一、集群监控:监控度量指标
### --- 监控度量指标
~~~ Kafka使用Yammer Metrics在服务器和Scala客户端中报告指标。
~~~ Java客户端使用Kafka Metrics,它是一个内置的度量标准注册表,
~~~ 可最大程度地减少拉入客户端应用程序的传递依赖项。
~~~ 两者都通过JMX公开指标,并且可以配置为使用可插拔的统计报告器报告统计信息,
~~~ 以连接到您的监视系统。具体的监控指标可以查看官方文档。
### --- JMX:Kafka开启Jmx端口
~~~ # 所有节点开启JMX_PORT端口:所有节点
[root@hadoop01 ~]# vim /opt/yanqi/servers/kafka_ms/bin/kafka-server-start.sh
~~~第一行加上下列参数即可
export JMX_PORT=9581
### --- 启动kafka集群
~~~ 所有kafka机器添加一个JMX_PORT ,并重启kafka
[root@hadoop01 ~]# kafka-server-start.sh -daemon /opt/yanqi/servers/kafka_ms/config/server.properties
### --- 验证JMX开启
~~~ 首先打印9581端口占用的进程信息,然后使用进程编号对应到Kafka的进程号,搞定。
~~~ # 查看hadoop01
~~~ 对应的kafka的pid是pid=4127
[root@hadoop01 ~]# ss -nelp | grep 9581
tcp LISTEN 0 50 :::9581 :::* users:(("java",pid=4127,fd=78)) ino:39145 sk:ffff986bf87f2200 v6only:0 <->
~~~ # 9581端口启动完成
~~~ kafka的pid和9581端口对应的pid值一致,说明监控端口成功连接
[root@hadoop01 ~]# jps
4127 Kafka
~~~ # 查看hadoop02
[root@hadoop02 ~]# ss -nelp | grep 9581
tcp LISTEN 0 50 :::9581 :::* users:(("java",pid=8927,fd=78)) ino:55599 sk:ffff9fe139100840 v6only:0 <->
[root@hadoop02 ~]# jps
8927 Kafka
~~~ # 查看hadoop03
~~~ 也可以查看Kafka启动日志,确定启动参数 -Dcom.sun.management.jmxremote.port=9581 存在即可
[root@hadoop03 ~]# ss -nelp | grep 9581
tcp LISTEN 0 50 :::9581 :::* users:(("java",pid=9250,fd=78)) ino:55666 sk:ffff8cd9790da100 v6only:0 <->
[root@hadoop03 ~]# jps
9250 Kafka
二、使用JConsole链接JMX端口
### --- 准备监控主题
[root@hadoop01 ~]# kafka-topics.sh --zookeeper localhost:2181/myKafka \
--create --topic topic_x --partitions 3 --replication-factor 2
[root@hadoop01 ~]# kafka-topics.sh --zookeeper localhost:2181/myKafka \
--create --topic topic_y --partitions 3 --replication-factor 2
[root@hadoop01 ~]# kafka-topics.sh --zookeeper localhost:2181/myKafka \
--create --topic topic_z --partitions 3 --replication-factor 2
[root@hadoop01 ~]# kafka-topics.sh --zookeeper localhost:2181/myKafka --list
topic_x
topic_y
topic_z
### --- win/mac找到jconsole工具并打开,
~~~ 在${JAVA_HOEM}/bin/ Mac电脑可以直接命令行输入jconsole
~~~ ——>win+R——>cmd——>jconsole
~~~ ——>安全连接失败,是否以不安全的方式重试:选择不安全的连接
### --- 和在JMS上查看到的在Hadoop节点上有2个分区;0号分区和1号分区
[root@hadoop01 ~]# kafka-topics.sh --zookeeper localhost:2181/myKafka \
--describe --topic topic_x
Topic:topic_x PartitionCount:3 ReplicationFactor:2 Configs:
Topic: topic_x Partition: 0 Leader: 0 Replicas: 0,2 Isr: 0,2
Topic: topic_x Partition: 1 Leader: 1 Replicas: 1,0 Isr: 1,0
Topic: topic_x Partition: 2 Leader: 2 Replicas: 2,1 Isr: 2,1
三、详细的监控指标
### --- 详细的监控指标:
~~~ 相见官方文档:http://kafka.apache.org/10/documentation.html#monitoring
~~~ 这里列出常用的:OS监控项
obJectName | 指标项 | 说明 |
java.lang:type=OperatingSystem | FreePhysicalMemorySize | 空闲物理内存 |
java.lang:type=OperatingSystem | SystemCpuLoad | 系统CPU利用率 |
java.lang:type=OperatingSystem | ProcessCpuLoad | 进程CPU利用率 |
java.lang:type=GarbageCollector, name=G1 YoungGeneration | CollectionCount | GC次数 |
四、broker指标
objectName | 指标项 | 说明 |
kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec | Count | 每秒输入的流量 |
kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec | Count | 每秒输出的流量 |
kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec | Count | 每秒扔掉的流量 |
kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec | Count | 每秒的消息写入总量 |
kafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec | Count | 当前机器每秒fetch 请求失败的数量 |
kafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec | Count | 当前机器每produce 请求失败的数量 |
kafka.server:type=ReplicaManager,name=PartitionCount | Value | 该broker上partition 的数量 |
kafka.server:type=ReplicaManager,name=LeaderCount | Value | Leader的replica数量 |
kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer | Count | 一个FetchConsumer 耗费的所有时间 |
kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower | Count | 一个FetchFollower 耗费的所有时间 |
kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce | Count | 一个请求Produce 耗费的所有时间 |
五、producer以及topic指标
obJectName | 指标项 | 官网说明 | 译文说明 |
kafka.producer:type=producermetrics, client-id=consoleproducer( client-id会变化) | incomingbyte- rate | The average number of incoming bytes received per second from all servers. | producer每秒的平均写入流量 |
kafka.producer:type=producermetrics, client-id=consoleproducer( client-id会变化) | outgoingbyte- rate | The average number of outgoing bytes sent per second to all servers. | producer每秒的输出流量 |
kafka.producer:type=producermetrics, client-id=consoleproducer( client-id会变化) | requestrate | The average number of requests sent per second to the broker. | producer每秒发给broker的平均request次数 |
kafka.producer:type=producermetrics, client-id=consoleproducer( client-id会变化) | responserate | The average number of responses received per second from the broker. | producer每秒发给broker的平均response次数 |
kafka.producer:type=producermetrics, client-id=consoleproducer( client-id会变化) | requestlatencyavg | The average time taken for a fetch request. | 一个fetch请求的平均时间 |
kafka.producer:type=producer-topicmetrics,client-id=consoleproducer,topic=testjmx(client-id和topic名称会变化) | recordsend- rate | The average number of records sent per second for a topic. | 每秒从topic发送的平均记录数 |
kafka.producer:type=producer-topicmetrics,client-id=consoleproducer, topic=testjmx(client-id和topic名称会变化) | recordretry- total | The total number of retried record sends | 重试发送的消息总数量 |
kafka.producer:type=producer-topicmetrics,client-id=consoleproducer, topic=testjmx(client-id和topic名称会变化) | recorderrortotal | The total number of record sends that resulted in errors | 发送错误的消息总数量 |
六、consumer指标
obJectName | 指标项 | 官网说明 | 说明 |
kafka.consumer:type=consumerfetch-manager-metrics,clientid= consumer-1(client-id会变化) | recordslag-max | Number of messages the consumer lags behind the producer by. Published by the consumer, not broker. | 由consumer 提交的消息消费lag |
kafka.consumer:type=consumerfetch-manager-metrics,clientid=consumer-1(client-id会变化) | recordsconsumedrate | The average number of records consumed per second | 每秒平均消费的消息数量 |