使用Zabbix监控ZooKeeper服务的健康状态
一 应用场景描述
在目前公司的业务中,没有太多使用ZooKeeper作为协同服务的场景。但是我们将使用Codis作为Redis的集群部署方案,Codis依赖ZooKeeper来存储配置信息。所以做好ZooKeeper的监控也很重要。
二 ZooKeeper监控要点
系统监控
内存使用量 ZooKeeper应当完全运行在内存中,不能使用到SWAP。Java Heap大小不能超过可用内存。
Swap使用量 使用Swap会降低ZooKeeper的性能,设置vm.swappiness = 0
网络带宽占用 如果发现ZooKeeper性能降低关注下网络带宽占用情况和丢包情况,通常情况下ZooKeeper是20%写入80%读入
磁盘使用量 ZooKeeper数据目录使用情况需要注意
磁盘I/O ZooKeeper的磁盘写入是异步的,所以不会存在很大的I/O请求,如果ZooKeeper和其他I/O密集型服务公用应该关注下磁盘I/O情况
ZooKeeper监控
zk_avg/min/max_latency 响应一个客户端请求的时间,建议这个时间大于10个Tick就报警
zk_outstanding_requests 排队请求的数量,当ZooKeeper超过了它的处理能力时,这个值会增大,建议设置报警阀值为10
zk_packets_received 接收到客户端请求的包数量
zk_packets_sent 发送给客户单的包数量,主要是响应和通知
zk_max_file_descriptor_count 最大允许打开的文件数,由ulimit控制
zk_open_file_descriptor_count 打开文件数量,当这个值大于允许值得85%时报警
Mode 运行的角色,如果没有加入集群就是standalone,加入集群式follower或者leader
zk_followers leader角色才会有这个输出,集合中follower的个数。正常的值应该是集合成员的数量减1
zk_pending_syncs leader角色才会有这个输出,pending syncs的数量
zk_znode_count znodes的数量
zk_watch_count watches的数量
Java Heap Size ZooKeeper Java进程的
三 编写Zabbix监控ZooKeeper的脚本和配置文件
要让Zabbix收集到这些监控数据,有两种方法一种是每个监控项目通过zabbix agent单独获取,主动监控和被动监控都可以。还有一种方法就是将这些监控数据一次性使用zabbix_sender全部发送给zabbix。这里我们选择第二种方式。那么采用zabbix_sender一次性发送全部监控数据的脚本就不能像通过zabbix agent这样逐个获取监控项目来编写脚本。
首先想办法将监控项目汇集成一个字典,然后遍历这个字典,将字典中的key:value对通过zabbix_sender的-k和-o参数指定发送出去
echo mntr|nc 127.0.0.1 2181
这条命令可以使用Python的subprocess模块调用,也可以使用socket模块去访问2181端口然后发送命令获取数据,获取到mntr执行的数据后还需要将其转化成为字典数据
即需要将这种样式的数据
zk_version 3.4.6-1569965, built on 02/20/2014 09:09 GMT zk_avg_latency 0 zk_max_latency 0 zk_min_latency 0 zk_packets_received 91 zk_packets_sent 90 zk_num_alive_connections 1 zk_outstanding_requests 0 zk_server_state follower zk_znode_count 17159 zk_watch_count 0 zk_ephemerals_count 1 zk_approximate_data_size 6666471 zk_open_file_descriptor_count 27 zk_max_file_descriptor_count 102400转换成为这样的数据
{'zk_followers': 2, 'zk_outstanding_requests': 0, 'zk_approximate_data_size': 6666471, 'zk_packets_sent': 2089, 'zk_pending_syncs': 0, 'zk_avg_latency': 0, 'zk_version': '3.4.6-1569965, built on 02/20/2014 09:09 GMT', 'zk_watch_count': 2, 'zk_packets_received': 2090, 'zk_open_file_descriptor_count': 30, 'zk_server_ruok': 'imok', 'zk_server_state': 'leader', 'zk_synced_followers': 2, 'zk_max_latency': 28, 'zk_num_alive_connections': 2, 'zk_min_latency': 0, 'zk_ephemerals_count': 1, 'zk_znode_count': 17159, 'zk_max_file_descriptor_count': 102400}到最后需要使用zabbix_sender发送的数据格式这个样子的
zookeeper.status[zk_version]这是key的名称
zookeeper.status[zk_outstanding_requests]:0 zookeeper.status[zk_approximate_data_size]:6666471 zookeeper.status[zk_packets_sent]:48 zookeeper.status[zk_avg_latency]:0 zookeeper.status[zk_version]:3.4.6-1569965, built on 02/20/2014 09:09 GMT zookeeper.status[zk_watch_count]:0 zookeeper.status[zk_packets_received]:49 zookeeper.status[zk_open_file_descriptor_count]:27 zookeeper.status[zk_server_ruok]:imok zookeeper.status[zk_server_state]:follower zookeeper.status[zk_max_latency]:0 zookeeper.status[zk_num_alive_connections]:1 zookeeper.status[zk_min_latency]:0 zookeeper.status[zk_ephemerals_count]:1 zookeeper.status[zk_znode_count]:17159 zookeeper.status[zk_max_file_descriptor_count]:102400精简代码如下:
#!/usr/bin/python import socket #from StringIO import StringIO from cStringIO import StringIO s=socket.socket() s.connect(('localhost',2181)) s.send('mntr') data_mntr=s.recv(2048) s.close() #print data_mntr h=StringIO(data_mntr) result={} zresult={} for line in h.readlines():key,value=map(str.strip,line.split('\t'))zkey='zookeeper.status' + '[' + key + ']'zvalue=valueresult[key]=valuezresult[zkey]=zvalue print result print '\n\n' print zresult# python test.py {'zk_outstanding_requests': '0', 'zk_approximate_data_size': '6666471', 'zk_max_latency': '0', 'zk_avg_latency': '0', 'zk_version': '3.4.6-1569965, built on 02/20/2014 09:09 GMT', 'zk_watch_count': '0', 'zk_num_alive_connections': '1', 'zk_open_file_descriptor_count': '27', 'zk_server_state': 'follower', 'zk_packets_sent': '542', 'zk_packets_received': '543', 'zk_min_latency': '0', 'zk_ephemerals_count': '1', 'zk_znode_count': '17159', 'zk_max_file_descriptor_count': '102400'}{'zookeeper.status[zk_watch_count]': '0', 'zookeeper.status[zk_avg_latency]': '0', 'zookeeper.status[zk_max_latency]': '0', 'zookeeper.status[zk_approximate_data_size]': '6666471', 'zookeeper.status[zk_server_state]': 'follower', 'zookeeper.status[zk_num_alive_connections]': '1', 'zookeeper.status[zk_min_latency]': '0', 'zookeeper.status[zk_outstanding_requests]': '0', 'zookeeper.status[zk_packets_received]': '543', 'zookeeper.status[zk_ephemerals_count]': '1', 'zookeeper.status[zk_znode_count]': '17159', 'zookeeper.status[zk_packets_sent]': '542', 'zookeeper.status[zk_open_file_descriptor_count]': '27', 'zookeeper.status[zk_max_file_descriptor_count]': '102400', 'zookeeper.status[zk_version]': '3.4.6-1569965, built on 02/20/2014 09:09 GMT'}详细代码如下:
#!/usr/bin/python""" Check Zookeeper Clusterzookeeper version should be newer than 3.4.x# echo mntr|nc 127.0.0.1 2181 zk_version 3.4.6-1569965, built on 02/20/2014 09:09 GMT zk_avg_latency 0 zk_max_latency 4 zk_min_latency 0 zk_packets_received 84467 zk_packets_sent 84466 zk_num_alive_connections 3 zk_outstanding_requests 0 zk_server_state follower zk_znode_count 17159 zk_watch_count 2 zk_ephemerals_count 1 zk_approximate_data_size 6666471 zk_open_file_descriptor_count 29 zk_max_file_descriptor_count 102400# echo ruok|nc 127.0.0.1 2181 imok"""import sys import socket import re import subprocess from StringIO import StringIO import oszabbix_sender = '/opt/app/zabbix/sbin/zabbix_sender' zabbix_conf = '/opt/app/zabbix/conf/zabbix_agentd.conf' send_to_zabbix = 1############# get zookeeper server status class ZooKeeperServer(object):def __init__(self, host='localhost', port='2181', timeout=1):self._address = (host, int(port))self._timeout = timeoutself._result = {}def _create_socket(self):return socket.socket()def _send_cmd(self, cmd):""" Send a 4letter word command to the server """s = self._create_socket()s.settimeout(self._timeout)s.connect(self._address)s.send(cmd)data = s.recv(2048)s.close()return datadef get_stats(self):""" Get ZooKeeper server stats as a map """data_mntr = self._send_cmd('mntr')data_ruok = self._send_cmd('ruok')if data_mntr:result_mntr = self._parse(data_mntr)if data_ruok:result_ruok = self._parse_ruok(data_ruok)self._result = dict(result_mntr.items() + result_ruok.items())if not self._result.has_key('zk_followers') and not self._result.has_key('zk_synced_followers') and not self._result.has_key('zk_pending_syncs'):##### the tree metrics only exposed on leader role zookeeper server, we just set the followers' to 0leader_only = {'zk_followers':0,'zk_synced_followers':0,'zk_pending_syncs':0} self._result = dict(result_mntr.items() + result_ruok.items() + leader_only.items() )return self._result def _parse(self, data):""" Parse the output from the 'mntr' 4letter word command """h = StringIO(data)result = {}for line in h.readlines():try:key, value = self._parse_line(line)result[key] = valueexcept ValueError:pass # ignore broken linesreturn resultdef _parse_ruok(self, data):""" Parse the output from the 'ruok' 4letter word command """h = StringIO(data)result = {}ruok = h.readline()if ruok:result['zk_server_ruok'] = ruokreturn resultdef _parse_line(self, line):try:key, value = map(str.strip, line.split('\t'))except ValueError:raise ValueError('Found invalid line: %s' % line)if not key:raise ValueError('The key is mandatory and should not be empty')try:value = int(value)except (TypeError, ValueError):passreturn key, valuedef get_pid(self): # ps -ef|grep java|grep zookeeper|awk '{print $2}'pidarg = '''ps -ef|grep java|grep zookeeper|grep -v grep|awk '{print $2}' ''' pidout = subprocess.Popen(pidarg,shell=True,stdout=subprocess.PIPE)pid = pidout.stdout.readline().strip('\n')return piddef send_to_zabbix(self, metric):key = "zookeeper.status[" + metric + "]"if send_to_zabbix > 0:#print key + ":" + str(self._result[metric])try:subprocess.call([zabbix_sender, "-c", zabbix_conf, "-k", key, "-o", str(self._result[metric]) ], stdout=FNULL, stderr=FNULL, shell=False)except OSError, detail:print "Something went wrong while exectuting zabbix_sender : ", detailelse:print "Simulation: the following command would be execucted :\n", zabbix_sender, "-c", zabbix_conf, "-k", key, "-o", self._result[metric], "\n"def usage():"""Display program usage"""print "\nUsage : ", sys.argv[0], " alive|all"print "Modes : \n\talive : Return pid of running zookeeper\n\tall : Send zookeeper stats as well"sys.exit(1)accepted_modes = ['alive', 'all']if len(sys.argv) == 2 and sys.argv[1] in accepted_modes:mode = sys.argv[1] else:usage()zk = ZooKeeperServer() # print zk.get_stats() pid = zk.get_pid()if pid != "" and mode == 'all':zk.get_stats()# print zk._resultFNULL = open(os.devnull, 'w')for key in zk._result:zk.send_to_zabbix(key)FNULL.close()print pidelif pid != "" and mode == "alive":print pid else:print 0zabbix配置文件check_zookeeper.conf
UserParameter=zookeeper.status[*],/usr/bin/python /opt/app/zabbix/sbin/check_zookeeper.py $1重新启动zabbix agent服务
四 制作Zabbix监控ZooKeeper的模板并设置报警阀值
模板参见附件
参考文档:
https://blog.serverdensity.com/how-to-monitor-zookeeper/
https://github.com/apache/zookeeper/tree/trunk/src/contrib/monitoring
http://john88wang.blog.51cto.com/2165294/1708302
转载于:https://blog.51cto.com/john88wang/1745339
总结
以上是生活随笔为你收集整理的使用Zabbix监控ZooKeeper服务的健康状态的全部内容,希望文章能够帮你解决所遇到的问题。
- 上一篇: 遍历map几种方式及应用
- 下一篇: hdu 模拟 贪心 4550