ceph状态监控环境:ubuntu18.04 ceph-n版 使用:脚本名称已固定不能更改,将 dingding.sh中钉钉机器人信息填写正确 将两脚本存放于 /opt/dingding/目录 添加定时任务即可。
一. 钉钉告警脚本(dingding.sh) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 webhook='https://oapi.dingtalk.com/robot/send?access_token=87939c93a91b1' cluster='消息' atMobiles='"15******"' msg=$1 msg2=$2 function_SendMsgToDingding () { curl $webhook -H 'Content-Type: application/json' -d " { 'msgtype': 'text', 'text': { 'content': '$cluster \n $msg \n $msg2 ' }, 'at': { 'isAtAll': true } }" } main () { function_SendMsgToDingding } main
二. ceph监控脚本 (check_ceph.sh) 脚本监控
1.ceph容量** 2.osd 状态 **3.ceph集群状态
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 #!/bin/bash set -etime=`date +%F%H%M` mkdir -p /opt/dingding/ ||true check_log="/opt/dingding/check_ceph.log" dingding="/opt/dingding/dingding.sh" echo "${0} 脚本执行时间:${time} " >> ${check_log} check_osd () { osd_sum=`ceph osd stat |awk '{print $1}' ` osd_up=`ceph osd stat |awk '{print $3}' ` if [ "${osd_sum} " = "${osd_up} " ];then osd_status="osd总数为:${osd_sum} osd up数量为: ${osd_up} osd运行状态正常" echo "${osd_status} " sh ${dingding} "${osd_status} " else osd_status="osd总数为:${osd_sum} osd up数量为: ${osd_up} osd存在down状态" echo "${osd_status} " sh ${dingding} "${osd_status} " fi } check_ceph_use () { contrast="5" ceph_sum=`ceph df |grep TOTAL |awk '{print $2$3}' ` ceph_use=`ceph df |grep TOTAL |awk '{print $NF}' |awk -F'[ .]' '{print $1}' ` ceph_use_sum=`ceph df |grep TOTAL |awk '{print $8$9}' ` if [ "${ceph_use} " -gt "${contrast} " ];then ceph_use_alert="ceph集群总容量为${ceph_sum} 当前已使用${ceph_use_sum} ceph集群利用率为:%${ceph_use} > %${contrast} 请尽快处理" echo "${ceph_use_alert} " sh ${dingding} "${ceph_use_alert} " else ceph_use_alert="ceph集群总容量为:${ceph_sum} 当前已使用:${ceph_use_sum} ceph集群利用率为:%${ceph_use} 集群正常" sh ${dingding} "${ceph_use_alert} " fi } check_ceph_colony_status () { ceph_colony_status=`ceph health` ceph_detail=`ceph health detail` if [ "${ceph_colony_status} " = "HEALTH_OK" ];then helth_message="ceph集群健康状态为: ${ceph_colony_status} " echo "${helth_message} " else helth_message="ceph集群不健康状态为: \n ${ceph_colony_status} " echo "${helth_message} " sh ${dingding} "ceph集群不健康请登陆使用ceph health detail查看" sh ${dingding} "${helth_message} " sh ${dingding} "${ceph_detail} " fi } main () { check_osd check_ceph_use check_ceph_colony_status } main
脚本优化使用 tr命令替换特殊字符使告警详情顺利发送 ( 可能会出问题不建议使用 ) ** **https://www.runoob.com/linux/linux-comm-tr.html
参数 :
1 2 3 4 5 6 7 8 echo " HEALTH_WARN application not enabled on 1 pool(s) POOL_APP_NOT_ENABLED application not enabled on 1 pool(s) application not enabled on pool 'kubernetes' use 'ceph osd pool application enable <pool-name> <app-name>', where <app-name> is 'cephfs', 'rbd', 'rgw', or freeform for custom applications." |tr "',<>()." " " ceph_detail=`ceph health detail |tr "[:punct:]" " " `
三. 扩展 某些告警需要抑制使其告警不发送时 可以利用 if判断与的-a (并且) -0(或者)进行抑制 如演示
四. 设置定时任务 1 2 */1 * * * * sh /opt/dingding/check_ceph.sh
测试定时任务成功
报警效果