目录
linux docker监控
linux 系统进程监控
linux 系统os监控
windows 系统os监控
配置文件&告警规则
Prometheus配置文件
node_alert.rules
docker_container.rules
mysql_alert.rules
vmware.rules
Alertmanager告警规则
consoul注册服务
Dashboard JSON文件
linux docker监控
获取的是docker stats命令的统计结果,可以页面方式展示出来。
cadvisor.tar
上传cadvisor.tar包,导入后修改tag,运行容器
docker load -i cadvisor.tar
docker tag gcr.io/cadvisor/cadvisor:latest google/cadvisor:latest
docker run -d --volume=/:/rootfs:ro --volume=/var/run:/var/run:rw --volume=/sys:/sys:ro --volume=/var/lib/docker/:/var/lib/docker:ro --publish=8080:8080 --name=cadvisor google/cadvisor:latest
容器运行后如下:
访问cadvisor http://ip:8080
linux 系统进程监控
通过正则、绝对路径、名字等获取指定进程的运行状况
process-exporter-0.7.5.linux-amd64.tar.gz
参考我的另一篇文章
Prometheus监控主机进程-CSDN博客
默认端口 9256
linux 系统os监控
通过exporter获取当前系统的Cpu、内存、硬盘等OS资源
node_exporter放到指定路径后
cat /etc/systemd/system/node-exporter.service
[Unit]
Description=Prometheus Node exporter
After=network.target
[Service]
ExecStart=/opt/monitoring/node_exporter
[Install]
WantedBy=multi-user.target
默认端口:9100
windows 系统os监控
通过exporter获取当前系统的Cpu、内存、硬盘等OS资源
windows_exporter-0.26.0-amd64.msi
1.关闭防火墙
2.管理员模式双击执行
3.services.msc服务管理检查windows-exporter服务自动启动即可
默认端口:9182
配置文件&告警规则
/opt/monitor/prometheus目录下
Prometheus配置文件
cat /opt/monitor/prometheus/prometheus.yml
# my global config
global:
scrape_interval: 10s # By default, scrape targets every 15 seconds.
scrape_timeout: 5s
evaluation_interval: 10s # By default, scrape targets every 15 seconds.
# scrape_timeout is set to the global default (10s).
# Attach these labels to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
external_labels:
monitor: 'zqa_monitor'
# Load and evaluate rules in this file every 'evaluation_interval' seconds.
rule_files:
- 'node_alert.rules'
- 'mysql_alert.rules'
- 'docker_container.rules'
# - "first.rules"
# - "second.rules"
# alert
alerting:
alertmanagers:
- scheme: http
static_configs:
- targets:
- "alertmanager:9093"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# Override the global default and scrape targets from this job every 5 seconds.
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090']
#- job_name: 'cadvisor'
# Override the global default and scrape targets from this job every 5 seconds.
# scrape_interval: 5s
#dns_sd_configs:
#- names:
# - 'tasks.cadvisor'
# type: 'A'
# port: 8080
#static_configs:
# - targets: ['10.33.70.218:8080']
- job_name: 'node-exporter'
# Override the global default and scrape targets from this job every 5 seconds.
scrape_interval: 5s
static_configs:
- targets: ['10.100.10.100:9182']
consul_sd_configs:
- server: '10.33.70.203:8500'
services: ['node-exporter-dev']
- job_name: 'mysql-exporter'
scrape_interval: 5s
static_configs:
- targets: ['10.33.70.218:9104', '10.33.70.166:9104', '10.33.70.224:9104']
- job_name: 'postgres-exporter'
scrape_interval: 5s
static_configs:
- targets: ['123.57.190.129:9187']
- job_name: 'vsphere-exporter'
scrape_interval: 5s
static_configs:
- targets: ['10.33.70.22:9272']
- job_name: 'es-exporter'
scrape_interval: 5s
static_configs:
- targets: ['123.57.216.51:9114']
- job_name: 'pushgateway'
scrape_interval: 30s
static_configs:
- targets: ['39.104.94.83:19091']
labels:
instance: pushgateway
honor_labels: true
- job_name: "cadvisor"
scrape_interval: 10s
metrics_path: '/metrics'
static_configs:
- targets: ["47.93.21.11:8080]
#- job_name: 'kafka-exporter'
# scrape_interval: 5s
# static_configs:
# - targets: [ '10.100.7.1:9308']
# - job_name: 'pushgateway'
# scrape_interval: 10s
# dns_sd_configs:
# - names:
# - 'tasks.pushgateway'
# type: 'A'
# port: 9091
# static_configs:
# - targets: ['node-exporter:9100']
node_alert.rules
groups:
- name: zqaalert
rules:
- alert: 机器宕机
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 2 minutes."
- alert: 负载率
expr: node_load1 > 8
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} under high load"
description: "{{ $labels.instance }} of job {{ $labels.job }} is under high load."
- alert: 可用内存小于5%
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 5
for: 10m
labels:
severity: warning
annotations:
summary: Host out of memory (instance {{ $labels.instance }})
description: "节点内存告警 (< 5% left)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: 磁盘使用率
expr: (100 - ((node_filesystem_avail_bytes{device!~'rootfs'} * 100) / node_filesystem_size_bytes{device!~'rootfs'}) > 90)
for: 5m
labels:
severity: High
annotations:
summary: "{{$labels.instance}}: High Disk usage detected"
description: "{{$labels.instance}}: 硬盘使用率大于 90% (当前值:{{ $value }})"
- alert: Cpu使用率
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[10m])) * 100) > 95
for: 10m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: High Cpu usage detected"
description: "{{$labels.instance}}: CPU 使用率大于 95% (current value is:{{ $value }})"
# - alert: 进程恢复
# expr: ceil(time() - max by(instance, groupname) (namedprocess_namegroup_oldest_start_time_seconds)) < 60
# for: 0s
# labels:
# severity: warning
# annotations:
# summary: "进程重启"
# description: "进程{{ $labels.groupname }}在{{ $value }}秒前重启过"
- alert: 进程退出告警
# expr: max by(instance, groupname) (rate(namedprocess_namegroup_oldest_start_time_seconds{groupname=~"^vsftpd.*|^proxy.*|^goproxy.*|^lizhu_monitor*|^lizhu_agent*|^lizhurunner*"}[5m])) < 0
expr: namedprocess_namegroup_num_procs{groupname=~"^vsftpd.*|^proxy.*|^goproxy.*|^lizhu_monitor*|^lizhu_agent*|^lizhurunner*"} == 0
for: 30s
labels:
severity: warning
annotations:
summary: "进程退出"
description: "进程{{ $labels.groupname }}退出了"
# - alert: 进程退出告警
# expr: max_over_time(namedprocess_namegroup_oldest_start_time_seconds{groupname=~"^vsftpd.*|^proxy.*|^goproxy.*|^lizhu_monitor.*|^lizhu_agent.*|^lizhurunner.*"}[1d]) < (time() - 10*60)
# for: 1s
# labels:
# severity: warning
# annotations:
# description: 进程组 {{ $labels.groupname }} 中的进程在最近10分钟内退出了
# summary: 进程退出
#- alert: 机器硬盘读取速率
# expr: sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 200
# for: 5m
# labels:
# severity: warning
# annotations:
# summary: Host unusual disk read rate (instance {{ $labels.instance }})
# description: "Disk is probably reading too much data (> 50 MB/s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
#- alert: 机器硬盘写入速率
# expr: sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 120
# for: 2m
# labels:
# severity: warning
# annotations:
# summary: Host unusual disk write rate (instance {{ $labels.instance }})
# description: "Disk is probably writing too much data VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostOomKillDetected
expr: increase(node_vmstat_oom_kill[1m]) > 0
for: 0m
labels:
severity: warning
annotations:
summary: Host OOM kill detected (instance {{ $labels.instance }})
description: "OOM kill detected\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: Esxi主机连接丢失
expr: vmware_host_power_state != 1
for: 1m
labels:
severity: critical
annotations:
summary: "Esxi 物理机IP: {{ $labels.host_name }} 丢失连接"
description: "VMware host {{ $labels.host_name }} is not connected to the virtualization platform."
docker_container.rules
groups:
- name: zqaalert
rules:
- alert: ContainerAbsent
expr: absent(container_last_seen)
for: 5m
labels:
severity: warning
annotations:
summary: "无容器 容器:{{$labels.instance }}"
description: "5分钟检查容器不存在,当前值为:{{ $value }}"
- alert: ContainerCpuUsage
expr: (sum(rate(container_cpu_usage_seconds_total{name!=""}[3m])) BY(instance, name)*100 ) > 300
for: 2m
labels:
severity: warning
annotations:
summary: "容器cpu使用率告警,容器:{{$labels.instance }}"
description: "容器cpu使用率超过300%,当前值为:{{ $value }}"
- alert: ContainerMemoryUsage
expr: (sum(container_memory_working_set_bytes{name!=""})BY (instance, name) /sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100 ) > 80
for: 2m
labels:
severity: warning
annotations:
summary: "容器内存使用率告警,容器:{{$labels.instance }}"
description: "容器内存使用率超过80%,当前值为:{{ $value }}"
- alert: ContainerVolumeIOUsage
expr: (sum(container_fs_io_current{name!=""}) BY (instance, name) * 100) >80
for: 2m
labels:
severity: warning
annotations:
summary: "容器存储IO使用率告警,容器:{{$labels.instance }}"
description: "容器存储IO使用率超过80%,当前值为:{{ $value }}"
- alert: ContainerHighThrottleRate
expr: rate(container_cpus_cfs_throttled_seconds_total[3m]) > 1
for: 2m
labels:
severity: warning
annotations:
summary: "容器限制告警,容器:{{$labels.instance }}"
description: "容器被限制,当前值为:{{ $value }}"
mysql_alert.rules
groups:
- name: zqaalert
rules:
- alert: Mysql 宕机
expr: mysql_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: MySQL down (instance {{ $labels.instance }})
description: "MySQL instance is down on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MysqlTooManyConnections(>80%)
expr: max_over_time(mysql_global_status_threads_connected[1m]) / mysql_global_variables_max_connections * 100 > 80
for: 2m
labels:
severity: warning
annotations:
summary: MySQL too many connections (> 80%) (instance {{ $labels.instance }})
description: "More than 80% of MySQL connections are in use on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MysqlHighThreadsRunning
expr: max_over_time(mysql_global_status_threads_running[1m]) / mysql_global_variables_max_connections * 100 > 60
for: 2m
labels:
severity: warning
annotations:
summary: MySQL high threads running (instance {{ $labels.instance }})
description: "More than 60% of MySQL connections are in running state on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: Mysql慢查询
expr: increase(mysql_global_status_slow_queries[1m]) > 0
for: 60m
labels:
severity: warning
annotations:
summary: MySQL slow queries (instance {{ $labels.instance }})
description: "MySQL server mysql has some new slow query.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
vmware.rules
- name: VMware Host Connection State
rules:
- alert: HostDisconnected
expr: vmware_host_power_state == "connected"
for: 5m # 规定主机连接状态必须持续5分钟才会触发警报
labels:
severity: warning
annotations:
summary: "VMware host {{ $labels.instance }} disconnected"
description: "VMware host {{ $labels.instance }} is not connected to the virtualization platform."
Alertmanager告警规则
通过定义组来监控组内机器
cat vim /opt/monitor/alertmanager/config.yml
global:
resolve_timeout: 5m
smtp_from: 'ops@xxx.com'
smtp_smarthost: 'smtp.feishu.cn:465'
smtp_auth_username: 'ops@xxx.com'
smtp_auth_password: 'ydWhsFDk3pF50TZg'
smtp_require_tls: false
smtp_hello: 'ZQA监控告警'
route:
group_by: ['zqaalert']
group_wait: 60s # 在触发第一个警报后,等待相同分组内的所有警报的最长时间
group_interval: 10m # 系统每隔10分钟检查一次是否有新的警报需要处理
repeat_interval: 60m # 在发送警报通知后,在重复发送通知之间等待的时间。设置为1小时意味着如果同一组内的警报在 1小时再次触发
receiver: 'web.hook'
receivers:
#- name: 'web.hook.prometheusalert'
- name: 'web.hook'
webhook_configs:
- url: 'http://10.33.70.22:9094/prometheusalert?type=fs&tpl=prometheus-fs&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/7fe7f42d-242b-42eb-837c-028cfc84adb8'
consoul注册服务
* */1 * * * ip addr | awk '/^[0-9]+: / {}; /inet.*global/ {print gensub(/(.*)\/(.*)/, "\\1", "g", $2)}' |grep "10.33"|head -1|xargs -i curl -X PUT -d '{"id": "node-exporter-{}","name": "node-exporter-dev","address": "{}","port": 9100,"tags": ["env-dev"],"checks": [{"http": "http://{}:9100/metrics", "interval": "5s"}]}' http://consul.intra.xxx.net/v1/agent/service/register
有现成的consoul容器,运行即可
Dashboard JSON文件
以下是我认为比较好用的 grafana 的 dashboards文件
Grafana dashboards | Grafana Labs