Prometheus各类监控及监控指标和告警规则

Prometheus各类监控及监控指标和告警规则

news2025/7/13 3:18:40

目录

linux docker监控

linux 系统进程监控

linux 系统os监控

windows 系统os监控

配置文件&告警规则

Prometheus配置文件

node_alert.rules

docker_container.rules

mysql_alert.rules

vmware.rules

Alertmanager告警规则

consoul注册服务

Dashboard JSON文件

linux docker监控

获取的是docker stats命令的统计结果，可以页面方式展示出来。

cadvisor.tar

上传cadvisor.tar包，导入后修改tag，运行容器

docker load -i cadvisor.tar

docker tag gcr.io/cadvisor/cadvisor:latest google/cadvisor:latest

docker run -d --volume=/:/rootfs:ro --volume=/var/run:/var/run:rw --volume=/sys:/sys:ro --volume=/var/lib/docker/:/var/lib/docker:ro --publish=8080:8080 --name=cadvisor google/cadvisor:latest

容器运行后如下：

访问cadvisor http://ip:8080

linux 系统进程监控

通过正则、绝对路径、名字等获取指定进程的运行状况

process-exporter-0.7.5.linux-amd64.tar.gz

参考我的另一篇文章

Prometheus监控主机进程-CSDN博客

默认端口 9256

linux 系统os监控

通过exporter获取当前系统的Cpu、内存、硬盘等OS资源

node_exporter放到指定路径后

cat /etc/systemd/system/node-exporter.service

[Unit]
Description=Prometheus Node exporter
After=network.target

[Service]
ExecStart=/opt/monitoring/node_exporter

[Install]
WantedBy=multi-user.target

默认端口：9100

windows 系统os监控

通过exporter获取当前系统的Cpu、内存、硬盘等OS资源

windows_exporter-0.26.0-amd64.msi

1.关闭防火墙

2.管理员模式双击执行

3.services.msc服务管理检查windows-exporter服务自动启动即可

默认端口：9182

配置文件&告警规则

/opt/monitor/prometheus目录下

Prometheus配置文件

cat /opt/monitor/prometheus/prometheus.yml 
# my global config
global:
  scrape_interval:     10s # By default, scrape targets every 15 seconds.
  scrape_timeout: 5s
  evaluation_interval: 10s # By default, scrape targets every 15 seconds.
  # scrape_timeout is set to the global default (10s).

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
      monitor: 'zqa_monitor'

# Load and evaluate rules in this file every 'evaluation_interval' seconds.
rule_files:
  - 'node_alert.rules'
  - 'mysql_alert.rules'
  - 'docker_container.rules'
  # - "first.rules"
  # - "second.rules"

# alert
alerting:
  alertmanagers:
  - scheme: http
    static_configs:
    - targets:
      - "alertmanager:9093"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
         - targets: ['localhost:9090']


  #- job_name: 'cadvisor'

    # Override the global default and scrape targets from this job every 5 seconds.
   # scrape_interval: 5s

    #dns_sd_configs:
    #- names:
    #  - 'tasks.cadvisor'
    #  type: 'A'
    #  port: 8080

    #static_configs:
    #     - targets: ['10.33.70.218:8080']

  - job_name: 'node-exporter'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
         - targets: ['10.100.10.100:9182']
    consul_sd_configs:
    - server: '10.33.70.203:8500'
      services: ['node-exporter-dev']

  - job_name: 'mysql-exporter'
    scrape_interval: 5s
    static_configs:
         - targets: ['10.33.70.218:9104', '10.33.70.166:9104', '10.33.70.224:9104']
  - job_name: 'postgres-exporter'
    scrape_interval: 5s
    static_configs:
         - targets: ['123.57.190.129:9187']
  - job_name: 'vsphere-exporter'
    scrape_interval: 5s
    static_configs:
         - targets: ['10.33.70.22:9272']
  - job_name: 'es-exporter'
    scrape_interval: 5s
    static_configs:
         - targets: ['123.57.216.51:9114']
  - job_name: 'pushgateway'
    scrape_interval: 30s
    static_configs:
      - targets: ['39.104.94.83:19091']
        labels:
          instance: pushgateway
    honor_labels: true

  - job_name: "cadvisor"
    scrape_interval: 10s
    metrics_path: '/metrics'
    static_configs:
      - targets: ["47.93.21.11:8080]

  #- job_name: 'kafka-exporter'
  #  scrape_interval: 5s
  #  static_configs:
  #       - targets: [ '10.100.7.1:9308']

#  - job_name: 'pushgateway'
#    scrape_interval: 10s
#    dns_sd_configs:
#    - names:
#      - 'tasks.pushgateway'
#      type: 'A'
#      port: 9091

#     static_configs:
#          - targets: ['node-exporter:9100']

node_alert.rules

groups:
- name: zqaalert
  rules:

  - alert:  机器宕机
    expr: up == 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 2 minutes."

  - alert: 负载率
    expr: node_load1 > 8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} under high load"
      description: "{{ $labels.instance }} of job {{ $labels.job }} is under high load."

  - alert: 可用内存小于5%
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 5
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Host out of memory (instance {{ $labels.instance }})
      description: "节点内存告警 (< 5% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert:  磁盘使用率
    expr: (100 - ((node_filesystem_avail_bytes{device!~'rootfs'} * 100) / node_filesystem_size_bytes{device!~'rootfs'}) > 90)
    for: 5m
    labels:
      severity: High
    annotations:
      summary: "{{$labels.instance}}: High Disk usage detected"
      description: "{{$labels.instance}}: 硬盘使用率大于 90% (当前值:{{ $value }})"


  - alert: Cpu使用率
    expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[10m])) * 100) > 95
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "{{$labels.instance}}: High Cpu usage detected"
      description: "{{$labels.instance}}: CPU 使用率大于 95% (current value is:{{ $value }})"

 # - alert: 进程恢复
 #   expr: ceil(time() - max by(instance, groupname) (namedprocess_namegroup_oldest_start_time_seconds)) < 60
 #   for: 0s
 #   labels:
 #     severity: warning
 #   annotations:
 #     summary: "进程重启"
 #     description: "进程{{ $labels.groupname }}在{{ $value }}秒前重启过"


  - alert: 进程退出告警
   # expr: max by(instance, groupname) (rate(namedprocess_namegroup_oldest_start_time_seconds{groupname=~"^vsftpd.*|^proxy.*|^goproxy.*|^lizhu_monitor*|^lizhu_agent*|^lizhurunner*"}[5m])) < 0
    expr: namedprocess_namegroup_num_procs{groupname=~"^vsftpd.*|^proxy.*|^goproxy.*|^lizhu_monitor*|^lizhu_agent*|^lizhurunner*"} == 0
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "进程退出"
      description: "进程{{ $labels.groupname }}退出了"  

#  - alert: 进程退出告警
#    expr: max_over_time(namedprocess_namegroup_oldest_start_time_seconds{groupname=~"^vsftpd.*|^proxy.*|^goproxy.*|^lizhu_monitor.*|^lizhu_agent.*|^lizhurunner.*"}[1d]) < (time() - 10*60)
#    for: 1s
#    labels:
#      severity: warning
#    annotations:
#      description: 进程组 {{ $labels.groupname }} 中的进程在最近10分钟内退出了
#      summary: 进程退出

  #- alert: 机器硬盘读取速率
  #  expr: sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 200
  #  for: 5m
  #  labels:
  #    severity: warning
  #  annotations:
  #    summary: Host unusual disk read rate (instance {{ $labels.instance }})
  #    description: "Disk is probably reading too much data (> 50 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  #- alert: 机器硬盘写入速率
  #  expr: sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 120
  #  for: 2m
  #  labels:
  #    severity: warning
  #  annotations:
  #    summary: Host unusual disk write rate (instance {{ $labels.instance }})
  #    description: "Disk is probably writing too much data VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostOomKillDetected
    expr: increase(node_vmstat_oom_kill[1m]) > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host OOM kill detected (instance {{ $labels.instance }})
      description: "OOM kill detected\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: Esxi主机连接丢失
    expr: vmware_host_power_state != 1
    for: 1m 
    labels:
      severity: critical
    annotations:
      summary: "Esxi 物理机IP: {{ $labels.host_name }} 丢失连接"
      description: "VMware host {{ $labels.host_name }} is not connected to the virtualization platform."

docker_container.rules

groups:
- name: zqaalert
  rules:
    - alert: ContainerAbsent
      expr: absent(container_last_seen)
      for: 5m
      labels:
        severity: warning
      annotations:
          summary: "无容器 容器:{{$labels.instance }}"
          description: "5分钟检查容器不存在,当前值为:{{ $value }}"
    - alert: ContainerCpuUsage
      expr: (sum(rate(container_cpu_usage_seconds_total{name!=""}[3m])) BY(instance, name)*100 ) > 300
      for: 2m
      labels:
        severity: warning
      annotations:
          summary: "容器cpu使用率告警,容器:{{$labels.instance }}"
          description: "容器cpu使用率超过300%,当前值为:{{ $value }}"
    - alert: ContainerMemoryUsage
      expr: (sum(container_memory_working_set_bytes{name!=""})BY (instance, name) /sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100 ) > 80
      for: 2m
      labels:
        severity: warning
      annotations:
          summary: "容器内存使用率告警,容器:{{$labels.instance }}"
          description: "容器内存使用率超过80%,当前值为:{{ $value }}"
    - alert: ContainerVolumeIOUsage
      expr: (sum(container_fs_io_current{name!=""}) BY (instance, name) * 100) >80 
      for: 2m
      labels:
        severity: warning
      annotations:
          summary: "容器存储IO使用率告警,容器:{{$labels.instance }}"
          description: "容器存储IO使用率超过80%,当前值为:{{ $value }}"
    - alert: ContainerHighThrottleRate
      expr: rate(container_cpus_cfs_throttled_seconds_total[3m]) > 1 
      for: 2m
      labels:
        severity: warning
      annotations:
          summary: "容器限制告警,容器:{{$labels.instance }}"
          description: "容器被限制,当前值为:{{ $value }}"

mysql_alert.rules

groups:
- name: zqaalert
  rules:

  - alert:  Mysql 宕机
    expr: mysql_up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: MySQL down (instance {{ $labels.instance }})
      description: "MySQL instance is down on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: MysqlTooManyConnections(>80%)
    expr: max_over_time(mysql_global_status_threads_connected[1m]) / mysql_global_variables_max_connections * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MySQL too many connections (> 80%) (instance {{ $labels.instance }})
      description: "More than 80% of MySQL connections are in use on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: MysqlHighThreadsRunning
    expr: max_over_time(mysql_global_status_threads_running[1m]) / mysql_global_variables_max_connections * 100 > 60
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MySQL high threads running (instance {{ $labels.instance }})
      description: "More than 60% of MySQL connections are in running state on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: Mysql慢查询
    expr: increase(mysql_global_status_slow_queries[1m]) > 0
    for: 60m
    labels:
      severity: warning
    annotations:
      summary: MySQL slow queries (instance {{ $labels.instance }})
      description: "MySQL server mysql has some new slow query.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

vmware.rules

- name: VMware Host Connection State
  rules:
  - alert: HostDisconnected
    expr: vmware_host_power_state == "connected"
    for: 5m # 规定主机连接状态必须持续5分钟才会触发警报
    labels:
      severity: warning
    annotations:
      summary: "VMware host {{ $labels.instance }} disconnected"
      description: "VMware host {{ $labels.instance }} is not connected to the virtualization platform."

Alertmanager告警规则

通过定义组来监控组内机器

cat vim /opt/monitor/alertmanager/config.yml


global:
  resolve_timeout: 5m
  smtp_from: 'ops@xxx.com'
  smtp_smarthost: 'smtp.feishu.cn:465'
  smtp_auth_username: 'ops@xxx.com'
  smtp_auth_password: 'ydWhsFDk3pF50TZg'
  smtp_require_tls: false
  smtp_hello: 'ZQA监控告警'

route:
  group_by: ['zqaalert']
  group_wait: 60s # 在触发第一个警报后，等待相同分组内的所有警报的最长时间
  group_interval: 10m   # 系统每隔10分钟检查一次是否有新的警报需要处理
  repeat_interval: 60m  # 在发送警报通知后，在重复发送通知之间等待的时间。设置为1小时意味着如果同一组内的警报在 1小时再次触发
  receiver: 'web.hook'
receivers:
#- name: 'web.hook.prometheusalert'
- name: 'web.hook'
  webhook_configs:
  - url: 'http://10.33.70.22:9094/prometheusalert?type=fs&tpl=prometheus-fs&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/7fe7f42d-242b-42eb-837c-028cfc84adb8'

consoul注册服务

* */1 * * * ip addr | awk '/^[0-9]+: / {}; /inet.*global/ {print gensub(/(.*)\/(.*)/, "\\1", "g", $2)}' |grep "10.33"|head -1|xargs -i curl -X PUT -d  '{"id": "node-exporter-{}","name": "node-exporter-dev","address": "{}","port": 9100,"tags": ["env-dev"],"checks": [{"http": "http://{}:9100/metrics", "interval": "5s"}]}'  http://consul.intra.xxx.net/v1/agent/service/register

有现成的consoul容器，运行即可

Dashboard JSON文件

以下是我认为比较好用的 grafana 的 dashboards文件

Grafana dashboards | Grafana Labs

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.coloradmin.cn/o/1952051.html

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈，一经查实，立即删除！

相关文章

并发编程--volatile

并发编程--volatile

1.什么是volatile volatile是轻量级的 synchronized，它在多处理器开发中保证了共享变量的 “ 可见性 ” 。可见性的意思是当一个线程修改一个共享变量时 ，另外一个线程能读到这个修改的值。如果 volatile 变量修饰符使用…

阅读更多...

车载录像机：移动安全领域的科技新星

车载录像机：移动安全领域的科技新星

随着科技的飞速发展，人类社会的各个领域都在不断经历技术革新。其中，车载录像机作为安防行业与汽车技术结合的产物，日益受到人们的关注。它不仅体现了人类科技发展的成果，更在安防领域发挥了重要作用。本文将详细介绍车载录像机的…

阅读更多...

Spring Boot集成canal快速入门demo

Spring Boot集成canal快速入门demo

1.什么是canal？ canal 是阿里开源的一款 MySQL 数据库增量日志解析工具，提供增量数据订阅和消费。工作原理 MySQL主备复制原理 MySQL master 将数据变更写入二进制日志（binary log）, 日志中的记录叫做二进制日志事件&#xff…

阅读更多...

【QT】UDP

【QT】UDP

目录核心API 示例：回显服务器服务器端编写： 第一步：创建出socket对象第二步： 连接信号槽第三步：绑定端口号第四步：编写信号槽所绑定方法第五步：编写第四步中处理请求的方法客户端…

阅读更多...

Simulink代码生成：基本模块的使用

Simulink代码生成：基本模块的使用

文章目录 1 引言2 模块使用实例2.1 In/Out模块2.2 Constant模块2.3 Scope/Display模块2.4 Ground/Terminator模块 3 总结 1 引言本文中博主介绍Simulink中最简单最基础的模块，包括In/Out模块（输入输出），Constant模块&#xff08…

阅读更多...

Postman测试工具详细解读

Postman测试工具详细解读

目录一、Postman的基本概念二、Postman的主要功能1. 请求构建2. 响应查看3. 断言与自动化测试4. 环境与变量5. 集合与文档化6. 与团队实时协作三、Postman在API测试中的重要性1. 提高测试效率2. 保障API的稳定性3. 促进团队协作4. 生成文档与交流工具四、Postman的使用技巧1…

阅读更多...

CAS算法

CAS算法

CAS算法 1. CAS简介 CAS叫做CompareAndSwap，比较并交换，主要是通过处理器的指令来保证操作的原子性。 CAS基本概念内存位置 (V)：需要进行CAS操作的内存地址。预期原值 (A)：期望该内存位置上的旧值。新值 (B)：如果旧…

阅读更多...

VSCode python autopep8 格式化长度设置

VSCode python autopep8 格式化长度设置

ctrl, 打开设置 > 搜索autopep8 > 找到Autopep8:Args > 添加项--max-line-length150

阅读更多...

Java泛型的介绍和基本使用

Java泛型的介绍和基本使用

什么是泛型泛型就是将类型参数化，比如定义了一个栈，你必须在定义之前声明这个栈中存放的数据的类型，是int也好是double或者其他的引用数据类型也好，定义好了之后这个栈就无法用来存放其他类型的数据。如果这时候我们想要使用这…

阅读更多...

谷粒商城实战笔记-71-商品服务-API-属性分组-前端组件抽取父子组件交互

谷粒商城实战笔记-71-商品服务-API-属性分组-前端组件抽取父子组件交互

文章目录一，一次性创建所有的菜单二，开发属性分组界面1，左侧三级分类树形组件2，右侧分组列表3，左右两部分通信3.1 子组件发送数据3.2，父组件接收数据 Vue的父子组件通信父组件向子组件传递数据子组件向父组…

阅读更多...

SpringBoot添加密码安全配置以及Jwt配置

SpringBoot添加密码安全配置以及Jwt配置

Maven仓库（依赖查找） 1、SpringBoot安全访问配置首先添加依赖 spring-boot-starter-security 然后之后每次启动项目之后，访问任何的请求都会要求输入密码才能请求。（如下） 在没有配置的情况下，默认用户…

阅读更多...

LLM agentic模式之工具使用: Gorilla

LLM agentic模式之工具使用: Gorilla

Gorilla Gorilla出自2023年5月的论文《Gorilla: Large Language Model Connected with Massive APIs》，针对LLM无法准确地生成API调用时的参数，构建API使用数据集后基于Llama微调了一个模型。数据集构建 API数据集APIBench的构建过程如下&#xff1…

阅读更多...

《Programming from the Ground Up》阅读笔记：p75-p87

《Programming from the Ground Up》阅读笔记：p75-p87

《Programming from the Ground Up》学习第4天，p75-p87总结，总计13页。一、技术总结 1.persistent data p75, Data which is stored in files is called persistent data, because it persists in files that remain on disk even when the program …

阅读更多...

C语言程序设计15

C语言程序设计15

程序设计15 问题15_1代码15_1结果15_1 问题15_2代码15_2结果15_2 问题15_3代码15_3结果15_3 问题15_1 在 m a i n main main 函数中将多次调用 f u n fun fun 函数，每调用一次，输出链表尾部结点中的数据，并释放该结点，使链表缩短…

阅读更多...

【SQL 新手教程 3/20】关系模型 -- 外键

【SQL 新手教程 3/20】关系模型 -- 外键

💗 关系数据库建立在关系模型上⭐ 关系模型本质上就是若干个存储数据的二维表记录 (Record)： 表的每一行称为记录（Record），记录是一个逻辑意义上的数据字段 (Column)：表的每一列称为字段（Colu…

阅读更多...

Buildroot 构建 Linux 系统

Buildroot 构建 Linux 系统

Buildroot 是一个工具，以简化和自动化为嵌入式系统构建完整 Linux 系统的过程。使用交叉编译技术，Buildroot 能够生成交叉编译工具链、根文件系统、Linux 内核映像和针对目标设备的引导加载程序。可以独立地使用这些选项的任何组合，例如&…

阅读更多...

Vitis AI 使用 VAI_Q_PYTORCH 工具

Vitis AI 使用 VAI_Q_PYTORCH 工具

目录 1. 简介 2. 资料汇总 3. 示例解释 3.1 快速上手示例 4. 总结 1. 简介 vai_q_pytorch 是 Vitis AI Quantizer for Pytorch 的缩写，主要作用是优化神经网络模型。它是 Vitis AI 平台的一部分，专注于神经网络的深度压缩。 vai_q_pytorch 的作用…

阅读更多...

大数据管理中心设计规划方案（可编辑的43页PPT）

大数据管理中心设计规划方案（可编辑的43页PPT）

引言：随着企业业务的快速发展，数据量急剧增长，传统数据管理方式已无法满足高效处理和分析大数据的需求。建立一个集数据存储、处理、分析、可视化于一体的大数据管理中心，提升数据处理能力，加速业务决策过程&#xff0…

阅读更多...

Spring Boot：图书管理系统（一）

Spring Boot：图书管理系统（一）

1.编写用户登录接口代码： package com.example.demo;import jakarta.servlet.http.HttpSession; import org.springframework.util.StringUtils; import org.springframework.web.bind.annotation.RequestMapping; import org.springframework.web.bind.annotatio…

阅读更多...

HarmonyOS和OpenHarmony区别联系

HarmonyOS和OpenHarmony区别联系

前言相信我们在刚开始接触鸿蒙开发的时候经常看到HarmonyOS和OpenHarmony频繁的出现在文章和文档之中，那么这两个名词分别是什么意思，他们之间又有什么联系呢？本文将通过现有的文章和网站内容并与Google的AOSP和Android做对比，带…

阅读更多...

推荐文章

最新文章