Prometheus各类监控及监控指标和告警规则

news2024/9/20 18:44:44

目录

linux  docker监控

linux  系统进程监控

linux  系统os监控

windows  系统os监控

配置文件&告警规则

Prometheus配置文件

 node_alert.rules

docker_container.rules

mysql_alert.rules

vmware.rules

Alertmanager告警规则

consoul注册服务

Dashboard JSON文件



linux  docker监控

获取的是docker stats命令的统计结果,可以页面方式展示出来。

cadvisor.tar

上传cadvisor.tar包,导入后修改tag,运行容器

docker load -i cadvisor.tar

docker tag gcr.io/cadvisor/cadvisor:latest google/cadvisor:latest

docker run -d --volume=/:/rootfs:ro --volume=/var/run:/var/run:rw --volume=/sys:/sys:ro --volume=/var/lib/docker/:/var/lib/docker:ro --publish=8080:8080 --name=cadvisor google/cadvisor:latest

容器运行后如下:

访问cadvisor   http://ip:8080

linux  系统进程监控

通过正则、绝对路径、名字等获取指定进程的运行状况

process-exporter-0.7.5.linux-amd64.tar.gz

参考我的另一篇文章

Prometheus监控主机进程-CSDN博客

默认端口 9256

linux  系统os监控

通过exporter获取当前系统的Cpu、内存、硬盘等OS资源

node_exporter放到指定路径后

cat /etc/systemd/system/node-exporter.service

[Unit]
Description=Prometheus Node exporter
After=network.target

[Service]
ExecStart=/opt/monitoring/node_exporter

[Install]
WantedBy=multi-user.target

默认端口:9100

windows  系统os监控

通过exporter获取当前系统的Cpu、内存、硬盘等OS资源

windows_exporter-0.26.0-amd64.msi

1.关闭防火墙

2.管理员模式双击执行

3.services.msc服务管理检查windows-exporter服务自动启动即可

默认端口:9182

配置文件&告警规则

/opt/monitor/prometheus目录下

Prometheus配置文件
cat /opt/monitor/prometheus/prometheus.yml 
# my global config
global:
  scrape_interval:     10s # By default, scrape targets every 15 seconds.
  scrape_timeout: 5s
  evaluation_interval: 10s # By default, scrape targets every 15 seconds.
  # scrape_timeout is set to the global default (10s).

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
      monitor: 'zqa_monitor'

# Load and evaluate rules in this file every 'evaluation_interval' seconds.
rule_files:
  - 'node_alert.rules'
  - 'mysql_alert.rules'
  - 'docker_container.rules'
  # - "first.rules"
  # - "second.rules"

# alert
alerting:
  alertmanagers:
  - scheme: http
    static_configs:
    - targets:
      - "alertmanager:9093"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
         - targets: ['localhost:9090']


  #- job_name: 'cadvisor'

    # Override the global default and scrape targets from this job every 5 seconds.
   # scrape_interval: 5s

    #dns_sd_configs:
    #- names:
    #  - 'tasks.cadvisor'
    #  type: 'A'
    #  port: 8080

    #static_configs:
    #     - targets: ['10.33.70.218:8080']

  - job_name: 'node-exporter'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
         - targets: ['10.100.10.100:9182']
    consul_sd_configs:
    - server: '10.33.70.203:8500'
      services: ['node-exporter-dev']

  - job_name: 'mysql-exporter'
    scrape_interval: 5s
    static_configs:
         - targets: ['10.33.70.218:9104', '10.33.70.166:9104', '10.33.70.224:9104']
  - job_name: 'postgres-exporter'
    scrape_interval: 5s
    static_configs:
         - targets: ['123.57.190.129:9187']
  - job_name: 'vsphere-exporter'
    scrape_interval: 5s
    static_configs:
         - targets: ['10.33.70.22:9272']
  - job_name: 'es-exporter'
    scrape_interval: 5s
    static_configs:
         - targets: ['123.57.216.51:9114']
  - job_name: 'pushgateway'
    scrape_interval: 30s
    static_configs:
      - targets: ['39.104.94.83:19091']
        labels:
          instance: pushgateway
    honor_labels: true

  - job_name: "cadvisor"
    scrape_interval: 10s
    metrics_path: '/metrics'
    static_configs:
      - targets: ["47.93.21.11:8080]

  #- job_name: 'kafka-exporter'
  #  scrape_interval: 5s
  #  static_configs:
  #       - targets: [ '10.100.7.1:9308']

#  - job_name: 'pushgateway'
#    scrape_interval: 10s
#    dns_sd_configs:
#    - names:
#      - 'tasks.pushgateway'
#      type: 'A'
#      port: 9091

#     static_configs:
#          - targets: ['node-exporter:9100']

 node_alert.rules
groups:
- name: zqaalert
  rules:

  - alert:  机器宕机
    expr: up == 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 2 minutes."

  - alert: 负载率
    expr: node_load1 > 8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} under high load"
      description: "{{ $labels.instance }} of job {{ $labels.job }} is under high load."

  - alert: 可用内存小于5%
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 5
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Host out of memory (instance {{ $labels.instance }})
      description: "节点内存告警 (< 5% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert:  磁盘使用率
    expr: (100 - ((node_filesystem_avail_bytes{device!~'rootfs'} * 100) / node_filesystem_size_bytes{device!~'rootfs'}) > 90)
    for: 5m
    labels:
      severity: High
    annotations:
      summary: "{{$labels.instance}}: High Disk usage detected"
      description: "{{$labels.instance}}: 硬盘使用率大于 90% (当前值:{{ $value }})"


  - alert: Cpu使用率
    expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[10m])) * 100) > 95
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "{{$labels.instance}}: High Cpu usage detected"
      description: "{{$labels.instance}}: CPU 使用率大于 95% (current value is:{{ $value }})"

 # - alert: 进程恢复
 #   expr: ceil(time() - max by(instance, groupname) (namedprocess_namegroup_oldest_start_time_seconds)) < 60
 #   for: 0s
 #   labels:
 #     severity: warning
 #   annotations:
 #     summary: "进程重启"
 #     description: "进程{{ $labels.groupname }}在{{ $value }}秒前重启过"


  - alert: 进程退出告警
   # expr: max by(instance, groupname) (rate(namedprocess_namegroup_oldest_start_time_seconds{groupname=~"^vsftpd.*|^proxy.*|^goproxy.*|^lizhu_monitor*|^lizhu_agent*|^lizhurunner*"}[5m])) < 0
    expr: namedprocess_namegroup_num_procs{groupname=~"^vsftpd.*|^proxy.*|^goproxy.*|^lizhu_monitor*|^lizhu_agent*|^lizhurunner*"} == 0
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "进程退出"
      description: "进程{{ $labels.groupname }}退出了"  

#  - alert: 进程退出告警
#    expr: max_over_time(namedprocess_namegroup_oldest_start_time_seconds{groupname=~"^vsftpd.*|^proxy.*|^goproxy.*|^lizhu_monitor.*|^lizhu_agent.*|^lizhurunner.*"}[1d]) < (time() - 10*60)
#    for: 1s
#    labels:
#      severity: warning
#    annotations:
#      description: 进程组 {{ $labels.groupname }} 中的进程在最近10分钟内退出了
#      summary: 进程退出

  #- alert: 机器硬盘读取速率
  #  expr: sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 200
  #  for: 5m
  #  labels:
  #    severity: warning
  #  annotations:
  #    summary: Host unusual disk read rate (instance {{ $labels.instance }})
  #    description: "Disk is probably reading too much data (> 50 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  #- alert: 机器硬盘写入速率
  #  expr: sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 120
  #  for: 2m
  #  labels:
  #    severity: warning
  #  annotations:
  #    summary: Host unusual disk write rate (instance {{ $labels.instance }})
  #    description: "Disk is probably writing too much data VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostOomKillDetected
    expr: increase(node_vmstat_oom_kill[1m]) > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host OOM kill detected (instance {{ $labels.instance }})
      description: "OOM kill detected\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: Esxi主机连接丢失
    expr: vmware_host_power_state != 1
    for: 1m 
    labels:
      severity: critical
    annotations:
      summary: "Esxi 物理机IP: {{ $labels.host_name }} 丢失连接"
      description: "VMware host {{ $labels.host_name }} is not connected to the virtualization platform."

      

docker_container.rules
groups:
- name: zqaalert
  rules:
    - alert: ContainerAbsent
      expr: absent(container_last_seen)
      for: 5m
      labels:
        severity: warning
      annotations:
          summary: "无容器 容器:{{$labels.instance }}"
          description: "5分钟检查容器不存在,当前值为:{{ $value }}"
    - alert: ContainerCpuUsage
      expr: (sum(rate(container_cpu_usage_seconds_total{name!=""}[3m])) BY(instance, name)*100 ) > 300
      for: 2m
      labels:
        severity: warning
      annotations:
          summary: "容器cpu使用率告警,容器:{{$labels.instance }}"
          description: "容器cpu使用率超过300%,当前值为:{{ $value }}"
    - alert: ContainerMemoryUsage
      expr: (sum(container_memory_working_set_bytes{name!=""})BY (instance, name) /sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100 ) > 80
      for: 2m
      labels:
        severity: warning
      annotations:
          summary: "容器内存使用率告警,容器:{{$labels.instance }}"
          description: "容器内存使用率超过80%,当前值为:{{ $value }}"
    - alert: ContainerVolumeIOUsage
      expr: (sum(container_fs_io_current{name!=""}) BY (instance, name) * 100) >80 
      for: 2m
      labels:
        severity: warning
      annotations:
          summary: "容器存储IO使用率告警,容器:{{$labels.instance }}"
          description: "容器存储IO使用率超过80%,当前值为:{{ $value }}"
    - alert: ContainerHighThrottleRate
      expr: rate(container_cpus_cfs_throttled_seconds_total[3m]) > 1 
      for: 2m
      labels:
        severity: warning
      annotations:
          summary: "容器限制告警,容器:{{$labels.instance }}"
          description: "容器被限制,当前值为:{{ $value }}"
mysql_alert.rules
groups:
- name: zqaalert
  rules:

  - alert:  Mysql 宕机
    expr: mysql_up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: MySQL down (instance {{ $labels.instance }})
      description: "MySQL instance is down on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: MysqlTooManyConnections(>80%)
    expr: max_over_time(mysql_global_status_threads_connected[1m]) / mysql_global_variables_max_connections * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MySQL too many connections (> 80%) (instance {{ $labels.instance }})
      description: "More than 80% of MySQL connections are in use on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: MysqlHighThreadsRunning
    expr: max_over_time(mysql_global_status_threads_running[1m]) / mysql_global_variables_max_connections * 100 > 60
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MySQL high threads running (instance {{ $labels.instance }})
      description: "More than 60% of MySQL connections are in running state on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: Mysql慢查询
    expr: increase(mysql_global_status_slow_queries[1m]) > 0
    for: 60m
    labels:
      severity: warning
    annotations:
      summary: MySQL slow queries (instance {{ $labels.instance }})
      description: "MySQL server mysql has some new slow query.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
vmware.rules
- name: VMware Host Connection State
  rules:
  - alert: HostDisconnected
    expr: vmware_host_power_state == "connected"
    for: 5m # 规定主机连接状态必须持续5分钟才会触发警报
    labels:
      severity: warning
    annotations:
      summary: "VMware host {{ $labels.instance }} disconnected"
      description: "VMware host {{ $labels.instance }} is not connected to the virtualization platform."

Alertmanager告警规则

通过定义组来监控组内机器

cat vim /opt/monitor/alertmanager/config.yml

global:
  resolve_timeout: 5m
  smtp_from: 'ops@xxx.com'
  smtp_smarthost: 'smtp.feishu.cn:465'
  smtp_auth_username: 'ops@xxx.com'
  smtp_auth_password: 'ydWhsFDk3pF50TZg'
  smtp_require_tls: false
  smtp_hello: 'ZQA监控告警'

route:
  group_by: ['zqaalert']
  group_wait: 60s # 在触发第一个警报后,等待相同分组内的所有警报的最长时间
  group_interval: 10m   # 系统每隔10分钟检查一次是否有新的警报需要处理
  repeat_interval: 60m  # 在发送警报通知后,在重复发送通知之间等待的时间。设置为1小时意味着如果同一组内的警报在 1小时再次触发
  receiver: 'web.hook'
receivers:
#- name: 'web.hook.prometheusalert'
- name: 'web.hook'
  webhook_configs:
  - url: 'http://10.33.70.22:9094/prometheusalert?type=fs&tpl=prometheus-fs&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/7fe7f42d-242b-42eb-837c-028cfc84adb8'

consoul注册服务

* */1 * * * ip addr | awk '/^[0-9]+: / {}; /inet.*global/ {print gensub(/(.*)\/(.*)/, "\\1", "g", $2)}' |grep "10.33"|head -1|xargs -i curl -X PUT -d  '{"id": "node-exporter-{}","name": "node-exporter-dev","address": "{}","port": 9100,"tags": ["env-dev"],"checks": [{"http": "http://{}:9100/metrics", "interval": "5s"}]}'  http://consul.intra.xxx.net/v1/agent/service/register

有现成的consoul容器,运行即可

Dashboard JSON文件

以下是我认为比较好用的  grafana 的 dashboards文件

Grafana dashboards | Grafana Labs

    

   

    

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1952051.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

并发编程--volatile

1.什么是volatile volatile是 轻 量 级 的 synchronized&#xff0c;它在多 处 理器开 发 中保 证 了共享 变 量的 “ 可 见 性 ” 。可 见 性的意思是当一个 线 程 修改一个共享变 量 时 &#xff0c;另外一个 线 程能 读 到 这 个修改的 值 。如果 volatile 变 量修 饰 符使用…

车载录像机:移动安全领域的科技新星

随着科技的飞速发展&#xff0c;人类社会的各个领域都在不断经历技术革新。其中&#xff0c;车载录像机作为安防行业与汽车技术结合的产物&#xff0c;日益受到人们的关注。它不仅体现了人类科技发展的成果&#xff0c;更在安防领域发挥了重要作用。本文将详细介绍车载录像机的…

Spring Boot集成canal快速入门demo

1.什么是canal&#xff1f; canal 是阿里开源的一款 MySQL 数据库增量日志解析工具&#xff0c;提供增量数据订阅和消费。 工作原理 MySQL主备复制原理 MySQL master 将数据变更写入二进制日志&#xff08;binary log&#xff09;, 日志中的记录叫做二进制日志事件&#xff…

【QT】UDP

目录 核心API 示例&#xff1a;回显服务器 服务器端编写&#xff1a; 第一步&#xff1a;创建出socket对象 第二步&#xff1a; 连接信号槽 第三步&#xff1a;绑定端口号 第四步&#xff1a;编写信号槽所绑定方法 第五步&#xff1a;编写第四步中处理请求的方法 客户端…

Simulink代码生成: 基本模块的使用

文章目录 1 引言2 模块使用实例2.1 In/Out模块2.2 Constant模块2.3 Scope/Display模块2.4 Ground/Terminator模块 3 总结 1 引言 本文中博主介绍Simulink中最简单最基础的模块&#xff0c;包括In/Out模块&#xff08;输入输出&#xff09;&#xff0c;Constant模块&#xff08…

Postman测试工具详细解读

目录 一、Postman的基本概念二、Postman的主要功能1. 请求构建2. 响应查看3. 断言与自动化测试4. 环境与变量5. 集合与文档化6. 与团队实时协作 三、Postman在API测试中的重要性1. 提高测试效率2. 保障API的稳定性3. 促进团队协作4. 生成文档与交流工具 四、Postman的使用技巧1…

CAS算法

CAS算法 1. CAS简介 CAS叫做CompareAndSwap&#xff0c;比较并交换&#xff0c;主要是通过处理器的指令来保证操作的原子性。 CAS基本概念 内存位置 (V)&#xff1a;需要进行CAS操作的内存地址。预期原值 (A)&#xff1a;期望该内存位置上的旧值。新值 (B)&#xff1a;如果旧…

VSCode python autopep8 格式化 长度设置

ctrl, 打开设置 > 搜索autopep8 > 找到Autopep8:Args > 添加项--max-line-length150

Java泛型的介绍和基本使用

什么是泛型 ​ 泛型就是将类型参数化&#xff0c;比如定义了一个栈&#xff0c;你必须在定义之前声明这个栈中存放的数据的类型&#xff0c;是int也好是double或者其他的引用数据类型也好&#xff0c;定义好了之后这个栈就无法用来存放其他类型的数据。如果这时候我们想要使用这…

谷粒商城实战笔记-71-商品服务-API-属性分组-前端组件抽取父子组件交互

文章目录 一&#xff0c;一次性创建所有的菜单二&#xff0c;开发属性分组界面1&#xff0c;左侧三级分类树形组件2&#xff0c;右侧分组列表3&#xff0c;左右两部分通信3.1 子组件发送数据3.2&#xff0c;父组件接收数据 Vue的父子组件通信父组件向子组件传递数据子组件向父组…

SpringBoot添加密码安全配置以及Jwt配置

Maven仓库&#xff08;依赖查找&#xff09; 1、SpringBoot安全访问配置 首先添加依赖 spring-boot-starter-security 然后之后每次启动项目之后&#xff0c;访问任何的请求都会要求输入密码才能请求。&#xff08;如下&#xff09; 在没有配置的情况下&#xff0c;默认用户…

LLM agentic模式之工具使用: Gorilla

Gorilla Gorilla出自2023年5月的论文《Gorilla: Large Language Model Connected with Massive APIs》&#xff0c;针对LLM无法准确地生成API调用时的参数&#xff0c;构建API使用数据集后基于Llama微调了一个模型。 数据集构建 API数据集APIBench的构建过程如下&#xff1…

《Programming from the Ground Up》阅读笔记:p75-p87

《Programming from the Ground Up》学习第4天&#xff0c;p75-p87总结&#xff0c;总计13页。 一、技术总结 1.persistent data p75, Data which is stored in files is called persistent data, because it persists in files that remain on disk even when the program …

C语言程序设计15

程序设计15 问题15_1代码15_1结果15_1 问题15_2代码15_2结果15_2 问题15_3代码15_3结果15_3 问题15_1 在 m a i n main main 函数中将多次调用 f u n fun fun 函数&#xff0c;每调用一次&#xff0c;输出链表尾部结点中的数据&#xff0c;并释放该结点&#xff0c;使链表缩短…

【SQL 新手教程 3/20】关系模型 -- 外键

&#x1f497; 关系数据库建立在关系模型上⭐ 关系模型本质上就是若干个存储数据的二维表 记录 (Record)&#xff1a; 表的每一行称为记录&#xff08;Record&#xff09;&#xff0c;记录是一个逻辑意义上的数据 字段 (Column)&#xff1a;表的每一列称为字段&#xff08;Colu…

Buildroot 构建 Linux 系统

Buildroot 是一个工具&#xff0c;以简化和自动化为嵌入式系统构建完整 Linux 系统的过程。使用交叉编译技术&#xff0c;Buildroot 能够生成交叉编译工具链、根文件系统、Linux 内核映像和针对目标设备的引导加载程序。可以独立地使用这些选项的任何组合&#xff0c;例如&…

Vitis AI 使用 VAI_Q_PYTORCH 工具

目录 1. 简介 2. 资料汇总 3. 示例解释 3.1 快速上手示例 4. 总结 1. 简介 vai_q_pytorch 是 Vitis AI Quantizer for Pytorch 的缩写&#xff0c;主要作用是优化神经网络模型。它是 Vitis AI 平台的一部分&#xff0c;专注于神经网络的深度压缩。 vai_q_pytorch 的作用…

大数据管理中心设计规划方案(可编辑的43页PPT)

引言&#xff1a;随着企业业务的快速发展&#xff0c;数据量急剧增长&#xff0c;传统数据管理方式已无法满足高效处理和分析大数据的需求。建立一个集数据存储、处理、分析、可视化于一体的大数据管理中心&#xff0c;提升数据处理能力&#xff0c;加速业务决策过程&#xff0…

Spring Boot:图书管理系统(一)

1.编写用户登录接口 代码&#xff1a; package com.example.demo;import jakarta.servlet.http.HttpSession; import org.springframework.util.StringUtils; import org.springframework.web.bind.annotation.RequestMapping; import org.springframework.web.bind.annotatio…

HarmonyOS和OpenHarmony区别联系

前言 相信我们在刚开始接触鸿蒙开发的时候经常看到HarmonyOS和OpenHarmony频繁的出现在文章和文档之中&#xff0c;那么这两个名词分别是什么意思&#xff0c;他们之间又有什么联系呢&#xff1f;本文将通过现有的文章和网站内容并与Google的AOSP和Android做对比&#xff0c;带…