小阿轩yx-Prometheus监控系统部署

前言

Prometheus 由 Go 语言编写而成，采用 Pull 方式获取监控信息，并提供了多维度的数据模型和灵活的査询接口。
Prometheus 不仅可以通过静态文件配置监控对象，还文持自动发现机制，能通过 Kubernetes、Consl、DNS 等多种方式动态获取监控对象。
在数据采集方面，借助 Go 语言的高并发特性，单机 Prometheus 可以采取数百个节点的监控数据；
在数据存储方面，随着本地时序数据库的不断优化，单机 Prometheus 每秒可以采集一千万个指标，如果需要存储大量的历史监控数据，则还支持远程存储。

Prometheus 的架构介绍

起源

Prometheus 起源于 Soundcloud ，因为微服务迅速发展，导致实例数量以几何倍数递增，不得不考虑设计一个符合以下几个功能的监控系统

多维数据模型，可以按照实例，服务，端点和方法之类的维度随意对数据进行切片和切块。
操作简单，可以随时随地部署监控服务，甚至在本地工作站上，而无需设置分布式存储后端或重新配置环境。
可扩展的数据收集和分散的架构，以便于可以可靠的监控服务的许多实例，独立团队可以部署独立的监控服务。
转化为一种查询语言，可以利用数据模型进行有效的警报和图形展示。

但是，当时的情况是，以上的功能都分散在各个系统之中

直到 2012年 Soundcloud 某位大神启动了一个孵化项目，Soundcloud 才把所有功能集合到一起，这时也就有了 Prometheus。

Prometheus 是用 Go 语言编写，从一开始就是开源的。
2016年 Prometheus 成为继 Kubernetes 之后，成为 CNCF（Cloud Native computing Foundation）第二个成员。

什么是 Prometheus

Prometheus 具有足够的通用性

可以监控各个级别的实例

自己的应用程序
第三方服务
主机或网络设备

此外 Prometheus 特别适用于监控动态云环境和 Kubernetes 云原生环境。

但是要注意 Prometheus 并不是万能的，目前并没有解决下面的一些问题

日志和追踪（Prometheus 只处理指标，也称为时间序列）
基于机器学习或 AI 的异常检测
水平扩展、集群化的存储

Prometheus 的运行原理

通过 Http 协议周期性抓取被监控组件的状态。
输出被监控组件信息的 Http 接口称为 exporter。
常用组件大部分都有 exporter 可以直接使用

比如 haproxy，Nginx，MSOL，Linux 系统信息（磁盘、内存、CPU、网络等）

Prometheus 组件构成

Prometheus 生态系统由多个组件构成，架构图

Prometheus server

服务核心组件，采用 pull 方式收集数据，通过 http 协议传输。并存储时间序列数据

Exporters/Jobs

负责收集不支持 Instrumentation 的目标对象的性能数据，并通过 HTTP 接口供 PrometheusServer 获取。

Node-Exporter

用于收集各 node 节点的物理指标状态数据，如平均负载、CPU、内存、磁盘、网络等资源信息的指标数据，需要部署到所有运算节点。

Kube-State-Metrics

为 prometheus 采集 k8s 资源数据的 exporter，通过监听 APIServer 收集 kubernetes 集群内资源对象的状态指标数据
同时它也提供自己的数据，主要是资源采集个数和采集发生的异常次数统计。
需要注意 kube-state-metrics 只是简单的提供一个metrics 数据，并不会存储这些指标数据，所以可以使用 Prometheus 来抓取这些数据然后存储，主要关注的是业务相关的一些元数据

比如 Deployment、Pod、副本状态等

cadvisor

用来监控容器内部使用资源的信息，比如CPU、内存、网络 I/0、磁盘 I/O。

blackbox-exporter

监控业务容器存活性。

Service Discovery

服务发现，Prometheus 支持多种服务发现机制

文件
DNS
Consul
Kubernetes
OpenstackEC2

基于服务发现的过程并不复杂，通过第三方提供的接口，Prometheus 查询到需要监控的 Target 列表，然后轮循这些 Target 获取监控数据。

Alertmanager

是一个独立的告警模块，从 Prometheus server 端接收到 alerts 后，会进行去重、分组，并路由到相应的接收方，发出报警

常见的接收方式有

电子邮件
微信
钉钉

Pushgateway

类似一个中转站，Prometheus 的 server 端只会使用 pull 方式拉取数据，但是某些节点因为某些原因只能使用 push 方式推送数据，那么它就是用来接收 push 而来的数据并暴露给 Prometheus 的server 拉取的中转站。
可以理解成目标主机可以上报短期任务的数据到 Pushgateway，然后 Prometheus server 统一从Pushgateway 拉取数据

Grafana

是一个跨平台的开源的度量分析和可视化工具，可以将采集的数据可视化的展示，并及时通知给告警接收方。
官方库中具有丰富的仪表盘插件。

Prometheus 的特性

提供多维度数据模型和灵活的查询语言

通过将监控指标关联多个Tag，来将监控数据进行任意维度的组合；
提供 HTTP 查询接口；
可以很方便的结合Grafana等组件展示数据。

支持服务器节点的本地存储

通过 prometheus 自带的时序数据库，可以完成每秒千万级的数据存储。
不仅如此，在保存大量历史数据的场景中，prometheus 还可以对接第三方时序数据库

如 OpenTSDB 等。

定义了开放指标数据标准

以基于 HTTP 的 Pull 方式采集时序数据，只有实现了 prometheus 监控数据格式的监控数据才可以被 prometheus 采集；
并支持以 Push 方式向中间网关推送时序数据，能更灵活地应对各种监控场景。

支持通过静态文件配置和动态发现机制发现监控对象

自动完成数据采集。
prometheus 目前已经文持 Kubernetes、consul 等多种服务发现机制，可以减少运维人员的手动配置环节。

支持多种多样的图表和界面展示

比如 Grafana 等。

Prometheus 的工作流程

配置监控目标：在 Prometheus 配置文件中定义监控目标及其相应的指标。
拉取指标数据：Prometheus 会定期从监控目标拉取指标数据，并将数据存储到本地存储中。
存储指标数据：Prometheus 会使用一种自定义的时间序列数据库（TSDB）存储指标数据，以便进行分析和查询。
分析指标数据：Prometheus 提供了一个表达式语言，可以基于时间序列数据进行数据处理和分析操作，比如计算归一化指标、统计分位数、处理异常值等。
查询指标数据：Prometheus 提供了一个基于 HTTP 的査询 API，可以用来执行査询操作和获取查询结果，获取到的查询数据可以通过图表和仪表盘的方式进行展示。

Grafana 介绍

是一款用 Go 语言开发的开源数据可视化工具
可以做数据监控和说几句统计，带有告警功能

特点

可视化

快速和灵活的客户端图形具有多种选项，面板插件为许多不同的方式可视化指标和日志。

报警

可视化地为最重要的指标定义警报规则，Granfana 将持续评估他们，并发送通知。

通知

警报更改状态时，他会发出通知，接受电子邮件通知。

动态仪表盘

使用模板变量创建动态和可重用的仪表盘，这些模板变量作为下拉菜单出现在仪表板顶部。

混合数据源

在同一个图中混合不同的数据源，可以根据每个查询指定数据源，这甚至适用于自定义数据源。

注释

注释来自不同数据源图标，将鼠标悬停在事件上可以显示完整的事件元数据和标记。

过滤器

过滤器允许您动态创建新的键~值，这些过滤器将自动应用于该数据源的所有查询。

Prometheus 的安装

资源配置

IP	节点	组件
192.168.10.108	Prometheus 服务器	Prometheus、Node_exporter
192.168.10.107	Grafana 服务器	Grafana
192.168.10.101	Agent01 服务器（Linux 主机）	Ndoe_exporter
192.168.10.102	Agent02 服务器（Linux 主机）	Ndoe_exporter
192.168.10.103	Agent03 服务器（windows 主机）	windows_exporter

部署 Prometheus

Prometheus 的安装包可以前往官网下载

https://prometheus.io/download/

将 Prometheus 的源码包上传至主机 108

修改主机名称

[root@localhost ~]# hostnamectl set-hostname prometheus
[root@localhost ~]# bash

关闭防火墙、内核机制

[root@prometheus ~]# systemctl stop firewalld
[root@prometheus ~]# systemctl disable firewalld
[root@prometheus ~]# setenforce 0
[root@prometheus ~]# vim /etc/sysconfig/selinux
##修改为
SELINUX=disabled

解压

[root@prometheus ~]# tar zxvf prometheus-2.48.0.linux-amd64.tar.gz

移动到指定目录下

[root@prometheus ~]# mv prometheus-2.48.0.linux-amd64 /usr/local/prometheus

链接到指定目录下

[root@prometheus ~]# In -s /usr/local/prometheus/prometheus /usr/local/bin
[root@prometheus ~]# In -s /usr/local/prometheus/promtool /usr/local/bin

查看版本

[root@prometheus ~]# prometheus --version
prometheus,    version    2.48.0    (branch:    HEAD,    revision:
6d80b30990bc297d95b5c844e118c4011fad8054)
build user:    root@26117804242c
build date:    20231116-04:35:21
go version:    go1.21.4
platform:      linux/amd64
tags:          netgo,builtinassets,stringlabels

注册 Prometheus 的系统服务

[root@prometheus ~]# vim /usr/lib/systemd/system/prometheus.service
[Unit]
Description=https://prometheus.io

[Service]
Restart=on-failure
ExecStart=/usr/local/prometheus/prometheus
--config.file=/usr/local/prometheus/prometheus.yml --web.listen-address=:9090

[Install]
WantedBy=multi-user.target

启动服务

[root@prometheus ~]# systemctl daemon-reload
[root@prometheus ~]# systemctl enable --now prometheus

访问测试

访问 Prometheus 首页

http://192.168.10.108:9090/

查看被监控端的状态

点击 stats-Targets

查看详细监控信息

http://192.168.10.108:9090/metrics

部署 node_exporter

Exporter 是 Prometheus 的指标数据收集组件。
它负责从目标 Jobs 收集数据，并把收集到的数据转换为 Prometheus 支持的时序数据格式。

和传统的指标数据收集组件不同的是

它只负责收集，并不向 Server 端发送数据，而是等待 Prometheus Server 主动抓取

node-exporter 默认的抓取 url 地址

http://ip:9100/metrics

agent 服务器端（被监控端）安装 node_exporter

将 node_exporter 源码包上传至 101、102

修改主机名称

主机一

[root@localhost ~]# hostnamectl set-hostname agent01
[root@localhost ~]# bash

主机二

[root@localhost ~]# hostnamectl set-hostname agent02
[root@localhost ~]# bash

101和102开启会话同步

[root@agent01 ~]# systemctl stop firewalld
[root@agent01 ~]# systemctl disable firewalld
[root@agent01 ~]# setenforce 0
[root@agent01 ~]# vim /etc/sysconfig/selinux
##修改为
SELINUX=disabled

解压

[root@agent01 ~]# tar zxvf node_exporter-1.7.0.linux-amd64.tar.gz

移动文件到指定目录下

[root@agent01 ~]# mv node exporter-1.7.0.linux-amd64 /usr/local/node_exporter

添加服务为系统服务

[root@agent01 ~]# vim /usr/lib/systemd/system/node exporter.service
[Unit]
Description=node_exporter
After=network.target
[Service]
ExecStart=/usr/local/node_exporter/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target

启动服务

[root@agent01 ~]# systemctl daemon-reload
[root@agent01 ~]# systemctl start node_exporter
[root@agent01 ~]# systemctl enable node_exporter

查看端口

[root@agent01 ~]# netstat -anpt | grep 9100
tcp6    0    0 :::9100    :::*    LISTEN 6352/node_exporter

windows 主机的 exporter 程序监听的 9182 端口

取消会话同步

在 windows 系统安装 exporter

打开一个 windows 系统（以 win10 系统演示）

将所需的程序包复制或上传至 win10 系统

只需鼠标双击运行就OK

最后在控制面板找到防火墙并关闭

在 Prometheus 服务端配置文件添加监控项

进入 Prometheus 配置文件添加 target（108主机）

[root@prometheus ~]# vim /usr/local/prometheus/prometheus.yml
#进入 Prometheus 安装目录下修改主配置文件
    - job_name: "prometheus"
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    static_configs:
      - targets: ["192.168.200.100:9090"]

- job_name: 'agent'
  static_configs:
    - targets: ['192.168.10.101:9100'.'192.168.10.102:9100','192.168.10.50:9182']

Linux 主机的 exporter 进程监听的是TCP 9100 端口。
windows 主机的exporter 进程监听的TCP9182 端口。

重启 Prometheus 服务

[root@prometheus ~]# systemctl restart prometheus.service

重启完浏览器刷新访问 192.168.10.108:9090，打开普罗米修斯的监控页面 status-Targets 查看
有没有添加成功。

Grafana 的部署与应用

部署 Grafana

（新开一个 107 主机）

将 grafana 的 rpm 上传至主机

修改主机名称

[root@localhost ~]# hostnamectl set-hostname grafana
[root@localhost ~]# bash

安装 grafana

[root@grafana ~]# yum -y install grafana-enterprise-10.2.2-1.x86 64.rpm

启动服务

[root@grafana ~]# systemctl start grafana-server
[root@grafana ~]# systemctl enable grafana-server

查看状态

[root@grafana ~]# systemctl status grafana-server

关闭防火墙、内核机制

[root@grafana ~]# systemctl stop firewalld
[root@grafana ~]# systemctl disable firewalld
[root@grafana ~]# setenforce 0
[root@grafana ~]# vim /etc/sysconfig/selinux
##修改为
SELINUX=disabled

浏览器访问 Grafana

http://192.168.10.107:3000/login
默认端口为：3000

初始账号和密码都是 admin

设置数据源

第一次登录进入后让设置新的密码，进入 Grafana，点击 DATA SOURCRE 添加数据源。

设置 Prometheus 为数据源

填写连接信息

在页面底端点击保存并测试的按钮

查看添加的数据源

点击“connections” --> “Data sources”，查看已添加的数据源

用导入模板的方法添加 Grafana 监控面板

在 home 页面点击添加 dashboard

选择导入模板的方法

指定模板 ID

模板 ID 可以从 Grafana 官网获得，Grafana 为用户提供了大量的模板，简化了用户的管理难度。

本案例中使用的模板 ID 为 12633，填写好 ID 后点击 “Load” 按钮

https://grafana.com/grafana/dashboards/

12633：针对 Linux 的节点进行监控的模板

14694：针对 windows 的节点进行监控的模板

用户也可以选择其他对应的模板进行创建

选择数据源

在下拉菜单中选择对应的数据源，再点击 “Import” 按钮进行导入。

导入后查看监控图像

自定义 dashboard 监控面板

在 home 页面点击添加 dashboard

选择 “Add visualization” 方式添加 dasboard

visualization：可视化
此方法需要用户手动对 dashboard 面板进行布局，监控参数也需要用户自行设置。

选择 Prometheus 数据源

设置查询规则

Metric：选择监控项
Label filters：设置过滤规则，表示要监控哪台主机
Instance：表示使用被监控主机的 IP 地址选择
Job：表示使用主机名进行选择

Prometheus 告警

安装告警组件 alertmanager

将 alertmanager-0.26.0.linux 源码包上传至主机108

解压

[root@prometheus ~]# tar zxvf alertmanager-0.26.0.linux-amd64.tar.gz

将文件移动到指定目录下

[root@prometheus ~]# mv alertmanager-0.26.0.linux-amd64 /usr/local/alertmanager

添加 alertmanager 服务

[root@prometheus ~]# vim /usr/lib/systemd/system/alertmanager.service <<'EOF'
[Unit]
Description=alertmanager project
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/alertmanager/alertmanager
--config.file=/usr/local/alertmanager/alertmanager.yml 
--storage.path=/usr/local/alertmanager --web.listen-address=0.0.0.0:9093
Restart=on-failure

[Install]
WantedBy=multi-user.target

启动服务

[root@prometheus ~]# systemctl daemon-reload
[root@prometheus ~]# systemctl start alertmanager
[root@prometheus ~]# systemctl enable alertmanager

属性解析：web.listen-address 是与 prometheus 交互的端

访问 alertmanager 的 WEB 页面

alertmanager 的 web 界面使用 9093 的端口

登录的 url

http://192.168.10.108:9093

将 alertmanager 增加到 prometheus

[root@prometheus ~]# vim /usr/local/prometheus/prometheus.yml
#修改 targets 对应的地址为 Prometheus 的地址
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - 192.168.10.108:9093

- 192.168.10.108:9093

检查 Prometheus 配置文件的语法

[root@prometheus ~]# promtool check config /usr/local/prometheus/prometheus.yml
Checking /usr/local/prometheus/prometheus.yml
  SUCCESS: 1 rule files found
 SUCCESS: /usr/local/prometheus/prometheus.yml is valid prometheus config file syntax
Checking /usr/local/prometheus/rules/hoststats-alert.rules
  SUCCESS: 3 rules found

重启 Prometheus

[root@prometheus ~]# systemctl restart prometheus

查看端口

[root@prometheus ~]# netstat -anpt | grep 9090
tcp6    0    0 :::9090    :::*        LISTEN        13275/prometheus
tcp6    0    0 ::1:43604  ::1:9090    TIME_WAIT     -

添加邮箱告警媒介

[root@prometheus ~]# vim /usr/local/alertmanager/alertmanager.yml
global:
  #当告警的状态由"firing"变为"resolve"的以后还要呆多长时间，才宣布告警解除
  resolve_timeout: 5m
  #qq邮箱 smtp 端口
  smtp_smarthost: 'smtp.qq.com:25'
  #发件人邮箱
  smtp_from: '406720950@qq.com'
  #邮箱地址
  smtp_auth_username: '406720950@qq.com'
  #邮箱安全码
  smtp_auth_password: 'pcmibkzesjqqcaha'
  smtp_hello: 'qq.com'
  #不携带证书请求
  smtp_require_tls: false

# 路由配置
route:
  group_by: ['alertname', 'cluster']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 10s
  receiver: 'email'

# 收信人员
receivers:
- name: 'email'
  email_configs:
  - to: '406720950@qq.com'
    send_resolved: true

# 规则主动失效措施，如果不想用的话可以取消掉
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

global:
smtp_smarthost: 'smtp.qq.com:25'
smtp_from: '406720950@qq.com'
smtp_auth_username: '406720950@qq.com'
smtp_auth_password: 'pcmibkzesjqqcaha'
# 收信人员
receivers:
- to: '406720950@qq.com'

重启服务

[root@prometheus ~]# systemctl restart alertmanager

查看状态

[root@prometheus ~]# netstat -tunlp | grep alert
tcp6    0    0 :::9093    :::*    LISTEN    13222/alertmanager
tcp6    0    0 :::9094    :::*    LISTEN    13222/alertmanager
tcp6    0    0 :::9094    :::*              13222/alertmanager

alertmanager 进程使用的是 TCP 9094 的端口。

配置 prometheus，添加告警规则

每个告警规则有五部分组成

名称（alert）
触发条件（expr），这是个 PromQL 表达式，例如 CPU 使用率超过 58%，在触发条件被满足之前，告警的状态都是 Inactive
持续时间（for），例如 CPU 使用率超过 58%的时间持续 30 秒，在 30 秒之内，此告警状态为 pending，超过 30 秒就进入 firing 状态
标签（labels），给告警打上标签，在使用时可以根据标签定位到指定告警
注解（annotations），对告警的描述，这些内容可以用来详明告警时刻的详细情况

向 Prometheus 中添加告警规则路径

[root@prometheus ~]# mkdir /usr/local/prometheus/rules

修改配置文件

[root@prometheus ~]# vim /usr/local/prometheus/prometheus.yml
rule files:
  - /usr/local/prometheus/rules/*.rules

添加告警规则

[root@prometheus ~]# vim /usr/local/prometheus/rules/hoststats-alert.rules
groups:
- name: node1_alerts
  rules:
  - alert: HighNodeCpu
    expr: instance:node_cpu:avg_rate1m > 10
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Hgih Node CPU for 1 hour
      console: This is a Test

  - alert: HostOutOfMemory
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host out of memory (instance {{ $labels.instance }})
      description: "Node memory is filling up (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: InstanceDown
    expr: up == 0
    for: 10s
    labels:
      severity: critical
    annotations:
      summary: Host {{ $labels.instance }} of {{ $labels.job }} is Down!

该规则文件总共添加了三条规则

第一个规则是检测 CPU 负载
第二个是内存利用率的检测
第三个是主机 down 的检测

重启 Prometheus

[root@prometheus rules]# systemctl restart prometheus

查看 Prometheus 告警界面

关闭掉 agent 端的主机查看 web 界面的告警信息

查看邮箱接受到的异常告警邮件

将该主机重新启动起来

小阿轩yx-Prometheus监控系统部署