目录
- 1.使用二进制方式安装Alertmanager
- 2.Alertmanager配置
- 3.alert接入prometheus
- 4.创建告警配置文件(在prometheus服务器)
- 5.测试告警
1.使用二进制方式安装Alertmanager
- 下载安装包
wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz
- 将安装包解压至
/data
目录
[root@alert ~]# mkdir /data
[root@alert ~]# tar xf alertmanager-0.25.0.linux-amd64.tar.gz -C /data/
[root@alert ~]# cd /data/
[root@alert data]# ln -s alertmanager-0.25.0.linux-amd64/ alertmanager
-
添加Alertmanager为系统服务
vi /usr/lib/systemd/system/alertmanager.service
[Unit]
Description=Prometheus Alertmanager Service daemon
After=network.target
[Service]
Type=simple
User=root
Group=root
ExecStart=/data/alertmanager/alertmanager\
--config.file=/data/alertmanager/alertmanager.yml\
--storage.path=/data/alertmanager/data/\
--data.retention=120h\
--web.external-url=http://192.168.10.5:9093\
--web.listen-address=:9093
Restart=on-failure
[Install]
WantedBy=multi-user.target
[root@alert system]# systemctl daemon-reload
[root@alert system]# systemctl start alertmanager
[root@alert system]# systemctl enable alertmanager.service
- 可以在浏览器查看状态:
IP:9093/#/status
2.Alertmanager配置
-
修改alertmanager配置文件
vim /data/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.163.com:465'
smtp_from: '自己邮箱@163.com'
smtp_auth_username: '自己邮箱@163.com'
smtp_auth_password: 'PLAPPSJXJCQABYAF'
smtp_require_tls: false
templates:
- 'template/*.tmpl'
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 20m
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: '接收人邮箱@qq.com'
html: '{{ template "test.html" . }}'
send_resolved: true
注意:
上文中的test.hml要和下文define中的名字一样
- 创建告警模板
[root@alert ~]# cd /data/alertmanager
[root@alert alertmanager]# mkdir template
[root@alert alertmanager]# vim test.tmpl
{{ define "test.html" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
========= ERROR ==========<br>
告警名称:{{ .Labels.alertname }}<br>
告警级别:{{ .Labels.severity }}<br>
告警机器:{{ .Labels.instance }} {{ .Labels.device }}<br>
告警详情:{{ .Annotations.summary }}<br>
告警时间:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br>
========= END ==========<br>
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
========= INFO ==========<br>
告警名称:{{ .Labels.alertname }}<br>
告警级别:{{ .Labels.severity }}<br>
告警机器:{{ .Labels.instance }}<br>
告警详情:{{ .Annotations.summary }}<br>
告警时间:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br>
恢复时间:{{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br>
========= END ==========<br>
{{- end }}
{{- end }}
{{- end }}
[root@alert ~]# systemctl restart alertmanager
3.alert接入prometheus
- 修改Prometheus配置文件:
prometheus.yml
以下内容加入到配置文件的合适位置
alerting:
alertmanagers:
- static_configs:
- targets:
- 192.168.10.5:9093
rule_files:
- "/data/prometheus/rules/*_rules.yml"
- job_name: "alertmanager"
static_configs:
- targets: ["192.168.10.5:9093"]
4.创建告警配置文件(在prometheus服务器)
- 创建目录
mkdir -p /data/prometheus/rules
-
创建告警规则,以node为例
vim /data/prometheus/rules/node_rules.yml
groups:
- name: node_status
rules:
- alert: status
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
discription: "Node has been down for more than 2 minutes"
summary: "Node Down"
5.测试告警
- 重启Prometheus服务,关闭一个node节点测试
[root@localhost rules]# systemctl restart prometheus.service
-
此时收到一个node 挂掉的邮件
-
将节点恢复,然后会收到一个恢复邮件
-
此部分内容为简单实现基于alertmanager的邮件告警。