ip8的服务器监控ip110和ip111的服务器
被监控的服务器110和111只需要安装node-export和nvidia-container-toolkit
下载镜像包
docker pull prom/node-exporter
docker pull prom/prometheus
docker pull grafana/grafana
新建目录
mkdir /opt/prometheus
cd /opt/prometheus/
vim prometheus.yml
global:
scrape_interval: 60s
evaluation_interval: 60s
scrape_configs:
- job_name: prometheus
static_configs:
- targets: ['localhost:9090']
labels:
instance: prometheus
- job_name: linux
static_configs:
- targets: ['10.20.13.8:9100']
labels:
instance: master
- job_name: node
static_configs:
- targets: ['10.20.13.111:9100','10.20.13.110:9100']
启动普罗米修斯
docker run -d \
-p 9090:9090 \
-v /opt/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus
启动node-export
docker run -d -p 9100:9100 \
-v "/proc:/host/proc:ro" \
-v "/sys:/host/sys:ro" \
-v "/:/rootfs:ro" \
prom/node-exporter
新建目录
mkdir /opt/grafana-storage
chmod 777 -R /opt/grafana-storage
启动grafana
docker run -d \
-p 3000:3000 \
--name=grafana \
-v /opt/grafana-storage:/var/lib/grafana \
grafana/grafana
访问grafana url
10.20.13.8:3000
默认会先跳转到登录页面,默认的用户名和密码都是admin
添加data source时,ip地址要填写本机Ip地址 http://ip:9090
安装显卡监控
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt update
apt upgrade
apt-get install -y nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker
运行容器
docker run -d --restart always --gpus all -p 9400:9400 --name gpu-exporter nvcr.io/nvidia/k8s/dcgm-exporter:3.2.5-3.1.8-ubuntu22.04
在配置文件中加入端口
vim /opt/prometheus/prometheus.yml
加入一段
- job_name: gpu_metrics
static_configs:
- targets: ['10.20.13.111:9400','10.20.13.110:9400']
在grafanan导入监控gpu模板 id12239