7.1 监控体系部署管理
7.2k8s集群层面监控
准备:部署k8s集群
master:192.168.192.128
node01:192.168.192.129
node02:192.168.192.130
1.prometheus架构:
prometheus工作原理:
1.数据采集(Exporters):Prometheus
定期通过HTTP请求从目标资源中拉取数据。目标资源可以是应用程序、系统、服务或其他资源。
2.数据存储(Storage):Prometheus
将采集到的数据存储在本地存储引擎中。存储引擎以时间序列方式存储数据,其中每个时间序列都由指标名称和一组键值对组成。
3.数据聚合(PromQL):Prometheus
通过查询表达式聚合数据。PromQL 是 Prometheus 的查询语言,它允许用户通过查询表达式从存储引擎中检索指标的特定信息。
4.告警处理(Alertmanager):Prometheus
可以根据用户指定的规则对数据进行警报。当指标的值超出特定阈值时,Prometheus 向 Alertmanager 发送警报。Alertmanager
可以帮助用户对警报进行分组、消除和路由,并将警报发送到相应的接收器,例如邮件、企微、钉钉等。
5.数据大盘(Grafana):帮助用户通过可视化方式展示
Prometheus 的数据,包括仪表盘、图表、日志和警报等。
prometheus部署:
1.创建命名空间
kubectl create namespace monitor
2.创建RBAC规则
创建RBAC规则,包含ServiceAccount、ClusterRole、ClusterRoleBinding三类YAML文件。
vim prometheus_rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitor
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources: ["nodes","nodes/proxy","services","endpoints","pods"]
verbs: ["get", "list", "watch"]
- apiGroups: ["extensions"]
resources: ["ingress"]
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitor
执行kubectl apply -f prometheus_rbac.yaml
验证:
3.创建ConfigMap类型的Prometheus配置文件
vim prometheus_cm.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitor
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: "kubernetes"
############ 数据采集job ###################
scrape_configs:
- job_name: prometheus
static_configs:
- targets: ['127.0.0.1:9090']
labels:
instance: prometheus
############ 指定告警规则文件路径位置 ###################
rule_files:
- /etc/prometheus/rules/*.rules
kubectl apply -f prometheus_cm.yaml
验证:
4.创建ConfigMap类型的prometheus
rules配置文件
包含general.rules和node.rules
vim prometheus_rules_cm.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
namespace: monitor
data:
general.rules: |
groups:
- name: general.rules
rules:
- alert: InstanceDown
expr: |
up{job=~"k8s-nodes|prometheus"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} 停止工作"
description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} 已经停止1分钟以上."
node.rules: |
groups:
- name: node.rules
rules:
- alert: NodeFilesystemUsage
expr: |
100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 > 85
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分区使用率过高"
description: "{{ $labels.instance }} 主机名:{{ $labels.hostname }} : {{ $labels.mountpoint }} 分区使用大于85% (当前值: {{ $value }})"
kubectl apply -f prometheus_rules_cm.yaml
验证:
5.创建prometheus svc暴露prometheus服务:
vim prometheus_svc.yaml
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: monitor
labels:
k8s-app: prometheus
spec:
type: ClusterIP
ports:
- name: http
port: 9090
targetPort: 9090
selector:
k8s-app: prometheus
执行kubectl apply -f prometheus_svc.yaml
6.创建prometheus deploy
由于Prometheus需要对数据进行持久化,以便在重启后能够恢复历史数据,所以在创建prometheus deploy前首先要部署nfs服务做存储实现持久化
部署nfs服务器(ip:192.168.192.131):
1.安装nfs和rpc
yum install -y nfs-utils
yum install -y rpcbind
2.启动服务和设置开启启动:(先启动rpc服务,再启动nfs服务)
systemctl start rpcbind
systemctl enable rpcbind
systemctl start nfs-server
systemctl enable nfs-server
3.配置共享文件目录,编辑配置文件:
首先创建共享目录,然后在/etc/exports配置文件中编辑配置即可。
mkdir -p /data/nfs01
vim /etc/exports
/data/nfs01 192.168.192.0/24(rw,no_root_squash,no_all_squash,sync)
重新加载NFS服务,使配置文件生效:
systemctl reload nfs
nfs客户端配置:(k8s三个节点都配置)
1.安装nfs-utils客户端
yum install -y nfs-utils
2.使用showmount命令查看nfs服务器共享信息
showmount -e 192.168.192.131
出现报错,检查是由于nfs服务器的防火墙忘记关闭,关闭防火墙即可
systemctl stop firewalld
systemctl disable firewalld
再次使用showmount命令查看:
检查存储类是否存在:kubectl get storageclass
如果不存在名为“nfs-storage”的存储类,需要创建一个新的存储类。由于pvc使用的是nfs存储,因此创建一个nfs类型的存储类
要想使用NFS的SC,还需要安装一个NFS
provisioner,provisioner里会定义NFS相关的信息(服务器IP、共享目录等)
github地址: https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner
安装git:yum install git -y
下载源码:git
clone https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner
cd nfs-subdir-external-provisioner/deploy
sed -i ‘s/namespace: default/namespace: monitor/’ rbac.yaml ##修改命名空间为monitor
kubectl
apply -f rbac.yaml ##创建rbac授权
vim prometheus_sc.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: nfs-storage
provisioner: nfs-provisioner
parameters:
archiveOnDelete: "false"
执行kubectl apply -f prometheus_sc.yaml
验证:
创建nfs类型的pv,指向 NFS 服务器上的共享目录
vim prometheus_pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: nfs-prometheus-pv
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteMany
nfs:
path: /srv/nfs_share
server: 192.168.1.100
persistentVolumeReclaimPolicy: Retain
storageClassName: nfs-storage
kubectl apply -f prometheus_pv.yaml
验证:
创建pvc,这个 PVC 使用的 storageClassName 是 nfs-storage,对应之前定义的 PV:
vim prometheus_pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus-data-pvc
namespace: monitor
spec:
accessModes:
- ReadWriteMany
storageClassName: "nfs-storage"
resources:
requests:
storage: 10Gi
执行kubectl apply -f prometheus_pvc.yaml
验证:
创建Prometheus控制器文件:
vim prometheus_controller.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitor
labels:
k8s-app: prometheus
spec:
replicas: 1
selector:
matchLabels:
k8s-app: prometheus
template:
metadata:
labels:
k8s-app: prometheus
spec:
serviceAccountName: prometheus
containers:
- name: prometheus
image: prom/prometheus:v2.36.0
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 9090
securityContext:
runAsUser: 65534
privileged: true
command:
- "/bin/prometheus"
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--web.enable-lifecycle"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=10d"
- "--web.console.libraries=/etc/prometheus/console_libraries"
- "--web.console.templates=/etc/prometheus/consoles"
resources:
limits:
cpu: 2000m
memory: 2048Mi
requests:
cpu: 1000m
memory: 512Mi
readinessProbe:
httpGet:
path: /-/ready
port: 9090
initialDelaySeconds: 5
timeoutSeconds: 10
livenessProbe:
httpGet:
path: /-/healthy
port: 9090
initialDelaySeconds: 30
timeoutSeconds: 30
volumeMounts:
- name: data
mountPath: /prometheus
subPath: prometheus
- name: config
mountPath: /etc/prometheus
- name: prometheus-rules
mountPath: /etc/prometheus/rules
- name: configmap-reload
image: jimmidyson/configmap-reload:v0.5.0
imagePullPolicy: IfNotPresent
args:
- "--volume-dir=/etc/config"
- "--webhook-url=http://localhost:9090/-/reload"
resources:
limits:
cpu: 100m
memory: 100Mi
requests:
cpu: 10m
memory: 10Mi
volumeMounts:
- name: config
mountPath: /etc/config
readOnly: true
volumes:
- name: data
persistentVolumeClaim:
claimName: prometheus-data-pvc
- name: prometheus-rules
configMap:
name: prometheus-rules
- name: config
configMap:
name: prometheus-config
执行kubectl apply -f prometheus_controller.yaml
验证:
出现问题,pod状态为CrashLoopBackOff
kubectl describe pod:
检查pod日志:
经检查,是由于nfs文件没有写的权限
登录nfs服务器:
进入到/data/nfs01目录:
文件权限修改后,再查看pod和deploy状态正常了
创建Ingress实现外部域名访问Prometheus
vim prometheus_ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
namespace: monitor
name: prometheus-ingress
spec:
ingressClassName: nginx
rules:
- host: prometheus.kubernets.cn
http:
paths:
- pathType: Prefix
backend:
service:
name: prometheus
port:
number: 9090
path: /
由于没有域名,所以进行了hosts配置
验证:
在虚拟机外访问,部署NodePortService
vim prometheus_nodeport_svc.yaml
apiVersion: v1
kind: Service
metadata:
name: prometheus-service
namespace: monitor
spec:
type: NodePort
selector:
k8s-app: prometheus
ports:
- port: 9090
targetPort: 9090
nodePort: 30090
kubectl apply -f prometheus_nodeport_svc.yaml
浏览器访问:http://192.168.192.128:30090/
prometheus监控平台:
Graph:用于绘制图表,可以选择不同的时间范围、指标和标签,还可以添加多个图表进行比较。
Alert:用于设置告警规则,当指标达到设定的阈值时,会发送告警通知。
Explore:用于查询和浏览指标数据,可以通过查询表达式或者标签过滤器来查找数据。
Status:用于查看prometheus的状态信息,包括当前的targets、rules、alerts等。
Config:用于编辑prometheus的配置文件,可以添加、修改和删除配置项。
基于Prometheus的全方位监控平台–K8S集群层面监控一、KubeStateMetrics简介kube-state-metrics 是一个 Kubernetes 组件,它通过查询 Kubernetes
的 API 服务器,收集关于 Kubernetes 中各种资源(如节点、pod、服务等)的状态信息,并将这些信息转换成 Prometheus
可以使用的指标。kube-state-metrics 主要功能:节点状态信息,如节点 CPU 和内存的使用情况、节点状态、节点标签等。Pod
的状态信息,如 Pod 状态、容器状态、容器镜像信息、Pod 的标签和注释等。Deployment、Daemonset、Statefulset
和 ReplicaSet 等控制器的状态信息,如副本数、副本状态、创建时间等。Service
的状态信息,如服务类型、服务 IP 和端口等。存储卷的状态信息,如存储卷类型、存储卷容量等。Kubernetes
的 API 服务器状态信息,如 API 服务器的状态、请求次数、响应时间等。通过 kube-state-metrics 可以方便的对 Kubernetes
集群进行监控,发现问题,以及提前预警。 二、KubeStateMetrics包含ServiceAccount、ClusterRole、ClusterRoleBinding、Deployment、ConfigMap、Service 六类YAML文件:
<<kube-state-metrics.yaml>>
kubectl apply -f kube-state-metrics.yaml
验证:
kubectl get all -nmonitor |grep kube-state-metrics
curl -kL $(kubectl get service -n monitor | grep kube-state-metrics |awk ‘{ print $3 }’):8080/metrics
查看pod详细信息报错拉取不到镜像:
手动拉取镜像:
查找镜像链接:https://docker.aityp.com/i/search?search=nfs-subdir-external-provisioner%3Av4.0.2@
在每个节点上都拉取镜像:
ctr -n k8s.io images pull swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/mirrorgooglecontainers/addon-resizer:1.8.6
ctr -n k8s.io images tag swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/mirrorgooglecontainers/addon-resizer:1.8.6 docker.io/mirrorgooglecontainers/addon-resizer:1.8.6
ctr -n k8s.io images pull swr.cn-north-4.myhuaweicloud.com/ddn-k8s/k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.4.2
ctr -n k8s.io images tag swr.cn-north-4.myhuaweicloud.com/ddn-k8s/k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.4.2 k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.4.2
crictl images #查看镜像
镜像拉取完成后再查看pod状态已变为running
vim kube-apiserver.yaml
- job_name: kube-apiserver
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace,__meta_kubernetes_service_name]
action: keep
regex: default;kubernetes
- source_labels: [__meta_kubernetes_endpoints_name]
action: replace
target_label: endpoint
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: service
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
kubectl apply -f kube-apiserver.yaml
新增 Kubernetes 集群架构监控
1.kube-apiserver
使用https访问时,需要tls相关配置,可以指定ca证书路径或者 insecure_skip_verify:
true跳过证书验证。
除此之外,还要指定 bearer_token_file,否则会提示 server
returned HTTP status 400 Bad Request;
- job_name: kube-apiserver
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace,
__meta_kubernetes_service_name]
action: keep
regex: default;kubernetes
- source_labels: [__meta_kubernetes_endpoints_name]
action: replace
target_label: endpoint
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: service
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
2.controller-manager
查看controller-manager信息: kubectl get pod -n kube-system
查看pod详情:
kubectl
describe pod -n kube-system kube-controller-manager-master1
Name:
kube-controller-manager-k8s-master
Namespace: kube-system
……
Labels:
component=kube-controller-manager
tier=control-plane
……
Containers:
kube-controller-manager:
……
Command:
kube-controller-manager
--allocate-node-cidrs=true
–authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
–authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
--bind-address=127.0.0.1
……
由上可知,匹配pod对象,lable标签为component=kube-controller-manager即可,但需注意的是controller-manager默认只运行127.0.0.1访问,因此还需先修改controller-manager配置.
修改
/etc/kubernetes/manifests/kube-controller-manager.yaml(3个节点都修改)
cat
/etc/kubernetes/manifests/kube-controller-manager.yaml
command:
–bind-address=0.0.0.0 # 端口改为0.0.0.0
#- --port=0 # 注释0端口
编辑prometheus配置文件,默认匹配到的是80端口,需要手动指定为10252端口;
vim prometheus-config.yaml
- job_name: kube-controller-manager
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_component]
regex: kube-controller-manager
action: keep
- source_labels: [__meta_kubernetes_pod_ip]
regex: (.+)
target_label: __address__
replacement: ${1}:10252
- source_labels: [__meta_kubernetes_endpoints_name]
action: replace
target_label: endpoint
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: service
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
curl -XPOST http://prometheus.kubernets.cn/-/reload
#执行该命令手动热加载prometheus服务
验证:
3.scheduler
查看kube-scheduler信息:
[root@tiaoban
prometheus]# kubectl describe pod -n kube-system kube-scheduler-master1
Name:
kube-scheduler-k8s-master
Namespace:
kube-system
……
Labels:
component=kube-scheduler
tier=control-plane
……
由上可知,匹配pod对象,lable标签为component=kube-scheduler即可scheduler和controller-manager一样,默认监听0端口,需要注释
修改 /etc/kubernetes/manifests/kube-scheduler.yaml(三个节点)
cat
/etc/kubernetes/manifests/kube-scheduler.yaml
……
-
command:
- –bind-address=0.0.0.0 # 端口改为0.0.0.0
#- --port=0 # 注释0端口
……
编辑prometheus配置文件(他默认匹配到的是80端口,需要手动指定为10251端口,同时指定token,否则会提示
server returned HTTP status 400 Bad Request):
- job_name: kube-scheduler
kubernetes_sd_configs:
- role: pod
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_component]
regex: kube-scheduler
action: keep
- source_labels: [__meta_kubernetes_pod_ip]
regex: (.+)
target_label: __address__
replacement: ${1}:10251
- source_labels: [__meta_kubernetes_endpoints_name]
action: replace
target_label: endpoint
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: service
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
应用,手动热加载prometheus服务
验证:
4.kube-state-metrics
需要手动指定为8080端口
- job_name: kube-state-metrics
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
regex: kube-state-metrics
action: keep
- source_labels: [__meta_kubernetes_pod_ip]
regex: (.+)
target_label: __address__
replacement: ${1}:8080
- source_labels: [__meta_kubernetes_endpoints_name]
action: replace
target_label: endpoint
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: service
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
应用yaml文件
验证:
5.coredns
编辑配置文件,他默认匹配到的是53端口,需要手动指定为9153端口
- job_name: coredns
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels:
- __meta_kubernetes_service_label_k8s_app
regex: kube-dns
action: keep
- source_labels: [__meta_kubernetes_pod_ip]
regex: (.+)
target_label: __address__
replacement: ${1}:9153
- source_labels: [__meta_kubernetes_endpoints_name]
action: replace
target_label: endpoint
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: service
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
应用yaml文件验证:
6.etcd
查看pod详细信息:kubectl describe pod -n kube-system
etcd-master1
Name:
etcd-master1
Namespace:
kube-system
Priority:
2000001000
Priority Class
Name: system-node-critical
Node:
master1/192.10.192.158
Start Time:
Mon, 30 Jan 2023 15:06:35 +0800
Labels:
component=etcd
tier=control-plane
···
Command:
etcd
–advertise-client-urls=https://192.10.192.158:2379
–cert-file=/etc/kubernetes/pki/etcd/server.crt
–client-cert-auth=true
–data-dir=/var/lib/etcd
–initial-advertise-peer-urls=https://192.10.192.158:2380
–initial-cluster=master1=https://192.10.192.158:2380
–key-file=/etc/kubernetes/pki/etcd/server.key
–listen-client-urls=https://127.0.0.1:2379,https://192.10.192.158:2379
–listen-metrics-urls=http://127.0.0.1:2381
–listen-peer-urls=https://192.10.192.158:2380
–name=master1
–peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
–peer-client-cert-auth=true
–peer-key-file=/etc/kubernetes/pki/etcd/peer.key
–peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
–snapshot-count=10000
–trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
···
由上可知,启动参数里面有一个 --listen-metrics-urls= http://127.0.0.1:2381 的配置,该参数就是来指定 Metrics 接口运行在 2381 端口下面的,而且是 http 的协议,所以也不需要什么证书配置,这就比以前的版本要简单许多了,以前的版本需要用 https 协议访问,所以要配置对应的证书。但是还需修改配置文件,地址改为0.0.0.0
编写prometheus配置文件,需要注意的是,他默认匹配到的是2379端口,需要手动指定为2381端口
- job_name: etcd
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels:
- __meta_kubernetes_pod_label_component
regex: etcd
action: keep
- source_labels: [__meta_kubernetes_pod_ip]
regex: (.+)
target_label: __address__
replacement: ${1}:2381
- source_labels: [__meta_kubernetes_endpoints_name]
action: replace
target_label: endpoint
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
上面部分参数简介如下:
kubernetes_sd_configs: 设置发现模式为 Kubernetes 动态服务发现kubernetes_sd_configs.role: 指定 Kubernetes 的服务发现模式,这里设置为 endpoints 的服务发现模式,该模式下会调用 kube-apiserver 中的接口获取指标数据。并且还限定只获取 kube-state-metrics 所在 - Namespace 的空间 kube-system 中的 Endpoints 信息kubernetes_sd_configs.namespace:指定只在配置的 Namespace 中进行 endpoints 服务发现relabel_configs: 用于对采集的标签进行重新标记
热加载prometheus,使configmap配置文件生效(也可以等待prometheus的自动热加载):
curl -XPOST http://prometheus.kubernets.cn/-/reload
验证:
cAdvisor功能:对容器资源的使用情况和性能进行监控。它以守护进程方式运行,用于收集、聚合、处理和导出正在运行容器的有关信息。
cAdvisor 本身就对 Docker 容器支持,并且还对其它类型的容器尽可能的提供支持,力求兼容与适配所有类型的容器。
Kubernetes 已经默认将其与 Kubelet 融合,所以我们无需再单独部署 cAdvisor 组件来暴露节点中容器运行的信息。
Prometheus 添加 cAdvisor 配置
由于 Kubelet 中已经默认集成 cAdvisor 组件,所以无需部署该组件。需要注意的是,他的指标采集地址为 /metrics/cadvisor,需要配置https访问,可以设置 insecure_skip_verify: true 跳过证书验证;
- job_name: kubelet
metrics_path: /metrics/cadvisor
scheme: https
tls_config:
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [__meta_kubernetes_endpoints_name]
action: replace
target_label: endpoint
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
热加载prometheus,使配置文件生效:curl -XPOST http://prometheus.kubernets.cn/-/reload
验证:
Node Exporter 是 Prometheus 官方提供的一个节点资源采集组件,可以用于收集服务器节点的数据,如 CPU频率信息、磁盘IO统计、剩余可用内存等等。 由于是针对所有K8S-node节点,所以我们这边使用DaemonSet这种方式
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitor
labels:
name: node-exporter
spec:
selector:
matchLabels:
name: node-exporter
template:
metadata:
labels:
name: node-exporter
spec:
hostPID: true
hostIPC: true
hostNetwork: true
containers:
- name: node-exporter
image: prom/node-exporter:latest
ports:
- containerPort: 9100
resources:
requests:
cpu: 0.15
securityContext:
privileged: true
args:
- --path.procfs
- /host/proc
- --path.sysfs
- /host/sys
- --collector.filesystem.ignored-mount-points
- '"^/(sys|proc|dev|host|etc)($|/)"'
volumeMounts:
- name: dev
mountPath: /host/dev
- name: proc
mountPath: /host/proc
- name: sys
mountPath: /host/sys
- name: rootfs
mountPath: /rootfs
tolerations:
- key: "node-role.kubernetes.io/master"
operator: "Exists"
effect: "NoSchedule"
volumes:
- name: proc
hostPath:
path: /proc
- name: dev
hostPath:
path: /dev
- name: sys
hostPath:
path: /sys
- name: rootfs
hostPath:
path: /
node_exporter.yaml文件说明:hostPID:指定是否允许Node Exporter进程绑定到主机的PID命名空间。若值为true,则可以访问宿主机中的PID信息。
hostIPC:指定是否允许Node Exporter进程绑定到主机的IPC命名空间。若值为true,则可以访问宿主机中的IPC信息。
hostNetwork:指定是否允许Node Exporter进程绑定到主机的网络命名空间。若值为true,则可以访问宿主机中的网络信息。
验证: curl localhost:9100/metrics |grep cpu
新增 k8s-node 监控
在 prometheus-config.yaml 中新增采集 job:k8s-nodesnode_exporter也是每个node节点都运行,因此role使用node即可,默认address端口为10250,替换为9100即可
- job_name: k8s-nodes
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:9100'
target_label: __address__
action: replace
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [__meta_kubernetes_endpoints_name]
action: replace
target_label: endpoint
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
热加载prometheus,使配置文件生效:curl -XPOST http://prometheus.kubernets.cn/-/reload
验证:
kube-state-metrics:将 Kubernetes API 中的各种对象状态信息转化为 Prometheus 可以使用的监控指标数据。
cAdvisor:用于监视容器资源使用和性能的工具,它可以收集 CPU、内存、磁盘、网络和文件系统等方面的指标数据。
node-exporter:用于监控主机指标数据的收集器,它可以收集 CPU 负载、内存使用情况、磁盘空间、网络流量等各种指标数据。这三种工具可以协同工作,为用户提供一个全面的 Kubernetes 监控方案,帮助用户更好地了解其 Kubernetes 集群和容器化应用程序的运行情况。