K8s集群版本是二进制部署的1.20.4,kube-prometheus对应选择的版本是kube-prometheus-0.8.0
Coredns是在安装集群的时候部署的,采用的也是该版本的官方文档,kube-prometheus中也有coredns的监控配置信息,但是在prometheus的监控页面并没有发现coredns的servicemonitor.。所以我们需要一步步的去排查该问题。
先看下coredns的servicemonitor
vim kubernetes-serviceMonitorCoreDNS.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app.kubernetes.io/name: coredns
name: coredns
namespace: monitoring
spec:
endpoints:
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
interval: 15s
port: metrics
jobLabel: app.kubernetes.io/name
namespaceSelector:
matchNames:
- kube-system
selector:
matchLabels:
app.kubernetes.io/name: kube-dns
再来看下coredns的service配置
---
apiVersion: v1
kind: Service
metadata:
name: kube-dns
namespace: kube-system
annotations:
prometheus.io/port: "9153"
prometheus.io/scrape: "true"
labels:
k8s-app: kube-dns
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/name: "CoreDNS"
spec:
selector:
k8s-app: kube-dns
clusterIP: 10.0.0.2
ports:
- name: dns
port: 53
protocol: UDP
- name: dns-tcp
port: 53
protocol: TCP
- name: metrics
port: 9153
protocol: TCP
从上面两段可以看到,servicemonitor去匹配的service是
labels:
app.kubernetes.io/name: coredns
而我们创建的coredns的service的labels
labels:
k8s-app: kube-dns
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/name: "CoreDNS"
两边没有对应上,所以该servicemonitor无法匹配到对应的service,所以监控不到我们的coredns.
因coredns对服务的影响比较大,我们选择去修改servicemonitor
修改labels后重新apply
Kubectl apply -f kubernetes-serviceMonitorCoreDNS.yaml
coredns就加载出来了
配置coredns的监控信息
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
app.kubernetes.io/name: kube-prometheus
app.kubernetes.io/part-of: kube-prometheus
prometheus: k8s
role: alert-rules
name: kubernetes-monitoring-coredns-rules
namespace: monitoring
spec:
groups:
- name: coredns
rules:
- alert: CoreDNSDown
annotations:
message: CoreDNS has disappeared from Prometheus target discovery.
runbook_url: https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednsdown
expr: |
absent(up{job="kube-dns"} == 1)
for: 15m
labels:
severity: critical
- alert: CoreDNS的dns请求持续时间延迟高
annotations:
message: CoreDNS has 99th percentile latency of {{ $value }} seconds for server
{{ $labels.server }} zone {{ $labels.zone }} .
runbook_url: https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednslatencyhigh
expr: |
histogram_quantile(0.99, sum(rate(coredns_dns_request_duration_seconds_bucket{job="kube-dns"}[5m])) by(server, zone, le)) > 4
for: 10m
labels:
severity: critical
- alert: CoreDNS响应错误高
annotations:
message: CoreDNS is returning SERVFAIL for {{ $value | humanizePercentage }} of requests.
runbook_url: https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednserorshigh
expr: |
sum(rate(coredns_dns_responses_total{job="kube-dns",rcode="SERVFAIL"}[5m]))
/
sum(rate(coredns_dns_responses_total{job="kube-dns"}[5m])) > 0.03
for: 10m
labels:
severity: critical
- alert: CoreDNS响应错误高
annotations:
message: CoreDNS is returning SERVFAIL for {{ $value | humanizePercentage }} of requests.
runbook_url: https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednserorshigh
expr: |
sum(rate(coredns_dns_responses_total{job="kube-dns",rcode="SERVFAIL"}[5m]))
/
sum(rate(coredns_dns_responses_total{job="kube-dns"}[5m])) > 0.01
for: 10m
labels:
severity: warning
- alert: CoreDNS转发请求持续时间延迟高
annotations:
message: CoreDNS has 99th percentile latency of {{ $value }} seconds forwarding requests to {{ $labels.to }}.
runbook_url: https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednsforwardlatencyhigh
expr: |
histogram_quantile(0.99, sum(rate(coredns_forward_request_duration_seconds_bucket{job="kube-dns"}[5m])) by(to, le)) > 4
for: 10m
labels:
severity: critical
- alert: CoreDNSForwardErrorsHigh
annotations:
message: CoreDNS is returning SERVFAIL for {{ $value | humanizePercentage }} of forward requests to {{ $labels.to }}.
runbook_url: https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednsforwarderrorshigh
expr: |
sum(rate(coredns_forward_responses_total{job="kube-dns",rcode="SERVFAIL"}[5m]))
/
sum(rate(coredns_forward_responses_total{job="kube-dns"}[5m])) > 0.03
for: 10m
labels:
severity: critical
- alert: CoreDNSForwardErrorsHigh
annotations:
message: CoreDNS is returning SERVFAIL for {{ $value | humanizePercentage }} of forward requests to {{ $labels.to }}.
runbook_url: https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednsforwarderrorshigh
expr: |
sum(rate(coredns_forward_responses_total{job="kube-dns",rcode="SERVFAIL"}[5m]))
/
sum(rate(coredns_forward_responses_total{job="kube-dns"}[5m])) > 0.01
for: 10m
labels:
severity: warning
- alert: CoreDNSForwardHealthcheckFailureCount
annotations:
message: CoreDNS health checks have failed to upstream server {{ $labels.to }}.
runbook_url: https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednsforwardhealthcheckfailurecount
expr: |
sum(rate(coredns_forward_healthcheck_failures_total{job="kube-dns"}[5m])) by (to) > 0
for: 10m
labels:
severity: warning
- alert: CoreDNSForwardHealthcheckBrokenCount
annotations:
message: CoreDNS health checks have failed for all upstream servers.
runbook_url: https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednsforwardhealthcheckbrokencount
expr: |
sum(rate(coredns_forward_healthcheck_broken_total{job="kube-dns"}[5m])) > 0
for: 10m
labels:
severity: warning
- alert: CorednsPanicCount
expr: increase(coredns_panics_total[1m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: CoreDNS Panic Count (instance {{ $labels.instance }})
description: "Number of CoreDNS panics encountered\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"