监控k8s controller和scheduler,创建serviceMonitor以及Rules

news2024/12/25 12:56:08

目录

一、修改kube-controller和kube-schduler的yaml文件

二、创建service、endpoint、serviceMonitor

三、Prometheus验证

四、创建PrometheusRule资源

五、Prometheus验证


直接上干货

一、修改kube-controller和kube-schduler的yaml文件

注意:修改时要一个节点一个节点的修改,等上一个修改的节点服务正常启动后再修改下个节点

kube-controller文件路径:/etc/kubernetes/manifests/kube-controller-manager.yaml
kube-scheduler文件路径:/etc/kubernetes/manifests/kube-scheduler.yaml

vim /etc/kubernetes/manifests/kube-controller-manager.yaml
vim /etc/kubernetes/manifests/kube-scheduler.yaml

二、创建service、endpoint、serviceMonitor

kube-controller-monitor.yaml

apiVersion: v1
kind: Service
metadata:
  labels:
    k8s-app: kube-controller-manager
  name: kube-controller-manage-monitor
  namespace: kube-system
spec:
  ports:
  - name: https-metrics
    port: 10257
    protocol: TCP
    targetPort: 10257
  sessionAffinity: None
  type: ClusterIP
--- 
apiVersion: v1
kind: Endpoints
metadata:
  labels:
    k8s-app: kube-controller-manager
  name: kube-controller-manage-monitor
  namespace: kube-system
subsets:
- addresses:
  - ip: 10.50.238.191
  - ip: 10.50.107.48
  - ip: 10.50.140.151
  ports:
  - name: https-metrics
    port: 10257
    protocol: TCP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    k8s-app: kube-controller-manager
  name: kube-controller-manager
  namespace: kube-system
spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    interval: 30s
    port: https-metrics
    scheme: https
    tlsConfig:
      insecureSkipVerify: true
  jobLabel: k8s-app
  namespaceSelector:
    matchNames:
    - kube-system
  selector:
    matchLabels:
      k8s-app: kube-controller-manager

kube-scheduler-monitor.yaml

apiVersion: v1
kind: Service
metadata:
  labels:
    k8s-app: kube-scheduler
  name: kube-scheduler-monitor
  namespace: kube-system
spec:
  ports:
  - name: https-metrics
    port: 10259
    protocol: TCP
    targetPort: 10259
  sessionAffinity: None
  type: ClusterIP
--- 
apiVersion: v1
kind: Endpoints
metadata:
  labels:
    k8s-app: kube-scheduler
  name: kube-scheduler-monitor
  namespace: kube-system
subsets:
- addresses:
  - ip: 10.50.238.191
  - ip: 10.50.107.48
  - ip: 10.50.140.151
  ports:
  - name: https-metrics
    port: 10259
    protocol: TCP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    k8s-app: kube-scheduler
  name: kube-scheduler
  namespace: kube-system
spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    interval: 30s
    port: https-metrics
    scheme: https
    tlsConfig:
      insecureSkipVerify: true
  jobLabel: k8s-app
  namespaceSelector:
    matchNames:
    - kube-system
  selector:
    matchLabels:
      k8s-app: kube-scheduler

root@10-50-238-191:/home/sunwenbo/prometheus-serviceMonitor/serviceMonitor/kubernetes-cluster# kubectl  apply -f ./
service/kube-controller-manage-monitor created
endpoints/kube-controller-manage-monitor created
servicemonitor.monitoring.coreos.com/kube-controller-manager created
service/kube-scheduler-monitor created
endpoints/kube-scheduler-monitor created
servicemonitor.monitoring.coreos.com/kube-scheduler created

三、Prometheus验证

四、创建PrometheusRule资源

kube-controller-rules.yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  annotations:
    meta.helm.sh/release-namespace: cattle-monitoring-system
    prometheus-operator-validated: "true"
  generation: 3
  labels:
    app: rancher-monitoring
    app.kubernetes.io/instance: rancher-monitoring
    app.kubernetes.io/part-of: rancher-monitoring
  name: kube-controller-manager
  namespace: cattle-monitoring-system
spec:
  groups:
  - name: kube-controller-manager.rule
    rules:
    - alert: K8SControllerManagerDown
      expr: absent(up{job="kube-controller-manager"} == 1)
      for: 1m
      labels:
        severity: critical
        cluster: manage-prod
      annotations:
        description: There is no running K8S controller manager. Deployments and replication controllers are not making progress.
        summary: No kubernetes controller manager are reachable
    - alert: K8SControllerManagerDown
      expr: up{job="kube-controller-manager"} == 0
      for: 1m
      labels:
        severity: warning
        cluster: manage-prod
      annotations:
        description: kubernetes controller manager {{ $labels.instance }} is down. {{ $labels.instance }} isn't reachable
        summary: kubernetes controller manager is down
    - alert: K8SControllerManagerUserCPU
      expr: sum(rate(container_cpu_user_seconds_total{pod=~"kube-controller-manager.*",container_name!="POD"}[5m]))by(pod) > 5
      for: 5m
      labels:
        severity: warning
        cluster: manage-prod
      annotations:
        description: kubernetes controller manager {{ $labels.instance }} is user cpu time > 5s. {{ $labels.instance }} isn't reachable
        summary: kubernetes controller 负载较高超过5s
   
    - alert: K8SControllerManagerUseMemory
      expr: sum(rate(container_memory_usage_bytes{pod=~"kube-controller-manager.*",container_name!="POD"}[5m])/1024/1024)by(pod) > 20
      for: 5m
      labels:
        severity: info
        cluster: manage-prod
      annotations:
        description: kubernetes controller manager {{ $labels.instance }} is use memory More than 20MB
        summary: kubernetes controller 使用内存超过20MB
   
    - alert: K8SControllerManagerQueueTimedelay
      expr: histogram_quantile(0.99, sum(rate(workqueue_queue_duration_seconds_bucket{job="kubernetes-controller-manager"}[5m])) by(le)) > 10
      for: 5m
      labels:
        severity: warning
        cluster: manage-prod
      annotations:
        description: kubernetes controller manager {{ $labels.instance }} is QueueTimedelay More than 10s
        summary: kubernetes controller 队列停留时间超过10秒,请检查ControllerManager

kube-scheduler-rules.yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  annotations:
    meta.helm.sh/release-namespace: cattle-monitoring-system
    prometheus-operator-validated: "true"
  generation: 3
  labels:
    app: rancher-monitoring
    app.kubernetes.io/instance: rancher-monitoring
    app.kubernetes.io/part-of: rancher-monitoring
  name: kube-scheduler
  namespace: cattle-monitoring-system
spec:
  groups:
  - name: kube-scheduler.rule
    rules:
    - alert: K8SSchedulerDown
      expr: absent(up{job="kube-scheduler"} == 1)
      for: 1m
      labels:
        severity: critical
        cluster: manage-prod
      annotations:
        description: "There is no running K8S scheduler. New pods are not being assigned to nodes."
        summary: "all k8s scheduler is down"
    - alert: K8SSchedulerDown
      expr: up{job="kube-scheduler"} == 0
      for: 1m
      labels:
        severity: warning
        cluster: manage-prod
      annotations:
        description: "K8S scheduler {{ $labels.instance }} is no running. New pods are not being assigned to nodes."
        summary: "k8s scheduler {{ $labels.instance }} is down"
    - alert: K8SSchedulerUserCPU
      expr: sum(rate(container_cpu_user_seconds_total{pod=~"kube-scheduler.*",container_name!="POD"}[5m]))by(pod) > 1
      for: 5m
      labels:
        severity: warning
        cluster: manage-prod
      annotations:
        current_value: '{{$value}}'
        description: "kubernetes scheduler {{ $labels.instance }} is user cpu time > 1s. {{ $labels.instance }} isn't reachable"
        summary: "kubernetes scheduler 负载较高超过1s,当前值为{{$value}}"
   
    - alert: K8SSchedulerUseMemory
      expr: sum(rate(container_memory_usage_bytes{pod=~"kube-scheduler.*",container_name!="POD"}[5m])/1024/1024)by(pod) > 20
      for: 5m
      labels:
        severity: info
        cluster: manage-prod
      annotations:
        current_value: '{{$value}}'
        description: "kubernetess scheduler {{ $labels.instance }} is use memory More than 20MB"
        summary: "kubernetes scheduler 使用内存超过20MB,当前值为{{$value}}MB"
   
    - alert: K8SSchedulerPodPending
      expr: sum(scheduler_pending_pods{job="kubernetes-scheduler"})by(queue) > 5
      for: 5m
      labels:
        severity: info
        cluster: manage-prod
      annotations:
        current_value: '{{$value}}'
        description: "kubernetess scheduler {{ $labels.instance }} is Pending pod More than 5"
        summary: "kubernetes scheduler pod无法调度 > 5,当前值为{{$value}}"
   
    - alert: K8SSchedulerPodPending
      expr: sum(scheduler_pending_pods{job="kubernetes-scheduler"})by(queue) > 10
      for: 5m
      labels:
        severity: warning
        cluster: manage-prod
      annotations:
        current_value: '{{$value}}'
        description: kubernetess scheduler {{ $labels.instance }} is Pending pod More than 10
        summary: "kubernetes scheduler pod无法调度 > 10,当前值为{{$value}}"
   
    - alert: K8SSchedulerPodPending
      expr: sum(rate(scheduler_binding_duration_seconds_count{job="kubernetes-scheduler"}[5m])) > 1
      for: 5m
      labels:
        severity: warning
        cluster: manage-prod
      annotations:
        current_value: '{{$value}}'
        description: kubernetess scheduler {{ $labels.instance }}
        summary: "kubernetes scheduler pod 无法绑定调度有问题,当前值为{{$value}}"
   
    - alert: K8SSchedulerVolumeSpeed
      expr: sum(rate(scheduler_volume_scheduling_duration_seconds_count{job="kubernetes-scheduler"}[5m])) > 1
      for: 5m
      labels:
        severity: warning
        cluster: manage-prod
      annotations:
        current_value: '{{$value}}'
        description: kubernetess scheduler {{ $labels.instance }}
        summary: "kubernetes scheduler pod Volume 速度延迟,当前值为{{$value}}"
   
    - alert: K8SSchedulerClientRequestSlow
      expr: histogram_quantile(0.99, sum(rate(rest_client_request_duration_seconds_bucket{job="kubernetes-scheduler"}[5m])) by (verb, url, le)) > 1
      for: 5m
      labels:
        severity: warning
        cluster: manage-prod
      annotations:
        current_value: '{{$value}}'
        description: kubernetess scheduler {{ $labels.instance }}
        summary: "kubernetes scheduler 客户端请求速度延迟,当前值为{{$value}}"
root@10-50-238-191:/home/sunwenbo/prometheus-serviceMonitor/rules# kubectl  apply -f kube-controller-rules.yaml 
prometheusrule.monitoring.coreos.com/kube-apiserver-rules configured
root@10-50-238-191:/home/sunwenbo/prometheus-serviceMonitor/rules# kubectl  apply -f kube-scheduler-rules.yaml 
prometheusrule.monitoring.coreos.com/kube-apiserver-rules configured
root@10-50-238-191:/home/sunwenbo/prometheus-serviceMonitor/rules# 

五、Prometheus验证

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1322483.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

neo4j安装报错:neo4j.bat : 无法将“neo4j.bat”项识别为 cmdlet、函数、脚本文件或可运行程序的名称。

neo4j安装报错: neo4j.bat : 无法将“neo4j.bat”项识别为 cmdlet、函数、脚本文件或可运行程序的名称。请检查名称的拼写,如果包括路径,请确 保路径正确,然后再试一次。 解决办法: 在环境变量中的,用户…

手把手教你搭建谷歌Gemini

前言 谷歌上周推出了一款名为 Gemini 的多模态大模型,并且现在发布了免费开放的 Gemini API 供开发者使用。根据谷歌提供的定价信息,Gemini 有两种收费方式。免费版本每分钟可以进行 60 次请求,足够满足个人用户的需求。收费版本目前暂不可用…

你了解Redis中的跳跃表吗?

跳跃表的基本内容: 对于一个有序序列,链表相对于数组来说,删除和插入的效率要快很多,只需要改变指针的指向,但是在查找的时候,数组就要更占优势一些,可以随机访问,然而链表需要从头…

连接SSH报错 / 连接容器SSH

连接SSH报错 / 连接容器SSH 前言被控端主控端连接失败 前言 本文介绍如何通过SSH方式远程连接Linux被控端,并介绍如何解决连接失败问题。 此方法同样适用于SSH连接Docker容器。 被控端 被控端一般为Linux,默认已安装ssh,但需要手动安装ope…

C++ OJ题测试—排序算法效率

目录 OJ链接 一、直接插入排序 二、希尔排序 三、直接选择排序 常规: 第二种: 四、 堆排序 五、冒泡排序 六、快速排序 常规: 三路划分优化效率 七、归并排序 八、计数排序 OJ链接 ​ 一、直接插入排序 class Solution { pub…

Gateway No servers available for service

springCloud集成网关测试报错找不到服务,如下 造成这种错误可能是下面两种原因 1、nacos注册的服务不在一个命名空间内,导致找不到服务503 spring cloud:nacos:discovery:server-addr: 127.0.0.1:8848config:server-addr: 127.0.0.1:8848file-extensio…

探索Qt 6.3:了解基本知识点和新特性

学习目标: 理解Qt6.3的基本概念和框架:解释Qt是什么,它的核心思想和设计原则。学会安装和配置Qt6.3开发环境:提供详细的步骤,让读者能够顺利安装和配置Qt6.3的开发环境。掌握Qt6.3的基本编程技巧:介绍Qt6.…

欧洲版OpenAI疑似将在24年发布并开源GPT-4级别模型!

大家好,我是二狗。 今天在推特上看到一条振奋人心的消息: “ 欧洲版OpenAI、法国初创公司 Mistral 首席执行官 Arthur Mensch 在法国国家广播电台宣布,Mistral 将在 2024 年发布开源 GPT-4 级别模型。” 这位老哥接着表示甚至可能是免费的&a…

C++ STL——栈和队列(stack queue)

本节目标 1.stack的介绍和使用及其模拟实现 2.queue的介绍和使用及其模拟实现 3.priority_queue的介绍和使用及其模拟实现 4.容器适配器 1.stack的介绍和使用及其模拟实现 1.1 stack的介绍 stack的文档介绍 根据stack的文档介绍可以知道,stack是一种容器适配器…

原生微信小程序-使用 阿里字体图标 详解

步骤一 1、打开阿里巴巴矢量图标库 网址:iconfont-阿里巴巴矢量图标库 2、搜索字体图标,鼠标悬浮点击添加入库 3、按如下步骤添加到自己的项目 步骤二 进入微信开发者工具 1、创建 fonts文件夹 > iconfont.wxss 文件,将刚才的代码复制…

结果实例: 一个cpu的parsec结果

简介 限于篇幅限制,很多教程和论文只展示部分结果。我们这里展示非常细节的结果,包括输出的许多命令行结果。 运行命令行 的shell窗口 ./build/X86/gem5.opt -d m5out/onlyoneCPUkvmCheckPointDifferRCS20231218restore \configs/deprecated/example/…

杰发科技AC7840——SPM电源管理之低功耗模式

0、SPM简介 很早以前就听过低功耗模式,一直没有怎么深入了解,最近遇到几个项目都是跟低功耗有关。正好AutoChips的芯片都有电源管理的功能,在此借用AC7840的SPM对低功耗进行测试。 1、AC7840的5种功耗模式 2、AC7840的模式转换 3、唤醒 在…

Kubernetes 的用法和解析 -- 5

一.企业级镜像仓库Harbo 准备:另起一台新服务器,并配置docker yum源,安装docker 和 docker-compose 1.1 上传harbor安装包并安装 [rootharbor ~]# tar xf harbor-offline-installer-v2.5.3.tgz [rootharbor ~]# cp harbor.yml.tmpl harbor…

Home Assistant 如何开启SSH服务

环境: Home Assistant 11.2 SSH & Web Terminal 17.0 问题描述: Home Assistant 如何开启SSH服务 解决方案: 通过添加一个名为Terminal & SSH的插件来在 Home Assistant 中启用 SSH 服务 下面是启用 SSH 服务的大致步骤&#x…

C语言数据结构-排序

文章目录 1 排序的概念及运用1.1 排序的概念1.2 排序的应用 2 插入排序2.1 直接插入排序2.2 希尔排序2.3 直接排序和希尔排序对比 3 选择排序3.1 堆排序3.2 直接选择排序 4 交换排序4.1 冒泡排序4.2 快速排序4.2.1 挖坑法14.2.2 挖坑法24.2.3 挖坑法3 5 并归排序6 十万级别数据…

Ubuntu中基础命令使用

前言 以下指令测试来自于Ubuntu18.04 如果有说的不对的,欢迎指正与补充 以下指令为我学习嵌入式开发中使用过最多的指令 目录 前言 1 ls 首先我们进入到Linux操作系统中 2 touch创建一个文件 3 pwd查看当前路径 4 创建目录 5 删除文件 6 cd 目录跳转 0…

LVS负载均衡集群之HA高可用模式

Keepalived工具介绍 专为LVS和HA设计的一款健康检查工具 一个合格的集群应该具备的特性: 1.负载均衡 LVS Nginx HAProxy F5 2.健康检查(探针) for调度器/节点服务器 Keeplived Hearbeat 3.故障转移 通过VIP飘逸实现主备切换 健康检查&am…

HarmonyOS 中DatePicker先择时间 路由跳转并传值到其它页

效果 代码 代码里有TextTimerController 这一种例用方法较怪,Text ,Button Datepicker 的使用。 import router from ohos.router’则是引入路由模块。 import router from ohos.router Entry Component struct TextnewClock {textTimerController: TextTimerContr…

【马来西亚会议】第四届计算机技术与全媒介融合设计国际学术会议(CTMCD 2024)

第四届计算机技术与全媒介融合设计国际学术会议(CTMCD 2024) 2023 4th International Conference on Computer Technology and Media Convergence Design 第四届计算机技术与全媒介融合设计国际学术会议(CTMCD 2024)将于 2024年2月23日-25日…

计算机组成原理(存储器与CPU的连接)

题目: 设 CPU 共有 16 根地址线。8 根数据线,并用 作访存控制信号,R/作读/写命令信号。现有这些存储芯片:ROM (2K*8 位、4K*4 位、8K*8 位),RAM(1K*4 位、2K*8 位、4K*8 位)及 74138 译码器和其他门电路(门电路自定)。试从上述规…