HAMi + prometheus-k8s + grafana实现vgpu虚拟化监控

news2025/1/14 3:55:58

最近长沙跑了半个多月,跟甲方客户对了下项目指标,许久没更新

回来后继续研究如何实现 grafana实现HAMi vgpu虚拟化监控,毕竟合同里写了需要体现gpu资源限制和算力共享以及体现算力卡资源共享监控

先说下为啥要用HAMi吧, 一个重要原因是公司有人引见了这个工具的作者, 很多问题我都可以直接向作者提问

HAMi,是一个国产的GPU与国产加速卡(支持的GPU与国产加速卡型号与具体特性请查看此项目官网:https://github.com/Project-HAMi/HAMi/)虚拟化开源项目,实现以kubernetes为基础的容器场景下GPU或加速卡虚拟化。HAMi原名“k8s-vGPU-scheduler”,

最初由我司开源,现已在国内与国际上愈加流行,是管理Kubernetes中异构设备的中间件。它可以管理不同类型的异构设备(如GPU、NPU等),在Pod之间共享异构设备,根据设备的拓扑信息和调度策略做出更好的调度决策。为了阐述的简明性,本文只提供一种可行的办法,最终实现使用prometheus抓取监控指标并作为数据源、使用grafana来展示监控信息的目的。

       本文假定已经部署好Kubernetes集群、HAMi。以下涉及到的相关组件都是在kubernetes集群内安装的,相关组件或软件版本信息如下:

组件或软件名称版本备注
kubernetes集群v1.23.1AMD64构架服务器环境下
HAMi根据向开源作者提问,当前HAMi版本发行机制还不够成熟,暂以安装HAMi的scheduler.kubeScheduler.imageTag 参数值为其版本,此值要跟kubernetes版本看齐项目地址:https://github.com/Project-HAMi/HAMi/
kube-prometheus stack prom/prometheus:v2.27.1关于监控的安装参见实现prometheus+grafana的监控部署_prometheus grafana监控部署-CSDN博客
dcgm-exporternvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04

HAMi  的默认安装方式是通过helm,添加Helm仓库:

helm repo add hami-charts https://project-hami.github.io/HAMi/


检查Kubernetes版本并安装HAMi(服务器版本为1.23.1):

helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTag=v1.23.1 -n kube-system

验证hami安装成功

kubectl get pods -n kube-system


确认hami-device-plugin和hami-scheduler都处于Running状态表示安装成功。

把helm安装转为hami-install.yaml


helm template hami hami-charts/hami --set scheduler.kubeScheduler.imageTag=v1.23.1 -n kube-system > hami-install.yaml

该格式部署

---
# Source: hami/templates/device-plugin/monitorserviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: hami-device-plugin
  namespace: "kube-system"
  labels:
    app.kubernetes.io/component: "hami-device-plugin"
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
---
# Source: hami/templates/scheduler/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: hami-scheduler
  namespace: "kube-system"
  labels:
    app.kubernetes.io/component: "hami-scheduler"
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
---
# Source: hami/templates/device-plugin/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: hami-device-plugin
  labels:
    app.kubernetes.io/component: hami-device-plugin
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
data:
  config.json: |
    {
        "nodeconfig": [
            {
                "name": "m5-cloudinfra-online02",
                "devicememoryscaling": 1.8,
                "devicesplitcount": 10,
                "migstrategy":"none",
                "filterdevices": {
                  "uuid": [],
                  "index": []
                }
            }
        ]
    }
---
# Source: hami/templates/scheduler/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: hami-scheduler
  labels:
    app.kubernetes.io/component: hami-scheduler
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
data:
  config.json: |
    {
        "kind": "Policy",
        "apiVersion": "v1",
        "extenders": [
            {
                "urlPrefix": "https://127.0.0.1:443",
                "filterVerb": "filter",
                "bindVerb": "bind",
                "enableHttps": true,
                "weight": 1,
                "nodeCacheCapable": true,
                "httpTimeout": 30000000000,
                "tlsConfig": {
                    "insecure": true
                },
                "managedResources": [
                    {
                        "name": "nvidia.com/gpu",
                        "ignoredByScheduler": true
                    },
                    {
                        "name": "nvidia.com/gpumem",
                        "ignoredByScheduler": true
                    },
                    {
                        "name": "nvidia.com/gpucores",
                        "ignoredByScheduler": true
                    },
                    {
                        "name": "nvidia.com/gpumem-percentage",
                        "ignoredByScheduler": true
                    },
                    {
                        "name": "nvidia.com/priority",
                        "ignoredByScheduler": true
                    },
                    {
                        "name": "cambricon.com/vmlu",
                        "ignoredByScheduler": true
                    },
                    {
                        "name": "hygon.com/dcunum",
                        "ignoredByScheduler": true
                    },
                    {
                        "name": "hygon.com/dcumem",
                        "ignoredByScheduler": true 
                    },
                    {
                        "name": "hygon.com/dcucores",
                        "ignoredByScheduler": true
                    },
                    {
                        "name": "iluvatar.ai/vgpu",
                        "ignoredByScheduler": true
                    }
                ],
                "ignoreable": false
            }
        ]
    }
---
# Source: hami/templates/scheduler/configmapnew.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: hami-scheduler-newversion
  labels:
    app.kubernetes.io/component: hami-scheduler
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
data:
  config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1
    kind: KubeSchedulerConfiguration
    leaderElection:
      leaderElect: false
    profiles:
    - schedulerName: hami-scheduler
    extenders:
    - urlPrefix: "https://127.0.0.1:443"
      filterVerb: filter
      bindVerb: bind
      nodeCacheCapable: true
      weight: 1
      httpTimeout: 30s
      enableHTTPS: true
      tlsConfig:
        insecure: true
      managedResources:
      - name: nvidia.com/gpu
        ignoredByScheduler: true
      - name: nvidia.com/gpumem
        ignoredByScheduler: true
      - name: nvidia.com/gpucores
        ignoredByScheduler: true
      - name: nvidia.com/gpumem-percentage
        ignoredByScheduler: true
      - name: nvidia.com/priority
        ignoredByScheduler: true
      - name: cambricon.com/vmlu
        ignoredByScheduler: true
      - name: hygon.com/dcunum
        ignoredByScheduler: true
      - name: hygon.com/dcumem
        ignoredByScheduler: true
      - name: hygon.com/dcucores
        ignoredByScheduler: true
      - name: iluvatar.ai/vgpu
        ignoredByScheduler: true
---
# Source: hami/templates/scheduler/device-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: hami-scheduler-device
  labels:
    app.kubernetes.io/component: hami-scheduler
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
data:
  device-config.yaml: |-
    nvidia:
      resourceCountName: nvidia.com/gpu
      resourceMemoryName: nvidia.com/gpumem
      resourceMemoryPercentageName: nvidia.com/gpumem-percentage
      resourceCoreName: nvidia.com/gpucores
      resourcePriorityName: nvidia.com/priority
      overwriteEnv: false
      defaultMemory: 0
      defaultCores: 0
      defaultGPUNum: 1
      deviceSplitCount: 10
      deviceMemoryScaling: 1
      deviceCoreScaling: 1
    cambricon:
      resourceCountName: cambricon.com/vmlu
      resourceMemoryName: cambricon.com/mlu.smlu.vmemory
      resourceCoreName: cambricon.com/mlu.smlu.vcore
    hygon:
      resourceCountName: hygon.com/dcunum
      resourceMemoryName: hygon.com/dcumem
      resourceCoreName: hygon.com/dcucores
    metax:
      resourceCountName: "metax-tech.com/gpu"
    mthreads:
      resourceCountName: "mthreads.com/vgpu"
      resourceMemoryName: "mthreads.com/sgpu-memory"
      resourceCoreName: "mthreads.com/sgpu-core"
    iluvatar: 
      resourceCountName: iluvatar.ai/vgpu
      resourceMemoryName: iluvatar.ai/vcuda-memory
      resourceCoreName: iluvatar.ai/vcuda-core
    vnpus:
    - chipName: 910B
      commonWord: Ascend910A
      resourceName: huawei.com/Ascend910A
      resourceMemoryName: huawei.com/Ascend910A-memory
      memoryAllocatable: 32768
      memoryCapacity: 32768
      aiCore: 30
      templates:
        - name: vir02
          memory: 2184
          aiCore: 2
        - name: vir04
          memory: 4369
          aiCore: 4
        - name: vir08
          memory: 8738
          aiCore: 8
        - name: vir16
          memory: 17476
          aiCore: 16
    - chipName: 910B3
      commonWord: Ascend910B
      resourceName: huawei.com/Ascend910B
      resourceMemoryName: huawei.com/Ascend910B-memory
      memoryAllocatable: 65536
      memoryCapacity: 65536
      aiCore: 20
      aiCPU: 7
      templates:
        - name: vir05_1c_16g
          memory: 16384
          aiCore: 5
          aiCPU: 1
        - name: vir10_3c_32g
          memory: 32768
          aiCore: 10
          aiCPU: 3
    - chipName: 310P3
      commonWord: Ascend310P
      resourceName: huawei.com/Ascend310P
      resourceMemoryName: huawei.com/Ascend310P-memory
      memoryAllocatable: 21527
      memoryCapacity: 24576
      aiCore: 8
      aiCPU: 7
      templates:
        - name: vir01
          memory: 3072
          aiCore: 1
          aiCPU: 1
        - name: vir02
          memory: 6144
          aiCore: 2
          aiCPU: 2
        - name: vir04
          memory: 12288
          aiCore: 4
          aiCPU: 4
---
# Source: hami/templates/device-plugin/monitorrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name:  hami-device-plugin-monitor
rules:
  - apiGroups:
      - ""
    resources:
      - pods
    verbs:
      - get
      - create
      - watch
      - list
      - update
      - patch
  - apiGroups:
      - ""
    resources:
      - nodes
    verbs:
      - get
      - update
      - list
      - patch
---
# Source: hami/templates/device-plugin/monitorrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: hami-device-plugin
  labels:
    app.kubernetes.io/component: "hami-device-plugin"
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  #name: cluster-admin
  name: hami-device-plugin-monitor
subjects:
  - kind: ServiceAccount
    name: hami-device-plugin
    namespace: "kube-system"
---
# Source: hami/templates/scheduler/rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: hami-scheduler
  labels:
    app.kubernetes.io/component: "hami-scheduler"
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
  - kind: ServiceAccount
    name: hami-scheduler
    namespace: "kube-system"
---
# Source: hami/templates/device-plugin/monitorservice.yaml
apiVersion: v1
kind: Service
metadata:
  name: hami-device-plugin-monitor
  labels:
    app.kubernetes.io/component: hami-device-plugin
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
spec:
  externalTrafficPolicy: Local
  selector:
    app.kubernetes.io/component: hami-device-plugin
  type: NodePort
  ports:
    - name: monitorport
      port: 31992
      targetPort: 9394
      nodePort: 31992
---
# Source: hami/templates/scheduler/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: hami-scheduler
  labels:
    app.kubernetes.io/component: hami-scheduler
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
spec:
  type: NodePort
  ports:
    - name: http
      port: 443
      targetPort: 443
      nodePort: 31998
      protocol: TCP
    - name: monitor
      port: 31993
      targetPort: 9395
      nodePort: 31993
      protocol: TCP
  selector:
    app.kubernetes.io/component: hami-scheduler
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
---
# Source: hami/templates/device-plugin/daemonsetnvidia.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: hami-device-plugin
  labels:
    app.kubernetes.io/component: hami-device-plugin
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: hami-device-plugin
      app.kubernetes.io/name: hami
      app.kubernetes.io/instance: hami
  template:
    metadata:
      labels:
        app.kubernetes.io/component: hami-device-plugin
        hami.io/webhook: ignore
        app.kubernetes.io/name: hami
        app.kubernetes.io/instance: hami
    spec:
      imagePullSecrets: 
        []
      serviceAccountName: hami-device-plugin
      priorityClassName: system-node-critical
      hostPID: true
      hostNetwork: true
      containers:
        - name: device-plugin
          image: projecthami/hami:latest
          imagePullPolicy: "IfNotPresent"
          lifecycle:
            postStart:
              exec:
                command: ["/bin/sh","-c", "cp -f /k8s-vgpu/lib/nvidia/* /usr/local/vgpu/"]
          command:
            - nvidia-device-plugin
            - --config-file=/device-config.yaml
            - --mig-strategy=none
            - --disable-core-limit=false
            - -v=false
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: NVIDIA_MIG_MONITOR_DEVICES
              value: all
            - name: HOOK_PATH
              value: /usr/local
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop: ["ALL"]
              add: ["SYS_ADMIN"]
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
            - name: lib
              mountPath: /usr/local/vgpu
            - name: usrbin
              mountPath: /usrbin
            - name: deviceconfig
              mountPath: /config
            - name: hosttmp
              mountPath: /tmp
            - name: device-config
              mountPath: /device-config.yaml
              subPath: device-config.yaml
        - name: vgpu-monitor
          image: projecthami/hami:latest
          imagePullPolicy: "IfNotPresent"
          command: ["vGPUmonitor"]
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop: ["ALL"]
              add: ["SYS_ADMIN"]
          env:
            - name: NVIDIA_VISIBLE_DEVICES
              value: "all"
            - name: NVIDIA_MIG_MONITOR_DEVICES
              value: all
            - name: HOOK_PATH
              value: /usr/local/vgpu              
          volumeMounts:
            - name: ctrs
              mountPath: /usr/local/vgpu/containers
            - name: dockers
              mountPath: /run/docker
            - name: containerds
              mountPath: /run/containerd
            - name: sysinfo
              mountPath: /sysinfo
            - name: hostvar
              mountPath: /hostvar
      volumes:
        - name: ctrs
          hostPath:
            path: /usr/local/vgpu/containers
        - name: hosttmp
          hostPath:
            path: /tmp
        - name: dockers
          hostPath:
            path: /run/docker
        - name: containerds
          hostPath:
            path: /run/containerd
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
        - name: lib
          hostPath:
            path: /usr/local/vgpu
        - name: usrbin
          hostPath:
            path: /usr/bin
        - name: sysinfo
          hostPath:
            path: /sys
        - name: hostvar
          hostPath:
            path: /var
        - name: deviceconfig
          configMap:
            name: hami-device-plugin
        - name: device-config
          configMap:
            name: hami-scheduler-device
      nodeSelector: 
        gpu: "on"
---
# Source: hami/templates/scheduler/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hami-scheduler
  labels:
    app.kubernetes.io/component: hami-scheduler
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/component: hami-scheduler
      app.kubernetes.io/name: hami
      app.kubernetes.io/instance: hami
  template:
    metadata:
      labels:
        app.kubernetes.io/component: hami-scheduler
        app.kubernetes.io/name: hami
        app.kubernetes.io/instance: hami
        hami.io/webhook: ignore
    spec:
      imagePullSecrets: 
        []
      serviceAccountName: hami-scheduler
      priorityClassName: system-node-critical
      containers:
        - name: kube-scheduler
          image: registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.31.0
          imagePullPolicy: "IfNotPresent"
          command:
            - kube-scheduler
            - --config=/config/config.yaml
            - -v=4
            - --leader-elect=true
            - --leader-elect-resource-name=hami-scheduler
            - --leader-elect-resource-namespace=kube-system
          volumeMounts:
            - name: scheduler-config
              mountPath: /config
        - name: vgpu-scheduler-extender
          image: projecthami/hami:latest
          imagePullPolicy: "IfNotPresent"
          env:
          command:
            - scheduler
            - --http_bind=0.0.0.0:443
            - --cert_file=/tls/tls.crt
            - --key_file=/tls/tls.key
            - --scheduler-name=hami-scheduler
            - --metrics-bind-address=:9395
            - --node-scheduler-policy=binpack
            - --gpu-scheduler-policy=spread
            - --device-config-file=/device-config.yaml
            - --debug
            - -v=4
          ports:
            - name: http
              containerPort: 443
              protocol: TCP
          volumeMounts:
            - name: tls-config
              mountPath: /tls
            - name: device-config
              mountPath: /device-config.yaml
              subPath: device-config.yaml
      volumes:
        - name: tls-config
          secret:
            secretName: hami-scheduler-tls
        - name: scheduler-config
          configMap:
            name: hami-scheduler-newversion
        - name: device-config
          configMap:
            name: hami-scheduler-device
---
# Source: hami/templates/scheduler/webhook.yaml
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: hami-webhook
webhooks:
  - admissionReviewVersions:
    - v1beta1
    clientConfig:
      service:
        name: hami-scheduler
        namespace: kube-system
        path: /webhook
        port: 443
    failurePolicy: Ignore
    matchPolicy: Equivalent
    name: vgpu.hami.io
    namespaceSelector:
      matchExpressions:
      - key: hami.io/webhook
        operator: NotIn
        values:
        - ignore
    objectSelector:
      matchExpressions:
      - key: hami.io/webhook
        operator: NotIn
        values:
        - ignore
    reinvocationPolicy: Never
    rules:
      - apiGroups:
          - ""
        apiVersions:
          - v1
        operations:
          - CREATE
        resources:
          - pods
        scope: '*'
    sideEffects: None
    timeoutSeconds: 10
---
# Source: hami/templates/scheduler/job-patch/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: hami-admission
  annotations:
    "helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
  labels:
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: admission-webhook
---
# Source: hami/templates/scheduler/job-patch/clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: hami-admission
  annotations:
    "helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
  labels:
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: admission-webhook
rules:
  - apiGroups:
      - admissionregistration.k8s.io
    resources:
      #- validatingwebhookconfigurations
      - mutatingwebhookconfigurations
    verbs:
      - get
      - update
---
# Source: hami/templates/scheduler/job-patch/clusterrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name:  hami-admission
  annotations:
    "helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
  labels:
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: admission-webhook
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: hami-admission
subjects:
  - kind: ServiceAccount
    name: hami-admission
    namespace: "kube-system"
---
# Source: hami/templates/scheduler/job-patch/role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name:  hami-admission
  annotations:
    "helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
  labels:
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: admission-webhook
rules:
  - apiGroups:
      - ""
    resources:
      - secrets
    verbs:
      - get
      - create
---
# Source: hami/templates/scheduler/job-patch/rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: hami-admission
  annotations:
    "helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
  labels:
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: admission-webhook
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: hami-admission
subjects:
  - kind: ServiceAccount
    name: hami-admission
    namespace: "kube-system"
---
# Source: hami/templates/scheduler/job-patch/job-createSecret.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: hami-admission-create
  annotations:
    "helm.sh/hook": pre-install,pre-upgrade
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
  labels:
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: admission-webhook
spec:
  template:
    metadata:
      name: hami-admission-create
      labels:
        helm.sh/chart: hami-2.4.0
        app.kubernetes.io/name: hami
        app.kubernetes.io/instance: hami
        app.kubernetes.io/version: "2.4.0"
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/component: admission-webhook
        hami.io/webhook: ignore
    spec:
      imagePullSecrets: 
        []
      containers:
        - name: create
          image: liangjw/kube-webhook-certgen:v1.1.1
          imagePullPolicy: IfNotPresent
          args:
            - create
            - --cert-name=tls.crt
            - --key-name=tls.key
            - --host=hami-scheduler.kube-system.svc,127.0.0.1
            - --namespace=kube-system
            - --secret-name=hami-scheduler-tls
      restartPolicy: OnFailure
      serviceAccountName: hami-admission
      securityContext:
        runAsNonRoot: true
        runAsUser: 2000
---
# Source: hami/templates/scheduler/job-patch/job-patchWebhook.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: hami-admission-patch
  annotations:
    "helm.sh/hook": post-install,post-upgrade
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
  labels:
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: admission-webhook
spec:
  template:
    metadata:
      name: hami-admission-patch
      labels:
        helm.sh/chart: hami-2.4.0
        app.kubernetes.io/name: hami
        app.kubernetes.io/instance: hami
        app.kubernetes.io/version: "2.4.0"
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/component: admission-webhook
        hami.io/webhook: ignore
    spec:
      imagePullSecrets: 
        []
      containers:
        - name: patch
          image: liangjw/kube-webhook-certgen:v1.1.1
          imagePullPolicy: IfNotPresent
          args:
            - patch
            - --webhook-name=hami-webhook
            - --namespace=kube-system
            - --patch-validating=false
            - --secret-name=hami-scheduler-tls
      restartPolicy: OnFailure
      serviceAccountName: hami-admission
      securityContext:
        runAsNonRoot: true
        runAsUser: 2000

部署dcgm-exporter

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: "dcgm-exporter"
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "3.6.1"
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app.kubernetes.io/name: "dcgm-exporter"
      app.kubernetes.io/version: "3.6.1"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: "dcgm-exporter"
        app.kubernetes.io/version: "3.6.1"
      name: "dcgm-exporter"
    spec:
      containers:
      - image: "nvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04"
        env:
        - name: "DCGM_EXPORTER_LISTEN"
          value: ":9400"
        - name: "DCGM_EXPORTER_KUBERNETES"
          value: "true"
        name: "dcgm-exporter"
        ports:
        - name: "metrics"
          containerPort: 9400
        securityContext:
          runAsNonRoot: false
          runAsUser: 0
          capabilities:
            add: ["SYS_ADMIN"]
        volumeMounts:
        - name: "pod-gpu-resources"
          readOnly: true
          mountPath: "/var/lib/kubelet/pod-resources"
      volumes:
      - name: "pod-gpu-resources"
        hostPath:
          path: "/var/lib/kubelet/pod-resources"

---

kind: Service
apiVersion: v1
metadata:
  name: "dcgm-exporter"
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "3.6.1"
spec:
  selector:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "3.6.1"
  ports:
  - name: "metrics"
    port: 9400

dcgm-exporter安装成功

参考这个hami-vgpu  dashboard 下载panel 的json文件

hami-vgpu-dashboard | Grafana Labs 导入后grafana中将创建一个名为“hami-vgpu-dashboard”的dashboard,但此页面中有一些Panel如vGPUCorePercentage还没有数据

ServiceMonitor 是 Prometheus Operator 中的一个自定义资源,主要用于监控 Kubernetes 中的服务。它的作用包括:

1. 自动化发现

ServiceMonitor 允许 Prometheus 自动发现和监控 Kubernetes 中的服务。通过定义 ServiceMonitor,您可以告诉 Prometheus 监控特定服务的端点。

2. 配置抓取参数

您可以在 ServiceMonitor 中设置抓取的相关参数,例如:

  • 抓取间隔:定义 Prometheus 多频繁抓取数据(如每 30 秒)。
  • 超时:定义抓取请求的超时时间。
  • 标签选择器:指定要监控的服务的标签,确保 Prometheus 仅抓取相关服务的数据。

dcgm-exporter需要配置两个service monitor

hami-device-plugin-svc-monitor.yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: hami-device-plugin-svc-monitor
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: hami-device-plugin
  namespaceSelector:
    matchNames:
      - kube-system
  endpoints:
    - path: /metrics
      port: monitorport
      interval: "15s"
      honorLabels: false
      relabelings:
        - sourceLabels: [__meta_kubernetes_endpoints_name]
          regex: hami-.*
          replacement: $1
          action: keep
        - sourceLabels: [__meta_kubernetes_pod_node_name]
          regex: (.*)
          targetLabel: node_name
          replacement: ${1}
          action: replace
        - sourceLabels: [__meta_kubernetes_pod_host_ip]
          regex: (.*)
          targetLabel: ip
          replacement: $1
          action: replace

hami-scheduler-svc-monitor.yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: hami-scheduler-svc-monitor
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: hami-scheduler
  namespaceSelector:
    matchNames:
      - kube-system
  endpoints:
    - path: /metrics
      port: monitor
      interval: "15s"
      honorLabels: false
      relabelings:
        - sourceLabels: [__meta_kubernetes_endpoints_name]
          regex: hami-.*
          replacement: $1
          action: keep
        - sourceLabels: [__meta_kubernetes_pod_node_name]
          regex: (.*)
          targetLabel: node_name
          replacement: ${1}
          action: replace
        - sourceLabels: [__meta_kubernetes_pod_host_ip]
          regex: (.*)
          targetLabel: ip
          replacement: $1
          action: replace

确认创建的ServiceMonitor

启动gpu pod一个测试下

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod-1
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.2.1
      command: ["sleep", "infinity"]
      resources:
        limits:
          nvidia.com/gpu: 1
          nvidia.com/gpumem: 1000
          nvidia.com/gpucores: 10

如果看到pod一直pending 状态

检查下节点如果出现下面gpu为0的情况

需要

   docker:
		1:下载NVIDIA-DOCKER2安装包并安装
		2:修改/etc/docker/daemon.json文件内容加上
			{
			"default-runtime": "nvidia",
				"runtimes": {
					"nvidia": {
						"path": "/usr/bin/nvidia-container-runtime",
						"runtimeArgs": []
					}
				},
			}
	k8s:
		1:下载k8s-device-plugin 镜像
		2:编写nvidia-device-plugin.yml创建驱动pod

使用这个yml进行创建

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      priorityClassName: "system-node-critical"
      containers:
      - image: nvidia/k8s-device-plugin:1.11
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

gpu pod启动后进入查看下, gpu内存和限制的大小相同设置成功

访问下{scheduler node ip}:31993/metrics 

日志最后有两行

vGPUPodsDeviceAllocated{containeridx="0",deviceusedcore="40",deviceuuid="GPU-7666e9de-679b-a768-51c6-260b81cd00ec",nodename="192.168.110.126",podname="gpu-pod-1",podnamespace="default",zone="vGPU"} 1.048576e+10
vGPUPodsDeviceAllocated{containeridx="0",deviceusedcore="40",deviceuuid="GPU-7666e9de-679b-a768-51c6-260b81cd00ec",nodename="192.168.110.126",podname="gpu-pod-2",podnamespace="default",zone="vGPU"} 1.048576e+10

可以看到相同deviceuuid的gpu被不同pod共享使用

exec进入hami-device-plugin  daemonset里面执行nvidia-smi -L 可以看到机器上所有显卡的信息

root@node126:/# nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-7666e9de-679b-a768-51c6-260b81cd00ec)
GPU 1: NVIDIA GeForce RTX 4090 (UUID: GPU-9f32af29-1a72-6e47-af2c-72b1130a176b)
root@node126:/# 

之前创建的两个serviceMonitor会去请求

app.kubernetes.io/component: hami-scheduler 和app.kubernetes.io/component: hami-device-plugin 的/metrics  接口获取数据

当gpu-pod跑起来以后查看hami-vgpu-metrics-dashboard

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/2276277.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

springboot使用Easy Excel导出列表数据为Excel

springboot使用Easy Excel导出列表数据为Excel Easy Excel官网&#xff1a;https://easyexcel.opensource.alibaba.com/docs/current/quickstart/write 主要记录一下引入时候的pom&#xff0c;直接引入会依赖冲突 解决方法&#xff1a; <!-- 引入Easy Excel的依赖 -->&l…

泛目录和泛站有什么差别

啥是 SEO 泛目录&#xff1f; 咱先来说说 SEO 泛目录是啥。想象一下&#xff0c;你有一个巨大的图书馆&#xff0c;里面的书架上摆满了各种各样的书&#xff0c;每一本书都代表着一个网页。而 SEO 泛目录呢&#xff0c;就像是一个超级图书管理员&#xff0c;它的任务就是把这些…

黑马天机学堂学习计划模块

核心功能 系统设计思路 ​​​​​​​ 代码分析 1. 学习记录管理 • 存储学习记录到 Redis&#xff1a; 利用 Redis 缓存学习记录&#xff0c;减少频繁的数据库访问。 public void writeRecordCache(LearningRecord record) {String key String.format("LEARNING:R…

初学stm32 --- DAC输出三角波和正弦波

输出三角波实验简要&#xff1a; 1&#xff0c;功能描述 通过DAC1通道1(PA4)输出三角波&#xff0c;然后通过DS100示波器查看波形 2&#xff0c;关闭通道1触发(即自动) TEN1位置0 3&#xff0c;关闭输出缓冲 BOFF1位置1 4&#xff0c;使用12位右对齐模式 将数字量写入DAC_…

专题 - STM32

基础 基础知识 STM所有产品线&#xff08;列举型号&#xff09;&#xff1a; STM产品的3内核架构&#xff08;列举ARM芯片架构&#xff09;&#xff1a; STM32的3开发方式&#xff1a; STM32的5开发工具和套件&#xff1a; 若要在电脑上直接硬件级调试STM32设备&#xff0c;则…

25年无人机行业资讯 | 1.1 - 1.5

25年无人机行业资讯 | 1.1 - 1.5 中央党报《经济日报》刊文&#xff1a;低空经济蓄势待发&#xff0c;高质量发展需的平衡三大关系 据新华网消息&#xff0c;2025年1月3日&#xff0c;中央党报《经济日报》发表文章指出&#xff0c;随着国家发展改革委低空经济发展司的成立&a…

时序数据库InfluxDB—介绍与性能测试

目录 一、简述 二、主要特点 三、基本概念 1、主要概念 2、保留策略 3、连续查询 4、存储引擎—TSM Tree 5、存储目录 四、基本操作 1、Java-API操作 五、项目中的应用 六、单节点的硬件配置 七、性能测试 1、测试环境 2、测试程序 3、写入测试 4、查询测试 一…

计算机网络 (35)TCP报文段的首部格式

前言 计算机网络中的TCP&#xff08;传输控制协议&#xff09;报文段的首部格式是TCP协议的核心组成部分&#xff0c;它包含了控制TCP连接的各种信息和参数。 一、TCP报文段的结构 TCP报文段由首部和数据两部分组成。其中&#xff0c;首部包含了控制TCP连接的各种字段&#xff…

GelSight Mini视触觉传感器凝胶触头升级:增加40%耐用性,拓展机器人与触觉AI 应用边界

马萨诸塞州沃尔瑟姆-2025年1月6日-触觉智能技术领军企业Gelsight宣布&#xff0c;旗下Gelsight Mini视触觉传感器迎来凝胶触头的更新。经内部测试&#xff0c;新Gel凝胶触头耐用性提升40%&#xff0c;外观与触感与原凝胶触头保持一致。此次升级有效满足了客户在机器人应用中对设…

burpsiute的基础使用(2)

爆破模块&#xff08;intruder&#xff09;&#xff1a; csrf请求伪造访问&#xff08;模拟攻击&#xff09;: 方法一&#xff1a; 通过burp将修改&#xff0c;删除等行为的数据包压缩成一个可访问链接&#xff0c;通过本地浏览器访问&#xff08;该浏览器用户处于登陆状态&a…

【ASP.NET学习】ASP.NET MVC基本编程

文章目录 ASP.NET MVCMVC 编程模式ASP.NET MVC - Internet 应用程序创建MVC web应用程序应用程序信息应用程序文件配置文件 用新建的ASP.NET MVC程序做一个简单计算器1. **修改视图文件**2. **修改控制器文件** 用新建的ASP.NET MVC程序做一个复杂计算器1.创建模型&#xff08;…

Git 命令代码管理详解

一、Git 初相识&#xff1a;版本控制的神器 在当今的软件开发领域&#xff0c;版本控制如同基石般重要&#xff0c;而 Git 无疑是其中最耀眼的明珠。它由 Linus Torvalds 在 2005 年创造&#xff0c;最初是为了更好地管理 Linux 内核源代码。随着时间的推移&#xff0c;Git 凭借…

OpenCV实现基于交叉双边滤波的红外可见光融合算法

1 算法原理 CBF是*Cross Bilateral Filter(交叉双边滤波)*的缩写&#xff0c;论文《IMAGE FUSION BASED ON PIXEL SIGNIFICANCE USING CROSS BILATERAL FILTER》。 论文中&#xff0c;作者使用交叉双边滤波算法对原始图像 A A A&#xff0c; B B B 进行处理得到细节&#xff0…

项目实战--网页五子棋(用户模块)(1)

接下来我将使用Java语言&#xff0c;和Spring框架&#xff0c;实现一个简单的网页五子棋。 主要功能包括用户登录注册&#xff0c;人机对战&#xff0c;在线匹配对局&#xff0c;房间邀请对局&#xff0c;积分排行版等。 这篇文件讲解用户模块的后端代码 1. 用户表与实体类 …

机器学习之随机森林算法实现和特征重要性排名可视化

随机森林算法实现和特征重要性排名可视化 目录 随机森林算法实现和特征重要性排名可视化1 随机森林算法1.1 概念1.2 主要特点1.3 优缺点1.4 步骤1.5 函数及参数1.5.1 函数导入1.5.2 参数 1.6 特征重要性排名 2 实际代码测试 1 随机森林算法 1.1 概念 是一种基于树模型的集成学…

MySQL存储引擎、索引、索引失效

MySQL Docker 安装 MySQL8.0&#xff0c;安装见docker-compose.yaml 操作类型 SQL 程序语言有四种类型&#xff0c;对数据库的基本操作都属于这四种类&#xff0c;分为 DDL、DML、DQL、DCL DDL(Dara Definition Language 数据定义语言)&#xff0c;是负责数据结构定义与数据…

WPF基础(1.1):ComboBox的使用

本篇文章介绍ComboBox的基本使用。 本篇文章的例子实现的功能&#xff1a;后端获取前端复选框中的选项之后&#xff0c;点击“确定”按钮&#xff0c;弹出一个MessageBox&#xff0c;显示用户选择的选项。 文章目录 1. 效果展示2. 代码逻辑2.1 前端代码2.2 后端代码 1. 效果展…

前端炫酷动画--文字(二)

目录 一、弧形边框选项卡 二、零宽字符 三、目录滚动时自动高亮 四、高亮关键字 五、文字描边 六、按钮边框的旋转动画 七、视频文字特效 八、立体文字特效让文字立起来 九、文字连续光影特效 十、重复渐变的边框 十一、磨砂玻璃效果 十二、FLIP动画 一、弧形边框…

android 官网刷机和线刷

nexus、pixel可使用google官网线上刷机的方法。网址&#xff1a;https://flash.android.com/ 本文使用google线上刷机&#xff0c;将Android14 刷为Android12 以下是失败的线刷经历。 准备工作 下载升级包。https://developers.google.com/android/images?hlzh-cn 注意&…

25/1/12 嵌入式笔记 学习esp32

了解了一下位选线和段选线的知识&#xff1a; 位选线&#xff1a; 作用&#xff1a;用于选择数码管的某一位&#xff0c;例如4位数码管的第1位&#xff0c;第2位&#xff09; 通过控制位选线的电平&#xff08;高低电平&#xff09;&#xff0c;决定当前哪一位数码管处于激活状…