gpu-manager安装及测试

news2025/1/23 17:30:31

提示:GPU-manager安装为主部分内容做了升级开箱即用,有用请点收藏❤抱拳

文章目录

  • 前言
  • 一、约束条件
  • 二、使用步骤
    • 1.下载镜像
        • 1.1 查看当前虚拟机的驱动类型:
    • 2.部署gpu-manager
    • 3.部署gpu-admission
    • 4.修改kube-scheduler.yaml![在这里插入图片描述](https://img-blog.csdnimg.cn/f76aab5b5bfc4f9c80eb4cd4721efeee.png)
      • 4.1 新建/etc/kubernetes/scheduler-policy-config.json
      • 4.2 新建/etc/kubernetes/scheduler-extender.yaml
      • 4.3 修改/etc/kubernetes/manifests/kube-scheduler.yaml
    • 4.1 结果查看
    • 测试
  • 总结


前言

本文只做开箱即用部分,想了解GPUManager虚拟化方案技术层面请直接点击:GPUmanager虚拟化方案


一、约束条件

1、虚拟机需要完成直通式绑定,也就是物理GPU与虚拟机绑定,我做的是hyper-v的虚拟机绑定参照上一篇文章
2、对于k8s要求1.10版本以上
3、GPU-Manager 要求集群内包含 GPU 机型节点
4、每张 GPU 卡一共有100个单位的资源,仅支持0 - 1的小数卡,以及1的倍数的整数卡设置。显存资源是以256MiB为最小的一个单位的分配显存
我的版本:k8s-1.20

二、使用步骤

1.下载镜像

镜像地址:https://hub.docker.com/r/tkestack/gpu-manager/tags
manager:docker pull tkestack/gpu-manager:v1.1.5
https://hub.docker.com/r/tkestack/gpu-quota-admission/tags
admission:docker pull tkestack/gpu-quota-admission:v1.0.0

1.1 查看当前虚拟机的驱动类型:

docker info 

在这里插入图片描述

2.部署gpu-manager

拥有GPU节点打标签:

kubectl label node XX nvidia-device-enable=enable

如果docker驱动是systemd 需要在yaml指定,因为GPUmanager默认cgroupfs
在这里插入图片描述
创建yaml内容如下:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: gpu-manager
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: gpu-manager-role
subjects:
  - kind: ServiceAccount
    name: gpu-manager
    namespace: kube-system
roleRef:
  kind: ClusterRole
  name: cluster-admin
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: gpu-manager-daemonset
  namespace: kube-system
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      name: gpu-manager-ds
  template:
    metadata:
      # This annotation is deprecated. Kept here for backward compatibility
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: gpu-manager-ds
    spec:
      serviceAccount: gpu-manager
      tolerations:
        # This toleration is deprecated. Kept here for backward compatibility
        # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
        - key: CriticalAddonsOnly
          operator: Exists
        - key: tencent.com/vcuda-core
          operator: Exists
          effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      # only run node has gpu device
      nodeSelector:
        nvidia-device-enable: enable
      hostPID: true
      containers:
        - image: tkestack/gpu-manager:v1.1.5
          imagePullPolicy: IfNotPresent
          name: gpu-manager
          securityContext:
            privileged: true
          ports:
            - containerPort: 5678
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
            - name: vdriver
              mountPath: /etc/gpu-manager/vdriver
            - name: vmdata
              mountPath: /etc/gpu-manager/vm
            - name: log
              mountPath: /var/log/gpu-manager
            - name: checkpoint
              mountPath: /etc/gpu-manager/checkpoint
            - name: run-dir
              mountPath: /var/run
            - name: cgroup
              mountPath: /sys/fs/cgroup
              readOnly: true
            - name: usr-directory
              mountPath: /usr/local/host
              readOnly: true
            - name: kube-root
              mountPath: /root/.kube
              readOnly: true
          env:
            - name: LOG_LEVEL
              value: "4"
            - name: EXTRA_FLAGS
              value: "--cgroup-driver=systemd"
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
      volumes:
        - name: device-plugin
          hostPath:
            type: Directory
            path: /var/lib/kubelet/device-plugins
        - name: vmdata
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/vm
        - name: vdriver
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/vdriver
        - name: log
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/log
        - name: checkpoint
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/checkpoint
        # We have to mount the whole /var/run directory into container, because of bind mount docker.sock
        # inode change after host docker is restarted
        - name: run-dir
          hostPath:
            type: Directory
            path: /var/run
        - name: cgroup
          hostPath:
            type: Directory
            path: /sys/fs/cgroup
        # We have to mount /usr directory instead of specified library path, because of non-existing
        # problem for different distro
        - name: usr-directory
          hostPath:
            type: Directory
            path: /usr
        - name: kube-root
          hostPath:
            type: Directory
            path: /root/.kube

执行yaml文件:

kubectl apply -f gpu-manager.yaml
kubectl get pod -A|grep  gpu 查询结果

3.部署gpu-admission

创建yaml内容如下:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: gpu-admission
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: gpu-admission-as-kube-scheduler
subjects:
  - kind: ServiceAccount
    name: gpu-admission
    namespace: kube-system
roleRef:
  kind: ClusterRole
  name: system:kube-scheduler
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: gpu-admission-as-volume-scheduler
subjects:
  - kind: ServiceAccount
    name: gpu-admission
    namespace: kube-system
roleRef:
  kind: ClusterRole
  name: system:volume-scheduler
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: gpu-admission-as-daemon-set-controller
subjects:
  - kind: ServiceAccount
    name: gpu-admission
    namespace: kube-system
roleRef:
  kind: ClusterRole
  name: system:controller:daemon-set-controller
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    component: scheduler
    tier: control-plane
    app: gpu-admission
  name: gpu-admission
  namespace: kube-system
spec:
  selector:
    matchLabels:
      component: scheduler
      tier: control-plane
  replicas: 1
  template:
    metadata:
      labels:
        component: scheduler
        tier: control-plane
        version: second
    spec:
      serviceAccountName: gpu-admission
      containers:
        - image: thomassong/gpu-admission:47d56ae9
          name: gpu-admission
          env:
            - name: LOG_LEVEL
              value: "4"
          ports:
            - containerPort: 3456
      dnsPolicy: ClusterFirstWithHostNet
      hostNetwork: true
      priority: 2000000000
      priorityClassName: system-cluster-critical
---
apiVersion: v1
kind: Service
metadata:
  name: gpu-admission
  namespace: kube-system
spec:
  ports:
    - port: 3456
      protocol: TCP
      targetPort: 3456
  selector:
    app: gpu-admission
  type: ClusterIP

执行yaml文件:

kubectl create -f gpu-admission.yaml
kubectl get pod -A|grep  gpu 查询结果

4.修改kube-scheduler.yaml在这里插入图片描述

4.1 新建/etc/kubernetes/scheduler-policy-config.json

创建内容:

vim /etc/kubernetes/scheduler-policy-config.json
复制如下内容:
{
    "kind": "Policy",
    "apiVersion": "v1",
    "predicates": [
        {
            "name": "PodFitsHostPorts"
        },
        {
            "name": "PodFitsResources"
        },
        {
            "name": "NoDiskConflict"
        },
        {
            "name": "MatchNodeSelector"
        },
        {
            "name": "HostName"
        }
    ],
    "priorities": [
        {
            "name": "BalancedResourceAllocation",
            "weight": 1
        },
        {
            "name": "ServiceSpreadingPriority",
            "weight": 1
        }
    ],
    "extenders": [
        {
            "urlPrefix": "http://gpu-admission.kube-system:3456/scheduler",
            "apiVersion": "v1beta1",
            "filterVerb": "predicates",
            "enableHttps": false,
            "nodeCacheCapable": false
        }
    ],
    "hardPodAffinitySymmetricWeight": 10,
    "alwaysCheckAllPredicates": false
}

4.2 新建/etc/kubernetes/scheduler-extender.yaml

创建内容:

vim /etc/kubernetes/scheduler-extender.yaml
复制如下内容:
apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
clientConnection:
  kubeconfig: "/etc/kubernetes/scheduler.conf"
algorithmSource:
  policy:
    file:
      path: "/etc/kubernetes/scheduler-policy-config.json"


4.3 修改/etc/kubernetes/manifests/kube-scheduler.yaml

修改内容:

vim /etc/kubernetes/manifests/kube-scheduler.yaml
复制如下内容:
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-scheduler
    tier: control-plane
  name: kube-scheduler
  namespace: kube-system
spec:
  containers:
    - command:
        - kube-scheduler
        - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
        - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
        - --bind-address=0.0.0.0
        - --feature-gates=TTLAfterFinished=true,ExpandCSIVolumes=true,CSIStorageCapacity=true,RotateKubeletServerCertificate=true
        - --kubeconfig=/etc/kubernetes/scheduler.conf
        - --leader-elect=true
        - --port=0
        - --config=/etc/kubernetes/scheduler-extender.yaml
      image: registry.cn-beijing.aliyuncs.com/kubesphereio/kube-scheduler:v1.22.10
      imagePullPolicy: IfNotPresent
      livenessProbe:
        failureThreshold: 8
        httpGet:
          path: /healthz
          port: 10259
          scheme: HTTPS
        initialDelaySeconds: 10
        periodSeconds: 10
        timeoutSeconds: 15
      name: kube-scheduler
      resources:
        requests:
          cpu: 100m
      startupProbe:
        failureThreshold: 24
        httpGet:
          path: /healthz
          port: 10259
          scheme: HTTPS
        initialDelaySeconds: 10
        periodSeconds: 10
        timeoutSeconds: 15
      volumeMounts:
        - mountPath: /etc/kubernetes/scheduler.conf
          name: kubeconfig
          readOnly: true
        - mountPath: /etc/localtime
          name: localtime
          readOnly: true
        - mountPath: /etc/kubernetes/scheduler-extender.yaml
          name: extender
          readOnly: true
        - mountPath: /etc/kubernetes/scheduler-policy-config.json
          name: extender-policy
          readOnly: true
  hostNetwork: true
  priorityClassName: system-node-critical
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  volumes:
    - hostPath:
        path: /etc/kubernetes/scheduler.conf
        type: FileOrCreate
      name: kubeconfig
    - hostPath:
        path: /etc/localtime
        type: File
      name: localtime
    - hostPath:
        path: /etc/kubernetes/scheduler-extender.yaml
        type: FileOrCreate
      name: extender
    - hostPath:
        path: /etc/kubernetes/scheduler-policy-config.json
        type: FileOrCreate
      name: extender-policy
status: {}

修改内容入下:
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
修改完成k8s自动重启,如果没有重启执行 kubectl delete pod -n [podname]

4.1 结果查看

执行命令:

 kubectl describe node master[节点名称]

在这里插入图片描述

测试

镜像下载:docker pull gaozhenhai/tensorflow-gputest:0.2
创建yaml内容: vim vcuda-test.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    k8s-app: vcuda-test
    qcloud-app: vcuda-test
  name: vcuda-test
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: vcuda-test
  template:
    metadata:
      labels:
        k8s-app: vcuda-test
        qcloud-app: vcuda-test
    spec:
      containers:
        - command:
            - sleep
            - 360000s
          env:
            - name: PATH
              value: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
          image: gaozhenhai/tensorflow-gputest:0.2
          imagePullPolicy: IfNotPresent
          name: tensorflow-test
          resources:
            limits:
              cpu: "4"
              memory: 8Gi
              tencent.com/vcuda-core: "50"
              tencent.com/vcuda-memory: "32"
            requests:
              cpu: "4"
              memory: 8Gi
              tencent.com/vcuda-core: "50"
              tencent.com/vcuda-memory: "32"

启动yaml:kubectl apply -f vcuda-test.yaml
进入容器:

kubectl exec -it `kubectl get pods -o name | cut -d '/' -f2` -- bash

执行测试命令:

cd /data/tensorflow/cifar10 && time python cifar10_train.py

查看结果:

执行命令:nvidia-smi pmon -s u -d 1、命令查看GPU资源使用情况

总结

到此vgpu容器层虚拟化全部完成

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/860077.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

Web Server市场占有率调查

目录 一、理论 1.Web Server市场占有率调查 一、理论 1.Web Server市场占有率调查 (1)netcraft ①查询 每月netcraft公司都会出一次调查报告 netcraft官方:Netcraft | Leader in Phishing Detection, Cybercrime Disruption and Websi…

Samba(二)

问题 Rocky Linux使用smbclient访问win11的共享文件时提示 Error NT_STATUS_IO_TIMEOUT 分析 通过测试,发现关闭windows公用网络防火墙时,可正常显示服务器端所分享出来的所有资源;进一步发现单独放行防火墙进站规则中的文件和打印机共享&a…

B树的插入与删除过程

B树的插入 原树: 插入key后,若导致原节点关键字数超过上限,则从中间位置( ⌈ m 2 ⌉ \lceil\frac{m}{2}\rceil ⌈2m​⌉)将关键字分成两部分,左部分包含的关键字放在原节点中,右部分包含的关键…

面试题:说说vue2的生命周期函数?说说vue3的生命周期函数?说说vue2和vue3的生命周期函数对比?

说说vue2的生命周期函数?说说vue3的生命周期函数?说说vue2和vue3的生命周期函数对比? 一、说说vue2的生命周期函数1.1 vue生命周期分为四个阶段、8个钩子1.1.1 beforeCreate 和 created 初始化阶段1.1.2 beforeMount 和 mounted 挂载阶段1.1.…

利用XLL文件投递Qbot银行木马的钓鱼活动分析

1概述 近期,安天CERT发现了一起利用恶意Microsoft Excel加载项(XLL)文件投递Qbot银行木马的恶意活动。攻击者通过发送垃圾邮件来诱导用户打开附件中的XLL文件,一旦用户安装并激活Microsoft Excel加载项,恶意代码将被执…

合并区间——力扣56

文章目录 题目描述解法一 排序题目描述 解法一 排序 vector<vector<int>> merge(vector<vector<int>>& intervals

SpringCloud实用篇5——elasticsearch基础

目录 1.初识elasticsearch1.1 了解ES1.1.1 elasticsearch的作用1.1.2 ELK技术栈1.1.3 elasticsearch和lucene1.1.4 总结 1.2.倒排索引1.2.1.正向索引1.2.2.倒排索引1.2.3.正向和倒排 1.3 es的一些概念1.3.1 文档和字段1.3.2 索引和映射1.3.3 mysql与elasticsearch 1.4 部署单点…

Linux MQTT智能家居项目(智能家居界面布局)

文章目录 前言一、创建工程项目二、界面布局准备工作三、正式界面布局总结 前言 一、创建工程项目 1.选择工程名称和项目保存路径 2.选择QWidget 3.添加保存图片的资源文件&#xff1a; 在工程目录下添加Icon文件夹保存图片&#xff1a; 将文件放入目录中&#xff1a; …

https的原理和方案

文章目录 https原理为什么要加密常见的加密方式对称加密非对称加密数据摘要&&数据指纹数据签名 https的几种工作方案方案一&#xff1a;只使用对称加密方案二&#xff1a;只使用非对称加密方案三&#xff1a;两端都使用非对称加密方案四&#xff1a;非对称加密 对称加…

2023年十款开源测试开发工具推荐

1、AutoMeter-API 自动化测试平台 AutoMeter 是一款针对分布式服务&#xff0c;微服务 API 做功能和性能一体化的自动化测试平台&#xff0c;一站式提供发布单元&#xff0c;API&#xff0c;环境&#xff0c;用例&#xff0c;前置条件&#xff0c;场景&#xff0c;计划&#x…

计算机工作原理:进程调度

在计算机中&#xff0c;什么是进程&#xff1f;一个跑起来的程序就是一个进程&#xff0c;没跑起来就只能算一个程序。 在windows的任务管理器中&#xff0c;可以很清楚的看到有哪一些进程。 进程&#xff08;progress&#xff09;也叫任务&#xff08;task&#xff09;。 每…

C语言进阶第二课-----------指针的进阶----------升级版

作者前言 &#x1f382; ✨✨✨✨✨✨&#x1f367;&#x1f367;&#x1f367;&#x1f367;&#x1f367;&#x1f367;&#x1f367;&#x1f382; ​&#x1f382; 作者介绍&#xff1a; &#x1f382;&#x1f382; &#x1f382; &#x1f389;&#x1f389;&#x1f389…

腾讯云服务器配置选择方法

腾讯云服务器配置如何选择&#xff1f;CPU内存、带宽和系统盘怎么选择合适&#xff1f;个人用户可以选择轻量应用服务器&#xff0c;企业用户可以选择云服务器CVM&#xff0c;2核2G3M带宽轻量服务器95元一年、2核4G5M服务器168元一年&#xff0c;企业用户可以选择标准型S5云服务…

Gitlab CI/CD笔记-第二天-使用maven打包并且使用主机套接字进行构建并push镜像。

一、安装gitlab-runner 1.可以是linux也可以是docker的 2.本文说的是docker安装部署的。 二、直接上.gitlab-ci.yml stages: # List of stages for jobs, and their order of execution - build-image build-image-job: stage: build-image image: harbor.com:543/docke…

UEFI: 模块和包概述

UEFI规范中有两个最重要的概念&#xff1a;模块&#xff08;Module)和包&#xff08;Package&#xff09;。 模块 Module UEFI上最小的可单独编译的代码单元&#xff0c;或者是预编译的二进制文件比如efi执行文件。 包 Package 由模块、平台描述文件&#xff08;DSC)和包声明…

Leecode长度最小的子数组_209

209.长度最小的子数组 题目建议&#xff1a; 本题关键在于理解滑动窗口&#xff0c;这个滑动窗口看文字讲解 还挺难理解的&#xff0c;建议大家先看视频讲解。 拓展题目可以先不做。 题目链接&#xff1a;力扣&#xff08;LeetCode&#xff09;官网 - 全球极客挚爱的技术成…

exchange partition global index

EXCHANGE PARTITION with a Table having a UNIQUE INDEX and PK Constraint. (Doc ID 1620636.1)​编辑To Bottom In this Document Symptoms Changes Cause Solution References APPLIES TO: Oracle Database - Enterprise Edition - Version 11.2.0.3 and later Oracle Da…

day16:static、final、常量、抽象类

一、static 静态变量能否继承&#xff1f; 静态变量属于类&#xff0c;是共享的资源&#xff0c;不认为是被继承的 静态变量不可以定义在静态方法中&#xff0c; 静态方法中先于对象存在&#xff0c; 不能使用 this super 静态方法中可以直接调用静态方法。 静态方法不能直…

MySQL中基础查询语句

用户表user数据如下&#xff1a; iddevice_idgenderageuniversityprovince12138male21北京大学Beijing23214male复旦大学Shanghai36543famale20北京大学Deijing42315female 23 浙江大学ZheJiang55432male25山东大学Shandong 1&#xff0c;写出ddl语句创建如上表&#xff0c;…

idea更改背景-给idea设置个性化背景

一&#xff0c;具体操作 按两次键盘Shift,打开快速查找/搜索功能 输入setb 选择Set Backgrounf Image 选择本地图片 二&#xff0c;推荐图片网站 Awesome Wallpapers - wallhaven.cc 该网站拥有大量免费高清图片可以白嫖