k8s集群内带GPU工作节点配置显卡驱动
系统为Centos7
一、下载、安装显卡驱动
查看显卡型号
[root@VM-3-9-centos user]# lspci | grep -i nvidia
00:08.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
1.1、官网下载驱动程序
https://www.nvidia.cn/Download/index.aspx
注:cuda最好12版本
1.2、安装显卡驱动
bash NVIDIA-Linux-x86_64-525.105.17.run
查看是否安装成功
[root@VM-3-9-centos user]# nvidia-smi
Wed May 17 13:04:48 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:08.0 Off | 0 |
| N/A 45C P0 26W / 70W | 3414MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 19692 C python3 3410MiB |
+-----------------------------------------------------------------------------+
[root@VM-3-9-centos user]#
卸载显卡驱动
需要重启服务器
/usr/bin/nvidia-uninstall
1.3、安装nvidia-docker2
yum install -y nvidia-docker2
yum install -y nvidia-container-runtime
二、配置环境支持显卡
2.1、修改daemon.json
{
"registry-mirrors": [
"https://tf72mndn.mirror.aliyuncs.com"
],
"exec-opts": ["native.cgroupdriver=systemd"],
"storage-driver": "overlay2",
"log-opts": {
"max-file": "3",
"max-size": "500m"
},
"storage-opts": ["overlay2.override_kernel_check=true"],
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
2.2、部署k8s nvidia插件
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml
注:修改部署类型,如果有多台显卡,可以选择部署到有显卡的服务器。
2.3、K8S集群内检查显卡
[root@VM-2-8-centos user]# kubectl describe node vm-3-9-centos |grep nv
nvidia.com/gpu.present=true
nvidia.com/gpu: 1
nvidia.com/gpu: 1
kube-system nvidia-device-plugin-daemonset-4p97n 0 (0%) 0 (0%) 0 (0%) 0 (0%) 85m
nvidia.com/gpu 1 1
2.4、通过rancher设置容器使用显卡数量