首先查询服务器的gpu型号
[kfk@bigdata-pro01 ~]$ lshw -C display
WARNING: you should run this program as super-user.
*-display
description: VGA compatible controller
product: SVGA II Adapter
vendor: VMware
physical id: f
bus info: pci@0000:00:0f.0
version: 00
width: 32 bits
clock: 33MHz
capabilities: vga_controller bus_master cap_list rom
configuration: driver=vmwgfx latency=64
resources: irq:16 ioport:1070(size=16) memory:e8000000-efffffff memory:fe000000-fe7fffff memory:c0400000-c0407fff
WARNING: output may be incomplete or inaccurate, you should run this program as super-user.
[kfk@bigdata-pro01 ~]$
根据GPU型号 在nvidia官方网站上下载对应型号的驱动
根据不同的型号下载相应的驱动
卸载nouveau
编辑dist-blacklist.conf
vim /usr/lib/modprobe.d/dist-blacklist.conf
在文件末尾添加
blacklist nouveau
options nouveau modeset=0
mode tools can also control driver binding.
#Syntax: see modprobe.conf(5).
#watchdog drivers
blacklist i8xx_tco
#framebuffer drivers
blacklist aty128fb
blacklist atyfb
blacklist radeonfb
blacklist i810fb
blacklist cirrusfb
blacklist intelfb
blacklist kyrofb
blacklist i2c-matroxfb
blacklist hgafb
#blacklist nvidiafb
blacklist rivafb
blacklist savagefb
blacklist sstfb
blacklist neofb
blacklist tridentfb
blacklist tdfxfb
blacklist virgefb
blacklist vga16fb
blacklist viafb
#ISDN - see bugs 154799, 159068
blacklist hisax
blacklist hisax_fcpcipnp
#sound drivers
blacklist snd-pcsp
#I/O dynamic configuration support for s390x (bz #563228)
blacklist chsc_sch
#crypto algorithms
blacklist sha1-mb
#see bz #1562114
blacklist sha256-mb
blacklist sha512-mb
blacklist nouveau
options nouveau modeset=0
注释掉blacklist nvidiafb
#blacklist nvidiafb
编辑blacklist.conf
添加blacklist nouveau
mkdir -p /etc/modprobe.d
重建 initramfs-3.10.0-957.el7.x86_64.img
其中3.10.0-957.el7是内核编号。不同的内核这个名字会略有差异。
mv /boot/initramfs-
(
u
n
a
m
e
−
r
)
.
i
m
g
/
b
o
o
t
/
i
n
i
t
r
a
m
f
s
−
(uname -r).img /boot/initramfs-
(uname−r).img/boot/initramfs−(uname -r)-nouveau.img
dracut /boot/initramfs-$(uname -r).img $(uname -r)
安装 kernel-devel
这一步也很关键,如果没有安装kernel-devel,那么需要安装它。不然在安装nvidia驱动时,它返回256的错误码。
yum install kernel-devel kernel-headers -y
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
Package kernel-headers-3.10.0-957.el7.x86_64 already installed and latest version
Resolving Dependencies
–> Running transaction check
—> Package kernel-devel.x86_64 0:3.10.0-957.el7 will be installed
–> Finished Dependency Resolution
Dependencies Resolved
=========================================================================================================================================================================================================================================
Package Arch Version Repository Size
Installing:
kernel-devel x86_64 3.10.0-957.el7 base 17 M
Transaction Summary
Install 1 Package
Total download size: 17 M
reboot
nouveau的配置需要重启才能生效。重启后lsmod|grep nouveau确保nouveau驱动被禁止。
安装驱动
使用init 3
使用init 3进入字符界面
执行cuda的run文件
chmod +x cuda_10.2.89_440.33.01_linux.run
./cuda_10.2.89_440.33.01_linux.run
Driver: Installed
Toolkit: Installed in /usr/local/cuda-10.2/
Samples: Installed in /root/, but missing recommended libraries
Please make sure that
- PATH includes /usr/local/cuda-10.2/bin
- LD_LIBRARY_PATH includes /usr/local/cuda-10.2/lib64, or, add /usr/local/cuda-10.2/lib64 to /etc/ld.so.conf and run ldconfig as root
To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-10.2/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall
Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.2/doc/pdf for detailed information on setting up CUDA.
Logfile is /var/log/cuda-installer.log
使用命令nvidia-smi确认驱动是否安装正确
[root@ASR1 asr]# nvidia-smi
Mon Dec 5 22:48:04 2022
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|=++==============|
| 0 Tesla T4 Off | 00000000:31:00.0 Off | 0 |
| N/A 65C P0 32W / 70W | 0MiB / 15109MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla T4 Off | 00000000:B1:00.0 Off | 0 |
| N/A 63C P0 24W / 70W | 0MiB / 15109MiB | 0% Default |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process
综上所述
通过确定GPU型号及操作系统,从nvidia网站下载驱动。然后屏蔽nouveau,安装kernel。当这些都完成后,使用init 3进入字符界面。再执行nvidia驱动的run,在提示输入的选项中选择输入accept,然后选择install就可以了。最后使用驱动自带的nvidia-smi可执行程序进行验证驱动是否完成了安装。