6

R730直通Tesla P40显卡

 1 year ago
source link: https://shidawuhen.github.io/2023/06/24/R730%E7%9B%B4%E9%80%9ATesla-P40%E6%98%BE%E5%8D%A1/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

本次讲述如何在R730的ESXi上,将Tesla P40直通到centos7.7和WinServer2016。使用直通模式,安装普通的驱动即可,不需要vGPU的驱动。

按计划本来后面要自己装一下系统、做RAID的,不过最近需要用到显卡,所以先把显卡安装上吧。

23年初的时候P40二手的差不多800左右,5月份的时候,二手商都按照1200要了。找了一个个人卖家,950拿下。另外需要买电源线,否则自带的线不够长。25一根,我搞了两根。

image-20230624151012727
image-20230624151020905

为啥选择P40,主要还是成本问题,我也想买更好的,但是更好的贵啊。以当前的眼光来看,性能挺一般,但是1000块钱的东西,还要啥自行车。而且想想当时原价买的时候要3W多,是不是就感觉还不错了?

20210120151305789

二、版本说明

安装显卡坑特别多,有很多匹配、兼容方面的内容,大家可以按照我给的物料,按照我给的操作流程来操作。

大家先按照tensorflow的版本确定CUDA版本,根据CUDA版本确定centos版本,明确自己要装什么!

不同的CUDA版本,要求的内核版本、gcc版本不一样,大家一定要确认一下。我这里简单整理了一下

CUDA版本 Linux内核版本 GCC Centos版本
11.2 3.10 4.8.5 CentOS 7.y (y <= 9)
11.4 3.10.0-1136 6
11.6 3.10.0-1160 6.x
12.1 3.10.0-1160 6.x

centos对应的内核版本

Centos版本 版本 内核版本
7.4 1708 3.10.0-693.el7.x86_64
7.6 1810
7.7 1908 3.10.0-1062.18.1.el7.x86_64
7.8 2003
7.9 2009 3.10.0-1160

ESXi:6.7

Centos7.7http://linuxsoft.cern.ch/centos-vault/7.7.1908/isos/x86_64/CentOS-7-x86_64-Minimal-1908.iso

显卡:Tesla P40

Nvidia驱动https://us.download.nvidia.cn/tesla/460.106.00/NVIDIA-Linux-x86_64-460.106.00.run

image-20230624152459072

CUDAhttps://developer.download.nvidia.com/compute/cuda/11.2.0/local_installers/cuda_11.2.0_460.27.04_linux.run

cudnn:cudnn-11.2-linux-x64-v8.1.0.77.tgz

tensorflow-gpu:2.5.0

2.1centos

http://linuxsoft.cern.ch/centos-vault/

2.2centos内核

https://buildlogs.centos.org/

2.3ESXi与GPU直通的兼容性

查看ESXi是否支持GPU直通

https://www.vmware.com/resources/compatibility/search.php?deviceCategory=vsga

13221912_63481e607911a15768

2.4Nvidia驱动

2.5CUDA

2.6cudnn

https://developer.nvidia.com/rdp/cudnn-archive

2.7tensorflow

https://www.tensorflow.org/install/source#common_installation_problems

image-20230624135119981

三、安装显卡

我们把显卡安装到机器上。机器要断电情况下操作。

image-20230624153124556
image-20230624153147760

将显卡原带的电源线,单头的接到显卡上A,双头的和我买延长线相连B,延长线单头查到riser上的电源口C。

image-20230624153421512

安装后如果服务器无法启动,可能得原因

  • 查看线连接的是否准确:我最开始启动不了,是因为插错误,幸亏没硬件烧了
  • 查看电源功率是否够:我的是750W,功率刚刚够

四、直通Centos7.7

4.1安装centos

我们先下载ISO文件(http://linuxsoft.cern.ch/centos-vault/7.7.1908/isos/x86_64/CentOS-7-x86_64-Minimal-1908.iso),把ISO文件上传到ESXi的存储上

image-20220916213640142

然后创建虚拟机即可

image-20220906130826029
image-20220906130904564

这里一定要设置为EFI,否则即使安装好nvidia驱动,nvidia-smi会报no devices were found

image-20230624124534947

4.2配置centos

修改配置:vi /etc/sysconfig/network-scripts/ifcfg-ens192

systemctl restart network

ping www.baidu.com 查看效果

ip address用于查看当前ip

确认好ip后,可通过ssh root@IP 控制centos,效率更高一些

关闭防火墙

systemctl stop firewalld

systemctl disable firewalld

uname -r和ll /usr/src/kernels/显示的版本要完全一致。如果ll /usr/src/kernels/不显示内容,则说明要安装kernel-devel、kernel-headers

uname -r

ll /usr/src/kernels/

yum list installed | grep kernel

yum list installed | grep gcc

安装kernel

有两种方式:

  1. 如果yum能查到kernel-devel、kernel-headers,这种方式最方便

yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r)

  1. 如果yum找不到对应的kernel,则从https://buildlogs.centos.org/ 找对应的版本
yum install wget vim -y
wget https://buildlogs.centos.org/c7.1908.00.x86_64/kernel/20190808101829/3.10.0-1062.el7.x86_64/kernel-devel-3.10.0-1062.el7.x86_64.rpm --no-check-certificate

wget https://buildlogs.centos.org/c7.1908.00.x86_64/kernel/20190808101829/3.10.0-1062.el7.x86_64/kernel-headers-3.10.0-1062.el7.x86_64.rpm --no-check-certificate

rpm -ivh --force kernel-devel-3.10.0-1062.el7.x86_64.rpm

rpm -ivh --force kernel-headers-3.10.0-1062.el7.x86_64.rpm

yum install gcc

重新检查配置

yum list installed | grep kernel

yum list installed | grep gcc

uname -r

ll /usr/src/kernels/

重启,检查系统能正常启动

4.3配置显卡

把显卡设置为直通

image-20230624164721439
image-20230624125821881

这里要预留所有内存。我设置的内存是32G,点击预留所有内存后,内存配置里的预留所有客户机内存会被勾选。

image-20230624125840423
image-20230624155731773

一定要配置上下面这三个参数

pciPassthru.use64bitMMIO=TRUE

pciPassthru.64bitMMIOSizeGB=64

hypervisor.cpuid.v0= FALSE
image-20230624155815940
image-20230624125921582

4.4安装Nvidia驱动

# 查看是否能发现显卡
yum install pciutils

lspci | grep -i nvidia

image-20230624160009672

https://www.nvidia.com/Download/index.aspx 选择合适的版本

wget https://us.download.nvidia.cn/tesla/460.106.00/NVIDIA-Linux-x86_64-460.106.00.run 

chmod +x NVIDIA-Linux-x86_64-460.106.00.run

./NVIDIA-Linux-x86_64-460.106.00.run --kernel-source-path=/usr/src/kernels/3.10.0-1062.el7.x86_64/ -k $(uname -r)

安装成功后,可用nvidia-smi查看

image-20230624161614759

这里有几点要说明:

  1. 不带-k $(uname -r),会报nvidia.ko
image-20230604193941833
  1. 如果使用驱动不对,报Unable to load the ‘nvidia-drm’ kernel module,此时需要换其他驱动试试
image-20230522232429284
  1. 安装过程中报nouveau 问题,让nvidia帮忙处理就行,然后重启,重新安装驱动
image-20230527194011491
image-20230527194022022
image-20230527194041433
  1. 如果报warning,可以继续,不影响安装
image-20230527194220088
  1. 此处选Yes这里写图片描述
  1. dmesg | grep NVRM 查看驱动无法启动原因

    image-20230611182632946

只能说安装驱动的时候太痛苦了。

4.5安装cuda

https://developer.nvidia.com/cuda-11.2.0-download-archive?target_os=Linux&target_arch=x86_64&target_distro=CentOS&target_version=7&target_type=runfilelocal

image-20230624131559850
wget https://developer.download.nvidia.com/compute/cuda/11.2.0/local_installers/cuda_11.2.0_460.27.04_linux.run
# 带上--kernel-source-path
sudo sh cuda_11.2.0_460.27.04_linux.run --kernel-source-path=/usr/src/kernels/3.10.0-1062.el7.x86_64/
image-20230624132701768
image-20230624132934002
image-20230624132947076
image-20230624133147319

进行一下配置

vim ~/.bashrc
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.2/lib64
export PATH=$PATH:/usr/local/cuda-11.2/bin
export CUDA_HOME=$CUDA_HOME:/usr/local/cuda-11.2
#在终端运行:
source ~/.bashrc

#验证cuda安装正确
nvcc --version

#卸载
cd /usr/local/cuda-11.2/bin
cuda-uninstaller
image-20230624160756660

4.6安装cudnn

首先需要注册一个账号,从https://developer.nvidia.com/rdp/cudnn-archive下载指定版本

image-20230624140004940
tar -zxvf cudnn-11.2-linux-x64-v8.1.0.77.tgz

sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

4.7安装tensorflow-gpu

安装python3.8

yum install -y zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel  
yum install libffi-devel -y

# 下载python3.8
wget https://www.python.org/ftp/python/3.8.12/Python-3.8.12.tgz

# 解压
tar -zxvf Python-3.8.12.tgz

# 编译,需要一点时间
cd Python-3.8.12/
./configure
make&&make install


# 将python3设置为默认python
# 查看python路径
which python
# 查看python3路径
which python3
# 将旧python备份,防止出问题后无法恢复
mv /usr/bin/python /usr/bin/python.bak
# 将python3设置为默认python
ln -s /usr/local/bin/python3 /usr/bin/python


# 查看pip3路径
which pip3
# 查看pip路径,应该会说不存在
which pip
# 设置pip也为可用的命令
ln -s /usr/local/bin/pip3 /usr/bin/pip
# 查看pip版本,如果显示说明设置成功
pip -V


# 使用yum功能,发现报错
yum

# 修复这个问题,将首行里的python将改为python2,再次使用yum就可以了
vi /usr/libexec/urlgrabber-ext-down
vi /usr/bin/yum

安装tensorflow

pip install tensorflow-gpu==2.5.0

4.8测试

import tensorflow as tf
tf.compat.v1.disable_eager_execution()
hello=tf.constant("Hello,Tensorflow!"")
a = tf.constant(1)
b = tf.constant(2)
sess=tf.compat.v1.Session()
print(sess.run(hello))
print(sess.run(a+b))
image-20230624161057116

五、直通WinServer2016

在WInServer安装驱动是最容易的

image-20230623182357415
image-20230623182427521
image-20230623182516818
image-20230623182540538
image-20230623182616134
image-20230623182632943
image-20230623183319242

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK