0. 前言
需要使用ddqn完成某项任务,为了快速训练,使用带有GPU的服务器进行训练。记录下整个过程,以及遇到的坑。
1. 选择模板代码
参考代码来源
GitHub
该代码最后一次更新是Mar 24, 2020。
环境配置:
python3.8
运行安装脚本:
apt-get update
apt-get install xvfb
apt-get install python-opengl
apt-get install python3-pip
python -m pip install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple/
安装python requirements
所需requirements文件
tensorflow
tensorlayer
opencv-python-headless
matplotlib
pyglet==1.5.27
gym==0.20.0
python -m pip install -r requirements -i https://pypi.tuna.tsinghua.edu.cn/simple/
运行模板代码
xvfb-run -s "-screen 0 1400x900x24" python double_DQN\ \&\ dueling_DQN.py
2. ubuntu 环境准备
本部分为踩坑记录,不需要跟着做
服务器之前有其他人用过,也可能是系统自带python,因此会有python环境,首先查看python版本
python --version
显示python命令对应的版本是2.7.17,之后查找该命令对应的符号链接文件位置。
which python
会显示python命令使用的符号链接文件
/usr/bin/python
查看该路径下还有没有其它python版本
ls -al | grep python
输出如下
lrwxrwxrwx 1 root root 9 Apr 16 2018 python -> python2.7
lrwxrwxrwx 1 root root 9 Apr 16 2018 python2 -> python2.7
-rwxr-xr-x 1 root root 3628904 Nov 29 02:51 python2.7
lrwxrwxrwx 1 root root 9 Jun 22 2018 python3 -> python3.6
-rwxr-xr-x 2 root root 4526456 Nov 25 22:10 python3.6
-rwxr-xr-x 2 root root 4526456 Nov 25 22:10 python3.6m
-rwxr-xr-x 1 root root 1018 Oct 29 2017 python3-jsondiff
-rwxr-xr-x 1 root root 3661 Oct 29 2017 python3-jsonpatch
-rwxr-xr-x 1 root root 1342 May 2 2016 python3-jsonpointer
-rwxr-xr-x 1 root root 398 Nov 16 2017 python3-jsonschema
lrwxrwxrwx 1 root root 10 Jun 22 2018 python3m -> python3.6m
发现python命令使用的是2.7,但python3可以使用3.6。因为目前有的tensorflow版本不支持2.7了,先使用3.6.
接着查看tensorflow gpu各个版本的要求:官网。如下图所示,我选择了2.0.0,发布于2019年9月30日,和代码更新时间比较近,并且支持python 3.6。
准备使用pip安装tensorflow,但pip并没有安装,使用一下命令安装pip。
apt install python3-pip
之后执行安装命令
python -m pip install tensorflow-gpu==2.0.0
比较难受的是pip源中并没有2.0.0,换了清华源也没有,输出如下
Could not find a version that satisfies the requirement tensorflow-gpu==2.0.0 (from versions: 1.13.1, 1.13.2, 1.14.0, 2.12.0)
No matching distribution found for tensorflow-gpu==2.0.0
可以看到最新的版本只有2.12.0,那只能安装最新的版本。
还需要选择python版本,至少需要python3.7。我为了之后使用方便直接把python命令的软连接接入到新安装的python3.7上。
apt install python3.7
rm -f /usr/bin/python
ln -s /usr/bin/python3.7 /usr/bin/python
python --version
最后显示版本为3.7.5,替换成功,
这时候需要更新一下pip(后面装tensorflow的时候需要安装很多相关包,如果不升级pip的话会有很多包装不上,其中一个报错是 Failed building wheel for grpcio)。
python -m pip install --upgrade pip
继续安装tensorflow-gpu。
python -m pip install tensorflow-gpu
输出如下
The "tensorflow-gpu" package has been removed!
Please install "tensorflow" instead.
Other than the name, the two packages have been identical
since TensorFlow 2.1, or roughly since Sep 2019. For more
information, see: pypi.org/project/tensorflow-gpu
意思是tensorflow2.1之后gpu包没得了,直接pip install tensorflow就可以。(安装速度感人,切换清华源)
python -m pip install tensorflow -i https://pypi.tuna.tsinghua.edu.cn/simple/
安装成功!!
3. 模板代码运行
本部分为踩坑记录,不需要跟着做
将代码下载到相应文件夹之后,可以使用如下语句运行ddqn模板代码。
python double_DQN\ \&\ dueling_DQN.py
当然会报很多no module的错误,使用pip依次安装,requirements总结如下
tensorlayer
opencv-python
opencv-python-headless
matplotlib
pyglet
将上述内容写文件,之后一键安装
python -m pip install -r requirments -i https://pypi.tuna.tsinghua.edu.cn/simple/
还需要安装gym,它是一个经常用于测试强化学习的示例,目前新的版本中获取新的状态时参数数量增加了,即以下语句会报错,step函数不仅输出变多了,而且s_的输出也不太正常。因此更换为早一点的版本。
s_,r,done,_ = self.env.step(a)
我根据模板代码的时间查看了gym的tag,发现时间上和模板代码相似,再打开gym的core文件查看step函数,果然从输出数量上合适的。进行安装:
python -m pip install gym==0.20.0 -i https://pypi.tuna.tsinghua.edu.cn/simple/
报错如下:
Collecting gym==0.20.0
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/f1/16/a421155206e7dc41b3a79d4e9311287b88c20140d567182839775088e9ad/gym-0.20.0.tar.gz (1.6 MB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [1 lines of output]
error in gym setup command: 'extras_require' must be a dictionary whose values are strings or lists of strings containing valid project/version requirement specifiers.
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
原因找了好久,从【参考】中找到了解决办法,更新为指定版本的setuptools:
python -m pip install --upgrade pip setuptools==57.5.0 -i https://pypi.tuna.tsinghua.edu.cn/simple/
运行代码
python double_DQN\ \&\ dueling_DQN.py
发现pyglet最低要求python3.8。。。重新安装python3.8,之后直接使用requirements文件一键安装到新环境。
运行过程中报错
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/gym/envs/classic_control/rendering.py", line 27, in <module>
from pyglet.gl import *
File "/usr/local/lib/python3.8/dist-packages/pyglet/gl/__init__.py", line 47, in <module>
from pyglet.gl.gl import *
File "/usr/local/lib/python3.8/dist-packages/pyglet/gl/gl.py", line 7, in <module>
from pyglet.gl.lib import link_GL as _link_function
File "/usr/local/lib/python3.8/dist-packages/pyglet/gl/lib.py", line 98, in <module>
from pyglet.gl.lib_glx import link_GL, link_GLX
File "/usr/local/lib/python3.8/dist-packages/pyglet/gl/lib_glx.py", line 11, in <module>
gl_lib = pyglet.lib.load_library('GL')
File "/usr/local/lib/python3.8/dist-packages/pyglet/lib.py", line 134, in load_library
raise ImportError(f'Library "{names[0]}" not found.')
ImportError: Library "GL" not found.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "double_DQN & dueling_DQN.py", line 195, in <module>
ddqn.train(200)
File "double_DQN & dueling_DQN.py", line 161, in train
if self.is_rend:self.env.render()
File "/usr/local/lib/python3.8/dist-packages/gym/core.py", line 254, in render
return self.env.render(mode, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/gym/envs/classic_control/cartpole.py", line 179, in render
from gym.envs.classic_control import rendering
File "/usr/local/lib/python3.8/dist-packages/gym/envs/classic_control/rendering.py", line 29, in <module>
raise ImportError(
ImportError:
Error occurred while running `from pyglet.gl import *`
HINT: make sure you have OpenGL installed. On Ubuntu, you can run 'apt-get install python-opengl'.
If you're running on a server, you may need a virtual frame buffer; something like this should work:
'xvfb-run -s "-screen 0 1400x900x24" python <your_script.py>'
最后说明了原因,缺少OpenGL 。并且在服务器上运行显示有点问题,就按照他给的解决方案处理。
apt-get install python-opengl
xvfb-run -s "-screen 0 1400x900x24" python double_DQN\ \&\ dueling_DQN.py
处理之后,再次报错
Traceback (most recent call last):
File "double_DQN & dueling_DQN.py", line 195, in <module>
ddqn.train(200)
File "double_DQN & dueling_DQN.py", line 161, in train
if self.is_rend:self.env.render()
File "/usr/local/lib/python3.8/dist-packages/gym/core.py", line 254, in render
return self.env.render(mode, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/gym/envs/classic_control/cartpole.py", line 229, in render
return self.viewer.render(return_rgb_array=mode == "rgb_array")
File "/usr/local/lib/python3.8/dist-packages/gym/envs/classic_control/rendering.py", line 126, in render
self.transform.enable()
File "/usr/local/lib/python3.8/dist-packages/gym/envs/classic_control/rendering.py", line 232, in enable
glPushMatrix()
NameError: name 'glPushMatrix' is not defined
原因是pyglet版本太高,降为1.5.27即可。
4. 安装GPU支持
根据tensorflow版本选择cuda和cudnn。
4.1 安装cuda 11.2
wget https://developer.download.nvidia.com/compute/cuda/11.2.2/local_installers/cuda_11.2.2_460.32.03_linux.run
chmod +x cuda_11.2.2_460.32.03_linux.run
sudo ./cuda_11.2.2_460.32.03_linux.run
安装完成后,需要将CUDA的路径添加到环境变量中。打开~/.bashrc文件,在文件末尾添加以下两行代码:
export PATH=/usr/local/cuda-11.2/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-11.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
然后运行以下命令使环境变量生效:
source ~/.bashrc
验证CUDA的安装是否成功。运行以下命令:
nvcc -V
4.2 安装cudnn 8.1
在NVIDIA官网下载。
之后上传到服务器,
tar -xzvf cudnn-11.2-linux-x64-v8.1.1.33.tgz
cp -P cuda/include/cudnn*.h /usr/local/cuda-11.2/include
cp -P cuda/lib64/libcudnn* /usr/local/cuda-11.2/lib64/
chmod a+r /usr/local/cuda-11.2/include/cudnn*.h /usr/local/cuda-11.2/lib64/libcudnn*
使用如下代码测试gpu是否正常使用
import tensorflow as tf
# 显示当前GPU设备信息
print(tf.config.list_physical_devices('GPU'))
# 创建一个TensorFlow的Session并在其中进行一个简单的运算
with tf.compat.v1.Session() as sess:
# 创建一个TensorFlow的常量张量
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
# 创建一个TensorFlow的变量张量
b = tf.Variable(tf.random.normal([3, 2], stddev=0.1), name='b')
# 进行矩阵乘法运算
c = tf.matmul(a, b, name='c')
# 初始化所有变量
sess.run(tf.compat.v1.global_variables_initializer())
# 运行TensorFlow图
print(sess.run(c))
输出如下,可以正常使用
2023-03-05 14:43:55.699137: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-05 14:43:55.861866: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-03-05 14:43:56.628130: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.2/lib64
2023-03-05 14:43:56.628241: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.2/lib64
2023-03-05 14:43:56.628256: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-03-05 14:43:58.016301: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:58.031069: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:58.032359: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
2023-03-05 14:43:58.035332: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-05 14:43:58.036833: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:58.038127: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:58.039367: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:59.106971: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:59.108369: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:59.109663: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:59.110894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30969 MB memory: -> device: 0, name: Tesla V100S-PCIE-32GB, pci bus id: 0000:00:06.0, compute capability: 7.0
2023-03-05 14:43:59.127126: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:357] MLIR V1 optimization pass is not enabled
[[0.50951785 0.10452858]
[1.070737 0.27480656]]
模板代码使用GPU加速感觉速度没快多少,可能是神经网络层数比较少的原因。