0. 前言

需要使用ddqn完成某项任务，为了快速训练，使用带有GPU的服务器进行训练。记录下整个过程，以及遇到的坑。

1. 选择模板代码

参考代码来源
GitHub
该代码最后一次更新是Mar 24, 2020。

环境配置：
python3.8
运行安装脚本：

apt-get update
apt-get install xvfb
apt-get install python-opengl
apt-get install python3-pip
python -m pip install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple/

安装python requirements
所需requirements文件

tensorflow
tensorlayer
opencv-python-headless
matplotlib
pyglet==1.5.27
gym==0.20.0

python -m pip install -r requirements -i https://pypi.tuna.tsinghua.edu.cn/simple/

运行模板代码

xvfb-run -s "-screen 0 1400x900x24" python double_DQN\ \&\ dueling_DQN.py

2. ubuntu 环境准备

本部分为踩坑记录，不需要跟着做

服务器之前有其他人用过，也可能是系统自带python，因此会有python环境，首先查看python版本

python --version

显示python命令对应的版本是2.7.17，之后查找该命令对应的符号链接文件位置。

which python

会显示python命令使用的符号链接文件

/usr/bin/python

查看该路径下还有没有其它python版本

ls -al | grep python

输出如下

lrwxrwxrwx  1 root   root           9 Apr 16  2018 python -> python2.7
lrwxrwxrwx  1 root   root           9 Apr 16  2018 python2 -> python2.7
-rwxr-xr-x  1 root   root     3628904 Nov 29 02:51 python2.7
lrwxrwxrwx  1 root   root           9 Jun 22  2018 python3 -> python3.6
-rwxr-xr-x  2 root   root     4526456 Nov 25 22:10 python3.6
-rwxr-xr-x  2 root   root     4526456 Nov 25 22:10 python3.6m
-rwxr-xr-x  1 root   root        1018 Oct 29  2017 python3-jsondiff
-rwxr-xr-x  1 root   root        3661 Oct 29  2017 python3-jsonpatch
-rwxr-xr-x  1 root   root        1342 May  2  2016 python3-jsonpointer
-rwxr-xr-x  1 root   root         398 Nov 16  2017 python3-jsonschema
lrwxrwxrwx  1 root   root          10 Jun 22  2018 python3m -> python3.6m

发现python命令使用的是2.7，但python3可以使用3.6。因为目前有的tensorflow版本不支持2.7了，先使用3.6.

接着查看tensorflow gpu各个版本的要求：官网。如下图所示，我选择了2.0.0，发布于2019年9月30日，和代码更新时间比较近，并且支持python 3.6。
在这里插入图片描述
准备使用pip安装tensorflow，但pip并没有安装，使用一下命令安装pip。

apt install python3-pip

之后执行安装命令

python -m pip install tensorflow-gpu==2.0.0

比较难受的是pip源中并没有2.0.0，换了清华源也没有，输出如下

Could not find a version that satisfies the requirement tensorflow-gpu==2.0.0 (from versions: 1.13.1, 1.13.2, 1.14.0, 2.12.0)
No matching distribution found for tensorflow-gpu==2.0.0

可以看到最新的版本只有2.12.0，那只能安装最新的版本。
还需要选择python版本，至少需要python3.7。我为了之后使用方便直接把python命令的软连接接入到新安装的python3.7上。

apt install python3.7
rm -f /usr/bin/python
ln -s /usr/bin/python3.7  /usr/bin/python
python --version

最后显示版本为3.7.5，替换成功，
这时候需要更新一下pip（后面装tensorflow的时候需要安装很多相关包，如果不升级pip的话会有很多包装不上，其中一个报错是 Failed building wheel for grpcio）。

python -m pip install --upgrade pip

继续安装tensorflow-gpu。

python -m pip install tensorflow-gpu

输出如下

The "tensorflow-gpu" package has been removed!
    
Please install "tensorflow" instead.

Other than the name, the two packages have been identical
since TensorFlow 2.1, or roughly since Sep 2019. For more
information, see: pypi.org/project/tensorflow-gpu

意思是tensorflow2.1之后gpu包没得了，直接pip install tensorflow就可以。（安装速度感人，切换清华源）

python -m pip install tensorflow -i https://pypi.tuna.tsinghua.edu.cn/simple/

安装成功！！

3. 模板代码运行

本部分为踩坑记录，不需要跟着做

将代码下载到相应文件夹之后，可以使用如下语句运行ddqn模板代码。

python double_DQN\ \&\ dueling_DQN.py

当然会报很多no module的错误，使用pip依次安装，requirements总结如下

tensorlayer
opencv-python
opencv-python-headless
matplotlib
pyglet

将上述内容写文件，之后一键安装

python -m pip install -r requirments -i https://pypi.tuna.tsinghua.edu.cn/simple/

还需要安装gym，它是一个经常用于测试强化学习的示例，目前新的版本中获取新的状态时参数数量增加了，即以下语句会报错，step函数不仅输出变多了，而且s_的输出也不太正常。因此更换为早一点的版本。

s_,r,done,_ = self.env.step(a)

我根据模板代码的时间查看了gym的tag，发现时间上和模板代码相似，再打开gym的core文件查看step函数，果然从输出数量上合适的。进行安装：

python -m pip install gym==0.20.0 -i https://pypi.tuna.tsinghua.edu.cn/simple/

报错如下：

Collecting gym==0.20.0
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/f1/16/a421155206e7dc41b3a79d4e9311287b88c20140d567182839775088e9ad/gym-0.20.0.tar.gz (1.6 MB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [1 lines of output]
      error in gym setup command: 'extras_require' must be a dictionary whose values are strings or lists of strings containing valid project/version requirement specifiers.
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

原因找了好久，从【参考】中找到了解决办法，更新为指定版本的setuptools：

python -m pip install --upgrade pip setuptools==57.5.0 -i https://pypi.tuna.tsinghua.edu.cn/simple/

运行代码

python double_DQN\ \&\ dueling_DQN.py

发现pyglet最低要求python3.8。。。重新安装python3.8，之后直接使用requirements文件一键安装到新环境。

运行过程中报错

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/gym/envs/classic_control/rendering.py", line 27, in <module>
    from pyglet.gl import *
  File "/usr/local/lib/python3.8/dist-packages/pyglet/gl/__init__.py", line 47, in <module>
    from pyglet.gl.gl import *
  File "/usr/local/lib/python3.8/dist-packages/pyglet/gl/gl.py", line 7, in <module>
    from pyglet.gl.lib import link_GL as _link_function
  File "/usr/local/lib/python3.8/dist-packages/pyglet/gl/lib.py", line 98, in <module>
    from pyglet.gl.lib_glx import link_GL, link_GLX
  File "/usr/local/lib/python3.8/dist-packages/pyglet/gl/lib_glx.py", line 11, in <module>
    gl_lib = pyglet.lib.load_library('GL')
  File "/usr/local/lib/python3.8/dist-packages/pyglet/lib.py", line 134, in load_library
    raise ImportError(f'Library "{names[0]}" not found.')
ImportError: Library "GL" not found.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "double_DQN & dueling_DQN.py", line 195, in <module>
    ddqn.train(200)
  File "double_DQN & dueling_DQN.py", line 161, in train
    if self.is_rend:self.env.render()
  File "/usr/local/lib/python3.8/dist-packages/gym/core.py", line 254, in render
    return self.env.render(mode, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/gym/envs/classic_control/cartpole.py", line 179, in render
    from gym.envs.classic_control import rendering
  File "/usr/local/lib/python3.8/dist-packages/gym/envs/classic_control/rendering.py", line 29, in <module>
    raise ImportError(
ImportError: 
    Error occurred while running `from pyglet.gl import *`
    HINT: make sure you have OpenGL installed. On Ubuntu, you can run 'apt-get install python-opengl'.
    If you're running on a server, you may need a virtual frame buffer; something like this should work:
    'xvfb-run -s "-screen 0 1400x900x24" python <your_script.py>'

最后说明了原因，缺少OpenGL 。并且在服务器上运行显示有点问题，就按照他给的解决方案处理。

apt-get install python-opengl
xvfb-run -s "-screen 0 1400x900x24" python double_DQN\ \&\ dueling_DQN.py

处理之后，再次报错

Traceback (most recent call last):
  File "double_DQN & dueling_DQN.py", line 195, in <module>
    ddqn.train(200)
  File "double_DQN & dueling_DQN.py", line 161, in train
    if self.is_rend:self.env.render()
  File "/usr/local/lib/python3.8/dist-packages/gym/core.py", line 254, in render
    return self.env.render(mode, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/gym/envs/classic_control/cartpole.py", line 229, in render
    return self.viewer.render(return_rgb_array=mode == "rgb_array")
  File "/usr/local/lib/python3.8/dist-packages/gym/envs/classic_control/rendering.py", line 126, in render
    self.transform.enable()
  File "/usr/local/lib/python3.8/dist-packages/gym/envs/classic_control/rendering.py", line 232, in enable
    glPushMatrix()
NameError: name 'glPushMatrix' is not defined

原因是pyglet版本太高，降为1.5.27即可。

4. 安装GPU支持

根据tensorflow版本选择cuda和cudnn。

4.1 安装cuda 11.2

wget https://developer.download.nvidia.com/compute/cuda/11.2.2/local_installers/cuda_11.2.2_460.32.03_linux.run
chmod +x cuda_11.2.2_460.32.03_linux.run
sudo ./cuda_11.2.2_460.32.03_linux.run

安装完成后，需要将CUDA的路径添加到环境变量中。打开~/.bashrc文件，在文件末尾添加以下两行代码：

export PATH=/usr/local/cuda-11.2/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-11.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

然后运行以下命令使环境变量生效：

source ~/.bashrc

验证CUDA的安装是否成功。运行以下命令：

nvcc -V

4.2 安装cudnn 8.1

在NVIDIA官网下载。
之后上传到服务器，

tar -xzvf cudnn-11.2-linux-x64-v8.1.1.33.tgz
cp -P cuda/include/cudnn*.h /usr/local/cuda-11.2/include
cp -P cuda/lib64/libcudnn* /usr/local/cuda-11.2/lib64/
chmod a+r /usr/local/cuda-11.2/include/cudnn*.h /usr/local/cuda-11.2/lib64/libcudnn*

使用如下代码测试gpu是否正常使用

import tensorflow as tf

# 显示当前GPU设备信息
print(tf.config.list_physical_devices('GPU'))

# 创建一个TensorFlow的Session并在其中进行一个简单的运算
with tf.compat.v1.Session() as sess:
    # 创建一个TensorFlow的常量张量
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    # 创建一个TensorFlow的变量张量
    b = tf.Variable(tf.random.normal([3, 2], stddev=0.1), name='b')
    # 进行矩阵乘法运算
    c = tf.matmul(a, b, name='c')

    # 初始化所有变量
    sess.run(tf.compat.v1.global_variables_initializer())

    # 运行TensorFlow图
    print(sess.run(c))

输出如下，可以正常使用

2023-03-05 14:43:55.699137: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-05 14:43:55.861866: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-03-05 14:43:56.628130: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.2/lib64
2023-03-05 14:43:56.628241: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.2/lib64
2023-03-05 14:43:56.628256: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-03-05 14:43:58.016301: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:58.031069: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:58.032359: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
2023-03-05 14:43:58.035332: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-05 14:43:58.036833: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:58.038127: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:58.039367: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:59.106971: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:59.108369: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:59.109663: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:59.110894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30969 MB memory:  -> device: 0, name: Tesla V100S-PCIE-32GB, pci bus id: 0000:00:06.0, compute capability: 7.0
2023-03-05 14:43:59.127126: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:357] MLIR V1 optimization pass is not enabled
[[0.50951785 0.10452858]
 [1.070737   0.27480656]]