CentOS_7(1708)下基于Tensorflow对双GPU深度学习环境配置

在工作种需要对tensorflow的GPU环境进行配置,主要是基于centos7下的单机多显卡环境进行部署,因此首先需要对centos7进行安装。其次对显卡驱动的安装,因为是多显卡,在安装多显卡驱动的时候和但显卡驱动安装有一定的区别。下面对整个过程进行详细的说明,如果遇到类似的环境部署,可以参考该文档。

1.Linux系统centos7-1708的安装

通过镜像写入软件将系统镜像写入到U盘,再对整个系统进行安装。在安装过程中通过U盘启动,在选择时有UEFI模式和非UEFI模式,选择非UEFI模式的U盘启动。进入到安装选择界面。

1.1 修改镜像挂载地址

选择Install CentOS7选项后,进入到安装过程,如果整个过程顺利,则可以继续,本1.1就可以忽略。如出现问题:failed to map image memory......的情况,那么就需要修改镜像挂载地址。在出现该错误后继续,间隔1-2分钟后会出现命令行。可以查看dev下的目录(ls /dev)。我的U盘被挂载到了/dev/sdb4,需要把该目录记录下来,后续会使用。

再重启计算机,进入到U盘启动后的界面,选择Install CentOS7,然后按tab进入到编辑模式,修改其中的命令行:Linuxefi /images/pxeboot/vmlinuzinst.stage22=hd:LABEL=CentOS\x207\x20x\86_64 quiet,将修改为:Linuxefi /images/pxeboot/vmlinuzinst.stage22=hd:/dev/sdb4 quiet,再通过Enter快捷键执行安装。然后进入到图形安装界面。如果还是不能进入到图形界面,安装出错,请参考下面。

1.2 图形界面出错问题解决

如果图形界面安装出现问题(x startup failed falling back to text mod),一般是安装基础的图形界面出问题。解决方法是重启计算机,U盘启动,不选择Install CentOS7,选择Troubleshooting -->,进入到选项,并选择第一项,修改镜像地址(tab或e修改,修改的地方和1.1中的一样)。

1.3 配置SSH服务

安装完系统后配置ssh服务,本文中将ssh端口修改为22222,需要开放22222端口。ssh具体配置就不详细说明。

2.Linux查看显卡的相关命令

lspci命令查看硬件接口信息,可以通过lspci |grep -i vga来查询显卡,可以看出电脑的所有显卡,显示显卡型号。也可以显示显卡比较详细的信息:lspci -vnn | grep VGA -A 12

3.安装nvidia显卡驱动

1
2
3
4
5
6
rpm -Uvh <http://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm>  # 添加源
yum install nvidia-detect # 安装nvidia-detect命令
nvidia-detect -v # 检测显卡型号
yum update # 更新yum
yum update kernel kernel-devel # 更新核
lsmod|grep nouveau

修改/lib/modprobe.d/dist-blacklist.conf文件,在文件中加入:blacklist nouveau options nouveau modeset=0

3.1 更改并重新生成grub2

打开 /etc/default/grub文件,在其中的:GRUB_CMDLINE_LINUX=”rd.lvm.lv=vg_centos/lv_root rd.lvm.lv=vg_centos/lv_swap rhgb quiet” quiet后面加入rdblacklist=nouveau,保存。

1
grub2-mkconfig -o /boot/grub2/grub.cfg

我们首先把现有的移动到其它路径下以作为留手备份,打开终端执行:

sudo mv /boot/initramfs-$(uname -r).img /你喜欢的路径,然后重建它,执行:sudo dracut /boot/initramfs-$(uname -r).img $(uname -r)即可。

3.2 安装NVIDIA驱动

官网下载驱动,下载下来时.run的文件,然后运行sh *.run即可。然后遇到提示什么的直接yes即可。在安装了驱动后提供了驱动自动更新的命令:nvidia-installer --update。安装完后查看/etc/X11/xorg.conf的内容,会发现 Device 的 Driver 设置会成为NVidia。

再通过lspci |grep -i vga查看显卡,发现intel的集成显卡不见了,因为前面将集成显卡禁用掉了。因此只能看现在的独立显卡。

3.3 安装Bumblebee

1
yum -y install bumblebee

在Bumblebee官方wiki上对cuda使用进行了说明,如果使用cuda可以不用Bumblebee,因此也可以不用安装Bumblebee,不用配置切换。

4.安装CUDA-toolkit

下载地址在这里选择合适的版本,本次选用的cuda9.1的最新版本安装,下载后放入某个目录下面,并运行sh ./cuda_9.1.85_387.26_linux.run进行安装,安装过程中一定要注意某些选项:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Do you accept the previously read EULA?
accept/decline/quit: accept
Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 387.26?
(y)es/(n)o/(q)uit: n
Install the CUDA 9.1 Toolkit?
(y)es/(n)o/(q)uit: y

Enter Toolkit Location
[ default is /usr/local/cuda-9.1 ]:

Do you want to install a symbolic link at /usr/local/cuda?
(y)es/(n)o/(q)uit:
Do you want to install a symbolic link at /usr/local/cuda?
(y)es/(n)o/(q)uit: y

Install the CUDA 9.1 Samples?
(y)es/(n)o/(q)uit: y

Enter CUDA Samples Location
[ default is /root ]: /usr/local/cuda-9.1/examples

如果出现部分lib缺失的情况,那么就需要安装相应的依赖库,主要出现以下的情况

1
2
3
4
5
Installing the CUDA Toolkit in /usr/local/cuda-9.1 ...
Missing recommended library: libGLU.so
Missing recommended library: libX11.so
Missing recommended library: libXi.so
Missing recommended library: libXmu.so

在通过命令yum install mesa-libGLU-devel libXi-devel libXmu-devel来安装依赖库并重新安装cuda-toolkit。并按照上面的过程进行安装。安装完后会提示:

1
2
3
4
5
lease make sure that
- PATH includes /usr/local/cuda-9.1/bin
- LD_LIBRARY_PATH includes /usr/local/cuda-9.1/lib64, or, add /usr/local/cuda-9.1/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-9.1/bin

配置cuda,将cuda/bin和cuda/lib分布添加到PATH和LD_LIBRARY_PATH。如将环境变量添加到/etc/profile中:export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda-9.1/lib64:/usr/local/cuda/lib64export PATH=$PATH:/usr/local/cuda-9.1/bin:/usr/local/cuda/binexport CUDA_HOME=/usr/local/cuda-9.1source /etc/profile。如果要对cuda进行测试,进入到examples目录下的对NVIDIA_CUDA-9.1_Samples下的文件进行编译。编译后运行某个example进行测试。如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
[root@localhost deviceQuery]# ./deviceQuery
./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "GeForce GTX 1060 6GB"
CUDA Driver Version / Runtime Version 9.1 / 9.1
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 6078 MBytes (6373572608 bytes)
(10) Multiprocessors, (128) CUDA Cores/MP: 1280 CUDA Cores
GPU Max Clock rate: 1785 MHz (1.78 GHz)
Memory Clock rate: 4004 Mhz
Memory Bus Width: 192-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "GeForce GTX 1050 Ti"
CUDA Driver Version / Runtime Version 9.1 / 9.1
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 4040 MBytes (4235919360 bytes)
( 6) Multiprocessors, (128) CUDA Cores/MP: 768 CUDA Cores
GPU Max Clock rate: 1392 MHz (1.39 GHz)
Memory Clock rate: 3504 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 1048576 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from GeForce GTX 1060 6GB (GPU0) -> GeForce GTX 1050 Ti (GPU1) : No
> Peer access from GeForce GTX 1050 Ti (GPU1) -> GeForce GTX 1060 6GB (GPU0) : No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime Version = 9.1, NumDevs = 2
Result = PASS

上面Result = PASS说明校验通过。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
[root@localhost bandwidthTest]# ./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

Device 0: GeForce GTX 1060 6GB
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6374.1

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6448.6

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 144068.5

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Result = PASS说明通信正常。

5.安装cuDNN

首先下载和CUDA对应版本的cuDNN的版本。下载地址在这里,再上传到服务器上某个目录,进行操作。首先下载的文件为压缩格式的,主要操作过程如下:

1
2
3
4
5
6
7
# 切换到压缩包位置
cp cudnn-9.1-linux-x64-v7.1.solitairetheme8 cudnn-9.1-linux-x64-v7.1.tgz
tar -xvf cudnn-9.1-linux-x64-v7.1.tgz

cp ./lib64/* /usr/local/cuda-9.1/lib64/
cp ./include/* /usr/local/cuda-9.1/include/
chmod a+r /usr/local/cuda-9.1/include/cudnn.h /usr/local/cuda-9.1/lib64/libcudnn*

通过上面的过程就安装配置完成了。

6.安装tensorflow-gpu

6.1 Anaconda安装

在清华大学的镜像网站(https://mirrors.tuna.tsinghua.edu.cn/)上下载最新的Anaconda版本,我们下载的是5.1.0版本,python版本为3.6.4,然后进行安装:

1
2
3
sh Anaconda3-5.1.0-Linux-x86_64.sh
# prefix处输入/opt/anaconda3,将anaconda安装到/opt/anaconda目录下
# 并将python和anaconda的bin目录加入到.bashrc下

安装了anaconda后,将pip,numpy,scipy,matplotlib,pandas,jupyter-notebook等都集成了,再安装一个查看gpu状态的工具:

1
pip install gpustat

6.2 安装Tensorflow的GPU版本

在安装了anaconda后,可以安装一个查看GPU状态的工具,通过pip进行安装即可:pip install gpustat。安装tensorflow直接执行conda install tensorflow-gpu,会自动安装tensorflow的GPU版本并将cuda的相关动态链接库安装好,cuda相关的动态库都已经安装在了${CONDA_HOME}/anaconda3/lib

由于conda上面支持的tensorflow-gpu版本,因此conda安装tensorflow-gpu会将cudnn,cuda以及mkl一起安装,版本会自动对应,因此相对来说比较容易,且不容易出错。

还可以通过pip来安装最新的tensorflow版本:pip install tensorflow-gpu,pip会找到相应的版本安装。安装后可能会出现GPU不可用的情况,这种情况基本是tensorflow安装的版本问题。

安装完成后进行测试:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
>>> hello = tf.constant('Hello, Tensorflow')
>>> sess = tf.Session()
2018-04-16 02:26:55.745130: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-04-16 02:26:56.924933: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-16 02:26:56.925576: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.7845
pciBusID: 0000:02:00.0
totalMemory: 5.94GiB freeMemory: 5.86GiB
2018-04-16 02:26:56.992033: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-16 02:26:56.992277: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 1 with properties:
name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.392
pciBusID: 0000:01:00.0
totalMemory: 3.94GiB freeMemory: 3.89GiB
2018-04-16 02:26:56.992325: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1408] Ignoring visible gpu device (device: 1, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1) with Cuda multiprocessor count: 6. The minimum required count is 8. You can adjust this requirement with the env var TF_MIN_GPU_MULTIPROCESSOR_COUNT.
2018-04-16 02:26:56.992342: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-04-16 02:26:57.159155: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-16 02:26:57.159213: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0 1
2018-04-16 02:26:57.159237: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N N
2018-04-16 02:26:57.159245: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 1: N N
2018-04-16 02:26:57.159377: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5649 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:02:00.0, compute capability: 6.1)
>>> print(sess.run(hello))
b'Hello, Tensorflow'

6.3 安装keras

安装keras较简单,直接用pip安装即可:pip install keras,会自动安装到site-packages,进入到/root/.keras/keras.json文件修改backed为tensorflow:

1
2
3
4
5
6
{
"floatx": "float32",
"epsilon": 1e-07,
"backend": "tensorflow",
"image_data_format": "channels_last"
}

7.配置jupyter-notebook或jupyter-lab

jupyter配置好后,可以远程连接服务器上的jupyter-server,方便多人使用anaconda的环境。首先,需要对jupyter的cnfig进行配置,先生成一个配置文件:

1
2
jupyter notebook --generate-config
# Writing default config to: /root/.jupyter/jupyter_notebook_config.py

在设置服务器登陆的密码:

1
2
3
4
5
6
In [1]: from notebook.auth import passwd

In [2]: passwd()
Enter password:
Verify password:
Out[2]: 'sha1:1c5cf71ad9a3:dad7ae0af719426816841d239f5e8247176dd0adsf'

将生成的sha1:...复制下来在配置文件中进行配置(vim ~/.jupyter/jupyter_notebook_config.py),配置项主要有:

1
2
3
4
5
c.NotebookApp.password =
c.NotebookApp.port = 8888
c.NotebookApp.allow_root = True
c.NotebookApp.open_browser = False
c.NotebookApp.notebook_dir = '/opt/workspace'

开放相应的端口:

1
2
3
[root@localhost cuda]# firewall-cmd --zone=public --add-port=8888/tcp --permanent
[root@localhost cuda]# firewall-cmd --zone=public --add-port=8889/tcp --permanent
[root@localhost cuda]# firewall-cmd --reload

8.可能出现的错误及解决办法

8.1 出现错误'GLIBC_2.23' not fund

ImportError: /lib64/libm.so.6: version GLIBC_2.23' not found (required by /opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
mkdir ~/glibc
cd ~/glibc

wget http://ftp.gnu.org/gnu/glibc/glibc-2.23.tar.gz
tar zxvf glibc-2.23.tar.gz
cd glibc-2.23
mkdir build
cd build

../configure --prefix=/opt/glibc-2.23
make -j4
sudo make install

export LD_LIBRARY_PATH=/opt/glibc-2.23/lib

这个错误修复后可能会出现python无法使用的情况。python最初是根据glibc-2.17安装的,因此python会出现问题。但是这种放肆能解决上面的错误。在其他场景能通过这种方式。

8.2出现错误:InvalidArgumentError

出现错误如下,只检测到了CPU但是没有检测到GPU。基本是tensorflow的版本和cuda和cudnn的版本未匹配。如果可以通过conda来安装就通过conda安装,会自动匹配对应的版本。

1
(see above for traceback): Cannot assign a device for operation 'matmul/bias': Operation was illuminated assigned to /device:GPU:1 but available devices are [/job:localhost/replica:0/task:0/device :CPU:0 ]. Make sure the device specification refers to a valid device`

还有另外一个问题,只能检测到一个CPU,另外一个CPU不在GPU列表中,这种情况下主要是因为tensorflow单机默认只是用一个GPU,因此需要进行指定某个GPU,但是需要设置一个环境变量TF_MIN_GPU_MULTIPROCESSOR_COUNT。在系统种可以设置一个环境变量export TF_MIN_GPU_MULTIPROCESSOR_COUNT=2;在工程种应用时需要通过代码设置下环境变量,如下:

1
2
import os
os.environ['TF_MIN_GPU_MULTIPROCESSOR_COUNT'] = '2'

8.3 出现FutureWarning警告

出现这个警告FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.解决办法:更新_h5py_的版本到_2.8.0rc1_,通过pip install h5py==2.8.0rc1命令来更新。

9.Refrence

1.Linux驱动程序安装范例

2.CentOS 7(1708) Intel+Nvidia双显卡笔记本安装Nvidia驱动并用Bumblebee控制独显

3.Bumblebee wiki)

4.CUDA的安装配置

5.cuDNN官方配置

6.Centos7安装tensorflow深度学习环境