1. 程式人生 > >深度學習服務器環境配置: Ubuntu17.04+Nvidia GTX 1080+CUDA 9.0+cuDNN 7.0+TensorFlow 1.3

深度學習服務器環境配置: Ubuntu17.04+Nvidia GTX 1080+CUDA 9.0+cuDNN 7.0+TensorFlow 1.3

顯卡 right const andrew ng extra framework abi credit packages

本文來源地址:http://www.52nlp.cn/tag/cuda-9-0

一年前,我配置了一套“深度學習服務器”,並且寫過兩篇關於深度學習服務器環境配置的文章:《深度學習主機環境配置: Ubuntu16.04+Nvidia GTX 1080+CUDA8.0》 和 《深度學習主機環境配置: Ubuntu16.04+GeForce GTX 1080+TensorFlow》 , 獲得了很多關註和引用。 這一年來,深度學習的大潮繼續,特別是前段時間,吳恩達(Andrew Ng)在Coursera上推出了深度學習系列課程,這門面向初學者的深度學習課程,更是進一步的將深度學習的門檻降低。

前段時間這臺主機出了點問題,本著“不折騰毋寧死”的原則,我重新安裝了系統,並且選擇了最新的Ubuntu17.04,CUDA9.0,cuDNN7.0, TensorFlow1.3,然後又是一堆坑,另外所能Google到的國內外資料目前為止基本上覆蓋的還是CUDA8.0, 和cuDNN6.0, 5.0, 所以這裏再次記錄一下本次深度學習主機環境配置之旅。

1. 準備工作

Ubuntu17.04系統安裝完畢之後,首先做兩個準備工作,一個是更新apt-get的源,這次用的是網易的源:

deb http://mirrors.163.com/ubuntu/ zesty main restricted universe multiverse
deb http://mirrors.163.com/ubuntu/ zesty-security main restricted universe multiverse
deb http://mirrors.163.com/ubuntu/ zesty-updates main restricted universe multiverse
deb http://mirrors.163.com/ubuntu/ zesty-proposed main restricted universe multiverse
deb http://mirrors.163.com/ubuntu/ zesty-backports main restricted universe multiverse
deb-src http://mirrors.163.com/ubuntu/ zesty main restricted universe multiverse
deb-src http://mirrors.163.com/ubuntu/ zesty-security main restricted universe multiverse
deb-src http://mirrors.163.com/ubuntu/ zesty-updates main restricted universe multiverse
deb-src http://mirrors.163.com/ubuntu/ zesty-proposed main restricted universe multiverse
deb-src http://mirrors.163.com/ubuntu/ zesty-backports main restricted universe multiverse

另外一個事情是將pip源指向清華大學的源鏡像:https://mirrors.tuna.tsinghua.edu.cn/help/pypi/,具體添加一個 ~/.config/pip/pip.conf 文件,設置為:

[global]
index-url = https://pypi.tuna.tsinghua.edu.cn/simple

這兩件事情都可以加速安裝相關工具包的速度,事半功倍。

然後就是給GTX1080顯卡安裝驅動,參考了這篇文章《How to install Nvidia Drivers on Ubuntu 17.04 & below, Linux Mint》,並且選擇了這篇文章所指的最新的381.09驅動:


sudo apt-get purge nvidia*
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update && sudo apt-get install nvidia-381 nvidia-settings

安裝完畢後重啟電腦即可,運行nvidia-smi即可檢驗驅動是否安裝成功。不過之後在安裝CUDA9的時候,又被安利了一次384.69顯卡驅動,所以我不太清楚這個過程是否有必要。

2. 安裝CUDA TOOLKIT

依然前往NVIDIA的CUDA官方頁面,登錄後可以選擇CUDA9.0版本下載:CUDA Toolkit 9.0 Release Candidate Downloads, 這次我選擇的是面向ubuntu17.04的deb版本:

技術分享圖片

下載完deb文件之後按照官方給的方法按如下方式安裝CUDA9:

sudo dpkg -i cuda-repo-ubuntu1704-9-0-local-rc_9.0.103-1_amd64.deb
sudo apt-key add /var/cuda-repo-9-0-local-rc/7fa2af80.pub
sudo apt-get update
sudo apt-get install cuda

安裝過程中發現貌似又一次安裝了顯卡驅動,版本是384.69,安裝完畢後運行“nvidia-smi”提示錯誤:Failed to initialize NVML: Driver/library version mismatch,這個時候是需要重啟機器讓新的版本的顯卡驅動生效,再次運行“nvidia-smi”:

技術分享圖片

之後可以測試一下CUDA的相關例子,我將cuda9.0下的sample拷貝到一個臨時目錄下進行編譯:


cp -r /usr/local/cuda-9.0/samples/ .
cd samples/
make

然後運行幾個例子看一下:

textminer@textminer:~/cuda_sample/samples/1_Utilities/bandwidthTest$ ./bandwidthTest

[CUDA Bandwidth Test] - Starting...
Running on...

Device 0: GeForce GTX 1080
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 11258.6

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12875.1

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 231174.2

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

textminer@textminer:~/cuda_sample/samples/6_Advanced/c++11_cuda$ ./c++11_cuda

GPU Device 0: "GeForce GTX 1080" with compute capability 6.1

Read 3223503 byte corpus from ./warandpeace.txt
counted 107310 instances of ‘x‘, ‘y‘, ‘z‘, or ‘w‘ in "./warandpeace.txt"

最後在 ~/.bashrc 裏再設置一下cuda的環境變量:

export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export CUDA_HOME=/usr/local/cuda

同時 source ~/.bashrc 讓其生效。

3. 安裝cuDNN

安裝cuDNN很簡單,不過同樣需要前往NVIDIA官網:https://developer.nvidia.com/cudnn,這次我們選擇的是cuDNN7, 關於cuDNN7,NVIDIA官方主頁是這樣寫的:

What’s New in cuDNN 7?
Deep learning frameworks using cuDNN 7 can leverage new features and performance of the Volta architecture to deliver up to 3x faster training performance compared to Pascal GPUs. cuDNN 7 is now available as a free download to the members of the NVIDIA Developer Program. Highlights include:

Up to 2.5x faster training of ResNet50 and 3x faster training of NMT language translation LSTM RNNs on Tesla V100 vs. Tesla P100
Accelerated convolutions using mixed-precision Tensor Cores operations on Volta GPUs
Grouped Convolutions for models such as ResNeXt and Xception and CTC (Connectionist Temporal Classification) loss layer for temporal classification

我選擇的是這個版本:cuDNN v7.0 (August 3, 2017), for CUDA 9.0 RC --- cuDNN v7.0 Library for Linux
技術分享圖片

下載完畢後解壓,然後將相關文件拷貝到cuda安裝目錄下即可:

tar -zxvf cudnn-9.0-linux-x64-v7.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include/
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/ -d (
sudo chmod a+r /usr/local/cuda/include/cudnn.h
sudo chmod a+r /usr/local/cuda/lib64/libcudnn*

4. 安裝Tensorflow1.3

在安裝Tensorflow之前,按照Tensorflow官方安裝文檔的說明,先安裝一個libcupti-dev庫:

The libcupti-dev library, which is the NVIDIA CUDA Profile Tools Interface. This library provides advanced profiling support. To install this library, issue the following command:

$ sudo apt-get install libcupti-dev

然後通過virtualenv 的方式安裝Tensorflow1.3 GUP版本,註意我用的是Python2.7:

sudo apt-get install python-pip python-dev python-virtualenv
virtualenv --system-site-packages tensorflow1.3
source tensorflow1.3/bin/activate
(tensorflow1.3) textminer@textminer:~/tensorflow/tensorflow1.3$ pip install --upgrade tensorflow-gpu

通過清華的pip源,用這種方式安裝tensorflow-gpu版本速度很快:

Collecting tensorflow-gpu
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ca/c4/e39443dcdb80631a86c265fb07317e2c7ea5defe73cb531b7cd94692f8f5/tensorflow_gpu-1.3.0-cp27-cp27mu-manylinux1_x86_64.whl (158.8MB)
21% |███████ | 34.7MB 958kB/s eta 0:02:10

Successfully built markdown html5lib
Installing collected packages: backports.weakref, protobuf, funcsigs, pbr, mock, numpy, markdown, html5lib, bleach, werkzeug, tensorflow-tensorboard, tensorflow-gpu
Successfully installed backports.weakref-1.0rc1 bleach-1.5.0 funcsigs-1.0.2 html5lib-0.9999999 markdown-2.6.9 mock-2.0.0 numpy-1.13.1 pbr-3.1.1 protobuf-3.4.0 tensorflow-gpu-1.3.0 tensorflow-tensorboard-0.1.5 werkzeug-0.12.2

這種方式安裝TensorFlow很方便,並且切換tensorflow的版本也很容易,如果不是下面的坑,這是我安裝Tensorflow的第一選擇。然後嘗試運行一下tensorflow,滿心期待會出現順利導入並且有GPU的相關信息出現:

(tensorflow1.3) textminer@textminer:~/tensorflow/tensorflow1.3$ python
Python 2.7.13 (default, Jan 19 2017, 14:48:08)
[GCC 6.3.0 20170118] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf

可是卻報如下錯誤:

File "/home/textminer/tensorflow/tensorflow1.3/local/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module(‘_pywrap_tensorflow_internal‘, fp, pathname, description)
ImportError: libcusolver.so.8.0: cannot open shared object file: No such file or directory

Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/install_sources#common_installation_problems

我看了一下 /usr/local/cuda/lib64/ 下有 libcusolver.so.9.0 這個文件,同時google了一下相關信息,基本上確定這是由於Tensorflow官方版本目前不支持CUDA9, 支撐CUDA8的緣故,所以這個pip版本默認找得是CUDA8.0的後綴文件: libcusolver.so.8.0 。

好在天無絕人之路,雖然這方面的資料很少,還是通過google找到了github上tensorflow的最近的兩條issue: Upgrade to CuDNN 7 and CUDA 9 和 CUDA 9RC + cuDNN7 。前一條是請求TensorFlow官方版本支持CUDA9和cuDNN7的討論:Please upgrade TensorFlow to support CUDA 9 and CuDNN 7. Nvidia claims this will provide a 2x performance boost on Pascal GPUs. 後一條是一個非官方方式在Tensorflow中支持CUDA9和cuDNN7的源代碼安裝方案:This is an unofficial and very not supported patch to make it possible to compile TensorFlow with CUDA9RC and cuDNN 7 or CUDA8 + cuDNN 7.

又是源代碼安裝Tensorflow, 這個方式我是不推薦的,還記得去年夏天用源代碼安裝Tensorflow的種種痛苦,特別是國內網絡不便的情況下,這種方式更是不願意推薦,不過不得已,我必須試一下。特別聲明,如果之後Tensorflow官方版本已經支持CUDA9和cuDNN7了,請直接按上述pip方式安裝,以下可以忽略。

5. 源代碼方式安裝Tensorflow

平心而論,嚴格按照github上這個10天前的issue的方法做基本上是沒問題的:

git clone https://github.com/tensorflow/tensorflow.git
wget https://storage.googleapis.com/tf-performance/public/cuda9rc_patch/0001-CUDA-9.0-and-cuDNN-7.0-support.patch
wget https://storage.googleapis.com/tf-performance/public/cuda9rc_patch/eigen.f3a22f35b044.cuda9.diff
cd tensorflow/
git status
git checkout db596594b5653b43fcb558a4753b39904bb62cbd~
git apply ../0001-CUDA-9.0-and-cuDNN-7.0-support.patch
./configure
bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

但是我還是遇到了一點問題在configure之後用bazel編譯tensorflow的時候遇到了如下錯誤:

ERROR: Skipping ‘//tensorflow/tools/pip_package:build_pip_package‘: error loading package ‘tensorflow/tools/pip_package‘: Encountered error while reading extension file ‘cuda/build_defs.bzl‘: no such package ‘@local_config_cuda//cuda

google了一下之後發現我用的是最新版的bazel_0.5.4, 回退版本是個解決方案,所以回退到了bazel_0.5.2,問題解決。這裏特別備註一下configure過程的選擇,僅供參考:

Please specify the location of python. [Default is /usr/bin/python]:
Found possible Python library paths:
/usr/local/lib/python2.7/dist-packages
/usr/lib/python2.7/dist-packages
Please input the desired Python library path to use. Default is /usr/local/lib/python2.7/dist-packages
Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]: Y
jemalloc as malloc support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Google Cloud Platform support? [y/N]: N
No Google Cloud Platform support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Hadoop File System support? [y/N]: N
No Hadoop File System support will be enabled for TensorFlow.

Do you wish to build TensorFlow with XLA JIT support? [y/N]:
No XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with VERBS support? [y/N]:
No VERBS support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL support? [y/N]:
No OpenCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 8.0]: 9.0
Please specify the location where CUDA 9.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
"Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 6.0]: 7
Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 6.1]
Do you want to use clang as CUDA compiler? [y/N]: N
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:
Do you wish to build TensorFlow with MPI support? [y/N]:
No MPI support will be enabled for TensorFlow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]:
Add "--config=mkl" to your bazel command to build with MKL support.
Please note that MKL on MacOS or windows is still not supported.
If you would like to use a local MKL instead of downloading, please set the environment variable "TF_MKL_ROOT" every time before build.
Configuration finished

即使bazel版本正確和configure無誤,第一次用bazel編譯 Tensorflow 還是會遇到問題:

bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

不過這個是上述issue中專門提到的,並且給了一個Eigen patch解決方案:

Attempt to build TensorFlow, so that Eigen is downloaded. This build will fail if building for CUDA9RC but will succeed for CUDA8
bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

Apply the Eigen patch:

    cd -P bazel-out/../../../external/eigen_archive
    patch -p1 < ~/Downloads/eigen.f3a22f35b044.cuda9.diff

Build TensorFlow successfully
    cd -
    bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

再次編譯Tensorflow成功,最後編譯tensorflow的pip安裝文件:

bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
ls /tmp/tensorflow_pkg/
tensorflow-1.3.0rc1-cp27-cp27mu-linux_x86_64.whl
sudo pip install /tmp/tensorflow_pkg/tensorflow-1.3.0rc1-cp27-cp27mu-linux_x86_64.whl

我們在ipython中試一下新安裝好的Tensorflow:

Python 2.7.13 (default, Jan 19 2017, 14:48:08) 
Type "copyright", "credits" or "license" for more information.
 
IPython 5.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython‘s features.
%quickref -> Quick reference.
help      -> Python‘s own help system.
object?   -> Details about ‘object‘, use ‘object??‘ for extra details.
 
In [1]: import tensorflow as tf
 
In [2]: hello = tf.constant(‘Hello, Tensorflow‘)
 
In [3]: sess = tf.Session()
2017-09-01 13:32:08.828776: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.835
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.62GiB
2017-09-01 13:32:08.828808: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-09-01 13:32:08.828813: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2017-09-01 13:32:08.828823: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
 
In [4]: print(sess.run(hello))
Hello, Tensorflow

終於看到GPU的相關信息了,接下來,盡情享受Tensorflow GPU版本帶來的效率提升吧。

深度學習服務器環境配置: Ubuntu17.04+Nvidia GTX 1080+CUDA 9.0+cuDNN 7.0+TensorFlow 1.3