深度學習服務器環境配置: Ubuntu17.04+Nvidia GTX 1080+CUDA 9.0+cuDNN 7.0+TensorFlow 1.3

阿新 • • 發佈：2017-11-23

顯卡 right const andrew ng extra framework abi credit packages

本文來源地址：http://www.52nlp.cn/tag/cuda-9-0

一年前，我配置了一套“深度學習服務器”，並且寫過兩篇關於深度學習服務器環境配置的文章：《深度學習主機環境配置: Ubuntu16.04+Nvidia GTX 1080+CUDA8.0》和《深度學習主機環境配置: Ubuntu16.04+GeForce GTX 1080+TensorFlow》 , 獲得了很多關註和引用。這一年來，深度學習的大潮繼續，特別是前段時間，吳恩達(Andrew Ng)在Coursera上推出了深度學習系列課程，這門面向初學者的深度學習課程，更是進一步的將深度學習的門檻降低。

前段時間這臺主機出了點問題，本著“不折騰毋寧死”的原則，我重新安裝了系統，並且選擇了最新的Ubuntu17.04，CUDA9.0，cuDNN7.0, TensorFlow1.3，然後又是一堆坑，另外所能Google到的國內外資料目前為止基本上覆蓋的還是CUDA8.0, 和cuDNN6.0, 5.0, 所以這裏再次記錄一下本次深度學習主機環境配置之旅。

1. 準備工作

Ubuntu17.04系統安裝完畢之後，首先做兩個準備工作，一個是更新apt-get的源，這次用的是網易的源：
deb http://mirrors.163.com/ubuntu/ zesty main restricted universe multiverse deb http://mirrors.163.com/ubuntu/ zesty-security main restricted universe multiverse deb http://mirrors.163.com/ubuntu/ zesty-updates main restricted universe multiverse deb http://mirrors.163.com/ubuntu/ zesty-proposed main restricted universe multiverse deb http://mirrors.163.com/ubuntu/ zesty-backports main restricted universe multiverse deb-src http://mirrors.163.com/ubuntu/ zesty main restricted universe multiverse deb-src http://mirrors.163.com/ubuntu/ zesty-security main restricted universe multiverse deb-src http://mirrors.163.com/ubuntu/ zesty-updates main restricted universe multiverse deb-src http://mirrors.163.com/ubuntu/ zesty-proposed main restricted universe multiverse deb-src http://mirrors.163.com/ubuntu/ zesty-backports main restricted universe multiverse

另外一個事情是將pip源指向清華大學的源鏡像：https://mirrors.tuna.tsinghua.edu.cn/help/pypi/，具體添加一個 ~/.config/pip/pip.conf 文件，設置為：
[global] index-url = https://pypi.tuna.tsinghua.edu.cn/simple

這兩件事情都可以加速安裝相關工具包的速度，事半功倍。

然後就是給GTX1080顯卡安裝驅動，參考了這篇文章《How to install Nvidia Drivers on Ubuntu 17.04 & below, Linux Mint》，並且選擇了這篇文章所指的最新的381.09驅動：

sudo apt-get purge nvidia* sudo add-apt-repository ppa:graphics-drivers/ppa sudo apt-get update && sudo apt-get install nvidia-381 nvidia-settings

安裝完畢後重啟電腦即可，運行nvidia-smi即可檢驗驅動是否安裝成功。不過之後在安裝CUDA9的時候，又被安利了一次384.69顯卡驅動，所以我不太清楚這個過程是否有必要。

2. 安裝CUDA TOOLKIT

依然前往NVIDIA的CUDA官方頁面，登錄後可以選擇CUDA9.0版本下載：CUDA Toolkit 9.0 Release Candidate Downloads, 這次我選擇的是面向ubuntu17.04的deb版本:

技術分享圖片

下載完deb文件之後按照官方給的方法按如下方式安裝CUDA9：
sudo dpkg -i cuda-repo-ubuntu1704-9-0-local-rc_9.0.103-1_amd64.deb sudo apt-key add /var/cuda-repo-9-0-local-rc/7fa2af80.pub sudo apt-get update sudo apt-get install cuda

安裝過程中發現貌似又一次安裝了顯卡驅動，版本是384.69，安裝完畢後運行“nvidia-smi”提示錯誤：Failed to initialize NVML: Driver/library version mismatch，這個時候是需要重啟機器讓新的版本的顯卡驅動生效，再次運行“nvidia-smi”：

技術分享圖片

之後可以測試一下CUDA的相關例子，我將cuda9.0下的sample拷貝到一個臨時目錄下進行編譯：

cp -r /usr/local/cuda-9.0/samples/ . cd samples/ make

然後運行幾個例子看一下：

textminer@textminer:~/cuda_sample/samples/1_Utilities/bandwidthTest$ ./bandwidthTest
[CUDA Bandwidth Test] - Starting... Running on...

Device 0: GeForce GTX 1080
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 11258.6

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12875.1

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 231174.2

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

textminer@textminer:~/cuda_sample/samples/6_Advanced/c++11_cuda$ ./c++11_cuda
GPU Device 0: "GeForce GTX 1080" with compute capability 6.1

Read 3223503 byte corpus from ./warandpeace.txt counted 107310 instances of ‘x‘, ‘y‘, ‘z‘, or ‘w‘ in "./warandpeace.txt"

最後在 ~/.bashrc 裏再設置一下cuda的環境變量：

export PATH=/usr/local/cuda/bin${PATH:+:${PATH}} export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} export CUDA_HOME=/usr/local/cuda

同時 source ~/.bashrc 讓其生效。

3. 安裝cuDNN

安裝cuDNN很簡單，不過同樣需要前往NVIDIA官網：https://developer.nvidia.com/cudnn，這次我們選擇的是cuDNN7, 關於cuDNN7，NVIDIA官方主頁是這樣寫的:

What’s New in cuDNN 7?
Deep learning frameworks using cuDNN 7 can leverage new features and performance of the Volta architecture to deliver up to 3x faster training performance compared to Pascal GPUs. cuDNN 7 is now available as a free download to the members of the NVIDIA Developer Program. Highlights include:

Up to 2.5x faster training of ResNet50 and 3x faster training of NMT language translation LSTM RNNs on Tesla V100 vs. Tesla P100
Accelerated convolutions using mixed-precision Tensor Cores operations on Volta GPUs
Grouped Convolutions for models such as ResNeXt and Xception and CTC (Connectionist Temporal Classification) loss layer for temporal classification

我選擇的是這個版本：cuDNN v7.0 (August 3, 2017), for CUDA 9.0 RC --- cuDNN v7.0 Library for Linux
技術分享圖片

下載完畢後解壓，然後將相關文件拷貝到cuda安裝目錄下即可：

tar -zxvf cudnn-9.0-linux-x64-v7.tgz sudo cp cuda/include/cudnn.h /usr/local/cuda/include/ sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/ -d ( sudo chmod a+r /usr/local/cuda/include/cudnn.h sudo chmod a+r /usr/local/cuda/lib64/libcudnn*

4. 安裝Tensorflow1.3

在安裝Tensorflow之前，按照Tensorflow官方安裝文檔的說明，先安裝一個libcupti-dev庫：

The libcupti-dev library, which is the NVIDIA CUDA Profile Tools Interface. This library provides advanced profiling support. To install this library, issue the following command:

$ sudo apt-get install libcupti-dev

然後通過virtualenv 的方式安裝Tensorflow1.3 GUP版本，註意我用的是Python2.7:

sudo apt-get install python-pip python-dev python-virtualenv virtualenv --system-site-packages tensorflow1.3 source tensorflow1.3/bin/activate (tensorflow1.3) textminer@textminer:~/tensorflow/tensorflow1.3$ pip install --upgrade tensorflow-gpu

通過清華的pip源，用這種方式安裝tensorflow-gpu版本速度很快：

Collecting tensorflow-gpu
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ca/c4/e39443dcdb80631a86c265fb07317e2c7ea5defe73cb531b7cd94692f8f5/tensorflow_gpu-1.3.0-cp27-cp27mu-manylinux1_x86_64.whl (158.8MB)
21% |███████ | 34.7MB 958kB/s eta 0:02:10

Successfully built markdown html5lib
Installing collected packages: backports.weakref, protobuf, funcsigs, pbr, mock, numpy, markdown, html5lib, bleach, werkzeug, tensorflow-tensorboard, tensorflow-gpu
Successfully installed backports.weakref-1.0rc1 bleach-1.5.0 funcsigs-1.0.2 html5lib-0.9999999 markdown-2.6.9 mock-2.0.0 numpy-1.13.1 pbr-3.1.1 protobuf-3.4.0 tensorflow-gpu-1.3.0 tensorflow-tensorboard-0.1.5 werkzeug-0.12.2

這種方式安裝TensorFlow很方便，並且切換tensorflow的版本也很容易，如果不是下面的坑，這是我安裝Tensorflow的第一選擇。然後嘗試運行一下tensorflow，滿心期待會出現順利導入並且有GPU的相關信息出現：

(tensorflow1.3) textminer@textminer:~/tensorflow/tensorflow1.3$ python
Python 2.7.13 (default, Jan 19 2017, 14:48:08)
[GCC 6.3.0 20170118] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf

可是卻報如下錯誤：

File "/home/textminer/tensorflow/tensorflow1.3/local/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module(‘_pywrap_tensorflow_internal‘, fp, pathname, description)
ImportError: libcusolver.so.8.0: cannot open shared object file: No such file or directory

Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/install_sources#common_installation_problems

我看了一下 /usr/local/cuda/lib64/ 下有 libcusolver.so.9.0 這個文件，同時google了一下相關信息，基本上確定這是由於Tensorflow官方版本目前不支持CUDA9, 支撐CUDA8的緣故，所以這個pip版本默認找得是CUDA8.0的後綴文件： libcusolver.so.8.0 。

好在天無絕人之路，雖然這方面的資料很少，還是通過google找到了github上tensorflow的最近的兩條issue: Upgrade to CuDNN 7 and CUDA 9 和 CUDA 9RC + cuDNN7 。前一條是請求TensorFlow官方版本支持CUDA9和cuDNN7的討論：Please upgrade TensorFlow to support CUDA 9 and CuDNN 7. Nvidia claims this will provide a 2x performance boost on Pascal GPUs. 後一條是一個非官方方式在Tensorflow中支持CUDA9和cuDNN7的源代碼安裝方案：This is an unofficial and very not supported patch to make it possible to compile TensorFlow with CUDA9RC and cuDNN 7 or CUDA8 + cuDNN 7.

又是源代碼安裝Tensorflow, 這個方式我是不推薦的，還記得去年夏天用源代碼安裝Tensorflow的種種痛苦，特別是國內網絡不便的情況下，這種方式更是不願意推薦，不過不得已，我必須試一下。特別聲明，如果之後Tensorflow官方版本已經支持CUDA9和cuDNN7了，請直接按上述pip方式安裝，以下可以忽略。

5. 源代碼方式安裝Tensorflow

平心而論，嚴格按照github上這個10天前的issue的方法做基本上是沒問題的：

git clone https://github.com/tensorflow/tensorflow.git wget https://storage.googleapis.com/tf-performance/public/cuda9rc_patch/0001-CUDA-9.0-and-cuDNN-7.0-support.patch wget https://storage.googleapis.com/tf-performance/public/cuda9rc_patch/eigen.f3a22f35b044.cuda9.diff cd tensorflow/ git status git checkout db596594b5653b43fcb558a4753b39904bb62cbd~ git apply ../0001-CUDA-9.0-and-cuDNN-7.0-support.patch ./configure bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

但是我還是遇到了一點問題在configure之後用bazel編譯tensorflow的時候遇到了如下錯誤：

ERROR: Skipping ‘//tensorflow/tools/pip_package:build_pip_package‘: error loading package ‘tensorflow/tools/pip_package‘: Encountered error while reading extension file ‘cuda/build_defs.bzl‘: no such package ‘@local_config_cuda//cuda

google了一下之後發現我用的是最新版的bazel_0.5.4, 回退版本是個解決方案，所以回退到了bazel_0.5.2，問題解決。這裏特別備註一下configure過程的選擇，僅供參考：

Please specify the location of python. [Default is /usr/bin/python]: Found possible Python library paths: /usr/local/lib/python2.7/dist-packages /usr/lib/python2.7/dist-packages Please input the desired Python library path to use. Default is /usr/local/lib/python2.7/dist-packages Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]: Y jemalloc as malloc support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Google Cloud Platform support? [y/N]: N
No Google Cloud Platform support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Hadoop File System support? [y/N]: N
No Hadoop File System support will be enabled for TensorFlow.

Do you wish to build TensorFlow with XLA JIT support? [y/N]:
No XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with VERBS support? [y/N]:
No VERBS support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL support? [y/N]:
No OpenCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 8.0]: 9.0
Please specify the location where CUDA 9.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
"Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 6.0]: 7
Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 6.1]
Do you want to use clang as CUDA compiler? [y/N]: N
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:
Do you wish to build TensorFlow with MPI support? [y/N]:
No MPI support will be enabled for TensorFlow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]: Add "--config=mkl" to your bazel command to build with MKL support. Please note that MKL on MacOS or windows is still not supported. If you would like to use a local MKL instead of downloading, please set the environment variable "TF_MKL_ROOT" every time before build. Configuration finished

即使bazel版本正確和configure無誤，第一次用bazel編譯 Tensorflow 還是會遇到問題：

bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

不過這個是上述issue中專門提到的，並且給了一個Eigen patch解決方案：

Attempt to build TensorFlow, so that Eigen is downloaded. This build will fail if building for CUDA9RC but will succeed for CUDA8
bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

Apply the Eigen patch:

    cd -P bazel-out/../../../external/eigen_archive
    patch -p1 < ~/Downloads/eigen.f3a22f35b044.cuda9.diff

Build TensorFlow successfully
    cd -
    bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

再次編譯Tensorflow成功，最後編譯tensorflow的pip安裝文件：

bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg ls /tmp/tensorflow_pkg/ tensorflow-1.3.0rc1-cp27-cp27mu-linux_x86_64.whl sudo pip install /tmp/tensorflow_pkg/tensorflow-1.3.0rc1-cp27-cp27mu-linux_x86_64.whl

我們在ipython中試一下新安裝好的Tensorflow:

Python 2.7.13 (default, Jan 19 2017, 14:48:08) 
Type "copyright", "credits" or "license" for more information.
 
IPython 5.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython‘s features.
%quickref -> Quick reference.
help      -> Python‘s own help system.
object?   -> Details about ‘object‘, use ‘object??‘ for extra details.
 
In [1]: import tensorflow as tf
 
In [2]: hello = tf.constant(‘Hello, Tensorflow‘)
 
In [3]: sess = tf.Session()
2017-09-01 13:32:08.828776: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.835
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.62GiB
2017-09-01 13:32:08.828808: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-09-01 13:32:08.828813: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2017-09-01 13:32:08.828823: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
 
In [4]: print(sess.run(hello))
Hello, Tensorflow

終於看到GPU的相關信息了，接下來，盡情享受Tensorflow GPU版本帶來的效率提升吧。

深度學習服務器環境配置: Ubuntu17.04+Nvidia GTX 1080+CUDA 9.0+cuDNN 7.0+TensorFlow 1.3

顯卡 right const andrew ng extra framework abi credit packages 本文來源地址：http://www.52nlp.cn/tag/cuda-9-0 一年前，我配置了一套“深度學習服務器”，並且寫過兩篇關於深度學習服務器環

深度學習服務器環境配置: Ubuntu17.04+Nvidia GTX 1080+CUDA 9.0+cuDNN 7.0+TensorFlow 1.3

深度學習服務器環境配置: Ubuntu17.04+Nvidia GTX 1080+CUDA 9.0+cuDNN 7.0+TensorFlow 1.3

中國地質大學（北京）Linux深度學習服務器終端校園網關賬號密碼登錄問題——以ubuntu14.04server版本為例

[日常填坑]圖像分類實戰-服務器環境配置

基於Anaconda下機器學習和深度學習的Python環境配置

NVIDIA DIGITS 學習筆記（NVIDIA DIGITS-2.0 + Ubuntu 14.04 + CUDA 7.0 + cuDNN 7.0 + Caffe 0.13.0）

Ubuntu 18.04 + CUDA 9.2 + cuDNN 7.1.4 + Caffe2 安裝

Ubuntu 17.04 搭建LAMP服務器環境流程

Windows 2008服務器環境PHP連接SQL Server數據庫的配置及連接方法

window servet 2012 r2 配置php服務器環境

Linux學習筆記之搭建LNMP服務器環境

深度學習-keras/openCV環境安裝配置學習筆記

【深度學習】Ubuntu環境下Tensorflow的安裝以及與Pycharm的相互配置

vue項目如何通過前端實現自動識別並配置服務器環境地址

騰訊雲，搭建Http靜態服務器環境

PHP獲取服務器環境信息

Web服務器的配置(詳細圖文教程)

轉：【實用教程】阿裏雲服務器的配置和使用

linux_www服務器簡單配置

centos git服務器搭建配置

20170713L08-00老男孩Linux運維實戰培訓-DELL R710服務器RAID配置實戰演示

深度學習服務器環境配置: Ubuntu17.04+Nvidia GTX 1080+CUDA 9.0+cuDNN 7.0+TensorFlow 1.3

相關推薦