我的AI之路(23)--在Windows下編譯Bazel和使用Bazel編譯tensorflow
谷歌廢棄使用CMake改用併力推自己的Bazel看來這是大勢所趨,花了幾天時間琢磨了下Bazel的編譯和使用Bazel編譯tensorflow,一般人工作都沒同時配有幾臺機器,Windows日常工作又得時時開著,至於自己花錢攢機器的就不要說了,為了使用Linux環境弄個VM在硬體不牛時效能可不好,所以還是琢磨Windows下怎麼搞定這事吧,不過一般的編譯在Linux下沒毛病在Windows下支援都不好,所以動手前對於可能遇到困難有由心裡準備的。
-
pacman -Syu zip unzip
確保zip和unzip安裝了。
再到https://github.com/bazelbuild/bazel/releases
啟動一個MSYS2 MinGW 64-bit終端,設定環境變數:
export BAZEL_VS="D:/Program Files (x86)/Microsoft Visual Studio 14.0" export BAZEL_SH="$(cygpath -m $(realpath $(which bash)))" export PATH="d:/anaconda3:$PATH" export JAVA_HOME="D:/Java/jdk1.8.0_181"
然後進入到解壓的原始碼包目錄下執行
./compile.sh
即可。編譯完後bazel.exe生成在output目錄下,有了一次編譯成功的基礎後,後面bazel原始碼有新版的想做更新編譯時只需執行
bazel build //src:bazel
即可,無需整個編譯。
編譯出了bazel.exe,把它所在路徑加入到PATH中去。
編譯Tensorflow和編譯Bazel所需要的環境支援大致是一樣的,需要先安裝好VS/VC,Java8,Msys2和Python或Anaconda,然後是需要把bazel.exe所在路徑加入PATH中(我的bazel.exe放在D:\AI下),另外還需要安裝git並把路徑加入到PATH中。
到https://github.com/tensorflow/tensorflow下載最新版的tensorflow原始碼zip包解壓開(我的解壓到D:\AI\tensorflow-master),然後要修改兩個指令碼檔案,一個是環境設定檔案,一個是要執行的主指令碼檔案:
D:\AI\tensorflow-master\tensorflow\tools\ci_build\windows\bazel\common_env.sh:
# Use a temporary directory with a short name. export TMPDIR=${TMPDIR:-"C:/tmp"} export TMPDIR=$(cygpath -m "$TMPDIR") mkdir -p "$TMPDIR" export workspace="D:/AI/tensorflow-master" # Set bash path export BAZEL_SH=${BAZEL_SH:-"C:/msys64/usr/bin/bash"}
export PYTHON_BASE_PATH="${PYTHON_DIRECTORY:-"D:/Anaconda3"}" export BAZEL_VS="D:/Program Files (x86)/Microsoft Visual Studio 14.0" # Set the path to find bazel. export PATH="/d/AI/:$PATH"
# Set Python path for ./configure export PYTHON_BIN_PATH="${PYTHON_BASE_PATH}/python.exe" export PYTHON_LIB_PATH="${PYTHON_BASE_PATH}/lib/site-packages"
# Add python into PATH, it's needed because gen_git_source.py uses # '/usr/bin/env python' as a shebang export PATH="${PYTHON_BASE_PATH}:$PATH" # Add git into PATH needed for gen_git_source.py export PATH="/d/Program Files/Git/cmd:$PATH"
# Make sure we have pip in PATH export PATH="${PYTHON_BASE_PATH}/Scripts:$PATH"
# Setting default values to CUDA related environment variables export TF_CUDA_VERSION=${TF_CUDA_VERSION:-9.0} export TF_CUDNN_VERSION=${TF_CUDNN_VERSION:-7} export TF_CUDA_COMPUTE_CAPABILITIES=${TF_CUDA_COMPUTE_CAPABILITIES:-5.0} export CUDA_TOOLKIT_PATH=${CUDA_TOOLKIT_PATH:-"C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v${TF_CUDA_VERSION}"} export CUDNN_INSTALL_PATH=${CUDNN_INSTALL_PATH:-"C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v${TF_CUDA_VERSION}"}
# Add Cuda and Cudnn dll directories into PATH export PATH="$(cygpath -u "${CUDA_TOOLKIT_PATH}")/bin:$PATH" export PATH="$(cygpath -u "${CUDA_TOOLKIT_PATH}")/extras/CUPTI/libx64:$PATH" # export PATH="$(cygpath -u "${CUDNN_INSTALL_PATH}")/bin:$PATH"
上面我的Cudnn的三個檔案全拷貝到了CUDA對應的目錄下,因此CUDNN_INSTALL_PATH和CUDA_TOOKIT_PATH設定一樣的路徑。
執行的主指令碼檔案一個錯誤需要修改,至少在我的Mysys2下得不到正確的值
D:\AI\tensorflow-master\tensorflow\tools\ci_build\windows\gpu\pip\build_tf_windows.sh:
#script_dir=$(dirname $0)#cd ${script_dir%%tensorflow/tools/ci_build/windows/gpu/pip}.
script_dir="D:/AI/tensorflow-master"
cd cd $script_dir
然後為了方便在D:\AI\tensorflow-master下建立一個run.bat檔案,內容如下:
bash D:/AI/tensorflow-master/tensorflow/tools/ci_build/windows/gpu/pip/build_tf_windows.sh %*
然後使用pip安裝keras-preprocessing和keras-applications到你的anaconda環境下去:
pip install keras-preprocessing --no-deps pip install keras-applications --no-deps pip install h5py
不然後面的編譯Tensorflow過程中會報錯而退出:
ModuleNotFoundError: No module named 'keras-preprocessing'
ModuleNotFoundError: No module named 'keras_applications'
然後在Msys2終端下執行
cd /d/AI/tensorflow-master
./run.bat
即可以開始幾個小時的編譯,編譯中途無錯誤發生的話,生成的類似tensorflow_gpu-1.11.0rc1-cp36-cp36m-win_amd64.whl檔案會放在D:\AI\tensorflow-master\py_test_dir\下面。
編譯中可能發生的錯誤及解決辦法:
(1)
Unzipping simple_console_for_windows.zip to create runfiles tree... [./bazel-bin/tensorflow/tools/pip_package/simple_console_for_windows.zip] End-of-central-directory signature not found. Either this file is not a zipfile, or it constitutes one disk of a multi-part archive. In the latter case the central directory and zipfile comment will be found on the last disk(s) of this archive. unzip: cannot find zipfile directory in one of ./bazel-bin/tensorflow/tools/pip_package/simple_console_for_windows.zip or ./bazel-bin/tensorflow/tools/pip_package/simple_console_for_windows.zip.zip, and cannot find ./bazel-bin/tensorflow/tools/pip_package/simple_console_for_windows.zip.ZIP, period.
這個問題最坑人了,每次編譯到最後快完時就報這個錯誤,開始我以為是bazel版本不對,試了幾個版本都報這個錯,試著去讀bazel的原始碼也沒找到線索,後來在tensorflow的issues裡搜尋到了,這個原因是你使用的原始碼是半個月以前的tensorflow原始碼,這個問題在最近解決了,是tensorflow本身程式碼問題,具體參見https://github.com/tensorflow/tensorflow/issues/22382
(2)
c:\tmp\6vmqdjhl\execroot\org_tensorflow\bazel-out\x64_windows-opt\genfiles\external\local_config_cuda\cuda\cuda\include\crt/host_config.h(133): fatal error C1189: #error: -- unsupported Microsoft Visual Studioversion! Only the versions 2012, 2013, 2015 and 2017 are supported!
這個問題的原因是cuda9.0只支援到Visual Studio 2017 version 15.3,如果你安裝高版本的VS肯定報錯(我開始安裝的最新的Visual Studio 2017 version 15.8),我試過,就算把cuda的host_config.h裡
#if _MSC_VER < 1600 || _MSC_VER > 1911
#error -- unsupported Microsoft Visual Studio version! Only the versions 2012, 2013, 2015 and 2017 are supported!
即使改掉這裡的判斷讓這裡過,後面編譯時還是會出現一堆錯誤。MSC_VER和VS版本對應關係如下: MSVC++ 4.0 _MSC_VER == 1000 (Developer Studio 4.0) MSVC++ 4.2 _MSC_VER == 1020 (Developer Studio 4.2) MSVC++ 5.0 _MSC_VER == 1100 (Visual Studio 97 version 5.0) MSVC++ 6.0 _MSC_VER == 1200 (Visual Studio 6.0 version 6.0) MSVC++ 7.0 _MSC_VER == 1300 (Visual Studio .NET 2002 version 7.0) MSVC++ 7.1 _MSC_VER == 1310 (Visual Studio .NET 2003 version 7.1) MSVC++ 8.0 _MSC_VER == 1400 (Visual Studio 2005 version 8.0) MSVC++ 9.0 _MSC_VER == 1500 (Visual Studio 2008 version 9.0) MSVC++ 10.0 _MSC_VER == 1600 (Visual Studio 2010 version 10.0) MSVC++ 11.0 _MSC_VER == 1700 (Visual Studio 2012 version 11.0) MSVC++ 12.0 _MSC_VER == 1800 (Visual Studio 2013 version 12.0) MSVC++ 14.0 _MSC_VER == 1900 (Visual Studio 2015 version 14.0) MSVC++ 14.1 _MSC_VER == 1910 (Visual Studio 2017 version 15.0) MSVC++ 14.11 _MSC_VER == 1911 (Visual Studio 2017 version 15.3) MSVC++ 14.12 _MSC_VER == 1912 (Visual Studio 2017 version 15.5) MSVC++ 14.13 _MSC_VER == 1913 (Visual Studio 2017 version 15.6) MSVC++ 14.14 _MSC_VER == 1914 (Visual Studio 2017 version 15.7) MSVC++ 14.15 _MSC_VER == 1915 (Visual Studio 2017 version 15.8)
安裝一個CUDA9.0支援的VS/VC版本這個問題浪費了我大把時間,MS網站上找不到1911對應的Visual Studio 2017 version 15.3,倒是找到了1910對應的Visual Studio 2017 version 15.0,安裝上後編譯到cuda時還是報一堆錯,沒法只好安裝了個Visual Studio 2015 version Update3並修改BAZEL_VS的值,然後這種錯誤就過了,每次安裝和解除安裝VS需要大把時間還需要你的網路好,真是折騰人,最搞笑的是tensorflow的開發大佬們說Visual Studio 2017完全沒問題,我就信了,上來就安裝2017版結果被坑了,就我實驗的結果來看,cuda9.0還就只能使用VS2015版的,雖然CUDA10都出來了,現在最新的Tensorflow也還停留在使用CUDA9.0上,所以使用VS2015來編譯在短時間內是無須改動的。
(3)
ERROR: D:/ai/tensorflow-master/tensorflow/contrib/lite/toco/python/BUILD:23:1: no such package '@com_google_absl//absl/strings': java.io.IOException: thread interrupted and referenced by '//tensorflow/contrib/lte/toco/python:tensorflow_wrap_toco_py_wrap' ERROR: Analysis of target '//tensorflow/tools/pip_package:build_pip_package' failed; build aborted: no such package '@com_google_absl//absl/strings': java.io.IOException: thread interrupted
這是因為編譯過程中需要下載一些包到臨時目錄下,C:\tmp\6vmqdjhl\external\com_google_absl\下載的*.tar.gz包不完整解壓不了,刪掉com_google_absl目錄然後重新啟動編譯,再次下載成功了會自動解壓這個包,編譯就可以往下走了。C:\tmp\6vmqdjhl\external\下其他很多包也是需要在編譯是下載的,出了類似錯誤時,刪掉這個目錄重啟啟動編譯,再次下載成功就可以往下走了。