1. 程式人生 > >SLI導致雙顯示卡被TensorFlow同時佔用問題(Windows下)

SLI導致雙顯示卡被TensorFlow同時佔用問題(Windows下)

最近學習TensorFlow,被一些不是bug的問題折騰的頭暈腦脹,藉此寫一下解決方法。本人是在win10下使用TensorFlow的,所以ubuntu下的繞行吧,不會出現這些問題。(此文有些地方我重新整理了一遍,放在了相約機器人公眾號上,大家可以參見連結

眾所周知,TensorFlow在執行時,會搶佔所有檢測到的GPU的視訊記憶體,這種做法褒貶不一吧,只能說,但怎麼單獨設定使用哪幾塊顯示卡呢,唯一的方法就是利用CUDA本身隱藏掉某些顯示卡(除此之外就是拔掉多餘顯示卡了,大家應該不會傻到這麼去做),有些教輔書或網上教程中寫的以下方法都是治標不治本的:

(1)使用with.....device語句

例如

with tf.device("/gpu:1"):

這只是指定下面的程式在哪塊GPU上執行,程式本身還是會佔用所有GPU的資源(信不信由你)

(2)使用allow_growth=True或per_process_gpu_memory_fraction

例如

import tensorflow as tf

g = tf.placeholder(tf.int16)
h = tf.placeholder(tf.int16)
mul = tf.multiply(g,h)

gpu_options = tf.GPUOptions(allow_growth = True)
#gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction = 0.7)
config = tf.ConfigProto(log_device_placement = True,allow_soft_placement = True,gpu_options = gpu_options)
with tf.Session(config=config) as sess:
    print("相乘:%d" % sess.run(mul, feed_dict = {g:3,h:4}))

前者能夠實現隨著程式本身慢慢增加所佔用的GPU的視訊記憶體,但仍舊會佔用所有GPU,如下:

上圖為程式執行前,下圖為程式執行後,可見程式執行後,兩塊GPU均被佔用了,但實際上只有GPU0執行了上述程式:

C:\Users\B622>python
Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>>
>>> g = tf.placeholder(tf.int16)
>>> h = tf.placeholder(tf.int16)
>>> mul = tf.multiply(g,h)
>>>
>>> gpu_options = tf.GPUOptions(allow_growth = True)
>>> #gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction = 0.7)
... config = tf.ConfigProto(log_device_placement = True,allow_soft_placement = True,gpu_options = gpu_options)
>>> with tf.Session(config=config) as sess:
...     print("相乘:%d" % sess.run(mul, feed_dict = {g:3,h:4}))
...
2018-08-21 07:00:01.651592: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-08-21 07:00:01.927932: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.721
pciBusID: 0000:17:00.0
totalMemory: 11.00GiB freeMemory: 9.10GiB
2018-08-21 07:00:02.025456: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1356] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.721
pciBusID: 0000:65:00.0
totalMemory: 11.00GiB freeMemory: 9.10GiB
2018-08-21 07:00:02.030441: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1435] Adding visible gpu devices: 0, 1
2018-08-21 07:00:03.036953: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-21 07:00:03.040347: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:929]      0 1
2018-08-21 07:00:03.042564: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:942] 0:   N N
2018-08-21 07:00:03.044994: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:942] 1:   N N
2018-08-21 07:00:03.047419: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8806 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:17:00.0, compute capability: 6.1)
2018-08-21 07:00:03.054450: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 8806 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:17:00.0, compute capability: 6.1
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1
2018-08-21 07:00:03.064623: I T:\src\github\tensorflow\tensorflow\core\common_runtime\direct_session.cc:284] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:17:00.0, compute capability: 6.1
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1

Mul: (Mul): /job:localhost/replica:0/task:0/device:GPU:0
2018-08-21 07:00:03.074668: I T:\src\github\tensorflow\tensorflow\core\common_runtime\placer.cc:886] Mul: (Mul)/job:localhost/replica:0/task:0/device:GPU:0
Placeholder_1: (Placeholder): /job:localhost/replica:0/task:0/device:GPU:0
2018-08-21 07:00:03.078028: I T:\src\github\tensorflow\tensorflow\core\common_runtime\placer.cc:886] Placeholder_1: (Placeholder)/job:localhost/replica:0/task:0/device:GPU:0
Placeholder: (Placeholder): /job:localhost/replica:0/task:0/device:GPU:0
2018-08-21 07:00:03.081462: I T:\src\github\tensorflow\tensorflow\core\common_runtime\placer.cc:886] Placeholder: (Placeholder)/job:localhost/replica:0/task:0/device:GPU:0
相乘:12

而後者設定固定大小資源的per_process_gpu_memory_fraction,也只是均勻搶佔每塊GPU這麼多資源而已,仍舊佔用了所有GPU,如下:

正確的做法是利用CUDA來隱藏某些GPU,方法如下:

(1)直接在程式碼中利用python語句實現

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

(2)直接在終端寫入

windows下(test.py改成自己的py檔案):

set CUDA_VISIBLE_DEVICES=1
python tset.py

linux下:

CUDA_VISIBLE_DEVICES=1 python test.py

但是如果程式中出現with tf.device():等語句,可能會因為不小心的索引而發生錯誤,為什麼這麼說呢?

CUDA_VISIBLE_DEVICES=1           Only device 1 will be seen
CUDA_VISIBLE_DEVICES=0,1         Devices 0 and 1 will be visible
CUDA_VISIBLE_DEVICES="0,1"       Same as above, quotation marks are optional
CUDA_VISIBLE_DEVICES=0,2,3       Devices 0, 2, 3 will be visible; device 1 is masked

CUDA_VISIBLE_DEVICES=""          No GPU will be visible

舉個例子,當執行如下程式碼時,程式會提示錯誤:

import tensorflow as tf
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

with tf.device("/gpu:1"):
    g = tf.placeholder(tf.int16)
    h = tf.placeholder(tf.int16)
    mul = tf.multiply(g,h)
gpu_options = tf.GPUOptions(allow_growth = True)
config = tf.ConfigProto(log_device_placement = True,gpu_options = gpu_options)
#config = tf.ConfigProto(log_device_placement = True,allow_soft_placement = True)
with tf.Session(config=config) as sess:
    print("相乘:%d" % sess.run(mul, feed_dict = {g:3,h:4}))

因為當設定os.environ["CUDA_VISIBLE_DEVICES"] = "1"時,如果你又使用了with tf.device("/gpu:1"):(注:with tf.device("/gpu:0"):是正確的),則程式會提示你沒有可用的GPU1,只有可用的CPU0和GPU0,如下(原因是因為設定了CUDA_VISIBLE_DEVICES後,CUDA本身會重新按你設定的順序從0開始排列可見的GPU,這裡只設置了一塊GPU,所以只能索引到第0號GPU,超出索引會報錯,雖然物理PCI總線上呼叫的還是GPU1這塊顯示卡,但程式本身認為該塊顯示卡的索引號是0而不是1):

InvalidArgumentError: Cannot assign a device for operation 'Mul': Operation was explicitly assigned to /device:GPU:1 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0 ]. Make sure the device specification refers to a valid device.
	 [[Node: Mul = Mul[T=DT_INT16, _device="/device:GPU:1"](Placeholder, Placeholder_1)]]

Caused by op 'Mul', defined at:
  File "E:\Anaconda3\lib\site-packages\spyder\utils\ipython\start_kernel.py", line 269, in <module>
    main()
  File "E:\Anaconda3\lib\site-packages\spyder\utils\ipython\start_kernel.py", line 265, in main
    kernel.start()
  File "E:\Anaconda3\lib\site-packages\ipykernel\kernelapp.py", line 486, in start
    self.io_loop.start()
  File "E:\Anaconda3\lib\site-packages\tornado\platform\asyncio.py", line 127, in start
    self.asyncio_loop.run_forever()
  File "E:\Anaconda3\lib\asyncio\base_events.py", line 422, in run_forever
    self._run_once()
  File "E:\Anaconda3\lib\asyncio\base_events.py", line 1432, in _run_once
    handle._run()
  File "E:\Anaconda3\lib\asyncio\events.py", line 145, in _run
    self._callback(*self._args)
  File "E:\Anaconda3\lib\site-packages\tornado\platform\asyncio.py", line 117, in _handle_events
    handler_func(fileobj, events)
  File "E:\Anaconda3\lib\site-packages\tornado\stack_context.py", line 276, in null_wrapper
    return fn(*args, **kwargs)
  File "E:\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 450, in _handle_events
    self._handle_recv()
  File "E:\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 480, in _handle_recv
    self._run_callback(callback, msg)
  File "E:\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 432, in _run_callback
    callback(*args, **kwargs)
  File "E:\Anaconda3\lib\site-packages\tornado\stack_context.py", line 276, in null_wrapper
    return fn(*args, **kwargs)
  File "E:\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "E:\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 233, in dispatch_shell
    handler(stream, idents, msg)
  File "E:\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "E:\Anaconda3\lib\site-packages\ipykernel\ipkernel.py", line 208, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "E:\Anaconda3\lib\site-packages\ipykernel\zmqshell.py", line 537, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "E:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2662, in run_cell
    raw_cell, store_history, silent, shell_futures)
  File "E:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2785, in _run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "E:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2909, in run_ast_nodes
    if self.run_code(code, result):
  File "E:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-1-f92d6fb2b710>", line 1, in <module>
    runfile('C:/Users/B622/.spyder-py3/temp.py', wdir='C:/Users/B622/.spyder-py3')
  File "E:\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)
  File "E:\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)
  File "C:/Users/B622/.spyder-py3/temp.py", line 22, in <module>
    add2 = tf.multiply(g,h)
  File "E:\Anaconda3\lib\site-packages\tensorflow\python\ops\math_ops.py", line 337, in multiply
    return gen_math_ops.mul(x, y, name)
  File "E:\Anaconda3\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 5066, in mul
    "Mul", x=x, y=y, name=name)
  File "E:\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "E:\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 3392, in create_op
    op_def=op_def)
  File "E:\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 1718, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'Mul': Operation was explicitly assigned to /device:GPU:1 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0 ]. Make sure the device specification refers to a valid device.
	 [[Node: Mul = Mul[T=DT_INT16, _device="/device:GPU:1"](Placeholder, Placeholder_1)]]

同理,如果你設定了os.environ["CUDA_VISIBLE_DEVICES"] = "3,0,1"(假設你有4塊GPU),則這時物理上的GPU3在程式看來是GPU0,物理上的GPU0在程式看來是GPU1,物理上的GPU1在程式看來是GPU2,物理上的GPU2不可見(被隱藏掉了)。

當然為了防止不小心的索引,可以在tf.ConfigProto中設定allow_soft_placement = True(表示指定的裝置不存在時,允許tf自動分配裝置),但這其實和我們所要將某些程式碼指配給某塊GPU相違背,所以在寫tf.device時要想清楚現在的GPU索引號。

除上述之外,在windows下還有很坑的一點是,當你的機子上有兩塊GPU設定了交火後,即使用了SLI橋後,無論你怎麼設定os.environ["CUDA_VISIBLE_DEVICES"] = "1"或在終端寫入對應指定某塊GPU的指令,TensorFlow還是會佔用所有GPU,雖然真的只有設定的GPU可見。

是不是感覺隱藏的GPU不可用,但還是被佔了視訊記憶體,有點賠了夫人又折兵啊。就是這麼荒唐,這個問題,排查了我一宿加一早上,百度又百度都找不到任何答案。嘗試過拆除SLI橋(如下圖):

但拆除後,發現windows檢測不到任何一塊顯示卡,如下圖(兩塊顯示卡都處於感嘆號狀態,這時你在終端使用nvidia-smi會報錯,表示不存在任何GPU):

裝上後又顯示正常了,真是很醉的操作,於是折騰了很久很久都沒有解決,一開始以為是驅動壞了,重灌了無數遍驅動,還是感嘆號,哇得一聲哭了出來(注:ubuntu下不會出現這樣的問題)。

最終,是禁用了SLI才解決的,即直接在NAVIDIA設定(NAVIDIA控制面板)中禁用掉就行了,如下圖:

 

禁用的時候會顯示需要關閉一些程式,直接在工作管理員裡結束即可。

注意:在結束上圖中的第一個程序(WindowsInternal...)時,該程序會在一兩秒內自動重啟用,所以速度要快,多嘗試幾次就行。

禁用SLI後,就不會出現兩塊GPU同時被tf佔用了,真正實現指定哪塊就佔用哪塊。