Some tips about using google’s TPU (Cont.)

阿新 • • 發佈：2018-12-23

Sometimes I get this error from TPUEstimator:

...
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run                                                 
    run_metadata_ptr)                                                                                                                                 
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run                                               
    feed_dict_tensor, options, run_metadata)                                                                                                          
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run                                           
    run_metadata)                                                                                                                                     
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call                                           
    raise type(e)(node_def, op, message)                                                                                                  
tensorflow.python.framework.errors_impl.DeadlineExceededError: Deadline Exceeded

12345678910

...File"/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py",line900,inrun run_metadata_ptr)File"/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py",line1135,in_run

feed_dict_tensor,options,run_metadata)File"/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py",line1316,in_do_run run_metadata)File"/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py",line1335,in_do_call

raise type(e)(node_def,op,message)tensorflow.python.framework.errors_impl.DeadlineExceededError:Deadline Exceeded

And after stop and restart TPU in console of GCP, the error disappeared. TPU doesn’t allow users to use it directly like GPU. You can’t see the device in VM looks like ‘/dev/tpu’ or something like this. Google provides TPU as RPC service, so you can only run DNN training through this service. I think this RPC service is not stable enough so sometimes it can’t work and lead to the error ‘Deadline Exceeded’.

When I get this type of error from TPU:

2018-09-29 01:57:12.779430: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.

1	2018-09-2901:57:12.779430:Wtensorflow/core/distributed_runtime/rpc/grpc_session.cc:349]GrpcSession::ListDevices will initialize the session with an empty graph andother defaults because the session has notyet been created.

The only solution is to create a new TPU instance and delete the old one in GCP console. Seems Google need to improve the robust of their TPU RPC service.

Running 10000 steps and get ‘loss’ for every turn:

INFO:tensorflow:Loss for final step: 3.2015076.
INFO:tensorflow:Loss for final step: 2.5733204.
INFO:tensorflow:Loss for final step: 1.8888541.
INFO:tensorflow:Loss for final step: 2.3713436.
INFO:tensorflow:Loss for final step: 2.9957836.
INFO:tensorflow:Loss for final step: 1.3974692.
INFO:tensorflow:Loss for final step: 1.3933656.
INFO:tensorflow:Loss for final step: 2.3544135.
INFO:tensorflow:Loss for final step: 1.9383199.
INFO:tensorflow:Loss for final step: 2.0213509.
INFO:tensorflow:Loss for final step: 1.8641331.
INFO:tensorflow:Loss for final step: 1.6767861.
INFO:tensorflow:Loss for final step: 2.63849.
INFO:tensorflow:Loss for final step: 2.19468.
INFO:tensorflow:Loss for final step: 1.9854712.
INFO:tensorflow:Loss for final step: 1.9380764.
INFO:tensorflow:Loss for final step: 0.97299415.
INFO:tensorflow:Loss for final step: 2.089243.
INFO:tensorflow:Loss for final step: 2.1150723.
INFO:tensorflow:Loss for final step: 1.8242038.
INFO:tensorflow:Loss for final step: 2.8426473.

123456789101112131415161718192021

INFO:tensorflow:Loss forfinalstep:3.2015076.INFO:tensorflow:Loss forfinalstep:2.5733204.INFO:tensorflow:Loss forfinalstep:1.8888541.INFO:tensorflow:Loss forfinalstep:2.3713436.INFO:tensorflow:Loss forfinalstep:2.9957836.INFO:tensorflow:Loss forfinalstep:1.3974692.INFO:tensorflow:Loss forfinalstep:1.3933656.INFO:tensorflow:Loss forfinalstep:2.3544135.INFO:tensorflow:Loss forfinalstep:1.9383199.INFO:tensorflow:Loss forfinalstep:2.0213509.INFO:tensorflow:Loss forfinalstep:1.8641331.INFO:tensorflow:Loss forfinalstep:1.6767861.INFO:tensorflow:Loss forfinalstep:2.63849.INFO:tensorflow:Loss forfinalstep:2.19468.INFO:tensorflow:Loss forfinalstep:1.9854712.INFO:tensorflow:Loss forfinalstep:1.9380764.INFO:tensorflow:Loss forfinalstep:0.97299415.INFO:tensorflow:Loss forfinalstep:2.089243.INFO:tensorflow:Loss forfinalstep:2.1150723.INFO:tensorflow:Loss forfinalstep:1.8242038.INFO:tensorflow:Loss forfinalstep:2.8426473.

It’s quite strange that the ‘loss’ can’t go low enough. I still need to do more experiments.

Previously, I run MobileNet_v2 in a machine with Geforce GTX 960 and it could process 100 samples per second. And by using 8 TPUs of Version 2, it can process about 500 samples per second. Firstly, I am so disappointed about the performance-boosting of TPUv2, for it only has about 1.4 TFLOPS for each. But then I noticed that may be the bottleneck is not the performance of TPU, since IO is usually the limit for training speed. Besides, my model is MobileNet_v2, which is too simple and light so it can’t excavate all the capability of TPU.
Therefore I set ‘depth_multiplier=4’ for MobileNet_v2. Under this model, GTX 960 could process 21 samples per second, and TPUv2-8 could process 275 samples per second. This time, I can estimate each TPUv2 has about 4 TFLOPS. I know this metric seems too low from Google’s official 45 TFLOPS. But considering the possible bottlenecks of storage IO and network bandwidth, it becomes understandable. And also, there is another possibility: Google’s 45 TFLOPS means the half-precision operation performance

Some tips about using google’s TPU (Cont.)

Some tips about using google’s TPU (Cont.)

Some tips about using google’s TPU

Marginally Interesting: Some Tips On Using Cassandra

Some tips about the life cycle of Maya thread pool

Some tips about the C++(自己看的，很亂）

Some tips about layout

Ask HN: Some questions about Banksy's self destructing painting

Google’s Pixel 3 Event: I Have Some Big Questions

Using Let’s Encrypt for free SSL Certs with Netscaler

Google's BBR擁塞控制演算法模型解析

Resist Google’s Attempts to Turn You Into a Robot

Command-line translator using Google Translate

The bug about using hooks and MirroredStrategy in tf.estimator.Estimator

Some modifications about SSD-Tensorflow

Android App Prototype Using Google Docs Drawing

how to publish an application using google app engine

building a blog using google app engine and python

Using Sketch’s grid in React

Using TypeScript’s singleton types in practice

Google's Project Stream makes 'Assassin's Creed' playable in Chrome

Some tips about using google’s TPU (Cont.)

相關推薦