Tensorflow 原始碼分析-會話與執行緒池之間的關係

阿新 • • 發佈：2019-01-15

1. Tensorflow 的sessionFactory

建立新的會話，tensorflow使用了多工廠模式，在不同的場景下使用不同的工廠，使用什麼工廠模式由傳遞進來的SessionOptions來決定。

1.1 註冊工廠

Tensorflow 提供了可以註冊多會話工廠的模式，允許不同的模組註冊自己的會話工廠

void SessionFactory::Register(const string& runtime_type,
                              SessionFactory* factory) {
  mutex_lock l(*get_session_factory_lock());
  if (!session_factories()->insert({runtime_type, factory}).second) {
    LOG(ERROR) << "Two session factories are being registered "
               << "under" << runtime_type;
  }
}

預設的tensorflow提供了兩種factoy，一個是DirectSession單機，一個是GrpcSession叢集。

使用什麼factory由傳遞的sessionoptions的target來決定

2. Tensorflow 的session

2.1 初始化Session

在session.cc程式碼中，通過NewSession來初始化會話

Session* NewSession(const SessionOptions& options) {
  SessionFactory* factory;
  const Status s = SessionFactory::GetFactory(options, &factory);
  if (!s.ok()) {
    LOG(ERROR) << s;
    return nullptr;
  }
  return factory->NewSession(options);
}

在程式碼中，我們可以看到通過factory來newSession，對單機來說也就是在前面提到的DirectSession

  Session* NewSession(const SessionOptions& options) override {
    // Must do this before the CPU allocator is created.
    if (options.config.graph_options().build_cost_model() > 0) {
      EnableCPUAllocatorFullStats(true);
    }
    std::vector<Device*> devices;
    const Status s = DeviceFactory::AddDevices(
        options, "/job:localhost/replica:0/task:0", &devices);
    if (!s.ok()) {
      LOG(ERROR) << s;
      return nullptr;
    }

    DirectSession* session =
        new DirectSession(options, new DeviceMgr(devices), this);
    {
      mutex_lock l(sessions_lock_);
      sessions_.push_back(session);
    }
    return session;
  }

2.2 平行計算

對Tensorflow的每個執行（op）都是需要進行計算的，對同一個會話來說，為了快速計算需要將op進行平行計算，對叢集來說就是叢集運算，而對單機版來說就是使用多執行緒來進行運算，也就是常說的執行緒池。

接下的部落格主要是增對單機的並行運算，也就是directsession中的執行緒池

在tensorflow中有三種session和執行緒池的關係

單個會話可以設定多個執行緒池，在初始化會話池的時候，會依據sessionoptions的配置，讀取多個執行緒池的配置，生成多個執行緒池的vector, 如果 thread_pool_options. global_name為空，代表是自己owned的需要自己關閉
單個會話設定單個執行緒池，在初始化會話池的時候，會依據sessionoptions的配置use_per_session_threads，讀取單執行緒池的配置，生成單個會話相關的獨立執行緒池, 需要自己關閉
多個會話共享相同的執行緒池，在初始化會話池的時候，建立所有會話共享的執行緒池，該執行緒池是全域性共享，無法關閉。

在config.proto protocol buffer我們可以看到定義的配置協議的格式: ConfigProto,ThreadPoolOptionProto

message ThreadPoolOptionProto {
  // The number of threads in the pool.
  //
  // 0 means the system picks a value based on where this option proto is used
  // (see the declaration of the specific field for more info).
  int32 num_threads = 1;

  // The global name of the threadpool.
  //
  // If empty, then the threadpool is made and used according to the scope it's
  // in - e.g., for a session threadpool, it is used by that session only.
  //
  // If non-empty, then:
  // - a global threadpool associated with this name is looked
  //   up or created. This allows, for example, sharing one threadpool across
  //   many sessions (e.g., like the default behavior, if
  //   inter_op_parallelism_threads is not configured), but still partitioning
  //   into a large and small pool.
  // - if the threadpool for this global_name already exists, then it is an
  //   error if the existing pool was created using a different num_threads
  //   value as is specified on this call.
  // - threadpools created this way are never garbage collected.
  string global_name = 2;
};

message ConfigProto {

// Map from device type name (e.g., "CPU" or "GPU" ) to maximum

// number of devices of that type to use.If a particular device

// type is not found in the map, the system picks an appropriate

// number.

map<string, int32> device_count = 1;

// The execution of an individual op (for some op types) can be

// parallelized on a pool of intra_op_parallelism_threads.

// 0 means the system picks an appropriate number.

int32 intra_op_parallelism_threads = 2;

// Nodes that perform blocking operations are enqueued on a pool of

// inter_op_parallelism_threads available in each process.

// 0 means the system picks an appropriate number.

// Note that the first Session created in the process sets the

// number of threads for all future sessions unless use_per_session_threads is

// true or session_inter_op_thread_pool is configured.

int32 inter_op_parallelism_threads = 5;

// If true, use a new set of threads for this session rather than the global

// pool of threads. Only supported by direct sessions.

// If false, use the global threads created by the first session, or the

// per-session thread pools configured by session_inter_op_thread_pool.

// This option is deprecated. The same effect can be achieved by setting

// session_inter_op_thread_pool to have one element, whose num_threads equals

// inter_op_parallelism_threads.

bool use_per_session_threads = 9;

// This option is experimental - it may be replaced with a different mechanism

// in the future.

// Configures session thread pools. If this is configured, then RunOptions for

// a Run call can select the thread pool to use.

// The intended use is for when some session invocations need to run in a

// background pool limited to a small number of threads:

// - For example, a session may be configured to have one large pool (for

// regular compute) and one small pool (for periodic, low priority work);

// using the small pool is currently the mechanism for limiting the inter-op

// parallelism of the low priority work.Note that it does not limit the

// parallelism of work spawned by a single op kernel implementation.

// - Using this setting is normally not needed in training, but may help some

// serving use cases.

// - It is also generally recommended to set the global_name field of this

// proto, to avoid creating multiple large pools. It is typically better to

// run the non-low-priority work, even across sessions, in a single large

// pool.

repeated ThreadPoolOptionProto session_inter_op_thread_pool = 12;

// Assignment of Nodes to Devices is recomputed every placement_period

// steps until the system warms up (at which point the recomputation

// typically slows down automatically).

int32 placement_period = 3;

// When any filters are present sessions will ignore all devices which do not

// match the filters. Each filter can be partially specified, e.g. "/job:ps"

// "/job:worker/replica:3", etc.

repeated string device_filters = 4;

// Options that apply to all GPUs.

GPUOptions gpu_options = 6;

// Whether soft placement is allowed. If allow_soft_placement is true,

// an op will be placed on CPU if

// 1. there's no GPU implementation for the OP

// or

// 2. no GPU devices are known or registered

// or

// 3. need to co-locate with reftype input(s) which are from CPU.

bool allow_soft_placement = 7;

// Whether device placements should be logged.

bool log_device_placement = 8;

// Options that apply to all graphs.

GraphOptions graph_options = 10;

// Global timeout for all blocking operations in this session.If non-zero,

// and not overridden on a per-operation basis, this value will be used as the

// deadline for all blocking operations.

int64 operation_timeout_in_ms = 11;

// Options that apply when this session uses the distributed runtime.

RPCOptions rpc_options = 13;

// Optional list of all workers to use in this session.

ClusterDef cluster_def = 14;

// If true, any resources such as Variables used in the session will not be

// shared with other sessions.

bool isolate_session_state = 15;

// Next: 16

};

而關於單個會話建立多個執行緒池，主要適用於在會話執行的過程中，可以主動選擇不同的執行緒池，還記得在呼叫session.run的時候可以傳遞runoption麼？我們還是直接來看協議

message RunOptions {
  // TODO(pbar) Turn this into a TraceOptions proto which allows
  // tracing to be controlled in a more orthogonal manner?
  enum TraceLevel {
    NO_TRACE = 0;
    SOFTWARE_TRACE = 1;
    HARDWARE_TRACE = 2;
    FULL_TRACE = 3;
  }
  TraceLevel trace_level = 1;

  // Time to wait for operation to complete in milliseconds.
  int64 timeout_in_ms = 2;

  // The thread pool to use, if session_inter_op_thread_pool is configured.
  int32 inter_op_thread_pool = 3;

  // Whether the partition graph(s) executed by the executor(s) should be
  // outputted via RunMetadata.
  bool output_partition_graphs = 5;

  // EXPERIMENTAL.  Options used to initialize DebuggerState, if enabled.
  DebugOptions debug_options = 6;

  // When enabled, causes tensor alllocation information to be included in
  // the error message when the Run() call fails because the allocator ran
  // out of memory (OOM).
  //
  // Enabling this option can slow down the Run() call.
  bool report_tensor_allocations_upon_oom = 7;

  reserved 4;
}

就是引數inter_op_thread_pool，在tensorflow中通訊協議，配置都是基於google 的protocol buffer的，所以物件的相關函式和程式碼，是通過編譯協議後長生的，比如：

  thread::ThreadPool* pool =
      thread_pools_[run_options.inter_op_thread_pool()].first;

中的inter_op_thread_pool函式，這個在原始碼中無法找到，tensorflow在編譯過程中會基於config.proto，自動生成c++的程式碼目錄在genfiles/tensorflow/core/protobuf/config.pb.h 和config.pb.cc

2.2.1 執行緒池的執行緒數

int32 NumInterOpThreadsFromSessionOptions(const SessionOptions& options) {
  const int32 t = options.config.inter_op_parallelism_threads();
  if (t != 0) return t;
  // Default to using the number of cores available in the process.
  return port::NumSchedulableCPUs();
}

通過配置中的inter_op_parallelism_threads，在多個執行緒池的化的情況下，讀取的就是每個執行緒池的num_threads了，如果沒有配置，那麼預設的數量將是系統有效的cpu數目

2.2.2 執行緒池的實現

tensowflow的執行緒池的實現是呼叫Eigen的執行緒池

struct ThreadPool::Impl : Eigen::ThreadPoolTempl<EigenEnvironment> {
  Impl(Env* env, const ThreadOptions& thread_options, const string& name,
       int num_threads, bool low_latency_hint)
      : Eigen::ThreadPoolTempl<EigenEnvironment>(
            num_threads, low_latency_hint,
            EigenEnvironment(env, thread_options, name)) {}

  void ParallelFor(int64 total, int64 cost_per_unit,
                   std::function<void(int64, int64)> fn) {
    CHECK_GE(total, 0);
    CHECK_EQ(total, (int64)(Eigen::Index)total);
    Eigen::ThreadPoolDevice device(this, this->NumThreads());
    device.parallelFor(
        total, Eigen::TensorOpCost(0, 0, cost_per_unit),
        [&fn](Eigen::Index first, Eigen::Index last) { fn(first, last); });
  }
};

Tensorflow 原始碼分析-會話與執行緒池之間的關係

1. Tensorflow 的sessionFactory

1.1 註冊工廠

2. Tensorflow 的session

2.1 初始化Session

2.2 平行計算

2.2.1 執行緒池的執行緒數

2.2.2 執行緒池的實現

Tensorflow 原始碼分析-會話與執行緒池之間的關係

muduo原始碼分析：ThreadPool 執行緒池的實現

Android中的執行緒與執行緒池

第三十八天 GIL 程序池與執行緒池

併發新特性—Executor 框架與執行緒池

執行緒的建立與執行緒池ThreadPoolExecutor，Executors

【JVM第九篇】：Executor框架與執行緒池

池與執行緒池技術點目錄 1. 執行緒池作用：提升效能 1 2. 使用流程 1 3. 執行緒與執行緒池的監控 jvisual 1 4. 執行緒常用方法 2 5. 執行緒池相關概念 2 5.1. 佇列

c++11多執行緒與執行緒池

執行緒與執行緒池

再入鎖，執行緒安全佇列與執行緒池串想

Hystrix 服務的隔離策略對比，訊號量與執行緒池隔離的差異

Dubbo學習筆記8：Dubbo的執行緒模型與執行緒池策略

Java操作Shell指令碼 + Java.lang.Process的原理分析 + 程序與執行緒的分析 + 多執行緒理解

詳解 Tomcat 的連線數與執行緒池

【胡思亂想】JNI與執行緒池的維護

java socket 服務端併發處理與執行緒池的使用

java執行緒之Executor框架與執行緒池

Tomcat 連線數與執行緒池詳解 | BIO/NIO有何不同 | 簡談Kafka中的NIO網路通訊模型

HTML5 Web Worker 多執行緒與執行緒池

Tensorflow 原始碼分析-會話與執行緒池之間的關係

1. Tensorflow 的sessionFactory

1.1 註冊工廠

2. Tensorflow 的session

2.1 初始化Session

2.2 平行計算

2.2.1 執行緒池的執行緒數

2.2.2 執行緒池的實現

相關推薦