Caffe源碼理解1：Blob存儲結構與設計

阿新 • • 發佈：2018-11-23

html lee python 數據博客使用片元 ive abi

博客：blog.shinelee.me | 博客園 | CSDN

Blob作用

據Caffe官方描述：

A Blob is a wrapper over the actual data being processed and passed along by Caffe, and also under the hood provides synchronization capability between the CPU and the GPU. Mathematically, a blob is an N-dimensional array stored in a C-contiguous fashion.

Caffe stores and communicates data using blobs. Blobs provide a unified memory interface holding data; e.g., batches of images, model parameters, and derivatives for optimization.

Blobs conceal the computational and mental overhead of mixed CPU/GPU operation by synchronizing from the CPU host to the GPU device as needed. Memory on the host and device is allocated on demand (lazily) for efficient memory usage.

Blob是Caffe中的基礎數據結構，主要作用如下：

存儲和傳輸數據，對外提供統一的內存接口。在Caffe中，輸入圖像、每層的權重和反向傳播時的梯度、每層的輸入和輸出等都以Blob形式管理
隱藏CPU和GPU之間數據同步的細節（通過SyncedMemory實現），用戶使用時不需要自己管理CPU和GPU間的數據同步

在邏輯上，Blob是個\(N_d\)維張量。當\(N_d=4\)時，Blob的shape定義為\(N * C * H * W\)，即\(Num * Channel * Height * Width\)，可以表示輸入圖像Batch、卷積層的kernel參數、卷積層的輸入輸出map等；當\(N_d=2\)

時，可以表示全連接層的權重，\(N_{out} * N_{in}\)；當\(N_d=1\)時，可以表示卷積層和全連接層的bias參數。

具體地，

\(N_d=4\)，Blob表示輸入圖像時，\(N\)為當前批次的圖片數量即MiniBatchNum，\(C\)為圖像的通道數，RGB圖\(C=3\)，\(H\)和\(W\)為圖像的高和寬。
\(N_d=4\)，Blob表示卷積層的輸入輸出時，\(N=1\)，\(C\)為特征圖的數量，\(H\)和\(W\)為特征圖的高和寬。
\(N_d=4\)，Blob表示卷積層kernel參數時，\(N\)為當前層輸出特征圖的數量，其與卷積核數量相同，\(C\)為當前層輸入特征圖的數量，其與一個卷積核的層數相同，\(H\)和\(W\)為卷積核的高和寬，每個卷積是三維的即\(C*H*W\)。
\(N_d=2\)，Blob表示全連接層的權重時，shape為\(N_{out} * N_{in}\)的二維矩陣，\(N_{out}\)為輸出數量，\(N_{in}\)為輸入數量。
\(N_d=1\)，Blob為長度為\(N\)的向量，表示卷積層bias參數時，\(N\)為卷積核數量（與輸出特征圖數量相同），表示全連接層bias參數時，\(N\)為輸出數量（與上面的\(N_{out}\)相同）。

主要成員變量

shared_ptr<SyncedMemory> data_; // 數據，存儲圖像、參數、輸入輸出等
shared_ptr<SyncedMemory> diff_; // 反向傳播時的梯度，訓練階段update時參數的更新量
shared_ptr<SyncedMemory> shape_data_; // GPU shape，與下面的shape是相同的
vector<int> shape_; // shape，data和diff相同
int count_; // 張量中的元素數量，比如 N*C*H*W
int capacity_; // 容量，當前分配內存的大小，當reshape時，可能需要擴容

Blob存儲結構

Blob的data_和diff_對應的數據區，在內存中均以行有先的方式存儲（C語言風格）。行優先和列優先的存儲方式如下圖所示，9個數連續存儲，表示同一個矩陣，但是存儲順序不同，圖片來自WIKI：

技術分享圖片

當輸入圖像為1張RGB圖時，shape為\(1*3*4*5\)，其存儲順序如下圖所示，圖片素材來自鏈接。channel維上，0為R，1為G、2為B，先在R上行有先存儲，再在G上行有先存儲，最後在B上行有先存儲。這裏僅作示意，在caffe中實際存儲順序為BGR。

技術分享圖片

當\(N=4\)時，\(Num * Channel * Height * Width\)，Blob在\(Width\)維上連續存儲，如下圖所示：

技術分享圖片

理解了上圖，再理解多維Blob的拼接、裁剪等操作就很容易了。

通過Blob的offset成員函數可以獲得\((n, c, h, w)\)處的偏移量，偏移的計算方式與行優先存儲是一致的，代碼如下：

  inline int offset(const int n, const int c = 0, const int h = 0,
      const int w = 0) const {
    CHECK_GE(n, 0);
    CHECK_LE(n, num());
    CHECK_GE(channels(), 0);
    CHECK_LE(c, channels());
    CHECK_GE(height(), 0);
    CHECK_LE(h, height());
    CHECK_GE(width(), 0);
    CHECK_LE(w, width());
    return ((n * channels() + c) * height() + h) * width() + w;
  }

CPU與GPU間的數據傳遞

const Dtype* cpu_data() const; // 不可修改數據，return (const Dtype*)data_->cpu_data();
const Dtype* gpu_data() const; // return (const Dtype*)data_->gpu_data();
Dtype* mutable_cpu_data(); // 可修改數據，return static_cast<Dtype*>(data_->mutable_cpu_data());
Dtype* mutable_gpu_data(); // static_cast<Dtype*>(data_->mutable_gpu_data());

Caffe中通過上述方式來獲取CPU和GPU上的數據區指針，在調用函數時，SyncedMemory會自行判斷是否需要同步數據（具體是如何判斷的，在講SyncedMemory時再詳細說明），當訪問CPU（GPU）側數據時，如果GPU（CPU）側數據（可能）更新過，則將數據同步至CPU（GPU）。可參考下面示例代碼來理解何時會發生數據同步，示例代碼來自Caffe官網。

// Assuming that data are on the CPU initially, and we have a blob.
const Dtype* foo;
Dtype* bar;
foo = blob.gpu_data(); // data copied cpu->gpu.
foo = blob.cpu_data(); // no data copied since both have up-to-date contents.
bar = blob.mutable_gpu_data(); // no data copied.
// ... some operations ...
bar = blob.mutable_gpu_data(); // no data copied when we are still on GPU.
foo = blob.cpu_data(); // data copied gpu->cpu, since the gpu side has modified the data
foo = blob.gpu_data(); // no data copied since both have up-to-date contents
bar = blob.mutable_cpu_data(); // still no data copied.
bar = blob.mutable_gpu_data(); // data copied cpu->gpu.
bar = blob.mutable_cpu_data(); // data copied gpu->cpu.

只要調用了mutable函數，即便沒有實際修改數據，再調用另一側的mutable函數，也會發生數據同步。因此，在明確不修改數據時，盡量調用const函數，只有在操縱數據時才調用mutable函數。

主要成員函數

Blob的主要成員函數有：

基本函數，包括構造函數、set和get類函數、邏輯判斷等
Reshape函數，用於設置Blob的shape，分配內存
Update函數，用於在網絡訓練時更新參數使用，\(data = data - diff\)
Blob運算函數，用於切片統計、求L1範數、L2範數、數乘等
輔助函數，proto導入導出等

下面重點介紹其中主要的成員函數。

template <typename Dtype>
class Blob {
 public:
  Blob()
       : data_(), diff_(), count_(0), capacity_(0) {}

  /// @brief Deprecated; use <code>Blob(const vector<int>& shape)</code>.
  explicit Blob(const int num, const int channels, const int height,
      const int width);
  explicit Blob(const vector<int>& shape);
// ......
}

在Blob的構造函數中，會調用Reshape函數，給shape成員變量賦值以及分配初始內存。在Layer::Reshape或者Layer::Forward時，也會調用Reshape函數來設置輸出Blob的維度，如果reshape了整個網絡的輸入Blob，則需要調用Net::Forward或者Net::Reshape來重新確定每一層相關Blob的shape（從bottom到top逐層推算得出）。當Blob size發生改變時，只有在內存不夠才會再分配內存，具體代碼如下

template <typename Dtype>
bool Blob<Dtype>::Reshape(const vector<int>& shape) {
  CHECK_LE(shape.size(), kMaxBlobAxes);
  count_ = 1;
  shape_.resize(shape.size());
  if (!shape_data_ || shape_data_->size() < shape.size() * sizeof(int)) {
    shape_data_.reset(new SyncedMemory(shape.size() * sizeof(int)));
  }
  int* shape_data = static_cast<int*>(shape_data_->mutable_cpu_data());
  for (int i = 0; i < shape.size(); ++i) {
    CHECK_GE(shape[i], 0);
    if (count_ != 0) {
      CHECK_LE(shape[i], INT_MAX / count_) << "blob size exceeds INT_MAX";
    }
    count_ *= shape[i];
    shape_[i] = shape[i];
    shape_data[i] = shape[i];
  }
  // 不夠時分配內存，原內存會釋放（shared_ptr）
  if (count_ > capacity_) { 
    capacity_ = count_;
    data_.reset(new SyncedMemory(capacity_ * sizeof(Dtype)));
    diff_.reset(new SyncedMemory(capacity_ * sizeof(Dtype)));
    return true;
  }
  else {
    return false;
  }
}

在網絡訓練階段，根據損失函數以及反向傳播得到的梯度，獲得每層參數的更新量diff_，會調用Update函數來更新參數，如下

template <typename Dtype>
void Blob<Dtype>::Update() {
  // We will perform update based on where the data is located.
  switch (data_->head()) {
  case SyncedMemory::HEAD_AT_CPU:
    // perform computation on CPU
    // data = data - diff, axpy: y = ax + y
    caffe_axpy<Dtype>(count_, Dtype(-1),
        static_cast<const Dtype*>(diff_->cpu_data()),
        static_cast<Dtype*>(data_->mutable_cpu_data()));
    break;
  case SyncedMemory::HEAD_AT_GPU:
  case SyncedMemory::SYNCED:
#ifndef CPU_ONLY
    // perform computation on GPU
    caffe_gpu_axpy<Dtype>(count_, Dtype(-1),
        static_cast<const Dtype*>(diff_->gpu_data()),
        static_cast<Dtype*>(data_->mutable_gpu_data()));
#else
    NO_GPU;
#endif
    break;
  default:
    LOG(FATAL) << "Syncedmem not initialized.";
  }
}

值得一提的是，Blob維度索引支持負數，-1表示最後一個維度，與Python相同，實現代碼如下，在需要訪問某個維度時，先使用CanonicalAxisIndex獲得真正維度，比如CanonicalAxisIndex(-1)。

// axis_index the axis index.
// If 0 <= index < num_axes(), return index.
// If -num_axes <= index <= -1, return (num_axes() - (-index))
inline int CanonicalAxisIndex(int axis_index) const {
  CHECK_GE(axis_index, -num_axes())
      << "axis " << axis_index << " out of range for " << num_axes()
      << "-D Blob with shape " << shape_string();
  CHECK_LT(axis_index, num_axes())
      << "axis " << axis_index << " out of range for " << num_axes()
      << "-D Blob with shape " << shape_string();
  if (axis_index < 0) {
    return axis_index + num_axes();
  }
  return axis_index;
}

其他函數，只取代表。

// set get
// 省略基本的set和get函數，如上面提到的const和mutable函數
// 返回(n, c, h, w)處的數據，return cpu_data()[offset(n, c, h, w)]
inline Dtype data_at(const int n, const int c, const int h, const int w) const;
inline Dtype diff_at(const int n, const int c, const int h, const int w) const;
void ShareData(const Blob& other); // 與另一Blob共享data，類似淺拷貝
void ShareDiff(const Blob& other); // 與另一Blob共享diff
// 從另一Blob拷貝，類似深拷貝
void Blob<Dtype>::CopyFrom(const Blob& source, bool copy_diff, bool reshape); 

// 切片元素數量統計，count *= shape(i)
inline int count(int start_axis, int end_axis) const; 

// proto序列化與反序列化
void FromProto(const BlobProto& proto, bool reshape = true); // 從proto導入
void ToProto(BlobProto* proto, bool write_diff = false) const; // 導出為proto

// 運算
Dtype asum_data() const; // data L1 norm
Dtype asum_diff() const; // diff L1 norm
Dtype sumsq_data() const; // data L2 norm
Dtype sumsq_diff() const; // diff L2 norm
void scale_data(Dtype scale_factor); // data 數乘，in place
void scale_diff(Dtype scale_factor); // diff 數乘，in place

// 邏輯判斷
bool ShapeEquals(const BlobProto& other); // 判斷shape是否相同

以上。

參考

Blobs, Layers, and Nets: anatomy of a Caffe model
Row- and column-major order
Caffe: a fast open framework for deep learning

Caffe源碼理解1：Blob存儲結構與設計

html lee python 數據博客使用片元 ive abi 博客：blog.shinelee.me | 博客園 | CSDN Blob作用據Caffe官方描述： A Blob is a wrapper over the actual data being p

Caffe源碼理解1：Blob存儲結構與設計

Blob作用

主要成員變量

Blob存儲結構

CPU與GPU間的數據傳遞

主要成員函數

參考

Caffe源碼理解1：Blob存儲結構與設計

Caffe原始碼理解1：Blob儲存結構與設計

caffe源碼理解鏈式法則

Ethzasl MSF源碼閱讀(1)：程序入口和主題訂閱

Caffe源碼解析3：Layer

Google guava cache源碼解析1--構建緩存器（3）

Linux 內核源碼情景分析 chap 2 存儲管理 (四)

深入理解JVM虛擬機器1：JVM記憶體的結構與永久代的消失

第八節課：第六章存儲結構與磁盤劃分

存儲結構與磁盤劃分。

並發編程（四）：ThreadLocal從源碼分析總結到內存泄漏

Android進階：四、RxJava2 源碼解析 1

android源碼分享1

源碼大招：不服來戰！擼這些完整項目，你不牛逼都難！

libevent源碼分析1 ----evnet相關結構體分析

從源碼理解 ThreadLocal()

tomcat源碼學習一：導入eclipse

mxnet 源碼閱讀 1

soket.io源碼分析(1):

ArrayList , Vector 源碼理解

Caffe源碼理解1：Blob存儲結構與設計

Blob作用

主要成員變量

Blob存儲結構

CPU與GPU間的數據傳遞

主要成員函數

參考

相關推薦