Caffe原始碼理解1：Blob儲存結構與設計

Caffe · 發表 2018-11-23 18:39:00

摘要：部落格： blog.shinelee.me |部落格園 | CSDN Blob作用據 Caffe官方描述： A Blob is a wrapper over the actual data being processed and passed along by Caf...

部落格： ofollow,noindex" target="_blank">blog.shinelee.me |部落格園 | C SDN

Blob作用

據 Caffe官方描述：

A Blob is a wrapper over the actual data being processed and passed along by Caffe, and also under the hood provides synchronization capability between the CPU and the GPU. Mathematically, a blob is an N-dimensional array stored in a C-contiguous fashion.

Caffe stores and communicates data using blobs. Blobs provide a unified memory interface holding data; e.g., batches of images, model parameters, and derivatives for optimization.

Blobs conceal the computational and mental overhead of mixed CPU/GPU operationby synchronizing from the CPU host to the GPU device as needed. Memory on the host and device is allocated on demand (lazily) for efficient memory usage.

Blob 是Caffe中的基礎資料結構，主要作用如下：

儲存和傳輸資料 ，對外提供統一的記憶體介面。在Caffe中，輸入影象、每層的權重和反向傳播時的梯度、每層的輸入和輸出等都以 Blob 形式管理
隱藏CPU和GPU之間資料同步的細節 （通過 SyncedMemory 實現），使用者使用時不需要自己管理CPU和GPU間的資料同步

在邏輯上， Blob 是個 \(N_d\) 維張量。當 \(N_d=4\) 時， Blob 的shape定義為 \(N * C * H * W\) ，即 \(Num * Channel * Height * Width\) ，可以表示輸入影象Batch、卷積層的kernel引數、卷積層的輸入輸出map等；當 \(N_d=2\) 時，可以表示全連線層的權重， \(N_{out} * N_{in}\) ；當 \(N_d=1\) 時，可以表示卷積層和全連線層的bias引數。

具體地，

\(N_d=4\) ， Blob 表示 輸入影象 時， \(N\) 為當前批次的圖片數量即MiniBatchNum， \(C\) 為影象的通道數，RGB圖 \(C=3\) ， \(H\) 和 \(W\) 為影象的高和寬。
\(N_d=4\) ， Blob 表示 卷積層的輸入輸出 時， \(N=1\) ， \(C\) 為特徵圖的數量， \(H\) 和 \(W\) 為特徵圖的高和寬。
\(N_d=4\) ， Blob 表示 卷積層kernel引數 時， \(N\) 為當前層輸出特徵圖的數量，其與卷積核數量相同， \(C\) 為當前層輸入特徵圖的數量，其與一個卷積核的層數相同， \(H\) 和 \(W\) 為卷積核的高和寬，每個卷積是三維的即 \(C*H*W\) 。
\(N_d=2\) ， Blob 表示 全連線層的權重 時，shape為 \(N_{out} * N_{in}\) 的二維矩陣， \(N_{out}\) 為輸出數量， \(N_{in}\) 為輸入數量。
\(N_d=1\) ， Blob 為長度為 \(N\) 的向量，表示 卷積層bias引數 時， \(N\) 為卷積核數量（與輸出特徵圖數量相同），表示 全連線層bias引數 時， \(N\) 為輸出數量（與上面的 \(N_{out}\) 相同）。

主要成員變數

shared_ptr<SyncedMemory> data_; // 資料，儲存影象、引數、輸入輸出等
shared_ptr<SyncedMemory> diff_; // 反向傳播時的梯度，訓練階段update時引數的更新量
shared_ptr<SyncedMemory> shape_data_; // GPU shape，與下面的shape是相同的
vector<int> shape_; // shape，data和diff相同
int count_; // 張量中的元素數量，比如 N*C*H*W
int capacity_; // 容量，當前分配記憶體的大小，當reshape時，可能需要擴容

Blob儲存結構

Blob 的 data_ 和 diff_ 對應的資料區，在記憶體中均以 行有先 的方式儲存（C語言風格）。行優先和列優先的儲存方式如下圖所示，9個數連續儲存，表示同一個矩陣，但是儲存順序不同，圖片來自 WIKI ：

當輸入影象為1張RGB圖時，shape為 \(1*3*4*5\) ，其儲存順序如下圖所示，圖片素材來自連結。channel維上，0為R，1為G、2為B，先在R上行有先儲存，再在G上行有先儲存，最後在B上行有先儲存。這裡僅作示意，在caffe中實際儲存順序為BGR。

當 \(N=4\) 時， \(Num * Channel * Height * Width\) ， Blob 在 \(Width\) 維上連續儲存，如下圖所示：

理解了上圖，再理解多維 Blob 的拼接、裁剪等操作就很容易了。

通過 Blob 的 offset 成員函式可以獲得 \((n, c, h, w)\) 處的偏移量，偏移的計算方式與行優先儲存是一致的，程式碼如下：

inline int offset(const int n, const int c = 0, const int h = 0,
const int w = 0) const {
CHECK_GE(n, 0);
CHECK_LE(n, num());
CHECK_GE(channels(), 0);
CHECK_LE(c, channels());
CHECK_GE(height(), 0);
CHECK_LE(h, height());
CHECK_GE(width(), 0);
CHECK_LE(w, width());
return ((n * channels() + c) * height() + h) * width() + w;
}

CPU與GPU間的資料傳遞

const Dtype* cpu_data() const; // 不可修改資料，return (const Dtype*)data_->cpu_data();
const Dtype* gpu_data() const; // return (const Dtype*)data_->gpu_data();
Dtype* mutable_cpu_data(); // 可修改資料，return static_cast<Dtype*>(data_->mutable_cpu_data());
Dtype* mutable_gpu_data(); // static_cast<Dtype*>(data_->mutable_gpu_data());

Caffe中通過上述方式來獲取CPU和GPU上的資料區指標，在呼叫函式時， SyncedMemory 會自行判斷是否需要同步資料（具體是如何判斷的，在講 SyncedMemory 時再詳細說明），當訪問CPU（GPU）側資料時，如果GPU（CPU）側資料（可能）更新過，則將資料同步至CPU（GPU）。可參考下面示例程式碼來理解何時會發生資料同步，示例程式碼來自Caffe官網。

// Assuming that data are on the CPU initially, and we have a blob.
const Dtype* foo;
Dtype* bar;
foo = blob.gpu_data(); // data copied cpu->gpu.
foo = blob.cpu_data(); // no data copied since both have up-to-date contents.
bar = blob.mutable_gpu_data(); // no data copied.
// ... some operations ...
bar = blob.mutable_gpu_data(); // no data copied when we are still on GPU.
foo = blob.cpu_data(); // data copied gpu->cpu, since the gpu side has modified the data
foo = blob.gpu_data(); // no data copied since both have up-to-date contents
bar = blob.mutable_cpu_data(); // still no data copied.
bar = blob.mutable_gpu_data(); // data copied cpu->gpu.
bar = blob.mutable_cpu_data(); // data copied gpu->cpu.

只要呼叫了 mutable 函式，即便沒有實際修改資料，再呼叫另一側的 mutable 函式，也會發生資料同步。因此，在明確不修改資料時，儘量呼叫 const 函式，只有在操縱資料時才呼叫 mutable 函式。

主要成員函式

Blob 的主要成員函式有：

基本函式，包括建構函式、set和get類函式、邏輯判斷等
Reshape 函式，用於設定 Blob 的 shape ，分配記憶體
Update 函式，用於在網路訓練時更新引數使用， \(data = data - diff\)
Blob 運算函式，用於切片統計、求L1範數、L2範數、數乘等
輔助函式，proto匯入匯出等

下面重點介紹其中主要的成員函式。

template <typename Dtype>
class Blob {
 public:
Blob()
: data_(), diff_(), count_(0), capacity_(0) {}

/// @brief Deprecated; use <code>Blob(const vector<int>& shape)</code>.
explicit Blob(const int num, const int channels, const int height,
const int width);
explicit Blob(const vector<int>& shape);
// ......
}

在 Blob 的建構函式中，會呼叫 Reshape 函式，給 shape 成員變數賦值以及分配初始記憶體。在 Layer::Reshape 或者 Layer::Forward 時，也會呼叫 Reshape 函式來設定輸出 Blob 的維度，如果reshape了整個網路的輸入 Blob ，則需要呼叫 Net::Forward 或者 Net::Reshape 來重新確定每一層相關 Blob 的shape（從bottom到top逐層推算得出）。當 Blob size發生改變時，只有在記憶體不夠才會再分配記憶體，具體程式碼如下

template <typename Dtype>
bool Blob<Dtype>::Reshape(const vector<int>& shape) {
CHECK_LE(shape.size(), kMaxBlobAxes);
count_ = 1;
shape_.resize(shape.size());
if (!shape_data_ || shape_data_->size() < shape.size() * sizeof(int)) {
shape_data_.reset(new SyncedMemory(shape.size() * sizeof(int)));
}
int* shape_data = static_cast<int*>(shape_data_->mutable_cpu_data());
for (int i = 0; i < shape.size(); ++i) {
CHECK_GE(shape[i], 0);
if (count_ != 0) {
CHECK_LE(shape[i], INT_MAX / count_) << "blob size exceeds INT_MAX";
}
count_ *= shape[i];
shape_[i] = shape[i];
shape_data[i] = shape[i];
}
// 不夠時分配記憶體，原記憶體會釋放（shared_ptr）
if (count_ > capacity_) { 
capacity_ = count_;
data_.reset(new SyncedMemory(capacity_ * sizeof(Dtype)));
diff_.reset(new SyncedMemory(capacity_ * sizeof(Dtype)));
return true;
}
else {
return false;
}
}

在網路訓練階段，根據損失函式以及反向傳播得到的梯度，獲得每層引數的更新量 diff_ ，會呼叫 Update 函式來更新引數，如下

template <typename Dtype>
void Blob<Dtype>::Update() {
// We will perform update based on where the data is located.
switch (data_->head()) {
case SyncedMemory::HEAD_AT_CPU:
// perform computation on CPU
// data = data - diff, axpy: y = ax + y
caffe_axpy<Dtype>(count_, Dtype(-1),
static_cast<const Dtype*>(diff_->cpu_data()),
static_cast<Dtype*>(data_->mutable_cpu_data()));
break;
case SyncedMemory::HEAD_AT_GPU:
case SyncedMemory::SYNCED:
#ifndef CPU_ONLY
// perform computation on GPU
caffe_gpu_axpy<Dtype>(count_, Dtype(-1),
static_cast<const Dtype*>(diff_->gpu_data()),
static_cast<Dtype*>(data_->mutable_gpu_data()));
#else
NO_GPU;
#endif
break;
default:
LOG(FATAL) << "Syncedmem not initialized.";
}
}

值得一提的是， Blob 維度索引支援負數，-1表示最後一個維度，與Python相同，實現程式碼如下，在需要訪問某個維度時，先使用 CanonicalAxisIndex 獲得真正維度，比如 CanonicalAxisIndex(-1) 。

// axis_index the axis index.
// If 0 <= index < num_axes(), return index.
// If -num_axes <= index <= -1, return (num_axes() - (-index))
inline int CanonicalAxisIndex(int axis_index) const {
CHECK_GE(axis_index, -num_axes())
<< "axis " << axis_index << " out of range for " << num_axes()
<< "-D Blob with shape " << shape_string();
CHECK_LT(axis_index, num_axes())
<< "axis " << axis_index << " out of range for " << num_axes()
<< "-D Blob with shape " << shape_string();
if (axis_index < 0) {
return axis_index + num_axes();
}
return axis_index;
}

其他函式，只取代表。

// set get
// 省略基本的set和get函式，如上面提到的const和mutable函式
// 返回(n, c, h, w)處的資料，return cpu_data()[offset(n, c, h, w)]
inline Dtype data_at(const int n, const int c, const int h, const int w) const;
inline Dtype diff_at(const int n, const int c, const int h, const int w) const;
void ShareData(const Blob& other); // 與另一Blob共享data，類似淺拷貝
void ShareDiff(const Blob& other); // 與另一Blob共享diff
// 從另一Blob拷貝，類似深拷貝
void Blob<Dtype>::CopyFrom(const Blob& source, bool copy_diff, bool reshape); 

// 切片元素數量統計，count *= shape(i)
inline int count(int start_axis, int end_axis) const; 

// proto序列化與反序列化
void FromProto(const BlobProto& proto, bool reshape = true); // 從proto匯入
void ToProto(BlobProto* proto, bool write_diff = false) const; // 匯出為proto

// 運算
Dtype asum_data() const; // data L1 norm
Dtype asum_diff() const; // diff L1 norm
Dtype sumsq_data() const; // data L2 norm
Dtype sumsq_diff() const; // diff L2 norm
void scale_data(Dtype scale_factor); // data 數乘，in place
void scale_diff(Dtype scale_factor); // diff 數乘，in place

// 邏輯判斷
bool ShapeEquals(const BlobProto& other); // 判斷shape是否相同

以上。