MACE原始碼解析【ARM卷積篇(二)】1*1卷積實現

阿新 • • 發佈：2019-01-18

前言

本文來解析一下MACE中ARM程式碼的1*1卷積的實現。1*1卷積在CNN中是比較特殊的一種操作，不再強調領域操作，一般用到1*1卷積有以下幾種情況(相互之間不獨立)
1.單純的加強非線性對映，不強調領域CNN的特徵提取功能
2.bottleneck結構中進行特徵圖數量的改變
3.depthWise 卷積中組成部分

除了以上三點外其他情況歡迎補充

本文涉及的原始碼檔案：

mace/mace/kernels/arm/conv_2d_neon_1x1.cc
mace/mace/kernels/gemm.cc

從卷積到矩陣乘法

// mace/mace/kernels/arm/conv_2d_neon_1x1.cc 


#include "mace/kernels/arm/conv_2d_neon.h"
#include "mace/kernels/gemm.h"

namespace mace {
namespace kernels {

void Conv2dNeonK1x1S1(const float *input,
                      const float *filter,
                      const index_t batch,
                      const index_t height,
                      const 
 index_t width,
                      const index_t in_channels,
                      const index_t out_channels,
                      float *output) {
  for (index_t b = 0; b < batch; ++b) {
    Gemm(filter, input + b * in_channels * height * width, 1, out_channels,
         in_channels, height * width,
         output + b * out_channels * height * width);
  }
}

}  // namespace kernels 

}  // namespace mace

MACE中1*1卷積的程式碼如上，可以看到其實就是在每一個batch中呼叫了gemm矩陣乘法運算。這節簡單說明卷積操作是如何變成矩陣乘法的。假設輸入通道數為C1，輸出通道數為C2。則一般卷積核引數為C1xC2xkhxkw,因此卷積核大小為1*1時，卷積核就從四維變成了兩維矩陣K，大小為C1*C2。在單batch下，假設上一次輸入資料大小為 C1*H*W,把它reshape成一個C1*(H*W)的矩陣F，這樣多通道分別卷積再求和的過程就可以用這兩個矩陣乘積來表示：

Z = K^{t} * F

得到了大小為C2*（H*W）的矩陣Z。其實就是單通道的卷積運算退化成了一個矩陣和一個標量的點乘運算了。下圖舉了一個C1=2，C2=3，輸入和輸出特徵圖大小為2*3（1*6、3*2也一樣）的例子。
這裡寫圖片描述

矩陣乘法做完後，就完成了單batch的1*1卷積運算。I0、I1f分別表示2通道的輸入資料，在這裡一個通道w*h個數據被拉成了一行。原始碼中沒有reshape函式？因為記憶體排布並沒有變，所以其實不需要額外的操作。

gemm的實現

那麼轉而來看gemm的實現。

/**
 * Gemm does fast matrix multiplications with batch.
 * It is optimized for arm64-v8 and armeabi-v7a using neon.
 *
 * We adopt two-level tiling to make better use of l1 cache and register.
 * For register tiling, function like GemmXYZ computes gemm for
 * matrix[X, Y] * matrix[Y, Z] with all data being able to fit in register.
 * For cache tiling, we try to compute one block of multiplication with
 * two input matrices and one output matrix fit in l1 cache.
 */

原始碼中開始的註釋如是說。為了更好的優化，MACE應用了矩陣分塊乘法，所以看這部分程式碼前建議先停下來複習一下矩陣分塊乘法的公式。
MACE把大矩陣運算分為兩級的矩陣分塊乘法。第一級的實現名字都是GemmXYZ這種形式,表示大小為[X,Y]和[Y,Z]的矩陣相乘，主要的NEON優化也是在這些函式中。這一級的矩陣計算大小都很小，最大也就Gemm688，所以大部分情況下變數都可以保持在暫存器上，避免暫存器變數溢位到棧上帶來的時間開銷。這一級的分塊矩陣乘法運算稱為register tiling。
第二級優化則是把若干register tiling組成一個block，保證一個block內的記憶體需求（2個矩陣輸入+1個矩陣輸出）不會超出L1 cache的大小，提高cache命中率。稱為cache tiling。MACE為了記憶體搬運優化做了兩級的分塊矩陣乘法。

register tiling

#define MACE_GEMM_PART_CAL_8(RC, RA, RAN)                      \
  c##RC = vmlaq_lane_f32(c##RC, b0, vget_low_f32(a##RA), 0);   \
  c##RC = vmlaq_lane_f32(c##RC, b1, vget_low_f32(a##RA), 1);   \
  c##RC = vmlaq_lane_f32(c##RC, b2, vget_high_f32(a##RA), 0);  \
  c##RC = vmlaq_lane_f32(c##RC, b3, vget_high_f32(a##RA), 1);  \
  c##RC = vmlaq_lane_f32(c##RC, b4, vget_low_f32(a##RAN), 0);  \
  c##RC = vmlaq_lane_f32(c##RC, b5, vget_low_f32(a##RAN), 1);  \
  c##RC = vmlaq_lane_f32(c##RC, b6, vget_high_f32(a##RAN), 0); \
  c##RC = vmlaq_lane_f32(c##RC, b7, vget_high_f32(a##RAN), 1);

#define MACE_GEMM_PART_CAL_4(RC)                              \
  c##RC = vmlaq_lane_f32(c##RC, b0, vget_low_f32(a##RC), 0);  \
  c##RC = vmlaq_lane_f32(c##RC, b1, vget_low_f32(a##RC), 1);  \
  c##RC = vmlaq_lane_f32(c##RC, b2, vget_high_f32(a##RC), 0); \
  c##RC = vmlaq_lane_f32(c##RC, b3, vget_high_f32(a##RC), 1);

子矩陣運算關鍵就是這兩個巨集，分別為8(4)個浮點向量和8(4)個標量的累乘和,，也就是我們矩陣運算中的基本操作。MACE_GEMM_PART_CAL_4(RC) 的一次呼叫實現的是1*4(A)和4*4(B)矩陣的乘法。

inline void Gemm144(const float *a_ptr,
                    const float *b_ptr,
                    const index_t stride_a,
                    const index_t stride_b,
                    const index_t stride_c,
                    float *c_ptr) {
#if defined(MACE_ENABLE_NEON)
  MACE_UNUSED(stride_a);
  MACE_UNUSED(stride_c);
  float32x4_t a0;
  float32x4_t b0, b1, b2, b3;
  float32x4_t c0;

  a0 = vld1q_f32(a_ptr);

  b0 = vld1q_f32(b_ptr);
  b1 = vld1q_f32(b_ptr + 1 * stride_b);
  b2 = vld1q_f32(b_ptr + 2 * stride_b);
  b3 = vld1q_f32(b_ptr + 3 * stride_b);

  c0 = vld1q_f32(c_ptr);

  MACE_GEMM_PART_CAL_4(0);

  vst1q_f32(c_ptr, c0);
#else
  GemmBlock(a_ptr, b_ptr, 1, 4, 4, stride_a, stride_b, stride_c, c_ptr);
#endif
}

以Gemm144為例，輸入矩陣A，B分別可以裝載到1個和4個1*4的浮點向量中去。再通過乘累加指令把計算結果存入1*4的結果向量中。而類似Gemm884這樣的函式，只不過是A矩陣每行多取一個向量。
所以在使用MACE_GEMM_PART_CAL_8計算時需要多2個引數，這兩個引數組成A矩陣的一行。呼叫程式碼長成這樣：

  MACE_GEMM_PART_CAL_8(0, 0, 1);
  MACE_GEMM_PART_CAL_8(1, 2, 3);
  MACE_GEMM_PART_CAL_8(2, 4, 5);
  MACE_GEMM_PART_CAL_8(3, 6, 7);
  MACE_GEMM_PART_CAL_8(4, 8, 9);
  MACE_GEMM_PART_CAL_8(5, 10, 11);
  MACE_GEMM_PART_CAL_8(6, 12, 13);
  MACE_GEMM_PART_CAL_8(7, 14, 15);

第一級的矩陣乘法就是這一系列GemmXYZ組成，而他們的呼叫則組成了第二級，繼續向下看。

cache tiling

這一部分的主體在GemmTile、Gemm這兩個函式上。畢竟是工程程式碼，需要對邊界進行處理，對不同編譯和裝置環境進行優化。所以程式碼顯得比較龐雜。為了理清邏輯我把aarch64 和clang 巨集控制的部分程式碼刪除、並暫時把邊界處理的程式碼也給刪掉，現在程式碼看上去是這樣的：

GemmTile(const float *A,
                     const float *B,
                     const index_t height,
                     const index_t K,
                     const index_t width,
                     const index_t stride_a,
                     const index_t stride_b,
                     const index_t stride_c,
                     float *C) {
  index_t h = 0;
  index_t w = 0;
  index_t k = 0;
  int reg_height_tile = 8;
  int reg_K_tile = 8;

  for (h = 0; h < height - reg_height_tile + 1; h += reg_height_tile) {
    for (k = 0; k < K - reg_K_tile + 1; k += reg_K_tile) {
      const float *a_ptr = A + (h * stride_a + k);
      for (w = 0; w + 3 < width; w += 4) {
        const float *b_ptr = B + (k * stride_b + w);
        float *c_ptr = C + (h * stride_c + w);
        Gemm884(a_ptr, b_ptr, stride_a, stride_b, stride_c, c_ptr);
      }
    }
  }
}

第一級的矩陣運算放在Gemm884中，此時可以把Gemm884看做單個元素看待。這樣這裡的三層迴圈就和普通的矩陣乘法一致了（回憶下分塊矩陣乘法的公式，其實就是一個遞迴的過程）。
我們再把邊界處理的程式碼加上去

inline void GemmTile(const float *A,
                     const float *B,
                     const index_t height,
                     const index_t K,
                     const index_t width,
                     const index_t stride_a,
                     const index_t stride_b,
                     const index_t stride_c,
                     float *C) {
  index_t h = 0;
  index_t w = 0;
  index_t k = 0;
  int reg_height_tile = 6;
  int reg_K_tile = 4;

  for (h = 0; h < height - reg_height_tile + 1; h += reg_height_tile) {
    for (k = 0; k < K - reg_K_tile + 1; k += reg_K_tile) {
      const float *a_ptr = A + (h * stride_a + k);
      for (w = 0; w + 3 < width; w += 4) {
        const float *b_ptr = B + (k * stride_b + w);
        float *c_ptr = C + (h * stride_c + w);
        Gemm884(a_ptr, b_ptr, stride_a, stride_b, stride_c, c_ptr);
      }
      if (w < width) {
          const float *b_ptr = B + (k * stride_b + w);
          float *c_ptr = C + (h * stride_c + w);
          GemmBlock(a_ptr, b_ptr, reg_height_tile, reg_K_tile, width - w,
              stride_a, stride_b, stride_c, c_ptr);
      }
    }
    if (k < K) {
        const float *a_ptr = A + (h * stride_a + k);
        const float *b_ptr = B + k * stride_b;
        float *c_ptr = C + h * stride_c;
        GemmBlock(a_ptr, b_ptr, reg_height_tile, K - k, width, stride_a, stride_b,
            stride_c, c_ptr);
    }
  }
  if (h < height) {
      index_t remain_h = height - h;
      for (k = 0; k < K - reg_K_tile; k += reg_K_tile) {
          const float *a_ptr = A + (h * stride_a + k);
          index_t w;
          for (w = 0; w + 3 < width; w += 4) {
              const float *b_ptr = B + (k * stride_b + w);
              float *c_ptr = C + (h * stride_c + w);
              GemmX44(a_ptr, b_ptr, stride_a, stride_b, stride_c, c_ptr, remain_h);
          }
          if (w < width) {
              const float *b_ptr = B + (k * stride_b + w);
              float *c_ptr = C + (h * stride_c + w);
              GemmBlock(a_ptr, b_ptr, remain_h, reg_K_tile, width - w, stride_a,
                  stride_b, stride_c, c_ptr);
          }
      }
      if (k < K) {
          const float *a_ptr = A + (h * stride_a + k);
          const float *b_ptr = B + k * stride_b;
          float *c_ptr = C + h * stride_c;
          GemmBlock(a_ptr, b_ptr, remain_h, K - k, width, stride_a, stride_b,
              stride_c, c_ptr);
      }
  }
}

對比一下可以看到一個block把3個維度上不足步長的部分用GemmBlock計算了。aarch64 和clang巨集包中的程式碼，內嵌了NEON的彙編程式碼，可以更好的安排指令排布以及暫存器的使用，可參考GemmXYZ解讀，不贅述了。

Gemm

我們至下而上的終於講到了矩陣乘法最上層介面函式。和GemmTile函式一樣先去掉細枝末節：

// A: height x K, B: K x width, C: height x width
void Gemm(const float *A,
    const float *B,
    const index_t batch,
    const index_t height,
    const index_t K,
    const index_t width,
    float *C,
    const bool transpose_a,
    const bool transpose_b) {
    memset(C, 0, sizeof(float)* batch * height * width);

    // It is better to use large block size if it fits for fast cache.
    // Assume l1 cache size is 32k, we load three blocks at a time (A, B, C),
    // the block size should be sqrt(32k / sizeof(T) / 3).
    // As number of input channels of convolution is normally power of 2, and
    // we have not optimized tiling remains, we use the following magic number
    const index_t block_size = 64;
    const index_t block_tile_height = RoundUpDiv(height, block_size);
    const index_t block_tile_width = RoundUpDiv(width, block_size);
    const index_t block_tile_k = RoundUpDiv(K, block_size);
    const index_t block_tile[3] = { block_tile_height, block_tile_width,
        block_tile_k };
    const index_t remain_height = height % block_size;
    const index_t remain_width = width % block_size;
    const index_t remain_k = K % block_size;
    const index_t remain[3] = { remain_height, remain_width, remain_k };

#pragma omp parallel for collapse(3)
    for (index_t n = 0; n < batch; ++n) {
        for (index_t bh = 0; bh < block_tile[0]; ++bh) {
            for (index_t bw = 0; bw < block_tile[1]; ++bw) {
                const float *a_base = A + n * height * K;
                const float *b_base = B + n * K * width;
                float *c_base = C + n * height * width;

                const index_t ih_begin = bh * block_size;
                const index_t ih_end =
                    bh * block_size +
                    (bh == block_tile[0] - 1 && remain[0] > 0 ? remain[0] : block_size);
                const index_t iw_begin = bw * block_size;
                const index_t iw_end =
                    bw * block_size +
                    (bw == block_tile[1] - 1 && remain[1] > 0 ? remain[1] : block_size);

                for (index_t bk = 0; bk < block_tile[2]; ++bk) {
                    const index_t ik_begin = bk * block_size;
                    const index_t ik_end =
                        bk * block_size + (bk == block_tile[2] - 1 && remain[2] > 0
                        ? remain[2]
                        : block_size);

                    Tensor trans_a;
                    Tensor trans_b;
                    const float *real_a = nullptr;
                    const float *real_b = nullptr;
                    float *real_c = c_base + (ih_begin * width + iw_begin);
                    index_t stride_a;
                    index_t stride_b;
                    index_t stride_c = width;

                    real_a = a_base + (ih_begin * K + ik_begin);
                    stride_a = K;

                    real_b = b_base + (ik_begin * width + iw_begin);
                    stride_b = width;

                    // inside block:
                    // calculate C[bh, bw] += A[bh, bk] * B[bk, bw] for one k
                    GemmTile(real_a, real_b, ih_end - ih_begin, ik_end - ik_begin,
                        iw_end - iw_begin, stride_a, stride_b, stride_c, real_c);
                }  // bk
            }    // bw
        }      // bh
    }        // n
}

主體依然是矩陣乘法的三層迴圈，只是這次基礎元素從一個register tiel計算變成了一個整個block計算，正如上面說的。這麼做是為了該block涉及的記憶體可以存在L1 cache中，減少計算時的cache miss。預設的block大小為64，此外Gemm把尾部不足64的部分丟給GemmTile去處理了。在迴圈的尾部傳入的block大小是可能不足64的。

總結

本文介紹了MACE的1*1卷積實現，實際上是呼叫矩陣乘法來完成單個batch內的卷積操作。在其gemm演算法中，使用了兩級矩陣分塊乘法的方案。儘量避免暫存器變數溢位到棧上和cache miss這兩種情況。原始矩陣運算為了計算一個結果對輸入的訪存跨度是很大的（取整行和整列），cache miss和暫存器溢位是必然比較頻繁。
可以看到實現上不足步長部分，一是會導致邏輯分支，二是沒有NEON優化，所以網路設計的時候不管長寬還是通道數都儘量取4、64的整數倍，會得到更好的計算效能。

MACE原始碼解析【ARM卷積篇(二)】1*1卷積實現

前言

從卷積到矩陣乘法

gemm的實現

register tiling

cache tiling

Gemm

總結

MACE原始碼解析【ARM卷積篇(二)】1*1卷積實現

Mace原始碼解析 1×N卷積N×１卷積１＊１卷積

java之ArrayList初始容量原始碼解析【jdk 1.8】

redisson分散式鎖redLock原始碼解析【未完】

別翻了，這篇文章絕對讓你深刻理解java類的載入以及ClassLoader原始碼分析【JVM篇二】

【Unity3D技術文檔翻譯】第1.1篇 AssetBundle 工作流

Python Web框架【Django框架第一篇基礎】

【Unity3D技術文件翻譯】第1.6篇使用 AssetBundle Manager

【搞定Java併發程式設計】第1篇：執行緒的五種可用狀態

【OpenCV入門教程之二】一覽眾山小：OpenCV 2.4.8 or OpenCV 2.4.9元件結構全解析

【JavaSE系列—基礎篇7】——註解基礎知識

【專案原始碼】- 【模仿知乎日報二】吐血高仿知乎日報

【JavaSE系列-基礎篇6】——泛型方法

【JavaSE系列-基礎篇6】——有界型別引數

Vue 原始碼解析 - 例項化 Vue 前（二）

【springmvc 的請求流程二】：（springmvc 的三大元件之一）處理器對映器的配置和通過處理器對映器返回請求鏈的原始碼流程

【React Native 安卓開發】----側邊欄的實現DrawerLayoutAndroid以及第三方框架react-native-side-menu的使用【第六篇】

【影象縮放篇之一】近鄰取樣插值和其速度優化

為什麼MySQL要用B+樹？聊聊B+樹與硬碟的前世今生【宇哥帶你玩轉MySQL 索引篇(二)】

【Head First Servlets and JSP】筆記1

MACE原始碼解析【ARM卷積篇(二)】1*1卷積實現

前言

從卷積到矩陣乘法

gemm的實現

register tiling

cache tiling

Gemm

總結

相關推薦