caffe原始碼閱讀筆記(一) SoftmaxLayer

阿新 • • 發佈：2019-01-13

Softmax層的作用是將輸入的預測向量轉化為概率值，也就是每個元素介於0和1之間，其和為1。而Softmax loss是基於Softmax的輸出，使用多元交叉熵損失函式得到的loss。下面我們來討論一下他們其中的正向和反向導數推導，以及caffe中的原始碼實現。為了更好地將推導和程式碼相結合，以加深理解，本文將會在每個推導部分直接緊跟其程式碼實現。

1. Softmax

1.1 前向計算

1.1.1 公式推導

假設有K個類別，前面已得出每個類別的分值為zizi，則Softmax通過下式計算出相應的概率值：
Softmax_(zi)=exp^(zj) / ∑_jexp^(zj)

, _{i=0,1,2,…,K−1}
在這裡插入圖片描述

這樣就將z_i對映到了[0,1]，且和為1，即為輸入被預測到每個類別的概率。
前向過程比較簡單，下面我們來看一下具體實現。

1.1.2 原始碼實現

我們主要分析Softmax層的Forward_cpu函式，該函式的實現位於caffe的src/caffe/layers/softmax_layer.cpp中。需要說明的是，在caffe的實現中，輸入值zizi首先減去了最大值，這樣避免了後續的exp()計算中可能出現的因數值過大而造成的溢位問題。
首先來解釋一下下面程式碼裡幾個不太好理解的變數：
scale_data： 是個中間變數，用來存放計算的中間結果。
inner_num_：

在softmax_layer.hpp的宣告中為inner_num_ = bottom[0]->count(softmax_axis_ + 1);也就是所有表示概率值的維度的畫素點總數。
outer_num_：在softmax_layer.hpp的宣告中為outer_num_ = bottom[0]->count(0, softmax_axis_);可以理解為樣本的個數。

template <typename Dtype>
void SoftmaxLayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom,
    const vector<Blob<Dtype>*>& top) {
  const Dtype* bottom_data = bottom[0]->cpu_data();
  Dtype* top_data = top[0]->mutable_cpu_data();
  Dtype* scale_data = scale_.mutable_cpu_data();    // scale_data 是個中間變數，用來存放計算的中間結果
  int channels = bottom[0]->shape(softmax_axis_);
  int dim = bottom[0]->count() / outer_num_;
  // 輸出資料初始化為輸入資料
  caffe_copy(bottom[0]->count(), bottom_data, top_data);
  // 我們需要減去最大值，計算exp，然後歸一化。
  for (int i = 0; i < outer_num_; ++i) {
    // 將中間變數scale_data初始化為輸入值的第一個樣本平面
    caffe_copy(inner_num_, bottom_data + i * dim, scale_data);
    // 找出每個樣本在每個類別的輸入分值的最大值，放入scale_data中
    for (int j = 0; j < channels; j++) {
      for (int k = 0; k < inner_num_; k++) {
        scale_data[k] = std::max(scale_data[k],
            bottom_data[i * dim + j * inner_num_ + k]);
      }
    }
    // 減去最大值
    caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, channels, inner_num_,
        1, -1., sum_multiplier_.cpu_data(), scale_data, 1., top_data);
    // 計算exp()
    caffe_exp<Dtype>(dim, top_data, top_data);
    // 求和
    caffe_cpu_gemv<Dtype>(CblasTrans, channels, inner_num_, 1.,
        top_data, sum_multiplier_.cpu_data(), 0., scale_data);
    // 除以前面求到的和
    for (int j = 0; j < channels; j++) {
      caffe_div(inner_num_, top_data, scale_data, top_data);
      top_data += inner_num_;    // 指標後移
    }
  }
}

在這裡插入圖片描述

1.2 反向傳播

1.2.1 公式推導

如前，設Softmax的輸入為z_i，輸出為a_i，那麼由鏈式法則，損失loss對其輸入z_i的偏導可以如下計算：

$∂~loss~\over{∂~z~}$ = $∂~loss~\over{∂a}$ ⋅ $∂~a~\over{∂~z~}$

其中 $∂~loss~\over{∂a}$ 是上面的層傳回來的梯度，對本層來說是已知的，所以我們只需計算 $∂~a~\over{∂~z~}$ 。
由a_i=
$\frac{e~zj~}{∑~j~e ~zj~}$

當i=j，
∂ai∂zi=ezi∑jezj−eziezi(∑jezj)2=ai−ai∗ai

這裡∗表示標量算數乘法。
當i≠j，等式的右邊為 -a_i * a_j
在這裡插入圖片描述

所以，
在這裡插入圖片描述

這裡的 ⋅⋅ 表示向量點乘。
上式寫成向量形式，即為：
在這裡插入圖片描述

即為 ( top_diff - top_diff ⋅ top_data) ⋅ top_data。

1.2.2 原始碼實現

下面我們主要分析Softmax層的Backward_cpu函式，該函式的實現位於caffe的src/caffe/layers/softmax_layer.cpp中。

template <typename Dtype>
void SoftmaxLayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top,
    const vector<bool>& propagate_down,
    const vector<Blob<Dtype>*>& bottom) {
  const Dtype* top_diff = top[0]->cpu_diff();
  const Dtype* top_data = top[0]->cpu_data();
  Dtype* bottom_diff = bottom[0]->mutable_cpu_diff();
  Dtype* scale_data = scale_.mutable_cpu_data();
  int channels = top[0]->shape(softmax_axis_);
  int dim = top[0]->count() / outer_num_;
  caffe_copy(top[0]->count(), top_diff, bottom_diff);    // 將bottom_diff初始化為top_diff的值
  for (int i = 0; i < outer_num_; ++i) {
    // 計算開始
    for (int k = 0; k < inner_num_; ++k) {
    // 計算dot(top_diff, top_data)
      scale_data[k] = caffe_cpu_strided_dot<Dtype>(channels,
          bottom_diff + i * dim + k, inner_num_,
          top_data + i * dim + k, inner_num_);
    }
    // 相減
    caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, channels, inner_num_, 1,
        -1., sum_multiplier_.cpu_data(), scale_data, 1., bottom_diff + i * dim);
  }
  // 對應元素相乘
  caffe_mul(top[0]->count(), bottom_diff, top_data, bottom_diff);
}

2. Softmax Loss

Softmax Loss就是用Softmax的輸出概率作為預測概率值，與真實label做交叉熵損失，在caffe中也是呼叫了Softmax layer來實現前向傳播。

SoftmaxWithLoss = Multinomial Logistic Loss Layer + Softmax Layer

2.1 前向計算

2.1.1 公式推導

其核心公式為：
在這裡插入圖片描述

其中，其中y^為標籤值，k為輸入影象標籤所對應的的神經元。m為輸出的最大值，主要是考慮數值穩定性。

在這裡插入圖片描述

2.2 反向計算

2.2.1 公式推導

其核心公式為：

在這裡插入圖片描述

需要注意的一點是，在反向傳導時SoftmaxWithLossLayer層並沒有向正向傳導時借用SoftmaxLayer層實現一部分，而是一手全部包辦了。因此SoftmaxLayer::Backward_cpu()函式也就被閒置了。

如果網路在訓練期間發散了，則最終計算結果accuracy ≈ 0.1（說明機器完全沒有預測精度，純靠蒙）, loss ≈-log(0.1) = 2.3026。如果大家看見loss為2.3左右，就應該瞭解當前網路沒有收斂，需要調節引數配置。至於怎麼調節嘛，這往往就依賴經驗了……

2.3 使用

2.3.1 在caffe中使用

layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "fc8"
  bottom: "label"
  top: "loss"
 }

caffe中softmaxloss 層的引數如下：

// Message that stores parameters shared by loss layers
message LossParameter {
  // If specified, ignore instances with the given label.
  //忽略那些label
  optional int32 ignore_label = 1;
  // How to normalize the loss for loss layers that aggregate across batches,
  // spatial dimensions, or other dimensions.  Currently only implemented in
  // SoftmaxWithLoss and SigmoidCrossEntropyLoss layers.
  enum NormalizationMode {
    // Divide by the number of examples in the batch times spatial dimensions.
    // Outputs that receive the ignore label will NOT be ignored in computing
    // the normalization factor.
    //一次前向計算的loss除以所有的label數
    FULL = 0;
    // Divide by the total number of output locations that do not take the
    // ignore_label.  If ignore_label is not set, this behaves like FULL.
    //一次前向計算的loss除以所有的可用的label數
    VALID = 1;
    // Divide by the batch size.
    //除以batchsize大小，預設為batchsize大小。
    BATCH_SIZE = 2;
    // Do not normalize the loss.
    NONE = 3;
  }
  // For historical reasons, the default normalization for
  // SigmoidCrossEntropyLoss is BATCH_SIZE and *not* VALID.
  optional NormalizationMode normalization = 3 [default = VALID];
  // Deprecated.  Ignored if normalization is specified.  If normalization
  // is not specified, then setting this to false will be equivalent to
  // normalization = BATCH_SIZE to be consistent with previous behavior.
  //如果normalize==false，則normalization=BATCH_SIZE
  //如果normalize==true,則normalization=Valid
  optional bool normalize = 2;
}

首先來看一下softmaxwithloss的標頭檔案：

#ifndef CAFFE_SOFTMAX_WITH_LOSS_LAYER_HPP_
#define CAFFE_SOFTMAX_WITH_LOSS_LAYER_HPP_

#include <vector>

#include "caffe/blob.hpp"
#include "caffe/layer.hpp"
#include "caffe/proto/caffe.pb.h"

#include "caffe/layers/loss_layer.hpp"
#include "caffe/layers/softmax_layer.hpp"

namespace caffe {

/**
 * @brief Computes the multinomial logistic loss for a one-of-many
 *        classification task, passing real-valued predictions through a
 *        softmax to get a probability distribution over classes.
 *
 * This layer should be preferred over separate
 * SoftmaxLayer + MultinomialLogisticLossLayer
 * as its gradient computation is more numerically stable.
 * At test time, this layer can be replaced simply by a SoftmaxLayer.
 *
 * @param bottom input Blob vector (length 2)
 *   -# @f$ (N \times C \times H \times W) @f$
 *      the predictions @f$ x @f$, a Blob with values in
 *      @f$ [-\infty, +\infty] @f$ indicating the predicted score for each of
 *      the @f$ K = CHW @f$ classes. This layer maps these scores to a
 *      probability distribution over classes using the softmax function
 *      @f$ \hat{p}_{nk} = \exp(x_{nk}) /
 *      \left[\sum_{k'} \exp(x_{nk'})\right] @f$ (see SoftmaxLayer).
 *   -# @f$ (N \times 1 \times 1 \times 1) @f$
 *      the labels @f$ l @f$, an integer-valued Blob with values
 *      @f$ l_n \in [0, 1, 2, ..., K - 1] @f$
 *      indicating the correct class label among the @f$ K @f$ classes
 * @param top output Blob vector (length 1)
 *   -# @f$ (1 \times 1 \times 1 \times 1) @f$
 *      the computed cross-entropy classification loss: @f$ E =
 *        \frac{-1}{N} \sum\limits_{n=1}^N \log(\hat{p}_{n,l_n})
 *      @f$, for softmax output class probabilites @f$ \hat{p} @f$
 */
template <typename Dtype>
class SoftmaxWithLossLayer : public LossLayer<Dtype> {
 public:
   /**
    * @param param provides LossParameter loss_param, with options:
    *  - ignore_label (optional)
    *    Specify a label value that should be ignored when computing the loss.
    *  - normalize (optional, default true)
    *    If true, the loss is normalized by the number of (nonignored) labels
    *    present; otherwise the loss is simply summed over spatial locations.
    */
  explicit SoftmaxWithLossLayer(const LayerParameter& param)
      : LossLayer<Dtype>(param) {}
  virtual void LayerSetUp(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  virtual void Reshape(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);

  virtual inline const char* type() const { return "SoftmaxWithLoss"; }
  virtual inline int ExactNumBottomBlobs() const { return -1; }
  virtual inline int MinBottomBlobs() const { return 2; }
  virtual inline int MaxBottomBlobs() const { return 3; }
  virtual inline int ExactNumTopBlobs() const { return -1; }
  virtual inline int MinTopBlobs() const { return 1; }
  virtual inline int MaxTopBlobs() const { return 2; }

 protected:
  virtual void Forward_cpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  virtual void Forward_gpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  /**
   * @brief Computes the softmax loss error gradient w.r.t. the predictions.
   *
   * Gradients cannot be computed with respect to the label inputs (bottom[1]),
   * so this method ignores bottom[1] and requires !propagate_down[1], crashing
   * if propagate_down[1] is set.
   *
   * @param top output Blob vector (length 1), providing the error gradient with
   *      respect to the outputs
   *   -# @f$ (1 \times 1 \times 1 \times 1) @f$
   *      This Blob's diff will simply contain the loss_weight* @f$ \lambda @f$,
   *      as @f$ \lambda @f$ is the coefficient of this layer's output
   *      @f$\[email protected]$ in the overall Net loss
   *      @f$ E = \lambda_i \ell_i + \mbox{other loss terms}@f$; hence
   *      @f$ \frac{\partial E}{\partial \ell_i} = \lambda_i @f$.
   *      (*Assuming that this top Blob is not used as a bottom (input) by any
   *      other layer of the Net.)
   * @param propagate_down see Layer::Backward.
   *      propagate_down[1] must be false as we can't compute gradients with
   *      respect to the labels.
   * @param bottom input Blob vector (length 2)
   *   -# @f$ (N \times C \times H \times W) @f$
   *      the predictions @f$ x @f$; Backward computes diff
   *      @f$ \frac{\partial E}{\partial x} @f$
   *   -# @f$ (N \times 1 \times 1 \times 1) @f$
   *      the labels -- ignored as we can't compute their error gradients
   */
  virtual void Backward_cpu(const vector<Blob<Dtype>*>& top,
      const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);
  virtual void Backward_gpu(const vector<Blob<Dtype>*>& top,
      const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);

  /// Read the normalization mode parameter and compute the normalizer based
  /// on the blob size.  If normalization_mode is VALID, the count of valid
  /// outputs will be read from valid_count, unless it is -1 in which case
  /// all outputs are assumed to be valid.
  virtual Dtype get_normalizer(
      LossParameter_NormalizationMode normalization_mode, Dtype valid_count);

  /// The internal SoftmaxLayer used to map predictions to a distribution.
  //宣告softmax layer
  shared_ptr<Layer<Dtype> > softmax_layer_;
  /// prob stores the output probability predictions from the SoftmaxLayer.
  //儲存經過softmax layer輸出的概率
  Blob<Dtype> prob_;
  /// bottom vector holder used in call to the underlying 
 //softmax層前向函式的bottom
 SoftmaxLayer::Forward
  vector<Blob<Dtype>*> softmax_bottom_vec_;
  /// top vector holder used in call to the underlying SoftmaxLayer::Forward
  //softmax層前向函式的top
  vector<Blob<Dtype>*> softmax_top_vec_;
  // Whether to ignore instances with a certain label.
  //是否需要忽略掉label
  bool has_ignore_label_;
  /// The label indicating that an instance should be ignored.
  int ignore_label_;
  bool has_hard_ratio_;
  float hard_ratio_;
  bool has_hard_mining_label_;
  int hard_mining_label_;
  bool has_class_weight_;
  Blob<Dtype> class_weight_;
  Blob<Dtype> counts_;
  Blob<Dtype> loss_;
  /// How to normalize the output loss.
  //歸一化loss型別
  LossParameter_NormalizationMode normalization_;
  bool has_cutting_point_;
  Dtype cutting_point_;
  std::string normalize_type_;

  int softmax_axis_, outer_num_, inner_num_;
};

}  // namespace caffe

具體函式實現

#include <algorithm>
#include <cfloat>
#include <vector>

#include "caffe/layers/softmax_loss_layer.hpp"
#include "caffe/util/math_functions.hpp"

namespace caffe {

template <typename Dtype>
void SoftmaxWithLossLayer<Dtype>::LayerSetUp(
    const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {
  LossLayer<Dtype>::LayerSetUp(bottom, top);
  normalize_type_ =
    this->layer_param_.softmax_param().normalize_type();
    //歸一化為softmax
  if (normalize_type_ == "Softmax") {
    LayerParameter softmax_param(this->layer_param_);
    softmax_param.set_type("Softmax");
    softmax_layer_ = LayerRegistry<Dtype>::CreateLayer(softmax_param);
    softmax_bottom_vec_.clear();
    softmax_bottom_vec_.push_back(bottom[0]);
    softmax_top_vec_.clear();
    softmax_top_vec_.push_back(&prob_);
    softmax_layer_->SetUp(softmax_bottom_vec_, softmax_top_vec_);
  }
  else if(normalize_type_ == "L2" || normalize_type_ == "L1") {
    LayerParameter normalize_param(this->layer_param_);
    normalize_param.set_type("Normalize");
    softmax_layer_ = LayerRegistry<Dtype>::CreateLayer(normalize_param);
    softmax_bottom_vec_.clear();
    softmax_bottom_vec_.push_back(bottom[0]);
    softmax_top_vec_.clear();
    softmax_top_vec_.push_back(&prob_);
    softmax_layer_->SetUp(softmax_bottom_vec_, softmax_top_vec_);
  }
  else {
    NOT_IMPLEMENTED;
  }

  has_ignore_label_ =
    this->layer_param_.loss_param().has_ignore_label();
  if (has_ignore_label_) {
    ignore_label_ = this->layer_param_.loss_param().ignore_label();
  }
  has_hard_ratio_ =
    this->layer_param_.softmax_param().has_hard_ratio();
  if (has_hard_ratio_) {
    hard_ratio_ = this->layer_param_.softmax_param().hard_ratio();
    CHECK_GE(hard_ratio_, 0);
    CHECK_LE(hard_ratio_, 1);
  }
  has_cutting_point_ =
    this->layer_param_.softmax_param().has_cutting_point();
  if (has_cutting_point_) {
    cutting_point_ = this->layer_param_.softmax_param().cutting_point();
    CHECK_GE(cutting_point_, 0);
    CHECK_LE(cutting_point_, 1);
  }
  has_hard_mining_label_ = this->layer_param_.softmax_param().has_hard_mining_label();
  if (has_hard_mining_label_) {
    hard_mining_label_ = this->layer_param_.softmax_param().hard_mining_label();
  }
  has_class_weight_ = (this->layer_param_.softmax_param().class_weight_size() != 0);
  softmax_axis_ =
    bottom[0]->CanonicalAxisIndex(this->layer_param_.softmax_param().axis());
  if (has_class_weight_) {
    class_weight_.Reshape({ bottom[0]->shape(softmax_axis_) });
    CHECK_EQ(this->layer_param_.softmax_param().class_weight().size(), bottom[0]->shape(softmax_axis_));
    for (int i = 0; i < bottom[0]->shape(softmax_axis_); i++) {
      class_weight_.mutable_cpu_data()[i] = (Dtype)this->layer_param_.softmax_param().class_weight(i);
    }
  }
  else {
    if (bottom.size() == 3) {
      class_weight_.Reshape({ bottom[0]->shape(softmax_axis_) });
      for (int i = 0; i < bottom[0]->shape(softmax_axis_); i++) {
        class_weight_.mutable_cpu_data()[i] = (Dtype)1.0;
      }
    }
  }
  if (!this->layer_param_.loss_param().has_normalization() &&
      this->layer_param_.loss_param().has_normalize()) {
    normalization_ = this->layer_param_.loss_param().normalize() ?
                     LossParameter_NormalizationMode_VALID :
                     LossParameter_NormalizationMode_BATCH_SIZE;
  } else {
    normalization_ = this->layer_param_.loss_param().normalization();
  }
}

template <typename Dtype>
void SoftmaxWithLossLayer<Dtype>::Reshape(
    const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {
  LossLayer<Dtype>::Reshape(bottom, top);
  softmax_layer_->Reshape(softmax_bottom_vec_, softmax_top_vec_);
  softmax_axis_ =
      bottom[0]->CanonicalAxisIndex(this->layer_param_.softmax_param().axis());
  outer_num_ = bottom[0]->count(0, softmax_axis_);
  inner_num_ = bottom[0]->count(softmax_axis_ + 1);
  counts_.Reshape({ outer_num_, inner_num_ });
  loss_.Reshape({ outer_num_, inner_num_ });
  CHECK_EQ(outer_num_ * inner_num_, bottom[1]->count())
      << "Number of labels must match number of predictions; "
      << "e.g., if softmax axis == 1 and prediction shape is (N, C, H, W), "
      << "label count (number of labels) must be N*H*W, "
      << "with integer values in {0, 1, ..., C-1}.";
  if (bottom.size() == 3) {
    CHECK_EQ(outer_num_ * inner_num_, bottom[2]->count())
      << "Number of loss weights must match number of label.";
  }
  if (top.size() >= 2) {
    // softmax output
    top[1]->ReshapeLike(*bottom[0]);
  }
  if (has_class_weight_) {
    CHECK_EQ(class_weight_.count(), bottom[0]->shape(1));
  }
}

template <typename Dtype>
Dtype SoftmaxWithLossLayer<Dtype>::get_normalizer(
    LossParameter_NormalizationMode normalization_mode, Dtype valid_count) {
  Dtype normalizer;
  switch (normalization_mode) {
    case LossParameter_NormalizationMode_FULL:
      normalizer = Dtype(outer_num_ * inner_num_);
      break;
    case LossParameter_NormalizationMode_VALID:
      if (valid_count == -1) {
        normalizer = Dtype(outer_num_ * inner_num_);
      } else {
        normalizer = valid_count;
      }
      break;
    case LossParameter_NormalizationMode_BATCH_SIZE:
      normalizer = Dtype(outer_num_);
      break;
    case LossParameter_NormalizationMode_NONE:
      normalizer = Dtype(1);
      break;
    default:
      LOG(FATAL) << "Unknown normalization mode: "
          << LossParameter_NormalizationMode_Name(normalization_mode);
  }
  // Some users will have no labels for some examples in order to 'turn off' a
  // particular loss in a multi-task setup. The max prevents NaNs in that case.
  return std::max(Dtype(1.0), normalizer);
}
//前向中主要利用softmax層輸出每一個樣本的對應的所有類別概率。如輸入一隻狗，則輸出狗的概率，貓的概率，猴的概率。[0.8,0.1,0.1]
template <typename Dtype>
void SoftmaxWithLossLayer<Dtype>::Forward_cpu(
    const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {
  // The forward pass computes the softmax prob values.
  softmax_layer_->Forward(softmax_bottom_vec_, softmax_top_vec_);
  const Dtype* prob_data = prob_.cpu_data();
  const Dtype* label = bottom[1]->cpu_data();
  int dim = prob_.count() / outer_num_;
  Dtype count = 0;
  Dtype loss = 0;
  if (bottom.size() == 2) {
    for (int i = 0; i < outer_num_; ++i) {
      for (int j = 0; j < inner_num_; j++) {
        const int label_value = static_cast<int>(label[i * inner_num_ + j]);
        if (has_ignore_label_ && label_value == ignore_label_) {
          continue;
        }
        DCHECK_GE(label_value, 0);
        DCHECK_LT(label_value, prob_.shape(softmax_axis_));
        loss -= log(std::max(prob_data[i * dim + label_value * inner_num_ + j],
          Dtype(FLT_MIN)));
        count += 1;
      }
    }
  }
  else if(bottom.size() == 3) {
    const Dtype* weights = bottom[2]->cpu_data();
    for (int i = 0; i < outer_num_; ++i) {
      for (int j = 0; j < inner_num_; j++) {
        const int label_value = static_cast<int>(label[i * inner_num_ + j]);
        const Dtype weight_value = weights[i * inner_num_ + j] * (has_class_weight_? class_weight_.cpu_data()[label_value] : 1.0);
        if (weight_value == 0) continue;
        if (has_ignore_label_ && label_value == ignore_label_) {
          continue;
        }
        DCHECK_GE(label_value, 0);
        DCHECK_LT(label_value, prob_.shape(softmax_axis_));
        loss -= weight_value * log(std::max(prob_data[i * dim + label_value * inner_num_ + j],
          Dtype(FLT_MIN)));
        count += weight_value;
      }
    }
  }
  top[0]->mutable_cpu_data()[0] = loss / get_normalizer(normalization_, count);
  if (top.size() == 2) {
    top[1]->ShareData(prob_);
  }
}

template <typename Dtype>
void SoftmaxWithLossLayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top,
    const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom) {
  if (propagate_down[1]) {
    LOG(FATAL) << this->type()
               << " Layer cannot backpropagate to label inputs.";
  }
  if (propagate_down[0]) {
    Dtype* bottom_diff = bottom[0]->mutable_cpu_diff();
    const Dtype* prob_data = prob_.cpu_data();
    caffe_copy(prob_.count(), prob_data, bottom_diff);
    const Dtype* label = bottom[1]->cpu_data();
    int dim = prob_.count() / outer_num_;
    Dtype count = 0;
    if (bottom.size() == 2) {
      for (int i = 0; i < outer_num_; ++i) {
        for (int j = 0; j < inner_num_; ++j) {
          const int label_value = static_cast<int>(label[i * inner_num_ + j]);
          if (has_ignore_label_ && label_value == ignore_label_) {
            for (int c = 0; c < bottom[0]->shape(softmax_axis_); ++c) {
              bottom_diff[i * dim + c * inner_num_ + j] = 0;
            }
          }
          else {
          //反向求導的公式的實現
            bottom_diff[i * dim + label_value * inner_num_ + j] -= 1;
            count += 1;
          }
        }
      }
    }
    else if (bottom.size() == 3) {
      const Dtype* weights = bottom[2]->cpu_data();
      for (int i = 0; i < outer_num_; ++i) {
        for (int j = 0; j < inner_num_; ++j) {
          const int label_value = static_cast<int>(label[i * inner_num_ + j]);
          const Dtype weight_value = weights[i * inner_num_ + j];
          if (has_ignore_label_ && label_value == ignore_label_) {
            for (int c = 0; c < bottom[0]->shape(softmax_axis_); ++c) {
              bottom_diff[i * dim + c * inner_num_ + j] = 0;
            }
          }
          else {
            bottom_diff[i * dim + label_value * inner_num_ + j] -= 1;
            for (int c = 0; c < bottom[0]->shape(softmax_axis_); ++c) {
              bottom_diff[i * dim + c * inner_num_ + j] *= weight_value * (has_class_weight_ ? class_weight_.cpu_data()[label_value] : 1.0);
            }
            if(weight_value != 0) count += weight_value;
          }
        }
      }
    }
    // Scale gradient
    //由歸一化手段決定梯度的放縮
    Dtype loss_weight = top[0]->cpu_diff()[0] /
                        get_normalizer(normalization_, count);
    caffe_scal(prob_.count(), loss_weight, bottom_diff);
  }
}

#ifdef CPU_ONLY
STUB_GPU(SoftmaxWithLossLayer);
#endif

INSTANTIATE_CLASS(SoftmaxWithLossLayer);
REGISTER_LAYER_CLASS(SoftmaxWithLoss);

}

caffe原始碼閱讀筆記(一) SoftmaxLayer

1. Softmax

1.1 前向計算

1.1.1 公式推導

1.1.2 原始碼實現

1.2 反向傳播

1.2.1 公式推導

1.2.2 原始碼實現

2. Softmax Loss

2.1 前向計算

2.1.1 公式推導

2.2 反向計算

2.2.1 公式推導

2.3 使用

2.3.1 在caffe中使用

caffe原始碼閱讀筆記(一) SoftmaxLayer

Caffe 原始碼閱讀筆記 [基本模組] Solver

Caffe 原始碼閱讀筆記 [DB] 儲存Caffe資料的LevelDB類

需求工程——軟件建模與分析閱讀筆記一（三）

原始碼閱讀筆記——Tablib

原始碼閱讀筆記——HowDoI

caffe 原始碼分析【一】： Blob類

jdk原始碼閱讀筆記-LinkedHashMap

SGISTL原始碼閱讀十一 list容器上

jdk原始碼閱讀筆記-ArrayList

.NetCore原始碼閱讀筆記系列之HttpAbstractions（五） Authentication

caffe 原始碼閱讀指導意見

Spark原始碼閱讀（一）

Koa原始碼閱讀（一）從搭建Web伺服器說起

eos 原始碼學習筆記一

mxnet原始碼閱讀筆記之include

JAVA 10原始碼閱讀筆記之JEP-307（G1的並行Full GC）

base_local_planner原始碼閱讀筆記

hashMap原始碼閱讀筆記1.7

Disruptor原始碼閱讀筆記

caffe原始碼閱讀筆記(一) SoftmaxLayer

1. Softmax

1.1 前向計算

1.1.1 公式推導

1.1.2 原始碼實現

1.2 反向傳播

1.2.1 公式推導

1.2.2 原始碼實現

2. Softmax Loss

2.1 前向計算

2.1.1 公式推導

2.2 反向計算

2.2.1 公式推導

2.3 使用

2.3.1 在caffe中使用

相關推薦