1. 程式人生 > >batch normalization 正向傳播與反向傳播

batch normalization 正向傳播與反向傳播

At the moment there is a wonderful course running at Standford University, calledCS231n - Convolutional Neural Networks for Visual Recognition, held by Andrej Karpathy, Justin Johnson and Fei-Fei Li. Fortunately all thecourse material is provided for free and all the lectures are recorded and uploaded on

Youtube. This class gives a wonderful intro to machine learning/deep learning coming along with programming assignments.

Batch Normalization

One Topic, which kept me quite busy for some time was the implementation of Batch Normalization, especially the backward pass. Batch Normalization is a technique to provide any layer in a Neural Network with inputs that are zero mean/unit variance - and this is basically what they like! But BatchNorm consists of one more step which makes this algorithm really powerful. Let’s take a look at the BatchNorm Algorithm:


Algorithm of Batch Normalization copied from the Paper by Ioffe and Szegedy mentioned above.

Look at the last line of the algorithm. After normalizing the input x the result is squashed through a linear function with parameters gamma and beta. These are learnable parameters of the BatchNorm Layer and make it basically possible to say “Hey!! I don’t want zero mean/unit variance input, give me back the raw input - it’s better for me.” Ifgamma = sqrt(var(x))

and beta = mean(x), the original activation is restored. This is, what makes BatchNorm really powerful. We initialize the BatchNorm Parameters to transform the input to zero mean/unit variance distributions but during training they can learn that any other distribution might be better.Anyway, I don’t want to spend to much time on explaining Batch Normalization. If you want to learn more about it, thepaper is very well written andhere Andrej is explaining BatchNorm in class.

Btw: it’s called “Batch” Normalization because we perform this transformation and calculate the statistics only for a subpart (a batch) of the entire trainingsset.

Backpropagation

In this blog post I don’t want to give a lecture in Backpropagation and Stochastic Gradient Descent (SGD). For now I will assume that whoever will read this post, has some basic understanding of these principles. For the rest, let me quote Wiki:

Backpropagation, an abbreviation for “backward propagation of errors”, is a common method of training artificial neural networks used in conjunction with an optimization method such as gradient descent. The method calculates the gradient of a loss function with respect to all the weights in the network. The gradient is fed to the optimization method which in turn uses it to update the weights, in an attempt to minimize the loss function.

Uff, sounds tough, eh? I will maybe write another post about this topic but for now I want to focus on the concrete example of the backwardpass through the BatchNorm-Layer.

Computational Graph of Batch Normalization Layer

I think one of the things I learned from the cs231n class that helped me most understanding backpropagation was the explanation through computational graphs. These Graphs are a good way to visualize the computational flow of fairly complex functions by small, piecewise differentiable subfunctions. For the BatchNorm-Layer it would look something like this:


Computational graph of the BatchNorm-Layer. From left to right, following the black arrows flows the forward pass. The inputs are a matrix X and gamma and beta as vectors. From right to left, following the red arrows flows the backward pass which distributes the gradient from above layer to gamma and beta and all the way back to the input.

I think for all, who followed the course or who know the technique the forwardpass (black arrows) is easy and straightforward to read. From inputx we calculate the mean of every dimension in the feature space and then subtract this vector of mean values from every training example. With this done, following the lower branch, we calculate the per-dimension variance and with that the entire denominator of the normalization equation. Next we invert it and multiply it with difference of inputs and means and we havex_normalized. The last two blobs on the right perform the squashing by multiplying with the inputgamma and finally addingbeta. Et voilà, we have our Batch-Normalized output.

A vanilla implementation of the forwardpass might look like this:

def batchnorm_forward(x, gamma, beta, eps):

  N, D = x.shape

  #step1: calculate mean
  mu = 1./N * np.sum(x, axis = 0)

  #step2: subtract mean vector of every trainings example
  xmu = x - mu

  #step3: following the lower branch - calculation denominator
  sq = xmu ** 2

  #step4: calculate variance
  var = 1./N * np.sum(sq, axis = 0)

  #step5: add eps for numerical stability, then sqrt
  sqrtvar = np.sqrt(var + eps)

  #step6: invert sqrtwar
  ivar = 1./sqrtvar

  #step7: execute normalization
  xhat = xmu * ivar

  #step8: Nor the two transformation steps
  gammax = gamma * xhat

  #step9
  out = gammax + beta

  #store intermediate
  cache = (xhat,gamma,xmu,ivar,sqrtvar,var,eps)

  return out, cache

Note that for the exercise of the cs231n class we had to do a little more (calculate running mean and variance as well as implement different forward pass for trainings mode and test mode) but for the explanation of the backwardpass this piece of code will work.In the cache variable we store some stuff that we need for the computing of the backwardpass, as you will see now!

The power of Chain Rule for backpropagation

For all who kept on reading until now (congratulations!!), we are close to arrive at the backward pass of the BatchNorm-Layer.To fully understand the channeling of the gradient backwards through the BatchNorm-Layer you should have some basic understanding of what the Chain rule is. As a little refresh follows one figure that exemplifies the use of chain rule for the backward pass in computational graphs.


The forwardpass on the left in calculates `z` as a function `f(x,y)` using the input variables `x` and `y` (This could literally be any function, examples are shown in the BatchNorm-Graph above). The right side of the figures shows the backwardpass. Receiving `dL/dz`, the gradient of the loss function with respect to `z` from above, the gradients of `x` and `y` on the loss function can be calculate by applying the chain rule, as shown in the figure.

So again, we only have to multiply the local gradient of the function with the gradient of above to channel the gradient backwards. Some derivations of some basic functions are listed in thecourse material. If you understand that, and with some more basic knowledge in calculus, what will follow is a piece of cake!

Finally: The Backpass of the Batch Normalization

In the comments of aboves code snippet I already numbered the computational steps by consecutive numbers. The Backpropagation follows these steps in reverse order, as we are literally backpassing through the computational graph. We will know take a more detailed look at every single computation of the backwardpass and by that deriving step by step a naive algorithm for the backward pass.

Step 9


Backwardpass through the last summation gate of the BatchNorm-Layer. Enclosed in brackets I put the dimensions of Input/Output

Recall that the derivation of a function f = x + y with respect to any of these two variables is1. This means to channel a gradient through a summation gate, we only need to multiply by1. And because the summation of beta during the forward pass is a row-wise summation, during the backward pass we need to sum up the gradient over all of its columns (take a look at the dimensions). So after the first step of backpropagation we already got the gradient for one learnable parameter: beta

Step 8


Next follows the backward pass through the multiplication gate of the normalized input and the vector of gamma.

For any function f = x * y the derivation with respect to one of the inputs is simply just the other input variable. This also means, that for this step of the backward pass we need the variables used in the forward pass of this gate (luckily stored in the cache of aboves function). So again we get the gradients of the two inputs of these gates by applying chain rule ( = multiplying the local gradient with the gradient from above). For gamma, as for beta in step 9, we need to sum up the gradients over dimension N, because the multiplication was again row-wise. So we now have the gradient for the second learnable parameter of the BatchNorm-Layergamma and “only” need to backprop the gradient to the inputx, so that we then can backpropagate the gradient to any layer further downwards.

Step 7


This step during the forward pass was the final step of the normalization combining the two branches (nominator and denominator) of the computational graph. During the backward pass we will calculate the gradients that will flow separately through these two branches backwards.

It’s basically the exact same operation, so lets not waste much time and continue. The two needed variablesxmu andivar for this step are also storedcache variable we pass to the backprop function. (And again: This is one of the main advantages of computational graphs. Splitting complex functions into a handful of simple basic operations. And like this you have a lot of repetitions!)

Step 6


This is a "one input-one output" node where, during the forward pass, we inverted the input (square root of the variance).

The local gradient is visualized in the image and should not be hard to derive by hand. Multiplied by the gradient from above is what we channel to the next step.sqrtvar is also one of the variables passed incache.

Step 5


Again "one input-one output". This node calculates during the forward pass the denominator of the normalization.

The derivation of the local gradient is little magic and should need no explanation.var andeps are also passed in thecache. No more words to lose!

Step 4


Also a "one input-one output" node. During the forward pass the output of this node is the variance of each feature `d for d in [1...D]`.

The derivation of this steps local gradient might look unclear at the very first glance. But it’s not that hard at the end. Let’s recall that a normal summation gate (see step 9) during the backward pass only transfers the gradient unchanged and evenly to the inputs. With that in mind, it should not be that hard to conclude, that a column-wise summation during the forward pass, during the backward pass means that we evenly distribute the gradient over all rows for each column. And not much more is done here. We create a matrix of ones with the same shape as the input sq of the forward pass, divide it element-wise by the number of rows (thats the local gradient) and multiply it by the gradient from above.

Step 3


This node outputs the square of its input, which during the forward pass was a matrix containing the input `x` subtracted by the per-feature `mean`.

I think for all who followed until here, there is not much to explain for the derivation of the local gradient.

Step 2


Now this looks like a more fun gate! two inputs-two outputs! This node subtracts the per-feature mean row-wise of each trainings example `n for n in [1...N]` during the forward pass.

Okay lets see. One of the definitions of backprogatation and computational graphs is, that whenever we have two gradients coming to one node, we simply add them up. Knowing this, the rest is little magic as the local gradient for a subtraction is as hard to derive as for a summation. Note that for mu we have to sum up the gradients over the dimensionN (as we did before forgamma and beta).

Step 1


The function of this node is exactly the same as of step 4. Only that during the forward pass the input was `x` - the input to the BatchNorm-Layer and the output here is `mu`, a vector that contains the mean of each feature.

As this node executes the exact same operation as the one explained in step 4, also the backpropagation of the gradient looks the same. So let’s continue to the last step.

Step 0 - Arriving at the Input

I only added this image to again visualize that at the very end we need to sum up the gradientsdx1 anddx2 to get the final gradientdx. This matrix contains the gradient of the loss function with respect to the input of the BatchNorm-Layer. This gradientdx is also what we give as input to the backwardpass of the next layer, as for this layer we receivedout from the layer above.

Naive implemantation of the backward pass through the BatchNorm-Layer

Putting together every single step the naive implementation of the backwardpass might look something like this:

def batchnorm_backward(dout, cache):

  #unfold the variables stored in cache
  xhat,gamma,xmu,ivar,sqrtvar,var,eps = cache

  #get the dimensions of the input/output
  N,D = dout.shape

  #step9
  dbeta = np.sum(dout, axis=0)
  dgammax = dout #not necessary, but more understandable

  #step8
  dgamma = np.sum(dgammax*xhat, axis=0)
  dxhat = dgammax * gamma

  #step7
  divar = np.sum(dxhat*xmu, axis=0)
  dxmu1 = dxhat * ivar

  #step6
  dsqrtvar = -1. /(sqrtvar**2) * divar

  #step5
  dvar = 0.5 * 1. /np.sqrt(var+eps) * dsqrtvar

  #step4
  dsq = 1. /N * np.ones((N,D)) * dvar

  #step3
  dxmu2 = 2 * xmu * dsq

  #step2
  dx1 = (dxmu1 + dxmu2)
  dmu = -1 * np.sum(dxmu1+dxmu2, axis=0)

  #step1
  dx2 = 1. /N * np.ones((N,D)) * dmu

  #step0
  dx = dx1 + dx2

  return dx, dgamma, dbeta

Note: This is the naive implementation of the backward pass. There exists an alternative implementation, which is even a bit faster, but I personally found the naive implementation way better for the purpose of understanding backpropagation through the BatchNorm-Layer. This well written blog post gives a more detailed derivation of the alternative (faster) implementation. However, there is a much more calculus involved. But once you have understood the naive implementation, it might not be to hard to follow.

Some final words

First of all I would like to thank the team of the cs231n class, that gratefully make all the material freely available. This gives people like me the possibility to take part in high class courses and learn a lot about deep learning in self-study.(Secondly it made me motivated to write my first blog post!)

And as we have already passed the deadline for the second assignment, I might upload my code during the next days on github.

What does the gradient flowing through batch normalization looks like ?

This past week, I have been working on the assignments from theStanford CS classCS231n: Convolutional Neural Networks for Visual Recognition. Inparticular, I spent a few hours deriving a correct expression tobackpropagate the batchnorm regularization(Assigment 2 - Batch Normalization). While this post is mainly for me not to forget about what insights Ihave gained in solving this problem, I hope it could be useful toothers that are struggling with back propagation.

Batch normalization

Batch normalization is a recent idea introduced byIoffe et al, 2015 to ease thetraining of large neural networks. The idea behind it is that neuralnetworks tend to learn better when their input features areuncorrelated with zero mean and unit variance. As each layer within aneural network see the activations of the previous layer as inputs,the same idea could be apply to each layer. Batch normalization doesexactly that by normalizing the activations over the current batch ineach hidden layer, generally right before the non-linearity.

To be more specific, for a given input batch x

of size (N,D) goingthrough a hidden layer of size H, some weights w of size (D,H) anda bias b of size (H)

, the common layer structure with batch normlooks like

  1. Affine transformation

    h=XW+b

where h

contains the results of the linear transformation (size (N,H)
  • ).

    where γ

    and β

    are learnable parameters and

    h^=(hμ)(σ2+ϵ)1/2

    contains the zero mean and unit variance version of h

    (size (N,H)). Indeed, the parameter μ (H) and σ2 (H) are the respective average and standard deviation of each activation over the full batch (of sizeN). Note that, this expression implicitly assume broadcasting as h is of size (N,H) and both μ and σ have size equal to (H)

    . A more correct expression would be

    hkl^=(hklμl)(σ2l+ϵ)1/2

    where

    μl=1Nphpl σ2l=1Np(hplμl)2.

    with k=1,,N

    and l=1,,H

    which now see a zero mean and unit variance input and where a

    contains the activations of size (N,H). Also note that, as

    相關推薦

    batch normalization 正向傳播反向傳播

    At the moment there is a wonderful course running at Standford University, calledCS231n - Convolutional Neural Networks for Visual Recognition, held by A

    神經網絡正向傳播反向傳播公式

    src 反向傳播 http 技術分享 img inf 分享圖片 公式 info 神經網絡正向傳播與反向傳播公式

    前向傳播反向傳播

    本篇主要介紹神經網路的引數更新方法   在介紹引數更新方法之前,需要知道損失函式(loss function)。 損失函式的作用是衡量模型預測值與實際值之間的差異。 一般神經網路用的損失函式是:交叉熵損失(cross entropy)。 當pi=1時,函式影象如下

    LSTM-基本原理-前向傳播反向傳播過程推導

    前言 最近在實踐中用到LSTM模型,一直在查詢相關資料,推導其前向傳播、反向傳播過程。 LSTM有很多變體,查到的資料的描述也略有差別,且有一些地方讓我覺得有些困惑。目前查到的資料中我認為這個國外大神的部落格寫的比較清晰: http://arunmally

    【轉載】前向傳播算法(Forward propagation)反向傳播算法(Back propagation)

    應用 思想 size 之路 基礎 pro 中間 nbsp sdn 原文鏈接:https://blog.csdn.net/bitcarmanlee/article/details/78819025 雖然學深度學習有一段時間了,但是對於一些算法的具體實現還是模糊不清,用了很久也

    DeepLearning-NLP-NN&RNN&LSTM正向傳播反向傳播

    DeepLearning-NLP-NN&RNN&LSTM正向傳播和反向傳播 神經網路NN結構、傳播及修正 神經網路結構圖 數學公式描述經典網路中每一個神經元的工作 如何反向傳播(BP(Backpropagation)神經網

    mxnet-梯度反向傳播

    #!/usr/bin/env python2 # -*- coding: utf-8 -*- """ Created on Fri Aug 10 16:13:29 2018 @author: myhaspl """ from mxnet import nd from mxnet import autogr

    RNN反向傳播演算法(BPTT)的理解

    RNN是序列建模的強大工具。 今天主要搬運兩天來看到的關於RNN的很好的文章: PS: 第一個連結中的Toy Code做一些說明 之所以要迴圈8(binary_dim=8)次,是因為輸入是2維的(a和b各輸入一個bit),那麼,每個bit只會影響8

    [2] TensorFlow 向前傳播演算法(forward-propagation)反向傳播演算法(back-propagation)

    TensorFlow Playground http://playground.tensorflow.org 幫助更好的理解,遊樂場Playground可以實現視覺化訓練過程的工具 TensorFlow Playground的左側提供了不同的資料集來測試神經網路。預設的資料為左上角被框出來的那個。被

    Caffe tutorial 之 前向反向傳播

    前向與反向傳播 前向與後向傳播是網路中重要的計算部分。 接下來以簡單的邏輯迴歸分類器為例介紹。 前向傳播用於計算推理過程中給定輸入的輸出。在前向傳播中,Caffe將每層的計算進行組合從而得到模型所代表的“函式”。此過程由底向上進行。 資料

    梯度下降演算法原理反向傳播思想(推導及核心觀點)

    梯度下降方法是常用的引數優化方法,經常被用在神經網路中的引數更新過程中。 神經網路中,將樣本中的輸入X和輸出Y當做已知值(對於一個樣本[X,Y],其中X和Y分別是標準的輸入值和輸出值,X輸入到模型中計算得到Y,但是模型中的引數值我們並不知道,所以我們的做法是隨機初始化模型的

    正向代理反向代理的區別

    高流量 反向代理 網站 代理軟件 shadows 們的 上網 正向代理 .cn 在計算機世界,代理可分為正向代理和反向代理,比如著名的FQ軟件Shadowsocks就是一款正向代理軟件,全世界前1000的高流量網站都在用的Web服務器Nginx作為反向代理服務器,那麽兩者之

    正向代理反向代理

    技術分享 origin pan 角色 blank 決定 -o lightbox box 正向代理 A同學在大眾創業、萬眾創新的大時代背景下開啟他的創業之路,目前他遇到的最大的一個問題就是啟動資金,於是他決定去找馬雲爸爸借錢,可想而知,最後碰一鼻子灰回來了,情急之下,他想到一

    正向代理反向代理的差別

    一個 域名 markdown 他能 mar ext internet down client 本文轉載自 : 正向代理與反向代理的差別 一、正向代理的概念   正向代理,也就是傳說中的代理,他的工作原理就像一個跳板,簡單的說,我是一個用戶,我訪問

    Nginx教程(7) 正向代理反向代理【總結】

    資料 用戶訪問 認證 origin 訪問者 發送 -128 負載 行為 1、前言   最近工作中用到反向代理,發現網絡代理的玩法還真不少,網絡背後有很多需要去學習。而在此之前僅僅使用了過代理軟件,曾經為了訪問google,使用了代理軟件,需要在瀏覽器中配置代理的地址。我只知

    caffe中的前向傳播反向傳播

    sla hit img 部分 可能 說明 caff .com 容易 caffe中的網絡結構是一層連著一層的,在相鄰的兩層中,可以認為前一層的輸出就是後一層的輸入,可以等效成如下的模型 可以認為輸出top中的每個元素都是輸出bottom中所有元素的函數。如果兩個神經元之間沒

    DNS的正向解析反向解析

    之前 zone 我們 tab 反向 作用 設置 藍色 同步 DNS系統系統在網絡中的作用就是維護著一個地址數據庫,其中記錄了各種主機域名和IP地址的對應關系,以便為客戶程序提供正向或反向的地址查詢服務。 正向解析:根據域名查詢IP地址,是DNS最基本也是最常用的

    前項傳播反向傳播

    修正 計算 ria 定義 original 基本 而且 是我 eight 前向傳播   如圖所示,這裏講得已經很清楚了,前向傳播的思想比較簡單。   舉個例子,假設上一層結點i,j,k,…等一些結點與本層的結點w有連接,那麽結點w的值怎麽算呢?就是通過上一層的i,j,

    正向代理反向代理通俗理解

    strong 多臺 回來 服務端 自己 作用 比喻 bsp 進行 代理,字面意義上來說,他就是相當於一個中間人這麽個概念。 帶到項目中也一樣,那麽來這樣一個比喻。 用戶(客戶端) 代理(正,反) 提供者(服務端)。 正向代理: 把整個流程比如成去飯店吃飯,我們也就是用戶(

    正向代理反向代理【總結】

    blank 資源 區別 客戶 得到 方式 接受 研究 waf 1、前言 轉載自:https://www.cnblogs.com/Anker/p/6056540.html   最近工作中用到反向代理,發現網絡代理的玩法還真不少,網絡背後有很多需要去學習。而在此之前僅僅使用了過