1. 程式人生 > >DL-1: Tips for Training Deep Neural Network

DL-1: Tips for Training Deep Neural Network

Different approaches for different problems.

e.g. dropout for good results on testing data.

Choosing proper loss

  • Square Error
i=1n(yiyiˆ)2
  • Cross Entropy
i=1nyiˆlnyi

Mini-batch

We do not really minimize total loss!

batch_size: 每次批處理訓練樣本個數;
nb_epoch: 整個訓練資料重複處理次數。
總的訓練樣本數量不變。

Mini-batch is Faster. Not always true with parallel computing.

Mini-batch has better performance!

Shuffle the training examples for each epoch. This is the default of Keras.

New activation function

Q: Vanishing Gradient Problem

  • Smaller gradients
  • Learn very slow
  • Almost random

  • Larger gradients

  • Learn very fast
  • Already converge

2006 RBM –> 2015 ReLU

ReLU: Rectified Linear Unit
1. Fast to compute
2. Biological reason
3. Infinite sigmoid with different biases
4. Vanishing gradient problem

A Thinner linear network. Do not have smaller gradients.

ReLU

ReLU1

ReLU2

ReLU3

Adaptive Learning Rate

Set the learning rate η

carefully.

  • If learning rate is too large, total loss may not decrease after each update.
  • If learning rate is too small, training would be too slow.

Solution:

  • Popular & Simple Idea: Reduce the learning rate by some factor every few epochs.
    • At the beginning, use larger learning rate
    • After several epochs, reduce the learning rate. E.g. 1/t decay: ηt=η/t+1
  • Learning rate cannot be one-size-fits-all.
    • Giving different parameters different learning rates

Adagrad: w=wηwL/w
ηw: Parameter dependent learning rate.

ηw=ηti=0(gi)2

η: constant
gi: is L/w obtained at the i-th update.

Summation of the square of the previous derivatives.

Observation:
1. Learning rate is smaller and smaller for all parameters.
2. Smaller derivatives, larger learning rate, and vice versa.

  • Adagrad [John Duchi, JMLR’11]
  • RMSprop
    https://www.youtube.com/watch?v=O3sxAc4hxZU
  • Adadelta [Matthew D. Zeiler, arXiv’12]
  • No more pesky learning rates” [Tom Schaul, arXiv’12]
  • AdaSecant [Caglar Gulcehre, arXiv’14]
  • Adam [DiederikP. Kingma, ICLR’15]
  • Nadam
    http://cs229.stanford.edu/proj2015/054_report.pdf

Momentum

Momentum

Momentum1

Overfitting

  • Learning target is defined by the training data.

  • Training data and testing data can be different.

  • The parameters achieving the learning target do not necessary have good results on the testing data.

  • Panacea for Overfitting

    • Have more training data
    • Create more training data

Early Stopping

Regularization

Weight decay is one kind of regularization.

Dropout

Training

  • Each time before updating the parameters
    1. Each neuron has p% to dropout
      The structure of the network is changed.
    2. Using the new network for training
      For each mini-batch, we resample the dropout neurons.

Testing

**No dropout**
  • If the dropout rate at training is p%, all the weights times (1-p)%
  • Assume that the dropout rate is 50%.
    If a weight w = 1 by training, set w = 0.5 for testing.

Dropout -Intuitive Reason

  • When teams up, if everyone expect the partner will do the work, nothing will be done finally.
  • However, if you know your partner will dropout, you will do better.
  • When testing, no one dropout actually, so obtaining good results eventually.

Dropout is a kind of ensemble

dropout1

dropout2

dropout3

dropout4

dropout5

Network Structure

CNN is a very good example!

參考