Initialization

Welcome to the first assignment of “Improving Deep Neural Networks”.

Training your neural network requires specifying an initial value of the weights. A well chosen initialization method will help learning. (訓練你的神經網路需要指定權重的初始值。精心挑選的初始化方法將有助於學習。)

If you completed the previous course of this specialization, you probably followed our instructions for weight initialization, and it has worked out so far. But how do you choose the initialization for a new neural network? In this notebook, you will see how different initializations lead to different results. (如果你完成了這個專業化的前一個課程，你可能按照我們的指導進行體重初始化，到目前為止它已經成功。但是，你如何選擇一個新的神經網路的初始化？在這個筆記本中，你會看到不同的初始化會帶來不同的結果。)

A well chosen initialization can:
- Speed up the convergence of gradient descent(加快梯度下降的趨勢)
- Increase the odds of gradient descent converging to a lower training (and generalization) error (增加梯度下降收斂到較低的訓練（和泛化）錯誤的機率)

To get started, run the following cell to load the packages and the planar dataset you will try to classify.

import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.datasets
from init_utils import sigmoid, relu, compute_loss, forward_propagation, backward_propagation
from init_utils import update_parameters, predict, load_dataset, plot_decision_boundary, predict_dec
#%matplotlib inline
 
plt.rcParams['figure.figsize'] = (7.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
# load image dataset: blue/red dots in circles
train_X, train_Y, test_X, test_Y = load_dataset()
plt.show()

1 - Neural Network model

You will use a 3-layer neural network (already implemented for you). Here are the initialization methods you will experiment with:
- Zeros initialization – setting initialization = "zeros" in the input argument.
- Random initialization – setting initialization = "random" in the input argument. This initializes the weights to large random values. (這將權重初始化為大的隨機值)
- He initialization – setting initialization = "he" in the input argument. This initializes the weights to random values scaled according to a paper by He et al., 2015. (這將權重初始化為根據He等人，2015年的論文縮放的隨機值)

Instructions: Please quickly read over the code below, and run it. In the next part you will implement the three initialization methods that this model() calls.

2 - Zero initialization

Exercise: Implement the following function to initialize all parameters to zeros. You’ll see later that this does not work well since it fails to “break symmetry”, but lets try it anyway and see what happens. Use np.zeros((..,..)) with the correct shapes.(實現以下功能將所有引數初始化為零。稍後你會看到，這不能很好地工作，因為它不能“破壞對稱性”，而是讓我們嘗試一下，看看會發生什麼。使用正確形狀的np.zeros（（..，..））)

def initialize_parameters_zeros(layers_dims):
    """
    Arguments:
    layer_dims -- python array (list) containing the size of each layer.
    Returns:
    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])
                    b1 -- bias vector of shape (layers_dims[1], 1)
                    ...
                    WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])
                    bL -- bias vector of shape (layers_dims[L], 1)
    """
parameters = {}
    L = len(layers_dims)            # number of layers in the network
for l in range(1, L):
        ### START CODE HERE ### (≈ 2 lines of code)
parameters['W' + str(l)] = np.zeros((layers_dims[l],layers_dims[l-1]))
        parameters['b' + str(l)] = np.zeros((layers_dims[l],1))
        ### END CODE HERE ###
return parameters
if __name__ == '__main__':
    model(train_X,train_Y,initialization='zeros')

  parameters = initialize_parameters_zeros([3, 2, 1])
  print("W1 = " + str(parameters["W1"]))
  print("b1 = " + str(parameters["b1"]))
  print("W2 = " + str(parameters["W2"]))
  print("b2 = " + str(parameters["b2"]))

結果：

W1 = [[0. 0. 0.]
      [0. 0. 0.]]
b1 = [[0.]
      [0.]]
W2 = [[0. 0.]]
b2 = [[0.]]

Run the following code to train your model on 15,000 iterations using zeros initialization.

parameters = model(train_X, train_Y, initialization="zeros")
print("On the train set:")
predictions_train = predict(train_X, train_Y, parameters)
print("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

結果：

Cost after iteration 0: 0.6931471805599453
Cost after iteration 1000: 0.6931471805599453
Cost after iteration 2000: 0.6931471805599453
Cost after iteration 3000: 0.6931471805599453
Cost after iteration 4000: 0.6931471805599453
Cost after iteration 5000: 0.6931471805599453
Cost after iteration 6000: 0.6931471805599453
Cost after iteration 7000: 0.6931471805599453
Cost after iteration 8000: 0.6931471805599453
Cost after iteration 9000: 0.6931471805599453
Cost after iteration 10000: 0.6931471805599455
Cost after iteration 11000: 0.6931471805599453
Cost after iteration 12000: 0.6931471805599453
Cost after iteration 13000: 0.6931471805599453
Cost after iteration 14000: 0.6931471805599453
On the train set:
Accuracy: 0.5
On the test set:

Accuracy: 0.5

The performance is really bad, and the cost does not really decrease, and the algorithm performs no better than random guessing. Why? Lets look at the details of the predictions and the decision boundary(效能非常糟糕，成本並沒有真正降低，演算法也沒有比隨機猜測更好。為什麼？讓我們看看預測和決策邊界的細節):

print("predictions_train = " + str(predictions_train))
print("predictions_test = " + str(predictions_test))

結果：

predictions_train = [[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                      0 0 0 0 0 0 0 0 0 0 0 0]]
predictions_test = [[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                     0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                     0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]

axes = plt.gca()
axes.set_xlim([-1.5, 1.5])
axes.set_ylim([-1.5, 1.5])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, np.squeeze(train_Y))

The model is predicting 0 for every example.

In general, initializing all the weights to zero results in the network failing to break symmetry. This means that every neuron in each layer will learn the same thing, and you might as well be training a neural network with $n^{[l]} = 1$ for every layer, and the network is no more powerful than a linear classifier such as logistic regression. (一般來說，將所有權重初始化為零將導致網路無法破壞對稱性。這意味著每一層中的每一個神經元都會學到相同的東西，而且你也可以用每一層的 $n^{[l]} = 1$ 來訓練一個神經網路，並且網路沒有線性強大分類器如邏輯迴歸)

What you should remember:
- The weights

W^{[l]}

should be initialized randomly to break symmetry. (應該隨機地初始化權重

W^{[l]}

以破壞對稱性)
- It is however okay to initialize the biases

b^{[l]}

to zeros. Symmetry is still broken so long as

W^{[l]}

is initialized randomly. (然而，將偏置

b^{[l]}

初始化為零是可以的。只要

W^{[l]}

被隨機初始化，對稱性仍然被打破)

3 - Random initialization

To break symmetry, lets intialize the weights randomly. Following random initialization, each neuron can then proceed to learn a different function of its inputs. In this exercise, you will see what happens if the weights are intialized randomly, but to very large values. (為了打破對稱，讓隨機初始化權重。隨機初始化之後，每個神經元可以繼續學習其輸入的不同功能。在這個練習中，你會看到如果權重是隨機初始化會發生什麼，但是會發生什麼)

Exercise: Implement the following function to initialize your weights to large random values (scaled by *10) and your biases to zeros. Use np.random.randn(..,..) * 10 for weights and np.zeros((.., ..)) for biases. We are using a fixed np.random.seed(..) to make sure your “random” weights match ours, so don’t worry if running several times your code gives you always the same initial values for the parameters. (實現以下功能，將您的權重初始化為較大的隨機值（由*10縮放），並將偏差初始化為零。使用np.random.randn（..，..）* 10作為權重和np.zeros（（..，..））偏差。我們正在使用一個固定的np.random.seed（..）來確保你的“隨機”權重符合我們的要求，所以不用擔心，如果執行幾次你的程式碼給你的引數總是相同的初始值。)

def initialize_parameters_random(layers_dims):
    """
    Arguments:
    layer_dims -- python array (list) containing the size of each layer.
    Returns:
    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])
                    b1 -- bias vector of shape (layers_dims[1], 1)
                    ...
                    WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])
                    bL -- bias vector of shape (layers_dims[L], 1)
    """
np.random.seed(3)               # This seed makes sure your "random" numbers will be the as ours
parameters = {}
    L = len(layers_dims)            # integer representing the number of layers
for l in range(1, L):
        ### START CODE HERE ### (≈ 2 lines of code)
parameters['W' + str(l)] = np.random.randn(layers_dims[l],layers_dims[l-1])*10
parameters['b' + str(l)] = np.zeros((layers_dims[l],1))
        ### END CODE HERE ###
return parameters
if __name__ == '__main__':
    parameters = initialize_parameters_random([3, 2, 1])
    print("W1 = " + str(parameters["W1"]))
    print("b1 = " + str(parameters["b1"]))
    print("W2 = " + str(parameters["W2"]))
    print("b2 = " + str(parameters["b2"]))

執行結果：

W1 = [[17.88628473   4.36509851   0.96497468]
      [-18.63492703 - 2.77388203 - 3.54758979]]
b1 = [[0.]
      [0.]]
W2 = [[-0.82741481 - 6.27000677]]
b2 = [[0.]]

Run the following code to train your model on 15,000 iterations using random initialization.

parameters = model(train_X, train_Y, initialization="random")
print("On the train set:")
predictions_train = predict(train_X, train_Y, parameters)
print("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

執行結果：

Cost after iteration 0: inf
Cost after iteration 1000: 0.6247924745506072
Cost after iteration 2000: 0.5980258056061102
Cost after iteration 3000: 0.5637539062842213
Cost after iteration 4000: 0.5501256393526495
Cost after iteration 5000: 0.5443826306793814
Cost after iteration 6000: 0.5373895855049121
Cost after iteration 7000: 0.47157999220550006
Cost after iteration 8000: 0.39770475516243037
Cost after iteration 9000: 0.3934560146692851
Cost after iteration 10000: 0.3920227137490125
Cost after iteration 11000: 0.38913700035966736
Cost after iteration 12000: 0.3861358766546214
Cost after iteration 13000: 0.38497629552893475
Cost after iteration 14000: 0.38276694641706693

On the train set:
Accuracy: 0.83
On the test set:
Accuracy: 0.86

If you see “inf” as the cost after the iteration 0, this is because of numerical roundoff; a more numerically sophisticated implementation would fix this. But this isn’t worth worrying about for our purposes. (如果您看到“inf”作為迭代0之後的成本，這是因為數值舍入;一個更復雜的實現可以解決這個問題。但是這不值得為我們的目的而擔心。)

Anyway, it looks like you have broken symmetry, and this gives better results. than before. The model is no longer outputting all 0s. (無論如何，看起來你已經破壞了對稱性，而這樣做會有更好的結果。比以前。該模型不再輸出全0。)

[[1 0 1 1 0 0 1 1 1 1 1 0 1 0 0 1 0 1 1 0 0 0 1 0 1 1 1 1 1 1 0 1 1 0 0 1
  1 1 1 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 0 0 1 1 1 1 0 1 1 0 1 0 1 1 1 1 0
  0 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1 0
  1 0 1 1 0 0 1 0 0 1 1 0 1 1 1 0 1 0 0 1 0 1 1 1 1 1 1 1 0 1 1 0 0 1 1 0
  0 0 1 0 1 0 1 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 0 1 1 0 1 0 1 1 1 1 0 1 1 1
  1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 1 0 1 1 0 1 0 1 1 1 0 1 1 1 0 1 0 1 0 0 1
  0 1 1 0 1 1 0 1 1 0 1 1 1 0 1 1 1 1 0 1 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 1
  1 1 0 1 1 0 1 1 1 0 0 1 0 0 0 1 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 1 1 1
  1 1 1 1 0 0 0 1 1 1 1 0]]
[[1 1 1 1 0 1 0 1 1 0 1 1 1 0 0 0 0 1 0 1 0 0 1 0 1 0 1 1 1 1 1 0 0 0 0 1
  0 1 1 0 0 1 1 1 1 1 0 1 1 1 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0
  1 1 1 1 1 0 1 0 0 1 0 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0]]

plt.title("Model with large random initialization")
axes = plt.gca()
axes.set_xlim([-1.5, 1.5])
axes.set_ylim([-1.5, 1.5])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, np.squeeze(train_Y))

Observations:

- The cost starts very high. This is because with large random-valued weights, the last activation (sigmoid) outputs results that are very close to 0 or 1 for some examples, and when it gets that example wrong it incurs a very high loss for that example. Indeed, when

log (a^{[3]}) = log (0)

, the loss goes to infinity.(成本開始非常高。這是因為對於大的隨機值權重，最後一次啟用（sigmoid）輸出的結果非常接近於0或1，並且在得到這個例子的錯誤時，這個例子會導致非常高的損失。事實上，當

log (a^{[3]}) = log (0)

時，損失會變成無窮大)
- Poor initialization can lead to vanishing/exploding gradients, which also slows down the optimization algorithm.(初始化不良會導致漸變/爆炸漸變，這也會減慢優化演算法的速度)

- If you train this network longer you will see better results, but initializing with overly large random numbers slows down the optimization.(如果你長時間訓練這個網路，你會看到更好的結果，但是用過大的隨機數初始化會減慢優化速度)

4 - He initialization

In summary:
- Initializing weights to very large random values does not work well. (將權重初始化為非常大的隨機值效果不佳)
- Hopefully intializing with small random values does better. The important question is: how small should be these random values be? Lets find out in the next part! (希望用小的隨機值進行初始化會更好。重要的問題是這些隨機值應該小到多少？讓我們來看看下一部分！)

Finally, try “He Initialization”; this is named for the first author of He et al., 2015. (If you have heard of “Xavier initialization”, this is similar except Xavier initialization uses a scaling factor for the weights $W^{[l}$

改善深層神經網路第一週-Initialization

Initialization

1 - Neural Network model

2 - Zero initialization

3 - Random initialization

4 - He initialization

改善深層神經網路第一週-Initialization

改善深層神經網路第一週-Regularization（正則化）

改善深層神經網路第一週

改善深層深度網路第一週 1

改善深層神經網路第二週

改善深層深度網路第一週 2

02改善深層神經網路-Initialization-第一週程式設計作業1

改善深層神經網路：超引數除錯、正則化以及優化_課程筆記_第一、二、三週

吳恩達改善深層神經網路：超引數除錯、正則化以及優化第一週

吳恩達deeplearning.ai課程《改善深層神經網路：超引數除錯、正則化以及優化》____學習筆記（第一週）

改善深層神經網路：超引數除錯、正則化以及優化優化演算法第二週

DeepLearing學習筆記-改善深層神經網路(第三週作業-TensorFlow使用)

改善深層神經網路_優化演算法_mini-batch梯度下降、指數加權平均、動量梯度下降、RMSprop、Adam優化、學習率衰減

吳恩達改善深層神經網路引數：超引數除錯、正則化以及優化——優化演算法

改善深層神經網路——超引數除錯、Batch正則化和程式框架（7）

改善深層神經網路——優化演算法（6）

改善深層神經網路——深度學習的實用層面（5）

改善深層神經網路week1學習筆記

吳恩達《深度學習-改善深層神經網路》3--超引數除錯、正則化以及優化

《吳恩達深度學習工程師系列課程之——改善深層神經網路：超引數除錯、正則化以及優化》學習筆記

改善深層神經網路第一週-Initialization

Initialization

1 - Neural Network model

2 - Zero initialization

3 - Random initialization

4 - He initialization

相關推薦