1. 程式人生 > >吳恩達深度學習2-Week2課後作業3-優化演算法

吳恩達深度學習2-Week2課後作業3-優化演算法

一、deeplearning-assignment

到目前為止,在之前的練習中我們一直使用梯度下降來更新引數並最小化成本函式。在本次作業中,將學習更先進的優化方法,它在加快學習速度的同時,甚至可以獲得更好的最終值。一個好的優化演算法可以讓你幾個小時內就獲得一個結果,而不是等待幾天。

1.Gradient Descent

在機器學習中,有一種優化方法叫梯度下降,當每次迭代使用m個樣本時,它也叫批量梯度下降。

完成梯度下降的更新規則:

2.Mini-Batch Gradient descent

現在讓我們學習如何從訓練集(X, Y)中構建小批量資料集。

Shuffling:如下所示,將訓練集(X, Y)的資料進行隨機混洗. X 和 Y 的每一列代表一個訓練樣本. 請注意隨機混洗是在X和Y之間同步完成的,這樣才能保證X的第i個標籤和Y的第i個標籤相匹配。隨機混洗是為了確保樣本被隨機的劃分到不同的小批量集中。

Partitioning:將混洗的資料(X, Y)按固定大小(mini_batch_size這裡是64)進行分割槽。請注意,訓練樣本的總數並不一定能被64整除。劃分的最後一個小批量可能小於64,就像下面這樣:

實現 random_mini_batches,隨機混洗的程式碼我們已經實現了。為了進行分割槽,我們提供了以下程式碼用來索引某個特定的小批量集,比如第一個和第二個小批量集:

first_mini_batch_X = shuffled_X[:, 0 : mini_batch_size]
second_mini_batch_X = shuffled_X[:, mini_batch_size : 2 * mini_batch_size]
...

你應該記住的是:

  • Shuffling and Partitioning 是構建小批量所需的兩個步驟。
  • mini-batch的大小通常選擇的是2的冪次方, 比如 16, 32, 64, 128。

3.Momentum

由於小批量梯度下降是用整體樣本的一個子集進行的引數更新,所以更新的方向會發生一定變化,小批量梯度下降會在不斷擺動中趨於收斂。使用動量優化法可以減少這些振盪。

動量法會把過去的梯度變化考慮進來用來平滑更新。我們把以前梯度變化的“方向”儲存在變數 vv 中。形式上,你可以把它看成前面步驟中梯度的指數加權平均值。你可以想象有一個球從上坡上滾下來,vv 就是它的“速度”,速度(和動量)的構建取決於山坡的坡度/方向。

紅色箭頭表示採用動量優化法的小批量梯度下降每一步進行的方向. 藍色的點表示的是每步的梯度(相對於當前小批量)方向。

momentum更新的規則是:for l = 1,...L:

β的常用值範圍是從0.8到0.999。如果你不想調整它,β=0.9通常是一個合理的預設值。

4.Adam

Adam是訓練神經網路最有效的優化演算法之一。它結合了RMSProp(講座中介紹)和Momentum。

Adam的更新規則是:for l = 1,...L:

5.Model with different optimization algorithms

讓我們使用下面的"moons"資料集來測試不同的優化方法。

現在將用3種優化方法依次執行這個3層的神經網路(詳見程式碼中的model函式)。

(1)Mini-batch Gradient descent

(2)Mini-batch gradient descent with momentum

由於這個例子比較簡單,使用 momemtum 帶來的收益較小;如果面對的是更復雜的問題,momemtum 帶來的收益會更大。

(3)Mini-batch with Adam mode


二、相關演算法程式碼

import numpy as np
import matplotlib.pyplot as plt
import scipy.io
import math
import sklearn
import sklearn.datasets

from opt_utils import load_params_and_grads, initialize_parameters, forward_propagation, backward_propagation
from opt_utils import compute_cost, predict, predict_dec, plot_decision_boundary, load_dataset
from week2.testCases import *

plt.rcParams['figure.figsize'] = (7.0, 4.0)  # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'


def update_parameters_with_gd(parameters, grads, learning_rate):
    L = len(parameters) // 2
    for l in range(L):
        parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)] - learning_rate * grads["dW" + str(l + 1)]
        parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)] - learning_rate * grads["db" + str(l + 1)]
        assert (parameters['W' + str(l + 1)].shape == grads["dW" + str(l + 1)].shape)
        assert (parameters['b' + str(l + 1)].shape == grads["db" + str(l + 1)].shape)

    return parameters


# parameters, grads, learning_rate = update_parameters_with_gd_test_case()
# parameters = update_parameters_with_gd(parameters, grads, learning_rate)
# print("W1 = " + str(parameters["W1"]))
# print("b1 = " + str(parameters["b1"]))
# print("W2 = " + str(parameters["W2"]))
# print("b2 = " + str(parameters["b2"]))


def random_mini_batches(X, Y, mini_batch_size=64, seed=0):
    np.random.seed(seed)
    m = X.shape[1]
    mini_batches = []

    permutation = list(np.random.permutation(m))
    shuffled_X = X[:, permutation]
    shuffled_Y = Y[:, permutation].reshape((1, m))

    num_complete_minibatches = math.floor(m / mini_batch_size)
    for k in range(0, num_complete_minibatches):
        mini_batch_X = shuffled_X[:, k * mini_batch_size: (k + 1) * mini_batch_size]
        mini_batch_Y = shuffled_Y[:, k * mini_batch_size: (k + 1) * mini_batch_size]
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)

    if m % mini_batch_size != 0:
        mini_batch_X = shuffled_X[:, num_complete_minibatches * mini_batch_size:]
        mini_batch_Y = shuffled_Y[:, num_complete_minibatches * mini_batch_size:]
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)

    return mini_batches


# X_assess, Y_assess, mini_batch_size = random_mini_batches_test_case()
# mini_batches = random_mini_batches(X_assess, Y_assess, mini_batch_size)
# print("shape of the 1st mini_batch_X: " + str(mini_batches[0][0].shape))
# print("shape of the 2nd mini_batch_X: " + str(mini_batches[1][0].shape))
# print("shape of the 3rd mini_batch_X: " + str(mini_batches[2][0].shape))
# print("shape of the 1st mini_batch_Y: " + str(mini_batches[0][1].shape))
# print("shape of the 2nd mini_batch_Y: " + str(mini_batches[1][1].shape))
# print("shape of the 3rd mini_batch_Y: " + str(mini_batches[2][1].shape))
# print("mini batch sanity check: " + str(mini_batches[0][0][0][0:3]))


def initialize_velocity(parameters):
    L = len(parameters) // 2
    v = {}

    for l in range(L):
        v["dW" + str(l + 1)] = np.zeros_like(parameters['W' + str(l + 1)])
        v["db" + str(l + 1)] = np.zeros_like(parameters['b' + str(l + 1)])

    return v


# parameters = initialize_velocity_test_case()
# v = initialize_velocity(parameters)
# print("v[\"dW1\"] = " + str(v["dW1"]))
# print("v[\"db1\"] = " + str(v["db1"]))
# print("v[\"dW2\"] = " + str(v["dW2"]))
# print("v[\"db2\"] = " + str(v["db2"]))


def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate):
    L = len(parameters) // 2
    for l in range(L):
        v["dW" + str(l + 1)] = beta * v["dW" + str(l + 1)] + (1 - beta) * grads['dW' + str(l + 1)]
        v["db" + str(l + 1)] = beta * v["db" + str(l + 1)] + (1 - beta) * grads['db' + str(l + 1)]
        parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)] - learning_rate * v["dW" + str(l + 1)]
        parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)] - learning_rate * v["db" + str(l + 1)]

    return parameters, v


# parameters, grads, v = update_parameters_with_momentum_test_case()
# parameters, v = update_parameters_with_momentum(parameters, grads, v, beta=0.9, learning_rate=0.01)
# print("W1 = " + str(parameters["W1"]))
# print("b1 = " + str(parameters["b1"]))
# print("W2 = " + str(parameters["W2"]))
# print("b2 = " + str(parameters["b2"]))
# print("v[\"dW1\"] = " + str(v["dW1"]))
# print("v[\"db1\"] = " + str(v["db1"]))
# print("v[\"dW2\"] = " + str(v["dW2"]))
# print("v[\"db2\"] = " + str(v["db2"]))


def initialize_adam(parameters):
    L = len(parameters) // 2
    v = {}
    s = {}

    for l in range(L):
        v["dW" + str(l + 1)] = np.zeros_like(parameters['W' + str(l + 1)])
        v["db" + str(l + 1)] = np.zeros_like(parameters['b' + str(l + 1)])
        s["dW" + str(l + 1)] = np.zeros_like(parameters['W' + str(l + 1)])
        s["db" + str(l + 1)] = np.zeros_like(parameters['b' + str(l + 1)])

    return v, s


# parameters = initialize_adam_test_case()
# v, s = initialize_adam(parameters)
# print("v[\"dW1\"] = " + str(v["dW1"]))
# print("v[\"db1\"] = " + str(v["db1"]))
# print("v[\"dW2\"] = " + str(v["dW2"]))
# print("v[\"db2\"] = " + str(v["db2"]))
# print("s[\"dW1\"] = " + str(s["dW1"]))
# print("s[\"db1\"] = " + str(s["db1"]))
# print("s[\"dW2\"] = " + str(s["dW2"]))
# print("s[\"db2\"] = " + str(s["db2"]))


def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate=0.01,
                                beta1=0.9, beta2=0.999, epsilon=1e-8):
    L = len(parameters) // 2
    v_corrected = {}
    s_corrected = {}

    for l in range(L):
        v["dW" + str(l + 1)] = beta1 * v["dW" + str(l + 1)] + (1 - beta1) * grads['dW' + str(l + 1)]
        v["db" + str(l + 1)] = beta1 * v["db" + str(l + 1)] + (1 - beta1) * grads['db' + str(l + 1)]

        v_corrected["dW" + str(l + 1)] = v["dW" + str(l + 1)] / (1 - beta1 ** t)
        v_corrected["db" + str(l + 1)] = v["db" + str(l + 1)] / (1 - beta1 ** t)

        s["dW" + str(l + 1)] = beta2 * s["dW" + str(l + 1)] + (1 - beta2) * (grads["dW" + str(l + 1)]) ** 2
        s["db" + str(l + 1)] = beta2 * s["db" + str(l + 1)] + (1 - beta2) * (grads["db" + str(l + 1)]) ** 2

        s_corrected["dW" + str(l + 1)] = s["dW" + str(l + 1)] / (1 - beta2 ** t)
        s_corrected["db" + str(l + 1)] = s["db" + str(l + 1)] / (1 - beta2 ** t)

        parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)] - learning_rate * v_corrected["dW" + str(l + 1)] / (
                np.sqrt(s_corrected["dW" + str(l + 1)]) + epsilon)
        parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)] - learning_rate * v_corrected["db" + str(l + 1)] / (
                np.sqrt(s_corrected["db" + str(l + 1)]) + epsilon)

    return parameters, v, s


# parameters, grads, v, s = update_parameters_with_adam_test_case()
# parameters, v, s = update_parameters_with_adam(parameters, grads, v, s, t=2)
# print("W1 = " + str(parameters["W1"]))
# print("b1 = " + str(parameters["b1"]))
# print("W2 = " + str(parameters["W2"]))
# print("b2 = " + str(parameters["b2"]))
# print("v[\"dW1\"] = " + str(v["dW1"]))
# print("v[\"db1\"] = " + str(v["db1"]))
# print("v[\"dW2\"] = " + str(v["dW2"]))
# print("v[\"db2\"] = " + str(v["db2"]))
# print("s[\"dW1\"] = " + str(s["dW1"]))
# print("s[\"db1\"] = " + str(s["db1"]))
# print("s[\"dW2\"] = " + str(s["dW2"]))
# print("s[\"db2\"] = " + str(s["db2"]))


train_X, train_Y = load_dataset()


# print(train_X.shape)
# print(train_Y.shape)


def model(X, Y, layers_dims, optimizer, learning_rate=0.0007, mini_batch_size=64, beta=0.9,
          beta1=0.9, beta2=0.999, epsilon=1e-8, num_epochs=10000, print_cost=True):
    L = len(layers_dims)
    costs = []
    t = 0
    seed = 10

    parameters = initialize_parameters(layers_dims)

    if optimizer == "gd":
        pass
    elif optimizer == "momentum":
        v = initialize_velocity(parameters)
    elif optimizer == "adam":
        v, s = initialize_adam(parameters)

    for i in range(num_epochs):
        seed = seed + 1
        minibatches = random_mini_batches(X, Y, mini_batch_size, seed)

        for minibatch in minibatches:
            (minibatch_X, minibatch_Y) = minibatch

            # Forward propagation
            a3, caches = forward_propagation(minibatch_X, parameters)

            # Compute cost
            cost = compute_cost(a3, minibatch_Y)

            # Backward propagation
            grads = backward_propagation(minibatch_X, minibatch_Y, caches)

            # Update parameters
            if optimizer == "gd":
                parameters = update_parameters_with_gd(parameters, grads, learning_rate)
            elif optimizer == "momentum":
                parameters, v = update_parameters_with_momentum(parameters, grads, v, beta, learning_rate)
            elif optimizer == "adam":
                t = t + 1  # Adam counter
                parameters, v, s = update_parameters_with_adam(parameters, grads, v, s,
                                                               t, learning_rate, beta1, beta2, epsilon)

        if print_cost and i % 1000 == 0:
            print("Cost after epoch %i: %f" % (i, cost))
        if print_cost and i % 100 == 0:
            costs.append(cost)

    plt.plot(costs)
    plt.ylabel('cost')
    plt.xlabel('epochs (per 100)')
    plt.title("Learning rate = " + str(learning_rate))
    plt.show()

    return parameters


# layers_dims = [train_X.shape[0], 5, 2, 1]
# parameters = model(train_X, train_Y, layers_dims, optimizer="gd")
# predictions = predict(train_X, train_Y, parameters)
# plt.title("Model with Gradient Descent optimization")
# axes = plt.gca()
# axes.set_xlim([-1.5, 2.5])
# axes.set_ylim([-1, 1.5])
# plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

# layers_dims = [train_X.shape[0], 5, 2, 1]
# parameters = model(train_X, train_Y, layers_dims, beta=0.9, optimizer="momentum")
# predictions = predict(train_X, train_Y, parameters)
# plt.title("Model with Momentum optimization")
# axes = plt.gca()
# axes.set_xlim([-1.5, 2.5])
# axes.set_ylim([-1, 1.5])
# plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

layers_dims = [train_X.shape[0], 5, 2, 1]
parameters = model(train_X, train_Y, layers_dims, optimizer="adam")
predictions = predict(train_X, train_Y, parameters)
plt.title("Model with Adam optimization")
axes = plt.gca()
axes.set_xlim([-1.5, 2.5])
axes.set_ylim([-1, 1.5])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

三、總結

動量梯度下降通常是有用的,但由於給的是一個較小的學習率,並且資料集簡單,它的影響幾乎可以忽略不計。

再來看Adam,它的表現明顯勝過小批量梯度下降和動量梯度下降。如果你在這個簡單的資料集上執行更多次epochs,這三種方法都會帶來非常好的結果。不過,你現在已經看到Adam更快地收斂了。

Adam的優勢:

  • 相對較低的記憶體需求(儘管高於梯度下降和動量梯度下降)。
  • 即使是沒有調優的超引數,該演算法也能有比較好的結果。

各種優化演算法比較: