1. 程式人生 > >吳恩達Coursera深度學習課程 DeepLearning.ai 程式設計作業——Optimization Methods(2-2)

吳恩達Coursera深度學習課程 DeepLearning.ai 程式設計作業——Optimization Methods(2-2)

Optimization Methods

Until now, you’ve always used Gradient Descent to update the parameters and minimize the cost. In this notebook, you will learn more advanced optimization methods that can speed up learning and perhaps even get you to a better final value for the cost function. Having a good optimization algorithm can be the difference between waiting days vs. just a few hours to get a good result.

Gradient descent goes “downhill” on a cost function JJ. Think of it as trying to do this:

這裡寫圖片描述

At each step of the training, you update your parameters following a certain direction to try to get to the lowest possible point.

Notations: As usual, $\frac{\partial J}{\partial a } = $ da for any variable a

.

To get started, run the following code to import the libraries you will need.

import numpy as np
import matplotlib.pyplot as plt
import scipy.io
import math
import sklearn
import sklearn.datasets

from opt_utils import load_params_and_grads, initialize_parameters, forward_propagation, backward_propagation
from opt_utils import compute_cost, predict, predict_dec, plot_decision_boundary, load_dataset
from testCases import *

plt.rcParams['figure.figsize'] = (7.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

opt_utils.py

import numpy as np
import matplotlib.pyplot as plt
import h5py
import scipy.io
import sklearn
import sklearn.datasets

def sigmoid(x):
    """
    Compute the sigmoid of x

    Arguments:
    x -- A scalar or numpy array of any size.

    Return:
    s -- sigmoid(x)
    """
    s = 1/(1+np.exp(-x))
    return s

def relu(x):
    """
    Compute the relu of x

    Arguments:
    x -- A scalar or numpy array of any size.

    Return:
    s -- relu(x)
    """
    s = np.maximum(0,x)
    
    return s

def load_params_and_grads(seed=1):
    np.random.seed(seed)
    W1 = np.random.randn(2,3)
    b1 = np.random.randn(2,1)
    W2 = np.random.randn(3,3)
    b2 = np.random.randn(3,1)

    dW1 = np.random.randn(2,3)
    db1 = np.random.randn(2,1)
    dW2 = np.random.randn(3,3)
    db2 = np.random.randn(3,1)
    
    return W1, b1, W2, b2, dW1, db1, dW2, db2


def initialize_parameters(layer_dims):
    """
    Arguments:
    layer_dims -- python array (list) containing the dimensions of each layer in our network
    
    Returns:
    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    W1 -- weight matrix of shape (layer_dims[l], layer_dims[l-1])
                    b1 -- bias vector of shape (layer_dims[l], 1)
                    Wl -- weight matrix of shape (layer_dims[l-1], layer_dims[l])
                    bl -- bias vector of shape (1, layer_dims[l])
                    
    Tips:
    - For example: the layer_dims for the "Planar Data classification model" would have been [2,2,1]. 
    This means W1's shape was (2,2), b1 was (1,2), W2 was (2,1) and b2 was (1,1). Now you have to generalize it!
    - In the for loop, use parameters['W' + str(l)] to access Wl, where l is the iterative integer.
    """
    
    np.random.seed(3)
    parameters = {}
    L = len(layer_dims) # number of layers in the network

    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1])*  np.sqrt(2.0 / layer_dims[l-1])  #請注意這裡的2.0很重要
        parameters['b' + str(l)] = np.zeros((layer_dims[l], 1))
        
        assert(parameters['W' + str(l)].shape == layer_dims[l], layer_dims[l-1])
        assert(parameters['W' + str(l)].shape == layer_dims[l], 1)
        
    return parameters


def compute_cost(a3, Y):
    
    """
    Implement the cost function
    
    Arguments:
    a3 -- post-activation, output of forward propagation
    Y -- "true" labels vector, same shape as a3
    
    Returns:
    cost - value of the cost function
    """
    m = Y.shape[1]
    
    logprobs = np.multiply(-np.log(a3),Y) + np.multiply(-np.log(1 - a3), 1 - Y)
    cost = 1./m * np.sum(logprobs)
    
    return cost

def forward_propagation(X, parameters):
    """
    Implements the forward propagation (and computes the loss) presented in Figure 2.
    
    Arguments:
    X -- input dataset, of shape (input size, number of examples)
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
                    W1 -- weight matrix of shape ()
                    b1 -- bias vector of shape ()
                    W2 -- weight matrix of shape ()
                    b2 -- bias vector of shape ()
                    W3 -- weight matrix of shape ()
                    b3 -- bias vector of shape ()
    
    Returns:
    loss -- the loss function (vanilla logistic loss)
    """
    
    # retrieve parameters
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    W3 = parameters["W3"]
    b3 = parameters["b3"]
    
    # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
    z1 = np.dot(W1, X) + b1
    a1 = relu(z1)
    z2 = np.dot(W2, a1) + b2
    a2 = relu(z2)
    z3 = np.dot(W3, a2) + b3
    a3 = sigmoid(z3)
    
    cache = (z1, a1, W1, b1, z2, a2, W2, b2, z3, a3, W3, b3)
    
    return a3, cache

def backward_propagation(X, Y, cache):
    """
    Implement the backward propagation presented in figure 2.
    
    Arguments:
    X -- input dataset, of shape (input size, number of examples)
    Y -- true "label" vector (containing 0 if cat, 1 if non-cat)
    cache -- cache output from forward_propagation()
    
    Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
    """
    m = X.shape[1]
    (z1, a1, W1, b1, z2, a2, W2, b2, z3, a3, W3, b3) = cache
    
    dz3 = 1./m * (a3 - Y)
    dW3 = np.dot(dz3, a2.T)
    db3 = np.sum(dz3, axis=1, keepdims = True)
    
    da2 = np.dot(W3.T, dz3)
    dz2 = np.multiply(da2, np.int64(a2 > 0))
    dW2 = np.dot(dz2, a1.T)
    db2 = np.sum(dz2, axis=1, keepdims = True)
    
    da1 = np.dot(W2.T, dz2)
    dz1 = np.multiply(da1, np.int64(a1 > 0))
    dW1 = np.dot(dz1, X.T)
    db1 = np.sum(dz1, axis=1, keepdims = True)
    
    gradients = {"dz3": dz3, "dW3": dW3, "db3": db3,
                 "da2": da2, "dz2": dz2, "dW2": dW2, "db2": db2,
                 "da1": da1, "dz1": dz1, "dW1": dW1, "db1": db1}
    
    return gradients

def predict(X, y, parameters):
    """
    This function is used to predict the results of a  n-layer neural network.
    
    Arguments:
    X -- data set of examples you would like to label
    parameters -- parameters of the trained model
    
    Returns:
    p -- predictions for the given dataset X
    """
    
    m = X.shape[1]
    p = np.zeros((1,m), dtype = np.int)
    
    # Forward propagation
    a3, caches = forward_propagation(X, parameters)
    
    # convert probas to 0/1 predictions
    for i in range(0, a3.shape[1]):
        if a3[0,i] > 0.5:
            p[0,i] = 1
        else:
            p[0,i] = 0

    # print results

    #print ("predictions: " + str(p[0,:]))
    #print ("true labels: " + str(y[0,:]))
    print("Accuracy: "  + str(np.mean((p[0,:] == y[0,:]))))
    
    return p

def load_2D_dataset():
    data = scipy.io.loadmat('datasets/data.mat')
    train_X = data['X'].T
    train_Y = data['y'].T
    test_X = data['Xval'].T
    test_Y = data['yval'].T

    plt.scatter(train_X[0, :], train_X[1, :], c=train_Y, s=40, cmap=plt.cm.Spectral);
    
    return train_X, train_Y, test_X, test_Y

def plot_decision_boundary(model, X, y):
    # Set min and max values and give it some padding
    x_min, x_max = X[0, :].min() - 1, X[0, :].max() + 1
    y_min, y_max = X[1, :].min() - 1, X[1, :].max() + 1
    h = 0.01
    # Generate a grid of points with distance h between them
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    # Predict the function value for the whole grid
    Z = model(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    # Plot the contour and training examples
    plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
    plt.ylabel('x2')
    plt.xlabel('x1')
    plt.scatter(X[0, :], X[1, :], c=y, cmap=plt.cm.Spectral)
    plt.show()
    
def predict_dec(parameters, X):
    """
    Used for plotting decision boundary.
    
    Arguments:
    parameters -- python dictionary containing your parameters 
    X -- input data of size (m, K)
    
    Returns
    predictions -- vector of predictions of our model (red: 0 / blue: 1)
    """
    
    # Predict using forward propagation and a classification threshold of 0.5
    a3, cache = forward_propagation(X, parameters)
    predictions = (a3 > 0.5)
    return predictions

def load_dataset():
    np.random.seed(3)
    train_X, train_Y = sklearn.datasets.make_moons(n_samples=300, noise=.2) #300 #0.2 
    # Visualize the data
    plt.scatter(train_X[:, 0], train_X[:, 1], c=train_Y, s=40, cmap=plt.cm.Spectral);
    train_X = train_X.T
    train_Y = train_Y.reshape((1, train_Y.shape[0]))
    
    return train_X, train_Y

1 - Gradient Descent

A simple optimization method in machine learning is gradient descent (GD). When you take gradient steps with respect to all mm examples on each step, it is also called Batch Gradient Descent.

Warm-up exercise: Implement the gradient descent update rule. The gradient descent rule is, for l=1,...,Ll = 1, ..., L:
(1)W[l]=W[l]αdW[l] W^{[l]} = W^{[l]} - \alpha \text{ } dW^{[l]} \tag{1}
(2)b[l]=b[l]αdb[l] b^{[l]} = b^{[l]} - \alpha \text{ } db^{[l]} \tag{2}

where L is the number of layers and α\alpha is the learning rate. All parameters should be stored in the parameters dictionary. Note that the iterator l starts at 0 in the for loop while the first parameters are W[1]W^{[1]} and b[1]b^{[1]}. You need to shift l to l+1 when coding.

def update_parameters_with_gd(parameters, grads, learning_rate):
    """
    Update parameters using one step of gradient descent
    
    Arguments:
    parameters -- python dictionary containing your parameters to be updated:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients to update each parameters:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    learning_rate -- the learning rate, scalar.
    
    Returns:
    parameters -- python dictionary containing your updated parameters 
    """

    L = len(parameters) // 2 # number of layers in the neural networks

    # Update rule for each parameter
    for l in range(L):
        ### START CODE HERE ### (approx. 2 lines)
        parameters["W" + str(l+1)] = parameters["W"+str(l+1)]-learning_rate*grads["dW"+str(l+1)]
        parameters["b" + str(l+1)] = parameters["b"+str(l+1)]-learning_rate*grads["db"+str(l+1)]
        ### END CODE HERE ###
        
    return parameters

A variant of this is Stochastic Gradient Descent (SGD), which is equivalent to mini-batch gradient descent where each mini-batch has just 1 example. The update rule that you have just implemented does not change. What changes is that you would be computing gradients on just one training example at a time, rather than on the whole training set. The code examples below illustrate the difference between stochastic gradient descent and (batch) gradient descent.

  • (Batch) Gradient Descent:
X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    # Forward propagation
    a, caches = forward_propagation(X, parameters)
    # Compute cost.
    cost = compute_cost(a, Y)
    # Backward propagation.
    grads = backward_propagation(a, caches, parameters)
    # Update parameters.
    parameters = update_parameters(parameters, grads)
        
  • Stochastic Gradient Descent:
X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    for j in range(0, m):
        # Forward propagation
        a, caches = forward_propagation(X[:,j], parameters)
        # Compute cost
        cost = compute_cost(a, Y[:,j])
        # Backward propagation
        grads = backward_propagation(a, caches, parameters)
        # Update parameters.
        parameters = update_parameters(parameters, grads)

In Stochastic Gradient Descent, you use only 1 training example before updating the gradients. When the training set is large, SGD can be faster. But the parameters will “oscillate” toward the minimum rather than converge smoothly. Here is an illustration of this:

這裡寫圖片描述

Note also that implementing SGD requires 3 for-loops in total:

  1. Over the number of iterations
  2. Over the mm training examples
  3. Over the layers (to update all parameters, from (W[1],b[1])(W^{[1]},b^{[1]}) to (W[L],b[L])(W^{[L]},b^{[L]}))

In practice, you’ll often get faster results if you do not use neither the whole training set, nor only one training example, to perform each update. Mini-batch gradient descent uses an intermediate number of examples for each step. With mini-batch gradient descent, you loop over the mini-batches instead of looping over individual training examples.

這裡寫圖片描述

What you should remember:

  • The difference between gradient descent, mini-batch gradient descent and stochastic gradient descent is the number of examples you use to perform one update step.
  • You have to tune a learning rate hyperparameter α\alpha.
  • With a well-turned mini-batch size, usually it outperforms either gradient descent or stochastic gradient descent (particularly when the training set is large).

2 - Mini-Batch Gradient descent

Let’s learn how to build mini-batches from the training set (X, Y).

There are two steps:

  • Shuffle: Create a shuffled version of the training set (X, Y) as shown below. Each column of X and Y represents a training example. Note that the random shuffling is done synchronously between X and Y. Such that after the shuffling the ithi^{th} column of X is the example corresponding to the ithi^{th} label in Y. The shuffling step ensures that examples will be split randomly into different mini-batches.

  • Partition: Partition the shuffled (X, Y) into mini-batches of size mini_batch_size (here 64). Note that the number of training examples is not always divisible by mini_batch_size. The last mini batch might be smaller, but you don’t need to worry about this. When the final mini-batch is smaller than the full mini_batch_size, it will look like this:

這裡寫圖片描述

Exercise: Implement random_mini_batches. We coded the shuffling part for you. To help you with the partitioning step, we give you the following code that selects the indexes for the 1st1^{st} and 2nd2^{nd} mini-batches:

first_mini_batch_X = shuffled_X[:, 0 : mini_batch_size]
second_mini_batch_X = shuffled_X[:, mini_batch_size : 2 * mini_batch_size]
...

Note that the last mini-batch might end up smaller than mini_batch_size=64. Let s\lfloor s \rfloor represents ss rounded down to the nearest integer (this is math.floor(s) in Python). If the total number of examples is not a multiple of mini_batch_size=64 then there will be mmini_batch_size\lfloor \frac{m}{mini\_batch\_size}\rfloor mini-batches with a full 64 examples, and the number of examples in the final mini-batch will be (mmini_batch_size×mmini_batch_sizem-mini_\_batch_\_size \times \lfloor \frac{m}{mini\_batch\_size}\rfloor).

def random_mini_batches(X, Y, mini_batch_size = 64, seed = 0):
    """
    Creates a list of random minibatches from (X, Y)
    
    Arguments:
    X -- input data, of shape (input size, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
    mini_batch_size -- size of the mini-batches, integer
    
    Returns:
    mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y)
    """
    
    np.random.seed(seed)            # To make your "random" minibatches the same as ours
    m = X.shape[1]                  # number of training examples
    mini_batches = []
        
    # Step 1: Shuffle (X, Y)
    permutation = list(np.random.permutation(m))
    shuffled_X = X[:, permutation]
    shuffled_Y = Y[:, permutation].reshape((1,m))

    # Step 2: Partition (shuffled_X, shuffled_Y). Minus the end case.
    num_complete_minibatches = math.floor(m/mini_batch_size) # number of mini batches of size mini_batch_size in your partitionning
    for k in range(0, num_complete_minibatches):
        ### START CODE HERE ### (approx. 2 lines)
        mini_batch_X = shuffled_X[:,k*mini_batch_size:(k+1)*mini_batch_size]
        mini_batch_Y = shuffled_Y[:,k*mini_batch_size:(k+1)*mini_batch_size]
        ### END CODE HERE ###
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)
    
    # Handling the end case (last mini-batch < mini_batch_size)
    if m % mini_batch_size != 0:
        ### START CODE HERE ### (approx. 2 lines)
        mini_batch_X = shuffled_X[:,mini_batch_size*num_complete_minibatches:]
        mini_batch_Y = shuffled_Y[:,mini_batch_size*num_complete_minibatches:]
        ### END CODE HERE ###
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)
    
    return mini_batches

3 - Momentum

Because mini-batch gradient descent makes a parameter update after seeing just a subset of examples, the direction of the update has some variance, and so the path taken by mini-batch gradient descent will “oscillate” toward convergence. Using momentum can reduce these oscillations.

Momentum takes into account the past gradients to smooth out the update. We will store the ‘direction’ of the previous gradients in the variable vv. Formally, this will be the exponentially weighted average of the gradient on previous steps. You can also think of vv as the “velocity” of a ball rolling downhill, building up speed (and momentum) according to the direction of the gradient/slope of the hill.

這裡寫圖片描述

Exercise: Initialize the velocity. The velocity, vv, is a python dictionary that needs to be initialized with arrays of zeros. Its keys are the same as those in the grads dictionary, that is:
for l=1,...,Ll =1,...,L:

v["dW" + str(l+1)] = ... #(numpy array of zeros with the same shape as parameters["W" + str(l+1)])
v["db" + str(l+1)] = ... #(numpy array of zeros with the same shape as parameters["b" + str(l+1)])

Note that the iterator l starts at 0 in the for loop while the first parameters are v[“dW1”] and v[“db1”] (that’s a “one” on the superscript). This is why we are shifting l to l+1 in the for loop.

def initialize_velocity(parameters):
    """
    Initializes the velocity as a python dictionary with:
                - keys: "dW1", "db1", ..., "dWL", "dbL" 
                - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
    Arguments:
    parameters -- python dictionary containing your parameters.
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    
    Returns:
    v -- python dictionary containing the current velocity.
                    v['dW' + str(l)] = velocity of dWl
                    v['db' + str(l)] = velocity of dbl
    """
    
    L = len(parameters) // 2 # number of layers in the neural networks
    v = {}
    
    # Initialize velocity
    for l in range(L):
        ### START CODE HERE ### (approx. 2 lines)
        v["dW" + str(l+1)] = np.zeros(parameters["W"+str(l+1)].shape)
        v["db" + str(l+1)] = np.zeros(parameters["b"+str(l+1)].shape)
        ### END CODE HERE ###
        
    return v
def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate):
    """
    Update parameters using Momentum
    
    Arguments:
    parameters -- python dictionary containing your parameters:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients for each parameters:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    v -- python dictionary containing the current velocity:
                    v['dW' + str(l)] = ...
                    v['db' + str(l)] = ...
    beta -- the momentum hyperparameter, scalar
    learning_rate -- the learning rate, scalar
    
    Returns:
    parameters -- python dictionary containing your updated parameters 
    v -- python dictionary containing your updated velocities
    """

    L = len(parameters) // 2 # number of layers in the neural networks
    
    # Momentum update for each parameter
    for l in range(L):
        
        ### START CODE HERE ### (approx. 4 lines)
        # compute velocities
        v["dW" + str(l+1)] = beta*v["dW"+str(l+1)]+(1-beta)*grads["dW"+str(l+1)]
        v["db" + str(l+1)] = beta*v["db"+str(l+1)]+(1-beta)*grads["db"+str(l+1)]
        # update parameters
        parameters["W" + str(l+1)] = parameters["W"+str(l+1)] - v["dW"+str(l+1)]*learning_rate
        parameters["b" + str(l+1)] = parameters["b"+str(l+1)] -v["db"+str(l+1)]*learning_rate
        ### END CODE HERE ###
        
    return parameters, v

4 - Adam

Adam is one of the most effective optimization algorithms for training neural networks. It combines ideas from RMSProp (described in lecture) and Momentum.

How does Adam work?

  1. It calculates an exponentially weighted average of past gradients, and stores it in variables vv (before bias correction) and vcorrectedv^{corrected} (with bias correction).
  2. It calculates an exponentially weighted average of the squares of the past gradients, and stores it in variables ss (before bias correction) and scorrecteds^{corrected} (with bias correction).
  3. It updates parameters in a direction based on combining information from “1” and “2”.

The update rule is, for l=1,...,Ll = 1, ..., L:

{vdW[l]=β1vdW[l]+(1β1)JW[l]vdW[l]corrected=vdW[l]1(β1)tsdW[l]=β2sdW[l]+(1β2)(JW[l])2sdW[l]corrected=sdW[l]1(β1)tW[l]=W[l]αvdW[l]correctedsdW[l]corrected+ε\begin{cases} v_{dW^{[l]}} = \beta_1 v_{dW^{[l]}} + (1 - \beta_1) \frac{\partial \mathcal{J} }{ \partial W^{[l]} } \\ v^{corrected}_{dW^{[l]}} = \frac{v_{dW^{[l]}}}{1 - (\beta_1)^t} \\ s_{dW^{[l]}} = \beta_2 s_{dW^{[l]}} + (1 - \beta_2) (\frac{\partial \mathcal{J} }{\partial W^{[l]} })^2 \\ s^{corrected}_{dW^{[l]}} = \frac{s_{dW^{[l]}}}{1 - (\beta_1)^t} \\ W^{[l]} = W^{[l]} - \alpha \frac{v^{corrected}_{dW^{[l]}}}{\sqrt{s^{corrected}_{dW^{[l]}}} + \varepsilon} \end{cases}

相關推薦

Coursera深度學習課程 DeepLearning.ai 程式設計作業——Optimization Methods2-2

Optimization Methods Until now, you’ve always used Gradient Descent to update the parameters and minimize the cost. In this noteboo

Coursera深度學習課程 DeepLearning.ai 程式設計作業——Regularization2-1.2

如果資料集沒有很大,同時在訓練集上又擬合得很好,但是在測試集的效果卻不是很好,這時候就要使用正則化來使得其擬合能力不會那麼強。 import numpy as np import sklearn import matplotlib.pyplot as plt

Coursera深度學習課程 DeepLearning.ai 程式設計作業——Convolution model:step by step and application (4.1)

一.Convolutional Neural Networks: Step by Step Welcome to Course 4’s first assignment! In this assignment, you will implement convol

Coursera深度學習課程 deeplearning.ai (5-3) 序列模型和注意力機制--程式設計作業(二):觸發字檢測

Part 2: 觸發字檢測 關鍵詞語音喚醒 觸發字檢測 歡迎來到這個專業課程的最終程式設計任務! 在本週的視訊中,你瞭解瞭如何將深度學習應用於語音識別。在本作業中,您將構建一個語音資料集並實現觸發字檢測演算法(有時也稱為關鍵字檢測或喚醒檢測)。觸發字

Coursera深度學習課程 deeplearning.ai (4-1) 卷積神經網路--程式設計作業

Part 1:卷積神經網路 本週課程將利用numpy實現卷積層(CONV) 和 池化層(POOL), 包含前向傳播和可選的反向傳播。 變數說明 上標[l][l] 表示神經網路的第幾層 上標(i)(i) 表示第幾個樣本 上標[i][i] 表示第幾個mi

Coursera深度學習課程 DeepLearning.ai 提煉筆記1-2-- 神經網路基礎

以下為在Coursera上吳恩達老師的DeepLearning.ai課程專案中,第一部分《神經網路和深度學習》第二週課程部分關鍵點的筆記。筆記並不包含全部小視訊課程的記錄,如需學習筆記中捨棄的內容請至Coursera 或者 網易雲課堂。同時在閱讀以下

Coursera深度學習課程 deeplearning.ai (4-1) 卷積神經網路--課程筆記

本課主要講解了卷積神經網路的基礎知識,包括卷積層基礎(卷積核、Padding、Stride),卷積神經網路的基礎:卷積層、池化層、全連線層。 主要知識點 卷積核: 過濾器,各元素相乘再相加 nxn * fxf -> (n-f+1)x(n-f+1)

Coursera深度學習課程 deeplearning.ai (4-4) 人臉識別和神經風格轉換--課程筆記

Part 1:人臉識別 4.1 什麼是人臉識別? 人臉驗證: 輸入圖片,驗證是不是 A 人臉識別: 有一個庫,輸入圖片,驗證是不是庫裡的一員 人臉識別難度更大,要求準確率更高,因為1%的人臉驗證錯誤在人臉識別中會被放大很多倍。 4.2 O

Coursera深度學習課程 deeplearning.ai (5-1) 迴圈序列模型--程式設計作業(一):構建迴圈神經網路

Part 1: 構建神經網路 歡迎來到本週的第一個作業,這個作業我們將利用numpy實現你的第一個迴圈神經網路。 迴圈神經網路(Recurrent Neural Networks: RNN) 因為有”記憶”,所以在自然語言處理(Natural Languag

Coursera深度學習課程 deeplearning.ai (5-1) 迴圈序列模型--課程筆記

1.1 為什麼選擇序列模型 序列模型的應用 語音識別:將輸入的語音訊號直接輸出相應的語音文字資訊。無論是語音訊號還是文字資訊均是序列資料。 音樂生成:生成音樂樂譜。只有輸出的音樂樂譜是序列資料,輸入可以是空或者一個整數。 情感分類:將輸入的評論句子轉換

Coursera深度學習課程 DeepLearning.ai 提煉筆記5-1-- 迴圈神經網路

Ng最後一課釋出了,撒花!以下為吳恩達老師 DeepLearning.ai 課程專案中,第五部分《序列模型》第一週課程“迴圈神經網路”關鍵點的筆記。 同時我在知乎上開設了關於機器學習深度學習的專欄收錄下面的筆記,以方便大家在移動端的學習。歡迎關

Coursera深度學習課程 deeplearning.ai (5-3) 序列模型和注意力機制--課程筆記

3.1 基礎模型 sequence to sequence sequence to sequence:兩個序列模型組成,前半部分叫做編碼,後半部分叫做解碼。用於機器翻譯。 image to sequence sequence to sequenc

Coursera深度學習課程 deeplearning.ai (5-2) 自然語言處理與詞嵌入--程式設計作業(一):詞向量運算

Part 1: 詞向量運算 歡迎來到本週第一個作業。 由於詞嵌入的訓練計算量龐大切耗費時間長,絕大部分機器學習人員都會匯入一個預訓練的詞嵌入模型。 你將學到: 載入預訓練單詞向量,使用餘弦測量相似度 使用詞嵌入解決類別問題,比如 “Man is to

Coursera深度學習課程 DeepLearning.ai 提煉筆記5-3-- 序列模型和注意力機制

完結撒花!以下為吳恩達老師 DeepLearning.ai 課程專案中,第五部分《序列模型》第三週課程“序列模型和注意力機制”關鍵點的筆記。 同時我在知乎上開設了關於機器學習深度學習的專欄收錄下面的筆記,以方便大家在移動端的學習。歡迎關注我的知

Coursera深度學習課程 deeplearning.ai (4-2) 深度卷積網路:例項探究--課程筆記

本課主要講解了一些典型的卷積神經網路的思路,包括經典神經網路的leNet/AlexNet/VGG, 以及殘差網路ResNet和Google的Inception網路,順便講解了1x1卷積核的應用,便於我們進行學習和借鑑。 2.1 為什麼要進行例項探究 神經

Coursera深度學習課程 deeplearning.ai (2-1) 深度學習實踐--程式設計作業

初始化 一個好的初始化可以做到: 梯度下降的快速收斂 收斂到的對訓練集只有較少錯誤的值 載入資料 import numpy as np import matplotlib.pyplot as plt import sklearn impo

Coursera深度學習課程 DeepLearning.ai 提煉筆記1-3-- 淺層神經網路

以下為在Coursera上吳恩達老師的DeepLearning.ai課程專案中,第一部分《神經網路和深度學習》第三週課程“淺層神經網路”部分關鍵點的筆記。筆記並不包含全部小視訊課程的記錄,如需學習筆記中捨棄的內容請至Coursera 或者 網易雲課堂

Coursera深度學習課程 deeplearning.ai (4-4) 人臉識別和神經風格轉換--程式設計作業

Part 1:Happy House 的人臉識別 本週的第一個作業我們將完成一個人臉識別系統。 人臉識別問題可以分為兩類: 人臉驗證: 輸入圖片,驗證是不是A 1:1 識別 舉例:人臉解鎖手機,人臉刷卡 人臉識別: 有一個庫,輸入圖片,驗證是不是庫裡的

Coursera深度學習課程 deeplearning.ai (5-2) 自然語言處理與詞嵌入--程式設計作業(二):Emojify表情包

Part 2: Emojify 歡迎來到本週的第二個作業,你將利用詞向量構建一個表情包。 你有沒有想過讓你的簡訊更具表現力? emojifier APP將幫助你做到這一點。 所以不是寫下”Congratulations on the promotion! L

Coursera深度學習課程 DeepLearning.ai 提煉筆記1-4-- 深層神經網路

以下為在Coursera上吳恩達老師的DeepLearning.ai課程專案中,第一部分《神經網路和深度學習》第四周課程“深層神經網路”部分關鍵點的筆記。筆記並不包含全部小視訊課程的記錄,如需學習筆記中捨棄的內容請至 Coursera 或者 網易雲課