Coursera 吳恩達 Deep Learning 第二課改善神經網路 Improving Deep Neural Networks 第二週程式設計作業程式碼Optimization methods

阿新 • • 發佈：2019-01-17

Optimization Methods

Until now, you’ve always used Gradient Descent to update the parameters and minimize the cost. In this notebook, you will learn more advanced optimization methods that can speed up learning and perhaps even get you to a better final value for the cost function. Having a good optimization algorithm can be the difference between waiting days vs. just a few hours to get a good result.

Gradient descent goes “downhill” on a cost function J. Think of it as trying to do this:

Minimizing the cost is like finding the lowest point in a hilly landscape At each step of the training, you update your parameters following a certain direction to try to get to the lowest possible point.

Notations: As usual, ∂J∂a= da for any variable a.

To get started, run the following code to import the libraries you will need.

import numpy as np
import matplotlib.pyplot as plt
import scipy.io
import math
import sklearn
import sklearn.datasets

from opt_utils import load_params_and_grads, initialize_parameters, forward_propagation, backward_propagation
from 
 opt_utils import compute_cost, predict, predict_dec, plot_decision_boundary, load_dataset
from testCases import *

%matplotlib inline
plt.rcParams['figure.figsize'] = (7.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

/home/jovyan/work/week6/opt_utils.py:76: SyntaxWarning: assertion is always true, perhaps remove parentheses?
  assert(parameters['W' + str(l)].shape == layer_dims[l], layer_dims[l-1])
/home/jovyan/work/week6/opt_utils.py:77: SyntaxWarning: assertion is always true, perhaps remove parentheses?
  assert(parameters['W' + str(l)].shape == layer_dims[l], 1)

1 - Gradient Descent

A simple optimization method in machine learning is gradient descent (GD). When you take gradient steps with respect to all m examples on each step, it is also called Batch Gradient Descent.

Warm-up exercise: Implement the gradient descent update rule. The gradient descent rule is, for l=1,...,L:

W[l]=W[l]−αdW[l](1)
b[l]=b[l]−αdb[l](2)

where L is the number of layers and α is the learning rate. All parameters should be stored in the parameters dictionary. Note that the iterator l starts at 0 in the for loop while the first parameters are W[1] and b[1]. You need to shift l to l+1 when coding.

# GRADED FUNCTION: update_parameters_with_gd

def update_parameters_with_gd(parameters, grads, learning_rate):
    """
    Update parameters using one step of gradient descent

    Arguments:
    parameters -- python dictionary containing your parameters to be updated:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients to update each parameters:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    learning_rate -- the learning rate, scalar.

    Returns:
    parameters -- python dictionary containing your updated parameters 
    """

    L = len(parameters) // 2 # number of layers in the neural networks

    # Update rule for each parameter
    for l in range(L):
        ### START CODE HERE ### (approx. 2 lines)
        parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - grads['dW' + str(l+1)] * learning_rate
        parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - grads['db' + str(l+1)] * learning_rate
        ### END CODE HERE ###

    return parameters

parameters, grads, learning_rate = update_parameters_with_gd_test_case()

parameters = update_parameters_with_gd(parameters, grads, learning_rate)
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))

W1 = [[ 1.63535156 -0.62320365 -0.53718766]
 [-1.07799357  0.85639907 -2.29470142]]
b1 = [[ 1.74604067]
 [-0.75184921]]
W2 = [[ 0.32171798 -0.25467393  1.46902454]
 [-2.05617317 -0.31554548 -0.3756023 ]
 [ 1.1404819  -1.09976462 -0.1612551 ]]
b2 = [[-0.88020257]
 [ 0.02561572]
 [ 0.57539477]]

Expected Output:

W1	[[ 1.63535156 -0.62320365 -0.53718766] [-1.07799357 0.85639907 -2.29470142]]
b1	[[ 1.74604067] [-0.75184921]]
W2	[[ 0.32171798 -0.25467393 1.46902454] [-2.05617317 -0.31554548 -0.3756023 ] [ 1.1404819 -1.09976462 -0.1612551 ]]
b2	[[-0.88020257] [ 0.02561572] [ 0.57539477]]

A variant of this is Stochastic Gradient Descent (SGD), which is equivalent to mini-batch gradient descent where each mini-batch has just 1 example. The update rule that you have just implemented does not change. What changes is that you would be computing gradients on just one training example at a time, rather than on the whole training set. The code examples below illustrate the difference between stochastic gradient descent and (batch) gradient descent.

(Batch) Gradient Descent:

X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    # Forward propagation
    a, caches = forward_propagation(X, parameters)
    # Compute cost.
    cost = compute_cost(a, Y)
    # Backward propagation.
    grads = backward_propagation(a, caches, parameters)
    # Update parameters.
    parameters = update_parameters(parameters, grads)

Stochastic Gradient Descent:

X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    for j in range(0, m):
        # Forward propagation
        a, caches = forward_propagation(X[:,j], parameters)
        # Compute cost
        cost = compute_cost(a, Y[:,j])
        # Backward propagation
        grads = backward_propagation(a, caches, parameters)
        # Update parameters.
        parameters = update_parameters(parameters, grads)

In Stochastic Gradient Descent, you use only 1 training example before updating the gradients. When the training set is large, SGD can be faster. But the parameters will “oscillate” toward the minimum rather than converge smoothly. Here is an illustration of this:

Figure 1 : SGD vs GD
“+” denotes a minimum of the cost. SGD leads to many oscillations to reach convergence. But each step is a lot faster to compute for SGD than for GD, as it uses only one training example (vs. the whole batch for GD).

Note also that implementing SGD requires 3 for-loops in total:
1. Over the number of iterations
2. Over the m training examples
3. Over the layers (to update all parameters, from (W[1],b[1]) to (W[L],b[L]))

In practice, you’ll often get faster results if you do not use neither the whole training set, nor only one training example, to perform each update. Mini-batch gradient descent uses an intermediate number of examples for each step. With mini-batch gradient descent, you loop over the mini-batches instead of looping over individual training examples.

Figure 2 : SGD vs Mini-Batch GD
“+” denotes a minimum of the cost. Using mini-batches in your optimization algorithm often leads to faster optimization.

What you should remember:
- The difference between gradient descent, mini-batch gradient descent and stochastic gradient descent is the number of examples you use to perform one update step.
- You have to tune a learning rate hyperparameter α.
- With a well-turned mini-batch size, usually it outperforms either gradient descent or stochastic gradient descent (particularly when the training set is large).

2 - Mini-Batch Gradient descent

Let’s learn how to build mini-batches from the training set (X, Y).

There are two steps:
- Shuffle: Create a shuffled version of the training set (X, Y) as shown below. Each column of X and Y represents a training example. Note that the random shuffling is done synchronously between X and Y. Such that after the shuffling the ith column of X is the example corresponding to the ith label in Y. The shuffling step ensures that examples will be split randomly into different mini-batches.

Partition: Partition the shuffled (X, Y) into mini-batches of size mini_batch_size (here 64). Note that the number of training examples is not always divisible by mini_batch_size. The last mini batch might be smaller, but you don’t need to worry about this. When the final mini-batch is smaller than the full mini_batch_size, it will look like this:

Exercise: Implement random_mini_batches. We coded the shuffling part for you. To help you with the partitioning step, we give you the following code that selects the indexes for the 1st and 2nd mini-batches:

first_mini_batch_X = shuffled_X[:, 0 : mini_batch_size]
second_mini_batch_X = shuffled_X[:, mini_batch_size : 2 * mini_batch_size]
...

Note that the last mini-batch might end up smaller than mini_batch_size=64. Let ⌊s⌋ represents s rounded down to the nearest integer (this is math.floor(s) in Python). If the total number of examples is not a multiple of mini_batch_size=64 then there will be ⌊mmini_batch_size⌋ mini-batches with a full 64 examples, and the number of examples in the final mini-batch will be (m−mini_batch_size×⌊mmini_batch_size⌋).

# GRADED FUNCTION: random_mini_batches

def random_mini_batches(X, Y, mini_batch_size = 64, seed = 0):
    """
    Creates a list of random minibatches from (X, Y)

    Arguments:
    X -- input data, of shape (input size, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
    mini_batch_size -- size of the mini-batches, integer

    Returns:
    mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y)
    """

    np.random.seed(seed)            # To make your "random" minibatches the same as ours
    m = X.shape[1]                  # number of training examples
    mini_batches = []

    # Step 1: Shuffle (X, Y)
    permutation = list(np.random.permutation(m))
    shuffled_X = X[:, permutation]
    shuffled_Y = Y[:, permutation].reshape((1,m))

    # Step 2: Partition (shuffled_X, shuffled_Y). Minus the end case.
    num_complete_minibatches = math.floor(m/mini_batch_size) # number of mini batches of size mini_batch_size in your partitionning
    for k in range(0, num_complete_minibatches):
        ### START CODE HERE ### (approx. 2 lines)
        mini_batch_X = shuffled_X[:, k * mini_batch_size : (k+1) * mini_batch_size]
        mini_batch_Y = shuffled_Y[:, k * mini_batch_size : (k+1) * mini_batch_size]
        ### END CODE HERE ###
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)

    # Handling the end case (last mini-batch < mini_batch_size)
    if m % mini_batch_size != 0:
        ### START CODE HERE ### (approx. 2 lines)
        mini_batch_X = shuffled_X[:,(k+1) * mini_batch_size:]
        mini_batch_Y = shuffled_Y[:,(k+1) * mini_batch_size:]
        ### END CODE HERE ###
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)

    return mini_batches

X_assess, Y_assess, mini_batch_size = random_mini_batches_test_case()
mini_batches = random_mini_batches(X_assess, Y_assess, mini_batch_size)

print ("shape of the 1st mini_batch_X: " + str(mini_batches[0][0].shape))
print ("shape of the 2nd mini_batch_X: " + str(mini_batches[1][0].shape))
print ("shape of the 3rd mini_batch_X: " + str(mini_batches[2][0].shape))
print ("shape of the 1st mini_batch_Y: " + str(mini_batches[0][1].shape))
print ("shape of the 2nd mini_batch_Y: " + str(mini_batches[1][1].shape)) 
print ("shape of the 3rd mini_batch_Y: " + str(mini_batches[2][1].shape))
print ("mini batch sanity check: " + str(mini_batches[0][0][0][0:3]))

shape of the 1st mini_batch_X: (12288, 64)
shape of the 2nd mini_batch_X: (12288, 64)
shape of the 3rd mini_batch_X: (12288, 20)
shape of the 1st mini_batch_Y: (1, 64)
shape of the 2nd mini_batch_Y: (1, 64)
shape of the 3rd mini_batch_Y: (1, 20)
mini batch sanity check: [ 0.90085595 -0.7612069   0.2344157 ]

Expected Output:

shape of the 1st mini_batch_X	(12288, 64)
shape of the 2nd mini_batch_X	(12288, 64)
shape of the 3rd mini_batch_X	(12288, 20)
shape of the 1st mini_batch_Y	(1, 64)
shape of the 2nd mini_batch_Y	(1, 64)
shape of the 3rd mini_batch_Y	(1, 20)
mini batch sanity check	[ 0.90085595 -0.7612069 0.2344157 ]

What you should remember:
- Shuffling and Partitioning are the two steps required to build mini-batches
- Powers of two are often chosen to be the mini-batch size, e.g., 16, 32, 64, 128.

3 - Momentum

Because mini-batch gradient descent makes a parameter update after seeing just a subset of examples, the direction of the update has some variance, and so the path taken by mini-batch gradient descent will “oscillate” toward convergence. Using momentum can reduce these oscillations.

Momentum takes into account the past gradients to smooth out the update. We will store the ‘direction’ of the previous gradients in the variable v. Formally, this will be the exponentially weighted average of the gradient on previous steps. You can also think of v as the “velocity” of a ball rolling downhill, building up speed (and momentum) according to the direction of the gradient/slope of the hill.

Figure 3: The red arrows shows the direction taken by one step of mini-batch gradient descent with momentum. The blue points show the direction of the gradient (with respect to the current mini-batch) on each step. Rather than just following the gradient, we let the gradient influence v and then take a step in the direction of v.

Exercise: Initialize the velocity. The velocity, v, is a python dictionary that needs to be initialized with arrays of zeros. Its keys are the same as those in the grads dictionary, that is:
for l=1,...,L:

v["dW" + str(l+1)] = ... #(numpy array of zeros with the same shape as parameters["W" + str(l+1)])
v["db" + str(l+1)] = ... #(numpy array of zeros with the same shape as parameters["b" + str(l+1)])

Note that the iterator l starts at 0 in the for loop while the first parameters are v[“dW1”] and v[“db1”] (that’s a “one” on the superscript). This is why we are shifting l to l+1 in the for loop.

# GRADED FUNCTION: initialize_velocity

def initialize_velocity(parameters):
    """
    Initializes the velocity as a python dictionary with:
                - keys: "dW1", "db1", ..., "dWL", "dbL" 
                - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
    Arguments:
    parameters -- python dictionary containing your parameters.
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl

    Returns:
    v -- python dictionary containing the current velocity.
                    v['dW' + str(l)] = velocity of dWl
                    v['db' + str(l)] = velocity of dbl
    """

    L = len(parameters) // 2 # number of layers in the neural networks
    v = {}

    # Initialize velocity
    for l in range(L):
        ### START CODE HERE ### (approx. 2 lines)
        v["dW" + str(l+1)] = np.zeros(parameters["W" + str(l+1)].shape)
        v["db" + str(l+1)] = np.zeros(parameters["b" + str(l+1)].shape)
        ### E

 
 
              
           
              
              
            
            相關推薦
			   
            
            
            
 

    

    
    Coursera 吳恩達 Deep Learning 第二課 改善神經網路 Improving Deep Neural Networks 第二週 程式設計作業程式碼Optimization methods
      
							
							
							Optimization Methods

Until now, you’ve always used Gradient Descent to update the parameters and minimize the cost. In this notebo 

  
 

    

    
    Coursera 吳恩達 Deep Learning 第2課 Improving Deep Neural Networks 第一週 程式設計作業程式碼 Regularization
      
								
								            
						
                
2 - L2 Regularization

# GRADED FUNCTION: compute_cost_with_regularization
def compute_cost_with_reg 

  
 

    

    
    Coursera 吳恩達 Deep Learning 第2課 Improving Deep Neural Networks 第一週 程式設計作業程式碼 Initialization
      
								
								            
						
                

2 - Zero initialization

# GRADED FUNCTION: initialize_parameters_zeros
def initialize_parameters_z 

  
 

    

    
    【Coursera】吳恩達 deeplearning.ai 04.卷積神經網路 第二週 深度卷積神經網路 課程筆記
       
 
  
  
 深度卷積神經網路 
 2.1 為什麼要進行例項化 
 實際上，在計算機視覺任務中表現良好的神經網路框架，往往也適用於其他任務。 
 2.2 經典網路 
  
  LeNet-5 
  AlexNet 
  VGG 
  
 LeNet-5 
 主要針對灰度影象   
 隨著神經網路的加深 

  
 

    

    
    吳恩達深度學習筆記(22)-深層神經網路說明及前後向傳播實現
       
  
  
 深層神經網路（Deep L-layer neural network） 
 目前為止我們已經學習了只有一個單獨隱藏層的神經網路的正向傳播和反向傳播，還有邏輯迴歸，並且你還學到了向量化，這在隨機初始化權重時是很重要。 
 目前所要做的是把這些理念集合起來，就可以執行你自己的深度神經網路。 
  

  
 

    

    
    吳恩達deeplearning之CNN—卷積神經網路入門
      
1.邊界檢測示例 
假如你有一張如下的影象，你想讓計算機搞清楚影象上有什麼物體，你可以做的事情是檢測影象的垂直邊緣和水平邊緣。  
如下是一個6*6的灰度影象，構造一個3*3的矩陣，在卷積神經網路中通常稱之為filter，對這個6*6的影象進行卷積運算，以左上角的-5計算為例  3*1+ 

  
 

    

    
    吳恩達序列模型學習筆記--迴圈神經網路（RNN）
      
                1. 序列模型的應用

序列模型能解決哪些激動人心的問題呢？

語音識別：將輸入的語音訊號直接輸出相應的語音文字資訊。無論是語音訊號還是文字資訊均是序列資料。

音樂生成：生成音樂樂譜。只有輸出的音樂樂譜是序列資料，輸入可以是空或者一個整數。

情感分類：將輸入的評論句子轉換 

  
 

    

    
    吳恩達《深度學習-卷積神經網路》2--深度卷積神經網路
      
                1. Why look at case studies本節展示幾個神經網路的例項分析為什麼要講例項？近些年CNN的主要任務就是研究如何將基本構件（CONV、POOL、CF）組合起來形成有效的CNN，而學習瞭解前人的做法可以激發創造2. Classic Networks1）Le 

  
 

    

    
    吳恩達《深度學習-卷積神經網路》1--卷積神經網路
      
                1. Computer Vision計算機視覺包括：  --圖片分類（圖片識別）Image classification  --目標檢測 object detection  --神經風格遷移 neural style transfer，如合成圖片創造新的藝術風格計算機視覺面臨 

  
 

    

    
    Coursera 吳恩達DeepLearning.AI 第五課 sequence model 序列模型 第二週 Emofify
      這個Emojify裡最坑的一個就是，avg初始化的時候一定要是 (50,) ，如果你用(word_to_vec_map["a"]).shape 就死活過不了。Emojify!Welcome to the second assignment of Week 2. You are going to use wor 

  
 

    

    
    吳恩達 machine learning 作業 第二週
       
 
 featureNormalize.m 
 function [X_norm, mu, sigma] = featureNormalize(X)
%FEATURENORMALIZE Normalizes the features in X 
%   FEATURENORMALIZE(X) return 

  
 

    

    
    Coursera-吳恩達-機器學習-第十週-測驗-Large Scale Machine Learning
       
 
 本片文章內容： 
 Coursera吳恩達機器學習課程，第十週 Large Scale Machine Learning 部分的測驗，題目及答案截圖。 
  
 1.cost increase ，說明資料diverge。減小learning rate。 
  
 stochastic不需要每步都是減 

  
 

    

    
    Coursera-吳恩達-機器學習-第六週-測驗-Machine Learning System Design
      
                說實話，這一次的測驗對我還是有一點難度的，為了刷到100分，刷了7次（哭）。

無奈，第2道和第4道題總是出錯，後來終於找到錯誤的地方，錯誤原因是思維定式，沒有動腦和審題正確。

這兩道題細節會在下面做出講解。







第二題分析：題意問，使用大量的資料，在哪兩種情況時 

  
 

    

    
    Coursera-吳恩達-機器學習-（第5周筆記）Neural Networks——Learning
      
							
							
							



Week 5 —— Neural Networks : Learning



目錄






一代價函式和反向傳播



1-1 代價函式

首先定義一些我們需要使用的變數： 


  
  L =網路中的總層數 
  sl =第l層中的單位數量（不 

  
 

    

    
    Coursera 吳恩達DeepLearning.AI 第五課 sequence model 序列模型 第一週 Improvise a Jazz Solo with an LSTM Network
      We have taken care of the preprocessing of the musical data to render it in terms of musical "values." You can informally think of each "value" as a note,  

  
 

    

    
    Coursera-吳恩達-機器學習-第五週-程式設計作業: Neural Networks Learning
      
                本次文章內容： Coursera吳恩達機器學習課程，第五週程式設計作業。程式語言是Matlab。

學習演算法分兩部分進行理解，第一部分是根據code對演算法進行綜述，第二部分是程式碼。

0 Introduction 

在這個練習中，將應用 backpropagation 

  
 

    

    
    吳恩達深度學習第一課第二週
      
							
							
							第二週 神經網路基礎



打卡（1）



2.1 二分類

在二分分類問題中 目標是訓練處一個分類器，它以圖片（本例中）的特徵向量X作為輸入，來預測輸出的結果標籤y是1還是0，也就是預測圖片中是否有貓。



課程中會用到的數學符號：


(x,y)(x,y 

  
 

    

    
    吳恩達Machine Learning學習筆記（三）--邏輯回歸
      多分類   nbsp   可用   bubuko   邏輯回歸   泛化能力   筆記   ima   學習   分類任務
　　原始方法：通過將線性回歸的輸出映射到0～1，設定閾值來實現分類任務
　　改進方法：原始方法的效果在實際應用中表現不好，因為分類任務通常不是線性函數，因此提出了邏輯回歸
邏輯回歸
假設 

  
 

    

    
    Coursera-吳恩達-機器學習-第七週-測驗-Support Vector Machines
       
 
   
  
 忘記截圖了，做了二次的，有點繞這裡，慢點想就好了。 
 正確選項是，It would be reasonable to try increasing C. It would also be reasonable to try decreasing σ2.  
 &n 

  
 

    

    
    Coursera-吳恩達-機器學習-第七週-程式設計作業: Support Vector Machines
       
 
 本次文章內容： Coursera吳恩達機器學習課程，第七週程式設計作業。程式語言是Matlab。 
 本文只是從程式碼結構上做的小筆記，更復雜的推導不在這裡。演算法分兩部分進行理解，第一部分是根據code對演算法進行綜述，第二部分是程式碼。 
 本次作業分兩個part，第一個是using SVM，第

Coursera 吳恩達 Deep Learning 第二課改善神經網路 Improving Deep Neural Networks 第二週程式設計作業程式碼Optimization methods

Optimization Methods

1 - Gradient Descent

2 - Mini-Batch Gradient descent

3 - Momentum

Coursera 吳恩達 Deep Learning 第二課改善神經網路 Improving Deep Neural Networks 第二週程式設計作業程式碼Optimization methods

Coursera 吳恩達 Deep Learning 第2課 Improving Deep Neural Networks 第一週程式設計作業程式碼 Regularization

Coursera 吳恩達 Deep Learning 第2課 Improving Deep Neural Networks 第一週程式設計作業程式碼 Initialization

【Coursera】吳恩達 deeplearning.ai 04.卷積神經網路第二週深度卷積神經網路課程筆記

吳恩達深度學習筆記(22)-深層神經網路說明及前後向傳播實現

吳恩達deeplearning之CNN—卷積神經網路入門

吳恩達序列模型學習筆記--迴圈神經網路（RNN）

吳恩達《深度學習-卷積神經網路》2--深度卷積神經網路

吳恩達《深度學習-卷積神經網路》1--卷積神經網路

Coursera 吳恩達DeepLearning.AI 第五課 sequence model 序列模型第二週 Emofify

吳恩達 machine learning 作業第二週

Coursera-吳恩達-機器學習-第十週-測驗-Large Scale Machine Learning

Coursera-吳恩達-機器學習-第六週-測驗-Machine Learning System Design

Coursera-吳恩達-機器學習-（第5周筆記）Neural Networks——Learning

Coursera 吳恩達DeepLearning.AI 第五課 sequence model 序列模型第一週 Improvise a Jazz Solo with an LSTM Network

Coursera-吳恩達-機器學習-第五週-程式設計作業: Neural Networks Learning

吳恩達深度學習第一課第二週

吳恩達Machine Learning學習筆記（三）--邏輯回歸

Coursera-吳恩達-機器學習-第七週-測驗-Support Vector Machines

Coursera-吳恩達-機器學習-第七週-程式設計作業: Support Vector Machines

Coursera 吳恩達 Deep Learning 第二課 改善神經網路 Improving Deep Neural Networks 第二週 程式設計作業程式碼Optimization methods

Optimization Methods

1 - Gradient Descent

2 - Mini-Batch Gradient descent

3 - Momentum

相關推薦

Coursera 吳恩達 Deep Learning 第二課改善神經網路 Improving Deep Neural Networks 第二週程式設計作業程式碼Optimization methods