1. 程式人生 > >Machine Learning--week1 監督學習、預測函式、代價函式以及梯度下降演算法

Machine Learning--week1 監督學習、預測函式、代價函式以及梯度下降演算法

  • Supervised Learning
    • given labelled data to train and used to predict
    • for regression problem and classification problem
  • Unsupervised Learning
    • derive structure from data where we don't necessarily know the effect of the variables
    • no feedback based on the prediction results
    • Clustering Algorism is just one type of Unsupervised Learning
    • Cocktail Party Algorithm is non-clustering
  • 差別在於:是否有監督(supervised),就看輸入資料是否有標籤(label)。輸入資料有標籤,則為有監督學習,沒標籤則為無監督學習。

分類和迴歸的區別在於輸出變數的型別。

定量輸出稱為迴歸,或者說是連續變數預測;
定性輸出稱為分類,或者說是離散變數預測。

舉個例子:

預測明天的氣溫是多少度,這是一個迴歸任務;
預測明天是陰、晴還是雨,就是一個分類任務。

訓練出來的預測函式名通常取 h(== hypothesis(假設)),

how to represent h? For example: \(h_{\theta}(x) = \theta _{0} + \theta _{1}x\) , \(\theta_{i}\) are the parameters of the model

調整$\theta_{i} $ \(s.t.\) \(\sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})^{2}\) 儘可能小, 其中M是訓練集的樣本容量。為了儘量減少平均誤差,該求和式可以寫為\(\frac{1}{2m}\sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})^{2}\)

\(J(\theta_{0}, \theta_{1}) = \frac{1}{2m}\sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})^{2}\) 就是所謂cost function(代價函式),又稱squared error function(平方誤差函式)

整理一下:

Hypothesis:
\[ h_{\theta}(x) = \theta _{0} + \theta _{1}x \]
Parameters:
\[ \theta_{0},\theta_{1} \]
Cost Function:
\[ J(\theta_{0}, \theta_{1}) = \frac{1}{2m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})^{2} \]
Goal:
\[ \underbrace{\rm minimize}_{\theta_{0},\theta_{1}}\, J(\theta_{0},\theta_{1}) \]

用Gradient Descent Algorithm(梯度下降演算法)去minimize the cost function

Gradient Descent Algorithm(虛擬碼):
\[ \text{repeat until convergence}\{\\ \qquad\qquad\qquad\qquad\qquad\qquad \theta_{j}\; \text{:= } \theta_{j} - \alpha\frac{\partial}{\partial \theta_{j}}J(\theta_{0},\theta_{1}) \qquad \text{(j = 0,1,2...)}\\ \}\qquad\qquad\qquad\qquad\qquad\; \]

其中\(\alpha\)是一個數字,被稱為learning rate(學習速率),用於控制梯度下降的速率:

  • \(\alpha\) is too small:gradient descent algorism can be too slow
  • \(\alpha\) is too larger:gradient descent can overshoot the minimum and may even fail to converge or even may diverge

解釋一下這個演算法:

​ 當\(\theta_{j}\uparrow\)會導致\(J(\theta_0,\theta_1)\uparrow\)時,偏導數\(>0\),於是表示式的效果就是\(\theta_j\downarrow\);而\(\theta_{j}\uparrow\)導致\(J(\theta_0,\theta_1)\downarrow\)時,偏導數\(<0\),於是 \(\theta_j\uparrow\)

​ 這樣\(\theta_j\)就會逐漸向梯度為0的地方滑落

​ 另外即使\(\alpha\)是一定的,在gradient descent 的過程中,\(\theta_{j}\)變化的幅度也是越來越小的,因為其偏導數趨向於0,所以沒必要在靠近區域性最優點時再額外減少\(\alpha\)

有一點需要注意:所有的\(\theta_{i}\)需要同時更新,因此我們不能直接用$ \theta_{0}; \text{:=} \theta_{0} - \alpha\frac{\partial}{\partial \theta_{0}}J(\theta_{0},\theta_{1})$這樣的表示式,而應該寫成:

$$
temp_0\text{ := } \theta_0 - \alpha\frac{\partial}{\partial \theta_{0}}J(\theta_{0},\theta_{1})\

temp_1\text{ := } \theta_1 - \alpha\frac{\partial}{\partial \theta_{1}}J(\theta_{0},\theta_{1})\

\theta_{0}; \text{:= } temp_0\qquad\qquad\qquad\qquad\quad\

\theta_{1}; \text{:= } temp_1\qquad\qquad\qquad\qquad\quad
$$

這樣才能避免 第一和第二條式子中的\(J(\theta_0,\theta_1)\)不一致

gradient descent algorism 要求所有\(\theta_{i}\)同步更新

注:由不同的起點可能得到不同的區域性最小點

\(J(\theta_0,\theta_1)\)的表示式代入Gradient Descent Algorithm中時,可得到(虛擬碼)
\[ \text{repeat until convergence}\{\\ \qquad\qquad\qquad\qquad\qquad \theta_{0}\; \text{:= } \theta_{0} - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})\\ \qquad\qquad\qquad\qquad\qquad\qquad \theta_{1}\; \text{:= } \theta_{1} - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})\sdot x^{(i)}\\ \}\qquad\qquad\qquad\qquad\qquad\; \]
這種形式稱為“Batch” Gradient Descent Algorithm:

​ “Batch”:Each step of gradient descent uses all the training examples