Machine Learning--week3 邏輯迴歸函式(分類)、決策邊界、邏輯迴歸代價函式、多分類與(邏輯迴歸和線性迴歸的)正則化
Classification
It's not a good idea to use linear regression for classification problem.
We can use logistic regression algorism, which is a classification algorism
想要\(0\le h_{\theta}(x) \le 1\), 只需要使用sigmoid function (又稱為logistic function)
\[ \large h_\theta(x) = g(\theta^Tx), \quad其中\;g(z) =\frac{1}{1+e^{-z}} \]
\(h_\theta(x)\)的意義在於: \(h_\theta(x)\) = estimated probability that \(y = 1\) on input \(x\)
注意:\(x=0\)時,\(g(z)\)剛好等於0.5
Decision Boundary
\(h_\theta{(x)} == P\{y=1|x;0 \}\) (\(P\)指預測的概率)
在課上的例子中,\(h_\theta(x) \ge 0.5,則y=1, else\; y=0\)
不妨設\(\theta = \begin{bmatrix}-3\\ 1\\ 1 \end{bmatrix} ,則 h_\theta(x)=g(-3+x_1+x_2)\)
由於"\(y=1\)" == "\(h_\theta(x) \ge 0.5\)" == "\(\theta^Tx \ge 0\)" == "\(-3+x_1+x_2 \ge 0\)"
這樣的到了 "\(y=1\)" == "\(x_1+x_2 \ge 3\)"
\(x_1+x_2\) 與 \(3\) 的關係決定了 \(y\) 的值,這就是Decision boundary(決策邊界)
拓展到 Non-linear decision boundary:
還可以有:Predict "\(y=1\)" if \(-1+x_1^2+x_2^2 \ge 0\) (\(\theta = \begin{bmatrix}-1\\ 0\\ 0 \\ 1\\ 1 \end{bmatrix},\;x = \begin{bmatrix}x_0\\ x_1\\ x_2\\ x_3 \\ x_4 \end{bmatrix} = \begin{bmatrix}1\\ x_1\\ x_2\\ x_1^2 \\ x_2^2 \end{bmatrix}\)
通過\(\theta\)的不同選擇與\(x\)的不同構造可以得到各種形狀的決策邊界
而Decision Boundary 取決於引數 \(\theta\) 的選擇,並非由訓練集決定
我們需要用訓練集來擬合引數 \(\theta\)
Cost Function
\[ \begin{align} &J(\theta) =\frac{1}{m}\sum_{i=1}^{m}Cost(h_\theta(x^{(i)}),y^{(i)})\end{align} \]
在之前的 linear regression 中,用的Cost函式是:$Cost(h_\theta(x,y)) = \frac{1}{2}(h_\theta(x,y))^2 $
但那不是通用的,在hypothesis function \(h_\theta(x)\)不再是線性方程的情況下,若再採用$Cost(h_\theta(x,y)) = \frac{1}{2}(h_\theta(x,y))^2 \(會導致\)J(\theta)$ 有著眾多的local optima,而不是我們想要的convex function
Logistic Regression Cost Function
\[ Cost(h_\theta(x),y) = \begin{cases} \begin{align} {-log(h_\theta(x))} &\quad\text{ if $y$ = 1} \\ {-log(1-h_\theta(x))} &\quad \text{ if $y$ = 0} \end{align} \end{cases} \]
當 \(h_\theta(x)=y\) 時,\(Cost(h_\theta(x,y))=0\),
當 \(y=1,h_\theta(x)\rightarrow0\) 時 \(Cost \rightarrow \infty\),此時:\(\theta^Tx \rightarrow -\infty\)
當 \(y=0,h_\theta(x)\rightarrow1\) 時 \(Cost \rightarrow \infty\),此時:\(\theta^Tx \rightarrow \infty\)
這樣就保證了\(\theta\)的調整能使得\(h_\theta(x)\) 向 \(y\) 靠近,也就是預測效果與實際更加符合
上面的\(Cost\) function 也可以寫成:
\[ Cost(h_\theta(x),y) = -y\cdot log(h_\theta(x))-(1-y)\cdot log(1-h_\theta(x)) \]
這與之前的cases形式是等價的
所以:
\[ \begin{align} J(\theta) &=\frac{1}{m}\sum_{i=1}^{m}Cost(h_\theta(x^{(i)}),y^{(i)})\\ &= -\frac{1}{m}[\sum_{i=1}^{m}y^{(i)}\cdot log(h_\theta(x^{(i)}))+(1-y^{(i)})\cdot log(1-h_\theta(x^{(i)}))] \end{align} \]
Gradient Descent Algorithm的通用形式還是跟linear regression的一樣(當然把\(h_\theta(x)\)展開後就不一樣了):
\[ \begin{align}&\text{Repeat\{} \\ &\qquad\theta_j := \theta_j - \alpha\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}\\ &\} \end{align} \]
Other Optimization Algorism
- Conjugate Algorism(共軛梯度法)
- BFGS(Broyden–Fletcher–Goldfarb–Shanno algorithm)
- L-BFGS( Limited-memory BFGS)
advantage:
- no need to manually pick \(\alpha\)
- Often faster than gradient descent
disadvantage:
- More complex
不建議自己寫,但是...可以直接調庫啊
%{
%a function's definition, return the costFunction in 'jVal' and the Partial derivative in 'gradient'
function [jVal, gradient] = costFunction(theta)
jVal = [code to compute J(theta)]
gradient = zeros(n+1,1)
gradient(1) = [code to compute ∂[J(theta)]/∂[theta(0)]]
gradient(2) = [code to compute ∂[J(theta)]/∂[theta(1)]]
...
gradient(n+1) [code to compute ∂[J(theta)]/∂[theta(n)]] %the matrix in Octave starts from 1
%}
options = optimset('GradObj', 'on', 'MaxIter', '100');
initialTheta = zeros(2,1);
[optTheta, functional, exitFlag] = fminunc(@costFunction, initialTheta, options);
Multiclass Classification:
用one-vs-all(一對多/一對餘)的思想
對每一類都分成"這一類" 與 "剩下的所有類的集合" 兩類,然後用之前的課程中講得分類方法擬合出這一類的分類器(classifier)
(classifier 就是hypothesis)
最後得出\(n\)個classifiers, 其中\(n\)是類別的總數量, \(y\)是類別:
\[ h_\theta^{(i)}(x) = P(y=i|x;\theta)\qquad (i=1,2,3,\dots,n) \]
也就是說,給定\(x\)和\(\theta\), \(h_\theta^{(i)}(x)\) 能算出來類別是\(i\)類的概率
然後輸入一個新的input \(x\)時,作出預測的行為是:\(\underbrace{max}_i(h_\theta^{(i)}(x))\)
Regularization (正則化)
解決overfitting(過擬合)的問題,另一個描述這個問題的詞語是high variance(高方差)
這是 過多變數(feature)+ 過少訓練資料 造成的
If we have too many features, the learned hypothesis may fit the training set very well(\(J(\theta) \approx 0\))
generalize: how well a hypothesis applies even to new examples
Option to address overfitting:
- Reduce number of features:
- Manually select which features to keep
- Model selection algorism
- Regularization:
- Keep all features, but reduce magnitude(大小)/values of parameters \(\theta_j\)
- Works well when having a lot of features , each of which contributes a bit to predicting \(y\)
regularized Linear Regression
Regularization 的思路:
Small values for parameters \(\theta_0, \theta_1,\dots,\theta_n\):
- "Simpler" hypothesis
- Less prone to overfitting
也就是將某些影響過大的\(\theta_j\)設得很小,比如: \(\theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3 + \theta_4x^4 \approx \theta_0 + \theta_1x + \theta_2x^2\)
Gradient Descent
但是這個regularization 的過程不是在 \(h_\theta(x)\) 裡進行的,而是在Cost Function \(J(\theta)\)裡進行的:
\[ \large J(\theta) =\frac{1}{2m} [\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2 + \lambda\sum_{j=1}^{n}\theta_j^2 ] \]
注意後面加上的那一項(稱之為正則化項)是從1開始的,它收縮了除了\(\theta_0\)外的每一個引數。 \(\lambda\) 稱為regularization parameter(正則化引數),用於控制兩個不同目標之間的平衡關係。
在這個cost functions 裡兩個\(\sum\)項代表了兩個不同的目標:
- 使假設更好地擬合數據(fit the training data well)
- 保持引數值較小(keep the parameters small)
較小的引數值能得到簡單的hypothesis,從而避免overfitting
注意:\(\lambda\)不能過大,否則會使得 \(\theta_1,\dots ,\theta_n \approx 0\), 從而fail to fit even the training set ——too high bias——underfitting(欠擬合)
\[ \begin{align} &\text{repeat until convergence}\{\qquad\qquad\qquad\qquad\qquad\\ &\qquad \theta_{0}\; \text{:= } \theta_{0} - \alpha\frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x_0^{(i)} \\ &\qquad \theta_{j}\; \text{:= } \theta_{j} - \alpha[\frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)} + \frac{\lambda}{m}\theta_j] \qquad (j = 1,2...,n)\\ &\} \end{align} \]
亦即:
\[ \begin{align} &\text{repeat until convergence}\{\qquad\qquad\qquad\qquad\qquad\\ &\qquad \theta_{0}\; \text{:= } \theta_{0} - \alpha\frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x_0^{(i)} \\ &\qquad \theta_{j}\; \text{:= } \theta_{j}(1-\alpha\frac{\lambda}{m}) - \alpha\frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)}\qquad (j = 1,2...,n)\\ &\} \end{align} \]
Normal Equation
review: 之前的Normal Equation是 \(\theta = (X^TX)^{-1}X^Ty\)
改成\(\theta = (X^TX+\lambda \small{\begin{bmatrix}0 \\&1 \\ &&1\\&&&\ddots\\&&&&1 \end{bmatrix}})^{-1}X^Ty,\quad \large\text{if }\lambda \gt 0\)
關於不可逆/退化矩陣 的問題,還是用Octave中的pinv()
可以取偽逆矩陣
但是隻要確保\(\lambda\)嚴格大於0,就能證明括號裡的兩個矩陣的和是可逆的.....
Regularized Logistic Regression
review: $ J(\theta) = -\frac{1}{m}[\sum_{i=1}^{m}y^{(i)}, log,h_\theta(x^{(i)})+(1-y^{(i)}), log,(1-h_\theta(x^{(i)}))]$
處理方法與Linear Regression 的一樣,都是在式子最後面加上一個正則化項 \(\frac{\lambda}{2m}\sum_{j=1}^m\theta_j^2\)
\[ J(\theta) = -\frac{1}{m}[\sum_{i=1}^{m}y^{(i)}\, log\,h_\theta(x^{(i)})+(1-y^{(i)})\, log\,(1-h_\theta(x^{(i)}))] + \frac{\lambda}{2m}\sum_{j=1}^m\theta_j^2 \]
Gradient Descent(general 形式跟Linear Regression的一樣,區別還是隻有\(h_\theta(x^{(i)})\)不同):
\[ \begin{align} &\text{repeat until convergence}\{\qquad\qquad\qquad\qquad\qquad\\ &\qquad \theta_{0}\; \text{:= } \theta_{0} - \alpha\frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x_0^{(i)} \\ &\qquad \theta_{j}\; \text{:= } \theta_{j} - \alpha[\frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)} + \frac{\lambda}{m}\theta_j] \qquad (j = 1,2...,n)\\ &\} \end{align} \]
在Octave中還是用之前的程式碼模版就行,注意在算\(\frac{\partial J(\theta)}{\partial \theta_j}\;(\small j=1,2,\dots,n)\)時需要注意把正則化項的偏微分加上
%{
%a function's definition, return the costFunction in 'jVal' and the Partial derivative in 'gradient'
function [jVal, gradient] = costFunction(theta)
jVal = [code to compute J(theta)]
gradient = zeros(n+1,1)
gradient(1) = [code to compute ∂[J(theta)]/∂[theta(0)]]
gradient(2) = [code to compute ∂[J(theta)]/∂[theta(1)]]
...
gradient(n+1) [code to compute ∂[J(theta)]/∂[theta(n)]] %the matrix in Octave starts from 1
%}
options = optimset('GradObj', 'on', 'MaxIter', '100');
initialTheta = zeros(2,1);
[optTheta, functional, exitFlag] = fminunc(@costFunction, initialTheta, options);