【ML】李巨集毅機器學習筆記

我的github連結 - 課程相關程式碼：

https://github.com/YidaoXianren/Machine-Learning-course-note

0. Introduction

Machine Learning: define a set of function, goodness of function, pick the best function
Regression輸出的是一個標量，Classification輸出的是(1)是或否(Binary Classification) (2) Multi-class Classification
選不同的function set其實就是選不同的model，model裡面最簡單的就是linear model;此外還有很多nonlinear model，如deep learning, SVM, decision tree, kNN...... 以上都是supervised learning - 需要蒐集很多training data

Semi-supervised learning(半監督學習) - 有些有有些沒有label
Transfer Learning - data not related to the task considered
Unsupervised Learning(非監督學習)
Structured Learning - Beyond Classification (輸出的是一個有結構性的object)
Reinforcement Learning - 沒有監督知道，只有一個好or壞的評分機制(learning from critics)
藍色: scenario; 紅色: task - 要解的問題; 綠色: method.

li1

1. Regression

output a scalar
Step1: Model: $y = b + wx_{cp}$ w and b are parameters, w: weight, b: bias

Linear Model: $y = b + \sum w_ix_i$
Step2: Goodness of Function - Loss function L: input is a function, output is how bad it is (損失函式)
first version of Loss Function: $L(f) = \sum (\hat y^n - f(x^n_{cp}))^2$ or $L(w,b) = \sum (\hat y^n - (b+wx^n_{cp}))^2$
Step3: Best Function - $w^*b^* = arg \min_{w,b}L(w,b) = arg \min_{w,b}\sum_n (\hat y^n-(b+wx^n_{cp}))^2$
Gradient Descent (梯度下降法) - 只要loss func對與它的引數是可微分的就可以用，不需要一定是線性方程
- Pick an initial value $w^0$ ; - Compute $\frac{dL}{dw}|_{w=w^0}$ ; - $w1 \leftarrow w^0 - \eta \frac{dL}{dw}|_{w=w^0}$ , where $\eta$ is learning rate. Continue this step until finding gradient that is equal to zero.
For two parameters: $w^*, b^*$ ; - Pick initial value: $w_0, b_0$ ; - Compute $\frac{\partial L}{\partial w}|_{w=w_0, b=b_0}, \frac{\partial L}{\partial b}|_{w=w_0, b=b_0}$ ; - $w^1 \leftarrow w^0 - \eta \frac{\partial L}{\partial w}|_{w=w^0, b=b^0}$ $b^1 \leftarrow b^0 - \eta \frac{\partial L}{\partial b}|_{w=w^0, b=b^0}$ 。Continue this step until finding gradient that is equal to zero.
以上方法得出來的結果 $\theta^*$ 滿足： $\theta^* = arg \min_\theta L(\theta)$
gradient descent缺點：可能會卡在saddle point或者local minima
對於linear regression, 由於它是convex的函式，所以不存在上述缺點。
Liner Regression - Gradient descent formula summary:
- $L(w,b) = \sum^{10}_{n=1}(\hat y^n-(b+wx^n_{cp}))^2$
- $\frac{\partial L}{\partial w} = \sum^{10}_{n=1}2(\hat y^n-(b+wx^n_{cp}))(-x^n_{cp})$
- $\frac{\partial L}{\partial b} = \sum^{10}_{n=1}2(\hat y^n-(b+wx^n_{cp}))(-1)$
複雜的模型在test data上不一定有更好的表現，有可能是overfitting(過擬合)
overfit的解決方法：1. 增加input資料集 2. regularization
Regularization (正則化)
- $y = b + \sum w_ix_i$ $L = \sum_n(\hat y^n-(b + \sum w_ix_i))^2 + \lambda (w_i)^2$
- 不但要選擇一個loss小的function，還要選擇一個平滑的function(正則化使函式更平滑, 因為w比較小) - smoother function is more likely to be correct
- $\lambda$ 大，找出來的function就比較smooth。反之，找出來的則不太smooth. 在 $\lambda$ 由小到大變化的過程中，函式不止要考慮loss最小化還要考慮weights最小化，所以對training error最小化的考慮就會相對(於沒有正則化的時候)減小，因此training error會隨著 $\lambda$ 增大而增大。test error則先減小再增大。

2. Error

Bias: $m = \frac{1}{N}\sum_nx^n$ ; Variance: $s^2 = \frac{1}{N}\sum_n(x^n-m)^2$ ; $E[s^2] = \frac{N-1}{N}\sigma^2 \neq \sigma^2$ ; want low bias & low variance
when using low degree(simple) models, variance is small, while complicate model leads to large variance. 簡單的模型受取樣資料的影響較小
Bias: If we average all the $f^*$ , it is close to $\hat f$ . $f^*$ 是每次訓練的最佳函式(model)解(注:每次訓練包含多個數據樣本-sample data)，而 $\hat f$ 是真實的函式(model)。
simple models have larger bias & smaller variance, while complicate models have smaller bias & larger variance.
如果error來自於variance很大，說明現在的模型是overfitting;如果error來自bias很大，說明現在的模型是underfitting
如果模型沒法fit training data，說明此時bias很大；如果模型很fit training data, 但是很不fit test data，說明此時variance很大
For large bias: add more feature, make a more complicate model
For large variance: get more data, or regularization (所有曲線都會變得比較平滑)
Cross Validation: Training Set, Validation Set, Testing Set (Public, Private)
N-fold Cross Validation - 交叉驗證：可以先分成training set和validation set, train的用來訓練model, validation的用來挑選model。選定model之後再用整個data set (training set+validation set)來重新train這個model的引數

3. Gradient Descent

$\theta^* = arg \min_\theta L(\theta)$ L: loss function, $\theta$ : parameters
假設 $\theta$ 有兩個變數 ${\theta_1, \theta_2}$ , 則：
$\theta^0 = \left[ \begin{matrix} \theta^0_1 \\ \theta^0_2 \end{matrix} \right]$ ; $\theta^1 = \theta^0 - \eta\triangledown L(\theta^0)$ ; --> $\left[ \begin{matrix} \theta^1_1 \\ \theta^1_2 \end{matrix} \right] = \left[ \begin{matrix} \theta^0_1 \\ \theta^0_2 \end{matrix} \right] - \eta \left[ \begin{matrix} \partial L(\theta^0_1)/\partial \theta_1 \\ \partial L(\theta^0_2)/\partial \theta_2 \end{matrix} \right]$ 這裡: $\triangledown L(\theta)= \left[ \begin{matrix} \partial L(\theta_1)/\partial \theta_1 \\ \partial L(\theta_2)/\partial \theta_2 \end{matrix} \right]$
設定learning rate:
- 可以繪製loss vs. No. of parameters updates(同一個迴圈的引數迭代次數)的曲線，觀察變化趨勢；
- Reduce the learning rate by some factor every few epochs - e.g. 1/t decay: $\eta^t = \eta/\sqrt{t+1}$ ；
- Give different learning rates to different parameters - Adagrad - divide the learning rate of each parameter by the root mean square of its previous derivatives
  
  $\eta^t / \sigma^t$ can be elimated... then it comes to the following form:
- Adagrad 更新法則： $w^{t+1} \leftarrow w^t - \frac{\eta}{\sqrt{\sum^t_{i=0}(g^i)^2}}g^t$ , $g^t = \frac{\partial L(\theta^t)}{\partial w}$ 為當下的梯度值(偏微分) - 造成反差效果
- The best step is $\frac{|First\ derivative|}{Second\ derivative}$ . Adagrad實際上是在模擬這樣一個運算。但是又比直接算二次微分節省時間。
Stochastic Gradient Descent (讓training更快):
- Gradient Descent的loss function是對全部example而言，加總的所有loss (update after seeing all examples)。而SGD是隨機選一個example，然後計算這一個example的loss，然後更新引數(update for each example)
- $L^n = (\hat y^n - (b + \sum w_ix^n_i))^2$ $\theta^i = \theta^{i-1} - \eta \triangledown L^n(\theta^{i-1})$
Feature Scaling (歸一化)
- 如果不做歸一化，不同引數的scale不一樣可能導致同樣是稍微改變一個權重的大小，對於scale小的變數而言變化很小，而對於scale大的變數而言變化很大。
- 由於做update的時候是沿著等高線垂直方向更新的，歸一化之後的更新效率會高一些。
- 做feature scaling:
  - for each dimension i: compute mean ( $m_i$ ) and standard deviation ( $\sigma_i$ ).
  - change each data using: $x^r_i \leftarrow \frac{x^r_i - m_i}{\sigma_i}$ - after this step, the means of all dimensions are 0, and variance are all 1

可以從泰勒級數的角度理解gradient descent - learning rate夠小，泰勒級數才能約等於只有一次項，才能保證每次都能往loss最小的方向移動
Gradient Descent可能不work的情況：
- Stuck at local minima
- Stuck at saddle point
- Very slow at the plateau
解析解(Analytical solution) = 封閉解(Closed-form solution): 根據嚴格的公式推導，給出任意自變數可以得到因變數的問題的解。數值解(Numerical solution):用數值分析，各種逼近的方法得到的近似解。

4. Classification: Probabilistic Generative Model

Generative model
- Given x, which class does it belong to:
- $P(C_1|x) = \frac{P(x|C_1)P(C_1)}{P(x|C_1)P(C_1)+P(x|C_2)P(C_2)}$
- Estimating the probabilities from training data. Consider $C_1$ as class 1, and $C_2$ as class 2.
- Generative Model:
- $P(x) = P(x|C_1)P(C_1) + P(x|C_2)P(C_2)$
- Assume the points are sampled from a Gaussian distribution
- Gaussian distribution: $f_{\mu, \Sigma}(x) = \frac{1}{(2\pi)^{D/2}} \frac{1}{|\Sigma|^{1/2}}exp\{-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\}$
- Input - vector x; Output - probability of sampling x. 函式的形狀取決於mean ( $\mu$ ) and covariance matrix ( $\Sigma$ )
- 思路：假設examples都是在一個高斯分佈中取樣出來的點，通過這些點計算出mean和covariance matrix，找到這個高斯分佈，再用這個高斯分佈函式來計算新進來的點(是不是這一類)的概率
- 任意一個組合的mean和covariance matrix都可以表示出平面上的任意一個點，只是似然值不一樣，有極大似然值(maximum likelihood)的那個就假設為這個類的高斯分佈函式的引數
- Likelihood of a Gaussian with mean and covariance matrix = the probability of the Gaussian samples
  - 似然值等於所有這個類的點的概率的乘積
  - $L(\mu,\Sigma) = f_{\mu,\Sigma}(x^1)f_{\mu,\Sigma}(x^2)f_{\mu,\Sigma}(x^3)......f_{\mu,\Sigma}(x^N)$
- 擁有極大似然值的一組mean和covariance matrix為：
  - $\mu^* = \frac{1}{N}\sum^N_{n=1}x^n$ $\Sigma^* = \frac{1}{N}\sum^N_{n=1}(x^n - \mu^*)(x^n - \mu^*)^T$
- 使不同class共用同一個covariance,likelihood:
  - $L(\mu^1, \mu^2, \Sigma) = f_{\mu^1,\Sigma}(x^1)f_{\mu^1,\Sigma}(x^2)...f_{\mu^1,\Sigma}(x^{79})\times f_{\mu^2,\Sigma}(x^{80})f_{\mu^2,\Sigma}(x^{81})...f_{\mu^2,\Sigma}(x^{140})$
  - $\mu_1$ 和 $\mu_2$ 都和之前一樣，分別算自己類樣本的平均數。
  - $\Sigma = \frac{79}{140}\Sigma^1 + \frac{61}{140}\Sigma^2$
  - 如果用同一個covariace matrix，訓練出來的boundary是線性的
Probability Distribution
- For binary features, you can use Bernoulli distributions instead.
  - If assume all dimension are independent, then it is Naive Bayes Classifier.
Posterior Probability
- 描述概率算出來的模型和logistic regression的聯絡，下圖中在covariance matrix一致時又可以寫成wx+b的形式。
- Sigmoid function形式的匯出
Bayes - 先驗，後驗，似然
- 貝葉斯公式： P(y|x) = (P(x|y)*P(y))/P(x)
- P(y|x) 為後驗概率， P(x|y)為條件概率or似然概率，P(y)和P(x)為先驗概率
- 所以貝葉斯公式也可以表述為：後檐概率=（似然度*先驗概率）/標準化常量。即：後驗概率與先驗概率和似然度的乘積成正比。

5. Logistic Regression

want to find $P_{w,b} (C_1|x)$ , if it is larger than or equal to 0.5, then output C1; otherwise, output C2.
Assume the data is generated based on $f_{w,b}(x) = P_{w,b}(C_1|x)$ , then the probability of generating the data is $L(w,b) = f_{w,b}(x^1)f_{w,b}(x^2)(1-f_{w,b}(x^3))...f_{w,b}(x^N)$ , where $x^1, x^2, x^3, ...$ is training data. Then the most likely $w^*$ and $b^*$ is the one with largest $L(w,b)$ : $w^*, b^* = arg \max_{w,b}L(w,b) = arg \min_{w,b} -lnL(w,b)$
$w^*, b^* = arg \min_{w,b} \sum_n-[\hat y^nlnf_{w,b}(x^n) + (1-\hat y^n)ln(1-f_{w,b}(x^n))]$ - the sum is cross entropy between two Bernoulli distribution (交叉熵).
交叉熵 - cross entropy
- assume distribution p: $p(x=1) = \hat y^n$ ; $p(x=0) = 1 - \hat y^n$
- assume distribution q: $q(x=1) = f(x^n)$ ; $q(x=0) = 1 - f(x^n)$
- Then the cross entropy is: $H(p,q) = - \sum_xp(x)ln(q(x))$
- cross entropy實際上就是在maximize likelihood
- 注:上圖的C是交叉熵
Logistic regression的step2只能用交叉熵不能用square error, 因為後者不管 $f_{w,b}(x^n)$ 等於1還是0,算出來的偏微分 $\partial L/\partial w_i$ 都等於0,無法更新。
Discriminative v.s. Generative
- $P(C_1|x) = \sigma(wx + b)$
- Discriminative: 直接用logistic regression的方法算出w和b
- Generative: 算出高斯分佈的引數，再推匯出對應的w和b (or其他概率論的方式)
- 假設不一樣，所以算出來的結果是不一樣的 - 在dataset一樣的前提下，discriminative算出的結果準確率往往要比generative的高
  - Naive Bayes裡面假設每個事件都是independent的，比如00|01|10 & 11的分類，樣本不均的時候可能會分錯，因為model可能會腦補不存在的情況
  - generative模型的好處：基於概率分佈的假設，所需的training data比較少；對noise比較robust；Priors and class-dependent probabilities can be estimated from different sources
Multi-class Classification - Softmax (待更新 - 引數更新公式)
Limitation of Logistic Regression
- 異或問題無法直接解決 - 可以用feature transformation轉成可以解決的問題(not always easy...)
- Cascading logistic regression models (把多個logistic regression堆疊起來，一些用來feature transformation,一些用來classification) - 其實就是深度學習(deep learning)

6. Deep Learning

Given network structure, define a function set.
Machine learning有一些問題不可分的時候需要做feature transformation。而deep learning就只需要設計一個structure，確定多少層，每層多少個neurons
Universality Theorem: Any continuous function f can be realized by a network with one hidden layer given enough hidden neurons

7. Backpropagation

Chain rule
Forward Pass:每個神經元輸出值對輸入值的偏微分 $a = \frac{\partial z}{\partial w}$
Backward Pass: 每個神經元從遠端過來的偏微分 $\frac{\partial l}{\partial z}$
最後，有： $\frac{\partial l}{\partial w} = \frac{\partial z}{\partial w} \times \frac{\partial l}{\partial z} = a \times \frac{\partial l}{\partial z}$

8. Keras

例子: https://keras.io/
mnist資料集下載地址: http://yann.lecun.com/exdb/mnist/
- dense - 指的是新增的是一個fully connected的layer
- activation function: softplus, softsign, sigmoid, tanh, hard_sigmoid, linear, softmax.
- loss function: categorical crossentropy
- optimizer: adam, SGD, RMSprop, Adagrad, Adadelta, Adamax, Nadam
Mini-batch:
- 一：將全部訓練集分成許多組，每組內隨即分配example
- 二：隨機初始化一組網路引數
- 三：選擇第一組batch計算它的total loss，根據這個loss更新一次網路引數
- 四：重複步驟三直到所有的mini-batch都選擇完了（完成一個步驟四稱為一個epoch）
- *batch size = 1的話就相當於是Stochastic gradient descent(SGD), smaller batch size means more updates in one epoch. batch size主要是為了提速 -- 同樣多的資料，大的batch size完成一次epoch需要的時間遠小於小的batch size。batch size = 10 is more stable, converge faster.
- *very large batch size can yield worse performance - 很容易卡到local minima. (SGD 或者小batch size能緩解這個問題是因為每次更新時的隨機batch有助於跳出gradient為0的區域)

9. Deep Neuron Networks Troubleshooting

先看看在training set上訓練的結果好不好，如果結果也不好，可能是卡在了local minima等地方，要回去看看前面設定的引數；如果training結果好testing結果不好，說明是overfitting；如果兩個都好，就是一個好的可以用的DNN。*區別於machine learning(SVM with rbf kernel, decision, knn with k=1), deep learning在training set不一定能得到100%準確率
Network疊很深不一定會更好，可能會有Vanishing Gradient Problem(梯度消失)的問題：接近input的幾層gradient非常小，接近output的幾層gradient非常大，所以當learning rate都一樣的時候，前面幾層的學習速度(引數更新速度)非常慢，而後面幾層學習速度快很多 - 所以前面還是隨機的時候(還沒怎麼更新)，後面的已經收斂了，因此不會在更新，卡在了一個performance很差的situation。- 該現象是來自於sigmoid function(這個activation func)，它會使input衰減，把正負無窮大的input壓到0-1之間，所以當層數很多的時候就會越來越小，導致前面幾層的gradient非常小。
- 解決方案1: 改成dynamic (adaptive)的learning rate
- 解決方案2: 直接改activation func為ReLU (Rectified Linear Unit)
  - 換成ReLU的原因：1. 比sigmoid計算快(沒有指數運算) 2. 有生物學上的理由 3. ReLU是由無限的sigmoid疊起來形成的 4. 能夠解決梯度消失的問題
  - 當用ReLU的時候，由於小於0的output都變成了0，就相當於對後面的網路沒有影響，這樣就只保留下了input大於0的unit和他們有連結的unit, 所以整個網路就變成了一個線性的網路(thinner linear network)，所以就沒有小的gradient。 - 注：這裡指的線性是區域性線性(在某次input的附近範圍線性)，但是整體上還是非線性的 - 分段線性。
  - 變形：Leaky ReLU; Parametric ReLU; Exponential Linear Unit (ELU)
Maxout: 每一個neuron有一個自動學出來的activation function(特指piecewise linear convex function) - 分為幾段取決於把幾個element放在一個group。（比如ReLU就是其中一種，兩個element放一個group，所以分成了兩段）
- Maxout是可以train的，當給定input的時候，我們就知道每一層裡面哪個是max的值，所以可以簡化成一個thin and linear的網路(只保留了max的unit)，所以用gradient descent直接算就可以了。 - 注！每個example放進來生成的簡化網路是不一樣的，因為每次區域性的max都不一樣。所以當全部example都跑完，一開始那個network的全部引數都會被訓練到。
Adaptive learning rate
- Adagrad $w^{t+1} \leftarrow w^t - \frac{\eta}{\sqrt{\sum^t_{i=0}(g^i)^2}}g^t$
- RMSProp (Adagrad的變形)
- Momentum: momentum of last step minus gradient at present
  - 考慮前一次的移動方向，其實就是考慮過去所有的移動方向
- Adam: RMSProp + Momentum
- Overfitting解決方法：
  - Early Stopping:overfit的時候隨著epoch的增加training的error會越來越小，但是testing的會先減小後變大，所以可以讓它停在testing error最小的epoch那裡(這裡用validation set代替testing set)。
  - Regularization
    - Find a set of weight not only minimizing original cost(e.g. minimize square error, cross entropy) but also close to zero
    - L2 regularization
      - $L'(\theta) = L(\theta) + \lambda \frac{1}{2}\left \| \theta \right \|_2$ $\theta = {w_1, w_2, ...}$
      - L2 regularization: $\left\| \theta \right\| _2 = (w_1)^2 + (w_2)^2 + ...$ Gradient: $\frac{\partial L'}{\partial w} = \frac{\partial L}{\partial w} + \lambda w$
      - 更新規則： $w^{t+1} \leftarrow w^t - \eta (\frac{\partial L}{\partial w} + \lambda w^t) = (1-\eta \lambda)w^t - \eta \frac{\partial L}{\partial w}$
      - 第一項小於1,所以每次更新w的值都會越來越接近於0，但是由於有後面偏微分這項的存在，所以並不會最終變成0，除非是對於L影響不大的w，這些w由於影響不大，偏微分接近0，所以自然就慢慢變成了0，從而達到減少引數數量的效果。 - 所以L2 regularization也叫做weight decay。
    - L1 regularization
      - $L'(\theta) = L(\theta) + \lambda \frac{1}{2}\left \| \theta \right \|_1$ $\theta = {w_1, w_2, ...}$
      - L1 regularization: $\left\| \theta \right\| _1 = |w_1| + |w_2| + ...$ Gradient: $\frac{\partial L'}{\partial w} = \frac{\partial L}{\partial w} + \lambda sgn(w)$
      - 更新規則： $w^{t+1} \leftarrow w^t - \eta (\frac{\partial L}{\partial w} + \lambda sgn(w^t)) = w^t - \eta \frac{\partial L}{\partial w} - \eta \lambda sgn(w^t)$
      - 最後一項不管w是大於0還是小於0總是讓w不斷的趨近與0。
    - L1和L2正則化的比較：L2對數值大的weight的懲罰力度比較大(因為每次更新是直接消w*一個固定的值)，而L1對所有weight都一視同仁(因為每次更新消的都是sign，正負1*一個固定的值); L1做出來引數間的差距可以拉的很大，有一些會很接近於0,而L2做出來整體上比較靠近，但是很難有非常靠近0的引數。
    - Regularization在DNN中作用不是很大，因為一開始就是從接近0的地方初始化引數的。而且Regularization跟early stopping的作用有些重疊，所以有early stopping就不太需要regularization.
  - Dropout:
    - 每次在update之前: each neuron has p% to dropout - 同這些neurons連結的線也消失，使整個網路變得很thin, then using the new thinner network for training. (注！for each mini-batch, we resample the dropout neurons)
    - 在testing的時候，沒有dropout。另外，如果訓練的時候每個weight的dropout rate是p%，則每個weight要乘以1 - p%
    - Dropout is a kind of ensemble(訓練很多種不同的network然後加權平均)
Practice - mnist:
- 如果training data不夠fit，可以嘗試改大hidden layer的neuron數量
- cross entropy在分類問題上比mse好很多。
- 要用GPU加速，batch size一定要開大一些才行（如10000)
- batch size從100調到10000正確率就降低了（因為相當於更新的次數少了）
- batch size從100調到1速度就會變得很慢，因為GPU不能發揮並行運算的效能
- 如果加深network的層數，會有梯度消失，performance不會變好。
- 如果把sigmoid都變成ReLU，準確率就升高了。
- 如果一開始沒有做normalization，維持0-255的input區間，也無法訓練成功
  - 注！要養成先看一遍training set的習慣，如筆者的資料集本身就是0-1的，再按照視訊做多一次normalization就無法訓練出來
- 把optimizer從SGD改成Adam可以讓收斂速度更快 - 體現在accuracy變化上
- 添加了noise之後的testing data準確率不高(此時training set的準確率很高)，可以用dropout - 注！加了dropout之後在training上的performance是會變差的，但在testing set上performance會變好
- 注：Keras上每個步驟輸出的acc是指在當下training的epoch的準確率

10. Convolutional Neural Network (CNN)

Why CNN for image:
- Some patterns are much smaller than the whole image: a neuron does not have to see the whole image to discover the pattern. (convolution)
- The same patterns appear in different regions. (convolution)
- Subsampling the pixels will not change the object. (max pooling)
CNN - Convolution
- Filter matrix中的data都是train出來的引數(根據training data)，但是filter matrix的size和一共有多少個filter是自己設計的。
- stride = 步長 = 每次filter移動的距離
- Filter走完生成的新的image叫做feature map
- Convolution就是fully connected的進化版 - 把每個filter中的各個元素當成一個個weight分別和每個區域性圖的pixel相乘。 -不過舉例如果filter是3x3的就只連線9個input，不是fully connected. 並且由於整張image用的是同一個filter來卷積，相當於share weights
- 做完之後each filter is a channel
CNN - Max Pooling (下采樣)
- nxn的一個範圍內保留最大的一個pixel值
- 下采樣也可以用average pooling
CNN - Flatten
- 把每一個channel裡面的pixel值全部拉出來，拉成一個nx1的vector
- 做完以上處理之後就可以放進fully connected的network裡面做gradient descent了
CNN - Example (注意引數數量)
CNN - What does CNN learn
- Degree of the activation of the k-th filter: $a^k = \sum^{11}_{i=1}\sum^{11}_{j=1}a^k_{ij}$ (assume the output of the k-th filter is a 11x11 matrix.)
- 假設X是input影象，利用偏導 $\frac{\partial a^k}{\partial x_{ij}}$ 和gradient descent算出來的影象就是每個filter最興奮的影象，即特徵(pattern)。這裡有 $x^* = arg \max_x a^k$
- DNN很容易被欺騙（如雪花噪點），可以通過增加一些額外的constraint來防止這種情況發生：
- 偏微分求出對正確class貢獻最大的pixel並表示出來
- 用灰色框遮掉某一部分看是不是無法辨識出來，從而看出哪部分最有利於class判定
- 風格遷移：
  - A Neural Algorithm of Artistic Style
  - https://arxiv.org/abs/1508.06576
- Shallow(Fat+Short) v.s. Deep(Thin+Fall) - 要保證neurons數量一致才可以比較：deep的比較好
  - 在做deep的時候其實就是在做modularization(結構化,模組化的架構)，某個output class的例子太少的時候如果直接一層train會比較weak，而用modularization分類歸納就容易很多(share by following neurons as modulus) - can be trained by little data. - use previous layer as module to build classifiers.
  - 單層網路可以完成全部function，但是是很沒有效率的。
  - 所以deep的可以用比較少的data和引數，也比較不容易overfitting
  - deep可以處理更復雜的問題
- 語音 - 模組化
  - Determine the state each acoustic feature belongs to
  - Gaussian Mixture Model (GMM)
  - In HMM-GMM, all the phonemes are modeled independently - not efficient.
  - DNN input - one acoustic feature; DNN output - probability of each state
- end to end learning
  - what each function should do is learned automatically
  - only provide input and output and let each layer learnt by itself - do not need to deal too much with original data, just replace them by a new layer.

11. Semi-supervised learning

semi-supervised 就是有部分訓練集沒有label，並且通常沒有label的data比有label的多。
Trasductive learning: unlabeled data is the testing data; Inductive learning: unlabeled data is not the testing data.
semi的準確率很大程度取決於對於未知的data假設的label是否準確
Semi-supervised Learning for Generative Model
- The unlabeled data $x^u$ help re-estimate $P(C_1), P(C_2), \mu^1, \mu^2, \Sigma$ (可能看了unlabeled的之後知道概率分佈的均值和方差在其他地方) - 進而影響decision boundary
- 演算法最後總能收斂，但是怎麼樣初始化會影響到收斂的結果
- 類似EM演算法
- 注意考慮了unlabeled data的最大似然的公式
Low-density Separation Assumption
- low-density: 在兩個類的boundary附近density是最低的，資料點最少
- Self-training
  - 不能用在regression上
  - 跟generative training相似，但是Self-training用的是hard label, 而前者用的是soft label(屬於兩種類都可能，只是概率不一樣)
  - 如果用neural network, 只能用hard label,soft label相當於反覆自證，無法更新。hard label - It looks like class 1, then it is class 1.
- Entropy-based Regularization
  - Entropy of $y^u$ (class of an unlabeled data): evaluate how concentrate the distribution $y^u$ is: $E(y^u) = -\sum^5_{m=1}y^u_mln(y^u_m)$ , 要讓它越小越好。（資訊熵） - 這裡的m指的是class
  - 所以一開始loss function只考慮labelled data的交叉熵越小越好(第一項)，現在就可以新增一項，讓unlabelled data的資訊熵也越小越好(第二項)： $L = \sum_{x^r}C(y^r, \hat y^r) + \lambda\sum_{x^u}E(y^u)$
- Semi-supervised SVM
  - 枚舉出所有unlabelled data的分類可能，每一種都做一下svm
  - 再看那一種可能效能夠讓margin最大，又minimize error，又少分類錯誤
  - 存在一個問題是如果資料過多很難處理，需要做一些approximation
Smoothness Assumption
- Assumption: "similar" x has the same $\hat y$
- More precisely: x is not uniform, if x1 and x2 are close in a high density region, then $\hat y^1$ and $\hat y^2$ are the same.
- Cluster and then Label
- 待補充：deep auto encoder - 用來讓cluster時各種類別差異更明顯
- Graph-based Approach(譜聚類) - represented the data points as a graph - 如果兩個點之間有相連就是一類
  - Graph Construction:
    - 首先要定義兩個資料點的相似度(similarity) - $s(x^i, x^j)$
    - Add edge
      - k Nearest Neighbor(kNN)
      - e-Neighborhood: 每個點只有跟它相似度超過某個threshold的才算
    - Edge weight is proportional to
      - Gaussian Radial Basis Function (RBF): $s(x^i, x^j) = exp(-\gamma\left\|x^i-x^j\right\|^2)$
- 資料點要夠多，否則有可能會傳不過去
- 定量的使用方式 - 定義一個smoothness of the labels - labels有多符合假設，每個相鄰data之間都用線連起來，每條線都賦予一個權重(weight)，然後smoothness表示為： for all data, no matter labelled or not. 算出來的smoothness越小越好
  - 另一種表示方法
  - 同理，可以在原先只考慮labelled data的loss function裡再加上smoothness這一項，然後gradient descent： $L = \sum_{x^r}C(y^r,\hat y^r) + \lambda S$ , 後面這項也相當於regularization.

12. Unsupervised Learning - Linear Methods (線性降維)

Clustering & Dimension Reduction (化繁為簡) - only have function input; Generation(無中生有) - only have function output
Clustering
- K means
- Hierarchical Agglomerative Clustering (HAC) - 凝聚層次聚類: 分多少類取決於threshold切在哪裡
Distributed Representation - 不把object定為某一類，而寫成每一類百分之幾
- 如果一開始某個data是很高維的，現在表示成distributed representation，就相當於降維了，dimension reduction.
Dimension Reduction - linear methods
- Feature Selection - 觀察樣本的資料，將沒有用的dimension直接拿掉
- Principle Component Analysis (PCA) - 主成分分析 - z=Wx
  - 重要前提(假設)： $\left \| w^1 \right \|_2 = 1$
  - 希望降維之後得到的z的variance越大越好，這樣可以保持data point之間的奇異度 - 才能看出區別。 $Var(z_1) = \sum_{z_1}(z_1-\bar z_1)^2$ , 這裡的 $w^1, z^1$ 表示第一維
  - 假設還想要第二維，第二維要滿足 $Var(z_2) = \sum_{z_2}(z_2-\bar z_2)^2$ 且 $\left \| w^2 \right \|_2 = 1$ , 即第二維跟第一維垂直： $w^1 \cdot w^2 = 0$
  - 依次算出所需維數的z，則總的權重矩陣為各個維按順序排列而成，且這個W是一個orthogonal matrix (因為每個row都互相垂直)
  - PCA的推導
  - SVD - 待補充
  - PCA looks like a neural network with one hidden layer (linear activation function) - autoencoder
  - PCA involves adding up and subtracting some components(images) - 如人臉的PCA是很多類似個別人臉的影象。-本質是SVD分解出來的兩個matrix的值可正可負。
    - 如果非要讓做出來的PCA的eigen image為可拼接筆畫(或組成部分)，要用non-negative matrix factorization (NMF): 1. forcing a1, a2, ... be non-negative (can only use add when making a image by eigen images); 2. forcing w1, w2, ... be non-negative(more like "part of digits")
  - weakness of PCA
    - 由於是unsupervised的learning,有可能第一個主成分分的就剛好在兩個類的boundary上，導致所有data point在這上面的投影都混在一起，無法分開。
    - 無法做non-linear dimension reduction
- Matrix Factorization （常用於推薦系統）
  - K: 潛在因素 - latent factor
  - $R^{M\times K} \cdot R^{K\times N} = R^{M \times N}$
  - 如果matrix中有一些missing data，就用gradient descent的方法做: minimizing $L = \sum_{(i,j)}( </div> <div class=$
    
    相關推薦
    
    【ML】李巨集毅機器學習筆記
    
    我的github連結 - 課程相關程式碼： https://github.com/YidaoXianren/Machine-Learning-course-note 0. Introduction Machine Learning: define a set of function
    
    李巨集毅機器學習筆記——02.Where does the error come from ?
    
    傳送門：在上節課講到，如果選擇不同的function set就是選擇不同的model 在testing data上會得到不同的error，而且越複雜的model不見得會給你越低的error，我們要討論的問題就是error來自什麼地方？ error有兩個來源，偏
    
    [機器學習入門] 李巨集毅機器學習筆記-1（Learning Map 課程導覽圖）
    
    在此就不介紹機器學習的概念了。 Learning Map（學習導圖） PDF VIDEO 先來看一張李大大的總圖↓ 鑑於看起來不是很直觀，我“照虎
    
    [機器學習入門] 李巨集毅機器學習筆記-5（Classification- Probabilistic Generative Model；分類：概率生成模型）
    
    [機器學習] 李巨集毅機器學習筆記-5（Classification: Probabilistic Generative Model；分類：概率生成模型） Classification
    
    [機器學習入門] 李巨集毅機器學習筆記-15 （Unsupervised Learning: Word Embedding；無監督學習：詞嵌入）
    
    [機器學習入門] 李巨集毅機器學習筆記-15 （Unsupervised Learning: Word Embedding；無監督學習：詞嵌入） PDF VIDEO
    
    [機器學習入門] 李巨集毅機器學習筆記-6 （Classification: Logistic Regression；邏輯迴歸）
    
    [機器學習] 李巨集毅機器學習筆記-6 （Classification: Logistic Regression；Logistic迴歸） PDF VIDEO Three steps Step 1: Function Set
    
    [機器學習入門] 李巨集毅機器學習筆記-14 （Unsupervised Learning: Linear Dimension Reduction；無監督學習：線性降維）
    
    [機器學習入門] 李巨集毅機器學習筆記-14 （Unsupervised Learning: Linear Dimension Reduction；線性降維） PDF VI
    
    李巨集毅機器學習筆記-6 深度學習簡介（Brief Introduction of Deep Learning）
    
    Brief Introduction of Deep Learning - 深度學習簡介 1. 前言 deep learning 在近些年非常熱門，從2012年開始，深度學習的應用數目幾乎是呈指數增長的。深度學習的發展史如下圖：
    
    李巨集毅機器學習筆記
    
    2018.10.09開始看李巨集毅的機器學習課，把重要的筆記記下來各種模型之間的關係 10月10日為什麼要使用Regulation 正則專案的：使目標函式儘可能的平滑，儘量使Wi小一點 Wi小的
    
    16、【李巨集毅機器學習（2017）】Unsupervised Learning: Deep Auto-encoder（無監督學習：深度自動編碼器）
    
    本篇部落格將介紹無監督學習演算法中的 Deep Auto-encoder。目錄 Deep Auto-encoder 輸入28*28維度的影象畫素，由NN encoder輸出code，code的維度往往小於784，但我們並不知道code的
    
    李巨集毅機器學習 P14 Backpropagation 筆記
    
    chain rule：求導的鏈式法則。接著上一節，我們想要minimize這個loss的值，我們需要計算梯度來更新w和b。以一個neuron舉例：這個偏微分的結果就是輸入x。比如下面這個神經網路：下面我們要計算這個偏微分：。這裡的以si
    
    李巨集毅機器學習 P13 Brief Introduction of Deep Learning 筆記
    
    deep learning的熱度增長非常快。下面看看deep learning的歷史。最開始出現的是1958年的單層感知機，1969年發現單層感知機有限制，到了1980年代出現多層感知機（這和今天的深度學習已經沒有太大的區別），1986年又出現了反向傳播演算法（通常超過3
    
    李巨集毅機器學習 P12 HW2 Winner or Loser 筆記（不使用框架實現使用MBGD優化方法和z_score標準化的logistic regression模型）
    
    建立logistic迴歸模型：根據ADULT資料集中一個人的age，workclass，fnlwgt，education，education_num，marital_status，occupation等資訊預測其income大於50K或者相反（收入）。資料集： ADULT資料集。
    
    李巨集毅機器學習P11 Logistic Regression 筆記
    
    我們要找的是一個概率。 f即x屬於C1的機率。上面的過程就是logistic regression。下面將logistic regression與linear regression作比較。接下來訓練模型，看看模型的好壞。假設有N組trainin
    
    李巨集毅機器學習 P15 “Hello world” of deep learning 筆記
    
    我們今天使用Keras來寫一個deep learning model。 tensorflow實際上是一個微分器，它的功能比較強大，但同時也不太好學。因此我們學Keras，相對容易，也有足夠的靈活性。李教授開了一個玩笑：下面我們來寫一個最簡單的deep learning mo
    
    線性迴歸李巨集毅機器學習HW1
    
    本文是李巨集毅老師機器學習的第一次大作業，參考網上程式碼，寫了一下自己的思路。李巨集毅 HM1: 要求：本次作業使用豐原站的觀測記錄，分成train set跟test set，train set是豐原站每個月的前20天所有資料。test set則是從豐原站剩下的資料中取樣出來。 trai
    
    李巨集毅機器學習課程--迴歸(Regression)
    
    李老師用的是精靈寶可夢做的比喻，假設進化後的寶可夢的cp值(Combat Power)與未進化之前的寶可夢的cp值相關，我們想找出這兩者之間的函式關係，可以設進化後的cp值為y,進化之前的cp值為x：y = b + w*x (不只可以設一次項，還可以設定二次項，三次項
    
    李巨集毅機器學習P7 Gradient Descent (Demo by AOE) 筆記、P8 Gradient Descent (Demo by Minecraft) 筆記
    
    P7 Gradient Descent (Demo by AOE) 筆記：在進行Gradient Decent時，我們可以類似玩遊戲帝國時代時探索地圖的情況。在地圖沒有探索前，你的視野範圍只有很小的一個圈，你不知道圈外的黑幕下面有什麼東西。現在我們假設地圖上的海拔
    
    李巨集毅機器學習PTT的理解（1）深度學習的介紹
    
    深度學習的介紹機器學習就像是尋找一個合適的函式，我們輸入資料就可以得到想要的結果，比如：在語音識別中，我們輸入一段語音，函式的輸出值就是識別的結果；在影象識別中，輸入一張照片，函式可以告訴我們分類
    
    卷積神經網路CNN |李巨集毅機器學習
    
    2018年11月10日 15:29:22 小辣油閱讀數：8 個人分類：李巨集毅

【ML】 李巨集毅機器學習筆記

相關推薦

【ML】李巨集毅機器學習筆記