1. 程式人生 > >Coursera | Andrew Ng (02-week-1-1.5)—為什麼正則化可以減少過擬合?

Coursera | Andrew Ng (02-week-1-1.5)—為什麼正則化可以減少過擬合?

該系列僅在原課程基礎上部分知識點添加個人學習筆記,或相關推導補充等。如有錯誤,還請批評指教。在學習了 Andrew Ng 課程的基礎上,為了更方便的查閱複習,將其整理成文字。因本人一直在學習英語,所以該系列以英文為主,同時也建議讀者以英文為主,中文輔助,以便後期進階時,為學習相關領域的學術論文做鋪墊。- ZJ

轉載請註明作者和出處:ZJ 微信公眾號-「SelfImprovementLab」

1.5 Why regularization reduces overfitting

為什麼正則化可以減少過擬合?

(字幕來源:網易雲課堂)

這裡寫圖片描述

**Why does regularization help with overfitting?**Why does it help with reducing variance problems?Let’s go through a couple examples to gain some intuition about how it works.So, recall that high bias, high variance, just right.and I just write pictures from our earlier video that looks something like this.Now, let’s see a fitting large and deep neural network.I know I haven’t drawn this one too large or too deep,unless you think some neural network and this currently overfitting.So you have some cost function like J of W,B equals sum of the losses.So what we did for regularization was add this extra term that penalizes the weight matrices from being too large.So that was the Frobenius norm.So why is it that shrinking the L2 norm or the Frobenius norm or the parameters might cause less overfitting?

這裡寫圖片描述

為什麼正則化有利於預防過擬合呢,為什麼它可以減少方差問題,我們通過兩個例子來直觀體會一下,左圖是高偏差 右圖是高方差 中間的是 Just Right,這幾張圖我們在前面課程中看到過,現在我們來看下這個龐大的深度擬合神經網路,我知道這張圖不夠大 深度也不夠,但你可以想象這是一個過擬合的神經網路,這是我們的代價函式 J 含有引數 w b,等於損失總和,我們新增正則項,它可以避免資料權值矩陣過大,這就是弗羅貝尼烏斯範數,為什麼壓縮 L2 範數,或弗羅貝尼烏斯範數或者引數可以減少過擬合,直觀上理解就是如果正則化λ設定得足夠大,權重矩陣 W 被設定為接近於 0 的值。

One piece of intuition is that if you crank regularization lambda to be really, really big,they’ll be really incentivized to set the weight matrices W to be reasonably close to zero.So one piece of intuition is maybe it set the weight to be so close to zero for a lot of hidden units that’s basically zeroing out a lot of the impact of these hidden units.And if that’s the case,then this much simplified neural network becomes a much smaller neural network.In fact, it is almost like a logistic regression unit,but stacked most probably as deep.And so that will take you from this overfitting case much closer to the left to other high bias case.But hopefully there’ll be an intermediate value of lambda that results in a result closer to this just right case in the middle.But the intuition is that by cranking up lambda to be really big,they’ll set W close to zero,which in practice this isn’t actually what happens.We can think of it as zeroing out or at least reducing the impact of a lot of the hidden units so you end up with what might feel like a simpler network.They get closer and closer to as if you’re just using logistic regression.The intuition of completely zeroing out of a bunch of hidden units isn’t quite right.It turns out that what actually happens is they’ll still use all the hidden units,but each of them would just have a much smaller effect.But you do end up with a simpler network and as if you have a smaller network that is therefore less prone to overfitting.So I am not sure this intuition helps, but when you implement regularization in the program exercise,you actually see some of these variance reduction results yourself.

這裡寫圖片描述

直觀理解就是把多隱藏單元的權重設為 0,於是基本上消除了這些隱藏單元的許多影響,如果是這種情況,這個被大大簡化了的神經網路會變成一個很小的網路,小到如同一個邏輯迴歸單元,可是深度卻很大,它會使這個網路從過擬合的狀態,更接近於左圖的高偏差狀態,但是 λ 會存在一箇中間值,於是會有一個接近“just right”的中間狀態,直觀理解就是 λ 增加到足夠大,W 會接近於 0,實際上是不會發生這種情況的,我們嘗試消除或至少減少許多隱藏單元的影響,最終這個網路會變得更簡單,這個神經網路越來越接近邏輯迴歸,我們直覺上認為大量隱藏單元被完全消除了 其實不然,實際上是該神經網路的所有隱藏單元依然存在,但是它們的影響變得更小了,神經網路變得更簡單了,貌似這樣更不容易發生過擬合,因此我不確定這個直覺經驗是否有用,不過 在程式設計中執行正則化時,你會實際看到一些方差減少的結果。

Here’s another attempt at additional intuition for why regularization helps prevent overfitting.And for this, I’m going to assume that we’re using the tanh activation function which looks like this.This is a g of z equals tanh of z.So if that’s the case,notice that so long as Z is quite small,so if Z takes on only a smallish range of parameters, maybe around here,maybe around here, then you’re just using the linear regime of the tanh function.Is only if Z is allowed to wander up to larger values or smaller values like so,that the activation function starts to become less linear.So the intuition you might take away from this is that if lambda,the regularization parameter is large,then you have that your parameters will be relatively small,because they are penalized being large into a cos function.And so if the weights W are small then because Z is equal to W,right, and then technically is plus b,but if W tends to be very small,then Z will also be relatively small.And in particular, if Z ends up taking relatively small values,just in this whole range,then G of Z will be roughly linear.So it’s as if every layer will be roughly linear,as if it is just linear regression.And we saw in course one that if every layer is linear,then your whole network is just a linear network.And so even a very deep network,with a deep network with a linear activation functionis at the end they are only able to compute a linear function.So it’s not able to fit those very very complicated decision,very non-linear decision boundaries that allow it to really overfit right to data sets like we saw on the overfitting high variance case on the previous slide.

這裡寫圖片描述

我們再來直觀感受一下,正則化為什麼可以預防過擬合,假設我們用的是這樣的雙曲啟用函式,用 g(z) 表示 tanh(z),那麼我們發現,只要z 非常小,如果 z 只涉及少量引數,這裡我們利用了雙曲正切函式的線性狀態,只要 Z 可以擴充套件為這樣的更大值或者更小值,啟用函式開始變得非線性,現在你應該摒棄這個直覺 即,如果正則引數 λ 很大,啟用函式的引數會相對小,因為代價函式中的引數變大了,如果 W 很小,z 等於 w,然後再加上 b,如果 w 很小,相對來說 z 也會很小,特別是 如果 z 的值最終在這個範圍內,都是相對較小的值,g(z) 大致呈線性,每層幾乎都是線性的,和線性迴歸函式一樣,第一節課我們講過 如果每層都是線性的,那麼整個網路就是一個線性網路,即使是一個非常深的深層網路,因具有線性啟用函式的特徵,最終我們只能計算線性函式。因此 它不適用於非常複雜的決策,以及過度擬合數據集的非線性決策邊界,如同我們在上張幻燈片中看到的過度擬合高方差的情況。

So just to summarize,if the regularization parameter becomes very large,the parameters W very small,so Z will be relatively small,kind of ignoring the effects of b for now,so Z will be relatively small or,really, I should say it takes on a small range of values.And so the activation function if is tanh,say, will be relatively linear.And so your whole neural network will be computing something not too far from a big linear function which is therefore pretty simple function rather than a very complex highly non-linear function.And so is also much less able to overfit.And again, when you implement regularization for yourself in the program exercise,you’ll be able to see some of these effects yourself.Before wrapping up our def discussion on regularization,I just want to give you one implementational tip.Which is that, when implanting regularization,we took our definition of the cost function J, and we actually modifiedit by adding this extra term that penalizes the weight being too large.

總結一下,如果正則化引數變得很大,引數 W 很小,z 也會相對變小,此時忽略 b 的影響,z 會相對變小,實際上 z 的取值範圍很小,這個啟用函式 也就是曲線函式,會相對呈線性,整個神經網路會計算離線性函式近的值,這個線性函式非常簡單,並不是一個極複雜的高度非線性函式,不會發生過擬合。大家在程式設計作業裡實現正則化的時候,會親眼看到這些結果,總結正規化之前,我給大家一個執行方面的小建議,在增加正則化項時,應用之前定義的代價函式 J,我們做過修改 增加了一項 目的是預防權重過大。

這裡寫圖片描述

And so if you implement gradient descent,one of the steps to debug gradient descent is to plot the cost function J as a functionof the number of elevations of gradient descent, and you want to see thatthe cost function J decreases monotonically after every elevation of gradient descent.And if you’re implementing regularization,then please remember that J now has this new definition.If you plot the old definition of J,just this first term,then you might not see a decrease monotonically.So to debug gradient descent make sure that you’re plottingthis new definition of J that includes this second term as well.Otherwise you might not see J decrease monotonically on every single elevation.So that’s it for L two regularization which is actuallya regularization technique that I use the most in training deep learning modules.In deep learning, there is another sometimes used regularization techniquecalled dropout regularization.Let’s take a look at that in the next video.

如果你使用的是梯度下降函式,在除錯梯度下降時 其中一步就是把代價函式 J 設計成這樣一個函式,它代表梯度下降的調幅數量,可以看到 代價函式對於梯度下降的每個調幅都單調遞減,如果你實施的是正規化函式,請牢記 J 已經有一個全新的定義,如果你用的是原函式J,也就是這第一個正則化項,你可能看不到單調遞減現象,為了除錯梯度下降,請務必使用新定義的J函式 它包含第二個正則化項,否則函式 J可能不會在所有調幅範圍內都單調遞減,這就是 L2 正則化,它是我在訓練深度學習模型時最常用的一種方法,在深度學習中 還有一種方法也用到了正則化,就是 dropout 正則化,我們下節課再講。

重點總結:

為什麼正則化可以減小過擬合

假設下圖的神經網路結構屬於過擬合狀態:

這裡寫圖片描述

對於神經網路的 Cost function:

J(w[1],b[1],,w[L],b[L])=1mi=1mL(y^(i),y(i))+λ2ml=1L||w[l]||2F

加入正則化項,直觀上理解,正則化因子λ設定的足夠大的情況下,為了使代價函式最小化,權重矩陣W就會被設定為接近於0的值。則相當於消除了很多神經元的影響,那麼圖中的大的神經網路就會變成一個較小的網路。

當然上面這種解釋是一種直觀上的理解,但是實際上隱藏層的神經元依然存在,但是他們的影響變小了,便不會導致過擬合。

數學解釋:

假設神經元中使用的啟用函式為 g(z)=tanh(z),在加入正則化項後:

這裡寫圖片描述

λ增大,導致W[l]減小,Z[l]=W[l]a[l1]+b[l]便會減小,由上圖可知,在 z 較小的區域裡,tanh(z)函式近似線性,所以每層的函式就近似線性函式,整個網路就成為一個簡單的近似線性的網路,從而不會發生過擬合。

參考文獻:

PS: 歡迎掃碼關注公眾號:「SelfImprovementLab」!專注「深度學習」,「機器學習」,「人工智慧」。以及 「早起」,「閱讀」,「運動」,「英語 」「其他」不定期建群 打卡互助活動。