1. 程式人生 > >sklearn庫學習之線性模型

sklearn庫學習之線性模型

線性模型利用輸入特徵的線性函式進行預測,學習線性模型的演算法的區別:
(1)係數和截距的特定組合對訓練資料擬合好壞的度量方法,不同的演算法使用不同的方法度量“對訓練集擬合好壞”–稱為損失函式
(2)是否使用正則化,使用哪種正則化方法

線性模型的主要引數是正則化引數,如果假定只有幾個特徵是真正重要的,應該用L1正則化,否則應預設使用L2正則化。

處理大型資料時,需研究使用LogisticRegression和Ridge模型的solver='sag’選項,比預設值要更快。

用於迴歸的線性模型

y

= w i x i + b y=w_i*x_i + b

x i x_i 是單個數據點的特徵, w i
w_i
是每個特徵座標軸的斜率或輸入特徵的加權, w i b w_i和b 是學習模型的引數, y y 是模型預測的結果。
在一維wave資料集上學習引數 w 0 b w_0和b

import mglearn  
mglearn.plots.plot_linear_regression_wave() 

線性迴歸(普通最小二乘法)

線性迴歸尋找引數 w b w和b 是的對訓練集的預測值與真實的迴歸目標 y y 之間的均方誤差最小線性迴歸尋找引數 w b w和b 是的對訓練集的預測值與真實的迴歸目標 y y 之間的均方誤差最小

均方誤差:預測值與真實值之差的平方和除以樣本數
#線性迴歸對wave資料集的預測結果
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import mglearn

X,y = mglearn.datasets.make_wave(n_samples = 60)
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 42)

lr = LinearRegression().fit(X_train,y_train)

#sklearn庫總是將從訓練資料中得出的數值儲存在以下劃線結尾的屬性中,與使用者設定的引數區分開
print('lr.coef_:{}'.format(lr.coef_))
print('lr.intercept_:{}'.format(lr.intercept_))

#若訓練集和測試集上的分數非常接近,說明可能存在欠擬合
print('Training set score:{:.2f}'.format(lr.score(X_train,y_train)))
print('Test set score:{:.2f}'.format(lr.score(X_test,y_test)))
#LinearRegression在高維資料集上的表現,波士頓房價資料集
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import mglearn

X,y = mglearn.datasets.load_extended_boston()

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 0)
lr = LinearRegression().fit(X_train,y_train)

#訓練集和測試集上的效能差異是過擬合的明顯標誌
print('Training set score:{:.2f}'.format(lr.score(X_train,y_train)))
print('Test set score:{:.2f}'.format(lr.score(X_test,y_test)))

嶺迴歸

嶺迴歸的預測公式與普通最小二乘法相同,但嶺迴歸用到了L2正則化約束,使每個特徵對輸出的影響儘可能小。更大的alpha表示約束更強的模型,預計大alpha對應的coef_元素比小alpha對應的coef_元素要小。
#Ridge在高維資料集上的表現,波士頓房價資料集
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
import mglearn

X,y = mglearn.datasets.load_extended_boston()

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 0)
ridge = Ridge().fit(X_train,y_train)

#Rigde在訓練集的分數低於LinearRegression,但在測試集上的分數更高
#線性模型對資料存在過擬合,Ridge是一種約束更強的模型,不容易過擬合
print('Training set score:{:.2f}'.format(ridge.score(X_train,y_train)))
print('Test set score:{:.2f}'.format(ridge.score(X_test,y_test)))
#調整alpha,增大alpha使得係數更趨向於0,降低訓練集效能,可能!!!提高泛化效能
#Ridge在高維資料集上的表現,波士頓房價資料集
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
import mglearn

X,y = mglearn.datasets.load_extended_boston()

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 0)

#預設aloha = 1.0
ridge = Ridge().fit(X_train,y_train)
print('Training set score:{:.2f}'.format(ridge.score(X_train,y_train)))
print('Test set score:{:.2f}'.format(ridge.score(X_test,y_test)))

#aplha = 10
ridge10 = Ridge(alpha = 10).fit(X_train,y_train)
print('Training set score:{:.2f}'.format(ridge10.score(X_train,y_train)))
print('Test set score:{:.2f}'.format(ridge10.score(X_test,y_test)))

#aplha = 0.1
ridge01 = Ridge(alpha = 0.1).fit(X_train,y_train)
print('Training set score:{:.2f}'.format(ridge01.score(X_train,y_train)))
print('Test set score:{:.2f}'.format(ridge01.score(X_test,y_test)))

#標資料點
plt.plot(ridge.coef_,'s',label = "Ridge alpha = 1")
plt.plot(ridge10.coef_,'^',label = "Ridge alpha = 10")
plt.plot(ridge01.coef_,'v',label = "Ridge alpha = 0.1")

plt.xlabel("Coefficient index") #x軸對應coef_的元素,x=i對應第i個特徵的係數,y軸表示該係數的具體數值
plt.ylabel("Coefficient magnitude") #係數震級
plt.hlines(0,0,len(ridge.coef_)) #畫橫座標
plt.ylim(-25,25) #設定座標軸的最大最小區間
plt.legend(loc = 'best')
import mglearn
#固定alpha值,改變訓練資料量
#對波士頓房價資料二次抽樣,在資料量逐漸增加的子資料集上對LinearRegression和Ridge(alpha = 1)兩個模型評估
#學習曲線
mglearn.plots.plot_ridge_n_samples()
#線性迴歸的訓練效能在下降
#如果有足夠多的資料,正則化變得不那麼重要

Lasso

使用Lasso也是約束其係數使其接近於0,但用到的方法不同,用了L1正則化,L1正則化的結果是使用Lasso時某些係數剛好為0,可以看作是自動化的特徵選擇。
#將Lasso應用在擴充套件的波士頓房價資料集上
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
import mglearn
import numpy as np
import matplotlib.pyplot as plt

X,y = mglearn.datasets.load_extended_boston()

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 0)
lasso = Lasso().fit(X_train,y_train)

#在訓練集和測試集上的表現都很差,表示存在欠擬合
print('Training set score:{:.2f}'.format(lasso.score(X_train,y_train)))
print('Test set score:{:.2f}'.format(lasso.score(X_test,y_test)))
print('Number of features used: {}'.format(np.sum(lasso.coef_ != 0))) #展示係數不為 0 的 feature 個數

#Lasso也有一個正則化引數alpha,預設=1.0,控制係數趨向於0的強度。降低欠擬合,減小alpha,增加max_iter的值(執行迭代的最大次數)
#擬合出一個更復雜的模型
lasso001 = Lasso(alpha = 0.01, max_iter = 100000).fit(X_train,y_train)
print('Training set score:{:.2f}'.format(lasso001.score(X_train,y_train)))
print('Test set score:{:.2f}'.format(lasso001.score(X_test,y_test)))
print('Number of features used: {}'.format(np.sum(lasso001.coef_ != 0)))

#但把alpha設得太小,會消除正則化的影響,從而出現過擬合
lasso00001 = Lasso(alpha = 0.0001, max_iter = 100000).fit(X_train,y_train)
print('Training set score:{:.2f}'.format(lasso00001.score(X_train,y_train)))
print('Test set score:{:.2f}'.format(lasso00001.score(X_test,y_test)))
print('Number of features used: {}'.format(np.sum(lasso00001.coef_ != 0)))

plt.plot(lasso.coef_,'s',label = 'Lasso alpha = 1')
plt.plot(lasso001.coef_,'^',label = 'Lasso alpha = 0.01')
plt.plot(lasso00001.coef_,'v',label = 'Lasso alpha = 0.0001')

plt.xlabel('Coefficient index')
plt.ylabel('Coefficient magnitude')
plt.legend(ncol = 2,loc = (0,1.05))#列數為2列
plt.ylim(-25,25)

sklearn提供了ElasticNet類,結合了Lasso和Ridge的懲罰項,調節兩個引數:用於L1正則化和L2正則化。

用於分類的線性模型

y = w i x i + b > 0 y=w_i*x_i + b>0
沒有返回特徵的加權求值,而是為預測設定了闕值(0):y<0,則預測類別-1;y>0,預測類別1。對用於分類的線性模型,決策邊界是輸入的線性函式,即線性分類器是利用直線、平面或超平面來分開兩個類別的分類器。

#將兩種線性分類模型應用到forge資料集上,並將決策邊界視覺化
from sklearn.linear_model import LogisticRegression   #Logistic迴歸
from sklearn.svm import LinearSVC      #線性支援向量機
import mglearn

X,y = mglearn.datasets.make_forge()

fig,axes = plt.subplots(1,2,figsize = (10,3))

for model,ax in zip([LinearSVC(), LogisticRegression()],axes):
    clf = model.fit(X,y)
    #alpha引數表示分界線顏色的深淺
    mglearn.plots.plot_2d_separator(clf, X, fill = False, eps = 0.5, ax = ax, alpha = 0.7) #決策邊界視覺化
    mglearn.discrete_scatter(X[:,0],X[:,1],y,ax = ax) #畫點
    
    ax.set_title("{}".format(clf.__class__.__name__))
    ax.set_xlabel("Feature 0")
    ax.set_ylabel("Feature 1")
    ax.legend(loc = "best")

LogisticRegression 和 LinearSVC 模型預設使用L2正則化,決定正則化強度的權衡引數叫做c。c值越大,對應的正則化越弱。

#不同c值的線性SVM在forge資料集上的決策邊界
import mglearn
mglearn.plots.plot_linear_svc_regularization()

在高維空間中,用於分類的線性模型非常強大。當考慮過多特徵時,避免過擬合越來越重要。

#在乳腺癌資料集上詳細分析LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

cancer = load_breast_cancer()
X_train,X_test,y_train,y_test = train_test_split(cancer.data, cancer.target,stratify = cancer.target,random_state = 42)

#C = 1.0
logreg = LogisticRegression().fit(X_train,y_train)
print("Training set score:{:.2f}".format(logreg.score(X_train,y_train)))
print("Test set score:{:.3f}".format(logreg.score(X_test,y_test)))

#C = 100
logreg100 = LogisticRegression(C = 100).fit(X_train,y_train)
print("Training set score:{:.2f}".format(logreg100.score(X_train,y_train)))
print("Test set score:{:.3f}".format(logreg100.score(X_test,y_test)))

#C = 0.01
logreg001 = LogisticRegression(C = 0.01).fit(X_train,y_train)
print("Training set score:{:.2f}".format(logreg001.score(X_train,y_train)))
print("Test set score:{:.3f}".format(logreg001.score(X_test,y_test)))

plt.plot(logreg.coef_.T,'o',label = "C = 1")
plt.plot(logreg100.coef_.T,'^',label = "C = 100")
plt.plot(logreg001.coef_.T,'v',label = "c = 0.01")

plt.xticks(range(cancer.data.shape[1]),cancer.feature_names,rotation = 90)

plt.hlines(0,0,cancer.data.shape[1])
plt.ylim(-5,5)
plt.xlabel("Coefficient index")
plt.ylabel("Coefficient magnitude")
plt.legend()
#系統可以告訴我們,某個特徵與哪個類別有關。


#使用L1正則化的LogisticRegression
for C, marker in zip([0.001,1,100],['o','^','v']):
    lr_l1 = LogisticRegression(C = C, penalty = "l1").fit(X_train,y_train)
    print("Training accuracy of l1 logreg with C ={:.3f}:{:.2f}".format(C,lr_l1.score(X_train,y_train)))
    print("Test accuracy of l1 logreg with C ={:.3f}:{:.2f}".format(C,lr_l1.score(X_test,y_test)))

    plt.plot(lr_l1.coef_.T,marker,label = "C={:.3f}".format(C))
    
plt.xticks(range(cancer.data.shape[1]),cancer.feature_names,rotation = 90)

plt.hlines(0,0,cancer.data.shape[1])
plt.ylim(-5,5)
plt.xlabel("Coefficient index")
plt.ylabel("Coefficient magnitude")
plt.legend(loc = 3)

模型的penalty引數會影響正則化,即模型是使用所有可用特徵還是隻選擇特徵的一個子集。

用於多分類的線性模型

許多線性分類模型只適用於二分類問題,不能輕易推廣到多分類問題,將二分類演算法推廣到多分類演算法的一種常見方法是“一對其餘”。
“一對其餘”,即對每個類別都學習一個二分類模型,將這個類別與其他類別分開。

每個類別都對應一個二分類器,這樣每個類別都有一個係數w向量和一個截距b,其結果中最大值對應的類別即為預測的類別標籤。
#包含3個類別的二維玩具資料集
from sklearn.datasets import make_blobs
import mglearn
import matplotlib.pyplot as plt

X,y = make_blobs(random_state = 42)
mglearn.discrete_scatter(X[:,0],X[:,1],y)

#訓練一個LinearSVC分類器
linear_svm = LinearSVC().fit(X,y)
print("Coefficient shape:", linear_svm.coef_.shape) #三條線,兩個特徵
print("Intercept shape:", linear_svm.intercept_.shape)

line = np.linspace(-15,15)

for coef,intercept,color in zip(linear_svm.coef_, linear_svm.intercept_,['b','r','g']):
    plt.plot(line, -(line * coef[0] + intercept) / coef[1], c = color)

plt.ylim(-10,15)
plt.xlim(-10,8)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
plt.legend(["Class 0","Class 1","Class 2","Line class 0","Line class 1","Line class 2"], loc = (1.01,0.3))

mglearn.plots.plot_2d_classification(linear_svm, X, fill = True, alpha = .7)  #邊界條件視覺化

對程式碼及方法的疑惑

train_test_split(X, y, stratify=y)
https://blog.csdn.net/weixin_37226516/article/details/62042550

普通最小二乘法(OLS)
https://blog.csdn.net/enjoy524/article/details/53556038

Python的知識點 plt.plot()函式細節
https://blog.csdn.net/cjcrxzz/article/details/79627483

python中Matplotlib的座標軸的座標區間的設定
https://blog.csdn.net/ccy950903/article/details/50688449

矩陣論:向量範數和矩陣範數
https://blog.csdn.net/pipisorry/article/details/51030563

正則化及正則化項的理解
https://blog.csdn.net/gshgsh1228/article/details/52199870

深度學習——L0、L1及L2範數
https://blog.csdn.net/zchang81/article/details/70208061

機器學習 - sklearn.Lasso
https://www.jianshu.com/p/1177a0bcb306