機器學習--決策樹及泰坦尼克號生存預測
決策樹是一個類似於流程圖的樹結構,分支節點表示對一個特徵進行測試,根據測試結果進行分類,樹葉節點代表一個類別。
要判斷從哪個特徵進行分裂,就要對資訊進行量化,量化的方式有:
ID3: 資訊增益
條件熵:
其中pi=P(X=xi),X,Y代表了兩個事件,而它們之間有時有聯絡的(也就是聯合概率分佈),條件熵H(Y|X)代表了在一直隨機變數X的情況下,Y的不確定性的大小。
資訊增益:熵H(Y)和條件熵H(Y|X)的差。定義如下:
I(Y,X)=H(Y)−H(Y|X)
熵越大,事物越不確定,資訊增益越大,該特徵越適合做分裂點。
C4.5: 資訊增益比
CART: 基尼係數
例項:預測泰坦尼克號生存率
a. 資料處理
%matplotlib inline import matplotlib.pyplot as plt import numpy as np import pandas as pd # 資料預處理,丟棄無用資料、處理資料、填充缺失值 def read_dataset(fname): # 指定第一列為行索引 data = pd.read_csv(fname,index_col=0) # 丟棄無用資料 data.drop(['Name','Ticket','Cabin'],axis=1,inplace=True) # 處理性別資料,male為1,female為0 data['Sex']=(data['Sex']=='male').astype(int) # 處理登船港口資料 labels = data['Embarked'].unique().tolist() data['Embarked'] = data['Embarked'].apply(lambda s: labels.index(s)) # 處理缺失值 data = data.fillna(0) return data train = read_dataset('train.csv') train.head()
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
---|---|---|---|---|---|---|---|---|
PassengerId | ||||||||
1 | 0 | 3 | 1 | 22.0 | 1 | 0 | 7.2500 | 0 |
2 | 1 | 1 | 0 | 38.0 | 1 | 0 | 71.2833 | 1 |
3 | 1 | 3 | 0 | 26.0 | 0 | 0 | 7.9250 | 0 |
4 | 1 | 1 | 0 | 35.0 | 1 | 0 | 53.1000 | 0 |
5 | 0 | 3 | 1 | 35.0 | 0 | 0 | 8.0500 | 0 |
b. 訓練模型
# 劃分資料集
from sklearn.model_selection import train_test_split
y = train['Survived'].values
X = train.drop(['Survived'],axis=1).values
X_train,X_test, y_train,y_test = train_test_split(X,y,test_size=0.2)
print('train dataset:{0}; test dataset: {1}'.format(X_train.shape,X_test.shape))
train dataset:(712, 7); test dataset: (179, 7)
# 用決策樹擬合
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X_train,y_train)
train_score = clf.score(X_train,y_train)
test_score = clf.score(X_test,y_test)
print('train score:{0}; test score:{1}'.format(train_score,test_score))
train score:0.9859550561797753; test score:0.7877094972067039
可以看到訓練分數非常高:98.6%,而測試分數只有78.8%,說明模型過擬合,需要進行剪枝。
c. 優化引數
可以用max_depth來控制決策樹的深度,當決策樹達到限定深度的時候,就不再進行分裂。
# 引數選擇max_depth
def cv_score(d):
clf = DecisionTreeClassifier(max_depth=d)
clf.fit(X_train,y_train)
tr_score = clf.score(X_train,y_train)
cv_score = clf.score(X_test,y_test)
return (tr_score,cv_score)
depths = range(2,15)
scores = [cv_score(d) for d in depths]
tr_scores = [s[0] for s in scores]
cv_scores = [s[1] for s in scores]
# 找出交叉驗證資料集評分最高的索引
best_score_index = np.argmax(cv_scores)
best_score = cv_scores[best_score_index]
best_param = depths[best_score_index]
print('best param:{0};best score:{1}'.format(best_param, best_score))
best param:6;best score:0.8212290502793296
引數與評分關係:
plt.figure(figsize=(6,4),dpi=144)
plt.grid()
plt.xlabel('max depth of decision tree')
plt.ylabel('score')
plt.plot(depths, cv_scores,'.g-',label='cross-validation score')
plt.plot(depths, tr_scores,'.r--',label='training score')
plt.legend()
隨著樹深增加,訓練分數增加,而測試分數並不會隨樹深增加而增加。
也可以考察min_impurity_split,用來指定資訊熵或基尼不純度的閾值,當決策樹分裂後,其資訊增益低於這個閾值時,不再分裂。
def cv_score(val):
clf = DecisionTreeClassifier(criterion = 'gini', min_impurity_split = val)
clf.fit(X_train,y_train)
tr_score = clf.score(X_train,y_train)
cv_score = clf.score(X_test,y_test)
return (tr_score,cv_score)
# 指定引數範圍,分別訓練模型並評分
values = np.linspace(0,0.5,20)
scores = [cv_score(v) for v in values]
tr_scores = [s[0] for s in scores]
cv_scores = [s[1] for s in scores]
# 找出評分最高的模型引數
best_score_index = np.argmax(cv_scores)
best_score = cv_scores[best_score_index]
best_param = values[best_score_index]
# 畫出引數與評分關係
plt.figure(figsize=(6,4),dpi=144)
plt.grid()
plt.xlabel('threshold of entropy')
plt.ylabel('score')
plt.plot(values, cv_scores,'.g-',label='cross-validation score')
plt.plot(values, tr_scores,'.r--',label='train score')
plt.legend()
當不純度閾值接近0.5時,訓練分數和測試分數都急劇下降,說明模型出現欠擬合。
d. 模型引數選擇包
sklearn.model_selection裡的GridSearchCV可以幫助選擇多個最佳引數。
引數param_grid是一個字典,字典的key對應要調的引數,字典的value對應引數值,可以包含多個key-value組合。
引數cv是交叉驗證資料集,cv=5表示把資料集分成5份,拿其中一份作為驗證集,其他四份作為訓練集。
輸出:clf.best_params_最優引數,clf.best_scores_最優評分,clf.cv_results_計算過程中所有中間結果。
from sklearn.model_selection import GridSearchCV
thresholds = np.linspace(0,0.5,50)
# 設定引數矩陣
param_grid = {'min_impurity_split': thresholds}
clf = GridSearchCV(DecisionTreeClassifier(),param_grid,cv=5)
clf.fit(X,y)
print('best param:{0}\nbest score:{1}'.format(clf.best_params_,clf.best_score_))
best param:{'min_impurity_split': 0.2040816326530612} best score:0.8204264870931538
多組引數選擇最優引數:
entropy_thresholds = np.linspace(0,1,50)
gini_thresholds = np.linspace(0,0.5,50)
# 設定引數矩陣
param_grid = [{'criterion':['entropy'],'min_impurity_split':entropy_thresholds},
{'criterion':['gini'],'min_impurity_split':gini_thresholds},
{'max_depth':range(2,10)},
{'min_samples_split':range(2,30,2)}]
clf = GridSearchCV(DecisionTreeClassifier(),param_grid,cv=5)
clf.fit(X,y)
print('best_param:{0}\nbest score:{1}'.format(clf.best_params_,clf.best_score_))
best_param:{'criterion': 'entropy', 'min_impurity_split': 0.5306122448979591} best score:0.8294051627384961