1. 程式人生 > >機器學習--決策樹及泰坦尼克號生存預測

機器學習--決策樹及泰坦尼克號生存預測

決策樹是一個類似於流程圖的樹結構,分支節點表示對一個特徵進行測試,根據測試結果進行分類,樹葉節點代表一個類別。

要判斷從哪個特徵進行分裂,就要對資訊進行量化,量化的方式有:

ID3: 資訊增益

條件熵:

其中pi=P(X=xi),X,Y代表了兩個事件,而它們之間有時有聯絡的(也就是聯合概率分佈),條件熵H(Y|X)代表了在一直隨機變數X的情況下,Y的不確定性的大小。

資訊增益:熵H(Y)和條件熵H(Y|X)的差。定義如下: 
I(Y,X)=H(Y)−H(Y|X)

熵越大,事物越不確定,資訊增益越大,該特徵越適合做分裂點。

C4.5: 資訊增益比

CART: 基尼係數

例項:預測泰坦尼克號生存率

a. 資料處理

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# 資料預處理,丟棄無用資料、處理資料、填充缺失值
def read_dataset(fname):
    # 指定第一列為行索引
    data = pd.read_csv(fname,index_col=0)
    # 丟棄無用資料
    data.drop(['Name','Ticket','Cabin'],axis=1,inplace=True)
    # 處理性別資料,male為1,female為0
    data['Sex']=(data['Sex']=='male').astype(int)
    # 處理登船港口資料
    labels = data['Embarked'].unique().tolist()
    data['Embarked'] = data['Embarked'].apply(lambda s: labels.index(s))
    # 處理缺失值
    data = data.fillna(0)
    return data

train = read_dataset('train.csv')

train.head()
Survived Pclass Sex Age SibSp Parch Fare Embarked
PassengerId
1 0 3 1 22.0 1 0 7.2500 0
2 1 1 0 38.0 1 0 71.2833 1
3 1 3 0 26.0 0 0 7.9250 0
4 1 1 0 35.0 1 0 53.1000 0
5 0 3 1 35.0 0 0 8.0500 0

b. 訓練模型

# 劃分資料集
from sklearn.model_selection import train_test_split
y = train['Survived'].values
X = train.drop(['Survived'],axis=1).values
X_train,X_test, y_train,y_test = train_test_split(X,y,test_size=0.2)
print('train dataset:{0}; test dataset: {1}'.format(X_train.shape,X_test.shape))
train dataset:(712, 7); test dataset: (179, 7)
# 用決策樹擬合
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X_train,y_train)
train_score = clf.score(X_train,y_train)
test_score = clf.score(X_test,y_test)
print('train score:{0}; test score:{1}'.format(train_score,test_score))
train score:0.9859550561797753; test score:0.7877094972067039

可以看到訓練分數非常高:98.6%,而測試分數只有78.8%,說明模型過擬合,需要進行剪枝。

c. 優化引數

可以用max_depth來控制決策樹的深度,當決策樹達到限定深度的時候,就不再進行分裂。

# 引數選擇max_depth
def cv_score(d):
    clf = DecisionTreeClassifier(max_depth=d)
    clf.fit(X_train,y_train)
    tr_score = clf.score(X_train,y_train)
    cv_score = clf.score(X_test,y_test)
    return (tr_score,cv_score)

depths = range(2,15)
scores = [cv_score(d) for d in depths]
tr_scores = [s[0] for s in scores]
cv_scores = [s[1] for s in scores]

# 找出交叉驗證資料集評分最高的索引
best_score_index = np.argmax(cv_scores)
best_score = cv_scores[best_score_index]
best_param = depths[best_score_index]
print('best param:{0};best score:{1}'.format(best_param, best_score))
best param:6;best score:0.8212290502793296

引數與評分關係:

plt.figure(figsize=(6,4),dpi=144)
plt.grid()
plt.xlabel('max depth of decision tree')
plt.ylabel('score')
plt.plot(depths, cv_scores,'.g-',label='cross-validation score')
plt.plot(depths, tr_scores,'.r--',label='training score')
plt.legend()

隨著樹深增加,訓練分數增加,而測試分數並不會隨樹深增加而增加。

也可以考察min_impurity_split,用來指定資訊熵或基尼不純度的閾值,當決策樹分裂後,其資訊增益低於這個閾值時,不再分裂。

def cv_score(val):
    clf = DecisionTreeClassifier(criterion = 'gini', min_impurity_split = val)
    clf.fit(X_train,y_train)
    tr_score = clf.score(X_train,y_train)
    cv_score = clf.score(X_test,y_test)
    return (tr_score,cv_score)

# 指定引數範圍,分別訓練模型並評分
values = np.linspace(0,0.5,20)
scores = [cv_score(v) for v in values]
tr_scores = [s[0] for s in scores]
cv_scores = [s[1] for s in scores]

# 找出評分最高的模型引數
best_score_index = np.argmax(cv_scores)
best_score = cv_scores[best_score_index]
best_param = values[best_score_index]

# 畫出引數與評分關係
plt.figure(figsize=(6,4),dpi=144)
plt.grid()
plt.xlabel('threshold of entropy')
plt.ylabel('score')
plt.plot(values, cv_scores,'.g-',label='cross-validation score')
plt.plot(values, tr_scores,'.r--',label='train score')
plt.legend()

當不純度閾值接近0.5時,訓練分數和測試分數都急劇下降,說明模型出現欠擬合。

d. 模型引數選擇包

sklearn.model_selection裡的GridSearchCV可以幫助選擇多個最佳引數。

引數param_grid是一個字典,字典的key對應要調的引數,字典的value對應引數值,可以包含多個key-value組合。

引數cv是交叉驗證資料集,cv=5表示把資料集分成5份,拿其中一份作為驗證集,其他四份作為訓練集。

輸出:clf.best_params_最優引數,clf.best_scores_最優評分,clf.cv_results_計算過程中所有中間結果。

from sklearn.model_selection import GridSearchCV
thresholds = np.linspace(0,0.5,50)
# 設定引數矩陣
param_grid = {'min_impurity_split': thresholds}
clf = GridSearchCV(DecisionTreeClassifier(),param_grid,cv=5)
clf.fit(X,y)
print('best param:{0}\nbest score:{1}'.format(clf.best_params_,clf.best_score_))
best param:{'min_impurity_split': 0.2040816326530612}
best score:0.8204264870931538

多組引數選擇最優引數:

entropy_thresholds = np.linspace(0,1,50)
gini_thresholds = np.linspace(0,0.5,50)

# 設定引數矩陣
param_grid = [{'criterion':['entropy'],'min_impurity_split':entropy_thresholds},
             {'criterion':['gini'],'min_impurity_split':gini_thresholds},
             {'max_depth':range(2,10)},
             {'min_samples_split':range(2,30,2)}]
clf = GridSearchCV(DecisionTreeClassifier(),param_grid,cv=5)
clf.fit(X,y)
print('best_param:{0}\nbest score:{1}'.format(clf.best_params_,clf.best_score_))
best_param:{'criterion': 'entropy', 'min_impurity_split': 0.5306122448979591}
best score:0.8294051627384961