《Hands-On Machine Learning with Scikit-Learn & TensorFlow》讀書筆記第六章決策樹

阿新 • • 發佈：2019-01-17

第六章決策樹

CHAPTER 6 Decision Trees

和支援向量機一樣，決策樹是一種多功能機器學習演算法，即可以執行分類任務也可以執行迴歸任務，甚至包括多輸出（multioutput）任務.

決策樹也是隨機森林的基本組成部分，而隨機森林是當今最強大的機器學習演算法之一。

鳶尾花資料集上進行一個決策樹分類器訓練

from sklearn.datasets import load_iris 
from sklearn.tree import DecisionTreeClassifier

iris = load_iris() 
X = iris.data[:, 2 
:] # petal length and width 
y = iris.target

tree_clf = DecisionTreeClassifier(max_depth=2) 
tree_clf.fit(X, y)

DecisionTreeClassifier類還有一些其他的引數用於限制樹模型的形狀:
min_samples_split（節點在被分裂之前必須具有的最小樣本數），min_samples_leaf（葉節點必須具有的最小樣本數），min_weight_fraction_leaf（和min_samples_leaf相同，但表示為加權總數的一小部分例項）
max_leaf_nodes（葉節點的最大數量）和max_features（在每個節點被評估是否分裂的時候，具有的最大特徵數量）。
增加min_* hyperparameters或者減少max_* hyperparameters會使模型正則化。

可以利用graphviz package 中的dot命令列，將.dot檔案轉換成 PDF 或 PNG 等多種資料格式。

Graphviz是一款開源圖形視覺化軟體包，http://www.graphviz.org/。
$ dot -Tpng iris_tree.dot -o iris_tree.png

視覺化圖形

from sklearn.tree import export_graphviz
from IPython.display import Image  
import pydotplus 

dot_data = export_graphviz(tree_clf, out_file=None 
, 
                         feature_names=iris.feature_names[2:], 
                         class_names=iris.target_names,  
                         filled=True, # 顏色
                         rounded=True,)  #圓角
graph = pydotplus.graph_from_dot_data(dot_data)  
Image(graph.create_png())

這裡寫圖片描述

決策樹的眾多特性之一就是，它不需要太多的資料預處理，尤其是不需要進行特徵的縮放或者歸一化。

Scikit-Learn 用的是 CART 演算法， CART 演算法僅產生二叉樹：每一個非葉節點總是隻有兩個子節點（只有是或否兩個結果）。然而，像 ID3 這樣的演算法可以產生超過兩個子節點的決策樹模型。

繪製決策邊界

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

def plot_decision_boundary(clf, X, y, axes=[0, 7.5, 0, 3], iris=True, legend=False, plot_training=True):
    x1s = np.linspace(axes[0], axes[1], 100)
    x2s = np.linspace(axes[2], axes[3], 100)
    x1, x2 = np.meshgrid(x1s, x2s)
    X_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)
    custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
    plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap)  #輪廓，等高線
    if not iris:
        custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])
        plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)
    if plot_training:
        plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", label="Iris-Setosa")
        plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", label="Iris-Versicolor")
        plt.plot(X[:, 0][y==2], X[:, 1][y==2], "g^", label="Iris-Virginica")
        plt.axis(axes)
    if iris:
        plt.xlabel("Petal length", fontsize=14)
        plt.ylabel("Petal width", fontsize=14)
    else:
        plt.xlabel(r"$x_1$", fontsize=18)
        plt.ylabel(r"$x_2$", fontsize=18, rotation=0)
    if legend:
        plt.legend(loc="lower right", fontsize=14)

plt.figure(figsize=(8, 4))
plot_decision_boundary(tree_clf, X, y)
plt.plot([2.45, 2.45], [0, 3], "k-", linewidth=2)
plt.plot([2.45, 7.5], [1.75, 1.75], "k--", linewidth=2)
plt.plot([4.95, 4.95], [0, 1.75], "k:", linewidth=2)
plt.plot([4.85, 4.85], [1.75, 3], "k:", linewidth=2)
plt.text(1.40, 1.0, "Depth=0", fontsize=15)
plt.text(3.2, 1.80, "Depth=1", fontsize=13)
plt.text(4.05, 0.5, "(Depth=2)", fontsize=11)

plt.show()

這裡寫圖片描述

白盒與黑盒模型
正如我們看到的一樣，決策樹非常直觀，他們的決定很容易被解釋。這種模型通常被稱為白盒模型。相反，隨機森林或神經網路通常被認為是黑盒模型。他們能做出很好的預測，並且您可以輕鬆檢查它們做出這些預測過程中計算的執行過程。然而，人們通常很難用簡單的術語來解釋為什麼模型會做出這樣的預測。例如，如果一個神經網路說一個特定的人出現在圖片上，我們很難知道究竟是什麼導致了這一個預測的出現：
模型是否認出了那個人的眼睛？她的嘴？她的鼻子？她的鞋？或者是否坐在沙發上？相反，決策樹提供良好的、簡單的分類規則，甚至可以根據需要手動操作（例如鳶尾花分類）。

正則化超引數

決策樹幾乎不對訓練資料做任何假設（於此相反的是線性迴歸等模型，這類模型通常會假設資料是符合線性關係的）。

如果不新增約束，樹結構模型通常將根據訓練資料調整自己，使自身能夠很好的擬合數據，而這種情況下大多數會導致模型過擬合。

這一類的模型通常會被稱為非引數模型，這不是因為它沒有任何引數（通常也有很多），而是因為在訓練之前沒有確定引數的具體數量，所以模型結構可以根據資料的特性自由生長。

於此相反的是，像線性迴歸這樣的引數模型有事先設定好的引數數量，所以自由度是受限的，這就減少了過擬合的風險（但是增加了欠擬合的風險）。

一些其他演算法的工作原理是在沒有任何約束條件下訓練決策樹模型，讓模型自由生長，然後再對不需要的節點進行剪枝。

當一個節點的全部子節點都是葉節點時，如果它對純度的提升不具有統計學意義，我們就認為這個分支是不必要的。

標準的假設檢驗，例如卡方檢測，通常會被用於評估一個概率值 – 即改進是否純粹是偶然性的結果（也叫原假設）

如果 p 值比給定的閾值更高（通常設定為 5%，也就是 95% 置信度，通過超引數設定），那麼節點就被認為是非必要的，它的子節點會被刪除。

這種剪枝方式將會一直進行，直到所有的非必要節點都被刪光。

對moons資料集進行訓練生成的兩個決策樹模型，左側的圖形對應的決策樹使用預設超引數生成（沒有限制生長條件），右邊的決策樹模型設定為min_samples_leaf=4。很明顯，左邊的模型過擬合了，而右邊的模型泛用性更好。

from sklearn.datasets import make_moons
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt

Xm, ym = make_moons(n_samples=100, noise=0.25, random_state=53)

deep_tree_clf1 = DecisionTreeClassifier(random_state=42)
deep_tree_clf2 = DecisionTreeClassifier(min_samples_leaf=4, random_state=42)
deep_tree_clf1.fit(Xm, ym)
deep_tree_clf2.fit(Xm, ym)

plt.figure(figsize=(11, 4))
plt.subplot(121)
plot_decision_boundary(deep_tree_clf1, Xm, ym, axes=[-1.5, 2.5, -1, 1.5], iris=False)
plt.title("No restrictions", fontsize=16)
plt.subplot(122)
plot_decision_boundary(deep_tree_clf2, Xm, ym, axes=[-1.5, 2.5, -1, 1.5], iris=False)
plt.title("min_samples_leaf = {}".format(deep_tree_clf2.min_samples_leaf), fontsize=14)

plt.show()

這裡寫圖片描述

迴歸

決策樹也能夠執行迴歸任務，讓我們使用 Scikit-Learn 的DecisionTreeRegressor類構建一個迴歸樹，讓我們用max_depth = 2在具有噪聲的二次項資料集上進行訓練。

# Quadratic training set + noise
np.random.seed(42)
m = 200
X = np.random.rand(m, 1)
y = 4 * (X - 0.5) ** 2 + np.random.randn(m, 1) / 10

import matplotlib.pyplot as plt

plt.scatter(X,y,s=5)
plt.show()

這裡寫圖片描述

from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg.fit(X, y)

reg_dot_data = export_graphviz(tree_reg, out_file=None, 
                         filled=True, rounded=True,)  
reg_graph = pydotplus.graph_from_dot_data(reg_dot_data)  
Image(reg_graph.create_png())

這裡寫圖片描述

這棵樹看起來非常類似於你之前建立的分類樹，它的主要區別在於，它不是預測每個節點中的樣本所屬的分類，而是預測一個具體的數值。

from sklearn.tree import DecisionTreeRegressor

tree_reg1 = DecisionTreeRegressor(random_state=42, max_depth=2)
tree_reg2 = DecisionTreeRegressor(random_state=42, max_depth=3)
tree_reg1.fit(X, y)
tree_reg2.fit(X, y)

def plot_regression_predictions(tree_reg, X, y, axes=[0, 1, -0.2, 1], ylabel="$y$"):
    x1 = np.linspace(axes[0], axes[1], 500).reshape(-1, 1)
    y_pred = tree_reg.predict(x1)
    plt.axis(axes)
    plt.xlabel("$x_1$", fontsize=18)
    if ylabel:
        plt.ylabel(ylabel, fontsize=18, rotation=0)
    plt.plot(X, y, "b.")
    plt.plot(x1, y_pred, "r.-", linewidth=2, label=r"$\hat{y}$")

plt.figure(figsize=(18, 6))
plt.subplot(121)
plot_regression_predictions(tree_reg1, X, y)
for split, style in ((0.1973, "k-"), (0.0917, "k--"), (0.7718, "k--")):
    plt.plot([split, split], [-0.2, 1], style, linewidth=2)
plt.text(0.21, 0.65, "Depth=0", fontsize=15)
plt.text(0.01, 0.2, "Depth=1", fontsize=13)
plt.text(0.65, 0.8, "Depth=1", fontsize=13)
plt.legend(loc="upper center", fontsize=18)
plt.title("max_depth=2", fontsize=14)

plt.subplot(122)
plot_regression_predictions(tree_reg2, X, y, ylabel=None)
for split, style in ((0.1973, "k-"), (0.0917, "k--"), (0.7718, "k--")):
    plt.plot([split, split], [-0.2, 1], style, linewidth=2)
for split in (0.0458, 0.1298, 0.2873, 0.9040):
    plt.plot([split, split], [-0.2, 1], "k:", linewidth=1)
plt.text(0.3, 0.5, "Depth=2", fontsize=13)
plt.title("max_depth=3", fontsize=14)

plt.show()

這裡寫圖片描述

左側顯示的是模型的預測結果，如果你將max_depth=3設定為 3，模型就會如右側顯示的那樣.注意每個區域的預測值總是該區域中例項的平均目標值。演算法以一種使大多數訓練例項儘可能接近該預測值的方式分割每個區域。

和處理分類任務時一樣，決策樹在處理迴歸問題的時候也容易過擬合。如果不新增任何正則化（預設的超引數），你就會得到左側的預測結果，顯然，過度擬合的程度非常嚴重。而當我們設定了min_samples_leaf = 10，相對就會產生一個更加合適的模型了。

tree_reg1 = DecisionTreeRegressor(random_state=42)
tree_reg2 = DecisionTreeRegressor(random_state=42, min_samples_leaf=10)
tree_reg1.fit(X, y)
tree_reg2.fit(X, y)

x1 = np.linspace(0, 1, 500).reshape(-1, 1)
y_pred1 = tree_reg1.predict(x1)
y_pred2 = tree_reg2.predict(x1)

plt.figure(figsize=(11, 4))

plt.subplot(121)
plt.plot(X, y, "b.")
plt.plot(x1, y_pred1, "r.-", linewidth=2, label=r"$\hat{y}$")
plt.axis([0, 1, -0.2, 1.1])
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", fontsize=18, rotation=0)
plt.legend(loc="upper center", fontsize=18)
plt.title("No restrictions", fontsize=14)

plt.subplot(122)
plt.plot(X, y, "b.")
plt.plot(x1, y_pred2, "r.-", linewidth=2, label=r"$\hat{y}$")
plt.axis([0, 1, -0.2, 1.1])
plt.xlabel("$x_1$", fontsize=18)
plt.title("min_samples_leaf={}".format(tree_reg2.min_samples_leaf), fontsize=14)
plt.show(）

這裡寫圖片描述

不穩定性

np.random.seed(6)
Xs = np.random.rand(100, 2) - 0.5
ys = (Xs[:, 0] > 0).astype(np.float32) * 2

angle = np.pi / 4
rotation_matrix = np.array([[np.cos(angle), -np.sin(angle)], [np.sin(angle), np.cos(angle)]])
Xsr = Xs.dot(rotation_matrix)

tree_clf_s = DecisionTreeClassifier(random_state=42)
tree_clf_s.fit(Xs, ys)
tree_clf_sr = DecisionTreeClassifier(random_state=42)
tree_clf_sr.fit(Xsr, ys)

plt.figure(figsize=(11, 4))
plt.subplot(121)
plot_decision_boundary(tree_clf_s, Xs, ys, axes=[-0.7, 0.7, -0.7, 0.7], iris=False)
plt.subplot(122)
plot_decision_boundary(tree_clf_sr, Xsr, ys, axes=[-0.7, 0.7, -0.7, 0.7], iris=False)

plt.show()

這裡寫圖片描述

決策樹很喜歡設定正交化的決策邊界，（所有邊界都是和某一個軸相垂直的），這使得它對訓練資料集的旋轉很敏感。在左圖中，決策樹可以輕易的將資料分隔開，但是在右圖中，當我們把資料旋轉了 45° 之後，決策樹的邊界看起來變的格外複雜。儘管兩個決策樹都完美的擬合了訓練資料，右邊模型的泛化能力很可能非常差。

解決這個難題的一種方式是使用 PCA 主成分分析，這樣通常能使訓練結果變得更好一些。

Q&A

1、100萬例項的決策樹，大概有多少層？
大概有 log 2 (10^6) ≈ 20 層，會略多於20層，因為得到的決策樹沒有那麼的平衡

2、父結點的GINI不純度和子結點比，通常哪個更低？
答：通常子結點更低，即使子結點中的一個比父結點，但所有子結點加權後，往往是比父結點低的

3、如果決策樹過擬合，嘗試減少max_depth是個好主意嗎？
答：是

4、如果決策樹欠擬合，嘗試縮放輸入特徵是個好主意嗎？
答：不是，決策樹不關心訓練資料是否縮放或居中。

5、如果在一個包含100萬個例項的訓練集上訓練一個決策樹需要花費一個小時，那麼在一個包含1000萬個例項的訓練集上訓練另一個決策樹需要多少時間？
答：10 × log(10m) / log(m). If m = 10^6 , then K ≈ 11.7

6、如果你的訓練集包含100,000個例項，將設定presort =true. 能否加速訓練？
答：只有當資料集小於幾千個例項時，預設訓練集才能加速訓練。如果它包含100,000個例項，則設定presort = True會顯著減慢訓練。

《Hands-On Machine Learning with Scikit-Learn & TensorFlow》讀書筆記第六章決策樹

第六章決策樹

CHAPTER 6

Decision Trees

迴歸

不穩定性

Q&A

《Hands-On Machine Learning with Scikit-Learn & TensorFlow》讀書筆記第六章決策樹

《Hands-On Machine Learning with Scikit-Learn & TensorFlow》讀書筆記第五章支援向量機

OReilly.Hands-On.Machine.Learning.with.Scikit-Learn.and.TensorFlow學習筆記彙總

Hands-on Machine Learning with Scikit-Learn and TensorFlow（中文版）和深度學習原理與TensorFlow實踐-學習筆記

二、《Hands-On Machine Learning with Scikit-Learn and TensorFlow》一個完整的機器學習專案

《Hands-On Machine Learning with Scikit-Learn & TensorFlow》讀書筆記第一章機器學習概覽

Hands on Machine Learning with Sklearn and TensorFlow學習筆記——機器學習概覽

強化學習（RLAI）讀書筆記第六章差分學習（TD-learning）

學習筆記之Supervised Learning with scikit-learn | DataCamp

強化學習（RLAI）讀書筆記第十章On-Policy Control with Approximation

強化學習（RLAI）讀書筆記第九章On-policy Prediction with Approximation

《Machine Learning in Action》| 第2章決策樹

《白話深度學習與Tensorflow》讀書筆記--第1章：機器學習是什麼

《面向機器智慧的tensorflow實踐》第六章筆記

[Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn

Rackspace teams up with Splunk on machine learning

Lockheed Martin partners with Uni of Adelaide on machine learning

Machine Learning with GPUs on vSphere

Large-Scale Machine Learning with Spark on Amazon EMR

Machine Learning with Data Lake Foundation on AWS

《Hands-On Machine Learning with Scikit-Learn & TensorFlow》讀書筆記 第六章 決策樹

第六章 決策樹

CHAPTER 6

Decision Trees

迴歸

不穩定性

Q&A

相關推薦

《Hands-On Machine Learning with Scikit-Learn & TensorFlow》讀書筆記第六章決策樹

第六章決策樹