李航《統計學習方法》——第五章決策樹模型

阿新 • • 發佈：2019-02-09

由於網上資料很多，這裡就不再對演算法原理進行推導，僅給出博主用Python實現的程式碼，供大家參考

適用問題：多類分類

三個步驟：特徵選擇、決策樹的生成和決策樹的剪枝

常見的決策樹演算法有：

ID3：特徵劃分基於資訊增益
C4.5：特徵劃分基於資訊增益比
CART：特徵劃分基於基尼指數

測試資料集：train.csv

ID3演算法程式碼：


# encoding=utf-8

import cv2
import time
import numpy as np
import pandas as pd


from sklearn.cross_validation import train_test_split
from 
 sklearn.metrics import accuracy_score

# 二值化
def binaryzation(img):
    cv_img = img.astype(np.uint8)
    cv2.threshold(cv_img,50,1,cv2.THRESH_BINARY_INV,cv_img)
    return cv_img

def binaryzation_features(trainset):
    features = []

    for img in trainset:
        img = np.reshape(img,(28,28))
        cv_img = img.astype(np.uint8)

        img_b = binaryzation(cv_img)
        # hog_feature = np.transpose(hog_feature) 

        features.append(img_b)

    features = np.array(features)
    features = np.reshape(features,(-1,feature_len))

    return features


class Tree(object):
    def __init__(self,node_type,Class = None, feature = None):
        self.node_type = node_type  # 節點型別（internal或leaf）
        self.dict = {} # dict的鍵表示特徵Ag的可能值ai，值表示根據ai得到的子樹  

        self.Class = Class  # 葉節點表示的類，若是內部節點則為none
        self.feature = feature # 表示當前的樹即將由第feature個特徵劃分（即第feature特徵是使得當前樹中資訊增益最大的特徵）

    def add_tree(self,key,tree):
        self.dict[key] = tree

    def predict(self,features): 
        if self.node_type == 'leaf' or (features[self.feature] not in self.dict):
            return self.Class

        tree = self.dict.get(features[self.feature])
        return tree.predict(features)

# 計算資料集x的經驗熵H(x)
def calc_ent(x):
    x_value_list = set([x[i] for i in range(x.shape[0])])
    ent = 0.0
    for x_value in x_value_list:
        p = float(x[x == x_value].shape[0]) / x.shape[0]
        logp = np.log2(p)
        ent -= p * logp

    return ent

# 計算條件熵H(y/x)
def calc_condition_ent(x, y):
    x_value_list = set([x[i] for i in range(x.shape[0])])
    ent = 0.0
    for x_value in x_value_list:
        sub_y = y[x == x_value]
        temp_ent = calc_ent(sub_y)
        ent += (float(sub_y.shape[0]) / y.shape[0]) * temp_ent

    return ent

# 計算資訊增益
def calc_ent_grap(x,y):
    base_ent = calc_ent(y)
    condition_ent = calc_condition_ent(x, y)
    ent_grap = base_ent - condition_ent

    return ent_grap

# ID3演算法
def recurse_train(train_set,train_label,features):

    LEAF = 'leaf'
    INTERNAL = 'internal'

    # 步驟1——如果訓練集train_set中的所有例項都屬於同一類Ck
    label_set = set(train_label)
    if len(label_set) == 1:
        return Tree(LEAF,Class = label_set.pop())

    # 步驟2——如果特徵集features為空
    class_len = [(i,len(list(filter(lambda x:x==i,train_label)))) for i in range(class_num)] # 計算每一個類出現的個數
    (max_class,max_len) = max(class_len,key = lambda x:x[1])

    if len(features) == 0:
        return Tree(LEAF,Class = max_class)

    # 步驟3——計算資訊增益,並選擇資訊增益最大的特徵
    max_feature = 0
    max_gda = 0
    D = train_label
    for feature in features:
        # print(type(train_set))
        A = np.array(train_set[:,feature].flat) # 選擇訓練集中的第feature列（即第feature個特徵）
        gda=calc_ent_grap(A,D)
        if gda > max_gda:
            max_gda,max_feature = gda,feature

    # 步驟4——資訊增益小於閾值
    if max_gda < epsilon:
        return Tree(LEAF,Class = max_class)

    # 步驟5——構建非空子集
    sub_features = list(filter(lambda x:x!=max_feature,features))
    tree = Tree(INTERNAL,feature=max_feature)

    max_feature_col = np.array(train_set[:,max_feature].flat)
    feature_value_list = set([max_feature_col[i] for i in range(max_feature_col.shape[0])]) # 儲存資訊增益最大的特徵可能的取值 (shape[0]表示計算行數)
    for feature_value in feature_value_list:

        index = []
        for i in range(len(train_label)):
            if train_set[i][max_feature] == feature_value:
                index.append(i)

        sub_train_set = train_set[index]
        sub_train_label = train_label[index]

        sub_tree = recurse_train(sub_train_set,sub_train_label,sub_features)
        tree.add_tree(feature_value,sub_tree)

    return tree

def train(train_set,train_label,features):
    return recurse_train(train_set,train_label,features)

def predict(test_set,tree):
    result = []
    for features in test_set:
        tmp_predict = tree.predict(features)
        result.append(tmp_predict)
    return np.array(result)


class_num = 10  # MINST資料集有10種labels，分別是“0,1,2,3,4,5,6,7,8,9”
feature_len = 784  # MINST資料集每個image有28*28=784個特徵（pixels）
epsilon = 0.001  # 設定閾值

if __name__ == '__main__':

    print("Start read data...")

    time_1 = time.time()

    raw_data = pd.read_csv('../data/train.csv', header=0)  # 讀取csv資料
    data = raw_data.values

    imgs = data[::, 1::]
    features = binaryzation_features(imgs) # 圖片二值化(很重要，不然預測準確率很低)
    labels = data[::, 0]

    # 避免過擬合，採用交叉驗證，隨機選取33%資料作為測試集，剩餘為訓練集
    train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0)
    time_2 = time.time()
    print('read data cost %f seconds' % (time_2 - time_1))

    # 通過ID3演算法生成決策樹
    print('Start training...')
    tree = train(train_features,train_labels,list(range(feature_len)))
    time_3 = time.time()
    print('training cost %f seconds' % (time_3 - time_2))

    print('Start predicting...')
    test_predict = predict(test_features,tree)
    time_4 = time.time()
    print('predicting cost %f seconds' % (time_4 - time_3))

    # print("預測的結果為：")
    # print(test_predict)
    for i in range(len(test_predict)):
        if test_predict[i] == None:
            test_predict[i] = epsilon
    score = accuracy_score(test_labels, test_predict)
print("The accruacy score is %f" % score)

# encoding=utf-8

import cv2
import time
import numpy as np
import pandas as pd


from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score

# 二值化
def binaryzation(img):
    cv_img = img.astype(np.uint8)
    cv2.threshold(cv_img,50,1,cv2.THRESH_BINARY_INV,cv_img)
    return cv_img

def binaryzation_features(trainset):
    features = []

    for img in trainset:
        img = np.reshape(img,(28,28))
        cv_img = img.astype(np.uint8)

        img_b = binaryzation(cv_img)
        # hog_feature = np.transpose(hog_feature)
        features.append(img_b)

    features = np.array(features)
    features = np.reshape(features,(-1,feature_len))

    return features


class Tree(object):
    def __init__(self,node_type,Class = None, feature = None):
        self.node_type = node_type  # 節點型別（internal或leaf）
        self.dict = {} # dict的鍵表示特徵Ag的可能值ai，值表示根據ai得到的子樹 
        self.Class = Class  # 葉節點表示的類，若是內部節點則為none
        self.feature = feature # 表示當前的樹即將由第feature個特徵劃分（即第feature特徵是使得當前樹中資訊增益最大的特徵）

    def add_tree(self,key,tree):
        self.dict[key] = tree

    def predict(self,features): 
        if self.node_type == 'leaf' or (features[self.feature] not in self.dict):
            return self.Class

        tree = self.dict.get(features[self.feature])
        return tree.predict(features)

# 計算資料集x的經驗熵H(x)
def calc_ent(x):
    x_value_list = set([x[i] for i in range(x.shape[0])])
    ent = 0.0
    for x_value in x_value_list:
        p = float(x[x == x_value].shape[0]) / x.shape[0]
        logp = np.log2(p)
        ent -= p * logp

    return ent

# 計算條件熵H(y/x)
def calc_condition_ent(x, y):
    x_value_list = set([x[i] for i in range(x.shape[0])])
    ent = 0.0
    for x_value in x_value_list:
        sub_y = y[x == x_value]
        temp_ent = calc_ent(sub_y)
        ent += (float(sub_y.shape[0]) / y.shape[0]) * temp_ent

    return ent

# 計算資訊增益
def calc_ent_grap(x,y):
    base_ent = calc_ent(y)
    condition_ent = calc_condition_ent(x, y)
    ent_grap = base_ent - condition_ent

    return ent_grap

# C4.5演算法
def recurse_train(train_set,train_label,features):

    LEAF = 'leaf'
    INTERNAL = 'internal'

    # 步驟1——如果訓練集train_set中的所有例項都屬於同一類Ck
    label_set = set(train_label)
    if len(label_set) == 1:
        return Tree(LEAF,Class = label_set.pop())

    # 步驟2——如果特徵集features為空
    class_len = [(i,len(list(filter(lambda x:x==i,train_label)))) for i in range(class_num)] # 計算每一個類出現的個數
    (max_class,max_len) = max(class_len,key = lambda x:x[1])

    if len(features) == 0:
        return Tree(LEAF,Class = max_class)

    # 步驟3——計算資訊增益,並選擇資訊增益最大的特徵
    max_feature = 0
    max_gda = 0
    D = train_label
    for feature in features:
        # print(type(train_set))
        A = np.array(train_set[:,feature].flat) # 選擇訓練集中的第feature列（即第feature個特徵）
        gda = calc_ent_grap(A,D)
        if calc_ent(A) != 0:  ####### 計算資訊增益比，這是與ID3演算法唯一的不同
            gda /= calc_ent(A)
        if gda > max_gda:
            max_gda,max_feature = gda,feature

    # 步驟4——資訊增益小於閾值
    if max_gda < epsilon:
        return Tree(LEAF,Class = max_class)

    # 步驟5——構建非空子集
    sub_features = list(filter(lambda x:x!=max_feature,features))
    tree = Tree(INTERNAL,feature=max_feature)

    max_feature_col = np.array(train_set[:,max_feature].flat)
    feature_value_list = set([max_feature_col[i] for i in range(max_feature_col.shape[0])]) # 儲存資訊增益最大的特徵可能的取值 (shape[0]表示計算行數)
    for feature_value in feature_value_list:

        index = []
        for i in range(len(train_label)):
            if train_set[i][max_feature] == feature_value:
                index.append(i)

        sub_train_set = train_set[index]
        sub_train_label = train_label[index]

        sub_tree = recurse_train(sub_train_set,sub_train_label,sub_features)
        tree.add_tree(feature_value,sub_tree)

    return tree

def train(train_set,train_label,features):
    return recurse_train(train_set,train_label,features)

def predict(test_set,tree):
    result = []
    for features in test_set:
        tmp_predict = tree.predict(features)
        result.append(tmp_predict)
    return np.array(result)


class_num = 10  # MINST資料集有10種labels，分別是“0,1,2,3,4,5,6,7,8,9”
feature_len = 784  # MINST資料集每個image有28*28=784個特徵（pixels）
epsilon = 0.001  # 設定閾值

if __name__ == '__main__':

    print("Start read data...")

    time_1 = time.time()

    raw_data = pd.read_csv('../data/train.csv', header=0)  # 讀取csv資料
    data = raw_data.values

    imgs = data[::, 1::]
    features = binaryzation_features(imgs) # 圖片二值化(很重要，不然預測準確率很低)
    labels = data[::, 0]

    # 避免過擬合，採用交叉驗證，隨機選取33%資料作為測試集，剩餘為訓練集
    train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0)
    time_2 = time.time()
    print('read data cost %f seconds' % (time_2 - time_1))

    # 通過C4.5演算法生成決策樹
    print('Start training...')
    tree = train(train_features,train_labels,list(range(feature_len)))
    time_3 = time.time()
    print('training cost %f seconds' % (time_3 - time_2))

    print('Start predicting...')
    test_predict = predict(test_features,tree)
    time_4 = time.time()
    print('predicting cost %f seconds' % (time_4 - time_3))

    # print("預測的結果為：")
    # print(test_predict)
    for i in range(len(test_predict)):
        if test_predict[i] == None:
            test_predict[i] = epsilon
    score = accuracy_score(test_labels, test_predict)
print("The accruacy score is %f" % score)

程式碼可從這裡decision_tree/C45.py獲得

執行結果：

CART演算法程式碼(用sklearn實現)：

# encoding=utf-8

import pandas as pd
import time

from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score

from sklearn.tree import DecisionTreeClassifier



if __name__ == '__main__':

    print("Start read data...")
    time_1 = time.time()

    raw_data = pd.read_csv('../data/train.csv', header=0) 
    data = raw_data.values

    features = data[::, 1::]
    labels = data[::, 0]

    # 隨機選取33%資料作為測試集，剩餘為訓練集
    train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0)

    time_2 = time.time()
    print('read data cost %f seconds' % (time_2 - time_1))


    print('Start training...') 
    # criterion可選‘gini’, ‘entropy’，預設為gini(對應CART演算法)，entropy為資訊增益（對應ID3演算法）
    clf = DecisionTreeClassifier(criterion='gini') 
    clf.fit(train_features,train_labels)
    time_3 = time.time()
    print('training cost %f seconds' % (time_3 - time_2))


    print('Start predicting...')
    test_predict = clf.predict(test_features)
    time_4 = time.time()
    print('predicting cost %f seconds' % (time_4 - time_3))


    score = accuracy_score(test_labels, test_predict)
print("The accruacy score is %f" % score)

李航統計學習方法第五章決策樹課後習題答案

決策樹是一種基本的分類和迴歸方法。決策樹呈樹形結構，在分類問題中，表示基於特徵對例項進行分類的過程。它可以認為是if-then規則的集合，也可以認為是定義在特徵空間和類空間上的條件概率分佈。學習時，利用訓練資料，根據損失函式最小化的原則建立決策樹模型。預測時，對

李航《統計學習方法》——第五章決策樹模型

由於網上資料很多，這裡就不再對演算法原理進行推導，僅給出博主用Python實現的程式碼，供大家參考適用問題：多類分類三個步驟：特徵選擇、決策樹的生成和決策樹的剪枝常見的決策樹演算法有： ID3：特徵劃分基於資訊增益 C4.5：特徵劃分基於資訊增益

《統計學習方法（李航）》講義第05章決策樹

lan 定義 if-then 利用建立 then 統計來源根據決策樹(decision tree) 是一種基本的分類與回歸方法。本章主要討論用於分類的決策樹。決策樹模型呈樹形結構，在分類問題中，表示基於特征對實例進行分類的過程。它可以認為是if-then

統計學習方法第五章

統計學習第五章：決策樹決策樹模型分類決策樹模型是一種描述對例項進行分類的樹形結構，表示基於特徵對例項進行分類的過程。決策樹由結點和有向邊組成。結點有兩種型別：內部節點和葉節點，內部節點表示一個

最小二乘迴歸樹Python實現——統計學習方法第五章課後題

李航博士《統計學習方法》第五章第二題，試用平方誤差準則生成一個二叉迴歸樹。輸入資料為： x 0 1 2 3

統計學習方法第五章CART演算法程式碼實踐例題5.4

from numpy import * def loadDataSet(): # 本書例題的資料集 dataset = [['青年', '否', '否', '一般', '否'], ['青年', '否', '否', '好', '否'], ['青

李航-統計學習方法-習題-第九章

9.2 證明引理 9.2. 引理 9.2 若P~θ(Z)=P(Z∣Y,θ)\widetilde P_\theta(Z)=P(Z|Y,\theta)Pθ(Z)=P(Z∣Y,θ)，則 F(P~,θ)=lo

李航·統計學習方法筆記·第6章 logistic regression與最大熵模型（1）·邏輯斯蒂迴歸模型

第6章 logistic regression與最大熵模型（1）·邏輯斯蒂迴歸模型標籤（空格分隔）：機器學習教程·李航統計學習方法邏輯斯蒂：logistic 李航書中稱之為：邏輯斯蒂迴歸模型周志華書中稱之為：對數機率迴歸模

統計學習方法-第2章-感知機(1)

2.1 感知機模型定義: 輸入特徵空間為\(\chi\subseteq R^n\), 輸出空間為\(\mathcal{Y}=\{+1, -1\}\). 則由輸入空間到輸出空間的如下函式: \[f(x) = sign(w\cdot x+b)\] 其中\[sign(x)=\left\{\begin{array

李航—統計學習方法筆記（一）

什麼是獨立同分布？百度：在概率統計理論中，指隨機過程中，任何時刻的取值都為隨機變數，如果這些隨機變數服從同一分佈，並且互相獨立，那麼這些隨機變數是獨立同分布。如果隨機變數X1和X2獨立，是指X1的取值不影響X2的取值，X2的取值也不影響X1的取值且隨機變數X1和X2服從同一分佈，這意味著X1和X2具有

李航統計學習方法查缺補漏

矩陣的微積分 https://zhuanlan.zhihu.com/p/28956839 獨立同分布歐式空間標註問題聯合概率分佈貝葉斯統計 https://www.zhihu.com/question/21134457 似然函式和概率密度函式 https://www.zhihu.co

李航統計學習方法之樸素貝葉斯法（含python及tensorflow實現）

樸素貝葉斯法樸素貝葉斯法數學表示式後驗概率最大化的含義樸素貝葉斯是一個生成模型。有一個強假設：條件獨立性。我們先看下樸素貝葉斯法的思想，然後看下條件獨立性具體數學表示式是什麼樣的。

統計學習方法第四章課後習題

4.1 用極大似然估計法推導樸素貝葉斯法中的先驗概率估計公式(4.8)和條件概率估計公式(4.9) 首先是(4.8) P(Y=ck)=∑i=1NI(yi=ck)NP({Y=c_k})=\frac {\sum_{i=1}^NI(y_i=c_k)} {N} P(Y=

演算法工程師修仙之路：李航統計學習方法（一）

第1章統計學習方法概論統計學習統計學習的特點統計學習（statistical learning）是關於計算機基於資料構建概率統計模型並運用模型對資料進行預測與分析的一門學科，統計學習也稱為統計機器學習（statistical machine learnin

李航統計學習方法習題5.1

定義5.3（資訊增益比）特徵A對訓練資料集D的資訊增益比定義為其資訊增益與訓練資料集D關於特徵A的值的熵之比，即

統計學習方法第四章極大似然估計的樸素貝葉斯分類方法例題4.1程式碼實踐

#-*- coding:utf-8 -*- from numpy import * #將書上的資料輸入，這裡懶得輸入那麼多個列表就用下array的轉置方法吧！就用這個方法吧0.0 def loadDataSet(): dataSet=[[1,1,1,1,1,2,2,2,2,2,3,3,3,3,3],

李航-統計學習方法筆記（一）：統計學習方法概論

對象統計學技術分享精確結束人的發生 abs 速度本系列筆記，主要是整理統計學習方法的知識點和代碼實現各個方法，來加強筆者對各個模型的理解，為今年找到好工作來打下基礎。計劃在一個月內更新完這本書的筆記，在此立一個flag: 從2019/2/17開始到 20

《機器學習》第三章決策樹學習筆記加總結

分類問題子集觀察組成 cas 普通重復 1.0 需要《機器學習》第三章決策樹學習決策樹學習方法搜索一個完整表示的假設空間，從而避免了受限假設空間的不足。決策樹學習的歸納偏置是優越選擇較小的樹。 3.1.簡介決策樹學習是一種逼近離散值目標函數的方法，在這種方法

《機器學習》周志華學習筆記第四章決策樹（課後習題）python 實現

一、基本內容 1.基本流程決策樹的生成過程是一個遞迴過程，有三種情形會導致遞迴返回（1）當前節點包含的yangben全屬於同一類別，無需劃分；（2）當前屬性集為空，或是所有yangben在所有屬性上的取值相同，無法劃分；（3）當前結點包含的yangben集合為空，不能

第五章決策樹

決策樹是基於特徵（非數字，如年齡，身高特徵）進行分類的過程，通常包括特徵選擇，決策樹的生成，決策樹的剪修。 5.1決策樹模型與學習 5.1.1決策樹模型決策樹由節點（內節點（特徵或者說屬性）和葉節點（類））和有向邊組成，是一種對例項進行分類的樹形結構。 5.1.2決策樹與if-th

李航《統計學習方法》——第五章 決策樹模型

相關推薦

李航《統計學習方法》——第五章決策樹模型