python實現西瓜書《機器學習》習題4.4基尼指數決策樹，預剪枝及後剪枝

阿新 • • 發佈：2018-11-11

大神程式碼：https://blog.csdn.net/Snoopy_Yuan/article/details/69223240
昨天畫不出樹有點煩躁，隨便找了百度了一點點，還是畫不出來。
今天這道題，其實就是把資訊增益換成基尼指數，本質上的構造樹邏輯是一致的。
不過原始碼有個小錯誤，在上面連結裡已經評論了，好奇寶寶可以自己去看

不過，奇葩的是前後剪枝算出來的準確率一毛一樣，估計程式裡還有問題，以後再扣吧。。。

主程式gini_decision_tree.py

#https://blog.csdn.net/Snoopy_Yuan/article/details/69223240

import pandas as pd

#data_file_encode="gb18030"   #gb18030支援漢字和少數民族字元，是一二四位元組變長編碼。這麼用的時候with open需要增加encoding引數，但會報錯gb18030不能解碼
# with open相當於開啟檔案，儲存成str物件，如果出錯則關閉檔案。引數r表示只讀
with open("/Users/huatong/PycharmProjects/Data/watermelon_33.csv",mode="r") as data_file:
    df=pd.read_csv(data_file)

import decision_tree

# 取出訓練集，iloc是根據數字索引取出對應行的資訊，drop是刪除這些行之後剩餘的表格
index_train = [0, 1, 2, 5, 6, 9, 13, 14, 15, 16]   #和書上80頁的訓練樣本相同

df_train = df.iloc[index_train]
df_test = df.drop(index_train)


# generate a full tree
root = decision_tree.TreeGenerate(df_train)
#decision_tree.DrawPNG(root, "decision_tree_full.png")  畫不出來 先註釋掉
print("accuracy of full tree: %.3f" % decision_tree.PredictAccuracy(root, df_test))

# 預剪枝
root = decision_tree.PrePurn(df_train, df_test)
#decision_tree.DrawPNG(root, "decision_tree_pre.png")
print("accuracy of pre-purning tree: %.3f" % decision_tree.PredictAccuracy(root, df_test))

# 後剪枝，先生成樹，再從底部節點開始分析
root = decision_tree.TreeGenerate(df_train)
decision_tree.PostPurn(root, df_test)
#decision_tree.DrawPNG(root, "decision_tree_post.png")
print("accuracy of post-purning tree: %.3f" % decision_tree.PredictAccuracy(root, df_test))

# 5折交叉分析
accuracy_scores = []
n = len(df.index)
k = 5
for i in range(k):
    m = int(n / k)
    test = []
    for j in range(i * m, i * m + m):
        test.append(j)

    df_train = df.drop(test)
    df_test = df.iloc[test]
    root = decision_tree.TreeGenerate(df_train)  # generate the tree
    decision_tree.PostPurn(root, df_test)  # post-purning

    # test the accuracy
    pred_true = 0
    for i in df_test.index:
        label = decision_tree.Predict(root, df[df.index == i])
        if label == df_test[df_test.columns[-1]][i]:
            pred_true += 1

    accuracy = pred_true / len(df_test.index)
    accuracy_scores.append(accuracy)

# print the prediction accuracy result
accuracy_sum = 0
print("accuracy: ", end="")
for i in range(k):
    print("%.3f  " % accuracy_scores[i], end="")
    accuracy_sum += accuracy_scores[i]
print("\naverage accuracy: %.3f" % (accuracy_sum / k))

decision_tree.py

#被主程式執行treeGenerate時候呼叫，def用於定義函式
#節點類，包含①當前節點的屬性，例如紋理清晰？ ②節點所屬分類，只對葉子節點有效 ③向下劃分的屬性取值例如色澤烏黑青綠淺白


class Node(object):   #新式類
    def __init__(self,attr_init=None,label_init=None,attr_down_init={}):   #注意類的特殊函式前後有兩個下劃線
        self.attr=attr_init
        self.label=label_init
        self.attr_down=attr_down_init

#主函式，輸入引數為資料集，輸出引數為決策樹根節點Node
def TreeGenerate(df):
    new_node=Node(None,None,{})
    label_arr=df[df.columns[-1]]   #好瓜這列數值，df.columns[-1]是最後一列
    label_count=NodeLabel(label_arr)
    if label_count:  #類別統計結果不為空
        new_node.label=max(label_count,key=label_count.get) #取類別數目最多的類，get是返回鍵值
        #如果樣本全屬於同一類別則直接返回葉節點，或如果樣本屬性集A為空則返回葉節點並標記類別為類別數最多的類，但如果樣本屬性取值相同怎麼處理？
        if len(label_count)==1 or len(label_arr)==0:
            return new_node
        #根據基尼指數選擇最優劃分屬性
        new_node.attr,div_value=OptAttr_Gini(df)
        #如果屬性值為空，刪除當前屬性再遞迴
        if div_value==0:
            value_count=ValueCount(df[new_node.attr])
            for value in value_count:
                df_v=df[df[new_node.attr].isin([value])]
                dv_v=df_v.drop(new_node.attr,1)
                new_node.attr_down[value]=TreeGenerate(df_v)
        else:
            value_l="<=%.3f"%div_value
            value_r=">%.3f"%div_value
            df_v_l=df[df[new_node.attr]<=div_value]   #左孩子
            df_v_r=df[df[new_node.attr]>div_value]    #右孩子
            new_node.attr_down[value_l] = TreeGenerate(df_v_l)   #繼續分
            new_node.attr_down[value_r] = TreeGenerate(df_v_r)
    return new_node


#統計樣本包含的類別和每個分類的個數，輸入引數是分類標籤序列，輸出序列中包含的類別和各類別總數
def NodeLabel(label_arr):
    label_count={}
    for label in label_arr:
        if label in label_count: label_count[label]+=1
        else:label_count[label]=1
    return label_count


#尋找最優劃分屬性，輸入引數為資料集，輸出引數為屬性opt_attr和劃分取值div_value，div_value對離散變數取值為0，對連續變數取實際值
def OptAttr_Gini(df):
    gini_index=float('Inf')
    for attr_id in df.columns[1:-1]:
        gini_index_tmp,div_value_tmp=GiniIndex(df,attr_id)
        if gini_index_tmp<gini_index:   #目標是找到最小基尼指數
            gini_index_=gini_index_tmp
            opt_attr=attr_id
            div_value=div_value_tmp
    #print("devide according to:",opt_attr,end=' ')
    #print("devide value is:",div_value) 這麼寫不行，要判斷是int還是字元
    return opt_attr,div_value


#計算基尼指數，輸入引數為資料集、屬性值，輸出引數為基尼指數和劃分取值div_value，離散變數取0連續變數取實際值
def GiniIndex(df,attr_id):
    gini_index=0
    div_value=0   #劃分數值
    n=len(df[attr_id]) #樣本數
    #對連續值變數
    if df[attr_id].dtype==(float,int):
        sub_gini={}  #儲存劃分數值和各子分類的？
        df=df.sort_values([attr_id],ascending=1) #按屬性這列排序，升序，這裡源程式sort函式會報錯要改成sort_values
        df=df.reset_index(drop=True)  #sort後索引變化了，需要還原索引
        data_arr=df[attr_id]
        label_arr=df[df.columns[-1]]
        for i in range(n-1):
            div=(data_arr[i]+data_arr[i+1])/2   #連續值屬性的劃分點集合
            sub_gini[div] = ( (i+1) * Gini(label_arr[0:i+1]) / n ) \
                              + ( (n-i-1) * Gini(label_arr[i+1:-1]) / n )
        div_value,gini_index=min(sub_gini.items(),key=lambda x:x[1]) #最lambda用於命名匿名函式

    #對離散值變數
    else:
        data_arr=df[attr_id]
        label_arr=df[df.columns[-1]]
        value_count=ValueCount(data_arr)
        for key in value_count:
            key_label_arr=label_arr[data_arr==key]
            gini_index+=value_count[key]*Gini(key_label_arr)/n
    return gini_index,div_value


#計算基尼值，注意區別於基尼指數
def Gini(label_arr):
    gini=1
    n=len(label_arr)
    label_count=NodeLabel(label_arr)
    for key in label_count:
        gini-=(label_count[key]/n)*(label_count[key]/n)   #gini=1-p^2
    return gini

#根據輸入引數屬性值區分後，各分類的樣本個數
def ValueCount(data_arr):
    value_count={}
    for label in data_arr:
        if label in value_count: value_count[label]+=1
        else: value_count[label]=1
    return value_count

#根據根節點預測
def Predict(root, df_sample):
    try:
        import re  # using Regular Expression to get the number in string
    except ImportError:
        print("module re not found")

    while root.attr != None:
        # continuous variable
        if df_sample[root.attr].dtype == (float, int):
            # get the div_value from root.attr_down
            for key in list(root.attr_down):
                num = re.findall(r"\d+\.?\d*", key)
                div_value = float(num[0])
                break
            if df_sample[root.attr].values[0] <= div_value:
                key = "<=%.3f" % div_value
                root = root.attr_down[key]
            else:
                key = ">%.3f" % div_value
                root = root.attr_down[key]

        # categoric variable
        else:
            key = df_sample[root.attr].values[0]
            # check whether the attr_value in the child branch
            if key in root.attr_down:
                root = root.attr_down[key]
            else:
                break

    return root.label

#計算驗證集精度
def PredictAccuracy(root, df_test):
    '''
    calculating accuracy of prediction on test set

    @param root: Node, root Node of the decision tree
    @param df_test: dataframe, test data set
    @return accuracy, float,
    '''
    if len(df_test.index) == 0: return 0
    pred_true = 0
    for i in df_test.index:
        label = Predict(root, df_test[df_test.index == i])
        if label == df_test[df_test.columns[-1]][i]:
            pred_true += 1
    return pred_true / len(df_test.index)

#預剪枝，輸入訓練集和驗證集，輸出剪枝後根節點
def PrePurn(df_train, df_test):

    # 生成新樹
    new_node = Node(None, None, {})
    label_arr = df_train[df_train.columns[-1]]

    label_count = NodeLabel(label_arr)
    if label_count:  # assert the label_count isn's empty
        new_node.label = max(label_count, key=label_count.get)

        # end if there is only 1 class in current node data
        # end if attribution array is empty
        if len(label_count) == 1 or len(label_arr) == 0:
            return new_node

        # calculating the test accuracy up to current node
        a0 = PredictAccuracy(new_node, df_test)

        # get the optimal attribution for a new branching
        new_node.attr, div_value = OptAttr_Gini(df_train)  # via Gini index

        # get the new branch
        if div_value == 0:  # categoric variable
            value_count = ValueCount(df_train[new_node.attr])
            for value in value_count:
                df_v = df_train[df_train[new_node.attr].isin([value])]  # get sub set
                df_v = df_v.drop(new_node.attr, 1)
                # for child node
                new_node_child = Node(None, None, {})
                label_arr_child = df_train[df_v.columns[-1]]
                label_count_child = NodeLabel(label_arr_child)
                new_node_child.label = max(label_count_child, key=label_count_child.get)
                new_node.attr_down[value] = new_node_child

            # calculating to check whether need further branching
            a1 = PredictAccuracy(new_node, df_test)
            if a1 > a0:  # need branching
                for value in value_count:
                    df_v = df_train[df_train[new_node.attr].isin([value])]  # get sub set
                    df_v = df_v.drop(new_node.attr, 1)
                    new_node.attr_down[value] = TreeGenerate(df_v)
            else:
                new_node.attr = None
                new_node.attr_down = {}

        else:  # continuous variable # left and right child
            value_l = "<=%.3f" % div_value
            value_r = ">%.3f" % div_value
            df_v_l = df_train[df_train[new_node.attr] <= div_value]  # get sub set
            df_v_r = df_train[df_train[new_node.attr] > div_value]

            # for child node
            new_node_l = Node(None, None, {})
            new_node_r = Node(None, None, {})
            label_count_l = NodeLabel(df_v_l[df_v_r.columns[-1]])
            label_count_r = NodeLabel(df_v_r[df_v_r.columns[-1]])
            new_node_l.label = max(label_count_l, key=label_count_l.get)
            new_node_r.label = max(label_count_r, key=label_count_r.get)
            new_node.attr_down[value_l] = new_node_l
            new_node.attr_down[value_r] = new_node_r

            # calculating to check whether need further branching
            a1 = PredictAccuracy(new_node, df_test)
            if a1 > a0:  # need branching
                new_node.attr_down[value_l] = TreeGenerate(df_v_l)
                new_node.attr_down[value_r] = TreeGenerate(df_v_r)
            else:
                new_node.attr = None
                new_node.attr_down = {}

    return new_node


#後剪枝
def PostPurn(root, df_test):
    '''
    pre-purning to generating a decision tree

    @param root: Node, root of the tree
    @param df_test: dataframe, the testing set for purning decision
    @return accuracy score through traversal the tree
    '''
    # leaf node
    if root.attr == None:
        return PredictAccuracy(root, df_test)

    # calculating the test accuracy on children node
    a1 = 0
    value_count = ValueCount(df_test[root.attr])
    for value in list(value_count):
        df_test_v = df_test[df_test[root.attr].isin([value])]  # get sub set
        if value in root.attr_down:  # root has the value
            a1_v = PostPurn(root.attr_down[value], df_test_v)
        else:  # root doesn't have value
            a1_v = PredictAccuracy(root, df_test_v)
        if a1_v == -1:  # -1 means no pruning back from this child
            return -1
        else:
            a1 += a1_v * len(df_test_v.index) / len(df_test.index)

    # calculating the test accuracy on this node
    node = Node(None, root.label, {})
    a0 = PredictAccuracy(node, df_test)

    # check if need pruning
    if a0 >= a1:
        root.attr = None
        root.attr_down = {}
        return a0
    else:
        return -1

def DrawPNG(root, out_file):
    import graphviz
    '''
    visualization of decision tree from root.
    @param root: Node, the root node for tree.
    @param out_file: str, name and path of output file
    '''
    try:
        from pydotplus import graphviz
    except ImportError:
        print("module pydotplus.graphviz not found")

    g = graphviz.Dot()  # generation of new dot

    TreeToGraph(0, g, root)
    g2 = graphviz.graph_from_dot_data(g.to_string())

    g2.write_png(out_file)


def TreeToGraph(i, g, root):
    '''
    build a graph from root on
    @param i: node number in this tree
    @param g: pydotplus.graphviz.Dot() object
    @param root: the root node

    @return i: node number after modified
#     @return g: pydotplus.graphviz.Dot() object after modified
    @return g_node: the current root node in graphviz
    '''
    try:
        from pydotplus import graphviz    #pydotplus和graphviz都要安裝
    except ImportError:
        print("module pydotplus.graphviz not found")

    if root.attr == None:
        g_node_label = "Node:%d\n好瓜:%s" % (i, root.label)
    else:
        g_node_label = "Node:%d\n好瓜:%s\n屬性:%s" % (i, root.label, root.attr)
    g_node = i
    g.add_node(graphviz.Node(g_node, label=g_node_label))

    for value in list(root.attr_down):
        i, g_child = TreeToGraph(i + 1, g, root.attr_down[value])
        g.add_edge(graphviz.Edge(g_node, g_child, label=value))

    return i, g_node

python實現西瓜書《機器學習》習題4.4基尼指數決策樹，預剪枝及後剪枝

大神程式碼：https://blog.csdn.net/Snoopy_Yuan/article/details/69223240 昨天畫不出樹有點煩躁，隨便找了百度了一點點，還是畫不出來。今天這道題，其實就是把資訊增益換成基尼指數，本質上的構造樹邏輯是一致的。不過原始碼有個小錯誤，在上面

西瓜書習題4.4 程式設計實現基尼指數決策樹

資料及程式碼地址：https://github.com/qdbszsj/decisionTreeGini這裡的程式碼在資訊熵決策樹的基礎上稍加修改就可以，之前是根據熵增的最大值來確定用哪個屬性劃分，現在是根據基尼指數（表現資料集D的純度）的最小值來建樹。這裡網上的很多人說建出

python實現西瓜書《機器學習》習題4.3資訊增益決策樹

首先這篇的格式可能會亂，markdown裝上以後，有時候是用csdn原來的編輯器，有時候就變成了markdown編輯器，蒙。更蒙的是，大牛的程式碼太飄逸了，有點看不懂，慣例先來原地址：https://blog.csdn.net/Snoopy_Yuan/article/details/689

python實現西瓜書《機器學習》習題5.5BP演算法

慣例，首先對原始碼致以崇高的感謝和敬意：https://blog.csdn.net/Snoopy_Yuan/article/details/70230862 學習神經網路，pybrain是個好東東，上鍊接http://pybrain.org/docs/index.html#installat

西瓜書課後習題4.3 基於資訊熵決策樹，連續和離散屬性，並驗證模型

import matplotlib.pyplot as plt import numpy as np from math import log import operator import csv def readDataset(filename): ''' 讀取資料 :

西瓜書習題4.3 基於資訊熵決策樹，連續和離散屬性

from math import log import operator import csv def readDataset(filename): ''' 讀取資料 :param filename: 資料檔名，CSV格式 :return:

西瓜書機器學習總結（一）

1.基本概念 1.資料集，特徵屬性，屬性值，訓練集，樣本，標記，獨立同分布的假設balabala….簡單易懂 2.歸納學習與歸納偏好：廣義從樣例學習，狹義是學習概念。西瓜模型的學習可以理解為從假設空間中搜索匹配，剔除不符合，最終會有多個模型，這個

小姐姐帶你一起學：如何用Python實現7種機器學習演算法（附程式碼）

編譯 | 林椿眄出品 | AI科技大本營（公眾號ID：rgznai100）【AI科技大本營導讀】

機器學習筆記（6）——C4.5決策樹中的剪枝處理和Python實現

1. 為什麼要剪枝還記得決策樹的構造過程嗎？為了儘可能正確分類訓練樣本，節點的劃分過程會不斷重複直到不能再分，這樣就可能對訓練樣本學習的“太好”了，把訓練樣本的一些特點當做所有資料都具有的一般性質，cong從而導致過擬合。這時就可以通過剪枝處理去掉yi一些分支來降低過擬合

機器學習筆記（5）——C4.5決策樹中的連續值處理和Python實現

在ID3決策樹演算法中，我們實現了基於離散屬性的決策樹構造。C4.5決策樹在劃分屬性選擇、連續值、缺失值、剪枝等幾方面做了改進，內容較多，今天我們專門討論連續值的處理和Python實現。 1. 連續屬性離散化 C4.5演算法中策略是採用二分法將連續屬性離散化處理：假定

機器學習算法整理（三）決策樹

outlook spa com width 選擇 clas .com img 衡量標準決策樹的訓練與測試如何切分特征（選擇節點）衡量標準-熵信息增益決策樹構造實例信息增益：表示特

我的機器學習之旅（六）：決策樹

family 分配根據 drop chrom labels arch ntp -o 決策樹概念：分類決策樹模型是一種描述對實例進行分類的樹形結構。決策樹由結點和有向邊組成。結點有兩種類型：內部節點和葉節點，內部節點表示一個特征或屬性，葉節點表示一個類。分類的時候，從根

《機器學習實戰》使用ID3演算法構造決策樹

決策樹是一個基本回歸和分類的演算法決策樹的優點： 1.易於理解和解釋，並且可以視覺化。 2.幾乎不需要資料預處理。決策樹還不支援缺失值。 3.可以同時處理數值變數和分類變數。其他方法大都適用於分析一種變數的集合。 4.可以處理多值輸出變數問題。決策樹的缺點：決策樹

機器學習的演算法knn,貝葉斯,決策樹

sklearn資料集與估計器資料集劃分機器學習一般的資料集會劃分為兩個部分：訓練資料：用於訓練，構建模型測試資料：在模型檢驗時使用，用於評估模型是否有效資料集劃分API sklearn.model_selection.train_test_split

機器學習筆記（一）——基於單層決策樹的AdaBoost演算法實踐

基於單層決策樹的AdaBoost演算法實踐最近一直在學習周志華老師的西瓜書，也就是《機器學習》，在第八章整合學習中學習了一個整合學習演算法，即AdaBoost演算法。AdaBoost是一種迭代演算法，其核心思想

機器學習實戰教程（三）：決策樹實戰篇之為自己配個隱形眼鏡

原文連結：cuijiahua.com/blog/2017/1… 一、前言上篇文章機器學習實戰教程（二）：決策樹基礎篇之讓我們從相親說起講述了機器學習決策樹的原理，以及如何選擇最優特徵作為分類特徵。本篇文章將在此基礎上進行介紹。主要包括：決策樹構建決策樹視覺化使用決

《機器學習實戰》第三章：決策樹（1）基本概念

有半個月沒來了。最近一段時間...大多在忙專案組的事（其實就是改一改現有程式碼的bug，不過也挺費勁的，畢竟程式碼不是自己寫的）。另外就是自己租了幾臺美帝的vps，搭了$-$的伺服器，效果還不錯。自己搭的話就不用去買別人的服務了，不過租vps畢竟還是要成本的，光用來番茄

機器學習筆記（7）——C4.5決策樹中的缺失值處理

缺失值處理是C4.5決策樹演算法中的又一個重要部分，前面已經討論過連續值和剪枝的處理方法：現實任務中，通常會遇到大量不完整的樣本，如果直接放棄不完整樣本，對資料是極大的浪費，例如下面這個有缺失值的西瓜樣本集，只有4個完整樣本。在構造決策樹時，處理含有缺失值

機器學習小實戰（二）建立決策樹

目錄一、決策樹簡介決策樹既可以分類，也可以迴歸。構造決策樹兩種方式：預剪枝/後剪枝難點：如何構造決策樹，選什麼特徵作為結點。特點：根節點是分類效果最好的，其餘次之、再次之。決策樹停止劃分結點的原因可能是：達到最大葉子節點數了、葉子結點樣本數

機器學習回顧篇（7）：決策樹演算法（ID3、C4.5）

注：本系列所有部落格將持續更新併發布在github上，您可以通過github下載本系列所有文章筆記檔案。 1 演算法概述¶

python實現西瓜書《機器學習》習題4.4基尼指數決策樹，預剪枝及後剪枝

相關推薦