1. 程式人生 > >Python建立決策樹—解決隱形眼鏡選擇問題

Python建立決策樹—解決隱形眼鏡選擇問題

現在我們碰到這樣一個問題,一個人去醫院想配一副隱形眼鏡。我們需要通過問他4個問題,決定他需要帶眼鏡的型別。那麼如何解決這個問題呢?我們決定用決策樹。首先我們去下載一個隱形眼鏡資料集,資料來源於UCI資料庫。下載了lenses.data檔案,如下:

1  1  1  1  1  3
2  1  1  1  2  2
3  1  1  2  1  3
4  1  1  2  2  1
5  1  2  1  1  3
6  1  2  1  2  2
7  1  2  2  1  3
8  1  2  2  2  1
9  2  1  1  1  3
10  2  1  1  2  2
11  2  1  2  1  3
12  2  1  2  2  1
13  2  2  1  1  3
14  2  2  1  2  2
15  2  2  2  1  3
16  2  2  2  2  3
17  3  1  1  1  3
18  3  1  1  2  3
19  3  1  2  1  3
20  3  1  2  2  1
21  3  2  1  1  3
22  3  2  1  2  2
23  3  2  2  1  3
24  3  2  2  2  3

我們可以看到,第一列的1到24,對應資料的ID

第二列的1到3,分別對應病人的年齡(age of patient),分別是青年(young),中年(pre-presbyopic),老年(presbyopic)

第三列的1和2,分別對應近視情況(spectacle prescription),近視(myope),遠視(hypermetrope)

第四列的1和2,分別對應眼睛是否散光(astigmatic),不散光(no),散光(yes)

第五列的1和2,分別對應分泌眼淚的頻率(tear production rate),很少(reduce),普通(normal)

第六列的1到3,則是最終根據以上資料得到的分類,分別是硬性的隱形眼鏡(hard),軟性的隱形眼鏡(soft),不需要帶眼鏡(no lenses)

資料我們獲取到了,那麼我們寫一個函式去開啟檔案設定好資料集,以下是程式碼:

from numpy import *
import operator
from math import log

def createLensesDataSet():#建立隱形眼鏡資料集
    fr = open('lenses.data')
    allLinesArr = fr.readlines()
    linesNum = len(allLinesArr)
    returnMat = zeros((linesNum, 4))
    statusLabels = ['age of the patient', 'spectacle prescription', 'astigmatic', 'tear production rate']
    classLabelVector = []
    classLabels = ['hard', 'soft', 'no lenses']

    index = 0
    for line in allLinesArr:
        line = line.strip()
        lineList = line.split('  ')
        returnMat[index, :] = lineList[1:5]
        classIndex = int(lineList[5]) - 1
        classLabelVector.append(classLabels[classIndex])  # 索引-1代表列表最後一個元素
        index += 1

    return ndarray.tolist(returnMat), statusLabels, classLabelVector

def createLensesAttributeInfo():
    parentAgeList = ['young', 'pre', 'presbyopic']
    spectacleList = ['myope', 'hyper']
    astigmaticList = ['no', 'yes']
    tearRateList = ['reduced', 'normal']
    return parentAgeList, spectacleList, astigmaticList, tearRateList

那麼接下來我們應該設定決策樹的分支,如何確定以上哪一個特徵是第一個分支呢,我們要提到一個概念,夏農熵(Shannon entropy)。熵這個概念代表資訊的不確定性的大小,在劃分資料集中經常會運用到。

它的公式是:

那麼我們先寫一個計算夏農熵的函式:

def calcShannonEnt(dataSet):#計算夏農熵
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet:
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob, 2)
    return shannonEnt

經過計算,我們可以得到我們當前使用的資料集,熵為:1.32608752536

然後,我們寫一個劃分資料集的函式,可以根據資料集,特徵索引和特徵值來劃分資料集:

def splitDataSet(dataSet, axis, value):#按照特徵值劃分資料集,引數為資料集,特徵索引,特徵值
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

說到取最佳特徵值,我們就要提到一個概念資訊增益(information divergence)

他的公式是:

即將單獨一個特徵值提取出來,計算該特徵值每個分支劃分出資料集的熵的求和,然後用總資料集的熵減去它

計算四個特徵值的資訊增益我們得到以下資料:

0:0.0393965036461
1:0.0395108354236
2:0.377005230011
3:0.548794940695

以下是計算資訊增益的程式碼:

def chooseBestFeatureToSplit(dataSet):#選擇最佳分割特徵值
    numFeatures =  len(dataSet[0]) - 1
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0
    bestFeature = -1
    for i in range(numFeatures):
        featList = [example[i] for example in dataSet]
        uniqueVals = set(featList)
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet) / float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntropy
        print(str(i)+':'+str(infoGain))
        if (infoGain > bestInfoGain):
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature

通過計算我們可以得出特徵值的優先順序,tear production rate>astigmatic>spectacle prescription>age of patient

接下來,有了以上的計算函式,我們就可以開始建立決策樹了,建立決策樹,我們使用字典型別去儲存,用鍵代表分支節點,值代表下一個節點或者葉子節點,程式碼如下:

def createTree(dataSet, labels):#建立決策樹
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList):
        print(classList[0])
        return classList[0]
    if len(dataSet[0]) == 1:
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)

    for value in uniqueVals:
        subLabels = labels[:]
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)
    return myTree

def majorityCnt(classList):#對於單個特徵值的列表,按出現次數進行排序
    classCount = {}
    for vote in classList:
        if vote not in classCount.keys(): classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.iteritems(), key = operator.itemgetter(1), reverse = True)
    return sortedClassCount[0][0]

主要函式寫完以後,我們寫一段測試程式碼,列印我們創建出的決策樹:

import trees
import treePlotter
from numpy import *

lensesData, labels, vector = trees.createLensesDataSet()
parentAgeList, spectacleList, astigmaticList, tearRateList = trees.createLensesAttributeInfo()
lensesAttributeList = [parentAgeList, spectacleList, astigmaticList, tearRateList]

for i in range(len(lensesData)):
    for j in range(len(lensesData[i])):
        index = int(lensesData[i][j]) - 1
        lensesData[i][j] = lensesAttributeList[j][index]
    lensesData[i].append(str(vector[i]))

myTree = trees.createTree(lensesData, labels)
print(myTree)

 

我們看一下輸出:

{'tear production rate': {'reduced': 'no lenses', 'normal': {'astigmatic': {'yes': {'spectacle prescription': {'hyper': {'age of the patient': {'pre': 'no lenses', 'presbyopic': 'no lenses', 'young': 'hard'}}, 'myope': 'hard'}}, 'no': {'age of the patient': {'pre': 'soft', 'presbyopic': {'spectacle prescription': {'hyper': 'soft', 'myope': 'no lenses'}}, 'young': 'soft'}}}}}}

可以看出這是一個比較長的字典巢狀結構,但是這樣看上去很不直觀,為了讓這個決策樹能直觀的顯示出來,我們要匯入圖形化模組matplotlib,用來把決策樹畫出來。

我們新寫一個treePlotter指令碼,指令碼中新增計算決策樹葉節點數量及深度的函式,用以計算畫布的高寬佈局。通過計算兩個節點中點座標的函式,確定分支屬性的位置,最終畫出決策樹。以下是指令碼程式碼:

import matplotlib.pyplot as plt
import matplotlib

from pylab import *
mpl.rcParams['font.sans-serif'] = ['SimHei']

# 定義文字框和箭頭格式
decisionNode = dict(boxstyle = "sawtooth", fc = "0.8")
leafNode = dict(boxstyle = "round4", fc = "0.8")
arrow_args = dict(arrowstyle = "<-")

def plotNode(nodeTxt, centerPt, parentPt, nodeType):
    createPlotPlus.ax1.annotate(nodeTxt, xy = parentPt, xycoords = 'axes fraction', xytext = centerPt, textcoords = 'axes fraction', \
                            va = "center", ha = "center", bbox = nodeType, arrowprops = arrow_args)

def getNumLeafs(myTree):#獲取葉節點的總數量
    numLeafs = 0
    firstStr = myTree.keys()[0]
    secondDict = myTree[firstStr]
    for k in secondDict.keys():
        if type(secondDict[k]).__name__ == 'dict':#判斷節點資料型別是否為字典
            numLeafs += getNumLeafs(secondDict[k])
        else:
            numLeafs += 1
    return numLeafs

def getTreeDepth(myTree):#判斷決策樹的深度
    maxDepth = 0
    firstStr = myTree.keys()[0]
    secondDict = myTree[firstStr]
    for k in secondDict.keys():
        if type(secondDict[k]).__name__ == 'dict':  # 判斷節點資料型別是否為字典
            thisDepth = 1 + getTreeDepth(secondDict[k])
        else:
            thisDepth = 1

        if thisDepth > maxDepth:
            maxDepth = thisDepth
    return maxDepth

def plotMidText(cntrPt, parentPt, txtString):#計算給定兩個座標的中點座標
    xMid = (parentPt[0] - cntrPt[0]) / 2.0 + cntrPt[0]
    yMid = (parentPt[1] - cntrPt[1]) / 2.0 + cntrPt[1]
    createPlotPlus.ax1.text(xMid-0.05, yMid, txtString, rotation = 30)

def plotTree(myTree, parentPt, nodeTxt):#根據樹,父節點,節點文字,繪製一個分支節點
    numLeafs = getNumLeafs(myTree)
    firstStr = myTree.keys()[0]
    cntrPt = (plotTree.xOff +(1.0 + float(numLeafs)) / 2.0 /plotTree.totalW, plotTree.yOff)
    plotMidText(cntrPt, parentPt, nodeTxt)
    plotNode(firstStr, cntrPt, parentPt, decisionNode)
    secondDict = myTree[firstStr]
    plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD
    for k in secondDict.keys():
        if type(secondDict[k]).__name__ =='dict':
            plotTree(secondDict[k], cntrPt, str(k))
        else:
            plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
            plotNode(secondDict[k], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
            plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(k))
    plotTree.yOff = plotTree.yOff + 1.0 / plotTree.totalD

def createPlotPlus(inTree):#根據給定決策樹建立影象
    fig = plt.figure(1, facecolor='white')
    fig.clf()
    axprops = dict(xticks = [], yticks = [])
    createPlotPlus.ax1 = plt.subplot(111, frameon = False, **axprops)
    plotTree.totalW = float(getNumLeafs(inTree))
    plotTree.totalD = float(getTreeDepth(inTree))
    plotTree.xOff = -0.5 / plotTree.totalW
    plotTree.yOff = 1.0
    plotTree(inTree, (0.5, 1.0), '')
    plt.show()

經過這個指令碼的處理,我們在測試程式碼上呼叫建立決策樹影象的函式:

treePlotter.createPlotPlus(myTree)

得到最終影象:

以上,完成。

 

參考書籍:《機器學習實戰》