1. 程式人生 > >決策樹代碼《機器學習實戰》

決策樹代碼《機器學習實戰》

必須 nbsp getter 什麽 key 畫圖 不支持 spl name

22:45:17 2017-08-09

KNN算法簡單有效,可以解決很多分類問題。但是無法給出數據的含義,就是一頓計算向量距離,然後分類。

決策樹就可以解決這個問題,分類之後能夠知道是問什麽被劃分到一個類。用圖形畫出來就效果更好了,這次沒有學哪個畫圖的,下次。

這裏只涉及信息熵的計算,最佳分類特征的提取,決策樹的構建。剪枝沒有學,這裏沒有。

  1 # -*- oding: itf-8 -*-
  2 
  3 ‘‘‘
  4 function: 《機器學習實戰》決策樹的代碼,畫圖的部分沒有寫;
  5 note: 貼出來以後用方便一點~
  6 date: 2017.8.9
  7 ‘‘‘
  8
9 from numpy import * 10 from math import log 11 import operator 12 13 #計算香濃信息熵 14 def calcuEntropy(dataSet): 15 numOfEntries = len(dataSet) 16 featVec = {} 17 for data in dataSet: 18 currentLabel = data[-1] 19 if currentLabel not in featVec.keys(): 20 featVec[currentLabel] = 1 21
else: 22 featVec[currentLabel] += 1 23 shannonEntropy = 0.0 24 for feat in featVec.keys(): 25 prob = float(featVec[feat]) / numOfEntries 26 shannonEntropy += -prob*log(prob, 2) 27 return shannonEntropy 28 29 #產生數據集 30 def loadDataSet(): 31 dataSet = [[1,1,
yes], 32 [1,0,no], 33 [0,1,no], 34 [0,1,no]] 35 labels = [no surfacing, flippers] 36 return dataSet, labels 37 38 ‘‘‘ 39 function: split the dataset 40 return: 基於劃分特征劃分之後我們想要的那部分集合 41 parameters: dataSet: 數據集,axis: 要劃分的特征, value:要返回的集合的axis特征值 42 ‘‘‘ 43 def splitDataSet(dataSet, axis, value): 44 retDataSet = [] #防止原始的數據集被修改 45 for featVec in dataSet: 46 if featVec[axis] == value: #我們想要的數值存起來,一會返回 47 reducedFeatVec = featVec[:axis] 48 reducedFeatVec.extend(featVec[axis+1:]) 49 retDataSet.append(reducedFeatVec) 50 return retDataSet 51 52 ‘‘‘ 53 function: 找出數據集中最佳的劃分特征 54 ‘‘‘ 55 def chooseBestClassifyFeat(dataSet): 56 numOfFeatures = len(dataSet[0]) - 1 57 bestFeature = -1 #初始化最佳的劃分特征 58 baseInfoGain = 0.0 #信息增益 59 baseEntropy = calcuEntropy(dataSet) 60 for i in range(numOfFeatures): 61 # if numOfFeatures == 1: #錯了,只有一個特征不是只有一個類別 62 # print(‘only one feature‘) 63 # print(dataSet[0][0]) 64 # return dataSet[0][0] #只有一個特征直接返回該特征 65 featList = [example[i] for example in dataSet] #或者第i個特征所有的取值 66 unicVals = set(featList) #不重復的第i個特征取值 67 newEntropy = 0.0 68 for value in unicVals: 69 subDataSet = splitDataSet(dataSet, i, value) 70 71 #計算劃分之後各個子數據集的信息熵,然後累加就是這個劃分的信息熵 72 currentEntropy = calcuEntropy(subDataSet) 73 prob = float(len(subDataSet)) / len(dataSet) 74 newEntropy += prob * currentEntropy 75 newInfoGain = baseEntropy - newEntropy 76 if newInfoGain > baseInfoGain: 77 bestFeature = i 78 baseInfoGain = newInfoGain 79 return bestFeature 80 81 ‘‘‘ 82 function: 多數表決,當分類器用完所有屬性,葉節點還是類別不統一的時候調用這個函數 83 arg: labelList 類別標簽列表 84 ‘‘‘ 85 def majorityCount(labelList): 86 classCount = {} 87 for label in labelList: 88 if label not in classCount.keys(): 89 classCount[label] = 0 90 classCount[label] += 1 91 sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1),reverse = True) 92 print(sortedClassCount) 93 return sortedClassCount[0][0] 94 95 96 ‘‘‘ 97 function: 遞歸的建造決策樹 98 arg: dataset: 數據集 labels: 代表特征的標簽,起始算法不需要,比如fippers代表第一個特征的意義 99 ‘‘‘ 100 def createTree(dataSet, labels): 101 classList = [example[-1] for example in dataSet] #得到所有的類別 102 if classList.count(classList[0]) == len(classList): #只有一種類別,直接返回 103 return classList[0] 104 if len(dataSet[0]) == 1: #特征屬性用完了但是還沒有完全分開,多數表決 105 return majorityCount(classList) 106 bestFeat = chooseBestClassifyFeat(dataSet) 107 print(bestFeat = + str(bestFeat)) 108 bestFeatLabel = labels[bestFeat] 109 del(labels[bestFeat]) #刪除這次使用的特征 110 featValues = [example[bestFeat] for example in dataSet] 111 myTree = {bestFeatLabel: {}} 112 unicVals = set(featValues) 113 for value in unicVals: 114 labelCopy = labels[:] 115 subDataSet = splitDataSet(dataSet, bestFeat, value) 116 myTree[bestFeatLabel][value] = createTree(subDataSet, labelCopy) 117 return myTree 118 119 ‘‘‘ 120 function: 用決策樹進行分類 121 arg: inputTree: 訓練好的決策樹,featLabels: 特征標簽,testVec: 待分類的向量 122 ‘‘‘ 123 def classify(inputTree, featLabel, testVec): 124 firstStr = list(inputTree.keys())[0] #python3 dict,.keys()不支持索引,必須轉換一下 125 secondDict = inputTree[firstStr] #second tree 126 featIndex = featLabel.index(firstStr) #可利用index函數找到這個特征標簽對飲過的特征位置 127 for key in secondDict.keys(): 128 if testVec[featIndex] == key: 129 if type(secondDict[key]).__name__ == dict: #說明下面不是葉子節點,繼續分類 130 classLabel = classify(secondDict[key], featLabel, testVec) 131 else: 132 classLabel = secondDict[key] #到達葉子節點,直接返回類別標簽 133 return classLabel 134 135 ‘‘‘ 136 function: 使用pickle模塊持久化存儲決策樹 137 note: 138 ‘‘‘ 139 def storeTree(inputTree, filename): 140 import pickle 141 fw = open(filename, wb) 142 pickle.dump(inputTree, fw) 143 fw.close() 144 145 ‘‘‘ 146 function: 從本地文件中讀取決策樹 147 ‘‘‘ 148 def grabTree(filename): 149 import pickle 150 fr = open(filename,rb) 151 return pickle.load(fr) 152 153 #測試信息熵的計算 154 dataSet, labels = loadDataSet() 155 shannon = calcuEntropy(dataSet) 156 print(shannon) 157 158 #測試數據集分割 159 print(dataSet) 160 retDataSet = splitDataSet(dataSet, 1, 1) 161 print(retDataSet) 162 retDataSet = splitDataSet(dataSet, 1, 0) 163 print(retDataSet) 164 165 #尋找最佳的劃分特征 166 bestFeature = chooseBestClassifyFeat(dataSet) 167 print(bestFeature) 168 169 #測試多數表決 170 out = majorityCount([1,1,2,2,2,1,2,2]) 171 print(out) 172 173 #創建決策大叔 174 myTree = createTree(dataSet, labels) 175 print(myTree) 176 177 #測試分類器 178 dataSet, labels = loadDataSet() 179 classLabel = classify(myTree, labels, [0,1]) 180 print(classLabel) 181 classLabel = classify(myTree, labels, [1,1]) 182 print(classLabel) 183 184 #持久化存儲決策樹 185 storeTree(myTree, classifierStorage.txt) 186 outTree = grabTree(classifierStorage.txt) 187 print(outTree)

決策樹代碼《機器學習實戰》