Python_sklearn機器學習庫學習筆記（四）decision_tree（決策樹）

阿新 • • 發佈：2017-05-23

min n) 空間 strong output epo from 標簽 ict

# 決策樹

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV

import zipfile
#壓縮節省空間
z=zipfile.ZipFile(‘ad-dataset.zip‘)
# df=pd.read_csv(z.open(z.namelist()[0]),header=None,low_memory=False)
# df = pd.read_csv(z.open(z.namelist()[0]), header=None, low_memory=False)

df=pd.read_csv(‘.\\tree_data\\ad.data‘,header=None)
explanatory_variable_columns=set(df.columns.values)
response_variable_column=df[len(df.columns.values)-1]
#最後一列是代表的標簽類型
explanatory_variable_columns.remove(len(df.columns)-1)

y=[1 if e ==‘ad.‘ else 0 for e in response_variable_column]
X=df.loc[:,list(explanatory_variable_columns)]

#匹配？字符，並把值轉化為-1
X.replace(to_replace=‘ *\?‘, value=-1, regex=True, inplace=True)

X_train,X_test,y_train,y_test=train_test_split(X,y)
#用信息增益啟發式算法建立決策樹
pipeline=Pipeline([(‘clf‘,DecisionTreeClassifier(criterion=‘entropy‘))])
parameters = {
‘clf__max_depth‘: (150, 155, 160),
‘clf__min_samples_split‘: (1, 2, 3),
‘clf__min_samples_leaf‘: (1, 2, 3)
}
#f1查全率和查準率的調和平均
grid_search=GridSearchCV(pipeline,parameters,n_jobs=-1,
                         verbose=1,scoring=‘f1‘)
grid_search.fit(X_train,y_train)
print ‘最佳效果：%0.3f‘%grid_search.best_score_
print ‘最優參數‘
best_parameters=grid_search.best_estimator_.get_params()
best_parameters

輸出結果：

Fitting 3 folds for each of 27 candidates, totalling 81 fits

[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:   21.0s
[Parallel(n_jobs=-1)]: Done  81 out of  81 | elapsed:   34.7s finished

最佳效果：0.888
最優參數

Out[123]:

{‘clf‘: DecisionTreeClassifier(class_weight=None, criterion=‘entropy‘, max_depth=160,
             max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
             min_samples_split=3, min_weight_fraction_leaf=0.0,
             presort=False, random_state=None, splitter=‘best‘),
 ‘clf__class_weight‘: None,
 ‘clf__criterion‘: ‘entropy‘,
 ‘clf__max_depth‘: 160,
 ‘clf__max_features‘: None,
 ‘clf__max_leaf_nodes‘: None,
 ‘clf__min_samples_leaf‘: 1,
 ‘clf__min_samples_split‘: 3,
 ‘clf__min_weight_fraction_leaf‘: 0.0,
 ‘clf__presort‘: False,
 ‘clf__random_state‘: None,
 ‘clf__splitter‘: ‘best‘,
 ‘steps‘: [(‘clf‘,
   DecisionTreeClassifier(class_weight=None, criterion=‘entropy‘, max_depth=160,
               max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
               min_samples_split=3, min_weight_fraction_leaf=0.0,
               presort=False, random_state=None, splitter=‘best‘))]}

for param_name in sorted(parameters.keys()):
    print (‘\t%s:%r‘%(param_name,best_parameters[param_name]))
predictions=grid_search.predict(X_test)
print classification_report(y_test,predictions)

輸出結果：

clf__max_depth:150
clf__min_samples_leaf:1
clf__min_samples_split:1
precision recall f1-score support

0 0.97 0.99 0.98 703
1 0.91 0.84 0.87 117

avg / total 0.96 0.96 0.96 820

df.head()

輸出結果;

	0	1	2	3	...	1558
0	125	125	1.0	1	...	ad.
1	57	468	8.2105	1	...	ad.
2	33	230	6.9696	1	...	ad.
3	60	468	7.8	1	...	ad.
4	60	468	7.8	1	...	ad.

# 決策樹集成

#coding:utf-8
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV

df=pd.read_csv(‘.\\tree_data\\ad.data‘,header=None,low_memory=False)
explanatory_variable_columns=set(df.columns.values)
response_variable_column=df[len(df.columns.values)-1]

df.head()

	0	1	2	3	...	1558
0	125	125	1.0	1	...	ad.
1	57	468	8.2105	1	...	ad.
2	33	230	6.9696	1	...	ad.
3	60	468	7.8	1	...	ad.
4	60	468	7.8	1	...	ad.

#The last column describes the targets(去掉最後一列)
explanatory_variable_columns.remove(len(df.columns.values)-1)
y=[1 if e==‘ad.‘ else 0 for e in response_variable_column]
X=df.loc[:,list(explanatory_variable_columns)]
#置換有？的為-1
X.replace(to_replace=‘ *\?‘, value=-1, regex=True, inplace=True)
X_train,X_test,y_train,y_test=train_test_split(X,y)
pipeline=Pipeline([(‘clf‘,RandomForestClassifier(criterion=‘entropy‘))])
parameters = {
‘clf__n_estimators‘: (5, 10, 20, 50),
‘clf__max_depth‘: (50, 150, 250),
‘clf__min_samples_split‘: (1, 2, 3),
‘clf__min_samples_leaf‘: (1, 2, 3)
}
grid_search = GridSearchCV(pipeline,parameters,n_jobs=-1,verbose=1,scoring=‘f1‘)
grid_search.fit(X_train,y_train)

print(u‘最佳效果：%0.3f‘%grid_search.best_score_)
print u‘最優的參數：‘
best_parameters=grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print(‘\t%s:%r‘%(param_name,best_parameters[param_name]))

輸出結果：

最佳效果：0.929 最優的參數： clf__max_depth:250 clf__min_samples_leaf:1 clf__min_samples_split:3 clf__n_estimators:50

predictions=grid_search.predict(X_test)
print classification_report(y_test,predictions)

輸出結果：

precision recall f1-score support

0 0.98 1.00 0.99 705
1 0.97 0.90 0.93 115

avg / total 0.98 0.98 0.98 820

Python_sklearn機器學習庫學習筆記（四）decision_tree（決策樹）

資料結構篇：二叉樹（四：交換左右子樹）

應用遞迴思想如果結點不為空，就交換其左右子樹，而待交換的左右子樹，我們不需要關心是否為空。 void Tree::ExChangeTree(BiTree *T) { BiTree *temp = new BiTree; if (T) { tem

Python_sklearn機器學習庫學習筆記（四）decision_tree（決策樹）

min n) 空間 strong output epo from 標簽 ict # 決策樹 import pandas as pd from sklearn.tree import DecisionTreeClassifier from sklearn.

Python_sklearn機器學習庫學習筆記（五）k-means（聚類）

# K的選擇：肘部法則如果問題中沒有指定K的值，可以通過肘部法則這一技術來估計聚類數量。肘部法則會把不同K值的成本函式值畫出來。隨著K值的增大，平均畸變程度會減小；每個類包含的樣本數會減少，於是樣本離其重心會更近。但是，隨著K值繼續增大，平均畸變程度的改善效果會不斷減

18-12-8-視覺化庫Seaborn學習筆記（四：REG-迴歸分析繪圖）

目錄獲取是否付小費資料 regplot（）和lmplot（）都可以繪製迴歸關係，推薦regplot（） sns.lmplot(x="x", y="y", data=XXX, order=2); #曲線利用hue引數畫出男女給予小費的不同 sns.l

Python3實現機器學習經典演算法（四）C4.5決策樹

一、C4.5決策樹概述　　C4.5決策樹是ID3決策樹的改進演算法，它解決了ID3決策樹無法處理連續型資料的問題以及ID3決策樹在使用資訊增益劃分資料集的時候傾向於選擇屬性分支更多的屬性的問題。它的大部分流程和ID3決策樹是相同的或者相似的，可以參考我的上一篇部落格：https://www.cnblogs.

機器學習筆記（一）——基於單層決策樹的AdaBoost演算法實踐

基於單層決策樹的AdaBoost演算法實踐最近一直在學習周志華老師的西瓜書，也就是《機器學習》，在第八章整合學習中學習了一個整合學習演算法，即AdaBoost演算法。AdaBoost是一種迭代演算法，其核心思想

Dlib機器學習庫學習系列三----人臉對齊（特徵點檢測）

本篇部落格是Dlib庫學習的第三篇---人臉對齊。人臉對齊與人臉檢測工程建立與配置基本相同，在此不再贅述。可參照我上一篇部落格。閒話少說，來點乾貨。步驟一：建立並配置工程，參照上一篇部落格。步驟二：下載形狀模型檔案下載地址

機器學習筆記（7）——C4.5決策樹中的缺失值處理

缺失值處理是C4.5決策樹演算法中的又一個重要部分，前面已經討論過連續值和剪枝的處理方法：現實任務中，通常會遇到大量不完整的樣本，如果直接放棄不完整樣本，對資料是極大的浪費，例如下面這個有缺失值的西瓜樣本集，只有4個完整樣本。在構造決策樹時，處理含有缺失值

【Python學習筆記】四、對映（Mapping）

• 通過名字來引用值得資料結構稱為對映字典（Dict）• 字典是鍵值對(key-value pair)的無序可變集合。（1）字典的操作①字典的建立• 字典中的每個元素包含兩部分：鍵和值。• 鍵和值用冒號分隔，元素間用逗號分隔，所有元素放在一對大括號中。d = {key1

機器學習筆記（6）——C4.5決策樹中的剪枝處理和Python實現

1. 為什麼要剪枝還記得決策樹的構造過程嗎？為了儘可能正確分類訓練樣本，節點的劃分過程會不斷重複直到不能再分，這樣就可能對訓練樣本學習的“太好”了，把訓練樣本的一些特點當做所有資料都具有的一般性質，cong從而導致過擬合。這時就可以通過剪枝處理去掉yi一些分支來降低過擬合

機器學習筆記（5）——C4.5決策樹中的連續值處理和Python實現

在ID3決策樹演算法中，我們實現了基於離散屬性的決策樹構造。C4.5決策樹在劃分屬性選擇、連續值、缺失值、剪枝等幾方面做了改進，內容較多，今天我們專門討論連續值的處理和Python實現。 1. 連續屬性離散化 C4.5演算法中策略是採用二分法將連續屬性離散化處理：假定

Swift學習筆記十四：構造（Initialization）

array -o 默認值 shee 狀態 servers 輸入告訴 nil ? ? ?類和結構體在實例創建時，必須為全部存儲型屬性設置合適的初始值。存儲型屬性的值

JS學習之路系列總結四象陣（此文猶如武林之中的易筋經，是你馳騁IT界的武功心法，學會JS五大陣法就學會了JS，博主建議先學三才陣）

元素 ins dom 命名 aslist element 多個及其 nod 四象陣法: 增加刪除改變查找【為了便於記憶，減少占用大腦內存，我命名為JS心法為：道陣法，兩儀陣法，三才陣法，四象陣法，五行陣法，只需記住陣法的關鍵字，即可搜索大腦中相應的內容，學

Python學習：17.Python面向對象（四、屬性（特性），成員修飾符，類的特殊成員）

介紹 col 寫代碼被調用表示 1.5 emp 成員 object 一、屬性（特性）普通方法去執行的時候，後面需要加括號，特性方法執行的時候和靜態字段一樣不需要不需要加括號. 特性方法不和字段同名. 特性方法不能傳參數. 在我們定義數據庫字段類的時候,往往需要

C語言學習及應用筆記之四：C語言volatile關鍵字及其使用

　　在C語言中，還有一個並不經常使用但卻非常有用的關鍵字volatile。那麼使用volatile關鍵字究竟能幹什麼呢？接下來我將就此問題進行討論。　　一個使用volatile關鍵字定義變數，其實就是告訴編譯系統這變數可能會被意想不到地改變。那麼編譯時，編譯器就不會自作主張的去假設這個變數的值，而進行程式

學習筆記（五）：使用決策樹演算法檢測POP3暴力破解

1.資料蒐集載入KDD 99中的資料： def load_kdd99(filename): x=[] with open(filename) asf: for line in f: line=line.st

Hadoop學習記錄（四、hadoop實現檔案操作）

1.從Hadoop URL讀取資料類似cat命令 public class URLCat { static{ URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()); }

python_sklearn機器學習算法系列之AdaBoost------人臉識別(PCA,決策樹)

本文主要目的是通過一個簡單的小例子和很短的程式碼來快速學習python 中的sklearn.ensemble的 AdaBoost這一模組的基本操作和使用，注意不是用python純粹從頭到尾自己構建AdaBoost，既然sklearn提供了現成的我們直接拿來

spark學習記錄（四、運算元（函式））

1.Transformations轉換運算元 Transformations類運算元是一類運算元（函式）叫做轉換運算元，如map,flatMap,reduceByKey等。Transformations運算元是延遲執行，也叫懶載入執行。 filter：過濾符合條件的記錄數，true保留

機器學習--手寫數字識別（KNN、決策樹）

KNN 及決策樹演算法為監督學習中的兩種簡單演算法。 KNN KNN演算法（鄰近演算法）的核心思想是如果一個樣本在特徵空間中的k個最相鄰的樣本中的大多數屬於某一個類別，則該樣本也屬於這個類別，並具有這個類別上樣本的特性。歐式距離的計算公式: 假設每個樣本有兩個特徵值，如 A

Python_sklearn機器學習庫學習筆記（四）decision_tree（決策樹）

相關推薦