利用scikit-learn庫實現隨機森林分類演算法

阿新 • • 發佈：2018-11-13

自己實踐一下在本章學到一些方法

首先實踐核心的部分，怎麼實現一個分類模型，並通過驗證曲線去優化模型，最後使用訓練出來的模型進行預測

In [20]:

#載入預處理的資料
import pandas as pd
df=pd.read_csv('../data/hr-analytics/hr_data_processed.csv') df.columns

Out[20]:

Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'work_accident', 'left',
       'promotion_last_5years', 'department_IT', 'department_RandD',
       'department_accounting', 'department_hr', 'department_management',
       'department_marketing', 'department_product_mng', 'department_sales',
       'department_support', 'department_technical', 'salary_high',
       'salary_low', 'salary_medium'],
      dtype='object')

In [21]:

#選擇訓練集

features = ['satisfaction_level', 'last_evaluation', 'number_project', 'average_montly_hours', 'time_spend_company', 'work_accident', 'promotion_last_5years', 'department_IT', 'department_RandD', 'department_accounting', 'department_hr', 'department_management', 'department_marketing', 'department_product_mng', 'department_sales', 'department_support', 'department_technical', 'salary_high', 'salary_low', 'salary_medium'] X=df[features].values y=df.left.values

In [33]:

#使用隨機森林分類器，計算驗證曲線的 max_depth
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score from sklearn.model_selection import validation_curve import numpy as np np.random.seed(1) #保證對於相同數量的隨機數的數列的值是相同的 clf=RandomForestClassifier(n_estimators=20) max_depths=[3,4,5,6,7,9,12,15,18,21] print('Training {} models'.format(len(max_depths))) train_scores,test_scores= validation_curve(estimator=clf, X=X, y=y, param_name="max_depth",param_range=max_depths,cv=5)

Training 10 models

In [43]:

def plot_validation_curve(train_scores, test_scores, param_range, xlabel='', log=False): '''  This code is from scikit-learn docs:  http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html   Also here:  https://github.com/rasbt/python-machine-learning-book-2nd-edition/blob/master/code/ch06/ch06.ipynb  ''' train_mean = np.mean(train_scores, axis=1) train_std = np.std(train_scores, axis=1) test_mean = np.mean(test_scores, axis=1) test_std = np.std(test_scores, axis=1) fig = plt.figure() plt.plot(param_range, train_mean, color=sns.color_palette('Set1')[1], marker='o', markersize=5, label='training accuracy') plt.fill_between(param_range, train_mean + train_std, train_mean - train_std, alpha=0.15, color=sns.color_palette('Set1')[1]) plt.plot(param_range, test_mean, color=sns.color_palette('Set1')[0], linestyle='--', marker='s', markersize=5, label='validation accuracy') plt.fill_between(param_range, test_mean + test_std, test_mean - test_std, alpha=0.15, color=sns.color_palette('Set1')[0]) if log: plt.xscale('log') plt.legend(loc='lower right') if xlabel: plt.xlabel(xlabel) plt.ylabel('Accuracy') plt.ylim(0.9, 1.0) return fig

In [45]:

import matplotlib.pyplot as plt
import seaborn as sns

In [47]:

#畫出驗證曲線
plot_validation_curve(train_scores,test_scores,max_depths,xlabel='max_depth') plt.xlim(3,21) plt.savefig('../figures/test_classfication_model.png', bbox_inches='tight', dpi=300)

In [58]:

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score from IPython.display import display from mlxtend.plotting import plot_decision_regions def cross_val_class_score(clf,X,y,cv=10): kfold=StratifiedKFold(n_splits=cv).split(X,y) class_accuracy=[] for k,(train,test) in enumerate(kfold): clf.fit(X[train],y[train]) #使用訓練資料擬合模型 y_test=y[test] y_pred=clf.predict(X[test]) #計算混淆矩陣，通過混淆矩陣找出對於每一個折，分類是0或者1的概率 cmat=confusion_matrix(y_test,y_pred) class_acc=cmat.diagonal()/cmat.sum(axis=1) class_accuracy.append(class_acc) print('fold: {:d} accuracy {:s}'.format(k+1,str(class_acc))) return np.array(class_accuracy)

In [61]:

#顯示k折驗證的結果
np.random.seed(1) clf=RandomForestClassifier(n_estimators=200, max_depth=6) scores=cross_val_class_score(clf,X,y) print('accuracy {} +/- {}'.format(scores.mean(axis=0),scores.std(axis=0)))

fold: 1 accuracy [ 0.99825022  0.88826816]
fold: 2 accuracy [ 0.99825022  0.84033613]
fold: 3 accuracy [ 0.99387577  0.81232493]
fold: 4 accuracy [ 0.99300087  0.85154062]
fold: 5 accuracy [ 0.99475066  0.82633053]
fold: 6 accuracy [ 0.99387577  0.85994398]
fold: 7 accuracy [ 0.99650044  0.87394958]
fold: 8 accuracy [ 0.99650044  0.83473389]
fold: 9 accuracy [ 0.99474606  0.87394958]
fold: 10 accuracy [ 0.99562172  0.89635854]
accuracy [ 0.99553722  0.85577359] +/- [ 0.00172575  0.02614334]

In [69]:

#畫出結果的箱圖
fig=plt.figure(figsize=(5,7)) sns.boxplot(data=pd.DataFrame(scores,columns=[0,1]), palette=sns.color_palette('Set1')) plt.xlabel('Left') plt.ylabel('accuracy') plt.show()

In [71]:

#計算特徵的重要性
d=(clf.feature_importances_,df.columns) list(zip(*d))

Out[71]:

[(0.36430881606946935, 'satisfaction_level'),
 (0.10606469651847085, 'last_evaluation'),
 (0.19088737947190054, 'number_project'),
 (0.13082595880187356, 'average_montly_hours'),
 (0.17955451160561237, 'time_spend_company'),
 (0.012101773234080513, 'work_accident'),
 (0.0008113047024873478, 'left'),
 (0.00021062542962211009, 'promotion_last_5years'),
 (0.00077649873359240354, 'department_IT'),
 (0.00022487937663401313, 'department_RandD'),
 (0.00043794363826079859, 'department_accounting'),
 (0.00031980481539390949, 'department_hr'),
 (0.00011370864098983321, 'department_management'),
 (0.00015365441067812497, 'department_marketing'),
 (0.00031929963267123197, 'department_product_mng'),
 (0.00036881031257490304, 'department_sales'),
 (0.00039082790477380948, 'department_support'),
 (0.0050013161512548546, 'department_technical'),
 (0.005775253267745778, 'salary_high'),
 (0.0013529372819138833, 'salary_low')]

In [75]:

#視覺化特徵的重要性
pd.Series(clf.feature_importances_, name='Feature importance', index=df[features].columns).sort_values().plot.barh() plt.show()

In [76]:

#打印出所有低重要性的特徵
importances=list(pd.Series(clf.feature_importances_, index=df[features].columns).sort_values(ascending=False).index) np.array(importances[5:]) <

利用scikit-learn庫實現隨機森林分類演算法

自己實踐一下在本章學到一些方法首先實踐核心的部分，怎麼實現一個分類模型，並通過驗證曲線去優化模型，最後使用訓練出來的模型進行預測 In [20]:

基於Python3.6編寫的jieba分片語件+Scikit-Learn庫+樸素貝葉斯演算法小型中文自動分類程式

實驗主題：大規模數字化（中文）資訊資源資訊組織所包含的基本流程以及各個環節執行的任務。本文所採用的分類及程式框架主要參考了這篇部落格基本流程：如下圖所示，和資訊資源資訊組織的基本流程類似，大規模數字化（中文）資訊資源組織的基本流程也如下：1資訊資源的預處理、2資訊外部特徵描述

scikit-learn /sklearn ：整合學習之隨機森林分類器（Forests of Randomized Tree）官方檔案翻譯

整合學習之隨機森林分類器整合學習的定義和分類。隨機森林法的定義和分類。隨機森林sklearn.ensemble.RandomForestClassifier()引數分類和含義。附註：Bias和Variance的含義和關係。一、整合學習（Ensemble

python實現隨機森林、邏輯回歸和樸素貝葉斯的新聞文本分類

ati int ces 平滑讀取 inf dict http tor 實現本文的文本數據可以在THUCTC下載也可以自己手動爬蟲生成，本文主要參考：https://blog.csdn.net/hao5335156/article/details/82716923 nb表

利用scikit-learn下的knn實現kaggle的手寫數字識別問題

# -*- coding:utf-8 -*- ''' Created on 2017年3月28日 @author: okcing 手寫數字識別 ''' import csv from sklearn import neighbors #匯入訓練資料和測

Spark技術在京東智能供應鏈預測的應用——按照業務進行劃分，然後利用scikit learn進行單機訓練並預測

rdd 解決難點新的訓練模型訓練 ati 情況明顯 3.3 Spark在預測核心層的應用我們使用Spark SQL和Spark RDD相結合的方式來編寫程序，對於一般的數據處理，我們使用Spark的方式與其他無異，但是對於模型訓練、預測這些需要調用算法接口的邏輯

php 遞迴函式的三種實現方式 php利用遞迴函式實現無限級分類

遞迴函式是我們常用到的一類函式，最基本的特點是函式自身呼叫自身，但必須在呼叫自身前有條件判斷，否則無限無限呼叫下去。實現遞迴函式可以採取什麼方式呢？本文列出了三種基本方式。理解其原來需要一定的基礎知識水品，包括對全域性變數，引用，靜態變數的理解，也需對他們的作用範圍有所理解。遞迴函式也是解決無限級分類的一個很

利用Scikit-Learn為模型自動調參

通過Keras的包裝類，藉助Scikit-Learn的網格搜尋演算法評估神經網路模型的不同配置，並找到最佳評估效能的引數組合。在Scikit-Learn中的GridSearchCV需要一個字典型別的欄位作為需要調參的引數，預設採用3折交叉驗證的方法來評估演算法。這裡有四個引數需要調參，因

利用Python requests庫實現cas認證

1.準備工作-背景知識 1.1 requests庫簡介： python有很多可以用來測試介面的模組，個人覺得，requests庫是最好用的，在Robot Framwork裡，它的測試庫requestsLibrary，也是基於requests寫的。 1.1.1 安裝：作為第三方模組，使用前，需要安裝，最簡單

實戰：用Python實現隨機森林

因為有Scikit-Learn這樣的庫，現在用Python實現任何機器學習演算法都非常容易。實際上，我們現在不需要任何潛在的知識來了解模型如何工作。雖然不需要了解所有細節，但瞭解模型如何訓練和預測對工作仍有幫助。比如：如果效能不如預期，我們可以診斷模型或當我們想要說服其他人使用我們的模型時，我們可以向他們解

實戰：用Python實現隨機森林！

因為有 Scikit-Learn 這樣的庫，現在用Python實現任何機器學習演算法都非常容易。實際上，我們現在不需要任何潛在的知識來了解模型如何工作。雖然不需要了解所有細節，但瞭解模型如何訓練和預測對工作仍有幫助。比如：如果效能不如預期，我們可以診斷模型或當我們想要說服其他人使用我們的模型時，我們

python實現隨機森林（RF）的引數尋優

# -*- coding: utf-8 -*- #RandomForestClassifier import math import matplotlib as mpl import warnings import numpy as np from sklearn import tree from

利用開源ASN1C庫實現asn.1的編解碼

　　最近在研究MMS的時候接觸到了抽象語義記法ASN.1(Abstract Syntax Notation One),於是對它做了一番瞭解，下面將這幾天的學習到的做下記錄，以供以後偷懶。　　ASN.1是一種 ISO/ITU-T 標準，描述了一種對資料進行表示、編碼、傳輸和

利用Arraylist陣列簡單實現隨機雙色球Demo

import java.util.ArrayList; public class Ticket { public static void main(String[] args) {

python包sk-learn中的隨機森林

最近在學習機器學習，學習到了隨機森林演算法，想做一個demo，閱讀了python的sk-learn包中隨機森林的程式碼實現，做了一些筆記。 sk-learn中的隨機森林是基於RandomForestC

機器學習之scikit-learn庫的使用

1、scikit-learn庫簡介 scikit-learn是一個整合了多種常用的機器學習演算法的Python庫，又簡稱skLearn。scikit-learn非常易於使用，為我們學習機器學習提供了一個很好的切入點。 2、機器學習基礎機器學習這門學科主要分為有監督學習、

java語言下利用tess4j開源庫實現圖片識別功能

一，tess4j 簡單介紹 Tess4J是對tesseract -OCR API.的Java JNA 封裝，使java能夠通過呼叫Tess4J的API來使用tesseract -OCR 我有一篇部落格也介紹了tesseract -OCR如何使用tesseract -OCR進行圖片識別&n

java利用poi開源庫實現將資料集寫入Excel表格並儲存在本地

一,目前主流的關於讀寫excel表格的有poi 和jxl開源庫，這裡只是簡單的介紹如何poi將資料集寫進Excel表格，並存進本地。二，官網下載poi的相關jar包，網址 http://poi.apache.org/download.html#POI-4.0.1 &nb

python機器學習庫scikit-learn簡明教程之：AdaBoost演算法

1.AdaBoost簡介及原理 Adaboost是一種迭代演算法，其核心思想是針對同一個訓練集訓練不同的分類器(弱分類器)，然後把這些弱分類器集合起來，構成一個更強的最終分類器（強分類器）。 Adab

隨機森林分類和adaboost分類方法的異同之處

隨機森林和adaboost演算法都可以用來分類，它們都是優秀的基於決策樹的組合演算法。相對於經典線性判別分析，其分類效果一般要好很多。下說明這兩種分類方法的相同和不同之處： 1，相同：二者都是bo

利用scikit-learn庫實現隨機森林分類演算法

相關推薦