記一次隨機森林小實踐

阿新 • • 發佈：2018-12-18

前言

程式碼是從Jupyter Notebook匯出來的
過程中借鑑了些的資料清洗寫法，有時間再補充。
好記性不如爛筆頭，免得下次又到處查語法。

py版本

# -*- coding: utf-8 -*-
# @Time    : 18-11-1 上午10:43
# @Author  : wanghai
# @Email   : 
# @File    : testt.py
# @Software: PyCharm Community Edition

# coding: utf-8

# In[1]:


import numpy as np
import pandas as pd
import 
 warnings
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from 
 sklearn.cross_validation import train_test_split

# In[2]:


raw_df = pd.read_csv('data.csv')
df1 = raw_df.drop(['apply_id'], axis=1)
# 異常值是否多
df1.describe()


# In[3]:


def scatterplot(x_data, y_data, area, alpha, x_label="", y_label="", title="", color="g"):
    plt.scatter(x, y, s=area, alpha=alpha, 
 c=color)
    plt.title(title)
    plt.xlabel(x_label)
    plt.ylabel(y_label)
    plt.legend(loc='upper left')
    plt.show()


# # 資料清洗，標籤準備

# In[4]:


# 應付實付時間差
df1['date'] = (pd.to_datetime(df1['act_repay_dt']) - pd.to_datetime(df1['plan_repay_dt'])).dt.total_seconds() / (
24 * 60 * 60)
# 視覺化
x = df1['date']
y = x
area = np.pi * 3
scatterplot(x, y, area, 0.7, x_label="date", y_label="y", title="pay time img")

# In[5]:


date_show = df1['date'].dropna()

# matplotlib histogram
plt.hist(date_show, facecolor='blue', edgecolor='black', bins=155)

# kdeplot(核密度估計圖)
sns.distplot(date_show, hist=True, kde=False,
             bins=500, color='blue',
             hist_kws={'edgecolor': 'black'})
plt.title('Histogram of pay date')
plt.xlabel('pay date')
plt.ylabel('people count')
plt.show()

# In[6]:


print('The shape of our features is:', df1.shape)

# 標籤準備
df1['y'] = np.where((pd.isnull(df1['act_repay_dt'])) | (df1['date'] > 7), 1, 0)

illegal = df1[(pd.isnull(df1['act_repay_dt'])) | (df1['date'] > 7)]
print("至今未還款或者還款時間逾期的人有 %d 人，佔比 %.3f" % (len(illegal), float(len(illegal)) / float(len(df1))))
columns = ['act_repay_dt', 'plan_repay_dt', 'date']
# 刪除干擾列（初步）
df1.drop(columns, inplace=True, axis=1)

# 刪除最大最小的100行（TODO:該方法有待改進）
columns = df1.columns.tolist()
for col in columns:
    indexs = df1.nlargest(3, columns=[col]).index.values
    for i in indexs:
        df1.drop(i, inplace=True)

print('The shape of our features after del is:', df1.shape)
# TODO：計算相關性，幹掉相關係數特別高的


# In[7]:


df1.head(3)

# # 均值填充空值

# In[8]:


df1 = df1.fillna(df1.mean())
x = np.array(df1.iloc[:, 0:-1])
y = np.array(df1.iloc[:, -1])
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=11)

# dt = DictVectorizer(sparse=False)
# x_train = dt.fit_transform(x_train.to_dict())
# x_test = dt.fit_transform(x_test.to_dict())

print('Training Features Shape:', x_train.shape)
print('Training Labels Shape:', y_train.shape)
print('Testing Features Shape:', x_test.shape)
print('Testing Labels Shape:', y_test.shape)

# In[9]:


# # 決策樹版本
# dtc = DecisionTreeClassifier()

# dtc.fit(x_train, y_train)

# dt_predict = dtc.predict(x_test)

# print(dtc.score(x_test, y_test))

# print(classification_report(y_test, dt_predict, target_names=["died", "survived"]))

# 隨機森林版本

rfc = RandomForestClassifier()

rfc.fit(x_train, y_train)

rfc_y_predict = rfc.predict(x_test)
# 返回給定測試資料和標籤的平均精度。
print("均值填充平均精度為：{:.2f}".format(rfc.score(x_test, y_test)))

# In[11]:


print("The accuracy/recall rate and other results are as follows：")
print(classification_report(y_test, rfc_y_predict, target_names=["plan_repay", "overdue_repay"]))

# In[12]:


print(rfc_y_predict)

# In[13]:


print(y_test)

# In[14]:


# 特徵重要性
print(rfc.feature_importances_)

markdown版本

import numpy as np
import pandas as pd
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.cross_validation import train_test_split

/home/c/anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

raw_df = pd.read_csv('data.csv')
df1 = raw_df.drop(['×××'], axis = 1)
# 異常值是否多
df1.describe()

def scatterplot(x_data, y_data, area, alpha, x_label="", y_label="", title="", color = "g"):
    plt.scatter(x, y, s=area, alpha=alpha, c=color)
    plt.title(title)
    plt.xlabel(x_label)
    plt.ylabel(y_label)
    plt.legend(loc='upper left')
    plt.show()

# 資料清洗，標籤準備

# 應付實付時間差
df1['date'] = (pd.to_datetime(df1['×××']) - pd.to_datetime(df1['×××'])).dt.total_seconds()/(24*60*60)
# 視覺化
x = df1['date']
y = x
area = np.pi*3
scatterplot(x, y, area, 0.7, x_label="date", y_label="y", title="pay time img")

png

date_show = df1['date'].dropna()

# matplotlib histogram
plt.hist(date_show, facecolor = 'blue', edgecolor = 'black',bins = 155)

# kdeplot(核密度估計圖)
sns.distplot(date_show, hist=True, kde=False, 
             bins=500, color = 'blue',
             hist_kws={'edgecolor':'black'})
plt.title('Histogram of pay date')
plt.xlabel('pay date')
plt.ylabel('people count')
plt.show()

/home/c/anaconda2/lib/python2.7/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

在這裡插入圖片描述

print('The shape of our features is:', df1.shape)

# 標籤準備
df1['y'] = np.where((pd.isnull(df1['act_repay_dt'])) | (df1['date'] > 7), 1, 0)

illegal = df1[(pd.isnull(df1['act_repay_dt'])) | (df1['date']>7)]
print("至今未還款或者還款時間逾期的人有 %d 人，佔比 %.3f" % (len(illegal), float(len(illegal)) / float(len(df1))))
columns = ['act_repay_dt', 'plan_repay_dt', 'date']
# 刪除干擾列（初步）
df1.drop(columns, inplace=True, axis=1)

# 刪除最大最小的100行（TODO:該方法有待改進）
columns = df1.columns.tolist()
for col in columns:
    indexs = df1.nlargest(3, columns=[col]).index.values
    for i in indexs:
        df1.drop(i, inplace=True)

print('The shape of our features after del is:', df1.shape)
# TODO：計算相關性，幹掉相關係數特別高的

('The shape of our features is:', (12154, 221))
至今未還款或者還款時間逾期的人有 1837 人，佔比 0.151
('The shape of our features after del is:', (11497, 219))

df1.head(3)

# 均值填充空值

df1 = df1.fillna(df1.mean())
x = np.array(df1.iloc[:,0:-1])
y = np.array(df1.iloc[:,-1])
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=11)

# dt = DictVectorizer(sparse=False)
# x_train = dt.fit_transform(x_train.to_dict())
# x_test = dt.fit_transform(x_test.to_dict())

print('Training Features Shape:', x_train.shape)
print('Training Labels Shape:', y_train.shape)
print('Testing Features Shape:', x_test.shape)
print('Testing Labels Shape:', y_test.shape)

('Training Features Shape:', (8047, 218))
('Training Labels Shape:', (8047,))
('Testing Features Shape:', (3450, 218))
('Testing Labels Shape:', (3450,))

# # 決策樹版本
# dtc = DecisionTreeClassifier()
 
# dtc.fit(x_train, y_train)
 
# dt_predict = dtc.predict(x_test)
 
# print(dtc.score(x_test, y_test))
# print(classification_report(y_test, dt_predict, target_names=["died", "survived"]))

# 隨機森林版本
rfc = RandomForestClassifier()
rfc.fit(x_train, y_train)
rfc_y_predict = rfc.predict(x_test)
# 返回給定測試資料和標籤的平均精度。
print("均值填充平均精度為：{:.2f}".format(rfc.score(x_test,y_test)))

均值填充平均精度為：0.86

print("The accuracy/recall rate and other results are as follows：")
print(classification_report(y_test, rfc_y_predict, target_names=["plan_repay", "overdue_repay"]))

The accuracy/recall rate and other results are as follows：
               precision    recall  f1-score   support

   plan_repay       0.87      0.99      0.92      2976
overdue_repay       0.33      0.04      0.07       474

  avg / total       0.79      0.86      0.81      3450

rfc_y_predict

array([0, 0, 0, ..., 0, 0, 0])

y_test

array([0, 0, 0, ..., 0, 0, 0])

# 特徵重要性
rfc.feature_importances_

調優

max_features、n_estimators、min_samples_leaf

設定交叉驗證

cv_parameter = [{'min_samples_leaf':[5,15,25,35], 'n_estimators':[50,200,500], 'max_depth' = [2, 3, 5]}]
n_jobs並行
clf = GridSearchCV(estimator=rfc,param_grid=cv_parameter, cv=5, n_jobs=1)

max_depth ：

整數或None，可選（預設=None）
樹的最大深度。如果為None，則擴充套件節點直到所有葉子都是純的或直到所有葉子包含少於min_samples_split樣本。

from sklearn.model_selection import GridSearchCV
rfc = RandomForestClassifier(max_features = 'sqrt', random_state = 3)
cv_parameter = [{'n_estimators':[50,200,500], 'min_samples_leaf':[5,15,25,35], 'max_depth':[2, 3, 5]}]
clf = GridSearchCV(estimator=rfc,param_grid=cv_parameter, cv=5, n_jobs=1)

clf.fit(x_train, y_train)
print('Best parameters:')
print(clf.best_params_)

在這裡插入圖片描述

設定權重

rfc = RandomForestClassifier(random_state = 3, class_weight={0: 1, 1: 5})

關於結果classification_report

在這裡插入圖片描述
預測出25個正樣本，對了11個，共474個真實正樣本。準確率0.44, 召回率0.023

記一次隨機森林小實踐

前言程式碼是從Jupyter Notebook匯出來的過程中借鑑了些的資料清洗寫法，有時間再補充。好記性不如爛筆頭，免得下次又到處查語法。 py版本 # -*- coding: utf-8 -*- # @Time : 18-11-1 上午10:43

記一次 VUE 專案優化實踐

愛康體檢寶 PC（www.tijianbao.com/) 算是一個“老”專案，為什麼說“老”呢，因為在前端技術日新月異，每天都有新知識、新概念，甚至新框架的今天，它還是基於vue-cli 2.x、webpack 3.x構建，顯然有些老了。其次，在早期開始這個專案的時候，由於倉促上線，也沒有過多的考慮效能及載入

記一次Java動態代理實踐

導語：在Java生態中，我們經常直接或者間接的用到動態代理，比如通過動態代理呼叫遠端服務，再比如通過動態代理實現解耦。本文結合京東服務框架JSF，講述京東使用動態代理進行抽象的一次實踐，以達到升級資料庫訪問層的目的。劉世傑，京東商城Java高階開發工程師，一直從服務端研發

記一次微信小程式原始碼反解包

安裝完成後在安裝目錄下新建兩個資料夾： node-cache node-global （這是用來放npm全域性模組的安裝目錄，也可以放到其他地方。）二配置環境變數變數名：NODE_HOME 變數值（你的安裝目錄）：C:\Program Fi

記一次微信小程式在安卓的白屏問題

在做小程式的時候，做到了一個限時商品售賣，用到了倒計時，因為這個原因導致了安卓手機上使用小程式時，將小程式放入後臺執行一段時間後，再次進入小程式後出現了頁面白屏或者點選事件失效的情況，這裡記錄下 1.相關程式碼檔案我這裡是使用了自定義元件的形式來渲染的外部的引用的自定義元件的wxml

記一次微信小程序開發

time imp catch 快速 urlencode 掃碼 ise headers ml2 之前在網上看到博客園新聞服務開放接口，因為自己本身有看博客園IT新聞的習慣，為了能隨時隨地簡潔方便的瀏覽新聞，於是萌生了一個利用開放API開發一個微信小程序的想法。 1. mpvu

記一次水平分表實踐（sharding-jdbc）

摘要本文示例是按月水平分表。存在一下兩點不足：分表主鍵沒有設計好，本文用的是自增長id，沒有把時間組合到主鍵中，導致少了一個

記一次介面效能優化實踐總結：優化介面效能的八個建議

### 前言最近對外介面偶現504超時問題，原因是程式碼執行時間過長，超過nginx配置的15秒，然後真槍實彈搞了一次介面效能優化。在這裡結合優化過程，總結了介面優化的八個要點，希望對大家有幫助呀~ - 資料量比較大，批量操作資料入庫 - 耗時操作考慮非同步處理 - 恰當使用快取 - 優化程式邏輯、程式碼

記一次在廣播（BroadcastReceiver）或服務（Service）裏彈窗的“完美”實踐

dac target 百度 define key 捕獲只有一個 show 一個事情是這樣的，目前在做一個醫療項目，需要定時在某個時間段比如午休時間和晚上讓我們的App休眠，那麽這個時候在休眠時間段如果用戶按了電源鍵點亮屏幕了，我們就需要彈出一個全屏的窗口去做一個人性化

記一次在BroadcastReceiver或Service裏彈窗的“完美”實踐

.net 電源屬性 amp nsa troy 界面 lag turn 　　事情是這樣的，目前在做一個醫療項目，需要定時在某個時間段比如午休時間和晚上讓我們的App休眠，那麽這個時候在休眠時間段如果用戶按了電源鍵點亮屏幕了，我們就需要彈出一個全屏的窗口去做一個人性化的提示

記一次寒假小嘗試心得體會

又是春聯一點過程大學 nbsp 活動的人學會歲如白駒之過隙，逝如奔流之忘川。轉眼，又是一年寒假時。這個寒假，怎麽說呢，還是和往年寒假有一些不同的。這一次寒假要比已往來的早一些（其實就是年要比已往過得晚一點），還到家還是老樣子，躺屍，

記一次清理緩存的小事情(chrome) chrome下清理緩存不生效的問題

選擇遇到 image 細節後來前端開發 bubuko 開發技術記一次清理緩存的小事情(chrome) chrome下清理緩存不生效的問題前端開發中會經常涉及清理緩存的事情. 在一次開發後, 需要清理緩存,一個哥們怎麽清理都不生效, 於是向我求救. 在我看了下後,

記一次基於vue的spa多頁簽實踐經驗

ofo ace 加載名稱 mman date 頁簽一鍵安裝 route 前言最近收到一個這樣的需求,要求做一個基於 vue 和 element-ui 的通用後臺框架頁,具體要求如下: 要求通用性高,需要在後期四十多個子項目中使用，所以大部分地方都做成可配置的.

記一次Android選修的小專案

目標和思路目標：做一個有多級頁面的app，使用者選擇需要選擇的資訊：性別、年齡、姓名。點選提交，會根據隨機生成一個三國時期的人物與其對應。並提示相關資訊。思路：編寫多個頁面，通過intent元件實現頁面跳轉，並在MainActivity.class檔案中根據選項的選擇

記一次小程式之旅

感覺已經好久沒寫程式了，最近這段時間，一方面是學習了python，然後折騰了scrapy框架，用python寫了下守護程序程式監聽任務以及用redis做佇列任務通訊，並開程序來處理爬蟲任務。以上這些其實沒啥好說的，就是順帶提一下。另外就是最近編寫segmentfault的講堂小程式，算起來，自小程式開始編寫到

記一次ES節點擴容、資料遷移實踐

記一次ES節點擴容、資料遷移實踐背景之前ES叢集裡的資料越來越大，日增500G日誌資料，需要做一波擴容。節點資訊目前叢集中的節點資訊如下：節點 CPU、MEM DISK 磁碟使用率節點角色 es01

記一次Minecraft遊戲伺服器搭建實踐經歷

Minecraft簡介 Minecraft是一款沙盒遊戲，整個遊戲沒有劇情，玩家在遊戲中自由建設和破壞，透過像積木一樣來對元素進行組合與拼湊，輕而易舉的就能製作出小木屋、城堡甚至城市，玩家可以通過自己創造的作品來體驗上帝一般的感覺。在這款遊戲裡，不僅可以單人娛樂，還可以多人聯機，玩家也可以安裝一些模組來增加

[Android] 記一次 MVVM 實踐

背景：為什麼選擇了MVVM 公司的專案一直是以 Activity 為載體的 Android 式 MVC 架構，上手快，大多數頁面程式碼也挺容易讀的。只是某複雜業務的 Activity 會有上千行的程式碼，內部複雜的狀態判斷和非同步邏輯特別多，而且原作者早已離職，每次提測都只能祈求這裡不出 bug。為了重

CentOS 7靜默安裝Oracle 11g（記一次最小化CentOS 7安裝Oracle 11g的經歷）

1.最小化安裝CentOS 7後首先設定一下固定IP可以先查詢一下自己的網絡卡裝置的名稱，是ens33，所以網絡卡配置檔名稱就是ifcfg-ens33（前面的ifcfg-不用管，固定的）ip addr開啟網絡卡配置檔案：vi /etc/sysconfig/network-sc

你的年目標實現了嗎，記一次開發微信小程式

前言：這是筆者第一次開發小程式，此前一直有打算自己做一個，並且能夠上線使用，但一直找不到靈感，加上還需要伺服器端、資料庫等技能，所有一直沒能實現。後來偶然看到微信小程式雲開發(有點驚豔了，確實挺簡便)，再加上一點點想法，於是就開始了小程式雲開發之旅。第一步，要做什麼東西？鑑於自己的技術水

記一次隨機森林小實踐

前言

py版本

markdown版本

調優

max_features、n_estimators、min_samples_leaf

設定交叉驗證

max_depth ：

設定權重

關於結果classification_report

相關推薦