金融貸款逾期的模型構建2——整合模型
阿新 • • 發佈:2018-12-24
任務——模型構建
構建隨機森林、GBDT、XGBoost和LightGBM這4個模型,並對每一個模型進行評分,評分方式任意,例如準確度和auc值。
1、相關安裝資源
- 隨機森林、GBDT均在sklearn包中;
- LightGBM:https://github.com/Microsoft/LightGBM
- 目前已經是pypi中的資源 ==》pip方式安裝
- XGBoost:https://www.lfd.uci.edu/~gohlke/pythonlibs/#xgboost、https://github.com/dmlc/xgboost
Tips:若 pip 安裝過程中,網速、超時等 ==》換源
sudo pip install -i http://pypi.douban.com/simple/ --trusted-host=pypi.douban.com/simple lightgbm
2、資料讀取 + 標準化
import pandas as pd
from sklearn.model_selection import train_test_split
import xgboost as xgb
import lightgbm as lgb
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingRegressor
import warnings
from sklearn.preprocessing import StandardScaler
warnings.filterwarnings(action ='ignore', category = DeprecationWarning)
## 讀取資料
data = pd.read_csv("data_all.csv")
x = data.drop(labels='status', axis=1)
y = data['status']
x_train, x_test, y_train, y_test = train_test_split(x, y,test_size=0.3,random_state=2018)
print(len(x)) # 4754
## 資料標準化
scaler = StandardScaler()
scaler.fit(x_train)
x_train_stand = scaler.transform(x_train)
x_test_stand = scaler.transform(x_test)
3、 隨機森林模型
思想:通過 Bagging 的思想將多棵樹整合的一種演算法,它的基本單元是決策樹。
rfc = RandomForestClassifier()
rfc.fit(x_train, y_train)
rfc_score = rfc.score(x_test, y_test)
print("The score of RF:",rfc_score)
rfc1 = RandomForestClassifier()
rfc1.fit(x_train_stand, y_train)
rfc1_score = rfc1.score(x_test_stand, y_test)
print("The score of RF(with preprocessing):",rfc1_score)
輸出結果
The score of RF: 0.7638402242466713
The score of RF(with preprocessing): 0.7652417659425368
4、GBDT模型
GBDT 的全稱是 Gradient Boosting Decision Tree,梯度下降樹。
思想:通過損失函式的負梯度來擬合
gbdt = GradientBoostingRegressor()
gbdt.fit(x_train, y_train)
gbdt_score = gbdt.score(x_test, y_test)
print("The score of GBDT:",gbdt_score)
輸出結果:
The score of GBDT: 0.18118075405980671
5、XGBoost模型
xgb = xgb.XGBClassifier()
xgb.fit(x_train, y_train)
xgb_score = xgb.score(x_test, y_test)
print("The score of XGBoost:", xgb_score)
輸出結果
The score of XGBoost: 0.7855641205325858
遇到的問題
DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
==》經過在網上查詢問題發現:這是一個numpy問題,在空陣列上棄用了真值檢查。該問題numpy已經修復。
==》解決方案1:忽略警告2
import warnings
warnings.filterwarnings(action ='ignore', category = DeprecationWarning)
6、lightGBM
思想:LightGBM 是一個梯度 boosting 框架,使用基於學習演算法的決策樹。它可以說是分散式的,高效的,有以下優勢:
更快的訓練效率 低記憶體使用 更高的準確率 支援並行化學習 可處理大規模資料
gbm = lgb.LGBMRegressor()
gbm.fit(x_train, y_train)
gbm_score = gbm.score(x_test, y_test)
print("The score of LightGBM:", gbdt_score)
輸出結果
The score of LightGBM: 0.18118075405980671