1. 程式人生 > >機器學習小目標——Task8

機器學習小目標——Task8

1. 任務

【任務八-特徵工程2】分別用IV值和隨機森林挑選特徵,再構建模型,進行模型評估

2. 用IV值特徵選擇

2.1 IV值

IV的全稱是Information Value,中文意思是資訊價值,或者資訊量

2.1.1 IV的計算

為了介紹IV的計算方法,我們首先需要認識和理解另一個概念——WOE,因為IV的計算是以WOE為基礎的。

2.1.2 WOE

WOE的全稱是“Weight of Evidence”,即證據權重。WOE是對原始自變數的一種編碼形式。
要對一個變數進行WOE編碼,需要首先把這個變數進行分組處理(也叫離散化、分箱等等,說的都是一個意思)。分組後,對於第i組,WOE的計算公式如下:
a

pyi是這個組中響應客戶(風險模型中,對應的是違約客戶,總之,指的是模型中預測變數取值為“是”或者說1的個體)佔所有樣本中所有響應客戶的比例,pni是這個組中未響應客戶佔樣本中所有未響應客戶的比例,#yi是這個組中響應客戶的數量,#ni是這個組中未響應客戶的數量,#yT是樣本中所有響應客戶的數量,#nT是樣本中所有未響應客戶的數量。

WOE越大,這種差異越大,這個分組裡的樣本響應的可能性就越大,WOE越小,差異越小,這個分組裡的樣本響應的可能性就越小

2.1.3 IV的計算公式

在這裡插入圖片描述
有了一個變數各分組的IV值,我們就可以計算整個變數的IV值,方法很簡單,就是把各分組的IV相加:
在這裡插入圖片描述

3. 用隨機森林特徵選擇

3.1 特徵重要性

隨機森林一個重要特徵:能夠計算單個特徵變數的重要性。

3.1.1 計算方法

  • 對於隨機森林中的每一顆決策樹,使用相應的OOB(袋外資料)資料來計算它的袋外資料誤差,記為errOOB1.

  • 隨機地對袋外資料OOB所有樣本的特徵X加入噪聲干擾(就可以隨機的改變樣本在特徵X處的值),再次計算它的袋外資料誤差,記為errOOB2.

  • 假設隨機森林中有Ntree棵樹,那麼對於特徵X的重要性=∑(errOOB2-errOOB1)/Ntree,之所以可以用這個表示式來作為相應特徵的重要性的度量值是因為:若給某個特徵隨機加入噪聲之後,袋外的準確率大幅度降低,則說明這個特徵對於樣本的分類結果影響很大,也就是說它的重要程度比較高。

實驗結果

在這裡插入圖片描述

程式碼

#!/usr/bin/env python 3.6
#-*- coding:utf-8 -*-
# @File    : IV.py
# @Date    : 2018-11-27
# @Author  : 黑桃
# @Software: PyCharm 
#
# # -*- coding:utf-8 -*-
# # Author:Bemyid

# from fancyimpute import BiScaler, KNN, NuclearNormMinimization
import pandas as pd

import math
import numpy as np
from scipy import stats
from sklearn.utils.multiclass import type_of_target

from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
from lightgbm.sklearn import LGBMClassifier
from mlxtend.classifier import StackingClassifier

from sklearn.metrics import accuracy_score, roc_auc_score
import warnings
import pickle
from sklearn.ensemble import RandomForestClassifier
warnings.filterwarnings("ignore")
path = "E:/MyPython/Machine_learning_GoGoGo/"


"""
0 讀取特徵
"""
print("0 讀取特徵")
f = open(path + 'feature/feature_V2.pkl', 'rb')
train, test, y_train, y_test = pickle.load(f)
data = pd.read_csv(path + 'data_set/data.csv',encoding='gbk')
f.close()


def predict(clf, train, test, y_train, y_test):
    # 預測
    y_train_pred = clf.predict(train)
    y_test_pred = clf.predict(test)

    y_train_proba = clf.predict_proba(train)[:, 1]
    y_test_proba = clf.predict_proba(test)[:, 1]

    # 準確率
    print('[準確率]')
    print('訓練集:', '%s' % accuracy_score(y_train, y_train_pred), end=' ')
    print('測試集:', '%s' % accuracy_score(y_test, y_test_pred))

    # auc取值:用roc_auc_score或auc
    print('[auc值]')
    print('訓練集:', '%s' % roc_auc_score(y_train, y_train_proba), end=' ')
    print('測試集:', '%s' % roc_auc_score(y_test, y_test_proba))


"""
2 隨機森林挑選特徵
"""
forest = RandomForestClassifier(oob_score=True,n_estimators=100, random_state=0,n_jobs=1)
forest.fit(train, y_train)
print('袋外分數:', forest.oob_score_)
predict(forest,train, test, y_train, y_test)
feature_impotance1 = sorted(zip(map(lambda x: '%.4f'%x, forest.feature_importances_), list(train.columns)), reverse=True)
print(feature_impotance1[:10])


def woe(X, y, event=1):
    res_woe = []
    iv_dict = {}
    for feature in X.columns:
        x = X[feature].values
        # 1) 連續特徵離散化
        if type_of_target(x) == 'continuous':
            x = discrete(x)
        # 2) 計算該特徵的woe和iv
        # woe_dict, iv = woe_single_x(x, y, feature, event)
        woe_dict, iv = woe_single_x(x, y, feature, event)
        iv_dict[feature] = iv
        res_woe.append(woe_dict)

    return iv_dict


def discrete(x):
    # 使用5等分離散化特徵
    res = np.zeros(x.shape)
    for i in range(5):
        point1 = stats.scoreatpercentile(x, i * 20)
        point2 = stats.scoreatpercentile(x, (i + 1) * 20)
        x1 = x[np.where((x >= point1) & (x <= point2))]
        mask = np.in1d(x, x1)
        res[mask] = i + 1  # 將[i, i+1]塊內的值標記成i+1
    return res


def woe_single_x(x, y, feature, event=1):
    # event代表預測正例的標籤
    event_total = sum(y == event)
    non_event_total = y.shape[-1] - event_total

    iv = 0
    woe_dict = {}
    for x1 in set(x):  # 遍歷各個塊
        y1 = y.reindex(np.where(x == x1)[0])
        event_count = sum(y1 == event)
        non_event_count = y1.shape[-1] - event_count
        rate_event = event_count / event_total
        rate_non_event = non_event_count / non_event_total

        if rate_event == 0:
            rate_event = 0.0001
            # woei = -20
        elif rate_non_event == 0:
            rate_non_event = 0.0001
            # woei = 20
        woei = math.log(rate_event / rate_non_event)
        woe_dict[x1] = woei
        iv += (rate_event - rate_non_event) * woei
    return woe_dict, iv

import warnings
warnings.filterwarnings("ignore")

iv_dict = woe(train, y_train)
iv = sorted(iv_dict.items(), key = lambda x:x[1],reverse = True)

useless = []
for feature in train.columns:
    if feature in [t[1] for t in feature_impotance1[50:]] and feature in [t[1] for t in feature_impotance2[50:]]:
        useless.append(feature)
        print(feature, iv_dict[feature])



train.drop(useless, axis = 1, inplace = True)
test.drop(useless, axis = 1, inplace = True)



importance = forest.feature_importances_

imp_result = pd.DataFrame(importance)
imp_result.columns=["impor"]
rf_impc = pd.Series(forest.feature_importances_, index=train.columns).sort_values(ascending=False)

lr = LogisticRegression(C = 0.1, penalty = 'l1')
svm_linear = svm.SVC(C = 0.01, kernel = 'linear', probability=True)
svm_poly =  svm.SVC(C = 0.01, kernel = 'poly', probability=True)
svm_rbf =  svm.SVC(gamma = 0.01, C =0.01 , probability=True)
svm_sigmoid =  svm.SVC(C = 0.01, kernel = 'sigmoid',probability=True)
dt = DecisionTreeClassifier(max_depth=5,min_samples_split=50,min_samples_leaf=60, max_features=9, random_state =2333)
xgb = XGBClassifier(learning_rate =0.1, n_estimators=80, max_depth=3, min_child_weight=5,
                    gamma=0.2, subsample=0.8, colsample_bytree=0.8, reg_alpha=1e-5,
                    objective= 'binary:logistic', nthread=4,scale_pos_weight=1, seed=27)
lgb = LGBMClassifier(learning_rate =0.1, n_estimators=100, max_depth=3, min_child_weight=11,
                    gamma=0.1, subsample=0.5, colsample_bytree=0.9, reg_alpha=1e-5,
                    nthread=4,scale_pos_weight=1, seed=27)

sclf = StackingClassifier(classifiers=[svm_linear, svm_poly, svm_rbf, svm_sigmoid, dt, xgb, lgb], 
                            meta_classifier=lr, use_probas=True,average_probas=False)
sclf.fit(train, y_train.values)
predict(sclf, train, test, y_train, y_test)

[準確率] 訓練集: 0.8264 測試集: 0.7917
[auc值] 訓練集: 0.8921 測試集: 0.7895