金融風控-->申請評分卡模型-->特徵工程（特徵分箱，WOE編碼）標籤：金融特徵分箱-WOE編碼 2017-07-16 21:26 4086人閱讀評論(2) 收藏舉報分類：金融風

阿新 • • 發佈：2019-01-04

這篇博文主要講在申請評分卡模型中常用的一些特徵工程方法，申請評分卡模型最多的還是logsitic模型。

先看資料，我們現在有三張表：

已加工成型的資訊：

Master表
idx:每一筆貸款的unique key,可以與另外2個檔案裡的idx相匹配。
UserInfo_*:借款人特徵欄位
WeblogInfo_*:Info網路行為欄位
Education_Info*:學歷學籍欄位
ThirdParty_Info_PeriodN_*:第三方資料時間段N欄位
SocialNetwork_*:社交網路欄位
ListingInfo:借款成交時間
Target:違約標籤(1 = 貸款違約,0 = 正常還款)

需要衍生的資訊

借款人的登陸資訊表
ListingInfo:借款成交時間
LogInfo1:操作程式碼
LogInfo2:操作類別
LogInfo3:登陸時間
idx:每一筆貸款的unique key

這裡寫圖片描述

客戶在不同的時間段內有著不同的操作，故我們最好做個時間切片，在每個時間切片內統計一些特徵。從而衍生出一些特徵。

時間切片:

兩個時刻間的跨度

例: 申請日期之前30天內的登入次數
申請日期之前第30天至第59天內的登入次數

基於時間切片的衍生

申請日期之前180天內,平均每月(30天)的登入次數

常用的時間切片

(1、2個)月,(1、2個)季度,半年,1年,1年半,2年

時間切片的選擇

不能太長:保證大多數樣本都能覆蓋到
不能太短:丟失資訊

我們希望最大時間切片不能太長，都是最好又能包含大部分資訊。那麼最大切片應該多大呢？

#coding:utf-8
import pandas as pd
import datetime
import collections
import numpy as np
import random

import matplotlib.pyplot as plt

def TimeWindowSelection(df, daysCol, time_windows):
    '''
    :param df: the dataset containg variabel of days
    :param daysCol: the column of days
    :param time_windows: the list of time window，可分別取30,60,90,,,360
    :return:
    ''' 

    freq_tw = {}
    for tw in time_windows:
        freq = sum(df[daysCol].apply(lambda x: int(x<=tw))) ##統計在tw時間切片內客戶操作的總次數
        freq_tw[tw] = freq/float(len(df))　##tw時間切片內客戶總運算元佔總的運算元比例
    return freq_tw


data1 = pd.read_csv('PPD_LogInfo_3_1_Training_Set.csv', header = 0)
### Extract the applying date of each applicant
data1['logInfo'] = data1['LogInfo3'].map(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d'))
data1['Listinginfo'] = data1['Listinginfo1'].map(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d'))
data1['ListingGap'] = data1[['logInfo','Listinginfo']].apply(lambda x: (x[1]-x[0]).days,axis = 1)
timeWindows = TimeWindowSelection(data1, 'ListingGap', range(30,361,30))
fig=plt.figure()
ax=fig.add_subplot(1,1,1)
ax.plot(list(timeWindows.keys()),list(timeWindows.values()),marker='o')
ax.set_xticks([0,30,60,90,120,150,180,210,240,270,300,330,360])
ax.grid()
plt.show()

這裡寫圖片描述

由上圖可以看出，在0-180天的時間切片內的運算元佔總的運算元的95%，180天以後的覆蓋度增長很慢。所以我們選擇180天為最大的時間切片。凡是不超過180天的時間切片，都可以用來做個特徵衍生。

選取[7,30,60,90,120,150,180]做為不同的切片,衍生變數。

那麼我們來選擇提取哪些有用的特徵：

統計下LogInfo1和LogInfo2在每個時間切片內被操作的次數m1。
統計下LogInfo1和LogInfo2在每個時間切片內不同的操作次數m2。
統計下LogInfo1和LogInfo2在每個時間切片內m1/m2的值。

time_window = [7, 30, 60, 90, 120, 150, 180]
var_list = ['LogInfo1','LogInfo2']
data1GroupbyIdx = pd.DataFrame({'Idx':data1['Idx'].drop_duplicates()})
for tw in time_window:
    data1['TruncatedLogInfo'] = data1['Listinginfo'].map(lambda x: x + datetime.timedelta(-tw))
    temp = data1.loc[data1['logInfo'] >= data1['TruncatedLogInfo']]
    for var in var_list:
        #count the frequences of LogInfo1 and LogInfo2
        count_stats = temp.groupby(['Idx'])[var].count().to_dict()
        data1GroupbyIdx[str(var)+'_'+str(tw)+'_count'] = data1GroupbyIdx['Idx'].map(lambda x: count_stats.get(x,0))
        # count the distinct value of LogInfo1 and LogInfo2
        Idx_UserupdateInfo1 = temp[['Idx', var]].drop_duplicates()
        uniq_stats = Idx_UserupdateInfo1.groupby(['Idx'])[var].count().to_dict()
        data1GroupbyIdx[str(var) + '_' + str(tw) + '_unique'] = data1GroupbyIdx['Idx'].map(lambda x: uniq_stats.get(x,0))
        # calculate the average count of each value in LogInfo1 and LogInfo2
        data1GroupbyIdx[str(var) + '_' + str(tw) + '_avg_count'] = data1GroupbyIdx[[str(var)+'_'+str(tw)+'_count',str(var) + '_' + str(tw) + '_unique']].\
            apply(lambda x: x[0]*1.0/x[1], axis=1)

資料清洗

對於類別型變數

        刪除缺失率超過50%的變數
        剩餘變數中的缺失做為一種狀態

對於連續型變數

        刪除缺失率超過30%的變數
        利用隨機抽樣法對剩餘變數中的缺失進行補缺

注:連續變數中的缺失也可以當成一種狀態

特徵分箱（連續變數離散化或類別型變數使其更少類別）
分箱的定義

將連續變數離散化
將多狀態的離散變數合併成少狀態

分箱的重要性及其優勢

離散特徵的增加和減少都很容易，易於模型的快速迭代；
稀疏向量內積乘法運算速度快，計算結果方便儲存，容易擴充套件；
離散化後的特徵對異常資料有很強的魯棒性：比如一個特徵是年齡>30是1，否則0。如果特徵沒有離散化，一個異常資料“年齡300歲”會給模型造成很大的干擾；
邏輯迴歸屬於廣義線性模型，表達能力受限；單變數離散化為N個後，每個變數有單獨的權重，相當於為模型引入了非線性，能夠提升模型表達能力，加大擬合；
離散化後可以進行特徵交叉，由M+N個變數變為M*N個變數，進一步引入非線性，提升表達能力；
特徵離散化後，模型會更穩定，比如如果對使用者年齡離散化，20-30作為一個區間，不會因為一個使用者年齡長了一歲就變成一個完全不同的人。當然處於區間相鄰處的樣本會剛好相反，所以怎麼劃分區間是門學問；
特徵離散化以後，起到了簡化了邏輯迴歸模型的作用，降低了模型過擬合的風險。
可以將缺失作為獨立的一類帶入模型。
將所有變數變換到相似的尺度上。

特徵分箱的方法
　這裡寫圖片描述

這裡我們主要講有監督的卡方分箱法(ChiMerge)。

　　自底向上的(即基於合併的)資料離散化方法。它依賴於卡方檢驗:具有最小卡方值的相鄰區間合併在一起,直到滿足確定的停止準則。
　　基本思想:對於精確的離散化，相對類頻率在一個區間內應當完全一致。因此,如果兩個相鄰的區間具有非常類似的類分佈，則這兩個區間可以合併；否則，它們應當保持分開。而低卡方值表明它們具有相似的類分佈。

分箱步驟：
這裡寫圖分箱述

這裡需要注意初始化時需要對例項進行排序，在排序的基礎上進行合併。

卡方閾值的確定：

　　根據顯著性水平和自由度得到卡方值
　　自由度比類別數量小1。例如：有3類,自由度為2，則90%置信度(10%顯著性水平)下，卡方的值為4.6。

閾值的意義

　　類別和屬性獨立時,有90%的可能性,計算得到的卡方值會小於4.6。
　　大於閾值4.6的卡方值就說明屬性和類不是相互獨立的，不能合併。如果閾值選的大,區間合併就會進行很多次,離散後的區間數量少、區間大。
　　
注:
1,ChiMerge演算法推薦使用0.90、0.95、0.99置信度,最大區間數取10到15之間.
2,也可以不考慮卡方閾值,此時可以考慮最小區間數或者最大區間數。指定區間數量的上限和下限,最多幾個區間,最少幾個區間。
3,對於類別型變數,需要分箱時需要按照某種方式進行排序。

按照最大區間數進行分箱程式碼：

def Chi2(df, total_col, bad_col, overallRate):
    '''
    :param df: the dataset containing the total count and bad count
    :param total_col: total count of each value in the variable
    :param bad_col: bad count of each value in the variable
    :param overallRate: the overall bad rate of the training set
    :return: the chi-square value
    '''
    df2 = df.copy()
    df2['expected'] = df[total_col].apply(lambda x: x*overallRate)
    combined = zip(df2['expected'], df2[bad_col])
    chi = [(i[0]-i[1])**2/i[0] for i in combined]
    chi2 = sum(chi)
    return chi2


### ChiMerge_MaxInterval: split the continuous variable using Chi-square value by specifying the max number of intervals
def ChiMerge_MaxInterval_Original(df, col, target, max_interval = 5):
    '''
    :param df: the dataframe containing splitted column, and target column with 1-0
    :param col: splitted column
    :param target: target column with 1-0
    :param max_interval: the maximum number of intervals. If the raw column has attributes less than this parameter, the function will not work
    :return: the combined bins
    '''
    colLevels = set(df[col])
    # since we always combined the neighbours of intervals, we need to sort the attributes
    colLevels = sorted(list(colLevels))　## 先對這列資料進行排序，然後在計算分箱
    N_distinct = len(colLevels)
    if N_distinct <= max_interval:  #If the raw column has attributes less than this parameter, the function will not work
        print "The number of original levels for {} is less than or equal to max intervals".format(col)
        return colLevels[:-1]
    else:
        #Step 1: group the dataset by col and work out the total count & bad count in each level of the raw column
        total = df.groupby([col])[target].count()
        total = pd.DataFrame({'total':total})
        bad = df.groupby([col])[target].sum()
        bad = pd.DataFrame({'bad':bad})
        regroup =  total.merge(bad,left_index=True,right_index=True, how='left')##將左側，右側的索引用作其連線鍵。
        regroup.reset_index(level=0, inplace=True)
        N = sum(regroup['total'])
        B = sum(regroup['bad'])
        #the overall bad rate will be used in calculating expected bad count
        overallRate = B*1.0/N　##　統計壞樣本率
        # initially, each single attribute forms a single interval
        groupIntervals = [[i] for i in colLevels]## 類似於[[1],[2],[3,4]]其中每個[.]為一箱
        groupNum = len(groupIntervals)
        while(len(groupIntervals)>max_interval):   #the termination condition: the number of intervals is equal to the pre-specified threshold
            # in each step of iteration, we calcualte the chi-square value of each atttribute
            chisqList = []
            for interval in groupIntervals:
                df2 = regroup.loc[regroup[col].isin(interval)]
                chisq = Chi2(df2, 'total','bad',overallRate)
                chisqList.append(chisq)
            #find the interval corresponding to minimum chi-square, and combine with the neighbore with smaller chi-square
            min_position = chisqList.index(min(chisqList))
            if min_position == 0:## 如果最小位置為0,則要與其結合的位置為１
                combinedPosition = 1
            elif min_position == groupNum - 1:
                combinedPosition = min_position -1
            else:## 如果在中間，則選擇左右兩邊卡方值較小的與其結合
                if chisqList[min_position - 1]<=chisqList[min_position + 1]:
                    combinedPosition = min_position - 1
                else:
                    combinedPosition = min_position + 1
            groupIntervals[min_position] = groupIntervals[min_position]+groupIntervals[combinedPosition]
            # after combining two intervals, we need to remove one of them
            groupIntervals.remove(groupIntervals[combinedPosition])
            groupNum = len(groupIntervals)
        groupIntervals = [sorted(i) for i in groupIntervals]　## 對每組的資料安從小到大排序
        cutOffPoints = [i[-1] for i in groupIntervals[:-1]]　## 提取出每組的最大值，也就是分割點
        return cutOffPoints

以卡方閾值作為終止分箱條件：

def ChiMerge_MinChisq(df, col, target, confidenceVal = 3.841):
    '''
    :param df: the dataframe containing splitted column, and target column with 1-0
    :param col: splitted column
    :param target: target column with 1-0
    :param confidenceVal: the specified chi-square thresold, by default the degree of freedom is 1 and using confidence level as 0.95
    :return: the splitted bins
    '''
    colLevels = set(df[col])
    total = df.groupby([col])[target].count()
    total = pd.DataFrame({'total':total})
    bad = df.groupby([col])[target].sum()
    bad = pd.DataFrame({'bad':bad})
    regroup =  total.merge(bad,left_index=True,right_index=True, how='left')
    regroup.reset_index(level=0, inplace=True)
    N = sum(regroup['total'])
    B = sum(regroup['bad'])
    overallRate = B*1.0/N
    colLevels =sorted(list(colLevels))
    groupIntervals = [[i] for i in colLevels]
    groupNum  = len(groupIntervals)
    while(1):   #the termination condition: all the attributes form a single interval; or all the chi-square is above the threshould
        if len(groupIntervals) == 1:
            break
        chisqList = []
        for interval in groupIntervals:
            df2 = regroup.loc[regroup[col].isin(interval)]
            chisq = Chi2(df2, 'total','bad',overallRate)
            chisqList.append(chisq)
        min_position = chisqList.index(min(chisqList))
        if min(chisqList) >=confidenceVal:
            break
        if min_position == 0:
            combinedPosition = 1
        elif min_position == groupNum - 1:
            combinedPosition = min_position -1
        else:
            if chisqList[min_position - 1]<=chisqList[min_position + 1]:
                combinedPosition = min_position - 1
            else:
                combinedPosition = min_position + 1
        groupIntervals[min_position] = groupIntervals[min_position]+groupIntervals[combinedPosition]
        groupIntervals.remove(groupIntervals[combinedPosition])
        groupNum = len(groupIntervals)
    return groupIntervals

無監督分箱法:

等距劃分、等頻劃分

等距分箱
　　從最小值到最大值之間,均分為 N 等份, 這樣, 如果 A,B 為最小最大值, 則每個區間的長度為 W=(B−A)/N , 則區間邊界值為A+W,A+2W,….A+(N−1)W 。這裡只考慮邊界，每個等份裡面的例項數量可能不等。
　　
等頻分箱
　　區間的邊界值要經過選擇,使得每個區間包含大致相等的例項數量。比如說 N=10 ,每個區間應該包含大約10%的例項。
　　
以上兩種演算法的弊端
　　比如,等寬區間劃分,劃分為5區間,最高工資為50000,則所有工資低於10000的人都被劃分到同一區間。等頻區間可能正好相反,所有工資高於50000的人都會被劃分到50000這一區間中。這兩種演算法都忽略了例項所屬的型別,落在正確區間裡的偶然性很大。

我們對特徵進行分箱後，需要對分箱後的每組（箱）進行woe編碼，然後才能放進模型訓練。

WOE編碼

WOE(weight of evidence, 證據權重)

一種有監督的編碼方式,將預測類別的集中度的屬性作為編碼的數值

優勢
　　將特徵的值規範到相近的尺度上。
　　(經驗上講,WOE的絕對值波動範圍在0.1~3之間)。
　　具有業務含義。
　　
缺點
　　需要每箱中同時包含好、壞兩個類別。

這裡寫圖片描述

特徵資訊度

IV(Information Value), 衡量特徵包含預測變數濃度的一種指標

這裡寫圖片描述
　特徵資訊度解構：
　
　其中Gi,Bi表示箱i中好壞樣本佔全體好壞樣本的比例。
　WOE表示兩類樣本分佈的差異性。
　(Gi-Bi)：衡量差異的重要性。

　特徵資訊度的作用
　選擇變數：

非負指標
高IV表示該特徵和目標變數的關聯度高
目標變數只能是二分類
過高的IV,可能有潛在的風險
特徵分箱越細,IV越高
常用的閾值有:
< =0.02: 沒有預測性,不可用
0.02 to 0.1: 弱預測性
0.1 to 0.2: 有一定的預測性
0.2 +: 高預測性

注意上面說的IV是指一個變數裡面所有箱的IV之和。

計算WOE和IV程式碼：

def CalcWOE(df, col, target):
    '''
    :param df: dataframe containing feature and target
    :param col: 注意col這列已經經過分箱了，現在計算每箱的WOE和總的IV。
    :param target: good/bad indicator
    :return: 返回每箱的WOE(字典型別）和總的IV之和。
    '''
    total = df.groupby([col])[target].count()
    total = pd.DataFrame({'total': total})
    bad = df.groupby([col])[target].sum()
    bad = pd.DataFrame({'bad': bad})
    regroup = total.merge(bad, left_index=True, right_index=True, how='left')
    regroup.reset_index(level=0, inplace=True)
    N = sum(regroup['total'])
    B = sum(regroup['bad'])
    regroup['good'] = regroup['total'] - regroup['bad']
    G = N - B
    regroup['bad_pcnt'] = regroup['bad'].map(lambda x: x*1.0/B)
    regroup['good_pcnt'] = regroup['good'].map(lambda x: x * 1.0 / G)
    regroup['WOE'] = regroup.apply(lambda x: np.log(x.good_pcnt*1.0/x.bad_pcnt),axis = 1)
    WOE_dict = regroup[[col,'WOE']].set_index(col).to_dict(orient='index')
    IV = regroup.apply(lambda x: (x.good_pcnt-x.bad_pcnt)*np.log(x.good_pcnt*1.0/x.bad_pcnt),axis = 1)
    IV = sum(IV)
    return {"WOE": WOE_dict, 'IV':IV}

那麼可能有人會問，以上都是有監督的分箱，有監督的WOE編碼，如何能將這些有監督的方法應用到預測集上呢？
　　
　　我們觀察下有監督的卡方分箱法和有監督的woe編碼的計算公式不難發現，其計算結果都是以一個比值結果呈現（卡方分箱法：(壞樣本數量-期望壞樣本數量)/期望壞樣本數量的比值形式；有監督的woe類似），比如我們發現預測集裡面好壞樣本不平衡，需要對壞樣本進行一個欠取樣或者是好樣本進行過取樣，只要是一個均勻取樣，理論上這個有監督的卡方分箱的比值結果是不變的，其woe的比值結果也是不變的。即預測集上的卡方分組和woe編碼和訓練集上一樣。
　　
　　那麼，在訓練集中我們對一個連續型變數進行分箱以後，對照這這個連續型變數每個值，如果這個值在某個箱中，那麼就用這個箱子的woe編碼代替他放進模型進行訓練。

　　在預測集中類似，但是預測集中的這個連續型變數的某個值可能不在任一個箱中，比如在訓練集中我對[x1,x2]分為一箱，[x3,x4]分為一箱，預測集中這個連續變數某個值可能為(x2+x3)/2即不在任意一箱中，如果把[x1,x2]分為一箱，那麼這一箱的變數應該是x1<=x< x2；第二箱應該是x2<=x< x4等等。即預測集中連續變數某一個值大於等於第i-1個箱的最大值，小於第ｉ個箱子的最大值，那麼這個變數就應該對應第ｉ個箱子。這樣分箱就覆蓋所有訓練樣本外可能存在的值。預測集中任意的一個值都可以找到對應的箱，和對應的woe編碼。
　　

def AssignBin(x, cutOffPoints):
    '''
    :param x: the value of variable
    :param cutOffPoints: 每組的最大值，也就是分割點
    :return: bin number, indexing from 0
    for example, if cutOffPoints = [10,20,30], if x = 7, return Bin 0. If x = 35, return Bin 3
    '''
    numBin = len(cutOffPoints) + 1
    if x<=cutOffPoints[0]:
        return 'Bin 0'
    elif x > cutOffPoints[-1]:
        return 'Bin {}'.format(numBin-1)
    else:
        for i in range(0,numBin-1):
            if cutOffPoints[i] < x <=  cutOffPoints[i+1]:
                return 'Bin {}'.format(i+1)

　　
　　如果我們發現分箱以後能完全能區分出好壞樣本，那麼得注意了這個連續變數會不會是個事後變數。

分箱的注意點

對於連續型變數做法:

使用ChiMerge進行分箱
如果有特殊值，把特殊值單獨分為一組，例如把-1單獨分為一箱。
計算這個連續型變數的每個值屬於那個箱子，得出箱子編號。以所屬箱子編號代替原始值。

def AssignBin(x, cutOffPoints):
    '''
    :param x: the value of variable
    :param cutOffPoints: the ChiMerge result for continous variable
    :return: bin number, indexing from 0
    for example, if cutOffPoints = [10,20,30], if x = 7, return Bin 0. If x = 35, return Bin 3
    '''
    numBin = len(cutOffPoints) + 1
    if x<=cutOffPoints[0]:
        return 'Bin 0'
    elif x > cutOffPoints[-1]:
        return 'Bin {}'.format(numBin-1)
    else:
        for i in range(0,numBin-1):
            if cutOffPoints[i] < x <=  cutOffPoints[i+1]:
                return 'Bin {}'.format(i+1)

檢查分箱以後每箱的bad_rate的單調性，如果不滿足，那麼繼續進行相鄰的兩箱合併，知道bad_rate單調為止。(可以放寬到U型)

## determine whether the bad rate is monotone along the sortByVar
def BadRateMonotone(df, sortByVar, target):
    # df[sortByVar]這列資料已經經過分箱
    df2 = df.sort([sortByVar])
    total = df2.groupby([sortByVar])[target].count()
    total = pd.DataFrame({'total': total})
    bad = df2.groupby([sortByVar])[target].sum()
    bad = pd.DataFrame({'bad': bad})
    regroup = total.merge(bad, left_index=True, right_index=True, how='left')
    regroup.reset_index(level=0, inplace=True)
    combined = zip(regroup['total'],regroup['bad'])
    badRate = [x[1]*1.0/x[0] for x in combined]
    badRateMonotone = [badRate[i]<badRate[i+1] for i in range(len(badRate)-1)]
    Monotone = len(set(badRateMonotone))
    if Monotone == 1:
        return True
    else:
        return False

　　上述過程是收斂的,因為當箱數為2時,bad rate自然單調

檢查最大箱，如果最大箱裡面數據數量佔總資料的90%以上，那麼棄用這個變數

def MaximumBinPcnt(df,col):
    N = df.shape[0]
    total = df.groupby([col])[col].count()
    pcnt = total*1.0/N
    return max(pcnt)

對於類別型變數：

當類別數較少時,原則上不需要分箱
否則，當類別較多時，以bad rate代替原有值，轉成連續型變數再進行分箱計算。

def BadRateEncoding(df, col, target):
    '''
    :param df: dataframe containing feature and target
    :param col: the feature that needs to be encoded with bad rate, usually categorical type
    :param target: good/bad indicator
    :return: the assigned bad rate to encode the categorical fature
    '''
    total = df.groupby([col])[target].count()
    total = pd.DataFrame({'total': total})
    bad = df.groupby([col])[target].sum()
    bad = pd.DataFrame({'bad': bad})
    regroup = total.merge(bad, left_index=True, right_index=True, how='left')
    regroup.reset_index(level=0, inplace=True)
    regroup['bad_rate'] = regroup.apply(lambda x: x.bad*1.0/x.total,axis = 1)
    br_dict = regroup[[col,'bad_rate']].set_index([col]).to_dict(orient='index')
    badRateEnconding = df[col].map(lambda x: br_dict[x]['bad_rate'])
    return {'encoding':badRateEnconding, 'br_rate':br_dict}

否則，檢查最大箱，如果最大箱裡面數據數量佔總資料的90%以上，那麼棄用這個變數
當某個或者幾個類別的bad rate為0時,需要和最小的非0bad rate的箱進行合併。

### If we find any categories with 0 bad, then we combine these categories with that having smallest non-zero bad rate
def MergeBad0(df,col,target):
    '''
     :param df: dataframe containing feature and target
     :param col: the feature that needs to be calculated the WOE and iv, usually categorical type
     :param target: good/bad indicator
     :return: WOE and IV in a dictionary
     '''
    total = df.groupby([col])[target].count()
    total = pd.DataFrame({'total': total})
    bad = df.groupby([col])[target].sum()
    bad = pd.DataFrame({'bad': bad})
    regroup = total.merge(bad, left_index=True, right_index=True, how='left')
    regroup.reset_index(level=0, inplace=True)
    regroup['bad_rate'] = regroup.apply(lambda x: x.bad*1.0/x.total,axis = 1)
    regroup = regroup.sort_values(by = 'bad_rate')
    col_regroup = [[i] for i in regroup[col]]
    for i in range(regroup.shape[0]):
        col_regroup[1] = col_regroup[0] + col_regroup[1]
        col_regroup.pop(0)
        if regroup['bad_rate'][i+1] > 0:
            break
    newGroup = {}
    for i in range(len(col_regroup)):
        for g2 in col_regroup[i]:
            newGroup[g2] = 'Bin '+str(i)
    return newGroup

當該變數可以完全區分目標變數時,需要認真檢查該變數的合理性。（可能是事後變數）

單變數分析

用IV檢驗該變數有效性（一般閾值區間在(0.0.2，0.8)）

iv_threshould = 0.02
## k,v分別表示col,col對應的這列的IV值。
varByIV = [k for k, v in var_IV.items() if v > iv_threshould]
## WOE_dict字典中包含字典。
WOE_encoding = []
for k in varByIV:
    if k in trainData.columns:
        trainData[str(k)+'_WOE'] = trainData[k].map(lambda x: WOE_dict[k][x]['WOE'])
        WOE_encoding.append(str(k)+'_WOE')
    elif k+str('_Bin') in trainData.columns:
        k2 = k+str('_Bin')
        trainData[str(k) + '_WOE'] = trainData[k2].map(lambda x: WOE_dict[k][x]['WOE'])
        WOE_encoding.append(str(k) + '_WOE')
    else:
        print "{} cannot be found in trainData"

連續變數bad rate的單調性(可以放寬到U型)
單一區間的佔比不宜過高（一般不能超過90%，如果超過則棄用這個變數）

多變數分析

變數的兩兩相關性，當相關性高時,只能保留一個:

可以選擇IV高的留下
或者選擇分箱均衡的留下（後期評分得分會均勻）

#### we can check the correlation matrix plot
col_to_index = {WOE_encoding[i]:'var'+str(i) for i in range(len(WOE_encoding))}
#sample from the list of columns, since too many columns cannot be displayed in the single plot
corrCols = random.sample(WOE_encoding,15)
sampleDf = trainData[corrCols]
for col in corrCols:
    sampleDf.rename(columns = {col:col_to_index[col]}, inplace = True)
scatter_matrix(sampleDf, alpha=0.2, figsize=(6, 6), diagonal='kde')

#alternatively, we check each pair of independent variables, and selected the variabale with higher IV if they are highly correlated
compare = list(combinations(varByIV, 2))## 從varByIV隨機的進行兩兩組合
removed_var = []
roh_thresould = 0.8
for pair in compare:
    (x1, x2) = pair
    roh = np.corrcoef([trainData[str(x1)+"_WOE"],trainData[str(x2)+"_WOE"]])[0,1]
    if abs(roh) >= roh_thresould:
        if var_IV[x1]>var_IV[x2]:## 選IV大的留下
            removed_var.append(x2)
        else:
            removed_var.append(x1)

多變數分析：變數的多重共線性
　通常用VIF來衡量，要求VIF<10:
　這裡寫圖片描述

import numpy as np
from sklearn.linear_model  import LinearRegression


selected_by_corr=[i for i in varByIv if i not in removed_var]
for i in range(len(selected_by_corr)):
    x0=trainData[selected_by_corr[i]+'_WOE']
    x0=np.array(x0)
    X_Col=[k+'_WOE' for k in selected_by_corr if k!=selected_by_corr[i]]
    X=trainData[X_Col]
    X=np.array(X)
    regr=LinearRegression()
    clr=regr.fit(X,x0)
    x_pred=clr.predit(X)
    R2=1-((x_pred-x0)**2).sum()/((x0-x0.mean())**2).sum()
    vif=1/(1-R2)
    print "The vif for {0} is {1}".format(selected_by_corr[i],vif)

當發現vif>10時，需要逐一剔除變數，當剔除變數Xk時，發現vif<10時，此時剔除{Xi,Xk}中IV小的那個變數。
通常情況下，計算vif這一步不是必須的，在進行單變數處理以後，放進邏輯迴歸模型進行訓練預測，如果效果非常不好時，才需要做多變數分析，消除多重共線性。

本篇博文總結：
　

金融風控-->申請評分卡模型-->特徵工程（特徵分箱，WOE編碼）標籤：金融特徵分箱-WOE編碼 2017-07-16 21:26 4086人閱讀評論(2) 收藏舉報分類：金融風

金融風控-->申請評分卡模型-->特徵工程（特徵分箱，WOE編碼）標籤：金融特徵分箱-WOE編碼 2017-07-16 21:26 4086人閱讀評論(2) 收藏舉報分類：金融風

Cloudera Manager(CDH5)內部結構、功能包括配置檔案、目錄位置等 2016-05-26 15:46 2112人閱讀評論(0) 收藏舉報分類： CDH（19） 1. 相關

STM32 use microlib是幹什麼的 2016-04-25 23:13 1298人閱讀評論(0) 收藏舉報分類： STM32（15）版權宣告：本文為博主原創文章，未經博主允許

金融風控-->申請評分卡模型-->申請評分卡介紹

JS 循環遍歷JSON數據分類： JS技術 JS JQuery 2010-12-01 13:56 43646人閱讀評論(5) 收藏舉報 jsonc JSON數據如：{"options":"[{

信用評分及模型原理解析（以P2P網貸為例）

金融信貸風控（一）——申請評分卡

金融申請評分卡（2）

【金融申請評分卡】資料準備

網際網路金融-機器學習及評分卡構建

評分卡模型-（一特徵構建）

評分卡模型之特徵工程中的BadRate單調與特徵分箱之間的聯絡

Application.SetCompatibleTextRenderingDefault的作用及使用方法分類： Win Forms 2006-04-23 12:23 7291人閱讀評論(7) 收

淺談信貸評分卡模型

評分卡模型（二資料清洗)

Logistic Regression在評分卡模型中的應用

LR演算法在申請評分卡的應用的理論

一文搞定信用評分卡模型-Python、SAS和R的實現（含程式碼和視訊）

初探機器學習與評分卡模型

評分卡模型開發-定性指標篩選

金融風控-->申請評分卡模型-->特徵工程（特徵分箱，WOE編碼） 標籤： 金融特徵分箱-WOE編碼 2017-07-16 21:26 4086人閱讀 評論(2) 收藏 舉報 分類： 金融風

相關推薦

金融風控-->申請評分卡模型-->特徵工程（特徵分箱，WOE編碼）標籤：金融特徵分箱-WOE編碼 2017-07-16 21:26 4086人閱讀評論(2) 收藏舉報分類：金融風