GiveMeSomeCredit——信用評分卡模型

阿新 • • 發佈：2019-01-09

如今在銀行、消費金融公司等各種貸款業務機構，普遍使用信用評分，對客戶實行打分制，以期對客戶有一個優質與否的評判。評分卡分為三類分別為：

A卡（Application score card）申請評分卡

B卡（Behavior score card）行為評分卡

C卡（Collection score card）催收評分卡

評分機制的區別在於：

1.使用的時間不同。分別側重貸前、貸中、貸後；

2.資料要求不同。A卡一般可做貸款0-1年的信用分析，B卡則是在申請人有了一定行為後，有了較大資料進行的分析，一般為3-5年，C卡則對資料要求更大，需加入催收後客戶反應等屬性資料。

3.每種評分卡的模型會不一樣。在A卡中常用的有邏輯迴歸，AHP等，而在後面兩種卡中，常使用多因素邏輯迴歸，精度等方面更好。

對於建立評分卡模型，我們參照以下的流程：

一. 資料預處理

此次的資料來源於Kaggle的Give Me Some Credit專案，首先來看一下資料：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
import seaborn as sns
from scipy import stats
import copy

%matplotlib inline

train_data = pd.read_csv('cs-training.csv')
train_data = train_data.iloc[:,1:]
train_data.info()

1.1 處理缺失值

可以看到資料方面，對於缺失比較多的MonthlyIncome，在此建立隨機森林模型進行填補，而缺失較少的NumberOfDependts,則直接刪除缺樣本。

mData = train_data.iloc[:,[5,0,1,2,3,4,6,7,8,9]]
train_known = mData[mData.MonthlyIncome.notnull()].as_matrix()
train_unknown = mData[mData.MonthlyIncome.isnull()].as_matrix()
train_X = train_known[:,1:]
train_y = train_known[:,0]
rfr = RandomForestRegressor(random_state=0,n_estimators=200,max_depth=3,n_jobs=-1)
rfr.fit(train_X,train_y)
predicted_y = rfr.predict(train_unknown[:,1:]).round(0)
train_data.loc[train_data.MonthlyIncome.isnull(),'MonthlyIncome'] = predicted_y

train_data = train_data.dropna()
train_data = train_data.drop_duplicates()

1.2 處理異常值

缺失值處理後，來處理異常值。異常值一般是指偏離資料較大的值。例如在統計學中，常把低於 Q1-1.5IQR的值和高於Q3+1.5IQR的值作為異常值。通過繪製箱型圖能很明顯的看到異常值，例如：

train_box = train_data.iloc[:,[3,7,9]]
train_box.boxplot()

很明顯可以看到，在這三個特徵之中有兩組樣本偏離了其他樣本的分佈，可以將其去除，此外，我們發現在age為0的樣本，這很明顯是不符合常識的，應同樣作為異常值捨棄：

train_data = train_data[train_data['NumberOfTime30-59DaysPastDueNotWorse']<90]
train_data = train_data[train_data.age>0]
train_data['SeriousDlqin2yrs'] = 1-train_data['SeriousDlqin2yrs'] #使好客戶為1，違約客戶為0

1.3 資料切分

為了使得能夠更好地檢驗模型效果，我們將資料切分化為訓練集和測試集。測試集取原資料的30%：

from sklearn.cross_validation import train_test_split
y = train_data.iloc[:,0]
X = train_data.iloc[:,1:]
train_X,test_X,train_y,test_y = train_test_split(X,y,test_size =0.3,random_state=0)
ntrain_data = pd.concat([train_y,train_X],axis=1)
ntest_data = pd.concat([test_y,test_X],axis=1)

二. 探索性分析

在建立模型之前，我們一般會對現有的資料進行探索性資料分析（Exploratory Data Analysis）。 EDA是指對已有的資料(特別是調查或觀察得來的原始資料)在儘量少的先驗假定下進行探索。常用的探索性資料分析方法有：直方圖、散點圖和箱線圖等。

age = ntrain_data['age']
sns.distplot(age)

可以看到，年齡的分佈大致呈正態分佈，符合統計分析假設。

mi = ntrain_data[['MonthlyIncome']]
sns.distplot(mi)

同樣，收入的分佈也大致呈正態分佈。

三.變數選擇

3.1 分箱處理

首先，需要將特徵進行分箱處理。分箱是將連續特徵離散化的一種方式，一般有等距，等頻，卡方分箱的等多種方式，合理的分箱可以使模型更加精準。在此，我使用的是一種常見於SAS上的單調分箱，python程式碼由這為大神提供。

def mono_bin(Y, X, n=10):
    r = 0
    good=Y.sum()
    bad=Y.count()-good
    while np.abs(r) < 1: 
        d1 = pd.DataFrame({"X": X, "Y": Y, "Bucket": pd.qcut(X, n)})
        d2 = d1.groupby('Bucket', as_index = True)
        r, p = stats.spearmanr(d2.mean().X, d2.mean().Y)  
        n = n - 1
    d3 = pd.DataFrame(d2.X.min(), columns = ['min'])
    d3['min']=d2.min().X
    d3['max'] = d2.max().X
    d3['sum'] = d2.sum().Y
    d3['total'] = d2.count().Y
    d3['rate'] = d2.mean().Y
    d3['woe']=np.log((d3['rate']/good)/((1-d3['rate'])/bad))
    d3['goodattribute']=d3['sum']/good
    d3['badattribute']=(d3['total']-d3['sum'])/bad
    iv=((d3['goodattribute']-d3['badattribute'])*d3['woe']).sum()
    d4 = (d3.sort_index(by = 'min')).reset_index(drop=True)
    woe=list(d4['woe'].round(3))
    cut=[]
    cut.append(float('-inf'))
    for i in range(1,n+1):
        qua=X.quantile(i/(n+1))
        cut.append(round(qua,4))
    cut.append(float('inf'))
    return d4,iv,cut,woe

x1_d,x1_iv,x1_cut,x1_woe = mono_bin(train_y,train_X.RevolvingUtilizationOfUnsecuredLines)

x2_d,x2_iv,x2_cut,x2_woe = mono_bin(train_y,train_X.age)

x4_d,x4_iv,x4_cut,x4_woe = mono_bin(train_y,train_X.DebtRatio)

x5_d,x5_iv,x5_cut,x5_woe = mono_bin(train_y,train_X.MonthlyIncome)

對於RevolvingUtilizationOfUnsecuredLines、age、DebtRatio和MonthlyIncome我們使用這種方式進行分類。

然而，其他的變數無法通過這種方式分箱，故我們使用人工選擇的方式進行：

cutx3 = [-inf, 0, 1, 3, 5, +inf]cutx6 = [-inf, 1, 2, 3, 5, +inf]cutx7 = [-inf, 0, 1, 3, 5, +inf]cutx8 = [-inf, 0,1,2, 3, +inf]cutx9 = [-inf, 0, 1, 3, +inf]cutx10 = [-inf, 0, 1, 2, 3, 5, +inf]以NumberOfTime30-59DaysPastDueNotWorse為例：

def woe_value(d1):
    d2 = d1.groupby('Bucket', as_index = True)
    good=train_y.sum()
    bad=train_y.count()-good
    d3 = pd.DataFrame(d2.X.min(), columns = ['min'])
    d3['min']=d2.min().X
    d3['max'] = d2.max().X
    d3['sum'] = d2.sum().Y
    d3['total'] = d2.count().Y
    d3['rate'] = d2.mean().Y
    d3['woe'] = np.log((d3['rate']/good)/((1-d3['rate'])/bad))
    d3['goodattribute']=d3['sum']/good
    d3['badattribute']=(d3['total']-d3['sum'])/bad
    iv=((d3['goodattribute']-d3['badattribute'])*d3['woe']).sum()
    d4 = (d3.sort_index(by = 'min')).reset_index(drop=True)
    woe=list(d4['woe'].round(3))
    return d4,iv,woe

d1 = pd.DataFrame({"X": train_X['NumberOfTime30-59DaysPastDueNotWorse'], "Y": train_y})
d1['Bucket'] = d1['X']
d1_x1 = d1.loc[(d1['Bucket']<=0)]
d1_x1.loc[:,'Bucket']="(-inf,0]"


d1_x2 = d1.loc[(d1['Bucket']>0) & (d1['Bucket']<= 1)]
d1_x2.loc[:,'Bucket'] = "(0,1]"


d1_x3 = d1.loc[(d1['Bucket']>1) & (d1['Bucket']<= 3)]
d1_x3.loc[:,'Bucket'] = "(1,3]"


d1_x4 = d1.loc[(d1['Bucket']>3) & (d1['Bucket']<= 5)]
d1_x4.loc[:,'Bucket'] = "(3,5]"


d1_x5 = d1.loc[(d1['Bucket']>5)]
d1_x5.loc[:,'Bucket']="(5,+inf)"
d1 = pd.concat([d1_x1,d1_x2,d1_x3,d1_x4,d1_x5])


x3_d,x3_iv,x3_woe= woe_value(d1)
x3_cut = [float('-inf'),0,1,3,5,float('+inf')]

在分箱的過程中，同時計算了WOE（Weight of Evidence）和IV(Information Value)，前者在建立邏輯迴歸模型是需要將所有的變數轉為WOE，而後者則可以很好的展示變數的預測能力。這兩個值的計算方式如下：

在通過IV值判斷之前可以先檢查一下變數之間的相關性，對變數有個直觀的瞭解：

corr = train_data.corr()
xticks = ['x0','x1','x2','x3','x4','x5','x6','x7','x8','x9','x10']
yticks = list(corr.index)
fig = plt.figure()
ax1 = fig.add_subplot(1, 1, 1)
sns.heatmap(corr, annot=True, cmap='rainbow', ax=ax1, annot_kws={'size': 5,  'color': 'blue'})
ax1.set_xticklabels(xticks, rotation=0, fontsize=10)
ax1.set_yticklabels(yticks, rotation=0, fontsize=10)
plt.show()

可以看到 NumberOfTime30-59DaysPastDueNotWorse,NumberOfOpenCreditLinesAndLoans和NumberOfTime60-89DaysPastDueNotWorse這三個特徵對於我們所要預測的值有較強的相關性。

接下來，看一下各個變數的IV值：

informationValue = []
informationValue.append(x1_iv)
informationValue.append(x2_iv)
informationValue.append(x3_iv)
informationValue.append(x4_iv)
informationValue.append(x5_iv)
informationValue.append(x6_iv)
informationValue.append(x7_iv)
informationValue.append(x8_iv)
informationValue.append(x9_iv)
informationValue.append(x10_iv)
informationValue

index=['x1','x2','x3','x4','x5','x6','x7','x8','x9','x10']
index_num = range(len(index))
ax=plt.bar(index_num,informationValue,tick_label=index)
plt.show()

通過IV值判斷變數預測能力的標準是：

< 0.02: unpredictive

0.02 to 0.1: weak

0.1 to 0.3: medium

0.3 to 0.5: strong

> 0.5: suspicious

可以看到，對於X4，X5，X6，X8，以及X10而言，IV值都比較低，因此可以捨棄這些預言能力較差的特徵

3.2 WOE轉換

接下來，將所有的需要的特徵woe化，並將不需要的特徵捨棄，僅保留WOE轉碼後的變數：

def trans_woe(var,var_name,x_woe,x_cut):
    woe_name = var_name + '_woe'
    for i in range(len(x_woe)):
        if i == 0:
            var.loc[(var[var_name]<=x_cut[i+1]),woe_name] = x_woe[i]
        elif (i>0) and (i<= len(x_woe)-2):
            var.loc[((var[var_name]>x_cut[i])&(var[var_name]<=x_cut[i+1])),woe_name] = x_woe[i]
        else:
            var.loc[(var[var_name]>x_cut[len(x_woe)-1]),woe_name] = x_woe[len(x_woe)-1]
    return var

x1_name = 'RevolvingUtilizationOfUnsecuredLines'
x2_name = 'age'
x3_name = 'NumberOfTime30-59DaysPastDueNotWorse'
x7_name = 'NumberOfTimes90DaysLate'
x9_name = 'NumberOfTime60-89DaysPastDueNotWorse'

train_X = trans_woe(train_X,x1_name,x1_woe,x1_cut)
train_X = trans_woe(train_X,x2_name,x2_woe,x2_cut)
train_X = trans_woe(train_X,x3_name,x3_woe,x3_cut)
train_X = trans_woe(train_X,x7_name,x7_woe,x7_cut)
train_X = trans_woe(train_X,x9_name,x9_woe,x9_cut)

train_X = train_X.iloc[:,-5:]

此時資料如下所示：

四.模型分析

4.1 模型建立

通過呼叫STATSMODEL包來建立邏輯迴歸模型：

import statsmodels.api as sm
X1=sm.add_constant(train_X)
logit=sm.Logit(train_y,X1)
result=logit.fit()
print(result.summary())

結果如下：

4.2 模型檢驗

模型建立後，可以通過匯入測試集的資料，畫出ROC曲線來判斷模型的準確性：

1.對測試集進行woe轉化

test_X = trans_woe(test_X,x1_name,x1_woe,x1_cut)
test_X = trans_woe(test_X,x2_name,x2_woe,x2_cut)
test_X = trans_woe(test_X,x3_name,x3_woe,x3_cut)
test_X = trans_woe(test_X,x7_name,x7_woe,x7_cut)
test_X = trans_woe(test_X,x9_name,x9_woe,x9_cut)

test_X = test_X.iloc[:,-5:]

2.擬合模型，畫出ROC曲線得到AUC值

from sklearn import metrics
X3 = sm.add_constant(test_X)
resu = result.predict(X3)
fpr, tpr, threshold = metrics.roc_curve(test_y, resu)
rocauc = metrics.auc(fpr, tpr)
plt.plot(fpr, tpr, 'b', label='AUC = %0.2f' % rocauc)
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('TPR')
plt.xlabel('FPR')
plt.show()

可以看到，ACU=0.85,是可以接受的。

五.建立評分卡

5.1 評分標準

依據以上論文資料得到：

a=log（p_good/P_bad）

Score = offset + factor * log(odds)

在建立標準評分卡之前，我們需要選取幾個評分卡引數：基礎分值、 PDO（比率翻倍的分值）和好壞比。這裡，我們取600分為基礎分值，PDO為20 （每高20分好壞比翻一倍），好壞比取20。

5.2 建立評分卡

p = 20/np.log(2)
q = 600 - 20*np.log(20)/np.log(2)

def get_score(coe,woe,factor):
    scores=[]
    for w in woe:
        score=round(coe*w*factor,0)
        scores.append(score)
    return scores

x_coe = [2.6084,0.6327,0.5151,0.5520,0.5747,0.4074]
baseScore = round(q + p * x_coe[0], 0)x1_score = get_score(x_coe[1], x1_woe, p)

x1_score = get_score(x_coe[1], x1_woe, p)
x2_score = get_score(x_coe[2], x2_woe, p)
x3_score = get_score(x_coe[3], x3_woe, p)
x7_score = get_score(x_coe[4], x7_woe, p)
x9_score = get_score(x_coe[5], x9_woe, p)

x_coe是之前邏輯迴歸模型得到的係數。最後BaseScore等於589分。通過get_score可以得到所有分段的分數，如下：

根據前面章節的分箱結果和得到的分數，可以建立評分卡：

5.3 自動計算評分

建立一個函式使得當輸入x1,x2,x3,x7,x9的值時可以返回評分數

cut_t = [x1_cut,x2_cut,x3_cut,x7_cut,x9_cut]
def compute_score(x):        #x為陣列，包含x1,x2,x3,x7和x9的取值
    tot_score = baseScore
    cut_d = copy.deepcopy(cut_t)
    for j in range(len(cut_d)):
        cut_d[j].append(x[j])
        cut_d[j].sort()
        for i in range(len(cut_d[j])):
            if cut_d[j][i] == x[j]:
                tot_score = score[j][i-1] +tot_score
    return tot_score

來測試一下：

總結

至此此次基於python製作的行為評分卡就此完成。本文通過對於Kaggle上專案的資料進行分析，利用邏輯迴歸製作了一個簡單的評分卡。在建立評分卡的過程中，首先進行了資料清洗，對缺失值和異常值進行了處理並對資料分佈進行了巨集觀展示。然後對特徵值進行了處理，將連續的變數分箱，同時計算了woe和iv值，並保留了iv值較高的變數對其woe轉化。最後將woe轉化後的資料進行邏輯迴歸分析，利用得到變數係數並自行擬定了評分標準建立了評分卡。

在整體過程中，並沒有對資料進行過多的挖掘。例如：只捨棄了個別變數的異常值，亦或是對於不能自動分箱的變數採取了直觀分箱的方式，並沒有過多的去探究其可能對於模型的影響。這可以為後續的模型優化奠定方向。

GiveMeSomeCredit——信用評分卡模型

一. 資料預處理

1.1 處理缺失值

1.2 處理異常值

1.3 資料切分

二. 探索性分析

三.變數選擇

3.1 分箱處理

3.2 WOE轉換

四.模型分析

4.1 模型建立

4.2 模型檢驗

五.建立評分卡

5.2 建立評分卡

5.3 自動計算評分

總結

GiveMeSomeCredit——信用評分卡模型

一文搞定信用評分卡模型-Python、SAS和R的實現（含程式碼和視訊）

信用評分卡模型總結10：評分卡的建立及sas部署實施

信用評分卡8_授信模型

信用評分卡（A卡）基於LR模型的資料處理及建模過程

信用評分卡（A卡/B卡/C卡）的模型簡介及開發流程｜乾貨

【詳解】銀行信用評分卡中的WOE在幹什麼？WOE的意義？為什麼可以使用WOE值代替原來的特徵值來做LR的訓練輸入資料

淺談信貸評分卡模型

評分卡模型-（一特徵構建）

評分卡模型（二資料清洗)

Logistic Regression在評分卡模型中的應用

信用評分卡建模的工作流程

筆記︱金融風險控制基礎常識——巴塞爾協議+信用評分卡Fico信用分

機器學習在信用評分卡中的應用

初探機器學習與評分卡模型

評分卡模型開發-定性指標篩選

評分卡模型剖析（woe、IV、ROC、資訊熵）

金融風控-->申請評分卡模型-->特徵工程（特徵分箱，WOE編碼）標籤：金融特徵分箱-WOE編碼 2017-07-16 21:26 4086人閱讀評論(2) 收藏舉報分類：金融風

評分卡模型-理論

評分卡模型開發--總體流程

GiveMeSomeCredit——信用評分卡模型

一. 資料預處理

1.1 處理缺失值

1.2 處理異常值

1.3 資料切分

二. 探索性分析

三.變數選擇

3.1 分箱處理

3.2 WOE轉換

四.模型分析

4.1 模型建立

4.2 模型檢驗

五.建立評分卡

5.2 建立評分卡

5.3 自動計算評分

總結

相關推薦