1. 程式人生 > >客戶貸款逾期預測[1]-邏輯迴歸模型

客戶貸款逾期預測[1]-邏輯迴歸模型

任務

      預測貸款客戶是否會逾期,status為響應變數,有0和1兩種值,0表示未逾期,1表示逾期。

程式碼:

# -*- coding: utf-8 -*-
"""
Created on Thu Nov 15 13:02:11 2018

@author: keepi
"""

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
pd.set_option('display.max_row',1000)

#匯入資料
data = pd.read_csv('data.csv',encoding='gb18030')
data = pd.DataFrame(data.fillna(10))

#特徵工程
'''
n = set(data['reg_preference_for_trad'])
dic = {}
for i,j in enumerate(n):
    dic[j] = i
data['reg_preference_for_trad'] = data['reg_preference_for_trad'].map(dic)
'''
x_dummy = pd.get_dummies(data['reg_preference_for_trad'])
data = pd.concat([data.drop('reg_preference_for_trad',axis=1),x_dummy],axis=1,sort=False)
data.drop('source',axis=1,inplace=True)
data.drop('bank_card_no',axis=1,inplace=True)
data.drop('latest_query_time',axis=1,inplace=True)
data.drop('loans_latest_time',axis=1,inplace=True)
data.drop('id_name',axis=1,inplace=True)

#劃分測試集、訓練集
train,test = train_test_split(data,test_size=0.3,random_state=25)
y_train = train.loc[:,'status']
train_2 = train.drop('status',axis=1)
y_test = test.loc[:,'status']
test_2 = test.drop('status',axis=1)

#模型訓練與預測
lr = LogisticRegression(C=190,dual=True,random_state=535)
lr.fit(train_2,y_train) 

y_test_pre = lr.predict(test_2)

#評分
score = f1_score(y_test,y_test_pre,average='macro')
print('驗證集分數',score)

驗證集分數:0.43838

遇到的問題

    1.SettingWithCopyWarning:A value is trying to be set on a copy of a slice from a DataFrame

           原因是我在處理資料時對原始資料進行了修改

train.drop('status',axis=1,inplace=True)
#警告:SettingWithCopyWarning
#修改為下面程式碼即可
train_2 = train.drop('status',axis=1)

    2.固定了劃分測試集和訓練集的隨機數種子,每次訓練的分數都不同

           因為邏輯迴歸的隨機數種子沒有設定

lr = LogisticRegression(C=100,dual=True,random_state=535)   #這樣即可

    3.在用svm預測後計算f1值的時候出現警告:

UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

        這個是說f1值因為某些項為0所以無法計算,因為我訓練出來的結果全為1,而測試集中的標籤含有0,1兩種值。那麼為什麼用LinearSVC訓練後會只預測出一種值呢?

 

參考

        啞變數與one-hot編碼