1. 程式人生 > >機器學習_Python中Gradient Boosting Machine(GBM)學習筆記1_資料分析

機器學習_Python中Gradient Boosting Machine(GBM)學習筆記1_資料分析

原文地址:Complete Guide to Parameter Tuning in Gradient Boosting (GBM) in Python by Aarshay Jain

翻譯出處:http://blog.csdn.net/han_xiaoyang/article/details/52663170 

     看的是大神寒小陽([email protected])翻譯的一篇關於GBM演算法的blog,原文連結和譯文連結已給出,目前詳細學習了資料分析的部分,原文中一筆帶過,自己找到原始碼進行學習,調通並寫下注釋,分享自己的心得。

  資料分析(程式碼+註釋):


# coding: utf-8

# In[2]:


import pandas as pd
import numpy as np
get_ipython().run_line_magic('matplotlib', 'inline')


# In[6]:


#Load data:
train = pd.read_csv('Train_nyOWmfK.csv')
test = pd.read_csv('Test_bCtAN1w.csv')


# In[7]:


train.shape, test.shape


# In[8]:


train.dtypes#檢視每個屬性的型別


# In[15]:


#Combine into data:
train['source']= 'train'
test['source'] = 'test'
data=pd.concat([train, test],ignore_index=True)#將train.csv與test.csv合併,且各自原來的索引忽略,合併後的資料在新表中的用統一排列新的索引
print(data.shape)
print(train.dtypes)


# ## Check missing:

# In[6]:


data.apply(lambda x: sum(x.isnull()))
'''
lambda只是一個表示式,函式體比def簡單很多。
lambda的主體是一個表示式,而不是一個程式碼塊。僅僅能在lambda表示式中封裝有限的邏輯進去。
lambda表示式是起到一個函式速寫的作用。允許在程式碼內嵌入一個函式的定義。
此處作用是看data資料集中每個屬性的資料為null的個數
'''


# ## Look at categories of all object variables:

# In[21]:


var = ['Gender','Salary_Account','Mobile_Verified','Var1','Filled_Form','Device_Type','Var2','Source']
for v in var:
    print('\n%s這一列資料的不同取值和出現的次數\n'%v)
    print(data[v].value_counts())


# ## Handle Individual Variables:

# ### City Variable:

# In[17]:


'''
捨棄"City"屬性,因為這一屬性的取值種類太過複雜
axis=0表示的是要對橫座標操作,axis=1是要對縱座標操作
inplace=False表示要對結果顯示,而True表示對結果不顯示
'''
len(data['City'].unique())
data.drop('City',axis=1,inplace=True)


# ### Determine Age from DOB

# In[18]:


data['DOB'].head()


# In[44]:


'''
DOB是出生的具體日期,咱們要具體日期作用沒那麼大,年齡段可能對我們有用,所以算一下年齡好了
建立一個年齡的欄位Age
'''
#print(data['DOB'])
data['Age'] = data['DOB'].apply(lambda x: 115 - int(x[-3:]))
data['Age'].head()


# In[41]:


#刪除原先的欄位
data.drop('DOB',axis=1,inplace=True)


# ### EMI_Load_Submitted

# In[55]:


data.boxplot(column=['EMI_Loan_Submitted'],return_type='axes')#畫出箱線圖


# In[46]:


#建立了EMI_Loan_Submitted_Missing這個變數,當EMI_Loan_Submitted 變數值缺失時它的值為1,否則為0。然後捨棄了EMI_Loan_Submitted。
data['EMI_Loan_Submitted_Missing'] = data['EMI_Loan_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0)
data[['EMI_Loan_Submitted','EMI_Loan_Submitted_Missing']].head(10)


# In[56]:


#drop original vaiables:
data.drop('EMI_Loan_Submitted',axis=1,inplace=True)


# ### Employer Name

# In[57]:


len(data['Employer_Name'].value_counts())


# In[59]:


#EmployerName的值也太多了,我把它也捨棄了
data.drop('Employer_Name',axis=1,inplace=True)


# ### Existing EMI

# In[60]:


#Existing_EMI的缺失值被填補為0(中位數),因為只有111個缺失值

data.boxplot(column='Existing_EMI',return_type='axes')


# In[61]:


data['Existing_EMI'].describe()


# In[19]:


#Impute by median (0) because just 111 missing:
data['Existing_EMI'].fillna(0, inplace=True)


# ### Interest Rate:

# In[63]:


#Majority values missing so I'll create a new variable stating whether this is missing or note:
data['Interest_Rate_Missing'] = data['Interest_Rate'].apply(lambda x: 1 if pd.isnull(x) else 0)
print data[['Interest_Rate','Interest_Rate_Missing']].head(10)


# In[62]:


data.drop('Interest_Rate',axis=1,inplace=True)


# ### Lead Creation Date:

# In[64]:


#Drop this variable because doesn't appear to affect much intuitively
data.drop('Lead_Creation_Date',axis=1,inplace=True)


# ### Loan Amount and Tenure applied:

# In[65]:


#Impute with median because only 111 missing:
data['Loan_Amount_Applied'].fillna(data['Loan_Amount_Applied'].median(),inplace=True)
data['Loan_Tenure_Applied'].fillna(data['Loan_Tenure_Applied'].median(),inplace=True)


# ### Loan Amount and Tenure selected

# In[68]:


#High proportion missing so create a new var whether present or not
data['Loan_Amount_Submitted_Missing'] = data['Loan_Amount_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0)
data['Loan_Tenure_Submitted_Missing'] = data['Loan_Tenure_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0)


# In[69]:


#建立了Loan_Amount_Submitted_Missing變數,當Loan_Amount_Submitted有缺失值時為1,反之為0,原本的Loan_Amount_Submitted變數被捨棄
#建立了Loan_Tenure_Submitted_Missing變數,當Loan_Tenure_Submitted有缺失值時為1,反之為0,原本的Loan_Tenure_Submitted變數被捨棄
data.drop(['Loan_Amount_Submitted','Loan_Tenure_Submitted'],axis=1,inplace=True)


# ### Remove logged-in

# In[26]:


#捨棄了LoggedIn,和Salary_Account
data.drop('LoggedIn',axis=1,inplace=True)


# ### Remove salary account

# In[27]:


#Salary account has mnay banks which have to be manually grouped
data.drop('Salary_Account',axis=1,inplace=True)


# ### Processing_Fee

# In[28]:


#High proportion missing so create a new var whether present or not
data['Processing_Fee_Missing'] = data['Processing_Fee'].apply(lambda x: 1 if pd.isnull(x) else 0)
#drop old
data.drop('Processing_Fee',axis=1,inplace=True)


# ### Source

# In[78]:


#Source-top保留了2個,其他組合成了不同的類別

data['Source'] = data['Source'].apply(lambda x: 'others' if x not in ['S122','S133'] else x)
data['Source'].value_counts()
print(data['Source'])


# ## Final Data:

# In[30]:


data.apply(lambda x: sum(x.isnull()))


# In[31]:


data.dtypes


# ### Numerical Coding:

# In[80]:


#給不同的數字編碼,起到區分作用的
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
var_to_encode = ['Device_Type','Filled_Form','Gender','Var1','Var2','Mobile_Verified','Source']
for col in var_to_encode:
    data[col] = le.fit_transform(data[col])


# ### One-Hot Coding

# In[81]:


#get_dummies 是利用pandas實現one hot encode的方式。
data = pd.get_dummies(data, columns=var_to_encode)
print(data)


# ### Separate train & test:

# In[77]:


print(data['source'])
train = data.loc[data['source']=='train']
test = data.loc[data['source']=='test']
#print(train.source)
#print(test.source)


# In[35]:


train.drop('source',axis=1,inplace=True)
test.drop(['source','Disbursed'],axis=1,inplace=True)


# In[36]:


train.to_csv('train_modified.csv',index=False)
test.to_csv('test_modified.csv',index=False)

     目前只學習了資料分析部分,模型建立及調參會儘快學習。