1. 程式人生 > >資料探勘專案(一)Airbnb 新使用者的民宿預定結果預測

資料探勘專案(一)Airbnb 新使用者的民宿預定結果預測

摘要
本文主要根據對Airbnb 新使用者的民宿預定結果進行預測,完整的陳述了從資料探索特徵工程構建模型的整個過程。
其中:
1 資料探索部分主要基於pandas庫,利用常見的:head()value_counts()describe()isnull()unique()等函式以及通過matplotlib作圖對資料進行理解和探索;
2. 特徵工程部分主要是通過從日期中提取年月日季節weekday,對年齡進行分段,計算相關特徵之間的差值,根據使用者id進行分組,從而統計一些特徵變數的次數平均值標準差等等,以及通過one hot encodinglabels encoding

對資料進行編碼來提取特徵;
3. 構建模型部分主要基於sklearn包xgboost包,通過呼叫不同的模型進行預測,其中涉及到的模型有,邏輯迴歸模型Logistic Regression,樹模型:DecisionTree,RandomForest,AdaBoost,Bagging,ExtraTree,GraBoost,SVM模型:SVM-rbf,SVM-poly,SVM-linearxgboost,以及通過改變模型的引數資料量大小,來觀察NDCG的評分結果,從而瞭解不同模型,不同引數和不同資料量大小對預測結果的影響.

1. 背景

About this Dataset,In this challenge, you are given a list of users along with their demographics, web session records, and some summary statistics. You are asked to predict which country a new user’s first booking destination will be. All the users in this dataset are from the USA.

There are 12 possible outcomes of the destination country: ‘US’, ‘FR’, ‘CA’, ‘GB’, ‘ES’, ‘IT’, ‘PT’, ‘NL’,’DE’, ‘AU’, ‘NDF’ (no destination found), and ‘other’. Please note that ‘NDF’ is different from ‘other’ because ‘other’ means there was a booking, but is to a country not included in the list, while ‘NDF’ means there wasn’t a booking.

2. 資料描述

總共包含6個csv檔案
1. train_users_2.csv - the training set of users (訓練資料)
2. test_users.csv - the test set of users (測試資料)
- id: user id (使用者id)
- date_account_created(帳號註冊時間): the date of account creation
- timestamp_first_active(首次活躍時間): timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up
- date_first_booking(首次訂房時間): date of first booking
- gender(性別)
- age(年齡)
- signup_method(註冊方式)
- signup_flow(註冊頁面): the page a user came to signup up from
- language(語言): international language preference
- affiliate_channel(付費市場渠道): what kind of paid marketing
- affiliate_provider(付費市場渠道名稱): where the marketing is e.g. google, craigslist, other
- first_affiliate_tracked(註冊前第一個接觸的市場渠道): whats the first marketing the user interacted with before the signing up
- signup_app(註冊app)
- first_device_type(裝置型別)
- first_browser(瀏覽器型別)
- country_destination訂房國家-需要預測的量): this is the target variable you are to predict
3. sessions.csv - web sessions log for users(網頁瀏覽資料)
- user_id(使用者id): to be joined with the column ‘id’ in users table
- action(使用者行為)
- action_type(使用者行為型別)
- action_detail(使用者行為具體)
- device_type(裝置型別)
- secs_elapsed(停留時長)
4. sample_submission.csv - correct format for submitting your predictions
- 資料下載地址
Airbnb 新使用者的民宿預定預測-資料集

3. 資料探索

  • 基於jupyter notebook 和 python3

3.1 train_users_2和test_users檔案

讀取檔案

train = pd.read_csv("train_users_2.csv")
test = pd.read_csv("test_users.csv")

導包

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn as sk
%matplotlib inline
import datetime
import os
import seaborn as sns#資料視覺化
from datetime import date
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelBinarizer
import pickle #用於儲存模型
import seaborn as sns
from sklearn.metrics import *
from sklearn.model_selection import *

檢視資料包含的特徵

print('the columns name of training dataset:\n',train.columns)
print('the columns name of test dataset:\n',test.columns)

分析:
1. train檔案比test檔案多了特徵-country_destination
2. country_destination是需要預測的目標變數
3. 資料探索時著重分析train檔案,test檔案類似

檢視資料資訊

print(train.info())


分析:
1. trian檔案包含213451行資料,16個特徵
1. 每個特徵的資料型別和非空數值
2. date_first_booking空值較多,在特徵提取時可以考慮刪除

特徵分析:
1. date_account_created

1.1 檢視date_account_created前幾行資料

print(train.date_account_created.head())


1.2 對date_account_created資料進行統計

print(train.date_account_created.value_counts().head())
print(train.date_account_created.value_counts().tail())


1.3獲取date_account_created資訊

print(train.date_account_created.describe())


1.4觀察使用者增長情況

dac_train = train.date_account_created.value_counts()
dac_test = test.date_account_created.value_counts()
#將資料型別轉換為datatime型別
dac_train_date = pd.to_datetime(train.date_account_created.value_counts().index)
dac_test_date = pd.to_datetime(test.date_account_created.value_counts().index)
#計算離首次註冊時間相差的天數
dac_train_day = dac_train_date - dac_train_date.min()
dac_test_day = dac_test_date - dac_train_date.min()
#motplotlib作圖
plt.scatter(dac_train_day.days, dac_train.values, color = 'r', label = 'train dataset')
plt.scatter(dac_test_day.days, dac_test.values, color = 'b', label = 'test dataset')

plt.title("Accounts created vs day")
plt.xlabel("Days")
plt.ylabel("Accounts created")
plt.legend(loc = 'upper left')


分析:
1. x軸:離首次註冊時間相差的天數
2. y軸:當天註冊的使用者數量
3. 隨著時間的增長,使用者註冊的數量在急劇上升

2. timestamp_first_active
2.1檢視頭幾行資料

print(train.timestamp_first_active.head())


2.2對資料進行統計看非重複值的數量

print(train.timestamp_first_active.value_counts().unique())

[1]
分析: 結果[1]表明timestamp_first_active沒有重複資料

2.3將時間戳轉成日期形式並獲取資料資訊

tfa_train_dt = train.timestamp_first_active.astype(str).apply(lambda x:  
                                                                    datetime.datetime(int(x[:4]),
                                                                                      int(x[4:6]), 
                                                                                      int(x[6:8]), 
                                                                                      int(x[8:10]), 
                                                                                      int(x[10:12]),
                                                                                      int(x[12:])))
print(tfa_train_dt.describe())

3. date_first_booking
獲取資料資訊

print(train.date_first_booking.describe())
print(test.date_first_booking.describe())


分析:
1. train檔案中date_first_booking有大量缺失值
2. test檔案中date_first_booking全是缺失值
3. 可以刪除特徵date_first_booking

4.age
4.1對資料進行統計

print(train.age.value_counts().head())


分析:使用者年齡主要集中在30左右
4.2柱狀圖統計

#首先將年齡進行分成4組missing values, too small age, reasonable age, too large age
age_train =[train[train.age.isnull()].age.shape[0],
            train.query('age < 15').age.shape[0],
            train.query("age >= 15 & age <= 90").age.shape[0],
            train.query('age > 90').age.shape[0]]

age_test = [test[test.age.isnull()].age.shape[0],
            test.query('age < 15').age.shape[0],
            test.query("age >= 15 & age <= 90").age.shape[0],
            test.query('age > 90').age.shape[0]]

columns = ['Null', 'age < 15', 'age', 'age > 90']

# plot
fig, (ax1,ax2) = plt.subplots(1,2,sharex=True, sharey = True,figsize=(10,5))

sns.barplot(columns, age_train, ax = ax1)
sns.barplot(columns, age_test, ax = ax2)

ax1.set_title('training dataset')
ax2.set_title('test dataset')
ax1.set_ylabel('counts')


分析:異常年齡較少,且有一定數量的缺失值

5.其他特徵
- train檔案中其他特徵由於labels較少,我們可以在特徵工程中直接進行one hot encoding即可

統一使用柱狀圖進行統計

def feature_barplot(feature, df_train = train, df_test = test, figsize=(10,5), rot = 90, saveimg = False): 
    feat_train = df_train[feature].value_counts()
    feat_test = df_test[feature].value_counts()
    fig_feature, (axis1,axis2) = plt.subplots(1,2,sharex=True, sharey = True, figsize = figsize)
    sns.barplot(feat_train.index.values, feat_train.values, ax = axis1)
    sns.barplot(feat_test.index.values, feat_test.values, ax = axis2)
    axis1.set_xticklabels(axis1.xaxis.get_majorticklabels(), rotation = rot)
    axis2.set_xticklabels(axis1.xaxis.get_majorticklabels(), rotation = rot)
    axis1.set_title(feature + ' of training dataset')
    axis2.set_title(feature + ' of test dataset')
    axis1.set_ylabel('Counts')
    plt.tight_layout()
    if saveimg == True:
        figname = feature + ".png"
        fig_feature.savefig(figname, dpi = 75)

5.1 gender

feature_barplot('gender', saveimg = True)


5.2 signup_method

feature_barplot('signup_method')


5.3 signup_flow

feature_barplot('signup_flow')


5.4 language

feature_barplot('language')


5.5 affiliate_channel

feature_barplot('affiliate_channel')


5.6 first_affiliate_tracked

feature_barplot('first_affiliate_tracked')


5.7 signup_app

feature_barplot('signup_app')


5.8 first_device_type

feature_barplot('first_device_type')


5.9 first_browser

feature_barplot('first_browser')

3.2 sesion檔案

獲取資料並檢視頭10行資料

df_sessions = pd.read_csv('sessions.csv')
df_sessions.head(10)


將user_id改名為id

#這是為了後面的資料合併
df_sessions['id'] = df_sessions['user_id']
df_sessions = df_sessions.drop(['user_id'],axis=1) #按行刪除

檢視資料的shape

df_sessions.shape

(10567737, 6)
分析:session檔案有10567737行資料,6個特徵

檢視缺失值

df_sessions.isnull().sum()


分析:action,action_type,action_detail, secs_elapsed缺失值較多

填充缺失值

df_sessions.action = df_sessions.action.fillna('NAN')
df_sessions.action_type = df_sessions.action_type.fillna('NAN')
df_sessions.action_detail = df_sessions.action_detail.fillna('NAN')
df_sessions.isnull().sum()


分析:
1. 填充後缺失值已經為0了
2. secs_elapsed 在後續做填充處理

4. 特徵提取

  • 在對資料有一定了解後,我們進行特徵提取工作

4.1 對session檔案特徵提取

1.action

df_sessions.action.head()

df_sessions.action.value_counts().min()

1
分析:對action進行統計,我們可以發現使用者action有多種,且最少的發生次數只有1,接下來我們可以對使用者發生次數較少的行為列為OTHER一類

1.1 將特徵action次數低於閾值100的列為OTHER

#Action values with low frequency are changed to 'OTHER'
act_freq = 100  #Threshold of frequency
act = dict(zip(*np.unique(df_sessions.action, return_counts=True)))
df_sessions.action = df_sessions.action.apply(lambda x: 'OTHER' if act[x] < act_freq else x)
#np.unique(df_sessions.action, return_counts=True) 取以陣列形式返回非重複的action值和它的數量
#zip(*(a,b))a,b種元素一一對應,返回zip object

2. 對特徵action,action_detail,action_type,device_type,secs_elapsed進行細化
- 首先將使用者的特徵根據使用者id進行分組
- 特徵action:統計每個使用者總的action出現的次數,各個action型別的數量,平均值以及標準差
- 特徵action_detail:統計每個使用者總的action_detail出現的次數,各個action_detail型別的數量,平均值以及標準差
- 特徵action_type:統計每個使用者總的action_type出現的次數,各個action_type型別的數量,平均值,標準差以及總的停留時長(進行log處理)
- 特徵device_type:統計每個使用者總的device_type出現的次數,各個device_type型別的數量,平均值以及標準差
- 特徵secs_elapsed:對缺失值用0填充,統計每個使用者secs_elapsed時間的總和,平均值,標準差以及中位數(進行log處理),(總和/平均數),secs_elapsed(log處理後)各個時間出現的次數

#對action特徵進行細化
f_act = df_sessions.action.value_counts().argsort()
f_act_detail = df_sessions.action_detail.value_counts().argsort()
f_act_type = df_sessions.action_type.value_counts().argsort()
f_dev_type = df_sessions.device_type.value_counts().argsort()

#按照id進行分組
dgr_sess = df_sessions.groupby(['id'])
#Loop on dgr_sess to create all the features.
samples = [] #samples列表
ln = len(dgr_sess) #計算分組後df_sessions的長度

for g in dgr_sess:  #對dgr_sess中每個id的資料進行遍歷
    gr = g[1]   #data frame that comtains all the data for a groupby value 'zzywmcn0jv'

    l = []  #建一個空列表,臨時存放特徵

    #the id    for example:'zzywmcn0jv'
    l.append(g[0]) #將id值放入空列表中

    # number of total actions
    l.append(len(gr))#將id對應資料的長度放入列表

    #secs_elapsed 特徵中的缺失值用0填充再獲取具體的停留時長值
    sev = gr.secs_elapsed.fillna(0).values   #These values are used later.

    #action features 特徵-使用者行為 
    #每個使用者行為出現的次數,各個行為型別的數量,平均值以及標準差
    c_act = [0] * len(f_act)
    for i,v in enumerate(gr.action.values): #i是從0-1對應的位置,v 是使用者行為特徵的值
        c_act[f_act[v]] += 1
    _, c_act_uqc = np.unique(gr.action.values, return_counts=True)
    #計算使用者行為行為特徵各個型別數量的長度,平均值以及標準差
    c_act += [len(c_act_uqc), np.mean(c_act_uqc), np.std(c_act_uqc)]
    l = l + c_act

    #action_detail features 特徵-使用者行為具體
    #(how many times each value occurs, numb of unique values, mean and std)
    c_act_detail = [0] * len(f_act_detail)
    for i,v in enumerate(gr.action_detail.values):
        c_act_detail[f_act_detail[v]] += 1
    _, c_act_det_uqc = np.unique(gr.action_detail.values, return_counts=True)
    c_act_detail += [len(c_act_det_uqc), np.mean(c_act_det_uqc), np.std(c_act_det_uqc)]
    l = l + c_act_detail

    #action_type features  特徵-使用者行為型別 click等
    #(how many times each value occurs, numb of unique values, mean and std
    #+ log of the sum of secs_elapsed for each value)
    l_act_type = [0] * len(f_act_type)
    c_act_type = [0] * len(f_act_type)
    for i,v in enumerate(gr.action_type.values):
        l_act_type[f_act_type[v]] += sev[i] #sev = gr.secs_elapsed.fillna(0).values ,求每個行為型別總的停留時長
        c_act_type[f_act_type[v]] += 1  
    l_act_type = np.log(1 + np.array(l_act_type)).tolist() #每個行為型別總的停留時長,差異比較大,進行log處理
    _, c_act_type_uqc = np.unique(gr.action_type.values, return_counts=True)
    c_act_type += [len(c_act_type_uqc), np.mean(c_act_type_uqc), np.std(c_act_type_uqc)]
    l = l + c_act_type + l_act_type    

    #device_type features 特徵-裝置型別
    #(how many times each value occurs, numb of unique values, mean and std)
    c_dev_type  = [0] * len(f_dev_type)
    for i,v in enumerate(gr.device_type .values):
        c_dev_type[f_dev_type[v]] += 1 
    c_dev_type.append(len(np.unique(gr.device_type.values))) 
    _, c_dev_type_uqc = np.unique(gr.device_type.values, return_counts=True)
    c_dev_type += [len(c_dev_type_uqc), np.mean(c_dev_type_uqc), np.std(c_dev_type_uqc)]        
    l = l + c_dev_type    

    #secs_elapsed features  特徵-停留時長     
    l_secs = [0] * 5 
    l_log = [0] * 15
    if len(sev) > 0:
        #Simple statistics about the secs_elapsed values.
        l_secs[0] = np.log(1 + np.sum(sev))
        l_secs[1] = np.log(1 + np.mean(sev)) 
        l_secs[2] = np.log(1 + np.std(sev))
        l_secs[3] = np.log(1 + np.median(sev))
        l_secs[4] = l_secs[0] / float(l[1]) #

        #Values are grouped in 15 intervals. Compute the number of values
        #in each interval.
        #sev = gr.secs_elapsed.fillna(0).values 
        log_sev = np.log(1 + sev).astype(int)
        #np.bincount():Count number of occurrences of each value in array of non-negative ints.  
        l_log = np.bincount(log_sev, minlength=15).tolist()                    
    l = l + l_secs + l_log

    #The list l has the feature values of one sample.
    samples.append(l)

#preparing objects    
samples = np.array(samples) 
samp_ar = samples[:, 1:].astype(np.float16) #取除id外的特徵資料
samp_id = samples[:, 0]   #取id,id位於第一列

#為提取的特徵建立一個dataframe     
col_names = []    #name of the columns
for i in range(len(samples[0])-1):  #減1的原因是因為有個id
    col_names.append('c_' + str(i))  #起名字的方式    
df_agg_sess = pd.DataFrame(samp_ar, columns=col_names)
df_agg_sess['id'] = samp_id
df_agg_sess.index = df_agg_sess.id #將id作為index
df_agg_sess.head()


分析:經過特徵提取後,session檔案由6個特徵變為458個特徵

4.2 對trian和test檔案進行特徵提取

標記train檔案的行數和儲存我們進行預測的目標變數
- labels儲存了我們進行預測的目標變數country_destination

train = pd.read_csv("train_users_2.csv")
test = pd.read_csv("test_users.csv")
#計算出train的行數,便於之後對train和test資料進行分離操作
train_row = train.shape[0]  

# The label we need to predict
labels = train['country_destination'].values

刪除date_first_booking和train檔案中的country_destination
- 資料探索時我們發現date_first_booking在train和test檔案中缺失值太多,故刪除
- 刪除country_destination,用模型預測country_destination,再與已經儲存country_destination的labels進行比較,從而判斷模型優劣

train.drop(['country_destination', 'date_first_booking'], axis = 1, inplace = True)
test.drop(['date_first_booking'], axis = 1, inplace = True)

合併train和test檔案
- 便於進行相同的特徵提取操作

#連線test 和 train
df = pd.concat([train, test], axis = 0, ignore_index = True)

1. timestamp_first_active
1.1 轉換為datetime型別

tfa = df.timestamp_first_active.astype(str).apply(lambda x: datetime.datetime(int(x[:4]),
                                                                          int(x[4:6]), 
                                                                          int(x[6:8]),
                                                                          int(x[8:10]),
                                                                          int(x[10:12]),
                                                                          int(x[12:])))

1.2 提取特徵:年,月,日

# create tfa_year, tfa_month, tfa_day feature
df['tfa_year'] = np.array([x.year for x in tfa])
df['tfa_month'] = np.array([x.month for x in tfa])
df['tfa_day'] = np.array([x.day for x in tfa])

1.3 提取特徵:weekday
- 對結果進行one hot encoding編碼

#isoweekday() 可以返回一週的星期幾,e.g.星期日:0;星期一:1
df['tfa_wd'] = np.array([x.isoweekday() for x in tfa]) 
df_tfa_wd = pd.get_dummies(df.tfa_wd, prefix = 'tfa_wd')  # one hot encoding 
df = pd.concat((df, df_tfa_wd), axis = 1) #新增df['tfa_wd'] 編碼後的特徵
df.drop(['tfa_wd'], axis = 1, inplace = True)#刪除原有未編碼的特徵

1.4 提取特徵:季節
- 因為判斷季節關注的是月份,故對年份進行統一

Y = 2000
seasons = [(0, (date(Y,  1,  1),  date(Y,  3, 20))),  #'winter'
           (1, (date(Y,  3, 21),  date(Y,  6, 20))),  #'spring'
           (2, (date(Y,  6, 21),  date(Y,  9, 22))),  #'summer'
           (3, (date(Y,  9, 23),  date(Y, 12, 20))),  #'autumn'
           (0, (date(Y, 12, 21),  date(Y, 12, 31)))]  #'winter'

def get_season(dt):
    dt = dt.date() #獲取日期
    dt = dt.replace(year=Y) #將年統一換成2000年
    return next(season for season, (start, end) in seasons if start <= dt <= end)

df['tfa_season'] = np.array([get_season(x) for x in tfa])
df_tfa_season = pd.get_dummies(df.tfa_season, prefix = 'tfa_season') # one hot encoding 
df = pd.concat((df, df_tfa_season), axis = 1)
df.drop(['tfa_season'], axis = 1, inplace = True)

2. date_account_created
2.1 將date_account_created轉換為datetime型別

dac = pd.to_datetime(df.date_account_created)

2.2 提取特徵:年,月,日

# create year, month, day feature for dac

df['dac_year'] = np.array([x.year for x in dac])
df['dac_month'] = np.array([x.month for x in dac])
df['dac_day'] = np.array([x.day for x in dac])

2.3 提取特徵:weekday

# create features of weekday for dac

df['dac_wd'] = np.array([x.isoweekday() for x in dac])
df_dac_wd = pd.get_dummies(df.dac_wd, prefix = 'dac_wd')
df = pd.concat((df, df_dac_wd), axis = 1)
df.drop(['dac_wd'], axis = 1, inplace = True)

2.4 提取特徵:季節

# create season features fro dac

df['dac_season'] = np.array([get_season(x) for x in dac])
df_dac_season = pd.get_dummies(df.dac_season, prefix = 'dac_season')
df = pd.concat((df, df_dac_season), axis = 1)
df.drop(['dac_season'], axis = 1, inplace = True)

2.5提取特徵:date_account_created和timestamp_first_active之間的差值
- 即使用者在airbnb平臺活躍到正式註冊所花的時間

dt_span = dac.subtract(tfa).dt.days 
  • dt_span的頭十行資料
dt_span.value_counts().head(10)


分析:資料主要集中在-1,可以猜測,使用者當天註冊dt_span值便是-1
- 從差值提取特徵:差值為一天,一月,一年和其他
- 即使用者活躍到註冊花費的時間為一天,一月,一年或其他

# create categorical feature: span = -1; -1 < span < 30; 31 < span < 365; span > 365
def get_span(dt):
    # dt is an integer
    if dt == -1:
        return 'OneDay'
    elif (dt < 30) & (dt > -1):
        return 'OneMonth'
    elif (dt >= 30) & (dt <= 365):
        return 'OneYear'
    else:
        return 'other'

df['dt_span'] = np.array([get_span(x) for x in dt_span])
df_dt_span = pd.get_dummies(df.dt_span, prefix = 'dt_span')
df = pd.concat((df, df_dt_span), axis = 1)
df.drop(['dt_span'], axis = 1, inplace = True)

2.6 刪除原有的特徵
- 對timestamp_first_active,date_account_created進行特徵提取後,從特徵列表中刪除原有的特徵

df.drop(['date_account_created','timestamp_first_active'], axis = 1, inplace = True)

3. age

#Age 獲取年齡
av = df.age.values
  • 在資料探索階段,我們發現大部分資料是集中在(15,90)區間的,但有部分年齡分佈在(1900,2000)區間,我們猜測使用者是把出生日期誤填為年齡,故進行預處理
#This are birthdays instead of age (estimating age by doing 2014 - value)
#資料來自2014年,故用2014-value
av = np.where(np.logical_and(av<2000, av>1900), 2014-av, av) 
df['age'] = av

3.1 將年齡進行分段

# Age has many abnormal values that we need to deal with. 
age = df.age
age.fillna(-1, inplace = True) #空值填充為-1
div = 15
def get_age(age):
    # age is a float number  將連續型轉換為離散型
    if age < 0:
        return 'NA' #表示是空值
    elif (age < div):
        return div #如果年齡小於15歲,那麼返回15歲
    elif (age <= div * 2):
        return div*2 #如果年齡大於15小於等於30歲,則返回30歲
    elif (age <= div * 3):
        return div * 3
    elif (age <= div * 4):
        return div * 4
    elif (age <= div * 5):
        return div * 5
    elif (age <= 110):
        return div * 6
    else:
        return 'Unphysical' #非正常年齡
  • 將分段後的年齡作為新的特徵放入特徵列表中
df['age'] = np.array([get_age(x) for x in age])
df_age = pd.get_dummies(df.age, prefix = 'age')
df = pd.concat((df, df_age), axis = 1)
df.drop(['age'], axis = 1, inplace = True)

4. 其他特徵
- 在資料探索時,我們發現剩餘的特徵lables都比較少,故不進一步進行特徵提取,只進行one-hot-encoding處理

feat_toOHE = ['gender', 
             'signup_method', 
             'signup_flow', 
             'language', 
             'affiliate_channel', 
             'affiliate_provider', 
             'first_affiliate_tracked', 
             'signup_app', 
             'first_device_type', 
             'first_browser']
#對其他特徵進行one-hot-encoding處理
for f in feat_toOHE:
    df_ohe = pd.get_dummies(df[f], prefix=f, dummy_na=True)
    df.drop([f], axis = 1, inplace = True)
    df = pd.concat((df, df_ohe), axis = 1)

4.3 整合提取的所有特徵

  • 我們將對session以及train,test檔案中提取的特徵進行合併
#將對session提取的特徵整合到一起
df_all = pd.merge(df, df_agg_sess, how='left')
df_all = df_all.drop(['id'], axis=1) #刪除id
df_all = df_all.fillna(-2)  #對沒有sesssion data的特徵進行缺失值處理

#加了一列,表示每一行總共有多少空值,這也作為一個特徵
df_all['all_null'] = np.array([sum(r<0) for r in df_all.values]) 

5. 模型構建

5.1 資料準備

1. 將train和test資料進行分離操作
- train_row是之前記錄的train資料行數

Xtrain = df_all.iloc[:train_row, :]
Xtest = df_all.iloc[train_row:, :]

2. 將提取的特徵生成csv檔案

Xtrain.to_csv("Airbnb_xtrain_v2.csv")
Xtest.to_csv("Airbnb_xtest_v2.csv")
#labels.tofile():Write array to a file as text or binary (default)
labels.tofile("Airbnb_ytrain_v2.csv", sep='\n', format='%s') #存放目標變數
  • 讀取特徵檔案
xtrain = pd.read_csv("Airbnb_xtrain_v2.csv",index_col=0)
ytrain = pd.read_csv("Airbnb_ytrain_v2.csv", header=None)
xtrain.head()

ytrain.head()


分析:可以發現經過特徵提取後特徵檔案xtrain擴充套件為665個特徵,ytrain中包含訓練集中的目標變數
3. 將目標變數進行labels encoding

le = LabelEncoder()
ytrain_le = le.fit_transform(ytrain.values)
  • labels encoding前:
    [‘AU’, ‘CA’, ‘DE’, ‘ES’, ‘FR’, ‘GB’, ‘IT’, ‘NDF’, ‘NL’, ‘PT’, ‘US’,’other’]
  • labels encoding後:
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

4. 提取10%的資料進行模型訓練
- 減少訓練模型花費的時間

# Let us take 10% of the data for faster training. 
n = int(xtrain.shape[0]*0.1)
xtrain_new = xtrain.iloc[:n, :]  #訓練資料
ytrain_new = ytrain_le[:n]       #訓練資料的目標變數

5. StandardScaling the dataset
- Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance)

X_scaler = StandardScaler()
xtrain_new = X_scaler.fit_transform(xtrain_new)

5.2 評分模型:NDCG

  • NDCG是一種衡量排序質量的評價指標,該指標考慮了所有元素的相關性
  • 由於我們預測的目標變數並不是二分類變數,故我們用NDGG模型來進行模型評分,判斷模型優劣
  • 一般二分類變數: 我們習慣於使用 f1 score, precision, recall, auc score來進行模型評分
from sklearn.metrics import make_scorer

def dcg_score(y_true, y_score, k=5):

    """
    y_true : array, shape = [n_samples] #資料
        Ground truth (true relevance labels).
    y_score : array, shape = [n_samples, n_classes] #預測的分數
        Predicted scores.
    k : int
    """
    order = np.argsort(y_score)[::-1] #分數從高到低排序
    y_true = np.take(y_true, order[:k]) #取出前k[0,k)個分數

    gain = 2 ** y_true - 1   

    discounts = np.log2(np.arange(len(y_true)) + 2)
    return np.sum(gain / discounts)


def ndcg_score(ground_truth, predictions, k=