資料探勘專案(一)Airbnb 新使用者的民宿預定結果預測
摘要
本文主要根據對Airbnb 新使用者的民宿預定結果進行預測,完整的陳述了從資料探索
到特徵工程
到構建模型
的整個過程。
其中:
1 資料探索
部分主要基於pandas庫
,利用常見的:head()
,value_counts()
,describe()
,isnull()
,unique()
等函式以及通過matplotlib
作圖對資料進行理解和探索;
2. 特徵工程
部分主要是通過從日期中提取年月日
,季節
,weekday
,對年齡進行分段
,計算相關特徵之間的差值
,根據使用者id進行分組,從而統計一些特徵變數的次數
,平均值
,標準差
等等,以及通過one hot encoding
和labels encoding
3. 構建模型
部分主要基於sklearn包
,xgboost包
,通過呼叫不同的模型進行預測,其中涉及到的模型有,邏輯迴歸模型Logistic Regression
,樹模型:DecisionTree,RandomForest,AdaBoost,Bagging,ExtraTree,GraBoost
,SVM模型:SVM-rbf,SVM-poly,SVM-linear
,xgboost
,以及通過改變模型的引數
和資料量大小
,來觀察NDCG
的評分結果,從而瞭解不同模型,不同引數和不同資料量大小對預測結果的影響.
1. 背景
About this Dataset,In this challenge, you are given a list of users along with their demographics, web session records, and some summary statistics. You are asked to predict which country a new user’s first booking destination will be. All the users in this dataset are from the USA.
There are 12 possible outcomes of the destination country: ‘US’, ‘FR’, ‘CA’, ‘GB’, ‘ES’, ‘IT’, ‘PT’, ‘NL’,’DE’, ‘AU’, ‘NDF’ (no destination found), and ‘other’. Please note that ‘NDF’ is different from ‘other’ because ‘other’ means there was a booking, but is to a country not included in the list, while ‘NDF’ means there wasn’t a booking.
2. 資料描述
總共包含6個csv檔案
1. train_users_2.csv - the training set of users (訓練資料)
2. test_users.csv - the test set of users (測試資料)
- id: user id (使用者id)
- date_account_created(帳號註冊時間): the date of account creation
- timestamp_first_active(首次活躍時間): timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up
- date_first_booking(首次訂房時間): date of first booking
- gender(性別)
- age(年齡)
- signup_method(註冊方式)
- signup_flow(註冊頁面): the page a user came to signup up from
- language(語言): international language preference
- affiliate_channel(付費市場渠道): what kind of paid marketing
- affiliate_provider(付費市場渠道名稱): where the marketing is e.g. google, craigslist, other
- first_affiliate_tracked(註冊前第一個接觸的市場渠道): whats the first marketing the user interacted with before the signing up
- signup_app(註冊app)
- first_device_type(裝置型別)
- first_browser(瀏覽器型別)
- country_destination(訂房國家-需要預測的量): this is the target variable you are to predict
3. sessions.csv - web sessions log for users(網頁瀏覽資料)
- user_id(使用者id): to be joined with the column ‘id’ in users table
- action(使用者行為)
- action_type(使用者行為型別)
- action_detail(使用者行為具體)
- device_type(裝置型別)
- secs_elapsed(停留時長)
4. sample_submission.csv - correct format for submitting your predictions
- 資料下載地址
Airbnb 新使用者的民宿預定預測-資料集
3. 資料探索
- 基於jupyter notebook 和 python3
3.1 train_users_2和test_users檔案
讀取檔案
train = pd.read_csv("train_users_2.csv")
test = pd.read_csv("test_users.csv")
導包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn as sk
%matplotlib inline
import datetime
import os
import seaborn as sns#資料視覺化
from datetime import date
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelBinarizer
import pickle #用於儲存模型
import seaborn as sns
from sklearn.metrics import *
from sklearn.model_selection import *
檢視資料包含的特徵
print('the columns name of training dataset:\n',train.columns)
print('the columns name of test dataset:\n',test.columns)
分析:
1. train檔案比test檔案多了特徵-country_destination
2. country_destination是需要預測的目標變數
3. 資料探索時著重分析train檔案,test檔案類似
檢視資料資訊
print(train.info())
分析:
1. trian檔案包含213451行資料,16個特徵
1. 每個特徵的資料型別和非空數值
2. date_first_booking空值較多,在特徵提取時可以考慮刪除
特徵分析:
1. date_account_created
1.1 檢視date_account_created前幾行資料
print(train.date_account_created.head())
1.2 對date_account_created資料進行統計
print(train.date_account_created.value_counts().head())
print(train.date_account_created.value_counts().tail())
1.3獲取date_account_created資訊
print(train.date_account_created.describe())
1.4觀察使用者增長情況
dac_train = train.date_account_created.value_counts()
dac_test = test.date_account_created.value_counts()
#將資料型別轉換為datatime型別
dac_train_date = pd.to_datetime(train.date_account_created.value_counts().index)
dac_test_date = pd.to_datetime(test.date_account_created.value_counts().index)
#計算離首次註冊時間相差的天數
dac_train_day = dac_train_date - dac_train_date.min()
dac_test_day = dac_test_date - dac_train_date.min()
#motplotlib作圖
plt.scatter(dac_train_day.days, dac_train.values, color = 'r', label = 'train dataset')
plt.scatter(dac_test_day.days, dac_test.values, color = 'b', label = 'test dataset')
plt.title("Accounts created vs day")
plt.xlabel("Days")
plt.ylabel("Accounts created")
plt.legend(loc = 'upper left')
分析:
1. x軸:離首次註冊時間相差的天數
2. y軸:當天註冊的使用者數量
3. 隨著時間的增長,使用者註冊的數量在急劇上升
2. timestamp_first_active
2.1檢視頭幾行資料
print(train.timestamp_first_active.head())
2.2對資料進行統計看非重複值的數量
print(train.timestamp_first_active.value_counts().unique())
[1]
分析: 結果[1]表明timestamp_first_active沒有重複資料
2.3將時間戳轉成日期形式並獲取資料資訊
tfa_train_dt = train.timestamp_first_active.astype(str).apply(lambda x:
datetime.datetime(int(x[:4]),
int(x[4:6]),
int(x[6:8]),
int(x[8:10]),
int(x[10:12]),
int(x[12:])))
print(tfa_train_dt.describe())
3. date_first_booking
獲取資料資訊
print(train.date_first_booking.describe())
print(test.date_first_booking.describe())
分析:
1. train檔案中date_first_booking有大量缺失值
2. test檔案中date_first_booking全是缺失值
3. 可以刪除特徵date_first_booking
4.age
4.1對資料進行統計
print(train.age.value_counts().head())
分析:使用者年齡主要集中在30左右
4.2柱狀圖統計
#首先將年齡進行分成4組missing values, too small age, reasonable age, too large age
age_train =[train[train.age.isnull()].age.shape[0],
train.query('age < 15').age.shape[0],
train.query("age >= 15 & age <= 90").age.shape[0],
train.query('age > 90').age.shape[0]]
age_test = [test[test.age.isnull()].age.shape[0],
test.query('age < 15').age.shape[0],
test.query("age >= 15 & age <= 90").age.shape[0],
test.query('age > 90').age.shape[0]]
columns = ['Null', 'age < 15', 'age', 'age > 90']
# plot
fig, (ax1,ax2) = plt.subplots(1,2,sharex=True, sharey = True,figsize=(10,5))
sns.barplot(columns, age_train, ax = ax1)
sns.barplot(columns, age_test, ax = ax2)
ax1.set_title('training dataset')
ax2.set_title('test dataset')
ax1.set_ylabel('counts')
分析:異常年齡較少,且有一定數量的缺失值
5.其他特徵
- train檔案中其他特徵由於labels較少,我們可以在特徵工程中直接進行one hot encoding即可
統一使用柱狀圖進行統計
def feature_barplot(feature, df_train = train, df_test = test, figsize=(10,5), rot = 90, saveimg = False):
feat_train = df_train[feature].value_counts()
feat_test = df_test[feature].value_counts()
fig_feature, (axis1,axis2) = plt.subplots(1,2,sharex=True, sharey = True, figsize = figsize)
sns.barplot(feat_train.index.values, feat_train.values, ax = axis1)
sns.barplot(feat_test.index.values, feat_test.values, ax = axis2)
axis1.set_xticklabels(axis1.xaxis.get_majorticklabels(), rotation = rot)
axis2.set_xticklabels(axis1.xaxis.get_majorticklabels(), rotation = rot)
axis1.set_title(feature + ' of training dataset')
axis2.set_title(feature + ' of test dataset')
axis1.set_ylabel('Counts')
plt.tight_layout()
if saveimg == True:
figname = feature + ".png"
fig_feature.savefig(figname, dpi = 75)
5.1 gender
feature_barplot('gender', saveimg = True)
5.2 signup_method
feature_barplot('signup_method')
5.3 signup_flow
feature_barplot('signup_flow')
5.4 language
feature_barplot('language')
5.5 affiliate_channel
feature_barplot('affiliate_channel')
5.6 first_affiliate_tracked
feature_barplot('first_affiliate_tracked')
5.7 signup_app
feature_barplot('signup_app')
5.8 first_device_type
feature_barplot('first_device_type')
5.9 first_browser
feature_barplot('first_browser')
3.2 sesion檔案
獲取資料並檢視頭10行資料
df_sessions = pd.read_csv('sessions.csv')
df_sessions.head(10)
將user_id改名為id
#這是為了後面的資料合併
df_sessions['id'] = df_sessions['user_id']
df_sessions = df_sessions.drop(['user_id'],axis=1) #按行刪除
檢視資料的shape
df_sessions.shape
(10567737, 6)
分析:session檔案有10567737行資料,6個特徵
檢視缺失值
df_sessions.isnull().sum()
分析:action,action_type,action_detail, secs_elapsed缺失值較多
填充缺失值
df_sessions.action = df_sessions.action.fillna('NAN')
df_sessions.action_type = df_sessions.action_type.fillna('NAN')
df_sessions.action_detail = df_sessions.action_detail.fillna('NAN')
df_sessions.isnull().sum()
分析:
1. 填充後缺失值已經為0了
2. secs_elapsed 在後續做填充處理
4. 特徵提取
- 在對資料有一定了解後,我們進行特徵提取工作
4.1 對session檔案特徵提取
1.action
df_sessions.action.head()
df_sessions.action.value_counts().min()
1
分析:對action進行統計,我們可以發現使用者action有多種,且最少的發生次數只有1,接下來我們可以對使用者發生次數較少的行為列為OTHER一類
1.1 將特徵action次數低於閾值100的列為OTHER
#Action values with low frequency are changed to 'OTHER'
act_freq = 100 #Threshold of frequency
act = dict(zip(*np.unique(df_sessions.action, return_counts=True)))
df_sessions.action = df_sessions.action.apply(lambda x: 'OTHER' if act[x] < act_freq else x)
#np.unique(df_sessions.action, return_counts=True) 取以陣列形式返回非重複的action值和它的數量
#zip(*(a,b))a,b種元素一一對應,返回zip object
2. 對特徵action,action_detail,action_type,device_type,secs_elapsed進行細化
- 首先將使用者的特徵根據使用者id進行分組
- 特徵action:統計每個使用者總的action出現的次數,各個action型別的數量,平均值以及標準差
- 特徵action_detail:統計每個使用者總的action_detail出現的次數,各個action_detail型別的數量,平均值以及標準差
- 特徵action_type:統計每個使用者總的action_type出現的次數,各個action_type型別的數量,平均值,標準差以及總的停留時長(進行log處理)
- 特徵device_type:統計每個使用者總的device_type出現的次數,各個device_type型別的數量,平均值以及標準差
- 特徵secs_elapsed:對缺失值用0填充,統計每個使用者secs_elapsed時間的總和,平均值,標準差以及中位數(進行log處理),(總和/平均數),secs_elapsed(log處理後)各個時間出現的次數
#對action特徵進行細化
f_act = df_sessions.action.value_counts().argsort()
f_act_detail = df_sessions.action_detail.value_counts().argsort()
f_act_type = df_sessions.action_type.value_counts().argsort()
f_dev_type = df_sessions.device_type.value_counts().argsort()
#按照id進行分組
dgr_sess = df_sessions.groupby(['id'])
#Loop on dgr_sess to create all the features.
samples = [] #samples列表
ln = len(dgr_sess) #計算分組後df_sessions的長度
for g in dgr_sess: #對dgr_sess中每個id的資料進行遍歷
gr = g[1] #data frame that comtains all the data for a groupby value 'zzywmcn0jv'
l = [] #建一個空列表,臨時存放特徵
#the id for example:'zzywmcn0jv'
l.append(g[0]) #將id值放入空列表中
# number of total actions
l.append(len(gr))#將id對應資料的長度放入列表
#secs_elapsed 特徵中的缺失值用0填充再獲取具體的停留時長值
sev = gr.secs_elapsed.fillna(0).values #These values are used later.
#action features 特徵-使用者行為
#每個使用者行為出現的次數,各個行為型別的數量,平均值以及標準差
c_act = [0] * len(f_act)
for i,v in enumerate(gr.action.values): #i是從0-1對應的位置,v 是使用者行為特徵的值
c_act[f_act[v]] += 1
_, c_act_uqc = np.unique(gr.action.values, return_counts=True)
#計算使用者行為行為特徵各個型別數量的長度,平均值以及標準差
c_act += [len(c_act_uqc), np.mean(c_act_uqc), np.std(c_act_uqc)]
l = l + c_act
#action_detail features 特徵-使用者行為具體
#(how many times each value occurs, numb of unique values, mean and std)
c_act_detail = [0] * len(f_act_detail)
for i,v in enumerate(gr.action_detail.values):
c_act_detail[f_act_detail[v]] += 1
_, c_act_det_uqc = np.unique(gr.action_detail.values, return_counts=True)
c_act_detail += [len(c_act_det_uqc), np.mean(c_act_det_uqc), np.std(c_act_det_uqc)]
l = l + c_act_detail
#action_type features 特徵-使用者行為型別 click等
#(how many times each value occurs, numb of unique values, mean and std
#+ log of the sum of secs_elapsed for each value)
l_act_type = [0] * len(f_act_type)
c_act_type = [0] * len(f_act_type)
for i,v in enumerate(gr.action_type.values):
l_act_type[f_act_type[v]] += sev[i] #sev = gr.secs_elapsed.fillna(0).values ,求每個行為型別總的停留時長
c_act_type[f_act_type[v]] += 1
l_act_type = np.log(1 + np.array(l_act_type)).tolist() #每個行為型別總的停留時長,差異比較大,進行log處理
_, c_act_type_uqc = np.unique(gr.action_type.values, return_counts=True)
c_act_type += [len(c_act_type_uqc), np.mean(c_act_type_uqc), np.std(c_act_type_uqc)]
l = l + c_act_type + l_act_type
#device_type features 特徵-裝置型別
#(how many times each value occurs, numb of unique values, mean and std)
c_dev_type = [0] * len(f_dev_type)
for i,v in enumerate(gr.device_type .values):
c_dev_type[f_dev_type[v]] += 1
c_dev_type.append(len(np.unique(gr.device_type.values)))
_, c_dev_type_uqc = np.unique(gr.device_type.values, return_counts=True)
c_dev_type += [len(c_dev_type_uqc), np.mean(c_dev_type_uqc), np.std(c_dev_type_uqc)]
l = l + c_dev_type
#secs_elapsed features 特徵-停留時長
l_secs = [0] * 5
l_log = [0] * 15
if len(sev) > 0:
#Simple statistics about the secs_elapsed values.
l_secs[0] = np.log(1 + np.sum(sev))
l_secs[1] = np.log(1 + np.mean(sev))
l_secs[2] = np.log(1 + np.std(sev))
l_secs[3] = np.log(1 + np.median(sev))
l_secs[4] = l_secs[0] / float(l[1]) #
#Values are grouped in 15 intervals. Compute the number of values
#in each interval.
#sev = gr.secs_elapsed.fillna(0).values
log_sev = np.log(1 + sev).astype(int)
#np.bincount():Count number of occurrences of each value in array of non-negative ints.
l_log = np.bincount(log_sev, minlength=15).tolist()
l = l + l_secs + l_log
#The list l has the feature values of one sample.
samples.append(l)
#preparing objects
samples = np.array(samples)
samp_ar = samples[:, 1:].astype(np.float16) #取除id外的特徵資料
samp_id = samples[:, 0] #取id,id位於第一列
#為提取的特徵建立一個dataframe
col_names = [] #name of the columns
for i in range(len(samples[0])-1): #減1的原因是因為有個id
col_names.append('c_' + str(i)) #起名字的方式
df_agg_sess = pd.DataFrame(samp_ar, columns=col_names)
df_agg_sess['id'] = samp_id
df_agg_sess.index = df_agg_sess.id #將id作為index
df_agg_sess.head()
分析:經過特徵提取後,session檔案由6個特徵變為458個特徵
4.2 對trian和test檔案進行特徵提取
標記train檔案的行數和儲存我們進行預測的目標變數
- labels儲存了我們進行預測的目標變數country_destination
train = pd.read_csv("train_users_2.csv")
test = pd.read_csv("test_users.csv")
#計算出train的行數,便於之後對train和test資料進行分離操作
train_row = train.shape[0]
# The label we need to predict
labels = train['country_destination'].values
刪除date_first_booking和train檔案中的country_destination
- 資料探索時我們發現date_first_booking在train和test檔案中缺失值太多,故刪除
- 刪除country_destination,用模型預測country_destination,再與已經儲存country_destination的labels進行比較,從而判斷模型優劣
train.drop(['country_destination', 'date_first_booking'], axis = 1, inplace = True)
test.drop(['date_first_booking'], axis = 1, inplace = True)
合併train和test檔案
- 便於進行相同的特徵提取操作
#連線test 和 train
df = pd.concat([train, test], axis = 0, ignore_index = True)
1. timestamp_first_active
1.1 轉換為datetime型別
tfa = df.timestamp_first_active.astype(str).apply(lambda x: datetime.datetime(int(x[:4]),
int(x[4:6]),
int(x[6:8]),
int(x[8:10]),
int(x[10:12]),
int(x[12:])))
1.2 提取特徵:年,月,日
# create tfa_year, tfa_month, tfa_day feature
df['tfa_year'] = np.array([x.year for x in tfa])
df['tfa_month'] = np.array([x.month for x in tfa])
df['tfa_day'] = np.array([x.day for x in tfa])
1.3 提取特徵:weekday
- 對結果進行one hot encoding編碼
#isoweekday() 可以返回一週的星期幾,e.g.星期日:0;星期一:1
df['tfa_wd'] = np.array([x.isoweekday() for x in tfa])
df_tfa_wd = pd.get_dummies(df.tfa_wd, prefix = 'tfa_wd') # one hot encoding
df = pd.concat((df, df_tfa_wd), axis = 1) #新增df['tfa_wd'] 編碼後的特徵
df.drop(['tfa_wd'], axis = 1, inplace = True)#刪除原有未編碼的特徵
1.4 提取特徵:季節
- 因為判斷季節關注的是月份,故對年份進行統一
Y = 2000
seasons = [(0, (date(Y, 1, 1), date(Y, 3, 20))), #'winter'
(1, (date(Y, 3, 21), date(Y, 6, 20))), #'spring'
(2, (date(Y, 6, 21), date(Y, 9, 22))), #'summer'
(3, (date(Y, 9, 23), date(Y, 12, 20))), #'autumn'
(0, (date(Y, 12, 21), date(Y, 12, 31)))] #'winter'
def get_season(dt):
dt = dt.date() #獲取日期
dt = dt.replace(year=Y) #將年統一換成2000年
return next(season for season, (start, end) in seasons if start <= dt <= end)
df['tfa_season'] = np.array([get_season(x) for x in tfa])
df_tfa_season = pd.get_dummies(df.tfa_season, prefix = 'tfa_season') # one hot encoding
df = pd.concat((df, df_tfa_season), axis = 1)
df.drop(['tfa_season'], axis = 1, inplace = True)
2. date_account_created
2.1 將date_account_created轉換為datetime型別
dac = pd.to_datetime(df.date_account_created)
2.2 提取特徵:年,月,日
# create year, month, day feature for dac
df['dac_year'] = np.array([x.year for x in dac])
df['dac_month'] = np.array([x.month for x in dac])
df['dac_day'] = np.array([x.day for x in dac])
2.3 提取特徵:weekday
# create features of weekday for dac
df['dac_wd'] = np.array([x.isoweekday() for x in dac])
df_dac_wd = pd.get_dummies(df.dac_wd, prefix = 'dac_wd')
df = pd.concat((df, df_dac_wd), axis = 1)
df.drop(['dac_wd'], axis = 1, inplace = True)
2.4 提取特徵:季節
# create season features fro dac
df['dac_season'] = np.array([get_season(x) for x in dac])
df_dac_season = pd.get_dummies(df.dac_season, prefix = 'dac_season')
df = pd.concat((df, df_dac_season), axis = 1)
df.drop(['dac_season'], axis = 1, inplace = True)
2.5提取特徵:date_account_created和timestamp_first_active之間的差值
- 即使用者在airbnb平臺活躍到正式註冊所花的時間
dt_span = dac.subtract(tfa).dt.days
- dt_span的頭十行資料
dt_span.value_counts().head(10)
分析:資料主要集中在-1,可以猜測,使用者當天註冊dt_span值便是-1
- 從差值提取特徵:差值為一天,一月,一年和其他
- 即使用者活躍到註冊花費的時間為一天,一月,一年或其他
# create categorical feature: span = -1; -1 < span < 30; 31 < span < 365; span > 365
def get_span(dt):
# dt is an integer
if dt == -1:
return 'OneDay'
elif (dt < 30) & (dt > -1):
return 'OneMonth'
elif (dt >= 30) & (dt <= 365):
return 'OneYear'
else:
return 'other'
df['dt_span'] = np.array([get_span(x) for x in dt_span])
df_dt_span = pd.get_dummies(df.dt_span, prefix = 'dt_span')
df = pd.concat((df, df_dt_span), axis = 1)
df.drop(['dt_span'], axis = 1, inplace = True)
2.6 刪除原有的特徵
- 對timestamp_first_active,date_account_created進行特徵提取後,從特徵列表中刪除原有的特徵
df.drop(['date_account_created','timestamp_first_active'], axis = 1, inplace = True)
3. age
#Age 獲取年齡
av = df.age.values
- 在資料探索階段,我們發現大部分資料是集中在(15,90)區間的,但有部分年齡分佈在(1900,2000)區間,我們猜測使用者是把出生日期誤填為年齡,故進行預處理
#This are birthdays instead of age (estimating age by doing 2014 - value)
#資料來自2014年,故用2014-value
av = np.where(np.logical_and(av<2000, av>1900), 2014-av, av)
df['age'] = av
3.1 將年齡進行分段
# Age has many abnormal values that we need to deal with.
age = df.age
age.fillna(-1, inplace = True) #空值填充為-1
div = 15
def get_age(age):
# age is a float number 將連續型轉換為離散型
if age < 0:
return 'NA' #表示是空值
elif (age < div):
return div #如果年齡小於15歲,那麼返回15歲
elif (age <= div * 2):
return div*2 #如果年齡大於15小於等於30歲,則返回30歲
elif (age <= div * 3):
return div * 3
elif (age <= div * 4):
return div * 4
elif (age <= div * 5):
return div * 5
elif (age <= 110):
return div * 6
else:
return 'Unphysical' #非正常年齡
- 將分段後的年齡作為新的特徵放入特徵列表中
df['age'] = np.array([get_age(x) for x in age])
df_age = pd.get_dummies(df.age, prefix = 'age')
df = pd.concat((df, df_age), axis = 1)
df.drop(['age'], axis = 1, inplace = True)
4. 其他特徵
- 在資料探索時,我們發現剩餘的特徵lables都比較少,故不進一步進行特徵提取,只進行one-hot-encoding處理
feat_toOHE = ['gender',
'signup_method',
'signup_flow',
'language',
'affiliate_channel',
'affiliate_provider',
'first_affiliate_tracked',
'signup_app',
'first_device_type',
'first_browser']
#對其他特徵進行one-hot-encoding處理
for f in feat_toOHE:
df_ohe = pd.get_dummies(df[f], prefix=f, dummy_na=True)
df.drop([f], axis = 1, inplace = True)
df = pd.concat((df, df_ohe), axis = 1)
4.3 整合提取的所有特徵
- 我們將對session以及train,test檔案中提取的特徵進行合併
#將對session提取的特徵整合到一起
df_all = pd.merge(df, df_agg_sess, how='left')
df_all = df_all.drop(['id'], axis=1) #刪除id
df_all = df_all.fillna(-2) #對沒有sesssion data的特徵進行缺失值處理
#加了一列,表示每一行總共有多少空值,這也作為一個特徵
df_all['all_null'] = np.array([sum(r<0) for r in df_all.values])
5. 模型構建
5.1 資料準備
1. 將train和test資料進行分離操作
- train_row是之前記錄的train資料行數
Xtrain = df_all.iloc[:train_row, :]
Xtest = df_all.iloc[train_row:, :]
2. 將提取的特徵生成csv檔案
Xtrain.to_csv("Airbnb_xtrain_v2.csv")
Xtest.to_csv("Airbnb_xtest_v2.csv")
#labels.tofile():Write array to a file as text or binary (default)
labels.tofile("Airbnb_ytrain_v2.csv", sep='\n', format='%s') #存放目標變數
- 讀取特徵檔案
xtrain = pd.read_csv("Airbnb_xtrain_v2.csv",index_col=0)
ytrain = pd.read_csv("Airbnb_ytrain_v2.csv", header=None)
xtrain.head()
ytrain.head()
分析:可以發現經過特徵提取後特徵檔案xtrain擴充套件為665個特徵,ytrain中包含訓練集中的目標變數
3. 將目標變數進行labels encoding
le = LabelEncoder()
ytrain_le = le.fit_transform(ytrain.values)
- labels encoding前:
[‘AU’, ‘CA’, ‘DE’, ‘ES’, ‘FR’, ‘GB’, ‘IT’, ‘NDF’, ‘NL’, ‘PT’, ‘US’,’other’] - labels encoding後:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
4. 提取10%的資料進行模型訓練
- 減少訓練模型花費的時間
# Let us take 10% of the data for faster training.
n = int(xtrain.shape[0]*0.1)
xtrain_new = xtrain.iloc[:n, :] #訓練資料
ytrain_new = ytrain_le[:n] #訓練資料的目標變數
5. StandardScaling the dataset
- Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance)
X_scaler = StandardScaler()
xtrain_new = X_scaler.fit_transform(xtrain_new)
5.2 評分模型:NDCG
- NDCG是一種衡量排序質量的評價指標,該指標考慮了所有元素的相關性
- 由於我們預測的目標變數並不是二分類變數,故我們用NDGG模型來進行模型評分,判斷模型優劣
- 一般二分類變數: 我們習慣於使用 f1 score, precision, recall, auc score來進行模型評分
from sklearn.metrics import make_scorer
def dcg_score(y_true, y_score, k=5):
"""
y_true : array, shape = [n_samples] #資料
Ground truth (true relevance labels).
y_score : array, shape = [n_samples, n_classes] #預測的分數
Predicted scores.
k : int
"""
order = np.argsort(y_score)[::-1] #分數從高到低排序
y_true = np.take(y_true, order[:k]) #取出前k[0,k)個分數
gain = 2 ** y_true - 1
discounts = np.log2(np.arange(len(y_true)) + 2)
return np.sum(gain / discounts)
def ndcg_score(ground_truth, predictions, k=