摘要
本文主要根據對Airbnb 新使用者的民宿預定結果進行預測，完整的陳述了從資料探索到特徵工程到構建模型的整個過程。
其中：
1 資料探索部分主要基於pandas庫，利用常見的:head()，value_counts()，describe()，isnull()，unique()等函式以及通過matplotlib作圖對資料進行理解和探索；
2. 特徵工程部分主要是通過從日期中提取年月日，季節，weekday，對年齡進行分段，計算相關特徵之間的差值，根據使用者id進行分組，從而統計一些特徵變數的次數，平均值，標準差等等，以及通過one hot encoding和labels encoding

對資料進行編碼來提取特徵；
3. 構建模型部分主要基於sklearn包，xgboost包，通過呼叫不同的模型進行預測，其中涉及到的模型有，邏輯迴歸模型Logistic Regression，樹模型：DecisionTree，RandomForest，AdaBoost，Bagging，ExtraTree，GraBoost，SVM模型：SVM-rbf，SVM-poly，SVM-linear，xgboost，以及通過改變模型的引數和資料量大小，來觀察NDCG的評分結果，從而瞭解不同模型，不同引數和不同資料量大小對預測結果的影響.

1. 背景

About this Dataset,In this challenge, you are given a list of users along with their demographics, web session records, and some summary statistics. You are asked to predict which country a new user’s first booking destination will be. All the users in this dataset are from the USA.

There are 12 possible outcomes of the destination country: ‘US’, ‘FR’, ‘CA’, ‘GB’, ‘ES’, ‘IT’, ‘PT’, ‘NL’,’DE’, ‘AU’, ‘NDF’ (no destination found), and ‘other’. Please note that ‘NDF’ is different from ‘other’ because ‘other’ means there was a booking, but is to a country not included in the list, while ‘NDF’ means there wasn’t a booking.

2. 資料描述

總共包含6個csv檔案
1. train_users_2.csv - the training set of users （訓練資料）
2. test_users.csv - the test set of users （測試資料）
- id: user id （使用者id）
- date_account_created（帳號註冊時間）: the date of account creation
- timestamp_first_active（首次活躍時間）: timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up
- date_first_booking（首次訂房時間）: date of first booking
- gender（性別）
- age（年齡）
- signup_method（註冊方式）
- signup_flow（註冊頁面）: the page a user came to signup up from
- language（語言）: international language preference
- affiliate_channel（付費市場渠道）: what kind of paid marketing
- affiliate_provider（付費市場渠道名稱）: where the marketing is e.g. google, craigslist, other
- first_affiliate_tracked（註冊前第一個接觸的市場渠道）: whats the first marketing the user interacted with before the signing up
- signup_app（註冊app）
- first_device_type(裝置型別)
- first_browser（瀏覽器型別）
- country_destination（訂房國家-需要預測的量）: this is the target variable you are to predict
3. sessions.csv - web sessions log for users（網頁瀏覽資料）
- user_id（使用者id）: to be joined with the column ‘id’ in users table
- action(使用者行為)
- action_type（使用者行為型別）
- action_detail（使用者行為具體）
- device_type（裝置型別）
- secs_elapsed（停留時長）
4. sample_submission.csv - correct format for submitting your predictions
- 資料下載地址
Airbnb 新使用者的民宿預定預測-資料集

3. 資料探索

基於jupyter notebook 和 python3

3.1 train_users_2和test_users檔案

讀取檔案

train = pd.read_csv("train_users_2.csv")
test = pd.read_csv("test_users.csv")

導包

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn as sk
%matplotlib inline
import datetime
import os
import seaborn as sns#資料視覺化
from datetime import date
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelBinarizer
import pickle #用於儲存模型
import seaborn as sns
from sklearn.metrics import *
from sklearn.model_selection import *

檢視資料包含的特徵

print('the columns name of training dataset:\n',train.columns)
print('the columns name of test dataset:\n',test.columns)

分析：
1. train檔案比test檔案多了特徵-country_destination
2. country_destination是需要預測的目標變數
3. 資料探索時著重分析train檔案，test檔案類似

檢視資料資訊

print(train.info())

分析：
1. trian檔案包含213451行資料，16個特徵
1. 每個特徵的資料型別和非空數值
2. date_first_booking空值較多，在特徵提取時可以考慮刪除

特徵分析：
1. date_account_created

1.1 檢視date_account_created前幾行資料

print(train.date_account_created.head())

1.2 對date_account_created資料進行統計

print(train.date_account_created.value_counts().head())
print(train.date_account_created.value_counts().tail())

1.3獲取date_account_created資訊

print(train.date_account_created.describe())

1.4觀察使用者增長情況

dac_train = train.date_account_created.value_counts()
dac_test = test.date_account_created.value_counts()
#將資料型別轉換為datatime型別
dac_train_date = pd.to_datetime(train.date_account_created.value_counts().index)
dac_test_date = pd.to_datetime(test.date_account_created.value_counts().index)
#計算離首次註冊時間相差的天數
dac_train_day = dac_train_date - dac_train_date.min()
dac_test_day = dac_test_date - dac_train_date.min()
#motplotlib作圖
plt.scatter(dac_train_day.days, dac_train.values, color = 'r', label = 'train dataset')
plt.scatter(dac_test_day.days, dac_test.values, color = 'b', label = 'test dataset')

plt.title("Accounts created vs day")
plt.xlabel("Days")
plt.ylabel("Accounts created")
plt.legend(loc = 'upper left')

分析：
1. x軸：離首次註冊時間相差的天數
2. y軸：當天註冊的使用者數量
3. 隨著時間的增長,使用者註冊的數量在急劇上升

2. timestamp_first_active
2.1檢視頭幾行資料

print(train.timestamp_first_active.head())

2.2對資料進行統計看非重複值的數量

print(train.timestamp_first_active.value_counts().unique())

[1]
分析：結果[1]表明timestamp_first_active沒有重複資料

2.3將時間戳轉成日期形式並獲取資料資訊

tfa_train_dt = train.timestamp_first_active.astype(str).apply(lambda x:  
                                                                    datetime.datetime(int(x[:4]),
                                                                                      int(x[4:6]), 
                                                                                      int(x[6:8]), 
                                                                                      int(x[8:10]), 
                                                                                      int(x[10:12]),
                                                                                      int(x[12:])))
print(tfa_train_dt.describe())

3. date_first_booking
獲取資料資訊

print(train.date_first_booking.describe())
print(test.date_first_booking.describe())

分析：
1. train檔案中date_first_booking有大量缺失值
2. test檔案中date_first_booking全是缺失值
3. 可以刪除特徵date_first_booking

4.age
4.1對資料進行統計

print(train.age.value_counts().head())

分析：使用者年齡主要集中在30左右
4.2柱狀圖統計

#首先將年齡進行分成4組missing values, too small age, reasonable age, too large age
age_train =[train[train.age.isnull()].age.shape[0],
            train.query('age < 15').age.shape[0],
            train.query("age >= 15 & age <= 90").age.shape[0],
            train.query('age > 90').age.shape[0]]

age_test = [test[test.age.isnull()].age.shape[0],
            test.query('age < 15').age.shape[0],
            test.query("age >= 15 & age <= 90").age.shape[0],
            test.query('age > 90').age.shape[0]]

columns = ['Null', 'age < 15', 'age', 'age > 90']

# plot
fig, (ax1,ax2) = plt.subplots(1,2,sharex=True, sharey = True,figsize=(10,5))

sns.barplot(columns, age_train, ax = ax1)
sns.barplot(columns, age_test, ax = ax2)

ax1.set_title('training dataset')
ax2.set_title('test dataset')
ax1.set_ylabel('counts')

分析：異常年齡較少，且有一定數量的缺失值

5.其他特徵
- train檔案中其他特徵由於labels較少，我們可以在特徵工程中直接進行one hot encoding即可

統一使用柱狀圖進行統計

def feature_barplot(feature, df_train = train, df_test = test, figsize=(10,5), rot = 90, saveimg = False): 
    feat_train = df_train[feature].value_counts()
    feat_test = df_test[feature].value_counts()
    fig_feature, (axis1,axis2) = plt.subplots(1,2,sharex=True, sharey = True, figsize = figsize)
    sns.barplot(feat_train.index.values, feat_train.values, ax = axis1)
    sns.barplot(feat_test.index.values, feat_test.values, ax = axis2)
    axis1.set_xticklabels(axis1.xaxis.get_majorticklabels(), rotation = rot)
    axis2.set_xticklabels(axis1.xaxis.get_majorticklabels(), rotation = rot)
    axis1.set_title(feature + ' of training dataset')
    axis2.set_title(feature + ' of test dataset')
    axis1.set_ylabel('Counts')
    plt.tight_layout()
    if saveimg == True:
        figname = feature + ".png"
        fig_feature.savefig(figname, dpi = 75)

5.1 gender

feature_barplot('gender', saveimg = True)

5.2 signup_method

feature_barplot('signup_method')

5.3 signup_flow

feature_barplot('signup_flow')

5.4 language

feature_barplot('language')

5.5 affiliate_channel

feature_barplot('affiliate_channel')

5.6 first_affiliate_tracked

feature_barplot('first_affiliate_tracked')

5.7 signup_app

feature_barplot('signup_app')

5.8 first_device_type

feature_barplot('first_device_type')

5.9 first_browser

feature_barplot('first_browser')

3.2 sesion檔案

獲取資料並檢視頭10行資料

df_sessions = pd.read_csv('sessions.csv')
df_sessions.head(10)

將user_id改名為id

#這是為了後面的資料合併
df_sessions['id'] = df_sessions['user_id']
df_sessions = df_sessions.drop(['user_id'],axis=1) #按行刪除

檢視資料的shape

df_sessions.shape

(10567737, 6)
分析：session檔案有10567737行資料，6個特徵

檢視缺失值

df_sessions.isnull().sum()

分析：action，action_type，action_detail， secs_elapsed缺失值較多

填充缺失值

df_sessions.action = df_sessions.action.fillna('NAN')
df_sessions.action_type = df_sessions.action_type.fillna('NAN')
df_sessions.action_detail = df_sessions.action_detail.fillna('NAN')
df_sessions.isnull().sum()

分析：
1. 填充後缺失值已經為0了
2. secs_elapsed 在後續做填充處理

4. 特徵提取

在對資料有一定了解後，我們進行特徵提取工作

4.1 對session檔案特徵提取

1.action

df_sessions.action.head()

df_sessions.action.value_counts().min()

1
分析：對action進行統計，我們可以發現使用者action有多種，且最少的發生次數只有1，接下來我們可以對使用者發生次數較少的行為列為OTHER一類

1.1 將特徵action次數低於閾值100的列為OTHER

#Action values with low frequency are changed to 'OTHER'
act_freq = 100  #Threshold of frequency
act = dict(zip(*np.unique(df_sessions.action, return_counts=True)))
df_sessions.action = df_sessions.action.apply(lambda x: 'OTHER' if act[x] < act_freq else x)
#np.unique(df_sessions.action, return_counts=True) 取以陣列形式返回非重複的action值和它的數量
#zip（*（a,b））a,b種元素一一對應，返回zip object

2. 對特徵action，action_detail，action_type，device_type，secs_elapsed進行細化
- 首先將使用者的特徵根據使用者id進行分組
- 特徵action：統計每個使用者總的action出現的次數，各個action型別的數量，平均值以及標準差
- 特徵action_detail：統計每個使用者總的action_detail出現的次數，各個action_detail型別的數量，平均值以及標準差
- 特徵action_type：統計每個使用者總的action_type出現的次數，各個action_type型別的數量，平均值，標準差以及總的停留時長（進行log處理）
- 特徵device_type：統計每個使用者總的device_type出現的次數，各個device_type型別的數量，平均值以及標準差
- 特徵secs_elapsed：對缺失值用0填充，統計每個使用者secs_elapsed時間的總和，平均值，標準差以及中位數（進行log處理），（總和/平均數），secs_elapsed（log處理後）各個時間出現的次數

#對action特徵進行細化
f_act = df_sessions.action.value_counts().argsort()
f_act_detail = df_sessions.action_detail.value_counts().argsort()
f_act_type = df_sessions.action_type.value_counts().argsort()
f_dev_type = df_sessions.device_type.value_counts().argsort()

#按照id進行分組
dgr_sess = df_sessions.groupby(['id'])
#Loop on dgr_sess to create all the features.
samples = [] #samples列表
ln = len(dgr_sess) #計算分組後df_sessions的長度

for g in dgr_sess:  #對dgr_sess中每個id的資料進行遍歷
    gr = g[1]   #data frame that comtains all the data for a groupby value 'zzywmcn0jv'

    l = []  #建一個空列表，臨時存放特徵

    #the id    for example:'zzywmcn0jv'
    l.append(g[0]) #將id值放入空列表中

    # number of total actions
    l.append(len(gr))#將id對應資料的長度放入列表

    #secs_elapsed 特徵中的缺失值用0填充再獲取具體的停留時長值
    sev = gr.secs_elapsed.fillna(0).values   #These values are used later.

    #action features 特徵-使用者行為 
    #每個使用者行為出現的次數，各個行為型別的數量，平均值以及標準差
    c_act = [0] * len(f_act)
    for i,v in enumerate(gr.action.values): #i是從0-1對應的位置，v 是使用者行為特徵的值
        c_act[f_act[v]] += 1
    _, c_act_uqc = np.unique(gr.action.values, return_counts=True)
    #計算使用者行為行為特徵各個型別數量的長度，平均值以及標準差
    c_act += [len(c_act_uqc), np.mean(c_act_uqc), np.std(c_act_uqc)]
    l = l + c_act

    #action_detail features 特徵-使用者行為具體
    #(how many times each value occurs, numb of unique values, mean and std)
    c_act_detail = [0] * len(f_act_detail)
    for i,v in enumerate(gr.action_detail.values):
        c_act_detail[f_act_detail[v]] += 1
    _, c_act_det_uqc = np.unique(gr.action_detail.values, return_counts=True)
    c_act_detail += [len(c_act_det_uqc), np.mean(c_act_det_uqc), np.std(c_act_det_uqc)]
    l = l + c_act_detail

    #action_type features  特徵-使用者行為型別 click等
    #(how many times each value occurs, numb of unique values, mean and std
    #+ log of the sum of secs_elapsed for each value)
    l_act_type = [0] * len(f_act_type)
    c_act_type = [0] * len(f_act_type)
    for i,v in enumerate(gr.action_type.values):
        l_act_type[f_act_type[v]] += sev[i] #sev = gr.secs_elapsed.fillna(0).values ，求每個行為型別總的停留時長
        c_act_type[f_act_type[v]] += 1  
    l_act_type = np.log(1 + np.array(l_act_type)).tolist() #每個行為型別總的停留時長，差異比較大，進行log處理
    _, c_act_type_uqc = np.unique(gr.action_type.values, return_counts=True)
    c_act_type += [len(c_act_type_uqc), np.mean(c_act_type_uqc), np.std(c_act_type_uqc)]
    l = l + c_act_type + l_act_type    

    #device_type features 特徵-裝置型別
    #(how many times each value occurs, numb of unique values, mean and std)
    c_dev_type  = [0] * len(f_dev_type)
    for i,v in enumerate(gr.device_type .values):
        c_dev_type[f_dev_type[v]] += 1 
    c_dev_type.append(len(np.unique(gr.device_type.values))) 
    _, c_dev_type_uqc = np.unique(gr.device_type.values, return_counts=True)
    c_dev_type += [len(c_dev_type_uqc), np.mean(c_dev_type_uqc), np.std(c_dev_type_uqc)]        
    l = l + c_dev_type    

    #secs_elapsed features  特徵-停留時長     
    l_secs = [0] * 5 
    l_log = [0] * 15
    if len(sev) > 0:
        #Simple statistics about the secs_elapsed values.
        l_secs[0] = np.log(1 + np.sum(sev))
        l_secs[1] = np.log(1 + np.mean(sev)) 
        l_secs[2] = np.log(1 + np.std(sev))
        l_secs[3] = np.log(1 + np.median(sev))
        l_secs[4] = l_secs[0] / float(l[1]) #

        #Values are grouped in 15 intervals. Compute the number of values
        #in each interval.
        #sev = gr.secs_elapsed.fillna(0).values 
        log_sev = np.log(1 + sev).astype(int)
        #np.bincount():Count number of occurrences of each value in array of non-negative ints.  
        l_log = np.bincount(log_sev, minlength=15).tolist()                    
    l = l + l_secs + l_log

    #The list l has the feature values of one sample.
    samples.append(l)

#preparing objects    
samples = np.array(samples) 
samp_ar = samples[:, 1:].astype(np.float16) #取除id外的特徵資料
samp_id = samples[:, 0]   #取id，id位於第一列

#為提取的特徵建立一個dataframe     
col_names = []    #name of the columns
for i in range(len(samples[0])-1):  #減1的原因是因為有個id
    col_names.append('c_' + str(i))  #起名字的方式    
df_agg_sess = pd.DataFrame(samp_ar, columns=col_names)
df_agg_sess['id'] = samp_id
df_agg_sess.index = df_agg_sess.id #將id作為index

df_agg_sess.head()

分析：經過特徵提取後，session檔案由6個特徵變為458個特徵

4.2 對trian和test檔案進行特徵提取

標記train檔案的行數和儲存我們進行預測的目標變數
- labels儲存了我們進行預測的目標變數country_destination

train = pd.read_csv("train_users_2.csv")
test = pd.read_csv("test_users.csv")
#計算出train的行數，便於之後對train和test資料進行分離操作
train_row = train.shape[0]  

# The label we need to predict
labels = train['country_destination'].values

刪除date_first_booking和train檔案中的country_destination
- 資料探索時我們發現date_first_booking在train和test檔案中缺失值太多，故刪除
- 刪除country_destination，用模型預測country_destination，再與已經儲存country_destination的labels進行比較，從而判斷模型優劣

train.drop(['country_destination', 'date_first_booking'], axis = 1, inplace = True)
test.drop(['date_first_booking'], axis = 1, inplace = True)

合併train和test檔案
- 便於進行相同的特徵提取操作

#連線test 和 train
df = pd.concat([train, test], axis = 0, ignore_index = True)

1. timestamp_first_active
1.1 轉換為datetime型別

tfa = df.timestamp_first_active.astype(str).apply(lambda x: datetime.datetime(int(x[:4]),
                                                                          int(x[4:6]), 
                                                                          int(x[6:8]),
                                                                          int(x[8:10]),
                                                                          int(x[10:12]),
                                                                          int(x[12:])))

1.2 提取特徵：年，月，日

# create tfa_year, tfa_month, tfa_day feature
df['tfa_year'] = np.array([x.year for x in tfa])
df['tfa_month'] = np.array([x.month for x in tfa])
df['tfa_day'] = np.array([x.day for x in tfa])

1.3 提取特徵：weekday
- 對結果進行one hot encoding編碼

#isoweekday() 可以返回一週的星期幾，e.g.星期日：0；星期一：1
df['tfa_wd'] = np.array([x.isoweekday() for x in tfa]) 
df_tfa_wd = pd.get_dummies(df.tfa_wd, prefix = 'tfa_wd')  # one hot encoding 
df = pd.concat((df, df_tfa_wd), axis = 1) #新增df['tfa_wd'] 編碼後的特徵
df.drop(['tfa_wd'], axis = 1, inplace = True)#刪除原有未編碼的特徵

1.4 提取特徵：季節
- 因為判斷季節關注的是月份，故對年份進行統一

Y = 2000
seasons = [(0, (date(Y,  1,  1),  date(Y,  3, 20))),  #'winter'
           (1, (date(Y,  3, 21),  date(Y,  6, 20))),  #'spring'
           (2, (date(Y,  6, 21),  date(Y,  9, 22))),  #'summer'
           (3, (date(Y,  9, 23),  date(Y, 12, 20))),  #'autumn'
           (0, (date(Y, 12, 21),  date(Y, 12, 31)))]  #'winter'

def get_season(dt):
    dt = dt.date() #獲取日期
    dt = dt.replace(year=Y) #將年統一換成2000年
    return next(season for season, (start, end) in seasons if start <= dt <= end)

df['tfa_season'] = np.array([get_season(x) for x in tfa])
df_tfa_season = pd.get_dummies(df.tfa_season, prefix = 'tfa_season') # one hot encoding 
df = pd.concat((df, df_tfa_season), axis = 1)
df.drop(['tfa_season'], axis = 1, inplace = True)

2. date_account_created
2.1 將date_account_created轉換為datetime型別

dac = pd.to_datetime(df.date_account_created)

2.2 提取特徵：年，月，日

# create year, month, day feature for dac

df['dac_year'] = np.array([x.year for x in dac])
df['dac_month'] = np.array([x.month for x in dac])
df['dac_day'] = np.array([x.day for x in dac])

2.3 提取特徵：weekday

# create features of weekday for dac

df['dac_wd'] = np.array([x.isoweekday() for x in dac])
df_dac_wd = pd.get_dummies(df.dac_wd, prefix = 'dac_wd')
df = pd.concat((df, df_dac_wd), axis = 1)
df.drop(['dac_wd'], axis = 1, inplace = True)

2.4 提取特徵：季節

# create season features fro dac

df['dac_season'] = np.array([get_season(x) for x in dac])
df_dac_season = pd.get_dummies(df.dac_season, prefix = 'dac_season')
df = pd.concat((df, df_dac_season), axis = 1)
df.drop(['dac_season'], axis = 1, inplace = True)

2.5提取特徵：date_account_created和timestamp_first_active之間的差值
- 即使用者在airbnb平臺活躍到正式註冊所花的時間

dt_span = dac.subtract(tfa).dt.days

dt_span的頭十行資料

dt_span.value_counts().head(10)

分析：資料主要集中在-1，可以猜測，使用者當天註冊dt_span值便是-1
- 從差值提取特徵：差值為一天，一月，一年和其他
- 即使用者活躍到註冊花費的時間為一天，一月，一年或其他

# create categorical feature: span = -1; -1 < span < 30; 31 < span < 365; span > 365
def get_span(dt):
    # dt is an integer
    if dt == -1:
        return 'OneDay'
    elif (dt < 30) & (dt > -1):
        return 'OneMonth'
    elif (dt >= 30) & (dt <= 365):
        return 'OneYear'
    else:
        return 'other'

df['dt_span'] = np.array([get_span(x) for x in dt_span])
df_dt_span = pd.get_dummies(df.dt_span, prefix = 'dt_span')
df = pd.concat((df, df_dt_span), axis = 1)
df.drop(['dt_span'], axis = 1, inplace = True)

2.6 刪除原有的特徵
- 對timestamp_first_active，date_account_created進行特徵提取後，從特徵列表中刪除原有的特徵

df.drop(['date_account_created','timestamp_first_active'], axis = 1, inplace = True)

3. age

#Age 獲取年齡
av = df.age.values

在資料探索階段，我們發現大部分資料是集中在（15，90）區間的，但有部分年齡分佈在（1900，2000）區間，我們猜測使用者是把出生日期誤填為年齡，故進行預處理

#This are birthdays instead of age (estimating age by doing 2014 - value)
#資料來自2014年，故用2014-value
av = np.where(np.logical_and(av<2000, av>1900), 2014-av, av) 
df['age'] = av

3.1 將年齡進行分段

# Age has many abnormal values that we need to deal with. 
age = df.age
age.fillna(-1, inplace = True) #空值填充為-1
div = 15
def get_age(age):
    # age is a float number  將連續型轉換為離散型
    if age < 0:
        return 'NA' #表示是空值
    elif (age < div):
        return div #如果年齡小於15歲，那麼返回15歲
    elif (age <= div * 2):
        return div*2 #如果年齡大於15小於等於30歲，則返回30歲
    elif (age <= div * 3):
        return div * 3
    elif (age <= div * 4):
        return div * 4
    elif (age <= div * 5):
        return div * 5
    elif (age <= 110):
        return div * 6
    else:
        return 'Unphysical' #非正常年齡

將分段後的年齡作為新的特徵放入特徵列表中

df['age'] = np.array([get_age(x) for x in age])
df_age = pd.get_dummies(df.age, prefix = 'age')
df = pd.concat((df, df_age), axis = 1)
df.drop(['age'], axis = 1, inplace = True)

4. 其他特徵
- 在資料探索時，我們發現剩餘的特徵lables都比較少，故不進一步進行特徵提取，只進行one-hot-encoding處理

feat_toOHE = ['gender', 
             'signup_method', 
             'signup_flow', 
             'language', 
             'affiliate_channel', 
             'affiliate_provider', 
             'first_affiliate_tracked', 
             'signup_app', 
             'first_device_type', 
             'first_browser']
#對其他特徵進行one-hot-encoding處理
for f in feat_toOHE:
    df_ohe = pd.get_dummies(df[f], prefix=f, dummy_na=True)
    df.drop([f], axis = 1, inplace = True)
    df = pd.concat((df, df_ohe), axis = 1)

4.3 整合提取的所有特徵

我們將對session以及train，test檔案中提取的特徵進行合併

#將對session提取的特徵整合到一起
df_all = pd.merge(df, df_agg_sess, how='left')
df_all = df_all.drop(['id'], axis=1) #刪除id
df_all = df_all.fillna(-2)  #對沒有sesssion data的特徵進行缺失值處理

#加了一列，表示每一行總共有多少空值，這也作為一個特徵
df_all['all_null'] = np.array([sum(r<0) for r in df_all.values])

5. 模型構建

5.1 資料準備

1. 將train和test資料進行分離操作
- train_row是之前記錄的train資料行數

Xtrain = df_all.iloc[:train_row, :]
Xtest = df_all.iloc[train_row:, :]

2. 將提取的特徵生成csv檔案

Xtrain.to_csv("Airbnb_xtrain_v2.csv")
Xtest.to_csv("Airbnb_xtest_v2.csv")
#labels.tofile（）：Write array to a file as text or binary (default)
labels.tofile("Airbnb_ytrain_v2.csv", sep='\n', format='%s') #存放目標變數

讀取特徵檔案

xtrain = pd.read_csv("Airbnb_xtrain_v2.csv",index_col=0)
ytrain = pd.read_csv("Airbnb_ytrain_v2.csv", header=None)

xtrain.head()

ytrain.head()

分析：可以發現經過特徵提取後特徵檔案xtrain擴充套件為665個特徵，ytrain中包含訓練集中的目標變數
3. 將目標變數進行labels encoding

le = LabelEncoder()
ytrain_le = le.fit_transform(ytrain.values)

labels encoding前：
[‘AU’, ‘CA’, ‘DE’, ‘ES’, ‘FR’, ‘GB’, ‘IT’, ‘NDF’, ‘NL’, ‘PT’, ‘US’,’other’]
labels encoding後：
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

4. 提取10%的資料進行模型訓練
- 減少訓練模型花費的時間

# Let us take 10% of the data for faster training. 
n = int(xtrain.shape[0]*0.1)
xtrain_new = xtrain.iloc[:n, :]  #訓練資料
ytrain_new = ytrain_le[:n]       #訓練資料的目標變數

5. StandardScaling the dataset
- Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance)

X_scaler = StandardScaler()
xtrain_new = X_scaler.fit_transform(xtrain_new)

5.2 評分模型：NDCG

NDCG是一種衡量排序質量的評價指標，該指標考慮了所有元素的相關性
由於我們預測的目標變數並不是二分類變數，故我們用NDGG模型來進行模型評分，判斷模型優劣
一般二分類變數: 我們習慣於使用 f1 score, precision, recall, auc score來進行模型評分

from sklearn.metrics import make_scorer

def dcg_score(y_true, y_score, k=5):

    """
    y_true : array, shape = [n_samples] #資料
        Ground truth (true relevance labels).
    y_score : array, shape = [n_samples, n_classes] #預測的分數
        Predicted scores.
    k : int
    """
    order = np.argsort(y_score)[::-1] #分數從高到低排序
    y_true = np.take(y_true, order[:k]) #取出前k[0,k）個分數

    gain = 2 ** y_true - 1   

    discounts = np.log2(np.arange(len(y_true)) + 2)
    return np.sum(gain / discounts)


def ndcg_score(ground_truth, predictions, k= 
 
              
           
              
              
            
            相關推薦
			   
            
            
            
 

    

    
    資料探勘專案（一）Airbnb 新使用者的民宿預定結果預測
      
							
							
							

摘要 
本文主要根據對Airbnb 新使用者的民宿預定結果進行預測，完整的陳述了從資料探索到特徵工程到構建模型的整個過程。 
其中： 
1 資料探索部分主要基於pandas庫，利用常見的:head()，value_counts()，describe()，is 

  
 

    

    
    資料探勘技術（一）——預處理
      
                

1、資料預處理

資料預處理技術包括：聚集、抽樣、維規約、特徵子集選擇、特徵建立、離散化和二元化、變數變換。

屬性的型別：標稱（定性的）（值僅僅是不同的名字，即只提供足夠的資訊以區分物件， 如僱員ID，性別）、序數（定性的）（值提供足夠資訊確定物件的序， ，如成績，街道 

  
 

    

    
    資料探勘學習（一）——常用的python包
      
                1、資料分析的內容：          2、資料分析與挖掘的相關模組： 3、安裝報錯：Import genism時會報錯：Chunkize warning while installing gensim此時需要在import genism前面加上：UserWarning: d 

  
 

    

    
    資料探勘實驗（一）資料規範化【最小-最大規範化、零-均值規範化、小數定標規範化】
      本文程式碼均已在 MATLAB R2019b 測試通過，如有錯誤，歡迎指正。
[toc]
## 一、資料規範化的原理
資料規範化處理是資料探勘的一項基礎工作。不同的屬性變數往往具有不同的取值範圍，數值間的差別可能很大，不進行處理可能會影響到資料分析的結果。為了消除指標之間由於取值範圍帶來的差異，需要進行標準化 

  
 

    

    
    電商大資料分析平臺專案（一）專案框架
      開發可以在web專案中內嵌的js sdk。每當使用者瀏覽到網站頁面或者觸發某種事件時，會呼叫js程式碼，根據使用者cookie傳送一個session資訊這時到我們的nginx伺服器中。
	nginx伺服器在接收到傳送的session後會將其寫入日誌檔案中記錄下來，這時監聽日誌檔案的flume會將session 

  
 

    

    
    資料探勘學習（四）——常見案例總結
      
                1、K-meaning演算法實戰主要是通過均值來聚類的一個方法。步驟為：1）隨機選擇k個點作為聚類中心；2）計算各個點到這k個點的距離，將距離相近的點聚集在一起，行程k個類；3）將對應的點聚到與他最近的聚類中心；4）分成k個聚類之後，重新計算聚類中心；5）比較當前聚類中心與前 

  
 

    

    
    常用的機器學習&資料探勘翻譯（轉）
      
							
							
							Basis(基礎)：       MSE(Mean Square Error 均方誤差)，       LMS(LeastMean Square 最小均方)，       LSM(Least Square Methods 最小二乘法)，       MLE(Ma 

  
 

    

    
    HAWQ + MADlib 玩轉資料探勘之（七）——關聯規則方法之Apriori演算法
      
                一、關聯規則簡介        關聯規則挖掘的目標是發現數據項集之間的關聯關係，是資料挖據中一個重要的課題。關聯規則最初是針對購物籃分析（Market Basket Analysis）問題提出的。假設超市經理想更多地瞭解顧客的購物習慣，特別是想知道，哪些商品顧客可能會在一次購 

  
 

    

    
    DataMining學習1_資料探勘技術（三）——關聯分析
      
                

3、關聯分析

3.1、基本概念

 （1）通常認為項在事物中出現比不出現更重要，因此項是非對稱二元變數。（2）關聯規則是形如X->Y的蘊涵表示式，其中X和Y是不相交的項集，即X交Y=空。（3）由關聯規則作出的推論並不必然蘊涵因果關係。它只表示規則前件和後件中的項明 

  
 

    

    
    資料探勘步驟（流程）
      
								
								            
						
                
流程說明：
暫且總結為五步：1、確立挖掘目的，2、資料準備，3、數學建模，4、模型評估，5、模型應用。
第一步：確立挖掘目的，
確立業務目標 -->  對目標做簡單評估，確立所需要的資料型別，人 

  
 

    

    
    資料探勘筆記（三）—資料預處理
      
                
1.原始資料存在的幾個問題：不一致；重複；含噪聲；維度高。
2.資料預處理包含資料清洗、資料整合、資料變換和資料歸約幾種方法。
3.資料探勘中使用的資料的原則
應該是從原始資料中選取合適的屬性作為資料探勘屬性，這個選取過程應參考的原則是：儘可能賦予屬性名和屬性值明確的含義； 

  
 

    

    
    HAWQ + MADlib 玩轉資料探勘之（六）——主成分分析與主成分投影
      
一、主成分分析（Principal Component Analysis，PCA）簡介        在資料探勘中經常會遇到多個變數的問題，而且在多數情況下，多個變數之間常常存在一定的相關性。例如，網站的“瀏覽量”和“訪客數”往往具有較強的相關關係，而電商應用中的“下單數”和“成交數”也具有較強的相關關係。 

  
 

    

    
    資源|28本必讀的經典機器學習/資料探勘書籍（免費下載）
       
  
  
 分享一下我老師大神的人工智慧教程！零基礎，通俗易懂！http://blog.csdn.net/jiangjunshow
 
 也歡迎大家轉載本篇文章。分享知識，造福人民，實現我們中華民族偉大復興！
 
 
          

  
 

    

    
    入門Python資料分析最好的實戰專案（一）分析篇
       
 
 
 作者：xiaoyu 
 微信公眾號：Python資料科學 
 知乎：python資料分析 
  
  非經作者允許，禁止任何商業轉載。 
  目的：本篇給大家介紹一個數據分析的初級專案，目的是通過專案瞭解如何使用Python進行簡單的資料分析。 
  資料來源：博主通過爬蟲採集的鏈家全網北京二手 

  
 

    

    
    大資料技術之Hive實戰——Youtube專案（一）
      
							
							
							一、需求描述

統計 Youtube 視訊網站的常規指標，各種 TopN 指標：

–統計視訊觀看數 Top10

–統計視訊類別熱度 Top10

–統計視訊觀看數 Top20 所屬類別包含這 Top20 視訊的個數

–統計視訊觀看數 Top50 所關聯視訊 

  
 

    

    
    Go專案（一）、伺服器資料拉取和Material Design風格
      
							
							
							

一、前言：因為希望能能夠整合現在安卓的圍棋app中較好的東西和當下較為流行的App中常見的功能，於是，打算開始這個叫Go的專案。



初步希望實現的功能：


  1、使用Material Design風格進行app介面的搭建； 
  2、具體實現模組：
 

  
 

    

    
    【讀書筆記】資料探勘導論（Introduction to Data Mining） 1
      
							
							
							第二章 資料





2-1 資料型別


如下性質來描述屬性 
(1) 相異性 = 和 ≠ 
(2) 序 <, <=, >, >= 
(3) 加法  
(4) 乘法
從而定義四種類型 ：標稱，序數，區間，比率 
標稱：分類的（定性的） 

  
 

    

    
    資料探勘例項（航空公司客戶價值分析）
      
							
							
							一、實現目標

（1）藉助航空公司客戶資料，對客戶進行分類

（2）對不同的客戶進行特徵分析，比較不同類客戶的客戶價值

（3）對不同價值的客戶類別提供個性化服務，指定相應的營銷策略

二、分析方法與過程

航空客運資訊挖掘主要步驟： 
（1）從航空公司的資料來 

  
 

    

    
    2018 - IDEA 中 搭建 SpringBoot 專案（一）
       
 
 
 一、選擇 JDK 
 二、選擇專案依賴 
 三、專案結構 
 四、啟動 
 
 一、選擇 JDK 
  
   
  注：這裡可以選擇用war包，即以後將war包複製到Tomcat下解壓部署即可。 
 二、選擇專案依賴 
 我們這邊選擇：Web、Freemarker、MySQ 

  
 

    

    
    大資料Hadoop學習筆記（一）
       
 
  
  
 大資料Hadoop2.x 
  
  hadoop用來分析儲存網路資料 
  MapReduce：對海量資料的處理、分散式。 思想————> 分而治之，大資料集分為小的資料集，每個資料集進行邏輯業務處理合並統計資料結果（reduce） 執行模式：本地模式和yarn模式 input—