1. 程式人生 > >Practical Lessons from Predicting Clicks on Ads at Facebook (GBDT + LR) 模型實踐

Practical Lessons from Predicting Clicks on Ads at Facebook (GBDT + LR) 模型實踐

部落格程式碼均以上傳至GitHub,歡迎follow和start~~!

1. GBDT構造組合特徵的方式

利用GBDT進行特徵構造依據其模型組合方式一共有兩種方式:

  1. GBDT + LR

    與原論文的實現方式一樣,利用GBDT構造組合特徵,再將組合特徵進行one-hot編碼(本實踐程式碼也屬此類);

  2. GBDT + FFM 或者 GBDT + 樹模型

    此時,使用利用GBDT構造的組合特徵不再進行one-hot編碼,而是直接利用輸出葉節點的索引資訊,如果將GBDT組合特徵輸出到其他樹模型,則可直接利用節點索引資訊;若是將GBDT資訊輸出到FFM中,依舊是利用索引資訊,但是需要將索引資訊組織成FFM資料輸入形式。

2. GBDT組合特徵實現方式

GBDT實現特徵組合主要有兩種實現方式:

  1. 可以設定pre_leaf=True獲得每個樣本在每顆樹上的leaf_Index,可以檢視下XGBoost官方文件查閱一下API:

原來這個引數是在predict裡面,在對原始特徵進行簡單調參訓 練後,對原始資料以及測試資料進行new_feature= bst.predict(d_test, pred_leaf=True)即可得到一個(nsample, ntrees) 的結果矩陣,即每個樣本在每個樹上的index。

  1. 通過設定apply實現 (注意結合LR時候,後接[:,:,0]進行降維):

    img

    可以看到他用的是apply()方法,這裡就有點疑惑了,在XGBoost官方API並沒有看到這個方法,於是我去SKlearn GBDT API看了下,果然有apply()方法可以獲得leaf indices:

因為XGBoost有自帶介面和Scikit-Learn介面,所以程式碼上有所差異。

值得注意的是,當使用apply方式時候,返回比直接呼叫XGBoost的多了n_classes:

這也是為什麼在GBDT+LR使用apply方式獲得GBDT的組合特徵時往往加上[:,:,0],為的就是去掉n_class那一維,如下:

例項程式碼:

'''
使用X_train訓練GBDT模型,後面用此模型構造特徵
''' 
grd.fit(X_train, y_train)
# fit one-hot編碼器 
grd_enc.fit(grd.apply(X_train)[:, :, 0]) 
'''  
使用訓練好的GBDT模型構建特徵,然後將特徵經過one-hot編碼作為新的特徵輸入到LR模型訓練。 
''' 
grd_lm.fit(grd_enc.transform(grd.apply(X_train_lr)[:, :, 0]), y_train_lr)
# 用訓練好的LR模型多X_test做預測 
y_pred_grd_lm = grd_lm.predict_proba(grd_enc.transform(grd.apply(X_test)[:, :, 0]))[:, 1]
# 根據預測結果輸出 
fpr_grd_lm, tpr_grd_lm, _ = roc_curve(y_test, y_pred_grd_lm)

3. 程式碼實踐

3.1 介紹

針對CTR預估,測試LR + GBDT的方案效果.

3.2 資料集

這裡提供兩份資料集,第一份比較好是CTR的,第二份也還湊合,之前在DeepFm中有用過。按理來說用第一個資料更好,但是壓縮包大小為4G+ 有點大.
所以我採用的是第二個資料。感興趣的同學,可以嘗試下用第一個的資料進行試驗。非常歡迎分享下實驗結果~

3.3 kaggle CTR比賽

使用kaggle 2014年比賽 criteo-Display Advertising Challenge比賽的資料集。第一名的方案就是參考了Facebook的論文,使用GBDT進行特徵轉換,後面跟FFM

比賽地址: https://www.kaggle.com/c/criteo-display-ad-challenge/data
資料集下載:http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/

第一名方案參考
https://www.kaggle.com/c/criteo-display-ad-challenge/discussion/10555
PPT: https://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf

3.4 kaggle 比賽

kaggle上一個預測任務
https://www.kaggle.com/c/porto-seguro-safe-driver-prediction

其中資料集及jupyter notebook版說明均已上傳至個人GitHub

採用了LightGBM樹整合架構實現的GBDT,當然也可採用XGBoost或者sklearn自帶的GBDT實現;

Code:

import gc
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# 1. 讀取資料
path = "./data/"
train_file = "train.csv"
test_file = "test.csv"

trainDf = pd.read_csv(path + train_file)
# testDf = pd.read_csv(path + train_file, nrows=1000, skiprows=range(1, 10000))

pos_trainDf = trainDf[trainDf['target'] == 1]
neg_trainDf = trainDf[trainDf['target'] == 0].sample(n=20000, random_state=2018)
trainDf = pd.concat([pos_trainDf, neg_trainDf], axis=0).sample(frac=1.0, random_state=2018)
del pos_trainDf
del neg_trainDf
gc.collect()

print(trainDf.shape, trainDf['target'].mean())

trainDf, testDf, _, _ = train_test_split(trainDf, trainDf['target'], test_size=0.25, random_state=2018)

print(trainDf['target'].mean(), trainDf.shape)
print(testDf['target'].mean(), testDf.shape)

"""
一共59個特徵,包括id, target
bin特徵17個;cat特徵14個;連續特徵26個;
Code:
columns = trainDf.columns.tolist()
bin_feats = []
cat_feats = []
con_feats = []
for col in  columns:
    if 'bin' in col:
        bin_feats.append(col)
        continue
    if 'cat' in col:
        cat_feats.append(col)
        continue
    if 'id' != col and 'target' != col:
        con_feats.append(col)

print(len(bin_feats), bin_feats)
print(len(cat_feats), cat_feats)
print(len(con_feats), con_feats)
"""
bin_feats = ['ps_ind_06_bin', 'ps_ind_07_bin', 'ps_ind_08_bin', 'ps_ind_09_bin', 'ps_ind_10_bin', 'ps_ind_11_bin',
             'ps_ind_12_bin', 'ps_ind_13_bin', 'ps_ind_16_bin', 'ps_ind_17_bin', 'ps_ind_18_bin', 'ps_calc_15_bin',
             'ps_calc_16_bin', 'ps_calc_17_bin', 'ps_calc_18_bin', 'ps_calc_19_bin', 'ps_calc_20_bin']
cat_feats = ['ps_ind_02_cat', 'ps_ind_04_cat', 'ps_ind_05_cat', 'ps_car_01_cat', 'ps_car_02_cat', 'ps_car_03_cat',
             'ps_car_04_cat', 'ps_car_05_cat', 'ps_car_06_cat', 'ps_car_07_cat', 'ps_car_08_cat', 'ps_car_09_cat',
             'ps_car_10_cat', 'ps_car_11_cat']
con_feats = ['ps_ind_01', 'ps_ind_03', 'ps_ind_14', 'ps_ind_15', 'ps_reg_01', 'ps_reg_02', 'ps_reg_03', 'ps_car_11',
             'ps_car_12', 'ps_car_13', 'ps_car_14', 'ps_car_15', 'ps_calc_01', 'ps_calc_02', 'ps_calc_03', 'ps_calc_04',
             'ps_calc_05', 'ps_calc_06', 'ps_calc_07', 'ps_calc_08', 'ps_calc_09', 'ps_calc_10', 'ps_calc_11',
             'ps_calc_12', 'ps_calc_13', 'ps_calc_14']

# 2. 特徵處理
trainDf = trainDf.fillna(0)
testDf = testDf.fillna(0)

train_sz = trainDf.shape[0]
combineDf = pd.concat([trainDf, testDf], axis=0)
del trainDf
del testDf
gc.collect()

# 2.1 連續特徵全部歸一化
from sklearn.preprocessing import MinMaxScaler

for col in con_feats:
    scaler = MinMaxScaler()
    combineDf[col] = scaler.fit_transform(np.array(combineDf[col].values.tolist()).reshape(-1, 1))

# 2.2 離散特徵one-hot
for col in bin_feats + cat_feats:
    onehotret = pd.get_dummies(combineDf[col], prefix=col)
    combineDf = pd.concat([combineDf, onehotret], axis=1)

# 3. 訓練模型
label = 'target'
onehot_feats = [col for col in combineDf.columns if col not in ['id', 'target'] + con_feats + cat_feats + bin_feats]
train = combineDf[:train_sz]
test = combineDf[train_sz:]
print("Train.shape: {0}, Test.shape: {0}".format(train.shape, test.shape))
del combineDf

# 3.1 LR模型
lr_feats = con_feats + onehot_feats
lr = LogisticRegression(penalty='l2', C=1)
lr.fit(train[lr_feats], train[label].values)


def do_model_metric(y_true, y_pred, y_pred_prob):
    print("Predict 1 percent: {0}".format(np.mean(y_pred)))
    print("Label 1 percent: {0}".format(train[label].mean()))
    from sklearn.metrics import roc_auc_score, accuracy_score
    print("AUC: {0:.3}".format(roc_auc_score(y_true=y_true, y_score=y_pred_prob[:, 1])))
    print("Accuracy: {0}".format(accuracy_score(y_true=y_true, y_pred=y_pred)))


print("Train............")
do_model_metric(y_true=train[label], y_pred=lr.predict(train[lr_feats]), y_pred_prob=lr.predict_proba(train[lr_feats]))

print("\n\n")
print("Test.............")
do_model_metric(y_true=test[label], y_pred=lr.predict(test[lr_feats]), y_pred_prob=lr.predict_proba(test[lr_feats]))

# 3.2 GBDT
lgb_feats = con_feats + cat_feats + bin_feats
categorical_feature_list = cat_feats + bin_feats

import lightgbm as lgb

lgb_params = {
    'objective': 'binary',
    'boosting_type': 'gbdt',
    'metric': 'auc',
    'learning_rate': 0.01,
    'num_leaves': 5,
    'max_depth': 4,
    'min_data_in_leaf': 100,
    'bagging_fraction': 0.8,
    'feature_fraction': 0.8,
    'bagging_freq': 10,
    'lambda_l1': 0.2,
    'lambda_l2': 0.2,
    'scale_pos_weight': 1,
}

lgbtrain = lgb.Dataset(train[lgb_feats].values, label=train[label].values,
                       feature_name=lgb_feats,
                       categorical_feature=categorical_feature_list
                       )
lgbvalid = lgb.Dataset(test[lgb_feats].values, label=test[label].values,
                       feature_name=lgb_feats,
                       categorical_feature=categorical_feature_list
                       )

evals_results = {}
print('train')
lgb_model = lgb.train(lgb_params,
                      lgbtrain,
                      valid_sets=lgbvalid,
                      evals_result=evals_results,
                      num_boost_round=1000,
                      early_stopping_rounds=60,
                      verbose_eval=50,
                      categorical_feature=categorical_feature_list,
                      )

# 3.3 LR + GBDT
train_sz = train.shape[0]
combineDf = pd.concat([train, test], axis=0, ignore_index=True)

# 得到葉節點編號 Feature Transformation
gbdt_feats_vals = lgb_model.predict(combineDf[lgb_feats], pred_leaf=True)
gbdt_columns = ["gbdt_leaf_indices_" + str(i) for i in range(0, gbdt_feats_vals.shape[1])]

combineDf = pd.concat(
    [combineDf, pd.DataFrame(data=gbdt_feats_vals, index=range(0, gbdt_feats_vals.shape[0]), columns=gbdt_columns)],
    axis=1)

# onehotencoder(gbdt_feats)
origin_columns = combineDf.columns
for col in gbdt_columns:
    combineDf = pd.concat([combineDf, pd.get_dummies(combineDf[col], prefix=col)], axis=1)
gbdt_onehot_feats = [col for col in combineDf.columns if col not in origin_columns]

# 恢復train, test
train = combineDf[:train_sz]
test = combineDf[train_sz:]
del combineDf;
gc.collect();

lr_gbdt_feats = lr_feats + gbdt_onehot_feats

lr_gbdt_model = LogisticRegression(penalty='l2', C=1)
lr_gbdt_model.fit(train[lr_gbdt_feats], train[label])

print("Train................")
do_model_metric(y_true=train[label], y_pred=lr_gbdt_model.predict(train[lr_gbdt_feats]),
                y_pred_prob=lr_gbdt_model.predict_proba(train[lr_gbdt_feats]))

print("Test..................")
do_model_metric(y_true=test[label], y_pred=lr_gbdt_model.predict(test[lr_gbdt_feats]),
                y_pred_prob=lr_gbdt_model.predict_proba(test[lr_gbdt_feats]))


3.5 使用apply方式生成GBDT特徵

Code:

# coding: utf-8
from sklearn.model_selection import train_test_split
from sklearn import metrics
from xgboost.sklearn import XGBClassifier
import numpy as np

class XgboostFeature():
      ##可以傳入xgboost的引數
      ##常用傳入特徵的個數 即樹的個數 預設30
      def __init__(self,n_estimators=30,learning_rate =0.3,max_depth=3,min_child_weight=1,gamma=0.3,subsample=0.8,colsample_bytree=0.8,objective= 'binary:logistic',nthread=4,scale_pos_weight=1,reg_alpha=1e-05,reg_lambda=1,seed=27):
          self.n_estimators=n_estimators
          self.learning_rate=learning_rate
          self.max_depth=max_depth
          self.min_child_weight=min_child_weight
          self.gamma=gamma
          self.subsample=subsample
          self.colsample_bytree=colsample_bytree
          self.objective=objective
          self.nthread=nthread
          self.scale_pos_weight=scale_pos_weight
          self.reg_alpha=reg_alpha
          self.reg_lambda=reg_lambda
          self.seed=seed
          print 'Xgboost Feature start, new_feature number:',n_estimators
      def mergeToOne(self,X,X2):
          X3=[]
          for i in xrange(X.shape[0]):
              tmp=np.array([list(X[i]),list(X2[i])])
              X3.append(list(np.hstack(tmp)))
          X3=np.array(X3)
          return X3
      ##切割訓練
      def fit_model_split(self,X_train,y_train,X_test,y_test):
          ##X_train_1用於生成模型  X_train_2用於和新特徵組成新訓練集合
          X_train_1, X_train_2, y_train_1, y_train_2 = train_test_split(X_train, y_train, test_size=0.6, random_state=0)
          clf = XGBClassifier(
                 learning_rate =self.learning_rate,
                 n_estimators=self.n_estimators,
                 max_depth=self.max_depth,
                 min_child_weight=self.min_child_weight,
                 gamma=self.gamma,
                 subsample=self.subsample,
                 colsample_bytree=self.colsample_bytree,
                 objective= self.objective,
                 nthread=self.nthread,
                 scale_pos_weight=self.scale_pos_weight,
                 reg_alpha=self.reg_alpha,
                 reg_lambda=self.reg_lambda,
                 seed=self.seed)
          clf.fit(X_train_1, y_train_1)
          y_pre= clf.predict(X_train_2)
          y_pro= clf.predict_proba(X_train_2)[:,1]
          print "pred_leaf=T AUC Score : %f" % metrics.roc_auc_score(y_train_2, y_pro)
          print"pred_leaf=T  Accuracy : %.4g" % metrics.accuracy_score(y_train_2, y_pre)
          new_feature= clf.apply(X_train_2)
          X_train_new2=self.mergeToOne(X_train_2,new_feature)
          new_feature_test= clf.apply(X_test)
          X_test_new=self.mergeToOne(X_test,new_feature_test)
          print "Training set of sample size 0.4 fewer than before"
          return X_train_new2,y_train_2,X_test_new,y_test
      ##整體訓練
      def fit_model(self,X_train,y_train,X_test,y_test):
          clf = XGBClassifier(
                 learning_rate =self.learning_rate,
                 n_estimators=self.n_estimators,
                 max_depth=self.max_depth,
                 min_child_weight=self.min_child_weight,
                 gamma=self.gamma,
                 subsample=self.subsample,
                 colsample_bytree=self.colsample_bytree,
                 objective= self.objective,
                 nthread=self.nthread,
                 scale_pos_weight=self.scale_pos_weight,
                 reg_alpha=self.reg_alpha,
                 reg_lambda=self.reg_lambda,
                 seed=self.seed)
          clf.fit(X_train, y_train)
          y_pre= clf.predict(X_test)
          y_pro= clf.predict_proba(X_test)[:,1]
          print "pred_leaf=T  AUC Score : %f" % metrics.roc_auc_score(y_test, y_pro)
          print"pred_leaf=T  Accuracy : %.4g" % metrics.accuracy_score(y_test, y_pre)
          new_feature= clf.apply(X_train)
          X_train_new=self.mergeToOne(X_train,new_feature)
          new_feature_test= clf.apply(X_test)
          X_test_new=self.mergeToOne(X_test,new_feature_test)
          print "Training set sample number remains the same"
          return X_train_new,y_train,X_test_new,y_test

4. 模板

4.1 GBDT + LR 模板

from scipy.sparse.construct import hstack
from sklearn.model_selection import train_test_split
from sklearn.datasets.svmlight_format import load_svmlight_file
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
from sklearn.linear_model.logistic import Lo