基於R語言的Kaggle案例分析學習筆記（五）

阿新 • • 發佈：2019-01-20

藥店銷量預測

本案例大綱：

1、xgboost理論介紹

2、R語言中xgboost相關函式的引數

3、案例背景

4、資料預處理

5、R語言的xgb模型實現程式碼

1、xgboost理論介紹

這部分我直接把一些牛人寫的關於xgb的理論介紹引用過來了，大家可以直接看以下部落格連結資料，既有原理介紹又有程式碼的函式引數介紹：

http://blog.csdn.net/a819825294/article/details/51206410

http://blog.csdn.net/sb19931201/article/details/52557382

http://blog.csdn.net/sb19931201/article/details/52577592

2、R語言中xgboost相關函式的引數

R語言的XGBOOST包的引數包括三個方面的引數：常規引數、模型引數和任務引數。通用引數用於選擇哪一類分類器，是樹模型還是線性模型；模型引數取決於常規函式中選擇的模型型別；任務引數取決於學習的場景。

常規數：

booster [default=gbtree]
選擇基分類器
silent [default=0]
設定成1則沒有執行資訊輸出，最好是設定為0.
nthread [default to maximum number of threads available if not set]
執行緒數
num_pbuffer
[set automatically by xgboost, no need to be set by user]
緩衝區大小
num_feature

[set automatically by xgboost, no need to be set by user]
特徵維度

模型引數：

（1）樹模型的引數

eta[default=0.3]

學習率，一般設定小一些。
range: [0,1]
gamma [default=0]
後剪枝時，用於控制是否剪枝，值越大，演算法越保守。
range: [0,∞]
max_depth [default=6]
樹的最大深度
範圍: [1,∞]
min_child_weight [default=1]
這個引數預設是 1，是每個葉子裡面 h 的和至少是多少，對正負樣本不均衡時的 0-1 分類而言，假設 h 在 0.01 附近，min_child_weight 為 1 意味著葉子節點中最少需要包含 100 個樣本。這個引數非常影響結果，控制葉子節點中二階導的和的最小值，該引數值越小，越容易 overfitting。
range: [0,∞]
max_delta_step [default=0]
這個引數在更新步驟中起作用，如果取0表示沒有約束，如果取正值則使得更新步驟更加保守。可以防止做太大的更新步子，使更新更加平緩。
range: [0,∞]
subsample [default=1]
樣本隨機取樣，較低的值使得演算法更加保守，防止過擬合，但是太小的值也會造成欠擬合。
range: (0,1]
colsample_bytree [default=1]
列取樣，對每棵樹的生成用的特徵進行列取樣.一般設定為： 0.5-1
range: (0,1]
lambda [default=1]
權重L2正則化
alpha [default=0]
權重L1正則化

（2）線性模型引數

lambda[default=0]
權重L2正則化
alpha [default=0]
權重L1正則化
lambda_bias
L2 regularization term on bias, default 0(no L1 reg on bias because it is notimportant)

偏導L2正則化引數，預設為0（沒有偏導L1正則化引數）

任務引數：

objective [default=reg:linear ] 定義最小化損失函式型別，常用引數如下：
“reg:linear” –linear regression
“reg:logistic” –logistic regression
“binary:logistic” –logistic regression for binary classification, outputprobability
“binary:logitraw” –logistic regression for binary classification, output scorebefore logistic transformation
“count:poisson” –poisson regression for count data, output mean of poissondistribution
max_delta_step is set to 0.7 by default in poisson regression (used tosafeguard optimization)
“multi:softmax” –set XGBoost to do multiclass classification using the softmaxobjective, you also need to set num_class(number of classes)
“multi:softprob” –same as softmax, but output a vector of ndata * nclass, whichcan be further reshaped to ndata, nclass matrix. The result contains predictedprobability of each data point belonging to each class.
“rank:pairwise” –set XGBoost to do ranking task by minimizing the pairwiseloss
base_score [ default=0.5 ]
the initial prediction score of all instances, global bias 所有例項的初始預測評分, 全域性偏差
eval_metric [ default according to objective ] 評估指標選擇
evaluation metrics for validation data, a default metric will be assignedaccording to objective( rmse for regression, and error for classification, meanaverage precision for ranking )
User can add multiple evaluation metrics, for Python user, remember to pass the metrics in as list of parameters pairsinstead of map, so that latter ‘eval_metric’ won’t override previous one
The choices are listed below: 評估指標可選列表如下：
“rmse”: root mean square error
“logloss”: negative log-likelihood
“error”: Binary classification error rate. It is calculated as #(wrongcases)/#(all cases). For the predictions, the evaluation will regard theinstances with prediction value larger than 0.5 as positive instances, and theothers as negative instances.
“merror”: Multiclass classification error rate. It is calculated as #(wrongcases)/#(all cases).
“mlogloss”: Multiclass logloss
“auc”: Area under the curve for ranking evaluation.
“ndcg”:Normalized Discounted Cumulative Gain
“map”:Mean average precision
“[email protected]”,”[email protected]”: n can be assigned as an integer to cut off the top positionsin the lists for evaluation.
“ndcg-“,”map-“,”[email protected]“,”[email protected]“: In XGBoost, NDCG and MAP will evaluate thescore of a list without any positive samples as 1. By adding “-” in theevaluation metric XGBoost will evaluate these score as 0 to be consistent undersome conditions. training repeatively
seed [ default=0 ] 隨機種子
random number seed.

從xgboost原理部分的第二個連結那位博主給出的python的xgboost引數幾乎一致，也就是R語言的xgboost的引數與python是一樣的。

3、案例背景

Rossmann在7個歐洲國家擁有3,000家藥店。目前，羅斯曼店經理的任務是提前六週預測其日銷量。商店銷售受到諸多因素的影響，包括促銷，競爭，學校和國家假日，季節性和地點。成千上萬的個人經理根據其獨特的情況預測銷售量，結果的準確性可能會有很大的變化。

Kaggle所提供的資料的欄位如下：

英文名稱	英文解釋	中文解釋
Id	an Id that represents a (Store, Date) duple within the test set	表示測試集中（儲存，日期）副本的Id
Store	a unique Id for each store	每個商店的獨特Id
Sales	the turnover for any given day (this is what you are predicting)	每天的銷量（這是需要預測的因變數）
Customers	the number of customers on a given day	某一天的客戶數量
Open	an indicator for whether the store was open: 0 = closed, 1 = open	商店是否開啟的指示器：0 =關閉，1 =開啟
StateHoliday	indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None	表示一個國家假期。通常所有商店，除了少數例外，在國營假期關閉。請注意，所有學校在公眾假期和週末關閉。a =公眾假期，b =復活節假期，c =聖誕節，0 =無
SchoolHoliday	indicates if the (Store, Date) was affected by the closure of public schools	表示（商店，日期）是否受到公立學校關閉的影響
StoreType	differentiates between 4 different store models: a, b, c, d	區分4種不同的商店模式：a，b，c，d
Assortment	describes an assortment level: a = basic, b = extra, c = extended	描述分類級別：a = basic，b = extra，c = extended
CompetitionDistance	distance in meters to the nearest competitor store	距離最接近的競爭對手商店的距離
CompetitionOpenSince[Month/Year]	gives the approximate year and month of the time the nearest competitor was opened	給出最近的競爭對手開放時間的大約年和月
Promo	indicates whether a store is running a promo on that day	指示商店是否在當天執行促銷
Promo2	Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating	Promo2是一些持續和連續推廣的一些商店：0 =商店不參與，1 =商店正在參與
Promo2Since[Year/Week]	describes the year and calendar week when the store started participating in Promo2	描述商店開始參與Promo2的日期
PromoInterval	describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store	描述了Promo2的連續間隔開始，命名新的促銷活動的月份。例如“二月，五月，八月，十一月”是指每一輪在該店的任何一年的二月，五月，八月，十一月份開始

4、資料預處理

由於本案例主要講解xgboost模型，所以對於資料預處理和特徵工程都做得比較少。只做了兩方面的處理，第一，Kaggle官網把商店的一些屬性資料與訓練集、測試集分開放，在不同檔案，所以要把store資料集與train、test資料集按列合併；第二，將資料按照xgb要求的格式進行轉換，R語言的xgboost包的xgb.Matrix是轉換格式的包。

5、程式碼實現

資料下載地址：https://www.kaggle.com/c/rossmann-store-sales/data

library(readr)
library(xgboost)
library(lubridate)
train<-read.csv('D:/R語言kaggle案例實戰/Kaggle第五節課/train.csv')
test<-read.csv('D:/R語言kaggle案例實戰/Kaggle第五節課/test.csv')
store<-read.csv('D:/R語言kaggle案例實戰/Kaggle第五節課/store.csv')#store是對店鋪屬性的補充
train<-merge(train,store)#將兩個資料集按列合併
test<-merge(test,store)#將兩個資料集按列合併
train$Date<-as.POSIXct(train$Date)#將日期字元變成時間格式
test$Date<-as.POSIXct(test$Date)#將日期字元變成時間格式
train[is.na(train)]<-0#將空值置為零
test[is.na(test)]<-0
train<-train[which(train$Open=='1'),]#選擇開門的且銷售額不為0的樣本
train<-train[which(train$Sales!='0'),]
train$month<-month(train$Date)#提取月份
train$year<-year(train$Date)#提取年份
train$day<-day(train$Date)#提取日
train<-train[,-c(3,8)]#刪除日期列和缺失值較多的列
test<-test[,-c(4,7)]#刪除日期列和缺失值較多的列
feature.names<-names(train)[c(1,2,5:19)]#這一步主要使測試集和訓練集的結構一致。
for(f in feature.names){
  if(class(train[[f]])=="character"){
    levels<-unique(c(train[[f]],test[[f]]))
    train[[f]]<-as.integer(factor(train[[f]],levels = levels))
    test[[f]]<-as.integer(factor(test[[f]],levels = levels))
  }
}
tra<-train[,feature.names]
RMPSE<-function(preds,dtrain){ #定義一個評價函式，Kaggle官方給的評價函式作為xgboost中的評價函式。
  labels<-getinfo(dtrain,"label")
  elab<-exp(as.numeric(labels))-1
  epreds<-exp(as.numeric(preds))-1
  err<-sqrt(mean((epreds/elab-1)^2))
  return(list(metric="RMPSE",value=err))
}
h<-sample(nrow(train),10000)#進行10000次抽樣
dval<-xgb.DMatrix(data=data.matrix(tra[h,]),label=log(train$Sales+1)[h])#用於以下構建watchlist 
dtrain<-xgb.DMatrix(data=data.matrix(tra[-h,]),label=log(train$Sales+1)[-h])#構建xgb特定的矩陣形式
watchlist<-list(val=dval,train=dtrain)#構建模型引數的watchlist,watchlist是用於監聽每次模型執行時的模型效能情況。
param<-list(objective="reg:linear",
            booster="gbtree",
            eta=0.02,
            max_depth=12,
            subsample=0.9,
            colsample_bytree=0.7,
            num_parallel_tree=2,
            alpha=0.0001,
            lambda=1)
clf<-xgb.train(  params=param,
                 data=dtrain,
                 nrounds = 3000,
                 verbose = 0,
                 early.stop.round=100,
                 watchlist = watchlist,
                 maximize = FALSE,
                 feval = RMPSE
  
)
ptest<- predict(clf,test,outputmargin=TRUE)

基於R語言的Kaggle案例分析學習筆記（五）

藥店銷量預測

本案例大綱：

1、xgboost理論介紹

2、R語言中xgboost相關函式的引數

3、案例背景

4、資料預處理

5、R語言的xgb模型實現程式碼

基於R語言的Kaggle案例分析學習筆記（五）

R語言與迴歸分析學習筆記（bootstrap method）

R語言與時間序列學習筆記（1）

R語言與時間序列學習筆記（2）

應用統計學與R語言實現學習筆記（五）——引數估計

R語言與點估計學習筆記（刀切法與最小二乘估計）

R語言與點估計學習筆記（EM演算法與Bootstrap法）

R語言與點估計學習筆記（矩估計與MLE）

Go語言學習筆記（五）文件操作

Hibernate學習筆記（五） --- 創建基於中間關聯表的多對多映射關系

Go語言學習筆記（五）函式

Twitter基於R語言的時序資料突變檢測（BreakoutDetection）

spark快速大資料分析學習筆記（1）

spark快速大數據分析學習筆記（1）

實變函式與泛函數分析學習筆記（二）：賦範線性空間

Go語言學習筆記（五）：變數作用域

Python資料分析學習筆記（1）numpy模組基礎入門

Python資料分析學習筆記（6）資料規約實戰--以主成分分析PCA為例

OpenCv學習筆記（五）-數學形態學2（灰度級膨脹和腐蝕及c語言實現）

自然語言處理學習筆記（五）

基於R語言的Kaggle案例分析學習筆記（五）

藥店銷量預測

本案例大綱：

1、xgboost理論介紹

2、R語言中xgboost相關函式的引數

3、案例背景

4、資料預處理

5、R語言的xgb模型實現程式碼

相關推薦