1. 程式人生 > >【R機器學習筆記】梯度提升迴歸樹——gbm包

【R機器學習筆記】梯度提升迴歸樹——gbm包

gbm包

gbm包是梯度提升迴歸樹(GBRT)在R 中的實現。GBRT,全稱為Gradient Boosting Regression Tree, 有時也稱為GBDT。

wiki中對GBRT的定義

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

[wiki]

gbm包在R中的使用

基本建模函式:

gbm(formula = formula(data),
    distribution = "bernoulli",
    data = list(),
    weights,
    var.monotone = NULL,
    n.trees = 100,
    interaction.depth = 1,
    n.minobsinnode = 10,
    shrinkage = 0.001,
    bag.fraction = 0.5,
    train.fraction = 1.0,
    cv.folds=0
, keep.data = TRUE, verbose = "CV", class.stratify.cv=NULL, n.cores = NULL)

引數:

formula
a symbolic description of the model to be fit. The formula may include an offset term (e.g. y~offset(n)+x). If keep.data=FALSE in the initial call to gbm then it is the user’s responsibility to resupply the offset to gbm.more.
常規的模型公式,比如y~x1+x2,如果使用所有變數,也可以使用y~ . 。 在這裡可以包括一個偏移項 (如y~offset(n)+x)。如果keep.data=FALSE,就需要使用者再次輸入這個偏移項,應該是這個意思吧。
distribution


either a character string specifying the name of the distribution to use or a list with a component name specifying the distribution and any additional parameters needed. If not specified, gbm will try to guess: if the response has only 2 unique values, bernoulli is assumed; otherwise, if the response is a factor, multinomial is assumed; otherwise, if the response has class “Surv”, coxph is assumed; otherwise, gaussian is assumed.
這裡可以輸入各種分佈函式的名字。

Currently available options are “gaussian” (squared error), “laplace” (absolute loss), “tdist” (t-distribution loss), “bernoulli” (logistic regression for 0-1 outcomes), “huberized” (huberized hinge loss for 0-1 outcomes), “multinomial” (classification when there are more than 2 classes), “adaboost” (the AdaBoost exponential loss for 0-1 outcomes), “poisson” (count outcomes), “coxph” (right censored observations), “quantile”, or “pairwise” (ranking measure using the LambdaMart algorithm).

If quantile regression is specified, distribution must be a list of the form list(name=”quantile”,alpha=0.25) where alpha is the quantile to estimate. The current version’s quantile regression method does not handle non-constant weights and will stop.
If “tdist” is specified, the default degrees of freedom is 4 and this can be controlled by specifying distribution=list(name=”tdist”, df=DF) where DF is your chosen degrees of freedom.
If “pairwise” regression is specified, distribution must be a list of the form list(name=”pairwise”,group=…,metric=…,max.rank=…) (metric and max.rank are optional, see below). group is a character vector with the column names of data that jointly indicate the group an instance belongs to (typically a query in Information Retrieval applications). For training, only pairs of instances from the same group and with different target labels can be considered. metric is the IR measure to use, one of
conc:
Fraction of concordant pairs; for binary labels, this is equivalent to the Area under the ROC Curve
mrr:
Mean reciprocal rank of the highest-ranked positive instance
map:
Mean average precision, a generalization of mrr to multiple positive instances
ndcg:
Normalized discounted cumulative gain. The score is the weighted sum (DCG) of the user-supplied target values, weighted by log(rank+1), and normalized to the maximum achievable value. This is the default if the user did not specify a metric.
ndcg and conc allow arbitrary target values, while binary targets {0,1} are expected for map and mrr. For ndcg and mrr, a cut-off can be chosen using a positive integer parameter max.rank. If left unspecified, all ranks are taken into account.
Note that splitting of instances into training and validation sets follows group boundaries and therefore only approximates the specified train.fraction ratio (the same applies to cross-validation folds). Internally, queries are randomly shuffled before training, to avoid bias.
Weights can be used in conjunction with pairwise metrics, however it is assumed that they are constant for instances from the same group.
For details and background on the algorithm, see e.g. Burges (2010).
目前只用過迴歸的 “gaussian” (squared error), “laplace” (absolute loss),分類的“adaboost” (the AdaBoost exponential loss for 0-1 outcomes), “poisson” (count outcomes)
data
an optional data frame containing the variables in the model. By default the variables are taken from environment(formula), typically the environment from which gbm is called. If keep.data=TRUE in the initial call to gbm then gbm stores a copy with the object. If keep.data=FALSE then subsequent calls to gbm.more must resupply the same dataset. It becomes the user’s responsibility to resupply the same data at this point.
資料麼,用data.frame的格式,好像標籤還不能用帶有NA的,否則會出錯。
weights
an optional vector of weights to be used in the fitting process. Must be positive but do not need to be normalized. If keep.data=FALSE in the initial call to gbm then it is the user’s responsibility to resupply the weights to gbm.more.
資料的權重,至今還沒用過。
var.monotone
an optional vector, the same length as the number of predictors, indicating which variables have a monotone increasing (+1), decreasing (-1), or arbitrary (0) relationship with the outcome.
擬定預測變數和標籤的關係,可選擇(大概是可以減少運算速度吧)
n.trees
the total number of trees to fit. This is equivalent to the number of iterations and the number of basis functions in the additive expansion.
迭代迴歸樹的數量,一般來說先越大越好,然後選擇合適的數目。
cv.folds
Number of cross-validation folds to perform. If cv.folds>1 then gbm, in addition to the usual fit, will perform a cross-validation, calculate an estimate of generalization error returned in cv.error.
交叉驗證的則數,可以用來提取最適的迴歸樹數目。
interaction.depth
The maximum depth of variable interactions. 1 implies an additive model, 2 implies a model with up to 2-way interactions, etc.
樹的深度。
n.minobsinnode
minimum number of observations in the trees terminal nodes. Note that this is the actual number of observations not the total weight.
樹終節點的最小個數。
shrinkage
a shrinkage parameter applied to each tree in the expansion. Also known as the learning rate or step-size reduction.
收縮率,也叫學習速率,一般先設定0.1左右。
bag.fraction
the fraction of the training set observations randomly selected to propose the next tree in the expansion. This introduces randomnesses into the model fit. If bag.fraction<1 then running the same model twice will result in similar but different fits. gbm uses the R random number generator so set.seed can ensure that the model can be reconstructed. Preferably, the user can save the returned gbm.object using save.
從訓練樣本選擇建立下一個樹的比例。這個引數還不是很清楚。
train.fraction
The first train.fraction * nrows(data) observations are used to fit the gbm and the remainder are used for computing out-of-sample estimates of the loss function.
從訓練樣本選擇樣本數量建立模型的比例。
keep.data
a logical variable indicating whether to keep the data and an index of the data stored with the object. Keeping the data and index makes subsequent calls to gbm.more faster at the cost of storing an extra copy of the dataset.
一般為TRUE,可以加快速度
verbose
If TRUE, gbm will print out progress and performance indicators. If this option is left unspecified for gbm.more then it uses verbose from object.
TRUE的話可以反饋結果。
n.cores
The number of CPU cores to use. The cross-validation loop will attempt to send different CV folds off to different cores. If n.cores is not specified by the user, it is guessed using the detectCores function in the parallel package. Note that the documentation for detectCores makes clear that it is not failsave and could return a spurious number of available cores.
使用CPU核心的數量。

選擇最適的迴歸樹個數

gbm.perf(object, 
         plot.it = TRUE, 
         oobag.curve = FALSE, 
         overlay = TRUE, 
         method)

object
這把前面得到模型扔到裡邊。
method
method=”OOB” computes the out-of-bag estimate
method=”test” uses the test (or validation) dataset to compute an out-of-sample estimate.
method=”cv” extracts the optimal number of iterations using cross-validation if gbm was called with cv.folds>1
一般用cv最好。

預測

predict(object,
        newdata,
        n.trees,
        type="link",
        single.tree=FALSE,
        ...)

object
剛剛建立的模型
newdata
新的資料,一般也為data.frame格式,需要保證變數名稱與剛建立模型對應
n.trees
一般為剛得出的最適樹的個數

例子

Examples

 # A least squares regression example # create some data

N <- 1000
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
X4 <- factor(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3*runif(N) 
mu <- c(-1,0,1,2)[as.numeric(X3)]

SNR <- 10 # signal-to-noise ratio
Y <- X1**1.5 + 2 * (X2**.5) + mu
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(N,0,sigma)

# introduce some missing values
X1[sample(1:N,size=500)] <- NA
X4[sample(1:N,size=300)] <- NA

data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)

# fit initial model
gbm1 <-
gbm(Y~X1+X2+X3+X4+X5+X6,         # formula
    data=data,                   # dataset
    var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
                                 # +1: monotone increase,
                                 #  0: no monotone restrictions
    distribution="gaussian",     # see the help for other choices
    n.trees=1000,                # number of trees
    shrinkage=0.05,              # shrinkage or learning rate,
                                 # 0.001 to 0.1 usually work
    interaction.depth=3,         # 1: additive model, 2: two-way interactions, etc.
    bag.fraction = 0.5,          # subsampling fraction, 0.5 is probably best
    train.fraction = 0.5,        # fraction of data for training,
                                 # first train.fraction*N used for training
    n.minobsinnode = 10,         # minimum total weight needed in each node
    cv.folds = 3,                # do 3-fold cross-validation
    keep.data=TRUE,              # keep a copy of the dataset with the object
    verbose=FALSE,               # don't print out progress
    n.cores=1)                   # use only a single core (detecting #cores is
                                 # error-prone, so avoided here)

# check performance using an out-of-bag estimator
# OOB underestimates the optimal number of iterations
best.iter <- gbm.perf(gbm1,method="OOB")
print(best.iter)

# check performance using a 50% heldout test set
best.iter <- gbm.perf(gbm1,method="test")
print(best.iter)

# check performance using 5-fold cross-validation
best.iter <- gbm.perf(gbm1,method="cv")
print(best.iter)

# plot the performance # plot variable influence
summary(gbm1,n.trees=1)         # based on the first tree
summary(gbm1,n.trees=best.iter) # based on the estimated best number of trees

# compactly print the first and last trees for curiosity
print(pretty.gbm.tree(gbm1,1))
print(pretty.gbm.tree(gbm1,gbm1$n.trees))

# make some new data
N <- 1000
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE))
X4 <- factor(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3*runif(N) 
mu <- c(-1,0,1,2)[as.numeric(X3)]

Y <- X1**1.5 + 2 * (X2**.5) + mu + rnorm(N,0,sigma)

data2 <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)

# predict on the new data using "best" number of trees
# f.predict generally will be on the canonical scale (logit,log,etc.)
f.predict <- predict(gbm1,data2,best.iter)
# least squares error
print(sum((data2$Y-f.predict)^2))

# create marginal plots
# plot variable X1,X2,X3 after "best" iterations
par(mfrow=c(1,3))
plot(gbm1,1,best.iter)
plot(gbm1,2,best.iter)
plot(gbm1,3,best.iter)
par(mfrow=c(1,1))
# contour plot of variables 1 and 2 after "best" iterations
plot(gbm1,1:2,best.iter)
# lattice plot of variables 2 and 3
plot(gbm1,2:3,best.iter)
# lattice plot of variables 3 and 4
plot(gbm1,3:4,best.iter)

# 3-way plots
plot(gbm1,c(1,2,6),best.iter,cont=20)
plot(gbm1,1:3,best.iter)
plot(gbm1,2:4,best.iter)
plot(gbm1,3:5,best.iter)

最後的是邊際圖,表示對於各變數,在其他變數固定不變的情況下,當其取某個值時,並有一點小的擾動時,將其對響應變數的影響程度繪製成曲線。