1. 程式人生 > >文章翻譯第七章4-6

文章翻譯第七章4-6

some load valid 包裝 bili 排序 character input abi

4 Performing cross-validation with the

並包裝卡雷特進行交叉驗證

caret packageThe Caret (classification and regression training) package contains many functions in regard to the training process for regression and classification problems. Similar to the e1071

package, it also contains a function to perform the k-fold cross validation. In this recipe, we will demonstrate how to the perform k-fold cross validation using the caret package.

(分類和回歸訓練)包中所包含的關於回歸和分類問題的訓練過程中的許多功能。類似的e1071包,它還包含了一個函數來實現交叉驗證。在這個食譜中,我們將演示如何執行交叉驗證使用

Getting ready

準備

In this recipe, we will continue to use the telecom churn dataset as the input data source to perform the k-fold cross validation

在這個食譜中,我們將繼續使用電信客戶流失數據集作為輸入數據源進行交叉驗證

How to do it...

怎麽做

Perform the following steps to perform the k-fold cross-validation with the caret package:執行以下步驟並封裝進行交叉驗證:

1. First, set up the control parameter to train with the 10-fold cross validation in 3

repetitions:

> control = trainControl(method="repeatedcv", number=10,

repeats=3)

2. Then, you can train the classification model on telecom churn data with rpart:

2。然後,你可以訓練分類模型對電信客戶流失數據:

> model = train(churn~., data=trainset, method="rpart",

preProcess="scale", trControl=control)

3. Finally, you can examine the output of the generated model:

3。最後,您可以檢查生成的模型的輸出:

> model

CART

2315 samples

16 predictor

2 classes: ‘yes‘, ‘no‘

Pre-processing: scaled

Resampling: Cross-Validated (10 fold, repeated 3 times)

Summary of sample sizes: 2084, 2083, 2082, 2084, 2083, 2084, ...

Resampling results across tuning parameters:

cp Accuracy Kappa Accuracy SD Kappa SD

0.0556 0.904 0.531 0.0236 0.155

0.0746 0.867 0.269 0.0153 0.153

0.0760 0.860 0.212 0.0107 0.141

Accuracy was used to select the optimal model using the largest

value.

The final value used for the model was cp = 0.05555556.

How it works...

如何工作

In this recipe, we demonstrate how convenient it is to conduct the k-fold cross-validation using the caret package. In the first step, we set up the training control and select the option to perform the 10-fold cross-validation in three repetitions. The process of repeating the k-fold validation is called repeated k-fold validation, which is used to test the stability of the model. If

the model is stable, one should get a similar test result. Then, we apply rpart on the training dataset with the option to scale the data and to train the model with the options configured in

the previous step.

它如何工作…在這個食譜中,我們演示了使用符號包進行交叉驗證是如何方便的。在第一步中,我們設置了訓練控制,並選擇選項執行10倍交叉驗證在三次重復。重復折驗證的過程稱為重復折驗證,這是用來測試的穩定性

See also

f You can configure the resampling function in trainControl, in which you can

specify boot, boot632, cv, repeatedcv, LOOCV, LGOCV, none, oob, adaptive_

cv, adaptive_boot, or adaptive_LGOCV. To view more detailed information of

how to choose the resampling method, view the trainControl document:

> ?trainControl

又見F可以控制、配置重采樣功能,您可以在其中指定啟動,boot632,CV,repeatedcv,LOOCV,lgocv,無,OOB,adaptive_ CV,adaptive_boot,或adaptive_lgocv。查看更詳細的信息,如何選擇重采樣方法,查看控制、文檔:>?列控

5.

Ranking the variable importance with the rminer package

排序的變量的重要性與rminer

Besides using the caret package to generate variable importance, you can use the rminerpackage to generate the variable importance of a classification model. In the following recipe, we will illustrate how to use rminer to obtain the variable importance of a fitted model.Getting readyIn this recipe, we will continue to use the telecom churn dataset as the input data source to rank the va

除了使用符號包產生變量的重要性,你可以使用rminer包產生一個分類模型的變量的重要性。在下面的食譜,我們將說明如何使用rminer獲得擬合模型的變量的重要性。準備在這個食譜中,我們將繼續使用的電信流失數據集作為輸入數據源排名VA

How to do it...Perform the following steps to rank the variable importance with rminer:1. Install and load the package, rminer:> install.packages("rminer")> library(rminer)2. Fit the svm model with the training set:> model=fit(churn~.,trainset,model="svm")3. Use the Importance function to obtain the variable importance:> VariableImportance=Importance(model,trainset,method="sensv")4. Plot the variable importance ranked by the variance:> L=list(runs=1,sen=t(VariableImportance$imp),sresponses=VariableImportance$sresponses)> mgraph(L,graph="IMP",leg=names(trainset),col="gray",Grid=10)Figure 2: The visualization of variable importance using the rminer package

Similar to the caret package, the rminer package can also generate the variable importance of a classification model. In this recipe, we first train the svm model on the training dataset, trainset, with the fit function. Then, we use the Importance function to rank the variable importance with a sensitivity measure. Finally, we use mgraph to plot the rank of the variable importance. Simila

類似於符號的rminer包,包也可以產生一個分類模型的變量的重要性。在這個食譜中,我們首先訓練SVM模型的訓練數據集,動車組,與擬合函數。然後,我們使用的重要性功能排名的變量重要性的敏感性措施。最後,我們使用MGraph繪制變量重要性排序。

6.Finding highly correlated features with the caret package

尋找高度相關的特征並包裝

When performing regression or classification, some models perform better if highly correlated attributes are removed. The caret package provides the findCorrelation function, which can be used to find attributes that are highly correlated to each other. In this recipe, we will demonstrate how to find highly correlated features using the caret package.

當進行回歸或分類,一些模型表現更好,如果高度相關的屬性被刪除。插入findcorrelation包提供的功能,它可以用來發現是彼此高度相關的屬性。在這個食譜中,我們將展示如何找到高度相關的特征,用符號包。

How to do it...Perform the following steps to find highly correlated attributes:1. Remove the features that are not coded in numeric characters:> new_train = trainset[,! names(churnTrain) %in% c("churn", "international_plan", "voice_mail_plan")]2. Then, you can obtain the correlation of each attribute:>cor_mat = cor(new_train)3. Next, we use findCorrelation to search for highly correlated attributes with a cut off equal to 0.75:> highlyCorrelated = findCorrelation(cor_mat, cutoff=0.75)4. We then obtain the name of highly correlated attributes:> names(new_train)[highlyCorrelated][1] "total_intl_minutes" "total_day_charge" "total_eve_minutes" "total_night_minutes"

In this recipe, we search for highly correlated attributes using the caret package. In order to retrieve the correlation of each attribute, one should first remove nonnumeric attributes. Then, we perform correlation to obtain a correlation matrix. Next, we use findCorrelation to find highly correlated attributes with the cut off set to 0.75. We finally obtain the names of highly correlated

在這個食譜中,我們尋找高度相關的屬性使用插入符號包。為了檢索每個屬性的相關性,應先去除非數值屬性。然後,我們執行相關,得到相關矩陣。接下來,我們使用findcorrelation找到高度相關的屬性與切斷設置為0.75。我們終於獲得高度相關的名稱。

---------摘自百度翻譯

李明玥

文章翻譯第七章4-6