Classification trees

阿新 • • 發佈：2018-12-16

Datacamp Learning

Classification trees in R

Building a simple decision tree

The loans dataset contains 11,312 randomly-selected people who were applied for and later received loans from Lending Club, a US-based peer-to-peer lending company.
You will use a decision tree to try to learn patterns in the outcome of these loans (either repaid or default) based on the requested loan amount and credit score at the time of application.
Then, see how the tree’s predictions differ for an applicant with good credit versus one with bad credit.

# Load the rpart package
library(rpart)
# Build a lending model predicting loan outcome versus loan amount and credit score
loan_model <- rpart(outcome~loan_amount+credit_score, data = loans, method = "class", control = rpart.control(cp = 0))
# Make a prediction for someone with good credit
predict( 
loan_model, good_credit, type = "class")
# Make a prediction for someone with bad credit
predict(loan_model, bad_credit, type = "class")

Visualizing classification trees

Due to government rules to prevent illegal discrimination, lenders are required to explain why a loan application was rejected.
The structure of classification trees can be depicted visually, which helps to understand how the tree makes its decisions.

# Examine the loan_model object
loan_model
# Load the rpart.plot package
library(rpart.plot)
# Plot the loan_model with default settings
rpart.plot(loan_model)
# Plot the loan_model with customized settings
rpart.plot(loan_model, type = 3, box.palette = c("red", "green"), fallen.leaves = TRUE)

在這裡插入圖片描述

Creating random test datasets

Before building a more sophisticated lending model, it is important to hold out a portion of the loan data to simulate how well it will predict the outcomes of future loan applicants.
As depicted in the following image, you can use 75% of the observations for training and 25% for testing the model.

在這裡插入圖片描述
The sample() function can be used to generate a random sample of rows to include in the training set. Simply supply it the total number of observations and the number needed for training.
Use the resulting vector of row IDs to subset the loans into training and testing datasets.

# Determine the number of rows for training
nrow(loans) * 0.75
# Create a random sample of row IDs
sample_rows <- sample(11312, 8484)
# Create the training dataset
loans_train <- loans[sample_rows, ]
# Create the test dataset
loans_test <- loans[-sample_rows, ]

Building and evaluating a larger tree

Previously, you created a simple decision tree that used the applicant’s credit score and requested loan amount to predict the loan outcome.
Lending Club has additional information about the applicants, such as home ownership status, length of employment, loan purpose, and past bankruptcies, that may be useful for making more accurate predictions.
Using all of the available applicant data, build a more sophisticated lending model using the random training dataset created previously. Then, use this model to make predictions on the testing dataset to estimate the performance of the model on future loan applications.

# Grow a tree using all of the available applicant data
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0))
# Make predictions on the test dataset
loans_test$pred <- predict(loan_model, loans_test, type = "class")
# Examine the confusion matrix
table(loans_test$pred, loans_test$outcome)

Preventing overgrown trees

The tree grown on the full set of applicant data grew to be extremely large and extremely complex, with hundreds of splits and leaf nodes containing only a handful of applicants. This tree would be almost impossible for a loan officer to interpret.
Using the pre-pruning methods for early stopping, you can prevent a tree from growing too large and complex. See how the rpart control options for maximum tree depth and minimum split count impact the resulting tree.

# Grow a tree with maxdepth of 6
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0, maxdepth = 6))
# Compute the accuracy of the simpler tree
loans_test$pred <- predict(loan_model, loans_test, type = "class")
mean(loans_test$pred == loans_test$outcome)
# Grow a tree with minsplit of 500
loan_model2 <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0, minsplit = 500))
# Compute the accuracy of the simpler tree
loans_test$pred2 <- predict(loan_model2, loans_test, type = "class")
mean(loans_test$pred2 == loans_test$outcome)

Creating a nicely pruned tree

Stopping a tree from growing all the way can lead it to ignore some aspects of the data or miss important trends it may have discovered later.
By using post-pruning, you can intentionally grow a large and complex tree then prune it to be smaller and more efficient later on.
In this exercise, you will have the opportunity to construct a visualization of the tree’s performance versus complexity, and use this information to prune the tree to an appropriate level.

# Grow an overly complex tree
loan_model <- rpart(outcome~.,data = loans_train,method='class',control = rpart.control(cp=0))
# Examine the complexity plot
plotcp(loan_model)
# Prune the tree
loan_model_pruned <- prune(loan_model, cp = 0.0014)
# Compute the accuracy of the pruned tree
loans_test$pred <- predict(loan_model_pruned,loans_test,type='class')
mean(loans_test$pred==loans_test$outcome)

Building a random forest model

In spite of the fact that a forest can contain hundreds of trees, growing a decision tree forest is perhaps even easier than creating a single highly-tuned tree.
Using the randomForest package, build a random forest and see how it compares to the single trees you built previously.
Keep in mind that due to the random nature of the forest, the results may vary slightly each time you create the forest.

# Load the randomForest package
library(randomForest)
# Build a random forest model
loan_model <- randomForest(outcome ~ ., data = loans_train)
# Compute the accuracy of the random forest
loans_test$pred <- predict(loan_model, loans_test)
mean(loans_test$pred == loans_test$outcome)

C’est tout, merci.

Classification trees

Datacamp Learning

Classification trees in R

Building a simple decision tree

Visualizing classification trees

Creating random test datasets

Building and evaluating a larger tree

Preventing overgrown trees

Creating a nicely pruned tree

Building a random forest model

Classification trees

Linear Classification in R with Decision Trees

機器學習筆記（Washington University）- Classification Specialization-week 3

機器學習筆記（Washington University）- Classification Specialization-week six & week 7

LeetCode96_Unique Binary Search Trees(求1到n這些節點能夠組成多少種不同的二叉查找樹) Java題解

Unique Binary Search Trees II -- LeetCode

[leetcode-617-Merge Two Binary Trees]

617. Merge Two Binary Trees

CS231n 學習筆記（1） Image CLassification

[LeetCode] Merge Two Binary Trees 合並二叉樹

Merge Two Binary Trees

CF821B Okabe and Banana Trees

19.Merge Two Binary Trees

CF821 B. Okabe and Banana Trees 簡單數學

96. Unique Binary Search Trees

【計算幾何】【預處理】【枚舉】Urozero Autumn Training Camp 2016 Day 5: NWERC-2016 Problem K. Kiwi Trees

Regularized least-squares classification（正則化最小二乘法分類器）取代SVM

310. Minimum Height Trees

leetcode 96 Unique Binary Search Trees

場景分類(scene classification) 摘錄

Classification trees

Datacamp Learning

Classification trees in R

Building a simple decision tree

Visualizing classification trees

Creating random test datasets

Building and evaluating a larger tree

Preventing overgrown trees

Creating a nicely pruned tree

Building a random forest model

相關推薦