Team 18: Ben Zhang, Maggie Li, Mikey Pedersen, Alex Jacques
8.3.3 Bagging and Random Forests
Bagging
We apply bagging and random forests to the Boston data:
crim zn indus chas nox rm age dis rad tax ptratio black lstat
1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
medv
1 24.0
2 21.6
3 34.7
4 33.4
5 36.2
6 28.7
[1] 506 14
We create a train set and test set first.
Bagging is simply a special case of a random forest with m = p. Therefore, the randomForest() function can be used to perform both random forests and bagging.
mtry=13 means 13 predictors are used.
Call:
randomForest(formula = medv ~ ., data = Boston, mtry = 13, importance = TRUE, subset = train)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 13
Mean of squared residuals: 11.33119
% Var explained: 85.26
What does importance mean here?
When importance is set to be true, the “importance index” of variables will be displayed. For example, the first column from importance denotes the mean decrease of accuracy when a predictor is not included.
Bagging typically results in improved accuracy over prediction using a single tree. Unfortunately, however, it can be difficult to interpret the resulting model. Although the collection of bagged trees is much more difficult to interpret than a single tree, one can obtain an overall summary of the importance of each predictor using the RSS (for bagging regression trees) or the Gini index (for bagging classification trees).
%IncMSE IncNodePurity
crim 3.44038649 813.209885
zn 0.41586673 64.533390
indus 0.11053141 103.424195
chas -0.06224478 9.601055
nox 1.86748958 239.949497
rm 58.06229015 12383.514326
age 2.28073936 319.950062
dis 0.88169124 254.610410
rad 0.12353839 67.816948
tax 0.86364003 139.453388
ptratio 0.38357910 110.896968
black 0.55862963 227.195328
lstat 29.77065187 4846.537272
This plot shows the predicted y-values plotted against the test values. We can see here that there is a strong correlation between the test set and predictions made using the bagging method.
[1] 23.4579
The test set MSE associated with the bagged regression tree is 23.4579, which is substantially lower than using an optimally-pruned single tree.
We can change the number of trees grown by randomForest() using the ntree argument (we try ntree=25 now, whereas the default is 500):
bag.boston <- randomForest(medv~.,data=Boston,subset=train,mtry=13,ntree=25)
yhat.bag <- predict(bag.boston,newdata=Boston[-train ,])
mean((yhat.bag-boston.test)^2)[1] 22.99145
Random Forest
Growing a random forest proceeds in exactly the same way, except that we use a smaller value of the mtry argument. By default, randomForest() uses p/3 variables when building a random forest of regression trees, and sqrt(p) variables when building a random forest of classification trees. Here we use mtry = 6:
set.seed(1)
rf.boston <- randomForest(medv~.,data=Boston ,subset=train ,
mtry=6,importance =TRUE)
yhat.rf <- predict(rf.boston ,newdata=Boston[-train ,])
mean((yhat.rf-boston.test)^2)[1] 19.62021
The test set MSE is 19.62021; Compared to 35.28688 from Wednesday, this indicates that random forests yielded an improvement over regression tree in this case.
We use importance() again to view the importance of each variable.
%IncMSE IncNodePurity
crim 16.697017 1076.08786
zn 3.625784 88.35342
indus 4.968621 609.53356
chas 1.061432 52.21793
nox 13.518179 709.87339
rm 32.343305 7857.65451
age 13.272498 612.21424
dis 9.032477 714.94674
rad 2.878434 95.80598
tax 9.118801 364.92479
ptratio 8.467062 823.93341
black 7.579482 275.62272
lstat 27.129817 6027.63740
The “IncNodePurity” is a measure of the total decrease in node impurity that results from splits over that variable, averaged over all trees.
8.34 Boosting
Boosting differs from random forests because we use previous trees to train our data in order to improve accuracy. Due to the sequential nature of the boosting algorithm, overfitting is a possibility. The argument “n.trees” is the parameter in which we adjust for fitting. Other parameters we can adjust when calling the gbm function is the "interaction.depth’ which controls the complexity of the boosted ensemble. Smaller trees (aka small depth) are sufficient when using boosting.
library(gbm)
set.seed(1)
boost.boston=gbm(medv~.,data=Boston[train,],distribution="gaussian",n.trees=5000,interaction.depth=4) var rel.inf
rm rm 43.9919329
lstat lstat 33.1216941
crim crim 4.2604167
dis dis 4.0111090
nox nox 3.4353017
black black 2.8267554
age age 2.6113938
ptratio ptratio 2.5403035
tax tax 1.4565654
indus indus 0.8008740
rad rad 0.6546400
zn zn 0.1446149
chas chas 0.1443986
The following plots illustrate the marginal effect of the selected variables on the response after integrating out the other variables. As you can see, the median house price is increasing with rm (amount of rooms) and decreasing with lstat (population status as a percentage).
Here we are finding predictions using the model constructed using boosting on the test data
yhat.boost=predict(boost.boston,newdata=Boston[-train,],n.trees=5000)
mean((yhat.boost-boston.test)^2)[1] 18.84709
Here we are introducing two new parameters. Shrinkage default is .001 which denotes slow learning from our previous models. Here, a larger set shrinkage runs the risk of overfitting with poor performance. Verbose prints out the progress and performance indicators of the model.
boost.boston=gbm(medv~.,data=Boston[train,],distribution="gaussian",n.trees=5000,interaction.depth=4,shrinkage=0.2,verbose=F)
yhat.boost=predict(boost.boston,newdata=Boston[-train,],n.trees=5000)
mean((yhat.boost-boston.test)^2)[1] 18.33455