Team 18: Ben Zhang, Maggie Li, Mikey Pedersen, Alex Jacques

8.3.3 Bagging and Random Forests
- Bagging
- Random Forest
8.34 Boosting

8.3.3 Bagging and Random Forests

Bagging

We apply bagging and random forests to the Boston data:

library(randomForest)
library(MASS)
set.seed(1)

head(Boston)

     crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
  medv
1 24.0
2 21.6
3 34.7
4 33.4
5 36.2
6 28.7

dim(Boston)

[1] 506  14

We create a train set and test set first.

train <- sample(1:nrow(Boston),nrow(Boston)/2)
boston.test=Boston[-train,"medv"]

Bagging is simply a special case of a random forest with m = p. Therefore, the randomForest() function can be used to perform both random forests and bagging.

mtry=13 means 13 predictors are used.

bag.boston <- randomForest(medv~.,data=Boston,subset=train,mtry=13,importance=TRUE)
bag.boston


Call:
 randomForest(formula = medv ~ ., data = Boston, mtry = 13, importance = TRUE,      subset = train) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 13

          Mean of squared residuals: 11.33119
                    % Var explained: 85.26

What does importance mean here?

When importance is set to be true, the “importance index” of variables will be displayed. For example, the first column from importance denotes the mean decrease of accuracy when a predictor is not included.

Bagging typically results in improved accuracy over prediction using a single tree. Unfortunately, however, it can be difficult to interpret the resulting model. Although the collection of bagged trees is much more difficult to interpret than a single tree, one can obtain an overall summary of the importance of each predictor using the RSS (for bagging regression trees) or the Gini index (for bagging classification trees).

bag.boston$importance

            %IncMSE IncNodePurity
crim     3.44038649    813.209885
zn       0.41586673     64.533390
indus    0.11053141    103.424195
chas    -0.06224478      9.601055
nox      1.86748958    239.949497
rm      58.06229015  12383.514326
age      2.28073936    319.950062
dis      0.88169124    254.610410
rad      0.12353839     67.816948
tax      0.86364003    139.453388
ptratio  0.38357910    110.896968
black    0.55862963    227.195328
lstat   29.77065187   4846.537272

yhat.bag <- predict(bag.boston ,newdata=Boston[-train ,])
plot(yhat.bag, boston.test)
abline(0,1)

This plot shows the predicted y-values plotted against the test values. We can see here that there is a strong correlation between the test set and predictions made using the bagging method.

mean((yhat.bag-boston.test)^2)

[1] 23.4579

The test set MSE associated with the bagged regression tree is 23.4579, which is substantially lower than using an optimally-pruned single tree.

We can change the number of trees grown by randomForest() using the ntree argument (we try ntree=25 now, whereas the default is 500):

bag.boston <- randomForest(medv~.,data=Boston,subset=train,mtry=13,ntree=25)
yhat.bag <- predict(bag.boston,newdata=Boston[-train ,])
mean((yhat.bag-boston.test)^2)

[1] 22.99145

Random Forest

Growing a random forest proceeds in exactly the same way, except that we use a smaller value of the mtry argument. By default, randomForest() uses p/3 variables when building a random forest of regression trees, and sqrt(p) variables when building a random forest of classification trees. Here we use mtry = 6:

set.seed(1)
rf.boston <- randomForest(medv~.,data=Boston ,subset=train ,
                         mtry=6,importance =TRUE)
yhat.rf <- predict(rf.boston ,newdata=Boston[-train ,])
mean((yhat.rf-boston.test)^2)

[1] 19.62021

The test set MSE is 19.62021; Compared to 35.28688 from Wednesday, this indicates that random forests yielded an improvement over regression tree in this case.

We use importance() again to view the importance of each variable.

importance(rf.boston)

          %IncMSE IncNodePurity
crim    16.697017    1076.08786
zn       3.625784      88.35342
indus    4.968621     609.53356
chas     1.061432      52.21793
nox     13.518179     709.87339
rm      32.343305    7857.65451
age     13.272498     612.21424
dis      9.032477     714.94674
rad      2.878434      95.80598
tax      9.118801     364.92479
ptratio  8.467062     823.93341
black    7.579482     275.62272
lstat   27.129817    6027.63740

The “IncNodePurity” is a measure of the total decrease in node impurity that results from splits over that variable, averaged over all trees.

varImpPlot(rf.boston)

8.34 Boosting

Boosting differs from random forests because we use previous trees to train our data in order to improve accuracy. Due to the sequential nature of the boosting algorithm, overfitting is a possibility. The argument “n.trees” is the parameter in which we adjust for fitting. Other parameters we can adjust when calling the gbm function is the "interaction.depth’ which controls the complexity of the boosted ensemble. Smaller trees (aka small depth) are sufficient when using boosting.

library(gbm)
set.seed(1)
boost.boston=gbm(medv~.,data=Boston[train,],distribution="gaussian",n.trees=5000,interaction.depth=4)

summary(boost.boston)

            var    rel.inf
rm           rm 43.9919329
lstat     lstat 33.1216941
crim       crim  4.2604167
dis         dis  4.0111090
nox         nox  3.4353017
black     black  2.8267554
age         age  2.6113938
ptratio ptratio  2.5403035
tax         tax  1.4565654
indus     indus  0.8008740
rad         rad  0.6546400
zn           zn  0.1446149
chas       chas  0.1443986

The following plots illustrate the marginal effect of the selected variables on the response after integrating out the other variables. As you can see, the median house price is increasing with rm (amount of rooms) and decreasing with lstat (population status as a percentage).

par(mfrow=c(1,2))
plot(boost.boston,i="rm")

plot(boost.boston,i="lstat")

Here we are finding predictions using the model constructed using boosting on the test data

yhat.boost=predict(boost.boston,newdata=Boston[-train,],n.trees=5000)
mean((yhat.boost-boston.test)^2)

[1] 18.84709

Here we are introducing two new parameters. Shrinkage default is .001 which denotes slow learning from our previous models. Here, a larger set shrinkage runs the risk of overfitting with poor performance. Verbose prints out the progress and performance indicators of the model.

boost.boston=gbm(medv~.,data=Boston[train,],distribution="gaussian",n.trees=5000,interaction.depth=4,shrinkage=0.2,verbose=F)
yhat.boost=predict(boost.boston,newdata=Boston[-train,],n.trees=5000)
mean((yhat.boost-boston.test)^2)

[1] 18.33455