Learning algorithm
Each tree is constructed using the following algorithm:
1. Let the number of training cases be N, and the number of variables in the classifier be M.
2. We are told the number m of input variables to be used to determine the decision at a node of the tree; m should be much less than M.
3. Choose a training set for this tree by choosing N times with replacement from all N available training cases (i.e. take a bootstrap sample). Use the rest of the cases to estimate the error of the tree, by predicting their classes.
4. For each node of the tree, randomly choose m variables on which to base the decision at that node. Calculate the best split based on these m variables in the training set.
5. Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier).
Advantages
The advantages of random forest are:
* For many data sets, it produces a highly accurate classifier.
* It handles a very large number of input variables.
* It estimates the importance of variables in determining classification.
* It generates an internal unbiased estimate of the generalization error as the forest building progresses.
* It includes a good method for estimating missing data and maintains accuracy when a large proportion of the data are missing.
* It provides an experimental way to detect variable interactions.
* It can balance error in class population unbalanced data sets.
* It computes proximities between cases, useful for clustering, detecting outliers, and (by scaling) visualizing the data.
* Using the above, it can be extended to unlabeled data, leading to unsupervised clustering, outlier detection and data views.
* Learning is fast.
No comments:
Post a Comment