Random forest

In machine learning, a random forest is a classifier that consists of many decision trees and outputs the class that is the mode of the classes output by individual trees. The algorithm for inducing a random forest was developed by Leo Breiman and Adele Cutler, and "Random Forests" is their trademark. The term came from random decision forests that was first proposed by Tin Kam Ho of Bell Labs in 1995. The method combines Breiman's "bagging" idea and Ho's "random subspace method" to construct a collection of decision trees with controlled variations.

Learning algorithm


Each tree is constructed using the following algorithm:

  1. Let the number of training cases be N, and the number of variables in the classifier be M.
  2. We are told the number m of input variables to be used to determine the decision at a node of the tree; m should be much less than M.
  3. Choose a training set for this tree by choosing N times with replacement from all N available training cases (i.e. take a bootstrap sample). Use the rest of the cases to estimate the error of the tree, by predicting their classes.
  4. For each node of the tree, randomly choose m variables on which to base the decision at that node. Calculate the best split based on these m variables in the training set.
  5. Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier).

Advantages


The advantages of random forest are:

   * For many data sets, it produces a highly accurate classifier.
   * It handles a very large number of input variables.
   * It estimates the importance of variables in determining classification.
   * It generates an internal unbiased estimate of the generalization error as the forest building progresses.
   * It includes a good method for estimating missing data and maintains accuracy when a large proportion of the data are missing.
   * It provides an experimental way to detect variable interactions.
   * It can balance error in class population unbalanced data sets.
   * It computes proximities between cases, useful for clustering, detecting outliers, and (by scaling) visualizing the data.
   * Using the above, it can be extended to unlabeled data, leading to unsupervised clustering, outlier detection and data views.
   * Learning is fast.

No comments:

Post a Comment