Subscribe to RichardOnData here: [ Ссылка ]
In this video I talk about the Random Forest algorithm. These are one of my favorite machine learning algorithms and one of the most popular ones overall, and they have implications whether your goals are inference or prediction.
The base learner of the Random Forest is the Decision Tree. These are simple and straightforward, but the problem is they tend to be high variance -- i.e. they overfit and do not generalize well from a training set to a test set. Random Forests correct for this problem. They are an "ensemble learning method" -- the process for creating them is as follows:
1) Create a bootstrapped dataset with a subset of available variables
2) Fit a Decision Tree using that data
3) Repeat the process
4) Tally "votes" for predictions across trees -- predicted class is the one with the most "votes"
A key feature of Random Forests is that they can be used to produce Variable Importance plots. These rank, from top to bottom, the most "important" variables in the data. What is nice about these is, while RFs are not interpretable the way that regression models are, is they are constructed in a different way and can detect things like non-linear relationships. See the diagram where RM is the most important variable, followed by LSTAT, followed by DIS.
Some other benefits of Random Forests are:
1) They are not extremely sensitive to outliers
2) They are fairly stable and can handle new data without changing dramatically
3) They have methods for handling missing data
4) They can be used for unsupervised learning (clustering)
However, some drawbacks are:
1) They can be slow and memory-intensive
2) Variable importance can become biased if you have: a) a mix of continuous and categorical variables, where the categorical variables have few levels; or b) correlated continuous variables
See StatQuest with Josh Starner's video on building random forests: [ Ссылка ]
See Leo Breiman and Adele Cutler's documentation on random forests: [ Ссылка ]
StackExchange discussion on sensitivity to outliers:
[ Ссылка ]
StackExchange discussion on conditional inference trees:
[ Ссылка ]
Abstract on clustering using RF:
[ Ссылка ]
PC for decision tree image:
[ Ссылка ]
PC for variable importance plot:
[ Ссылка ] (Eryk Lewinson, Towards Data Science)
When Should You Use Random Forests?
Теги
when should you use random forestsrandom forests pros and consrandom forest explainedrandom forest algorithmrandom forest tutorialrandom forests explainedmachine learning random forestsvariable importanceinference vs predictionvariable importance random forestmachine learningrandom forestrandom forestsdecision treesjoshua starmerwhat is random forestdecision treemachine learning tutorialstatisticsrandom forest basicsstatquest