Thi's avatar
HomeAboutNotesBlogTopicsToolsReading
About|My sketches |Cooking |Cafe icon Support Thi
💌 [email protected]

Random Forest

Anh-Thi Dinh
Machine Learning
Left aside

What's the idea of Random Forest?

Random forest consists a (large) number of decision trees operating together (ensemble learning). The class with the most votes from the trees will be chosen as the final result of the RF's prediction. These decision tree models are relatively uncorrelated so that they can protect each other from their individual errors.
An illustration of the random forest's idea.
❓ How (decision) trees are chosen? RF ensures that the chosen trees are not too correlated to the others.
  1. Bagging: From a sample of size N, trees are chosen so that they also have size N with replacement. For example, if our training data was [1, 2, 3, 4, 5] (size 5), then we might give one of our tree the list [1, 2, 2, 5, 5] (with replacement).
  1. Feature randomness: The features in the original dataset are chosen randomly. There may be some trees that are lacking in some features.
So in our random forest, we end up with trees that are not only trained on different sets of data (thanks to bagging) but also use different features to make decisions. (ref)
For each tree, we can use Decision Tree Classifier or Decision Tree Regression depending on the type our problem (classification or regression).

When we use Random Forest?

  • Decision tree algorithms easily lead to overfitting problems. Random forest algorithm can overcome this.
  • Capable of both regression and classification problems.
  • Handle a large number of features.
  • Estimating which features are important in the underlying data being modeled (ref).
  • Random forest is capable of learning without carefully crafted data transformations (ref).
  • Output probabilities for classification problems.

Using RF with Scikit-learn

Random forest classifier

Load the library,
A sample dataset:
Create RF classifier(other parameters),
☝
If a problem has imbalanced classes, use class_weight="balanced". (ref)
Training & predict (other methods),

Random forest regression

Select important features in Random Forest

Some premilinaries,
Select feature importance (ref),
Visualize,
Select features with importance greater than a threshold,

References

  • Tony Yiu -- Understanding Random Forest.
  • Scikit-learn -- Random Forest CLassifier official doc.
  • Scikit-learn -- Random Forest Regression official doc.
  • Chris Albon -- Titanic Competition With Random Forest.
  • The Yhat Blog -- Random Forests in Python.
  • fast.ai -- Introduction to Random Forest and a solution to "Bull Book for Bulldozers" problem on Kaggle.
◆What's the idea of Random Forest?◆When we use Random Forest?◆Using RF with Scikit-learn○Random forest classifier○Random forest regression○Select important features in Random Forest◆References
About|My sketches |Cooking |Cafe icon Support Thi
💌 [email protected]
1from sklearn.ensemble import RandomForestClassifier
1iris = datasets.load_iris() # iris flowers
2X = iris.data
3y = iris.target
1clf = RandomForestClassifier(criterion='entropy', # default is 'gini'
2                             n_estimators=8, # number of trees (default=10)
3                             n_jobs=-1) # number of processors being used ("-1" means "all")
1model = clf.fit(X, y)
2model.predict([[ 5,  4,  3,  2]]) # returns: array([1])
3model.predict_proba([[ 5,  4,  3,  2]]) # predict class probabilities
1# load libraries
2from sklearn.ensemble import RandomForestRegressor
3from sklearn import datasets
1# sample: Boston Housing Data
2boston = datasets.load_boston()
3X = boston.data[:,0:2]
4y = boston.target
1# train
2regr = RandomForestRegressor(random_state=0, n_jobs=-1)
3model = regr.fit(X, y)
1# predict
2model.predict(<something>)
1from sklearn.ensemble import RandomForestClassifier
2
3# Load data
4iris = datasets.load_iris()
5X = iris.data
6y = iris.target
7
8# create a RF classifier
9clf = RandomForestClassifier(random_state=0, n_jobs=-1)
1# Train model
2model = clf.fit(X, y)
3
4# Calculate feature importances
5importances = model.feature_importances_
6
7# load additional packages
8import numpy as np
9import matplotlib.pyplot as plt
1# Sort feature importances in descending order
2indices = np.argsort(importances)[::-1]
3# Rearrange feature names so they match the sorted feature importances
4names = [iris.feature_names[i] for i in indices]
5
6plt.figure()
7plt.title("Feature Importance")
8plt.bar(range(X.shape[1]), importances[indices])
9plt.xticks(range(X.shape[1]), names, rotation=90)
10
11plt.show()
1from sklearn.feature_selection import SelectFromModel
2
3# Create object that selects features with importance greater than or equal to a threshold
4selector = SelectFromModel(clf, threshold=0.3)
5
6# Feature new feature matrix using selector
7X_important = selector.fit_transform(X, y)
8
9# Train random forest using most important features
10model = clf.fit(X_important, y)