Last modified on 15 Apr 2020.

A process of performing hyper parameter tuning to determine optimal values for a given model.

Below are an example of using Grid Search with Random Forest in solving the Titanic problem.

from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier as RF
# Create a DICTIONARY containing all the candidate values of the parameters
parameter_grid = dict(n_estimators=list(range(1, 5001, 1000)),
                      criterion=['gini','entropy'],
                      max_features=list(range(1, len(features), 2)),
                      max_depth= [None] + list(range(5, 25, 1)))

# Creata a random forest object
random_forest = RF(random_state=0, n_jobs=-1)

# Create a gridsearch object with 5-fold cross validation, 
#   and uses all cores (n_jobs=-1)
gsc = GridSearchCV(estimator=random_forest, 
                   param_grid=parameter_grid, 
                   cv=5, verbose=1, n_jobs=-1)
  • estimator: model we are using (RF).
  • param_grid: a dictionary of required parameters and their range of values specified in estimator.

Fit and get the best parameters,

grid_result = gsc.fit(X, y)
best_params = grid_result.best_params_

In the case you wanna use these best_params,

best_clf = RF(n_estimators = best_params["n_estimators"], 
              criterion = best_params["criterion"],
              max_features = best_params["max_features"],
              max_depth = best_params["max_depth"]
           )

Or you can just use directly the result to predict,

gsc.fit(X, y)
gsc.predict(X_test)

Take the cross validation (take a long time to run!!!),

cv_scores = cross_val_score(gsc, X, y)
print('Accuracy scores:', cv_scores)
print('Mean of score:', np.mean(cv_scores))
print('Variance of scores:', np.var(cv_scores))

Reference

Notes with this notation aren't good enough. They are being updated. If you can see this, you are so smart. ;)