Pipeline & GridSearch

What's the idea of Pipeline?

Stack multiple processes into a single (scikit-learn) estimation.

An example of using pipeline in Machine Learning with 3 different steps.

Why pipeline?

An example of using scaling with cross-validation with and without using pipeline.

Pipeline in Scikit-learn

Below sample codes come from this example.

1from sklearn.svm import SVC
2from sklearn.decomposition import PCA
3from sklearn.pipeline import make_pipeline
4
5pca = PCA(n_components=150, whiten=True, random_state=42)
6svc = SVC(kernel='rbf', class_weight='balanced')
7model = make_pipeline(pca, svc)

Difference between Pipeline and make_pipeline:

Pipeline: you can name the steps.

make_pipeline: no need to name the steps (use them directly).

1make_pipeline(PCA(), SVC())

1Pipeline(steps=[
2	('principle_component_analysis', PCA()),
3	('support_vector_machine', SVC())
4])

Using with GridSearch

1# Using with GridSearch (to choose the best parameters)
2from sklearn.model_selection import GridSearchCV
3param_grid = {'svc__C': [1, 5, 10, 50],	# "svc": name before, "C": param in svc
4              'svc__gamma': [0.0001, 0.0005, 0.001, 0.005]}
5grid = GridSearchCV(model, param_grid, cv=5, verbose=1, n_jobs=-1)
6
7grid_result = grid.fit(X, y)
8best_params = grid_result.best_params_
9
10# predict with best params
11grid.predict(X_test)

In case you wanna use best_params

1best_params['svc__C']
2best_params['svc__gamma']

Take care the cross validation (take a long time to run!!!

1from sklearn.model_selection import cross_val_score
2cv_scores = cross_val_score(grid, X, y)
3print('Accuracy scores:', cv_scores)
4print('Mean of score:', np.mean(cv_scores))
5print('Variance of scores:', np.var(cv_scores))