Reading: Hands-On ML - Chap 3: Classification

Anh-Thi Dinh
This note serves as a reminder of the book's content, including additional research on the mentioned topics. It is not a substitute for the book. Most images are sourced from the book or referenced.
I've noticed that taking notes on this site while reading the book significantly extends the time it takes to finish the book. I've stopped noting everything, as in previous chapters, and instead continue reading by highlighting/hand-writing notes instead. I plan to return to the detailed style when I have more time.
This book contains 1007 pages of readable content. If you read at a pace of 10 pages per day, it will take you approximately 3.3 months (without missing a day) to finish it. If you aim to complete it in 2 months, you'll need to read at least 17 pages per day.


List of notes for this book

Jupyter notebook for this chapter: on Github, on Colab, on Kaggle.


  • We use dataset MNIST in this chapter = 70K small images of digits handwriten. ← “Hello world” of ML.
  • Download from ← use sklearn.datasets.fetch_openml
    • 1from sklearn.datasets import fetch_openml
      3mnist = fetch_openml('mnist_784', as_frame=False)
      4# data contains images -> dataframe isn't suitable, so as_frame=False
      5X, y =,
      6X.shape # (70000, 784)
  • sklean.datasets contains 3 types of functions:
    • fetch_* functions such as fetch_openml() to download real-life datasets.
    • load_* functions to load small toy datasets (no need to download)
    • make_* functions to generate fake datasets.
  • 70K images, 784 features. Each image = 28x28 pixels.
  • Plot an image
    • 1import matplotlib.pyplot as plt def plot_digit(image_data):
      3image = image_data.reshape(28, 28) plt.axis("off")
      4plt.imshow(image, cmap="binary")
      5some_digit = X[0] plot_digit(some_digit)
y[0] = 5
  • MNIST from fetch_openml() is already split into a training set (first 60K, already shuffled) and test set (last 10K).
    • 1X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
  • Training set is already shuffled ← good for cross-validation (all are similar).

Training a Binary Classifier

Let’s simplify the problem - “detect only the number 5” ← binary classifier (2 classes, 5 or non-5).
Good to start is stochastic gradient descent (SGD, or stochastic GD) classifier ← SGDClassifier ← deals with training instances independently, one at a time ← handling large datasets effeciently, well suited for online training.
1from sklearn.linear_model import SGDClassifier
3sgd_clf = SGDClassifier(random_state=42), y_train_5)

Performance Measures

Evaluating a classifier is often significantly trickier than evaluating a regressor!

Measuring Accuracy Using Cross-Validation

Use cross_val_score() ← use k-folds.
1from sklearn.model_selection import cross_val_score
2cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")
Wow, get 95% accuracy with SGD but it’s good? → Let’s try DummyClassifier ← classifies every single image in the most frequent class (non-5) and then use cross_val_score90% accuracy! Why? It’s because only 10% are 5s! ← If you always guess that an image is not a 5, 90% of the time, you’re right!
→ Accuracy isn’t the preferred measure for classifiers, especially with skewed datasets (some classes are much more than others). ← use confusion matrix (CM) 👈 My note: Confusion matrix & f1-score.
Sometimes, you can implement yourself a custom cross-validation to better control the measure ← use StratifiedKFold to performs stratified sampling (folds that preserves the percentage of samples for each class).

Confusion Matrices

General idea: count the number of times instances of class A are classified as class B. Eg. to check how many times the classifier confuses 8s and 0s, check row#8, col#0 of the CM.
1from sklearn.model_selection import cross_val_predict
2from sklearn.metrics import confusion_matrix
4y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)
5cm = confusion_matrix(y_train_5, y_train_pred)
cross_val_predict() returns the predictions made on each test fold.
1array([[53892,   687],
2       [ 1891,  3530]])
CM. Each row → actual class, each column → predicted class.
predicted non-5
predicted 5
actual non-5
negative class)
FP - Type I)
actual 5
positive class)
FN - Type II)
TN = True Negative, FP = False Positive = Type I Error, FN = False Negative, TP = True Positive.
→ A perfect classifier would only have TP and TN (only predict right 5 and non-5 or FP = FN = 0)!
So, a more concise metric: look at the accuracy of the positive predictions ← precision How many of what we predict are right?
But what if we always make negative predictions (except the single positive we pretty sure) → precision would be 1/1=100%? ← This classifier isn’t useful because it ignores all but one positive instance. → precision should be used with another metric name recall (also, sensitivity or true positive rate - TPR).
Recall = ratio of positive instaces that are correctly detected by the classifier. ← Do we miss something?
Figure 3-3. An illustrated confusion matrix.

Precision and Recall

Scikit-learn gives precision_score and recall_score to compute precision and recall.
1from sklearn.metrics import precision_score, recall_score
3precision_score(y_train_5, y_train_pred) # 0.8370879772350012
4recall_score(y_train_5, y_train_pred) # 0.6511713705958311
When it claims an image represents a 5, it is correct only 83.7% of the time. Moreover, it only detects 65.1% of the 5s.
It’s convenient to combine precision and recall into a single metric called F1 score. It’s the harmonic mean of them.
1from sklearn.metrics import f1_score
3f1_score(y_train_5, y_train_pred) # 0.7325171197343846
F1 score is high if both recall and precision are high. However, it’s not always the only one metric you want: in some contexts, you mostly care about precision, and in other contexts you really care about recall.
  • Care about precision: detects safe video for kids → a good classifier = keeps only safe videos ↔ precision high ↔ less “wrongly detected” (FP) videos come to children. We don’t care if recall is low in this case (lots of good videos will be missed but no problem).
  • Care about recall: detects shoplifters in surveillance images → a good classifier = all shopifiers get caught ↔ recall high ↔ less “allowed passing” (FN). We don’t care if precision is low (a few wrong alerts but we won’t miss bad guys).
→ Unfortunately, increasing precision reduces recall and vice versa. ← precision/recall trade-off.

Precision/Recall Trade-off

Figure 3-4. SGDClassifier: Precision/recall trade-off ranks images by classifier score. Those above the set threshold are positive; a higher threshold results in lower recall but generally higher precision.
  • Higher recall (lower threshold) → we don’t miss 5s but we allow many not-5s there. Conversely, higher precision (higher threshold) → there aren’t many not-5s but we miss many 5s (lower recall).
  • So which threshold should be used? → Figure 3-5.
    • Figure 3-5. Precision and recall versus the decision threshold. Threshold = 3000.
  • Strategy 2: to select precision/recall trade-off → plot preficion against recall.
    • Figure 3-6. Precision versus recall. → Observe: From 0.8 of recall, precision falls sharply.
  • The choice of precision/recall trade-off depends on your project!
  • Search for the lowest threshold that gives you at least 90% precision.
    • 1idx_for_90_precision = (precisions >= 0.90).argmax()
      2threshold_for_90_precision = thresholds[idx_for_90_precision]
  • For many application, 48% recall wouldn’t be great.

The ROC Curve

  • ROC = Receiver Operating Characteristic. ← common tool used with binary classifiers.
  • Specificity: How many negative results belong to our predictions? ← It is used when we care about TN values and don't want to make false alarms of the FP values (e.g. drug test).
  • ROC plots TPR (True Positive Rate) vs FPR (False Positive Rate) = Sensitivity (Recall) vs Specificity.
  • Use roc_curve.
    • 1from sklearn.metrics import roc_curve
      2import matplotlib.pyplot as plt
      3%matplotlib inline
      5fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
      6# create plot
      7plt.plot(fpr, tpr, label='ROC curve')
      8plt.plot([0, 1], [0, 1], 'k--') # Dashed diagonal
Figure 3-7. A ROC curve plotting FPR against TPR for all possible thresholds; the black circle highlights the chosen ratio (at 90% precision and 48% recall)
  • Trade-off: the higher recall, the more FPR (predict wrong) the classifier produces.
  • A good classifier stays as far away from the dotted lines (random classifier) as possible (toward the top-left corner) → Measure the area under the curve (AUC) ← roc_auc_score
  • Perfect classifier will have AUC = 1 (fit the rectangle).
  • The purely random classifier (dotted line) will have AUC = 0.5.

Precision/Recall Curve vs ROC curve

Figure 3-3. An illustrated confusion matrix.
Figure 3-6. Precision versus recall. → Observe: From 0.8 of recall, precision falls sharply.
Figure 3-7. A ROC curve plotting FPR against TPR for all possible thresholds; the black circle highlights the chosen ratio (at 90% precision and 48% recall)
  • Use precision/recall curve ← when positive class is rare or when you care about FP than FN.
  • Otherwise, use ROC.
  • For example, Figure 3-7 displays a satisfactory ROC, but the PR curve suggests there is room for model enhancement (the curve could really be closer to the top-right corner).

Plot curves for Random Forest

  • The precision_recall_curve() expects labels and scores for each instance but RandomForestClassifier doesn’t have decision_function() method. ← use the probability of the positive class as a score.
    • 1y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba")
  • These are estimated probablities, not actual probabilities ← not so good ← use sklearn.calibration to calibrate these estimations.
Figure 3-8. Comparing PR curves: the random forest classifier is superior to the SGD classifier because its PR curve is much closer to the top-right corner, and it has a greater AUC

Multiclass Classification (MC)

  • Multiclass classifiers = Multinomial classifiers = distinguish between more than 2 classes.
  • You can perform MC with multiple binary classifiers (BC).
    • one-versus-the-rest (OvR) or one-versus-all (OvA) strategy: instead of classifying 10 classes (0 to 9), we train 10 BC, one for each digit (0-detector, 1-detector,…). Then, take the highest BC score. ← Most of BC likes this.
    • one-versus-one (OvO) strategy: train BC for every pair of digits (0vs1, 0vs2,…, 1vs2,…). Then the class winning the most duels will be the class of an image. ← Advantage: only train on a part of the training set containing 2 classes. ← SVM likes this.
    • Scikit-Learn auto detects which strategy to use for the chosen BC.
      • 1from sklearn.svm import SVC
        3svm_clf = SVC(random_state=42)[:2000], y_train[:2000])  # y_train, not y_train_5
  • ☝ Sometimes, just scale the input can increase the results (discussed in Chap 2).

Error Analysis

Assuming you have a promising model, we'll explore ways to enhance it by analyzing its errors.
Plot the confusion matrix of the predictions ← a color diagram of the CM is much easier to analyze.
1from sklearn.metrics import ConfusionMatrixDisplay
3y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
4plt.rc('font', size=9)  # extra code – make the text smaller
6	y_train, y_train_pred,
7	sample_weight = (y_train_pred != y_train) # If available -> Fig 3.10
8	normalize="true", values_format=".0%" # If available -> Fig 3.9 right
Figure 3-9. Confusion matrix (left) and the same CM normalized by row (right)
From Figure 3-9: Images are mainly diagonal, indicating good results. However, row #5 and col #5 appear darker, not due to poor performance but fewer 5s in the dataset. Solution: Use CM normalization. Result: 82% accuracy.
If you look carefully, you will notice that many digits have been misclassified as 8s, but this is not immediately obvious from this diagram. ← putting zero weight on the correct prediction
Figure 3-10. Confusion matrix with errors only, normalized by row (left) and by column (right). Eg. 53% = 535/(all incorrect on row #5 of Fig 3-9 left)%. ← 53% or the errors the model made on images of 5s were misclassifications as 8s.
→ Think more of reducing the false 8s:
  • More data for digits that look like 8s ← classify them from 8s.
  • An algo for couting the number of closed loops (8 has 2, 6 has 1, 5 has 1).
We can boost our training dataset via data augmentation, which tweaks images, like shifting or rotating. Other methods are also viable.

Multilabel Classification

  • Classifier can output multiple classes for each instance.
  • Eg: face-recognition classifier: detects multiple faces in an image. ← it outputs [True, False, True] for Alice, Bob, Charlie in the image ← multilabel classification (outputs multiple binary tags)
  • Eg: KNeighborsClassifier to classify each image in MNIST into 2 labels — large (7,8,9) or odd.
    • 1import numpy as np
      2from sklearn.neighbors import KNeighborsClassifier
      4y_train_large = (y_train >= '7')
      5y_train_odd = (y_train.astype('int8') % 2 == 1)
      6y_multilabel = np.c_[y_train_large, y_train_odd]
      8knn_clf = KNeighborsClassifier(), y_multilabel)
  • To evaluate, one way: measure F1 score of each label and then compute the average score.
  • ChainClassifier arranges binary classifiers into a chain, where each model predicts using input features and previous models' predictions.

Multioutput Classification

  • Multioutputmulticlass classification = Multioutput Classification.
  • Each label can be multiclass (has more than 2 possible values).
  • Eg: A systems removes noise from images. Output: multilabel (one label per pixel) and each label can have multiple values (0 to 255).