Reading: Hands-On ML - Chap 3: Classification

Anh-Thi Dinh
👉 List of all notes for this book. IMPORTANT UPDATE Nov 18, 2024: I've stopped taking detailed notes from the book and now only highlight and annotate directly in the PDF files/book. With so many books to read, I don't have time to type everything. In the future, if I make notes while reading a book, they'll contain only the most notable points (for me).
📔
Jupyter notebook for this chapter: on Github, on Colab, on Kaggle.

MNIST

  • We use dataset MNIST in this chapter = 70K small images of digits handwriten. ← “Hello world” of ML.
  • Download from OpenML.org. ← use sklearn.datasets.fetch_openml
    • 1from sklearn.datasets import fetch_openml
      2
      3mnist = fetch_openml('mnist_784', as_frame=False)
      4# data contains images -> dataframe isn't suitable, so as_frame=False
      5X, y = mnist.data, mnist.target
      6X.shape # (70000, 784)
  • sklean.datasets contains 3 types of functions:
    • fetch_* functions such as fetch_openml() to download real-life datasets.
    • load_* functions to load small toy datasets (no need to download)
    • make_* functions to generate fake datasets.
  • 70K images, 784 features. Each image = 28x28 pixels.
  • Plot an image
    • 1import matplotlib.pyplot as plt def plot_digit(image_data):
      2
      3image = image_data.reshape(28, 28) plt.axis("off")
      4plt.imshow(image, cmap="binary")
      5some_digit = X[0] plot_digit(some_digit) plt.show()
y[0] = 5
  • MNIST from fetch_openml() is already split into a training set (first 60K, already shuffled) and test set (last 10K).
    • 1X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
  • Training set is already shuffled ← good for cross-validation (all are similar).

Training a Binary Classifier

Let’s simplify the problem - “detect only the number 5” ← binary classifier (2 classes, 5 or non-5).
Good to start is stochastic gradient descent (SGD, or stochastic GD) classifier ← SGDClassifier ← deals with training instances independently, one at a time ← handling large datasets effeciently, well suited for online training.
1from sklearn.linear_model import SGDClassifier
2
3sgd_clf = SGDClassifier(random_state=42)
4sgd_clf.fit(X_train, y_train_5)
5
6sgd_clf.predict([some_digit])

Performance Measures

Evaluating a classifier is often significantly trickier than evaluating a regressor!

Measuring Accuracy Using Cross-Validation

Use cross_val_score() ← use k-folds.
1from sklearn.model_selection import cross_val_score
2cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")
Wow, get 95% accuracy with SGD but it’s good? → Let’s try DummyClassifier ← classifies every single image in the most frequent class (non-5) and then use cross_val_score90% accuracy! Why? It’s because only 10% are 5s! ← If you always guess that an image is not a 5, 90% of the time, you’re right!
→ Accuracy isn’t the preferred measure for classifiers, especially with skewed datasets (some classes are much more than others). ← use confusion matrix (CM) 👈 My note: Confusion matrix & f1-score.
Sometimes, you can implement yourself a custom cross-validation to better control the measure ← use StratifiedKFold to performs stratified sampling (folds that preserves the percentage of samples for each class).

Confusion Matrices

General idea: count the number of times instances of class A are classified as class B. Eg. to check how many times the classifier confuses 8s and 0s, check row#8, col#0 of the CM.
1from sklearn.model_selection import cross_val_predict
2from sklearn.metrics import confusion_matrix
3
4y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)
5cm = confusion_matrix(y_train_5, y_train_pred)
cross_val_predict() returns the predictions made on each test fold.
1array([[53892,   687],
2       [ 1891,  3530]])
CM. Each row → actual class, each column → predicted class.
predicted non-5
predicted 5
actual non-5
(
negative class)
53892
(
TN)
687
(
FP - Type I)
actual 5
(
positive class)
1891
(
FN - Type II)
3530
(
TP)
TN = True Negative, FP = False Positive = Type I Error, FN = False Negative, TP = True Positive.
→ A perfect classifier would only have TP and TN (only predict right 5 and non-5 or FP = FN = 0)!
So, a more concise metric: look at the accuracy of the positive predictions ← precision How many of what we predict are right?
But what if we always make negative predictions (except the single positive we pretty sure) → precision would be 1/1=100%? ← This classifier isn’t useful because it ignores all but one positive instance. → precision should be used with another metric name recall (also, sensitivity or true positive rate - TPR).
Recall = ratio of positive instaces that are correctly detected by the classifier. ← Do we miss something?
Figure 3-3. An illustrated confusion matrix.

Precision and Recall

Scikit-learn gives precision_score and recall_score to compute precision and recall.
1from sklearn.metrics import precision_score, recall_score
2
3precision_score(y_train_5, y_train_pred) # 0.8370879772350012
4recall_score(y_train_5, y_train_pred) # 0.6511713705958311
When it claims an image represents a 5, it is correct only 83.7% of the time. Moreover, it only detects 65.1% of the 5s.
It’s convenient to combine precision and recall into a single metric called F1 score. It’s the harmonic mean of them.
1from sklearn.metrics import f1_score
2
3f1_score(y_train_5, y_train_pred) # 0.7325171197343846
F1 score is high if both recall and precision are high. However, it’s not always the only one metric you want: in some contexts, you mostly care about precision, and in other contexts you really care about recall.
  • Care about precision: detects safe video for kids → a good classifier = keeps only safe videos ↔ precision high ↔ less “wrongly detected” (FP) videos come to children. We don’t care if recall is low in this case (lots of good videos will be missed but no problem).
  • Care about recall: detects shoplifters in surveillance images → a good classifier = all shopifiers get caught ↔ recall high ↔ less “allowed passing” (FN). We don’t care if precision is low (a few wrong alerts but we won’t miss bad guys).
→ Unfortunately, increasing precision reduces recall and vice versa. ← precision/recall trade-off.

Precision/Recall Trade-off

Figure 3-4. SGDClassifier: Precision/recall trade-off ranks images by classifier score. Those above the set threshold are positive; a higher threshold results in lower recall but generally higher precision.
  • Higher recall (lower threshold) → we don’t miss 5s but we allow many not-5s there. Conversely, higher precision (higher threshold) → there aren’t many not-5s but we miss many 5s (lower recall).
  • So which threshold should be used? → Figure 3-5.
    • Figure 3-5. Precision and recall versus the decision threshold. Threshold = 3000.
  • Strategy 2: to select precision/recall trade-off → plot preficion against recall.
    • Figure 3-6. Precision versus recall. → Observe: From 0.8 of recall, precision falls sharply.
  • The choice of precision/recall trade-off depends on your project!
  • Search for the lowest threshold that gives you at least 90% precision.
    • 1idx_for_90_precision = (precisions >= 0.90).argmax()
      2threshold_for_90_precision = thresholds[idx_for_90_precision]
  • For many application, 48% recall wouldn’t be great.

The ROC Curve

  • ROC = Receiver Operating Characteristic. ← common tool used with binary classifiers.
  • Specificity: How many negative results belong to our predictions? ← It is used when we care about TN values and don't want to make false alarms of the FP values (e.g. drug test).
  • ROC plots TPR (True Positive Rate) vs FPR (False Positive Rate) = Sensitivity (Recall) vs Specificity.
  • Use roc_curve.
    • 1from sklearn.metrics import roc_curve
      2import matplotlib.pyplot as plt
      3%matplotlib inline
      4
      5fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
      6# create plot
      7plt.plot(fpr, tpr, label='ROC curve')
      8plt.plot([0, 1], [0, 1], 'k--') # Dashed diagonal
      9plt.show()
Figure 3-7. A ROC curve plotting FPR against TPR for all possible thresholds; the black circle highlights the chosen ratio (at 90% precision and 48% recall)
  • Trade-off: the higher recall, the more FPR (predict wrong) the classifier produces.
  • A good classifier stays as far away from the dotted lines (random classifier) as possible (toward the top-left corner) → Measure the area under the curve (AUC) ← roc_auc_score
  • Perfect classifier will have AUC = 1 (fit the rectangle).
  • The purely random classifier (dotted line) will have AUC = 0.5.
 

Precision/Recall Curve vs ROC curve

Figure 3-3. An illustrated confusion matrix.
Figure 3-6. Precision versus recall. → Observe: From 0.8 of recall, precision falls sharply.
Figure 3-7. A ROC curve plotting FPR against TPR for all possible thresholds; the black circle highlights the chosen ratio (at 90% precision and 48% recall)
  • Use precision/recall curve ← when positive class is rare or when you care about FP than FN.
  • Otherwise, use ROC.
  • For example, Figure 3-7 displays a satisfactory ROC, but the PR curve suggests there is room for model enhancement (the curve could really be closer to the top-right corner).

Plot curves for Random Forest

  • The precision_recall_curve() expects labels and scores for each instance but RandomForestClassifier doesn’t have decision_function() method. ← use the probability of the positive class as a score.
    • 1y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba")
  • These are estimated probablities, not actual probabilities ← not so good ← use sklearn.calibration to calibrate these estimations.
Figure 3-8. Comparing PR curves: the random forest classifier is superior to the SGD classifier because its PR curve is much closer to the top-right corner, and it has a greater AUC

Multiclass Classification (MC)

  • Multiclass classifiers = Multinomial classifiers = distinguish between more than 2 classes.
  • You can perform MC with multiple binary classifiers (BC).
    • one-versus-the-rest (OvR) or one-versus-all (OvA) strategy: instead of classifying 10 classes (0 to 9), we train 10 BC, one for each digit (0-detector, 1-detector,…). Then, take the highest BC score. ← Most of BC likes this.
    • one-versus-one (OvO) strategy: train BC for every pair of digits (0vs1, 0vs2,…, 1vs2,…). Then the class winning the most duels will be the class of an image. ← Advantage: only train on a part of the training set containing 2 classes. ← SVM likes this.
    • Scikit-Learn auto detects which strategy to use for the chosen BC.
      • 1from sklearn.svm import SVC
        2
        3svm_clf = SVC(random_state=42)
        4svm_clf.fit(X_train[:2000], y_train[:2000])  # y_train, not y_train_5
  • ☝ Sometimes, just scale the input can increase the results (discussed in Chap 2).

Error Analysis

Assuming you have a promising model, we'll explore ways to enhance it by analyzing its errors.
Plot the confusion matrix of the predictions ← a color diagram of the CM is much easier to analyze.
1from sklearn.metrics import ConfusionMatrixDisplay
2
3y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
4plt.rc('font', size=9)  # extra code – make the text smaller
5ConfusionMatrixDisplay.from_predictions(
6	y_train, y_train_pred,
7	sample_weight = (y_train_pred != y_train) # If available -> Fig 3.10
8	normalize="true", values_format=".0%" # If available -> Fig 3.9 right
9)
10plt.show()
Figure 3-9. Confusion matrix (left) and the same CM normalized by row (right)
From Figure 3-9: Images are mainly diagonal, indicating good results. However, row #5 and col #5 appear darker, not due to poor performance but fewer 5s in the dataset. Solution: Use CM normalization. Result: 82% accuracy.
If you look carefully, you will notice that many digits have been misclassified as 8s, but this is not immediately obvious from this diagram. ← putting zero weight on the correct prediction
Figure 3-10. Confusion matrix with errors only, normalized by row (left) and by column (right). Eg. 53% = 535/(all incorrect on row #5 of Fig 3-9 left)%. ← 53% or the errors the model made on images of 5s were misclassifications as 8s.
→ Think more of reducing the false 8s:
  • More data for digits that look like 8s ← classify them from 8s.
  • An algo for couting the number of closed loops (8 has 2, 6 has 1, 5 has 1).
We can boost our training dataset via data augmentation, which tweaks images, like shifting or rotating. Other methods are also viable.

Multilabel Classification

  • Classifier can output multiple classes for each instance.
  • Eg: face-recognition classifier: detects multiple faces in an image. ← it outputs [True, False, True] for Alice, Bob, Charlie in the image ← multilabel classification (outputs multiple binary tags)
  • Eg: KNeighborsClassifier to classify each image in MNIST into 2 labels — large (7,8,9) or odd.
    • 1import numpy as np
      2from sklearn.neighbors import KNeighborsClassifier
      3
      4y_train_large = (y_train >= '7')
      5y_train_odd = (y_train.astype('int8') % 2 == 1)
      6y_multilabel = np.c_[y_train_large, y_train_odd]
      7
      8knn_clf = KNeighborsClassifier()
      9knn_clf.fit(X_train, y_multilabel)
      10
      11knn_clf.predict([some_digit])
  • To evaluate, one way: measure F1 score of each label and then compute the average score.
  • ChainClassifier arranges binary classifiers into a chain, where each model predicts using input features and previous models' predictions.

Multioutput Classification

  • Multioutputmulticlass classification = Multioutput Classification.
  • Each label can be multiclass (has more than 2 possible values).
  • Eg: A systems removes noise from images. Output: multilabel (one label per pixel) and each label can have multiple values (0 to 255).