- We use dataset MNIST in this chapter = 70K small images of
**digits handwriten**. ← “Hello world” of ML.

- Download from OpenML.org. ← use
`sklearn.datasets.fetch_openml`

```
1from sklearn.datasets import fetch_openml
2
3mnist = fetch_openml('mnist_784', as_frame=False)
4# data contains images -> dataframe isn't suitable, so as_frame=False
5X, y = mnist.data, mnist.target
6X.shape # (70000, 784)
```

`sklean.datasets`

contains 3 types of functions:`fetch_*`

functions such as`fetch_openml()`

to download real-life datasets.`load_*`

functions to load small toy datasets (no need to download)`make_*`

functions to generate fake datasets.

- 70K images, 784 features. Each image = 28x28 pixels.

- Plot an image

```
1import matplotlib.pyplot as plt def plot_digit(image_data):
2
3image = image_data.reshape(28, 28) plt.axis("off")
4plt.imshow(image, cmap="binary")
5some_digit = X[0] plot_digit(some_digit) plt.show()
```

- MNIST from
`fetch_openml()`

is already split into a training set (first 60K, already shuffled) and test set (last 10K).

`1X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]`

- Training set is already shuffled ← good for cross-validation (all are similar).

Let’s simplify the problem - “detect only the number 5” ←

**binary classifier**(2 classes, 5 or non-5).Good to start is

**(SGD, or stochastic GD) classifier ←***stochastic gradient descent*`SGDClassifier`

← deals with training instances independently, one at a time ← handling large datasets effeciently, well suited for online training.```
1from sklearn.linear_model import SGDClassifier
2
3sgd_clf = SGDClassifier(random_state=42)
4sgd_clf.fit(X_train, y_train_5)
5
6sgd_clf.predict([some_digit])
```

Evaluating a classifier is often

**significantly trickier**than evaluating a regressor!Use

`cross_val_score()`

← use k-folds.```
1from sklearn.model_selection import cross_val_score
2cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")
```

Wow, get 95% accuracy with SGD but it’s good? → Let’s try

`DummyClassifier`

← classifies every single image in the most frequent class (non-5) and then use `cross_val_score`

→ 90% accuracy! Why? It’s because only 10% are 5s! ← If you always guess that an image is *not*a 5, 90% of the time, you’re right!→ Accuracy isn’t the preferred measure for classifiers, especially with

**skewed datasets**(some classes are much more than others). ← use**confusion matrix**(CM) 👈 My note: Confusion matrix & f1-score.Sometimes, you can implement yourself a custom cross-validation to better control the measure ← use

`StratifiedKFold`

to performs *stratified sampling*(folds that preserves the percentage of samples for each class).👉 My note: Confusion matrix & f1-score.

**General idea**: count the number of times instances of class A are classified as class B. Eg. to check how many times the classifier confuses 8s and 0s, check row#8, col#0 of the CM.

```
1from sklearn.model_selection import cross_val_predict
2from sklearn.metrics import confusion_matrix
3
4y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)
5cm = confusion_matrix(y_train_5, y_train_pred)
```

`cross_val_predict()`

returns the predictions made on each test fold.```
1array([[53892, 687],
2 [ 1891, 3530]])
```

CM. Each row →

*actual class*, each column →*predicted class*.TN =

*True Negative,*FP =*False Positive*= Type I Error, FN =*False Negative*, TP =*True Positive*.→ A perfect classifier would

**only have**TP and TN (only predict right 5 and non-5 or FP = FN = 0)!So, a more concise metric: look at the accuracy of the positive predictions ←

**precision**← How many of what we predict are right?But what if we always make negative predictions (except the single positive we pretty sure) → precision would be 1/1=100%? ← This classifier isn’t useful because it ignores all but one positive instance. → precision should be used with another metric name

**recall**(also,**or***sensitivity**true positive rate -***).***TPR***Recall**= ratio of positive instaces that are correctly detected by the classifier. ← Do we miss something?

Scikit-learn gives

`precision_score`

and `recall_score`

to compute precision and recall.```
1from sklearn.metrics import precision_score, recall_score
2
3precision_score(y_train_5, y_train_pred) # 0.8370879772350012
4recall_score(y_train_5, y_train_pred) # 0.6511713705958311
```

When it claims an image represents a 5, it is correct only 83.7% of the time. Moreover, it only detects 65.1% of the 5s.

It’s convenient to combine precision and recall into a single metric called

**F1 score**. It’s the*harmonic mean*of them.```
1from sklearn.metrics import f1_score
2
3f1_score(y_train_5, y_train_pred) # 0.7325171197343846
```

F1 score is high if both recall and precision are high. However, it’s not always the only one metric you want: in some contexts, you mostly care about precision, and in other contexts you really care about recall.

**Care about precision**: detects safe video for kids → a good classifier = keeps only safe videos ↔ precision high ↔ less “wrongly detected” (FP) videos come to children. We don’t care if recall is low in this case (lots of good videos will be missed but no problem).

**Care about recall**: detects shoplifters in surveillance images → a good classifier = all shopifiers get caught ↔ recall high ↔ less “allowed passing” (FN). We don’t care if precision is low (a few wrong alerts but we won’t miss bad guys).

→ Unfortunately, increasing precision reduces recall and vice versa. ←

**precision/recall trade-off**.- Higher recall (lower threshold) → we don’t miss 5s but we allow many not-5s there. Conversely, higher precision (higher threshold) → there aren’t many not-5s but we miss many 5s (lower recall).

- So which threshold should be used? → Figure 3-5.

**Strategy 2**: to select precision/recall trade-off → plot preficion against recall.

- The choice of precision/recall trade-off depends on your project!

- Search for the lowest threshold that gives you at least 90% precision.

```
1idx_for_90_precision = (precisions >= 0.90).argmax()
2threshold_for_90_precision = thresholds[idx_for_90_precision]
```

- For many application, 48% recall wouldn’t be great.

- ROC =
*Receiver Operating Characteristic. ←*common tool used with binary classifiers.

**Specificity**: How many negative results belong to our predictions? ← It is used when we care about TN values and don't want to make false alarms of the FP values (e.g. drug test).

- ROC plots TPR (True Positive Rate) vs FPR (False Positive Rate) = Sensitivity (Recall) vs Specificity.

- Use
`roc_curve`

.

```
1from sklearn.metrics import roc_curve
2import matplotlib.pyplot as plt
3%matplotlib inline
4
5fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
6# create plot
7plt.plot(fpr, tpr, label='ROC curve')
8plt.plot([0, 1], [0, 1], 'k--') # Dashed diagonal
9plt.show()
```

**Trade-off**: the higher recall, the more FPR (predict wrong) the classifier produces.

- A good classifier stays as far away from the dotted lines (random classifier) as possible (toward the top-left corner) → Measure the
*area under the curve*(**AUC**) ←`roc_auc_score`

- Perfect classifier will have AUC = 1 (fit the rectangle).

- The purely random classifier (dotted line) will have AUC = 0.5.

- Use precision/recall curve ← when positive class is rare or when you care about FP than FN.

- Otherwise, use ROC.

- For example, Figure 3-7 displays a satisfactory ROC, but the PR curve suggests there is room for model enhancement (the curve could really be closer to the top-right corner).

- The
`precision_recall_curve()`

expects labels and scores for each instance but`RandomForestClassifier`

doesn’t have`decision_function()`

method. ← use the probability of the positive class as a score.

`1y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba")`

- These are
**estimated**probablities, not actual probabilities ← not so good ← use`sklearn.calibration`

to calibrate these estimations.

*Multiclass classifiers*=*Multinomial classifiers*= distinguish between more than 2 classes.

- You can perform MC with multiple binary classifiers (BC).
*one-versus-the-rest*(OvR) or*one-versus-all*(OvA) strategy: instead of classifying 10 classes (0 to 9), we train 10 BC, one for each digit (0-detector, 1-detector,…). Then, take the highest BC score. ← Most of BC likes this.*one-versus-one*(OvO) strategy: train BC for every pair of digits (0vs1, 0vs2,…, 1vs2,…). Then the class winning the most duels will be the class of an image. ← Advantage: only train on a part of the training set containing 2 classes. ← SVM likes this.- Scikit-Learn auto detects which strategy to use for the chosen BC.

```
1from sklearn.svm import SVC
2
3svm_clf = SVC(random_state=42)
4svm_clf.fit(X_train[:2000], y_train[:2000]) # y_train, not y_train_5
```

- ☝ Sometimes, just scale the input can increase the results (discussed in Chap 2).

Assuming you have a promising model, we'll explore ways to enhance it by analyzing its errors.

Plot the confusion matrix of the predictions ← a color diagram of the CM is much easier to analyze.

```
1from sklearn.metrics import ConfusionMatrixDisplay
2
3y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
4plt.rc('font', size=9) # extra code – make the text smaller
5ConfusionMatrixDisplay.from_predictions(
6 y_train, y_train_pred,
7 sample_weight = (y_train_pred != y_train) # If available -> Fig 3.10
8 normalize="true", values_format=".0%" # If available -> Fig 3.9 right
9)
10plt.show()
```

From Figure 3-9: Images are mainly diagonal, indicating good results. However, row #5 and col #5 appear darker, not due to poor performance but fewer 5s in the dataset. Solution: Use CM normalization. Result: 82% accuracy.

If you look carefully, you will notice that many digits have been misclassified as 8s, but this is not immediately obvious from this diagram. ← putting zero weight on the correct prediction

→ Think more of reducing the false 8s:

- More data for digits that look like 8s ← classify them from 8s.

- An algo for couting the number of closed loops (8 has 2, 6 has 1, 5 has 1).

We can boost our training dataset via

**data augmentation**, which tweaks images, like shifting or rotating. Other methods are also viable.- Classifier can output multiple classes for each instance.

- Eg: face-recognition classifier: detects multiple faces in an image. ← it outputs
`[True, False, True]`

for Alice, Bob, Charlie in the image ←**multilabel classification**(outputs multiple binary tags)

- Eg:
`KNeighborsClassifier`

to classify each image in MNIST into 2 labels — large (7,8,9) or odd.

```
1import numpy as np
2from sklearn.neighbors import KNeighborsClassifier
3
4y_train_large = (y_train >= '7')
5y_train_odd = (y_train.astype('int8') % 2 == 1)
6y_multilabel = np.c_[y_train_large, y_train_odd]
7
8knn_clf = KNeighborsClassifier()
9knn_clf.fit(X_train, y_multilabel)
10
11knn_clf.predict([some_digit])
```

- To evaluate, one way: measure F1 score of each label and then compute the average score.

`ChainClassifier`

arranges binary classifiers into a chain, where each model predicts using input features and previous models' predictions.

- Multioutputmulticlass classification = Multioutput Classification.

- Each label can be multiclass (has more than 2 possible values).

- Eg: A systems removes noise from images. Output: multilabel (one label per pixel) and each label can have multiple values (0 to 255).