This note serves as a reminder of the book's content, including additional research on the mentioned topics. It is not a substitute for the book. Most images are sourced from the book or referenced.

I've noticed that taking notes on this site while reading the book significantly extends the time it takes to finish the book. I've stopped noting everything, as in previous chapters, and instead continue reading by highlighting/hand-writing notes and instead.

*I plan to return to the detailed*style when I have more time.## Information

## List of notes for this book

- We use dataset MNIST in this chapter = 70K small images of
**digits handwriten**. ← “Hello world” of ML.

- Download from OpenML.org. ← use
`sklearn.datasets.fetch_openml`

```
1from sklearn.datasets import fetch_openml
2
3mnist = fetch_openml('mnist_784', as_frame=False)
4# data contains images -> dataframe isn't suitable, so as_frame=False
5X, y = mnist.data, mnist.target
6X.shape # (70000, 784)
```

`sklean.datasets`

contains 3 types of functions:`fetch_*`

functions such as`fetch_openml()`

to download real-life datasets.`load_*`

functions to load small toy datasets (no need to download)`make_*`

functions to generate fake datasets.

- 70K images, 784 features. Each image = 28x28 pixels.

- Plot an image

```
1import matplotlib.pyplot as plt def plot_digit(image_data):
2
3image = image_data.reshape(28, 28) plt.axis("off")
4plt.imshow(image, cmap="binary")
5some_digit = X[0] plot_digit(some_digit) plt.show()
```

- MNIST from
`fetch_openml()`

is already split into a training set (first 60K, already shuffled) and test set (last 10K).

`1X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]`

- Training set is already shuffled ← good for cross-validation (all are similar).

Let’s simplify the problem - “detect only the number 5” ←

**binary classifier**(2 classes, 5 or non-5).Good to start is

**(SGD, or stochastic GD) classifier ←***stochastic gradient descent*`SGDClassifier`

← deals with training instances independently, one at a time ← handling large datasets effeciently, well suited for online training.```
1from sklearn.linear_model import SGDClassifier
2
3sgd_clf = SGDClassifier(random_state=42)
4sgd_clf.fit(X_train, y_train_5)
5
6sgd_clf.predict([some_digit])
```

Evaluating a classifier is often

**significantly trickier**than evaluating a regressor!Use

`cross_val_score()`

← use k-folds.```
1from sklearn.model_selection import cross_val_score
2cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")
```

Wow, get 95% accuracy with SGD but it’s good? → Let’s try

`DummyClassifier`

← classifies every single image in the most frequent class (non-5) and then use `cross_val_score`

→ 90% accuracy! Why? It’s because only 10% are 5s! ← If you always guess that an image is *not*a 5, 90% of the time, you’re right!→ Accuracy isn’t the preferred measure for classifiers, especially with

**skewed datasets**(some classes are much more than others). ← use**confusion matrix**(CM) 👈 My note: Confusion matrix & f1-score.Sometimes, you can implement yourself a custom cross-validation to better control the measure ← use

`StratifiedKFold`

to performs *stratified sampling*(folds that preserves the percentage of samples for each class).👉 My note: Confusion matrix & f1-score.

**General idea**: count the number of times instances of class A are classified as class B. Eg. to check how many times the classifier confuses 8s and 0s, check row#8, col#0 of the CM.

```
1from sklearn.model_selection import cross_val_predict
2from sklearn.metrics import confusion_matrix
3
4y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)
5cm = confusion_matrix(y_train_5, y_train_pred)
```

`cross_val_predict()`

returns the predictions made on each test fold.```
1array([[53892, 687],
2 [ 1891, 3530]])
```

CM. Each row →

*actual class*, each column →*predicted class*.TN =

*True Negative,*FP =*False Positive*= Type I Error, FN =*False Negative*, TP =*True Positive*.→ A perfect classifier would

**only have**TP and TN (only predict right 5 and non-5 or FP = FN = 0)!So, a more concise metric: look at the accuracy of the positive predictions ←

**precision**← How many of what we predict are right?But what if we always make negative predictions (except the single positive we pretty sure) → precision would be 1/1=100%? ← This classifier isn’t useful because it ignores all but one positive instance. → precision should be used with another metric name

**recall**(also,**or***sensitivity**true positive rate -***).***TPR***Recall**= ratio of positive instaces that are correctly detected by the classifier. ← Do we miss something?

Scikit-learn gives

`precision_score`

and `recall_score`

to compute precision and recall.```
1from sklearn.metrics import precision_score, recall_score
2
3precision_score(y_train_5, y_train_pred) # 0.8370879772350012
4recall_score(y_train_5, y_train_pred) # 0.6511713705958311
```

When it claims an image represents a 5, it is correct only 83.7% of the time. Moreover, it only detects 65.1% of the 5s.

It’s convenient to combine precision and recall into a single metric called

**F1 score**. It’s the*harmonic mean*of them.```
1from sklearn.metrics import f1_score
2
3f1_score(y_train_5, y_train_pred) # 0.7325171197343846
```

F1 score is high if both recall and precision are high. However, it’s not always the only one metric you want: in some contexts, you mostly care about precision, and in other contexts you really care about recall.

**Care about precision**: detects safe video for kids → a good classifier = keeps only safe videos ↔ precision high ↔ less “wrongly detected” (FP) videos come to children. We don’t care if recall is low in this case (lots of good videos will be missed but no problem).

**Care about recall**: detects shoplifters in surveillance images → a good classifier = all shopifiers get caught ↔ recall high ↔ less “allowed passing” (FN). We don’t care if precision is low (a few wrong alerts but we won’t miss bad guys).

→ Unfortunately, increasing precision reduces recall and vice versa. ←

**precision/recall trade-off**.- Higher recall (lower threshold) → we don’t miss 5s but we allow many not-5s there. Conversely, higher precision (higher threshold) → there aren’t many not-5s but we miss many 5s (lower recall).

- So which threshold should be used? → Figure 3-5.

**Strategy 2**: to select precision/recall trade-off → plot preficion against recall.

- The choice of precision/recall trade-off depends on your project!

- Search for the lowest threshold that gives you at least 90% precision.

```
1idx_for_90_precision = (precisions >= 0.90).argmax()
2threshold_for_90_precision = thresholds[idx_for_90_precision]
```

- For many application, 48% recall wouldn’t be great.

- ROC =
*Receiver Operating Characteristic. ←*common tool used with binary classifiers.

**Specificity**: How many negative results belong to our predictions? ← It is used when we care about TN values and don't want to make false alarms of the FP values (e.g. drug test).

- ROC plots TPR (True Positive Rate) vs FPR (False Positive Rate) = Sensitivity (Recall) vs Specificity.

- Use
`roc_curve`

.

```
1from sklearn.metrics import roc_curve
2import matplotlib.pyplot as plt
3%matplotlib inline
4
5fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
6# create plot
7plt.plot(fpr, tpr, label='ROC curve')
8plt.plot([0, 1], [0, 1], 'k--') # Dashed diagonal
9plt.show()
```

**Trade-off**: the higher recall, the more FPR (predict wrong) the classifier produces.

- A good classifier stays as far away from the dotted lines (random classifier) as possible (toward the top-left corner) → Measure the
*area under the curve*(**AUC**) ←`roc_auc_score`

- Perfect classifier will have AUC = 1 (fit the rectangle).

- The purely random classifier (dotted line) will have AUC = 0.5.

- Use precision/recall curve ← when positive class is rare or when you care about FP than FN.

- Otherwise, use ROC.

- For example, Figure 3-7 displays a satisfactory ROC, but the PR curve suggests there is room for model enhancement (the curve could really be closer to the top-right corner).

- The
`precision_recall_curve()`

expects labels and scores for each instance but`RandomForestClassifier`

doesn’t have`decision_function()`

method. ← use the probability of the positive class as a score.

`1y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba")`

- These are
**estimated**probablities, not actual probabilities ← not so good ← use`sklearn.calibration`

to calibrate these estimations.

*Multiclass classifiers*=*Multinomial classifiers*= distinguish between more than 2 classes.

- You can perform MC with multiple binary classifiers (BC).
*one-versus-the-rest*(OvR) or*one-versus-all*(OvA) strategy: instead of classifying 10 classes (0 to 9), we train 10 BC, one for each digit (0-detector, 1-detector,…). Then, take the highest BC score. ← Most of BC likes this.*one-versus-one*(OvO) strategy: train BC for every pair of digits (0vs1, 0vs2,…, 1vs2,…). Then the class winning the most duels will be the class of an image. ← Advantage: only train on a part of the training set containing 2 classes. ← SVM likes this.- Scikit-Learn auto detects which strategy to use for the chosen BC.

```
1from sklearn.svm import SVC
2
3svm_clf = SVC(random_state=42)
4svm_clf.fit(X_train[:2000], y_train[:2000]) # y_train, not y_train_5
```

- ☝ Sometimes, just scale the input can increase the results (discussed in Chap 2).

Assuming you have a promising model, we'll explore ways to enhance it by analyzing its errors.

Plot the confusion matrix of the predictions ← a color diagram of the CM is much easier to analyze.

```
1from sklearn.metrics import ConfusionMatrixDisplay
2
3y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
4plt.rc('font', size=9) # extra code – make the text smaller
5ConfusionMatrixDisplay.from_predictions(
6 y_train, y_train_pred,
7 sample_weight = (y_train_pred != y_train) # If available -> Fig 3.10
8 normalize="true", values_format=".0%" # If available -> Fig 3.9 right
9)
10plt.show()
```

From Figure 3-9: Images are mainly diagonal, indicating good results. However, row #5 and col #5 appear darker, not due to poor performance but fewer 5s in the dataset. Solution: Use CM normalization. Result: 82% accuracy.

If you look carefully, you will notice that many digits have been misclassified as 8s, but this is not immediately obvious from this diagram. ← putting zero weight on the correct prediction

→ Think more of reducing the false 8s:

- More data for digits that look like 8s ← classify them from 8s.

- An algo for couting the number of closed loops (8 has 2, 6 has 1, 5 has 1).

We can boost our training dataset via

**data augmentation**, which tweaks images, like shifting or rotating. Other methods are also viable.- Classifier can output multiple classes for each instance.

- Eg: face-recognition classifier: detects multiple faces in an image. ← it outputs
`[True, False, True]`

for Alice, Bob, Charlie in the image ←**multilabel classification**(outputs multiple binary tags)

- Eg:
`KNeighborsClassifier`

to classify each image in MNIST into 2 labels — large (7,8,9) or odd.

```
1import numpy as np
2from sklearn.neighbors import KNeighborsClassifier
3
4y_train_large = (y_train >= '7')
5y_train_odd = (y_train.astype('int8') % 2 == 1)
6y_multilabel = np.c_[y_train_large, y_train_odd]
7
8knn_clf = KNeighborsClassifier()
9knn_clf.fit(X_train, y_multilabel)
10
11knn_clf.predict([some_digit])
```

- To evaluate, one way: measure F1 score of each label and then compute the average score.

`ChainClassifier`

arranges binary classifiers into a chain, where each model predicts using input features and previous models' predictions.

- Multioutputmulticlass classification = Multioutput Classification.

- Each label can be multiclass (has more than 2 possible values).

- Eg: A systems removes noise from images. Output: multilabel (one label per pixel) and each label can have multiple values (0 to 255).