Reading: Hands-On ML - Quick notes (Chapter 4 — 9)

👉 List of all notes for this book. IMPORTANT UPDATE Nov 18, 2024: I've stopped taking detailed notes from the book and now only highlight and annotate directly in the PDF files/book. With so many books to read, I don't have time to type everything. In the future, if I make notes while reading a book, they'll contain only the most notable points (for me).

Chapter 4. Training Models

MSE cost function for linear regression model

Closed-form solution: nghiệm bằng công thức toán, trực tiếp

@ operator performs matrix multiplication. Dùng cho Numpy, TF, PyTorch, JAX, not on pure Python.

Normal eqn & SVD approach chậm với n (#features) nhưng lại khá nhanh với m (#instances). ← nhiều features, GD nhanh hơn

Gradient Descent: pp optimization tổng quát. Ý tưởng chính là tweak parameters iteratively để minimize cost functions. ← compute the local gradient và đi theo hướng dốc đó từ đỉnh đến đáy (đáy = min loss)

With feature scaling → go faster! ← StandardScaler

Batch gradient descent → use whole batch of training data at each step to computer the gradient

Learning rate: slow → too long, high → underfitting

Figure 4-8. Gradient descent with various learning rates

Find good learning rate → use grid search.
How many #epochs? → high but with condition (tolerance) to stop when gradient is tiny. Smaller tolerance → longer to train each epoch.

Batch GD → whole training set → slow → ngược lại hoàn toàn là Stochastic GD ← pick random one instance at each step.

Good: When cost functin is irregular → SGD helps jumping out of local min.
Bad: never come to optimal.

→ Adjust learning rate (use learning schedule): Start wide then smaller smaller. ← simulated annealing algo (luyện kim: kim loại nóng chảy làm nguội từ từ)

Some estimators also have a partial_fit() method that you can call to run a single round of training on one or more instances.

warm_start=True with fit() will continue to train where it left off.

Mini-batch GD: each step, based on 1 < mini batches < full

vs SGD: tận dụng lợi thế của GPU, opt matrix operations để có performance boost.

Polynomial Regression (khi data ko phải đường thẳng) ← PolynomialFeatures

How to tell a model is underfitting or overfitting?

Recall: use cross-validation → well on training but poor on cross-val → overfitting. Poor on both → underfitting.
Learning curves: plots of training errors and validation errors vs training iteration. ← learning_curve()

Figure 4-15. Learning curves ← no gap between 2 curves ← underfiting → both training and val curves are high

→ to fix: choose again: better model or better features.

Figure 4-16. Learning curves for the 10th-degree polynomial model ← overfitting → training curve is better and val curve

→ to fix: more data

Bias / variance trade-off

Bias (thành kiến) → wrong assumption (nghĩ linear nhưng lại là bật cao) ← high bias → underfitting
Variance → high variance → overfitting
Irreducible error ← data bị noise ← clean data

→ Gọi là trade-off vì tăng bật → tăng variance nhưng lại giảm bias và ngược lại.

A good way to reduce overfitting → regularized

Ridge regression (Ridge) ← use

Lasso regression (Least absolute shrinkage and selection operator regression, Lasso) ← use

An important characteristic of lasso regression is that it tends to eliminate the weights of the least important features (i.e., set them to zero).

Elastic net regression: a middle ground between ridge regression and lasso regression.

(source) Ridge regression can't zero coefficients, resulting in all or none in the model. Unlike this, LASSO provides parameter shrinkage and variable selection. For highly correlated covariates, consider the Elastic Net instead of the LASSO.
It’s important to scale the data before regularization.
Use what? → Avoid not using regulariztion; if only a few features are useful → lasso/elastic; when #features > #instances → use elastic.

Early stopping: stop training as long as the validation error reaches min.

Figure 4-20. Early stopping regularization

copy.deepcopy() → copies both model hyperparams & learned params whereas sklearn.base.clone() only copies hyperparams.

Logistic regression: we can use regression for classification (0/1) → estimate the probability an instance belongs to which class. ← gọi là “regression” vì là extension của linear regression (chỉ cần apply cái sigmoid function trước khi output)

ML Coursera 3 - w3: Logistic Regression | Notes of Thi
Regularized logistic regression | Notes of Thi
Sigmoid → 2 classes ← can be generalized to support multiple classes
Softmax → multiple classes ← softmax regression / multinomial logistic regression
Read more about these activation functions in this note.

Iris dataset: contains the sepal and petal length and width of 150 iris flowers of three different species: Iris setosa, Iris versicolor, and Iris virginica.

Figure 4-22. Flowers of three iris plant species

The softmax regression classifiier predicts only one class at a time (multiclass, not multioutput) ← cannot use it to recognize multiple people in one picture.

ScikitLearn’s LogisticRegression classifier uses softmax regression automatically when you train it on more than two classes.

Chapter 5. Support Vector Machines (SVG)

Note: Support Vector Machine (SVM)

capable of performing linear or nonlinear classification, regression, and even novelty detection

shine with small to medium-sized nonlinear datasets (i.e., hundreds to thousands of instances), especially for classification tasks

Linear SVG classification

Figure 5-1. Large margin classification. Left: The left plot shows the decision boundaries of three possible linear classifiers. Right: the solid line in the plot on the right represents the decision boundary of an SVM classifier

think of an SVM classifier as fitting the widest possible street
Add more instances not affect the decision boundary (only instances located on the edge ← support vectors)
SVMs are sensitive to the feature scales

Figure 5-2. Sensitivity to feature scales. Scaled (right) is much better.

Hard margin = strictly impose that all instances must be off the street and on the correct side → only works with linearly separable data + sensitive to outliers

Figure 5-3. Hard margin sensitivity to outliers

Margin violations = có nhiều data ở trong street.
soft margin classification = good balance between keeping the street as large as possible and limiting the margin violations
hyperparameters

Figure 5-4. Large margin (left) versus fewer margin violations (right)

If SVM overfitting → reduce !

C low → less overfitting, wider street, more support vectors, more violations. → (if too much) underfitting.

Nonlinear SVG Classification

Figure 5-5. Adding features to make a dataset linearly separable

Use kernel tricks: The kernel trick makes it possible to get the same result as if you had added many polynomial features, even with a very high degree, without actually having to add them.

Figure 5-7. SVM classifiers with a polynomial kernel

Another technique to tackle nonlinear problems is to add features computed using a similarity function ← How much each instance resembles a particular landmark. ← computationally expensive

❓Chưa hiểu ví dụ cái này lắm (p. 253)

Gaussian RBF Kernel

Figure 5-9. SVM classifiers using an RBF kernel

An illustration of using gamma. In high-gamma case, we only consider points nearby the hyperplane, it may cause an overfitting.

The LinearSVC class is based on the liblinear library ← optimized algo for linear SVG, not support kernels
The SVC class is based on the libsvm library ← support kernel trick

SVM Regression

Ngược cái classification (tạo cái street rộng nhất có thể, hạn chế violations rơi vào giữa street) thì cái regression là tạo cái street có thể chứa nhiều instances nhất có thể và cũng hạn chế violations rơi ra ngoài)

Figure 5-10. SVM regression, controlled by hyperparameter .

Ngược cái classification (support vectors ở bên trong để xác định margin), support vectors của regression thì ở bên ngoài cùng cũng để xác định margins. ← Reducing increase #support vectors and vice versa.

Under the hood of linear SVM Classifier

To make the street (margin) larger → make smaller.

Figure 5-12. A smaller weight vector results in a larger margin

To train an SVM: using quadratic programming problem OR using gradient descent OR solving the dual problem.
The dual problem:

The solution to the dual problem typically gives a lower bound to the solution of the primal problem, but under some conditions it can have the same solution as the primal problem ← SVM have these conditions.
The dual problem is faster to solve than the primal one when the number of training instances is smaller than the number of features. the dual problem makes the kernel trick possible, while the primal problem does not.

In ML, a kernel is a function capable of computing the dot product , based only on the original vectors and , without having to compute (or even to know about) the transformation .

Chapter 6. Decision Trees

Note: Decision Tree Classifier

Note: Decision Tree Regression

To visualize Decision Trees, use Graphviz.

DT doesn’t (or very little) require data preparations.

Scikit-learn uses the CART algo which produces only binary trees (node have max 2 children).

ID3 algo can produces nodes that have more than 2 children.

Figure 6-1. Iris decision tree

Figure 6-2. Decision tree decision boundaries

White box models = intuitive + decisions are easy to interpret. Eg: decision trees.

Black box models = otherwise. Eg: random forests and NN. ← hard to know what contributed to this prediction

the CART algorithm is a greedy algorithm. A greedy algorithm often produces a solution that’s reasonably good but not guaranteed to be optimal

Complexity of the prediction = ← independent of the number of features.

Complexity of the training = ← Comparing all features on all samples at each node results

predictions are very fast, even when dealing with large training sets.

By default, the DecisionTreeClassifier class uses the Gini impurity measure, but you can select the entropy impurity measure instead by setting the criterion hyperparameter to "entropy".

Use which one? → most of the time, they gives the same results. Gini is slightly faster.
when they differ, Gini impurity tends to isolate the most frequent class in its own branch of the tree, while entropy tends to produce slightly more balanced trees

Entropy → zeros when molecules are still and well ordered. In Information Theory, entropy is zero when all messages are identical. In ML, a set’s entropy is zero when it contains instances of only one class.

Decision tree makes very few assumption about the training data.

It’s nonparametric model ← not many params + free to stick closely to the data → most likely overfitting. parametric model (linear model) ← predetermined params + reduce overfitting but increase underfitting.

DT Regression (Note: Decision Tree Regression) ← The main difference is that instead of predicting a class in each node, it predicts a value

This prediction is the average target value of the 110 training instances associated with this leaf node (for value=0.111)

Figure 6-4. A decision tree for regression

Figure 6-5. Predictions of two decision tree regression models

predicted value for each region is always the average target value of the instances in that region.

Instead of minimize the impurity (for classification), we minimize MSE using CART algo.

Weakness:

DT loves orthogonal decision boundaries ← sensity to the data’s orientation.

Figure 6-7. Sensitivity to training set rotation

To overcome the sensitivity to data’s orientation → transform (using PCA) before.

Transform from this (Figure 6-2. Decision tree decision boundaries)

to this (Figure 6-8. A tree’s decision boundaries on the scaled and PCA-rotated iris dataset)

Decision Trees Have a High Variance ← small changes to the hyperparameters or to the data may produce very different models
Tuỳ vào cách chọn số features mà mỗi lần train có thể cho ra kết quả khác nhau ← dùng random forest sẽ tốt hơn (Note: ) vì sẽ lấy trung bình predictions over many trees.

Chapter 7. Ensemble Learning and Random Forests

Wisdom of the crowd = khi một câu trả lời kết hợp từ rất rất nhiều người tốt hơn 1 chuyên gia.

Ensemble = prediction of a group of predictors often gets better than with the best individual predictor.

Random forest = one of the most powerful ML algo available today.

they require very little preprocessing, they’re great for getting a prototype up and running quickly.

Voting Classifiers

Figure 7-2. Hard voting classifier predictions

Thậm chí kết hợp những weak learner (chả hơn random guess là bao) lại với nhau lại cho kết quả của 1 strong learner (high accuracy). ← giải thích bằng the law of large number

biased coin that has a 51% chance of coming up heads and 49% chance of coming up tails

Ensemble methods work best when the predictors are as independent from one another as possible. One way to get diverse classifiers is to train them using very different algorithms.

sklearn.ensemble.VotingClassifier — scikit-learn 1.4.2 documentation

Soft voting = If all classifiers are able to estimate class probabilities ← voting sẽ dựa trên prob của class để classify, nhiều weight hơn cho máy cái confident votes. Ví dụ: predictor dự đoán “1” prop là 60%, còn predictor dự đoán “2” có prop là 90% thì cái 90% sẽ có nhiều weight hơn.

Bagging and pasting ensembles

Use the same training algorithm for every predictor but train them on different random subsets of the training set.

Sampling with replacement (instances can be sampled several times for the same predictor) → bagging (boostrap aggregating) ← bootstrap param

Sampling without replacement → pasting

Both allows instances to be sampled several times across multiple predictors.

predictors can all be trained in parallel, via different CPU cores or even different servers.

Random forests

Note:

is an ensemble of decision trees, generally via the bagging method.

Random forests are very handy to get a quick understanding of what features actually matter, in particular if you need to perform feature selection.

Boosting

combine several weak learners into a strong learner.

AdaBoost → correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfit. ← modify weights of instances

Figure 7-7. AdaBoost sequential training with instance weight updates

Figure 7-8. Decision boundaries of consecutive predictors. Bên trái là dùng learning_rate=1 nhưng áp dụng AdaBoost 5 lần cũng cho kq tương tự như bên phải nếu dùng lr=0.5. Cái sau fix cái trước.

Gradient Boosting: #AdaBoost ngoại trừ việc ko tập trung vào weights mà là residual errors.

1tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)
2tree_reg1.fit(X, y)
3
4y2 = y - tree_reg1.predict(X)
5tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=43)
6tree_reg2.fit(X, y2)
7
8# ...
9
10sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

Figure 7-9. In this depiction of gradient boosting, the first predictor (top left) is trained normally, then each consecutive predictor (middle left and lower left) is trained on the previous predictor’s residuals; the right column shows the resulting ensemble’s predictions

GBRT = gradient tree boosting = gradient boosted regression trees.

→ Drawback: training cannot be parallelized. ← ko scale ngon bằng bagging và pasting.

Histogram-Based Gradient Boosting. It works by binning the input features, replacing them with integers.

HGB can train hundreds of times faster than regular GBRT on large datasets
binning causes a precision loss

TensorFlow Decision Forests ← take a look from TF

Stacking

It is based on a simple idea: instead of using trivial functions (such as hard voting) to aggregate the predictions of all predictors in an ensemble, why don’t we train a model to perform this aggregation

Figure 7-12. Training the blender in a stacking ensemble

Chapter 8. Dimensionality Reduction (DR)

Too many features make the training slow + hard to find good solution ← curse of dimensionality ← in real world problem, it’s possible to reduce the #features. → speed up training

First try to train your system with the original data before considering using dimensionality reduction.

DR is good for data visualization too (reduce to 2/3 dim ← gain info like clusters.

Well, there’s just plenty of space in high dimensions. As a result, high-dimensional datasets are at risk of being very sparse → most training instances are likely to be far away from each other → new instance is far away from training instances → prediction is less reliable

More dimensions, more overfiting

in practice, the number of training instances required to reach a given density grows exponentially with the number of dimensions

2 main approaches to reduce dimensionality: projection and manifold learning (manifold = đa tạp).

Projection

From this (Figure 8-2. A 3D dataset lying close to a 2D subspace)

To this (Figure 8-3. The new 2D dataset after projection)

Manifold Learning

Figure 8-4. Swiss roll dataset ← 2D manifold = a 2D shape that can be bent and twisted in a higher-dimensional space

Figure 8-5. Squashing by projecting onto a plane (left) versus unrolling the Swiss roll (right)

→ Can use LocallyLinearEmbedding to unroll this Swiss (check LLE).

Many dimensionality reduction algorithms work by modeling the manifold on which the training instances lie; this is called manifold learning.

Dimensionality reduction often speeds up model training, but its impact on the solution quality varies with the dataset.

Figure 8-6. The decision boundary may not always be simpler with lower dimensions

PCA (Principal component analysis)

Note:
Idea: First it identifies the hyperplane that lies closest to the data, and then it projects the data onto it.
PCA identifies the axis that accounts for the largest amount of variance in the training set. The ith axis is called the ith principal component
how can you find the principal components of a training set → singular value decomposition (SVD)
PCA assumes that the dataset is centered around the origin.
explained_variance_ratio_ = The ratio indicates the proportion of the dataset’s variance that lies along each principal component.

1cumsum = np.cumsum(pca.explained_variance_ratio_)
2d = np.argmax(cumsum >= 0.95) + 1

1pca = PCA(n_components=0.95) # the ratio of variance you wish to preserve

Elbow

Figure 8-8. Explained variance as a function of the number of dimensions

Projecting down to d dimensions

Project sao cho preserve as much variance as possible.

Can use in a pipeline where using PCA to reduce the dimension + then apply a AI model. Use RandomizedSearchCV to find a good combination of hyperparameters for both.
It’s possible to inverse the PCA (pca.inverse_transform(X_reduced))
Incremental PCA: nhược điểm của preceding PCA là cần whole training set to fit in memory → use IncrementalPCA (split into mini batches) ← alternative, use NumPy’s memmap

Random Projection

using random linear projection.
two similar instances will remain similar after the projection, and two very different instances will remain very different.
johnson_lindenstrauss_min_dim & GaussianRandomProjection / SparseRandomProjection
no training is required
random projection is a simple, fast, memory-efficient, and surprisingly powerful dimensionality reduction algorithm that you should keep in mind, especially when you deal with high-dimensional datasets

LLE (locally linear embedding)

a nonlinear dimensionality reduction (NLDR) technique + does not rely on projections
LocallyLinearEmbedding to unroll Swiss roll.
How it works?

Tìm k-nn của each instance x trước → construct linear function of these neighbors ← points fixed ← find weights. ← weights ở đây cho thấy relationships between instances. ← chúng ta muốn preserve mấy relationship này as much as possible sau khi projection.
Sau khi có weights rùi, project training instances to d-dim space (d<n) + dựa vào cái weights đã tìm ở trên (weights fixed) → thiết lặp pt relationship và tìm các điểm xung quanh gần với x chiếu nhất.
Consider when data is nonlinear

Other Dimensionality Reduction Techniques

Multidimensional scaling (MDS)
Isomap
t-SNE ← used for visualization (visualize clusters of instances)
Linear discriminant analysis (LDA) ← good tech to resuce dim before running anothe classification algo.

Chapter 9 - Unsupervised Learning Techniques

Most are supervised learning. Huge potential in unsupervised learning.

the most common unsupervised learning task: dimensionality reduction

Other tasks: clustering, anomaly detection, density estimation (also used for anomaly detection, vissualization)

Classification vs clustering

Figure 9-1. Classification (left) versus clustering (right)

❓Use clustering in dimensionality reduction??? → read more

K-Means.

Note:
The computational complexity of the algorithm is generally linear with regard to the number of instances m, the number of clusters k, and the number of dimensions n
k-means is generally one of the fastest clustering algorithms.
Although the algorithm is guaranteed to converge, it may not converge to the right solution (to local min instead) ← improving the centroid initialization
Run the algo mutiple times with different random init and keep the best solution ← param n_init ← so sáng giữa các init bằng metric là inertia_

The inertia is not a good performance metric when trying to choose k because it keeps getting lower as we increase k.

.score() returns negative because it must follow Scikit’s rule “greater is better”
k-means++ (2006 paper) ← chọn init thông minh hơn (select centroids that are distant from one another) ← KMeans uses this by default.
algorithm="elkan" ← 2003: on some large dataset and many clusters ← bỏ qua mấy bước distance calculations → tuy nhiên pp này ko phải lúc nào cũng nhanh, thậm chí chậm, phụ thuộc dataset.
Using minibatch → MiniBatchKMeans (2010) ← speed up 3/4 times. → thích hợp cho clusters huge dataset.

Although the mini-batch k-means algorithm is much faster than the regular k-means algorithm, its inertia is generally slightly worse.

Choose #cluster k?

elbow ← ko phải là tối ưu

silhouette score ← more precise ← Read this note.

shape’s height indicates the number of instances in the cluster, and its width represents the sorted silhouette coefficients of the instances in the cluster (wider is better). The vertical dashed lines represent the mean silhouette score for each number of clusters. ← Nếu những instances có coeff nằm bên trái dashline → bad ← instance much close to other clusters

→ 4 & 5 là ok nhất, nếu muốn mấy clusters có similar sizes thì chọn 5.

k-means does not behave very well when the clusters have varying sizes, different densities, or nonspherical shapes ← depending on the data, different clustering algorithms may perform better
It is important to scale the input features (see Chapter 2) before you run k-means

HDBSCAN

Note:
Based on local density estimation.
This algorithm works well if all the clusters are well separated by low-density regions
It cannot predict which cluster a new instance belong to (doesn’t have predict() method, but fit_predict()) ← we can easily use with another classifier method like KNeighborsClassifier

1from sklearn.neighbors import KNeighborsClassifier
2
3knn = KNeighborsClassifier(n_neighbors=50)
4knn.fit(dbscan.components_, dbscan.labels_[dbscan.core_sample_indices_])

DBSCAN can identify any #clusters at any shape, robust to outliers, only 2 hyperparams.
Doesn’t scale well on large dataset ← computational complexity .
HDBSCAN (hierarchical DBSCAN) is better.

Using clusters for image segmentation (SOTA là dùng NN nhưng chapter này dùng clustering)

Image segmentation is the task of partitioning an image into multiple segments

Figure 9-12. Image segmentation using k-means with various numbers of color clusters

color segmentation (giống màu là segment) ← total forest in a region
semantic segmentation (cùng object type thì segment) ← safe driving car, segment người đi đường (pedestrian)
instance segmentation (cùng individual obect thì segment) ← mỗi ngưởi đi đường là 1 object riêng.

Using Clustering for Semi-Supervised Learning

Giả sử chỉ có 50 labeled instances (random) → 74.8% (train on full labeled dataset thì là 90.7%)
Cluster thành 50 clusters → chọn 50 hình đại diện (representative images) → train → 84.9%
Label all instances in each instance as its representative (label propagation) → train → 89.4% ← propagate labels auto: LabelSpreading & LabelPropagation or SelfTrainingClassifier
Ignore 1% of instances that are farthest from their cluster center (eliminate some outliers) → 90.9%

Active learning: nhiều strategies → most common là uncertainty sampling

Model trained on labeled instances → sau đó predict all other unlabeled instances
Instances mà model uncertain most (prop is lowest) → đưa cho expert for labeling
Iterate the process.

Other clustering methods:

Agglomerative clustering → This approach can capture clusters of various shapes; it also produces a flexible and informative cluster tree instead of forcing you to choose a particular cluster scale ← scale nicely to large #instances
BIRCH → for very large datasets + #features not too large (<20) ← use limit memory with large dataset
Note: ← has some features as DBSCAN (any #clusters of any shape) + 1 hyperparam (radius of the circle) + not suited for large datasets.
Affinity propagation ← instances repeatedly exchange messages between one another until every instance has elected another instance (or itself) to represent it. ← giống bầu cử: bầu cho kẻ giống mình và muốn kẻ đó thắng. ← không cần chọn #clusters trước. + diff-size clusters ← not suited for large dataset
Spectral clustering ← not good for large data + diff sizes.

Hard clustering: assigning each instance to a single cluster.

K-Means clustering is an example of a hard clustering algorithm where each data point is assigned to the cluster with the nearest centroid.
Each instance chỉ thuộc 1 cluster thui.

Soft clustering: give each instance a score per cluster, allowing for more flexibility in the assignment process

Mỗi instance có thể thuộc

Gaussian Mixtures model (GMM)

instances were generated from a mixture of several Gaussian distributions
All the instances generated from a single Gaussian distribution form a cluster that typically looks like an ellipsoid ← work great with clusters of this shape
Not work well with clusters with very diff shapes.
Idea of algo ← expectation-maximization

init cluster params randomly → assign instances to clusters → max step → update clusters
EM likes generalization of k-means ← not only find centers but also size, shape and orientation, also relative weights.

Like K-Means, EM can end up converging to poor solutions, so it needs to be run several times
predict() for hard clustering, predict_proba() for soft clustering.
GMM is a generative model (you can sample new instance from it) ← .sample()
also possible to estimate the density of the model at any given location
Use for anomaly ← any instance located in a low-density region can be considered an anomaly
Choose #cluster k?

KMeans use silhouette scores but GMM cannot use it because the clusters aren’t spherical or have diff sizes.
find the model that minimizes a theoretical information criterion
Dùng Bayesian Gaussian Mixture Models ← ước định trước 1 n_clusters lớn hơn n_clusters optimal, sau đó thuật toán sẽ tự eliminate những cái dư.

“probability” vs “likelihood” in stats: given a stats model with some param

“prob” → describes how likely future x is (knowing )
“likelihood” → describes how likely a set are after x known.
It is important to understand that the likelihood function is not a probability distribution

👉

Reading: Hands-On ML - Quick notes (Chapter 10 — 13)