Reading: Hands-On ML - Quick notes (from Chapter 4)

Anh-Thi Dinh
draft
I've found that taking notes on this site while reading a book significantly increases the time it takes to finish. I've stopped documenting everything as I did in previous chapters. Now, I use this note as a single place to record what I find interesting (for my personal reference). Unlike previous chapters, this is not intended for sharing but rather a rough draft (a mix of Vietnamese, English and even French). While some might find it helpful, I don't recommend using these notes.
This book contains 1007 pages of readable content. If you read at a pace of 10 pages per day, it will take you approximately 3.3 months (without missing a day) to finish it. If you aim to complete it in 2 months, you'll need to read at least 17 pages per day.

List of notes for this book

Information

Chapter 4 - Training Models

  • Closed-form solution: nghiệm bằng công thức toán, trực tiếp.
  • @ operator performs matrix multiplication. Dùng cho Numpy, TF, PyTorch, JAX, not on pure Python.
  • Normal eqn & SVD approach chậm với n (#features) nhưng lại khá nhanh với m (#instances). ← nhiều features, GD nhanh hơn
  • Gradient Descent: pp optimization tổng quát. Ý tưởng chính là tweak parameters iteratively để minimize cost functions. ← compute the local gradient và đi theo hướng dốc đó từ đỉnh đến đáy (đáy = min loss)
  • With feature scaling → go faster! ← StandardScaler
  • Batch gradient descent → use whole batch of training data at each step to computer the gradient
    • Learning rate: slow → too long, high → underfitting
      • Figure 4-8. Gradient descent with various learning rates
    • Find good learning rate → use grid search.
    • How many #epochs? → high but with condition (tolerance) to stop when gradient is tiny. Smaller tolerance → longer to train each epoch.
  • Batch GD → whole training set → slow → ngược lại hoàn toàn là Stochastic GD ← pick random one instance at each step.
    • Good: When cost functin is irregular → SGD helps jumping out of local min.
    • Bad: never come to optimal.
    • → Adjust learning rate (use learning schedule): Start wide then smaller smaller. ← simulated annealing algo (luyện kim: kim loại nóng chảy làm nguội từ từ)
    • Some estimators also have a partial_fit() method that you can call to run a single round of training on one or more instances.
      • warm_start=True with fit() will continue to train where it left off.
  • Mini-batch GD: each step, based on 1 < mini batches < full
    • vs SGD: tận dụng lợi thế của GPU, opt matrix operations để có performance boost.
  • Polynomial Regression (khi data ko phải đường thẳng) ← PolynomialFeatures
  • How to tell a model is underfitting or overfitting?
    • Recall: use cross-validation → well on training but poor on cross-val → overfitting. Poor on both → underfitting.
    • Learning curves: plots of training errors and validation errors vs training iteration. ← learning_curve()
      • Figure 4-15. Learning curves ← no gap between 2 curvesunderfiting → both training and val curves are high
        → to fix: choose again: better model or better features.
        Figure 4-16. Learning curves for the 10th-degree polynomial model ← overfitting → training curve is better and val curve
        → to fix: more data
  • Bias / variance trade-off
    • Bias (thành kiến) → wrong assumption (nghĩ linear nhưng lại là bật cao) ← high bias → underfitting
    • Variance → high variance → overfitting
    • Irreducible error ← data bị noise ← clean data
    • → Gọi là trade-off vì tăng bật → tăng variance nhưng lại giảm bias và ngược lại.
  • A good way to reduce overfitting → regularized
    • Ridge regression (Ridge) ← use
    • Lasso regression (Least absolute shrinkage and selection operator regression, Lasso) ← use
      • An important characteristic of lasso regression is that it tends to eliminate the weights of the least important features (i.e., set them to zero).
    • Elastic net regression: a middle ground between ridge regression and lasso regression.
    • (source) Ridge regression can't zero coefficients, resulting in all or none in the model. Unlike this, LASSO provides parameter shrinkage and variable selection. For highly correlated covariates, consider the Elastic Net instead of the LASSO.
    • It’s important to scale the data before regularization.
    • Use what?Avoid not using regulariztion; if only a few features are useful → lasso/elastic; when #features > #instances → use elastic.
  • Early stopping: stop training as long as the validation error reaches min.
    • Figure 4-20. Early stopping regularization
  • copy.deepcopy() → copies both model hyperparams & learned params whereas sklearn.base.clone() only copies hyperparams.
  • Logistic regression: we can use regression for classification (0/1) → estimate the probability an instance belongs to which class. ← gọi là “regression” vì là extension của linear regression (chỉ cần apply cái sigmoid function trước khi output)
  • Iris dataset: contains the sepal and petal length and width of 150 iris flowers of three different species: Iris setosa, Iris versicolor, and Iris virginica.
    • Figure 4-22. Flowers of three iris plant species
  • The softmax regression classifiier predicts only one class at a time (multiclass, not multioutput) ← cannot use it to recognize multiple people in one picture.
  • ScikitLearn’s LogisticRegression classifier uses softmax regression automatically when you train it on more than two classes.

Chapter 5 - Support Vector Machines (SVG)

  • capable of performing linear or nonlinear classification, regression, and even novelty detection
  • shine with small to medium-sized nonlinear datasets (i.e., hundreds to thousands of instances), especially for classification tasks
  • Linear SVG classification
    • Figure 5-1. Large margin classification. Left: The left plot shows the decision boundaries of three possible linear classifiers. Right: the solid line in the plot on the right represents the decision boundary of an SVM classifier
    • think of an SVM classifier as fitting the widest possible street
    • Add more instances not affect the decision boundary (only instances located on the edge ← support vectors)
    • SVMs are sensitive to the feature scales
      • Figure 5-2. Sensitivity to feature scales. Scaled (right) is much better.
    • Hard margin = strictly impose that all instances must be off the street and on the correct side → only works with linearly separable data + sensitive to outliers
      • Figure 5-3. Hard margin sensitivity to outliers
    • Margin violations = có nhiều data ở trong street.
    • soft margin classification = good balance between keeping the street as large as possible and limiting the margin violations
    • hyperparameters
      • Figure 5-4. Large margin (left) versus fewer margin violations (right)
        If SVM overfitting → reduce !
        C low → less overfitting, wider street, more support vectors, more violations. → (if too much) underfitting.
  • Nonlinear SVG Classification
    • Figure 5-5. Adding features to make a dataset linearly separable
    • Use kernel tricks: The kernel trick makes it possible to get the same result as if you had added many polynomial features, even with a very high degree, without actually having to add them.
      • Figure 5-7. SVM classifiers with a polynomial kernel
    • Another technique to tackle nonlinear problems is to add features computed using a similarity function ← How much each instance resembles a particular landmark. ← computationally expensive
      • ❓Chưa hiểu ví dụ cái này lắm (p. 253)
    • Gaussian RBF Kernel
      • Figure 5-9. SVM classifiers using an RBF kernel
        An illustration of using gamma. In high-gamma case, we only consider points nearby the hyperplane, it may cause an overfitting.
    • The LinearSVC class is based on the liblinear library ← optimized algo for linear SVG, not support kernels
    • The SVC class is based on the libsvm library ← support kernel trick
  • SVM Regression
    • Ngược cái classification (tạo cái street rộng nhất có thể, hạn chế violations rơi vào giữa street) thì cái regression là tạo cái street có thể chứa nhiều instances nhất có thể và cũng hạn chế violations rơi ra ngoài)
      • Figure 5-10. SVM regression, controlled by hyperparameter .
    • Ngược cái classification (support vectors ở bên trong để xác định margin), support vectors của regression thì ở bên ngoài cùng cũng để xác định margins. ← Reducing increase #support vectors and vice versa.
  • Under the hood of linear SVM Classifier
    • To make the street (margin) larger → make smaller.
      • Figure 5-12. A smaller weight vector results in a larger margin
    • To train an SVM: using quadratic programming problem OR using gradient descent OR solving the dual problem.
    • The dual problem:
      • The solution to the dual problem typically gives a lower bound to the solution of the primal problem, but under some conditions it can have the same solution as the primal problem ← SVM have these conditions.
      • The dual problem is faster to solve than the primal one when the number of training instances is smaller than the number of features. the dual problem makes the kernel trick possible, while the primal problem does not.
    • In ML, a kernel is a function capable of computing the dot product , based only on the original vectors and , without having to compute (or even to know about) the transformation .

Chapter 6. Decision Trees

  • To visualize Decision Trees, use Graphviz.
  • DT doesn’t (or very little) require data preparations.
    • Scikit-learn uses the CART algo which produces only binary trees (node have max 2 children).
  • ID3 algo can produces nodes that have more than 2 children.
Figure 6-1. Iris decision tree
Figure 6-2. Decision tree decision boundaries
  • White box models = intuitive + decisions are easy to interpret. Eg: decision trees.
  • Black box models = otherwise. Eg: random forests and NN. ← hard to know what contributed to this prediction
  • the CART algorithm is a greedy algorithm. A greedy algorithm often produces a solution that’s reasonably good but not guaranteed to be optimal
  • Complexity of the prediction = ← independent of the number of features.
  • Complexity of the training = ← Comparing all features on all samples at each node results
  • predictions are very fast, even when dealing with large training sets.
  • By default, the DecisionTreeClassifier class uses the Gini impurity measure, but you can select the entropy impurity measure instead by setting the criterion hyperparameter to "entropy".
    • Use which one? → most of the time, they gives the same results. Gini is slightly faster.
    • when they differ, Gini impurity tends to isolate the most frequent class in its own branch of the tree, while entropy tends to produce slightly more balanced trees
  • Entropy → zeros when molecules are still and well ordered. In Information Theory, entropy is zero when all messages are identical. In ML, a set’s entropy is zero when it contains instances of only one class.
  • Decision tree makes very few assumption about the training data.
  • It’s nonparametric model ← not many params + free to stick closely to the data → most likely overfitting. parametric model (linear model) ← predetermined params + reduce overfitting but increase underfitting.
  • DT Regression (Note: Decision Tree Regression) ← The main difference is that instead of predicting a class in each node, it predicts a value
  • This prediction is the average target value of the 110 training instances associated with this leaf node (for value=0.111)
Figure 6-4. A decision tree for regression
Figure 6-5. Predictions of two decision tree regression models
  • predicted value for each region is always the average target value of the instances in that region.
  • Instead of minimize the impurity (for classification), we minimize MSE using CART algo.
  • Weakness:
    • DT loves orthogonal decision boundaries ← sensity to the data’s orientation.
      • Figure 6-7. Sensitivity to training set rotation
    • To overcome the sensitivity to data’s orientation → transform (using PCA) before.
      • Transform from this (Figure 6-2. Decision tree decision boundaries)
        to this (Figure 6-8. A tree’s decision boundaries on the scaled and PCA-rotated iris dataset)
    • Decision Trees Have a High Variance small changes to the hyperparameters or to the data may produce very different models
    • Tuỳ vào cách chọn số features mà mỗi lần train có thể cho ra kết quả khác nhau ← dùng random forest sẽ tốt hơn (Note: Random Forest ) vì sẽ lấy trung bình predictions over many trees.

Chapter 7. Ensemble Learning and Random Forests

 
→ 300
Loading comments...