Reading: Hands-On ML - Chap 1: The ML Landscape

Anh-Thi Dinh
This note serves as a reminder of the book's content, including additional research on the mentioned topics. It is not a substitute for the book. Most images are sourced from the book or referenced.
I've noticed that taking notes on this site while reading the book significantly extends the time it takes to finish the book. I've stopped noting everything, as in previous chapters, and instead continue reading by highlighting/hand-writing notes instead. I plan to return to the detailed style when I have more time.
This book contains 1007 pages of readable content. If you read at a pace of 10 pages per day, it will take you approximately 3.3 months (without missing a day) to finish it. If you aim to complete it in 2 months, you'll need to read at least 17 pages per day.


List of notes for this book


This book is organized in 2 parts:

Other resources

The chapter 1 introduces a lot of fundamental concepts (and jargon) that every data scientist should know by heart. If you already familiar with machine learning basics, you may want to skip directly to Chapter 2.
Jupyter notebook for this chapter: on Github, on Colab, on Kaggle.

What Is Machine Learning?

A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. — Tom Mitchell, 1997.
Example: email spam filter ← give it examples of spam/non-spam emails so that it can learn to flag spam.
  • Training set: examples the system uses to learn. Each training example is call training instance (or sample).
  • Model: The part of ML system that learns and makes predictions. Example: Neural Networks, Random Forest,…
  • T = task to flag spam for new emails. E = training data. The perfomance measure P needs to bedfined ← it’s called accuracy.

Why Use Machine Learning?

(1) The traditional programming → if problem is difficult, your program is a long list of complex rules → hard to maintain.
(2) The machine learning approach.
For example, some words like “4U” in the subject,
  • Use (1), we will ignore all of these words (ignore all patterns we think) → spammer changes to use “For U” → we need to update (1) again → … → bad!
  • (2) will detects the frequent patterns of words in the spam examples and detect the new ones.
(3) Automatically adapting to change.
(4) Machine learning can help humans learn.
Data Mining = digging into large amounts of data to discover hidden patterns.
Machine learning is great for:
  1. Problems for which existing solutions require a lot of fine-tuning or long lists of rules (a machine learning model can often simplify code and perform better than the traditional approach)
  1. Complex problems for which using a traditional approach yields no good solution (the best machine learning techniques can perhaps find a solution)
  1. Fluctuating environments (a machine learning system can easily be retrained on new data, always keeping it up to date)
  1. Getting insights about complex problems and large amounts of data

Examples of Applications

  • Analyzing images of products on a production line to automatically classify them ← Image Classification ← using CNNs, Transformer.
  • Detecting tumors in brain scans ← Image Segmentation ← also using CNNs and Transformers.
  • Automatically classifying news articles ← NLP (Natural Language Processing) ← use RNN (Recurrent Neural Networks), Transformers.
  • Automatically flagging offensive comments on discussion forums ← Text classifications.
  • Summarizing long documents automatically ← Text summarization.
  • Creating a chatbot or a personal assistant ← NLU (Natural Language Understanding), Question-Answering modules.
  • Forecasting your company’s revenue next year, based on many performance metrics ← Linear Regression, Polynomial Regression, SVM (Support Vector Machine), Random Forest, Neural Networks.
  • Making your app react to voice commands ← Speech Recognition ← RNNs, CNNs, Transformers.
  • Detecting credit card fraud ← Anomaly DetectionIsolation Forests, Gaussian mixture models, Autoencoders.
  • Segmenting clients based on their purchases so that you can design a different marketing strategy for each segment ← Clustering ← K-Means, DBSCAN,…
  • Representing a complex, high-dimensional dataset in a clear and insightful diagram ← Data Visualization, Dimentionality Reduction
  • Recommending a product that a client may be interested in, based on past purchases ← Recommender System.
  • Building an intelligent bot for a game ← Reinforcement Learning

Types of Machine Learning Systems

Classify types of ML based on:
  • Supervised during training? ← supervised, unsupervised, semi-supervised, self-supervised,…
  • Can they learn incrementally on the fly? ← onlive learning vs batch learning
  • Comparing new data to known data? Or detecting new patterns? ← Instance-based learning vs model-based learning.
Above types can be used together.

Training Supervision

Supervised Learning

  • Training set fed to the algo includes the solutions ← labels
    • A labeled training set for spam classification (an example of supervised learning)
  • Classification: train examples with their classes → it classifies new instance.
  • Regression: predicts a target (eg. price of car) given a set of features/predictors/attributes (eg. mileage, age, brand,…). Regression model can be used for classification. ← Logistic regression
    • A regression problem: predict a value, given an input feature (there are usually multiple input features, and sometimes multiple output values)

Unsupervised learning

  • Training data is unlabeled.clustering can be used to detect group of similar data. If you use hierarchical clustering, it may subdivide each group into smaller groups.
An unlabeled training set for unsupervised learning
  • Visualization is an example of unsupervised learning. ← These algorithms try to preserve as much structure as they can.
    • Example of a t-SNE visualization highlighting semantic clusters.
  • Dimensionality reduction: simplify the data without losing too much information. ← merge correlated features into one.
  • Anomaly Detection: eg. detecting unusual credit card transactions, catching manufacturing defects, or automatically removing outliers from a dataset before feeding it to another learning algorithm. ← system learns the normal + meet new instance → it’s “arnomal” or not.
    • Anomaly detection
  • Novelty detection: alike anomaly, it looks for new instances that look different from all in the training set.
  • Association rule learning: dig into large amount of data → find the patterns, relation between features. Eg. relation between products bought in a supermaket.

Semi-supervised learning

It’s algos dealing with data that partially labeled. Eg. Google photos labels your face in the new photos or label all faces in a photo.
Semi-supervised learning with two classes (triangles and squares): the unlabeled examples (circles) help classify a new instance (the cross) into the triangle class rather than the square class, even though it is closer to the labeled squares
Most semi-supervised = unsupervised + supervised. Eg: using clustering to label unlabled data and then use supervised algo with this new all-labeled data.

Self-supervised learning

Generate a fully labeled dataset from a fully unlabeled one.
Self-supervised learning example: input (left) and target (right)
A large amount of unlabeled data can be processed by masking certain parts in an image and training a model to reconstruct the missing parts. Additionally, the model can classify species such as cats and dogs, although it may not know their specific names yet. Later on, we can map this knowledge to the labeled names that humans use.
Transfer learning = transfering knowledge from one task to another task. ← one of the important techniches in ML.

Reinforcement learning

Agent = the learning system. → it can observe the env + select and perform actions + get rewards (or penalties). ← it must find the best strategy (policy).
Reinforcement learning
Example: DeepMind’s AlphaGo beats Ke Jie (number one in Go game) by learning from milions of games and play with itself.

Batch Learning vs Online Learning

Batch Learning

It’s trained from all the available data, done offline. ← Offline Learning.
  • Model tends to decay because the world keep changes → model rot or data drift.
  • If you want Batch Learning to know new data → retrain on the full dataset (new + old).
  • It’s not effective (time / resources consumption).

Online learning

  • It feeds the system data sequentially (mini batches) ← quick and cheap, new data can be learnt on the fly.
In online learning, a model is trained and launched into production, and then it keeps learning as new data comes in.
  • Can be used if the data changes fast or you have limited computing resources (out-of-core learning).
  • Can be used to train huge data (cannot be trained at once)
    • Using online learning to handle huge datasets
  • Learning rate = how fast the system should adapt to the data changes. Too high → quickly adapt but quickly forget and vice versa.
  • Weakness: The system is vulnerable to bad data being fed while it is live. To address this, set up a mechanism to turn off learning if a drop in performance is detected.

Instance-Based Learning vs Model-Based Learning

  • One way to categorize ML systems is by how they generalize.
  • Should: good performance in both training and predict.

Instance-Based Learning

Learn by heart + ability of measure of similarity to detect “look-alike” spam emails, for example.
Instance-based learning

Model-Based Learning

Generalize from dataset → build a model → use this model to make predictions.
Model-based learning

Typical ML workflow

You want to know if money makes people happy?
  1. From dataset, you plot ← data studying
  1. Based on the plot, it looks like a linear regression (satisfaction goes up/down linearily as GDP) ← model selection
  1. Plot the model
    1. A few possible linear models → after training, we can choose the blue one!
  1. How we know which model is the best? → measure the good by a utility function (or fitness function) or measure the bad by cost function. ← For linear regression, we usually use cost function (measures the distance between the linear model’s predictions and the training examples) ← objective: minimize the cost function!
  1. Predict new data ← inference

Main Challenges of ML

2 things can go wrong in training models → “bad model” & “bad data”.
  • Insufficient Quantity of Training Data → For child, it’s easy for recognizing “an apple”, not ML models, we need a lot of data for it!
    • In this paper, MS researchers show that, with enough data, different models perform almost identically results!
      The idea that data matters more than algorithms for complex problems!
      However, data is usually not enough!
  • Nonrepresentative Training Data → to generalize well, training data need to be representative of new cases!
    • An example of “Nonrepresentative Training Data” (blue dots without red dots) → add more data (red dots), the old predicting model isn’t good anymore!
    • Sample is too small → sampling noise (nonrepresentative data). Large sample can be nonrepresentative if sampling method is flawed ← sampling bias!
  • Poor-Quality Data: it’s worthy to spend time cleaning up the training data. ← most data scientist spend a significant part of their time to do that!
  • Irrelevant Features: garbage in, garbage out. A critical part of the success of a machine learning project is coming up with a good set of features to train on ← Feature engineering. 2 steps:
    • Feature selection: select the most useful features to train.
    • Feature extraction: combine existing features to make a more useful one.
  • Overfitting the training data: the model performs well on the training data, but it does not generalize well.
    • Overfitting the training data
      Overfitting happens when model is too complex relative to the amount of data. Possible solutions:
    • Simplify model: fewer parameters, reducing the number of attributes, constraining the model,…
    • Gather more data.
    • Reduce the noise in data.
    • Regularization = constraining a model to make it simpler and reduce the overfitting. The amount of refularization to apply during learning is controlled by hyperparameters.
      You want to find the right balance between fitting the training data perfectly and keeping the model simple enough to ensure that it will generalize well.
      Regularization reduces the risk of overfitting
      Tuning hyperparameters is an important part of building a machine learning system!
  • Underfitting the training data: your model is too simple to learn the underlying structure of the data. Possible solutions:
    • Select a more powerful model (more parameters)
    • Better features (feature engineering)
    • Reduce the constraints on the model (eg. reducing the regularization hyperparameter).

Testing and Validating

  • Split data into 2 sets: training set (train the model using it) & test set (test if the model works well using it). Commonly use 80% training and 20% test (but not all the cases depending on the size of dataset).
  • Evaluate your model with test set → get generalization error (out-of-sample error).
  • If training error is low but generalization error is high → overfitting.

Hyperparameter Tuning and Model Selection

Problem: You have a model → how to choose value of regularization hyperparameters? → train 100 different models using 100 different values → test with test set → get the best value → but when you apply to real data, it’s bad ← Why? Because it’s fixed to the test data itself!
Common solution is holdout validation
Model selection using holdout validation
Holdout validation: split training set into “new” training set + validation set (or development set or dev set).
Process: train multiple models (various hyperparameters) with “new” training set → select model performed best on validation set → retrain the best model on the whole training set (new + validation) → final model → evaluate with test set.
  • Validation set is too small → model may be a “suboptimal” one.
  • Validation set is too large → remaining training is much smaller than the full training set → It’s bad because it likes “selecting the fastest sprinter to participate in a marathon” ← solution: perform repeated cross-validation (use multiple validation sets and get the average) ← weakness: training time takes longer!

Data Mismatch

It is easy to obtain a large amount of data, but such data may not be representative enough to be used in production.
For example, when building a mobile app to detect flowers, training the model using data downloaded from the web may not yield accurate results. ← we don’t know when model is bad because there is overfitting or mismatch!
Remember: validation set & test set must be representative of the data you expect to use in production!
Solution: use train-dev set Idea: train on “train” + evaluate on “train-dev” → if it’s poor, it’s overfitting. Otherwise → no overfitting → evaluate on “dev” → if it’s poor, it’s mismatch! → when it’s good → evaluate on test → when it’s good → production.
No free lunch theorem
If you make absolutely no assumption about the data, then there is no reason to prefer one model over any other. In practice you make some reasonable assumptions about the data and evaluate only a few reasonable models.


Read the book.