In this chapter you will work through an example project end to end.
In this chapter, we use California Housing Prices dataset (or download it from the author’s repository).
This data includes metrics as the population, median income, median housing price for each block group (called “district” for short).
Your model should learn from this data → predict the median housing price in any district.
You should pull out this ML project checklist (Appendix A in the book) for each project.
Ask questions to find the methods.
- Question: What exactly the business objective is? (find a model isn’t a final goal) → Business objective: Whether it’s worth to invest in a given area?
- Question: What the current solution looks like (if any)? ← a ref for performance → currently estimated manually by experts. ← Their estimates were off by more than 30%.
Pipeline = a sequence of data processing components is called a data pipeline. Each component is handled by a team. The whole process is robust.
- Question: What kind of training supervision the model (supervised, unsupervised, semi-supervised, self-supervised of reinforcement)? Classification / Regression / ? Use batch learning / online learning?
- supervised ← model trained with labeled examples.
- multiple regression ← predict a value, use multiple features.
- univariate regression ← predict a single value for each district. If we want to predict multiple values → multivariate regression.
- Batch learning ← no continuous flow data, no need to adjust data, data is small.
If data were huge → split batch learning across multiple servers (use MapReduce technique) or online learning.
- A typical measure for regression: Root Mean Square Error (RMSE)
It’s corresponding to Euclidean norm (or norm, noted or just ).
This is more sensitive to outlier than below MAE. If outliers are rare (bell-shaped curve data) → RMSE performs well!
- If there is any outlier district, we can use Mean Absolute Error (MAE = Average Absolute Deviation)
It’s corresponding to norm, noted or Manhattan norm (it measures the distance between two points in a city if you can only travel along orthogonal city blocks).
- norm of a vector containing elements:
= number of nonzero elements in the vector.
= maximum absolute value.
It is beneficial to communicate with other teams in the pipeline to understand the assumptions regarding the overall problems. If there are any changes, adjust your methods accordingly to adapt to them.
It’s time for the codes. Check the offcial jupyter notebooks here. In this chapter, we run these notebooks using Google Colab at this link.
Thi: I ignore some sections related to the usage of Google Colab and Jupyter Notebook. The codes in this note are just snippets.
→ 10 attributes.
- 20640 → small (vs ML standard)
total_bedrooms
has missing values.
ocean_proximity
isn’t numeric ← it’s categorical attribute (check Fig 3.
Check with histograms.
1import matplotlib.pyplot as plt
2
3housing.hist(bins=50, figsize=(12, 8))
4save_fig("attribute_histogram_plots") # extra code
5plt.show()
Some remarks:
median_income
isn’t a normal US$. → ask the team collecting data → it’s scaled (1 unit = 10k$) and capped (for >15 and <0.5 into 1 bin). ← Thi: dồn lại.
housing_median_age
&median_house_value
were capped too but there is a problem withmedian_house_value
because it’s effects directly we want to predict. → ask client team to see if they want exact predict beyond 500k? → collect more label for them (>500k) or remove those districts.
- Attribues have very diff scales → (later) feature scaling.
- Many histograms are skewed right → hard to detect patterns → need to be transformed (more symmetrical / bell-shaped)
- Why now? → your brain is an amazing pattern detection system → overfitting (by you) ← called data snooping bias
- Check the codes in the notebook. We split 20% data for test set.
- Problem: If we use random to split the test set → it will change at each run → not perfect
- Solution: save on the 1st run, use it in subsequent runs OR use random seed ← weakness: broken when we have new data.
→ We should use id of each instance (eg. compute its hash) → be sure that test set is consistent across multiple runs.
1from zlib import crc32
2
3def is_id_in_test_set(identifier, test_ratio):
4 return crc32(np.int64(identifier)) < test_ratio * 2**32
5
6def split_data_with_id_hash(data, test_ratio, id_column):
7 ids = data[id_column]
8 in_test_set = ids.apply(lambda id_: is_id_in_test_set(id_, test_ratio))
9 return data.loc[~in_test_set], data.loc[in_test_set]
- Above code needs an “id” column → check and use attributes to generate a consistent id for each instance! Eg.
longitude
andlatitude
.
- Use scikit-learn
1from sklearn.model_selection import train_test_split
2
3train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
There are more methods in
sklearn.model_selection
.- Stratified sampling: the population is divided into subgroups called strata, and a representative sample is taken from each stratum to ensure the test set represents the entire population.
- Make sure male vs female in the test set is representative.
- Suppose some expert told that
median_income
(Fig 2-8) is very important → make sure the test set considers it important too. - Use
pd.cut()
to categorize an attribute.
- Make sure to put the test set aside and explore only the training set.
- Can make a exploration set if training set is large. Just for exploring the data.
- Make a copy to work with:
housing = strat_train_set.copy()
.
Use scatterplot to visualizing all districts.
1housing.plot(kind="scatter", x="longitude", y="latitude", grid=True, alpha=0.2)
2plt.show()
1housing.plot(
2 kind="scatter", x="longitude", y="latitude", grid=True,
3 s=housing["population"] / 100, label="population",
4 c="median_house_value", cmap="jet", colorbar=True,
5 legend=True, sharex=False, figsize=(10, 7)
6)
7plt.show()
- Standard correlation coefficient or Pearson’s r between every pair of attributes.
- The correlation coefficient . ~1 → strong positive correlation. 0 → no linear correlation. ← The correlation coefficient only measures linear correlations
- Use Pandas
scatter_matrix()
to check the correlation between attributes → plot every numerical attribute against every other numerical attribute.
- From Fig 2-14, the most promising attribute to predict the median house value is the median income. → zoom at it
- The correlation is quite strong.
- Points aren’t too dispersed.
- visible as a horizontal line at $500k
- There are other “less obvious” straight lines arount 450k, 350k, 280k ← remove these districts to prevent your algo to learn something bad.
- If some attributes have a skewed-right distribution → transform them (eg. computing their logarithm or square root).
- Last thing before preparing data for pipeline → try out various attribute combinations.
- eg: #rooms per household = #rooms & #household, #bed rooms / #rooms, population/household,…
→ Found that
bedrooms_ratio
is good to use (its corr = -0.256397) → house with a lower bedroom/room ratio tend to be more expensive.1housing["bedrooms_ratio"] = housing["total_bedrooms"] / housing["total_rooms"]
- You should write functions to perform this step instead of doing it manually!
- You should also separate the predictors and the labels, since you don’t necessarily want to apply the same transformations to the predictors and the target values
- ML algo cannot work with missing features. Eg.
total_bedrooms
has missing values, - Remove corresponding districts.
- Remove whole attribute.
- Set missing value to some value (zero, mean, median,…) ← called imputation ←
sklearn.impute
Scikit-Learn Design ← paper: "API design for machine learning software:
experiences from the scikit-learn project”
experiences from the scikit-learn project”
ocean_proximity
→ text attribute → check → categorical attribute
- Most ML algos prefer to work with numbers → convert categories to numbers.
- Check
sklearn.preprocessing
- Suppose we use
OrdinalEncoder
to convert categories to 0, 1, 2, 3,… ← Issue: ML assumes 2 nearby values are similar! ← It may be fine if categories has order like “bad” < “average” < “good” < “excellent” but not forocean_proximity
!
→ Solution: use One-hot encoding = create one binary attribute per category! ← output of
OneHotEncoder
is a SciPy sparse matrix (very efficient matrices that contain mostly zeros ← save memory and speed up computations).- Pandas has also
pd.get_dummies(df)
which has the same functionality as sk’sOneHotEncoder
but we prefer the latter (smarter) because it remembers which categories it was trained on.
OneHotEncoder
can also detects unknown categories and rase an exception whereas get_dummies()
cannot (it creates a new column)!- If attribute has large of categories → not good → replace the categorical attribute by numerical ones.
- One of the most important transformations you need to apply to your data is feature scaling.
- ML algo don’t perform well when numerical attributes have very different scales.
- Eg. #rooms where median income → model is bias, it will focus more on #rooms.
- 2 common ways: min-max scaling (or normalization) & standardization (z-score normalization)
- min-max scaling: values scaled to [0,1] (or other range) ← ←
MinMaxScaler
- standardization: ( is mean, is ’s standard deviation). It doesn’t restrict values to a specific range but it’s much less affected by outliers. ←
StandardScaler
- Warning: never use
fit()
orfit_transform()
for anything else than the training set ← you can use scaler for other sets later!
- If feature has heavy tail (ie. values far from the mean are not exponentially rare) → need to shrink the heavy tail first, then scale.
- Heavy tail to the right ← replace the feature with square root.
- Power law distribution ← replace the feature with its logarithm ← Fig 2-17.
- Another approach: bucketizing the feature (chopping its distribution into roughly equal-sized buckets, replacing each feature value with the index of the bucket it belongs to)
- A feature has a multimodal distribution (i.e., with two or more clear peaks, called modes) → Strategies:
Method 1: Bucketize it, but this time treating the bucket IDs as categories, rather than as numerical values.
Method 2: Add a feature for each mode, representing the similarity between the housing median age and that mode, using a radial basis function (RBF). The most common type of RBF is Gaussian RBF, where the output value decays exponentially as the input value moves away from the fixed point.
The parameter determines how quickly the similarity measure decays as moves away from 35.
1from sklearn.metrics.pairwise import rbf_kernel
- The target values may also need to be transformed too. ← then use
inverse_transform()
method to get the desired values from the predicted-transformed value. - Use
TransformedTargetRegressor
← give it a regression model & label transformer then fit training set with unscaled labels. After that, just use.predict()
as normal.
1from sklearn.preprocessing import FunctionTransformer
2
3log_transformer = FunctionTransformer(np.log, inverse_func=np.exp)
4log_pop = log_transformer.transform(housing[["population"]])
A log-transformer
1rbf_transformer = FunctionTransformer(rbf_kernel, kw_args=dict(Y=[[35.]], gamma=0.1))
2
3age_simil_35 = rbf_transformer.transform(housing[["housing_median_age"]])
A transformer computes the same Gaussian RBF similarity measure
1sf_coords = 37.7749, -122.41
2sf_transformer = FunctionTransformer(rbf_kernel, kw_args=dict(Y=[sf_coords], gamma=0.1))
3sf_simil = sf_transformer.transform(housing[["latitude", "longitude"]])
How to add a feature that will measure the geographic similarity between each district and San Francisco
Custom transformers are useful to combine features too,
1ratio_transformer = FunctionTransformer(lambda X: X[:, [0]] / X[:, [1]])
2ratio_transformer.transform(np.array([[1., 2.], [3., 4.]]))
Transformer computes the ratio between the input features 0 and 1
You can create your own custom transformer with methods such as
fit()
, transform()
, and fit_transform()
. These are the only methods you need to implement.- Use
TransformerMixin
as a base class → you havefit_transform()
for free.
- Use
BaseEstimator
as a base class (and avoid using*args
and**kwargs
in constructor) → you haveget_params()
,set_params()
A custom transformer can (and often does) use other estimators in its implementation.
Check whether your custom estimator respects Scikit-Learn’s API by passing an instance to
check_estimator()
.ClusterSimilarity
← Transformer uses k-means to locate the clusters, then measures Gaussian RBF similarity between each district and all cluster centers. (Figure 2-19)- Many transform steps need to be executed in order → scikit-learn has
Pipeline
to help.
1from sklearn.pipeline import Pipeline
2
3num_pipeline = Pipeline([
4 ("impute", SimpleImputer(strategy="median")),
5 ("standardize", StandardScaler()),
6])
An example of using
Pipeline
- Pipelines = list of name/estimator pair.
- name = any not containing
__
- estimator = all be transformed (must have
fit_transform()
) ← except the last one which can be anything!
- If you don’t want to name the transformers → use
make_pipeline()
instead.
1num_pipeline = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
- Pipeline calls methods inside the line, sequentially and it exposes the same methods as the final estimator. ← eg. Above code → pipeline acts as a transformer (its last estimator is
StandardScaler
)
- Pipeline supports indexing, ie.
pipeline[1]
.
- Use
ColumnTransformer
to have a single transformer → handle all columns (numerical, categorical…) ← apply appropriate transformer to each column.
1num_attribs = [...]
2cat_attribs = [...]
3
4cat_pipeline = make_pipeline(
5 SimpleImputer(strategy="most_frequent"),
6 OneHotEncoder(handle_unknown="ignore"))
7
8preprocessing = ColumnTransformer([
9 ("num", num_pipeline, num_attribs),
10 ("cat", cat_pipeline, cat_attribs),
11])
Use
make_column_transformer
and make_column_selector
if you don’t want to name columns.1preprocessing = make_column_transformer(
2 (num_pipeline, make_column_selector(dtype_include=np.number)),
3 (cat_pipeline, make_column_selector(dtype_include=object)),
4)
- Missing values: numerical features ← median, categorical features ← most frequent category.
- Category feature ← one-hot encoded.
- Ratio feature are needed ←
bedrooms_ratio
,rooms_per_house
,people_per_house
.
- Cluster similarity features will also be added.
- Features with a long tail will be replaced by their logarithm.
- All numerical features will be standardized
- Try with simple linear regression first (use
LinearRegression
) than usemean_squared_error
to measure the performance. ← result: score 68k (wheremedian_housing_values
range is between 120k and 265k) ← not very satisfying (of course)! ← underfitting! ← features don’t provide enough info or model isn’t good enough!
Options to improve: (1) more powerful model, (2) better features, (3) reduce constraints on the model.
- Try
DecisionTreeRegressor
← more powerful to find complex nonlinear relationship! (Chapter 6) ← result: 0 score (RSME) ← overfitting! ← How to be sure? → split training into training/validation tests
- Option 1: Use
train_test_split
()
to split training set into training/validation tests → train again with smaller training and validate using validation set.
- Option 2: Use k-fold cross-validation (
cross_val_score
) → randomly splits training set into 10 nonoverlapping subsets (folds) → trains&evaluate 10 times (pick one for validation and train on 9 other folds) ← result: 66.8K±2K ← bad!
- Remark: Score of cross-validation is greater is better (opposite to a cost function which is lower is better)
- If training error is low but validation error is high → overfitting!
- Option 3: use
RandomForestRegressor
(Chapter 7) = train many decision trees on random subsets of features, then average their predictions. ← ensembles model ← result: 47K±1K (really better) ← However, if trainRandomForest
on training set + measure RSME → 17K (much lower than 47K) → there is still overfitting! ← Solution: regularize model, more data,…
- In this stage, try serveral model → goal: a shortlist (2 to 5) of promising models.
After having a shortlist, this stage is to fine-tune them!
- You can but shouldn’t play with hyperparameters manually until you find a great combination → use
GridSearchCV
instead (it searches for you)!
- Given which hyperparameters + which values to try out → it uses cross-validation.
- TIP: Using a Scikit-Learn pipeline for preprocessing lets you adjust preprocessing and model hyperparameters simultaneously. If pipeline fitting is costly, set the pipeline's
memory
to a cache directory path.
- Sample codes
- 2 dictionaries in
param_grid
→ 3x3 + 3x2 = 15 combinations. - Train pipeline 3 times per combination (
cv=3
)
1from sklearn.model_selection import GridSearchCV
2
3full_pipeline = Pipeline([
4 ("preprocessing", preprocessing),
5 ("random_forest", RandomForestRegressor(random_state=42)),
6])
7param_grid = [
8 {'preprocessing__geo__n_clusters': [5, 8, 10],
9 'random_forest__max_features': [4, 6, 8]},
10 {'preprocessing__geo__n_clusters': [10, 15],
11 'random_forest__max_features': [6, 8, 10]},
12]
13grid_search = GridSearchCV(full_pipeline, param_grid, cv=3,
14 scoring='neg_root_mean_squared_error')
15grid_search.fit(housing, housing_labels)
→ Total: 3x15=45 rounds of training.
RandomizedSearchCV
is often preferable, especially when the hyperparameter search space is large.
- It evaluates a fixed number of combinations, selecting a random value for each hyperparameter at every iteration.
- Each hyperprameter → provide either list of values or a prob distribution.
- There are also
HalvingRandomSearchCV
andHalvingGridSearchCV
← use computational resources more efficiently ← Idea: from the beginning rounds, they use limit resources (eg. part of training data) to find the params, then the best candidates go to the next rounds (more resources).
- Combine the models that perform best. “Many” is better than “individual”. Check more in Chapter 7.
- Get inside from the best models (parameters’ value w.r.t each feature) → check and remove the less important features. ← do this auto:
sklearn.feature_selection.SelectFromModel
- Now is also a good time to ensure that your model not only works well on average, but also on all categories of districts.
- You are ready to evaluate the final model on the test set.
- You need to know how precise the error estimate gives from the test ← 95% confidence interval for the generalization error using
scipy.stats.t.interval()
.
- Hyperparameter tuning might decrease performance due to overfitting on validation data. Resist tweaking hyperparameters for test set improvements, as they may not apply to new data.
- Save and load the model, use
joblib
!
- An example of deploy your model
- We can use Google’s Vertex AI to upload and deploy our model. 👈 My note: Google Vertex AI
- After deployment, you have to monitor the system and model too. ← To see if it’s still working or needed to be improved.
- You should probably automate the whole process as much as possible.
- You should trigger alerts when something goes wrong.
- Make sure you keep backups of the models + having a rollback process to previous model. Backups are w.r.t model versions, dataset versions,…
- ML involves a log of infrastructure (MLOps - ML Operations) → Chapter 19.
- Much of the work is in the data preparation step.
- Understanding the overall process and mastering a few machine learning algorithms can be more beneficial than solely focusing on exploring advanced algorithms.
- Kaggle is a good place for you to start an A-Z project.