In this chapter you will work through an example project end to end.

In this chapter, we use

**California Housing Prices dataset**(or download it from the author’s repository).This data includes metrics as the population, median income, median housing price for each block group (called “

**district**” for short).Your model should learn from this data → predict the median housing price in any district.

You should pull out this ML project checklist (Appendix A in the book) for each project.

*Ask questions to find the methods.*

**Question:**What exactly the business objective is? (find a model isn’t a final goal) → Business objective: Whether it’s worth to invest in a given area?

**Question**: What the current solution looks like (if any)? ← a ref for performance → currently estimated manually by experts. ← Their estimates were off by**more than 30%**.

**Pipeline**= a sequence of data processing components is called a data

*pipeline*

**Each component is handled by a team. The whole process is robust.**

*.***Question**: What kind of training supervision the model (supervised, unsupervised, semi-supervised, self-supervised of reinforcement)? Classification / Regression / ? Use batch learning / online learning?**supervised**← model trained with*labeled*examples.**multiple****regression**← predict a value, use multiple features.**univariate regression**← predict a*single value*for each district. If we want to predict multiple values →*multivariate regression*.**Batch learning**← no continuous flow data, no need to adjust data, data is small.

If data were huge → split batch learning across multiple servers (use

**MapReduce**technique) or online learning.- A typical measure for regression:
(RMSE)*Root Mean Square Error*

It’s corresponding to

*Euclidean norm*(or*norm, noted*or just ).This is more sensitive to outlier than below MAE. If outliers are rare (bell-shaped curve data) → RMSE performs well!

- If there is any outlier district, we can use
(MAE =**Mean Absolute Error***Average Absolute Deviation*)

It’s corresponding to

*norm, noted*or*Manhattan norm*(it measures the distance between two points in a city if you can only travel along orthogonal city blocks).-
*norm*of a vector containing elements:

= number of nonzero elements in the vector.

= maximum absolute value.

It is beneficial to communicate with other teams in the pipeline to understand the assumptions regarding the overall problems. If there are any changes, adjust your methods accordingly to adapt to them.

It’s time for the codes. Check the offcial jupyter notebooks here. In this chapter, we run these notebooks using Google Colab at this link.

**Thi**: I ignore some sections related to the usage of Google Colab and Jupyter Notebook. The codes in this note are just snippets.

→ 10 attributes.

- 20640 → small (vs ML standard)

`total_bedrooms`

has missing values.

`ocean_proximity`

isn’t numeric ← it’s categorical attribute (check Fig 3.

Check with histograms.

```
1import matplotlib.pyplot as plt
2
3housing.hist(bins=50, figsize=(12, 8))
4save_fig("attribute_histogram_plots") # extra code
5plt.show()
```

Some remarks:

`median_income`

isn’t a normal US$. → ask the team collecting data → it’s scaled (1 unit = 10k$) and capped (for >15 and <0.5 into 1 bin). ←*Thi: dồn lại.*

`housing_median_age`

&`median_house_value`

were capped too but there is a problem with`median_house_value`

because it’s effects directly we want to predict. → ask client team to see if they want exact predict beyond 500k? → collect more label for them (>500k) or remove those districts.

- Attribues have very diff scales → (later)
**feature scaling**.

- Many histograms are
**skewed right**→ hard to detect patterns → need to be transformed (more symmetrical / bell-shaped)

- Why now? → your brain is an amazing pattern detection system → overfitting (by you) ← called
**data snooping bias**

- Check the codes in the notebook. We split
**20% data**for test set.

**Problem**: If we use random to split the test set → it will change at each run → not perfect**Solution**: save on the 1st run, use it in subsequent runs OR use random seed ←**weakness**: broken when we have new data.

→ We should use id of each instance (eg. compute its hash) → be sure that test set is consistent across multiple runs.

```
1from zlib import crc32
2
3def is_id_in_test_set(identifier, test_ratio):
4 return crc32(np.int64(identifier)) < test_ratio * 2**32
5
6def split_data_with_id_hash(data, test_ratio, id_column):
7 ids = data[id_column]
8 in_test_set = ids.apply(lambda id_: is_id_in_test_set(id_, test_ratio))
9 return data.loc[~in_test_set], data.loc[in_test_set]
```

- Above code needs an “id” column → check and use attributes to generate a consistent id for each instance! Eg.
`longitude`

and`latitude`

.

- Use scikit-learn

```
1from sklearn.model_selection import train_test_split
2
3train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
```

There are more methods in

`sklearn.model_selection`

.**Stratified sampling**: the population is divided into subgroups called*strata*, and a representative sample is taken from each stratum to ensure the test set represents the entire population.- Make sure male vs female in the test set is representative.
- Suppose some expert told that
`median_income`

(Fig 2-8) is very important → make sure the test set considers it important too. - Use
`pd.cut()`

to categorize an attribute.

- Make sure to put the test set aside and explore only the training set.

- Can make a
*exploration set*if training set is large. Just for exploring the data.

- Make a copy to work with:
`housing = strat_train_set.copy()`

.

Use

**scatterplot**to visualizing all districts.```
1housing.plot(kind="scatter", x="longitude", y="latitude", grid=True, alpha=0.2)
2plt.show()
```

```
1housing.plot(
2 kind="scatter", x="longitude", y="latitude", grid=True,
3 s=housing["population"] / 100, label="population",
4 c="median_house_value", cmap="jet", colorbar=True,
5 legend=True, sharex=False, figsize=(10, 7)
6)
7plt.show()
```

*Standard correlation coefficient*or*Pearson’s r*between every pair of attributes.

- The correlation coefficient . ~1 → strong positive correlation. 0 → no linear correlation. ← The correlation coefficient only measures linear correlations

- Use Pandas
`scatter_matrix()`

to check the correlation between attributes → plot every numerical attribute against every other numerical attribute.

- From Fig 2-14, the most promising attribute to predict the median house value is the median income. → zoom at it
- The correlation is quite strong.
- Points aren’t too dispersed.
- visible as a horizontal line at $500k
- There are other “less obvious” straight lines arount 450k, 350k, 280k ← remove these districts to prevent your algo to learn something bad.

- If some attributes have a skewed-right distribution → transform them (eg. computing their logarithm or square root).

- Last thing before preparing data for pipeline → try out various attribute combinations.

- eg: #rooms per household = #rooms & #household, #bed rooms / #rooms, population/household,…

→ Found that

`bedrooms_ratio`

is good to use (its corr = -0.256397) → house with a lower bedroom/room ratio tend to be more expensive.`1housing["bedrooms_ratio"] = housing["total_bedrooms"] / housing["total_rooms"]`

- You should write functions to perform this step instead of doing it manually!

- You should also separate the predictors and the labels, since you don’t necessarily want to apply the same transformations to the predictors and the target values

- ML algo cannot work with
*missing features.*Eg.`total_bedrooms`

has missing values, - Remove corresponding districts.
- Remove whole attribute.
- Set missing value to some value (zero, mean, median,…) ← called
**imputation**←`sklearn.impute`

Scikit-Learn Design ← paper: "

*API design for machine learning software:*

experiences from the scikit-learn project”experiences from the scikit-learn project

`ocean_proximity`

→ text attribute → check → categorical attribute

- Most ML algos prefer to work with numbers → convert categories to numbers.

- Check
`sklearn.preprocessing`

- Suppose we use
`OrdinalEncoder`

to convert categories to 0, 1, 2, 3,… ←**Issue**: ML assumes 2 nearby values are similar! ← It may be fine if categories has order like “bad” < “average” < “good” < “excellent” but not for`ocean_proximity`

!

→

**Solution**: use**One-hot encoding**= create one binary attribute per category! ← output of`OneHotEncoder`

is a SciPy *sparse matrix*(very efficient matrices that contain mostly zeros ← save memory and speed up computations).- Pandas has also
`pd.get_dummies(df)`

which has the same functionality as sk’s`OneHotEncoder`

but we prefer**the latter (smarter)**because it remembers which categories it was trained on.

`OneHotEncoder`

can also detects unknown categories and rase an exception whereas `get_dummies()`

cannot (it creates a new column)!- If attribute has large of categories → not good → replace the categorical attribute by numerical ones.

- One of the most important transformations you need to apply to your data is
**feature scaling**.

- ML algo don’t perform well when numerical attributes have very different scales.
- Eg. #rooms where median income → model is bias, it will focus more on #rooms.

- 2 common ways:
**min-max scaling**(or*normalization*) &**standardization**(*z-score normalization*) **min-max scaling**: values scaled to [0,1] (or other range) ← ←`MinMaxScaler`

**standardization**: ( is mean, is ’s standard deviation). It doesn’t restrict values to a specific range but it’s much less affected by outliers. ←`StandardScaler`

**Warning**: never use`fit()`

or`fit_transform()`

for anything else than the training set ← you can use scaler for other sets later!

- If feature has
**heavy tail**(ie. values far from the mean are not exponentially rare) → need to shrink the heavy tail first, then scale. - Heavy tail to the right ← replace the feature with square root.
*Power law distribution*← replace the feature with its logarithm ← Fig 2-17.- Another approach:
*bucketizing*the feature (chopping its distribution into roughly equal-sized buckets, replacing each feature value with the index of the bucket it belongs to)

- A feature has a
**multimodal distribution**(i.e., with two or more clear peaks, called*modes*) → Strategies:

*Method 1:*Bucketize it, but this time treating the bucket IDs as categories, rather than as numerical values.

*Method 2:*Add a feature for each mode, representing the similarity between the housing median age and that mode, using a

**radial basis function**(RBF). The most common type of RBF is

**Gaussian RBF**, where the output value decays exponentially as the input value moves away from the fixed point.

The parameter determines how quickly the similarity measure decays as moves away from 35.

`1from sklearn.metrics.pairwise import rbf_kernel`

- The target values may also need to be transformed too. ← then use
`inverse_transform()`

method to get the desired values from the predicted-transformed value. - Use
`TransformedTargetRegressor`

← give it a regression model & label transformer then fit training set with unscaled labels. After that, just use`.predict()`

as normal.

```
1from sklearn.preprocessing import FunctionTransformer
2
3log_transformer = FunctionTransformer(np.log, inverse_func=np.exp)
4log_pop = log_transformer.transform(housing[["population"]])
```

A log-transformer

```
1rbf_transformer = FunctionTransformer(rbf_kernel, kw_args=dict(Y=[[35.]], gamma=0.1))
2
3age_simil_35 = rbf_transformer.transform(housing[["housing_median_age"]])
```

A transformer computes the same Gaussian RBF similarity measure

```
1sf_coords = 37.7749, -122.41
2sf_transformer = FunctionTransformer(rbf_kernel, kw_args=dict(Y=[sf_coords], gamma=0.1))
3sf_simil = sf_transformer.transform(housing[["latitude", "longitude"]])
```

How to add a feature that will measure the geographic similarity between each district and San Francisco

Custom transformers are useful to combine features too,

```
1ratio_transformer = FunctionTransformer(lambda X: X[:, [0]] / X[:, [1]])
2ratio_transformer.transform(np.array([[1., 2.], [3., 4.]]))
```

Transformer computes the ratio between the input features 0 and 1

You can create your own custom transformer with methods such as

`fit()`

, `transform()`

, and `fit_transform()`

. These are the only methods you need to implement.- Use
`TransformerMixin`

as a base class → you have`fit_transform()`

for free.

- Use
`BaseEstimator`

as a base class (and avoid using`*args`

and`**kwargs`

in constructor) → you have`get_params()`

,`set_params()`

A custom transformer can (and often does) use other estimators in its implementation.

Check whether your custom estimator respects Scikit-Learn’s API by passing an instance to

`check_estimator()`

.`ClusterSimilarity`

← Transformer uses k-means to locate the clusters, then measures Gaussian RBF similarity between each district and all cluster centers. (Figure 2-19)- Many transform steps need to be executed in order → scikit-learn has
`Pipeline`

to help.

```
1from sklearn.pipeline import Pipeline
2
3num_pipeline = Pipeline([
4 ("impute", SimpleImputer(strategy="median")),
5 ("standardize", StandardScaler()),
6])
```

An example of using

`Pipeline`

- Pipelines =
**list**of name/estimator pair. - name = any not containing
`__`

- estimator = all be transformed (must have
`fit_transform()`

) ← except the last one which can be anything!

- If you don’t want to name the transformers → use
`make_pipeline()`

instead.

`1num_pipeline = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())`

- Pipeline calls methods inside the line,
**sequentially**and it exposes the same methods as the final estimator. ← eg. Above code → pipeline acts as a transformer (its last estimator is`StandardScaler`

)

- Pipeline supports indexing, ie.
`pipeline[1]`

.

- Use
`ColumnTransformer`

to have a single transformer → handle all columns (numerical, categorical…) ← apply appropriate transformer to each column.

```
1num_attribs = [...]
2cat_attribs = [...]
3
4cat_pipeline = make_pipeline(
5 SimpleImputer(strategy="most_frequent"),
6 OneHotEncoder(handle_unknown="ignore"))
7
8preprocessing = ColumnTransformer([
9 ("num", num_pipeline, num_attribs),
10 ("cat", cat_pipeline, cat_attribs),
11])
```

Use

`make_column_transformer`

and `make_column_selector`

if you don’t want to name columns.```
1preprocessing = make_column_transformer(
2 (num_pipeline, make_column_selector(dtype_include=np.number)),
3 (cat_pipeline, make_column_selector(dtype_include=object)),
4)
```

- Missing values: numerical features ← median, categorical features ← most frequent category.

- Category feature ← one-hot encoded.

- Ratio feature are needed ←
`bedrooms_ratio`

,`rooms_per_house`

,`people_per_house`

.

- Cluster similarity features will also be added.

- Features with a long tail will be replaced by their logarithm.

- All numerical features will be standardized

- Try with simple linear regression first (use
`LinearRegression`

) than use`mean_squared_error`

to measure the performance. ← result: score 68k (where`median_housing_values`

range is between 120k and 265k) ← not very satisfying (of course)! ←**underfitting**! ← features don’t provide enough info or model isn’t good enough!

Options to improve: (1) more powerful model, (2) better features, (3) reduce constraints on the model.

- Try
`DecisionTreeRegressor`

← more powerful to find complex nonlinear relationship! (Chapter 6) ← result: 0 score (RSME) ←**overfitting**! ← How to be sure? → split training into training/validation tests

- Option 1: Use
`train_test_split`

`()`

to split training set into training/validation tests → train again with smaller training and validate using validation set.

- Option 2: Use
**k-fold cross-validation**(`cross_val_score`

) → randomly splits training set into 10 nonoverlapping subsets (*folds*) → trains&evaluate 10 times (pick one for validation and train on 9 other folds) ← result: 66.8K±2K ← bad!

**Remark**: Score of cross-validation is*greater is better*(opposite to a cost function which is*lower is better*)

- If training error is low but validation error is high → overfitting!

- Option 3: use
`RandomForestRegressor`

(Chapter 7) = train many decision trees on random subsets of features, then average their predictions. ←**ensembles****model**← result: 47K±1K (really better) ← However, if train`RandomForest`

on training set + measure RSME → 17K (much lower than 47K) → there is still overfitting! ← Solution: regularize model, more data,…

- In this stage, try serveral model →
**goal**: a shortlist (2 to 5) of promising models.

After having a shortlist, this stage is to fine-tune them!

- You can but
**shouldn’t**play with hyperparameters**manually**until you find a great combination → use`GridSearchCV`

instead (it searches for you)!

- Given which hyperparameters + which values to try out → it uses cross-validation.

**TIP**: Using a Scikit-Learn pipeline for preprocessing lets you adjust preprocessing and model hyperparameters simultaneously. If pipeline fitting is costly, set the pipeline's`memory`

to a cache directory path.

- Sample codes
- 2 dictionaries in
`param_grid`

→ 3x3 + 3x2 = 15 combinations. - Train pipeline 3 times per combination (
`cv=3`

)

```
1from sklearn.model_selection import GridSearchCV
2
3full_pipeline = Pipeline([
4 ("preprocessing", preprocessing),
5 ("random_forest", RandomForestRegressor(random_state=42)),
6])
7param_grid = [
8 {'preprocessing__geo__n_clusters': [5, 8, 10],
9 'random_forest__max_features': [4, 6, 8]},
10 {'preprocessing__geo__n_clusters': [10, 15],
11 'random_forest__max_features': [6, 8, 10]},
12]
13grid_search = GridSearchCV(full_pipeline, param_grid, cv=3,
14 scoring='neg_root_mean_squared_error')
15grid_search.fit(housing, housing_labels)
```

→ Total: 3x15=45 rounds of training.

`RandomizedSearchCV`

is often preferable, especially when the hyperparameter search space is large.

- It evaluates a
**fixed**number of combinations, selecting a**random value**for each hyperparameter at every iteration.

- Each hyperprameter → provide either list of values or a prob distribution.

- There are also
`HalvingRandomSearchCV`

and`HalvingGridSearchCV`

← use computational resources more efficiently ←**Idea**: from the beginning rounds, they use limit resources (eg. part of training data) to find the params, then the best candidates go to the next rounds (more resources).

- Combine the models that perform best. “Many” is better than “individual”. Check more in Chapter 7.

- Get inside from the best models (parameters’ value w.r.t each feature) → check and remove the less important features. ← do this auto:
`sklearn.feature_selection.SelectFromModel`

- Now is also a good time to ensure that your model not only works well on average, but also on all categories of districts.

- You are ready to evaluate the final model on the test set.

- You need to know how precise the error estimate gives from the test ← 95%
*confidence interval*for the generalization error using`scipy.stats.t.interval()`

.

- Hyperparameter tuning might decrease performance due to overfitting on validation data. Resist tweaking hyperparameters for test set improvements, as they may not apply to new data.

- Save and load the model, use
`joblib`

!

- An example of deploy your model

- We can use Google’s Vertex AI to upload and deploy our model. 👈 My note: Google Vertex AI

- After deployment, you have to monitor the system and model too. ← To see if it’s still working or needed to be improved.

- You should probably automate the whole process as much as possible.

- You should trigger alerts when something goes wrong.

- Make sure you keep backups of the models + having a rollback process to previous model. Backups are w.r.t model versions, dataset versions,…

- ML involves a log of infrastructure (
**MLOps -**ML Operations) → Chapter 19.

- Much of the work is in the data preparation step.

- Understanding the overall process and mastering a few machine learning algorithms can be more beneficial than solely focusing on exploring advanced algorithms.

- Kaggle is a good place for you to start an A-Z project.