Posted in category machine learning with tags errors.

Evaluation metrics are used to explain the performance of a model. Basically, we can compare the actual values and the predicted values to calculate the accuracy of our models. In this post, I try to understand the meaning and usage of some popular errors in the model evaluation.

## What’s an error of the model? (regression metrics)

The error of the model is the difference between the data points and the trend line generated by the algorithm. There are many ways to calculate this difference (regression metrics)

• Mean Absolute Error (MAE) : $MAE = \frac{1}{n}\sum_{j=1}^n \vert y_j - \hat{y}_j \vert$.
• Mean Squared Error (MSE) : $MSE = \frac{1}{n}\sum_{j=1}^n (y_j - \hat{y}_j)^2$
• Root Mean Squared Error (RMSR): $RMSR = \sqrt{\frac{1}{n}\sum_{j=1}^n (y_j - \hat{y}_j)^2}$
• Relative absolute Error (RAE): $RAE = \dfrac{\sum_{j=1}^n\vert y_j-\hat{y}_j\vert}{\sum_{j=1}^n\vert y_j-\bar{y}\vert}$ ($\bar{y}$ is the mean value of $y$)
• Relative Squared Error (RSE): $RSE = \dfrac{\sum_{j=1}^n (y_j-\hat{y}_j)^2}{\sum_{j=1}^n (y_j-\bar{y}_j)^2}$
• R squared: $R^2 = 1 - RSE$.

What’s their meaning?

• MAE : It’s just the average error, the easiest one. All individual differences have the same role, there is no one being more weighted than the others.
• MSE : It focuses on “larger” errors because of the squared term. The higher this value, the worse the model is.
• MSE is more popular than MAE because in the MAE, all gears are equivalent while in MSE, the bigger gears will influence much on the final error.
• RMSE : the most popular because it is interpretable in the same units as the response vector or $y$ units.
• RAE : It’s normalized, i.e. it doesn’t depend much on the unit of $y$.
• RSE : It’s used for calculating $R^2$.
• $R^2$ : It represents how close the data values are, to the fitted regression line. The higher the R-squared, the better the model fits your data.

The choice of metric, completely depends on

• The type of model,
• Your data type
• Domain of knowledge.

## How about the metrics for classification models?

• Jaccard index (Jaccard similarity coefficient) :

• $0 \le J \le 1$.
• Higher is better : The common part is bigger, thus the numerator is bigger and the denominator is smaller.
sklearn.metrics.jaccard_similarity_score

• F1-Score (F-score, F-measure) : I have another post writing about this.

• $0 \le F_1 \le 1$.
• Higher is better : look at the formular of $F_1$ to see the reason. It needs both the precison and recall to be bigger. Other words, the False Negative is smaller (the wrong selected items are less) and the False Negative is smaller too (the wrong non-selected items are less).
• Hard to remember TP, TN, FP, FN? read this.
sklearn.metrics.f1_score

• Log Loss : Logarithmic loss (also known as Log loss) measures the performance of a classifier where the predicted output is a probability value between 0 and 1.
• We can calculate the log loss for each row using the log loss equation, which measures how far each prediction is from the actual label.

• $1 \ge LogLoss \ge 0$.
• Smaller is better.
sklearn.metrics.log_loss

• Accuracy score : using in scikit-learn

sklearn.metrics.accuracy_score(y_test, y_pred)
# accuracy_score = jaccard_similarity_score (binary & muticlass classification)


## The idea of K-fold cross validation?

• We use CV to estimate how well a ML model would generalize to new data? It helps avoid overfitting and underfitting.
• CV set and training set must use the same distribution! Why, check this.
• We choose different groups of CV set/training set to find the predictions, after that, we choose the best one.

from sklearn.model_selection import KFold
kf = KFold(n_splits=2)
kf.get_n_splits(X)
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]


Source of figures used in this post: k-fold, jaccard, f1-score.