Last modified on 05 Jun 2020.

Confusion matrix

  actual (yes) actual (no)
predict (yes) TP FP
predict (no) FN TN
  • True Positive (TP): what we predict Positive is really Positive.
  • True Negative (FN): what we predict Negative is really Negative.
  • False Negative (FN): what we predict Negative is actually Positive.
  • False Positive (FP): what we predict Positive is actually Negative.

This guy is pregnant? This guy is pregnant?

How to remember?

  • True/False indicates what we predicted is right/wrong.
  • Positive/Negative is what we predicted (yes or no).

Type I / Type II errors

  • FP = Type I error = rejection of true null hypothesis = negative results are predicted wrongly = what we predict positive is actually negative.
  • FN = Type II error = non-rejection of a false null hypothesis = positive results are predicted wrongly = what we predict negative are actually positive.

Why CM is important?

Give a general view about our model, “is it really good?” thanks to precision and recall!

Precision & Recall

  actual (yes) actual (no)  
predict (yes) TP FP Precision
predict (no) FN TN  
  • Precision: How many of our positive predictions are really true? (Check the accuracy of our positive predictions).

    precision=truepositivepositivelypredictedresults=TPTP+FP. \mathrm {precision} = \dfrac{\mathrm{true\, positive}}{\mathrm{positively\, predicted\, results}} = \dfrac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}}.

  • Recall: How many of positive results belong to our predictions? (Do we miss some negative predictions?)

    recall=truepositivepositivelyactualresults=TPTP+FN. \mathrm {recall} = \dfrac{\mathrm{true\, positive}}{\mathrm{positively\, actual\, results}} = \dfrac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}.

When to use?

  • Precision is importantly used when the “wrongly predicted yes” (FP) influences much (e.g. This email is spam? – results yes but actually no and we lost important emails!).
  • Recall is importantly used when the “wrongly predicted no” (FN) influences much (e.g. In the banking industry, this transaction is fraudulent? – results no but actually yes and we lost money!).


High precision and low recall or vice versa? F1-Score gives us a balance between precision and recall.

f1=(recall1+precision12)1=2×precisionrecallprecision+recall. f_1 = \left({\frac {\mathrm {recall} ^{-1}+\mathrm {precision} ^{-1}}{2}}\right)^{-1}=2\times {\frac {\mathrm {precision} \cdot \mathrm {recall} }{\mathrm {precision} +\mathrm {recall} }}.

F1-score depends on how we label the class “positive”. This email is spam? is very different from This email is not spam?

When to use F1-Score?

  • When you need a balance between precision and recall.
  • When we have a “skewed class” problem (uneven class distribution, too many “yes” and very few “no”, for example).
  • One of precision and recall is improved but the other changes too much, then f1-score will be very small!

How to choose f1-score value?

Normally, f1(0,1]f_1\in (0,1] and it gets the higher values, the better our model is.

  • The best one (f1=1f_1=1), both precision and recall get 100%100\%.
  • One of precision and recall gets very small value (close to 0), f1f_1 is very small, our model is not good!

What if we prefer one of precision and recall than the other? We consider fβf_{\beta}[ref]

fβ=(1+β2)precisionrecallβ2precision+recall f_{\beta} = ( 1 + \beta^2)\frac{\text{precision}\cdot\text{recall}}{\beta^2\cdot\text{precision} + \text{recall}}

f1f_1 is a special case of fβf_{\beta} when β=1\beta=1:

  • When precision is more important than recall, we choose β<1\beta < 1 (usually choose β=0.5\beta=0.5).
  • When recall is more important than precision, we choose β>1\beta > 1 (usually choose β=2\beta=2).

Accuracy / Specificity

  • Accuracy: How accurate our predictions to the whole predictions?

    accuracy=TP+TNTP+TN+FP+FN \mathrm{accuracy} = \dfrac{TP + TN}{TP + TN + FP + FN}

  • Specificity: How many negative results belong to our predictions?

    specificity=TNFP+TN \mathrm{specificity} = \dfrac{TN}{FP + TN}

When to use?

  • Accuaracy is used when we have symmetric datasets.
  • Specificity is used when we care about TN values and don’t want to make false alarms of the FP values (e.g. drug test).

Confusion Matrix & F1-Score with Scikit-learn

from sklearn.metrics import confusion_matrix
n_classes = target.shape[0]
confusion_matrix(y_true, y_pred, labels=range(n_classes))

Precision / Reacall / f1-score / support

from sklearn.metrics import classification_report
classification_report(y_test, y_pred)

ROC curve,

from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
%matplotlib inline

fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
# create plot
plt.plot(fpr, tpr, label='ROC curve')
plt.plot([0, 1], [0, 1], 'k--', label='Random guess')
_ = plt.xlabel('False Positive Rate')
_ = plt.ylabel('True Positive Rate')
_ = plt.title('ROC Curve')
_ = plt.xlim([-0.02, 1])
_ = plt.ylim([0, 1.02])
_ = plt.legend(loc="lower right")


  1. Classification: Precision and Recall - Google Developers, Machine Learning Crash Course.
  2. Classification: Check Your Understanding (Accuracy, Precision, Recall) - Google Developers, Machine Learning Crash Course.
  3. F-measure versus Accuracy - NLP blog.
  4. Accuracy, Precision, Recall or F1? - Koo Ping Shung, Towards Data Science.
  5. Dealing with Imbalanced data: undersampling, oversampling and proper cross-validation - Marco Altini.
  6. Accuracy, Recall, Precision, F-Score & Specificity, which to optimize on? - Salma Ghoneim, Towards Data Science.