Confusion matrix
actual (yes)  actual (no)  

predict (yes)  TP  FP 
predict (no)  FN  TN 
 True Positive (TP): what we predict Positive is really Positive.
 True Negative (FN): what we predict Negative is really Negative.
 False Negative (FN): what we predict Negative is actually Positive.
 False Positive (FP): what we predict Positive is actually Negative.
This guy is pregnant?
How to remember?
 True/False indicates what we predicted is right/wrong.
 Positive/Negative is what we predicted (yes or no).
Type I / Type II errors
 FP = Type I error = rejection of true null hypothesis = negative results are predicted wrongly = what we predict positive is actually negative.
 FN = Type II error = nonrejection of a false null hypothesis = positive results are predicted wrongly = what we predict negative are actually positive.
Why CM is important?
Give a general view about our model, “is it really good?” thanks to precision and recall!
Precision & Recall
actual (yes)  actual (no)  

predict (yes)  TP  FP  Precision 
predict (no)  FN  TN  
Recall 

Precision: How many our positive predictions are really true? (Check the accuracy of our positive predictions).
$$ \mathrm {precision} = \dfrac{\mathrm{true\, positive}}{\mathrm{positively\, predicted\, results}} = \dfrac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}}. $$

Recall: How many positive results belong to our predictions? (Do we miss some negative predictions?)
$$ \mathrm {recall} = \dfrac{\mathrm{true\, positive}}{\mathrm{positively\, actual\, results}} = \dfrac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}. $$
When to use?
 Precision is importantly used when the “wrongly predicted yes” (FP) influences much (e.g. This email is spam? – results yes but actually no and we lost important emails!).
 Recall is importantly used when the “wrongly predicted no” (FN) influences much (e.g. In the banking industry, this transaction is fraudulent? – results no but actually yes and we lost money!).
F1Score
High precision and low recall or vice versa? F1Score gives us a balance between precision and recall.
$$ f_1 = \left({\frac {\mathrm {recall} ^{1}+\mathrm {precision} ^{1}}{2}}\right)^{1}=2\times {\frac {\mathrm {precision} \cdot \mathrm {recall} }{\mathrm {precision} +\mathrm {recall} }}. $$
F1score depends on how we label the class “positive”. This email is spam? is very different from This email is not spam?
When to use F1Score?
 When you need a balance between precision and recall.
 When we have a “skewed class” problem (uneven class distribution, too many “yes” and very few “no”, for example).
 One of precision and recall is improved but the other changes too much, then f1score will be very small!
How to choose f1score value?
Normally, $f_1\in (0,1]$ and it gets the higher values, the better our model is.
 The best one ($f_1=1$), both precision and recall get $100\%$.
 One of precision and recall gets very small value (close to 0), $f_1$ is very small, our model is not good!
What if we prefer one of precision and recall than the other? We consider $f_{\beta}$^{[ref]}
$$ f_{\beta} = ( 1 + \beta^2)\frac{\text{precision}\cdot\text{recall}}{\beta^2\cdot\text{precision} + \text{recall}} $$
$f_1$ is a special case of $f_{\beta}$ when $\beta=1$:
 When precision is more important than recall, we choose $\beta < 1$ (usually choose $\beta=0.5$).
 When recall is more important than precision, we choose $\beta > 1$ (usually choose $\beta=2$).
Accuracy / Specificity

Accuracy: How accurate our predictions to the whole predictions?
$$ \mathrm{accuracy} = \dfrac{TP + TN}{TP + TN + FP + FN} $$

Specificity: How many negative results belong to our predictions?
$$ \mathrm{specificity} = \dfrac{TN}{FP + TN} $$
When to use?
 Accuaracy is used when we have symmetric datasets.
 Specificity is used when we care about TN values and don’t want to make false alarms of the FP values (e.g. drug test).
Confusion Matrix & F1Score with Scikitlearn
from sklearn.metrics import confusion_matrix
n_classes = target.shape[0]
confusion_matrix(y_true, y_pred, labels=range(n_classes))
Precision / Reacall / f1score / support
from sklearn.metrics import classification_report
classification_report(y_test, y_pred)
ROC curve,
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
%matplotlib inline
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
# create plot
plt.plot(fpr, tpr, label='ROC curve')
plt.plot([0, 1], [0, 1], 'k', label='Random guess')
_ = plt.xlabel('False Positive Rate')
_ = plt.ylabel('True Positive Rate')
_ = plt.title('ROC Curve')
_ = plt.xlim([0.02, 1])
_ = plt.ylim([0, 1.02])
_ = plt.legend(loc="lower right")
References
 Classification: Precision and Recall  Google Developers, Machine Learning Crash Course.
 Classification: Check Your Understanding (Accuracy, Precision, Recall)  Google Developers, Machine Learning Crash Course.
 Fmeasure versus Accuracy  NLP blog.
 Accuracy, Precision, Recall or F1?  Koo Ping Shung, Towards Data Science.
 Dealing with Imbalanced data: undersampling, oversampling and proper crossvalidation  Marco Altini.
 Accuracy, Recall, Precision, FScore & Specificity, which to optimize on?  Salma Ghoneim, Towards Data Science.