F1 score

The F1 score is the harmonic mean of precision and recall, combining them into a single classification metric:

$F_{1} = \frac{2 \cdot precision \cdot recall}{precision + recall} = \frac{2 TP}{2 TP + FP + FN}$

The two forms are equivalent — the second comes from substituting the Confusion matrix definitions and simplifying.

Why harmonic mean

The harmonic mean has the property that both factors must be high for the result to be high. If precision is great and recall is terrible — or vice versa — the F1 score is closer to the smaller of the two, not their average.

Compare with the arithmetic mean. A classifier with precision 0.99 and recall 0.01 has:

Arithmetic mean: $(0.99 + 0.01) /2 = 0.5$ — sounds fine.
F1 score (harmonic mean): $2 \cdot 0.99 \cdot 0.01/ (0.99 + 0.01) \approx 0.02$ — correctly reflects that the classifier is useless.

This is what makes F1 a sterner test than the arithmetic mean. It punishes imbalances between precision and recall, and rewards classifiers that are reasonably good at both.

When F1 is the right summary

F1 is the standard single-number summary when:

We care about both precision and recall and don’t want to pick one.
The dataset is imbalanced and accuracy would be misleading.
The application is in information retrieval or natural-language processing, where F1 is the conventional metric.

F1 is not the right summary when:

The costs of FP and FN are very different. In that case, weighted variants — $F_{β}$ with $β < 1$ favoring precision or $β > 1$ favoring recall — are more appropriate.
The output is genuinely continuous (regression) — F1 is for classification.
The threshold matters and we want to see performance across thresholds — ROC curve and AUC are better.

Generalized $F_{β}$

The $F_{β}$ score generalizes F1 to weight recall differently from precision:

$F_{β} = \frac{( 1 + β ^{2} ) \cdot precision \cdot recall}{β ^{2} \cdot precision + recall}$

$F_{1}$ — equal weight (the harmonic mean above).
$F_{0.5}$ — weights precision more.
$F_{2}$ — weights recall more (common in medical screening).

In scikit-learn

from sklearn.metrics import f1_score, fbeta_score
 
f1   = f1_score(y_test, y_pred)
f2   = fbeta_score(y_test, y_pred, beta=2)
f05  = fbeta_score(y_test, y_pred, beta=0.5)

For multi-class problems, the average= parameter chooses how per-class scores are combined.

Idriss Rami — Notes

Explorer

F1 score

Why harmonic mean

When F1 is the right summary

Generalized $F_{β}$

In scikit-learn

Graph View

Table of Contents

Backlinks

Idriss Rami — Notes

Explorer

F1 score

Why harmonic mean

When F1 is the right summary

Generalized Fβ​

In scikit-learn

Graph View

Table of Contents

Backlinks

Generalized $F_{β}$