The F1 score is the harmonic mean of precision and recall, combining them into a single classification metric:
The two forms are equivalent — the second comes from substituting the Confusion matrix definitions and simplifying.
Why harmonic mean
The harmonic mean has the property that both factors must be high for the result to be high. If precision is great and recall is terrible — or vice versa — the F1 score is closer to the smaller of the two, not their average.
Compare with the arithmetic mean. A classifier with precision 0.99 and recall 0.01 has:
- Arithmetic mean: — sounds fine.
- F1 score (harmonic mean): — correctly reflects that the classifier is useless.
This is what makes F1 a sterner test than the arithmetic mean. It punishes imbalances between precision and recall, and rewards classifiers that are reasonably good at both.
When F1 is the right summary
F1 is the standard single-number summary when:
- We care about both precision and recall and don’t want to pick one.
- The dataset is imbalanced and accuracy would be misleading.
- The application is in information retrieval or natural-language processing, where F1 is the conventional metric.
F1 is not the right summary when:
- The costs of FP and FN are very different. In that case, weighted variants — with favoring precision or favoring recall — are more appropriate.
- The output is genuinely continuous (regression) — F1 is for classification.
- The threshold matters and we want to see performance across thresholds — ROC curve and AUC are better.
Generalized
The score generalizes F1 to weight recall differently from precision:
- — equal weight (the harmonic mean above).
- — weights precision more.
- — weights recall more (common in medical screening).
In scikit-learn
from sklearn.metrics import f1_score, fbeta_score
f1 = f1_score(y_test, y_pred)
f2 = fbeta_score(y_test, y_pred, beta=2)
f05 = fbeta_score(y_test, y_pred, beta=0.5)For multi-class problems, the average= parameter chooses how per-class scores are combined.