Specificity measures: of all the actual negatives, what fraction did the model correctly identify as negative?
The denominator is all the negatives — those the model correctly classified (TN) plus those it incorrectly flagged (FP).
Also called the true negative rate (TNR). It’s the mirror image of recall (sensitivity, TPR):
- Recall (TPR) — fraction of actual positives caught.
- Specificity (TNR) — fraction of actual negatives correctly identified.
High specificity means the model rarely raises false alarms. The applications that care most about specificity are those where a false alarm is costly:
- Spam filtering, where a legitimate email going to the spam folder (FP) is more harmful than a few spam messages getting through.
- Drug testing, where wrongly flagging an innocent person is worse than missing some users.
- Quality control, where wrongly rejecting good product is more expensive than missing the occasional defect.
The complement of specificity is the False positive rate (FPR):
FPR is what the ROC curve plots on the x-axis (against TPR on the y-axis), so specificity and FPR are essentially the same information in different forms.
The recall-specificity tradeoff
Tuning a classifier’s threshold trades recall against specificity:
- Lower threshold (more eager to predict positive) → higher recall, lower specificity.
- Higher threshold (more conservative) → lower recall, higher specificity.
The ROC curve traces out this tradeoff across all thresholds. The right operating point depends on the costs in the application.
In scikit-learn
scikit-learn doesn’t have a standalone specificity_score, but it’s easy to compute from the Confusion matrix:
from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
specificity = tn / (tn + fp)Or equivalently as the recall of the negative class — pass pos_label=0 to recall_score. The wine-quality classifier from the Introduction to Data Science textbook ends up with recall 0.84 and specificity 0.55 — a real asymmetry, the kind of thing accuracy alone wouldn’t reveal.