Binary Classification Metrics
Metrics for evaluating models that predict between two classes (positive/negative, yes/no, 1/0).
Overview
Binary classification metrics fall into two categories:
- Threshold-dependent: Require binary predictions (0/1)
- Precision, Recall, F-score, Specificity
- Threshold-independent: Use probability scores
- AUC, Brier score, Log loss
Quick Reference
| Metric | Input Type | Range | Best For |
|---|---|---|---|
auc | Probabilities | [0, 1] | Model comparison |
precision | Labels | [0, 1] | Minimizing FP |
recall | Labels | [0, 1] | Minimizing FN |
fbeta_score | Labels | [0, 1] | Balanced evaluation |
mcc | Labels | [-1, 1] | Imbalanced data |
brier_score | Probabilities | [0, 1] | Calibration |
ROC-Based Metrics
Area Under ROC Curve
auc(actual, predicted_probs)Interpretation:
- AUC = 1.0: Perfect ranking
- AUC = 0.5: Random guessing
- AUC < 0.5: Worse than random (flip predictions!)
Guidelines:
- 0.9-1.0: Excellent
- 0.8-0.9: Good
- 0.7-0.8: Fair
- 0.6-0.7: Poor
- 0.5-0.6: Fail
Gini Coefficient
gini_coefficient(actual, predicted_probs)Relationship: Gini = 2 × AUC - 1
KS Statistic
ks_statistic(actual, predicted_probs)When to use: Credit scoring, marketing response modeling.
Probability Calibration Metrics
Log Loss
ll(actual, predicted_probs) # Elementwise
logloss(actual, predicted_probs) # MeanWhen to use:
- Training neural networks (cross-entropy loss)
- When probability values matter, not just ranking
Brier Score
brier_score(actual, predicted_probs)Interpretation:
- 0: Perfect calibration
- 0.25: Random guessing for balanced data
- 1: Complete miscalibration
When to use: Weather forecasting, medical prognosis - anywhere probability calibration matters.
Precision and Recall
Precision (Positive Predictive Value)
precision(actual, predicted_labels)Interpretation: Of all samples predicted positive, what fraction are actually positive?
Optimize for precision when: False positives are costly
- Spam detection (don't mark good email as spam)
- Legal discovery (don't flag innocent documents)
Recall (Sensitivity, True Positive Rate)
recall(actual, predicted_labels)
sensitivity(actual, predicted_labels) # AliasInterpretation: Of all actual positives, what fraction did we detect?
Optimize for recall when: False negatives are costly
- Disease screening (don't miss sick patients)
- Fraud detection (don't miss fraudulent transactions)
- Security threats (don't miss actual threats)
F-Score
fbeta_score(actual, predicted_labels; beta=1.0)Choosing beta:
- β = 1: Equal weight to precision and recall (F1)
- β = 0.5: Precision weighted 2× more than recall
- β = 2: Recall weighted 2× more than precision
Formula: F_β = (1 + β²) × (precision × recall) / (β² × precision + recall)
Specificity and NPV
specificity(actual, predicted_labels)
npv(actual, predicted_labels)Relationships:
- Sensitivity (recall) + FNR = 1
- Specificity + FPR = 1
- Precision + FDR = 1
- NPV + FOR = 1
Error Rates
fpr(actual, predicted_labels) # False Positive Rate
fnr(actual, predicted_labels) # False Negative RateCombined Metrics
Youden's J (Informedness)
youden_j(actual, predicted_labels)Use case: Finding optimal threshold that maximizes sensitivity + specificity.
Markedness
markedness(actual, predicted_labels)Interpretation: How marked (informative) are positive and negative predictions?
Fowlkes-Mallows Index
fowlkes_mallows_index(actual, predicted_labels)Likelihood Ratios (Medical/Diagnostic)
positive_likelihood_ratio(actual, predicted_labels)
negative_likelihood_ratio(actual, predicted_labels)
diagnostic_odds_ratio(actual, predicted_labels)Interpretation of LR+:
- LR+ > 10: Strong evidence for positive
- LR+ = 5-10: Moderate evidence
- LR+ = 2-5: Weak evidence
- LR+ = 1: Useless test
Interpretation of LR-:
- LR- < 0.1: Strong evidence for negative
- LR- = 0.1-0.2: Moderate evidence
- LR- = 0.2-0.5: Weak evidence
- LR- = 1: Useless test
Business Metrics
Lift
lift(actual, predicted_probs; percentile=0.1)Interpretation: How many times better than random in the top X%?
- Lift = 3 in top 10%: 3× more positives than random
Gain
gain(actual, predicted_probs; percentile=0.1)Interpretation: What percentage of all positives are captured in top X%?
Usage Examples
Complete Binary Classification Evaluation
using UnifiedMetrics
actual = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
predicted_probs = [0.9, 0.8, 0.7, 0.3, 0.6, 0.4, 0.35, 0.2, 0.1, 0.05]
predicted_labels = predicted_probs .>= 0.5
println("=== Threshold-Independent ===")
println("AUC: ", round(auc(actual, predicted_probs), digits=3))
println("Gini: ", round(gini_coefficient(actual, predicted_probs), digits=3))
println("KS Statistic: ", round(ks_statistic(actual, predicted_probs), digits=3))
println("Brier Score: ", round(brier_score(actual, predicted_probs), digits=3))
println("Log Loss: ", round(logloss(actual, predicted_probs), digits=3))
println("\n=== Threshold-Dependent (threshold=0.5) ===")
println("Precision: ", round(precision(actual, predicted_labels), digits=3))
println("Recall: ", round(recall(actual, predicted_labels), digits=3))
println("F1 Score: ", round(fbeta_score(actual, predicted_labels), digits=3))
println("Specificity: ", round(specificity(actual, predicted_labels), digits=3))
println("MCC: ", round(mcc(actual, predicted_labels), digits=3))Comparing Different Thresholds
using UnifiedMetrics
actual = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
probs = [0.9, 0.8, 0.7, 0.3, 0.6, 0.4, 0.35, 0.2, 0.1, 0.05]
for threshold in [0.3, 0.5, 0.7]
labels = probs .>= threshold
println("Threshold: $threshold")
println(" Precision: $(round(precision(actual, labels), digits=2))")
println(" Recall: $(round(recall(actual, labels), digits=2))")
println(" F1: $(round(fbeta_score(actual, labels), digits=2))")
println(" Youden's J: $(round(youden_j(actual, labels), digits=2))")
endMedical Diagnostic Evaluation
using UnifiedMetrics
# Disease screening results
actual = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0] # 1 = has disease
predicted = [1, 1, 1, 0, 0, 0, 0, 0, 1, 0] # Test results
println("=== Diagnostic Performance ===")
println("Sensitivity: ", round(sensitivity(actual, predicted), digits=3))
println("Specificity: ", round(specificity(actual, predicted), digits=3))
println("PPV (Precision): ", round(precision(actual, predicted), digits=3))
println("NPV: ", round(npv(actual, predicted), digits=3))
println("LR+: ", round(positive_likelihood_ratio(actual, predicted), digits=2))
println("LR-: ", round(negative_likelihood_ratio(actual, predicted), digits=2))
println("DOR: ", round(diagnostic_odds_ratio(actual, predicted), digits=1))Marketing/Business Application
using UnifiedMetrics
# Customer response prediction
actual_responded = [1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
predicted_scores = [0.9, 0.8, 0.3, 0.7, 0.6, 0.5, 0.4, 0.2, 0.1, 0.05]
println("=== Campaign Targeting ===")
for pct in [0.1, 0.2, 0.3, 0.5]
println("Top $(Int(pct*100))%:")
println(" Lift: $(round(lift(actual_responded, predicted_scores, percentile=pct), digits=2))x")
println(" Gain: $(round(gain(actual_responded, predicted_scores, percentile=pct)*100, digits=1))%")
endHandling Imbalanced Data
using UnifiedMetrics
# Highly imbalanced: 95% negative, 5% positive
actual = vcat(fill(0, 95), fill(1, 5))
predicted = vcat(fill(0, 100)) # Naive: always predict negative
println("=== Naive Model on Imbalanced Data ===")
println("Accuracy: ", accuracy(actual, predicted)) # 0.95 - misleading!
println("Recall: ", recall(actual, predicted)) # 0.0 - reveals the problem
println("MCC: ", mcc(actual, predicted)) # 0.0 - correctly shows failure
# A better model
predicted_better = vcat(fill(0, 90), fill(1, 5), fill(0, 3), fill(1, 2))
println("\n=== Better Model ===")
println("Accuracy: ", accuracy(actual, predicted_better))
println("Recall: ", recall(actual, predicted_better))
println("Precision: ", precision(actual, predicted_better))
println("MCC: ", round(mcc(actual, predicted_better), digits=3))See the API Reference for complete function documentation.