Binary Classification Metrics

Metrics for evaluating models that predict between two classes (positive/negative, yes/no, 1/0).

Overview

Binary classification metrics fall into two categories:

  1. Threshold-dependent: Require binary predictions (0/1)
    • Precision, Recall, F-score, Specificity
  2. Threshold-independent: Use probability scores
    • AUC, Brier score, Log loss

Quick Reference

MetricInput TypeRangeBest For
aucProbabilities[0, 1]Model comparison
precisionLabels[0, 1]Minimizing FP
recallLabels[0, 1]Minimizing FN
fbeta_scoreLabels[0, 1]Balanced evaluation
mccLabels[-1, 1]Imbalanced data
brier_scoreProbabilities[0, 1]Calibration

ROC-Based Metrics

Area Under ROC Curve

auc(actual, predicted_probs)

Interpretation:

  • AUC = 1.0: Perfect ranking
  • AUC = 0.5: Random guessing
  • AUC < 0.5: Worse than random (flip predictions!)

Guidelines:

  • 0.9-1.0: Excellent
  • 0.8-0.9: Good
  • 0.7-0.8: Fair
  • 0.6-0.7: Poor
  • 0.5-0.6: Fail

Gini Coefficient

gini_coefficient(actual, predicted_probs)

Relationship: Gini = 2 × AUC - 1

KS Statistic

ks_statistic(actual, predicted_probs)

When to use: Credit scoring, marketing response modeling.

Probability Calibration Metrics

Log Loss

ll(actual, predicted_probs)      # Elementwise
logloss(actual, predicted_probs) # Mean

When to use:

  • Training neural networks (cross-entropy loss)
  • When probability values matter, not just ranking

Brier Score

brier_score(actual, predicted_probs)

Interpretation:

  • 0: Perfect calibration
  • 0.25: Random guessing for balanced data
  • 1: Complete miscalibration

When to use: Weather forecasting, medical prognosis - anywhere probability calibration matters.

Precision and Recall

Precision (Positive Predictive Value)

precision(actual, predicted_labels)

Interpretation: Of all samples predicted positive, what fraction are actually positive?

Optimize for precision when: False positives are costly

  • Spam detection (don't mark good email as spam)
  • Legal discovery (don't flag innocent documents)

Recall (Sensitivity, True Positive Rate)

recall(actual, predicted_labels)
sensitivity(actual, predicted_labels)  # Alias

Interpretation: Of all actual positives, what fraction did we detect?

Optimize for recall when: False negatives are costly

  • Disease screening (don't miss sick patients)
  • Fraud detection (don't miss fraudulent transactions)
  • Security threats (don't miss actual threats)

F-Score

fbeta_score(actual, predicted_labels; beta=1.0)

Choosing beta:

  • β = 1: Equal weight to precision and recall (F1)
  • β = 0.5: Precision weighted 2× more than recall
  • β = 2: Recall weighted 2× more than precision

Formula: F_β = (1 + β²) × (precision × recall) / (β² × precision + recall)

Specificity and NPV

specificity(actual, predicted_labels)
npv(actual, predicted_labels)

Relationships:

  • Sensitivity (recall) + FNR = 1
  • Specificity + FPR = 1
  • Precision + FDR = 1
  • NPV + FOR = 1

Error Rates

fpr(actual, predicted_labels)  # False Positive Rate
fnr(actual, predicted_labels)  # False Negative Rate

Combined Metrics

Youden's J (Informedness)

youden_j(actual, predicted_labels)

Use case: Finding optimal threshold that maximizes sensitivity + specificity.

Markedness

markedness(actual, predicted_labels)

Interpretation: How marked (informative) are positive and negative predictions?

Fowlkes-Mallows Index

fowlkes_mallows_index(actual, predicted_labels)

Likelihood Ratios (Medical/Diagnostic)

positive_likelihood_ratio(actual, predicted_labels)
negative_likelihood_ratio(actual, predicted_labels)
diagnostic_odds_ratio(actual, predicted_labels)

Interpretation of LR+:

  • LR+ > 10: Strong evidence for positive
  • LR+ = 5-10: Moderate evidence
  • LR+ = 2-5: Weak evidence
  • LR+ = 1: Useless test

Interpretation of LR-:

  • LR- < 0.1: Strong evidence for negative
  • LR- = 0.1-0.2: Moderate evidence
  • LR- = 0.2-0.5: Weak evidence
  • LR- = 1: Useless test

Business Metrics

Lift

lift(actual, predicted_probs; percentile=0.1)

Interpretation: How many times better than random in the top X%?

  • Lift = 3 in top 10%: 3× more positives than random

Gain

gain(actual, predicted_probs; percentile=0.1)

Interpretation: What percentage of all positives are captured in top X%?

Usage Examples

Complete Binary Classification Evaluation

using UnifiedMetrics

actual = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
predicted_probs = [0.9, 0.8, 0.7, 0.3, 0.6, 0.4, 0.35, 0.2, 0.1, 0.05]
predicted_labels = predicted_probs .>= 0.5

println("=== Threshold-Independent ===")
println("AUC: ", round(auc(actual, predicted_probs), digits=3))
println("Gini: ", round(gini_coefficient(actual, predicted_probs), digits=3))
println("KS Statistic: ", round(ks_statistic(actual, predicted_probs), digits=3))
println("Brier Score: ", round(brier_score(actual, predicted_probs), digits=3))
println("Log Loss: ", round(logloss(actual, predicted_probs), digits=3))

println("\n=== Threshold-Dependent (threshold=0.5) ===")
println("Precision: ", round(precision(actual, predicted_labels), digits=3))
println("Recall: ", round(recall(actual, predicted_labels), digits=3))
println("F1 Score: ", round(fbeta_score(actual, predicted_labels), digits=3))
println("Specificity: ", round(specificity(actual, predicted_labels), digits=3))
println("MCC: ", round(mcc(actual, predicted_labels), digits=3))

Comparing Different Thresholds

using UnifiedMetrics

actual = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
probs = [0.9, 0.8, 0.7, 0.3, 0.6, 0.4, 0.35, 0.2, 0.1, 0.05]

for threshold in [0.3, 0.5, 0.7]
    labels = probs .>= threshold
    println("Threshold: $threshold")
    println("  Precision: $(round(precision(actual, labels), digits=2))")
    println("  Recall: $(round(recall(actual, labels), digits=2))")
    println("  F1: $(round(fbeta_score(actual, labels), digits=2))")
    println("  Youden's J: $(round(youden_j(actual, labels), digits=2))")
end

Medical Diagnostic Evaluation

using UnifiedMetrics

# Disease screening results
actual = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]  # 1 = has disease
predicted = [1, 1, 1, 0, 0, 0, 0, 0, 1, 0]  # Test results

println("=== Diagnostic Performance ===")
println("Sensitivity: ", round(sensitivity(actual, predicted), digits=3))
println("Specificity: ", round(specificity(actual, predicted), digits=3))
println("PPV (Precision): ", round(precision(actual, predicted), digits=3))
println("NPV: ", round(npv(actual, predicted), digits=3))
println("LR+: ", round(positive_likelihood_ratio(actual, predicted), digits=2))
println("LR-: ", round(negative_likelihood_ratio(actual, predicted), digits=2))
println("DOR: ", round(diagnostic_odds_ratio(actual, predicted), digits=1))

Marketing/Business Application

using UnifiedMetrics

# Customer response prediction
actual_responded = [1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
predicted_scores = [0.9, 0.8, 0.3, 0.7, 0.6, 0.5, 0.4, 0.2, 0.1, 0.05]

println("=== Campaign Targeting ===")
for pct in [0.1, 0.2, 0.3, 0.5]
    println("Top $(Int(pct*100))%:")
    println("  Lift: $(round(lift(actual_responded, predicted_scores, percentile=pct), digits=2))x")
    println("  Gain: $(round(gain(actual_responded, predicted_scores, percentile=pct)*100, digits=1))%")
end

Handling Imbalanced Data

using UnifiedMetrics

# Highly imbalanced: 95% negative, 5% positive
actual = vcat(fill(0, 95), fill(1, 5))
predicted = vcat(fill(0, 100))  # Naive: always predict negative

println("=== Naive Model on Imbalanced Data ===")
println("Accuracy: ", accuracy(actual, predicted))  # 0.95 - misleading!
println("Recall: ", recall(actual, predicted))       # 0.0 - reveals the problem
println("MCC: ", mcc(actual, predicted))             # 0.0 - correctly shows failure

# A better model
predicted_better = vcat(fill(0, 90), fill(1, 5), fill(0, 3), fill(1, 2))
println("\n=== Better Model ===")
println("Accuracy: ", accuracy(actual, predicted_better))
println("Recall: ", recall(actual, predicted_better))
println("Precision: ", precision(actual, predicted_better))
println("MCC: ", round(mcc(actual, predicted_better), digits=3))

See the API Reference for complete function documentation.