Choosing the Right Metric

This guide helps you select the appropriate metric for your machine learning task. The right metric depends on your problem type, data characteristics, and business requirements.

Quick Decision Guide

What type of problem are you solving?

Problem TypeGo to Section
Predicting continuous values (prices, temperatures, etc.)Regression Metrics
Predicting categories (multi-class)Classification Metrics
Predicting yes/no outcomesBinary Classification Metrics
Ranking items (search, recommendations)Information Retrieval Metrics
Predicting future values in a sequenceTime Series Metrics

Regression Metrics

Decision Flowchart

START: Regression Problem
    |
    v
Do you need interpretable units?
    |
    +-- YES --> Do you want to penalize large errors more?
    |               |
    |               +-- YES --> Use RMSE
    |               |
    |               +-- NO --> Use MAE
    |
    +-- NO --> Do you need scale-independent comparison?
                    |
                    +-- YES --> Are there zeros in actual values?
                    |               |
                    |               +-- YES --> Use SMAPE or WMAPE
                    |               |
                    |               +-- NO --> Use MAPE
                    |
                    +-- NO --> Use R² (explained_variation)

When to Use Each Metric

MetricUse WhenAvoid When
MAEYou want average error in original units; outliers should not dominateYou need to heavily penalize large errors
RMSELarge errors are particularly bad; you want same units as targetOutliers are present and acceptable
MAPEYou need percentage errors for stakeholder communicationActual values contain zeros or near-zeros
SMAPEYou need percentage errors and have zerosYou need asymmetric error treatment
You want to know proportion of variance explainedComparing models on different datasets
MASEComparing forecasts across different scalesNon-time-series data

Detailed Recommendations

For General Regression Tasks

Primary metric: rmse or mae

  • Use rmse when large errors are costly (e.g., predicting house prices where a $100K error is much worse than ten $10K errors)
  • Use mae when all errors matter equally (e.g., predicting delivery times)
# Standard evaluation
rmse(actual, predicted)  # Penalizes large errors
mae(actual, predicted)   # Treats all errors equally

For Percentage-Based Reporting

Primary metric: mape, smape, or wmape

# When actual values are always positive and non-zero
mape(actual, predicted)

# When actual values may be zero
smape(actual, predicted)  # Symmetric, bounded [0, 2]
wmape(actual, predicted)  # Weighted by actuals

# To detect systematic bias
mpe(actual, predicted)  # Positive = under-prediction

For Model Comparison

Primary metric: explained_variation (R²) or adjusted_r2

# Basic R²
explained_variation(actual, predicted)  # 1 = perfect, 0 = mean baseline

# When comparing models with different numbers of features
adjusted_r2(actual, predicted, n_features)

For Robust Models (Outlier-Resistant)

Primary metric: huber_loss or mdae

# Huber loss: quadratic for small errors, linear for large
huber_loss(actual, predicted, delta=1.0)

# Median Absolute Error: robust to outliers
mdae(actual, predicted)

For Skewed Target Variables

Primary metric: rmsle or msle

# For targets spanning multiple orders of magnitude (prices, populations)
rmsle(actual, predicted)  # Penalizes under-prediction more

For Count Data or GLMs

Primary metric: mean_poisson_deviance or mean_gamma_deviance

# For count data (website visits, number of purchases)
mean_poisson_deviance(actual, predicted)

# For positive continuous data with variance ~ mean²
mean_gamma_deviance(actual, predicted)

Classification Metrics

Decision Flowchart

START: Multi-class Classification
    |
    v
Is your dataset balanced?
    |
    +-- YES --> Use accuracy() or ce()
    |
    +-- NO --> Use balanced_accuracy() or cohens_kappa()
                    |
                    v
               Do you need a single summary metric?
                    |
                    +-- YES --> For binary: mcc()
                    |           For ordinal: ScoreQuadraticWeightedKappa()
                    |
                    +-- NO --> Use confusion_matrix() for detailed analysis

When to Use Each Metric

MetricUse WhenAvoid When
accuracyClasses are balanced; simple reporting neededImbalanced datasets
balanced_accuracyClasses are imbalancedYou need per-class details
cohens_kappaYou want to account for chance agreementN/A
mccBinary classification; best single metricMulti-class (use macro-averaged)
confusion_matrixYou need detailed error analysisSimple summary is sufficient

Detailed Recommendations

For Balanced Datasets

accuracy(actual, predicted)  # Simple and interpretable

For Imbalanced Datasets

# Macro-averaged recall across classes
balanced_accuracy(actual, predicted)

# Accounts for chance agreement
cohens_kappa(actual, predicted)

For Ordinal Classification

When classes have a natural order (e.g., ratings 1-5):

# Penalizes predictions farther from true class
ScoreQuadraticWeightedKappa(actual, predicted, min_rating=1, max_rating=5)

For Multi-Label Classification

# Fraction of incorrect labels
hamming_loss(actual_matrix, predicted_matrix)

Binary Classification Metrics

Decision Flowchart

START: Binary Classification
    |
    v
What type of predictions do you have?
    |
    +-- Probabilities (0-1) --> Do you need threshold-independent evaluation?
    |                               |
    |                               +-- YES --> Use auc() or gini_coefficient()
    |                               |
    |                               +-- NO --> What matters more?
    |                                               |
    |                                               +-- Calibration --> brier_score() or logloss()
    |                                               |
    |                                               +-- Ranking --> ks_statistic()
    |
    +-- Binary Labels (0/1) --> What is your priority?
                                    |
                                    +-- Balance precision/recall --> fbeta_score()
                                    |
                                    +-- Minimize false positives --> precision()
                                    |
                                    +-- Minimize false negatives --> recall()
                                    |
                                    +-- Single best metric --> mcc()

When to Use Each Metric

MetricUse WhenAvoid When
aucComparing models; threshold hasn't been chosenYou need a specific operating point
precisionFalse positives are costly (spam detection)Missing positives is worse
recallFalse negatives are costly (disease detection)False alarms are problematic
fbeta_scoreYou need to balance precision and recallClear priority for one over other
mccImbalanced data; need single summary metricYou need threshold-independent metric
brier_scoreProbability calibration mattersRanking is more important

Detailed Recommendations

For Model Selection (Before Choosing Threshold)

# Area Under ROC Curve - threshold independent
auc(actual, predicted_scores)

# Gini coefficient (= 2*AUC - 1)
gini_coefficient(actual, predicted_scores)

# Maximum separation between classes
ks_statistic(actual, predicted_scores)

For Probability Calibration

When you need well-calibrated probabilities:

# Mean squared error of probabilities
brier_score(actual, predicted_probs)  # Lower is better

# Cross-entropy loss
logloss(actual, predicted_probs)  # Lower is better

For Threshold-Based Evaluation

After choosing a classification threshold:

# Convert probabilities to labels
predicted_labels = predicted_probs .>= threshold

# When false positives are costly (spam filter, fraud detection)
precision(actual, predicted_labels)

# When false negatives are costly (disease screening, security threats)
recall(actual, predicted_labels)
sensitivity(actual, predicted_labels)  # Same as recall

# Balanced metric
fbeta_score(actual, predicted_labels)         # F1: equal weight
fbeta_score(actual, predicted_labels, beta=0.5)  # Favor precision
fbeta_score(actual, predicted_labels, beta=2.0)  # Favor recall

For Medical/Diagnostic Applications

# Sensitivity (true positive rate)
sensitivity(actual, predicted_labels)

# Specificity (true negative rate)
specificity(actual, predicted_labels)

# Youden's J (optimal threshold criterion)
youden_j(actual, predicted_labels)

# Likelihood ratios for clinical decision making
positive_likelihood_ratio(actual, predicted_labels)
negative_likelihood_ratio(actual, predicted_labels)
diagnostic_odds_ratio(actual, predicted_labels)

For Imbalanced Data

The single best metric for binary classification with imbalanced data:

# Matthews Correlation Coefficient: accounts for all quadrants of confusion matrix
mcc(actual, predicted_labels)  # Range: [-1, 1], 0 = random

For Business Applications

# Lift: how much better than random in top X%
lift(actual, predicted_scores, percentile=0.1)

# Gain: what % of positives captured in top X%
gain(actual, predicted_scores, percentile=0.1)

Information Retrieval Metrics

Decision Flowchart

START: Ranking/Retrieval Problem
    |
    v
Do you have graded relevance scores?
    |
    +-- YES --> Use ndcg() or dcg()
    |
    +-- NO (binary relevance) --> What matters more?
                                      |
                                      +-- Finding first relevant item --> mrr()
                                      |
                                      +-- Finding all relevant items --> recall_at_k()
                                      |
                                      +-- Precision of top results --> precision_at_k()
                                      |
                                      +-- Balance of both --> f1_at_k() or mapk()

When to Use Each Metric

MetricUse WhenAvoid When
ndcgRelevance is graded (0-5 stars)Binary relevance only
mrrOnly first relevant result mattersAll relevant items matter
map@kRanking quality across positions mattersOnly top-1 or top-k matters
recall@kCoverage of relevant items is priorityPrecision matters more
precision@kQuality of top results is priorityMissing relevant items is costly
hit_rateAt least one relevant in top-k is successNeed finer granularity

Detailed Recommendations

For Search Engines

# Graded relevance (best for search)
ndcg(relevance_scores, k=10)

# Mean NDCG across queries
mean_ndcg(relevances_list, k=10)

# Mean Reciprocal Rank (how quickly users find what they want)
mrr(actual_list, predicted_list)

For Recommendation Systems

# Did we show at least one good item?
hit_rate(actual_list, predicted_list, k=10)

# How many relevant items did we show?
recall_at_k(actual, predicted, k=10)

# What fraction of shown items are relevant?
precision_at_k(actual, predicted, k=10)

# Balanced metric
f1_at_k(actual, predicted, k=10)

# Catalog coverage (diversity)
coverage(predicted_list, full_catalog)

# Novelty (recommending non-obvious items)
novelty(predicted_list, item_popularity)
# Average precision at k
apk(10, relevant_products, retrieved_products)

# Mean AP across queries
mapk(10, relevant_lists, retrieved_lists)

Time Series Metrics

Decision Flowchart

START: Time Series Forecasting
    |
    v
What aspect of forecast quality matters?
    |
    +-- Point forecast accuracy --> Is scale-independent comparison needed?
    |                                   |
    |                                   +-- YES --> mase() or theil_u2()
    |                                   |
    |                                   +-- NO --> rmse() or mae()
    |
    +-- Directional accuracy --> directional_accuracy()
    |
    +-- Forecast bias --> tracking_signal() or forecast_bias()
    |
    +-- Prediction intervals --> coverage_probability() or winkler_score()

When to Use Each Metric

MetricUse WhenAvoid When
maseComparing across series with different scalesSingle series evaluation
rmsseScale-independent; sensitive to large errorsOutliers acceptable
tracking_signalMonitoring for systematic biasOne-time evaluation
directional_accuracyDirection matters more than magnitudeMagnitude accuracy critical
winkler_scoreEvaluating prediction intervalsPoint forecasts only
theil_u2Comparing to naive benchmarkAbsolute accuracy needed

Detailed Recommendations

For Comparing Forecasts Across Different Series

The M-competition recommended metrics:

# Mean Absolute Scaled Error (most recommended)
mase(actual, predicted, m=1)     # Non-seasonal
mase(actual, predicted, m=12)    # Monthly data with yearly seasonality
mase(actual, predicted, m=7)     # Daily data with weekly seasonality

# Root Mean Squared Scaled Error
rmsse(actual, predicted, m=1)

Interpretation:

  • MASE < 1: Better than naive forecast
  • MASE = 1: Same as naive forecast
  • MASE > 1: Worse than naive forecast

For Single Series Evaluation

# Standard metrics in original units
mae(actual, predicted)
rmse(actual, predicted)

# Percentage-based (avoid if zeros present)
mape(actual, predicted)
wape(actual, predicted)  # Handles zeros better

For Detecting Forecast Bias

# Normalized measure of cumulative error
tracking_signal(actual, predicted)
# Interpretation: values outside [-4, 4] indicate systematic bias

# Simple bias (positive = under-forecasting)
forecast_bias(actual, predicted)

For Comparing to Benchmark

# Theil's U2: comparison to naive forecast
theil_u2(actual, predicted, m=1)
# < 1: better than naive, > 1: worse than naive

# Theil's U1: normalized error
theil_u1(actual, predicted)
# 0 = perfect, 1 = worst

For Direction Prediction (Trading, etc.)

# What fraction of up/down movements were predicted correctly?
directional_accuracy(actual, predicted)

For Probabilistic Forecasts / Prediction Intervals

# Does the interval contain the actual value at expected rate?
coverage_probability(actual, lower, upper)
# Should match your confidence level (e.g., 0.95 for 95% intervals)

# Interval score (rewards narrow intervals, penalizes misses)
winkler_score(actual, lower, upper, alpha=0.05)

# Quantile forecast evaluation
pinball_loss_series(actual, predicted_quantile, quantile=0.9)

For Preserving Temporal Structure

# Does the forecast maintain autocorrelation patterns?
autocorrelation_error(actual, predicted, max_lag=10)

Common Mistakes to Avoid

Regression

  1. Using MAPE with zeros: MAPE is undefined when actual values are zero. Use SMAPE or WMAPE instead.
  2. Ignoring scale: When comparing models across different datasets, use scale-independent metrics (R², MAPE, MASE).
  3. Only using R²: R² can be misleading for non-linear relationships. Always check residual plots.

Classification

  1. Using accuracy on imbalanced data: A model predicting the majority class always achieves high accuracy. Use balanced_accuracy, MCC, or per-class metrics.
  2. Optimizing for wrong metric: If false negatives are costly (medical diagnosis), optimize for recall, not precision.

Binary Classification

  1. Comparing AUC across very different datasets: AUC can be misleading if class distributions differ significantly.
  2. Ignoring calibration: High AUC doesn't mean probabilities are well-calibrated. Check Brier score.
  3. Using accuracy on imbalanced data: Use MCC instead.

Information Retrieval

  1. Using NDCG with binary relevance: While valid, simpler metrics (MAP, MRR) may be more interpretable.
  2. Ignoring position: Metrics like precision don't account for ranking. Use NDCG or MRR.

Time Series

  1. Not using scaled metrics: Raw MAE/RMSE can't be compared across series with different scales.
  2. Ignoring seasonality in MASE: Set m to match your data's seasonal period.
  3. Only checking point accuracy: Also evaluate bias (trackingsignal) and intervals (coverageprobability).

Metric Selection Summary Table

ScenarioRecommended MetricAlternative
General regressionRMSEMAE
Regression with outliersHuber lossMdAE
Stakeholder reportingMAPE (if no zeros)SMAPE
Imbalanced binary classificationMCCBalanced accuracy
Medical diagnosisSensitivity + SpecificityYouden's J
Search rankingNDCGMRR
Recommendation systemHit rate, Recall@kMAP@k
Forecast comparisonMASERMSSE
Forecast monitoringTracking signalForecast bias
Prediction intervalsCoverage + WinklerPinball loss