> kairox tempo

Log-Loss Is Not Enough: How Identical Scores Hide an 11-Point ROI Gap

Seven XGBoost training runs on the same La Liga dataset. Five of them land within 0.0025 log-loss of each other. Their away-betting returns diverge by up to 11 percentage points. The metric you are optimising is not measuring what you think it is.

Jun 202612 min read

Cross-entropy loss is the standard evaluation metric for probabilistic classifiers. It penalises confident wrong predictions more than uncertain ones, which makes it a reasonable proxy for calibration quality. In most classification settings, two models within 0.002 log-loss of each other are, for practical purposes, equivalent.

Football prediction is not most settings.

The goal here is not to produce a well-calibrated distribution for its own sake. The goal is to find bets where the model's probability differs from the bookmaker's implied probability by enough to justify a stake. That requires something log-loss does not directly measure: directional edge on specific outcome classes. A model can be globally well-calibrated and still be systematically wrong about away wins. A model can have mediocre overall log-loss and still price away wins better than the market does.

The question this brief investigates is whether XGBoost's hyperparameter search landscape has multiple local optima that reach similar log-loss through fundamentally different internal representations, and whether those representations produce different betting value. The answer, with real walk-forward validation data across seven experiments on KairoX Tempo's La Liga model, is yes, and the difference is large enough to matter.

The Multimodality Problem

Optuna's TPE sampler explores the hyperparameter space efficiently, but "efficiently" does not mean it finds a single global optimum. XGBoost's loss surface for tabular football data has at least four distinct attractor basins, each producing a qualitatively different kind of model that happens to score similarly on CV log-loss.

Across seven La Liga training runs (EXP-002 through EXP-007 plus the production v1.0.0 model), six distinct regimes were identified:

Regime	lr	n_est	depth	LL mean	Away ROI
Slow-Deep (EXP-002)	0.044	484	5	0.9753	+5.41%
Fast-Shallow (EXP-003)	0.034	683	5	0.9764	+15.58%
Slow-Shallow-Reg (EXP-004)	0.016	674	3	0.9770	+13.33%
Medium-Slow (EXP-005)	0.020	562	6	0.9761	+16.69%
Medium-Fast (EXP-006)	0.083	603	7	0.9771	+10.42%
Medium (v1.0.0)	0.048	581	8	0.9746	+7.48%
Slow-Deep-ELO (EXP-007)	0.014	759	7	0.9831	+20.38%

Seven training runs, six hyperparameter regimes. Log-loss spread across the top five: 0.0025. Away ROI spread: 11.28pp.

The top five experiments cluster within 0.0025 log-loss of each other. Their away-betting ROI spans from +5.41% to +16.69%. The best log-loss model (v1.0.0, 0.9746) ranks fifth out of seven on Away ROI (+7.48%). The worst overall calibration model (EXP-007, ECE=0.0337) produces the highest Away ROI (+20.38%).

XGBoost Regime Map showing learning rate vs n_estimators with Away ROI color gradient — Figure 1: Each regime occupies a distinct region of the hyperparameter space. Color encodes Away ROI. The best log-loss model (v1.0.0, top-right) is not the best Away ROI model.

The Anti-Calibration Finding

The most counterintuitive result is in the per-class ECE breakdown. EXP-006 is the best-calibrated model across all seven runs, with an overall ECE of 0.0193. Its per-class error is remarkably balanced: 0.019 on home wins, 0.019 on draws, 0.020 on away wins. By any calibration metric, this is the model you want.

It also produces the fourth-lowest Away ROI of the seven runs.

Run	ECE overall	ECE (Away)	Away ROI
EXP-006 (Medium-Fast)	0.0193	0.0200	+10.42%
EXP-002 (Slow-Deep)	0.0251	0.0168	+5.41%
EXP-003 (Fast-Shallow)	0.0251	0.0133	+15.58%
EXP-005 (Medium-Slow)	0.0292	0.0256	+16.69%
EXP-007 (Slow-Deep-ELO)	0.0337	0.0377	+20.38%

Per-class ECE and Away ROI. EXP-003 has the best Away-class calibration (0.0133) and reasonable ROI. EXP-007 has the worst overall calibration and the best returns.

EXP-003 offers an interesting middle case: it has the best Away-class ECE of all seven runs (0.0133), and it also has consistently positive Away ROI across all three validation seasons (+21.8%, +16.1%, +10.5%), the only model to achieve that. Good away-class calibration does correlate with away-class betting edge here, but the overall ECE ranking tells you very little about which model to deploy.

The Murphy decomposition clarifies why. Brier score decomposes into three terms: reliability (how close predicted probabilities are to observed frequencies), resolution (how much the predictions vary across matches), and uncertainty (the irreducible difficulty of the data). EXP-007 has the worst reliability score (0.001600) and the highest Away ROI. v1.0.0 and EXP-002 have the highest resolution scores (most discriminative predictions overall) but rank among the lowest Away ROI models. High resolution in aggregate does not mean high discrimination on the specific subset of matches where market mispricing exists.

Per-class ECE heatmap and Murphy decomposition scatter for all seven regimes — Figure 2: Left: ECE heatmap by regime and class. Right: Murphy decomposition scatter for the Away class. EXP-007 (worst reliability, bottom-right) delivers the best Away ROI. EXP-006 (best overall calibration) sits at mid-range returns.

What the SHAP Profiles Reveal

The SHAP importance comparison between EXP-002 and EXP-005 is the most diagnostic result in this investigation. These two models differ by 0.0008 log-loss, the smallest gap in the entire experiment set. Their feature weighting diverges by more than 50% on eight of the top twelve features.

Feature	EXP-002 SHAP	EXP-005 SHAP	Relative diff
away_team_away_win_rate_roll_10	0.0183	0.0037	79.9%
xg_op_delta_roll_3	0.0045	0.0015	65.3%
passes_acc_delta_roll_10	0.0610	0.0238	60.9%
xg_delta_roll_3	0.0061	0.0025	59.5%
matchday	0.0160	0.0072	55.4%
points_gap	0.1046	0.0502	52.0%

EXP-002 vs EXP-005: log-loss difference of 0.0008, SHAP divergence up to 79.9%. These are not the same model with different seeds.

EXP-002 concentrates weight on points_gap (0.1046) and passes_acc_delta_roll_10 (0.0610). It is building a model around standing quality gaps and possession dominance. EXP-005 distributes importance more evenly across process metrics, giving roughly equal weight to xG signals, momentum features, and positional context.

The mechanism behind EXP-002's lower Away ROI is plausible from this profile. A model that over-weights points gap is encoding a strong home-advantage prior: when the home team is significantly better on standings, the model pushes probability toward the home win. Away wins in La Liga disproportionately happen when the quality gap is smaller or reversed, and EXP-002 is less sensitive to the away team's own momentum features (away_team_away_win_rate_roll_10 SHAP drops 79.9% vs EXP-005). The model is learning a version of football where the table position dominates, which is accurate on average but leaves edges on away wins priced incorrectly.

SHAP importance heatmap across all seven regimes and top twelve features — Figure 3: Mean absolute SHAP by feature and regime. The two columns that stand out are EXP-002 (heavy concentration on points_gap and passes_acc) and EXP-007 (redistributed toward xg and elo signals). Same log-loss, different internal representations.

Fold Stability and What It Costs

EXP-004 (Slow-Shallow-Regularized) is the most stable model in the set by log-loss variance across folds (σ=0.0009, six times tighter than EXP-007's σ=0.0065). Its Away ROI across the three validation seasons is +17.4%, +19.4%, +5.9%: two strong seasons and a weaker third, but always positive.

EXP-003 (Fast-Shallow) is the only model with monotonically positive Away ROI across all three seasons: +21.8%, +16.1%, +10.5%. The declining trend over time could indicate genuine regime decay (La Liga outcomes becoming more predictable for home wins in recent seasons as the gap between top teams and the rest widens) or simply fold-level variance. More seasons would be needed to distinguish the two.

EXP-006, the best-calibrated model, produces the most stable Away ROI range: +8.6% to +14.5%. If the deployment criterion is minimising the worst-case season rather than maximising expected returns, EXP-006 makes a legitimate case despite its lower peak.

The feature-set confound

EXP-002 (V1 features) and EXP-005 (V4 features) use different feature sets. Part of the ROI divergence between them may come from feature content rather than hyperparameters alone. An ablation using the same feature set with both hyperparameter configurations would isolate the pure hyperparameter effect. That experiment has not been run. The SHAP divergence analysis makes it unlikely that features alone explain the full gap, but the confound exists and is worth acknowledging.

What This Changes About Model Selection

The standard workflow for XGBoost model selection is: run hyperparameter search, pick the configuration with the best CV log-loss, deploy. This investigation suggests that workflow is incomplete for betting applications.

The issue is not that log-loss is a bad metric. It is that football betting generates returns only on bets placed, and bets are placed only on the subset of matches where the model's probability exceeds the implied probability from market odds by a meaningful margin. The model's accuracy on the other 85% of matches, which log-loss weights heavily, is irrelevant to deployment performance.

A more complete selection criterion for this use case would include:

CV log-loss as a necessary condition, not a sufficient one. Any model that doesn't beat the uniform baseline meaningfully gets filtered out first.
Per-class ECE as a secondary filter, specifically on the Away class, which is the primary source of market mispricing in La Liga.
OOF ROI on Away bets at the target confidence threshold, computed across all three validation seasons, with stability (σ across folds) weighted alongside mean return.
SHAP profile review: does the model give meaningful weight to per-team momentum features, or has it collapsed onto standing-based proxies?

None of this is novel in isolation. What the data shows here is that the gap between the best and worst regime on a single dataset, with near-identical log-loss, is large enough to separate a profitable system from an unprofitable one. The choice of which regime to deploy is not a minor tuning decision.

Open question: home ROI across regimes

Most regimes show negative or near-zero Home ROI consistently. This suggests a shared calibration bias toward home wins that persists regardless of hyperparameter configuration, and is therefore more likely a feature representation problem than a tuning problem. The question of what feature or data characteristic is driving the systematic home-win overpricing is unresolved.