When 100% Accuracy Is a Red Flag

The model trained. The evaluation ran. The terminal printed OOS accuracy: 1.00. Most people would screenshot that and ship. I almost did.

The model in question is the regime detector for Kairox Vector, an XGBoost classifier that reads macro indicators every morning and outputs one of four labels: trending_bull, trending_bear, ranging_low_vol, or ranging_high_vol. These labels condition everything downstream. Get the regime wrong and the right signal in the wrong context destroys returns.

So a 100% accurate regime classifier should be cause for celebration. Except in machine learning, perfect is almost never real. It usually means you made a mistake you have not found yet.

How the Labels Were Built

Before training any model, you need labeled historical data. For regime classification, I used a rules-based approach grounded in market structure: a period is trending_bull when the 50-day SMA crosses above the 200-day SMA (the golden cross) and price stays above the 50-day SMA. trending_bear is the mirror image. Periods where those conditions are mixed, neither clearly trending, get classified as ranging, with VIX level splitting them into high and low volatility variants.

These are well-understood technical signals. They are not perfect, but they produce sensible ground truth: 2022 shows up as bear, the 2020 recovery as bull, flat periods with elevated VIX land in ranging high-vol. The labeling logic is sound.

The problem came in what I fed the model as features.

The Bug That Looked Like a Win

The feature set included everything that felt relevant for characterizing market regime: SPY returns over multiple horizons, realized volatility, FRED macro series (fed funds rate, CPI, yield curve, credit spreads, dollar index), and crucially three SMA-derived features: sma50_vs_sma200, spy_pct_above_sma50, and spy_pct_above_sma200.

Those last three features are not independent signals. They are the label rules. sma50_vs_sma200 > 0 is literally the golden cross condition. spy_pct_above_sma50 > 0 is the second half of the trending bull definition. The model was handed the answer sheet as a training feature.

XGBoost did not need to learn anything about FEDFUNDS or credit spreads or yield curve shape. It learned one thing: a 2-node decision tree that mapped the SMA features directly to their label. Everything else, all the macro context that was supposed to make this model useful, was ignored entirely.

v2 · SMA features in X

OOS accuracy: 1.00

v3b · SMA features excluded

OOS accuracy: 0.85

SMA-derived (label leakage)Macro / return featuresFRED macro featuresReturn / volatility features

Left: the tautological model. The three SMA features account for virtually all predictive weight. Every macro feature scores near zero. Right: after removing the leakage. VIX, yield curve, and credit spreads now lead. The model is reading the macro environment.

The chart is unambiguous. This is not a model with good features and great performance. It is a model that learned to decode its own label encoding. Remove the three SMA features and the model is useless. It has learned nothing about what drives market regimes.

The Tautology, Visualized

The failure mode has a name: label leakage. It happens when information about the target variable sneaks into the training features, either directly or through a transformation that preserves the signal. The model does not generalize. It memorizes.

A useful analogy: imagine training a model to predict whether it is raining, and including "is the ground wet" as a feature. The model achieves perfect accuracy. Of course it does. Wet ground is caused by rain. You have not built a weather predictor, you have built a ground-moisture detector that happens to correlate with rain. Take it to a city where someone just cleaned the streets and it fails completely.

In this case the circular dependency looks like this:

The circular dependency that produced 100% accuracy. Labels are defined by SMA crossover logic. Those same SMA values were included as training features. XGBoost learned to invert the label definition, not to model the regime.

The fix is conceptually simple: keep the SMA features for label generation, but exclude them entirely from the training feature matrix. The model now has to learn regime from fundamentals. Macro environment, return dynamics, volatility structure. If macro conditions genuinely explain regime structure, XGBoost should predict SMA-based regimes from fundamentals alone, detecting regime transitions from deteriorating macro data before the price structure confirms them. That is the whole point of using macro features in the first place.

What Honest Accuracy Looks Like

After removing the SMA features from training, accuracy dropped from 1.00 to 0.83. Adding class weighting to address the bear market imbalance, bears represent only around 10% of the training set, brought it to 0.85. The feature importance chart looked completely different.

VIX ratio, T10Y2Y, BAMLH0A0HYM2. Volatility term structure, yield curve inversion, credit spreads. These are the features that show up in every serious regime framework. The model found them without being told which ones mattered.

The classification report tells a similar story. The honest model performs worse on aggregate accuracy. That is expected, reality is harder than label reconstruction. But it performs correctly across the regimes it has enough signal for.

v2 · SMA features in X

TAUTOLOGICAL

Class	Prec	Recall	F1
trending_bull	1.00	1.00	1.00
trending_bear	1.00	1.00	1.00
ranging_low_vol	1.00	1.00	1.00
ranging_high_vol	1.00	1.00	1.00

OOS ACCURACY1.000

v3b · SMA features excluded

HONEST

Class	Prec	Recall	F1
trending_bull	0.88	0.94	0.91
trending_bear	0.00 ✗	0.00 ✗	0.00 ✗
ranging_low_vol	0.73	0.56	0.64
ranging_high_vol	0.85	0.94	0.90

OOS ACCURACY0.850

Perfect metrics across the board in v2 are not a sign of quality. They are a confession. v3b's more modest numbers represent a model that actually learned something.

The Bear Gate and Why It Failed Honestly

The validation gate for bear market precision (≥ 0.65 out-of-sample) showed 0.000. Ten bear days in the test set, ten predicted as ranging_high_vol. At first glance this looks like a critical failure.

Those ten days were April 16 to 30, 2025, the Trump tariff shock. The SMA death cross technically appeared, triggering a bear label. But the macro environment looked nothing like a structural bear: FEDFUNDS being cut, unemployment at historic lows, credit spreads not blown out. Compare that to 2022, where every macro signal confirmed the bear simultaneously.

2022 bear market — macro confirmed:

FEDFUNDS: 0.08% to 4.33% within the year
CPI: 8.5% peak YoY
T10Y2Y: deeply inverted (−0.5% to −1.0%)
HY credit spreads: 600bps+
Model bear precision on 2022 data: 0.960 ✓

April 2025 tariff shock — macro did not confirm:

FEDFUNDS: being cut, not hiked
Unemployment: low, not deteriorating
Credit spreads: elevated but not blown out
Death cross lasted 10 days, immediately reversed
Model prediction: ranging_high_vol ✓

The model was arguably more correct than its label. A 10-day technical crossover driven by tariff panic and immediately reversed is a volatility event, not a structural bear. The model placed it in ranging_high_vol because that is what the macro data said, and that is the more nuanced and useful classification.

The gate failure is a test-set design limitation: the chronological 80/20 split puts 2022 in training and leaves no genuine macro-driven bear in the test window. Proper evaluation requires walk-forward cross-validation across multiple complete market cycles. That is a Phase 9 problem.

The Practical Lesson

Perfect accuracy is not evidence that a model works. It is a signal to go looking for what you did wrong. Three questions catch most label leakage early:

Does any training feature share a formula with the label definition? Even partial overlap is dangerous.
What does feature importance show? If one feature dominates completely, the model has found a shortcut.
What happens to accuracy when you remove the top feature? If it collapses, that feature was load-bearing in the wrong way.

The version of this model that passed every gate had learned nothing. The version that failed the bear gate had learned something real: what structural macro deterioration looks like, and how it differs from a volatility correction. That distinction is exactly what a regime model is supposed to capture.

The model that scores 0.85 and misses ten edge-case bear days is more valuable than the model that scores 1.00 by memorizing its labels. One of them will work in production. The other will silently fail the moment it sees data it has not already labeled.

The v2 model was briefly considered production-ready. It would have run in the daily DAG and produced regime outputs that looked entirely reasonable, correctly labeling 2022 as bear, 2023 to 2024 as bull, printing confidence scores that felt believable. I only caught it because I went looking.

A model that gets 100% on a test it wrote itself is not intelligent. It is cheating. The goal was never to reconstruct labels. It was to detect the macro conditions that cause regimes before the price structure confirms them.