Rolling Windows vs Cumulative Stats: Why 'Recent Form' Made My Predictions Worse

TL;DR - The Momentum Trap That Cost Me 3.4 Percentage Points

I thought tracking "last 5 matches" would capture team momentum better than season averages. It made sense: recent form should predict future performance better than distant history. So I rebuilt my football prediction model to use rolling windows instead of cumulative statistics.

Validation accuracy dropped from 60.8% to 57.4%. Test accuracy crashed to 53.2%.

The problem wasn't the concept, it was variance instability. Rolling windows have wildly different reliability across the season. Matchday 3: a team's form is based on 2 matches (coin flips). Matchday 25: same stat is based on 24 matches (actual signal). XGBoost couldn't learn a consistent relationship because the feature's meaning changed every week.

When I switched back to cumulative season statistics using all matches played so far, but maintaining the same expanding calculation, accuracy jumped to 60.8%. Training time dropped by 40%. Feature importance became interpretable again.

The lesson: In noisy domains like football, recency bias amplifies noise instead of capturing signal. Cumulative stats smooth variance and generalize better. This article shows you exactly why rolling windows fail, when they work, and how to choose the right temporal aggregation for your problem.

The Intuition That Destroyed My Model

When I built Hercules (my first 1X2 predictor), I used cumulative season statistics:

1# What Hercules did (worked well)
2def calculate_cumulative_ppg(matches_df):
3    """Season-long points per game"""
4    matches_df = matches_df.sort_values('date')
5    for team in teams:
6        for season in seasons:
7            season_matches = matches_df[
8                (matches_df['team_id'] == team) &
9                (matches_df['season'] == season)
10            ]
11            # Average over ALL season matches so far
12            matches_df.loc[season_matches.index, 'season_ppg'] = (
13                season_matches['points'].expanding().mean()
14            )
15    return matches_df

But then I hypothesized momentum mattered. Teams on winning streaks outperform expectations. Recent form predicts future results better than season averages.

Made perfect sense. So I rewrote everything:

1# What I changed to (disaster)
2def calculate_rolling_ppg(matches_df):
3    """Last 5 matches points per game"""
4    matches_df = matches_df.sort_values('date')
5    for team in teams:
6        team_matches = matches_df[matches_df['team_id'] == team]
7        # Average over LAST 5 matches only
8        matches_df.loc[team_matches.index, 'rolling_ppg_5'] = (
9            team_matches['points'].rolling(window=5, min_periods=1).mean()
10        )
11    return matches_df

I also added "momentum indicators" (gradient of rolling averages) and "form volatility" (standard deviation of recent results). The model now had 15 new features, all focused on capturing short-term patterns.

Results after retraining:

Metric	Cumulative (Old)	Rolling Windows (New)
Validation Accuracy	60.8%	57.4%
Test Accuracy	60.8%	53.2%
Training Time	8 minutes	13 minutes
Feature Stability	High	Garbage

I'd made predictions worse while making training slower. And I had no idea why.

The Hidden Problem: Variance Instability

The breakthrough came when I plotted prediction confidence across matchdays.

Figure 1: Prediction Confidence by Matchday - Rolling vs Cumulative. The rolling window model was schizophrenic. Early season: predictions swung wildly between 'no idea' (45% confidence) and 'very sure' (75% confidence). Late season: stable but wrong.

The rolling window model was schizophrenic. Early season: predictions swung wildly between "no idea" (45% confidence) and "very sure" (75% confidence). Late season: stable but wrong.

Why? Because the features themselves were unstable.

Wait, both approaches use expanding calculations early season (min_periods=1).

So why does cumulative work better? Because XGBoost sees sample size as part of the semantic feature. 'rolling_ppg_5' means '5-match average' even when only 2 exist. 'season_ppg' means 'all available data', model learns to distrust it early, trust it late.

Matchday 3 scenario:

Team A: [Win, Loss] = 1.5 PPG over 2 matches
Team B: [Win, Win] = 3.0 PPG over 2 matches
Model sees: "Team B is 2x stronger! High confidence prediction."
Reality: Two matches is noise. Could easily reverse.

Matchday 25 scenario:

Team A: [15W, 7D, 2L] = 2.13 PPG over 24 matches
Team B: [18W, 4D, 2L] = 2.42 PPG over 24 matches
Model sees: "Team B is ~13% stronger. Moderate confidence."
Reality: 24 matches is signal. Meaningful difference.

Same feature name (rolling_ppg_5), completely different reliability. XGBoost learned "rolling_ppg > 2.5 = strong predictor" from late-season matches, then applied that threshold to early-season noise.

Cumulative stats don't have this problem:

Matchday 3:

Team A: 1.5 PPG over 2 matches (explicitly uncertain)
Team B: 3.0 PPG over 2 matches (explicitly uncertain)
Model learns: "Low sample sizes = low confidence"

Matchday 25:

Team A: 2.13 PPG over 24 matches (high certainty)
Team B: 2.42 PPG over 24 matches (high certainty)
Model learns: "High sample sizes = trust the difference"

The sample size is part of the feature with cumulative stats. XGBoost can learn that early-season values are noisier. With rolling windows, the sample size is hidden, the model sees "5 matches" whether it's early or late season.

The Mathematics of Why This Fails

The variance of a sample mean decreases with sample size:

1Var(sample_mean) = σ² / n
2
3Where σ² is population variance and n is sample size.

For rolling windows (fixed n=5):

Matchday 5: Var(PPG) = σ² / 5
Matchday 15: Var(PPG) = σ² / 5
Matchday 35: Var(PPG) = σ² / 5

Constant variance, but the meaning changes because team's true strength evolves over the season.

For cumulative stats (variable n):

Matchday 5: Var(PPG) = σ² / 5 (high variance = model cautious)
Matchday 15: Var(PPG) = σ² / 15 (medium variance = model moderate)
Matchday 35: Var(PPG) = σ² / 35 (low variance = model confident)

Decreasing variance that naturally encodes uncertainty.

I tested this by tracking feature variance across the season:

Feature Variance Across Season - Rolling vs Cumulative — Figure 2: Feature Variance Patterns Across Season. The rolling window variance barely moved, around 0.45 standard deviation throughout the season. Cumulative variance dropped exponentially, following the σ²/n curve almost perfectly.

The rolling window variance barely moved, around 0.45 standard deviation throughout the season. But the reliability of that 0.45 changed drastically. Early season: 0.45 std on 2-5 matches = noise. Late season: 0.45 std on last 5 of 35 matches = different noise, but from a stabilized distribution.

Cumulative variance dropped exponentially, following the σ²/n curve almost perfectly. By matchday 20, cumulative PPG had 1/4 the variance of rolling PPG, making it a far more stable predictor.

What XGBoost Actually Learned

I ran SHAP analysis to understand how the model used these features.

SHAP Feature Importance - Early vs Late Season — Figure 3: SHAP Feature Importance Analysis - Rolling Windows Create Unpredictable Model Behavior. Early season (matchdays 1-10): Rolling window features had SHAP values ranging from -0.35 to +0.42 for the same feature value. Late season: improved but still showed ±0.25 spread. Cumulative features remained stable at ±0.10 spread throughout.

Early season (matchdays 1-10):

Rolling window features: SHAP values ranged from -0.35 to +0.42 for the same feature value (e.g., rolling_ppg=2.0)
Model couldn't decide if high rolling PPG meant "genuinely strong" or "lucky start"
Cumulative features: Tight SHAP distributions (±0.08), consistent interpretation

Late season (matchdays 25-38):

Rolling windows improved but still showed ±0.25 spread
Cumulative features: Remained stable at ±0.10 spread

The model learned different relationships for the same feature depending on matchday. That's overfitting by definition, memorizing training season quirks instead of learning generalizable patterns.

The Overfitting Signature

When I plotted training vs validation accuracy over epochs, the problem became obvious:

Training vs Validation Accuracy - Rolling Windows Overfitting — Figure 4: Training vs Validation Accuracy Over Time. Rolling windows memorize training patterns (15% gap), cumulative stats generalize (2% gap). The uncalibrated model has a healthy generalization gap.

Rolling window model:

Training accuracy: 68% (model thinks it's doing great!)
Validation accuracy: 57% (reality check)
Test accuracy: 53% (disaster)
Gap: 15 percentage points between training and test

Cumulative model:

Training accuracy: 63%
Validation accuracy: 60.8%
Test accuracy: 60.8%
Gap: 2 percentage points (healthy generalization)

The rolling window model learned to exploit early-season variance in the training data. When it saw rolling_ppg_5 > 2.8 in matchday 4 of the 21-22 season, it learned "predict home win with 75% confidence" because that happened to work in training. But in 24-25 season, rolling_ppg_5 > 2.8 at matchday 4 was noise, could go either way.

Cumulative stats don't have season-specific patterns. season_ppg = 2.13 over 24 matches means roughly the same thing in any season.

The Counter-Intuitive Fix

Here's the twist: cumulative stats use the exact same expanding().mean() calculation as rolling windows. The difference is semantic, not mathematical.

1# These produce IDENTICAL calculations at the code level
2rolling_avg = df['points'].rolling(window=999, min_periods=1).mean()
3cumulative_avg = df['points'].expanding(min_periods=1).mean()
4# Both grow the window size as more data arrives
5# Both average over all available data up to current row

But they represent fundamentally different features:

Rolling window interpretation:

"Team's form in last 5 matches"
Implies: Recent > old
Creates: Recency bias that amplifies noise

Cumulative interpretation:

"Team's season performance"
Implies: All data matters equally
Creates: Variance reduction through larger samples

Same math, different philosophy, 7 percentage point accuracy difference.

The key insight: In noisy domains, recency bias kills you.

Football has ~0.4-0.6 goals of random variance per match. A "hot streak" of 3 wins could be:

True skill increase (momentum)
Opponent weakness (schedule luck)
Random variance (regresses to mean)

Rolling windows assume #1. Cumulative stats assume #3. Empirically, #3 is right more often.

When Rolling Windows Do Work

I'm not saying rolling windows are always wrong. They work when:

1. True momentum exists (rare in football)

Example: Basketball has real hot-hand effects (confidence, rhythm)
Example: Trading strategies where recent volatility predicts future volatility
Football momentum is mostly illusion (regression to mean)

2. Concept drift is fast (not football)

Example: Adversarial games where strategies evolve rapidly
Football: team quality changes slowly over months, not weeks

3. Sample size is large (not early season)

If your rolling window is 20 matches, variance is reasonable
But then you lose data, can't make predictions in first 20 matches

I tested different window sizes:

Window Size	Early Season Var	Late Season Var	Test Accuracy
Last 3	0.62	0.58	51.2%
Last 5	0.45	0.41	53.2%
Last 10	0.32	0.29	56.1%
Last 20	0.22	0.19	58.5%
Cumulative	0.95 → 0.15	0.15	60.8%

Larger windows help by reducing variance, but:

You lose early-season predictions (no data yet)
You still don't encode sample size in the feature
Cumulative stats do both better

The Production Impact

After reverting to cumulative stats, I backtested the entire 24-25 season:

Cumulative P&L - Rolling vs Cumulative Features — Figure 5: 24-25 Season Backtest (€10,000 Start, Fractional Kelly). The rolling window model placed 60% more bets because it saw 'edges' from early-season variance that didn't exist. The cumulative model was selective, only bet when sample sizes were large enough to trust.

Rolling window model:

Final bankroll: €8,760 (-12.4% ROI)
Bets placed: 156
Win rate: 51.3%
Max drawdown: 18.7%

Cumulative model:

Final bankroll: €12,180 (+21.8% ROI)
Bets placed: 98 (fewer false edges)
Win rate: 63.3%
Max drawdown: 8.9%

The rolling window model placed 60% more bets because it saw "edges" from early-season variance that didn't exist. The cumulative model was selective, only bet when sample sizes were large enough to trust.

€3,420 difference from a single feature engineering decision.

Implementation Checklist

If you're building temporal ML models, here's how to choose:

Use cumulative stats when:

Domain has high noise (sports, finance, social behavior)
Sample size varies significantly over time
You want feature importance to be interpretable
Concept drift is slow (team quality evolves over months)

Use rolling windows when:

True momentum effects exist (verified, not assumed)
Concept drift is fast (strategies change weekly)
You can afford to lose early predictions
Window size is large enough (20+ samples) to reduce variance

Decision framework:

1def choose_temporal_aggregation(domain_noise, concept_drift_speed,
2                                       sample_size_variation):
3    if domain_noise == 'high' and sample_size_variation == 'high':
4        return 'cumulative'  # Variance reduction dominates
5    
6    if concept_drift_speed == 'fast':
7        return 'rolling_large_window'  # Need recency, but keep variance low
8    
9    if momentum_effects_verified():  # Requires A/B test or literature
10        return 'rolling_optimized_window'
11    
12    return 'cumulative'  # Default to stability

Red flags that you need cumulative:

Feature importance changes drastically across time periods
Early-time predictions are garbage, late-time predictions are good
Validation accuracy << Training accuracy (overfitting)
Feature SHAP values have high variance for same feature value

The Broader Lesson

This isn't just about football prediction. Every time-series problem faces this:

Demand forecasting:

Rolling windows: "Sales last 7 days"
Problem: Black Friday sales spike 400%. Rolling window thinks new baseline is 4x higher. Cumulative averages it as 1 day out of 365, correctly treating it as an outlier.
Cumulative: "Year-to-date daily average"
Benefit: One-off spikes don't break the model

Fraud detection:

Rolling windows: "Transactions last 30 days"
Problem: Legitimate large purchase looks fraudulent if window is small
Cumulative: "Account lifetime transaction profile"
Benefit: Large samples reveal true behavior

Medical diagnosis:

Rolling windows: "Symptoms last 3 visits"
Problem: Chronic conditions disappear from feature space
Cumulative: "Lifetime symptom frequency"
Benefit: Long-term patterns aren't forgotten

The meta-pattern: When noise > signal, larger samples > recent samples.

Your intuition says "recent is relevant." Mathematics says "recent is unstable." Listen to the math.

Conclusion: Fighting Your Instincts

I spent 2 weeks building a rolling window feature system because "everyone knows recent form matters." Lost 7 percentage points of accuracy. Took 1 day to revert to cumulative stats and beat my previous best.

The lesson isn't "rolling windows are always bad", it's that intuition misleads you in high-noise domains. What feels like signal (hot streaks, momentum, recent form) is often just variance that regresses to mean.

Three questions before you use rolling windows:

Is the variance reduction worth losing early predictions? (Usually no)
Does true momentum exist, or are you seeing regression to mean? (Test it)
Can your model learn different relationships for different sample sizes? (Usually no)

If any answer is "no," use cumulative stats. They're boring. They ignore sexy concepts like momentum. And they work.

The Prometheus system now uses cumulative stats everywhere. Cronos (1X2): 60.8% test accuracy. Hyperion (goals): 61.7%. Coeus (corners): 72.5%. All built on season-long averages, not recent form.

Start simple. Accumulate data. Let variance decrease naturally. Make money.

Diego Garcia

Rolling Windows vs Cumulative Stats: Why "Recent Form" Made My Predictions Worse