Walk-forward without leakage: a checklist that's saved me
Most leakage bugs don't look like leakage. They look like a model that's just good. Here's the small set of checks I run before I'll trust any backtest number.
A model that’s just good is the most expensive thing in research, because nobody questions it until it goes live. Leakage doesn’t announce itself; it presents as a clean equity curve.
This is the checklist I run before I’ll trust a backtest number, roughly in order of how often each one catches me.
1. Are your features point-in-time?
The cheap question: at time t, is every feature computed from data
that was observable strictly before t?
The harder question: is every feature computed from data that was observable at the speed it was observable? End-of-day prices become available with a lag. Fundamentals are restated. Sentiment data is timestamped to the publish time, not the trade time.
The framework-level fix: bake a point_in_time attribute into every
feature definition, and reject at build time any model whose target
overlaps with the lookback of any input.
2. Are your folds chronological and gapped?
Random k-fold on a time series is leakage by design. Walk-forward is the minimum, and walk-forward needs an embargo gap between train and test — large enough to cover the longest feature lookback plus the holding period. The embargo is not optional; the framework should enforce it.
3. Is your target horizon disjoint from your feature horizon?
If a feature is a 20-day rolling stat and your target is forward
20-day return, the windows can touch. They shouldn’t. A target horizon
of h and a maximum feature lookback of L means the safe step
between folds is h + L, not h.
4. Did you re-fit normalisations per fold?
Centring or scaling using the full-sample mean is a quiet form of leakage — it tells the training set what the test set looked like. Compute scalers within the training portion of each fold only.
5. Did you re-rank cross-sectionally, not over time?
Cross-sectional momentum that ranks “the highest return of the last six months across the universe” is fine. The same code with a typo’d window that ranks over time accidentally compares today’s ticker to its own future. Easy to introduce, hard to spot in plots — but visible if you compute the average per-fold rank correlation of feature values to the future version of themselves.
6. Did transaction costs survive the optimisation loop?
Costs are leakage’s cousin. A model tuned to maximise Sharpe with zero costs and then “cost-adjusted” at the end is a different model than one optimised with costs in the loop. Bake the cost into the objective, not the report.
7. Did the universe survive survivorship?
The cheapest way to a great backtest is to use today’s universe. Reconstitute the universe at the rebalance date — including delisted and acquired names — or accept that your Sharpe has a known upward bias.
8. Did you look at the trade list?
Aggregate stats hide a lot. Take the worst-performing 20 trades and read them — what feature value got them in, what feature value got them out, when. If those reads make obvious sense, the model is probably fine. If they don’t, you’re either learning something new or you’re leaking.
This isn’t a complete list. It’s the list that has caught me. Most of these become free if the framework owns the loop, which is most of why I keep building frameworks.