alpha-bench — a research framework for mid-term alpha models
The framework you spend most of your day in. Declarative model definitions, walk-forward CV, leakage-safe feature evaluation, and a benchmark grid that lets a quant compare three weeks of model variants over coffee.
Research Engineer (platform lead)
Python · PyTorch · scikit-learn · DuckDB · Polars · Hydra
Runnable companion:
extras/alpha-bench-mini/— a ~250-line stdlib-only Python sketch of the executor, the model spec, and the walk-forward loop, with eight unit tests. Five minutes to clone and run; covers the central concepts without trying to be production framework code.
Why it exists
A research team’s velocity is bounded by how cheaply they can ask the next question. The previous setup at the firm I’m modelling this on made every question expensive: notebooks were the unit of work, runs weren’t reproducible, leakage was a constant audit burden, and the gap between “this looked good in research” and “this is what production sees” was wide enough to lose ideas in.
alpha-bench collapses that. A model is a small Python class with a
declarative spec; the framework owns the rest: data wiring, feature
caching, walk-forward partitioning, evaluation, and the comparison
grid.
What it looks like
from alpha_bench import Model, walk_forward, features as F
class XSMomentum(Model):
universe = "HK.large_cap"
horizon = "10d"
features = [F.ret(window=20), F.ret(window=60), F.turnover(20)]
target = F.fwd_ret("10d")
estimator = "ridge(alpha=auto)"
cv = walk_forward(train="3y", test="3m", step="1m", embargo="5d")
objective = "rank_ic"
That’s it. rb run xsmom does the rest:
- resolves the feature graph (and reuses anything already materialised)
- partitions the walk-forward folds with an embargo to prevent leakage
- trains, evaluates, and writes one row per fold to a results table
- registers the artefact with a stable hash so a teammate can
rb replay xsmom@abc123and get bit-exact numbers
Design choices that aged well
- Features as first-class objects, not columns. A feature carries its own type, lag, refit cadence, point-in-time semantics, and a hash of its dependencies. Two features with the same hash share a cache slot regardless of who defined them.
- One executor, two surfaces. The CLI and the notebook helper
both call the same
bench.run(...). The notebook just rehydrates artefacts and renders aBenchmarkGridwidget — quants stay in the notebook, but the framework owns the run. - Walk-forward by default. No model can be registered without walk-forward evaluation. The embargo window is enforced at the framework boundary; you cannot accidentally peek.
How it lands in production
Every promoted model emits a deployment manifest the platform reads directly. The same feature graph runs in production — no rewrites, no parallel codepaths. See signal-stream for the serving half of the loop.
What I’d push next
- A first-class portfolio-construction layer so the framework can evaluate combinations of alphas under realistic capacity assumptions
- Lake-aware caching with content-addressed feature shards
- Per-quant cost dashboards so the grid is also a budget signal