Skip to main content
← projects
shipped platformmlops

alpha-bench — a research framework for mid-term alpha models

The framework you spend most of your day in. Declarative model definitions, walk-forward CV, leakage-safe feature evaluation, and a benchmark grid that lets a quant compare three weeks of model variants over coffee.

Role

Research Engineer (platform lead)

Stack

Python · PyTorch · scikit-learn · DuckDB · Polars · Hydra

model variants / day
120+
median backtest
9.2s
reproducibility
bit-exact

Runnable companion: extras/alpha-bench-mini/ — a ~250-line stdlib-only Python sketch of the executor, the model spec, and the walk-forward loop, with eight unit tests. Five minutes to clone and run; covers the central concepts without trying to be production framework code.

Why it exists

A research team’s velocity is bounded by how cheaply they can ask the next question. The previous setup at the firm I’m modelling this on made every question expensive: notebooks were the unit of work, runs weren’t reproducible, leakage was a constant audit burden, and the gap between “this looked good in research” and “this is what production sees” was wide enough to lose ideas in.

alpha-bench collapses that. A model is a small Python class with a declarative spec; the framework owns the rest: data wiring, feature caching, walk-forward partitioning, evaluation, and the comparison grid.

What it looks like

from alpha_bench import Model, walk_forward, features as F

class XSMomentum(Model):
    universe   = "HK.large_cap"
    horizon    = "10d"
    features   = [F.ret(window=20), F.ret(window=60), F.turnover(20)]
    target     = F.fwd_ret("10d")
    estimator  = "ridge(alpha=auto)"
    cv         = walk_forward(train="3y", test="3m", step="1m", embargo="5d")
    objective  = "rank_ic"

That’s it. rb run xsmom does the rest:

  • resolves the feature graph (and reuses anything already materialised)
  • partitions the walk-forward folds with an embargo to prevent leakage
  • trains, evaluates, and writes one row per fold to a results table
  • registers the artefact with a stable hash so a teammate can rb replay xsmom@abc123 and get bit-exact numbers

Design choices that aged well

  • Features as first-class objects, not columns. A feature carries its own type, lag, refit cadence, point-in-time semantics, and a hash of its dependencies. Two features with the same hash share a cache slot regardless of who defined them.
  • One executor, two surfaces. The CLI and the notebook helper both call the same bench.run(...). The notebook just rehydrates artefacts and renders a BenchmarkGrid widget — quants stay in the notebook, but the framework owns the run.
  • Walk-forward by default. No model can be registered without walk-forward evaluation. The embargo window is enforced at the framework boundary; you cannot accidentally peek.

How it lands in production

Every promoted model emits a deployment manifest the platform reads directly. The same feature graph runs in production — no rewrites, no parallel codepaths. See signal-stream for the serving half of the loop.

What I’d push next

  • A first-class portfolio-construction layer so the framework can evaluate combinations of alphas under realistic capacity assumptions
  • Lake-aware caching with content-addressed feature shards
  • Per-quant cost dashboards so the grid is also a budget signal