feature-forge — a DAG-based feature pipeline for trading data
Treat features like a build system: declarative dependencies, point-in-time correctness baked in, content-addressed caching, and the same graph at training time and serving time.
Research Engineer
Python · Polars · Arrow · Parquet · DuckDB · Prefect
Runnable companion:
extras/feature-forge-mini/— a ~250-line stdlib-only Python sketch of the DAG, content-addressed hashes, and point-in-time enforcement. Edit any feature’s body and watch its hash plus every descendant’s hash rotate; unrelated features keep their hashes. Thirteen unit tests.
The problem
Most feature stores ship two things: a registry and an online lookup. What they rarely give you is the part quants actually want — a clean mental model for how a feature is built from other features.
feature-forge is a build system for that. Every feature is a node;
every dependency is an edge; every edge carries a lag.
A feature, defined
@feature(
deps=[mid_price, returns(window=1)],
refit="daily",
lag="1d",
)
def vol_ewma(mid_price, returns_1):
return ewma(returns_1.abs(), span=20)
feature-forge records that vol_ewma reads mid_price and
returns_1 from yesterday’s window, and it refuses at build time to
let you write a feature that would consume a future bar.
Why a DAG, not a notebook chain
The same DAG runs in three places:
| context | executor | output |
|---|---|---|
| research backtest | local / cluster | parquet shards in the feature lake |
| training | same | a materialised, joinable table per fold |
| serving | streaming | a Kafka topic per feature, replayable |
One graph, three executors. A model trained on vol_ewma(span=20)
gets exactly the same definition in production. Drift is caught at
schema-hash check time, not when PnL diverges.
Content-addressed caching
A feature’s hash is the hash of its code plus the hashes of its parents. Two researchers writing the “same” feature get the same hash and one cache hit; a tiny code change invalidates only the downstream subgraph. The lake is content-addressed, so promotion to production is a rename, not a recompute.
What I’d ship next
- A “what-if” mode that simulates retroactive schema changes against a frozen lake snapshot — answer “what does our last 18 months of research look like if we redefine this feature?” in minutes
- Cost-aware scheduling: a feature carries its expected wall-clock and bytes, and the scheduler refuses to fan out a 4-hour graph during the trading day without explicit override