Skip to main content
← projects
shipped dataplatform

feature-forge — a DAG-based feature pipeline for trading data

Treat features like a build system: declarative dependencies, point-in-time correctness baked in, content-addressed caching, and the same graph at training time and serving time.

Role

Research Engineer

Stack

Python · Polars · Arrow · Parquet · DuckDB · Prefect

feature graph nodes
1.4k
cache hit rate
94%
PIT violations
0 in prod

Runnable companion: extras/feature-forge-mini/ — a ~250-line stdlib-only Python sketch of the DAG, content-addressed hashes, and point-in-time enforcement. Edit any feature’s body and watch its hash plus every descendant’s hash rotate; unrelated features keep their hashes. Thirteen unit tests.

The problem

Most feature stores ship two things: a registry and an online lookup. What they rarely give you is the part quants actually want — a clean mental model for how a feature is built from other features.

feature-forge is a build system for that. Every feature is a node; every dependency is an edge; every edge carries a lag.

A feature, defined

@feature(
    deps=[mid_price, returns(window=1)],
    refit="daily",
    lag="1d",
)
def vol_ewma(mid_price, returns_1):
    return ewma(returns_1.abs(), span=20)

feature-forge records that vol_ewma reads mid_price and returns_1 from yesterday’s window, and it refuses at build time to let you write a feature that would consume a future bar.

Why a DAG, not a notebook chain

The same DAG runs in three places:

contextexecutoroutput
research backtestlocal / clusterparquet shards in the feature lake
trainingsamea materialised, joinable table per fold
servingstreaminga Kafka topic per feature, replayable

One graph, three executors. A model trained on vol_ewma(span=20) gets exactly the same definition in production. Drift is caught at schema-hash check time, not when PnL diverges.

Content-addressed caching

A feature’s hash is the hash of its code plus the hashes of its parents. Two researchers writing the “same” feature get the same hash and one cache hit; a tiny code change invalidates only the downstream subgraph. The lake is content-addressed, so promotion to production is a rename, not a recompute.

What I’d ship next

  • A “what-if” mode that simulates retroactive schema changes against a frozen lake snapshot — answer “what does our last 18 months of research look like if we redefine this feature?” in minutes
  • Cost-aware scheduling: a feature carries its expected wall-clock and bytes, and the scheduler refuses to fan out a 4-hour graph during the trading day without explicit override