MLOps for quant research isn't MLOps for ML

The MLOps vocabulary — registry, lineage, monitoring, canary — comes from a world where the model serves a request and the cost of a wrong prediction is one user-visit’s worth of friction. Quant trading uses the same words but doesn’t mean the same things.

A few of the differences that matter when you’re building the platform.

”Reproducible” means something stronger

In web MLOps, reproducible-ish is fine: rerun the training pipeline, expect approximately the same model, ship the new one. In quant research, reproducible means bit-exact — a year from now, when trading questions a fill that happened today, you need to recover the exact model, the exact features, the exact data window, and the exact signal value, with no drift.

Concretely: every random seed is recorded, every library version is pinned in the registered artefact (not just in a requirements file), and the feature graph is content-addressed so a downstream node’s hash changes whenever any ancestor’s code changes. “Trust me it’s the same model” doesn’t make it past compliance.

Monitoring is bimodal

Web ML monitors one thing well: prediction quality. Quant has to monitor two things, and they’re not the same.

Infra freshness (is the feature up to date? is the schema what the model expects? is latency within SLO?) — pages on call.
Alpha drift (is the model still earning? is its return profile changing?) — goes to research, not to ops.

Mixing them up is how you get woken at 3am for a Sharpe that slipped from 1.4 to 1.3. The platform should route them differently by design.

”Deploy” needs a contract, not a script

In a typical ML deploy, you push a model and it serves traffic. In quant, you push a model and it consumes a market-data stream that must match what it saw in training. The deploy is a contract: the feature graph the model was trained on, the schema of that graph, and the data window over which the contract was validated.

This is why model-registry carries the feature graph hash, not just the weights. A deploy that doesn’t pass a schema check at load time is rejected — at the boundary, not at the first divergent signal.

Promotion is two questions, not one

A web model gets promoted when it beats the incumbent on a held-out metric. A trading model gets promoted when it beats the incumbent and the team can explain why. The framework should require both — a numeric report (which it generates) and a written rationale (which it stores alongside the artefact).

The second one is what stops “good number, deploy it” from being a regular occurrence. It also gives the next person to look at the model, six months later, a clue.

Rollback is the unit of safety

In web ML, the safe move during an incident is often a slow rollback: roll percentages back, watch, decide. In trading, the safe move is instant. The registry’s promotion API is a write; the rollback API is one call away and has been load-tested. If you can’t roll back in under two minutes, you don’t have a registry, you have a database.

Different shape, same vocabulary. Most of the platform work I care about is reading the MLOps literature and asking, “what would this mean if a wrong prediction cost the firm money?” — and rebuilding the bit that doesn’t survive contact with that question.