MLOps for quant research isn't MLOps for ML
Web-MLOps wants to retrain on yesterday's data and ship to A/B. Quant-MLOps wants to defend bit-exact reproducibility of a model that traded a year ago. Same vocabulary, different platform.
The MLOps vocabulary — registry, lineage, monitoring, canary — comes from a world where the model serves a request and the cost of a wrong prediction is one user-visit’s worth of friction. Quant trading uses the same words but doesn’t mean the same things.
A few of the differences that matter when you’re building the platform.
”Reproducible” means something stronger
In web MLOps, reproducible-ish is fine: rerun the training pipeline, expect approximately the same model, ship the new one. In quant research, reproducible means bit-exact — a year from now, when trading questions a fill that happened today, you need to recover the exact model, the exact features, the exact data window, and the exact signal value, with no drift.
Concretely: every random seed is recorded, every library version is pinned in the registered artefact (not just in a requirements file), and the feature graph is content-addressed so a downstream node’s hash changes whenever any ancestor’s code changes. “Trust me it’s the same model” doesn’t make it past compliance.
Monitoring is bimodal
Web ML monitors one thing well: prediction quality. Quant has to monitor two things, and they’re not the same.
- Infra freshness (is the feature up to date? is the schema what the model expects? is latency within SLO?) — pages on call.
- Alpha drift (is the model still earning? is its return profile changing?) — goes to research, not to ops.
Mixing them up is how you get woken at 3am for a Sharpe that slipped from 1.4 to 1.3. The platform should route them differently by design.
”Deploy” needs a contract, not a script
In a typical ML deploy, you push a model and it serves traffic. In quant, you push a model and it consumes a market-data stream that must match what it saw in training. The deploy is a contract: the feature graph the model was trained on, the schema of that graph, and the data window over which the contract was validated.
This is why model-registry carries the feature graph hash, not just the weights. A deploy that doesn’t pass a schema check at load time is rejected — at the boundary, not at the first divergent signal.
Promotion is two questions, not one
A web model gets promoted when it beats the incumbent on a held-out metric. A trading model gets promoted when it beats the incumbent and the team can explain why. The framework should require both — a numeric report (which it generates) and a written rationale (which it stores alongside the artefact).
The second one is what stops “good number, deploy it” from being a regular occurrence. It also gives the next person to look at the model, six months later, a clue.
Rollback is the unit of safety
In web ML, the safe move during an incident is often a slow rollback: roll percentages back, watch, decide. In trading, the safe move is instant. The registry’s promotion API is a write; the rollback API is one call away and has been load-tested. If you can’t roll back in under two minutes, you don’t have a registry, you have a database.
Different shape, same vocabulary. Most of the platform work I care about is reading the MLOps literature and asking, “what would this mean if a wrong prediction cost the firm money?” — and rebuilding the bit that doesn’t survive contact with that question.