Evaluation & MLOps
Model Registry, Lineage, and Reproducibility
The infrastructure that answers "which dataset, code, and hyperparameters produced this checkpoint?" - and why you only miss it the first time you cannot reproduce a model.
intermediate · 7 min read
A trained model is a binary blob with no intrinsic memory of how it came to exist. Six months later, you have a production incident, and the question "what data was this trained on, what hyperparameters, what code revision?" needs to be answerable in minutes, not weeks. A model registry plus a lineage discipline is what makes that possible.
What a model registry stores
A registry is not just a file store. It tracks four things that together let you trust a checkpoint:
| Concept | What it captures |
|---|---|
| Registered model | A named container for a logical model (e.g. "support-triage-classifier") |
| Model version | An immutable checkpoint, automatically numbered |
| Lineage links | Pointers back to the training run, dataset hash, and code revision |
| Aliases and stages | Mutable references like @champion, @candidate, staging, production |
The dominant open-source option is MLflow's Model Registry - centralised store, REST APIs, lineage linking from versions back to the originating experiment run, alias-based deployment workflow. Weights and Biases Registry covers similar ground with a stronger experiment-tracking UX. Neptune, Vertex AI Model Registry, and SageMaker Model Registry are the other common picks. Whichever you pick, the discipline is more important than the tool.
The lineage problem
Reproducibility requires answering, for any checkpoint:
- Which data? Exact dataset version, including any filtering, deduplication, and tokenisation steps. A "v2" label is not enough; you need a content hash.
- Which code? Git SHA of the training repo, plus pinned versions of every dependency.
requirements.txtwith==pins, or a lockfile. - Which hyperparameters? Full config including seed, learning rate schedule, batch size, optimiser state, and any preprocessing flags.
- Which compute? Hardware, driver versions, and numerical-precision settings - bf16 vs fp16 vs fp32 can change downstream metrics non-trivially.
Miss any one of these and reproducibility collapses. The painful failure mode is silent - the same code on the same data produces a slightly different model six months later, and you do not know which axis drifted.
Dataset versioning: DVC
Git tracks code well and data poorly. DVC (Data Version Control) is the established answer: it stores data in cloud object storage (S3, GCS, Azure) and tracks pointers in git, so git checkout of an old commit also gives you the matching data revision. It also handles pipeline definitions so you can re-run the training stages in dependency order. For teams that already live in git, DVC is the lowest-friction way to make data reproducible.
Model cards: the human-readable lineage record
Mitchell et al.'s 2018 "Model Cards for Model Reporting" paper proposed short documents that accompany every model release, covering intended use, training data, evaluation results stratified by demographic and contextual slices, and known limitations. The format is now standard - Hugging Face Hub renders the README.md in any model repo as a model card with structured YAML metadata for licence, training datasets, base model lineage, and evaluation results. If you publish a model anywhere, write the model card. If you only train internally, write an internal one anyway; it is the document your auditor and your future self will ask for.
Trade-offs
- Registry discipline imposes friction. Every training run now needs config, metadata, and dataset hashes logged. Teams that skip this in week one regret it in month six.
- Registry tools assume team scale. For a solo researcher, MLflow plus DVC is overkill - a structured directory with a config.yaml and a README per checkpoint covers 80% of the value.
- Lineage breaks at the data source. If your training data comes from a live API or a constantly-updating warehouse, "reproducibility" requires snapshotting the source, not just versioning your local copy.