Model Registry, Lineage, and Reproducibility

A trained model is a binary blob with no intrinsic memory of how it came to exist. Six months later, you have a production incident, and the question "what data was this trained on, what hyperparameters, what code revision?" needs to be answerable in minutes, not weeks. A model registry plus a lineage discipline is what makes that possible.

What a model registry stores

A registry is not just a file store. It tracks four things that together let you trust a checkpoint:

Concept	What it captures
Registered model	A named container for a logical model (e.g. "support-triage-classifier")
Model version	An immutable checkpoint, automatically numbered
Lineage links	Pointers back to the training run, dataset hash, and code revision
Aliases and stages	Mutable references like `@champion`, `@candidate`, `staging`, `production`

The dominant open-source option is MLflow's Model Registry - centralised store, REST APIs, lineage linking from versions back to the originating experiment run, alias-based deployment workflow. Weights and Biases Registry covers similar ground with a stronger experiment-tracking UX. Neptune, Vertex AI Model Registry, and SageMaker Model Registry are the other common picks. Whichever you pick, the discipline is more important than the tool.

The lineage problem

Reproducibility requires answering, for any checkpoint:

Which data? Exact dataset version, including any filtering, deduplication, and tokenisation steps. A "v2" label is not enough; you need a content hash.
Which code? Git SHA of the training repo, plus pinned versions of every dependency. requirements.txt with == pins, or a lockfile.
Which hyperparameters? Full config including seed, learning rate schedule, batch size, optimiser state, and any preprocessing flags.
Which compute? Hardware, driver versions, and numerical-precision settings - bf16 vs fp16 vs fp32 can change downstream metrics non-trivially.

Miss any one of these and reproducibility collapses. The painful failure mode is silent - the same code on the same data produces a slightly different model six months later, and you do not know which axis drifted.

Dataset versioning: DVC

Git tracks code well and data poorly. DVC (Data Version Control) is the established answer: it stores data in cloud object storage (S3, GCS, Azure) and tracks pointers in git, so git checkout of an old commit also gives you the matching data revision. It also handles pipeline definitions so you can re-run the training stages in dependency order. For teams that already live in git, DVC is the lowest-friction way to make data reproducible.

Model cards: the human-readable lineage record

Mitchell et al.'s 2018 "Model Cards for Model Reporting" paper proposed short documents that accompany every model release, covering intended use, training data, evaluation results stratified by demographic and contextual slices, and known limitations. The format is now standard - Hugging Face Hub renders the README.md in any model repo as a model card with structured YAML metadata for licence, training datasets, base model lineage, and evaluation results. If you publish a model anywhere, write the model card. If you only train internally, write an internal one anyway; it is the document your auditor and your future self will ask for.

Trade-offs

Registry discipline imposes friction. Every training run now needs config, metadata, and dataset hashes logged. Teams that skip this in week one regret it in month six.
Registry tools assume team scale. For a solo researcher, MLflow plus DVC is overkill - a structured directory with a config.yaml and a README per checkpoint covers 80% of the value.
Lineage breaks at the data source. If your training data comes from a live API or a constantly-updating warehouse, "reproducibility" requires snapshotting the source, not just versioning your local copy.

What a model registry stores

The lineage problem

Dataset versioning: DVC

Model cards: the human-readable lineage record

Trade-offs

Further reading