Pitfalls in ML-for-Science

In 2022 Sayash Kapoor and Arvind Narayanan went looking for a pattern behind a string of retracted or unreplicable machine-learning results across the sciences. They found one, and it had a name: data leakage. Their survey documented reproducibility failures caused by leakage in 17 fields at first count, affecting hundreds of papers; the running tally on their project page now spans roughly 30 fields and 648 affected papers, from medicine and neuroimaging to law, ecology, and dermatology. The unifying thread is not fraud and rarely incompetence. It is a set of subtle methodological errors that inflate reported performance, survive peer review, and only surface when someone tries to reproduce the result on genuinely held-out data. This concept is the skeptic's checklist for reading an ML-for-science headline, including the AlphaFold- and GNoME-style claims in the neighbouring concepts.

The taxonomy of data leakage

Leakage is any way that information from the evaluation set influences the model, so the reported score measures memorisation or contamination rather than the ability to predict something genuinely new. Kapoor and Narayanan enumerate eight distinct types; the ones worth internalising cluster into a few families.

No held-out test set at all. The model is evaluated on the same data it trained on, or hyperparameters are tuned on the test set until the number looks good. Reported accuracy then measures fit, not generalisation. This still appears in published work.
Pre-processing on the full dataset. Imputation, feature scaling, feature selection, or oversampling (SMOTE) computed over train and test together. The test set has now leaked its statistics into the pipeline before the split. The fix is to fit every transform on the training fold only and apply it to the test fold, inside the cross-validation loop, never before it.
Temporal leakage. Training on data that postdates the prediction target. A model that "predicts" a 2020 outcome using features that were only measurable in 2021 will look brilliant and be useless. Any forecasting task needs a time-respecting split, not a random one.
Non-independence between train and test (duplicates and groups). Rows that are not independent, multiple scans from the same patient, several sentences from the same document, augmented copies of one image, get scattered across the split. The model recognises the individual, not the phenomenon. Grouped splits (split by patient, by subject, by site) are the fix.
Feature leakage (proxies for the label). A feature that encodes the answer. The textbook case: a diagnosis dataset where the presence of a treatment code perfectly predicts the disease, because you only get the treatment once diagnosed. The model learns the shortcut and collapses in deployment where the proxy is absent.

The pattern across all eight is the same: the evaluation is easier than the real task, so the score is an overestimate of real-world skill.

Benchmark performance is not a scientific claim

A number on a benchmark answers a narrow question: on this dataset, split this way, this model scored X. A scientific claim is broader: this method predicts the phenomenon in the world. The gap between the two is where most ML-for-science overreach lives.

The clearest demonstration in the Kapoor and Narayanan work is their civil-war-prediction case study. Several papers had reported that complex ML models substantially beat classical logistic regression at forecasting armed conflict. When the authors corrected the leakage in each, the complex-model advantage evaporated; logistic regression was competitive or better. The headline ("ML predicts civil war") was real as a benchmark number and false as a scientific claim, and the difference was entirely leakage.

Two habits protect you here. First, always ask what a simple, honest baseline scores. If a linear model or a majority-class predictor is within a point or two of the deep model, the deep model is not the story. Second, separate the predictive claim ("this classifier is accurate") from the causal or mechanistic claim ("this feature drives the outcome"). Predictive accuracy, even when real, licenses almost nothing about mechanism.

Distribution shift: from curated benchmark to deployment

Even a leakage-free benchmark can mislead, because the benchmark distribution is curated and the deployment distribution is not. Medical imaging is the canonical example: a pneumonia detector trained on scans from one hospital learned to read the scanner model and portable-versus-fixed-machine metadata baked into the image, both correlated with sickness at that site, and its accuracy dropped sharply at a new hospital. Nothing leaked in the technical sense; the training and test sets simply shared a spurious correlation the deployment world did not.

The general failure is that models exploit whatever correlation minimises training loss, including shortcuts that are artefacts of how the benchmark was assembled. A benchmark score is a claim about the benchmark's distribution. Extending it to deployment requires external validation on data from a genuinely different source, ideally a different site, cohort, or time period, collected by people who were not optimising against it.

Reporting standards: the REFORMS response

If leakage is subtle enough to pass review, the remedy is not sharper reviewers but better disclosure. Kapoor and Narayanan proposed model info sheets: a short structured form where authors state, per model, how the train/test split was made, what pre-processing touched which data, how features were selected, and how the reported performance connects to the scientific claim. The sheets cannot stop a false claim, but they make the leakage-prone decisions visible in one place instead of scattered across a methods section.

That idea was generalised by a 19-author consortium into REFORMS (Reporting Standards for Machine Learning Based Science), a 32-item checklist plus paired guidelines published in Science Advances. It covers study design, data, modelling, evaluation, and the claims drawn from results, and it is written for three audiences at once: authors designing a study, referees reviewing one, and journals setting policy. Treating a paper as REFORMS-shaped, checking whether each item is actually addressed, is a fast way to locate the weak joint in an ML-for-science result.

When it falls down

This concept is itself the failure-mode catalogue, so the interesting cases are the ones the checklist above does not cleanly catch.

Leakage that survives peer review. The 648-paper tally is the count that was caught. Reviewers rarely rerun the pipeline; they read prose. Leakage lives in code and data handling, not in the text, so a clean-sounding methods section is weak evidence. The only strong evidence is a reproduction on independently collected data, which almost never happens before publication.
Benchmark saturation and contamination. Once a benchmark is public and heavily optimised against, high scores stop measuring capability and start measuring familiarity with the benchmark. For anything scraped from the web, including the corpora that train large models, the test set may already be in the training data. A near-perfect score on a well-known benchmark is now a reason for suspicion, not celebration.
The incentive to publish the optimistic split. A researcher who tries several splits, several feature sets, and several models, then reports the best, has p-hacked without touching a p-value. The published number is the maximum over many silent attempts, not an unbiased estimate. Nothing in a single paper reveals this; only pre-registration, held-out final test sets touched once, and independent replication do.
Correct predictions, wrong conclusion. The hardest case is a model that genuinely predicts well and is then over-interpreted as revealing mechanism or causation. The accuracy is real; the scientific claim built on it is not. No amount of leakage-checking fixes a category error about what prediction licenses.

The taxonomy of data leakage

Benchmark performance is not a scientific claim

Distribution shift: from curated benchmark to deployment

Reporting standards: the REFORMS response

When it falls down

Further reading