Diversity Metrics and Collapse Detection

When a fine-tuned Llama model generates 10,000 synthetic instructions and every tenth one starts with "Certainly! Here's how to...", you have a problem that no loss curve will tell you about. Diversity collapse in synthetic data is an invisible failure: the model trains, the accuracy metrics look fine, and only weeks later does someone notice that the deployed model sounds like it wandered into a conversational rut and never left.

Measuring and monitoring diversity is therefore not a quality-of-life concern - it is a precondition for safe use of any synthetic pipeline.

What Diversity Actually Measures

Diversity in a dataset is not the same as randomness. A dataset of 1,000 completely random strings would score high on entropy-based metrics but be useless. What we want is coverage of the semantic, stylistic, and structural space that the model should handle.

Three distinct axes matter:

Axis	What collapses first	Symptom
Lexical	Vocabulary and surface patterns	Repeated phrases, formulaic openings
Semantic	Topic and intent distribution	Over-representation of a few instruction types
Structural	Response length, format, complexity	All answers converging to the same template

Metrics must address each axis differently.

Distinct-N and Self-BLEU

The simplest family of lexical metrics counts novel n-grams. Distinct-N (for n = 1, 2, 3) computes the fraction of unique n-grams in the corpus:

distinct_n = |unique_n_grams| / |total_n_grams|

A corpus where every response begins with the same ten tokens will have a depressed distinct-2 score even if the remainder is varied. Self-BLEU inverts this: compute the BLEU score of each sample against all other samples in the corpus, then average. High self-BLEU means the corpus is paraphrasing itself.

These metrics are fast but shallow. A model that varies its opening tokens while repeating the same underlying reasoning pattern will fool them.

The Vendi Score

Friedman and Dieng (2022) introduced the Vendi Score as a principled diversity metric grounded in information theory. Given a set of samples and a user-defined similarity function \(k\), construct the kernel matrix \(K\) where \(K_{ij} = k(x_i, x_j) / n\). The Vendi Score is:

\[\text{VS}(S) = \exp\!\left(-\sum_i \lambda_i \log \lambda_i\right) = \exp(H(\lambda))\]

where \(\lambda_1, \ldots, \lambda_n\) are the eigenvalues of \(K\). This is the exponential of the Shannon entropy of the eigenvalue spectrum.

The score has a clean intuition: if all samples are identical, \(K\) has rank 1, a single eigenvalue equals 1, entropy is 0, and Vendi Score equals 1. If all samples are maximally dissimilar, eigenvalues are uniform, entropy is maximised, and the score equals \(n\). The similarity function can be anything - cosine similarity over sentence embeddings for semantic diversity, Tanimoto coefficient for molecular data, or pixel-space similarity for images.

Crucially, the Vendi Score requires no reference distribution. This matters for synthetic pipelines where you do not always have a ground-truth distribution to compare against.

The Diversity Coefficient

Miranda et al. (2023) proposed the diversity coefficient, which measures how much a corpus's average pairwise task-embedding distance predicts downstream generalisation. The metric correlates with performance across 44 GPT-2 and LLaMAv2 models, providing empirical evidence that diversity in the formal sense causally improves evaluation performance - not merely correlates with it.

Collapse Detection in Practice

Monitoring diversity is straightforward if you instrument the pipeline. A minimal setup tracks three statistics at the end of every synthetic generation batch:

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def diversity_stats(embeddings: np.ndarray) -> dict:
    """
    embeddings: (n_samples, dim) sentence-transformer outputs
    Returns distinct pairwise cosine distances and mean nearest-neighbour distance.
    """
    sim = cosine_similarity(embeddings)
    np.fill_diagonal(sim, 0.0)
    mean_sim = sim.mean()
    # nearest-neighbour similarity (highest non-self similarity per row)
    nn_sim = sim.max(axis=1).mean()
    return {
        "mean_pairwise_similarity": float(mean_sim),
        "mean_nn_similarity": float(nn_sim),
        "effective_rank": float(np.linalg.matrix_rank(embeddings, tol=1e-3)),
    }

Track these across generations. A rising mean_pairwise_similarity is a collapse signal. A falling effective_rank of the embedding matrix is another: if your 10,000 instruction embeddings span only 40 effective dimensions instead of the expected 150+, the generator has implicitly converged to a low-dimensional manifold.

The Self-Consuming Loop Problem

Alemohammad et al. (2023) formalised why collapse compounds across generations. In their Model Autophagy Disorder (MAD) framework, a model trained generation \(t+1\) on synthetic data from generation \(t\) without injecting new real data will see quality (precision) and diversity (recall) both degrade progressively. This is not a hypothesis - they demonstrated it analytically and empirically across multiple generative architectures.

The intuition is straightforward: a generative model cannot perfectly cover the tails of the distribution it was trained on. Its synthetic outputs therefore have slightly narrowed support. Train the next model on those outputs, and the tails narrow again. After a few generations, the model effectively models only the mode of the original distribution.

The cure identified by Gerstgrasser et al. (2024) is data accumulation rather than data replacement: keep all prior generations of data alongside new real data. Under this regime, the test error has a finite upper bound independent of the number of training iterations. Replacement converges toward collapse; accumulation does not.

Operational Monitoring Checklist

Putting this together, a practical diversity monitoring protocol for a synthetic-data pipeline should include:

Per-batch Vendi Score over sentence-embedding similarity. Alert if it drops below a threshold calibrated on a reference human-written corpus.
Self-BLEU at bigram and trigram level. A jump of more than 15% from baseline warrants investigation.
Embedding effective rank. Track the matrix rank of a 2,000-sample random draw from each generation's outputs.
Topic cluster entropy. Cluster embeddings with k-means (k=50 is reasonable for a 10k dataset); compute entropy over cluster assignments. Uniform distribution scores high; spike to a few dominant clusters indicates topic collapse.
Retention of real data. Never replace; always accumulate. Flag any pipeline step that discards prior real data without explicit justification.

The thresholds are pipeline-specific. Calibrate them on a human-authored reference set from the same domain, and treat 2 standard-deviation deviations as soft alerts and 3 standard-deviation deviations as hard stops.

When It Falls Down

Fast models on slow distributions. If your base distribution changes - domain shift in the real world, a new coding paradigm, a different user demographic - diversity metrics calibrated on old data become misleading. A highly diverse synthetic corpus in the old distribution may be a collapsed corpus relative to the new one.

Vendi Score is expensive at scale. Computing the full eigenvalue decomposition of an \(n \times n\) kernel matrix is \(O(n^3)\). For datasets beyond a few thousand samples, use a subsample or an approximation such as the Nystrom method. Misapplying it to millions of samples without approximation will be prohibitively slow.

Distinct-N rewards verbosity. A model that produces very long outputs will mechanically score higher on distinct-N even if the semantic range is narrow. Normalise by output length, or prefer embedding-based metrics when generation length is uncontrolled.

Effective rank conflates compression with collapse. A tight, high-quality corpus of similar-length structured instructions may naturally have low embedding rank without any pathological collapse. Always pair rank with semantic cluster entropy before sounding an alarm.

Diversity and quality trade off. A diverse corpus is not necessarily a good one. Increasing sampling temperature increases lexical variety but also increases incoherence. Filtering for quality reduces the corpus and can inadvertently reduce diversity. The two objectives must be optimised jointly, not sequentially.