Multilingual Data Balancing

English accounts for roughly 46% of Common Crawl by byte count. Swahili accounts for under 0.01%. If you train a multilingual model on those raw proportions, the model learns English well and Swahili barely at all. Multilingual data balancing is the set of decisions that sits between "dump Common Crawl" and "train a model that is genuinely useful in 100 languages."

Why Raw Proportions Fail

A web crawl is not a neutral sample of human language. English dominates because English speakers produced more digital text earlier. High-resource languages (English, Chinese, Russian, Spanish) have orders-of-magnitude more data than low-resource ones (Yoruba, Burmese, most indigenous languages). Training on raw proportions produces two interrelated problems:

Capacity dilution in low-resource languages. A model never sees enough Swahili sentences to learn its morphology, so Swahili prompts either hallucinate or switch to English mid-response.
Forgetting pressure for low-resource languages. Parameters shared across languages are pulled predominantly by gradients from majority-language examples. Low-resource languages' representations are gradually overwritten.

Conneau et al. (XLM-R, 2019) studied this empirically: training across 100 languages showed a clear trade-off between positive cross-lingual transfer and capacity dilution. When too many languages compete for the same model capacity, the average improves but low-resource tails suffer.

Temperature-Based Upsampling

The most widely adopted balancing heuristic is temperature sampling, introduced for multilingual masked-language models and inherited by later LLMs. You define a sampling probability for language \(l\) as:

\[q_l = \frac{p_l^{\,1/T}}{\sum_{l'} p_{l'}^{\,1/T}}\]

where \(p_l\) is the raw proportion of language \(l\) in the corpus and \(T\) is the temperature hyperparameter.

At \(T = 1\) you recover the unmodified distribution.
As \(T \to \infty\) the distribution flattens toward uniform (\(q_l \to 1/L\) for \(L\) languages).
In practice \(T = 2\) to \(T = 5\) is common. mT5 used \(T = 5\) across 101 languages, capping any single language at 1,000 Wikipedia pages to prevent over-representation.

The practical effect: a language with 0.01% raw share at \(T = 5\) gets upsampled to roughly 1%, which may be 100x its raw count. You repeat tokens from that language; the model sees them more often but each example is no longer independent.

import numpy as np

def temperature_sample(raw_counts: dict[str, int], T: float) -> dict[str, float]:
    langs = list(raw_counts)
    counts = np.array([raw_counts[l] for l in langs], dtype=float)
    probs = counts / counts.sum()
    tempered = probs ** (1.0 / T)
    tempered /= tempered.sum()
    return dict(zip(langs, tempered))

# Example: 3 languages, huge English imbalance
counts = {"en": 1_000_000, "sw": 1_000, "yo": 200}
print(temperature_sample(counts, T=1.0))  # en ≈ 0.999
print(temperature_sample(counts, T=5.0))  # en ≈ 0.72, sw ≈ 0.19, yo ≈ 0.09

Upsampled tokens are drawn by repeating the existing corpus, not by generating new text, so the effective vocabulary and style are bounded by what was collected.

Mixture Strategies Beyond Temperature

Temperature sampling is not the only mechanism. Several complementary strategies appear in practice:

Strategy	Mechanism	When to use
Temperature sampling	Rescale raw proportions by \(p^{1/T}\)	General-purpose multilingual pretraining
Hard caps	Never exceed \(N\) tokens per language	Prevent dominant languages even at low \(T\)
Translation augmentation	Translate high-resource text into low-resource target	Language has almost no native web text
Corpus sourcing	Add curated books, Wikipedia, Bible	Low-resource languages under-represented in crawls
Two-stage training	Broad multilingual base, then language-specific continued pretraining	Downstream tasks concentrated in specific languages

BLOOM's ROOTS corpus (2022) combined hard caps with curated sources: 46 natural languages were included with deliberate over-representation of African languages, using Wikipedia dumps, legal documents, and partnered datasets rather than relying on Common Crawl alone for those languages.

Translation augmentation is attractive but carries a caveat: machine-translated text can introduce translationese artefacts (simpler syntax, calque structures) that the model then learns and reproduces. Filtering translated sentences out of evaluation sets matters if you train on translated data.

What the Balancing Decision Actually Affects

Three downstream quantities move when you change the language mixture:

Cross-lingual transfer. If language \(A\) and language \(B\) share morphological or syntactic structure, training on \(A\) improves zero-shot performance on \(B\) even without explicit \(B\) data. The XLM-R result shows this transfer is real but bounded; beyond a certain number of competing languages the shared parameters cannot hold all the representations simultaneously.

Native-language fluency. Raising a low-resource language's share beyond roughly 5-10% of training tokens often hurts English benchmark scores on the same model. This is the "curse of multilinguality" trade-off: per-language quality versus breadth.

Tokeniser fertility. A tokeniser trained on raw web proportions will over-partition low-resource text into single characters or byte fallback tokens. A word in Yoruba that is four tokens in a balanced tokeniser vocabulary may be twelve tokens in an English-biased one, directly increasing inference cost and context budget consumption for those users.

When It Falls Down

Availability ceilings. For truly low-resource languages (under 10M tokens publicly available), upsampling means the model sees the same sentences dozens of times per epoch. Memorisation rather than generalisation is the likely result. No balancing policy overcomes data absence.

Script and encoding inconsistency. Many low-resource languages appear in multiple scripts or under multiple Unicode normalisations across different sources. If balancing aggregates these without normalisation, the "language" label covers multiple script variants that the model treats as distinct vocabularies, diluting the effective count further.

Distribution shift inside a language. "Arabic" spans Modern Standard Arabic, Moroccan Darija, Egyptian colloquial, and Gulf dialects, which differ enough that upsampling one dialect does not proportionally help the others. Treating language as a monolithic label is a simplification that balancing alone cannot fix.

Evaluation blind spots. Many multilingual benchmarks (XNLI, TyDi QA) cover only 10-20 languages. Optimising the balancing ratio against these benchmarks may inadvertently harm languages not in the evaluation suite. The improvement you measure is not the improvement you delivered.

Interaction with deduplication. If you deduplicate before balancing, low-resource documents collapse more aggressively (small web presence means more duplicates as a fraction of the total). Deduplicating after balancing inflates the removal rate for upsampled repetitions. Order matters and practitioners often do it differently.

Why Raw Proportions Fail

Temperature-Based Upsampling

Mixture Strategies Beyond Temperature

What the Balancing Decision Actually Affects

When It Falls Down

Further Reading