← Concept library

Architectures & Scaling

Vocabulary Size Trade-offs

Choosing a tokeniser vocabulary size forces a three-way tension between sequence length, embedding table memory, and coverage of rare or multilingual text.

intermediate · 8 min read

GPT-2 shipped with a 50,257-token vocabulary. LLaMA-2 used 32,000. Gemma extended to 256,000. None of these is obviously wrong; each reflects deliberate bets about the corpus, compute budget, and downstream use. Vocabulary size is one of the few tokeniser hyperparameters that cannot be changed after pretraining without discarding the model weights, so getting it wrong is expensive.

The three-way tension

Every vocabulary size decision sits at the intersection of three competing pressures.

Sequence length vs. compression. A larger vocabulary encodes more text per token. If V doubles, each merge step in BPE has more candidate pairs, so common character sequences get absorbed into single tokens more aggressively. A 128k vocabulary might encode an average English word in 1.2 tokens; a 32k vocabulary might need 1.8 tokens for the same text. Because attention is quadratic in sequence length (or at least linear in the KV-cache), shorter sequences reduce both training compute and inference memory. The effect compounds over a full pretraining run: rough estimates suggest that doubling vocabulary from 32k to 64k for a monolingual English corpus reduces token count by roughly 10-15%, which translates directly to cheaper training at scale.

Embedding table memory. The token embedding matrix has shape (V, d_model). For a model with d_model = 4096 and V = 128,000 in bfloat16, that matrix is 128000 x 4096 x 2 bytes = roughly 1 GB. The unembedding (logit projection) layer is typically tied or identical in shape, so effective cost is closer to 2 GB just for the vocabulary-related parameters. At V = 32,000 the same pair costs under 0.5 GB. For a 7B-parameter model this is not catastrophic, but for a 1B-parameter model a 256k vocabulary means the embedding table is ~2 GB while the rest of the model is ~1.8 GB - the tail wagging the dog.

Rare token coverage. A vocabulary that is too small forces common multi-character units to remain as separate tokens. More critically, rare languages, technical jargon, and code identifiers fragment into long character-level sequences, blowing up sequence length and making it harder for the model to learn sensible representations for those units. The standard measure here is fertility: the average number of tokens per Unicode word. A well-calibrated vocabulary should yield fertility close to 1 for high-resource languages and should not exceed 5-6 for low-resource ones; if it does, those languages are effectively penalised at inference time because users pay per token.

Model Vocab size Typical English fertility
GPT-2 50,257 ~1.3
LLaMA / LLaMA-2 32,000 ~1.4
Mistral-7B 32,000 ~1.4
Gemma 256,000 ~1.1
Qwen 151,936 ~1.1

(Fertility numbers are approximate and vary by text domain.)

How BPE merges interact with vocabulary size

Byte-Pair Encoding (BPE), introduced by Sennrich et al. (2016) for machine translation and now the dominant algorithm for LLM tokenisers, grows the vocabulary by iteratively merging the most frequent adjacent byte or character pair in the training corpus. Vocabulary size V is essentially the number of merge operations plus the base alphabet.

A key insight: the first few thousand merges capture extremely high-frequency patterns (common English morphemes, punctuation clusters, whitespace conventions). The marginal value of each additional merge decreases as V grows. This means:

# Pseudocode for BPE merge selection
while len(vocab) < V_target:
    pair = most_frequent_adjacent_pair(corpus)
    vocab.add(merge(pair))
    corpus = apply_merge(corpus, pair)
    # Each iteration: O(corpus_size) scan

At V = 32k, merges stop while many common 4-6 character suffixes (like -ation, -tion, -ness) still exist as multi-token sequences. At V = 128k, those get absorbed. The corpus used to train BPE matters as much as V: if the corpus is 99% English, a 128k vocabulary simply allocates those extra merges to rare English words rather than to other languages - which is generally wasteful if the model is intended to be multilingual.

SentencePiece (Kudo and Richardson, 2018) adds a unigram language model variant alongside BPE; both algorithms expose the same vocabulary size hyperparameter V and face the same fundamental tension.

The multilingual amplifier

For monolingual English models, vocabulary size is primarily a compute-vs-coverage trade-off. For multilingual models it becomes a fairness issue. Languages that are underrepresented in the BPE training corpus end up with very few dedicated tokens, so their text fragments into individual characters or short byte sequences. This imposes:

  1. Higher inference cost for users of those languages (more tokens per query).
  2. Worse model quality on those languages because the model never sees natural word-level units.
  3. A feedback loop: poor quality discourages use, reducing data collection for future models.

Practical approach: set a floor on per-language token count during BPE training. Some teams explicitly upsample low-resource languages in the BPE training corpus (separate from the pretraining corpus), allocating a fraction of the vocabulary budget specifically to non-English coverage. Qwen's 151,936-token vocabulary was explicitly designed to give Chinese characters and subwords dedicated allocations; its Chinese fertility is close to 1 token per character.

Sequence length, context windows, and KV cache

The connection between vocabulary size and runtime cost runs through sequence length. Suppose you double V and gain a 12% reduction in average tokens per document. For a model with a 4096-token context window, those 12% fewer tokens mean:

  • 12% more text fits in context at the same position budget.
  • KV cache size scales linearly with sequence length; shorter sequences free VRAM during inference.
  • Training throughput improves because more examples fit in a fixed batch size in tokens.

The flip side: each forward pass now has a softmax over V logits instead of V/2. For a standard linear classifier head, the FLOPs for the final projection scale as (d_model x V). For a 4096-dim model, going from 32k to 128k vocabulary adds roughly 384M parameters to the output projection - not free but usually outweighed by sequence length savings at scale.

A rough rule of thumb: for a primarily English monolingual corpus, vocabulary sizes between 32k and 64k are Pareto-efficient. Below 32k, fertility degrades noticeably even for English. Above 128k, the embedding table becomes a bottleneck for small models. Multilingual targets justify 100k-256k.

When it falls down

Vocabulary mismatch after domain adaptation. If you fine-tune a general-purpose model on a narrow technical domain (genomics, legal text, source code), the vocabulary learned during pretraining may fragment that domain's key terms badly. A vocabulary trained on web text will split the gene name BRCA1 or the code token __init__ into character-level pieces. Fine-tuning does not fix this; the model still sees those terms as multi-token sequences and struggles to attend over the full unit. The correct fix is vocabulary expansion (adding domain tokens and re-initialising their embeddings), but this requires continued pretraining to stabilise, not just supervised fine-tuning.

Extremely large vocabularies and rare token under-training. With V = 256k, many tokens in the tail of the distribution appear fewer than a hundred times in a trillion-token corpus. Their embeddings are essentially random at the end of training because they have never accumulated enough gradient signal. Models can exhibit erratic behaviour on inputs that trigger these rare tokens, including unexpected token-to-token confusion with visually similar strings.

BPE tokenisation is greedy and non-unique. Standard BPE uses a greedy left-to-right scan during inference. The same string can tokenise differently depending on context boundaries (e.g., whether a word appears at the start of a sentence or mid-sentence after a space). This boundary sensitivity is not obviously connected to vocabulary size but is exacerbated by larger vocabularies that have more overlapping merge candidates.

Vocabulary size and quantisation. When quantising embedding tables to int8 or int4, errors accumulate per token. Larger vocabularies mean a larger total quantisation error budget, which can produce more visible quality degradation on text that triggers the tail of the vocabulary.

Further reading

  • Sennrich, Haddow, and Birch. "Neural Machine Translation of Rare Words with Subword Units." ACL 2016. https://arxiv.org/abs/1508.07909 - the original BPE paper; section 3 covers the vocabulary size choice directly.
  • Kudo and Richardson. "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing." EMNLP 2018. https://arxiv.org/abs/1808.06226 - introduces the unigram LM variant and discusses vocabulary size effects on segmentation quality.
  • Touvron et al. "LLaMA: Open and Efficient Foundation Language Models." 2023. https://arxiv.org/abs/2302.13971 - illustrates the 32k vocabulary choice for a large-scale English-dominant model.
  • Bai et al. "Qwen Technical Report." 2023. https://arxiv.org/abs/2309.16609 - shows a 151k vocabulary designed for Chinese-English bilingual coverage.
Sign in to save and react.
Share Copied