Training a Tokeniser on a Corpus

GPT-2's tokeniser was trained on WebText, an English-heavy corpus scraped from Reddit outbound links. Reuse it for a Thai-language model and a word like "กรุงเทพมหานคร" (Bangkok) becomes a stream of 30+ individual byte tokens. That model wastes roughly a third of its context window on a single proper noun. The tokeniser is not a neutral preprocessing step; it is a design decision that propagates forward into every training step, every inference call, and every downstream task.

What a Subword Tokeniser Actually Learns

Before training can begin it helps to be precise about what the optimisation target is. A subword tokeniser learns a vocabulary - a finite set of token strings - and a segmentation rule for splitting arbitrary text into that vocabulary.

Three dominant algorithms are in use:

Algorithm	Training objective	Merge strategy	Guarantees coverage?
BPE (Byte Pair Encoding)	Compress text by fewest tokens	Bottom-up greedy merges	Yes (byte fallback)
WordPiece	Maximise likelihood of training corpus	Merges that improve a language model	Yes (##-prefixed fragments)
Unigram Language Model	Maximise unigram LM probability	Top-down pruning from large init vocab	Yes (byte fallback configurable)

BPE, introduced by Sennrich et al. (2016) for neural machine translation, is the most widely deployed in modern LLMs. The key insight is that character-level segmentation provides a closed alphabet - no unknown tokens are possible - and iterative pair merges let the model learn common morphemes, subwords, and whole words as single tokens without committing to a fixed word list.

The training algorithm, stripped to its logic:

1. Initialise vocabulary V = {all characters in corpus} ∪ {special tokens}
2. Count all adjacent pair frequencies across the corpus
3. Merge the most frequent pair (a, b) → ab; add ab to V
4. Repeat steps 2-3 until |V| == target_vocab_size

Each merge reduces the total token count in the corpus by one unit per occurrence of that pair. After k merges you have absorbed the k most statistically useful combinations. The resulting merge table, ordered by when each merge was learned, is the complete tokeniser: given new text, replay the merges in order.

Byte-level BPE (used in GPT-2, LLaMA, and most modern English-centric models) seeds V with the 256 raw bytes rather than Unicode code points. This eliminates the unknown-token problem entirely - every conceivable byte sequence is representable - at the cost of making non-ASCII scripts token-inefficient unless the training corpus reflects them.

The Role of the Training Corpus

The corpus you feed the tokeniser trainer determines which merges happen first, and therefore which strings become single tokens. This has compounding effects.

Frequency drives merge priority. If your corpus is 95% English, common English words and morphemes will be merged early. A Chinese character that appears infrequently will never merge with adjacent characters into a multi-character word token. Every sentence in Chinese then tokenises at character granularity, consuming far more of the model's context window than an equivalent English sentence.

Corpus composition should mirror pretraining data, not the open web. If you intend to pretrain on 30% code, 50% English web text, and 20% multilingual text, draw your tokeniser training data from the same mixture at the same proportions. Deviating even modestly means the vocabulary allocation is miscalibrated: you may allocate vocabulary slots to rare English n-grams that would have been better spent on common Python keywords.

Practical scale. Training BPE to a vocabulary of 32k-100k tokens over a corpus in the range of tens of gigabytes typically completes in a few minutes to a couple of hours on a single CPU with efficient implementations. The HuggingFace tokenizers library (Rust-backed) tokenises a gigabyte of text in under 20 seconds, and trainer throughput is comparable. You do not need GPU resources for this step.

A minimal BPE training call via the tokenizers library:

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel

tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = ByteLevel()

trainer = BpeTrainer(
    vocab_size=50_000,
    min_frequency=2,          # ignore pairs seen fewer than 2 times
    special_tokens=["<|pad|>", "<|eos|>", "<|unk|>"],
)
tokenizer.train(files=["corpus_shard_00.txt", "corpus_shard_01.txt"], trainer=trainer)
tokenizer.save("tokenizer.json")

min_frequency matters: setting it too low lets hapax pairs consume vocabulary slots. Setting it too high discards rare but useful technical tokens (chemical formulae, code identifiers).

Vocabulary Size and the Fertility Trade-off

Fertility is the average number of tokens produced per word (or per Unicode code point for script-based languages). A well-calibrated tokeniser for a language should have fertility close to 1.0 for that language: common words are single tokens, rare ones split into two or three.

Vocabulary size controls the fertility-parameter trade-off:

Larger vocabulary (100k-250k) reduces fertility, shortens sequences, and reduces the FLOPs per training step. It also increases the embedding matrix size, which grows as vocab_size * embedding_dim. For a 128k vocabulary at dimension 4096 that is 2.1 billion parameters just for token embeddings.
Smaller vocabulary (16k-32k) packs the embedding table but fragments text more. Shorter sequences can hurt reasoning over long contexts.

LLaMA-1 used 32k tokens; LLaMA-3 expanded to 128k specifically to improve multilingual and code tokenisation. GPT-4 reportedly uses a 100k+ vocabulary. There is no universal optimum; it depends on the language mix and model size.

A useful diagnostic before finalising the vocabulary is to compute per-language fertility on held-out samples. If Arabic or Thai shows fertility of 4-6 when English is near 1.2-1.5, the tokeniser is under-representing those languages and you should either rebalance the training corpus or increase vocabulary size.

When It Falls Down

Domain shift after the tokeniser is fixed. The tokeniser is trained once and frozen before model training begins. If the corpus changes composition later - say you add a large code dataset mid-training - the tokeniser cannot adapt. Common programming keywords that were rare in the original corpus will fragment into multiple tokens, harming code generation quality. This is why it is worth investing in a representative corpus before the tokeniser training run, not after.

Rare scripts and under-resourced languages. Languages with limited web presence contribute too few examples to produce useful merges. A tokeniser trained on 1TB of English and 1GB of Amharic will produce nearly-character-level segmentation for Amharic. This is not just an efficiency problem; it means the model is processing those languages through a qualitatively different representational bottleneck.

Numerals and code identifiers. Naive BPE tends to produce inconsistent splits for numbers: "2023", "2024", and "2025" may segment differently, making arithmetic generalisation harder. Some practitioners use digit-split pre-tokenisation (forcing each digit to its own token) or character-level fallback for numerals to mitigate this. Neither is a complete solution.

Vocabulary overlap with downstream tasks. If your model is later fine-tuned or adapted for a domain that was absent from the tokeniser corpus (medical records, legal text with Latin phrases, a new programming language), the user pays a latency and quality tax that cannot be removed without retraining from scratch with a new tokeniser.

Token boundary artefacts. Because the tokeniser sees text as a flat byte or character sequence, the same surface string may tokenise differently depending on what precedes it. "Washington" at the start of a sentence may be one token; " Washington" with a leading space may be a different token. This is expected behaviour in byte-level BPE, but it creates subtle brittleness in prompting and can cause tokenisation-sensitive evaluations to produce misleading results.

What a Subword Tokeniser Actually Learns

The Role of the Training Corpus

Vocabulary Size and the Fertility Trade-off

When It Falls Down

Further Reading