Tokenisation and BPE

Models do not see characters or words. They see integer tokens drawn from a fixed vocabulary. How you build that vocabulary determines how efficiently the model can represent text - and which languages or domains it secretly disadvantages.

Byte-pair encoding

BPE starts from a byte alphabet, then iteratively merges the most frequent adjacent pair until the vocabulary reaches its target size (typically 30k-200k). The result is a vocabulary that represents common English fragments as single tokens and rare strings as several.

"unhappiness" -> ["un", "happiness"]    # both common merges
"obfuscation" -> ["o", "bf", "us", "cation"]    # falls back to smaller pieces

Why it matters

Cost. Tokens are billed per input/output. A language with under-represented vocabulary (Hindi, Vietnamese, Arabic) costs 3-5x more tokens than English for the same content.
Context window. A 128k token window is much smaller in non-English text.
Numeric tasks. Old GPT models split numbers as single digits; newer models tokenise multi-digit chunks. The choice silently changes arithmetic capability.

SentencePiece and tiktoken

The two dominant implementations are SentencePiece (Llama, T5, PaLM) and tiktoken (GPT, Claude). They differ in details but the BPE skeleton is the same. When debugging weird model behaviour, dump the tokenisation - many bugs hide in unexpected splits.