NLP Foundations
Tokenisation and BPE
How models turn raw text into integer tokens, and why the vocabulary choice silently shapes model behaviour.
beginner · 6 min read
Models do not see characters or words. They see integer tokens drawn from a fixed vocabulary. How you build that vocabulary determines how efficiently the model can represent text - and which languages or domains it secretly disadvantages.
Byte-pair encoding
BPE starts from a byte alphabet, then iteratively merges the most frequent adjacent pair until the vocabulary reaches its target size (typically 30k-200k). The result is a vocabulary that represents common English fragments as single tokens and rare strings as several.
"unhappiness" -> ["un", "happiness"] # both common merges
"obfuscation" -> ["o", "bf", "us", "cation"] # falls back to smaller pieces
Why it matters
- Cost. Tokens are billed per input/output. A language with under-represented vocabulary (Hindi, Vietnamese, Arabic) costs 3-5x more tokens than English for the same content.
- Context window. A 128k token window is much smaller in non-English text.
- Numeric tasks. Old GPT models split numbers as single digits; newer models tokenise multi-digit chunks. The choice silently changes arithmetic capability.
SentencePiece and tiktoken
The two dominant implementations are SentencePiece (Llama, T5, PaLM) and tiktoken (GPT, Claude). They differ in details but the BPE skeleton is the same. When debugging weird model behaviour, dump the tokenisation - many bugs hide in unexpected splits.