Foundations
Tokens & Tokenization
Models don't see words — they see tokens. Understanding BPE/SentencePiece changes how you reason about context windows, cost, and weird edge cases.
beginner · 6 min read
Why tokens, not words
A tokenizer splits text into tokens — usually sub-word pieces. The model's vocabulary is a fixed list (often 50k–256k entries); every input is mapped to integer IDs from that list.
Why not just split on whitespace? Two reasons:
- Out-of-vocabulary words would break the model.
- Common pieces like
ing,tion, ordefcompress text efficiently across languages and code.
Byte-Pair Encoding (BPE) in one paragraph
Start from raw bytes. Repeatedly find the most-frequent adjacent pair of tokens in your training corpus and merge them into a new token. Stop when you hit your target vocabulary size. The result is a tokenizer where common substrings are single tokens and rare strings fall back to many small tokens.
GPT models use BPE. SentencePiece (used by LLaMA, T5) is a similar idea trained directly on Unicode characters with explicit whitespace handling.
What this changes for you
- Cost & context windows are measured in tokens, not words. As a rough guide: 1 English word ≈ 1.3 tokens; 1 Chinese character ≈ 1 token; 100 lines of Python ≈ 600 tokens.
- Spelling and digit math can fail in surprising ways — the model doesn't see
12345as five digits, it sees one or two tokens. - Whitespace is significant.
catandcatare often different tokens. This is why prompt formatting matters.
Try it
Open tiktokenizer.vercel.app and paste in a paragraph. Notice how it splits — and how it splits your name.