Tokens & Tokenization

Why tokens, not words

A tokenizer splits text into tokens — usually sub-word pieces. The model's vocabulary is a fixed list (often 50k–256k entries); every input is mapped to integer IDs from that list.

Why not just split on whitespace? Two reasons:

Out-of-vocabulary words would break the model.
Common pieces like ing, tion, or def compress text efficiently across languages and code.

Byte-Pair Encoding (BPE) in one paragraph

Start from raw bytes. Repeatedly find the most-frequent adjacent pair of tokens in your training corpus and merge them into a new token. Stop when you hit your target vocabulary size. The result is a tokenizer where common substrings are single tokens and rare strings fall back to many small tokens.

GPT models use BPE. SentencePiece (used by LLaMA, T5) is a similar idea trained directly on Unicode characters with explicit whitespace handling.

What this changes for you

Cost & context windows are measured in tokens, not words. As a rough guide: 1 English word ≈ 1.3 tokens; 1 Chinese character ≈ 1 token; 100 lines of Python ≈ 600 tokens.
Spelling and digit math can fail in surprising ways — the model doesn't see 12345 as five digits, it sees one or two tokens.
Whitespace is significant. cat and cat are often different tokens. This is why prompt formatting matters.

Try it

Open tiktokenizer.vercel.app and paste in a paragraph. Notice how it splits — and how it splits your name.