← Concept library

Foundations

Tokens & Tokenization

Models don't see words — they see tokens. Understanding BPE/SentencePiece changes how you reason about context windows, cost, and weird edge cases.

beginner · 6 min read

Why tokens, not words

A tokenizer splits text into tokens — usually sub-word pieces. The model's vocabulary is a fixed list (often 50k–256k entries); every input is mapped to integer IDs from that list.

Why not just split on whitespace? Two reasons:

  1. Out-of-vocabulary words would break the model.
  2. Common pieces like ing, tion, or def compress text efficiently across languages and code.

Byte-Pair Encoding (BPE) in one paragraph

Start from raw bytes. Repeatedly find the most-frequent adjacent pair of tokens in your training corpus and merge them into a new token. Stop when you hit your target vocabulary size. The result is a tokenizer where common substrings are single tokens and rare strings fall back to many small tokens.

GPT models use BPE. SentencePiece (used by LLaMA, T5) is a similar idea trained directly on Unicode characters with explicit whitespace handling.

What this changes for you

  • Cost & context windows are measured in tokens, not words. As a rough guide: 1 English word ≈ 1.3 tokens; 1 Chinese character ≈ 1 token; 100 lines of Python ≈ 600 tokens.
  • Spelling and digit math can fail in surprising ways — the model doesn't see 12345 as five digits, it sees one or two tokens.
  • Whitespace is significant. cat and cat are often different tokens. This is why prompt formatting matters.

Try it

Open tiktokenizer.vercel.app and paste in a paragraph. Notice how it splits — and how it splits your name.