Patches Over Tokens: How the Byte Latent Transformer Kills the Tokenizer

Ask GPT-4 how many times the letter "r" appears in "strawberry" and, for a long stretch of 2023 and 2024, it would confidently answer two. The model was not stupid. It simply never saw the letters. The word arrived as a small handful of subword tokens, each an opaque integer, and the spelling was gone before the first attention layer ever ran. Tokenization had already made the decision that spelling did not matter, and no amount of scale inside the model could undo a choice made outside it.

The Byte Latent Transformer (BLT), introduced by Artidoro Pagnoni and colleagues at Meta in December 2024 (Pagnoni et al., 2024, Byte Latent Transformer: Patches Scale Better Than Tokens, arXiv:2412.09871), is the first byte-level architecture to match a strong tokenizer-based model at the 8-billion-parameter scale while spending fewer FLOPs at inference. It does this not by making bytes cheap enough to process one at a time, but by learning where to group them. The grouping is driven by information content: predictable stretches of text collapse into long units, and surprising stretches get cut fine so the heavy part of the network can look closely.

Why this matters: A tokenizer is a compression scheme frozen before training begins, and it silently fixes how much compute each character receives. BLT replaces that fixed budget with a dynamic one keyed to the data itself, which is why it can be both cheaper and more robust than the token-based model it replaces.

TL;DR

Tokenization is a preprocessing step that compresses text into a fixed vocabulary of subword integers. It saves compute but discards character-level structure and hard-codes a compute budget per unit of text before the model exists.
BLT works directly on raw UTF-8 bytes (a vocabulary of 256) and groups them into patches whose boundaries are chosen dynamically at runtime.
Patch boundaries are set by a small entropy model: when the next byte is hard to predict, BLT starts a new patch; when text is predictable, patches grow long.
Compute is allocated where it is needed. A heavy latent global transformer runs once per patch, not once per byte, so long patches over predictable text are nearly free.
In a FLOP-controlled scaling study up to 8B parameters and 4T training bytes, BLT matches Llama 3 style tokenizer models and can trade patch size against model size to get better scaling at fixed inference cost.
Byte-level access buys robustness: better spelling and character manipulation, resistance to noisy input, and stronger performance on low-resource languages that BPE vocabularies underserve.
The idea did not appear from nowhere. MegaByte, MambaByte, and SpaceByte built the runway; H-Net (July 2025) pushes past BLT by making the segmentation itself fully learned.

At a Glance

BLT is best understood as a light-heavy-light sandwich. Two small byte-level networks sit at the edges, and one large transformer sits in the middle operating on a much shorter sequence of patches.

flowchart LR
  Bytes["Raw bytes<br/>(vocab 256)"] --> Enc["Local encoder<br/>light, per byte"]
  Enc --> Patch["Patch vectors<br/>short sequence"]
  Patch --> Global["Latent global transformer<br/>heavy, per patch"]
  Global --> Dec["Local decoder<br/>light, per byte"]
  Dec --> Out["Next bytes"]
  class Bytes blue
  class Enc,Dec purple
  class Patch slate
  class Global purple
  class Out teal
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
  classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0

The whole design turns on one number: how many bytes land in each patch. Make patches long and the expensive middle transformer runs rarely. Make them short and the model looks carefully at difficult text. BLT's contribution is a principled way to decide, byte by byte, which regime applies.

[IMAGE: Annotated schematic of the three BLT modules stacked vertically, with sequence length shrinking from ~6000 bytes at the encoder to ~1000 patches at the global transformer and expanding back, arrows labelled with the cross-attention that bridges byte space and patch space]

Before Bytes: Why Tokenization Existed in the First Place

To see what BLT removes, start with what tokenization solved. Early neural language models operated on words, which forced an awkward choice: a vocabulary large enough to cover a language is enormous and still fails on names, typos, and code. Byte Pair Encoding, borrowed from data compression and adapted for NLP by Rico Sennrich and colleagues (Sennrich et al., 2016, Neural Machine Translation of Rare Words with Subword Units, arXiv:1508.07909), split the difference. BPE starts from characters and greedily merges the most frequent adjacent pairs until it reaches a target vocabulary size, so common words become single tokens and rare ones fall back to fragments. SentencePiece (Kudo and Richardson, 2018, arXiv:1808.06226) made the scheme language-agnostic and reversible.

The payoff is compute. A modern BPE vocabulary of 100,000 to 200,000 tokens packs roughly four bytes of English text into each token on average, so a transformer sees a sequence four times shorter than the raw bytes. Since attention cost grows with the square of sequence length, that shortening is not a minor optimization; it is the reason large models on long documents are affordable at all.

But the compression is lossy in a way that matters. The token boundaries are fixed by frequency statistics gathered before training, and they encode assumptions that leak into everything the model does.

timeline
  title From Characters to Learned Patches
  2016 : BPE subword units (Sennrich)
  2018 : SentencePiece language agnostic tokens
  2021 : ByT5 and CANINE token-free encoders
  2023 : MegaByte fixed-size byte patches
  2024 : MambaByte and SpaceByte byte models
  2024 : BLT entropy-based dynamic patches
  2025 : H-Net fully learned dynamic chunking

The token-free line of work is older than BLT. CANINE (Clark et al., 2021, arXiv:2103.06874) and ByT5 (Xue et al., 2021, ByT5: Towards a token-free future, arXiv:2105.13626) showed that transformers could read raw characters or bytes and gain robustness to noise and spelling, at a steep efficiency cost because every byte became a sequence position. ByT5 was competitive on noisy and multilingual tasks but slow, and it never displaced subword models for general pretraining. The open problem it left behind was simple to state and hard to solve: keep the byte-level access, lose the byte-level compute bill.

[IMAGE: Side-by-side before/after comparison of the string "def tokenize():" showing BPE splitting it into 5 opaque token IDs versus BLT keeping all 15 bytes visible and grouping them into 3 entropy-derived patches, with patch boundaries drawn as vertical bars]

How BLT Actually Works

BLT keeps the efficiency trick that made byte-level models viable, the idea of a small local model feeding a large global one, and adds the missing piece: the boundaries between local and global units are chosen by the data.

The compute problem, stated precisely

A transformer's cost per forward pass scales with the number of positions it processes. If a document is $N$ bytes and the heavy transformer runs on every byte, the global attention cost scales as $O(N^2)$ and the feed-forward cost as $O(N)$ times the model width. Tokenization reduces $N$ by a fixed factor of roughly four. BLT reduces it by a factor equal to the average patch size, which is not fixed but chosen per region of text.

Let $P$ be the average number of bytes per patch. The heavy global transformer sees about $N/P$ positions. The two light modules still touch all $N$ bytes, but they are deliberately small, so the dominant cost sits in the global model and falls roughly as $1/P$. The design question becomes: how do you pick patch boundaries so that $P$ is large where it is safe to be coarse and small where detail matters?

Entropy patching: let surprise draw the lines

BLT answers with a separately trained, small byte-level language model whose only job is to estimate, for each position, how uncertain the next byte is. Concretely, that model gives a distribution over the 256 possible next bytes, and BLT computes its Shannon entropy:

\[H(x_i) = -\sum_{v=1}^{256} p(x_i = v \mid x_{<i}) \, \log p(x_i = v \mid x_{<i})\]

High entropy means the next byte is genuinely uncertain, the start of a new word, a rare name, the first digit of an unpredictable number. Low entropy means the continuation is nearly forced, the middle of a common word, the closing bytes of a familiar keyword. BLT places a patch boundary wherever the byte is surprising and lets predictable runs flow together.

The paper uses two related rules for drawing a boundary at position $i$: a global threshold, start a new patch when $H(x_i) > \theta_g$, and an approximate monotonicity rule, start a new patch when entropy jumps sharply relative to the previous byte, $H(x_i) - H(x_{i-1}) > \theta_r$. The second rule catches the moment surprise spikes even when absolute entropy stays moderate. Both are computed in a single pass by the small entropy model before the main network runs.

flowchart TD
  Start["Read byte x_i"] --> Ent["Entropy model scores H(x_i)"]
  Ent --> Check{"H(x_i) over threshold<br/>or sharp jump?"}
  Check -->|Yes| Cut["Close current patch<br/>start new patch"]
  Check -->|No| Grow["Append byte to<br/>current patch"]
  Cut --> Next["Advance to x_i+1"]
  Grow --> Next
  Next --> Start
  class Start,Ent blue
  class Check amber
  class Cut rose
  class Grow emerald
  class Next slate
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
  classDef rose fill:#be123c,stroke:#fb7185,stroke-width:1px,color:#fff
  classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff
  classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0

The elegance is that patching now tracks meaning rather than frequency. A tokenizer splits "unbelievable" the same way every time regardless of context. BLT can keep it whole when the surrounding text makes it predictable and slice it finely when it appears in a context where the model should attend to its morphology.

Three modules, two directions of attention

BLT's network is three transformers wired together by cross-attention rather than by reshaping.

The local encoder is a shallow byte-level transformer. It reads every byte, augments each with hashed n-gram embeddings that fold in local context (so a byte "knows" the few bytes around it), and then, using the patch boundaries from the entropy model, pools its byte representations into one vector per patch via a cross-attention step where patch queries attend to their constituent byte keys.

The latent global transformer is the heavy model, most of the parameters and depth. It sees only the patch vectors, a sequence of length $N/P$, and does the real language modeling: long-range attention, reasoning, world knowledge. Everything expensive happens here, and it happens on the short sequence.

The local decoder is a second shallow byte-level transformer. It takes the processed patch vectors and, through cross-attention in the opposite direction (byte queries attending to patch keys), expands them back out to per-byte predictions, generating the actual next bytes one at a time.

graph TD
  In["Byte stream"] --> LE["Local encoder<br/>byte transformer + n-gram hash"]
  LE -->|"cross-attn<br/>bytes to patches"| PV["Patch vectors"]
  PV --> GT["Latent global transformer<br/>bulk of parameters"]
  GT --> PV2["Refined patch vectors"]
  PV2 -->|"cross-attn<br/>patches to bytes"| LD["Local decoder<br/>byte transformer"]
  LD --> Out["Next-byte predictions"]
  EM["Entropy model<br/>small byte LM"] -.->|"patch boundaries"| LE
  EM -.->|"patch boundaries"| LD
  class In blue
  class LE,LD,GT purple
  class PV,PV2 slate
  class EM amber
  class Out teal
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
  classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
  classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0

Notice what is absent: an embedding table indexed by a 128,000-entry vocabulary, and its matching output projection. BLT's input vocabulary is 256. The parameters that a token model spends on a giant embedding matrix, often hundreds of millions of them, are freed to go into the layers that actually compute.

[IMAGE: Anatomy figure of the local encoder cross-attention, showing 8 byte vectors on the left flowing into 2 patch query vectors on the right, with attention weights drawn as edge thickness so the pooling is visible]

Seeing It in Motion

The two most confusing parts of BLT are how a boundary decision made by one model steers a second model, and how generation proceeds when the network predicts bytes but reasons over patches. A sequence view makes both concrete.

sequenceDiagram
  participant Bytes as Byte stream
  participant Entrpy as Entropy model
  participant Encdr as Local encoder
  participant Globl as Global transformer
  participant Decdr as Local decoder
  Bytes->>Entrpy: score next-byte uncertainty
  Entrpy->>Encdr: patch boundaries
  Encdr->>Encdr: embed bytes, add n-gram context
  Encdr->>Globl: pool bytes into patch vectors
  Globl->>Globl: attend across patches
  Globl->>Decdr: refined patch vectors
  Decdr->>Bytes: predict next byte
  Note over Bytes,Decdr: repeat per generated byte, reusing patch cache

During generation the process is incremental. Each new byte is scored by the entropy model, which decides whether it extends the current patch or opens a new one. If the patch is still open, the decoder can often predict the next byte cheaply without re-running the global transformer, because the governing patch vector has not changed. When a boundary triggers, the global model runs once to produce a fresh patch vector. This is where the inference savings come from in practice: on predictable text, the expensive model sits idle for stretches of bytes.

stateDiagram-v2
  [*] --> InPatch
  InPatch --> InPatch: low-entropy byte, extend patch
  InPatch --> Boundary: high-entropy byte
  Boundary --> RunGlobal: new patch vector needed
  RunGlobal --> InPatch: global transformer step
  InPatch --> [*]: sequence ends

The state machine captures the core trade the architecture makes at runtime: most bytes keep you in the cheap InPatch loop, and only surprising bytes pay for a RunGlobal step.

By the Numbers

The quantitative case for BLT rests on the relationship between patch size and compute, and on the FLOP-controlled comparison against tokenizer models. The figures below come from the BLT paper unless labelled as approximate context.

Quantity	Token-based LLM (BPE)	BLT (bytes plus patches)
Input vocabulary	~100k to 200k subwords	256 bytes
Units the heavy model sees	~1 per 4 bytes (fixed)	~1 per patch, dynamic
Reported average patch size	not applicable	roughly 4 to 8 bytes, data dependent (approx.)
Embedding table parameters	hundreds of millions	negligible
Inference FLOP change vs. matched token model	baseline	up to about 50% fewer
Largest scale studied	matched	8B parameters, 4T training bytes

The mechanism that produces the FLOP saving is worth stating as an equation. If a token model processes $N/4$ positions (four bytes per token) and BLT processes $N/P$ patches, the ratio of heavy-model positions is:

\[\frac{\text{BLT positions}}{\text{token positions}} = \frac{N/P}{N/4} = \frac{4}{P}\]

When the average patch grows past four bytes, BLT runs its expensive model on fewer positions than the token model does, which is how it can come out ahead on inference cost even though it also pays for two small byte-level networks. The paper's headline is that this is not a fixed win: because the vocabulary no longer pins the sequence length, BLT can grow patch size and model size together at a fixed inference budget, giving a scaling curve that pulls away from tokenizer models rather than running parallel to them.

The robustness numbers are qualitative but consistent. Because the model sees characters, it performs markedly better on tasks that require manipulating them, spelling, character counting, case transformation, of the kind collected in the CUTE benchmark (Edman et al., 2024, arXiv:2409.15452). It degrades more gracefully on corrupted or noisy input, and it narrows the gap on low-resource languages whose scripts a BPE vocabulary tuned on English-heavy data represents poorly.

[IMAGE: Log-log plot of inference FLOPs versus model performance, showing the BLT scaling line diverging below the BPE line as patch and model size grow together, annotated with the crossover point near patch size 4]

A Concrete Example

Take the fragment of Python return total # done. Written out with its leading indentation it is 24 bytes. Walk it through both systems.

A BPE tokenizer trained on code might split this into roughly seven tokens: the indentation whitespace, return, a space, total, two spaces, the #, and done. Seven positions for the heavy model, and the individual characters of return and total are invisible; the model cannot, from the token alone, know that total starts with "t".

Now BLT. The entropy model scans the bytes. The four leading spaces are trivially predictable after the first, so entropy collapses and they fold into one long patch. The "r" that begins return is a surprise (a new token could begin many ways), so a boundary opens there; but "eturn" is nearly forced once "r" appears in this context, so those bytes flow into the same patch. The space, then "total", behave similarly: a boundary at "t", a low-entropy tail. The comment marker # is surprising and opens a patch; " done" trails predictably behind it.

Region	Bytes	Entropy signal	BLT action
Indentation	4 spaces	very low after first	one long patch
`return`	6 bytes	spike at "r", flat after	boundary at "r", then extend
`total`	6 bytes	spike at "t", flat after	boundary, then extend
`# done`	8 bytes	spike at "#"	boundary, then extend

The result is four patches over 24 bytes, an average patch size of six. The heavy transformer runs four times instead of the token model's seven, and every byte remains individually addressable by the local decoder. If a downstream task asks the model to rename total to count, BLT is operating in exactly the representation where that edit lives; the token model has to reason about opaque integers whose spelling it was never shown.

The trade is visible in the same example. Those four leading spaces became one patch, but if the entropy thresholds were set slightly differently, they might have merged with return or split into two patches, changing the compute profile. Patch boundaries are a tunable, and the tuning interacts with everything downstream.

Where It Breaks

BLT is not free of sharp edges, and the paper is candid about several.

The entropy model is a dependency and a failure surface. It is a separate small network that must be trained and must run before or alongside the main model. If its uncertainty estimates are poorly calibrated, patches are drawn in the wrong places: too coarse over content that needed attention, too fine over boilerplate, wasting the compute BLT was supposed to save. The segmentation is a heuristic bolted in front of the model rather than a fully learned part of it, which is precisely the seam that later work targets.

Variable-length patches complicate the systems layer. Standard transformer inference assumes fixed-size units and batches them into neat tensors. When patch sizes vary across a batch and across positions, keeping GPUs saturated requires care, and the KV-cache bookkeeping is messier than in a token model where every step advances by exactly one unit. The theoretical FLOP savings only become wall-clock savings with an implementation that handles this ragged structure efficiently.

There is also an ecosystem cost. A decade of infrastructure assumes tokens: context windows are quoted in tokens, prices are per token, datasets are pre-tokenized, and evaluation harnesses count tokens. A byte-patch model does not slot cleanly into that world, and "how long is this document" no longer has a fixed answer because it depends on how the entropy model happened to patch it.

Finally, the win is regime-dependent. On text that is already highly compressible, BLT's long patches shine. On dense, high-entropy content, code with many rare identifiers, heavily multilingual text, mathematical notation, patches stay short, average patch size falls toward the token model's ratio, and the efficiency advantage narrows. BLT is strongest exactly where tokenizers waste the most, and closest to parity where they were already efficient.

Alternative Designs

BLT sits in a lineage of attempts to delete or dynamize tokenization, and it is fairer to it, and more useful, to see the alternatives side by side.

Approach	How it segments	Strengths	Weaknesses
BPE tokenizer	Fixed merges from frequency	Cheap, mature tooling, strong baselines	Frozen boundaries, no character access, poor on rare scripts
ByT5 / CANINE	No segmentation, pure bytes/chars	Maximum robustness, simplest pipeline	Long sequences, expensive at scale
MegaByte	Fixed-size patches	Sub-quadratic, simple, parallel decoding	Boundaries ignore content, splits mid-word
SpaceByte	Boundaries at spaces and delimiters	Cheap heuristic aligned to words	Fails on scripts without spaces, still a heuristic
MambaByte	State space model over bytes	Linear scaling in length, no attention blowup	Different architecture, patching not the focus
BLT	Entropy-driven dynamic patches	Content-aware, matches tokenizers at 8B, robust	Separate entropy model, ragged systems layer
H-Net	Fully learned dynamic chunking	End-to-end, no external segmenter	Newer, heavier training machinery

MegaByte (Yu et al., 2023, MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers, arXiv:2305.07185) is BLT's direct ancestor: the same local-global patch idea, but with patches of a fixed size, so a boundary can land in the middle of a word and the model cannot spend more compute on hard regions. SpaceByte (Slagle, 2024, arXiv:2404.14408) improved on that by putting boundaries at spaces, a cheap and surprisingly effective heuristic for space-delimited languages, though it collapses for scripts like Chinese that do not delimit words with spaces. MambaByte (Wang et al., 2024, MambaByte: Token-free Selective State Space Model, arXiv:2401.13660) attacked the length problem from a different direction entirely, using a Mamba state space backbone whose cost grows linearly rather than quadratically with sequence length, so it can afford to stay at the byte level without patching at all.

The most important successor is H-Net (Hwang, Wang, and Gu, 2025, Dynamic Chunking for End-to-End Hierarchical Sequence Modeling, arXiv:2507.07955). It makes the move BLT stops just short of: rather than a separate entropy model deciding boundaries, H-Net learns the segmentation jointly with the rest of the network through a differentiable dynamic-chunking mechanism, and it can stack the idea hierarchically so that chunks of bytes form chunks of chunks. Where BLT's patching is a smart heuristic in front of the model, H-Net's chunking is part of the model, trained by the same gradients. Early results report byte-level H-Nets outperforming strong tokenized baselines across languages and modalities, which suggests the direction BLT opened is far from exhausted.

[IMAGE: Comparison matrix heatmap with rows MegaByte / SpaceByte / MambaByte / BLT / H-Net and columns for content-aware, hierarchical, learned-boundaries, handles-no-space-scripts, matched-at-8B, shaded to show BLT and H-Net dominating the later columns]

How It Is Used in Practice

BLT is a research architecture, not yet a default in shipping products, and honesty requires saying so plainly. The comparison points are internal FLOP-controlled studies, not a fleet of deployed byte-level assistants. Meta released model weights and code for BLT, which lets others reproduce the scaling claims and probe the robustness behavior directly rather than taking them on faith.

The near-term practical pull is strongest in three places. Multilingual and low-resource settings benefit because byte access sidesteps the vocabulary bias that makes BPE inefficient on underrepresented scripts, where a single character can cost several tokens. Code and structured text benefit because character-level edits, exact string matching, and format-sensitive generation are the model's native representation rather than something it reconstructs from tokens. And any pipeline that has to survive messy input, user typos, mixed encodings, adversarial unicode, gains from a model that never had a brittle preprocessing step to break in the first place.

The operational considerations are the flip side of the failure modes. A production BLT system needs an inference stack that handles variable patch lengths without stalling the GPU, a training pipeline that keeps the entropy model in sync with the main model, and monitoring that reasons about bytes and patches rather than tokens. These are solvable engineering problems, but they are not the problems the current ecosystem is tooled for, and that gap is a real adoption cost independent of the model's quality.

Insights Worth Remembering

A tokenizer is a compute-allocation policy in disguise. It decides, before training, how many FLOPs each stretch of text receives, and that decision is frozen for the life of the model.
BLT's core move is to replace a static allocation with a dynamic one keyed to entropy, spending compute in proportion to how surprising the data is.
The efficiency and the robustness are the same idea seen from two sides. Keeping bytes visible is what enables character-level competence; grouping them by surprise is what keeps it affordable.
Long patches are the payoff and the risk. They deliver the FLOP savings, but a mis-drawn boundary over important content is compute spent in the wrong place.
The vocabulary was carrying hidden parameters. Deleting a 128k-entry embedding table frees hundreds of millions of parameters to move into the layers that compute.
BLT wins most where tokenizers waste most and approaches parity where they were already efficient, so the benefit is a function of your data, not a constant.
The heuristic seam (a separate entropy model) is the obvious thing to make learnable, which is exactly the thread H-Net pulls.

Open Questions

The measured results establish that byte-level models can match tokenizer models at 8B parameters under FLOP control. What remains open is how far the curve bends. BLT's scaling study stops at 8B and 4T bytes; whether the crossover advantage holds, grows, or shrinks at frontier scale is not yet shown, and it is the question that decides whether byte-level pretraining becomes standard or stays a specialist tool.

Whether the entropy model should exist at all is contested. BLT treats segmentation as a preprocessing decision made by a separate network; H-Net argues it should be learned end to end. The evidence so far favors more learning and less heuristic, but H-Net is new and its training cost and stability at scale are still being characterized, so calling the debate settled would be premature.

The systems question is genuinely unresolved. Variable-length patches make theoretical FLOP savings hard to convert into wall-clock and hardware-utilization savings, and how much of BLT's advantage survives contact with production inference kernels is an empirical matter that the published work does not fully answer. It is plausible that the biggest near-term gains come from co-designing the patching scheme with the serving stack rather than from the model architecture alone.

Finally, there is the multimodal prize. If a model reasons over learned chunks of raw bytes, the same machinery could in principle ingest bytes of any kind, text, audio, images, without modality-specific tokenizers. Both the BLT and H-Net lines gesture at this, and it is the most exciting possibility the work opens, but it remains a direction rather than a demonstrated result.