Structured Generation and Constrained Decoding: Making LLMs Predictable
June 02, 2026 · 25 min read
Language models are, by construction, unreliable format producers. Ask GPT-4 or Claude for JSON and you will usually get JSON. Ask ten thousand times in a production pipeline and you will get JSON with trailing commas, missing closing braces, hallucinated fields, and the occasional apologetic paragraph explaining why it cannot comply. The fix is not better prompts. The fix is constrained decoding: a family of techniques that intervene at the token-sampling step to make structurally invalid output impossible, not unlikely.
This matters because the entire agentic stack, every tool call, every function invocation, every data extraction pipeline, depends on the model producing output that machines can parse without error handling that dwarfs the model call itself. Constrained decoding turns language models from probabilistic text generators into reliable structured-data engines.
Why this matters: Every LLM-powered product that calls a function, fills a database row, or drives an agent loop depends on structured output. When that output is malformed, retries spike latency, cost doubles, and downstream systems break silently. Constrained decoding eliminates this entire failure class by construction, not by hope.
TL;DR
- Constrained decoding masks invalid tokens at every generation step, guaranteeing that the output conforms to a target grammar or schema.
- The core mechanism converts a JSON Schema (or regex, or context-free grammar) into a finite-state machine, then uses that FSM to compute a token mask before each sampling step.
- OpenAI, Anthropic, and Google all ship native structured output modes that use constrained decoding server-side. Open-source engines (Outlines, Guidance, XGrammar, SGLang) do the same locally.
- Performance overhead is negligible in modern implementations: under 50 microseconds per token for grammar checking, against 10-50 milliseconds per token for model inference.
- Constrained decoding guarantees syntactic validity but not semantic correctness. The model can produce perfectly formatted JSON that contains wrong answers.
- Quality degradation is real but measurable: masking high-probability tokens distorts the model's distribution, and recent algorithms like ASAp (Park et al., NeurIPS 2024) address this with provably correct renormalization.
- Coalescence and jump-forward decoding exploit deterministic structure to skip LLM calls entirely, making constrained generation faster than unconstrained generation in many cases.
- The Pydantic-to-schema-to-constraint pipeline has become the standard integration pattern, with libraries like Instructor providing retry logic and validation across 15+ providers.
At a Glance
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1e40af', 'primaryTextColor': '#fff', 'primaryBorderColor': '#60a5fa', 'lineColor': '#94a3b8', 'textColor': '#e2e8f0', 'clusterBkg': '#1e293b', 'clusterBorder': '#334155', 'fontSize': '16px'}}}%%
flowchart LR
subgraph Input["Developer Specification"]
PS["Pydantic model"]
JS["JSON Schema"]
RX["Regex pattern"]
CFG["Context-free grammar"]
end
subgraph Compile["Compilation (offline)"]
FSM["Build FSM / PDA"]
IDX["Index: state -> valid tokens"]
end
subgraph Decode["Constrained Decoding (per token)"]
LG["Model produces logits"]
MK["Mask invalid tokens"]
SM["Sample from valid set"]
ADV["Advance FSM state"]
end
subgraph Output["Guaranteed Output"]
VJ["Valid JSON"]
TC["Correct tool call"]
SE["Schema-compliant extraction"]
end
PS --> JS
JS --> FSM
RX --> FSM
CFG --> FSM
FSM --> IDX
IDX --> MK
LG --> MK
MK --> SM
SM --> ADV
ADV -->|"next token"| MK
SM --> Output
classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff
class PS,JS,RX,CFG blue
class FSM,IDX purple
class LG,MK,SM,ADV teal
class VJ,TC,SE emerald
Before Constrained Decoding
For most of the LLM era, getting structured output from a language model meant asking nicely and hoping. The history of structured generation tracks a steady shift from "please format this correctly" to "you cannot format this incorrectly."
%%{init: {'theme': 'base', 'themeVariables': {'cScale0': '#1e40af', 'cScale1': '#6d28d9', 'cScale2': '#b45309', 'cScale3': '#be123c', 'cScale4': '#047857', 'cScale5': '#0e7490', 'cScale6': '#1e40af', 'cScaleLabel0': '#e2e8f0', 'cScaleLabel1': '#e2e8f0', 'cScaleLabel2': '#e2e8f0', 'cScaleLabel3': '#e2e8f0', 'cScaleLabel4': '#e2e8f0', 'cScaleLabel5': '#e2e8f0', 'cScaleLabel6': '#e2e8f0', 'textColor': '#e2e8f0', 'lineColor': '#94a3b8', 'fontSize': '16px'}}}%%
timeline
title From Hope to Guarantee: Structured Output Evolution
2020-2022 : Prompt engineering era
: "Return valid JSON" in system prompts
: Regex post-processing and retry loops
: PICARD (Scholak et al.) constrains text-to-SQL
2023 : OpenAI ships JSON mode (Nov 2023) - valid JSON, no schema enforcement
: Willard and Louf publish FSM-based guided generation (arXiv 2307.09702)
: Outlines library launches with regex-to-FSM compilation
: Microsoft Guidance introduces token healing
: LMQL provides SQL-like constraint syntax from ETH Zurich
2024 : OpenAI Structured Outputs with JSON Schema enforcement (Aug 2024)
: Google Gemini adds response_schema at Google I/O (May 2024)
: SGLang compressed FSM with jump-forward decoding (NeurIPS 2024)
: Park et al. Grammar-Aligned Decoding addresses quality loss (NeurIPS 2024)
: XGrammar ships pushdown automata for context-free grammars
2025-2026 : Anthropic Claude adds strict structured outputs (Nov 2025)
: XGrammar becomes default backend for vLLM, SGLang, TensorRT-LLM
: Coalescence makes structured generation faster than unconstrained
: Sub-40 microsecond per-token overhead achieved
The prompt engineering era (2020-2022) relied on instructions like "You must return valid JSON" combined with post-generation parsing and retry loops. This worked tolerably for demos. In production, with thousands of concurrent requests, a 5% parse failure rate meant hundreds of retries per minute, each doubling latency and cost.
OpenAI's JSON mode (November 2023) guaranteed syntactically valid JSON but enforced no schema. You got valid JSON, but it might have the wrong fields, wrong types, or an entirely unexpected structure. The gap between "valid JSON" and "JSON matching my schema" turned out to be enormous.
The real breakthrough came from the research side. Willard and Louf's 2023 paper, "Efficient Guided Generation for Large Language Models" (arXiv:2307.09702), demonstrated that autoregressive text generation can be reformulated as transitions between states of a finite-state machine. This insight, implemented in the Outlines library, made schema-guaranteed output a practical reality for any model, not just proprietary APIs.
[IMAGE: Side-by-side comparison showing a prompt-based JSON generation attempt with malformed output on the left, and constrained decoding producing guaranteed-valid output on the right. Caption: "The difference between asking for structure and enforcing it."]
How Constrained Decoding Actually Works
The mechanism is elegant. A language model generates text one token at a time by producing a probability distribution (logits) over its entire vocabulary, typically 32,000 to 128,000 tokens. Constrained decoding inserts a logit processor between the model's output layer and the sampling step. This processor checks each candidate token against a grammar state and sets the logits of all invalid tokens to negative infinity before sampling occurs.
The Core Math
At each generation step \(t\), the model produces raw logits \(z_t \in \mathbb{R}^{|V|}\) over vocabulary \(V\). Standard sampling converts these to probabilities via softmax:
\[P(v_i | x_{<t}) = \frac{\exp(z_{t,i})}{\sum_{j=1}^{|V|} \exp(z_{t,j})}\]Constrained decoding defines a set of valid tokens \(V_t^{\text{valid}} \subseteq V\) based on the current grammar state \(s_t\), then applies a mask:
\[\tilde{z}_{t,i} = \begin{cases} z_{t,i} & \text{if } v_i \in V_t^{\text{valid}} \\ -\infty & \text{otherwise} \end{cases}\]After masking, the probabilities are renormalized over valid tokens only:
\[P_{\text{constrained}}(v_i | x_{<t}) = \begin{cases} \frac{\exp(z_{t,i})}{\sum_{v_j \in V_t^{\text{valid}}} \exp(z_{t,j})} & \text{if } v_i \in V_t^{\text{valid}} \\ 0 & \text{otherwise} \end{cases}\]The model's weights are never modified. Only the sampling distribution changes. This is why constrained decoding is model-agnostic: it works with any autoregressive language model.
[IMAGE: Diagram showing a vocabulary heatmap where most tokens are grayed out (masked) and only a handful of valid tokens retain their probability mass. Caption: "Token masking in action: at each step, only grammar-valid tokens retain probability."]
From Schema to Finite-State Machine
The practical question is: how do you compute \(V_t^{\text{valid}}\) efficiently? The answer involves a compilation pipeline that converts high-level specifications into state machines.
Step 1: Schema to Regex. A JSON Schema defines allowed types, field names, nesting, and constraints. This gets compiled into a (potentially very long) regular expression that matches all and only the valid JSON strings conforming to that schema. Libraries like Outlines use this conversion, handling quantifiers, optional fields, and enum values.
Step 2: Regex to FSM. Regular expressions are mathematically equivalent to finite-state automata. The interegular library converts the regex into a deterministic finite automaton (DFA) where each state represents a position in the valid output, and transitions correspond to characters.
Step 3: FSM to Token Index. Here is where the tokenization challenge appears. LLMs do not generate characters; they generate tokens, which are variable-length byte sequences produced by BPE (Byte Pair Encoding). The string "name" might be a single token, or it might be tokenized as "na" + "me", depending on context. The system must compute, for each FSM state, which tokens in the vocabulary would produce a valid transition. This precomputation creates an index: a mapping from each state to its set of valid token IDs.
Step 4: Runtime Masking. During generation, the system tracks the current FSM state. At each step, it looks up the valid token set for that state (an O(1) hash map operation), constructs the mask, applies it, and advances the state after sampling.
[IMAGE: Pipeline diagram showing JSON Schema flowing through regex compilation, FSM construction, token index building, and finally runtime masking during generation. Caption: "The compilation pipeline: from developer-facing schema to inference-time token mask."]
Beyond Regular Languages: Pushdown Automata
Pure FSMs handle regular languages, but JSON is not regular. Nested objects and arrays require tracking depth, which regular expressions cannot express. The solution: pushdown automata (PDA), essentially FSMs augmented with a stack.
XGrammar, which has become the default structured generation backend for vLLM, SGLang, and TensorRT-LLM, implements this approach. It models JSON generation as a collection of FSMs where the stack tracks nesting context, and each FSM handles one level of the structure. When the model opens a new object or array, the PDA pushes a new FSM onto the stack. When it closes one, it pops.
This distinction matters in practice. Early FSM-only approaches could handle flat JSON reliably but struggled with deeply nested schemas. PDA-based approaches handle arbitrary nesting at the cost of slightly more complex state management.
Token Healing
There is a subtle problem at the boundary between constrained and unconstrained generation. When constrained decoding forces the model down an unusual token path (because the "natural" tokenization would violate the grammar), it may produce a non-canonical tokenization that the model rarely encountered during training. Microsoft's Guidance library introduced token healing to address this: it backs up one token at the prompt boundary and constrains the first generated token to begin with that removed token's continuation. This small correction measurably improves output quality, contributing roughly a 3% accuracy gain across benchmarks (Geng et al., 2025).
Seeing It in Motion
The Decoding Loop
%%{init: {'theme': 'base', 'themeVariables': {'actorBkg': '#1e40af', 'actorTextColor': '#fff', 'actorBorder': '#3b82f6', 'signalColor': '#94a3b8', 'signalTextColor': '#e2e8f0', 'labelBoxBkgColor': '#1e293b', 'labelBoxBorderColor': '#334155', 'labelTextColor': '#e2e8f0', 'loopTextColor': '#e2e8f0', 'noteBkgColor': '#1e293b', 'noteTextColor': '#e2e8f0', 'noteBorderColor': '#475569', 'activationBorderColor': '#3b82f6', 'activationBkgColor': '#1e3a5f', 'fontSize': '16px'}}}%%
sequenceDiagram
participant App as Application
participant LLM as Language Model
participant LP as Logit Processor
participant FSM as Grammar FSM
App->>LLM: Prompt + Schema
Note over FSM: Compile schema to FSM (cached)
loop For each token
LLM->>LP: Raw logits (32K-128K values)
LP->>FSM: Query current state
FSM-->>LP: Valid token set for state s_t
LP->>LP: Mask invalid tokens to -inf
LP-->>LLM: Masked logits
LLM->>LLM: Sample from valid distribution
LLM->>FSM: Advance state with chosen token
end
LLM-->>App: Guaranteed schema-valid output
Jump-Forward and Coalescence
One of the most counterintuitive findings in constrained decoding: it can be faster than unconstrained generation. Two techniques make this possible.
Jump-forward decoding (introduced in SGLang) identifies sequences of FSM states with only one valid transition each. If the schema requires the field name "temperature" and the model has just generated the opening quote, the next nine characters are deterministic. Instead of calling the LLM nine times, the system appends those characters directly and skips ahead to the next branching point. SGLang's RadixAttention mechanism automatically reuses the KV cache for the skipped tokens, avoiding redundant computation.
Coalescence (from dottxt/Outlines) generalizes this idea. When multiple token paths through the FSM converge to the same generated string, the system picks the longest valid token and skips the rest. For a schema with fixed field names, most of the structural scaffolding (braces, quotes, colons, commas, field names) is deterministic, and the LLM only needs to generate the actual values.
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1e40af', 'primaryTextColor': '#fff', 'primaryBorderColor': '#60a5fa', 'lineColor': '#94a3b8', 'textColor': '#e2e8f0', 'clusterBkg': '#1e293b', 'clusterBorder': '#334155', 'fontSize': '16px'}}}%%
flowchart TD
A["Schema: {'name': str, 'age': int}"] --> B["Standard decoding: 15+ LLM calls"]
A --> C["Coalescence decoding"]
C --> D["LLM call 1: generate value for 'name'"]
C --> E["Skip: append deterministic tokens"]
C --> F["LLM call 2: generate value for 'age'"]
C --> G["Skip: append closing brace"]
B --> H["Output: 15 LLM calls"]
F --> I["Output: 2 LLM calls + direct appends"]
classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff
classDef rose fill:#be123c,stroke:#fb7185,stroke-width:1px,color:#fff
class A blue
class B,H rose
class C,D,F amber
class E,G,I emerald
The dottxt team reports a 5x speedup from coalescence on typical JSON schemas (dottxt engineering blog). SGLang's compressed FSM achieves up to 6.4x higher throughput compared to baseline inference engines (Zheng et al., NeurIPS 2024).
[IMAGE: Visualization of a compressed FSM showing singular transition paths highlighted in green (skipped by jump-forward) and branching points in orange (requiring LLM calls). Caption: "Compressed FSM: green paths are deterministic and skipped; orange nodes require model sampling."]
By the Numbers
Performance data from recent benchmarks paints a clear picture of where the field stands.
Framework Comparison (Geng et al., January 2025)
| Metric | Guidance | Outlines | Llamacpp | XGrammar | OpenAI API | Gemini API |
|---|---|---|---|---|---|---|
| Schema coverage (GlaiveAI) | 96% | 95% | 95% | -- | 31% | 6% |
| Schema coverage (GitHub Medium) | 69% | 29% | 57% | -- | -- | -- |
| Schema coverage (GitHub Hard) | 41% | 3% | 39% | -- | -- | -- |
| Compliance rate | 87-100% | 6-83% | 85-100% | -- | 92-100% | 92-100% |
| Grammar compile time (GlaiveAI) | 0.00s | 3.48s | 0.03s | <0.01s | N/A | N/A |
| Time per output token (GlaiveAI) | 6.37ms | 30.33ms | 17.70ms | -- | N/A | N/A |
| Time per output token (GitHub Medium) | 7.57ms | 46.57ms | 29.08ms | -- | N/A | N/A |
| Task accuracy improvement vs. unconstrained | +3-4% | +1-3% | +2% | -- | -- | -- |
Provider-Hosted Structured Output Performance
| Provider | Feature | Release | Schema Guarantee | Compilation Latency | Per-Token Overhead |
|---|---|---|---|---|---|
| OpenAI | Structured Outputs | Aug 2024 | 100% on supported schemas | <10s first call, cached after | Negligible (server-side) |
| Gemini response_schema | May 2024 | High on supported subset | Minimal | Negligible (server-side) | |
| Anthropic | Strict structured outputs | Nov 2025 | 100% with strict: true | Not disclosed | Negligible (server-side) |
| SGLang + XGrammar | Grammar-guided | 2024-2025 | 100% on compiled grammars | <0.01s | <40 microseconds/token |
[IMAGE: Bar chart comparing grammar compilation time across Guidance (near zero), XGrammar (near zero), Llamacpp (30ms), and Outlines (3.48s) on the GlaiveAI dataset. Caption: "Compilation overhead varies dramatically across frameworks."]
Key Numerical Findings
The overhead story has changed dramatically. XGrammar achieves under 40 microseconds per token for grammar checking. The model's own inference costs 10-50 milliseconds per token. That makes the grammar overhead less than 0.4% of total generation time. Guidance's llguidance backend reaches roughly 50 microseconds per token with negligible startup costs.
Outlines' initial compilation can be slow, sometimes 40 seconds to over 10 minutes for complex schemas, but this is a one-time cost amortized across all subsequent requests using that schema. Caching the compiled FSM eliminates this penalty entirely for repeated use.
OpenAI reports that their first request with a new schema incurs up to 10 seconds of processing, occasionally up to a minute for very complex schemas, but subsequent requests see no latency penalty due to server-side caching.
A Concrete Example
Let us trace the generation of a simple JSON object step by step, showing exactly how the FSM constrains each token choice.
Target schema (Pydantic):
class WeatherReport(BaseModel):
city: str
temperature: float
unit: Literal["celsius", "fahrenheit"]
Equivalent JSON Schema:
{
"type": "object",
"properties": {
"city": {"type": "string"},
"temperature": {"type": "number"},
"unit": {"enum": ["celsius", "fahrenheit"]}
},
"required": ["city", "temperature", "unit"],
"additionalProperties": false
}
Step-by-step constrained generation:
| Step | FSM State | Valid Tokens | Model Choice | Why |
|---|---|---|---|---|
| 1 | Start | { only |
{ |
JSON object must open with brace |
| 2 | Object opened | " only |
" |
First character of field name |
| 3-6 | Field name | city" (deterministic) |
city" |
Only valid field; coalescence skips LLM |
| 7 | After field name | : only |
: |
Deterministic separator |
| 8 | Before string value | " only |
" |
String value must open with quote |
| 9-N | String content | Any valid string tokens | San, Francisco |
Model generates freely within string constraints |
| N+1 | String content | " + string tokens |
" |
Model chooses to close the string |
| N+2 | After value | , only |
, |
More required fields remain |
| ... | Temperature field | Deterministic field name | "temperature": |
Coalescence skips this entire sequence |
| ... | Number value | Digit tokens, ., - |
18.5 |
Model generates, constrained to valid numbers |
| ... | Unit field | Deterministic name | "unit":" |
Coalescence skips again |
| ... | Enum value | celsius" or fahrenheit" |
celsius" |
Only two valid completions |
| Final | End | } only |
} |
Must close object |
The total generation involves perhaps 5-6 actual LLM calls (for the string value, the number value, and the enum choice), with all structural tokens appended deterministically. The model is free to be creative where creativity matters (what city? what temperature?) and mechanically constrained where structure matters (field names, types, punctuation).
[IMAGE: Animated step-through of the FSM states during this generation, showing the state machine highlighting the current node and available transitions at each step. Caption: "Walking the FSM: each step constrains the vocabulary to only valid continuations."]
Where It Breaks
Constrained decoding is not a silver bullet. Its failure modes are well-characterized and worth understanding before deploying.
Semantic Correctness Is Not Guaranteed
The most important limitation: a schema can enforce that the temperature field contains a number, but it cannot enforce that the number is meteorologically plausible. Constrained decoding guarantees syntax, not semantics. A perfectly schema-valid response of {"city": "London", "temperature": 847.3, "unit": "celsius"} will sail through every grammar check.
Distribution Distortion
Park et al. (NeurIPS 2024) formalized what practitioners had observed: masking high-probability tokens distorts the model's output distribution. When the model "wants" to generate a token that the grammar forbids at the current position, the probability mass redistributes to grammatically valid tokens in a way that does not preserve the relative ordering the model intended (Park et al., 2024).
The effect is most pronounced when strict format requirements force many low-entropy decisions in sequence (braces, quotes, commas, field names). Each forced decision is a small perturbation; repeated perturbations across many steps induce trajectory bias. On reasoning-intensive tasks, this can degrade accuracy by 10-30% compared to unconstrained generation.
The ASAp algorithm addresses this by computing, at each step, the conditional probability of the LLM's distribution given the grammar constraint, then sampling from that corrected distribution. This is provably correct but adds computational overhead.
Schema Complexity Limits
Not all JSON schemas compile cleanly. The benchmark study by Geng et al. found dramatic differences in schema coverage across frameworks: Guidance handled 96% of schemas in the GlaiveAI dataset, while Outlines managed only 3% of "hard" schemas from GitHub. Common failure triggers include deeply nested recursive schemas, complex anyOf/oneOf combinations, and string patterns with Unicode character classes.
Closed-source APIs (OpenAI, Gemini) take a conservative approach, rejecting schemas they cannot guarantee, which yields low coverage but near-perfect compliance on accepted schemas.
Tokenization Mismatches
LLMs process tokens, but grammars express constraints over characters. The mapping between characters and tokens is many-to-many, and context-dependent. The string "true" might be tokenized as a single token in one context and as "tr" + "ue" in another. This creates edge cases where the FSM's character-level transitions do not cleanly align with token boundaries, requiring careful handling to avoid infinite loops or invalid states.
[IMAGE: Illustration of a tokenization mismatch where the same string "temperature" gets different BPE tokenizations depending on preceding context, and how the FSM handles both paths. Caption: "Tokenization is context-dependent; the FSM must handle all valid token decompositions."]
Alternative Designs
Not all approaches to structured output use constrained decoding. The design space includes fundamentally different strategies, each with distinct tradeoffs.
| Approach | Mechanism | Validity Guarantee | Quality Impact | Latency Impact | Schema Flexibility |
|---|---|---|---|---|---|
| Prompt-based | System prompt says "return JSON" | None; ~76-95% success rate | None (unconstrained) | None | Any format describable in text |
| JSON mode | Provider guarantees valid JSON | Valid JSON, no schema | Minimal | Minimal | Any JSON structure |
| Post-validation + retry | Parse output, retry on failure | Eventual (with retry budget) | None per attempt | 2-5x on failures | Any parseable format |
| Fine-tuning | Train model on structured examples | High but not 100% | Improved for trained schemas | None at inference | Fixed to training distribution |
| FSM constrained | Token masking via finite-state machine | 100% for regular languages | -3% to +4% depending on task | <0.4% overhead + compilation | Regex-expressible schemas |
| PDA constrained | Token masking via pushdown automaton | 100% for context-free languages | Similar to FSM | Slightly higher than FSM | Recursive/nested schemas |
| Provider-hosted | Server-side constrained decoding | 100% on accepted schemas | Tuned per model | First-call compilation cost | Provider-defined subset |
| Grammar-aligned (ASAp) | Corrected sampling distribution | 100% with provable fidelity | Minimal degradation | Higher per-token cost | Same as PDA |
The industry has converged on constrained decoding as the default for production use, with provider-hosted implementations for API consumers and open-source engines (XGrammar, Guidance, Outlines) for self-hosted models. Fine-tuning remains relevant for specialized domains where the model needs to learn which values to produce, not just what structure to follow.
[IMAGE: Decision tree for choosing a structured output strategy based on requirements: self-hosted vs API, schema complexity, latency sensitivity, and accuracy requirements. Caption: "Choosing the right structured output strategy depends on your deployment model and schema complexity."]
How It Is Used in Practice
Function Calling and Tool Use
Every major LLM provider now ships function calling as a first-class feature, and constrained decoding is the mechanism that makes it reliable. When you define a tool with a JSON Schema input specification, the provider's inference engine constrains the model's output to match that schema exactly.
OpenAI introduced strict: true in tool definitions (August 2024), requiring additionalProperties: false and explicit required arrays. With strict mode, their evaluations show 100% schema adherence on supported schemas, up from approximately 40% with earlier models (OpenAI, 2024).
Anthropic released structured outputs in public beta (November 2025) with both JSON mode and strict tool calling. As their documentation states: structured outputs "compile your JSON schema into a grammar and actively restrict token generation during inference. The model literally cannot produce tokens that would violate your schema" (Anthropic, 2025).
Google Gemini uses response_schema based on OpenAPI 3.0 schema definitions, supporting Pydantic (Python) and Zod (JavaScript) for schema specification (Google, 2024).
Data Extraction Pipelines
Extracting structured records from unstructured text is perhaps the most common production use case. Medical record parsing, invoice processing, resume screening: all require the model to fill a fixed schema from variable input. Without constrained decoding, these pipelines need extensive error handling for missing fields, wrong types, and malformed output. With it, every response is guaranteed parseable.
The Pydantic Integration Pattern
The dominant integration pattern in Python codebases uses Pydantic as the schema definition layer. The Instructor library (python.useinstructor.com) exemplifies this:
- Define a Pydantic model describing the desired output.
- Pass it to the LLM call (Instructor handles schema conversion).
- Get back a typed, validated Python object.
- If validation fails (semantic checks, custom validators), Instructor automatically retries with the error message fed back to the model.
This pattern works across 15+ providers and has become the de facto standard for Python-based LLM applications. The retry-with-feedback loop is particularly valuable: when the model produces a schema-valid but semantically wrong response (say, a negative age), Pydantic's validators catch it and the model gets another attempt with explicit error context.
Agentic Workflows
Agent systems amplify the stakes of structured output. A single malformed tool call in a multi-step reasoning chain can cascade: the agent retries, re-sends the full conversation history (multiplying token consumption), and may enter a loop. Constrained decoding eliminates this failure class entirely, which is why every major agent framework (LangGraph, CrewAI, AutoGen) defaults to structured tool calling when available.
[IMAGE: Architecture diagram of an agent loop showing the LLM generating constrained tool calls, the executor running them, and results feeding back. Red X marks where malformed output would break the loop without constrained decoding. Caption: "In agent loops, a single malformed tool call cascades into retries and token waste."]
Self-Hosted Inference
For organisations running their own models, the open-source constrained decoding stack is mature. SGLang with XGrammar is the current performance leader for structured generation workloads, achieving approximately 3x higher throughput than vLLM on constrained decoding tasks (SqueezeBits, 2025). The stack handles JSON Schema, regex, and EBNF grammar constraints with under 40 microseconds of per-token overhead.
vLLM supports both XGrammar and Outlines as grammar backends. XGrammar uses a pushdown automaton compiled to C via pthread, with tokenizer data caching that minimizes startup cost. The compilation happens once per schema and is cached across requests.
Insights Worth Remembering
-
Constrained decoding is model-agnostic. It modifies the sampling distribution, not the model weights. Any autoregressive model works.
-
Schema compilation is the hidden cost. The per-token overhead is negligible (<50 microseconds), but the first-time compilation of a complex schema can take seconds to minutes. Cache aggressively.
-
Deterministic tokens should not cost LLM calls. Coalescence and jump-forward decoding exploit the fact that most structural tokens in JSON are deterministic given the schema. Skipping them makes constrained generation faster than unconstrained generation for structured tasks.
-
Syntactic validity is not semantic correctness. A perfectly formatted JSON response can contain hallucinated values. Constrained decoding solves the structure problem, not the accuracy problem. Combine it with Pydantic validators for semantic checks.
-
Distribution distortion is real and quantifiable. Masking tokens changes the distribution the model samples from. For high-stakes reasoning tasks, consider Grammar-Aligned Decoding (ASAp) or allow the model to reason freely before constraining its final output.
-
Provider APIs are conservative; open-source engines are flexible. OpenAI and Gemini reject schemas they cannot guarantee. Guidance and XGrammar attempt any schema, with varying success. Choose based on whether you need coverage or compliance.
-
Token healing matters at boundaries. The mismatch between character-level grammars and token-level generation creates subtle quality issues. Guidance's token healing provides a consistent 3% accuracy improvement.
-
The "think then constrain" pattern is emerging. Recent research (2025-2026) shows that letting the model reason freely in unconstrained space before constraining its final structured output preserves both reasoning quality and structural guarantees.
-
LMQL demonstrated that constraints can reduce cost. By eagerly evaluating constraints during generation, LMQL achieves 26-85% cost savings through early termination of invalid paths, reducing total API calls by up to 80% (ETH Zurich SRI Lab).
-
The convergence is real. All major providers, open-source engines, and agent frameworks now support constrained decoding. The technique has moved from research novelty to infrastructure default in under three years.
Open Questions
Can we constrain semantics, not just syntax? Current systems enforce that a field is a number, but not that it is a reasonable number. Integrating value-range constraints, cross-field consistency checks, and factual grounding into the decoding loop remains unsolved. Pydantic validators handle this post-generation, but doing it during generation would eliminate wasted tokens.
What is the right grammar formalism? Regular expressions, context-free grammars, and pushdown automata each trade expressiveness for efficiency. Recent work on Earley-driven dynamic pruning suggests that more powerful formalisms can be made practical, but the field has not settled on a standard.
How should models be trained to work with constraints? OpenAI trains models specifically to understand schema structure, achieving 93% accuracy before constrained decoding brings it to 100%. Should all models be trained with schema-awareness, or is the decoding-time approach sufficient?
Will speculative decoding compose with constrained decoding? Speculative decoding uses a small draft model to propose tokens that a larger model verifies. Combining this with grammar constraints could yield multiplicative speedups, but ensuring the draft model's proposals satisfy grammar constraints adds complexity. Early results are promising but not yet production-ready.
What happens when schemas conflict with the model's knowledge? If a schema requires an enum of ["true", "false"] but the factually correct answer is "unknown", the model is forced to lie structurally. The interaction between structural constraints and truthfulness is underexplored.
[IMAGE: Research frontier diagram showing open problems in constrained decoding: semantic constraints, trained schema awareness, speculative composition, and truthfulness under structural pressure. Caption: "The next frontier: moving from syntactic guarantees to semantic ones."]
Sources and Further Reading
-
Willard, B. T. and Louf, R. "Efficient Guided Generation for Large Language Models." arXiv:2307.09702, 2023. https://arxiv.org/abs/2307.09702 - The foundational paper establishing FSM-based guided generation.
-
Park, K., Wang, J., Berg-Kirkpatrick, T. et al. "Grammar-Aligned Decoding." NeurIPS 2024. https://arxiv.org/abs/2405.21047 - Identifies and addresses quality degradation from constrained decoding via the ASAp algorithm.
-
Zheng, L. et al. "SGLang: Efficient Execution of Structured Language Model Programs." NeurIPS 2024. https://arxiv.org/abs/2312.07104 - Introduces compressed FSM and jump-forward decoding.
-
Geng, S. et al. "Generating Structured Outputs from Language Models: Benchmark and Studies." arXiv:2501.10868, January 2025. https://arxiv.org/abs/2501.10868 - Comprehensive benchmark comparing six constrained decoding frameworks.
-
OpenAI. "Introducing Structured Outputs in the API." August 2024. https://openai.com/index/introducing-structured-outputs-in-the-api/
-
Anthropic. "Structured Outputs." Claude API Documentation, November 2025. https://docs.claude.com/en/docs/build-with-claude/structured-outputs
-
Google. "Structured Output - Gemini API." 2024. https://ai.google.dev/gemini-api/docs/structured-output
-
dottxt. "Coalescence: Making LLM Inference 5x Faster." dottxt Engineering Blog. https://blog.dottxt.ai/coalescence.html
-
LMSYS. "Fast JSON Decoding for Local LLMs with Compressed Finite State Machine." February 2024. https://www.lmsys.org/blog/2024-02-05-compressed-fsm/
-
vLLM Project. "Structured Decoding in vLLM: A Gentle Introduction." January 2025. https://vllm-project.github.io/2025/01/14/struct-decode-intro.html
-
Outlines (dottxt-ai). Structured Text Generation Library. https://github.com/dottxt-ai/outlines
-
XGrammar (MLC-AI). Fast, Flexible and Portable Structured Generation. https://github.com/mlc-ai/xgrammar
-
Microsoft Guidance. Control LM Output. https://github.com/guidance-ai/llguidance
-
LMQL. A Programming Language for LLM Interaction. ETH Zurich SRI Lab. https://lmql.ai/
-
Instructor. Structured Outputs for LLMs. https://python.useinstructor.com/
-
SqueezeBits. "Guided Decoding Performance on vLLM and SGLang." 2025. https://blog.squeezebits.com/guided-decoding-performance-vllm-sglang