← Blog

Structured Generation and Constrained Decoding: Making LLMs Predictable

June 02, 2026 · 25 min read

Language models are, by construction, unreliable format producers. Ask GPT-4 or Claude for JSON and you will usually get JSON. Ask ten thousand times in a production pipeline and you will get JSON with trailing commas, missing closing braces, hallucinated fields, and the occasional apologetic paragraph explaining why it cannot comply. The fix is not better prompts. The fix is constrained decoding: a family of techniques that intervene at the token-sampling step to make structurally invalid output impossible, not unlikely.

This matters because the entire agentic stack, every tool call, every function invocation, every data extraction pipeline, depends on the model producing output that machines can parse without error handling that dwarfs the model call itself. Constrained decoding turns language models from probabilistic text generators into reliable structured-data engines.

Why this matters: Every LLM-powered product that calls a function, fills a database row, or drives an agent loop depends on structured output. When that output is malformed, retries spike latency, cost doubles, and downstream systems break silently. Constrained decoding eliminates this entire failure class by construction, not by hope.

TL;DR

  • Constrained decoding masks invalid tokens at every generation step, guaranteeing that the output conforms to a target grammar or schema.
  • The core mechanism converts a JSON Schema (or regex, or context-free grammar) into a finite-state machine, then uses that FSM to compute a token mask before each sampling step.
  • OpenAI, Anthropic, and Google all ship native structured output modes that use constrained decoding server-side. Open-source engines (Outlines, Guidance, XGrammar, SGLang) do the same locally.
  • Performance overhead is negligible in modern implementations: under 50 microseconds per token for grammar checking, against 10-50 milliseconds per token for model inference.
  • Constrained decoding guarantees syntactic validity but not semantic correctness. The model can produce perfectly formatted JSON that contains wrong answers.
  • Quality degradation is real but measurable: masking high-probability tokens distorts the model's distribution, and recent algorithms like ASAp (Park et al., NeurIPS 2024) address this with provably correct renormalization.
  • Coalescence and jump-forward decoding exploit deterministic structure to skip LLM calls entirely, making constrained generation faster than unconstrained generation in many cases.
  • The Pydantic-to-schema-to-constraint pipeline has become the standard integration pattern, with libraries like Instructor providing retry logic and validation across 15+ providers.

At a Glance

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1e40af', 'primaryTextColor': '#fff', 'primaryBorderColor': '#60a5fa', 'lineColor': '#94a3b8', 'textColor': '#e2e8f0', 'clusterBkg': '#1e293b', 'clusterBorder': '#334155', 'fontSize': '16px'}}}%%
flowchart LR
    subgraph Input["Developer Specification"]
        PS["Pydantic model"]
        JS["JSON Schema"]
        RX["Regex pattern"]
        CFG["Context-free grammar"]
    end
    subgraph Compile["Compilation (offline)"]
        FSM["Build FSM / PDA"]
        IDX["Index: state -> valid tokens"]
    end
    subgraph Decode["Constrained Decoding (per token)"]
        LG["Model produces logits"]
        MK["Mask invalid tokens"]
        SM["Sample from valid set"]
        ADV["Advance FSM state"]
    end
    subgraph Output["Guaranteed Output"]
        VJ["Valid JSON"]
        TC["Correct tool call"]
        SE["Schema-compliant extraction"]
    end

    PS --> JS
    JS --> FSM
    RX --> FSM
    CFG --> FSM
    FSM --> IDX
    IDX --> MK
    LG --> MK
    MK --> SM
    SM --> ADV
    ADV -->|"next token"| MK
    SM --> Output

    classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
    classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
    classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
    classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
    classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff

    class PS,JS,RX,CFG blue
    class FSM,IDX purple
    class LG,MK,SM,ADV teal
    class VJ,TC,SE emerald

Before Constrained Decoding

For most of the LLM era, getting structured output from a language model meant asking nicely and hoping. The history of structured generation tracks a steady shift from "please format this correctly" to "you cannot format this incorrectly."

%%{init: {'theme': 'base', 'themeVariables': {'cScale0': '#1e40af', 'cScale1': '#6d28d9', 'cScale2': '#b45309', 'cScale3': '#be123c', 'cScale4': '#047857', 'cScale5': '#0e7490', 'cScale6': '#1e40af', 'cScaleLabel0': '#e2e8f0', 'cScaleLabel1': '#e2e8f0', 'cScaleLabel2': '#e2e8f0', 'cScaleLabel3': '#e2e8f0', 'cScaleLabel4': '#e2e8f0', 'cScaleLabel5': '#e2e8f0', 'cScaleLabel6': '#e2e8f0', 'textColor': '#e2e8f0', 'lineColor': '#94a3b8', 'fontSize': '16px'}}}%%
timeline
    title From Hope to Guarantee: Structured Output Evolution
    2020-2022 : Prompt engineering era
             : "Return valid JSON" in system prompts
             : Regex post-processing and retry loops
             : PICARD (Scholak et al.) constrains text-to-SQL
    2023 : OpenAI ships JSON mode (Nov 2023) - valid JSON, no schema enforcement
         : Willard and Louf publish FSM-based guided generation (arXiv 2307.09702)
         : Outlines library launches with regex-to-FSM compilation
         : Microsoft Guidance introduces token healing
         : LMQL provides SQL-like constraint syntax from ETH Zurich
    2024 : OpenAI Structured Outputs with JSON Schema enforcement (Aug 2024)
         : Google Gemini adds response_schema at Google I/O (May 2024)
         : SGLang compressed FSM with jump-forward decoding (NeurIPS 2024)
         : Park et al. Grammar-Aligned Decoding addresses quality loss (NeurIPS 2024)
         : XGrammar ships pushdown automata for context-free grammars
    2025-2026 : Anthropic Claude adds strict structured outputs (Nov 2025)
             : XGrammar becomes default backend for vLLM, SGLang, TensorRT-LLM
             : Coalescence makes structured generation faster than unconstrained
             : Sub-40 microsecond per-token overhead achieved

The prompt engineering era (2020-2022) relied on instructions like "You must return valid JSON" combined with post-generation parsing and retry loops. This worked tolerably for demos. In production, with thousands of concurrent requests, a 5% parse failure rate meant hundreds of retries per minute, each doubling latency and cost.

OpenAI's JSON mode (November 2023) guaranteed syntactically valid JSON but enforced no schema. You got valid JSON, but it might have the wrong fields, wrong types, or an entirely unexpected structure. The gap between "valid JSON" and "JSON matching my schema" turned out to be enormous.

The real breakthrough came from the research side. Willard and Louf's 2023 paper, "Efficient Guided Generation for Large Language Models" (arXiv:2307.09702), demonstrated that autoregressive text generation can be reformulated as transitions between states of a finite-state machine. This insight, implemented in the Outlines library, made schema-guaranteed output a practical reality for any model, not just proprietary APIs.

[IMAGE: Side-by-side comparison showing a prompt-based JSON generation attempt with malformed output on the left, and constrained decoding producing guaranteed-valid output on the right. Caption: "The difference between asking for structure and enforcing it."]

How Constrained Decoding Actually Works

The mechanism is elegant. A language model generates text one token at a time by producing a probability distribution (logits) over its entire vocabulary, typically 32,000 to 128,000 tokens. Constrained decoding inserts a logit processor between the model's output layer and the sampling step. This processor checks each candidate token against a grammar state and sets the logits of all invalid tokens to negative infinity before sampling occurs.

The Core Math

At each generation step \(t\), the model produces raw logits \(z_t \in \mathbb{R}^{|V|}\) over vocabulary \(V\). Standard sampling converts these to probabilities via softmax:

\[P(v_i | x_{<t}) = \frac{\exp(z_{t,i})}{\sum_{j=1}^{|V|} \exp(z_{t,j})}\]

Constrained decoding defines a set of valid tokens \(V_t^{\text{valid}} \subseteq V\) based on the current grammar state \(s_t\), then applies a mask:

\[\tilde{z}_{t,i} = \begin{cases} z_{t,i} & \text{if } v_i \in V_t^{\text{valid}} \\ -\infty & \text{otherwise} \end{cases}\]

After masking, the probabilities are renormalized over valid tokens only:

\[P_{\text{constrained}}(v_i | x_{<t}) = \begin{cases} \frac{\exp(z_{t,i})}{\sum_{v_j \in V_t^{\text{valid}}} \exp(z_{t,j})} & \text{if } v_i \in V_t^{\text{valid}} \\ 0 & \text{otherwise} \end{cases}\]

The model's weights are never modified. Only the sampling distribution changes. This is why constrained decoding is model-agnostic: it works with any autoregressive language model.

[IMAGE: Diagram showing a vocabulary heatmap where most tokens are grayed out (masked) and only a handful of valid tokens retain their probability mass. Caption: "Token masking in action: at each step, only grammar-valid tokens retain probability."]

From Schema to Finite-State Machine

The practical question is: how do you compute \(V_t^{\text{valid}}\) efficiently? The answer involves a compilation pipeline that converts high-level specifications into state machines.

Step 1: Schema to Regex. A JSON Schema defines allowed types, field names, nesting, and constraints. This gets compiled into a (potentially very long) regular expression that matches all and only the valid JSON strings conforming to that schema. Libraries like Outlines use this conversion, handling quantifiers, optional fields, and enum values.

Step 2: Regex to FSM. Regular expressions are mathematically equivalent to finite-state automata. The interegular library converts the regex into a deterministic finite automaton (DFA) where each state represents a position in the valid output, and transitions correspond to characters.

Step 3: FSM to Token Index. Here is where the tokenization challenge appears. LLMs do not generate characters; they generate tokens, which are variable-length byte sequences produced by BPE (Byte Pair Encoding). The string "name" might be a single token, or it might be tokenized as "na" + "me", depending on context. The system must compute, for each FSM state, which tokens in the vocabulary would produce a valid transition. This precomputation creates an index: a mapping from each state to its set of valid token IDs.

Step 4: Runtime Masking. During generation, the system tracks the current FSM state. At each step, it looks up the valid token set for that state (an O(1) hash map operation), constructs the mask, applies it, and advances the state after sampling.

[IMAGE: Pipeline diagram showing JSON Schema flowing through regex compilation, FSM construction, token index building, and finally runtime masking during generation. Caption: "The compilation pipeline: from developer-facing schema to inference-time token mask."]

Beyond Regular Languages: Pushdown Automata

Pure FSMs handle regular languages, but JSON is not regular. Nested objects and arrays require tracking depth, which regular expressions cannot express. The solution: pushdown automata (PDA), essentially FSMs augmented with a stack.

XGrammar, which has become the default structured generation backend for vLLM, SGLang, and TensorRT-LLM, implements this approach. It models JSON generation as a collection of FSMs where the stack tracks nesting context, and each FSM handles one level of the structure. When the model opens a new object or array, the PDA pushes a new FSM onto the stack. When it closes one, it pops.

This distinction matters in practice. Early FSM-only approaches could handle flat JSON reliably but struggled with deeply nested schemas. PDA-based approaches handle arbitrary nesting at the cost of slightly more complex state management.

Token Healing

There is a subtle problem at the boundary between constrained and unconstrained generation. When constrained decoding forces the model down an unusual token path (because the "natural" tokenization would violate the grammar), it may produce a non-canonical tokenization that the model rarely encountered during training. Microsoft's Guidance library introduced token healing to address this: it backs up one token at the prompt boundary and constrains the first generated token to begin with that removed token's continuation. This small correction measurably improves output quality, contributing roughly a 3% accuracy gain across benchmarks (Geng et al., 2025).

Seeing It in Motion

The Decoding Loop

%%{init: {'theme': 'base', 'themeVariables': {'actorBkg': '#1e40af', 'actorTextColor': '#fff', 'actorBorder': '#3b82f6', 'signalColor': '#94a3b8', 'signalTextColor': '#e2e8f0', 'labelBoxBkgColor': '#1e293b', 'labelBoxBorderColor': '#334155', 'labelTextColor': '#e2e8f0', 'loopTextColor': '#e2e8f0', 'noteBkgColor': '#1e293b', 'noteTextColor': '#e2e8f0', 'noteBorderColor': '#475569', 'activationBorderColor': '#3b82f6', 'activationBkgColor': '#1e3a5f', 'fontSize': '16px'}}}%%
sequenceDiagram
    participant App as Application
    participant LLM as Language Model
    participant LP as Logit Processor
    participant FSM as Grammar FSM

    App->>LLM: Prompt + Schema
    Note over FSM: Compile schema to FSM (cached)
    
    loop For each token
        LLM->>LP: Raw logits (32K-128K values)
        LP->>FSM: Query current state
        FSM-->>LP: Valid token set for state s_t
        LP->>LP: Mask invalid tokens to -inf
        LP-->>LLM: Masked logits
        LLM->>LLM: Sample from valid distribution
        LLM->>FSM: Advance state with chosen token
    end
    
    LLM-->>App: Guaranteed schema-valid output

Jump-Forward and Coalescence

One of the most counterintuitive findings in constrained decoding: it can be faster than unconstrained generation. Two techniques make this possible.

Jump-forward decoding (introduced in SGLang) identifies sequences of FSM states with only one valid transition each. If the schema requires the field name "temperature" and the model has just generated the opening quote, the next nine characters are deterministic. Instead of calling the LLM nine times, the system appends those characters directly and skips ahead to the next branching point. SGLang's RadixAttention mechanism automatically reuses the KV cache for the skipped tokens, avoiding redundant computation.

Coalescence (from dottxt/Outlines) generalizes this idea. When multiple token paths through the FSM converge to the same generated string, the system picks the longest valid token and skips the rest. For a schema with fixed field names, most of the structural scaffolding (braces, quotes, colons, commas, field names) is deterministic, and the LLM only needs to generate the actual values.

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1e40af', 'primaryTextColor': '#fff', 'primaryBorderColor': '#60a5fa', 'lineColor': '#94a3b8', 'textColor': '#e2e8f0', 'clusterBkg': '#1e293b', 'clusterBorder': '#334155', 'fontSize': '16px'}}}%%
flowchart TD
    A["Schema: {'name': str, 'age': int}"] --> B["Standard decoding: 15+ LLM calls"]
    A --> C["Coalescence decoding"]
    
    C --> D["LLM call 1: generate value for 'name'"]
    C --> E["Skip: append deterministic tokens"]
    C --> F["LLM call 2: generate value for 'age'"]
    C --> G["Skip: append closing brace"]

    B --> H["Output: 15 LLM calls"]
    F --> I["Output: 2 LLM calls + direct appends"]

    classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
    classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
    classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff
    classDef rose fill:#be123c,stroke:#fb7185,stroke-width:1px,color:#fff

    class A blue
    class B,H rose
    class C,D,F amber
    class E,G,I emerald

The dottxt team reports a 5x speedup from coalescence on typical JSON schemas (dottxt engineering blog). SGLang's compressed FSM achieves up to 6.4x higher throughput compared to baseline inference engines (Zheng et al., NeurIPS 2024).

[IMAGE: Visualization of a compressed FSM showing singular transition paths highlighted in green (skipped by jump-forward) and branching points in orange (requiring LLM calls). Caption: "Compressed FSM: green paths are deterministic and skipped; orange nodes require model sampling."]

By the Numbers

Performance data from recent benchmarks paints a clear picture of where the field stands.

Framework Comparison (Geng et al., January 2025)

Metric Guidance Outlines Llamacpp XGrammar OpenAI API Gemini API
Schema coverage (GlaiveAI) 96% 95% 95% -- 31% 6%
Schema coverage (GitHub Medium) 69% 29% 57% -- -- --
Schema coverage (GitHub Hard) 41% 3% 39% -- -- --
Compliance rate 87-100% 6-83% 85-100% -- 92-100% 92-100%
Grammar compile time (GlaiveAI) 0.00s 3.48s 0.03s <0.01s N/A N/A
Time per output token (GlaiveAI) 6.37ms 30.33ms 17.70ms -- N/A N/A
Time per output token (GitHub Medium) 7.57ms 46.57ms 29.08ms -- N/A N/A
Task accuracy improvement vs. unconstrained +3-4% +1-3% +2% -- -- --

Source: Geng et al., "Generating Structured Outputs from Language Models: Benchmark and Studies," January 2025

Provider-Hosted Structured Output Performance

Provider Feature Release Schema Guarantee Compilation Latency Per-Token Overhead
OpenAI Structured Outputs Aug 2024 100% on supported schemas <10s first call, cached after Negligible (server-side)
Google Gemini response_schema May 2024 High on supported subset Minimal Negligible (server-side)
Anthropic Strict structured outputs Nov 2025 100% with strict: true Not disclosed Negligible (server-side)
SGLang + XGrammar Grammar-guided 2024-2025 100% on compiled grammars <0.01s <40 microseconds/token

[IMAGE: Bar chart comparing grammar compilation time across Guidance (near zero), XGrammar (near zero), Llamacpp (30ms), and Outlines (3.48s) on the GlaiveAI dataset. Caption: "Compilation overhead varies dramatically across frameworks."]

Key Numerical Findings

The overhead story has changed dramatically. XGrammar achieves under 40 microseconds per token for grammar checking. The model's own inference costs 10-50 milliseconds per token. That makes the grammar overhead less than 0.4% of total generation time. Guidance's llguidance backend reaches roughly 50 microseconds per token with negligible startup costs.

Outlines' initial compilation can be slow, sometimes 40 seconds to over 10 minutes for complex schemas, but this is a one-time cost amortized across all subsequent requests using that schema. Caching the compiled FSM eliminates this penalty entirely for repeated use.

OpenAI reports that their first request with a new schema incurs up to 10 seconds of processing, occasionally up to a minute for very complex schemas, but subsequent requests see no latency penalty due to server-side caching.

A Concrete Example

Let us trace the generation of a simple JSON object step by step, showing exactly how the FSM constrains each token choice.

Target schema (Pydantic):

class WeatherReport(BaseModel):
    city: str
    temperature: float
    unit: Literal["celsius", "fahrenheit"]

Equivalent JSON Schema:

{
  "type": "object",
  "properties": {
    "city": {"type": "string"},
    "temperature": {"type": "number"},
    "unit": {"enum": ["celsius", "fahrenheit"]}
  },
  "required": ["city", "temperature", "unit"],
  "additionalProperties": false
}

Step-by-step constrained generation:

Step FSM State Valid Tokens Model Choice Why
1 Start { only { JSON object must open with brace
2 Object opened " only " First character of field name
3-6 Field name city" (deterministic) city" Only valid field; coalescence skips LLM
7 After field name : only : Deterministic separator
8 Before string value " only " String value must open with quote
9-N String content Any valid string tokens San, Francisco Model generates freely within string constraints
N+1 String content " + string tokens " Model chooses to close the string
N+2 After value , only , More required fields remain
... Temperature field Deterministic field name "temperature": Coalescence skips this entire sequence
... Number value Digit tokens, ., - 18.5 Model generates, constrained to valid numbers
... Unit field Deterministic name "unit":" Coalescence skips again
... Enum value celsius" or fahrenheit" celsius" Only two valid completions
Final End } only } Must close object

The total generation involves perhaps 5-6 actual LLM calls (for the string value, the number value, and the enum choice), with all structural tokens appended deterministically. The model is free to be creative where creativity matters (what city? what temperature?) and mechanically constrained where structure matters (field names, types, punctuation).

[IMAGE: Animated step-through of the FSM states during this generation, showing the state machine highlighting the current node and available transitions at each step. Caption: "Walking the FSM: each step constrains the vocabulary to only valid continuations."]

Where It Breaks

Constrained decoding is not a silver bullet. Its failure modes are well-characterized and worth understanding before deploying.

Semantic Correctness Is Not Guaranteed

The most important limitation: a schema can enforce that the temperature field contains a number, but it cannot enforce that the number is meteorologically plausible. Constrained decoding guarantees syntax, not semantics. A perfectly schema-valid response of {"city": "London", "temperature": 847.3, "unit": "celsius"} will sail through every grammar check.

Distribution Distortion

Park et al. (NeurIPS 2024) formalized what practitioners had observed: masking high-probability tokens distorts the model's output distribution. When the model "wants" to generate a token that the grammar forbids at the current position, the probability mass redistributes to grammatically valid tokens in a way that does not preserve the relative ordering the model intended (Park et al., 2024).

The effect is most pronounced when strict format requirements force many low-entropy decisions in sequence (braces, quotes, commas, field names). Each forced decision is a small perturbation; repeated perturbations across many steps induce trajectory bias. On reasoning-intensive tasks, this can degrade accuracy by 10-30% compared to unconstrained generation.

The ASAp algorithm addresses this by computing, at each step, the conditional probability of the LLM's distribution given the grammar constraint, then sampling from that corrected distribution. This is provably correct but adds computational overhead.

Schema Complexity Limits

Not all JSON schemas compile cleanly. The benchmark study by Geng et al. found dramatic differences in schema coverage across frameworks: Guidance handled 96% of schemas in the GlaiveAI dataset, while Outlines managed only 3% of "hard" schemas from GitHub. Common failure triggers include deeply nested recursive schemas, complex anyOf/oneOf combinations, and string patterns with Unicode character classes.

Closed-source APIs (OpenAI, Gemini) take a conservative approach, rejecting schemas they cannot guarantee, which yields low coverage but near-perfect compliance on accepted schemas.

Tokenization Mismatches

LLMs process tokens, but grammars express constraints over characters. The mapping between characters and tokens is many-to-many, and context-dependent. The string "true" might be tokenized as a single token in one context and as "tr" + "ue" in another. This creates edge cases where the FSM's character-level transitions do not cleanly align with token boundaries, requiring careful handling to avoid infinite loops or invalid states.

[IMAGE: Illustration of a tokenization mismatch where the same string "temperature" gets different BPE tokenizations depending on preceding context, and how the FSM handles both paths. Caption: "Tokenization is context-dependent; the FSM must handle all valid token decompositions."]

Alternative Designs

Not all approaches to structured output use constrained decoding. The design space includes fundamentally different strategies, each with distinct tradeoffs.

Approach Mechanism Validity Guarantee Quality Impact Latency Impact Schema Flexibility
Prompt-based System prompt says "return JSON" None; ~76-95% success rate None (unconstrained) None Any format describable in text
JSON mode Provider guarantees valid JSON Valid JSON, no schema Minimal Minimal Any JSON structure
Post-validation + retry Parse output, retry on failure Eventual (with retry budget) None per attempt 2-5x on failures Any parseable format
Fine-tuning Train model on structured examples High but not 100% Improved for trained schemas None at inference Fixed to training distribution
FSM constrained Token masking via finite-state machine 100% for regular languages -3% to +4% depending on task <0.4% overhead + compilation Regex-expressible schemas
PDA constrained Token masking via pushdown automaton 100% for context-free languages Similar to FSM Slightly higher than FSM Recursive/nested schemas
Provider-hosted Server-side constrained decoding 100% on accepted schemas Tuned per model First-call compilation cost Provider-defined subset
Grammar-aligned (ASAp) Corrected sampling distribution 100% with provable fidelity Minimal degradation Higher per-token cost Same as PDA

The industry has converged on constrained decoding as the default for production use, with provider-hosted implementations for API consumers and open-source engines (XGrammar, Guidance, Outlines) for self-hosted models. Fine-tuning remains relevant for specialized domains where the model needs to learn which values to produce, not just what structure to follow.

[IMAGE: Decision tree for choosing a structured output strategy based on requirements: self-hosted vs API, schema complexity, latency sensitivity, and accuracy requirements. Caption: "Choosing the right structured output strategy depends on your deployment model and schema complexity."]

How It Is Used in Practice

Function Calling and Tool Use

Every major LLM provider now ships function calling as a first-class feature, and constrained decoding is the mechanism that makes it reliable. When you define a tool with a JSON Schema input specification, the provider's inference engine constrains the model's output to match that schema exactly.

OpenAI introduced strict: true in tool definitions (August 2024), requiring additionalProperties: false and explicit required arrays. With strict mode, their evaluations show 100% schema adherence on supported schemas, up from approximately 40% with earlier models (OpenAI, 2024).

Anthropic released structured outputs in public beta (November 2025) with both JSON mode and strict tool calling. As their documentation states: structured outputs "compile your JSON schema into a grammar and actively restrict token generation during inference. The model literally cannot produce tokens that would violate your schema" (Anthropic, 2025).

Google Gemini uses response_schema based on OpenAPI 3.0 schema definitions, supporting Pydantic (Python) and Zod (JavaScript) for schema specification (Google, 2024).

Data Extraction Pipelines

Extracting structured records from unstructured text is perhaps the most common production use case. Medical record parsing, invoice processing, resume screening: all require the model to fill a fixed schema from variable input. Without constrained decoding, these pipelines need extensive error handling for missing fields, wrong types, and malformed output. With it, every response is guaranteed parseable.

The Pydantic Integration Pattern

The dominant integration pattern in Python codebases uses Pydantic as the schema definition layer. The Instructor library (python.useinstructor.com) exemplifies this:

  1. Define a Pydantic model describing the desired output.
  2. Pass it to the LLM call (Instructor handles schema conversion).
  3. Get back a typed, validated Python object.
  4. If validation fails (semantic checks, custom validators), Instructor automatically retries with the error message fed back to the model.

This pattern works across 15+ providers and has become the de facto standard for Python-based LLM applications. The retry-with-feedback loop is particularly valuable: when the model produces a schema-valid but semantically wrong response (say, a negative age), Pydantic's validators catch it and the model gets another attempt with explicit error context.

Agentic Workflows

Agent systems amplify the stakes of structured output. A single malformed tool call in a multi-step reasoning chain can cascade: the agent retries, re-sends the full conversation history (multiplying token consumption), and may enter a loop. Constrained decoding eliminates this failure class entirely, which is why every major agent framework (LangGraph, CrewAI, AutoGen) defaults to structured tool calling when available.

[IMAGE: Architecture diagram of an agent loop showing the LLM generating constrained tool calls, the executor running them, and results feeding back. Red X marks where malformed output would break the loop without constrained decoding. Caption: "In agent loops, a single malformed tool call cascades into retries and token waste."]

Self-Hosted Inference

For organisations running their own models, the open-source constrained decoding stack is mature. SGLang with XGrammar is the current performance leader for structured generation workloads, achieving approximately 3x higher throughput than vLLM on constrained decoding tasks (SqueezeBits, 2025). The stack handles JSON Schema, regex, and EBNF grammar constraints with under 40 microseconds of per-token overhead.

vLLM supports both XGrammar and Outlines as grammar backends. XGrammar uses a pushdown automaton compiled to C via pthread, with tokenizer data caching that minimizes startup cost. The compilation happens once per schema and is cached across requests.

Insights Worth Remembering

  1. Constrained decoding is model-agnostic. It modifies the sampling distribution, not the model weights. Any autoregressive model works.

  2. Schema compilation is the hidden cost. The per-token overhead is negligible (<50 microseconds), but the first-time compilation of a complex schema can take seconds to minutes. Cache aggressively.

  3. Deterministic tokens should not cost LLM calls. Coalescence and jump-forward decoding exploit the fact that most structural tokens in JSON are deterministic given the schema. Skipping them makes constrained generation faster than unconstrained generation for structured tasks.

  4. Syntactic validity is not semantic correctness. A perfectly formatted JSON response can contain hallucinated values. Constrained decoding solves the structure problem, not the accuracy problem. Combine it with Pydantic validators for semantic checks.

  5. Distribution distortion is real and quantifiable. Masking tokens changes the distribution the model samples from. For high-stakes reasoning tasks, consider Grammar-Aligned Decoding (ASAp) or allow the model to reason freely before constraining its final output.

  6. Provider APIs are conservative; open-source engines are flexible. OpenAI and Gemini reject schemas they cannot guarantee. Guidance and XGrammar attempt any schema, with varying success. Choose based on whether you need coverage or compliance.

  7. Token healing matters at boundaries. The mismatch between character-level grammars and token-level generation creates subtle quality issues. Guidance's token healing provides a consistent 3% accuracy improvement.

  8. The "think then constrain" pattern is emerging. Recent research (2025-2026) shows that letting the model reason freely in unconstrained space before constraining its final structured output preserves both reasoning quality and structural guarantees.

  9. LMQL demonstrated that constraints can reduce cost. By eagerly evaluating constraints during generation, LMQL achieves 26-85% cost savings through early termination of invalid paths, reducing total API calls by up to 80% (ETH Zurich SRI Lab).

  10. The convergence is real. All major providers, open-source engines, and agent frameworks now support constrained decoding. The technique has moved from research novelty to infrastructure default in under three years.

Open Questions

Can we constrain semantics, not just syntax? Current systems enforce that a field is a number, but not that it is a reasonable number. Integrating value-range constraints, cross-field consistency checks, and factual grounding into the decoding loop remains unsolved. Pydantic validators handle this post-generation, but doing it during generation would eliminate wasted tokens.

What is the right grammar formalism? Regular expressions, context-free grammars, and pushdown automata each trade expressiveness for efficiency. Recent work on Earley-driven dynamic pruning suggests that more powerful formalisms can be made practical, but the field has not settled on a standard.

How should models be trained to work with constraints? OpenAI trains models specifically to understand schema structure, achieving 93% accuracy before constrained decoding brings it to 100%. Should all models be trained with schema-awareness, or is the decoding-time approach sufficient?

Will speculative decoding compose with constrained decoding? Speculative decoding uses a small draft model to propose tokens that a larger model verifies. Combining this with grammar constraints could yield multiplicative speedups, but ensuring the draft model's proposals satisfy grammar constraints adds complexity. Early results are promising but not yet production-ready.

What happens when schemas conflict with the model's knowledge? If a schema requires an enum of ["true", "false"] but the factually correct answer is "unknown", the model is forced to lie structurally. The interaction between structural constraints and truthfulness is underexplored.

[IMAGE: Research frontier diagram showing open problems in constrained decoding: semantic constraints, trained schema awareness, speculative composition, and truthfulness under structural pressure. Caption: "The next frontier: moving from syntactic guarantees to semantic ones."]

Sources and Further Reading

  1. Willard, B. T. and Louf, R. "Efficient Guided Generation for Large Language Models." arXiv:2307.09702, 2023. https://arxiv.org/abs/2307.09702 - The foundational paper establishing FSM-based guided generation.

  2. Park, K., Wang, J., Berg-Kirkpatrick, T. et al. "Grammar-Aligned Decoding." NeurIPS 2024. https://arxiv.org/abs/2405.21047 - Identifies and addresses quality degradation from constrained decoding via the ASAp algorithm.

  3. Zheng, L. et al. "SGLang: Efficient Execution of Structured Language Model Programs." NeurIPS 2024. https://arxiv.org/abs/2312.07104 - Introduces compressed FSM and jump-forward decoding.

  4. Geng, S. et al. "Generating Structured Outputs from Language Models: Benchmark and Studies." arXiv:2501.10868, January 2025. https://arxiv.org/abs/2501.10868 - Comprehensive benchmark comparing six constrained decoding frameworks.

  5. OpenAI. "Introducing Structured Outputs in the API." August 2024. https://openai.com/index/introducing-structured-outputs-in-the-api/

  6. Anthropic. "Structured Outputs." Claude API Documentation, November 2025. https://docs.claude.com/en/docs/build-with-claude/structured-outputs

  7. Google. "Structured Output - Gemini API." 2024. https://ai.google.dev/gemini-api/docs/structured-output

  8. dottxt. "Coalescence: Making LLM Inference 5x Faster." dottxt Engineering Blog. https://blog.dottxt.ai/coalescence.html

  9. LMSYS. "Fast JSON Decoding for Local LLMs with Compressed Finite State Machine." February 2024. https://www.lmsys.org/blog/2024-02-05-compressed-fsm/

  10. vLLM Project. "Structured Decoding in vLLM: A Gentle Introduction." January 2025. https://vllm-project.github.io/2025/01/14/struct-decode-intro.html

  11. Outlines (dottxt-ai). Structured Text Generation Library. https://github.com/dottxt-ai/outlines

  12. XGrammar (MLC-AI). Fast, Flexible and Portable Structured Generation. https://github.com/mlc-ai/xgrammar

  13. Microsoft Guidance. Control LM Output. https://github.com/guidance-ai/llguidance

  14. LMQL. A Programming Language for LLM Interaction. ETH Zurich SRI Lab. https://lmql.ai/

  15. Instructor. Structured Outputs for LLMs. https://python.useinstructor.com/

  16. SqueezeBits. "Guided Decoding Performance on vLLM and SGLang." 2025. https://blog.squeezebits.com/guided-decoding-performance-vllm-sglang

Sign in to save and react.
Share Copied