Programmatic and Templated Data Generation

The FLAN dataset (Wei et al., 2021) converted 62 NLP benchmarks into instruction-following training examples not by calling a language model, but by writing a set of text templates for each task and filling them with existing labelled data. A sentiment analysis benchmark became an instruction-following example by instantiating a template like "Classify the sentiment of the following review as positive or negative: {review_text}\nAnswer: {label}". No model was consulted. No generation budget was spent. Thousands of training examples were created in seconds. That template-filling step contributed to FLAN achieving strong zero-shot generalisation across unseen tasks, and the core technique predates large-scale generation by many years.

Programmatic and templated generation is the family of methods that build training examples through code, structured templates, and formal grammars rather than (or alongside) unconstrained model generation. It is the part of the synthetic data toolkit where a software engineer, not a prompt engineer, is most at home.

The Spectrum from Template to Generator

A useful way to place methods is along a spectrum ordered by how much of the output is determined by code versus model:

Method	Code controls	Model controls	Example
Pure template	Format, content, label	Nothing	FLAN task templates
Slot-filling with model	Format, label schema	Slot values	"Paraphrase {src}" with LLM fill
Grammar-guided generation	Token grammar / structure	Surface wording	Constrained decoding with CFGs
Programmatic pipeline	Routing, filtering, composition	Sub-tasks	Instruction + code executor loop
Pure model generation	Nothing	Everything	Self-Instruct, Evol-Instruct

Most real pipelines sit somewhere in the middle three cells. The choice of where to sit determines the trade-off between control and diversity.

Template-Based Generation: Slot Filling at Scale

A template is a string with typed placeholders. An example schema for a classification task might look like:

TEMPLATE = (
    "You are given a {domain} document.\n"
    "Document: {body}\n"
    "Does this document discuss {topic}? Answer yes or no.\n"
    "Answer: {label}"
)

example = TEMPLATE.format(
    domain="legal",
    body=contract_snippet,
    topic="indemnification clauses",
    label="yes",
)

The power of this approach is that every slot can be drawn from a controlled distribution. If you want 30% of your training examples to involve legal text and 70% financial text, you set that ratio in code. If you want uniform coverage over 20 topics, you iterate a loop. If you want hard negatives where the label is "no", you sample from a curated set of documents that do not match. None of this is possible when a model generates everything end-to-end, because end-to-end generation produces whatever distribution the model already favours, which tends to mirror its training distribution.

FLAN's original templates were hand-written (roughly ten templates per task), but the same principle scales programmatically. The T0 family (Sanh et al., 2022) expanded coverage to over 170 NLP datasets using a combination of human-written and automatically diversified templates, and found that model zero-shot performance on held-out tasks scaled with the number of distinct template formulations used during training, not just the number of examples.

Grammar-Based and Compositional Generation

A context-free grammar (CFG) can describe a language of valid training examples exactly. This is used heavily in two domains: mathematical reasoning and structured prediction.

For mathematical word problems, a CFG might encode productions like:

Problem  -> "{name} has {N} {item}s. {event}. How many {item}s does {name} have now?"
Event    -> "{name} gives {K} {item}s to {person}"
         | "{name} receives {K} {item}s from {person}"
         | "{name} loses {K} {item}s"
N, K     -> integer drawn from [1, 100]
name     -> drawn from names corpus
item     -> drawn from concrete nouns

Sampling from this grammar generates syntactically valid problems with known correct answers. The verifiable ground truth is a first-class output of the program, not a noisy by-product of model generation. The GSM8K dataset, widely used for maths reasoning evaluations, was partly constructed through this kind of structured compositional design, with human authors working from a constrained schema rather than writing each problem from scratch.

A more sophisticated form is grammar-guided constrained decoding, where a formal grammar is applied at inference time to force the model to produce only valid outputs. Libraries such as Outlines (Willard and Louf, 2023) implement this by computing, at each decoding step, which tokens are valid continuations under the grammar, and masking the rest. This turns a free-form language model into a guaranteed-format generator, useful when the downstream consumer of the synthetic data requires strict JSON, SQL, or another structured format.

Programmatic Pipelines: Composing Code and Model Calls

The most expressive programmatic generators interleave code execution with model calls. A schematic example for synthetic code-understanding data:

for problem_spec in problem_library:
    # Step 1: generate a candidate solution (model call)
    code = llm.generate(f"Write Python to: {problem_spec.description}")

    # Step 2: run the code (deterministic)
    result = sandbox.execute(code, test_cases=problem_spec.tests)

    # Step 3: accept only correct solutions (programmatic gate)
    if result.all_passed:
        dataset.append({
            "instruction": problem_spec.description,
            "response": code,
            "tests_passed": True,
        })

This is rejection sampling (covered in its own concept), but the key point here is that the acceptance criterion is a deterministic program, not another model. The unit test is the oracle. No amount of model-predicted quality scoring can match the certainty of "the code ran and produced the right output". OpenMathInstruct-1 (Toshniwal et al., 2024) applied this principle to generate 1.8 million maths problem-solution pairs by using Mixtral to produce candidate solutions and a symbolic checker to verify correctness, achieving strong results on GSM8K (84.6%) and MATH (50.7%) with a training set built entirely from open-licensed models.

The pipeline pattern is also used for data augmentation rather than data creation. A programmatic augmenter might:

Swap named entities (replace "London" with "Berlin") while preserving all semantic structure and the label.
Apply back-translation (English to French then back) via a deterministic pipeline call.
Corrupt specific token types (mask numbers, redact email addresses) to produce noise-robust training examples.

Each of these transformations is reproducible, cheap to audit, and does not require a generative model.

Distribution Control: The Principal Advantage

The phrase "control over distribution" appears in almost every paper that proposes a programmatic generation method, and it is worth making precise.

When a model generates training examples freely, the probability of example \(x\) appearing in the dataset is roughly proportional to \(p_{\text{model}}(x)\): whatever the model finds likely. Rare phenomena (uncommon reasoning patterns, edge-case formats, minority dialect structures) are underrepresented. Common phenomena are overrepresented. The training set inherits the model's existing biases.

A programmatic pipeline breaks this coupling. The probability of example \(x\) appearing in the dataset is whatever the sampling distribution of the code specifies. Want exactly 1,000 examples of each of 50 arithmetic operation types? Write a loop. Want adversarial negatives that are semantically close to positives but structurally different? Write a perturbation function. The dataset design becomes a software engineering problem, not a prompt engineering problem, and is therefore more tractable to analyse, version, and test.

This matters most for evaluation data. When measuring a model on a capability, you want the evaluation set to have known, controlled properties. Model-generated eval data is circular (a model scores well on data generated by a similar model). Programmatically generated eval data, derived from a formal specification, can be genuinely novel to the model under evaluation.

When It Falls Down

Template rigidity limits naturalness. Training a model on slot-filled templates produces a model that generates responses matching the template's style. If all training examples share the same sentence structure ("Given the following {X}, identify the {Y}"), the fine-tuned model may mirror that structure in its own outputs, producing unnatural-sounding responses in deployment. The FLAN team observed this and addressed it by writing many template variants per task. The general fix is to combine programmatic generation with a paraphrase or style-variation pass (using a model), but this adds cost and reintroduces distribution drift.

Grammar coverage gaps. A CFG that describes arithmetic word problems cannot generate the geometry or combinatorics problems that live outside its productions. Coverage is exactly what the grammar specifies, nothing more. Practitioners often underestimate the cost of writing comprehensive grammars for complex domains; a grammar that looks complete typically has corner cases that appear in real evaluation data but not in the grammar's reachable set.

Oracle brittleness. Programmatic acceptance criteria are only as good as the oracle. Unit tests catch functional bugs but not logical errors in the solution design. A code snippet that passes five test cases may still be subtly wrong in ways the tests do not probe. In natural language domains, there is often no oracle at all; the only available verifier is another model, which collapses the programmatic advantage back into the model-driven regime.

Combinatorial explosion in slot space. A template with five independent slots, each with 100 possible values, has 10 billion potential instantiations. Sampling this space uniformly is fine, but it is easy to produce extremely imbalanced coverage when slot distributions are correlated in the real world. A "document + label" template where most documents are positive but most programmatic negatives are constructed artificially produces a training set that does not reflect real negative examples. The code controls the distribution, but only as well as the programmer's model of the real distribution.

Evaluation contamination through specification leakage. If the formal specification used to generate training examples was written with knowledge of benchmark questions (even indirectly), the trained model may appear to generalise when it is actually recalling structural patterns from the spec. This is especially insidious in grammar-based generation where the grammar author may have inspected benchmark problems while writing productions.