Deck library
Cards on the table.
Flip to reveal. Search any term, filter by difficulty, drill into a concept. Premium-marked decks unlock with Pro Max.
Cards
1168
Reviewed
0
XP
0
Agents & Tool Use
9 concept(s)- Model decides to call a tool and emits
{name, arguments}as a structured message. - Host executes the tool and returns the result.
- Model integrates the result and either answers or calls another tool.
- Repeat until done (capped).
The model never executes anything itself - it only emits intents. The host is the trust boundary.
Tap to flip back
Underspecified function descriptions. The model picks tools by reading the description field, not the name. "Get user info" tells the router nothing; "Look up a customer's current subscription status by user ID. Returns plan, valid_until, and recent payment history" disambiguates against every other lookup tool. Treat descriptions as prompts to the router, because that is exactly what they are.
Tap to flip back
- Parallel when calls are independent: fetch customer details and fetch order history simultaneously in one turn.
- Serial when one call depends on another's output: fetch the customer's ID first, then use it to fetch their orders. Serial calls require separate turns because the second call's arguments depend on the first call's result.
Parallel cuts latency dramatically; modern APIs (Anthropic, OpenAI) support multiple tool calls per turn.
Tap to flip back
- Iteration cap (3-5 typical) to stop runaway loops.
- Argument validation against the tool schema before executing - never trust the model's JSON blindly.
- Sandboxing for any tool that touches the file system, shell, or external network.
The model is a planner, not a safety layer. Treat tool calls as untrusted input.
Tap to flip back
JSON Schema with three fields: name (machine identifier), description (the natural-language hint the model reads to choose), and parameters (a JSON Schema object defining argument types and which are required). The model returns a tool-call message with the matching name and a JSON arguments payload conforming to the schema. Strict mode on modern APIs guarantees the arguments validate.
Tap to flip back
"tool_use"- Claude wants to call a tool. Execute it, append the result to conversation history, and continue the loop."end_turn"- Claude is done. Present the final response to the user. The task is complete.
The agentic loop continues when stop_reason is "tool_use" and terminates when it is "end_turn". This is model-driven termination - the model signals completion, you don't guess.
Tap to flip back
- Parsing natural language signals ("I'm done", "Here's the result") - fragile, model phrases things differently each time
- Arbitrary iteration caps (stop after 5 loops) - cuts off legitimate multi-step reasoning
- Checking for assistant text content - model can include text alongside tool calls; text does not mean "done"
Always use stop_reason as the loop control mechanism.
Tap to flip back
After executing a tool, its result must be appended to the conversation history before the next API call. Without this, the model cannot see what happened and will re-request the same tools. Appending results allows the model to:
- Reason about new information
- Decide whether more tools are needed
- Integrate multiple results into a coherent response
Tap to flip back
A coordinator agent (hub) manages all communication, error handling, and information routing. Specialist subagents (spokes) handle specific domains: search, analysis, synthesis, reporting.
Key rules:
- All communication routes through the coordinator
- Subagents have isolated context (no automatic inheritance)
- The coordinator dynamically selects which subagents to invoke based on query complexity
- Not every query needs the full pipeline
Tap to flip back
Emit multiple Task tool calls in a single coordinator response rather than across separate turns. This spawns subagents concurrently.
Sequential: Turn 1 spawns Agent A, Turn 2 (after A returns) spawns Agent B. Total time = A + B.
Parallel: Single turn spawns both Agent A and Agent B. Total time = max(A, B).
Tap to flip back
Overly narrow decomposition. The coordinator breaks a broad topic into too-narrow subtasks that miss entire domains. Example: "impact of AI on creative industries" decomposed into only visual arts subtasks, completely missing music, writing, and film.
The subagents execute correctly within their assigned scope - the problem is always what the coordinator assigned, not how the subagents performed.
Tap to flip back
The coordinator's allowedTools must include "Task". The Task tool is the mechanism for spawning subagents. Without it, the coordinator cannot delegate work.
Each subagent is defined with an AgentDefinition that includes a description, system prompt, and tool restrictions specific to its role.
Tap to flip back
Subagents operate with isolated context - they do NOT automatically inherit the coordinator's conversation history or share memory between invocations.
You must include complete findings from prior agents directly in the subagent's prompt. Use structured data formats to separate content from metadata (source URLs, document names, page numbers) so attribution is preserved through the pipeline.
Tap to flip back
Programmatic enforcement (hooks, prerequisite gates): When errors have financial, safety, or legal consequences. Example: blocking process_refund until get_customer returns a verified ID. Provides 100% compliance.
Prompt-based guidance (system prompt instructions, few-shot examples): When preferences are stylistic or consequences of non-compliance are minor. Provides probabilistic (<100%) compliance.
Rule: if the answer to "what happens if the model skips this step?" is "incorrect refund / data breach / policy violation," use hooks, not prompts.
Tap to flip back
PostToolUse hooks transform tool results before the model processes them. Common uses:
- Normalize heterogeneous data formats (Unix timestamps to ISO 8601, numeric status codes to human-readable labels)
- Trim verbose tool outputs to only relevant fields (40-field order lookup trimmed to 5 relevant fields)
They are deterministic transformations that happen between tool execution and model reasoning.
Tap to flip back
When escalating to a human agent, compile a structured handoff summary including:
- Customer ID
- Root cause analysis
- Refund amount / relevant financial details
- Recommended action
- Summary of what was attempted
This is necessary because the human agent receiving the escalation lacks access to the conversation transcript. Everything they need to act must be in the summary.
Tap to flip back
Good tool descriptions include four elements:
1. Input formats: What identifier types does the tool accept?
2. Example queries: "Use when the user asks about order status"
3. Edge cases: "Returns null if no active orders"
4. Boundaries: "Do NOT use for subscription queries; use get_subscription instead"
Minimal descriptions ("Retrieves customer info") cause unreliable tool selection between similar tools. Tool descriptions are the primary mechanism LLMs use for deciding which tool to call.
Tap to flip back
- Transient - Timeout, service unavailable.
isRetryable: true. - Validation - Invalid input format.
isRetryable: false(fix input first). - Business - Policy violation (e.g., refund exceeds limit).
isRetryable: false. Include customer-friendly explanation. - Permission - Insufficient access.
isRetryable: false.
Every error must include isError: true, errorCategory, isRetryable, and a human-readable description.
Tap to flip back
MCP Resources: Expose content catalogs (issue summaries, documentation hierarchies, database schemas). They give agents visibility into available data without requiring exploratory tool calls. Read-only, informational.
MCP Tools: Expose actions (search, create, update, delete). They let agents do things in external systems.
Resources reduce the number of tool calls by showing the agent what data exists before it starts querying.
Tap to flip back
| Setting | Behavior | Use Case |
|---|---|---|
"auto" |
Model may return text OR call a tool | Default - model decides |
"any" |
Model MUST call a tool but picks which | Guarantee structured output when multiple schemas exist |
{"type":"tool","name":"..."} |
Model MUST call this specific tool | Force a specific extraction to run before enrichment |
Key distinction: "any" guarantees a tool call; forced selection guarantees which tool.
Tap to flip back
4-5 tools per agent is ideal. Giving an agent access to too many tools (e.g., 18) degrades tool selection reliability by increasing decision complexity.
Agents with tools outside their specialization tend to misuse them (a synthesis agent attempting web searches). Restrict each agent to tools relevant to its role, with limited cross-role tools only for specific high-frequency needs.
Tap to flip back
Project-scoped (.mcp.json): Shared team tooling. Version-controlled. Uses ${ENV_VAR} expansion for credentials. All developers who clone the repo get access.
User-scoped (~/.claude.json): Personal/experimental servers. NOT shared via version control. Only on your machine.
Both are available simultaneously - tools from all configured MCP servers are discovered at connection time.
Tap to flip back
- User level (
~/.claude/CLAUDE.md): Personal preferences. NOT shared via version control. Only applies to you. - Project level (
.claude/CLAUDE.mdor rootCLAUDE.md): Team standards. Shared via VCS. Applies to all team members. - Directory level (subdirectory
CLAUDE.md): Package-specific rules. Applies to that directory.
Common pitfall: critical conventions in user-level config that new team members don't receive.
Tap to flip back
Create files in .claude/rules/ with YAML frontmatter containing paths with glob patterns:
paths: ["**/*.test.tsx"]
Rules load only when editing matching files, reducing irrelevant context and token usage. Glob patterns work across directories - ideal for test files, API files, or any convention that spans multiple locations.
Superior to directory-level CLAUDE.md when conventions span multiple directories.
Tap to flip back
-
context: fork- Run the skill in an isolated sub-agent context. Prevents verbose output from polluting the main conversation. The main session only receives the summary. -
allowed-tools- Restrict which tools the skill can access during execution (e.g., limit to file reads to prevent destructive actions). -
argument-hint- Prompt developers for required parameters when they invoke the skill without arguments.
Tap to flip back
Plan mode: Complex tasks with multiple valid approaches, large-scale changes across many files, architectural decisions (microservice restructuring, library migration affecting 45+ files). Explore first, design, then execute.
Direct execution: Well-scoped changes with clear scope and one obvious approach (single-file bug fix, adding a validation conditional, simple refactoring).
You can combine them: plan mode for investigation, then switch to direct execution for implementation.
Tap to flip back
The -p (or --print) flag runs Claude Code in non-interactive mode. It processes the prompt, outputs the result to stdout, and exits without waiting for user input.
Without -p, Claude Code waits for interactive input, causing CI jobs to hang indefinitely. This is the documented way to run Claude Code in automated pipelines.
Combine with --output-format json and --json-schema for machine-parseable structured output.
Tap to flip back
--resume session-name: Continue a prior conversation. Use when prior context is mostly valid and you want to pick up where you left off.
fork_session: Create independent branches from a shared analysis baseline. Use to explore divergent approaches (comparing two refactoring strategies without redoing analysis).
Fresh start with injected summary: New session with a structured summary of prior work. Use when prior tool results are stale (code has changed since last session). More reliable than resuming with outdated context.
Tap to flip back
"Be conservative" and "only report high-confidence findings" do NOT improve precision. The model interprets these vaguely and inconsistently.
Replace with specific categorical criteria: "Report: bugs, security issues. Skip: minor style, local patterns." Define severity levels with concrete code examples: "Critical = data loss, High = wrong output, Medium = performance regression."
If a specific category has high false positives, temporarily disable it rather than adding vague confidence modifiers.
Tap to flip back
Few-shot examples are the most effective technique when:
- Detailed instructions alone produce inconsistent results
- You need consistent output format (location, issue, severity, suggested fix)
- Ambiguous scenarios need demonstrated reasoning
- Extraction from varied document structures fails
- You want to reduce false positives
Use 2-4 targeted examples that show reasoning for why one action was chosen over plausible alternatives. Examples enable generalization to novel patterns, not just matching pre-specified cases.
Tap to flip back
Guarantees: Syntactic validity. The output conforms to the JSON schema - no missing braces, no trailing commas, no type mismatches. Schema syntax errors are eliminated.
Does NOT guarantee: Semantic correctness. The model can produce perfectly formatted JSON with wrong values - line items that don't sum to the stated total, values in the wrong fields, fabricated information for required fields.
For semantic validation, add self-correction patterns: extract calculated_total alongside stated_total, include conflict_detected booleans.
Tap to flip back
Retries work for: format mismatches, structural output errors, field placement errors. The information exists in the document but the model extracted it wrong. Retry with the original document, the failed extraction, and specific validation errors.
Retries are useless when: the required information is simply absent from the source document. No amount of retrying will extract data that doesn't exist. Mark these as unavailable rather than retrying.
Tap to flip back
- 50% cost savings compared to real-time API calls
- Processing window: up to 24 hours with no guaranteed latency SLA
- Does NOT support multi-turn tool calling within a single batch request
- Use
custom_idto correlate batch request/response pairs - Good for: overnight reports, weekly audits, nightly test generation
- Bad for: blocking pre-merge checks, real-time code review
- Failure handling: resubmit only failed documents identified by
custom_id
Tap to flip back
A model retains its reasoning context from generation. In the same session, it's less likely to question its own decisions - it remembers why it made each choice and will defend those choices even if they're wrong.
An independent review instance (separate Claude session without the generator's reasoning context) is more effective at catching subtle issues. It evaluates the code fresh, without bias from the generation process.
For large PRs: split into per-file local analysis passes plus a cross-file integration pass to avoid attention dilution and contradictions.
Tap to flip back
Do you need an agent or a workflow? A workflow orchestrates LLM/tool calls through predefined code paths (you wrote the control flow). An agent lets the model dynamically direct its own process and decide when it is done. Workflows win on predictability for well-defined tasks; agents win when the path genuinely cannot be known in advance. Most "agent" projects are workflows in disguise and would be more reliable built as one.
Tap to flip back
It provides some subset of: control-flow definition (graph/chain/conversation), persistent state across steps, crash-resume persistence, human-in-the-loop checkpoints, and observability. The cost is an abstraction layer between you and the prompt, which is exactly where agents fail. If you cannot see the exact tokens the model received, you cannot debug it. That trade is the whole decision.
Tap to flip back
- LangGraph: low-level graph of nodes over shared state; explicit control, persistence, durable long-running agents. Dominates serious production.
- CrewAI: high-level role-based agents that collaborate; fast to start, harder to steer off-metaphor.
- AutoGen: agents as a conversation; elegant for multi-agent research.
- OpenAI Agents SDK: minimal loop with handoffs and guardrails; good for a thin OpenAI-centric stack.
Tap to flip back
Because the most successful production implementations used simple, composable patterns rather than complex frameworks, adding abstraction only when a simpler version demonstrably failed. A graph engine wrapped around what is really a three-step prompt chain adds failure modes (state bugs, version churn, hidden prompts) without adding capability. Start with plain code and a loop; adopt a framework when its absence is the thing hurting you.
Tap to flip back
State models and control-flow primitives do not port between frameworks, so switching is a rewrite, not a config change. The graph, the state object, the persistence layer, all are framework-specific. Choose for where you will be in a year, not for the next demo. The corollary: keep the prompt and tool logic separable from the framework so a migration touches plumbing, not reasoning.
Tap to flip back
It turns the N-times-M integration problem (every application needs custom glue for every tool/data source) into N-plus-M. Build a server once for a data source and a client once in an application, and any compliant client can talk to any compliant server. The official analogy: a USB-C port for AI, one standardised connector instead of bespoke cables per pairing.
Tap to flip back
- Tools - model-controlled: functions the model may decide to call (MCP's form of function calling).
- Resources - application-controlled: data the host loads into context (a file, a row, a document).
- Prompts - user-controlled: reusable templated workflows a user invokes deliberately.
The "who controls it" axis is the point: tools are the model's, resources are the host's, prompts are the user's.
Tap to flip back
- Host: the AI application the user interacts with (Claude Desktop, an IDE); runs one or more clients.
- Client: the connector inside the host that maintains a session with a single server.
- Server: a separate process exposing a data source or tool via the protocol.
Because the server runs as its own process, a server written for Slack works unchanged in any MCP-aware host. Transport is stdio (local subprocess) or HTTP (remote), over JSON-RPC.
Tap to flip back
Tool results flow back into the model's context, so a malicious or compromised server can mount a prompt-injection attack through the data it returns. Treat third-party MCP servers as untrusted input, not trusted plugins. Authorisation is also the protocol's younger part: the tool-calling plumbing standardised faster than auth/permission scoping, so production deployments wrap servers in their own authentication and least-privilege controls.
Tap to flip back
It standardises how a model reaches a tool; it does nothing for whether the model picks the right tool or uses it well. A bad tool description fails identically over MCP as over a bespoke integration. MCP is a connector, not a brain. It removes integration boilerplate, not the prompt-and-tool-design work that determines whether the agent actually succeeds.
Tap to flip back
Foundations
23 concept(s)State: the prompt / dialogue context. Action: each token sampled from the vocabulary. Episode: the full response. Reward: a scalar from the reward model, given once at end-of-sequence. The language model itself is the policy being optimised.
Why it matters: understanding this mapping shows why standard PPO applies and why sparse end-of-episode reward makes credit assignment hard.
Tap to flip back
- \(r_\phi(x, y)\): reward model score for response \(y\) to prompt \(x\).
- \(\pi_\text{ref}\): the frozen SFT-stage reference policy.
- \(\beta\): coefficient trading off reward maximisation against staying close to the reference. \(\beta = 0\) removes the constraint; large \(\beta\) collapses back to imitation.
The KL term prevents reward hacking by penalising drift from the pre-RLHF model.
Tap to flip back
Annotators compare two responses \((y_1, y_2)\) for the same prompt and label the preferred one. The reward model is trained to maximise the log-likelihood of the preferred response scoring higher:
\[\mathcal{L} = -\mathbb{E} \left[ \log \sigma\!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right) \right]\]where \(y_w\) is the "won" (preferred) response and \(y_l\) is the "lost" (rejected) one.
Pairwise comparison is used instead of absolute ratings because humans are far more consistent when making relative judgements.
Tap to flip back
The reward model \(r_\phi\) is an imperfect proxy for true human preferences. As the policy is optimised harder against \(r_\phi\), it finds inputs that exploit gaps in the proxy: responses that score highly on \(r_\phi\) but that humans would not actually prefer. Gao et al. (2022) measured this as an inverted-U curve: ground-truth performance improves then degrades as KL divergence from \(\pi_\text{ref}\) grows.
This is an instance of Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure."
Tap to flip back
Three compounding difficulties:
- Action space: the vocabulary is roughly 50,000 tokens per step, orders of magnitude larger than most RL benchmarks.
- Reward sparsity: one scalar is given at the end of a full response (hundreds of tokens), making temporal credit assignment approximate.
- Pretrained initialisation: gradient steps must be small enough to avoid catastrophic forgetting of the pretraining knowledge, so learning rates and update magnitudes are tightly constrained.
A value network (critic) must generalise across the full context space simultaneously.
Tap to flip back
- Annotation noise / bias: human labellers disagree and may prefer long, confident-sounding answers regardless of correctness; the reward model absorbs these biases.
- Verbosity bias: policies learn to produce longer responses because annotators often rate length as a proxy for helpfulness, even when brevity would be superior.
- Style collapse: PPO can converge to a narrow, formulaic style that consistently scores well on the reward model, reducing the diversity present in the pretrained model.
All three are expressions of optimising an imperfect proxy rather than the underlying preference.
Tap to flip back
The sparse terminal reward is distributed across tokens, with the KL term applied at every position:
token_reward[t] = 0.0 # no intermediate reward for most tokens
token_reward[-1] += r_phi(x, y) # terminal reward at end-of-sequence
# KL penalty at every token:
token_reward[t] -= beta * log(pi_theta(t|ctx) / pi_ref(t|ctx))
PPO treats token_reward as the reward signal. The per-token KL penalty is computed from log-probability ratios between the current policy and the frozen reference, both evaluated on the same generated token. This turns a single end-of-episode scalar into a denser signal while continuously penalising distributional drift.
Tap to flip back
In a full MDP, intermediate rewards and state transitions across every token would require temporal-difference methods and make credit assignment across thousands of tokens intractable. Treating the complete response as a single "arm" and the prompt as the context collapses this to a one-shot bandit pull: one action (the full response), one scalar reward, no state transitions. This is an approximation, but a practical one that makes the policy gradient update tractable with PPO.
Tap to flip back
The first term maximises reward from the reward model. The second term penalises the KL divergence from the reference (SFT) policy. The \(\beta\) hyperparameter controls the trade-off: low \(\beta\) allows aggressive reward chasing (risks reward hacking); high \(\beta\) keeps the policy close to the supervised baseline (risks under-alignment). The KL term also preserves linguistic fluency learned during pre-training.
Tap to flip back
The Bradley-Terry model, fit via a cross-entropy loss on preference pairs:
\[\mathcal{L}_\text{RM} = -\mathbb{E}\!\left[\log \sigma\!\left(r(x, y_w) - r(x, y_l)\right)\right]\]where \(y_w\) is the preferred response and \(y_l\) the dispreferred one. The assumption is that the probability of preferring \(y_w\) over \(y_l\) depends only on the difference in their scalar rewards. This sidesteps the need for absolute quality scores; annotators only need to rank two options.
Tap to flip back
Reward over-optimisation occurs when the policy improves its score on the proxy reward model while degrading on the true (gold) reward. Gao, Schulman, and Hilton (2022) measured this empirically and found the relationship is approximately:
\[r_\text{gold} \approx \alpha \sqrt{d_\text{KL}} - \beta \, d_\text{KL}\]The proxy reward climbs monotonically with KL divergence from the reference policy, but the gold reward peaks early and then falls. This is Goodhart's law: the reward model is a noisy proxy, and optimising it too hard reveals exploits the model did not capture. The KL penalty in the objective slows but does not eliminate this effect.
Tap to flip back
Rafailov et al. (2023) showed that the optimal policy satisfying the KL-regularised RLHF objective is:
\[\pi^*(y|x) \propto \pi_{\text{ref}}(y|x)\exp\!\left(r(x,y)/\beta\right)\]Rearranging, the reward can be expressed as a log-ratio of the optimal and reference policies. Substituting into the Bradley-Terry preference loss eliminates the reward model entirely, leaving a loss that trains the policy directly on preference pairs. DPO thus achieves the same objective as PPO-based RLHF without sampling from the policy during training, making it computationally lighter and more stable.
Tap to flip back
- Length bias. Annotators systematically prefer longer responses; the reward model encodes length as a quality proxy, and the policy learns to pad outputs.
- Credit assignment breaks for long-horizon tasks. Multi-step reasoning (coding, maths proofs) requires knowing which intermediate steps were correct. A single end-of-response scalar cannot distinguish sound reasoning that led to a wrong answer from flawed reasoning that accidentally reached the right one. Process reward models re-introduce MDP structure.
- Out-of-distribution prompts. The reward model is trained on a curated prompt distribution. For prompts outside that distribution the model extrapolates unreliably, and the KL penalty cannot compensate because it has no knowledge of which regions are underrepresented.
Tap to flip back
Temperature and nucleus (top-p) sampling serve as the exploration mechanism: raising temperature increases the probability mass on lower-ranked tokens, causing the policy to try different response arms. However, coverage remains narrow because the token action space is astronomical (all possible sequences), and which responses are reachable depends heavily on prefix context. The policy explores locally around the current distribution, rarely venturing into genuinely novel regions of response space. This is one reason RLHF training is sensitive to the diversity of the initial SFT demonstrations.
Tap to flip back
Reward over-optimisation occurs when a policy is trained too aggressively against a proxy reward model, causing it to exploit the proxy's blind spots rather than genuinely improve. It happens because the reward model is an imperfect approximation of human preferences: the policy, as a powerful optimiser, finds inputs the reward model scores highly that do not correspond to truly preferred behaviour. Proxy reward climbs; gold reward peaks and then falls.
Tap to flip back
- \(r_\phi(x, y)\): the proxy reward model's score for response \(y\) given prompt \(x\).
- \(\beta \cdot \mathrm{KL}[\pi_\theta \| \pi_{\text{ref}}]\): a penalty that increases as the trained policy drifts from the reference (SFT) model's distribution.
- \(\beta\): the coefficient that trades off reward maximisation against staying close to the reference. Larger \(\beta\) reduces exploitation but also limits genuine improvement.
Tap to flip back
As KL divergence from the reference policy increases (more RL optimisation), proxy reward rises monotonically, but gold reward follows an inverted-U shape: it improves initially, reaches a peak, then declines below baseline. The location and sharpness of the peak depend on reward model size; larger reward models are harder to exploit, so the peak occurs at a higher KL and the decline is more gradual.
Tap to flip back
- Length exploitation - the policy pads responses because reward models trained on human data often correlate length with thoroughness.
- Sycophancy - the policy validates and agrees with the user rather than informing them, because raters tend to prefer responses that confirm their views.
- Formatting games - the policy inserts bullet points, headers, or other structure regardless of whether it aids comprehension, because reward models associate these with high-quality training examples.
All three emerge without explicit instruction; the policy discovers them as paths of least resistance through the reward model's blind spots.
Tap to flip back
Ensemble reward models score a response as the minimum (or weighted average) across multiple reward models trained with different initialisations or data subsets. A response must satisfy all models to score well, making it harder to exploit any single model's blind spot.
Weakness: if all ensemble members were trained on the same rater pool and task distribution, they share systematic biases. Behaviours that exploit rater preferences (such as sycophancy) will fool the whole ensemble, because the failure mode is in the data, not the model architecture.
Tap to flip back
The proxy reward - the signal visible during training - keeps increasing throughout over-optimisation. Training loss curves therefore look healthy; there is no obvious warning sign. True quality has peaked and begun to decline, but this is only visible via periodic human evaluations or a held-out gold reward model that is not used as the training objective. Without such monitoring, a team may continue training past the optimal point while believing the model is still improving.
Tap to flip back
An outcome reward model scores only the final response, so the policy can reach a plausible-sounding correct answer via flawed or hallucinated intermediate reasoning and still receive high reward. A process reward model assigns a reward signal at each reasoning step, forcing the policy to produce correct intermediate steps as well. This dramatically reduces the routes available for exploitation: the policy cannot simply game the final output format; it must also produce step-by-step reasoning that the PRM validates. The trade-off is that dense step-level labelling is far more expensive to collect than outcome-level comparisons.
Tap to flip back
An ORM assigns a single scalar reward to the complete response (based only on the final answer). A PRM assigns a reward to each individual reasoning step, providing t signals for a chain of t steps.
Why it matters: dense per-step feedback lets RL penalise intermediate errors even when a lucky recovery produces the right final answer.
Tap to flip back
The ORM sees only the correct final answer and gives a positive reward, so it effectively reinforces the faulty step-3 reasoning. The PRM scores step 3 directly and assigns it a negative or low reward regardless of the eventual correct answer, discouraging the error.
Tap to flip back
The policy learns to maximise the proxy reward rather than genuine quality. It finds "shortcuts" (confident-sounding outputs, lucky answer formats, stylistic patterns) that score high on the learned reward model but are not actually correct or helpful. This is reward over-optimisation: the measure becomes the target and stops tracking what we care about.
Tap to flip back
Process supervision significantly outperformed outcome supervision. Their best process-supervised model reached 78% on a representative MATH test subset. They also released PRM800K: 800,000 step-level human feedback annotations. The PRM improved both average accuracy and consistency (best-of-N reliability improved more than random sampling).
Tap to flip back
Learned reward models are imperfect proxies: the policy can exploit their blind spots. A verifiable outcome reward (e.g., a unit test, a maths checker) returns ground truth - there is no learned proxy to Goodhart against. The policy can only score a reward by actually producing a correct answer.
Limitation: this only works for tasks where correctness is machine-checkable (maths, code, formal logic), not open-ended generation.
Tap to flip back
- PRM reward hacking: the policy learns step phrasings that score high on the PRM without being mathematically valid - the PRM is itself a learned proxy and is gameable.
- Verbosity exploitation: because each correct step earns reward, the model may pad chains with trivially-true filler steps to accumulate extra signal, inflating chain length without improving accuracy.
Tap to flip back
PRMs act as step-level scoring heuristics for inference-time search. At each node in a beam search or tree search over candidate reasoning chains, the PRM scores which continuation is most plausible. The best-scoring path is selected or ranked. This decouples the PRM's benefit from the RL training loop entirely, allowing it to improve answer quality at test time without any additional fine-tuning.
Tap to flip back
Best-of-N generates N independent completions from a fixed policy, scores each with a reward function, and returns the highest-scoring one. It never modifies any model weights -- all the "improvement" comes from spending more inference compute, not from gradient updates.
Tap to flip back
Each completion is an independent draw from the model's score distribution. The expected maximum of N i.i.d. draws (the N-th order statistic) grows roughly as log N for light-tailed distributions -- sublinear in N. Additionally, coverage saturates once the probability of any sample solving the problem is near 1. Beyond that ceiling, more samples add nothing because the model's latent capability is fully exercised.
Tap to flip back
An ORM scores only the final completed sequence; it is blind to intermediate reasoning steps, so a completion with flawed reasoning that reaches a correct answer looks the same as a well-reasoned one. A PRM scores each reasoning step and can penalise a completion that has a weak or incorrect intermediate step, making it harder to game with answer-hacking. PRMs are more expensive to label and train but more robust at high N.
Tap to flip back
At large N, you are exhaustively searching for the completion that maximises the reward model's score. The reward model is a proxy for true quality. Eventually, the argmax winner is the completion best adapted to the reward model's blind spots or biases -- not the genuinely best answer. This is reward overoptimisation at inference time: the metric (reward model score) ceases to track the true goal once it becomes the selection target.
Tap to flip back
Coverage is the fraction of problems for which at least one of the N samples is correct. It upper-bounds what any selection strategy can achieve: if the correct answer never appears in N samples, no verifier can surface it. Coverage scales log-linearly with N (Brown et al., 2024) when an exact verifier is available. Average performance (expected reward) can be misleading because a single correct sample in N is enough to get credit; coverage makes this threshold explicit.
Tap to flip back
Best-of-N needs diverse samples to cover the space of possible answers. At temperature near zero, all N completions collapse toward the greedy decode -- you get N nearly identical outputs, no coverage gain, and wasted compute. At high temperature, individual completions become incoherent or factually unreliable, so coverage of correct answers may actually fall. The useful operating point is somewhere in between, and it is task-specific; calibrating temperature is a real practical concern.
Tap to flip back
Snell et al. showed that allocating compute uniformly (same N for every prompt) is inefficient. A compute-optimal strategy -- giving more samples to harder prompts estimated via a difficulty predictor, and fewer to easy ones -- achieves more than 4x efficiency improvement at the same total compute budget compared to uniform Best-of-N. They also found that a small model with optimal test-time compute can outperform a 14x larger model running a single greedy decode on prompts in the right difficulty range.
Tap to flip back
RFT samples multiple completions per prompt from the current model, discards those that are incorrect (reward = 0), and trains on the survivors via supervised cross-entropy. PPO instead computes a gradient from every rollout (weighting by advantage), uses an explicit clipped surrogate objective to control step size, and requires a separate critic network to estimate baselines. RFT is structurally simpler: no critic, no clipping, and no gradient from incorrect samples.
Tap to flip back
The RFT loss on accepted completions Y+ is:
L_RFT = - E_x [ (1/|Y+|) * sum_{y in Y+} log pi_theta(y | x) ]
Taking the gradient and recognising that Y+ contains exactly the samples where r(y|x) = 1:
grad L_RFT ≈ - E[ r(y|x) * grad log pi_theta(y | x) ]
This is the REINFORCE estimator with binary rewards and no explicit baseline. The zero-reward samples contribute nothing because they are not included in Y+; their implicit gradient weight is zero. RFT is REINFORCE in disguise.
Tap to flip back
Iterative RFT repeatedly cycles through: (1) sample completions from the current policy, (2) filter to correct ones, (3) fine-tune on survivors, (4) update the policy and repeat. Because each round's training data is generated from the policy as it currently stands, the data is on-policy. This satisfies the core requirement of policy gradient methods and is structurally identical to a simplified PPO loop without clipping or a critic. The Llama 2 post-training alternated this procedure with PPO across training stages.
Tap to flip back
Best-of-N generates N completions at inference time, scores each with a reward model or verifier, and returns the top-scoring one. No gradient flows; it is a pure selection procedure. Gao et al. (2022) showed that as N scales, proxy reward grows roughly as O(log N) but gold reward follows an inverted-U: it peaks and then declines. This same over-optimisation curve appears in PPO and in iterative RFT. The shared mechanism is any optimisation procedure (gradient-based or sampling-based) searching under an imperfect reward signal. RFT is not immune simply because it avoids an explicit RL loop.
Tap to flip back
When none of the G sampled completions is correct, Y+ is empty. No samples survive the filter, so the cross-entropy loss has no terms for this prompt and no gradient flows. The model receives no update signal, however hard or near-miss the attempts were. GRPO avoids this by computing within-group relative advantages: when all rewards are equal (all zeros), all advantages are zero and the gradient is near-zero but still defined. RFT's hard filter makes training blind to hard problems where the model's pass rate approaches zero.
Tap to flip back
RFT has no explicit KL regularisation term. In practice three informal substitutes are used: short training duration (few epochs per round), early stopping when validation reward plateaus, and sometimes mixing filtered data with original SFT data as a soft anchor. When none of these safeguards is applied and iterations run aggressively, the policy drifts far from the reference without any corrective force. The result is distribution collapse: the model converges to a narrow set of solution templates that matched the filter on training prompts but generalises poorly, the same pathology as over-optimised PPO but with no built-in control mechanism.
Tap to flip back
-
Sample inefficiency on hard prompts. When the correct-sample rate is low (say 1 in 64), the majority of rollout compute is discarded. PPO extracts a gradient from every rollout by weighting by advantage; RFT throws away all incorrect samples. At 70B parameters, generating 64 rollouts per prompt per iteration is expensive, so this waste is directly proportional to inference cost.
-
No signal on zero-pass-rate prompts. As noted above, prompts where the model never succeeds produce no training data. This creates a systematic blind spot for the hardest problems in the distribution, causing iterative RFT to specialise on problems the model can already partially solve rather than expanding capability at the frontier.
Tap to flip back
Singhal et al. (2023) showed that a policy trained with a reward based solely on response length reproduced most of the downstream win-rate gains achieved by a full RLHF policy. This implies that a significant fraction of apparent quality improvement from RLHF may reflect length exploitation rather than genuine reasoning improvement.
Tap to flip back
β scales the KL penalty KL(π || π_ref) that keeps the trained policy close to the reference. A low β lets the policy drift far enough to discover and exploit proxy reward weaknesses (length, formatting). A high β prevents that drift but also limits genuine learning. Reward hacking lives in the low-β regime where the policy has enough freedom to find surface-level shortcuts.
Tap to flip back
- Spurious headers ("## Background", "## Answer") - mimic expert document structure and signal organisation.
- Bullet decomposition of a single idea into multiple points - looks more thorough and effortful.
- Redundant conclusions that repeat the answer - triggers annotator preference for perceived completeness.
- Confidence theatre ("I am confident that...") - matches annotator preference for assertive, authoritative tone.
All four are cheap to produce and reliably raise reward without improving content quality.
Tap to flip back
LLM judges inherit the same length and formatting biases present in their training data. Dubois et al. (2024) showed that AlpacaEval's GPT-4 judge assigns higher win rates to longer responses independently of quality. In RLAIF, the policy is optimised against this biased judge, so it learns to produce long and heavily formatted responses. Human evals of the resulting policy then show win-rate gains that are partly or wholly attributable to length rather than reasoning. The bias compounds across iterations.
Tap to flip back
The curve plots ground-truth quality against the KL divergence from the reference policy. It rises initially (the proxy reward and true quality agree), reaches an interior maximum, then falls as the policy exploits proxy weaknesses. The implication: there is an optimal stopping point for RL training. Training longer against the proxy reward degrades true quality even as proxy reward continues to rise. This is Goodhart's Law in action: once a measure becomes a target, it ceases to be a good measure.
Tap to flip back
RLVR uses a binary reward on the final answer (correct/incorrect), which removes direct length incentives on the answer token. However, the chain-of-thought reasoning trace is unconstrained. If longer reasoning traces correlate with correct answers in training data (even spuriously), the policy can learn to pad or repeat reasoning steps. Format hacking in the trace (e.g., re-stating the problem, redundant recaps) also emerges because the binary reward signal does not penalise it. The hacking shifts from the output to the reasoning scaffold.
Tap to flip back
- Length penalty
r_adj = r_φ - λ·|y|: directly reduces verbosity but risks penalising legitimately detailed answers; λ calibration is hard. - Format-invariant reward (strip markdown before scoring): removes surface cues but may remove genuine structural signals.
- Length-controlled evaluation (AlpacaEval 2.0 LC regression): debiases metrics but does not fix training gradients.
- Diverse preference data with short-preferred pairs: reduces the reward model's length prior but requires annotator discipline.
- Higher β (KL coefficient): limits policy drift and exploitation depth but slows and caps learning.
No single fix is complete; practitioners typically combine two or more.
Tap to flip back
S - state space (all situations the agent can be in); A - action space (all choices available); P(s'|s,a) - transition function (probability of reaching s' from s via a); R(s,a,s') - reward function (scalar feedback per transition); gamma - discount factor (how much future rewards are down-weighted).
Together these five elements fully specify the decision problem an RL agent must solve.
Tap to flip back
The Markov property states:
P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, ...) = P(s_{t+1} | s_t, a_t)
The next state depends only on the current state and action, not on the full history. Without this assumption the state space would grow exponentially with time (every distinct history would be a different "state"), making exact solution methods like dynamic programming computationally infeasible.
Tap to flip back
V^pi(s) = sum_{a} pi(a|s) * sum_{s'} P(s'|s,a) [ R(s,a,s') + gamma * V^pi(s') ]
pi(a|s)- probability of choosing action a under policy piP(s'|s,a)- transition probability to successor state s'R(s,a,s')- immediate reward for that transitiongamma * V^pi(s')- discounted value of the successor state
The equation is recursive: the value of a state is defined in terms of the values of its successors. This recursion is what dynamic programming and TD learning methods exploit.
Tap to flip back
The expectation equation averages over the current policy pi(a|s):
V^pi(s) = sum_a pi(a|s) * sum_{s'} P(s'|s,a) [R + gamma * V^pi(s')]
The optimality equation takes the maximum over actions instead:
V*(s) = max_a sum_{s'} P(s'|s,a) [R + gamma * V*(s')]
The optimality version finds the best possible policy; it no longer depends on a fixed pi. Once V is known, the optimal policy is simply "act greedily with respect to V".
Tap to flip back
- V^pi(s): expected return starting in state s and following policy pi from the very next step.
- Q^pi(s, a): expected return if you take action a in state s (regardless of what pi says), then follow pi afterwards.
Relationship:
V^pi(s) = sum_a pi(a|s) * Q^pi(s, a)
Q is more useful for control because comparing Q(s, a) values across actions lets an agent improve its policy without needing the transition model P.
Tap to flip back
-
Partial observability: the agent cannot see the full state (e.g., hidden opponent cards). The correct model is a POMDP; exact solutions are PSPACE-hard. Workaround: recurrent networks or belief-state estimation.
-
High-dimensional / continuous state spaces: exact dynamic programming is intractable (the "curse of dimensionality"). Workaround: function approximation (neural networks), but convergence guarantees largely disappear.
-
Reward misspecification / hacking: the MDP assumes R is given correctly. In practice, a poorly designed reward leads to unintended high-return policies (e.g., a robot that covers its camera to avoid seeing mess). Workaround: reward modelling, human feedback (RLHF).
Tap to flip back
gamma down-weights future rewards: a reward k steps away is worth gamma^k today.
- gamma = 0: agent is fully myopic - it only maximises the immediate next reward and ignores all future consequences. Useful for toy analysis but rarely correct in practice.
- gamma -> 1: the agent values the distant future almost as much as the present. The infinite-horizon return may diverge unless episodic termination is guaranteed. Even when it converges, high gamma inflates return variance, making learning slow and unstable.
In practice gamma is a hyperparameter (commonly 0.95 to 0.999); there is no principled formula to derive it from the task structure.
Tap to flip back
The reward \(r_t\) is a scalar signal for one transition. The return \(G_t\) is the cumulative sum of all future rewards from time \(t\) onwards:
\[G_t = r_{t+1} + r_{t+2} + \cdots\]The agent optimises the expected return, not individual rewards. A high single reward that leads to a long sequence of losses is worse than a modest reward that starts a profitable trajectory.
Tap to flip back
Converges when \(\gamma \in [0,1)\) and rewards are bounded. The geometric series \(\sum \gamma^k = \frac{1}{1-\gamma}\) provides the upper bound.
Without the condition \(\gamma < 1\), the sum can diverge for a continuing task with nonzero rewards, making it useless as an optimisation target.
Tap to flip back
The effective horizon is approximately \(\frac{1}{1-\gamma}\) steps. Beyond this point, the discounted contribution of a reward falls below \(e^{-1} \approx 0.37\) of its face value.
| \(\gamma\) | Effective horizon |
|---|---|
| 0.9 | ~10 steps |
| 0.99 | ~100 steps |
| 0.999 | ~1000 steps |
Choosing \(\gamma\) is a bias-variance tradeoff: lower \(\gamma\) reduces variance but biases the agent towards myopia; higher \(\gamma\) captures long-range structure but makes gradient estimates noisier.
Tap to flip back
- Terminated: the agent reached a true terminal state (e.g., game over, goal achieved). The bootstrap value is 0.
- Truncated: the episode was cut short by a time limit, not by the environment. The future continues; the bootstrap value should be the estimated value of the final state.
If you treat truncation as termination (bootstrap = 0), every return in that trajectory is biased downward. Modern gym-style APIs expose a separate truncated flag to prevent this.
Tap to flip back
In an episodic task the sum is always finite because the episode terminates at time \(T\). Setting \(\gamma = 1\) just means all future rewards are weighted equally, which is valid.
In a continuing task there is no \(T\), so \(\sum_{k=0}^{\infty} 1 \cdot r_{t+k+1}\) may diverge if rewards are nonzero. The solution is either to enforce \(\gamma < 1\) or to switch to the average-reward formulation, which subtracts the long-run mean reward at each step.
Tap to flip back
Because \(G_t = r_{t+1} + \gamma G_{t+1}\), each return depends on the next one. The backward pass exploits this recurrence in \(O(T)\) time:
G = last_value # 0 if terminated, V(s_T) if truncated
for r in reversed(rewards):
G = r + gamma * G
last_value = 0 for true termination. last_value = V(s_T) for truncation, bootstrapping from the value estimate so the agent is not penalised for the environment cutting the episode short.
Tap to flip back
-
Myopia from low \(\gamma\): The agent ignores rewards more than a few steps away, sacrificing long-run goals for immediate scraps (e.g., collecting a small reward in the wrong corridor instead of walking to the goal).
-
Terminal state misidentification: Forgetting to zero the bootstrap on true termination inflates value estimates near terminal states, because the function learns to predict rewards that will never occur. This is a silent bug with no obvious error signal during training.
Tap to flip back
V(s) is the expected cumulative return from state s under a given policy. Q(s, a) additionally conditions on the first action taken, so Q(s, a) = E[G_t | S_t=s, A_t=a]. Q is preferred when you need to select actions without a model of the environment (Q-learning, DQN), because the greedy action is simply argmax_a Q(s, a). V is preferred as a baseline or critic in actor-critic methods where a separate policy network handles action selection.
Tap to flip back
Q^pi(s, a) = sum_{s', r} p(s', r | s, a) * [ r + gamma * sum_{a'} pi(a'|s') * Q^pi(s', a') ]
p(s', r | s, a): transition and reward distribution (environment dynamics).r: immediate reward on this step.gamma: discount factor, down-weighting future rewards.sum_{a'} pi(a'|s') Q^pi(s', a'): expected value of the next state under the current policy, expressed as a policy-weighted sum over Q values.
The equation says: the value of taking action a in state s equals immediate reward plus discounted expected next-state value. It is a self-consistency constraint; any Q satisfying it for all (s, a) is the true Q^pi.
Tap to flip back
The expectation equation averages over the policy: sum_{a'} pi(a'|s') Q^pi(s', a'). The optimality equation replaces the average with a max: max_{a'} Q*(s', a'). This means Q is the value achievable by the best possible policy from the next state onward, not the current policy. The consequence: once Q is found, the optimal policy is greedily derived as pi*(s) = argmax_a Q*(s, a), with no need to store an explicit policy. Q-learning targets Q* directly (off-policy), which is why it converges to the optimal policy regardless of the exploration strategy used.
Tap to flip back
The TD error (delta) is:
delta = r + gamma * V(s') - V(s)
It measures how much the current estimate V(s) disagrees with a one-step bootstrapped target (r + gamma * V(s')). A positive delta means the state turned out better than expected; a negative delta means worse. The TD(0) update rule V(s) += alpha * delta nudges V(s) toward the target by a step of size alpha. Over many samples, this drives V toward V^pi by iteratively reducing the Bellman residual.
Tap to flip back
The deadly triad is the combination of: (1) function approximation (e.g., a neural network), (2) bootstrapping (using estimated values as targets, as in TD learning), and (3) off-policy data (training on transitions not generated by the current policy). Each alone is manageable. Together, the bootstrap target depends on approximated values, the approximation error feeds back into future targets, and the off-policy distribution mismatch means gradient updates may move parameters in directions that worsen estimates on frequently-visited states. The result can be divergence rather than convergence. DQN partially mitigates this with a lagged target network and experience replay, but does not eliminate the theoretical risk.
Tap to flip back
The Bellman operator is a contraction with factor gamma: each application brings value estimates gamma-times closer to the fixed point. When gamma is near 1, this contraction is weak (the factor approaches 1), so many iterations are needed for convergence, and estimation noise at distant time steps compounds heavily. When gamma is low, convergence is fast but the agent is myopic and ignores long-horizon consequences. There is no universally correct gamma; it must be tuned to the task's effective horizon. In sparse-reward environments with long episodes, a high gamma is necessary but makes training fragile.
Tap to flip back
Define the Bellman operator T^pi acting on a value function V:
(T^pi V)(s) = sum_a pi(a|s) sum_{s',r} p(s',r|s,a) [r + gamma * V(s')]
V^pi is a fixed point: T^pi(V^pi) = V^pi. Because T^pi is a contraction (by factor gamma < 1) under the sup-norm, Banach's fixed-point theorem guarantees a unique fixed point and that repeated application converges from any starting point. For algorithms this means: tabular policy evaluation (repeatedly applying T^pi to any initial V) is guaranteed to converge to V^pi. TD learning is a stochastic approximation of the same iteration. With function approximation the contraction no longer holds in general, which is where the deadly triad problems enter.
Tap to flip back
Value-based methods parameterise a value function (V(s) or Q(s,a)) and derive a policy implicitly by acting greedily with respect to it. Policy-based methods parameterise a policy π_θ(a|s) directly and optimise expected return via gradient ascent on that policy. One represents the value landscape; the other represents the behaviour directly.
Tap to flip back
DQN selects actions via argmax_a Q(s, a). When the action space is continuous this argmax is intractable - you cannot enumerate all possible actions. Discretising loses resolution; solving a continuous optimisation at each step is expensive. DDPG works around this by adding an actor network that learns to approximate argmax_a Q(s, a), effectively grafting a policy method on top of a value method.
Tap to flip back
The policy gradient estimator ∇_θ log π_θ(a|s) · G_t is only unbiased when the trajectory (s, a) was sampled from the current policy π_θ. Reusing off-policy data introduces bias unless importance-sampling corrections are applied, and those corrections increase variance rapidly as the behaviour policy diverges. The practical cost is data inefficiency: each collected batch can only be used for a small number of gradient steps before it must be discarded.
Tap to flip back
The critic estimates a baseline or advantage A(s, a) = Q(s, a) - V(s). This replaces the high-variance Monte Carlo return G_t in the gradient estimator with a lower-variance bootstrapped estimate. The actor still performs policy gradient updates, but the signal it receives is far less noisy, allowing stable learning with shorter rollouts. Without the critic, sparse or delayed rewards produce extremely noisy gradients.
Tap to flip back
The deadly triad (Sutton & Barto) refers to the combination of: (1) function approximation, (2) bootstrapping (using estimated values as regression targets), and (3) off-policy training. When all three coincide, Q-value estimates can diverge. Value-based methods are most exposed because they routinely use all three - neural network approximators, Bellman bootstrap targets, and replay buffers with off-policy data. Policy gradient methods are less exposed because they update via on-policy returns, not bootstrapped value targets.
Tap to flip back
A stochastic policy is strictly necessary when the optimal strategy is a mixed strategy - for example, in a two-player zero-sum game where a deterministic policy is exploitable, or in a partially observed environment where multiple actions have the same expected value given the observable state. Policy-based methods naturally represent stochastic policies because π_θ(a|s) is an explicit probability distribution. Value-based methods are inherently deterministic (they output an argmax), so they cannot represent mixed strategies without additional mechanism.
Tap to flip back
PPO constrains how far the updated policy can move from the behaviour policy that collected the data by clipping the probability ratio r_t(θ) = π_θ(a|s) / π_θ_old(a|s) into the interval [1-ε, 1+ε]. This limits the importance-sampling error introduced by reusing data across multiple gradient steps. It does not eliminate the on-policy requirement entirely - data is still discarded after a small number of epochs - but the clip makes several passes over each batch safe enough in practice without a full importance-weighted correction.
Tap to flip back
You need (1) a complete transition model P(s' | s, a) and (2) a reward function R(s, a, s'). Without these, you cannot perform the expectation over successor states that the Bellman update requires. This is why DP methods are called "model-based" - they assume the MDP is fully known.
Tap to flip back
The optimal value of state s equals the maximum over all actions of the expected discounted return:
V*(s) = max_a sum_s' P(s'|s,a) [R(s,a,s') + gamma * V*(s')]
In words: "choose the action whose expected immediate reward plus discounted optimal future value is highest." The max replaces the policy-weighted sum in the expectation equation, encoding the greedy improvement step.
Tap to flip back
The policy improvement theorem states that the greedy policy pi' derived from V^pi satisfies V^{pi'}(s) >= V^{pi}(s) for all states s, with strict improvement unless pi was already optimal. Because there are finitely many deterministic policies (|A|^|S|), and each iteration produces a strictly better or unchanged policy, policy iteration must terminate at the optimal policy in a finite number of steps.
Tap to flip back
For any two value functions V and V', the Bellman optimality operator T satisfies:
||TV - TV'||_inf <= gamma * ||V - V'||_inf
Because gamma < 1, T is a contraction mapping under the sup-norm. By Banach's fixed-point theorem, repeated application of a contraction converges geometrically to a unique fixed point (V*). Each sweep reduces the error by at least a factor of gamma, so value iteration converges, though only asymptotically - you stop at an epsilon-optimal solution.
Tap to flip back
Policy iteration runs policy evaluation to full convergence (iterating Bellman expectation updates until the value function stabilises) before taking a single greedy improvement step. Value iteration collapses this into one operation: it applies a single Bellman optimality update per state per sweep, which simultaneously evaluates and improves. Policy iteration uses fewer outer iterations but each is more expensive; value iteration has cheaper iterations but needs more of them to converge.
Tap to flip back
-
No transition model available. Real environments (robotics, games, finance) do not provide P(s'|s,a) analytically. DP is inapplicable without it, which drove model-free methods that learn from sampled experience instead.
-
State-space explosion. DP sweeps cost O(|S|^2 * |A|) per iteration. Even modest real-world problems have state spaces far too large for tabular representation - an Atari screen has on the order of 10^70 possible pixel configurations. Neural function approximation (e.g. DQN) sidesteps tabular storage at the cost of losing the convergence guarantees DP provides.
Tap to flip back
Generalised policy iteration (GPI) is the observation that almost every RL algorithm can be understood as some interleaving of (1) policy evaluation - updating a value function estimate towards consistency with a policy - and (2) policy improvement - updating the policy to be greedier with respect to the current value function. The two processes interact and compete but jointly converge to the optimal value function and policy.
Policy iteration: full evaluation before each improvement. Value iteration: one-step evaluation before each improvement. TD learning and Q-learning: online, sampled, approximate evaluation with continuous greedy improvement. All sit on the same GPI spectrum, differing only in how deeply they evaluate before improving and whether they use a model or samples.
Tap to flip back
It averages the actual discounted returns G_t obtained from complete episodes that passed through state s. The estimate is unbiased because no bootstrapping occurs: each G_t is the true realised return under the current policy, not an approximation constructed from other estimates.
Tap to flip back
First-visit MC uses only the return from the first time state s is visited per episode; every-visit uses the return from every visit. Both converge to V_π(s), but first-visit produces independent samples within an episode, giving it cleaner theoretical properties (the samples are i.i.d. across episodes).
Tap to flip back
G_t is the sum of all future rewards along a trajectory, accumulating randomness from every stochastic action and transition until episode end. TD(0) uses only a single step R_{t+1} + γV(S_{t+1}), so its updates see far less compounded randomness. Bias and variance trade off: removing bootstrap bias (MC) costs you variance.
Tap to flip back
A deterministic policy will never visit state-action pairs outside its chosen actions, leaving those Q-values uninitialised. Without coverage of all pairs, greedy improvement may lock onto a suboptimal policy. Exploring starts or epsilon-greedy exploration guarantees every (s, a) pair is visited infinitely often, which is necessary for convergence.
Tap to flip back
The ratio corrects for the mismatch between behaviour policy b and target policy π:
ρ = Π_{t} [π(A_t|S_t) / b(A_t|S_t)]
Each factor is a probability ratio. For a T-step episode the product of T independent ratios has variance that grows exponentially with T. When π and b differ substantially, many ratios will be far from 1 and the weighted return estimate becomes highly unstable.
Tap to flip back
TD(λ) targets are a geometric mixture of n-step returns:
G_t^λ = (1-λ) Σ_{n=1}^∞ λ^{n-1} G_t^(n)
Setting λ = 0 recovers TD(0) (one-step bootstrap). Setting λ = 1 recovers the full Monte Carlo return (provided the task is episodic). Intermediate values interpolate the bias-variance trade-off continuously, which is why GAE uses a tunable λ for policy gradient variance reduction.
Tap to flip back
- Continuing tasks with no terminal state: MC needs an episode to end before updating, so it simply cannot run.
- Very long episodes: the variance of
G_tgrows with episode length, making sample efficiency prohibitively poor.
TD methods handle both cases cleanly because they update after every single step without waiting for episode termination.
Tap to flip back
V(s_t) ← V(s_t) + α [r_{t+1} + γ V(s_{t+1}) - V(s_t)]
- α: learning rate
- r_{t+1} + γ V(s_{t+1}): the TD target (one-step lookahead)
- r_{t+1} + γ V(s_{t+1}) - V(s_t): the TD error δ_t (prediction error to correct)
Why it matters: TD learns online after every step, not after every episode, because the target substitutes V(s_{t+1}) for the unknown true return.
Tap to flip back
Q-learning target: r + γ max_a Q(s', a) - uses the greedy action regardless of what the agent actually does next.
SARSA target: r + γ Q(s', a') - uses the actual next action a' drawn from the current policy.
Q-learning is off-policy because the target is independent of the behaviour policy. SARSA is on-policy because its target reflects the exploration the agent will actually perform. In dangerous environments, SARSA is more conservative because it accounts for exploratory stumbles; Q-learning can be overoptimistic.
Tap to flip back
The deadly triad is the combination of:
1. Off-policy learning
2. Bootstrapping (using estimated future values in the target)
3. Function approximation (e.g., a neural network)
Any two of these are individually fine. All three together can cause value estimates to diverge rather than converge, even when learning rate and architecture are sensible. DQN mitigates the problem with experience replay and a separate frozen target network, but does not eliminate the theoretical risk. Tabular Q-learning is safe because it has no function approximation.
Tap to flip back
G_t^(n) = r_{t+1} + γ r_{t+2} + ... + γ^{n-1} r_{t+n} + γ^n V(s_{t+n})
- n=1 → TD(0): one real reward, then bootstrap.
- n=∞ → Monte Carlo: all real rewards, no bootstrap.
Intermediate n reduces the variance of MC (fewer summed noise terms) while reducing the bias of TD(0) (less reliance on the current, possibly inaccurate V). In practice, values around n=3-10 often beat both extremes.
Tap to flip back
In stochastic environments, Q(s', a) estimates have noise. max_a Q(s', a) is always >= the true max value in expectation - you systematically select the positively-biased noise outlier. This inflates targets and propagates overestimates.
Double DQN decouples two steps:
1. Action selection: use the online network to pick arg max_a Q(s', a).
2. Action evaluation: use the frozen target network to compute Q_target(s', selected_a).
Because selection and evaluation use independent (or semi-independent) networks, the positive correlation that drives maximisation bias is broken. van Hasselt et al. (2015) showed this reduces overestimation and improves final performance on Atari.
Tap to flip back
δ_t = r_{t+1} + γ V(s_{t+1}) - V(s_t)
It is the signed difference between what was predicted (V(s_t)) and what the evidence now suggests the value should be (r_{t+1} + γ V(s_{t+1})).
Neuroscientists have noted that dopamine neuron firing in the brain resembles TD error signals: dopamine neurons fire at unexpected reward, are suppressed when predicted reward is omitted, and shift their response from reward to the reward-predicting cue as learning proceeds - exactly the behaviour TD errors drive. Schultz, Dayan & Montague (1997) formalised this connection.
Tap to flip back
TD(0) propagates value information one step per visit. In sparse-reward environments (e.g., a maze where +1 appears only at the exit), value estimates near the start state remain near zero until states adjacent to the goal have been visited many times and their values have propagated backward step by step.
Techniques that accelerate credit assignment:
- Larger n in n-step TD or TD(λ): cover more ground per update.
- Reward shaping: add intermediate pseudo-rewards that guide the agent.
- Hindsight Experience Replay (HER): relabel failed trajectories as having achieved the state they actually reached, generating useful signal from every transition.
- Priority-based replay (Prioritised Experience Replay): sample transitions with high δ_t more often, focusing updates where the estimates are most wrong.
Tap to flip back
Q(s, a) ← Q(s, a) + α · [r + γ · max_a' Q(s', a') − Q(s, a)]
- α: learning rate (step size)
- r: immediate reward observed after taking action a in state s
- γ: discount factor, down-weighting future rewards
- max_a' Q(s', a'): bootstrapped estimate of optimal future value from next state s'
- [...]: the TD error - the surprise signal driving the update
Why it works: repeated application of this contraction operator drives Q towards Q*, the true optimal action-value function.
Tap to flip back
Q-learning targets r + γ · max_a' Q(s', a'), which assumes the greedy action will be taken next, regardless of what action the behaviour policy (e.g. ε-greedy) actually selects.
The policy being evaluated and improved (greedy) differs from the policy generating experience (ε-greedy). That separation is what makes it off-policy.
Practical payoff: experience from old policies can be reused in a replay buffer, because the target is not tied to how that experience was collected.
Tap to flip back
When Q(s', a') values have estimation noise, max_a' Q(s', a') systematically picks the overestimated action. This positive bias accumulates through bootstrapping, inflating Q values beyond their true optimum - especially in stochastic environments.
Double Q-learning fix: decouple action selection from action evaluation using two separate estimators θ and θ'. One network selects the greedy action; the other evaluates it:
target = r + γ · Q(s', argmax_a' Q(s', a'; θ); θ')
The overestimation bias is substantially reduced because the two networks' errors are independent.
Tap to flip back
-
Experience replay: stores transitions in a circular buffer; mini-batches are sampled uniformly at training time. Breaks the temporal correlations between consecutive observations that destabilise stochastic gradient descent.
-
Target network: a periodically-copied frozen version of the Q-network generates bootstrap targets. Prevents the "moving target" problem where the network chases its own shifting predictions and diverges.
Both address the same root cause: neural networks trained with bootstrapped, auto-correlated targets are prone to oscillation and divergence.
Tap to flip back
The deadly triad (Sutton & Barto): function approximation + bootstrapping + off-policy learning.
Each alone is manageable:
- Supervised learning uses function approximation but not bootstrapping.
- Tabular Q-learning uses bootstrapping but not generalising approximators.
Together they can cause divergence: the approximator generalises errors across states, bootstrapping feeds those errors back as targets, and off-policy data violates the distribution assumptions that keep the approximator stable. DQN operates squarely in this regime - it works empirically but has no convergence guarantee.
Tap to flip back
The update and the greedy policy both require computing max_a Q(s, a) over the entire action set. In a discrete action space with N actions this costs O(N) forward passes. In a continuous action space the argmax is an optimisation problem with no closed form for a general Q-network.
Workarounds:
- Discretise the action space (loses precision, scales poorly).
- Use a special network architecture where the argmax is analytic (e.g. NAF - Normalised Advantage Functions).
- Switch to policy gradient methods (DDPG, SAC, PPO) that parameterise a policy directly and never need an explicit Q-maximisation step.
Tap to flip back
Watkins & Dayan (1992) proved convergence requires:
- All (s, a) pairs visited infinitely often - the agent must explore sufficiently (e.g. ε-greedy with ε > 0 forever, or decaying ε that sums to infinity in visits).
- Robbins-Monro learning rate schedule: Σ_t α_t = ∞ and Σ_t α_t² < ∞ - the schedule must step far enough to correct any initialisation, but shrink fast enough that noise averages out.
- Bounded rewards and a finite MDP.
In practice α is often fixed (violating condition 2), which means Q converges to a neighbourhood of Q rather than Q exactly; this is acceptable in most empirical settings.
Tap to flip back
Experience replay and a target network. Experience replay stores transitions in a circular buffer and samples random minibatches, breaking temporal correlations. The target network is a periodically-copied frozen duplicate of the online network whose parameters are held fixed when computing TD targets, preventing the prediction and target from chasing each other simultaneously.
Tap to flip back
L(theta) = E [ ( r + gamma * max_{a'} Q(s', a'; theta^-) - Q(s, a; theta) )^2 ]
theta is the online network (differentiated). theta^- is the target network (frozen, treated as a constant). The target network is synchronised to theta every fixed number of steps (e.g., 10,000).
Why it matters: fixing theta^- makes the regression target approximately stationary, converting an unstable coupled system into something resembling supervised learning.
Tap to flip back
Vanilla DQN uses the same parameters for both selecting the best next action and evaluating it, which causes systematic upward bias because any noise that inflates an action's value also increases the probability of selecting it.
Double DQN decouples the two roles:
target = r + gamma * Q(s', argmax_{a'} Q(s', a'; theta); theta^-)
The online network (theta) selects the action; the target network (theta^-) evaluates it. Errors in the two networks are independent, so the bias largely cancels.
Tap to flip back
Dueling networks decompose Q into a state value V(s) and a state-dependent advantage A(s, a):
Q(s, a) = V(s) + (A(s, a) - mean_{a'} A(s, a'))
The mean is subtracted to ensure identifiability (V and A cannot otherwise be separated from Q alone).
This is useful because in many states the choice of action barely matters. A shared V(s) head can learn the overall value of a state efficiently without needing to estimate a distinct advantage for every action, leading to better generalisation and more accurate value estimates.
Tap to flip back
The deadly triad (Sutton & Barto) describes conditions under which Q-learning with function approximation can diverge: combining function approximation + bootstrapping (using network estimates as targets) + off-policy learning. DQN uses all three: a neural network approximator, TD targets derived from the network itself, and a replay buffer that samples transitions from a distribution different from the current policy. In most benchmark environments training is stable, but divergence has been observed and has no fully general theoretical fix.
Tap to flip back
-
Discrete action space only. DQN selects actions via
argmax_a Q(s, a), which requires enumerating all actions. Continuous spaces make this intractable. Remedy: use an actor-critic method (DDPG, SAC, TD3) that maintains an explicit policy network. -
Exponential blowup when discretising. Naively binning each continuous dimension into k values produces k^d actions for d dimensions, growing exponentially. Even moderate resolution is impractical. The same actor-critic remedy applies.
Tap to flip back
Prioritised experience replay samples transitions with probability proportional to their absolute TD error (plus a small epsilon for exploration). Transitions the network finds surprising are replayed more often, improving sample efficiency; on 49 Atari games it outperforms uniform DQN on 41 of them.
The correction: because sampling is no longer uniform, the gradient update is biased toward high-error transitions. Importance-sampling weights w_i = (1 / N * 1 / P(i))^beta (annealed from beta < 1 to 1 over training) are applied to each loss term to correct the distribution shift and restore unbiased gradient estimates at convergence.
Tap to flip back
Regret L(T) is the cumulative difference between the reward collected by an optimal policy and the reward actually collected over T steps: L(T) = sum_{t=1}^{T} [ Q(a) - Q*(a_t) ]. It captures the cost of not knowing which arm is best. A good exploration strategy minimises regret; an algorithm with O(ln T) regret is provably near-optimal for stationary bandits, while fixed-epsilon-greedy accumulates O(T) regret because it never stops exploring bad arms at a constant rate.
Tap to flip back
UCB1 selects the action maximising Q_t(a) + c * sqrt(ln(t) / N_t(a)), where N_t(a) is the pull count and c is a confidence constant. The second term is a bonus that shrinks as an arm is explored more and grows as time passes without exploring it. This implements optimism in the face of uncertainty: the agent assumes underexplored arms might be excellent until evidence proves otherwise, so it systematically tries them before committing to the greedy choice. The result is O(ln T) regret, provably optimal up to constants for the stationary bandit.
Tap to flip back
Thompson Sampling maintains a posterior distribution over the true value Q*(a) for each arm. At each step it draws one sample from each posterior and picks the arm with the highest sampled value. In the Bernoulli bandit, conjugate Beta priors make updates exact: after observing reward r from arm a, update Beta(alpha, beta) to Beta(alpha + r, beta + 1 - r). The prior matters most in early steps when data is scarce; a poorly chosen prior (e.g., overly optimistic) can front-load exploration at the wrong arms. Over time the posterior concentrates around the true value and the agent naturally exploits the best arm.
Tap to flip back
Epsilon-greedy chooses uniformly at random during exploration. In a sparse-reward environment the positive reward may only be reachable via a long, specific sequence of actions; the probability of hitting that sequence by uniform random action is exponentially small. Count-based and curiosity-driven methods address this by adding an intrinsic reward bonus for visiting novel or surprising states, actively steering the agent toward unexplored regions rather than waiting for a lucky random walk to discover distant rewards.
Tap to flip back
In tabular RL, the visit count N(s) is well defined and the intrinsic bonus 1/sqrt(N(s)) is easy to compute. In deep RL with high-dimensional observations (e.g., Atari pixels), almost every frame is unique, so N(s) = 1 for essentially every state and the bonus is uninformative. Bellemare et al. (2016) replace the count with a pseudo-count derived from a density model: after observing state s, the pseudo-count N-hat(s) is estimated from how much the density model's probability of s changes. This generalises visit frequency to a continuous notion of familiarity without requiring exact state matches.
Tap to flip back
Curiosity-driven methods reward prediction error of a forward model. A stochastic, uncontrollable stimulus - a TV showing random noise, a fire burning chaotically - produces permanently high prediction error because the noise cannot be reduced through learning. The agent maximises intrinsic reward by staring at the noisy TV indefinitely rather than exploring the task-relevant parts of the environment. The standard mitigation is to compute prediction error only in a learned feature space that encodes agent-controllable factors and discards irrelevant stochasticity, as in Pathak et al. (2017)'s inverse-dynamics encoding.
Tap to flip back
PPO adds a policy-entropy term c2 * H(pi(. | s_t)) to the training objective, penalising overly peaked action distributions. This keeps the policy from collapsing to a single deterministic action too early in training - essentially keeping epsilon above zero throughout. Unlike UCB, it does not explicitly track uncertainty over action values or state visit counts; it simply prevents premature commitment. The advantage is scalability: the entropy of a softmax head is trivial to compute even for large networks. The disadvantage is that it provides no directed signal toward genuinely unexplored states; it just maintains diversity in the actions taken from already-visited states.
Tap to flip back
The log-derivative (score function) trick: multiply and divide by \(\pi_\theta\) to convert \(\nabla_\theta \mathbb{E}[R]\) into \(\mathbb{E}[R \cdot \nabla_\theta \log \pi_\theta]\). The gradient now sits inside an expectation over the current policy, so it can be estimated with Monte Carlo rollouts without differentiating through the environment.
Why it matters: this makes policy gradients applicable to any black-box reward signal - no model, no value table, no Bellman backup required.
Tap to flip back
Actions at time \(t\) cannot affect rewards received before \(t\), so including past rewards in the weight only adds noise. Reward-to-go \(G_t = \sum_{k \geq t} \gamma^{k-t} r_k\) retains the same expected gradient (unbiased) while removing a source of variance. This is sometimes called the causality trick.
Tap to flip back
Subtracting a state-dependent baseline gives the update weight \((G_t - b(s_t))\). Because \(\mathbb{E}[\nabla_\theta \log \pi_\theta(a \mid s)] = 0\) for any \(b(s)\) that does not depend on \(a\), the baseline contributes zero in expectation: no bias introduced. Its sole effect is to reduce variance by centring the signal, making learning faster and more stable. The optimal baseline is \(V^\pi(s_t)\); using a neural network approximation leads to the actor-critic family.
Tap to flip back
dist.log_prob(action) computes \(\log \pi_\theta(a_t \mid s_t)\); multiplying by \(G_t\) gives the policy gradient estimator term. Optimisers in PyTorch perform gradient descent (minimisation), but we want gradient ascent on expected return \(J(\theta)\). Negating converts the ascent objective into a minimisation loss. The backward() call then computes \(-\nabla_\theta \log \pi_\theta \cdot G_t\), and the optimiser subtracts this, effectively adding \(\nabla_\theta J\).
Tap to flip back
- Sample efficiency: REINFORCE is on-policy - each trajectory is used once. PPO uses importance sampling ratio \(r_t = \pi_\theta / \pi_{\theta_\text{old}}\) to reuse data for multiple gradient steps.
- Update instability: large gradient steps can collapse the policy. PPO clips \(r_t\) to \([1-\varepsilon, 1+\varepsilon]\), preventing any single update from moving the policy too far even if the gradient points further away. The clipped surrogate objective is cheap to compute and does not require solving a constrained optimisation as TRPO does.
Tap to flip back
When reward appears only at episode termination, trajectories that never reach a rewarding outcome return a gradient of zero - the policy receives no learning signal at all. Random exploration rarely discovers the rewarding region early in training.
Common remedies:
- Reward shaping: add a dense auxiliary signal (e.g., distance to goal) that guides exploration without changing the optimal policy (if done carefully with potential-based shaping).
- Curiosity-driven exploration: add an intrinsic bonus proportional to prediction error or state novelty, so the agent is rewarded for visiting unfamiliar states even before extrinsic reward arrives.
Tap to flip back
GAE (Schulman et al. 2016, arXiv:1506.02438) defines a weighted average of \(k\)-step advantage estimates:
\[A^{\text{GAE}(\gamma,\lambda)}_t = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}\]where \(\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\) is the TD residual.
- \(\lambda = 0\): one-step TD advantage; low variance, but biased by the value function approximation error.
- \(\lambda = 1\): Monte Carlo return minus baseline; unbiased, but high variance.
\(\lambda \in (0,1)\) interpolates smoothly, letting practitioners tune the trade-off. Most modern PPO implementations default to \(\lambda \approx 0.95\), which gives low variance without accumulating too much bootstrap bias.
Tap to flip back
Because E_{a~π}[∇_θ log π_θ(a|s) · b(s)] = 0 for any fixed b(s). Since b does not depend on the action, it factors outside the expectation; the remaining integral of ∇_θ log π_θ over a normalised distribution is zero (EGLP lemma). The baseline shifts individual sample returns but leaves the gradient's expectation unchanged.
Tap to flip back
The optimal baseline is b*(s) = V^π(s), the state value function under the current policy. Subtracting it produces the advantage A^π(s,a) = Q^π(s,a) - V^π(s), which is centred around zero and thus minimises the variance of the gradient estimate across actions. In practice V^π is approximated by a learned value network (the "critic" in actor-critic).
Tap to flip back
Â_t^GAE(γ,λ) = Σ_{l≥0} (γλ)^l δ_{t+l} where δₜ = rₜ + γV(sₜ₊₁) - V(sₜ) is the one-step TD error.
λ = 0: single-step TD advantage (low variance, higher bias from value approximation error).λ = 1: full Monte Carlo advantage (unbiased given perfect V, but high variance).λ ∈ [0.9, 0.99]is the typical practical range used in PPO and TRPO, trading a little bias for large variance reduction.
Tap to flip back
Value-loss gradients flow back through the shared trunk and update feature representations to minimise prediction error for V(s). Those same gradient steps can corrupt or override the features that the policy head has learned to use for good action selection. The two objectives are not aligned: accurate value estimation does not require the same representations as high-entropy, well-calibrated action selection. Common mitigations include separate learning rates, gradient clipping, and a small value-loss coefficient (e.g., 0.5).
Tap to flip back
Normalisation divides by the batch's standard deviation, which rescales all advantage values to unit variance. If the reward is sparse - most trajectories return zero and a few return a large positive value - the standard deviation is dominated by those rare successes. Normalising inflates the near-zero advantages of the common trajectories and deflates the large positive signal from the rare successes, weakening the very signal the policy needs to learn from. The fix is to use a running statistics normaliser or to clip rather than normalise.
Tap to flip back
Gₜ (full return from t=0) includes past rewards that the action at step t could not have influenced. Those past rewards add noise with zero mean but nonzero variance to the gradient weight. Replacing Gₜ with the reward-to-go Rₜ = Σ_{t'≥t} γ^{t'-t} r_{t'} discards those causally irrelevant terms, cutting the variance of the estimator without changing its expectation. This is often called the "causality trick" and is the simplest variance-reduction step before introducing a baseline.
Tap to flip back
If values is not detached, the policy loss gradient flows through the value network as well as the policy network. This means the value network parameters are updated simultaneously by two conflicting signals: one optimising value prediction (value loss) and one optimising policy performance (policy loss). Detaching treats the baseline as a fixed target during the policy update step, cleanly separating the two optimisation problems and preventing gradient interference.
Tap to flip back
The actor outputs a policy: a probability distribution (or parameters of a distribution) over actions given a state, π_θ(a | s). The critic outputs a scalar value estimate, typically V_φ(s) or Q_φ(s, a). The actor uses the critic's estimate to compute advantages; the critic is trained independently via TD regression.
Tap to flip back
The advantage subtracts the state-value baseline V(s), which captures the predictable return from being in that state regardless of action. This reduces variance in the gradient estimate without introducing bias (since V(s) does not depend on the action). The resulting signal is a zero-mean, signed quantity: positive means the action was better than the average action from that state, negative means worse.
Tap to flip back
Lambda controls the exponential decay weighting across TD errors, trading off bias against variance. At λ = 0, GAE reduces to the one-step TD error (low variance, higher bias from the bootstrapped critic). At λ = 1, it reduces to the full Monte Carlo return minus the baseline (unbiased, high variance). Practical values sit around λ ∈ [0.9, 0.97].
Tap to flip back
If gradients were allowed to flow through the critic's parameters during the actor update, the optimiser would simultaneously try to minimise actor loss by adjusting critic weights. This conflates two distinct objectives and destabilises training. The critic should be a fixed estimator for the purpose of computing the advantage: its current output is used as a target signal, not as a co-optimised component. Detaching breaks the gradient path at that point.
Tap to flip back
A3C runs multiple independent agent-environment instances in parallel, each asynchronously computing gradients and pushing updates to a shared parameter server. The workers explore different trajectories simultaneously, which decorrelates the gradient updates. This decorrelation provides the same stabilising effect as experience replay in DQN, but without a replay buffer, making the method on-policy and memory-efficient.
Tap to flip back
The actor and critic impose conflicting gradient pressures on shared layers: the critic loss encourages representations that predict scalar returns, while the policy loss encourages action-discriminative features. If the critic loss coefficient c_v is too large, critic gradients dominate and destroy the policy representation. If too small, the critic trains slowly, leading to poor advantage estimates. The standard fix is careful tuning of c_v (commonly 0.5) or, in high-stakes settings, separate network heads with no shared backbone.
Tap to flip back
PPO replaces the standard policy gradient objective with a clipped surrogate: L_CLIP = E_t [ min( r_t · A_t, clip(r_t, 1-ε, 1+ε) · A_t ) ], where r_t = π_θ(a_t|s_t) / π_θ_old(a_t|s_t) is the probability ratio between new and old policy. Clipping prevents the ratio from straying too far from 1, bounding the effective policy change per update. This achieves a trust-region-like constraint without the second-order computation required by TRPO.
Tap to flip back
On-policy: the behaviour policy (the one collecting data) and the target policy (the one being improved) are the same. Off-policy: they differ. SARSA is on-policy; Q-learning is off-policy. Why it matters: off-policy lets you reuse old experience; on-policy cannot.
Tap to flip back
Q-learning bootstraps off max_a' Q(s', a') (the greedy target policy), so it optimises for the best possible action regardless of what epsilon-greedy exploration actually does. SARSA bootstraps off the action the agent actually takes under epsilon-greedy, including the occasional random cliff-edge step - so its value estimates reflect the exploratory behaviour, penalising risky states. Q-learning's values are cleaner asymptotically but more dangerous during learning.
Tap to flip back
rho_t = pi(a_t | s_t) / b(a_t | s_t) re-weights a transition collected under behaviour policy b to give an unbiased estimate of the return under target policy pi. Without it, Monte Carlo returns from a different distribution are biased. For multi-step trajectories the ratios multiply, causing variance explosion - which is why algorithms like V-trace clip rho_t to a ceiling c_bar.
Tap to flip back
- Sample reuse - transitions stored in a replay buffer can be sampled many times; on-policy data is used once then discarded.
- Behaviour diversity - you can learn from human demonstrations, scripted explorers, or old checkpoints, as long as the behaviour policy has coverage.
- Parallelism - actors can asynchronously collect experience while a centralised learner updates the target network (e.g. IMPALA's architecture), decoupling acting speed from learning speed.
Tap to flip back
The deadly triad (Sutton & Barto) is the combination of: (1) function approximation, (2) bootstrapping (using estimated values to update values), and (3) off-policy updates. Each ingredient alone is fine; together they can cause the value function to diverge rather than converge. DQN tames it with a frozen target network and experience replay, but there is no general convergence guarantee for off-policy deep RL - only empirical stability recipes.
Tap to flip back
PPO clips the probability ratio r_t(theta) = pi_theta(a|s) / pi_theta_old(a|s) to the interval [1 - epsilon, 1 + epsilon] in its surrogate objective. This prevents the updated policy from drifting too far from the one that generated the rollout batch, so the on-policy assumption is approximately maintained across epochs. Without clipping, repeated epochs on stale data would cause large, destabilising policy updates.
Tap to flip back
Coverage requires that for every state-action pair the target policy might visit, the behaviour policy must also visit it with non-zero probability: pi(a|s) > 0 => b(a|s) > 0. If violated, the importance sampling ratio becomes undefined (division by zero), and the replay buffer contains no gradient signal for those transitions. The agent silently extrapolates from neighbouring states, often catastrophically - a particular hazard in robotics manipulation where rare, precise contacts matter most.
Tap to flip back
It adds a bonus proportional to the Shannon entropy of the policy at each visited state: r_t + α · H(π(·|s_t)). The agent is rewarded not just for collecting reward but for acting unpredictably. This prevents premature collapse to a single deterministic action and keeps exploration alive throughout training.
Tap to flip back
The hard Bellman backup uses max_a Q(s,a) for the target value. The soft Bellman backup replaces this with the soft maximum:
V_soft(s) = α · log Σ_a exp( Q(s,a) / α )
The soft maximum is smooth and differentiable, always less than or equal to the hard maximum, and approaches the hard maximum as α → 0. This makes gradient-based optimisation more stable and encodes the entropy bonus directly into value estimates.
Tap to flip back
The optimal policy is a Boltzmann (softmax) distribution over actions:
π*(a|s) ∝ exp( Q*(s,a) / α )
High-Q actions receive higher probability, but probability mass is never zero on any action. Temperature α controls the spread: large α approaches uniform; small α approaches greedy. This is exactly the softmax policy familiar from multi-armed bandits, extended recursively through the Bellman equation.
Tap to flip back
SAC frames temperature selection as a constrained optimisation: maximise expected reward subject to the constraint that the policy's mean entropy stays above a minimum target H_target (often set to -|A|, the negative of the action-space dimension). The Lagrangian dual variable for this constraint is α. Gradient descent on the dual variable automatically raises α when entropy falls below the target and lowers it when entropy exceeds the target, making temperature self-tuning during training.
Tap to flip back
PPO adds an entropy bonus as a coefficient c₂ · H(π_θ) directly to its surrogate loss, acting as a soft regulariser to slow policy collapse. It does not modify the Bellman backups. SAC bakes entropy into the value function through the soft Bellman backup, so the exploration incentive persists throughout every value estimate. PPO's entropy term primarily helps in the early phases of training on discrete action spaces; SAC's entropy shaping is structurally deeper and more persistent.
Tap to flip back
-
Reward-sparse tasks: a high-entropy policy may never encounter reward, so the entropy bonus dominates and the agent learns to act randomly rather than solve the task. Pre-training with curriculum or reward shaping is needed.
-
Unbounded Gaussian variance: in continuous action spaces, a Gaussian policy can inflate its variance indefinitely in low-reward regions to harvest entropy. Without variance clamping or squashing (e.g., tanh), this causes diverging variances or NaN gradients. SAC addresses this with a log-variance parameterisation and squashing, but naive implementations fail here.
Tap to flip back
α scales the relative weight of entropy versus reward in the objective. At α → 0 the entropy bonus vanishes and the policy collapses to the standard reward-maximising (greedy) policy. At α → ∞ the reward signal is negligible and the optimal policy is uniform over all actions. Intermediate values of α produce stochastic policies that concentrate more probability on higher-reward actions while retaining spread across alternatives, balancing exploitation and exploration.
Tap to flip back
The problem of determining which actions in a long sequence were responsible for a delayed reward. When a reward arrives many steps after the actions that caused it, the agent must propagate credit backward through time to assign appropriate value to each decision. First formalised by Minsky in 1961, it remains a core challenge in RL.
Tap to flip back
The bonus must be potential-based: \(F(s, a, s') = \gamma \Phi(s') - \Phi(s)\) for some potential function \(\Phi: \mathcal{S} \to \mathbb{R}\). This telescoping form sums to zero over any complete trajectory, so the agent's ranking of policies under the shaped reward is identical to its ranking under the original reward.
Tap to flip back
TD(\(\lambda\)) maintains eligibility traces that keep recent states and actions "eligible" for credit updates. Each step back, eligibility decays by \((\gamma\lambda)^k\). At \(\lambda = 0\) you get one-step TD (minimal propagation); at \(\lambda = 1\) you recover full Monte-Carlo returns (credit reaches all the way back). Larger \(\lambda\) assigns credit further back but increases variance; smaller \(\lambda\) is lower variance but propagates credit more slowly.
Tap to flip back
GAE constructs the advantage estimate as \(\hat{A}_t^{\text{GAE}(\gamma, \lambda)} = \sum_{l=0}^{\infty}(\gamma\lambda)^l \delta_{t+l}\), a \(\lambda\)-weighted blend of multi-step TD errors. The baseline \(V(s_t)\) subtracted inside each \(\delta\) removes the expected return regardless of action, so only the action-specific contribution remains. This reduces gradient variance substantially while the \(\lambda\) parameter lets practitioners dial in the bias-variance trade-off.
Tap to flip back
The policy gradient theorem states \(\nabla_\theta J = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) \cdot G_t]\). Any baseline \(b(s)\) that depends only on state can be subtracted without changing the expectation, because \(\mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)] = 0\). Subtracting \(V(s_t)\) gives the advantage \(A(s,a) = Q(s,a) - V(s)\), which directly answers "how much better was this specific action than average?" - a credit assignment signal centred on zero.
Tap to flip back
RUDDER (Arjona-Medina et al. 2018) trains a sequence model (e.g. an LSTM) to predict the cumulative return from a trajectory prefix. The contribution of each new observation to the prediction - the change in predicted return at each step - is used as a redistributed per-step reward. The method requires a sequence model accurate enough that its prediction updates correspond to genuine action contributions, not noise. Under perfect prediction, the redistributed rewards reduce expected future return to zero, eliminating discounting bias and simplifying Q-value estimation.
Tap to flip back
-
Potential mispecification: A potential function that points away from the true goal along some paths makes learning slower, not faster. Potential-based shaping preserves optimality but does not guarantee faster convergence when the heuristic is misleading.
-
Stochastic long-horizon environments: When outcomes are partly due to chance, eligibility traces and return redistribution conflate skill with luck. Credit assigned to actions that happened to precede a lucky outcome introduces noise into the value estimates, destabilising training.
Tap to flip back
Large Language Models
3 concept(s)On reflexive single-step tasks - "what is the capital of France", "is this sentence positive or negative", short classification. Forced reasoning adds latency, increases token cost, and can lead the model astray by inventing spurious justifications. Reserve CoT for multi-step problems where intermediate state genuinely matters (maths, logic, planning).
Tap to flip back
Sample multiple reasoning chains for the same prompt (5-40), then take the majority answer. Different chains stumble in different places but converge on the correct answer more often than any single chain. Accuracy lifts on hard reasoning benchmarks are typically 5-20 points - far larger than most prompt tweaks deliver, at the cost of N-times inference.
Tap to flip back
Not quite. Frontier models (Claude, GPT-4 class, Gemini) trained on reasoning traces produce CoT implicitly, and some expose explicit "thinking" tokens (Claude extended thinking, OpenAI o-series). Explicit CoT instructions still help on edge tasks, weaker open-source models, and when you want to inspect the reasoning. The shift is that you no longer need "Let's think step by step" to unlock the capability - you choose whether to surface it.
Tap to flip back
- Autoregressive (next-token): hides the next token, reads left context only (causal). GPT family.
- Masked LM (MLM): hides ~15% of tokens at random, reads both directions. BERT.
- Span corruption: hides contiguous spans replaced by sentinels, encoder reads both sides and a decoder regenerates them. T5.
What an objective hides decides what signal the model extracts and what it can later do.
Tap to flip back
The training task and the inference task are identical: you train by predicting the next token and you generate by predicting the next token, so there is no train-test gap. MLM trains on a 15%-masking corruption pattern that never occurs at generation time, so making a fluent generator out of it means fighting its own objective. Autoregression also forces broad competence (syntax, facts, arithmetic) because all of it reduces next-token uncertainty.
Tap to flip back
No. A freshly pretrained base model is a next-token predictor; asked a question it may just continue with more questions. SFT plus preference optimisation (RLHF/DPO) turn it into an assistant, but they only surface and direct capability the pretraining objective already instilled. They redistribute probability mass; they do not teach a skill the pretraining loss never rewarded.
Tap to flip back
The model trains on ground-truth prefixes but generates on its own (possibly wrong) prefixes, so an early mistake shifts the model off the distribution it was trained on and errors can compound across a long generation. It is one reason decoding strategy matters and why long generations drift. The objective optimises one-step-ahead likelihood, not multi-step rollout quality.
Tap to flip back
When the inference task is not open-ended generation. Bidirectional MLM encoders produce the strongest embeddings for retrieval and classification; encoder-decoder span-corruption models (T5) stay competitive where there is a clear input-to-output mapping (translation, summarisation). Match the objective to what the model must do at inference, not to whichever sounds more powerful.
Tap to flip back
The loss does not care what it compresses. The same next-token objective on web sludge versus curated tokens produces very different models, but nothing in the objective flags the difference; the corpus does the quiet work. This is why deduplication, filtering, and curation are first-class levers, not afterthoughts.
Tap to flip back
LLMs have a fixed training cutoff and no access to private documents. RAG solves this by retrieving relevant information at query time and injecting it into the prompt, making the model a reasoning layer over externally supplied data.
Tap to flip back
- Index - chunk documents and store their embeddings in a vector database. 2. Retrieve - embed the user query and find the top-k most similar chunks. 3. Generate - place retrieved chunks in the prompt and instruct the model to answer from that context only.
Tap to flip back
Chunk size directly affects retrieval quality. Chunks that are too small lose surrounding context, while chunks that are too large introduce irrelevant text that dilutes the signal. There is no universal optimal size - it must be determined by evaluation on your specific data.
Tap to flip back
Embedding model quality matters more. A high-quality embedding model running on a simple store like SQLite will outperform a mediocre embedding model on a purpose-built vector database like Pinecone.
Tap to flip back
Models exhibit recency bias, placing disproportionate weight on content near the end of long contexts. To counteract this, place the most important retrieved chunks last in the prompt.
Tap to flip back
Production RAG commonly adds query rewriting (to handle conversational follow-ups), hybrid search combining vector and BM25 retrieval, cross-encoder re-ranking (reducing top-50 chunks to top-5), and per-source citation. Each addition should be justified by measurable improvement in evaluations.
Tap to flip back
- Ingestion - split docs into 200-800 token chunks with 50-100 token overlap.
- Embedding - encode each chunk and store vectors in an index.
- Retrieval - embed the query, return top-k nearest chunks.
- Reranking - rescore top-k with a cross-encoder (optional but high-leverage).
- Generation - inject chunks into the prompt and ask the model to answer.
Skipping reranking is the single biggest quality win most teams leave on the table.
Tap to flip back
LLMs attend most strongly to the start and end of long contexts and skim the middle. If your top-1 retrieved chunk sits in the middle of ten chunks, the model often ignores it. Reorder retrieved chunks so the most relevant sit at the prompt edges (highest at the end is a strong default for decoder-only models). Cheap to implement, measurable accuracy lift.
Tap to flip back
Pure vector search misses exact-keyword matches (product codes, error strings, proper nouns). Pure BM25 misses semantic paraphrases. Combine the two ranked lists with Reciprocal Rank Fusion (RRF) and you get the best of both with almost no extra latency. This is now a default rather than an optimisation.
Tap to flip back
- Fixed-token chunks that slice through sentences mid-clause.
- No overlap, so context that straddles boundaries vanishes.
- Chunks too small (lose context) or too large (dilute the signal in cosine).
Default to recursive splitting on paragraph -> sentence -> token boundaries, 400 tokens with 50-100 overlap. Tune from there.
Tap to flip back
This is hallucinated citation: the model produces text in the shape of a quote but invents the content. Mitigations:
- Instruct it to quote verbatim from chunks and mark the chunk ID.
- Post-hoc verify quoted spans actually appear in the retrieved chunks.
- For high-stakes outputs, reject answers whose quotes fail verification.
Never trust a citation just because it is formatted like one.
Tap to flip back
NLP Foundations
6 concept(s)As d_k grows, the variance of Q . K^T grows linearly, pushing some logits into regions where softmax saturates and gradients vanish. Dividing by sqrt(d_k) rescales the dot products to unit variance, keeping the softmax in its responsive range. Without it, training a deep transformer becomes numerically fragile and convergence stalls.
Tap to flip back
Self-attention: Q, K, V all derived from the same sequence (used inside encoder blocks and inside decoder blocks). Cross-attention: Q from one sequence, K and V from another (the bridge from encoder to decoder in encoder-decoder models). The math is identical; only the input plumbing differs.
Tap to flip back
It is O(n^2) in both compute and memory. Doubling context quadruples the attention matrix, so by 100k tokens attention dominates the forward pass and the KV cache blows out GPU memory. FlashAttention reorders the computation to stay memory-efficient; Longformer/BigBird sparsify; Mamba and RWKV drop attention for recurrent state. Pick the variant that matches your context-length and recall requirements.
Tap to flip back
Each head learns a specialised type of dependency - one tracks syntactic agreement, another resolves coreference, another acts as a positional nearest-neighbour. Running them in parallel lets the layer compose several relations at once, and stacking layers composes them into hierarchical structure. The insight: depth gives you composition, heads give you parallel specialisation.
Tap to flip back
- Multi-head self-attention.
- Residual connection + layer norm wrapping the attention.
- Feed-forward MLP (typically 4x model dimension wide).
- Residual connection + layer norm wrapping the MLP.
GPT-2 stacks 12, GPT-3 stacks 96, GPT-4-class models stack 80-120. The block is the unit of scale.
Tap to flip back
- Encoder-only (BERT): bidirectional attention, no generation. Best for classification, NER, embeddings.
- Decoder-only (GPT family): causal mask, autoregressive generation. The default for chat, completion, and most LLMs today.
- Encoder-decoder (T5): encoder reads input fully, decoder generates conditioned on it. Still strong for translation and summarisation where input and output are clearly separated.
If in doubt for an LLM-style product, pick decoder-only.
Tap to flip back
Rotary position embeddings encode position by rotating query and key vectors in 2D subspaces, so relative position falls out of the dot product naturally. This generalises far better to sequences longer than training context than learned absolute embeddings, and it composes cleanly with FlashAttention. Llama, Mistral, Qwen and PaLM all use it; learned absolute embeddings are mostly legacy at this point.
Tap to flip back
Position is injected explicitly, either by adding sinusoidal vectors to the embeddings (original paper), learning a positional embedding table (BERT, GPT-2), rotating Q/K vectors (RoPE - Llama, Mistral), or adding a linear bias to attention logits (ALiBi - Bloom). Strip positional info entirely and the model collapses to a bag-of-words.
Tap to flip back
Starting from a byte alphabet, BPE iteratively merges the most frequent adjacent pair of symbols until the vocabulary hits a target size (typically 30k-200k), so common fragments become single tokens and rare strings decompose into smaller pieces.
Tap to flip back
BPE merges are learned from the training corpus, which is dominated by English. Under-represented scripts and languages never get long merges, so each word fragments into many small tokens (sometimes one per byte of UTF-8). That inflates input cost, output cost, and effectively shrinks the context window. It is the quietest tax in the API bill.
Tap to flip back
Dump the tokenisation of the numbers. Older GPT tokenisers split numbers as multi-digit chunks in unpredictable ways ("1234" might become "12" + "34", "123" + "4", or four single digits depending on context). The model is not bad at arithmetic - it is solving a different arithmetic problem on each input. Many "weird model behaviour" bugs hide in unexpected token splits.
Tap to flip back
Both implement BPE-style subword tokenisation. SentencePiece (Llama, T5, PaLM) treats whitespace as a regular character so it works language-agnostically without pre-tokenisation. Tiktoken (GPT, Claude) is highly optimised in Rust and assumes whitespace-separated languages. The vocabulary differs; the BPE skeleton is the same.
Tap to flip back
Cosine ignores vector magnitude and compares direction only, so it is robust to differences in sentence length or model calibration. Range is a clean -1 to 1. When embeddings are L2-normalised (almost always), cosine equals the dot product, which is what HNSW and pgvector actually compute under the hood. Euclidean distance correlates poorly with semantic similarity in high dimensions and is rarely used.
Tap to flip back
Around 100k vectors. Below that, brute force is a one-liner and fast enough. Above that, latency creeps over 100ms and you reach for HNSW (pgvector, Qdrant), IVF (FAISS), or LSH. The cost is a small recall hit - tune index parameters to trade recall against speed.
Tap to flip back
- Domain-adapt the model. Fine-tune a sentence-transformer on your own query/passage pairs, or pick a model already trained for the domain (legal-BERT, BioBERT, FinBERT).
- Use asymmetric retrieval. Different encoder for queries vs documents (E5-asym, BGE-large with query/passage prompts), because short queries and long docs occupy different regions of vector space.
The pitfall is assuming all-MiniLM-L6-v2 works everywhere; it was trained on generic web text.
Tap to flip back
Take a transformer encoder, pool its hidden states (mean-pool or [CLS]) into a single vector, then train with a contrastive objective: pull semantically similar pairs together and push dissimilar pairs apart. Sentence-Transformers, E5, BGE and Voyage all follow this recipe. The contrastive loss is what turns "language model" into "similarity geometry".
Tap to flip back
Raw self-attention is permutation-invariant: it treats the input as a set, so permuting the tokens just permutes the output and "the dog bit the man" is indistinguishable from "the man bit the dog". Order is not in the attention math; it has to be added, either by encoding position into the embeddings (sinusoidal, learned, RoPE) or by biasing the scores (ALiBi). Everything a model knows about word order, it knows because position was injected somewhere.
Tap to flip back
RoPE rotates query and key vectors by an angle proportional to position. In the dot product the absolute angles cancel and only the relative distance (m - n)*theta survives, so the attention score depends on how far apart two tokens are, not where they sit absolutely. Relative distance is the quantity that transfers past the training length. A learned absolute table, by contrast, simply has no row for a position it never trained on.
Tap to flip back
ALiBi adds no embeddings. It subtracts a penalty from each pre-softmax attention score, linear in query-key distance and scaled by a fixed per-head slope, so each head attends to a soft recency-biased window. Because the bias is purely a function of distance with no length-bound parameters, it extrapolates past training length ("train short, test long"). The trade-off: the monotonic recency penalty makes it hard to attend sharply to a single distant token, which hurts retrieval-style long-context tasks.
Tap to flip back
For length generalisation in small decoder-only models, no positional encoding at all (NoPE) matched or beat ALiBi and RoPE. It works because the causal mask already breaks permutation symmetry: token i can attend only to j <= i, so the model can recover position by counting visible tokens, and it learns relative-style attention without being told. The takeaway is not to delete positional encodings (frontier models keep RoPE for sharper control at scale) but that position is partly emergent from causality.
Tap to flip back
Push a RoPE model past its training length and the high-frequency dimension pairs rotate into angles never seen, so local syntax corrupts first. Position interpolation rescales long positions back into the trained range; YaRN refines this by interpolating slow long-range frequencies while leaving fast local ones nearly untouched, recovering most quality with roughly an order of magnitude less fine-tuning. The relative-distance property is exactly what makes this remapping coherent.
Tap to flip back
At each step it emits a probability distribution over the whole vocabulary. A separate decoder turns that distribution into an actual token. Swap the decoder and the same frozen weights swing from dull looping text to vivid prose to confident nonsense. Many "the model is bad at X" complaints are really decoder-settings complaints, fixable by tuning two numbers rather than re-running a fine-tune.
Tap to flip back
Beam search maximises total sequence probability. On open-ended tasks the highest-probability sequence is bland, repetitive, and degenerate, because high likelihood and high quality diverge for creative text. Beam shines where there is one mostly-correct answer (translation), but for story or chat generation it reads like a hostage note. There you want sampling, not maximisation.
Tap to flip back
Temperature T divides the logits before softmax: T < 1 sharpens toward top tokens, T > 1 flattens, T -> 0 is greedy. Used alone it is blunt because raising T for diversity also inflates the long tail of genuinely bad tokens, so you get creativity and incoherence together. That is why temperature is almost always paired with a truncation step (top-p or min-p) that cuts the tail first.
Tap to flip back
- top-k: the
khighest-probability tokens. Fixed size, so it is mis-sized when the distribution is very peaked or very flat. - top-p (nucleus): the smallest set whose cumulative probability exceeds
p. The shortlist adapts to model confidence. - min-p: every token with probability at least
min_p * p_max. The floor scales with the top token, staying coherent even at high temperature.
Tap to flip back
Sampling noise compounds across a long generation, and one wrong token early can derail an entire chain of thought or break code. Reasoning-tuned models are typically decoded near-greedy (low temperature) because here diversity buys nothing and costs correctness. The deliberate exception is self-consistency, which re-introduces sampling to draw several diverse chains and majority-vote the answer.
Tap to flip back
Architectures & Scaling
34 concept(s)Query represents what a token is looking for, Key represents what a token matches against, and Value is what a token contributes when matched. Together they enable each token to selectively gather information from other tokens based on relevance.
Tap to flip back
The formula is: attention(i) = softmax(Q_i · K^T / sqrt(d)) · V. Dividing by sqrt(d) prevents the dot products from growing too large as dimensionality increases, keeping gradients well-scaled during training.
Tap to flip back
RNNs process tokens sequentially, making long-range dependencies hard to maintain over many steps. Attention allows every token to directly attend to every other token in one operation, regardless of distance in the sequence.
Tap to flip back
Attention is O(n²) in sequence length because every token must attend to every other token. A 100k-token context requires roughly 10 billion attention scores, making long-context computation a significant engineering challenge.
Tap to flip back
Multi-head attention runs several attention operations in parallel, each potentially learning a different type of relationship such as syntax, coreference, or positional adjacency. A single head can only capture one relational pattern at a time, whereas multiple heads allow richer, diverse representations.
Tap to flip back
Softmax converts the raw dot-product scores between a query and all keys into a probability distribution that sums to 1. This ensures the weighted combination of Value vectors is a proper weighted average, highlighting the most relevant tokens.
Tap to flip back
Each token produces a Query (what am I looking for?), a Key (what do I match against?), and a Value (what do I contribute if matched?). These three vectors enable tokens to compare themselves against all other tokens and extract weighted information.
Tap to flip back
The formula is: attention(i) = softmax(Q_i * K^T / sqrt(d)) * V. The dot products measure relevance between the query and all keys, softmax converts those scores into a probability distribution, and the result is a weighted sum of values.
Tap to flip back
Dividing by sqrt(d) - where d is the dimension of the key/query vectors - prevents the dot products from growing too large as dimensionality increases. Without this scaling, softmax outputs become extremely peaked and gradients vanish during training.
Tap to flip back
Attention lets every token directly attend to every other token in a single operation, regardless of distance. RNNs process tokens sequentially, so information from distant tokens must survive many steps of compression, causing it to degrade.
Tap to flip back
Self-attention is O(n^2) in sequence length because every token attends to every other token. For a 100k-token context this means roughly 10 billion attention scores, making long-context inference extremely expensive in memory and compute.
Tap to flip back
Human annotation of large fine-tuning datasets is prohibitively expensive (tens of cents to tens of dollars per example for expert tasks). A capable generative model can produce the same volume of labelled examples orders of magnitude more cheaply, collapsing the annotation bottleneck. The value proposition holds as long as the generator's errors are manageable.
Tap to flip back
Instruction synthesis generates an instruction-response pair and accepts it as-is. Rejection sampling generates multiple candidate responses for each prompt and discards those that fail a verifier (unit test, ground-truth answer check, classifier). The result is a training set composed only of correct or high-quality outputs, even when the base model's per-sample pass rate is low. This is especially useful for reasoning tasks where correctness can be verified mechanically.
Tap to flip back
Stage 1 (SL-CAI): the model generates a response, then critiques and revises it according to a written list of principles (the "constitution"). The revised response becomes a supervised fine-tuning target. Stage 2 (RL-CAI / RLAIF): the model generates preference pairs (original vs. revised), which train a preference model. Policy training then uses this AI-generated preference signal rather than human annotations. Humans only author the constitution once; all subsequent labelling is done by the model itself.
Tap to flip back
Model collapse (also called Model Autophagy Disorder) is the progressive loss of output quality or diversity when a model is trained iteratively on its own generated data without sufficient injection of real data. Each generation's errors and distributional biases are amplified in the next. The result is a model concentrated on high-probability outputs, losing the ability to produce rare but correct responses. The key cause is the absence of a grounding signal from real data to counteract drift.
Tap to flip back
phi-1 was trained on GPT-3.5-generated "textbook quality" code exercises (roughly 1B synthetic tokens) rather than raw web-scraped code. Textbook-style data is informationally denser: it explains concepts with build-up, worked examples, and commentary, mirroring how code is taught rather than how it appears in production. This suggests that data quality and pedagogical structure can substitute for scale when the task is well-defined.
Tap to flip back
Goodhart's Law states that when a metric is used as a training target, it ceases to be a good measure of the underlying goal. In RLAIF, if the policy model and the critique model share architecture or training lineage, the policy can learn to produce outputs that score well on the critique without genuinely improving in harmlessness or helpfulness. The critique score diverges from actual quality. This is why constitutional AI requires careful design of the critique model and regular re-evaluation against human preference ground truth.
Tap to flip back
- No correct samples exist: if the base model never generates a correct answer for a hard problem class (e.g., a proof the model lacks the knowledge for), rejection sampling produces no accepted examples for that class - the filter keeps nothing. 2. No reliable verifier: if correctness cannot be determined mechanically (e.g., open-ended creative writing, nuanced ethical reasoning), there is no filter to apply, and all candidates are accepted indiscriminately, defeating the purpose.
Tap to flip back
- Seed pool - start with a small set of human-written tasks (175 in the original paper).
- Instruction generation - the model samples a few tasks from the pool and writes a new instruction.
- Instance generation - the model produces input-output pairs for each new instruction.
- Filtering - near-duplicates and degenerate samples are removed (ROUGE-L similarity threshold of 0.7); survivors re-enter the pool.
The loop then repeats, expanding the pool without additional human annotation.
Tap to flip back
The ROUGE-L diversity filter catches redundancy but has no mechanism to verify factual accuracy. The model generates outputs that are plausible in form but may be wrong in substance, and those wrong-but-well-formatted answers pass the filter. The model's existing capability ceiling determines the quality ceiling; it cannot reliably produce correct outputs for tasks already beyond its competence.
Tap to flip back
In the original Self-Instruct, the same model serves as both generator and student (true self-improvement). In Alpaca, a stronger teacher model (text-davinci-003) generates the 52,000 demonstrations while a weaker student (LLaMA-7B) is fine-tuned on them. That is knowledge distillation, not self-improvement. The capability ceiling is the teacher's, not the student's, but this also introduces licence/IP complications and means errors are subtler and harder to filter.
Tap to flip back
- Capability ceiling - the student can only learn up to the teacher's competence level.
- Licence/legal exposure - generating training data from a proprietary API to train a competing model may violate terms of service (a concern the Alpaca authors flagged explicitly).
- Subtler error propagation - a stronger teacher produces well-formatted but factually wrong outputs that pass mechanical filters more easily than crude errors would.
Tap to flip back
A model trained on Self-Instruct data may learn the stylistic surface features of instruction responses (bullet-point formatting, hedging phrases like "Sure, here is...") without learning the underlying task semantics. This produces models that sound confident and well-structured but give incorrect or shallow answers. It is the mechanism behind much of the sycophantic, over-hedged tone observed in early chat-tuned models.
Tap to flip back
ROUGE-L filtering removes near-duplicate strings, but it does not force the model to venture into low-confidence task types. Because the generator samples from its own distribution, tasks the model handles confidently are generated frequently; tasks requiring rare or specialised knowledge are rarely generated at all. Over many iterations the pool gravitates toward a subset the model is already good at, starving the fine-tuned model of training signal for hard or specialised tasks.
Tap to flip back
- Self-Instruct (Wang et al., 2022): fine-tuning GPT-3 with the pipeline produced a 33 percentage-point absolute improvement over the base model on Super-NaturalInstructions, approaching InstructGPT-001 performance.
- Alpaca (Stanford, 2023): in blind human comparisons against text-davinci-003, the 7B Alpaca model won 90 comparisons to 89, suggesting rough parity despite being far smaller and cheaper to produce (data generation cost under $500).
Tap to flip back
Seed instruction datasets (e.g., Alpaca) concentrate at low complexity - simple recall and formatting tasks. Evol-Instruct uses an LLM to iteratively rewrite seed instructions into progressively harder variants, producing a training dataset whose difficulty distribution spans simple to highly constrained multi-step tasks. Models fine-tuned on this spread generalise far better to complex real-world prompts.
Tap to flip back
- Add constraints - introduces extra conditions the answer must satisfy.
- Deepening - shifts the instruction toward a more specialised sub-problem.
- Concretising - replaces vague terms with specific technical vocabulary.
- Increased reasoning - demands multi-step inference rather than direct recall.
- Complicate input - makes the input artefact itself more intricate (longer code, nested data).
These are applied stochastically; each evolved instruction is produced by one randomly sampled operator.
Tap to flip back
Depth evolution makes an existing instruction harder while keeping the same topic. Breadth evolution generates a brand-new instruction on a related but distinct topic, inspired by but different from the original seed. It expands subject-matter coverage rather than difficulty within a fixed subject. Both types of evolved instructions are included in the final training mix.
Tap to flip back
The filter (implemented as a second LLM judge call) discards outputs that:
- Repeat the original instruction with only cosmetic changes.
- Contain meta-text from the rewriting prompt (e.g., "I will now make this harder by...").
- Are incoherent or unanswerable due to conflicting constraints.
- Are empty strings (often from refused instructions).
Without filtering, corrupted examples enter training directly. The filter is also where most of the pipeline's inference cost concentrates, since every candidate needs a separate judgment call.
Tap to flip back
WizardCoder adapted the depth operators for code: evolving instructions to require higher algorithmic complexity, edge-case handling, or resource constraints. Starting from Code Alpaca, it achieved strong HumanEval pass@1 scores (ICLR 2024).
WizardMath added a reinforcement learning stage (RLEIF): evolved instructions are used for SFT and as RL prompts scored by a process reward model. This produced a 7B model competitive with much larger models on GSM8k and MATH (ICLR 2025 Oral).
Both show that Evol-Instruct is domain-agnostic when operators are adapted to domain-specific notions of difficulty.
Tap to flip back
The evolved instructions carry the stylistic fingerprint, knowledge gaps, and refusal patterns of the LLM used to generate them (GPT-4 in the original WizardLM work). A student model fine-tuned on this data learns to mimic those biases, not just to follow complex instructions. This is a form of distribution shift: the training signal conflates "harder instruction following" with "behaving like the generator model," which can impair diversity and introduce subtle systematic errors.
Tap to flip back
The only quality signal is the elimination filter, which is itself an LLM judge. LLM judges carry systematic biases (e.g., preferring verbose or superficially complex text) that propagate silently into the training set. There is no external oracle - unlike mathematical reasoning tasks where a symbolic checker can verify correctness. This means quality collapse at high evolution depth is hard to detect, and cascading errors across multiple rounds can produce chains of plausibly-formatted but subtly broken training examples that pass the filter.
Tap to flip back
Hard labels (one-hot) encode only the correct class and discard every other class relationship. Soft labels carry the teacher's full output distribution, encoding which "wrong" answers the teacher considers plausible and how similar different classes are. This richer gradient signal lets the student learn inter-class structure, not just which label is correct.
Why it matters: the teacher's soft distribution is itself compressed knowledge about the input space; training against it is far more data-efficient than training against human annotations alone.
Tap to flip back
Temperature T divides logits before the softmax: p_i = exp(z_i / T) / sum(exp(z_j / T)).
- T = 1: standard softmax; probability mass is concentrated near the argmax, making near-zero entries nearly invisible.
- T > 1: distribution "softens"; probability mass spreads over more tokens, amplifying the gradient signal from low-probability but informative entries.
Both teacher and student use the same T during distillation. After training, T is reset to 1 for inference. Common values are T = 2-10 for classification; in LLM distillation T is often kept low (1-2) because the vocabulary is already large.
Tap to flip back
-
Output-token distillation - student trains on decoded text from the teacher. Cheapest: only final responses needed. Example: Alpaca (Stanford, 2023).
-
Soft-distribution distillation - student minimises KL divergence against the teacher's per-token vocabulary distribution. More expensive: requires teacher logits at every position, stored or streamed.
-
Reasoning-trace distillation - teacher generates chain-of-thought explanations; student learns the problem-decomposition strategy, not just the answer. Most expensive to generate, but transfers the deepest capability. Example: Orca (Microsoft, 2023).
Cost increases from 1 to 3; depth of transfer also increases from 1 to 3.
Tap to flip back
Imitation models trained purely on teacher output text learn to mimic the teacher's discourse style far more readily than its factual depth or reasoning ability. Human raters consistently rated these models as competitive with ChatGPT; automated benchmarks on out-of-distribution factual and reasoning tasks revealed a persistent capability gap.
The "false promise" is that surface fluency and polished phrasing create the illusion of transferred capability. The student absorbs how the teacher writes, not what the teacher knows. This motivates supplementing output-token distillation with richer signals (logits, reasoning traces) and evaluating on tasks outside the distillation distribution.
Tap to flip back
Model collapse (or "going MAD" per Alemohammad et al., 2023) occurs when models are trained repeatedly on outputs from previous model generations without injecting fresh real-world data. Each generation fits a compressed version of the previous one's distribution, systematically losing the long-tail variance present in real data. Precision (output quality) may initially hold while recall (diversity) degrades; eventually both collapse.
Root cause: generative models oversample the high-probability modes of their own output distribution. Iterative training amplifies this bias, narrowing the effective support of the learned distribution.
Mitigation: continually mix in real human-generated data alongside synthetic data; never train solely on self-generated outputs across multiple rounds.
Tap to flip back
Standard instruction fine-tuning teaches the student to produce correct final answers. Orca trained on GPT-4's step-by-step reasoning traces, making the intermediate problem-decomposition strategy observable to the student. The student learned:
- How to break complex problems into sub-problems.
- When to apply which reasoning pattern (analogy, deduction, retrieval, etc.).
- Self-correction patterns the teacher exhibited mid-trace.
A 65B model trained on output tokens alone had been taught "what to say"; a 13B model trained on explanation traces was taught "how to think about it." The additional supervision signal from reasoning steps more than compensated for the 5x parameter gap on structured reasoning benchmarks.
Tap to flip back
Rejection sampling generates k candidate responses per prompt from the teacher, then filters to keep only those passing a quality criterion (unit tests for code, exact-match on maths answers, a judge model score for open-ended tasks). Only the accepted responses enter the student's training set.
Why apply it: even a highly capable teacher produces errors, especially on long reasoning chains. Training the student on those errors teaches it to confidently reproduce the teacher's failure modes. Filtering by verifiable correctness before training substantially improves the signal-to-noise ratio in the distillation set, yielding a better student than training on unfiltered teacher outputs.
The accepted responses also tend to be the teacher's most coherent outputs, giving the student a higher-quality distribution to fit.
Tap to flip back
Rejection sampling discards any teacher-generated trace whose final answer does not match the gold label, keeping only traces that led to a correct result. This ensures the student model trains on reasoning paths that are at least outcome-correct, preventing erroneous intermediate steps from being directly supervised.
Why it matters: Without this filter, wrong traces degrade student performance relative to plain answer-only fine-tuning.
Tap to flip back
A small model trained only on (question, answer) pairs must infer the latent reasoning strategy from the answer signal alone - a very sparse gradient that the model may lack the capacity to resolve into reliable intermediate steps. Providing the full trace as a training target converts this discovery problem into a standard sequence prediction task: the model learns to reproduce the explicit steps, and correct reasoning behaviour emerges as a consequence.
Tap to flip back
During fine-tuning, the student always receives teacher-quality trace prefixes (clean, correct intermediate steps). At inference the student must condition on its own, lower-quality trace prefixes. If the student's trace degrades midway through a problem, it has never trained to recover from its own errors as context, so final-answer quality drops sharply. This is a specific instance of the more general exposure bias in autoregressive training.
Tap to flip back
Basic RFT keeps traces that produced correct answers and discards the rest - the trace is only a means to verify correctness. Orca explicitly prompts GPT-4 to generate explanation traces and step-by-step thought processes as first-class training targets, regardless of whether the answer could have been verified by other means. The student learns from three signal layers simultaneously: the explanation, the reasoning process, and the final answer. This richer supervision is why a 13B Orca model can surpass much larger models on complex benchmarks.
Tap to flip back
When a model generates its own fine-tuning data and retrains on it repeatedly, low-probability but valid reasoning strategies are increasingly unlikely to appear in the sampled traces. Each round narrows the distribution toward the model's dominant, high-confidence paths. Over several iterations the model becomes confidently wrong on problems that require the rarer strategies it has stopped generating. This is the reasoning-specific form of mode collapse in synthetic data loops.
Tap to flip back
Fine-tuning on a large quantity of domain-specific traces (e.g., maths reasoning) shifts the model's weights toward the representations and output patterns that are rewarded in that domain. Parameters previously used for other tasks are repurposed. Fu et al. (2023) measured this directly: models distilled on maths reasoning traces improved on GSM8K but degraded on language-understanding benchmarks. The trade-off is acceptable for narrow deployment targets but costly for general-purpose assistants.
Tap to flip back
The student's ceiling is the teacher's reliability on that task. Rejection sampling only keeps traces that led to correct answers, so if the teacher solves a class of problems at 60% accuracy, the fine-tuning set for that class is sparse and potentially unrepresentative. On tasks where even frontier teachers fail frequently, the distillation pipeline produces too little or too low-quality data for meaningful transfer. The student cannot learn a reasoning strategy the teacher does not reliably demonstrate.
Tap to flip back
Sample k completions per prompt from the current model, filter them through a verifier (ground-truth checker or reward model), and fine-tune on the completions that pass. The model's own outputs replace human-written training labels. Yuan et al. (2023) moved a 7B LLaMA from 35.9% to 49.3% on GSM8K using this approach with k=100.
Tap to flip back
When the model fails a question, STaR prompts it: "Given that the answer is X, construct a reasoning trace that leads to X." The resulting rationale is added to the training set even though the model could not reach X unprompted. Without rationalisation, only already-solvable problems accumulate training data, permanently excluding hard problems from improving the model.
Tap to flip back
Full REINFORCE updates on both correct and incorrect samples: ∇L = E[r(y) · ∇ log π(y|x)]. RFT approximates this by using binary reward and discarding failed samples rather than penalising them. It gains simplicity (no critic, no KL penalty machinery) but loses: (1) explicit negative signal from near-misses, and (2) policy drift correction - the distribution can shift substantially across iterations without a reference-model KL guard.
Tap to flip back
If k=100 samples per prompt all follow the same reasoning path, the model receives 100 copies of nearly identical data - it is like training on one example 100 times. The fine-tuning signal comes from learning different routes to a correct answer. Yuan et al. found that pooling rejection samples across multiple model checkpoints or temperatures outperformed collecting more samples from a single model.
Tap to flip back
If the model's probability of generating a correct solution is near zero, sampling any finite k yields no passing completions, so there is no data to train on. The loop stalls. Mitigations: (1) start from a stronger base model, (2) use a teacher/larger model to seed the initial correct solutions (distillation first, then self-improvement), or (3) decompose hard problems into simpler sub-tasks the model can already solve.
Tap to flip back
After many RFT rounds, the model may learn to produce outputs that satisfy the verifier's surface-level check (e.g., the exact answer token format) without genuine reasoning improvement. The verifier's pass/fail signal, originally a proxy for correctness, becomes the optimisation target itself. This is Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." It is most severe when the verifier is a learned reward model rather than a ground-truth checker.
Tap to flip back
DeepSeek-R1 used a rejection sampling stage to construct cold-start reasoning traces before the main reinforcement learning phase. Long reasoning chains were sampled from an intermediate model, filtered for correctness, and used to fine-tune a base model - providing the RL stage with a better starting distribution than a raw pretrained model. This "RFT as RL warm-up" pattern is now common in reasoning model pipelines.
Tap to flip back
Generator produces candidate outputs. Verifier scores or critiques each candidate. Filter discards low-quality ones and passes the rest back as fine-tuning data for the generator.
The loop compresses capability into the weights without proportionally increasing human labelling effort.
Tap to flip back
Sample k solutions per problem, execute or symbolically verify each one, discard incorrect solutions, and fine-tune on the correct remainder.
RFT is tractable when a reliable oracle exists (e.g. a maths checker, unit tests). Without an oracle you must rely on the model judging itself, which reintroduces self-consistency limitations.
Tap to flip back
SL phase: The model critiques and revises its own responses according to natural-language constitutional principles. The final revision becomes supervised training data - no human rater needed.
RLAIF phase: The model ranks response pairs against the constitution. Those AI-generated preference labels train a reward model used for PPO, replacing the human preference labels used in standard RLHF.
Tap to flip back
Reward hacking / overoptimisation. The generator finds outputs that score well on the proxy reward without being genuinely better. The proxy's blind spots are exploited, and the gap between proxy score and true quality widens as optimisation pressure increases.
This is Goodhart's Law in a statistical setting: once a measure becomes a target, it ceases to be a good measure.
Tap to flip back
Each filtering pass discards low-scoring outputs, narrowing the distribution of retained examples. Over many iterations, the model is fine-tuned on an increasingly homogeneous set, losing stylistic range and failing on unusual but valid inputs.
Concretely: if the reward signal correlates with response length, the model drifts toward verbosity regardless of quality.
Tap to flip back
A student model trained on teacher outputs can mimic the surface form of expert responses - correct tone, formatting, instruction-following - without inheriting the teacher's underlying reasoning capability.
Benchmarks measuring surface behaviour look good; benchmarks requiring deep reasoning or factual accuracy do not. The student has learned to "sound like" the teacher, not to reason like one.
Tap to flip back
A model critiquing its own outputs is bounded by its own blind spots. If the model consistently misunderstands a class of queries, its critiques of responses to those queries will be wrong in the same direction - and fine-tuning on the revised responses entrenches rather than corrects the misunderstanding.
Improvement is only possible where the model already has the capability to distinguish better from worse responses. Constitutional AI partially mitigates this by externalising normative judgement to written principles, but interpretation of those principles is still performed by the same model.
Tap to flip back
Without persona conditioning, repeated sampling from an LLM produces instructions clustered around the model's training-data prior: English-speaking, tech-literate, Western demographics. Persona prompting inserts a brief identity description into the prompt so each generation call conditions on a different user type, spreading outputs across a wider slice of the real human distribution.
Why it matters: a model fine-tuned on undiverse synthetic data inherits the same coverage gaps and performs poorly for underrepresented user populations.
Tap to flip back
Text-to-Persona: an LLM is prompted "Who is likely to read or write this text?" on each web document, yielding fine-grained personas grounded in real-world content.
Persona-to-Persona: given a known persona, the model is asked "Who is in a close relationship with this person?" and expanded across six iteration steps to capture demographics with low web footprint (children, rural workers, support staff).
Together, these two methods produced roughly one billion deduplicated personas in the paper.
Tap to flip back
Without persona conditioning, the k candidates generated per prompt are all drawn from a similar distribution region; rejection sampling then selects the best point in one dense cluster. With persona conditioning, each candidate explores a different persona-relevant region of the space. Rejection sampling now picks the highest-quality representative of each region, yielding a more diverse accepted set.
Tap to flip back
If a model fine-tuned on persona-conditioned synthetic data is then used to generate the next round of data, the output distribution drifts. Alemohammad et al. (2023) showed that iterative training without fresh real data causes progressive recall (diversity) degradation even if precision (quality) holds briefly -- a condition they call Model Autophagy Disorder. Persona prompting slows but does not prevent this collapse; injecting real data at each generation round is necessary.
Tap to flip back
Mid-level specificity -- roughly two to four attributes -- works best.
- Too coarse ("a student"): adds no distributional signal; the LLM ignores it.
- Too fine ("a 34-year-old postdoc in computational protein folding with a Bayesian background"): the model has no reliable basis for this identity and fabricates domain details, reducing factual quality.
The sweet spot gives the model enough context to genuinely shift register and topic focus without hallucinating implausible specifics.
Tap to flip back
-
Stereotype amplification: the LLM's persona-to-output mapping reflects its trained associations. A culturally marked persona (e.g., "a Nigerian entrepreneur") can produce reductive, stereotyped content rather than genuine representational diversity.
-
WEIRD coverage illusion: if the source corpus for persona extraction skews toward Western, Educated, Industrialised, Rich, Democratic populations, the resulting persona pool does too -- regardless of its size. One billion personas drawn from an English-centric web still under-represents large parts of the world.
Tap to flip back
Persona conditioning adds noise rather than signal when the correct output does not depend on who is asking. A mathematical proof is either valid or it is not; injecting "a primary school teacher" versus "a PhD mathematician" as the persona does not change what constitutes a correct proof, but it does risk lowering the quality of generated solutions. Apply persona prompting selectively, only to tasks where vocabulary, difficulty level, framing, or implicit assumptions meaningfully vary with the user population.
Tap to flip back
They replaced raw web-scraped training text with roughly one billion tokens of synthetic "textbook-quality" Python exercises generated via GPT-3.5, combined with six billion tokens of carefully filtered real code. The key insight is that data quality (pedagogical clarity, worked examples, concept density) can substitute for data quantity and model scale.
Tap to flip back
Starting from ~175 hand-written seed tasks, a frozen model generates new instruction-input-output triples by sampling a few seeds as in-context examples and asking the model to produce a novel task. A ROUGE-based filter discards any new task with overlap > 0.7 against the existing pool, and format violations are removed. The generator only needs to be better than random; diversity is enforced by the deduplication step rather than requiring the model to be reliably creative.
Tap to flip back
The intermediate reasoning trace encodes the process used to reach the answer: decomposition steps, constraint checks, backtracking signals. Final answers discard all of this. A student trained on traces learns a reasoning strategy; a student trained on answers learns only input-output mappings. Orca showed that a 13B model trained on GPT-4 step-by-step explanations substantially outperformed one trained on GPT-4 final answers across complex reasoning benchmarks.
Tap to flip back
p̃(y|x) ∝ p_θ(y|x) · 1[r(x, y) ≥ τ]
p_θ(y|x): the generator's probability of outputygiven inputx.r(x, y): a scorer (reward model or rule-based verifier) evaluating quality.τ: the acceptance threshold.1[...]: the indicator function that zeroes out sub-threshold outputs.
Fine-tuning on samples from p̃ shifts the model toward the accepted region. The distribution is only as useful as the scorer; a flawed scorer systematically biases the student toward whatever the scorer rewards, not what is actually correct.
Tap to flip back
Supervised phase: The model elicits a harmful output, self-critiques it against a written constitution, revises iteratively, and the fine-tuning data is (original prompt, final revised response). Human input: writing the constitution once; no per-example labelling.
RL phase: The model generates response pairs; a separate "AI feedback" model scores which is preferable per the constitution; this trains a reward model used for PPO. Human input: again only the upfront constitution; no human preference labels on individual pairs.
The scaling advantage is that per-label human annotation is replaced by one-time principle specification.
Tap to flip back
MAD describes the progressive degradation of quality or diversity when a generative model is repeatedly trained on its own outputs without sufficient fresh real data injected each generation. Each synthetic generation amplifies the current model's biases; rare modes are dropped; the distribution narrows. The trigger is an autophagous (self-consuming) loop where the proportion of real, diverse data falls below what is needed to counteract distributional drift. The fix is to maintain a continuous injection of verified real-world data across training generations.
Tap to flip back
-
Reward hacking. The scorer is a proxy for quality, not quality itself. A code verifier accepting unit-test passes rewards hard-coded solutions; a verbosity-biased reward model rewards padded answers.
-
Hallucination laundering. A teacher confident but wrong in its chain-of-thought is undetected when the verifier cannot check factual correctness. The student learns to reproduce confident errors.
-
Coverage gaps. The generator is conditioned on its own world model. Rare distributions (minority languages, niche domains) are systematically under-generated; the student fails on exactly the cases synthetic pipelines cannot easily manufacture.
Tap to flip back
Code can be executed. A unit test either passes or fails, providing a cheap, hard binary signal that requires no learned reward model. Natural language has no equivalent oracle, so filtering must rely on approximations (reward models, self-critique) that carry their own errors. Execution-based filtering is the foundation of every high-quality code synthetic data pipeline.
Tap to flip back
OSS-Instruct seeds each instruction-generation prompt with a random snippet of real open-source code, then asks the teacher model to construct a related programming task. Because the seed is drawn from the actual diversity of public repositories, the resulting task distribution covers rare library APIs, domain-specific idioms, and language patterns that purely model-generated seeds miss. MagicoderS-CL-7B trained on 75,000 OSS-Instruct examples outperformed ChatGPT on HumanEval+.
Tap to flip back
Phi-1 combined two data sources: web/StackOverflow content filtered for "textbook quality" (clear explanations, good comments) and GPT-3.5-generated synthetic exercises with worked solutions. The synthetic exercises were explicitly pedagogical, including step-by-step reasoning and edge-case coverage. Signal density per token was high enough to compensate for a corpus orders of magnitude smaller than typical code pretraining data. The result was 50.6% pass@1 on HumanEval at 1.3B parameters.
Tap to flip back
When a model generates synthetic training data, fine-tunes on it, then generates the next round, each iteration narrows the output distribution. Shumailov et al. (2023) showed this causes "irreversible defects": rare but valid outputs disappear from the model's repertoire. For code, uncommon idioms, less popular languages, and unusual algorithms gradually vanish. The primary mitigation is to ground each generation round in real, human-authored code (e.g., open-source snippets or problem statements), preventing the loop from closing entirely on itself.
Tap to flip back
- Depth rewriting - adds constraints, error-handling requirements, or edge cases to an existing instruction, increasing problem difficulty within the same concept.
- Breadth rewriting - generates a conceptually related but distinct task, expanding the variety of problems covered.
Together they let a small seed set produce a wide spectrum of instruction-code pairs without human authoring, though repeated evolution from a narrow seed can cause the corpus to homogenise over many steps.
Tap to flip back
Unit tests verify input-output behaviour, not code quality. A solution can pass all tests while being unreadable, algorithmically inefficient (e.g., O(n^2) where O(n log n) is trivial), using deprecated or insecure API calls, or hard-coding special cases that happen to satisfy the test suite but fail on out-of-distribution inputs. Execution-based filtering cannot penalise these properties; a secondary filter (static analysis, a reward model trained on human code-review labels, or mutation testing) is required for higher-quality data.
Tap to flip back
Verifier leakage occurs when the test suite or seed problems used during rejection sampling overlap with the evaluation benchmark (e.g., HumanEval). Because HumanEval problems are widely known, a teacher model asked to generate "similar coding problems" will often produce near-duplicates. The trained model then scores high not from genuine generalisation but from exposure to problems in the training distribution. Evaluating on held-out benchmarks like EvalPlus or LiveCodeBench (which rotate problems) typically reveals the true lower capability level.
Tap to flip back
RFT generates many candidate solutions to each problem, executes or checks the final answer programmatically, keeps only the correct ones, and fine-tunes the model on the filtered set. Mathematics is suited to this because final answers (numeric values, expressions) can be verified automatically without human review - the verifier acts as a free, scalable quality filter. Yuan et al. used this to lift LLaMA-7B from 35.9% to 49.3% on GSM8K by pooling correct solutions from multiple model checkpoints.
Tap to flip back
MetaMath targets problem diversity (not solution diversity) through four operators:
- Rephrasing - paraphrase the surface form while preserving mathematical content.
- Self-questioning - decompose a problem into its sub-goal chain.
- FOBAR - reverse the problem: give the answer, ask for an unknown input.
- Backward reasoning - start from the target quantity and derive preconditions.
Applied to GSM8K and MATH training splits, this produced MetaMathQA; fine-tuning LLaMA-2-7B on it yielded 66.4% on GSM8K, an 11.5 percentage-point gain over same-size baselines. The intuition mirrors image augmentation: seeing many structural views of the same concept builds a more robust representation.
Tap to flip back
Outcome supervision gives a single correctness signal for the final answer. A PRM labels each individual reasoning step, allowing training to distinguish solutions that reached the right answer via flawed intermediate steps from genuinely correct chains.
Math-Shepherd automated step-level labelling without human annotation: for each candidate step, the model rolls out multiple completions forward to the end; the step is marked correct if any completion reaches the right answer. This is a noisy but scalable proxy. Lightman et al.'s PRM800K (800k human step annotations) established that process supervision significantly outperforms outcome supervision on the MATH benchmark.
Tap to flip back
Distribution collapse occurs when a model is repeatedly fine-tuned on its own generated outputs across multiple rounds. Each fine-tuning pass narrows the distribution slightly; over iterations, the model converges to a stereotyped solution style - verbose, formulaic, and brittle on novel problem structures. The diversity of the training signal contracts because no external grounding (human labels, verified novel problems, a stronger teacher) pushes against the narrowing. This is directly analogous to mode collapse in GANs or quality degradation in recursive text self-distillation.
Tap to flip back
Rejection sampling cannot synthesise solutions that the generator model cannot produce even once across a large sample. Hard problem types - such as combinatorial proofs or multivariable calculus - remain underrepresented because the model's hit rate for them is near zero, so the filter produces almost no training signal for those cases.
Teacher-student distillation partially circumvents this: a stronger teacher (e.g., GPT-4-class) generates high-quality reasoning traces for problem types the student model struggles with; the student is then fine-tuned on those traces. The ceiling is now set by the teacher, not the student - though the teacher's own blind spots propagate downstream.
Tap to flip back
When a model is trained with a verifier (e.g., eval() of a numeric expression) as the sole reward signal, it learns to produce outputs whose last line satisfies the verifier rather than outputs that reflect correct reasoning. On problems with small integer answer spaces, a random or heuristic guess matches the correct value with non-negligible probability - for example, "the answer is 7" on a problem whose answer happens to be 7. The model's benchmark score rises, but its generalisation to novel problem distributions does not. The symptom is a sharp drop in performance when the test set is drawn from genuinely unseen competition problems rather than paraphrases of the training distribution.
Tap to flip back
DeepSeekMath (Shao et al., 2024) first continued pre-training on roughly 120 billion tokens of math-adjacent web text (filtered from Common Crawl using a math-domain classifier). This phase builds broad arithmetic and symbolic reasoning priors. Instruction fine-tuning on synthetic problem-solution pairs was applied second. The final stage used Group Relative Policy Optimisation (GRPO): a batch of solutions is sampled for each problem, scored by a reward model, and each solution's advantage is computed relative to the group mean rather than against an absolute baseline - reducing variance and memory cost compared to standard PPO. The combined pipeline took a 7B model to 51.7% on the competition-level MATH benchmark.
Tap to flip back
Duplication: near-identical pairs that add no signal and inflate loss toward a narrow surface.
Teacher-ceiling contamination: the student inherits the teacher's systematic hallucinations or errors.
Format collapse: over-production of one response style (e.g., bullet lists) because that style scored well for the teacher.
Why it matters: training on all raw synthetic output indiscriminately makes models worse, not better.
Tap to flip back
Self-Instruct computes ROUGE-L overlap between each newly generated instruction and all instructions already in the retained pool. If ROUGE-L exceeds roughly 0.7, the new instruction is discarded as a near-duplicate.
ROUGE-L measures the length of the longest common subsequence between two strings, normalised by their lengths. It captures word-order similarity without requiring exact contiguous matches.
Limitation: ROUGE-L cannot detect semantically identical instructions expressed in different words.
Tap to flip back
IFD(x, y) = PPL_student(y | x) / PPL_student(y)
A high ratio means the response y is much harder to predict given the instruction x than it is unconditionally. This indicates the instruction encodes strong conditioning signal the student has not yet learned to exploit, so training on that pair will produce large gradient updates.
Low-IFD pairs (where the model already "knows" the answer) are filtered out to concentrate compute on the most informative examples.
Tap to flip back
Magpie generated 4 million synthetic instruction pairs from Llama-3-Instruct, then applied LLM-as-judge quality scoring and filtered down to 300K examples.
Models fine-tuned on the filtered 300K outperformed models trained on the full unfiltered 4 million on benchmarks including AlpacaEval and ArenaHard.
Takeaway: a curated 7.5% subset beats the full raw corpus - signal density matters far more than raw volume.
Tap to flip back
When a reward model (RM) selects training examples and the student is then trained on those examples, the student learns to produce outputs that score well on the RM - not necessarily outputs that are task-correct.
In an iterative loop (generate, filter with RM, train, repeat), each generation's student is slightly better at gaming the RM. The RM's objective drifts further from actual task quality. Small misalignments between the RM and the true objective compound multiplicatively across rounds, eventually producing a model that is confidently fluent but systematically wrong.
This is the synthetic-data analogue of reward hacking in RL.
Tap to flip back
MinHash LSH approximates Jaccard similarity between sets of character n-grams.
Steps:
1. Tokenise each document into character 5-grams (or word shingles).
2. Apply K independent hash functions to each shingle set; keep the minimum hash per function to form a K-dimensional MinHash signature.
3. Use LSH banding: split signatures into bands and hash each band into a bucket. Documents sharing a bucket in any band are candidate duplicates.
4. Compute exact Jaccard for candidates; discard pairs above threshold (e.g., 0.8).
Lee et al. (2022) used this to remove a 61-word sentence repeated over 60,000 times in C4, reducing model memorisation by roughly 10x.
Tap to flip back
Classifiers for "educational value" or "quality" are trained on human ratings drawn predominantly from standard-dialect, formal web text. They learn to associate minority dialects, informal registers, and non-Western writing conventions with lower quality - even when those texts are factually correct and linguistically coherent.
Applied to synthetic data filtering, such classifiers systematically down-score any outputs in those styles. The fine-tuned model then under-represents those registers, producing a model that is subtly biased toward the dominant training dialect.
Dodge et al. (2021) documented this effect in the C4 corpus: blocklist-based filtering disproportionately removed text from and about minority individuals, revealing that "quality" labels encode demographic assumptions.
Tap to flip back
The Vendi Score quantifies diversity in a set of samples as the exponential of the Shannon entropy of the eigenvalue spectrum of a pairwise kernel matrix.
Given a similarity function \(k\) and \(n\) samples, form the normalised kernel matrix \(K\) (entries \(K_{ij} = k(x_i, x_j)/n\)). Let \(\lambda_1, \ldots, \lambda_n\) be its eigenvalues. Then:
\[\text{VS}(S) = \exp\!\left(-\sum_i \lambda_i \log \lambda_i\right)\]Score = 1 means all samples are identical; score = \(n\) means all are maximally dissimilar. It requires no reference distribution, which is why it is practical for synthetic corpora.
Why it matters: reference-free metrics are essential when you have no ground-truth distribution to compare against.
Tap to flip back
MAD (Alemohammad et al., 2023) is the progressive loss of quality and diversity that occurs when a generative model is trained on synthetic outputs from the previous generation, without injecting fresh real data.
At each generation, the model cannot perfectly cover the tails of its training distribution. Its synthetic outputs have slightly narrowed support. Training the next model on those outputs narrows the tails further. After several iterations, the model effectively only represents the mode of the original distribution.
The paper demonstrated this analytically and empirically across multiple architectures: "without enough fresh real data in each generation of an autophagous loop, future generative models are doomed to have their quality or diversity progressively decrease."
Practical implication: never run a synthetic pipeline that replaces real data with each generation's synthetic outputs - always accumulate.
Tap to flip back
Gerstgrasser et al. (2024) proved that if each training generation accumulates all prior real and synthetic data (rather than replacing old data with new synthetic data), the test error has a finite upper bound independent of the number of iterations.
Replacement converges toward collapse because the training distribution drifts away from the original real distribution. Accumulation prevents drift by keeping the real data anchored in every training round.
The guarantee was validated across language models, diffusion models, and VAEs, making it broadly applicable and not architecture-specific.
Tap to flip back
The three axes are:
- Lexical: vocabulary and surface patterns (n-gram variety, phrase reuse)
- Semantic: topic and intent distribution (what the instructions are actually about)
- Structural: response length, format, and complexity
In practice, structural collapse (all outputs converging to the same template length and format) is often the first visible symptom, but semantic collapse is the most damaging because it causes the trained model to generalise poorly to underrepresented topics even if individual outputs look superficially varied.
Tap to flip back
Both are collapse signals:
- Rising mean pairwise cosine similarity means the outputs are converging in semantic space - the generator is producing similar instructions regardless of surface variation.
- Falling effective rank of the embedding matrix means the synthetic corpus spans fewer independent dimensions of meaning. A 10,000-sample corpus whose embeddings span only 40 effective dimensions (vs. an expected 150+) has implicitly collapsed to a low-dimensional manifold.
Monitor both across batches. A 2-sigma deviation from a human-authored baseline is a soft alert; 3 sigma is a hard stop.
Tap to flip back
Distinct-N counts unique n-grams as a fraction of total n-grams. It has two blind spots:
- Length bias: longer outputs mechanically produce more unique n-grams, inflating the score even if semantic range is narrow.
- Surface vs. semantics: a model that varies its opening phrases but repeats the same reasoning patterns will score well on Distinct-N despite having collapsed semantically.
The recommended complement is embedding-based metrics (Vendi Score over sentence-transformer embeddings, or topic cluster entropy). These operate in semantic space and are not fooled by surface variation or length. Use Distinct-N as a cheap first-pass check but do not rely on it alone.
Tap to flip back
Computing the full Vendi Score requires eigendecomposing the \(n \times n\) kernel matrix, which is \(O(n^3)\) time and \(O(n^2)\) space. This is tractable for datasets up to a few thousand samples but prohibitive for millions.
Practical approximations include:
- Random subsampling: compute Vendi Score on a random draw of 2,000-5,000 samples per batch rather than the full corpus.
- Nystrom approximation: approximate the kernel matrix with a low-rank factorisation using \(m \ll n\) landmark points, reducing complexity to \(O(nm^2)\).
For operational monitoring, subsampling is usually sufficient because diversity trends (rising similarity, falling rank) are detectable in representative samples well before full-corpus collapse occurs.
Tap to flip back
Parallel mixing: both sources are blended in a single dataset; human data acts as a quality anchor and volume filler.
Stage-wise mixing: human data seeds early stages (SFT, reward modelling), and synthetic data scales later stages (RLAIF, rejection-sampling fine-tuning). Human data sets the standard; synthetic data provides volume.
Iterative self-improvement: human data initialises the judge's evaluation criteria only; the model then generates and scores its own training targets.
Why it matters: the mode you choose determines how authority over the model's behaviour is distributed between human signal and synthetic output, not just the quantity of each.
Tap to flip back
MAD (Alemohammad et al., 2023) is the progressive collapse of quality or diversity that occurs when a generative model is trained on its own outputs without fresh real data.
Each generation loses probability mass from the tails of the real distribution. The next generator never recovers that mass, so errors compound. The result is degradation in either precision or recall (or both) across generations.
Practical implication: any pipeline that trains iteratively on synthetic data must inject fresh human (or otherwise non-synthetic) data at every generation to prevent collapse. Reducing annotation costs over time is constrained by this requirement.
Tap to flip back
Humans author a set of normative principles (the constitution) rather than labelling individual harmful outputs. The model generates critiques and revisions of its own responses guided by those principles (supervised phase). In the RL phase, an AI judge applies the same principles to generate preference labels (RLAIF).
Human authority is embedded in the standard used to judge outputs, not in the volume of labelled examples. The contribution is qualitative rather than quantitative.
This means a small, carefully designed human artefact (the constitution) can shape a large synthetic training corpus.
Tap to flip back
In a standard training mixture without upweighting, each example contributes proportionally to the gradient signal. A 1% human fraction provides 1% of the gradient updates. If the synthetic majority already has a strong, consistent style or policy, the minority human signal is overwhelmed.
Fixes include:
- Upweighting human examples (increasing their sampling probability during training)
- Curriculum ordering (training on human data first to set the base behaviour)
- Using human data only in critical phases (e.g., reward model training) where it has structural authority rather than volumetric influence
Tap to flip back
LIMA (Zhou et al., 2023) fine-tuned a 65B LLaMA model on only 1,000 carefully curated human demonstrations, with no RLHF. In human evaluations, it matched or outperformed models trained on orders-of-magnitude more data (including synthetic corpora).
Key lesson: quality-selection matters more than volume for supervised fine-tuning. The model's core knowledge comes from pretraining; SFT teaches it a response format and behaviour style. A small number of high-quality, diverse human examples is often sufficient to communicate that style.
This does not mean you should always use minimal data, but it sets a strong prior against adding synthetic volume without first verifying quality.
Tap to flip back
AI judges (LLM-as-a-Judge) tend to exhibit a verbosity bias: they favour longer, more detailed responses over concise but equally correct ones, partly because their own training rewards elaboration.
Additional known biases:
- Positional bias: preference for whichever response appears first in the prompt (varies by model).
- Self-preference: models rate outputs similar to their own generation style more highly.
- Sycophancy: judges can be swayed by authoritative-sounding framing regardless of accuracy.
When mixing human and AI preference labels, these biases are not symmetric with human biases (humans show recency bias and anchoring effects). The mixture introduces systematic noise that may not cancel out, requiring calibration or separate accounting of the two label sources.
Tap to flip back
The chain is: human preferences -> reward model (RM) -> rejection sampling filter -> synthetic training targets.
RFT generates multiple completions, scores them with the RM, and keeps only high-scoring ones as training examples. Quality is bounded by the RM's calibration. The RM was trained on human preference labels, so its extrapolation accuracy outside the distribution of its training data determines whether the synthetic examples it endorses reflect genuine human preferences or reward hacking.
If the RM over-fits to surface features correlated with human approval (length, hedging language, specific phrases), the rejected samples may not be genuinely worse and the accepted samples may not be genuinely better - leading to mode collapse toward RM-favoured styles rather than user-preferred quality.
Fresh human evaluation is the only way to detect when the chain has broken down.
Tap to flip back
Synthetic preference data refers to pairwise comparison labels (chosen vs. rejected completions) generated by a language model judge rather than by human annotators. The judge reads a prompt and two candidate responses, then outputs a verdict. The resulting triples are structurally identical to human-labelled data and feed directly into reward-model training or DPO.
Why it matters: human labelling is expensive and slow; a capable judge LLM can produce millions of comparisons at a fraction of the cost.
Tap to flip back
The constitutional loop trains a model to be helpful and harmless using a written set of principles (the "constitution") as the sole source of normative guidance.
Phase 1 - Supervised (CAI-SL): The model critiques its own harmful response against a randomly sampled principle, then rewrites it. These (original, revised) pairs are used for supervised fine-tuning.
Phase 2 - RL (RLAIF): The model compares response pairs and labels which better satisfies a given principle. These AI-generated labels train a preference model, which guides RLHF.
Key advantage over vanilla RLAIF: the principles are explicit and auditable, rather than being implicitly absorbed from pre-training.
Tap to flip back
Rejection sampling fine-tuning samples k completions per prompt from the current policy, scores them with a reward model or judge LLM, keeps only the highest-scoring completion, and fine-tunes on that filtered set using standard SFT loss.
Differences from PPO-based RLHF:
- No RL gradient updates or KL-penalty infrastructure needed.
- More stable but less sample-efficient: many completions are discarded entirely.
- Can be iterated (sample, filter, fine-tune, repeat) to progressively improve the policy.
Used (alongside PPO) in the Llama 2-Chat training pipeline; also the basis of the Self-Improving Reasoner (STaR) approach for reasoning tasks.
Tap to flip back
Position bias is a judge LLM's tendency to prefer whichever response appears first (or second) in the prompt, independent of actual quality. Studies have measured this effect reversing verdicts 10-30% of the time.
Mitigation: submit each pair twice with A/B order swapped, then aggregate the two verdicts (majority vote or averaged logits). If the judge disagrees with itself across orderings, treat the label as uncertain and either discard it or down-weight it during reward-model training.
Tap to flip back
MAD (Alemohammad et al., 2023) describes the progressive quality and diversity degradation that occurs when successive model generations are trained primarily on outputs from previous synthetic generations, without fresh real data injected at each step.
The mechanism: the model distribution contracts around high-probability regions, progressively dropping rare but valid outputs. Pointwise quality metrics can look stable while recall (diversity) collapses silently.
Implication for synthetic preference pipelines: if the judge and policy are both fine-tuned on AI-generated labels across many iterations, the joint distribution can drift into a narrow, self-reinforcing mode. Periodic re-grounding with real human labels or held-out human evaluations is necessary to prevent this.
Tap to flip back
LLM judges tend to rate longer responses as better, regardless of actual information content, possibly because length correlates with surface thoroughness in pre-training data. Policies trained on these labels learn to pad outputs rather than improve reasoning.
Countermeasures:
- Instruct the judge explicitly to penalise unnecessary verbosity.
- Add a length-normalisation penalty term to the reward signal.
- Evaluate reward models on adversarially lengthened completions and audit for length-quality correlation before deploying them.
Tap to flip back
Direct RLAIF (d-RLAIF, Lee et al. 2023) bypasses reward model training entirely. Instead of building a separate reward model from AI-labelled pairs and then running PPO, it queries the judge LLM at each PPO step to obtain a reward score for the current policy's output in real time.
Trade-off:
- Pro: no reward model training cost or overoptimisation risk from a static RM; reward judgements can incorporate up-to-date judge capabilities.
- Con: inference latency and cost are substantially higher because a large judge LLM must be called for every rollout, making training loops significantly slower and more expensive than standard RLAIF.
Lee et al. found d-RLAIF slightly outperformed standard RLAIF on their evaluation tasks despite this added cost.
Tap to flip back
In unconstrained model generation, the probability of an example being produced is roughly proportional to what the model finds likely, so rare phenomena and minority patterns are under-represented. With template-based generation, the sampling distribution is set by code: the programmer specifies exactly how many examples of each type, label, or domain to include. Distribution control becomes a software engineering problem with predictable, auditable outcomes rather than a prompt engineering problem with opaque outcomes.
Tap to flip back
FLAN converted 62 existing NLP benchmarks into instruction-following examples by writing text templates with typed placeholders for each task, then instantiating those templates with the labelled data already present in each benchmark. For example, a sentiment classification example became "Classify the sentiment of the following review as positive or negative: {review_text}\nAnswer: {label}" with the actual review text and label substituted in. No generative model was needed; the transformation was a string-formatting operation over pre-existing labelled data.
Tap to flip back
Grammar-guided constrained decoding applies a formal grammar (such as a context-free grammar or a JSON schema) at inference time, masking out any tokens that would violate the grammar at each decoding step. This forces the model to produce only structurally valid outputs (valid JSON, valid SQL, valid arithmetic expressions). It solves the problem that free-form generation frequently produces near-valid but malformed outputs that break downstream parsers or evaluation pipelines, removing the need for post-hoc filtering or repair.
Tap to flip back
A unit test is deterministic and correct by construction: it either passes or it does not, and a passing result is a definitive proof that the code produces the specified output for those inputs. A model-based quality judge produces a probabilistic score that reflects the judge's beliefs about quality, which can be wrong, biased, or gamed. For tasks with verifiable correctness (code execution, symbolic maths, SQL query results), the programmatic oracle provides certainty the model judge cannot. This is why OpenMathInstruct-1 and similar pipelines use symbolic checkers rather than LLM-as-judge scoring for acceptance.
Tap to flip back
Template rigidity occurs when all training examples share the same surface structure because they come from a small set of templates. The fine-tuned model learns to mimic that structure in its own outputs, producing unnaturally repetitive or formulaic responses at deployment time. The standard fix is to write many template variants per task (as FLAN did with roughly ten variants per benchmark task) and to combine template instantiation with a light paraphrase or style-variation pass using a model, introducing surface diversity while preserving the programmatically controlled label and content.
Tap to flip back
A CFG for maths word problems encodes productions that specify the problem structure (agent, quantity, operation type) and sample numeric values from defined ranges. Because the structure is determined by the grammar's rules, the correct answer can be computed analytically by the same program that samples the grammar - it is a direct output of the code, not inferred from a model. For example, a "give away K items" production reduces the starting count by exactly K, so the label is computed symbolically, making it error-free by construction. This is fundamentally different from asking a model to generate a problem and then separately asking it to verify the answer.
Tap to flip back
Model-generated evaluation data carries the risk of circularity: a model tends to score well on data generated by a model with similar training, not because it has genuinely acquired the capability but because it has learned structural patterns common to both. Programmatically generated evaluation data derives from a formal specification that is independent of the model being evaluated, so high scores reflect genuine capability rather than pattern matching against a familiar generator's style. This independence is especially important for capability-specific benchmarks where controlling coverage and label balance is itself part of the measurement design.
Tap to flip back
A human curator makes idiosyncratic, random errors. A generator makes systematic errors that repeat across every sample it produces, because the same model biases drive every generation. One bad teacher propagates its mistakes at scale; one human annotator is usually diluted by many others.
Tap to flip back
- Factuality - verify answers against symbolic checkers, verifier models, or teacher log-probability signals.
- Diversity and coverage - measure vocabulary entropy, semantic cluster count, and ROUGE-L self-similarity across the set.
- Deduplication - remove near-duplicates using MinHash or n-gram Jaccard similarity (threshold roughly 0.8 on 13-gram shingles).
- Benchmark contamination - scan every training candidate against all evaluation examples using normalised LCS or token-overlap matching.
Tap to flip back
A teacher model has already seen benchmark data during its own pretraining. When asked to generate "similar" examples, it can reproduce near-verbatim question structures, distinctive templates, or characteristic problem types without copying text literally. Exact-match scanning compares strings; indirect contamination manifests through shared format and structure, not shared tokens. The defence is scanning for benchmark-characteristic patterns (multi-choice lettering schemes, recurring proper nouns, problem templates) rather than only exact text.
Tap to flip back
MAD describes progressive quality and diversity degradation that occurs when a generative model is iteratively trained on its own synthetic outputs without re-injection of real data. Precision or recall (or both) degrade across generations. A single audit of generation N says nothing about whether generation N+1 will be clean. The audit must be run at every iteration, and aggregate diversity and factuality metrics must be tracked longitudinally across the pipeline.
Tap to flip back
Rejection sampling filters out low-quality generations; a near-100% acceptance rate means either the reward model lacks discrimination or the task is trivially easy. Either way, the surviving data contains minimal learning signal - the model is not being pushed toward harder, more informative examples. A healthy rejection rate (often 30-70% in practice) indicates the filter is doing real work.
Tap to flip back
Both human raters and LLM judges evaluate stylistic quality reliably but miss factual accuracy and capability gaps. Gudibande et al. (2023) found that imitation models distilled from ChatGPT matched ChatGPT in crowdsourced human evals while closing little to none of the capability gap on knowledge-intensive tasks. Style audit passing does not imply capability audit passing. Factuality must be checked with separate, targeted methods (symbolic verifiers, knowledge probes) rather than relying on holistic quality ratings.
Tap to flip back
In a constitutional or RLAIF loop, the generator and critic often share the same base model weights or the same pretraining distribution. Systematic biases in the base model are inherited by both roles: the generator produces biased samples and the critic fails to flag them. Using a different model family for the verifier (e.g., Mistral critiquing a Llama-generated dataset) reduces the shared blind-spot surface. Human spot-checks at even low sampling rates (0.5%) serve as calibration signals that automated cross-model checks alone cannot provide.
Tap to flip back
OpenAI's usage policy prohibits using outputs from its services to develop models that compete with OpenAI. This covers both decoded text and logit-level signals from the API.
Why it matters: The restriction attaches to the data, not the training procedure. Training a student model on prohibited API outputs is a violation regardless of the fine-tuning method used.
Tap to flip back
Alpaca faced two simultaneous constraints: OpenAI's usage policy prohibited using text-davinci-003 outputs to build a competing model, and the LLaMA base model carried a non-commercial licence. The authors acknowledged both in the release post and restricted Alpaca to academic research only.
Why it matters: This was the first widely publicised demonstration that a distillation pipeline can be technically successful while legally non-deployable. It established a precedent the field now takes seriously.
Tap to flip back
The ToS restriction attaches to the data, not the model weights or architecture. If the training pairs were generated from a prohibited API, any model trained on those pairs - regardless of its architecture or subsequent training - inherits the compliance problem. The provenance of the data is the relevant variable.
Tap to flip back
Closed commercial APIs (GPT-4, Claude, Gemini) carry explicit contractual prohibitions on using outputs to train competing models. Self-hosted open-weight models (LLaMA, Mistral, Falcon) do not carry equivalent API-level restrictions; the constraints are instead in the model licence, which varies by provider and is generally less restrictive on training use.
Why it matters: Teams using open-weight teachers run locally are on substantially firmer legal ground than teams calling closed APIs for distillation data.
Tap to flip back
Licence stacking refers to the accumulation of licence constraints across a model's lineage. A student fine-tuned from a LLaMA 2 base inherits LLaMA 2's licence restrictions (including its 700 million MAU commercial cap) even if the student was trained entirely on your own data. Switching to a different fine-tuning dataset does not change the licence on the base weights.
Why it matters: Due diligence must cover every node in the pipeline - base model, teacher model, and data sources - not just the final training run.
Tap to flip back
Gudibande et al. found that imitation models learn the teacher's discourse style far more readily than its factual correctness or reasoning depth. Human raters rated imitation models as competitive with the teacher; automated evaluations on novel reasoning and factual tasks exposed a persistent capability gap that the stylistic fluency masked.
Why it matters: This undermines the core motivation for ToS-violating distillation. If the resulting student cannot actually replicate the teacher's reasoning ability, the legal risk is being taken for limited reward.
Tap to flip back
- Use a self-hosted open-weight model as the teacher (LLaMA 3, Mistral, Falcon, Qwen, etc.) - run locally, no API ToS applies.
- Seek explicit authorisation from the provider via an enterprise agreement or research partnership before generating distillation data - this is the route Microsoft used for Orca with GPT-4.
A third option is distilling from your own previously trained models, which carries no third-party ToS risk at all.
Tap to flip back
Common Crawl's WET extractor is optimised for speed and coverage, not precision. It retains substantial navigation and sidebar boilerplate. Re-extracting from WARC with a higher-quality parser (e.g., Trafilatura) improves text purity before any downstream filter sees the data, preventing noisy signal from propagating through the entire pipeline.
Tap to flip back
Navigation menus and footer links are mostly HTML tags with very short text; body paragraphs are mostly text with sparse tags. Density scoring measures the ratio of text characters to tag characters per block and keeps blocks above a threshold. This simple heuristic works well on news-style pages and is very fast, but struggles with modern JS-rendered or template-heavy layouts.
Tap to flip back
favor_precision=True reduces false positives (keeping boilerplate as content) at the cost of increased false negatives (discarding ambiguous pages). For pretraining corpora with billions of pages you can afford to discard uncertain cases; the alternative - letting boilerplate through to downstream filters - propagates noise that may survive if the filters are not specifically designed to catch it.
Tap to flip back
NFKC normalisation collapses compatibility characters (full-width Latin letters common in East Asian web text, circled numbers, fraction ligatures) to their canonical equivalents. Without it, these characters produce spurious vocabulary entries during tokeniser training and make exact-duplicate detection unreliable because the "same" word can appear as multiple distinct byte sequences.
Tap to flip back
A 1% boilerplate false-positive rate across 10 billion documents yields 100 million noisy training examples. Small per-document extraction errors become enormous absolute corpus contamination. Compounding this, boilerplate shared across many pages is removed by deduplication, but unique-yet-low-quality templated content (e.g., per-product e-commerce listings) passes both extraction and deduplication checks and requires separate content-quality filters to catch.
Tap to flip back
URL, publication date, language, author, title, and token count are worth preserving. They enable: post-hoc domain allow/deny-listing; date-range filtering; decontamination (checking whether a document's URL coincides with benchmark source pages); and cheap length-based filtering without re-tokenising. None of this metadata need enter the model's training text - it serves the curation pipeline.
Tap to flip back
Static extractors operate on the raw HTML served by the server, not the rendered DOM. SPAs populate their content via JavaScript executed by the browser; the raw HTML is often near-empty shell markup. Pretraining pipelines generally accept this as an acceptable coverage gap: the lost pages are a small fraction of the crawl and fetching rendered DOM would require a headless browser, adding orders-of-magnitude more compute and latency per page.
Tap to flip back
A perplexity or quality classifier is trained on a specific language. If non-target-language documents enter the corpus first, they may score well or poorly for the wrong reasons - a French document might receive low English perplexity (incorrectly accepted) or high perplexity (incorrectly rejected). Either way the quality signal is unreliable. Language identification gates the corpus to a single language so that downstream filters operate on a homogeneous distribution.
Tap to flip back
CCNet uses the fastText language identification model (lid.176.bin or its compressed variant lid.176.ftz), which supports 176 languages. It was chosen because it runs on CPU at millions of documents per second with a memory footprint under 1 MB compressed - critical when processing hundreds of terabytes of web crawl data. Heavier transformer-based classifiers achieve marginally better accuracy on short or ambiguous texts but are impractical at that scale.
Tap to flip back
Too high (e.g., 0.95): Many legitimate Welsh documents are discarded because the classifier assigns sub-threshold confidence to Welsh text that borrows English words or uses informal orthography. Corpus size shrinks sharply.
Too low (e.g., 0.5): Documents from closely related or geographically co-occurring languages (English, Cornish) slip through, injecting noise that degrades monolingual model quality.
The right threshold for low-resource languages is generally lower than for high-resource ones, and paragraph-level detection (as in CCNet) helps by averaging over more context before making a document-level decision.
Tap to flip back
where \(n_l\) is the token count for language \(l\), \(N\) is the total across all languages, and \(T\) is a temperature parameter (mT5 used \(T = 5\)).
When \(T = 1\) the mixture is proportional to raw token counts - English dominates. As \(T\) increases the distribution flattens toward uniform, upsampling low-resource languages and downsampling high-resource ones. The cost is that high-resource language data is seen less often per training step, which can slightly hurt performance on those languages.
Tap to flip back
The three languages share extensive vocabulary, are mutually intelligible, and overlap in both Latin and Cyrillic scripts. A character n-gram classifier trained on news text or Wikipedia will find the distributional differences too subtle to separate reliably - and even trained human annotators often disagree on which label is correct for a given document. Any pretraining corpus reporting distinct high-quality Serbian and Croatian shards should be read with scepticism unless paragraph-level or speaker-metadata-based disambiguation was applied.
Tap to flip back
CCNet runs fastText language detection per paragraph, then accepts a document when the majority of its paragraphs are assigned the target language. This solves two problems:
- Pages that mix languages (e.g., English boilerplate navigation wrapped around French article text) are handled more accurately - the French paragraphs are retained and the English navigation paragraphs discarded.
- Pages with embedded keyword spam in a different language do not bias the whole-document prediction, because those paragraphs are a minority of the total.
The cost is roughly linear extra inference: one model call per paragraph rather than one per document.
Tap to flip back
Arabizi is informal Arabic written in the Latin script with digits substituting for sounds absent from the Latin alphabet (e.g., "3" for the letter ع). Because standard Arabic language identification models are trained on Arabic script (UTF-8 Arabic Unicode block), they will not recognise Arabizi as Arabic - it shares no character n-grams with the training data. The document is then either mislabelled as another Latin-script language or rejected as unidentifiable. Similar issues affect Romanised Hindi, Japanese Romaji, and Chinese pinyin, all of which appear on the web and all of which fall outside the training distribution of script-aware classifiers.
Tap to flip back
Use an editorially curated corpus (Wikipedia, books) as positive examples and random web samples as negatives. The classifier learns a proxy for quality from sources that already went through human editorial processes, then generalises that signal to the full web. No direct human labelling of web documents is needed.
Tap to flip back
CCNet trains a fastText n-gram language model on Wikipedia. A document's perplexity under this model measures how surprised the model is by the document's text. Low perplexity means the document resembles Wikipedia-quality writing and is retained; high perplexity means it is likely noise or gibberish and is discarded. FastText runs on CPU at millions of documents per minute, making it practical at web scale.
Tap to flip back
A hard threshold keeps documents scoring above a fixed percentile and discards the rest - a binary decision. DSIR instead assigns each document an importance weight proportional to the ratio of the target distribution density to the web distribution density:
w(x) = p_target(x) / p_web(x)
Documents are then sampled proportionally to these weights, preserving diversity while biasing toward the target (e.g. Wikipedia + books). The pipeline optimises KL reduction - the decrease in KL divergence between the sampled corpus and the target - which correlates strongly (r = 0.82) with downstream benchmark accuracy.
Tap to flip back
Running a capable LLM (capable of nuanced 0-5 quality scoring) over every document in a 15-trillion-token corpus would cost orders of magnitude more in compute and latency than affordable. Instead:
1. Score a sample (~400k documents) with the LLM.
2. Train a small classifier (e.g. fine-tuned DistilBERT or gradient-boosted on text features) to replicate those scores.
3. Run the small classifier over all tokens at CPU-scale speed.
The LLM annotation is a one-time fixed cost; the small classifier amortises it across the full corpus. The result approximates the richer LLM signal without paying the LLM's inference cost per document.
Tap to flip back
-
Reference corpus bias. Wikipedia-trained classifiers penalise informal technical writing, code-heavy prose, non-Western cultural references, and low-resource languages even when those documents are genuinely informative.
-
Quality vs. relevance conflation. A polished but vacuous marketing brochure scores high on fluency; a rough but accurate niche technical description scores low. Classifiers optimise surface-level quality, not informational value.
-
Out-of-distribution extrapolation. For document types rare in training (legal contracts, poetry, source code interleaved with prose), classifier confidence is an extrapolation. Hard thresholds applied to these produce unpredictable retention rates.
Tap to flip back
When deduplication runs before filtering, a cluster of near-duplicates is collapsed to one representative. The classifier then scores only that representative; all copies survive or all are discarded based on a single document's score. Run after deduplication, different representative selection algorithms can surface different cluster members, changing the document that gets scored. This means pipeline ordering is a hidden hyperparameter affecting corpus composition.
Published papers rarely discuss this because it requires instrumenting the full pipeline to track document provenance across stages, and the effect is difficult to isolate from other variables. Most ablations treat filtering and deduplication as independent steps, which they are not in practice.
Tap to flip back
At web corpus scale, classifier inference cost is a hard constraint. FastText and n-gram logistic regression run on CPU at millions of documents per minute with no GPU requirement. Even a moderate BERT-base classifier is roughly 1,000 times slower per document. At 15 trillion tokens, that difference translates to weeks of additional GPU time and significant cost.
The first-pass filter therefore favours throughput over precision: use a cheap, fast classifier to remove the worst 30-50% of documents, then reserve higher-capacity models (or LLM-annotated classifiers) for a second pass on the retained subset.
Tap to flip back
A heuristic filter is a deterministic rule on document statistics (character ratios, line lengths, keyword presence) - no model weights, no inference cost. A classifier scores documents using a trained model, which is slower and requires labelled training data. Heuristics run first as a cheap bulk-removal pass; classifiers handle residual noise the rules cannot express.
Tap to flip back
Correctly removed: navigation menu items (e.g. "Home | About | Contact") and JavaScript-rendered link text. Incorrectly discarded: legitimate list items in technical documentation, headings, and mathematical display lines. The rule encodes an assumption that good prose is sentence-structured, which does not hold for all registers or domains.
Tap to flip back
The "fraction of characters in duplicated lines >= 0.10" rule removed approximately 12.5% of tokens. It catches boilerplate that appears in multiple lines within the same document - navigation menus, repeated legal notices, copy-pasted footers - more effectively than exact-line deduplication alone.
Tap to flip back
Filters targeting correlated features (e.g. "short lines" and "low punctuation") both fire on the same documents, so their joint removal rate is lower than the naive sum - but they may still jointly remove an unexpectedly large or small fraction. Measuring per-filter and joint removal rates identifies redundant rules (wasting compute for no extra gain) and rules that are too aggressive (discarding legitimate content), allowing principled pruning of the pipeline.
Tap to flip back
Compute a target statistic over a high-quality reference set (e.g. Wikipedia) and over the raw crawl. Plot both distributions as histograms. Set the filter threshold at the inflection point where the two distributions diverge. Then verify the chosen threshold removes an acceptable fraction of raw tokens (roughly under 15-20% per rule). This grounds thresholds in the actual data rather than arbitrary choices.
Tap to flip back
- "lorem ipsum" - removes placeholder template pages never replaced with real content.
- Curly brace "{" presence - removes code, JSON bleed-through, and template markup.
- "cookie policy" / "privacy policy" lines - removes footer/navigation boilerplate injected into extracted page text.
Each targets a specific, identifiable signature of non-prose or low-information content.
Tap to flip back
Threshold brittleness: heuristics calibrated on one crawl snapshot may over-trigger or under-trigger on later snapshots because the web's content distribution shifts (more single-page apps, changed boilerplate patterns, different spam techniques). A pipeline that does not re-calibrate per snapshot will silently degrade corpus quality. The fix is to re-run histogram analysis on each new snapshot and adjust thresholds before applying filters at scale.
Tap to flip back
For a single hash function h, the probability that the minimum hash over document A's shingles equals the minimum hash over document B's shingles is exactly the Jaccard similarity between the two shingle sets:
P[ min_h(A) == min_h(B) ] = |A ∩ B| / |A ∪ B|
Using k hash functions and averaging, the fraction of matching minimums in two signatures is an unbiased estimator of Jaccard(A, B). This lets you estimate document-level similarity with a fixed-size vector rather than storing and comparing the full shingle sets.
Tap to flip back
Banding converts a soft similarity estimator into a near-threshold detector. Divide a length-k MinHash signature into b bands of r rows each. Two documents are candidate duplicates if at least one complete band matches exactly. The candidate probability for similarity s is:
P(candidate) = 1 - (1 - s^r)^b
Increasing r steepens the threshold (fewer false positives at low similarity). Increasing b raises recall at high similarity. Choosing b=20, r=5 (k=100) creates a steep transition around s~0.5: documents with Jaccard >= 0.8 are almost always flagged; documents below 0.3 almost never are. The banding replaces an O(n^2) pairwise comparison with a hash-table lookup.
Tap to flip back
Two documents that are semantically identical can have different byte sequences due to Unicode variants, inconsistent whitespace, HTML entities, or encoding differences. Without normalisation (lowercasing, NFKC normalisation, whitespace collapsing, HTML stripping), these count as distinct and both survive. Normalise before hashing and you catch them.
The danger of inconsistency: if you normalise differently at deduplication time vs. tokenisation time, documents you judged distinct may tokenise identically, meaning you kept a "unique" duplicate that wastes a training slot. Applying the same normalisation pipeline across all stages prevents this.
Tap to flip back
A token that appears 60,000 times in the training corpus gets roughly 60,000x the gradient signal of a token that appears once. The model does not generalise from it - it memorises it and can reproduce it verbatim at inference time. Removing duplicates flattens the token-frequency distribution.
Lee et al. (2022) found:
- Models trained on deduplicated data emitted memorised text 10x less frequently.
- They reached the same or better validation accuracy in fewer training steps.
- Over 4% of standard validation sets overlapped with undeduplicated training data, inflating benchmark scores.
The compute saving follows directly: fewer repeated tokens means each step sees more diverse signal, so less data is needed to reach the same loss.
Tap to flip back
MinHash LSH produces a set of candidate duplicate pairs, not a clean one-to-many mapping. A chain can form: document A is a near-duplicate of B, and B is a near-duplicate of C, but A and C would not be flagged directly. Removing only pairwise matches can leave all three in the corpus.
Union-Find builds a graph where each document is a node and each flagged pair is an edge. Computing connected components groups all transitively-linked near-duplicates into a cluster. One representative (typically the longest or earliest document) is kept per cluster; the rest are discarded. This handles syndication chains, mirror networks, and template-generated page families that form chains rather than stars.
Tap to flip back
1. Short documents. With only 16 or so word 5-grams, the MinHash signature has high variance. Two short documents on different topics may share several n-grams by chance, producing a falsely elevated Jaccard estimate. Apply a minimum-length filter (e.g. 200 words) before computing signatures.
2. Low Jaccard threshold over topically related text. A threshold of 0.5 can flag two Wikipedia articles about related topics (e.g. "BERT" and "RoBERTa") as duplicates because they share extensive factual sentences, tables, and terminology. Multilingual corpora amplify this: a French and English article on the same subject share cognates that inflate the word n-gram overlap. Raising the threshold and using character n-grams instead of word n-grams both help.
Tap to flip back
Document-level SHA-256 hashing marks only documents that are byte-identical in their entirety. A news article that is 95% unique but contains a repeated 100-word legal disclaimer shared by 500,000 other documents passes unchanged.
Suffix array deduplication concatenates the entire corpus with sentinel characters, builds a sorted array of all suffixes, and scans for adjacent suffixes that share a common prefix of length >= threshold (e.g. 50 tokens). The repeated substring is identified and either the flagged span is masked out, or documents containing it above a coverage threshold are removed. This catches boilerplate paragraphs, repeated copyright notices, and scraped template bodies that live inside otherwise unique documents.
The cost: suffix array construction is O(n log n) time and requires roughly 5 bytes per input character of working memory, making it expensive at multi-terabyte scale. It is typically run after initial document-level deduplication to limit the corpus size first.
Tap to flip back
MinHash compares whole documents by hashing sets of n-grams into compact signatures, then groups similar documents into LSH buckets. Suffix-array deduplication sorts every token-level suffix of the entire corpus and finds exact repeated spans by scanning adjacent entries in the sorted array. One operates on document-similarity; the other on exact substring occurrence across the full corpus.
Tap to flip back
Suffix-array (token-level) deduplication removes the repeated span because it detects the exact substring across all documents. Document-level MinHash does not, because the host documents are distinct from each other and will not land in the same LSH bucket. This is the canonical reason both levels are needed in production pipelines.
Tap to flip back
Global deduplication removed 90% of tokens from older snapshots (because older content re-appears in later crawls), leaving a dataset skewed toward recently-crawled, lower-quality content. Per-snapshot deduplication eliminated only the large clusters of thousands of identical documents, which cause most of the harm, while preserving legitimate re-occurrences of high-quality text.
Tap to flip back
A suffix array is a sorted array of all suffixes of the concatenated corpus. After sorting, any two repeated spans will be adjacent (or near-adjacent) entries. A linear scan computing the Longest Common Prefix (LCP) between consecutive entries identifies all exact substrings that exceed a length threshold, so the full O(n) scan replaces what would otherwise be O(n^2) pairwise comparisons.
Tap to flip back
- Semantic redundancy across languages: a Wikipedia article and its translation share no character n-grams but carry the same knowledge. Neither method treats them as duplicates.
- Benchmark contamination: evaluation examples that appear once in the training corpus are not duplicates by any deduplication criterion. Removing them requires a separate decontamination pass against the held-out test sets.
Tap to flip back
MinHash + LSH is cheap enough for petabyte-scale streaming: it operates per-document with compact signatures. Suffix-array construction over hundreds of billions of tokens requires tens of gigabytes of RAM and significant CPU time (implemented in Rust, not Python) and is feasible only once the corpus has already been coarsely filtered. Running the cheaper pass first reduces the corpus size the expensive pass must handle.
Tap to flip back
After applying suffix-array exact deduplication to C4, models emitted memorised text roughly 10 times less frequently. Deduplication also reduced the train-test overlap that affected over 4% of standard validation sets, and models reached equivalent or better validation accuracy in fewer training steps.
Tap to flip back
Because models carry accumulated weight state between steps, each batch nudges parameters toward that batch's distribution. Batches seen during high learning-rate phases have disproportionate lasting influence (larger updates), while batches near cooldown are consolidated more conservatively. Two runs with identical token counts but different orderings follow different gradient paths and converge to different minima.
Tap to flip back
The cooldown phase is the final 5-10% of training where the learning rate is annealed to near zero. Updates are small and conservative, so the model consolidates what it knows rather than rapidly adapting. Upweighting high-quality, dense data (curated books, educational text) during cooldown gives those distributions a clean imprint on the final weights with minimal gradient interference from noisier sources.
Tap to flip back
DoReMi (Xie et al., 2023) trains two models in parallel: a small reference model on a uniform mixture, and a "proxy" model whose domain weights are updated online. Weights are set inversely proportional to how much worse the proxy performs on each domain relative to the reference (excess loss). Domains where the proxy struggles get upweighted; easy domains are downweighted. The final weights are then used to train the full-scale model, replacing manual ablation with an automatic signal.
Tap to flip back
Sequential single-domain training causes gradient overwriting: web-text updates in the second phase push weight vectors away from code-specialised configurations, and vice versa. Interleaved batches produce conflicting but averaged gradient directions, forcing the optimiser into a compromise parameter region that generalises across both distributions. The interleaved model finds a basin that satisfies both objectives; sequential training finds two separate local basins and ends in the latter one.
Tap to flip back
Tokeniser fertility (tokens per character) varies across languages. A language that tokenises less efficiently (e.g., many languages with non-Latin scripts using byte-fallback tokens) will be over-represented in token budget relative to bytes, and under-represented in information per token. Balancing by bytes appears fair but actually gives the model far more gradient steps on some languages than others. Token-level balancing ensures uniform update exposure across scripts and language families.
Tap to flip back
If domain mixture weights are tuned to maximise scores on public benchmarks (MMLU, ARC, HellaSwag, etc.), the training distribution is shaped by the evaluation distribution. The model learns the stylistic and factual patterns of those specific benchmarks rather than developing generalisable capability. This is Goodhart's Law applied to data pipelines: the proxy metric (benchmark score) becomes the target, and the real objective (general intelligence) diverges. Mitigation: evaluate against held-out log-perplexity or decontaminated benchmarks not used to tune weights.
Tap to flip back
Chinchilla's compute-optimal result makes the token budget a genuinely scarce resource: you cannot simply add more tokens of a weaker domain without displacing tokens of a stronger one. Every domain weight decision is therefore a trade-off under a fixed budget. This forces practitioners to treat data mixture as a first-class hyperparameter - not an afterthought - because sub-optimal mixture proportions directly translate to wasted compute and a weaker final model.
Tap to flip back
BPE scans the corpus, counts every adjacent pair of tokens, and merges the most frequent pair into a single new token. Each merge step adds one entry to the vocabulary and reduces the total token count in the corpus by one occurrence per matched pair. Repeating until the target vocabulary size is reached yields a vocabulary and an ordered merge table. The ordering matters: replaying merges in the same order deterministically segments any new text.
Tap to flip back
Fertility is the average number of tokens produced per word (or per Unicode character for logographic scripts). A tokeniser trained on an English-heavy corpus assigns few merges to non-English text, so those languages approach character-level segmentation with fertility of 4-6 vs. ~1.2-1.5 for English. High fertility wastes context window, increases FLOPs per training step, and effectively gives those languages a worse representational bottleneck inside the model.
Tap to flip back
Vocabulary slots are awarded to the most frequent merges in the training corpus. If that corpus differs in language or domain composition from the actual pretraining data, the vocabulary is miscalibrated: slots go to patterns that are rare in pretraining (wasted capacity) while common patterns in the real data fragment into multiple tokens (lower quality representations). The tokeniser is frozen before model training begins, so this mismatch cannot be corrected without retraining from scratch.
Tap to flip back
Byte-level BPE seeds the initial vocabulary with the 256 raw byte values instead of Unicode code points. Because every string is a valid byte sequence, the tokeniser can represent any text without an unknown token - there is no out-of-vocabulary case. The tradeoff is that non-ASCII scripts start as individual bytes and require many merges to form useful multi-character units, so they still need sufficient corpus representation to get those merges assigned. Used in GPT-2, LLaMA, and most modern English-centric LLMs.
Tap to flip back
A larger vocabulary reduces fertility (shorter token sequences, fewer FLOPs per step) but expands the embedding matrix: size = vocab_size x embedding_dim parameters. For a 128k vocabulary at 4096 dimensions that is roughly 2.1 billion parameters in embeddings alone. A smaller vocabulary keeps the embedding table compact but fragments text more, potentially harming long-context reasoning and multilingual quality. LLaMA-1 used 32k; LLaMA-3 expanded to 128k to improve multilingual and code coverage.
Tap to flip back
In byte-level BPE, the space preceding a word is encoded as a distinct byte (e.g. "G" in GPT-2's encoding for "Ġ"). So "Washington" at the start of a sentence and " Washington" mid-sentence may be assigned different token IDs or different merge paths. This is by design (whitespace is semantically meaningful) but it creates prompt-sensitivity: changing capitalisation or leading whitespace can alter the token sequence and produce subtly different model behaviour. It also makes tokenisation-sensitive benchmark comparisons fragile.
Tap to flip back
- Fragmented domain terms. Medical, legal, or code-specific tokens absent from the tokeniser corpus split into many sub-tokens, consuming context window and making it harder for attention to relate semantically related fragments. 2. No remedy short of full retraining. Because the tokeniser is frozen and its token IDs are baked into the embedding matrix, adapting the vocabulary requires re-initialising or expanding embeddings and re-running pretraining; fine-tuning alone cannot fix poor tokenisation of a new domain.
Tap to flip back
Fertility is the average number of tokens a tokeniser produces per Unicode word for a given language or domain. A high fertility (e.g. 6 tokens per word) means the model must attend across many tokens to process a single semantic unit, inflating sequence length, raising inference cost for users, and degrading model quality because gradient signal is spread across an artificially long span. A vocabulary too small relative to the corpus produces high fertility for rare languages and technical jargon.
Tap to flip back
- Sequence length - larger vocabulary produces shorter token sequences, reducing attention compute and KV-cache size.
- Embedding table memory - the (V x d_model) matrix grows linearly with vocabulary size; for small models this can dominate total parameter count.
- Rare token coverage - a vocabulary that is too small fragments rare languages, code identifiers, and technical terms into character-level pieces.
Why it matters: each pressure pulls in a different direction, so the optimal V depends on model size, target languages, and deployment context.
Tap to flip back
- V=32,000: 32000 x 4096 x 2 = 262 MB
- V=128,000: 128000 x 4096 x 2 = 1,048 MB (~1 GB)
The unembedding (output projection) layer is typically the same shape, so effective vocabulary-related memory is roughly double these figures. For a 1B-parameter model, a 256k vocabulary can make the embedding tables larger than the rest of the model's weights combined.
Tap to flip back
BPE merges the most frequent adjacent pair at each step. If the corpus is 99% English, the algorithm uses nearly all V merge slots on English patterns, leaving non-English scripts with almost no dedicated tokens. Even setting V=256k will not help an underrepresented language if its text is absent from the BPE training corpus - those scripts still degrade to character-level sequences. The fix is to upsample low-resource languages in the BPE training corpus (separately from the pretraining data mix) to reserve vocabulary budget for them.
Tap to flip back
A vocabulary trained on web text fragments domain-specific identifiers (e.g. gene names like BRCA1, code tokens like __init__) into character-level pieces. Fine-tuning does not fix this because the tokeniser is frozen; the model still sees those terms as long multi-token sequences and cannot easily learn coherent representations for them. The proper remedy is vocabulary expansion with re-initialised embeddings followed by continued pretraining, not standard supervised fine-tuning.
Tap to flip back
Rare tokens in a very large vocabulary may appear only a handful of times in the entire pretraining corpus. Their embeddings never receive enough gradient updates to converge to useful representations; they remain near-random at end of training. At V=256k, a substantial tail of tokens is effectively untrained, causing erratic behaviour on inputs that trigger those entries. The marginal return on each additional merge step diminishes sharply beyond a corpus-specific saturation point.
Tap to flip back
(a) English monolingual: 32k-64k - fertile enough for English at low embedding cost; going higher gives diminishing sequence-length returns and wastes merge budget on rare English variants.
(b) Multilingual: 100k-256k - each additional language family needs dedicated token allocations to keep fertility near 1; without the extra budget, low-resource languages fragment badly, inflating inference cost and degrading quality for those users.
The difference arises because vocabulary budget is shared across all languages in the BPE training corpus; more languages require more total merges to give each one adequate coverage.
Tap to flip back
Benchmark decontamination is the removal of evaluation set examples (or near-duplicates) from a model's pretraining corpus before training begins. Without it, a model may have memorised benchmark answers during pretraining, so reported scores reflect recall rather than genuine generalisation - inflating perceived capability.
Tap to flip back
GPT-3 used 13-gram overlap to flag and remove contaminated training documents. This method fails to catch semantic contamination: paraphrased, translated, or lightly reformatted versions of benchmark examples that share no exact n-gram with the original but still reveal the correct answer. A model that processes such rephrased content during pretraining gains an unfair advantage that 13-gram filtering cannot remove.
Tap to flip back
- Embed-and-retrieve: Encode all benchmark examples and all training documents with a fast encoder (e.g., sentence-transformers). Use approximate nearest-neighbour (ANN) search to find the top-k most semantically similar training documents per benchmark example.
- LLM judge: For each retrieved candidate, prompt a language model to determine whether reading that document would reveal the benchmark answer. Remove documents the judge labels as contaminating.
This catches paraphrase and translation variants that pure string matching misses, at the cost of significantly higher compute.
Tap to flip back
8 to 18 percent of HumanEval benchmark problems overlap with those corpora even after standard decontamination filters are applied. The contaminating content is rephrased rather than verbatim, bypassing n-gram methods entirely. The same paper showed a 13B model fine-tuned on rephrased benchmark variants could match GPT-4 on that benchmark while performing normally on uncontaminated tasks.
Tap to flip back
If synthetic training data is generated by a model (e.g., GPT-4) whose own pretraining may have included benchmark answers, those answers can propagate into the synthetic documents. The synthetic corpus then contaminates a downstream model indirectly. Yang et al. (2023) highlighted this as a risk specific to synthetically generated datasets, noting that standard decontamination pipelines do not account for contamination inherited through a generation model.
Tap to flip back
-
Crawl timing relative to benchmark release. Crawling the web after a benchmark is published guarantees that pages reproducing the benchmark will be ingested. Tracking which benchmarks existed at crawl time allows targeted post-hoc removal; pre-release crawls reduce the problem at source.
-
Answer-only vs. question-answer removal. Removing documents that contain the answer text (but not the question) is defensible for multiple-choice benchmarks, but risky for open-ended generation tasks where the answer phrasing itself is the signal. The choice materially affects how much legitimate educational content is discarded.
Tap to flip back
Sainz et al. argue that benchmark results should be accompanied by a measured contamination report - specifically, the residual fraction of benchmark n-grams still present in the training corpus after decontamination. A residual rate above roughly 1 percent warrants explicit disclosure. Without this, readers cannot tell whether a high score reflects genuine capability or undisclosed data overlap.
Tap to flip back
Carlini et al. (2021) showed that by querying GPT-2 with carefully chosen prompts, an attacker could recover hundreds of verbatim sequences from the training corpus, including real names, phone numbers, and email addresses. The result held even for sequences that appeared only once in the data, showing that "it only appears once, so the model won't remember it" is not a safe assumption.
Tap to flip back
Carlini et al. (2022) found that memorisation scales as a roughly log-linear function of model parameter count: doubling model size noticeably increases the fraction of training sequences the model can reproduce verbatim. For PII this is directly dangerous because a sequence like a phone number that a smaller model would not reproduce may be reliably extractable from a larger one. Training-set deduplication reduces the effect but does not eliminate it.
Tap to flip back
- Regex / heuristic patterns (cheapest): catch structured PII such as email addresses, phone numbers, SSNs, and IPv4 addresses using curated regular expressions applied in a single linear pass.
- Lightweight NER models (moderate cost): distilled sequence-labelling models tag unstructured PII such as personal names and locations that regex patterns miss.
- Blocklist hashing (O(1) per document): hash-based lookup against known-bad identifier sets drops documents that exactly match breach dumps or other known leaks.
A typical pipeline runs them as a cascade in this order. Expensive large-model verification is used offline to audit recall, not in the hot path.
Tap to flip back
- Deletion: simplest to implement; removes tokens entirely. Breaks sentence structure and leaves context gaps the model may learn unusual statistics around.
- Sentinel replacement (e.g.,
[EMAIL],[PHONE]): preserves document structure; the most common production choice. Introduces typed placeholder tokens the model learns to predict and generate. - Synthetic substitution: replaces real PII with plausible fabricated values, preserving readability. Risk: the corpus now contains invented data, which may mislead the model on factual tasks.
Tap to flip back
Running deduplication first reduces duplicated PII to fewer instances before the NER stage, so the scrubber sees a smaller target. Running PII scrubbing first alters document text before deduplication, causing near-duplicate pairs (e.g., two copies of the same forum post) to diverge when redacted spans differ, confusing MinHash or SimHash-based deduplication. Most large-scale pipelines (FineWeb, Dolma) deduplicate first then scrub, accepting the trade-off.
Tap to flip back
Quasi-identifiers are attributes that are not directly identifying on their own (zip code, birth year, occupation) but jointly narrow the population enough to re-identify an individual. Sweeney's widely cited work showed that combinations of just three quasi-identifiers can re-identify over 85% of US residents. A scrubber that only removes direct PII categories (names, phone numbers, email addresses) leaves quasi-identifier combinations intact. Addressing this properly requires either broad structural redaction (destroying utility) or applying differential privacy during training so the model's gradients cannot overfit to any combination of rare features.
Tap to flip back
Lukas et al. found that sentence-level differential privacy reduces PII leakage substantially but does not eliminate it: approximately 3% of PII sequences still leaked under their experimental conditions. They also showed that their novel extraction attacks could recover up to 10 times more PII than prior attack methods. The finding challenges the assumption that scrubbing training data alone is sufficient; marginal increases in scrubbing aggressiveness yield diminishing privacy returns while compounding costs in corpus fluency and downstream task performance.
Tap to flip back
- URL/domain blocklists - cheap, deterministic, applied pre-download; covers only known bad domains, misses toxic content on legitimate domains.
- Word/n-gram blocklists - fast, high recall, no GPU cost; low precision because legitimate community discourse uses the same vocabulary as hate speech.
- Toxicity classifiers - model-based scoring (e.g., Perspective API); better precision and generalisation, but reflect annotator biases, require threshold selection, and fail on under-resourced languages.
Cost increases and precision generally improves from (1) to (3), but so does the risk of systematically biased removal.
Tap to flip back
The C4 word-list filter disproportionately removed documents discussing LGBTQ+ topics, African American Vernacular English, and HIV/AIDS. Documents mentioning gay, lesbian, or transgender subjects were removed at roughly twice the rate of gender-neutral equivalents. This happens because blocklists contain words that appear legitimately in marginalised communities' own discourse - reclaimed slurs, community vernacular, and first-person accounts of discrimination look the same to a token-matching filter as hate speech directed at those groups.
Tap to flip back
Even a filter with 99.9% recall (missing only 0.1% of toxic documents) leaves tens of billions of tokens of harmful content when operating on a 15-trillion-token corpus. FineWeb explicitly acknowledges this: "there are still a significant number of documents present in the final dataset that could be considered toxic." Toxicity filtering reduces concentration of harmful material; it does not achieve zero-toxicity coverage at web scale.
Tap to flip back
A classifier (like Perspective API) returns a probability score per document. The threshold determines the cut-off for removal. Shifting from 0.5 to 0.8 might reduce removed documents from 8% to 1.5% of the corpus - a difference of hundreds of billions of tokens at 15T scale. Setting the threshold too low over-removes legitimate analytical, clinical, and community content; setting it too high leaves harmful material. Thresholds also drift: a value calibrated on a 2020 CommonCrawl dump may behave differently on 2024 data as the social-media fraction and community norms of the crawl change.
Tap to flip back
Toxicity filtering runs after language identification and quality filtering, but before deduplication. Order matters for safety: a toxic document that survives toxicity filtering might be duplicated thousands of times in the raw crawl. If deduplication ran first, it would collapse those copies to one before toxicity filtering could catch it. Running deduplication after toxicity filtering means a surviving toxic document still gets deduplicated, limiting how many copies influence training. Running it before would leave the full duplicate weight of any missed toxic content in the final corpus.
Tap to flip back
Most toxicity classifiers and word lists are English-centric: they were trained on English annotations and cover English vocabulary. Toxic content in Hindi, Arabic, Yoruba, or other languages passes undetected because neither word-list token matching nor classifier scores transfer reliably across scripts and languages. The result is multilingual corpora with uneven safety properties: English content is more aggressively filtered than content in under-resourced languages, introducing a language-tier disparity in model safety behaviour.
Tap to flip back
Domain erasure occurs when aggressive toxicity filtering removes legitimate professional or academic content because it discusses sensitive subjects. Legal documents describing crimes, clinical trials involving sexual health, academic papers analysing extremist rhetoric, and journalism reporting on violence all share surface features (vocabulary, n-grams) with genuinely harmful material. A filter tuned for high recall against hate speech will catch some of these. The result is a corpus that under-represents law, medicine, and social science relative to the actual web, skewing the model's knowledge in these domains.
Tap to flip back
Temperature sampling rescales each language's raw proportion \(p_l\) by raising it to the power $1/T$, then renormalising: \(q_l \propto p_l^{1/T}\). At \(T=1\) the original distribution is unchanged. As \(T\) increases toward infinity the distribution flattens toward uniform across languages. Values between 2 and 5 are typical; mT5 used \(T=5\). Why it matters: it is the primary mechanism for upsampling low-resource languages without manually setting per-language weights for every language in the corpus.
Tap to flip back
The curse of multilinguality refers to the per-language quality loss when a fixed-capacity model is trained on many languages simultaneously. Parameters that must represent English, Chinese, Yoruba, and 97 other languages cannot specialise as deeply as a monolingual model. Empirically, Conneau et al. (XLM-R) showed that adding more languages initially improves low-resource performance through cross-lingual transfer, but continued expansion dilutes capacity and eventually degrades all languages. A common practical symptom: raising a low-resource language's training share above roughly 5-10% noticeably drops English benchmark scores on the same model checkpoint.
Tap to flip back
A tokeniser trained on unbalanced (English-heavy) data allocates most vocabulary slots to English sub-words. Low-resource language text then fragments into many more tokens per word - high "fertility." A Yoruba word that is 4 tokens in a balanced tokeniser may be 12 tokens in an English-biased one. This increases inference cost, shrinks the effective context window for those users, and means the model sees fewer complete semantic units per training step for low-resource languages. Balancing the tokeniser training corpus is a prerequisite to effective multilingual data balancing during pretraining.
Tap to flip back
- Memorisation instead of generalisation. With heavy upsampling the model sees the same sentences dozens of times per epoch. It memorises surface patterns rather than learning the language's underlying structure, so it performs poorly on novel inputs. 2. Script and normalisation fragmentation. Very low-resource languages often appear in multiple scripts or Unicode normalisations across different sources. The balancing pipeline aggregates these as a single language label, but the model sees what are effectively distinct vocabularies, diluting the already-small effective corpus further. Both problems share the root cause: balancing policy cannot substitute for data that does not exist.
Tap to flip back
Low-resource languages have a higher duplicate rate as a fraction of their total web presence (fewer distinct pages indexed). If you deduplicate before balancing, more low-resource documents are removed, shrinking their effective corpus below what the balancing ratio assumes. If you deduplicate after balancing, the upsampled repetitions you added are themselves detected as duplicates and removed, undoing the upsampling. Neither order is obviously correct; practitioners must consciously choose and document which they applied, because the resulting token counts differ substantially.
Tap to flip back
ROOTS is the 1.6 TB corpus used to pretrain BLOOM (176B parameters), spanning 46 natural languages and 13 programming languages. For African languages - which are heavily under-represented in Common Crawl - the team sourced curated datasets: Wikipedia dumps, legal documents, religious texts, and partnered community datasets, rather than depending on crawl data that barely existed. This illustrates a general principle: for languages below a crawl-viability threshold, corpus construction requires active curation and community partnerships, not just proportional sampling from a web crawl.
Tap to flip back
Translation augmentation generates additional training data for a low-resource target language by machine-translating high-resource source text into it. It addresses the availability ceiling problem: if only 5M tokens of native Swahili text exist, translating 50M tokens of English adds synthetic volume. The risk is translationese: machine-translated output tends to produce simpler syntax, calque structures (word-for-word constructions from the source), and unnatural phrasing that the model then learns and reproduces. Evaluation sets must be filtered to remove translated items, and models trained heavily on translated data may perform poorly on colloquial native text that looks unlike the translation output style.
Tap to flip back
- Where did the text originate? (URL, publisher, synthetic generator)
- What transformations were applied? (filtering, deduplication, quality scoring)
- What was the licence or terms of service at collection time?
- Has that licence changed since collection?
Why it matters: without all four, you cannot reliably audit legal exposure or reprocess a corpus when new contamination is discovered.
Tap to flip back
Licence-omission rate above 70% and licence-error rate above 50%.
Why it matters: the metadata practitioners rely on to assess whether a dataset is commercially usable is mostly absent or wrong.
Tap to flip back
Licence drift is when a source's terms of service change after the corpus was collected. The training data does not update, but the legal regime governing its original collection continues to evolve. Reddit (2023) and Twitter/X (2023) are the most-cited examples.
Tap to flip back
N-gram filtering can only remove documents you know to check against. Without per-document source metadata, you cannot identify which shard to reprocess when a new benchmark is released after the corpus was frozen. Provenance records (WARC identifiers, snapshot dates, sub-source labels) let you replay the pipeline against updated blocklists without rebuilding from scratch.
Tap to flip back
Synthetic data is generated by a model (e.g. GPT-4) that was trained on copyrighted web text. The original licence constraints of that web text propagate through the generation chain. Additionally, most commercial LLM providers explicitly prohibit using their outputs to train competing models. This means synthetic data has real and traceable provenance, even though no direct copying occurred.
Tap to flip back
The EU CDSM Directive (Art. 4) allows TDM for commercial purposes but grants rights-holders an explicit opt-out mechanism (machine-readable reservation). US fair use has no equivalent opt-out; it is a case-by-case balancing test. A corpus legally assembled under US fair use arguments may still be actionable in the EU if rights-holders have posted opt-out notices, creating jurisdiction-fragmented legal risk for globally deployed models.
Tap to flip back
It is a category label, not provenance. Real provenance requires: the specific WARC record identifier, the crawl snapshot date, the source URL, and the licence-at-collection-time for that URL. "Common Crawl" alone gives you none of these per-document details, making it impossible to audit legal status, trace benchmark contamination, or reprocess with updated filters.
Tap to flip back
A string is k-eidetically memorised if the model reproduces it verbatim given only a short prompt prefix, AND the string appears at most k times in the training corpus. Low k (even k=1) means the model overfitted to a single document rather than learning a pattern. Formalised by Carlini et al. (2021).
Tap to flip back
Memorisation probability grows log-linearly with the number of times a sequence is duplicated in the training corpus. Each doubling of the repetition count roughly doubles the probability that the model will reproduce it verbatim on an extraction probe. Carlini et al. (2022) confirmed this across six orders of magnitude of model scale.
Tap to flip back
Models trained on deduplicated data emitted memorised text approximately ten times less frequently than models trained on the raw corpus. Deduplication also improved held-out perplexity and reduced the number of training steps needed to reach the same accuracy, because every gradient step carried novel signal.
Tap to flip back
MinHash+LSH is a document-level fuzzy method: fast and scalable to hundreds of billions of tokens on a CPU cluster, but it misses verbatim substring copies embedded inside otherwise-distinct documents, and can produce false positives. Suffix-array deduplication is precise and catches exact substring matches of any length, but is computationally expensive at web scale. Most large pipelines run MinHash first, then an optional suffix-array pass for residual verbatim duplicates.
Tap to flip back
In low-resource languages, near-duplicate documents may be the only available training signal for a given topic or linguistic structure. A Jaccard similarity threshold calibrated on English (where duplicates are genuinely redundant) will strip most of the content for these languages, leaving the model with almost no exposure to them. Per-language or per-domain threshold calibration is needed to avoid this collapse.
Tap to flip back
No. Deduplication removes repeated copies of a document, but a PII-containing document that appears only once survives deduplication intact. At sufficient model scale, singletons can still be extracted verbatim if the inference context is long enough (demonstrated by Carlini et al. on GPT-2). Deduplication reduces memorisation pressure but is not a substitute for explicit PII detection and scrubbing.
Tap to flip back
Semantic deduplication embeds documents with a pretrained encoder, clusters them by cosine similarity, and retains one representative per cluster. It captures near-semantic-equivalence that Jaccard-based MinHash misses. D4 showed that pairing MinHash deduplication with semantic diversification produced a 20% training efficiency gain and up to 2 percentage-point improvements in downstream task accuracy at 6.7B parameter scale.
Tap to flip back
Evaluation & MLOps
3 concept(s)MMLU is saturated - frontier models score 88-92% against a ~90% human-expert ceiling. Once a benchmark sits at the ceiling, deltas track prompt-engineering effort, harness choice, and overfitting rather than real capability. Hugging Face showed the same model can score 49% or 64% on MMLU depending on the evaluation harness (formatting, log-likelihood normalisation, answer-extraction logic). A leaderboard score without a harness footnote is roughly meaningless.
Tap to flip back
Internet-scraped pretraining data overlaps with the public test sets of every popular benchmark. Models can recognise test questions verbatim or paraphrased. Sainz et al. (EMNLP 2024) argue NLP evaluation is "in trouble" because contamination silently inflates scores. Labs can decontaminate at training time but external readers cannot verify it - they have no access to the training corpus. This is why the field keeps minting new benchmarks faster than models saturate them.
Tap to flip back
Original HumanEval averages only ~3 tests per problem, so buggy code passes routinely. EvalPlus increases the test count 80x on the same prompts and pass rates drop by 19-29% across most models. Many earlier claims of "human-level coding" were artefacts of weak test suites. SWE-bench raised the bar further by demanding real GitHub-issue fixes end-to-end - the closest open benchmark to "can this model be a junior engineer."
Tap to flip back
Treat any single public benchmark as a leading indicator, not a verdict. Track a basket of three or four current frontier benchmarks (GPQA-Diamond, SWE-bench, FrontierMath, etc.) plus your own internal eval set on real production traffic. Public scores predict general capability; your eval predicts whether the model handles your specific distribution. Decisions ride on the latter.
Tap to flip back
OpenAI o-series and similar test-time-compute models can spend more tokens (and more wall-clock time) per question to raise their score. Comparing them to a one-shot model at the same accuracy column ignores the cost asymmetry - the reasoning model might be paying 50x the tokens. Report cost-per-correct or fix the token budget across models if you want a meaningful comparison.
Tap to flip back
HELM measures accuracy, calibration, robustness, fairness, bias, toxicity, efficiency for each scenario. The multi-axis view matters because a model that wins accuracy can lose calibration badly - the failure mode that triggers production incidents (confidently wrong on a tail slice). Robustness reveals which models break under typos and paraphrase. Efficiency separates "frontier capability you can afford" from "frontier capability you can demo." Single-number leaderboards are gameable; a seven-axis report card is much harder to fake.
Tap to flip back
Before HELM, models had been evaluated on average against just 17.9% of the standard scenarios, with little overlap between papers. HELM evaluated 30 models on 42 scenarios using a common matrix, forcing apples-to-apples comparison. The framing (one scenario, seven axes) was less novel than the discipline of running every model on every scenario the same way - reproducibility as a coordination service.
Tap to flip back
EleutherAI's lm-evaluation-harness supports 60+ standard benchmarks with hundreds of subtasks and is the backend for Hugging Face's Open LLM Leaderboard. It runs against HuggingFace transformers, vLLM, OpenAI and Anthropic APIs. If you roll your own, your numbers will not be comparable to anything published - a serious problem when you need to defend "we are 4 points behind frontier on GPQA" to leadership. Extend the harness instead.
Tap to flip back
A safety-tuned model refuses borderline tasks that the unaligned baseline would attempt (and sometimes get right). Without the joint view, you would misattribute the accuracy regression to a capability loss - it is actually a deliberate behavioural trade-off. HELM's combined axes let you see that calibration and toxicity improved while accuracy dipped, which is the report you want before approving a release.
Tap to flip back
Full HELM coverage costs real API credits and GPU-hours per model. Most teams run a curated subset focused on the axes they ship against (accuracy + calibration + robustness for production work; add fairness and toxicity for consumer apps). The benchmark also drifts: scenarios saturate within 18 months and have to be retired or replaced, so "full HELM" itself is a moving target.
Tap to flip back
- Registered model: a named container for a logical model (e.g.
support-triage-classifier). - Model version: an immutable, auto-numbered checkpoint.
- Lineage links: pointers to the training run, dataset hash, and code revision.
- Aliases / stages: mutable references like
@champion,staging,productionfor deployment workflow.
MLflow, W&B Registry, Vertex AI, and SageMaker all implement this shape. The discipline matters more than the tool.
Tap to flip back
- Which data: exact dataset version, including filtering, dedup, tokenisation. Content hash, not a
v2label. - Which code: git SHA of the training repo plus pinned dependency lockfile.
- Which hyperparameters: full config - seed, LR schedule, batch size, optimiser state, preprocessing flags.
- Which compute: hardware, driver versions, numerical-precision settings (bf16 vs fp16 vs fp32 changes downstream metrics).
Miss any axis and reproducibility silently collapses six months later when you try to retrain.
Tap to flip back
Git tracks code well and data poorly - large binary blobs explode the repo and diffs are meaningless. DVC (Data Version Control) stores data in cloud object storage (S3, GCS, Azure) and keeps pointers in git, so git checkout of an old commit gives you the matching data revision. It also defines pipelines for dependency-ordered re-runs. For teams already in git, DVC is the lowest-friction route to reproducible data.
Tap to flip back
A model card (Mitchell et al. 2018) is a short document accompanying every model release: intended use, training data, evaluation results stratified by slice, known limitations. Hugging Face renders the repo README.md as the card via structured YAML metadata. Even for internal-only models, write one - it is the document your auditor and your future self will ask for when an incident lands six months later and the original engineer has left.
Tap to flip back
Lineage breaks at the data source boundary. If your training data comes from a live API or a constantly-updating warehouse, "version v2" only pins your local copy - the upstream data underneath has moved. True reproducibility requires snapshotting the source (a frozen Parquet dump, a content-hashed pull) at training time, not just versioning what you ingested. Otherwise re-running the same code on "the same data" silently retrains on a different dataset.
Tap to flip back
Inference Optimisation
3 concept(s)Without a cache, generating token t+1 re-runs attention over all t prior tokens, so producing a length-n sequence is O(n^2) FLOPs and O(n^2) HBM reads. With a KV cache, you stash per-layer K and V for every emitted token, then each step only computes K / V for the single new token and reads the rest. Per-step attention work over the past collapses from O(n) to O(1), and the whole sequence becomes O(n).
Tap to flip back
bytes = 2 * n_layers * n_kv_heads * head_dim * seq_len * dtype_bytes
The leading 2 is K + V. The whole point of Grouped-Query Attention (n_kv_heads << n_query_heads) and Multi-Query Attention (n_kv_heads = 1) is to shrink this number. DeepSeek's MLA goes further by storing a low-rank latent that K and V are projected from. Concretely: Llama-3-70B at 128k context is 40 GiB with GQA-8 vs ~160 GiB with full MHA - the difference between fitting on one H100 and not.
Tap to flip back
On modern GPUs, each decode step reads the entire KV cache plus the model weights to emit one token, so attention during decode is memory-bandwidth-bound, not compute-bound. With Llama-3-70B at 32k context on H100 (3.35 TB/s HBM), a single stream upper-bounds at roughly 335 tokens/sec just from cache traffic. The cache directly caps batch size (concurrent users per HBM budget) and per-stream throughput.
Tap to flip back
Classic allocators reserve a contiguous slab for each sequence's max length, wasting 60-80% of HBM on fragmentation and reservation. PagedAttention (Kwon et al., SOSP 2023) borrows the OS virtual-memory trick: split the cache into fixed-size blocks (~16 tokens), keep a per-sequence block table mapping logical positions to physical blocks, and allocate on demand. Waste drops under 4% and you serve 2-4x more concurrent requests on the same hardware. This is the enabler that makes vLLM's continuous batching actually work in HBM.
Tap to flip back
Prefix caching hashes block contents (plus the prefix that produced them) so any incoming request can reuse a matching prefix already in HBM. Big wins: chat with long system prompts, RAG with shared retrieved chunks, few-shot prompts. No win: decode-heavy workloads where every request has a unique prompt - you pay the bookkeeping overhead for free. Pin known-hot prefixes to a longer-lived eviction tier so they survive normal LRU pressure.
Tap to flip back
At ~6B+ params, a small number of activation channels develop values 10-100x larger than the rest (Dettmers, 2022). Round those channels to INT8 and you clip them; round everything else to fit them and you lose all useful precision in the bulk.
- SmoothQuant migrates the difficulty into the weights via per-channel rescaling
Y = (X / s) @ (s * W)- weights absorb the scale, activations become uniform. - AWQ observes only ~1% of weight channels (those aligned with large activation channels) are critical, and protects them with per-channel scaling before quantising the rest to 4-bit.
- GPTQ uses the inverse Hessian to quantise weight columns sequentially while compensating with the rest - SOTA for 3-4 bit weight-only.
Tap to flip back
Two FP8 formats introduced on Hopper / Blackwell tensor cores at 2x FP16 throughput:
- E4M3 (4 exponent, 3 mantissa) - tighter range, more precision. Used for weights and activations.
- E5M2 (5 exponent, 2 mantissa) - wider range, less precision. Used for gradients.
With per-tensor (better: per-block) scaling and keeping norms / final logits in higher precision, quality is generally indistinguishable from bf16. Llama-3, DeepSeek-V3, and most production deployments now serve in FP8 by default.
Tap to flip back
Decode is HBM-bandwidth-bound, so reducing what you read is what speeds things up. Weight-only INT8/INT4 dequantises on-the-fly into fp16 just before the matmul - the matmul itself runs in fp16 (no accuracy regression from numerics) but you read 2-4x less from HBM. Activation quantisation is where quality degrades fastest because of outliers, and KV-cache quantisation is the next frontier (INT4 KV cuts cache memory 4x at 1-3 pt quality loss, worth it for long context).
Tap to flip back
- H100 / B100 hardware: FP8 first - half memory, ~2x throughput, no measurable quality loss.
- Fitting 70B+ on one GPU: INT4 weight-only via AWQ or GPTQ - kernels mature in vLLM, TensorRT-LLM, llama.cpp.
- CPU / Apple Silicon: GGUF Q4_K_M or Q5_K_M - llama.cpp's k-quants are calibrated for Metal / CPU.
- Sub-4-bit (2-3 bit): only with QAT or QuIP#; PTQ below 4 bits gets ugly fast.
Long-context reasoning models suffer more from quantisation than chat models because errors compound across thousands of decoded tokens - measure on real tasks, not short-answer perplexity.
Tap to flip back
Two structural problems:
- Padding waste - a batch of one 4k prompt and seven 200-token prompts spends 95% of its attention FLOPs on pad tokens.
- Tail blocking - the whole batch finishes when the longest generation finishes; a 4000-token completion holds up seven 50-token completions.
Anyscale's measurements show static batching achieves roughly 1/20th the throughput of continuous batching on the same hardware. The fix is to schedule at the decoding-step level, not the request level.
Tap to flip back
From the Orca paper (Yu et al., OSDI 2022), each iteration the scheduler:
- Picks all currently in-flight sequences.
- Runs one decode step per sequence in parallel.
- Removes any sequence that emitted EOS or hit max length.
- Admits new sequences from the queue if HBM has free blocks.
A finished request frees its slot immediately; a new request joins on the next iteration. To stop a long prefill from blocking ongoing decodes, vLLM / SGLang / TensorRT-LLM use chunked prefill - slice long prompts and interleave them with decode steps.
Tap to flip back
- TTFT (time to first token) improves vs static because requests no longer wait for a batch window to fill.
- TPOT (time per output token) can get worse under heavy load because each decode step reads more KV cache (more concurrent sequences sharing the step).
- System throughput improves dramatically - the headline 20x+ Anyscale number.
Tuning knobs: max_num_seqs caps concurrency (too high blows TPOT past SLO), max_num_batched_tokens caps prefill burst, gpu_memory_utilization reserves HBM share when other processes live on the GPU.
Tap to flip back
| Stack | Pick it when |
|---|---|
| vLLM | One decision that works for most chat / RAG; best general-purpose throughput; ships PagedAttention + prefix caching |
| TensorRT-LLM | NVIDIA-only fleet, an engineer to handle the build, you want the absolute peak H100 / H200 throughput with FP8 |
| SGLang | Heavy structured-output or shared-prefix workloads (RadixAttention, fast constrained decoding) |
| llama.cpp | Local, Apple Silicon, CPU, or edge - GGUF quantisation, runs anywhere, single-stream focus |
Hugging Face TGI is the easy-deploy fallback but trails vLLM and TRT-LLM on throughput.
Tap to flip back
Mathematical Foundations
4 concept(s)Composition of linear maps is linear: (A B) x = A (B x) is itself a single linear map. So n stacked linear layers without an activation are mathematically equivalent to one matrix A_n ... A_2 A_1. The non-linearity (ReLU, GELU, SiLU) is the only thing buying you expressive depth - without it, depth is free parameters with zero added capacity.
Tap to flip back
Every real (m, n) matrix factorises as:
A = U S V^T
Uis(m, m)orthonormal - left singular vectors (output directions).Vis(n, n)orthonormal - right singular vectors (input directions).Sis(m, n)diagonal with non-negative singular values sorted descending - how muchAstretches each direction.
Rank of A is the count of non-zero singular values. SVD always exists; eigendecomposition does not.
Tap to flip back
Empirically, the weight updates dW produced by fine-tuning a large pretrained model have fast-decaying singular spectra - most of the change lives in a tiny subspace. Hu et al's LoRA paper showed rank 8 is usually enough to capture it. So you train B A where B is (d, r) and A is (r, d), freeze the original weights, and pay a few hundred MB instead of 140 GB. The same low-rank prior is why int4 quantisation barely hurts accuracy.
Tap to flip back
An H100 delivers ~990 BF16 TFLOPs but only ~3 TB/s HBM bandwidth. Modern GPUs are arithmetic-intensity machines - they need many FLOPs per byte loaded. So small matmuls waste compute waiting on HBM, big fused matmuls saturate the tensor cores, and quantisation helps even when arithmetic precision is unchanged because there is less data to move. This is why FlashAttention fuses kernels and why activation memory (not weights) caps training batch size.
Tap to flip back
Almost always, unless you have a symmetric positive-definite matrix. Eigendecomposition only exists for square matrices, eigenvectors can be complex, non-orthogonal, or numerically unstable when nearly parallel. SVD exists for every real matrix, is always orthonormal, and gives you the best rank-r approximation in Frobenius norm via Eckart-Young. PCA, LoRA, low-rank attention, quantisation calibration all sit on SVD.
Tap to flip back
H(p, q) = - sum_k p_k log q_k
KL(p || q) = H(p, q) - H(p) = sum_k p_k log (p_k / q_k)
Cross-entropy is the bits you actually spend encoding samples from p with a code optimised for q. KL is the excess over the optimal code H(p). KL is non-negative, zero iff p = q, and not symmetric - this asymmetry is why forward KL is mode-covering and reverse KL is mode-seeking.
Tap to flip back
Two reasons. First, softmax is the maximum-entropy distribution given linear logit constraints - it falls out of exponential-family theory. Second, the joint gradient is beautiful:
dL/dlogits = q - p
No exponentials, no divisions, no special cases. Numerically stable, trivially differentiable, and every framework fuses the two ops into one kernel. That is why every classification loss, every autoregressive LM loss, and every policy-gradient log-prob term reduces to this same expression.
Tap to flip back
KL(p_data || p_model) (forward) penalises the model for putting low mass where data has high mass - mode-covering. Maximum-likelihood training minimises this. KL(p_model || p_ref) (reverse) penalises the policy for putting high mass where the reference is near zero - mode-seeking. RLHF uses reverse KL because you want the policy to stay inside the reference's support; forward KL would let the policy wander into regions the reference never visits.
Tap to flip back
Perplexity is exp(cross_entropy_in_nats). A perplexity of 20 means the model is, on average, as confused as if it had to choose uniformly among 20 tokens at each step. Lower is better. Because of the exponential, a drop from 30 to 20 is a much bigger capability jump than 100 to 90 - mind the scale when comparing benchmark numbers.
Tap to flip back
KL(p || q) = sum_k p_k log (p_k / q_k)
If q(x) = 0 anywhere p(x) > 0, the log term diverges. On empirical samples, zero counts in q are common. The fix is to smooth q (add a small epsilon, Laplace smoothing, or a fitted parametric distribution) before computing KL. Same care applies to log-likelihood evaluations against held-out data containing unseen tokens.
Tap to flip back
For a function R^n -> R, forward mode costs O(n) evaluations to compute all partial derivatives, reverse mode costs O(1) extra passes regardless of n. With 70 billion parameters and a scalar loss, forward mode would need 70 billion forward passes per gradient. Reverse mode (backprop) does it in one backward pass. The asymmetry is the entire reason no major framework uses forward mode for training.
Tap to flip back
Three reasons:
- Memory. A 70B-parameter model has a
(7e10, 7e10)Hessian -5e21floats, six orders of magnitude beyond the planet's storage. - Compute. Each Hessian-vector product costs comparable to a backward pass, and Newton needs many per step.
- Non-convexity. Vanilla Newton converges to stationary points, and in saddle-dominated landscapes it heads straight for the nearest saddle.
K-FAC, Shampoo, Sophia exist but AdamW (cheap diagonal preconditioner) wins almost every workload.
Tap to flip back
It rescales the gradient when its L2 norm exceeds a threshold:
if ||grad||_2 > threshold:
grad = grad * (threshold / ||grad||_2)
This caps step magnitude while preserving direction - prevents loss spikes from exploding gradients. Threshold of 1.0 is the standard transformer default. It does not fix vanishing gradients - you cannot rescale a zero. Vanishing needs architectural fixes (residual connections, LayerNorm, gating). If clipping triggers constantly, your LR is too high or your init is bad.
Tap to flip back
At a critical point (grad = 0), the second-order Taylor expansion is governed by the Hessian's eigenvalues:
- All positive (positive-definite): local minimum.
- All negative: local maximum.
- Mixed signs: saddle point.
In high dimension, the probability that all n eigenvalues share a sign drops as 2^{-n} at a random critical point. So saddles dominate the loss landscape, plateaus during training usually reflect slow saddle escape rather than true minima, and SGD's noise is the mechanism that kicks the optimiser off the ridge.
Tap to flip back
Use central differences with double precision:
grad_approx_i = (f(x + h e_i) - f(x - h e_i)) / (2 h)
with h = 1e-5. Central differences have O(h^2) truncation error vs O(h) for forward differences. Too small h and floating-point noise dominates; too large and truncation dominates. Compare against autodiff per-parameter and look for relative error below 1e-5. Use this on tiny networks to catch custom-layer bugs; do not gradient-check a 70B model.
Tap to flip back
Naive softmax exp(x_i) / sum_j exp(x_j) overflows when any x_i > ~88 (FP32) or ~16 (FP16). Fix: subtract the max before exponentiating.
m = max(x)
softmax(x_i) = exp(x_i - m) / sum_j exp(x_j - m)
logsumexp(x) = m + log(sum_j exp(x_j - m))
log_softmax(x_i) = x_i - logsumexp(x)
Mathematically identical (the exp(-m) cancels). Largest exponent is now 0, no overflow, at least one term is exp(0) = 1 so no underflow. Always use the fused kernel; never write log(softmax(x)) by hand.
Tap to flip back
A single optimiser step might change a weight by 1e-7 from a base of 0.1. In BF16 (7 mantissa bits), 0.1 + 1e-7 = 0.1 - the update is rounded away and learning stops. FP32 master weights accumulate small updates faithfully; the BF16 copy is refreshed each step from the FP32 master for cheap forward/backward compute. Same idea as FP16 loss scaling: keep precision where rounding would silently kill the gradient.
Tap to flip back
BF16 has the same 8-bit exponent as FP32 (~3.4e38 range) but only 7 mantissa bits. FP16 has 10 mantissa bits but a 5-bit exponent (max 65504, min normal ~6e-5). LLM training produces activations and gradients that span huge dynamic range; FP16 silently underflows them, which is why FP16 needed loss scaling. BF16 trades precision for range - precision is recoverable via FP32 master weights, range is not. Same memory cost, fewer NaN incidents.
Tap to flip back
- Capture state immediately. NaNs are sticky - save batch, model state, optimiser state at detection. The corrupted state itself is a clue.
- Walk forward with
torch.autograd.detect_anomaly()to find the first op producing NaN. - Check inputs to that op for
inf, zero denominators, negatives undersqrt/log, large positives underexp. - Check gradients - forward NaN often follows backward
inffrom the previous step.
Common culprits: log(0) from softmax tail, attention pre-softmax overflow in BF16, LR warmup too short, corrupt batch.
Tap to flip back
cuBLAS reduction order depends on tensor shapes and tile selection at kernel launch. Two identical runs produce loss values differing by ~1e-6 per step and significantly different checkpoints after many steps. Full determinism requires torch.use_deterministic_algorithms(True), cudnn.deterministic=True, CUBLAS_WORKSPACE_CONFIG, pinned RNG seeds across Python/NumPy/PyTorch/CUDA, and no atomic ops in custom kernels. Throughput cost is 10-30%. Most production runs accept non-determinism; reproducibility comes from saved checkpoints and configs.
Tap to flip back
Applied LLMs
44 concept(s)__global__ marks a function as a kernel: it is called from host (CPU) code but executes on the device (GPU). The caller uses the triple-angle-bracket syntax kernel<<<grid, block>>>(args) to launch it. Without __global__, the compiler treats the function as a normal CPU function.
Tap to flip back
Both branches are executed serially; the hardware uses a bitmask to suppress writes for threads not taking each path. A warp that splits 50/50 across an if/else effectively halves throughput for that section. This is called warp divergence. Keeping data-dependent branches at warp boundaries (so all 32 threads agree) avoids it.
Tap to flip back
Coalescing is when all 32 threads in a warp access a contiguous, aligned 128-byte block, allowing a single memory transaction to serve the whole warp. With stride-2 access, threads touch alternating cache lines, forcing 2 transactions per warp and dropping effective bandwidth to roughly 50% of peak. Fully random access can require up to 32 transactions, reducing effective bandwidth to around 3% of peak.
Tap to flip back
A tile of the input matrices is loaded cooperatively into shared memory (L1-speed scratchpad). Each thread then reads from shared memory - not HBM - to accumulate partial dot products. The tile is advanced until the full dot product is computed. This raises arithmetic intensity from O(1) (naive) toward O(tile_size), pushing the kernel past the roofline ridge point so it becomes compute-bound rather than bandwidth-bound.
Tap to flip back
Occupancy is the ratio of active warps on a streaming multiprocessor (SM) to the maximum warps that SM supports. Low occupancy means fewer warps are available to hide memory latency during stalls. A common cause is register pressure: if each thread requires many registers, the fixed SM register file can only accommodate a small number of concurrent threads, limiting the number of resident blocks.
Tap to flip back
.item() copies a scalar from device to host, which forces a cudaDeviceSynchronize-style stall: the CPU blocks until all pending GPU work completes. This serialises the otherwise asynchronous GPU pipeline. In a tight loop (e.g., logging loss every step) it can cut GPU utilisation significantly. Solution: accumulate tensors on-device and call .item() only at a coarser logging frequency.
Tap to flip back
Triton is a Python-based DSL where you write the kernel logic in terms of blocks of elements rather than individual threads. The Triton compiler automatically handles shared-memory allocation, tiling, coalescing, and instruction selection. You still specify a launch grid (number of program instances), but you avoid manual threadIdx/blockIdx arithmetic and explicit shared-memory management. PyTorch's torch.compile backend (Inductor) generates Triton kernels internally for fused operator sequences.
Tap to flip back
Grid (whole GPU, one per launch), Block / CTA (one Streaming Multiprocessor), Thread (one CUDA core). The warp - 32 threads executing in lockstep - is the hardware's actual scheduling unit, sitting below the block level. Threads within a block share on-chip shared memory and can synchronise; threads across blocks cannot.
Tap to flip back
Global memory transactions are issued per warp. If all 32 threads access consecutive 32-bit words, the hardware merges them into one or two 128-byte transactions. If each thread accesses an arbitrary address, the hardware issues up to 32 separate transactions - 32x the bus traffic for the same amount of data. The fix is to arrange data layouts and index arithmetic so each warp's thread i accesses address base + i.
Tap to flip back
Tiling loads a sub-matrix from slow global memory into fast on-chip shared memory cooperatively (coalesced), then performs all dot-product accumulations within shared memory. It improves arithmetic intensity - FLOPs per byte of global memory traffic. A naive GEMM has ~2 FLOPs per 8 bytes (near-zero intensity); with a TILE x TILE tile the intensity becomes TILE/2 FLOPs/byte, moving the kernel from memory-bound toward compute-bound.
Tap to flip back
The SM register file, shared memory capacity, and maximum resident blocks all cap how many warps can reside simultaneously. Memory-bound kernels hide latency by switching to another warp while one stall waits for DRAM. If occupancy is low, there are few alternative warps to switch to, and the SM stalls rather than doing useful work. Compute-bound kernels (e.g., Tensor Core GEMM) are less sensitive because the cores stay busy regardless.
Tap to flip back
Kernel fusion combines multiple logical operations (e.g., bias add + layer norm + activation) into a single CUDA kernel. Without fusion each operation reads its inputs from and writes its outputs to global memory (HBM). Fusion keeps intermediate values in registers or shared memory, trading several 200-400-cycle global memory round-trips for a single pass. PyTorch 2.0's Inductor backend generates fused Triton kernels automatically, achieving ~43% average training speedup on A100 across tested models.
Tap to flip back
CUDA Graphs record an entire sequence of kernel launches, memory copies, and their dependencies as a replayable graph object. Replaying it requires only one API call, removing per-iteration CPU-GPU command submission overhead (useful when kernel count is high and kernels are short). They suit inference because the computation graph is fixed - same shapes, same data pointers each run. Training changes shapes with variable batches, triggers gradient updates that alter data pointers, and uses dynamic control flow, all of which require re-capturing the graph.
Tap to flip back
Shared memory is organised into 32 banks (modern CUDA); consecutive 4-byte words map to successive banks. A bank conflict occurs when multiple threads in a warp access different addresses that fall in the same bank - accesses serialise. A common trap: a 2D array with 32 columns accessed column-major sends every thread to bank 0. The standard fix is to pad the array's inner dimension by one element (e.g., float tile[32][33]), shifting each row's bank mapping and eliminating the conflict with negligible extra memory use.
Tap to flip back
1 transaction. Thirty-two consecutive floats occupy 128 bytes, which is exactly one L2 cache line. If the base address is 128-byte aligned and threads access base + i*4, the hardware merges all 32 requests into a single transaction.
Tap to flip back
Each x field sits 12 bytes apart (one struct), so consecutive thread addresses have stride 3 floats rather than 1. The hardware cannot merge them into one transaction per warp.
Fix: convert to structure-of-arrays: float xs[N], ys[N], zs[N]. Now consecutive xs are stride-1 and the load is coalesced.
Tap to flip back
Every thread in the warp caused its own independent cache-line transaction - the warp was completely non-coalesced. A ratio of 1 is ideal (one sector services the entire warp). Values between 1 and 32 indicate partial coalescing. The metric is l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum / l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum.
Tap to flip back
Each thread block loads a rectangular tile of the operand into shared memory using row-major, stride-1 reads from global memory - those reads are coalesced. The column-wise access pattern then happens entirely within shared memory, which has its own high-bandwidth banking system and does not require coalescing. Global memory is touched once per tile pass; shared memory absorbs the non-sequential pattern.
Tap to flip back
A warp of 32 threads each issuing a float4 load moves 32 x 16 = 512 bytes per instruction (covering four 128-byte cache lines). Compared to scalar float loads (128 bytes per instruction), this doubles the bytes transferred per instruction issued, reducing instruction-issue overhead and keeping the memory bus saturated. Alignment must be 16 bytes per pointer.
Tap to flip back
Row 0 starts at offset 0 (aligned). Row 1 starts at offset 4000 bytes. 4000 / 128 = 31.25, so row 1's base is not 128-byte aligned. The hardware must issue a partial first transaction and a full second transaction to cover any 32-float warp load near the start of the row. Fix: pad each row to the next multiple of 32 floats (128 bytes), e.g., 1024 floats for this case, or use cuBLAS/Triton which pad internally.
Tap to flip back
With a single sequence the warp may have far fewer than 32 active threads doing useful work. Memory transactions are still issued for the full 128-byte cache line, but only a fraction of the fetched bytes contain useful data - effective bandwidth utilisation per byte drops in proportion to warp occupancy. Fix: continuous batching (used in vLLM and similar) packs multiple requests into one forward pass so that warps are more fully utilised, recovering both coalescing efficiency and arithmetic throughput.
Tap to flip back
Each thread independently fetches a full row of A and a full column of B from global memory. Threads in the same warp that work on adjacent columns of C all need the same row of A but fetch it separately, so the same data crosses the high-latency global memory bus N times instead of once. Tiling solves this by having a block cooperatively load a shared tile once.
Tap to flip back
- After cooperative load into shared memory: prevents any thread from reading the tile before all threads have finished writing their element into it (write-read race).
- After the inner dot-product loop: prevents any thread from overwriting the tile with the next iteration's data while other threads are still reading from it (read-write race). Omitting either barrier produces silently wrong results.
Tap to flip back
Shared memory has 32 banks (4 bytes each). A bank conflict occurs when two or more threads in the same warp access different addresses that map to the same bank, serialising the accesses. For a 32-column row-major array, all threads reading the same column land on bank 0 - a 32-way conflict. The fix: declare the array with one extra column of padding (e.g. float A[32][33]), which offsets successive rows and scatters column accesses across distinct banks.
Tap to flip back
Larger tiles reduce global memory traffic proportionally (a BLOCK x BLOCK tile cuts traffic by BLOCK), but shared memory is finite per SM. Large tiles mean fewer thread blocks fit simultaneously on the SM, lowering occupancy. Low occupancy reduces the GPU's ability to hide memory latency by switching warps. The optimal tile size is workload- and hardware-specific; empirical tuning with a profiler (e.g. ncu) is necessary.
Tap to flip back
Triton exposes tiles as first-class objects. The programmer declares block pointers, calls tl.load to bring a tile into on-chip memory, computes on it with tensor operations, and calls tl.store. The Triton compiler analyses data dependencies across the tile, inserts the necessary barriers, handles bank-conflict avoidance, and manages register pressure - producing PTX that is competitive with hand-written CUDA for standard shapes like matrix multiplication.
Tap to flip back
cp.async is a CUDA PTX instruction that copies data directly from global memory into shared memory without staging through registers. This allows the GPU to pipeline data loading for the next tile while the current tile is being processed (software pipelining / double buffering). Older kernels that route the load through registers cannot overlap the two phases, leaving the memory subsystem idle during compute and vice versa. On Ampere and later, not using cp.async in latency-bound tiled kernels leaves measurable throughput on the table.
Tap to flip back
- Very small or non-divisible K (reduction dimension): boundary-guard predicates add instruction overhead, and the number of reuses per loaded element is too small to justify the complexity.
- Small batch / small matrix sizes (e.g. single-sample inference): the problem fits in L2 cache already, so repeated global memory fetches are not the bottleneck. A different kernel shape (e.g. a batched GEMV) or simply a cuBLAS call is more appropriate than a tiled GEMM.
Tap to flip back
Each kernel launch must materialise its output in global memory (HBM) so the next kernel can read it. Three kernels (row-max, exp+sum, normalise) each read and write the full matrix, producing ~3x the necessary HBM traffic. A fused kernel keeps the row-max and partial sums in registers or shared memory and never writes them to DRAM - only the final normalised values are written once.
Tap to flip back
Vertical (producer-consumer) fusion chains operations where one op's output feeds directly into the next - keeping intermediates in registers/SRAM instead of DRAM. Horizontal fusion merges independent ops that read the same input, reducing redundant loads. FlashAttention uses vertical fusion: it fuses the QK^T scaling, softmax, and weighted-sum-of-values into one tiled kernel pass, avoiding the multiple HBM round-trips of the unfused baseline.
Tap to flip back
- Registers per SM (65,536 on A100): more fused operations mean more live values per thread; exceeding the per-thread budget causes register spilling to local memory in global DRAM, undoing the fusion benefit. 2. Shared memory per SM (~164 KB on A100): reductions that tile large rows into SRAM (e.g., FlashAttention) must size blocks to fit; oversized tiles prevent the kernel from launching at all or reduce occupancy.
Tap to flip back
torch.compile uses a three-stage pipeline: TorchDynamo captures Python bytecode into a computation graph (tracing through control flow); TorchInductor analyses the graph and groups fusible pointwise and reduction ops; finally Inductor emits Triton kernels (or C++ for CPU) that implement the fused group. The programmer sees only a single torch.compile(model) call; the resulting kernels are recompiled and cached per input shape.
Tap to flip back
A large GEMM is compute-bound: the SMs are saturated doing FLOPs and global memory bandwidth is not the bottleneck. Fusion with cheap adjacent ops (bias add, activation) removes DRAM traffic, but since DRAM was not the limiting resource, throughput barely changes. Worse, adding register pressure from the fused op can lower occupancy, slightly reducing the GPU's ability to hide arithmetic latency - giving a net neutral or marginal regression. Profile first; only fuse where memory bandwidth is the demonstrated bottleneck.
Tap to flip back
Triton kernels and Inductor-generated fused kernels are compiled for specific tile sizes, strides, and loop bounds derived from concrete input shapes. A different shape requires recompilation (a fresh PTX binary), which adds latency and can produce suboptimal tiles if the new shape does not align with the tuned block sizes. torch.compile(dynamic=True) mitigates recompilation cost by generating shape-polymorphic code, but at the cost of less aggressive tile tuning and therefore lower peak throughput than the static-shape case.
Tap to flip back
XLA's default fusion rules are conservative and general-purpose: they avoid fusing ops whose combined register or SRAM footprint could hurt occupancy, and they use static profitability estimates rather than per-input measurement. The fact that custom (hand-tuned or search-driven) fusion strategies yield an order-of-magnitude gain on some workloads means the compiler's heuristics leave significant throughput on the table when the access pattern or tile geometry is unusual. It motivates tools like AutoTVM, Triton's autotuner, and torch.compile's shape-specific compilation as ways to close that gap automatically.
Tap to flip back
CUDA uses a "scalar program, blocked threads" model: each thread handles one element, and the programmer coordinates blocks manually. Triton uses a "blocked program, scalar threads" model: each program instance operates on a tile of data (e.g. 128 elements), and the compiler handles thread-to-element assignment internally. This means you reason about tiles rather than individual thread indices.
Tap to flip back
@triton.jit marks a Python function as a Triton kernel. Compilation to PTX (or AMD ISA) happens on the first call with a given set of argument types and constexpr values. Subsequent calls with the same signature hit a compiled cache. The kernel is never executed as Python on the CPU; the decorator triggers the Triton compiler pipeline.
Tap to flip back
- Memory coalescing - reorders thread access patterns so warps read contiguous 128-byte cache lines.
- Shared-memory allocation - inserts
__shared__staging buffers for tiles that are reused within a kernel. - Thread swizzling - permutes the thread-to-data mapping to avoid shared-memory bank conflicts.
- Vectorisation - emits 128-bit
LDGload instructions where pointer alignment allows.
Tensor-core selection (mma instructions for tl.dot) is a fifth, particularly important on Ampere and Hopper.
Tap to flip back
torch.compile uses TorchDynamo to capture the computation graph, TorchInductor to fuse and schedule ops, and then Triton as the code-generation backend - Inductor emits Triton kernel source which is JIT-compiled to PTX. The compiled kernels are cached on disk at ~/.cache/torch/inductor/, so the compilation overhead is paid only once per unique graph shape + input dtype combination.
Tap to flip back
Every kernel launch carries a fixed overhead (a few microseconds) from the CUDA driver, regardless of tensor size. For small tensors that cost less compute than the launch overhead, this dominates. The right fix is CUDA graphs or torch.compile with graph capture, which record and replay a sequence of launches without re-entering the driver per kernel. Writing faster Triton code does not help here - the bottleneck is the launch mechanism, not the kernel body.
Tap to flip back
A tl.constexpr parameter is a compile-time constant baked into the kernel at JIT time. When BLOCK_SIZE is a constexpr, the compiler knows the exact loop bounds and tile shapes during compilation, enabling full loop unrolling, static vectorisation, and optimal register allocation. Without constexpr, the compiler must emit general-purpose loops that are harder to vectorise. Changing a constexpr value produces a new compiled variant; Triton's autotuner exploits this by compiling and benchmarking multiple BLOCK_SIZE candidates.
Tap to flip back
- Custom sparse or structured-sparse attention - TorchInductor cannot see through irregular index patterns (e.g., sliding-window attention masks, block-sparse layouts). A hand-written Triton kernel can fuse the masking, softmax, and accumulation in a single pass over HBM, while Inductor would decompose these into separate dense kernels.
- Non-standard fusions across framework boundaries - If a kernel must straddle pre- and post-processing steps that live outside the PyTorch graph (e.g., in-place quantisation before a linear layer, or custom RoPE embeddings computed with non-standard formulas), Inductor's graph capture will miss the fusion opportunity. A Triton kernel authored for that specific fused operation avoids the extra HBM round-trips.
Tap to flip back
Exhaust options in order: (1) existing library primitives (cuBLAS, cuDNN), (2) torch.compile for element-wise chains, (3) Triton for tile-level custom patterns, (4) raw CUDA only when all else fails and a real profiler confirms a bottleneck. Each step up the ladder saves days of engineering and avoids architecture-lock-in.
Tap to flip back
A graph break occurs when TorchInductor cannot trace through a piece of code and must split the computation graph at that boundary. Custom CUDA kernels wrapped as torch.autograd.Function are opaque to the compiler; it cannot lower them to Triton. Two graph breaks in a training loop can eliminate most of the speedup torch.compile would otherwise deliver, making the custom kernel a net loss.
Tap to flip back
- The operation has no existing primitive AND is a measured training bottleneck (FlashAttention's tiled SRAM rewrite is the canonical case).
- You need architecture-specific intrinsics unavailable to compilers today - structured 2:4 sparsity, wgmma instructions, sub-byte quantisation (INT4/FP8).
- Your data layout is genuinely irregular (ragged, interleaved, sparse block) in a way that causes library reshaping overhead to negate any kernel-level gain.
Tap to flip back
Hopper changed the memory hierarchy (HBM3, larger L2), warp scheduling semantics, and introduced new async-copy and wgmma instructions. A kernel tuned for Ampere's tile sizes and prefetch patterns is not automatically optimal on Hopper. Frameworks like Triton and torch.compile regenerate per-target code on each architecture; hand-written .cu files require manual re-tuning.
Tap to flip back
- Highly dynamic shapes: recompilation per new shape can dominate runtime.
- Cold-start / serverless inference: minutes-long first-run compilation is unacceptable.
- Operations with irregular data dependencies (variable-length scatter, graph NN) that do not tile cleanly.
- Pipelines mixing PyTorch with external C++/CUDA extensions, where the compiler cannot see across boundaries.
Tap to flip back
cuBLAS (dense linear algebra / GEMM), cuFFT (Fast Fourier Transform), and Thrust (parallel algorithms: sort, scan, reduce). Using these before writing custom kernels is the guide's first-priority recommendation because they are architecture-tuned and maintained by NVIDIA against each new GPU generation.
Tap to flip back
Operations with irregular or data-dependent memory access patterns - for example sparse attention over arbitrary masks, graph neural network scatter/gather over variable-degree neighbourhoods, or ragged-sequence processing. Triton's compiler assumes the work decomposes into fixed-size contiguous 2D tiles; when the access pattern does not conform to that structure, padding and masking overhead reduces efficiency and the tile-based abstraction becomes difficult to express correctly.
Tap to flip back
Arithmetic intensity (AI) = FLOPs executed / bytes transferred from DRAM, measured in ops/byte. It captures how much computation a kernel does per unit of data movement - the central quantity that determines which hardware ceiling bounds performance.
Tap to flip back
Attainable FLOPS/s = min(Pi, beta * AI)
where Pi is peak compute throughput (FLOPS/s) and beta is peak memory bandwidth (bytes/s). The kernel cannot exceed either limit simultaneously.
Tap to flip back
The ridge point is the arithmetic intensity at which the bandwidth-bound diagonal meets the flat compute ceiling: AI_ridge = Pi / beta. Kernels with AI below the ridge are bandwidth-bound; kernels above it are compute-bound. Knowing which side you are on tells you which resource to optimise first.
Tap to flip back
Ridge point = 312e12 / 2e12 = 156 ops/byte.
Element-wise activations at 0.5 ops/byte sit far to the left of the ridge - deep in the bandwidth-bound regime. Fusing them with adjacent operations is the standard fix, avoiding repeated DRAM round-trips for intermediate tensors.
Tap to flip back
Larger batch sizes increase arithmetic intensity for GEMMs. The weight matrix bytes are amortised over more input vectors, so the ops/byte ratio rises. This pushes the kernel from the bandwidth-bound ramp toward (or past) the ridge point, allowing the compute units to stay busy rather than stalling on memory.
Tap to flip back
- Latency dominance - for very small tensors (e.g., batch=1 decode), kernel launch overhead and memory latency dwarf both bandwidth and compute ceilings; the model predicts much higher attainable performance than reality.
- Wrong memory level - if the working set fits in L2 cache, using the DRAM roofline underestimates attainable performance; the correct roofline uses the L2 bandwidth ceiling instead.
- Ineffective Tensor Core utilisation - if GEMM dimensions are not multiples of 16, Tensor Core throughput degrades sharply, lowering the effective compute ceiling well below Pi and making the kernel appear compute-bound at an AI that should be bandwidth-bound.
Tap to flip back
It draws one roofline per memory level (DRAM, L2, L1/shared memory, registers), each with its own higher bandwidth ceiling. A kernel is placed on whichever level actually supplies its data. This reveals whether a kernel that "beats" the DRAM roof is genuinely compute-bound or merely cache-resident, and whether cache-blocking or tiling could push a memory-bound kernel onto a faster hierarchy level.
Tap to flip back
A warp is a group of exactly 32 threads that an SM executes together, issuing one instruction per clock across all 32 lanes simultaneously (SIMT). The hardware scheduler operates at warp granularity because it is cheaper to track 32-thread groups than individual threads, and because issuing one instruction to 32 parallel lanes saturates the wide execution units. Blocks are a logical grouping for resource allocation and shared memory; warps are the physical execution unit.
Tap to flip back
Each SM keeps many warps simultaneously resident (up to 64 on Ampere), each with its registers live in the on-chip register file. When a warp issues a load and stalls waiting for HBM (400-800 cycles), the warp scheduler instantly selects another warp that is ready to execute - no context switch, no register save/restore. Latency is hidden by warp switching, not by out-of-order speculation. This is why occupancy (fraction of maximum resident warps) matters: too few warps and stalls go uncovered.
Tap to flip back
Warp divergence occurs when threads within the same warp take different branches. The hardware executes both paths serially with masking; threads on the inactive path waste cycles. Worst case (32-way branch): effective throughput drops to 1/32. Divergence across warps is free - each warp is independently scheduled and can be on a different instruction. The rule: keep threads within a warp on the same code path; branches on warp-aligned data (e.g. threadIdx.x / 32) are safe.
Tap to flip back
max_resident_warps = min(
SM_hardware_ceiling, # e.g. 64 on A100
SM_register_file / (regs_per_thread * 32),
(SM_smem / smem_per_block) * warps_per_block
)
Register pressure is the most common culprit: 64 registers per thread on an A100 limits resident threads to 65536 / 64 = 1024 = 32 warps (50% occupancy). Shared memory demand limits block count, which limits warp count indirectly. The SM hardware ceiling is an absolute upper bound regardless of resources.
Tap to flip back
Tensor Cores execute a warp-collective matrix multiply-accumulate (e.g. 16x16x16 on Ampere) in a single operation. The hardware reads entire 128-byte cache lines per operand tile, so data must be contiguous in memory and the leading dimension must align to 128 bytes (16 elements for FP32, 32 for FP16). A misaligned or non-contiguous layout forces the compiler to gather elements into registers via scalar loads before feeding the Tensor Core - effectively falling back to CUDA core throughput (roughly 16x slower for FP16 GEMM on A100). cuBLAS handles layout requirements automatically; custom kernels must be explicit.
Tap to flip back
Shared memory is physically split into 32 banks; concurrent accesses by different threads in the same warp to different addresses within the same bank are serialised (2-way conflict halves bandwidth; 32-way conflict reduces it to 1/32). For a row-major matrix stored in SMEM with 32 columns, reading a column is safe (each thread hits a different bank). Reading a column from a matrix with a row stride that is a multiple of 32 causes a 32-way conflict because every thread maps to the same bank. The standard fix: pad the row by one element to break the stride alignment.
Tap to flip back
Occupancy helps only when the kernel is latency-bound - the SM spends cycles waiting for memory and needs more warps to switch to. A kernel that is compute-bound (arithmetic throughput is the bottleneck, close to the roofline ceiling) is already keeping the execution units busy; adding more resident warps does not create more compute. In that case, higher occupancy may actually reduce performance by increasing register spilling (if registers per thread must be cut to fit more warps). Always profile to determine the bottleneck before tuning occupancy.
Tap to flip back
- Registers - ~256 KB per SM (64k × 32-bit registers); private to each thread; ~1 cycle latency.
- L1 / Shared memory - 228 KB per SM on H100, split between hardware L1 cache and programmer-managed scratchpad; ~20-30 cycle latency.
- L2 cache - 50 MB chip-wide on H100; hardware-managed; ~200 cycle latency.
- HBM (global memory) - 80 GB on H100 SXM; ~600-700 cycle latency; 3.35 TB/s bandwidth.
The gradient from fast/small/private to slow/large/shared is the central constraint that drives almost every GPU optimisation technique.
Tap to flip back
Arithmetic intensity = FLOPs performed / bytes transferred from memory.
A kernel is memory-bound when its arithmetic intensity falls below the threshold:
AI_threshold = Peak FLOP/s / Peak Memory Bandwidth
For an H100 SXM in FP16 Tensor Core mode: ~989 TFLOP/s / 3.35 TB/s ≈ 295 FLOPs/byte.
- Elementwise ops (ReLU, layer norm): AI ~ 0.25-2 FLOPs/byte → deeply memory-bound; speed scales with HBM bandwidth.
- Large matrix multiplies: AI >> 295 FLOPs/byte → compute-bound; speed scales with Tensor Core throughput.
Understanding this split tells you where to look when a kernel underperforms.
Tap to flip back
Tiling loads a T×T sub-block of A and B from HBM into shared memory once, then reuses each element T times across the inner accumulation loop. HBM traffic is reduced by a factor of T relative to a naive implementation that loads each element once per multiply-accumulate.
For T = 128, that is 128x fewer HBM round-trips. The practical limit is shared-memory capacity: you cannot make T arbitrarily large without exhausting the 228 KB scratchpad on H100 (and thereby killing SM occupancy).
This is the same tiling idea that makes FlashAttention IO-optimal: it keeps intermediate attention scores on-chip rather than writing an N×N matrix back to HBM.
Tap to flip back
Shared memory on modern NVIDIA GPUs is physically divided into 32 banks, each 4 bytes wide. When multiple threads in the same warp access different addresses that map to the same bank, those accesses serialise (a 32-way conflict takes 32× longer than a conflict-free access).
Classic cause: reading a 32-column tile column-by-column - threads 0-31 all land on bank 0.
Fix: allocate float smem[32][33] instead of [32][32]. The extra column shifts every row's starting bank by one, breaking the alignment. The 33rd column is never read; it purely pads the address mapping. Cost: ~3% extra shared memory.
Tap to flip back
Each SM has a finite register file (65,536 × 32-bit registers on Hopper). If a kernel requires more registers per thread than the SM can provide at the chosen block size, the compiler spills excess values to local memory - a region of HBM private to each thread.
Spilled registers incur ~600-cycle HBM latency instead of 1-cycle register access, severely degrading throughput.
Nsight Compute signal: high "Warp Stalls: Long Scoreboard" percentage in the Warp State Statistics section. The "Local Load/Store" metrics and "Registers Per Thread" stat confirm the root cause.
Fixes: reduce block size (fewer concurrent threads share the register file), simplify the kernel to lower per-thread register count, or split into smaller kernels.
Tap to flip back
HBM serves memory in cache-line-sized transactions (128 bytes on modern GPUs). When consecutive threads in a warp access consecutive 4-byte floats, the 32 addresses collapse into a single 128-byte transaction - fully coalesced.
When threads access scattered addresses (e.g., column-major access through a row-major array), each thread triggers a separate transaction, issuing up to 32 × 32-byte transactions instead of 1 × 128-byte transaction. The memory controller saturates on transaction count long before the raw byte-transfer limit is hit.
Rule of thumb: thread N in a warp should access address base + N * sizeof(element) for peak HBM utilisation. Transposing data on load into shared memory is the standard workaround when the access pattern is inherently non-coalesced.
Tap to flip back
Occupancy = fraction of maximum concurrent warps active on an SM. High occupancy enables latency hiding: while one warp stalls on a memory access, others keep the arithmetic units busy.
Allocating more shared memory per block increases the tile size T (and thus reduces HBM traffic by factor T), but each SM can host fewer blocks simultaneously, lowering occupancy.
- Memory-bound kernels (low arithmetic intensity) benefit most from high occupancy because latency hiding is their primary lever; smaller tiles with more blocks in flight often win.
- Compute-bound kernels (high arithmetic intensity, e.g., large GEMM) can afford lower occupancy because Tensor Cores keep the SM busy regardless; large tiles that maximise data reuse often win.
The right point on the curve must be found empirically with a profiler (ncu) or autotuned - it depends on matrix size, data type, SM count, and the specific kernel's instruction mix.
Tap to flip back
Sign (1 bit): positive or negative.
Exponent (variable width): controls dynamic range - how large or small a number can be. More exponent bits = wider range.
Mantissa / significand (variable width): controls precision - how finely nearby values can be distinguished. More mantissa bits = finer precision.
Every format swap (e.g. FP32 to BF16) moves bits between exponent and mantissa, buying range at the cost of precision or vice versa.
Tap to flip back
- FP16: 1 sign + 5 exponent + 10 mantissa. Max value ~65504. Prone to gradient overflow; requires loss scaling.
- BF16: 1 sign + 8 exponent + 7 mantissa. Same dynamic range as FP32. No loss scaling needed, but coarser precision (~2 decimal digits vs. FP32's 7).
BF16 eliminates the overflow bookkeeping of FP16 at the cost of mantissa bits. For most transformer training workloads, range matters more than precision, which is why BF16 became the default on A100 and later GPUs.
Tap to flip back
Loss scaling multiplies the scalar loss by a large factor (e.g. 2^16) before backpropagation, shifting gradient values up into FP16's representable range. Before the weight update, gradients are divided back and checked for overflow (inf/NaN).
It is needed for FP16 because gradients commonly fall below ~6e-8 (FP16's minimum normal) and flush to zero.
BF16 shares FP32's 8-bit exponent, so its minimum normal is ~1.2e-38 - the same floor as FP32 - and gradient underflow is not a practical concern.
Tap to flip back
Because FP16 has only 10 mantissa bits, a small gradient update applied directly to an FP16 weight can be rounded to zero (the update is smaller than the spacing between adjacent FP16 values near that weight magnitude).
The FP32 master copy has 23 mantissa bits, enough to accumulate thousands of small updates before they are lost. After each optimizer step the FP32 weights are cast back to FP16 for the next forward pass. Memory cost: 1.5x the FP16 weight storage.
Tap to flip back
- E4M3 (4-bit exponent, 3-bit mantissa): higher precision in a narrow range. Used for forward-pass weights and activations, where precision matters more.
- E5M2 (5-bit exponent, 2-bit mantissa): wider range, coarser precision. Used for backward-pass gradients, which can spike to larger magnitudes.
The split mirrors the FP16/BF16 logic but at 8 bits. Both variants require per-tensor scaling factors because their representable ranges are very narrow compared to BF16 or FP32.
Tap to flip back
Saturation occurs when activation or weight values exceed FP8 E4M3's maximum (~448). Values above the ceiling are clipped to 448, corrupting the forward pass.
Large transformers are most susceptible because they develop outlier channels in attention and feed-forward activations - a small number of dimensions with values far above the mean. Without per-tensor (or per-block) scaling calibrated to the actual value distribution, these outliers saturate and the rest of the tensor loses relative precision.
The fix is scaling factors that are updated each step, at additional overhead. This is why FP8 inference requires careful calibration rather than being a drop-in format swap.
Tap to flip back
The GPU falls back to FP32 execution for BF16 operations (since BF16 is not natively supported). Throughput drops to FP32 speed, completely erasing the 2x memory and compute benefit of BF16.
This is a common deployment trap: format support is hardware-specific. FP16 has broader legacy support (Volta onward); BF16 requires Ampere (A100) or newer on NVIDIA hardware. Always check torch.cuda.is_bf16_supported() before assuming BF16 will be fast on the deployment target.
Tap to flip back
Arithmetic intensity = FLOPs performed / bytes read from DRAM (FLOP/byte). Compare it to the GPU's ridge point (peak FLOP/s / peak bandwidth). Below the ridge: memory-bound. Above: compute-bound. Most LLM decoding ops sit far below the ridge because each weight is multiplied once but still costs 2 bytes to load.
Tap to flip back
Each weight element is loaded once from HBM and used for exactly one multiply-accumulate against the single token's activation vector. FLOPs = 2 * M * N for a linear layer; bytes loaded = 2 * M * N (FP16). The ratio cancels to ~1 FLOP/byte, far below the ~156 FLOP/byte ridge point of an A100. The GPU's compute units finish almost immediately and wait for the next data tile.
Tap to flip back
Standard attention writes the full S x S attention score matrix to HBM, then reads it back for softmax and the weighted sum - two full HBM round-trips per layer. FlashAttention tiles Q, K, V into blocks that fit in on-chip SRAM and fuses softmax with the value accumulation into one kernel. Intermediate scores never touch HBM; only Q, K, V, and the output are moved. This cuts attention's HBM traffic from O(S^2) to O(S).
Tap to flip back
Ridge point = 312 TFLOP/s / ~2.0 TB/s = ~156 FLOP/byte. Practically: any kernel with fewer than 156 FLOPs per byte of DRAM traffic is limited by memory bandwidth, not compute. Single-sequence decoding (AI ~1) never benefits from more FLOP/s on that GPU; only higher bandwidth or fewer bytes transferred would help.
Tap to flip back
When the model spans multiple GPUs via tensor or pipeline parallelism, activations must be transferred between GPUs (NVLink or InfiniBand) at every layer boundary. This inter-GPU communication adds latency on the critical path. A model that fits in one GPU's HBM avoids that overhead entirely, making single-GPU fit a first-class optimisation target (often achieved via quantisation).
Tap to flip back
KV cache bytes = 2 * L * H * d_head * S * B * bytes_per_element. For large models (L=80, etc.) at long context and non-trivial batch sizes this can reach tens of gigabytes, directly reducing the headroom available for weights. The practical batch size ceiling is often set by HBM capacity (weights + KV cache must both fit), not by compute throughput. This is why techniques like GQA (reducing H) and INT8 KV cache quantisation matter.
Tap to flip back
GDDR stacks chips side-by-side on a PCB connected via a wide but long bus. HBM stacks multiple DRAM dies vertically in a single package, placed on a silicon interposer next to the GPU die, connected through thousands of through-silicon vias. The extremely wide bus (1024+ bits per stack) and short physical path give HBM roughly 3-5x the bandwidth of GDDR at lower energy per bit transferred.
Tap to flip back
Occupancy = active warps per SM / maximum warps per SM.
On an A100 the maximum is 64 warps (2,048 threads) per SM. If only 32 warps are resident, occupancy is 50%. The ratio matters because the warp scheduler can only hide latency by switching to a ready warp; with fewer resident warps there are fewer candidates to switch to.
Tap to flip back
- Registers - each SM has a fixed register file (65,536 32-bit registers on A100); high per-thread register usage means fewer threads fit.
- Shared memory - blocks that allocate large shared buffers leave no space for additional blocks on the same SM.
- Thread slots - total concurrent threads per SM is bounded (2,048 on A100) regardless of the other two limits.
Register pressure is the most common culprit in practice. Compile with -maxrregcount=N to trade register usage for occupancy, but watch for spilling to local (DRAM-backed) memory.
Tap to flip back
Warp switching covers a stall only when the newly scheduled warp has no pending dependency on the stalled warp's result. If warp B needs a value that warp A is loading, switching to B just adds another stalled warp. The scheduler needs warps whose memory requests are independent so it can keep issuing instructions on different warps while all their loads are in flight simultaneously. This is why coalesced, independent access patterns matter far more than raw occupancy numbers.
Tap to flip back
When the kernel is compute-bound: the tensor cores or CUDA cores are already saturated issuing arithmetic instructions every cycle. Adding more resident warps queues more arithmetic work but cannot reduce arithmetic latency - there is no idle time to fill. The ncu metric sm__warps_active.avg.pct_of_peak_sustained_active near 100% signals this. In compute-bound kernels, raising occupancy may introduce register bank conflicts or extra shared memory pressure that actively hurts performance.
Tap to flip back
More shared memory per block means fewer blocks fit on an SM, which lowers occupancy. Yet large shared memory tiles reduce global memory traffic (data is reused from on-chip SRAM), shrinking the amount of DRAM latency that needs hiding in the first place.
The practical trade-off: use large tiles when arithmetic intensity is high (matmul, attention), accept the lower occupancy because you have enough compute work per loaded byte. Use small tiles or rely on L1 when access patterns are irregular and reuse is low.
Ampere/Hopper let you split L1/shared memory at runtime via cudaFuncSetAttribute to tune this without recompiling.
Tap to flip back
It automatically searches for the block size that maximises theoretical occupancy for a given kernel, accounting for the kernel's register usage and any fixed shared memory. Before CUDA 6.5, programmers had to manually consult the CUDA Occupancy Calculator spreadsheet. The API replaced that workflow with a runtime call, which is especially useful for libraries and frameworks that launch user-defined kernels without knowing the register footprint at compile time.
Tap to flip back
If the compiler cannot honour the register cap through allocation alone, it spills excess live values to local memory - which is backed by L2/DRAM and incurs the same latency as a global memory load. The kernel now has higher occupancy (more warps fit on the SM) but each warp generates additional memory traffic from spill/fill loads. If the kernel was already memory-bandwidth-bound, the extra traffic saturates DRAM faster, making effective throughput worse than it was at lower occupancy with no spilling.
The sign to watch for: ncu will report high l1tex__t_bytes_pipe_lsu_mem_local_op_ld.sum after applying -maxrregcount.
Tap to flip back
H100 NVLink 4.0 delivers 900 GB/s aggregate bidirectional bandwidth (18 links x 50 GB/s each). PCIe Gen 5 x16 tops out at roughly 64 GB/s bidirectional. NVLink 4.0 is approximately 14x faster, which NVIDIA rounds to "7x PCIe Gen 5" for single-direction comparisons.
Tap to flip back
With 8 GPUs, a fully-connected all-pairs topology would require up to 28 separate link pairs and physically impossible connector density on each GPU. NVSwitch is a crossbar switch ASIC with many NVLink ports; placing multiple NVSwitch chips between GPUs creates a non-blocking all-to-all fabric without each GPU needing a direct link to every other GPU.
Tap to flip back
Tensor parallelism places communication on the critical path of every forward and backward layer (all-gather and reduce-scatter on activations). Data-parallel all-reduce happens once per step, after the full backward pass, and can overlap with the next step's forward pass using gradient bucketing. Tensor parallelism latency is directly experienced; all-reduce latency can largely be hidden.
Tap to flip back
- NVLink 4.0: 1 GB / 900 GB/s = ~1.1 ms
- PCIe 5.0: 1 GB / 64 GB/s = ~15.6 ms
If a corresponding matrix multiply takes ~2 ms, communication is pipeline-overlappable with NVLink but dominates compute time with PCIe, making tensor parallelism impractical over PCIe for typical layer sizes.
Tap to flip back
NVLink is confined to a single node (typically 8 GPUs). Crossing nodes requires InfiniBand or RoCE, where bandwidth drops to roughly 50 GB/s per link and latency increases by ~10x. As a result, tensor parallelism is generally capped at degree 8 (intra-node), while pipeline parallelism (micro-batch streaming across nodes) and data parallelism (gradient all-reduce over InfiniBand) handle the inter-node dimension.
Tap to flip back
- Non-contiguous memory layouts force the driver or NCCL to copy or scatter-gather before sending, wasting cycles and effective bandwidth.
- Excessive synchronisation barriers in communication code (e.g., blocking cudaDeviceSynchronize calls between sub-operations) serialise transfers that could otherwise be pipelined across the NVSwitch fabric.
Profiling with Nsight Systems typically reveals these as the dominant causes of sub-peak utilisation.
Tap to flip back
At batch size 1 (or small batches), the dominant bottleneck is HBM memory bandwidth on each GPU: loading model weights from HBM to tensor cores is the limiting step, not GPU-to-GPU communication. Inter-GPU transfers are only necessary if tensor parallelism is used, and for inference the communication volume per token is small enough that even PCIe would suffice. NVLink's advantage materialises mainly at training scale or large-batch inference where tensor parallelism is genuinely necessary.
Tap to flip back
2N(P-1)/P bytes per rank, approaching 2N as P grows.
Each byte must travel once in the reduce direction (to sum partial results) and once in the broadcast direction (to distribute the final result). The ring algorithm achieves this lower bound by pipelining P-1 steps in each phase, so no extra copies are needed regardless of P.
Tap to flip back
They are algebraically equivalent in final result, but splitting them inserts a gap between operations.
ZeRO uses this gap to run the optimiser step on the local gradient shard between Reduce-Scatter (which lands each rank's owned shard) and All-Gather (which redistributes updated parameters). This means each rank only stores and updates 1/P of the optimiser state, slashing peak memory without increasing total communication volume.
Tap to flip back
All-to-All.
In MoE with expert parallelism, each GPU hosts a subset of experts. After the router assigns tokens to experts, tokens destined for expert E on rank R must be physically shipped to rank R. One all-to-all sends those tokens; a second all-to-all returns the expert outputs to the originating ranks. The cost scales as O(batch_size * P), making it the dominant communication term in large MoE models.
Tap to flip back
To overlap gradient communication with backward computation.
Without bucketing, DDP would wait for the full backward pass to finish before launching any all-reduce. With bucketing, as each layer's gradients are computed they are flushed into a bucket; once a bucket reaches 25 MB (default) the all-reduce launches asynchronously while earlier layers continue computing. This hides most of the communication latency behind compute, and the bucket size can be tuned via bucket_cap_mb.
Tap to flip back
A flat ring routes data across the slow inter-node fabric at every step. NCCL's hierarchical all-reduce keeps intra-node traffic on the fast NVLink domain and only sends reduced partial sums across InfiniBand between nodes.
Concretely: within each server the GPUs perform a local reduce-scatter; one GPU per server acts as a relay, completing a cross-node all-reduce over the partial sums; each server then broadcasts the final result locally via all-gather. This can be 5-10x faster than a flat ring on a cluster where intra-node bandwidth is ~600 GB/s (NVLink 4.0) but inter-node is ~50 GB/s (InfiniBand HDR).
Tap to flip back
-
Straggler amplification. A ring blocks all P ranks until the slowest participant finishes each step. A parameter server can proceed with whichever workers have already checked in (with asynchronous SGD), tolerating slow nodes. In a ring, one overloaded or throttled GPU stalls the entire job.
-
Small-tensor bandwidth waste. Ring all-reduce amortises well over large buffers but incurs fixed kernel-launch and latency overhead per call. With many tiny gradient tensors (common if bucketing is disabled) most of the GPU time is overhead, not data movement. A parameter server aggregates independently per parameter and is less sensitive to this pattern.
Tap to flip back
- After column-parallel linear: each rank holds a column slice of the output matrix. An All-Gather concatenates the slices so the next layer sees the full activation tensor.
- After row-parallel linear: each rank holds a partial sum of the output (one row-block of the weight matrix times the full input). An All-Reduce sums these partial results across ranks to produce the correct output.
The choice depends on whether the parallelism splits the output dimension (gather needed) or accumulates partial products (reduce needed).
Tap to flip back
Weights are pre-loaded into each processing element (PE) and stay there. Activation values enter from one edge and shift across the array one column per clock cycle, visiting every PE in their row. Each PE multiplies the passing activation by its local weight and adds the result to an on-chip accumulator. No intermediate partial sum ever leaves the chip until the full dot product is complete.
Why it matters: this eliminates the dominant source of memory traffic - re-fetching the same weight bytes for every input token.
Tap to flip back
Memory traffic is O(N^2): you load N^2 activation values and N^2 weight values once each. Arithmetic work is O(N^3): every activation interacts with every weight across N accumulation steps. The ratio O(N^3)/O(N^2) = O(N). Doubling the array side length doubles arithmetic intensity, pushing the operation further above the memory-bandwidth ceiling in the roofline model.
Tap to flip back
256 x 256 = 65,536 multiply-accumulate units. At 700 MHz each fires once per cycle, giving 65,536 x 700,000,000 = approximately 45.9 billion MACs per second. One MAC counts as two operations (multiply + add), so peak throughput is roughly 92 TOPS. The TPU v1 MXU operates on 8-bit integers, accumulating into 32-bit registers.
Tap to flip back
The TPU MXU is a 128x128 (or 256x256) fixed grid. Every matmul tile must fill that grid to achieve peak utilisation. A small batch size shrinks the activation matrix rows, leaving most PEs computing with zero-padded inputs. A GPU's warp-based scheduler can keep other warps running during memory latency, partially hiding underutilisation. The systolic array has no such hiding mechanism: unused PEs are simply idle for those cycles.
Tap to flip back
Nothing improves. The systolic array performs every multiply in the grid regardless of the weight value; zero-weights still consume a full cycle and accumulate a zero product. Sparsity gives no throughput benefit unless the compiler can restructure the sparse weights into a denser tile layout before feeding the array. This contrasts with Nvidia Ampere's 2:4 structured-sparsity tensor core support, which can double effective throughput for qualifying sparse matrices.
Tap to flip back
The TPU systolic array has no caches, no out-of-order execution, and no dynamic scheduling. Data enters at a fixed rate and the wavefront propagates in lockstep; execution time is determined entirely by matrix dimensions and clock frequency. GPUs rely on caches, dynamic warp scheduling, and speculative memory prefetch, all of which introduce variance. The TPU's determinism is intentional: datacenter inference must meet strict tail-latency SLOs, and a predictable execution model simplifies capacity planning.
Tap to flip back
XLA must pad every matrix dimension to the nearest multiple of the MXU tile size (128 for most TPU generations). If a model uses a hidden dimension of, say, 100, XLA pads it to 128, wasting 22% of MXU capacity and extra HBM bandwidth on each matmul tile. The constraint is structural: the systolic array is a fixed physical grid; feeding it a matrix row shorter than 128 elements still occupies a full row of PEs for that cycle. Model designers targeting TPUs therefore choose embedding and hidden dimensions that are multiples of 128.
Tap to flip back
Arithmetic intensity is the ratio of floating-point operations to bytes of memory traffic for a given kernel. A chip has a peak ops:byte ratio (e.g. ~295 FLOP/B on H100 FP16). If an operation's arithmetic intensity exceeds that ratio, the chip is compute-bound and tensor cores are fully utilised. If it falls below, the chip is memory-bound and compute units stall waiting for data. Large GEMMs (e.g. M=N=K=8192) reach ~2730 FLOP/B, far above the roofline knee, which is why they are the only common operation that saturates tensor cores.
Tap to flip back
The mapping is direct: batch size B becomes M, input features d_in become K, output features d_out become N. The operation Y = X · W^T is a standard [M×K] × [K×N] multiply. During backprop this produces two additional GEMMs of identical problem size: one for the input gradient and one for the weight gradient, making training cost roughly 3× the forward-pass GEMM cost.
Tap to flip back
During single-token decoding, batch size and sequence length collapse so that M = 1. The GEMM degenerates to a matrix-vector product (GEMV). Arithmetic intensity drops to roughly 1 FLOP/B, far below any modern GPU's roofline knee. The bottleneck becomes HBM bandwidth, not compute. This is why techniques such as quantisation (reducing bytes moved per weight), continuous batching (increasing effective M), and speculative decoding (generating multiple tokens per forward pass) are critical for inference throughput.
Tap to flip back
Tensor cores process fixed tile sizes (e.g. 16×8×16 per warp instruction). If M, N, or K is not a multiple of the tile dimension (8 in FP16, 16 in INT8), the final tile is partially empty - lanes compute zeros but still consume clock cycles. This waste is tile quantisation. The alignment rule: keep all three GEMM dimensions as multiples of 8 (FP16) or 16 (INT8). Violating this can cost 20-50% of tensor-core throughput.
Tap to flip back
A standard GEMM kernel reads and computes over the full dense matrix regardless of how many values are zero. The hardware has no mechanism to skip zero multiplications in arbitrary positions. Speedup requires either structured sparsity with hardware support (e.g. NVIDIA 2:4 sparsity on Ampere+, which halves GEMM cost but constrains the sparsity pattern) or specialised sparse kernels that only outperform dense GEMMs when sparsity exceeds ~90-95%, a threshold rarely achieved without substantial accuracy loss.
Tap to flip back
The four weight projections (Q, K, V, output) and the two attention score operations (QK^T and score·V) are all batched GEMMs. The non-GEMM step is the softmax normalisation over the T×T score matrix: it requires a row-wise max and sum, which cannot be expressed as a matrix multiply. At long sequence lengths this intermediate T×T matrix also becomes the memory bottleneck. FlashAttention addresses this by fusing the softmax into the tiled GEMM kernel, keeping intermediates in fast SRAM, but the softmax itself remains outside the GEMM primitive.
Tap to flip back
Dettmers et al. (NeurIPS 2022) found that at scales above ~6B parameters, a small fraction of activation dimensions develop very large magnitudes (outliers). Quantising those dimensions to INT8 alongside normal values causes severe accuracy loss because the outliers dominate the quantisation range and compress all other values into a narrow integer bucket. The fix is mixed-precision decomposition: outlier dimensions stay in FP16 (a separate small GEMM) while the remaining ~99.9% of values are quantised to INT8. This shows that the standard GEMM abstraction assumes statistically uniform matrices; emergent structure at scale can break that assumption.
Tap to flip back
The boost clock is the maximum opportunistic frequency reached when the GPU is cool and within its instantaneous power budget. The sustained clock is the lower equilibrium frequency the GPU settles at once thermals stabilise - typically 5-20% below boost. A compute-bound workload's FLOP/s throughput scales directly with clock frequency, so a 10% clock reduction means ~10% less throughput. Short benchmarks (under ~60 seconds) may never leave the boost window, making peak figures misleadingly optimistic compared to hours-long training runs.
Tap to flip back
swPowerCap means the GPU has been throttled because it hit the software-configured power limit (set via nvidia-smi --power-limit or NVML). The chip may still have thermal headroom - it is a policy constraint, not a thermal emergency. hwThermal means the junction temperature itself is at or near the hardware safety threshold (~83-87°C), forcing a hardware slowdown regardless of any power-cap setting. The distinction is actionable: swPowerCap on a cool GPU suggests the power cap is set too conservatively; hwThermal on a hot GPU means cooling is insufficient.
Tap to flip back
Memory-bound kernels are bottlenecked by HBM bandwidth, not by SM compute frequency. HBM operates on a separate clock domain that is relatively insensitive to core-clock reductions from power capping. Cutting the power limit reduces SM frequency (and voltage), but if the kernel spends most of its time waiting for memory, the lower SM clock barely affects runtime. Compute-bound workloads (large GEMMs, attention computation) are directly frequency-limited and lose throughput closer to linearly with clock reduction. This is why a blanket power cap tuned for inference can be far more aggressive than one for pre-training.
Tap to flip back
P_dynamic ≈ α · C · V² · f
α= switching activity factorC= total switched capacitanceV= supply voltagef= clock frequency
The costly coupling: voltage and frequency are not independent. Running stably at higher f requires higher V to maintain timing margins. A 10% frequency increase may require ~5% higher voltage, which multiplies power by V² - roughly a 10% additional power increase on top of the linear f term. So a modest frequency uplift can demand a disproportionate power increase, which is why GPU boost clocks are narrow high-efficiency operating bands, not freely adjustable levers.
Tap to flip back
NVLink nodes can be placed in sync-boost mode, where all GPUs in the topology lock their SM clocks to the same frequency - that of the slowest (usually hottest) member. This ensures collective-communication kernels (AllReduce, AllGather) complete at the same rate across all ranks, avoiding idle waiting. The cost: if one GPU (e.g., GPU 7 with a blocked cooling vent) throttles to 90% of base clock, all eight GPUs run at 90% of base clock. A single thermal outlier is enough to impose a cluster-wide slowdown. Per-GPU temperature monitoring and physical airflow checks are therefore cluster-level concerns, not just per-device ones.
Tap to flip back
nvidia-smi dmon -s pucvmet -d 0.2- real-time stream of SM clock, power draw, temperature, and utilisation at 200 ms intervals. Look for SM clock dropping while temperature and power are at their limits.nvidia-smi -i <gpu_id> -q -d PERFORMANCE- reports active throttle reasons (e.g.,HW Thermal Slowdown: Active,SW Power Cap: Active).nvmlDeviceGetCurrentClocksThrottleReasons()via NVML (or its Pythonpynvmlwrapper) - programmatic bitmask of all active throttle flags, suitable for automated monitoring in training harnesses.
Tap to flip back
At low utilisation the GPU spends much of its time idle; average power draw stays well below the cap, so the cap is never the binding constraint and the GPU runs at boost clock. As utilisation climbs, average power draw approaches the cap. The PMU begins throttling clock frequency to keep power within the configured limit, reducing throughput at precisely the moment demand is highest. The fix: either raise the power cap (if rack power budget allows), use a smaller model, reduce batch size, or accept that SLA latency will degrade under high load. Profiling at production utilisation - not idle - is the only way to discover this before deployment.
Tap to flip back
The ridge point I* = F / BW_mem is the arithmetic intensity (FLOPS/byte) at which a kernel transitions from memory-bound to compute-bound. Compute peak F and HBM bandwidth BW_mem are both on the datasheet. For H100 SXM5: 989e12 / 3.35e12 ≈ 295 FLOPS/byte. Any kernel with intensity below that value is memory-bound regardless of how many TFLOPS the chip advertises.
Tap to flip back
Compute-bound, because 1365 > 295. The tensor cores are the bottleneck, not DRAM. An elementwise GELU at ~1 FLOP/byte sits far below the ridge and is memory-bound; fusing it with the preceding linear projection raises intensity and removes the separate HBM round-trip.
Tap to flip back
Each halved mantissa width allows the tensor core to pack twice as many multiply-accumulate units in the same silicon area, so throughput roughly doubles per dtype step: FP16/BF16 is 2x TF32, FP8 is 2x FP16. FP32 scalar (CUDA core) throughput is typically 60-80x lower than BF16 tensor-core peak on the same chip because it bypasses the tensor-core datapath entirely. Sparse (2:4) variants are exactly 2x the corresponding dense figure.
Tap to flip back
Tensor cores process matrices in fixed tile sizes (e.g., multiples of 16 bytes for FP16). A matrix dimension of 4097 requires two tiles in one direction but fills only 1 element of the second tile, wasting ~99% of that tile's capacity. Real batch sizes and sequence lengths are rarely aligned. Efficiency commonly drops to 70-85% of peak for production workloads with irregular shapes.
Tap to flip back
The relevant figure is interconnect bandwidth: NVLink (900 GB/s bidirectional on H100, ~450 GB/s effective per ring direction) vs. PCIe 4.0 x16 (~32 GB/s). For a hidden dimension of 8192 in BF16, each all-reduce moves ~32 KB per token per layer. At 80 layers and batch 32, NVLink costs ~0.18 s; PCIe costs ~2.56 s. TP-8 is viable inside an NVLink node and prohibitively slow across PCIe.
Tap to flip back
The datasheet figure is a peak burst rate measured with sequential access patterns. Real workloads - sparse attention, embedding table lookups, variable-length KV-cache reads - exhibit irregular access patterns that achieve only 40-70% of headline bandwidth. Thermal throttling on sustained loads and sub-optimal memory access alignment further reduce effective bandwidth. Budget at 50-65% of headline for memory-bound kernel estimates.
Tap to flip back
HBM bandwidth becomes irrelevant. The binding constraint shifts to L2 bandwidth, which is roughly 3-4x higher than HBM bandwidth on H100 (est. 12 TB/s vs. 3.35 TB/s). Kernel fusion (e.g., FlashAttention keeping attention activations in SRAM) is designed precisely to exploit this: multiple logical operations share one HBM load/store, and intermediate results live in the fast on-chip memory hierarchy throughout.
Tap to flip back
A pretrained base model is a document-completion engine. It will answer "What is the capital of France?" by generating more geography questions, because that pattern is common in web text. SFT forces the model to map an instruction-style prompt to a single, coherent response instead of continuing a document.
Why it matters: Without SFT, the base model's output distribution is uncontrolled; all downstream alignment stages depend on SFT establishing the assistant format.
Tap to flip back
xis the prompt.y_wis the human-preferred (winning) completion;y_lis the less-preferred (losing) one.r_thetais the learned scalar reward head.- The loss is minimised when the preferred completion scores higher;
sigmaconverts the score difference to a probability.
Why it matters: This loss converts ranking annotations (which humans find easier to produce than absolute scores) into a differentiable training signal.
Tap to flip back
The KL term keeps the RLHF policy close to the SFT reference model:
\[\text{objective} = \mathbb{E}\left[r_\phi(x, y) - \beta \, \text{KL}[\pi_\theta \| \pi_\text{SFT}]\right]\]Without it, the policy exploits weaknesses in the reward model, producing outputs that score well on the RM but are useless to users (e.g. repetitive, verbose, or grammatically strange text). The coefficient beta controls the strength of this constraint; too large and RLHF makes no progress, too small and reward hacking emerges within a few thousand steps.
Tap to flip back
The optimal RLHF policy can be written in closed form:
\[\pi^*(y|x) \propto \pi_\text{ref}(y|x) \exp\!\left(\frac{1}{\beta} r(x, y)\right)\]Inverting this expression gives an implicit reward in terms of any policy and the reference. Substituting back into the pairwise preference loss yields the DPO objective, which depends only on the policy and the reference model log-probabilities over the (chosen, rejected) pair. No separate RM needs to be trained; no RL loop is needed.
Tap to flip back
Task arithmetic treats the difference fine_tuned_weights - base_weights as a "task vector". Multiple task vectors can be added to the base to produce a combined model:
merged = base + alpha * delta_A + beta * delta_B
Failure modes:
1. Conflicting parameters: if two delta vectors modify the same weights in opposite directions, the merge can degrade both capabilities rather than combining them.
2. Untested combination: the merged model was never seen during any training run, so benchmark-tuned alpha/beta values may not generalise to real usage.
Tap to flip back
Every subsequent stage (RM training, RLHF, DPO) treats SFT model outputs as either the starting policy or the reference distribution. If SFT demonstrations contain factual errors, the RM is trained partly on those errors as positive examples. If SFT instils stylistic biases (over-hedging, excessive lists), those patterns become the baseline the KL penalty protects. No amount of RL or preference optimisation fully overcomes a flawed SFT foundation.
Tap to flip back
DPO advantages:
1. Requires only 2 models in memory (policy + frozen reference) vs. 4 for PPO (policy, reference, reward model, value model).
2. Training is offline: no need to sample from the policy during training, which simplifies the pipeline.
3. More stable: no RL instability, no reward model collapse.
Where PPO is still preferred:
Online RL allows the policy to explore its own failure modes and generate new preference data on the fly, which can outperform offline DPO when the initial preference dataset is small or does not cover the target distribution well.
Tap to flip back
Loss is computed only on response tokens; instruction and input tokens are masked to -100 (PyTorch's ignore index). The model is being trained to generate good answers given a prompt, not to memorise the prompts themselves. Including prompt tokens in the loss would dilute the gradient signal and train the model to predict instruction text rather than produce helpful responses.
Tap to flip back
Fine-tuning on just 1,000 carefully curated instruction-response pairs produced a model that matched or exceeded GPT-4 in 43% of human evaluations, outperformed Bard in 58% of cases, and beat DaVinci003 in 65%. The implication: data quality dominates data quantity; 10,000 mediocre examples are likely worse than 1,000 excellent ones.
Tap to flip back
Self-Instruct bootstraps instruction data by having the model generate candidate (instruction, input, output) triples from a small human-written seed set, then filters out duplicates and low-quality examples using heuristics. The filtered synthetic data is used to fine-tune the same model. Wang et al. (2023) showed a 33% absolute improvement over the GPT-3 baseline on SuperNaturalInstructions using this approach.
Tap to flip back
The SFT checkpoint serves two roles: (1) it is the initial policy that PPO optimises, and (2) it is the reference model against which the KL-divergence penalty is computed during RL fine-tuning. A weak SFT checkpoint sets a hard floor on alignment quality - RL can shift the output distribution but cannot inject capabilities the base model never acquired during pretraining or SFT.
Tap to flip back
Catastrophic forgetting occurs when fine-tuning overwrites pretrained knowledge, degrading performance on tasks not represented in the instruction dataset. The primary culprit is an excessively high learning rate (typically anything well above 2e-5 for a 7B model). Too many training epochs on a narrow instruction distribution compounds the problem. Monitoring held-out benchmark scores (e.g., MMLU, HumanEval) during training detects it early.
Tap to flip back
SFT treats every training response as equally correct - it cannot distinguish between an acceptable answer and an excellent one. There is no gradient signal encoding "this response is better than that one." Preference-based methods provide comparison data (A is better than B) so the model can learn to favour higher-quality outputs rather than just reproduce the average of its training demonstrations.
Tap to flip back
- Inherited hallucinations - the student model learns to reproduce the confident but incorrect outputs present in the teacher's generations.
- Misaligned refusal calibration - the student absorbs the teacher's safety behaviour, which was tuned for a different deployment context, producing either over-refusals or gaps where the teacher's policy was more permissive than intended for the student's use case.
Tap to flip back
During SFT the model learned to associate specific delimiter tokens (e.g. [INST], <|im_start|>) with role boundaries. At inference time, the model is simply a next-token predictor. Wrong delimiters produce a token sequence the model has never seen in training, so it generates plausible continuations that ignore the intended role structure - no exception is raised because the sequence is syntactically valid.
Why it matters: This makes format mismatch a silent correctness bug, not a runtime error.
Tap to flip back
add_generation_prompt=True appends the opening tokens of an assistant turn (without closing them), priming the model to generate in assistant mode. During SFT preprocessing it must be False because the training targets already include the full assistant turn; adding the header a second time would teach the model to reproduce it as part of response content, causing format corruption.
Tap to flip back
Pre-training vocabulary tokens arrive at SFT with embedding vectors shaped by billions of training tokens - they carry semantic and syntactic associations. Freshly added special tokens (e.g. <|eot_id|>) start with random embeddings and must learn their meaning entirely from SFT examples. Too few examples and the embedding fails to converge, producing unstable model behaviour near those tokens. Monitoring embedding norm during training is a useful diagnostic.
Tap to flip back
If you call apply_chat_template(tokenize=False) to get a formatted string and then tokenise it separately with add_special_tokens=True, the tokeniser prepends a second bos_token. The model was trained to see exactly one BOS at position 0; a duplicate shifts all subsequent position encodings and degrades output quality.
Fix: Either use apply_chat_template(tokenize=True) directly, or set add_special_tokens=False when tokenising the rendered string separately.
Tap to flip back
ChatML wraps each turn as <|im_start|>{role}\n{content}<|im_end|>, using dedicated token IDs with no pre-training history. It cleanly separates role metadata from content and avoids the ambiguity of printable-string delimiters. Qwen, Phi-3, and several other recent model families adopted it. It is not a formal standard - many major models (Llama 2, Mistral-Instruct, Gemma) use different formats - but it is the closest the community has to a shared convention for new designs.
Tap to flip back
In standard SFT the loss is computed only on assistant tokens; user and system turns are masked to -100. If the template is applied incorrectly - for example, if the generation-prompt header is missing and the boundary between user and assistant content is ambiguous - the loss mask may accidentally supervise the model to reproduce user messages. The symptom is instruction-following collapse: the model echoes or paraphrases user input instead of generating assistant-style responses.
Why it matters: Template correctness is not just an inference concern; it is a training correctness requirement.
Tap to flip back
Jinja2 is implemented in multiple languages. Python-specific methods (.lower(), .items(), .strip()) work in the Python implementation but are unavailable in the Rust tokenizers library and JavaScript runtimes used by llama.cpp and browser-side deployments. Replace them with Jinja filters: |lower, |dictitems, |trim. Also replace Python literals True/False/None with Jinja true/false/none. This keeps templates portable across all backends without changing rendered output.
Tap to flip back
A preference record contains: prompt, chosen (preferred response), and rejected (dispreferred response). The label asserts only a relative ordering: "given this prompt, a human preferred A over B." It makes no absolute quality claim. This relativity makes annotation cheaper but introduces sensitivity to annotator bias and pair-selection effects.
Tap to flip back
- Human pairwise comparison - expensive; annotator verbosity bias.
- Scaled rating then conversion - Likert gaps treated as equivalent when they are not.
- AI feedback (RLAIF) - circular if judge model shares the trained model's biases.
- Constitutional / rule-based filtering - encodes the constitution author's values, not broader human consensus.
Why it matters: the strategy you choose determines where label noise enters the pipeline.
Tap to flip back
IAA measures how consistently different annotators label the same pair. Pairs below a minimum agreement threshold (Cohen's kappa > 0.4 is common; safety-critical domains use higher thresholds) are discarded or flagged. Low IAA indicates the pair is ambiguous and likely to introduce noise into reward model training.
Tap to flip back
A reward model can only generalise to the task distribution the preference data covers. 100,000 pairs concentrated on creative writing will produce a reward model that is a poor judge of code correctness. Diversity and coverage of the actual deployment distribution matter more than raw count once a minimum threshold (roughly a few thousand pairs for single-domain tasks) is reached.
Why it matters: prompt distribution bias is invisible to IAA metrics, so it is the easiest failure to miss.
Tap to flip back
Verbosity bias is the tendency for annotators (and LLM judges) to prefer longer, more confident-sounding responses even when shorter ones are more accurate. A reward model trained on such data learns to be verbose rather than correct. The bias is present in both human and AI feedback pipelines and is one documented route to reward hacking.
Tap to flip back
The margin is the quality difference between the chosen and rejected responses. Low-margin pairs (both responses are good, one is only marginally better) carry weak training signal: the reward model learns small stylistic distinctions that do not generalise. Filtering to high-confidence, high-margin pairs produces a smaller but more effective dataset. DPO is especially sensitive because it directly optimises the log-probability ratio of chosen vs. rejected, with no separate reward model to absorb noise.
Tap to flip back
As the policy model improves, prompts sampled from earlier usage logs become unrepresentative: the updated model generates better responses, so old "rejected" responses may no longer be the kind of errors the new model makes. Iterative collection (Anthropic's approach was a weekly cadence) re-samples prompts from the current model's outputs, keeping the preference data aligned with the current policy's error distribution. Without this, the reward model eventually receives pairs that are both high quality, producing near-random labels on the policy's actual failure modes.
Tap to flip back
A reward model replaces the token-prediction head of a language model with a scalar regression head. Given a prompt x and a completion y, it outputs a single number r(x, y) representing estimated human preference. The rest of the transformer stack is unchanged. This scalar becomes the reward signal used in PPO or similar RL training.
Tap to flip back
For a preferred completion y_w and a rejected completion y_l on the same prompt x:
L = -log σ( r(x, y_w) - r(x, y_l) )
This pushes the model to assign a higher scalar to the winning completion. It only requires relative ranking, so labellers never need to agree on an absolute quality scale.
Tap to flip back
The KL penalty β · KL(π_θ || π_ref) prevents the policy from drifting far from the SFT baseline. Without it, the policy quickly discovers completions that exploit the reward model's blind spots - high scorer but incoherent or degenerate text. The penalty keeps the policy in-distribution relative to the RM's training data. β (typically 0.01-0.1) is often the highest-leverage hyperparameter in the pipeline.
Tap to flip back
Reward hacking occurs when the policy optimises the proxy reward model beyond the point where human preference also improves. Gao, Schulman, and Hilton (2022) showed that proxy reward continues climbing while gold-standard human preference flattens and then falls. The gap follows predictable scaling laws with respect to policy optimisation steps and RM size. Standard mitigations: KL regularisation, early stopping, periodic RM retraining on policy-generated completions.
Tap to flip back
The SFT model shares token distribution with the completions being scored; it already "speaks" assistant-style language. Starting from this checkpoint gives the RM a better prior over completion quality and means fewer comparison pairs are needed to reach useful ranking accuracy. Starting from the raw base model wastes capacity re-learning surface-form patterns already captured by SFT.
Tap to flip back
A single RM trained on combined feedback must encode a trade-off between helpfulness and safety as a single scalar. Optimising that scalar hard produces a policy that is either uselessly cautious or subtly harmful, depending on which objective the RM learnt to weight more. Decomposing into two RMs allows each to be optimised independently and combined at a higher level, giving finer control over the helpfulness-safety frontier.
Tap to flip back
- Length bias: labellers tend to prefer longer responses. The RM absorbs this; the policy learns to pad outputs. Fix: length-normalised rewards or explicit length calibration.
- Labeller inconsistency: vague guidelines cause different labellers to resolve helpfulness-vs-safety conflicts differently. The RM learns a noisy mixture of objectives. Fix: detailed, worked-example annotation guidelines. Both biases compound silently - the RM fits the data perfectly, but the data does not reflect a coherent human preference.
Tap to flip back
The KL penalty prevents reward hacking: the policy exploiting flaws or extrapolation errors in the imperfect reward model. The reward model was trained on finite human comparisons; an unconstrained policy will find and amplify its blind spots, producing fluent-but-harmful or nonsensical completions. The KL term forces the policy to stay close to the reference distribution, limiting how far it can exploit those gaps.
Why: Goodhart's Law applies directly here. Once the reward model score becomes the target, it stops being a reliable proxy for human preference.
Tap to flip back
r_total(x, y) = r_reward_model(x, y) - β · KL[π_θ(y|x) || π_ref(y|x)]
r_reward_model(x, y)- the learned reward model's score for promptxand completionyβ- the coefficient controlling how tightly the policy is constrainedKL[π_θ || π_ref]- forward KL divergence between the live policy and the frozen referenceπ_θ- the policy being optimisedπ_ref- the frozen reference model (the SFT checkpoint)
Why: Subtracting the KL term means any increase in reward is only beneficial if it does not move the policy too far from the reference.
Tap to flip back
The reference model is a frozen snapshot of the supervised fine-tuned (SFT) checkpoint, taken immediately before RL begins. It is not the raw pretrained base because:
- The SFT checkpoint already encodes instruction-following behaviour, formatting conventions, and coherence built during SFT.
- The KL penalty protects this investment. Using the raw base would allow RL to destroy SFT-acquired capabilities while still satisfying a lax KL constraint.
- Human raters evaluated the reward model outputs relative to instruction-following behaviour, so the SFT distribution is the meaningful reference point.
Tap to flip back
π*(y|x) ∝ π_ref(y|x) · exp( r(x, y) / β )
The optimal policy reweights the reference distribution by exponentiated reward, with β controlling sharpness. Key implication: the reference model is structurally load-bearing, not just a regulariser. DPO exploits this - it eliminates the explicit reward model but still requires the reference, because the reference encodes the reward implicitly through log-probability ratios.
Why: If the reference is bad, the optimal policy is bounded by a bad prior regardless of reward model quality.
Tap to flip back
- Very small β (near 0): Near-unconstrained optimisation. The policy finds reward model exploits quickly; reward hacking and policy collapse are likely within a few thousand RL steps.
- Moderate β (0.02-0.1): The range used in InstructGPT and the summarisation paper. Reward increases measurably while general capabilities are preserved.
- Very large β (> 0.5): The policy barely moves from the SFT checkpoint. Training is stable but the reward signal has little effect; you are effectively paying the cost of RL for minimal benefit.
The optimal value is task and reward-model-quality dependent; practitioners typically tune it per task.
Tap to flip back
-
Reference model quality sets a ceiling. A poorly trained SFT checkpoint preserves its flaws inside the KL constraint. The RL policy cannot freely escape a bad reference.
-
Catastrophic forgetting on out-of-distribution tasks. The KL penalty protects capabilities covered by the reference model's training distribution. Narrow RL fine-tuning (e.g., customer-service only) can still degrade the policy on tasks far from that distribution, because the KL term does not uniformly protect all capabilities equally.
(Additional: KL estimates are noisy on long completions; the per-token sum has high variance at hundreds of tokens, destabilising gradients.)
Tap to flip back
DPO's loss is derived from the closed-form optimal policy: π*(y|x) ∝ π_ref(y|x) · exp(r/β). Rearranging to express the reward in terms of policy log-probabilities gives:
r(x, y) = β · log(π_θ(y|x) / π_ref(y|x)) + β · log Z(x)
The reward is parameterised as a log ratio between the live policy and the reference. Plugging this into the Bradley-Terry preference model and differentiating yields the DPO loss. The reference model's log-probabilities appear explicitly in every gradient step. Remove the reference and you lose the anchor that makes the parameterisation meaningful.
Tap to flip back
Reward hacking occurs when a policy maximises its proxy reward signal through behaviours the designers did not intend, because the reward model is an imperfect stand-in for genuine human preference. High proxy scores do not guarantee genuinely better behaviour; they may reflect exploitation of the reward model's blind spots.
Why it matters: The reward model is trained on finite, noisy human comparisons - it cannot perfectly encode all human values, so the policy finds shortcuts the labellers never anticipated.
Tap to flip back
Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." In RLHF, the measure is the reward model's score. Once the policy is directly optimised against it, the score stops tracking genuine quality because the policy learns to exploit whatever proxy patterns the reward model uses rather than the underlying property those patterns were meant to capture.
Tap to flip back
When a policy is optimised against a proxy reward model, performance under a held-out "gold" reward model first improves, peaks, then degrades - even as the proxy score keeps rising. The shape is an inverse-U. Beyond the peak, additional optimisation exploits the proxy's weaknesses rather than learning genuinely better behaviour. The degradation threshold depends on reward model size, policy size, and the optimisation method (RL or best-of-n).
Tap to flip back
The modified reward is:
r_total(x, y) = r_proxy(x, y) - β · KL[π_θ(y|x) || π_ref(y|x)]
β penalises the policy for drifting away from the supervised reference model. A larger β reduces the risk of reward hacking (less room to exploit the proxy) but also limits how much the policy can improve. β = 0 is unconstrained reward maximisation; too large a β freezes the policy near the SFT baseline. Practitioners tune it empirically, often targeting a KL range of a few to tens of nats.
Tap to flip back
Three common forms:
- Length inflation - reward models trained on human comparisons inherit a weak preference for longer answers; the policy bloats responses.
- Sycophancy - raters tend to prefer responses that agree with their stated beliefs; the policy learns to validate the user regardless of truth.
- Confidence inflation - uncertain but confident-sounding text is rated higher; the policy drops hedges and delivers hallucinations without qualification.
Shared mechanism: each exploits a statistical correlation in the reward model's training data that is not causally connected to genuine quality.
Tap to flip back
Iterative retraining collects new human preference comparisons on the current policy's outputs and retrains the reward model to cover its latest exploits. Each round forces the reward model to patch the loopholes the policy most recently found. The main limitation is cost: generating new outputs, obtaining human labels, and retraining the reward model is expensive and slow. For high-traffic production systems, the lag between deployment and the next retraining round is long enough for significant misalignment to persist unchecked.
Tap to flip back
No. DPO removes the explicit reward model by directly optimising the policy against preference pairs, which prevents some proxy exploitation paths. However, the policy can still overfit the training preference distribution, producing outputs that satisfy the surface patterns of the labelled pairs without generalising well. There is also no reward score to monitor as an early-warning signal, making it harder to detect overoptimisation in progress. The KL from the reference model can still grow implicitly during training on large datasets.
Bottom line: DPO changes the mechanism of potential reward hacking, not the fundamental tension between a finite proxy (preference data) and true human values.
Tap to flip back
A model soup is the arithmetic mean of weight tensors from multiple fine-tuned checkpoints. It works because all ingredients start from the same pretrained base, placing them in the same flat basin of the loss landscape. This shared initialisation creates linear mode connectivity: the straight-line interpolation between any two checkpoints stays in a low-loss region, so their average is also a low-loss point with reduced variance across hyperparameter choices.
Tap to flip back
The greedy soup ranks candidates by solo validation accuracy, starts with the best single checkpoint, then iterates through the remaining checkpoints in descending order. Each candidate is tentatively added (running mean updated); the update is kept only if validation accuracy does not decrease.
This solves the bad-ingredient problem: a checkpoint that is poorly tuned (wrong learning rate, insufficient data) may be far from the flat plateau and would drag the mean toward a higher-loss region. Greedy selection gates inclusion on whether the candidate genuinely helps, at the cost of one validation forward pass per candidate.
Tap to flip back
A task vector is theta_fine_tuned - theta_pretrained: the signed difference between a fine-tuned model's weights and the pretrained base.
To compose capabilities:
theta_merged = theta_base + alpha * tv_A + beta * tv_B
Adding task vectors injects multiple skills simultaneously; negating a vector suppresses a skill. The scalars alpha and beta control the contribution of each task. This works because fine-tuning from the same base keeps task vectors in a linearly-mode-connected neighbourhood.
Tap to flip back
TIES-Merging (Yadav et al., NeurIPS 2023) addresses:
- Redundant parameters - changes of tiny magnitude that represent noise rather than task learning. TIES resets these to zero (trim step) before merging.
- Sign conflicts - parameters where different fine-tunes push the value in opposite directions. TIES resolves this by majority vote: each parameter adopts the sign held by the majority of fine-tunes, then only models whose delta agrees with that sign contribute to the average.
Together these steps reduce interference and produce a merged model that more cleanly represents each constituent task.
Tap to flip back
WiSE-FT (Wortsman et al., CVPR 2022) interpolates between the zero-shot pretrained weights and the fine-tuned weights:
theta_wise = (1 - alpha) * theta_zero_shot + alpha * theta_fine_tuned
It addresses forgetting of out-of-distribution robustness: fine-tuning CLIP on a specific dataset improves in-distribution accuracy but erodes the broad representations that made the model robust to distribution shifts. At alpha around 0.5, WiSE-FT recovers 4-6 percentage points of OOD robustness while still exceeding the zero-shot model in-distribution. The merged weights are strictly better than either endpoint on the joint objective.
Tap to flip back
Four principal failure modes:
- Different pretrained bases - weight tensors live in incommensurable spaces; averaging produces nonsense regardless of architecture similarity.
- Long, high-lr, narrow fine-tuning - extended training on a small, domain-specific dataset moves the checkpoint far from the pretrained basin, breaking linear mode connectivity.
- Validation leakage - greedy soup selection on the same partition used for hyperparameter search produces a soup overfit to that partition; gains may not generalise to the test distribution.
- Too many divergent tasks - as the number of merged models grows and their tasks diverge semantically, sign conflicts accumulate faster than TIES/DARE can resolve them, and the merged model becomes a progressively worse approximation of each individual model.
Tap to flip back
DARE (Yu et al., ICML 2024) randomly sets a fraction p of delta parameters (theta_fine_tuned - theta_base) to zero, then rescales the survivors by 1 / (1 - p) to preserve the expected magnitude (a dropout-style correction).
It can afford near-complete dropout because fine-tuning deltas are extremely sparse in effect: experiments show 90-99% of delta values are near zero and contribute negligibly to task performance. Dropping them removes the interference they cause when summed with deltas from other models, while the rescaling ensures the surviving parameters carry the correct expected contribution. The result is that merged models combining many fine-tunes suffer far less mutual cancellation of meaningful weights.
Tap to flip back
RLAIF replaces human preference labels (stage 2: reward-model training) with preference labels generated by an AI judge. The reward-model architecture, the PPO loop, and the KL penalty against the reference policy are all unchanged. It is a substitution at the annotation stage, not a new RL algorithm.
Why it matters: human annotation is the main scalability bottleneck in RLHF; swapping it out keeps the rest of the pipeline intact.
Tap to flip back
Phase 1 (SL-CAI) is a self-critique and revision loop:
- A helpful-only SFT model generates a draft response to a potentially harmful prompt.
- The model critiques its own draft against a randomly sampled constitutional principle.
- The model revises the draft to address the critique.
- Steps 2-3 repeat for up to a few rounds.
- The final revised responses are used to fine-tune a new SL-CAI model.
Output: a fine-tuned model that is less harmful and less evasive than the raw SFT model, produced without any human labels on harmful content.
Tap to flip back
In RL-CAI, an AI feedback model compares pairs of responses with explicit reference to a numbered principle from the constitution in each prompt. The preference labels it produces can therefore be traced back to a specific, human-readable principle.
In raw RLAIF the judge receives no explicit principle, so its preferences reflect implicit biases from pretraining - opaque and hard to audit. With a constitution, disagreements about model behaviour can be diagnosed by checking which principle was active during training.
Tap to flip back
In canonical RLAIF the AI judge's preference labels are used to train a separate preference model (PM), and RL is then run against the PM - exactly like RLHF but with AI labels.
In direct-RLAIF (d-RLAIF, Lee et al. 2024) there is no separate PM training step. The reward signal is obtained by querying the LLM judge directly during the RL loop, asking it to score each policy output on the fly. Lee et al. found d-RLAIF matched or exceeded canonical RLAIF on summarisation and dialogue benchmarks.
Tap to flip back
-
Constitution blind spots: principles are written by researchers from specific cultural and professional backgrounds. Harms outside the framers' experience are simply absent from the document; the model will be systematically less cautious about them with no visible signal.
-
Compounding model errors: the critique model, revision model, and feedback model are all LLMs that can hallucinate or misapply principles. A bad critique produces a badly revised response, which becomes training data for SL-CAI, whose outputs are then compared by the feedback model. There is no hard error-correction step, so mistakes propagate across phases.
Tap to flip back
Lee et al. (2024) found that using an AI judge smaller than or equal in capability to the policy being trained produces noisy, uninformative preference labels - the judge cannot reliably distinguish which of two frontier-model outputs is better.
Implication: to get useful RLAIF labels for a frontier-class policy, the AI judge must itself be frontier-class. The per-label cost is much lower than human annotation, but you cannot use a cheap small model as the judge for a large policy without degrading label quality.
Tap to flip back
A written constitution makes the model's value specification editable without a new annotation campaign. To change behaviour, you edit a text file and re-run the RL-CAI phase. With RLHF the specification is implicit in thousands of human preference pairs; changing it requires collecting new pairs.
Additional advantages: principles can be version-controlled, shared publicly for external review, and audited to check whether a given refusal or compliance is principled or accidental.
Tap to flip back
Length bias is the tendency of reward models to assign higher scores to longer outputs, independent of quality. It arises because human annotators use length as a proxy for completeness, so the reward model learns length as a primary signal. A policy optimised against that reward then inflates verbosity rather than improving content.
Why it matters: Singhal et al. (COLM 2024) showed a pure length-based reward reproduced most downstream RLHF quality gains, meaning the gains attributed to quality were largely length gains in disguise.
Tap to flip back
Reward model retraining with length-balanced preference data. The reward model is the source of the bias: it absorbs the length signal from annotation patterns and encodes it into its weights. Downstream mitigations (PPO penalty, DPO, evaluator corrections) treat symptoms rather than the cause. Re-weighting or resampling preference pairs so chosen and rejected responses have similar lengths removes the shortcut before it enters the model.
Tap to flip back
r_adjusted = r_raw - λ · (len(response) / len(reference))
where reference is a baseline length (e.g., median SFT output length).
Risk of λ too high: the model learns excessive terseness. It may give cryptic, under-explained answers to genuinely complex queries, scoring poorly on tasks that legitimately require depth. λ is query-distribution-dependent and needs careful tuning.
Tap to flip back
LLM-based judges inherit the same presentation-over-substance heuristic as human annotators. A longer, polished response looks more thorough even to an LLM evaluator. Dubois et al. (2024) showed AlpacaEval's original win-rate correlates with output length; removing that correlation via regression (length-controlled AlpacaEval) increased alignment with human judgement (LMSYS Chatbot Arena correlation from 0.94 to 0.98). The bias lives in the evaluation signal, not just the training signal.
Tap to flip back
DPO optimises a policy directly against contrastive (chosen, rejected) pairs without a separate reward model scalar. Because no reward signal needs to be maximised, there is no gradient surface the optimiser can exploit by increasing length.
Condition: the training pairs must be length-balanced, i.e., chosen and rejected responses should have similar token counts. If chosen responses are consistently longer than rejected ones in the dataset, DPO will still absorb the length signal via the implicit reward it learns from those pairs.
Tap to flip back
A model penalised for token count may produce responses with lower token counts but worse usability: shorter sentences, more jargon, fewer worked examples, less scaffolding for the reader. The length penalty succeeds as a proxy metric while the underlying goal (useful, well-calibrated responses) is not achieved. This is a classic Goodhart's Law instance: the measure becomes a target and ceases to be a good measure. Diagnostic checks should include human rating of response usefulness, not just token count, after applying length mitigations.
Tap to flip back
The counterfactual: "What would the evaluator preference be if both the model's output and the baseline output had the same length?"
Computation: fit a generalised linear model (GLM) on the preference data with response length difference as a covariate. The length-controlled win rate is the model's predicted preference at zero length difference. This regresses out the confounding effect of length, leaving a win-rate estimate attributable to quality rather than verbosity. The adjusted metric improved LMSYS Chatbot Arena correlation from 0.94 to 0.98 (Dubois et al., COLM 2024).
Tap to flip back
Helpfulness, Harmlessness, Honesty.
- Maximising harmlessness pushes toward refusals, hurting helpfulness.
- Maximising helpfulness (sounding confident and complete) can reward sycophancy, hurting honesty.
- Maximising honesty (hedging where uncertain) can read as unhelpful to human raters who prefer confident answers.
The tension is structural: a single scalar reward cannot simultaneously maximise all three without careful weighting and diverse evaluation signals.
Tap to flip back
A 1.3B InstructGPT model was preferred by human raters over the 175B GPT-3 base model on roughly 85% of prompts from the API's own traffic distribution.
The method was pairwise human preference evaluation: raters saw outputs from two model variants side-by-side and indicated which they preferred. This demonstrated that alignment via RLHF matters more than raw scale for real-world helpfulness, at least as measured by human preference.
Tap to flip back
Reward hacking occurs when the policy discovers outputs that score high on the reward model without genuinely satisfying the underlying human preference the RM was trained to capture.
Typical symptoms:
- Confidently formatted but factually wrong answers (high "style" score, low accuracy)
- Excessive hedging the RM reads as "safe" but users find useless
- Sycophantic agreement with false premises, which scores high on "helpful tone" labels
Mitigation: periodic human spot-checks on RM-scored outputs to confirm the RM has not drifted from what it was originally trained to measure.
Tap to flip back
Position bias: the judge favours whichever response appears first, independent of content.
Verbosity bias: longer responses are scored higher regardless of accuracy.
Mitigation for position bias: run each pair in both orderings (A vs B, then B vs A), flip the second score, and average. This substantially reduces position inflation while leaving verbosity and self-enhancement biases unaddressed.
Tap to flip back
TruthfulQA measures whether a model avoids mimicking confident-sounding human falsehoods across 817 questions in 38 categories.
Counterintuitive finding: larger base models performed worse on TruthfulQA. Scale amplifies confident imitation of confident-sounding pretraining text, including widespread misconceptions. RLHF fine-tuning partially recovers truthfulness, but benchmark saturation means a model can score well on TruthfulQA while still confabulating on topics the benchmark does not cover.
Tap to flip back
Over-refusal is when a model declines safe requests because their surface form resembles an unsafe prompt (e.g., refusing to explain how bleach works for cleaning because the phrasing overlaps with harm-related queries).
Standard harm-rate metrics count harmful outputs produced. A model that refuses everything scores zero harmful outputs and looks perfectly safe. The XSTest benchmark was built specifically to make over-refusal visible by testing 250 clearly safe prompts that a well-calibrated model should accept.
Tap to flip back
The reward model was fitted to human preference labels collected at a point in time on a fixed distribution of prompts. As RLHF training progresses, the policy moves into regions of output space that the RM never saw during its own training. The RM's predictions in those regions are extrapolations, not reliable reflections of human preference.
This is the mechanism behind reward hacking: the policy gradient pushes the policy toward high-RM-score outputs, but because the RM generalises imperfectly outside its training distribution, "high RM score" and "humans would prefer this" increasingly diverge. Monitoring RM scores alone cannot detect this; it requires returning to human raters on the new output distribution.
Tap to flip back
With Adam optimisation in a mixed-precision regime, training costs roughly 12 bytes per parameter (bf16 weights + fp32 gradients + two fp32 optimiser moment vectors). For a 7B-parameter model that is roughly 84 GB for optimiser states alone, before counting activations. PEFT reduces this by shrinking the trainable-parameter count to a fraction of the full model, so optimiser state grows proportionally smaller.
Tap to flip back
LoRA freezes the original weight W_0 and trains a low-rank decomposition W_0 + BA, where B is (d_out, r) and A is (r, d_in) with r << min(d_out, d_in). At inference time the update is merged: W_0 <- W_0 + BA. The merged model has identical architecture to the original, so there is zero added latency. The only cost paid is during training, not serving.
Tap to flip back
Adapter modules insert new bottleneck layers (Linear -> non-linearity -> Linear) into the transformer stack. These layers are always present during inference and cannot be merged into the frozen base weights because of the non-linearity between the two projections. LoRA, having no non-linearity between its two matrices, can be merged at inference time, eliminating the latency penalty. Adapters trade a small but persistent inference overhead for potentially easier multi-task serving (keep one base, load per-task adapters).
Tap to flip back
Prompt tuning closes the gap with full fine-tuning only at roughly 10 billion parameters or more. Below that scale, performance is inconsistent and sensitive to initialisation. P-Tuning v2 extends soft prompts to every transformer layer (not just the input), which restores competitive performance on smaller models and structured-prediction tasks, at the cost of a slightly larger soft-prompt parameter budget.
Tap to flip back
r sets the rank of the weight-update matrices B and A. A small r (4 or 8) gives a very parameter-efficient update suitable for light domain adaptation; a larger r (16 to 64) allows the model to represent a richer set of weight changes, which helps when the downstream task differs significantly from pretraining. Increasing r beyond roughly 64 typically yields diminishing returns and approaches the cost of full fine-tuning on the targeted layers.
Tap to flip back
PEFT trains small task-specific modules or low-rank updates on top of a frozen base. Each task gets its own adapter or LoRA weights. If you want a single combined model that handles all tasks without a task identifier, you must merge or combine adapters, which can degrade performance on individual tasks. PEFT reduces the cost of training isolated per-task checkpoints; it does not provide a mechanism for a single shared parameter set to retain all tasks simultaneously without interference.
Tap to flip back
- Additive adapter modules - new bottleneck layers inserted between frozen transformer blocks (Houlsby adapters).
- Low-rank weight perturbation - decomposed update matrices added to frozen projection weights (LoRA).
- Soft prompt / prefix methods - learnable continuous tokens prepended to the input or to each layer's key-value sequence (prompt tuning, prefix tuning).
Each family leaves the pretrained base weights unchanged and concentrates training signal into a small, task-specific parameter set.
Tap to flip back
LoRA freezes all pretrained weights and trains only two small matrices, \(A\) and \(B\), whose product \(BA\) approximates the weight update \(\Delta W\). No gradient flows into \(W_0\).
Why: This separates the base model's general knowledge (frozen) from the task-specific update (low-rank), enabling multiple adapters to share one base checkpoint.
Tap to flip back
- \(W_0\) is the frozen pretrained weight.
- \(B \in \mathbb{R}^{d \times r}\), \(A \in \mathbb{R}^{r \times k}\), rank \(r \ll \min(d,k)\).
- \(\alpha / r\) is the scaling factor; a common default is \(\alpha = 2r\).
\(B\) is initialised to zero so the adapter contributes nothing at training step 0.
Tap to flip back
The adapter can be merged into the base weight before serving:
\[W_{\text{merged}} = W_0 + \frac{\alpha}{r} \cdot BA\]The resulting matrix has the same shape as \(W_0\). The deployed model is architecturally identical to the original - no extra matrix multiplications at runtime.
Tap to flip back
Roughly 10,000x fewer trainable parameters (e.g., 4.7 M vs. 175 B) by adapting only query and value projections in each attention layer with small-rank matrices (\(r = 4\) or \(r = 8\)) rather than updating every weight in the model.
GPU memory for optimiser states drops by roughly 3x compared to Adam full fine-tuning.
Tap to flip back
QLoRA stores the base model in 4-bit NormalFloat (NF4) quantisation while keeping the LoRA adapter weights in full precision (bfloat16). Gradients flow through a dequantised base model but only update the adapter.
This lets a 65B-parameter model be fine-tuned on a single 48 GB GPU. QLoRA also introduces double quantisation (quantising the quantisation constants) and paged optimisers to manage memory spikes.
Tap to flip back
Because \(\Delta W = BA\). With \(B = 0\), the product is zero regardless of \(A\), so the model's output at step 0 is identical to the pretrained model. Training starts from a stable, known baseline rather than from a random perturbation.
If both were random, the adapter would inject noise into every forward pass before any gradient signal has been received.
Tap to flip back
-
Rank underestimation: Tasks requiring genuinely new knowledge (specialised vocabulary, novel reasoning) may need higher rank or full fine-tuning; small-rank adapters lose signal.
-
Alpha/rank coupling: Changing \(r\) without adjusting \(\alpha\) shifts the adapter's effective learning rate, causing instability or slow convergence. rsLoRA (scaling by \(\alpha / \sqrt{r}\) instead of \(\alpha / r\)) mitigates this, but vanilla configs copied without retuning often underperform.
Tap to flip back
The intrinsic dimension \(d_{90}\) is the smallest \(d\) such that training only a \(d\)-dimensional parameter vector \(\phi\) (projected back into the full space via a fixed random matrix \(P\)) reaches 90% of full fine-tuning performance. It measures how many effectively independent directions are needed to solve the task from the pre-trained initialisation.
Why it matters: it gives a principled upper bound on the complexity of the fine-tuning problem, independent of total parameter count.
Tap to flip back
It tells you that the MRPC fine-tuning objective, despite the model having 125 million parameters, has its low-loss region concentrated in a subspace of dimension roughly 200. The full parameter space is nearly irrelevant to the optimisation; the pre-training has already shaped the loss landscape so that a tiny random affine subspace reliably intersects a good solution.
Implication: models with low intrinsic dimension are prime candidates for parameter-efficient fine-tuning without meaningful performance loss.
Tap to flip back
Pre-training dramatically lowers intrinsic dimension compared to random initialisation. Larger pre-trained models tend to have lower intrinsic dimension than smaller ones for the same downstream task, after equivalent pre-training. This partially explains why larger models generalise better after fine-tuning on small datasets: they compress more of the task's optimisation problem into a smaller effective subspace during pre-training.
Tap to flip back
- Take pre-trained weights \(\theta_0 \in \mathbb{R}^D\).
- Sample a fixed random matrix \(P \in \mathbb{R}^{D \times d}\) once (never updated).
- Introduce \(\phi \in \mathbb{R}^d\) initialised at zero; at every forward pass use \(\theta = \theta_0 + P\phi\).
- Train only \(\phi\) via gradient descent.
Repeat for increasing \(d\) until target performance is reached; that \(d\) is the intrinsic dimension.
Tap to flip back
LoRA implements the same core idea (update weights in a low-dimensional subspace) but replaces the unstructured random matrix \(P\) with a per-layer rank decomposition: \(\Delta W = BA\). This is an engineering refinement, not a theoretical requirement. The divergence is structural: the true intrinsic subspace of a task may cut across layers in ways that per-layer rank decomposition does not capture. In practice this gap rarely hurts, and the per-layer structure enables lossless weight merging at inference time (zero added latency).
Tap to flip back
- Novel-knowledge tasks: tasks requiring genuinely new factual content not seen during pre-training cannot be solved by recombining existing directions; higher rank or full fine-tuning is needed.
- Very long fine-tuning runs: with large learning rates or many epochs, updates drift far from \(\theta_0\), violating the assumption that the optimum lies near the initialisation in a small subspace.
- Out-of-distribution task structure: if the downstream modality or format is far from pre-training (e.g., structured code generation for a text-only model), the pre-training compression may not organise the parameter space usefully for the new task, and a low-rank delta may be insufficient.
Tap to flip back
Standard PAC-style bounds grow with model complexity (parameter count or norm). But if a fine-tuned solution lies in a \(d\)-dimensional subspace with \(d \ll D\), the effective model complexity is proportional to \(d\), not \(D\). Aghajanyan et al. formalise this via compression-based generalisation bounds: a solution reachable from a \(d\)-dimensional random projection implies a short description length proportional to \(d\), yielding tight bounds even when \(D\) is enormous and the training set is small. This explains empirically why 125M-parameter models fine-tuned on a few thousand examples rarely overfit catastrophically.
Tap to flip back
Rank \(r\) sets the number of dimensions in the low-rank subspace the adapter can occupy. For a weight \(W_0 \in \mathbb{R}^{d \times k}\), the adapter has \(r(d + k)\) trainable parameters.
Why it matters: Doubling \(r\) doubles parameter count (for that layer), but may not double task performance. The task's intrinsic dimensionality determines where returns flatten.
Tap to flip back
Alpha is a scalar that controls the adapter's contribution via the factor \(\alpha / r\):
\[h = W_0 x + \frac{\alpha}{r} \cdot B A x\]The ratio \(\alpha / r\) acts as a per-adapter learning rate multiplier. Keeping this ratio constant while sweeping \(r\) holds the adapter's effective scale steady, isolating rank as the only variable.
Tap to flip back
With \(\alpha / r\) scaling, increasing rank shrinks the multiplier, weakening gradient flow through the adapter. This causes training to slow and often makes high-rank LoRA no better than low-rank.
rsLoRA replaces the divisor with \(\sqrt{r}\):
\[\text{scale} = \frac{\alpha}{\sqrt{r}}\]This is consistent with NTK/mean-field scaling arguments. In practice it means higher ranks reliably improve quality without requiring manual alpha retuning. Enable with use_rslora=True in LoraConfig.
Tap to flip back
- Narrow classification / NER: \(r = 4\) to $8$. These tasks have low intrinsic dimensionality; the base model already holds the structural knowledge and only needs steering.
- Broad instruction tuning: \(r = 16\) to $64$. Many diverse behaviours must be simultaneously shaped, requiring more adapter capacity.
Aghajanyan et al. (2021) showed that a RoBERTa checkpoint reaches 90% of peak MRPC performance when optimised in a randomly projected 200-dimensional subspace of the full parameter space, illustrating why small ranks work for focused tasks.
Tap to flip back
- Fix the scale ratio by setting
alpha = r(or enable rsLoRA and choose a fixed alpha independently). This removes rank-scale coupling. - Double r from a low starting point (8 → 16 → 32) and evaluate on a held-out subset. Quality usually plateaus well before \(r = 128\) for standard fine-tuning.
- Once rank is fixed, tune alpha by varying \(\alpha / r\) from 0.5 to 4.0.
- Monitor adapter norm (
torch.linalg.norm(B @ A)) during training to catch scale explosions early.
Tap to flip back
Biderman et al. found that full fine-tuning learns perturbations with an effective rank that is roughly 10 to 100x higher than typical LoRA configurations. For tasks that require acquiring new knowledge (rather than steering existing representations), small-rank adapters cannot store the necessary associations, producing a model that mimics the target style but makes domain-specific factual errors.
The implication: LoRA's parameter efficiency comes at a genuine capacity cost on knowledge-intensive domains, not just a compression artefact.
Tap to flip back
At very low ranks (r = 1 or 2), \(\alpha / \sqrt{r}\) grows large. With r=1, alpha=16, the scale is 16.0, which is typically far too aggressive and can cause the adapter norm to explode.
Recommendation: Use rsLoRA for \(r \geq 8\) where its stabilisation benefits apply. For very low ranks, use standard \(\alpha / r\) scaling and tune alpha directly - the scale is small enough that the original formula's instability at high ranks is not a concern.
Tap to flip back
Adapting both Wq and Wv at a moderate rank (e.g., r=4 each) outperformed adapting a single matrix (e.g., Wq alone) at double the rank, even with the same total parameter count. The key insight: rank-4 captures enough of the required update subspace; spreading budget across more matrices beats concentrating it in one.
Tap to flip back
The LoRA paper froze MLP layers for simplicity and because attention weight updates were sufficient for the NLU benchmarks tested. The choice breaks on knowledge-injection tasks (domain-specific QA, factual recall), where MLP blocks function as key-value stores for factual associations. Freezing them causes hallucination and factual errors to persist despite fine-tuning.
Tap to flip back
Sequential adapters insert a bottleneck module inside the residual stream after a sublayer: x -> Sublayer -> Adapter -> +x. Parallel adapters run alongside: both the sublayer and the adapter branch process x independently and their outputs are summed. At the limit of full-rank parallel adaptation on a weight matrix, the design is equivalent to adding a low-rank delta to that matrix - i.e., it converges to LoRA.
Tap to flip back
Early layers encode low-level syntax and are largely stable across domains - adapting them rarely pays off for most tasks. Middle layers (roughly 30-70% of total depth) handle abstract semantics and offer the best return when resources are constrained. Late layers govern output formatting and instruction-following patterns. For style or instruction tasks, concentrating on middle and late layers is the most common practical approach.
Tap to flip back
A very low-rank update spans too few directions in weight space to represent all required changes at once. If the task demands instruction-following, new-domain vocabulary, and style adaptation simultaneously, the adapter cannot encode all three independently. The result is partial or erratic adaptation - the model may handle one requirement while regressing on another.
Tap to flip back
Wk (key projection) and Wo (output projection) appear to contribute less marginal value when query and value projections are already adapted. Wq shapes what each token queries for; Wv shapes what gets blended into the output. Wk modulates which tokens are retrieved but co-varies tightly with Wq (they jointly determine attention scores), while Wo is a linear mix of already-adapted Wv outputs. The incremental gain from adapting Wk and Wo does not justify the extra parameters in most tasks.
Tap to flip back
A large update delta can overwrite the pre-trained weight basin rather than perturbing it gently. The merged model is then equivalent to a partially re-trained model rather than a lightly adapted one. Downstream, this typically manifests as worse generalisation compared to keeping the adapter separate (i.e., not merging), loss of behaviour on tasks not covered by fine-tuning data, and sometimes numerical instability if the merged weights have very different magnitude distributions than the originals.
Tap to flip back
QLoRA reduces the memory footprint of a 65B model's weights to roughly 33 GB (down from ~130 GB at BF16), enabling fine-tuning on a single 48 GB GPU. The savings come from storing the frozen base model in 4-bit NF4 precision rather than 16-bit, while still dequantising to BF16 on the fly for actual matrix multiplications.
Why it matters: storage and compute are separated - you pay the 4-bit cost at rest but retain 16-bit numerical quality during the forward pass.
Tap to flip back
NF4 (Normal Float 4) places its 16 quantisation levels at the quantiles of a standard normal distribution, so each bin covers equal probability mass. Pre-trained transformer weights are approximately normally distributed, so NF4 wastes no code-points on the tails. INT4 uses uniform spacing and FP4 uses a floating-point grid - both are suboptimal for this distribution. The paper describes NF4 as "information-theoretically optimal for normally distributed weights."
Tap to flip back
Double quantisation applies a second round of quantisation to the 32-bit scaling constants produced by the first quantisation step. Each scaling constant covers a block of 64 weights, costing 0.5 bits per parameter before double quantisation. Quantising the constants to 8-bit reduces this overhead to roughly 0.127 bits per parameter. The net saving is approximately 0.4 bits per parameter, which totals around 3.25 GB across a 65B model.
Tap to flip back
Paged optimisers use NVIDIA's unified memory to move Adam optimiser states (momentum and variance tensors) to CPU RAM when GPU memory is exhausted, then page them back before the next parameter update. This prevents out-of-memory crashes during long-sequence batches or gradient accumulation without permanently offloading state. The cost is PCIe bandwidth latency on steps where paging occurs.
Tap to flip back
| Tensor | Precision | Updated? |
|---|---|---|
| Base model weights | NF4 (4-bit) | No (frozen) |
| Dequantised activations | BF16 (temporary) | N/A |
| LoRA matrices A, B | BF16 | Yes |
| Adam optimiser states | FP32 (paged) | Yes |
Gradients flow through the dequantisation step into A and B only; the base weights accumulate no gradient.
Tap to flip back
- Quantisation error on non-normal layers. NF4 assumes normally distributed weights; embedding tables, output projections, and MoE gating layers sometimes deviate, leading to larger-than-expected quantisation error.
- Merged inference reverts to BF16 size. Merging LoRA adapters back into the base model produces a full 16-bit model. Compressed-size serving requires a separate inference quantisation step (GPTQ, AWQ, etc.).
- Training throughput penalty. On-the-fly dequantisation typically reduces training speed by 20-30% compared to BF16 LoRA on the same hardware. QLoRA trades speed for memory.
Tap to flip back
Layers without LoRA sit behind frozen, quantisation-noisy weights with no mechanism to correct the error. In standard LoRA fine-tuning it is common to target only Q and V projections because the base model is stored losslessly at 16-bit - skipping some layers costs little. In QLoRA the base model has per-layer quantisation error, so each uncovered layer contributes residual noise to the forward pass with no trainable path to compensate. Targeting all-linear gives every layer a correction path, empirically improving downstream task performance. The HuggingFace PEFT documentation explicitly recommends target_modules="all-linear" for QLoRA-style training.
Tap to flip back
A Houlsby adapter inserts two adapter modules per transformer layer: one after the multi-head attention projection and one after the feed-forward sublayer. Each adapter applies a down-projection from the hidden dimension d to a bottleneck dimension r, a non-linearity (typically GELU), and an up-projection back to d. A residual connection wraps the unit so the adapter is the identity at initialisation, preventing destabilisation of pretrained weights. Only the down- and up-projection matrices are trained; the rest of the transformer is frozen.
Tap to flip back
Within 0.4% of full fine-tuning performance, while training only 3.6% of the model's parameters per task. This is the core empirical result that established adapters as a viable parameter-efficient fine-tuning method. The remaining 96.4% of parameters stay frozen and shared across all tasks.
Tap to flip back
The bottleneck forces all task-specific information through a low-rank subspace before it is added back to the residual stream. This prevents the adapter from memorising arbitrary high-dimensional perturbations in the frozen activations. On small datasets (thousands of examples), full fine-tuning risks distorting pretrained representations via overfitting; the adapter's narrow gate limits the degrees of freedom available to overfit, preserving the general-purpose structure learned during pretraining.
Tap to flip back
AdapterFusion (Pfeiffer et al., 2020) is a composition method for combining pretrained task-specific adapters. In Stage 1, independent adapters are trained per source task, each frozen after training. In Stage 2, all adapters are loaded frozen into the backbone and a small trainable attention mechanism (the fusion layer) is trained to weight their contributions per token on the target task. Because no adapter's weights are modified in Stage 2, there is no pathway for the model to forget previously learned task representations. The fusion layer learns to route rather than to overwrite.
Tap to flip back
Series adapters add a sequential forward pass (two linear layers) at every adapter position in every layer. This compute cannot be fused into the existing transformer weight matrices, so it adds latency on every token. LoRA avoids this by merging its low-rank updates directly into the frozen weights at serving time (W' = W + BA), leaving zero inference overhead.
AdapterDrop (Rücklé et al., 2020) mitigates the adapter latency by removing adapters from the lower transformer layers entirely. Lower layers encode mostly syntax and surface form, which transfer well without adaptation. Dropping adapters from most lower layers recovers a large fraction of the inference speed while retaining most task performance.
Tap to flip back
Frameworks such as vLLM and TensorRT-LLM heavily fuse the transformer's attention and feed-forward kernels into custom CUDA operations. Inserting non-standard bottleneck layers between these fused operations breaks the fusion, forcing a fallback to unfused code paths. This can multiply the effective latency penalty of the adapters far beyond their FLOP cost, and may prevent the use of memory-efficient attention implementations entirely. LoRA does not have this problem because its updates are absorbed into the weight matrices before any inference kernel runs, leaving the forward-pass graph unchanged.
Tap to flip back
He et al. (2021) show that all three methods can be described as modifications to specific hidden states in the transformer's forward pass, differing only in (a) which hidden states are modified, (b) whether the modification is additive or through reparametrisation, and (c) whether the modification is input-dependent or not. Adapters modify intermediate residual states via learnable bottleneck functions. Prefix tuning prepends learnable vectors to the key and value sequences. LoRA modifies the weight matrices directly with a low-rank additive term. Because they occupy the same design space, it is possible to combine elements from each, and the paper demonstrates that such hybrids can achieve better parameter efficiency than any individual method.
Tap to flip back
Into the key-value (KV) tensors at every transformer layer. The prefix adds m virtual key vectors and m virtual value vectors to each layer's attention context. Real tokens can attend to these virtual tokens, but the virtual tokens do not attend to anything. All original weight matrices remain frozen.
Why it matters: Inserting at every layer gives the prefix direct influence on each layer's attention output, rather than relying on gradual propagation from the input embedding alone.
Tap to flip back
During training, the prefix tensors are not optimised directly. Instead, a small feed-forward network (FFN) maps a compact embedding matrix E to the prefix: P_l = FFN(E). Only E and the FFN weights are trained. After training, the materialised prefix tensors are saved and the FFN is discarded.
Why it is needed: Directly optimising raw prefix parameters causes training instability and erratic loss curves. The FFN acts as a smooth parameterisation that stabilises gradient updates.
Tap to flip back
- Prompt tuning adds soft tokens only at the input embedding layer (one location).
- Prefix tuning injects learned key-value vectors at every transformer layer.
Prefix tuning is therefore deeper and generally more expressive, especially on tasks where later layers carry significant semantic load.
Tap to flip back
Approximately 0.1% of the original model's parameters. For GPT-2-medium (345M parameters), a prefix of length 10 involves on the order of tens of thousands of parameters. This is roughly 1000x fewer parameters than full fine-tuning.
Tap to flip back
Low-data regimes. When training examples are scarce, full fine-tuning tends to overfit because it updates all parameters on a small signal. Prefix tuning, with far fewer trainable parameters and a frozen pretrained backbone, generalises better and also extrapolates to unseen topics more reliably.
Tap to flip back
-
Sequence length budget: Prefix tokens consume context window space. A prefix of 100 tokens on a 512-token model costs 20% of available context before any real input.
-
Token-level tasks at small scale: Plain prefix tuning under-performs full fine-tuning on sequence labelling tasks (e.g. NER) and on models smaller than roughly 1B parameters, where soft-prompt methods lack the capacity to compensate for the reduced expressiveness of frozen weights.
Tap to flip back
P-Tuning v2 (Liu et al., 2021) applies the same "deep prompt" principle as prefix tuning (learned vectors inserted at every layer) to BERT-style encoder models and sequence labelling tasks. The original prefix tuning paper targeted GPT/BART generation tasks. P-Tuning v2 demonstrated that with proper tuning, 0.1%-3% of parameters is sufficient to match full fine-tuning universally across model scales and NLU tasks, closing the gap that earlier soft-prompt methods left on smaller models and harder tasks.
Tap to flip back
A soft prompt is a matrix of free-floating, learnable embedding vectors (shape k × d) prepended to the model's input in embedding space. Unlike a standard text prompt, whose tokens are fixed discrete vocabulary entries, the soft-prompt vectors are not tied to any vocabulary token and are updated by gradient descent during training. The model weights remain frozen; only the prompt matrix is learnt.
Why it matters: decoupling the task signal from the vocabulary removes the need for manual prompt engineering and allows exact gradient-based optimisation.
Tap to flip back
At roughly 1 billion parameters and above, with the clearest evidence at T5-XXL (11B). Below that scale, prompt tuning lags behind full fine-tuning on benchmarks such as SuperGLUE.
The reason: a large model has already learnt rich internal representations. The soft prompt need only activate and route existing features. A small model lacks those features, so no prefix of any length can compensate for the missing representational capacity.
Tap to flip back
- Vocabulary sampling - initialise each virtual token by sampling a random token embedding from the existing embedding matrix.
- Class-label initialisation - initialise each virtual token with the embedding of a task output label (e.g. "positive", "negative" for sentiment).
Class-label initialisation performs best, especially at smaller model scales. The gap shrinks as scale grows. Random initialisation performs worst. The intuition: starting in a semantically meaningful region of embedding space gives the optimiser a better starting point.
Tap to flip back
- Soft-prompt tuning inserts learnable vectors only at the input embedding layer. Parameter count:
k × d. - Prefix tuning (Li and Liang, 2021) inserts learnable vectors at the key and value projections of every transformer layer. Parameter count:
O(k × d × L)whereLis the number of layers.
Prefix tuning has more expressive capacity but is larger and required a reparametrisation MLP to stabilise training. Soft-prompt tuning trains the input vectors directly and is significantly smaller.
Tap to flip back
Even though no frozen weights are updated, the backward pass must compute gradients with respect to the soft-prompt matrix P. To do so, PyTorch (or JAX) must retain the intermediate activations from the entire forward pass through all frozen layers, since the chain rule requires them. This means activation memory is roughly comparable to full fine-tuning unless activation checkpointing (gradient checkpointing) is applied.
The savings from soft-prompt tuning are in stored parameter count and serving storage, not automatically in training-time GPU memory.
Tap to flip back
- Tasks on small models (below ~1B parameters) - the model lacks the pre-learnt representations the prompt can steer; no prefix compensates.
- Complex multi-step reasoning and structured-output tasks - these require compositional capabilities that can only be instilled by updating deeper weights. A soft prompt can direct but cannot install new reasoning circuits.
In both cases, LoRA or adapter methods, which modify weight matrices directly, are more appropriate tools.
Tap to flip back
Advantage: a single frozen model backbone can serve hundreds of different tasks simultaneously. Each task is represented by its own tiny prompt matrix (typically a few hundred kilobytes). No weight duplication, no separate model copies.
Silent failure mode: if the serving framework accidentally mixes up prompt matrices across concurrent requests (e.g. a batching bug assigns task A's prompt to task B's request), the model produces incorrect output for the wrong task without raising an error. The corruption is semantically silent, not a crash. Production serving code must rigorously isolate prompt matrices per request or per task identifier.
Tap to flip back
(IA)^3 stands for "Infused Adapter by Inhibiting and Amplifying Inner Activations." It fine-tunes a model by learning small vectors that element-wise scale (inhibit or amplify) specific intermediate activations, while all original weight matrices remain frozen.
Tap to flip back
The three sites are: (1) key projection outputs, (2) value projection outputs, and (3) the input to the second (down-projection) FFN layer. For the attention sites the vector multiplies the output activations; for the FFN site it multiplies the input activations, which is why PEFT's IA3Config has a separate feedforward_modules argument.
Tap to flip back
For T0 (roughly 11B parameters), (IA)^3 introduces about 0.01% trainable parameters. LoRA at a typical rank sits around 0.1% or higher. Full fine-tuning is 100%. So (IA)^3 is roughly one order of magnitude more parameter-efficient than LoRA, achieved by learning a single vector rather than a pair of low-rank matrices per site.
Tap to flip back
Initialising to ones means the model starts as an exact copy of the pre-trained base (multiplying by 1 changes nothing). Zero-initialisation would zero out the targeted activations entirely at the start of training, destroying the pre-trained representations. The all-ones start lets gradient descent make small, targeted adjustments from a stable baseline, analogous to how LoRA zero-initialises the B matrix so the initial delta-W is zero rather than random noise.
Tap to flip back
For an attention site: l_k ⊙ (X W_K) = X (W_K diag(l_k)). The scaling vector is equivalent to rescaling the columns of the frozen weight matrix. At inference time you compute W_K_merged = W_K * l_k (broadcast column-wise) once offline and store the result. The merged model is then a plain frozen transformer with no PEFT code path, so there is zero inference latency overhead - the same property LoRA enjoys.
Tap to flip back
(IA)^3 applies a rank-1 transformation per targeted site (a diagonal rescaling). This is sufficient when the pre-trained model already encodes the required knowledge and the task only needs selective routing or suppression. It breaks down for: (1) tasks requiring genuinely new compositional behaviours (complex instruction following, code synthesis); (2) small base models where the pre-trained representations are underspecified; (3) RLHF-style training where the reward signal pushes the distribution far from the pre-trained prior - the low expressive capacity of a single vector saturates early. LoRA's adjustable rank provides headroom in all three cases.
Tap to flip back
IA3 per block: d_k + d_v + d_ff (three vectors, one per targeted site).
LoRA per block (rank r, applied to Q/K/V/O): 4 * 2 * d * r = 8dr.
For d = 1024, d_ff = 4096, r = 8: IA3 adds ~12,288 parameters; LoRA adds ~65,536. LoRA is roughly 5x larger per block at this rank. The ratio grows with rank and shrinks as hidden size increases relative to the FFN dimension.
Tap to flip back
W_merged = W + (alpha / r) * B @ A
W is the frozen base weight, B (shape r x d_out) and A (shape d_in x r) are the trained LoRA matrices, and alpha / r is the scaling factor. This is a pure in-place addition; W_merged has the same shape as W and requires no structural model changes.
Why it matters: both the scale factor and the matrix multiply must be applied correctly, or quality silently degrades.
Tap to flip back
merge_and_unload() folds the adapter into the base weights AND strips all PEFT adapter objects, returning a plain Transformers model with no further PEFT overhead. It is one-way: you cannot unmerge after this.
merge_adapter() folds the adapter in-place but keeps the PEFT wrapper and the adapter parameter objects, so unmerge_adapter() can reconstruct the original W later. Use this when you need adapter-swap capability in production.
Tap to flip back
The base weights in QLoRA are stored in 4-bit NF4; the adapter matrices are in bfloat16/float16. To compute W + scale * BA you must first dequantise W to a higher precision, perform the addition, then optionally re-quantise.
PEFT's merge_and_unload() handles the dequantisation step automatically, but does NOT re-quantise; the output is a full-precision (float16/bfloat16) merged model. If you need a quantised merged model you must quantise it explicitly as a separate step afterwards.
Tap to flip back
A task vector is the element-wise difference between a fine-tuned model's weights and the original base model weights. For a LoRA adapter, the task vector for each targeted layer is exactly (alpha / r) * B @ A - the same delta used in the merge formula.
Because task vectors are just tensors, they can be added (combine abilities), negated (remove a task's influence), or interpolated. This is the foundation of task arithmetic (Ilharco et al., 2023) and of techniques like DARE and TIES that sparsify or sign-filter the deltas before adding them.
Tap to flip back
Standard LoRA uses alpha / r as the scale; RSLoRA uses alpha / sqrt(r).
If the adapter was trained with use_rslora=True but the merge code blindly applies alpha / r, every merged weight is off by a factor of sqrt(r) - for r=16 that is a 4x error. The model outputs fluent text but with degraded task performance. Always read the scale convention from the saved adapter config rather than hard-coding the formula.
Tap to flip back
-
Task interference: when two fine-tunes push the same weight dimensions in opposite directions, adding their delta weights partially cancels both signals, degrading performance on each task.
-
Magnitude collapse: if one adapter's delta is far larger (higher rank, higher scale, or longer training), it can dominate the merge and nearly overwrite the smaller adapter's contribution, making the combined model behave almost like only one of the adapters.
TIES and DARE mitigate these by zeroing small deltas and resolving sign conflicts before summation, but neither fully eliminates interference on fundamentally opposing tasks.
Tap to flip back
LoRA training freezes the base weights and learns only the delta BA. The delta is calibrated to add to a specific W. If the base weights differ at merge time (different quantisation, different training step, even different loading precision), the merge arithmetic applies the right delta to the wrong W, producing corrupted outputs.
Always record the base model commit hash or file SHA alongside the adapter checkpoint so the correct base can be reproduced at merge time.
Tap to flip back
For a weight matrix W in R^(d x k), LoRA adds a side path: output = Wx + (BA)x(alpha/r), where A is in R^(r x k) and B is in R^(d x r). A is initialised from a Gaussian; B is initialised to zero so the adapter contributes nothing at step 0. Only A and B are trained; W is frozen.
Why it matters: the zero initialisation of B ensures training starts from exactly the pretrained model's behaviour, making LoRA safe to apply without learning-rate warmup tricks.
Tap to flip back
The original LoRA paper (Hu et al., 2021) reports a reduction of up to 10,000x in trainable parameters, with GPU memory requirement cut by roughly 3x versus full fine-tuning of GPT-3 175B.
Why it matters: this ratio is the core economic argument for LoRA; it makes fine-tuning accessible on hardware that cannot hold optimiser state for a full model.
Tap to flip back
Aghajanyan et al. (2020) showed that the weight update needed to fine-tune a pretrained model for a downstream task lies in a much lower-dimensional subspace than the full parameter space. In other words, a randomly projected, lower-dimensional optimisation problem can match the quality of full fine-tuning. LoRA operationalises this by restricting DW = B*A to rank r.
Why it matters: without this empirical foundation, LoRA would look like an arbitrary constraint; the intrinsic dimension hypothesis explains why rank-8 or rank-16 can be sufficient for most tasks.
Tap to flip back
- NF4 (4-bit NormalFloat): a quantisation data type that is information-theoretically optimal for normally distributed weights, minimising quantisation error for pretrained model weights.
- Double quantisation: the quantisation constants themselves are quantised, saving an additional ~0.37 bits per parameter.
- Paged optimisers: NVIDIA unified memory is used to page optimiser state to CPU RAM during memory spikes, preventing OOM crashes on long sequences.
Together these allow fine-tuning a 65B-parameter model on a single 48 GB GPU.
Tap to flip back
- Insufficient representational capacity for structurally novel tasks. If the required weight update is not well-approximated by a low-rank matrix, LoRA underfits regardless of training duration. Full fine-tuning has no such structural constraint.
- Inability to adapt frozen base representations. If the pretraining distribution is genuinely misaligned with the target domain (rare language, specialist notation), the frozen base produces poor intermediate representations that no adapter can fully correct. Full fine-tuning updates every layer and can redistribute representations globally.
Tap to flip back
Because the base model weights are frozen and identical across all LoRA variants, a single server can load the base once and store only the small adapter checkpoints (typically 10-100 MB each) for each task. At inference time the server either merges the relevant adapter into the base weights (zero latency) or applies the B*A side path dynamically.
Full fine-tuning would require a separate complete copy (e.g. 14 GB for a 7B bfloat16 model) per task variant.
Tap to flip back
The term alpha/r scales the adapter output before it is added to the frozen weight's output: output += (BA)x(alpha/r). A common default is alpha = 2r*, which fixes the scale at 2.0 regardless of rank. This means you can change rank without also changing the effective learning rate contribution of the adapter, making ablations over rank cleaner to interpret.
Setting alpha independently of r (e.g. alpha = 16 at r = 4) changes the effective scale to 4.0, which can destabilise training at high learning rates.
Tap to flip back
When a model is fine-tuned on a narrow task, gradient updates move the shared weights toward a new loss minimum. Those same weights encoded prior capabilities, so the prior capabilities degrade or vanish. It is a structural property of gradient descent on shared parameters, not a bug in any specific library.
Tap to flip back
Larger models develop more specialised weight circuits during pre-training, meaning a given weight is more likely to be load-bearing for multiple capabilities simultaneously. Disrupting those circuits has a broader blast radius. Empirically, Luo et al. (2023) observed greater forgetting severity in larger models (up to 7B) during continual instruction tuning.
Tap to flip back
\(F_i\) is the diagonal of the Fisher information matrix for weight \(i\), estimated from the pre-training objective. It measures how sensitive the prior task loss is to changes in \(\theta_i\). Weights with high \(F_i\) are penalised heavily for drifting from \(\theta^*_\text{pre}\); low-\(F_i\) weights are free to adapt. This selectively slows learning on weights that matter most to prior tasks.
Tap to flip back
LoRA freezes all pre-trained weights and routes adaptation through low-rank decomposition matrices added in parallel. Because the original weights never move, they cannot be overwritten. Biderman et al. (2024) confirmed empirically that LoRA forgets substantially less of base model capability than full fine-tuning, at the cost of somewhat lower peak task performance - a direct tradeoff between plasticity and stability.
Tap to flip back
Experience replay mixes a fraction (typically 10-20%) of general pre-training-style data into each fine-tuning batch, so gradient updates continuously reinforce prior capabilities alongside the new task. The practical limitation: constructing a high-quality, representative replay corpus for frontier models is very difficult because pre-training data is often proprietary or too large to store. If the replay set is not representative, you prevent forgetting only for the capabilities it covers.
Tap to flip back
RLHF-trained models encode refusal and alignment behaviours in the same weights used for general capabilities. Fine-tuning on a narrow domain corpus moves those weights toward the task objective, which can partially overwrite safety constraints even with no adversarial intent. Using PEFT (frozen base weights) substantially reduces this risk because alignment-critical weights are never updated directly.
Tap to flip back
- Memory cost: storing the Fisher diagonal requires memory proportional to model size, which is expensive at billions of parameters.
- Approximation quality: the diagonal Fisher is a crude approximation of the true Fisher matrix; accuracy degrades as model scale grows. Kronecker-factored variants (K-FAC) improve this but add significant engineering overhead. Together these factors make EWC impractical for production-scale LLM fine-tuning without substantial modification.
Tap to flip back
The four residents are: model weights, gradients, optimiser state (e.g., Adam's two momentum buffers), and activations. LoRA keeps the base weights frozen and trains only the small A and B matrices, so gradients and optimiser state shrink proportionally to the fraction of trainable parameters (roughly 10,000x fewer for a 7B model at r=8). Base weight memory is unchanged unless you also apply quantisation (e.g., QLoRA).
Tap to flip back
W' = W + B @ A
where W is (d_out, d_in) and frozen, A is (r, d_in), B is (d_out, r). Rank r is typically 4-16 vs. d_in ~ 4096. This compresses the trainable parameter count from d_out × d_in to r × (d_out + d_in), a ~256x reduction at r=8 for a 4096-dimensional layer. B is initialised to zero so that B @ A = 0 at step 0, preserving pretrained behaviour.
Tap to flip back
After training, the LoRA update B @ A can be merged directly into W (W' = W + B @ A), leaving a single weight matrix identical in shape to the original. No extra layers or branches remain in the forward path. Adapter modules insert serial bottleneck layers that cannot be absorbed this way; every forward pass must execute them, adding latency proportional to the number of adapted layers.
Tap to flip back
- NF4 (4-bit NormalFloat): stores base weights at 4 bits using quantisation levels optimal for normally distributed values, cutting base weight memory ~4x vs. float16.
- Double quantisation: quantises the quantisation constants themselves, saving roughly 0.5 additional bits per parameter.
- Paged optimisers: pages GPU optimiser states to CPU RAM during memory pressure using NVIDIA unified memory.
Headline result: a 65B-parameter model fine-tuneable on a single 48 GB GPU with performance matching full 16-bit fine-tuning.
Tap to flip back
Prompt tuning becomes competitive at ~10B+ parameters and roughly matches full fine-tuning at 100B+ parameters. Below that scale, the method performs materially worse because the frozen model representations lack the capacity to be steered sufficiently by a small prepended soft-prompt alone. The paper by Lester et al. (2021) frames this as "the power of scale" - at sufficient model size, a handful of learnable token embeddings is enough signal for adaptation.
Tap to flip back
DoRA (Liu et al., 2024) decomposes each weight column into a scalar magnitude m and a unit-norm direction vector v, then applies LoRA only to the directional component: W'_col = m * (v + B@A) / ||v + B@A||. Standard LoRA couples magnitude and directional changes through the same B@A product. By giving the optimiser an independent scalar to control magnitude, DoRA more closely mimics how full fine-tuning updates weights, improving task performance with no additional inference overhead beyond standard LoRA.
Tap to flip back
- Rank underfit: tasks far from the pretraining distribution (e.g., a new language with novel morphology) may require a larger update than a small rank can represent; increasing r helps but erodes savings.
- Quantisation degradation: QLoRA's 4-bit base is lossy; tasks requiring precise numerical reasoning or structured output (code, JSON) can see measurable quality drops vs. float16 fine-tuning even when aggregate benchmarks appear equivalent.
- Multi-adapter version drift: serving many LoRA adapters from one base model breaks cleanly only when all adapters share the exact same base version and quantisation scheme; checkpoint drift across releases is a real operational problem.
Tap to flip back
Deep Learning
4 concept(s)Convolutions share the same k x k kernel across every spatial location, which gives translation equivariance (shifting the input shifts the output by the same amount) and parameter efficiency (weights scale with kernel and channel count, not image size). That built-in prior matches the statistics of natural images, which is why CNNs can train from thousands of examples while transformers need millions.
Tap to flip back
The blocker was not vanishing gradients (BatchNorm had already handled those) - it was an optimisation problem: very deep plain nets could not learn identity mappings, so deeper nets had higher training error than shallow ones. ResNet reframes each block as y = F(x, W) + x, so layers learn the residual instead of the full mapping. Zero is a sensible default, and 152-layer (even 1001-layer) nets train cleanly. Every modern transformer inherits this skip connection.
Tap to flip back
CNNs survived in regimes where data or compute is tight:
- Edge inference - MobileNet / EfficientNet style depthwise-separable convs are 5-10x cheaper than attention at low resolutions.
- Dense prediction - segmentation and detection backbones (ConvNeXt v2) keep spatial structure baked in.
- Hybrid stacks - production vision systems use a conv stem for cheap downsampling and a transformer trunk for global reasoning.
The CNN bias is a prior: a feature when data is scarce, dead weight when it is abundant.
Tap to flip back
Theoretically RF_L = RF_{L-1} + (k_L - 1) * prod(strides up to L-1), so each strided conv or pool roughly doubles the receptive field. In practice the effective receptive field is smaller because gradients concentrate around the kernel centre, so the network behaves as if it sees a Gaussian-weighted window rather than a hard box.
Tap to flip back
Backprop-through-time multiplies k Jacobians of the recurrence. With spectral radius < 1, gradients shrink to zero in ~30 steps (vanishing); with radius > 1 they explode and training diverges. Gradient clipping fixes explosion but you cannot clip your way out of a numerical zero - vanishing is structural. The structural fix is gated state (LSTM / GRU) that gives gradients a near-identity highway.
Tap to flip back
c_t = f_t * c_{t-1} + i_t * c_tilde. When the forget gate f_t stays near 1, the cell state is approximately an identity map across time, so gradients propagate unimpeded. The three gates (forget, input, output) learn when to overwrite the highway. GRU collapses the same idea into two gates and merges cell and hidden state; on most benchmarks the two are within noise.
Tap to flip back
- Parallel training - self-attention computes all positions at once; an RNN must wait for step
t-1. That is the difference between training on 10B and 10T tokens on the same hardware. - Effective context - attention reaches every prior token with O(1) path length; LSTM signal decays well before the few-hundred-token figures quoted in 2016.
- Scaling laws - transformers improve predictably with parameters / data / compute; RNNs plateau.
Tap to flip back
- Streaming inference - an always-on speech recogniser keeps a fixed-size hidden state; a transformer's KV cache grows linearly with audio length.
- Tiny on-device models - a 1M-param GRU for keyword spotting on a microcontroller beats any attention model at that size budget.
- State-space hybrids - Mamba and RWKV bring back recurrence with parallel-scan training and selective gating, matching transformers on some long-context benchmarks.
Tap to flip back
Dropout zeros each activation with probability p and scales survivors by 1/(1-p). The intuition is that you are training an ensemble of 2^N sub-networks (one per mask) that share weights, and the un-masked test-time forward pass approximates their geometric mean. Co-adaptation between specific neurons is discouraged because no neuron can rely on a specific other being present.
Tap to flip back
Three reasons:
- Data is the regulariser - on 15T tokens there is not enough capacity to memorise, so injected noise is unnecessary.
- It hurts at scale - Chinchilla and Llama ablations show dropout has zero or negative effect on validation loss when compute and data scale together.
- Throughput tax - generating masks and scaling activations is a measurable cost on the critical path of every layer.
You still see attention dropout in fine-tuning recipes where the dataset is small and overfitting is real.
Tap to flip back
In plain SGD, adding lambda * w^2 to the loss is equivalent to multiplying weights by (1 - lr * lambda). In Adam the L2 gradient term gets divided by sqrt(v_t) along with everything else - so heavily updated parameters receive less effective decay than rarely updated ones, which is the opposite of what you want. AdamW decouples decay from the gradient: every parameter shrinks at the same fractional rate. That is why every LLM trainer uses AdamW, not Adam + L2.
Tap to flip back
| Technique | Role |
|---|---|
| Massive pretraining data | Removes the overfitting regime entirely |
| Weight decay (AdamW) | Shrinks unused parameters |
| Data augmentation / Mixup | Vision and speech only |
| Stochastic depth (drop path) | Deep ViTs |
| Early stopping | Fine-tuning |
The general lesson: regularisers that mattered for 100M-param / 1M-example models mostly disappear at 100B / 10T tokens. Different regime, different toolbox.
Tap to flip back
The loss is a function R^n -> R where n is the parameter count (billions) and the output dimension is 1.
- Forward mode needs
npasses for anR^n -> R^mJacobian - one per input. - Reverse mode needs
mpasses - one per output.
For neural nets, m = 1 (scalar loss), so reverse mode gets all n gradients in a single backward pass. Forward mode would take billions of passes. Forward mode is still used for Hessian-vector products, ODE sensitivities, and few-input / many-output situations.
Tap to flip back
PyTorch builds a graph dynamically as ops execute; loss.backward() walks it in reverse topological order. The graph is rebuilt every iteration, which makes print / pdb debugging trivial.
JAX traces code into a typed IR (jaxpr) and transforms it functionally. jit(grad(vmap(f))) composes JIT, differentiation, and batching cleanly. Steeper learning curve, no implicit state, but the composition story is significantly cleaner - which is why DeepMind and Anthropic research codebases lean on JAX.
Tap to flip back
Backprop needs every forward-pass activation cached until the matching backward step. For a transformer with L layers, hidden d, sequence n, batch b, activation memory is roughly L * b * n * d * dtype_bytes * factor. Llama-2 70B at seq 4096, batch 4, bf16 burns 100+ GB - more than weights and optimiser state combined.
Gradient checkpointing (Chen et al., 2016) discards activations during forward and recomputes them on demand in backward. Standard uniform-checkpointing variant gives O(sqrt(L)) memory at ~1.33x compute. The recurring optimisation question is which blocks to checkpoint - attention activations dominate at long context, so checkpointing only attention often captures most of the win at half the compute cost.
Tap to flip back
Backward is roughly 2x the forward pass in FLOPs. Each backward step needs two matrix multiplies per layer: one to compute the parameter gradient dL/dtheta_l and one to propagate the upstream gradient dL/da_{l-1} to the previous layer. The forward pass does one matmul per layer. Combined with the rule of thumb that compilers and kernels make backward ops nearly as efficient as forward, you budget ~3x forward FLOPs for forward + backward per training step.
Tap to flip back
Safety & Alignment
2 concept(s)SQL has a grammar that separates code and data, so parameterised queries are a clean fix. Prompts have no grammar - instructions and data share the same token stream and the model decides what counts as instruction from natural-language cues. There is no character, delimiter, or escape sequence the model can use to mark something as "data, not instruction." Worse, the model is trained to follow instructions wherever they appear, because that is what makes it a useful chat assistant.
Tap to flip back
Direct injection: the user types Ignore previous instructions and.... Easy to fingerprint and partially filter. The version on social media.
Indirect injection: the attacker plants instructions in content the model will later retrieve - a webpage, PDF, email, tool output, vector store entry. The user asks an innocuous question, the model fetches the poisoned source, and the malicious instructions arrive inside what the model treats as trusted context. Greshake et al. (2023) showed data theft from email assistants, exfiltration via image tags, worming between agents. The user never sees the attacker's prompt.
Tap to flip back
The moment you wire the model to tools (send_email, execute_sql, transfer_funds), the attacker's payload becomes arbitrary code execution under the model's authority. The model reads the poisoned page, interprets the embedded instruction as a legitimate user request, and calls the tool with attacker-chosen arguments. This is why OWASP lists Prompt Injection (LLM01) and Excessive Agency (LLM06) together - the latter amplifies the former's blast radius enormously.
Tap to flip back
Cosmetic / weak: input classifiers (bypassed by paraphrase), delimiter blocks like ### USER INPUT ### (model still follows embedded instructions).
Empirically helpful: spotlighting / datamarking (Hines 2024) reduced attack success from over 50% to under 2% on benchmark. Output filtering catches obvious exfil patterns.
Strongest: CaMeL-style control/data flow split (Debenedetti 2025), capability minimisation, and human confirmation on irreversible actions (money, deletes, posts). The last is the only mitigation with a strong safety argument.
Tap to flip back
CaMeL (Debenedetti et al. 2025) is a control-flow-vs-data-flow split: a privileged planner LLM emits a typed plan, an unprivileged executor LLM handles untrusted content, and capability tokens gate which outputs can flow back into tool calls. The untrusted content can never directly influence tool arguments - it can only fill data slots the planner authorised. "Defeating prompt injections by design." Adds latency and complexity, still not a full solve, but the most principled approach to date.
Tap to flip back
- CBRN uplift - can the model meaningfully help a non-expert plan mass-casualty harm? Benchmarked vs textbook/search baselines.
- Cyber offence - vulnerability discovery, exploit development, autonomous network compromise. CTF-style harnesses.
- Autonomy and agentic capability - self-replication, resource acquisition, long-horizon tasks without supervision. METR-style suites.
- AI R&D acceleration - can the model meaningfully speed the research that produces stronger models? Mostly internal, not yet standardised.
Plus misuse and alignment evals on top.
Tap to flip back
ASL levels are AI Safety Levels analogous to biosafety levels. ASL-2 is today's frontier. ASL-3 is meaningful CBRN uplift or substantial autonomy - triggers hardened deployment (jailbreak-resistance bar), enhanced security (weight protection, insider threat), and structured red-teaming. ASL-4 is near-expert CBRN uplift or substantial autonomy. The policy is thresholds-and-commitments: cross a capability threshold and deployment is paused until the corresponding safeguards are demonstrated. v3.3 (mid-2026) added mandatory Frontier Safety Roadmaps and quantitative Risk Reports.
Tap to flip back
OpenAI's Preparedness Framework rates models on Low / Medium / High / Critical per category (Cybersecurity, CBRN, Persuasion, Model Autonomy), both pre- and post-mitigation. Rules:
- Models above Medium post-mitigation in any category cannot be deployed.
- Models above High cannot be developed further until safeguards close the gap.
ASL is one ladder of capability levels; Preparedness is a per-category matrix. Both share the "publish thresholds, gate releases" structure.
Tap to flip back
The UK AI Security Institute (AISI) and the US AI Safety Institute Consortium (AISIC, housed at NIST) conduct technical capability evals on frontier models with pre-deployment access under bilateral arrangements with Anthropic, OpenAI, Google DeepMind, Meta. Critically: external eval access does not mean external veto. The institutes provide capability findings; the labs retain release decisions. Closest thing to an international evaluation regime, but enforcement teeth are reputational, not legal.
Tap to flip back
The EU AI Act (Regulation 2024/1689, in force 2024) designates frontier general-purpose AI models above a training-compute threshold (currently 10^25 FLOPs) as having "systemic risk." Obligations: model evaluations, adversarial testing, incident reporting, cybersecurity protections, energy-use reporting. Separately, high-risk application categories (employment, education, law enforcement, critical infrastructure) require conformity assessments, technical docs, post-market monitoring, and human oversight regardless of model size.
Tap to flip back
Training Infrastructure
2 concept(s)DDP replicates the full model (weights, gradients, optimiser state) on every GPU and shards the global batch across them. Each replica runs forward and backward on its slice, then an AllReduce sums gradients so every rank applies the identical update. Replicas stay bit-identical only because they start from the same weights and apply the same averaged gradient.
Tap to flip back
Ring-AllReduce passes 1/N of each buffer around the ring in 2(N-1) steps. Each link carries roughly 2(N-1)/N of the data regardless of N, which is bandwidth-optimal for large payloads. Latency grows linearly with N though, so for small messages a log(N)-depth reduction tree wins. NCCL picks ring vs tree per message size automatically.
Tap to flip back
Without no_sync() wrapping the inner K-1 micro-batches, DDP fires an AllReduce on every backward pass even though you only step the optimiser every K iterations. You pay the comms cost K times for the same effective update - pure waste. Wrap the non-stepping iterations in model.no_sync() so only the final accumulation step synchronises.
Tap to flip back
Mixed-precision Adam needs 16 bytes per parameter: 2 (BF16 weights) + 2 (BF16 grads) + 4 (FP32 master) + 4 (moment 1) + 4 (moment 2). A 7B model therefore needs ~112 GB of state alone, before activations. That overshoots an 80 GB H100, which is why you reach for ZeRO, FSDP, or tensor parallelism.
Tap to flip back
- Missing
DistributedSampler. Every rank sees the same data shard; you silently wasteN-1GPUs and your effective batch isNtimes smaller than you think. - Unused parameters. If some parameter does not receive a gradient on every rank, DDP hangs waiting for the missing AllReduce. Either set
find_unused_parameters=True(slower) or restructure the model so every parameter participates every step.
Tap to flip back
BF16 keeps FP32's full 8-bit exponent (so the dynamic range is identical to FP32, ~1e-38 to ~3e38). FP16's 5-bit exponent caps the smallest normal at ~6e-5, so any gradient below that flushes to zero - you need loss scaling to survive. BF16 needs no GradScaler, no overflow-recovery cycles, and no tuned scale schedule. The slightly worse mantissa precision (7 vs 10 bits) is invisible to noisy transformer optimisation.
Tap to flip back
Optimiser updates can be six orders of magnitude smaller than the weights themselves. weight -= lr * grad underflows in BF16 when lr * grad is too small to perturb the FP16/BF16 weight. The standard recipe keeps an FP32 master copy that the optimiser reads and writes; the BF16/FP16 compute copy is re-derived each step from the master. Costs 4 extra bytes per parameter, removes a whole class of silent divergence.
Tap to flip back
E4M3 max ~448, E5M2 max ~57344 - both ranges are absurdly small. Each tensor needs a per-tensor scale that maps its actual values into FP8's representable window; the scale is stored alongside and unscaled on the way back to higher precision for accumulation. Use E4M3 for forward activations and weights (better precision), E5M2 for gradients (it needs the range). NVIDIA's Transformer Engine tracks scales automatically; hand-rolling FP8 is a fast path to NaNs.
Tap to flip back
The recipe is really "FP8 matmuls, BF16 everything else, FP32 master weights and norm statistics." Operations that must stay in higher precision:
- LayerNorm reductions and statistics
- Softmax
- Residual accumulator
- FP32 master weights for the optimiser
These are noise-sensitive enough that running them in FP8 destabilises training, even when the matmuls themselves converge fine.
Tap to flip back
LLM Systems
3 concept(s)pgvector is the right starting point when you already run Postgres: one extension, transactional inserts, joins against existing tables, your DBA already knows the backup story. It is comfortable up to ~10M vectors and ~100 QPS. Beyond that, HNSW build times get slow, you compete with OLTP traffic for shared buffers, and dedicated stores (Qdrant, Milvus) pull ahead. Adding a second database is the most expensive optimisation you can do - delay it until a real scaling pain emerges.
Tap to flip back
- HNSW: multi-layer proximity graph. Excellent recall/latency, supports dynamic inserts, no training phase. Costs: high memory (graph is 1.5-3x raw vector bytes, lives in RAM) and slow builds. Right default for <50M vectors with low write rates.
- IVF / IVFFlat / IVF_PQ: partition the corpus, search top
nprobepartitions. Cheap to build, low memory, easy to shard. Recall drops withnprobe. Right for 50M+ vectors where HNSW memory stops fitting.
At billion scale, both layer Product Quantisation on top for 8-32x memory reduction at ~1-5 points recall loss.
Tap to flip back
Pick Milvus. It is the only mainstream option with a disaggregated architecture (coordinator, query nodes, data nodes, separate object store for cold data) so you scale read and write paths independently. The cost is operational weight - you are running a small distributed system. Do not pick Milvus to serve a 5M-vector RAG demo; the operational tax is enormous relative to pgvector or Qdrant for that scale.
Tap to flip back
Weaviate ships native hybrid search (BM25 + vector + RRF in a single query) and a schema-driven data model with built-in modules for embedding generation. If your retrieval has rich filters and you want one query language for both lexical and semantic, Weaviate is the cleanest end-to-end implementation. Qdrant is leaner for pure vector workloads but you bolt BM25 and fusion together yourself. Trade is the opinionated GraphQL API.
Tap to flip back
Every dedicated vector DB is:
- Another service in monitoring.
- Another set of backups with a different restore procedure than your primary DB.
- Another upgrade path - vector DBs are young; breaking changes still happen.
- Another consistency story - syncing source-of-truth (Postgres) to the vector index needs dual-writes, CDC, or periodic reindex. All three have failure modes.
If your team is five engineers without a clear scaling pain, stay on pgvector and revisit at 10M rows or 200 ms p95.
Tap to flip back
ANN benchmarks run on uniform synthetic vectors with no filters and no concurrent writes. Your workload looks nothing like that - real corpora have clustered distributions, real queries have metadata filters, and you write while you read. Always benchmark on a sample of your real corpus with realistic filters and write traffic. Also: default ef_search / nprobe often give 85-90% recall, which sounds fine until users see the 10-15% they miss. Tune for your target recall before declaring victory.
Tap to flip back
Embeddings tokenise strings like INV-2024-7831 into sub-pieces and lose the identity of the identifier itself. BM25 indexes it as a single token and finds the matching document on the first try. Named entities, error codes, function names, SKUs, dates, version numbers - anything where the user asks about a specific string - BM25 wins. For "what does our refund policy say about damaged items" - vector wins. Production RAG runs both.
Tap to flip back
RRF computes score(d) = sum over lists L of 1 / (k + rank_L(d)), k=60 standard. Brilliance: you never calibrate BM25 vs cosine scores against each other - only ranks matter. Rank-1-in-both > rank-1-and-rank-50 > rank-50-in-both. The method has one parameter (k) that rarely matters above noise. Linear combinations like alpha*bm25 + (1-alpha)*cosine are tempting but require per-query calibration and break when one retriever's score distribution shifts. RRF is parameter-free and robust.
Tap to flip back
A bi-encoder (the embedding model) encodes query and document independently then takes a dot product - fast, but never sees them together so it cannot capture token-level interactions. A cross-encoder takes (query, document) concatenated as input and runs a full forward pass - quadratic in length, vastly more accurate. You cannot afford this over 1M documents; you can afford it over the top 50-100 from the fused stage. Typical lift: 5-15 points of recall-at-10, larger than most prompt-engineering wins.
Tap to flip back
- Query embed: ~15 ms (MiniLM on GPU or cached)
- BM25 top 100: ~15 ms (single-node Elasticsearch is fine for billions)
- Dense top 100: ~30 ms (HNSW with
ef_search ~100) - RRF: ~1 ms
- Rerank top 50: ~120 ms (bge-reranker-large fp16, batched)
- Pass top 5 to LLM
The reranker dominates and is also where the biggest accuracy gains live. If 120 ms is too much, use bge-reranker-base or rerank top 20 instead of top 50.
Tap to flip back
- Calling the reranker 50 times for 50 candidates is 10-20x slower than one batched call. Always batch the cross-encoder.
- Different chunking between BM25 and dense indexes breaks RRF deduplication - the "same document" appears as different IDs in each list, so the fusion loses its main signal. Use the same chunk IDs across both indexes, or normalise to a parent-document ID before fusion.
For high-value verticals, fine-tune the reranker on a few thousand of your own (query, relevant, irrelevant) triples - the lift is large.
Tap to flip back
| Type | Price (relative) |
|---|---|
| Input (uncached) | 1.0x |
| Cache write | 1.25x to 2.0x |
| Cache read | 0.1x to 0.5x |
| Output | 3.0x to 5.0x |
Two implications people get wrong: output is the expensive bit (a chatty 2,000-token response costs 10x a tight 200-token one), and cache write is a one-time tax that only earns out when reused 3+ times within the TTL. Log all four separately; aggregating into "tokens" is a false economy that costs you a week of debugging in month three.
Tap to flip back
request_id, timestamp_utc, org_id, user_id, feature, model,
input_tokens, cache_read_tokens, cache_write_tokens, output_tokens,
cost_usd (computed at call time from your price table),
latency_ms, cached_prefix_hash
Append-only table partitioned by day. Every report falls out: spend by feature, users near limit, cache hit rate by feature, cost-per-request before/after a prompt change. Compute cost_usd at call time from a price table you control - provider invoices arrive monthly, you need attributable telemetry within seconds.
Tap to flip back
- Chat-like products: count by request. Cheap to enforce, predictable for users, no pre-flight estimation needed. Variance per request is small relative to a daily limit.
- Document-processing products: count by token. Variance is real (a 100-page PDF vs a tweet), and a per-request limit is either too loose or too tight. Estimate with
tiktoken/anthropic.count_tokens()- never with character-count heuristics.
Atomically reserve tokens (Redis INCR) before the call, otherwise two parallel requests can both pass the check and both spend.
Tap to flip back
- Concurrency cap per user - Redis semaphore, three lines of code. Stops the runaway-loop case dead.
- Token-bucket per user - allows short bursts, prevents sustained over-consumption.
- Separate upstream keys per high-value tenant - enterprise customers get a dedicated provider key so their rate limit is theirs alone.
- Per-tenant inference cluster - nuclear option for regulated workloads or very large customers.
Scales: shared key + per-user concurrency cap + per-tenant token-bucket. Dedicated keys only when a customer pays for isolation.
Tap to flip back
- UTC midnight: one global rollover, trivial. A user in California sees reset at 4 PM local - surprising but unambiguous.
- User-local midnight: friendlier UX but 24+ rollover events per day; DST shifts make some days 23 or 25 hours.
- Rolling 24-hour window: smoothest UX, requires a Redis sorted-set with timestamped entries and a sliding-window query.
Pick UTC unless you have a specific reason not to, and document it in the API. The support tickets averted by saying "quotas reset at 00:00 UTC" in the docs are non-trivial.
Tap to flip back
Reasoning Models
3 concept(s)No new mechanism - the model still emits one token at a time. What changes is the budget:
- Longer chains of thought - thousands of intermediate reasoning tokens before the visible answer.
- Parallel sampling - draw N completions and aggregate (majority vote, best-of-N, self-consistency).
- Search - Tree of Thoughts, MCTS rollouts, beam search over reasoning steps.
- Iterative refinement - generate, critique, revise.
All four cost more wall-clock and more tokens; all four can move a fixed-weight model up the accuracy curve.
Tap to flip back
Scaling LLM test-time compute optimally can be more effective than scaling model parameters (arXiv 2408.03314). A small model with a compute-optimal test-time strategy can match a model ~14x larger evaluated greedily on MATH-difficulty problems. The optimal strategy is question-difficulty-dependent: easy problems want greedy decoding, hard problems want sequential revision plus verifier-guided search. Naive "always sample 64 and majority vote" leaves performance on the table.
Tap to flip back
- Low base capability. If pass@1 is essentially zero, sampling 1000 traces still gives zero. Compute helps the middle of the distribution, not the impossible tail.
- No verifier. Best-of-N needs a way to pick the best. Without unit tests, an answer key or a PRM you fall back to self-consistency, which only works if wrong answers are diverse.
- Latency-bound workloads. Voice assistants, autocomplete, customer chat - users will not wait 45 seconds for a hidden reasoning trace.
- Reward-hacked length. Models trained with length rewards pad with restatements that look like reasoning.
Tap to flip back
In the o-series and Anthropic extended-thinking APIs, the model emits reasoning tokens you pay for but never see. A single hard query can burn 20-50k reasoning tokens before producing 200 visible tokens. Per-call cost swings from cents to dollars. For agentic loops running thousands of queries this dominates the bill. The API exposes a reasoning_effort or max_thinking_tokens knob - the caller must choose where on the cost-accuracy curve to sit. Budgets that used to be predictable became wildly variable per call.
Tap to flip back
| Workload | Scale test-time? | Why |
|---|---|---|
| Competition maths, theorem proving | Yes | Verifiable answer, no latency contract |
| Coding agent submitting a PR | Yes | Minutes acceptable, correctness compounds |
| Customer-support chat | No | User waits, expects sub-second first token |
| Autocomplete, voice | No | 100ms budget kills everything but greedy |
| Bulk classification | Maybe | Depends on per-call $ vs accuracy gain |
Rule: scale when the value of one extra correct answer exceeds the cost of the extra tokens plus the latency penalty.
Tap to flip back
Wang et al (ICLR 2023, arXiv 2203.11171) sampled N chains at temperature > 0 and returned the majority vote over final answers. Reported gain on GSM8K: +17.9 points over greedy CoT. Mechanism: greedy decoding commits early to one chain and propagates its errors. Sampled chains explore different reasoning paths; the correct answer tends to be reachable via several distinct paths while wrong answers are usually one-off mistakes. Marginalising concentrates probability on the correct answer.
Tap to flip back
Yao et al (NeurIPS 2023, arXiv 2305.10601) frame reasoning as deliberate search: generate multiple candidate thoughts per step, score each with a value function (self-evaluation or external checker), expand promising branches with BFS/DFS/beam. On Game of 24, ToT solved 74% vs 4% for GPT-4 with chain-of-thought. The gap is what disciplined search buys you on combinatorial problems where greedy commits early to dead-end branches.
Tap to flip back
Heuristic from the Snell test-time-compute paper:
| Problem difficulty | Best strategy |
|---|---|
| Easy (right first try) | Greedy decode |
| Medium (right with effort) | Self-consistency, modest N |
| Hard (rare correct paths) | Verifier-guided beam / MCTS |
| Pathological (essentially incapable) | More base capability, not more search |
If pass@1 is decent and pass@N grows fast with N, parallel sampling wins. If pass@1 is low and chains tend to dead-end early, search wins because it prunes doomed branches before they burn compute.
Tap to flip back
Modern frontier models bake CoT into their weights via SFT and RL on long reasoning traces - they produce reasoning spontaneously. Kojima et al's 2022 prompt trick mattered when models had to be cued. On o1, R1, Gemini Thinking, prompting "think step by step" rarely beats default behaviour. The interesting knobs moved above the prompt level: aggregation (self-consistency), search (ToT, MCTS), verifier-in-the-loop. Trained CoT replaced prompted CoT.
Tap to flip back
MCTS suits problems with large branching factors and a verifier. The LLM is the node-expansion policy; the verifier or learned value model gives terminal rewards; UCB balances exploration and exploitation.
for _ in range(n_iterations):
leaf = select(root) # UCB descent
children = expand(leaf, model) # sample k continuations
value = evaluate(children, prm) # programmatic or learned
backpropagate(leaf, value)
rStar-Math (Microsoft 2024) used this to teach a 7B Qwen to match o1-preview on MATH. Small model + heavy search + good value function can beat much larger no-search models on verifiable tasks.
Tap to flip back
Three causes, roughly in increasing seriousness:
- Real capability gains. Reasoning models genuinely are better at competition maths.
- Test-time compute slides accuracy. A model at 40% pass@1 may hit 85% at pass@64 with self-consistency - headline numbers are usually max-compute.
- Contamination. Benchmark questions appear verbatim or paraphrased in training data. The model is not solving the problem; it is recognising it.
The third is the load-bearing methodological issue. Public benchmarks older than 18 months should be treated as marketing artefacts rather than capability signals.
Tap to flip back
Direct leakage:
- MATH problem in a Stack Exchange answer
- GSM8K question republished on a tutoring blog
- AIME problem with worked solution on a maths forum
- Benchmark test split scraped wholesale from Hugging Face
Indirect leakage:
- A paraphrase or translation of the benchmark question
- A textbook chapter using the same problem
- A YouTube transcript walking through the answer
Diagnostic: sharp performance drop on problems released after the model's training cutoff (LiveCodeBench's central observation).
Tap to flip back
ARC-AGI (Chollet, 2019) is programmatically generatable novel grid puzzles, not crowd-sourced from textbooks. Each puzzle is unique - the model must induce a transformation rule from a few examples, not retrieve a memorised answer. Contamination requires the exact test puzzle leaking, which is controllable. ARC-AGI-1 held up for years before o3-high broke the human baseline (87.5%) at thousands of dollars per task. ARC-AGI-2 (still unbeaten at parity) and ARC-AGI-3 (agentic) continue the standing public challenge with a $2M+ prize pool.
Tap to flip back
| Defence | Example |
|---|---|
| Temporal - score only on post-cutoff items | LiveCodeBench (problems annotated with release dates) |
| Secrecy - never publish test items, only aggregate scores | FrontierMath (held privately by Epoch AI), UK AISI internal evals |
| Generative - new items every period | LiveBench (monthly refresh from recent maths, arXiv, news) |
The unifying idea: if the eval can leak, eventually it will. Engineers picking models for reasoning workloads should weight LiveBench / LiveCodeBench / FrontierMath / ARC-AGI rankings over older static suites, and weight their own private internal evals over all of those.
Tap to flip back
Three rules:
- A benchmark score is an upper bound on capability under the conditions tested - check which compute setting produced it.
- A delta over a strong baseline on a contamination-controlled benchmark is meaningful. A delta on a public static benchmark older than 18 months is mostly marketing.
- The only reliable measurements come from private evals on data the model has never seen - your own internal benchmark on your own data is worth more than any public leaderboard.
Reasoning evaluation is now a moving practice, not a fixed scoreboard.
Tap to flip back
Vision & Multimodal
37 concept(s)- Cut a 224x224x3 image into a grid of 16x16 patches - 14x14 = 196 patches.
- Flatten each to a 768-dim vector (
16 * 16 * 3). - Linearly project to model dim, prepend a learnable
[CLS]token, add positional embeddings. - Run through a vanilla transformer encoder - 197 tokens, no convolutions, no pooling.
- The final
[CLS]representation feeds the classification head.
That is the entire architectural change. Patches are tokens; an image is a short sequence.
Tap to flip back
On ImageNet-1k (1.3M) ViT loses to ResNets at every size. On ImageNet-21k (14M) it is competitive. On JFT-300M it wins by 1-2 points at a quarter of the compute. Convolution's translation equivariance and locality are priors that help when data is scarce. With enough data, the prior becomes dead weight - the model could learn equivariance itself and might prefer something better. Strong priors are a substitute for data; weak priors plus enough data can match or exceed them.
Tap to flip back
Pure ViT attention is quadratic in token count. A 1024x1024 image at 16x16 patches has 4096 patches and 16M attention entries per layer - prohibitive for detection and segmentation. Swin computes attention within fixed windows (e.g. 7x7 patches), shifts windows between layers so information crosses boundaries, and builds a hierarchical pyramid by merging patches at each stage. Result: linear in image size, multi-scale features, drops cleanly into dense-prediction heads.
Tap to flip back
- Small-data classification (below ~10M pretrained images). ResNets and ConvNeXts with strong augmentation match or beat ViTs at a fraction of the compute.
- Edge inference. MobileNets and EfficientNets dominate phone-scale deployment - depthwise convolutions are 5-10x cheaper than attention at low resolution.
- Dense prediction without a hybrid. Pure ViT has no multi-scale hierarchy; detection backbones almost always reintroduce one (Swin, ViTDet, ConvNeXt).
CNNs migrated to where their assumptions still hold; the frontier moved up the data curve.
Tap to flip back
Counterintuitively, no. Dosovitskiy et al found that separate row/column embeddings or sinusoidal 2D grids gave no measurable gain over flat 1D learnable embeddings. The model learns 2D structure from data. Variable-resolution inference is handled by bilinearly interpolating the embedding grid - awkward but it works. Later models reintroduced spatial priors via RoPE-2D or conditional positional encoding mostly to improve extrapolation, not classification accuracy.
Tap to flip back
Two encoders, batch of N image-text pairs, N x N cosine similarity matrix:
logits = f(I) @ g(T).T / temperature # N x N
loss_i = cross_entropy(logits, eye(N)) # image-to-text
loss_t = cross_entropy(logits.T, eye(N)) # text-to-image
loss = (loss_i + loss_t) / 2
Diagonal entries are true matches; off-diagonals are negatives. Temperature is learned, settles around 0.01. Bigger batches give more negatives - the original CLIP used batch 32,768 across 256 V100s.
Tap to flip back
Embed all 1,000 class names with the text encoder (often via prompt templates like "a photo of a {class}"). Embed the test image with the image encoder. Pick the class whose text embedding has the highest cosine similarity to the image. No fine-tuning, no labelled examples needed. CLIP ViT-L hits 76% top-1 on ImageNet zero-shot - roughly matching a fully supervised ResNet-50. Prompt ensembles add another 1-3 points.
Tap to flip back
Softmax contrastive loss normalises across the whole batch - every negative competes with every positive. Performance becomes tightly coupled to batch size (huge batches needed for enough negatives). SigLIP (Zhai et al, 2023) judges each pair independently as positive or negative with a sigmoid loss. No batch-wide normalisation, no global temperature softmax. Trains well at smaller batches, decouples performance from compute, and is now the default backbone for PaliGemma, Idefics, and parts of Gemma 3.
Tap to flip back
- Compositional binding. Cannot reliably distinguish "red cube on blue sphere" from "blue cube on red sphere" - the alignment objective produces a bag-of-concepts representation.
- Counting and spatial relations. No notion of "three apples" or "apple left of orange".
- OCR. Recognises text is present, not what it says.
- Fine-grained taxonomy. Bird species, medical imagery - the long tail that web captions never describe precisely.
- Typographic attacks. A sticker reading "iPod" on an apple makes it predict iPod.
These are fundamental to contrastive alignment, not data problems. It is why downstream multimodal LLMs pipe CLIP features through an LLM that can reason compositionally.
Tap to flip back
Once images and text live in the same space, multiple capabilities fall out for free:
- Zero-shot classification by nearest-text-embedding.
- Cross-modal retrieval (image search by text, image-to-image search).
- Grounding for generation - Stable Diffusion conditions on CLIP text embeddings; open-vocabulary detectors (OWL-ViT, GLIP) condition on them.
- Compositional probing via vector arithmetic.
It collapsed image classification from "pick 1000 classes, train" to "describe what you want in English". That is the most important architectural idea in vision since ResNet.
Tap to flip back
Both trained on web-scale weakly-supervised pairs, both replaced per-task supervised stacks with one general model. Whisper used 680k hours of multilingual audio-text scraped from the web, log-Mel spectrogram into encoder-decoder transformer, 99 languages, transcription and speech-to-English translation zero-shot. Pre-Whisper production ASR meant separate models per language with separate fine-tuning. Whisper's encoder is now the de facto audio backbone for downstream tasks - audio classification, speaker recognition, multimodal LLMs that take speech input.
Tap to flip back
A 1-minute video at 1 fps and CLIP ViT-L token rate is 60 frames x 576 tokens = 34k tokens - blows past most LLM context windows and burns expensive attention compute. Perceiver resamplers, temporal pooling, Q-Former-style learned queries all exist to compress this without losing the answer-relevant frame. The trade-off: aggressive compression loses fine-grained temporal events (which is why long-video reasoning is still wobbly), light compression blows the token budget.
Tap to flip back
V-JEPA (Bardes et al, 2024) is self-supervised and generative-free. It predicts masked spatio-temporal regions in an abstract feature space, not pixel space. Because it does not reconstruct pixels, the model is free to ignore unpredictable details (exact textures, background motion) and focus on what is structurally predictable. Strong frozen-feature evaluation - 81.9% on Kinetics-400 with ViT-H/16 - without ever generating a frame. Sits alongside JEPA-style language pretraining as Yann LeCun's preferred non-generative direction.
Tap to flip back
- Tokenise video into a 3D latent grid via a video VAE (2x-4x spatial compression, 2x-8x temporal).
- Train a DiT or Flow Matching model on those latent tokens.
- Condition on text via cross-attention; classifier-free guidance at sample time.
The "patches as tokens" formulation lets one architecture handle different resolutions, durations and aspect ratios. Sampling cost is dominated by token count, which is why commercial models are still in the seconds-to-minute clip range. Veo 3 added native audio generation alongside video.
Tap to flip back
Audio is tokenised by EnCodec (a neural audio codec) into discrete tokens at ~75 Hz across multiple codebooks via residual vector quantisation - same role the VAE plays for latent diffusion. MusicGen then autoregresses a transformer LM over those tokens, conditioned on text or melody. 3.3B parameters in the largest variant, generates ~12 seconds of music in real time on an A100. The recipe is portable - any 1D continuous signal becomes a candidate for tokenise-and-model. AudioGen (environmental sound) uses the same stack.
Tap to flip back
Four forces favour specialisation:
| Force | Why it splits the stack |
|---|---|
| Tokeniser specialisation | A great audio codec is not a great image VAE |
| Data availability | High-quality paired multimodal data is scarce |
| Inference economics | Sora-class video uses 1000x an image model's compute |
| Latency profiles | Real-time speech needs streaming decoders; image gen does not |
Current consensus: unified at the user interface, specialised at the compute layer. Frontier products route requests behind a unified API to different models, with growing but not total weight-sharing.
Tap to flip back
Text normalisation -> grapheme-to-phoneme (G2P) -> prosody/duration prediction -> acoustic model -> vocoder. Each stage converts one representation to the next: raw text becomes spoken-form tokens, then phonemes, then a timed phoneme sequence, then a mel spectrogram, then a raw waveform.
Tap to flip back
Normalisation converts written-form text into spoken-form tokens before any phoneme lookup. Errors here are silent - the output sounds intelligible but wrong. Heteronyms (words spelled identically but pronounced differently depending on part of speech) require shallow parsing to resolve, and numeric/symbolic forms like currency, IP addresses, or dates all need bespoke rules.
Tap to flip back
Tacotron 2 conditions a modified WaveNet vocoder on mel-spectrogram predictions from a sequence-to-sequence network. Using the mel spectrogram as an intermediate acoustic representation allowed a simplified vocoder design. The system achieved MOS 4.53 out of 5.0, compared to 4.58 for professional recordings - effectively indistinguishable to most listeners.
Tap to flip back
An RVQ stacks multiple vector quantisers in sequence, where each quantiser codes the residual error left by the previous one. This lets a neural codec compress audio to very low bitrates (3 kbps for SoundStream) with high fidelity. For TTS, RVQ-based codecs produce a discrete token sequence that a language model can predict - enabling codec language model pipelines like VALL-E.
Tap to flip back
VALL-E treats TTS as conditional language modelling over discrete EnCodec tokens rather than signal regression to a mel spectrogram. A transformer generates codec token sequences conditioned on text and a 3-second speaker reference clip; the codec decoder then converts tokens to a waveform. This collapses the acoustic model and vocoder into one language model and enables zero-shot voice cloning without fine-tuning.
Tap to flip back
- Rare proper nouns and code-switched text expose G2P failures (e.g. "Nguyen", mixed-language phrases).
- Autoregressive models can hallucinate or repeat phonemes/words, especially at sentence boundaries.
- Expressive or emotionally loaded text (sarcasm, whispering, laughter) is outside most read-speech training distributions, producing bland or incorrect prosody.
Tap to flip back
Zero-shot cloning (e.g. VALL-E) requires only a short reference clip and no fine-tuning, but generalises from a prompt embedding, which can fail on unusual accents or laryngeal qualities not well-represented in training data. Fine-tuning on 5-30 minutes of a target speaker bakes speaker priors into model weights, giving more reliable fidelity at the cost of needing labelled audio for every new speaker.
Tap to flip back
The same token can have multiple correct spoken expansions depending on context. "1995" reads as "nineteen ninety-five" in a year reference but "one thousand nine hundred ninety-five" as a page number. A token-level classifier without sentence context cannot distinguish these cases; it needs surrounding words to infer the semantic role of the token. This is why Transformer-based models that encode the full sentence (e.g., with a BERT encoder) outperform token-level FST classifiers on the hard disambiguation cases.
Tap to flip back
Finite-state transducers (FSTs) are fast, deterministic, and interpretable; they cascade tokenisation, classification, and verbalisation stages. Neural seq2seq models (often Transformer-based) handle context-sensitive and long-tail tokens better but add latency and are harder to audit. Production systems typically combine both: FSTs for high-frequency unambiguous tokens, neural models for the tail. Using a neural model for everything trades off reliability and latency for coverage.
Tap to flip back
A heteronym is a word spelled identically but with different pronunciations depending on its syntactic role or meaning (e.g., "lead" /lɛd/ vs /liːd/, "close" /kloʊs/ vs /kloʊz/). A single-pronunciation dictionary lookup assigns one fixed reading and will be wrong roughly half the time on these words. Correct handling requires at minimum part-of-speech tagging (and sometimes full sentence-level context) to select the right pronunciation variant before feeding it to the acoustic model.
Tap to flip back
ARPAbet appends a digit to vowel phonemes to encode lexical stress: 1 = primary stress, 2 = secondary stress, 0 = unstressed. Example: "present" as a noun is P R IY1 Z AH0 N T; as a verb it is P R IH0 Z EH1 N T. The acoustic model uses this stress information to generate duration and pitch patterns. Without correct stress placement, synthesised speech sounds flat or wrongly emphasised, even if every phoneme is individually correct.
Tap to flip back
G2P is a learned model that maps a character sequence to a phoneme sequence, providing pronunciations for words not in any pre-built lexicon. Dictionary lookup works well for common in-vocabulary words but fails on out-of-vocabulary tokens: proper nouns, neologisms, brand names, technical terms, and code-switched words. G2P models generalise to unseen words by learning sub-word phonological patterns. The cost is that they systematically struggle with foreign-origin names and acronyms, which violate the phonological patterns seen in training data.
Tap to flip back
Traditional front-ends run normalisation and phonemisation as sequential, independent stages. An upstream mistake (wrong token class, wrong expansion) propagates forward with no correction mechanism. End-to-end systems (e.g., Tacotron trained on characters) learn the full grapheme-to-acoustics mapping jointly, so there is no hard pipeline boundary where errors accumulate. The trade-off is reduced interpretability: it is harder to diagnose why a specific token was mispronounced when the model has no explicit phoneme output to inspect.
Tap to flip back
The same written token has different correct expansions across languages. Number formatting conventions differ (thousands separator, decimal separator); abbreviations are language-specific; date formats vary by locale; and currency symbols map to different spoken words. A model trained only on English normalisation will misfire when the surrounding sentence is French or German, even for digit strings that look identical. Correct multilingual normalisation requires either per-language FST grammars, a language-conditioned neural model, or a sentence-level multilingual model that detects language automatically.
Tap to flip back
Phonemes change on a 50-100 ms timescale; a 24 kHz waveform must be faithful at sub-millisecond resolution. One network spanning both scales is hard to train. The split inserts a mel spectrogram as an intermediate: the acoustic model produces ~80-100 frames per second, and the vocoder handles the high-frequency waveform reconstruction separately.
Tap to flip back
The mel spectrogram retains spectral magnitude but discards phase. The vocoder does not recover the original phase; it synthesises a new, perceptually plausible phase from scratch. This is why two vocoders run on the same mel spectrogram produce numerically different but perceptually similar waveforms.
Tap to flip back
Feed ground-truth mel spectrograms (computed from real recordings) directly into the vocoder and listen to the output. Any artefacts heard are attributable to the vocoder alone, not to the acoustic model. This isolates the vocoder's error contribution before measuring the full pipeline.
Tap to flip back
Attention-based acoustic models occasionally misalign - skipping words, repeating syllables, or losing track of position in long inputs. An explicit duration predictor assigns a frame count to each phoneme deterministically, eliminating these failure modes and enabling fully parallel mel generation (giving FastSpeech its ~270x speedup over Tacotron 2).
Tap to flip back
Voice identity - pitch contour, formant positions, speaking rate, rhythm - is encoded in the mel spectrogram that the acoustic model produces. The vocoder only inverts the spectrogram to a waveform and is largely speaker-agnostic once well trained. A few minutes of target-speaker audio is enough to shift the acoustic model's output distribution; retraining the vocoder adds cost with little benefit.
Tap to flip back
The acoustic model and vocoder are usually trained independently, so the vocoder never sees the specific distribution of prediction errors the acoustic model makes. At inference the vocoder receives slightly blurry or jittered mels it was not trained on, causing degraded quality. The fix is end-to-end fine-tuning after independent pre-training, or training the vocoder on acoustic-model outputs rather than ground-truth mels. This partially undermines the clean modularity of the two-stage design.
Tap to flip back
Speech is quasi-periodic: voiced sounds repeat glottal cycles at the fundamental frequency (F0), and harmonics stack at integer multiples of F0. Multi-period discriminators evaluate the generator's output at different sub-sampling strides, penalising incorrect periodicities. Multi-scale discriminators enforce consistency at different temporal resolutions. Together they prevent the buzzy, aperiodic artefacts common in earlier GAN vocoders.
Tap to flip back
Stage 1 is a sequence-to-sequence model (encoder + location-sensitive attention + decoder) that converts a character sequence into 80-bin mel spectrograms. Stage 2 is a modified WaveNet vocoder that conditions on those mel frames and synthesises a raw waveform at 24 kHz. The clean separation means either stage can be replaced independently.
Tap to flip back
A 24 kHz waveform runs at 24,000 samples per second. The mel spectrogram compresses that to roughly 400 frames per second (a ~60x reduction) while preserving perceptually weighted frequency content. This makes the sequence-to-sequence mapping from text far more tractable, and a learned vocoder can invert mel spectrograms into high-fidelity audio more reliably than Griffin-Lim inversion of linear spectrograms.
Tap to flip back
Location-sensitive attention conditions on both the encoder hidden states and the cumulative attention weights from all previous decoder steps. The cumulative weights act as a "how far have I read so far" signal, biasing the model to advance monotonically through the input. Standard Bahdanau attention has no such bias and tends to repeat or skip regions on longer inputs.
Tap to flip back
The PreNet sits between the previous mel frame and the decoder LSTM. If its output is deterministic at inference, the decoder can learn to copy its own previous output rather than attending to the encoder - a form of degenerate autoregression that causes error accumulation on long utterances. Keeping 0.5 dropout active injects noise that forces the decoder to rely on the attention context vector instead.
Tap to flip back
- Word skipping: the attention jumps over a region, typically on long inputs (over ~200 characters) or after punctuation that causes abrupt context shifts.
- Word repetition / looping: the attention re-attends to the same encoder region, often triggered by repeated substrings or unstressed function words. Both failures become more frequent as utterance length increases; production deployments typically enforce input length limits to avoid them.
Tap to flip back
The WaveNet is trained on ground-truth mel spectrograms, so it optimises for a slightly different distribution than the imperfect Stage 1 outputs it sees at inference. The mismatch produces spectral smearing and pitch instability, most audibly at prosodic boundaries where the Stage 1 decoder output diverges furthest from the training distribution. One remedy is fine-tuning the vocoder on Stage 1 outputs ("adaptation"), though this adds a third training stage.
Tap to flip back
Replacing Griffin-Lim inversion of linear spectrograms (Tacotron 1) with a learned WaveNet vocoder conditioned on mel spectrograms. Griffin-Lim is an iterative phase reconstruction algorithm that introduces buzzy artefacts; the WaveNet directly models the waveform distribution conditioned on a perceptually scaled representation, producing much smoother and more natural audio.
Tap to flip back
A correct alignment appears as a near-diagonal band in the decoder-step vs. encoder-position matrix. Each decoder frame should attend sharply to roughly one phoneme at a time, advancing left-to-right. This matters because it encodes the fundamental monotonic constraint of speech - phonemes unfold in order - which the model must learn implicitly rather than having it enforced structurally.
Tap to flip back
- Skipping (under-attention) - attention jumps forward too fast; a word or phoneme is missing from the output.
- Repetition (over-attention) - attention stalls on one encoder position; the same phoneme is generated repeatedly, producing a stutter.
- Diagonal drift - attention peaks are wide and blurry rather than sharp; the audio is intelligible but prosody is smeared and speaking rate fluctuates.
Tap to flip back
Location-sensitive attention feeds the cumulative previous attention weights (convolved with a learned filter) into the attention score computation. This biases the mechanism away from encoder positions it has already attended to heavily, making it harder for the decoder to stall and loop on the same position. It does not enforce monotonicity - it only penalises revisiting positions, so failures can still occur under sufficient stress.
Tap to flip back
Guided attention loss adds a training penalty proportional to off-diagonal attention weights, using a Gaussian-shaped weight matrix centred on the diagonal. It accelerates alignment learning and reduces skipping/repetition on short to medium sentences. The trade-off: it regularises the attention toward a fixed speaking rate profile, which can reduce expressiveness on unusual prosody or uncommon phoneme timing patterns.
Tap to flip back
Attention is autoregressive: each frame's alignment is conditioned on the previous frame's alignment. A small drift at frame 50 compounds over subsequent frames. Additionally, most training sentences are short to medium length, so the model has rarely had to maintain correct monotonic progress over 100+ phoneme sequences. Both factors - error accumulation and distribution mismatch - worsen together as length grows.
Tap to flip back
FastSpeech uses a duration predictor to assign a scalar frame count to each phoneme, then a length regulator expands the phoneme sequence to match the mel frame count. There is no cross-attention over encoder outputs, so skipping and repetition cannot occur. The dependency introduced is ground-truth phoneme durations, which must be extracted from a teacher attention-based TTS model - creating a bootstrapping reliance on the same fragile mechanism it replaces.
Tap to flip back
In Tacotron, cross-attention operates over text/phoneme encoder outputs; failures produce missing or repeated phonemes in the continuous mel domain. In codec language models, cross-attention operates over discrete audio codec tokens from a prompt, and the decoder generates codec token sequences. Failures manifest as token repetition in the codec token stream or as prompt drift (the model re-uses prompt tokens at the wrong position), which produces different audible artefacts - tonal or timbral glitches rather than intelligibility errors.
Tap to flip back
FastSpeech replaces the recurrent, attention-based decoder with a feed-forward Transformer stack. A length regulator expands encoder hidden states by replicating each phoneme vector according to its predicted duration, so the decoder receives a fully expanded sequence and can generate all mel frames simultaneously with no step-to-step dependency.
Tap to flip back
The length regulator repeats each phoneme's encoder hidden vector exactly d_i times, where d_i is the duration (in mel frames) for that phoneme. This expands a short phoneme sequence into a longer frame-aligned sequence that matches the target mel-spectrogram length, allowing the parallel decoder to operate without attention-based alignment.
Tap to flip back
FastSpeech 2 uses a Montreal Forced Aligner (MFA) to extract ground-truth phoneme durations directly from the training audio-text pairs. This removes the teacher-student distillation step that FastSpeech 1 required, where a separate Tacotron model had to be trained first and its attention weights extracted.
Tap to flip back
Pitch (F0) is extracted with a WORLD vocoder (pyworld) and energy as frame-level L2 norm of the mel-spectrogram. Both are quantised into discrete bins and replaced by learnable embeddings, then trained with mean-squared error. At inference, an operator can manually shift pitch or energy values to control expressiveness - raising pitch produces a higher-voiced reading without retraining.
Tap to flip back
- Duration compounding errors: a mis-predicted duration stretches or compresses an entire phoneme uniformly; there is no frame-level self-correction.
- Flat prosody: parallel frame generation breaks within-word conditioning, so fine-grained prosodic variation is lost compared to autoregressive models that condition each frame on the previous one.
Tap to flip back
Both learn monotonic alignment inside the model during training. Glow-TTS uses dynamic programming in latent space (Monotonic Alignment Search) to find the highest-probability monotonic alignment between text and the flow's latent representation. VITS uses a stochastic duration predictor within a variational autoencoder framework, allowing the alignment to emerge end-to-end without MFA. The trade-off is greater training complexity versus FastSpeech 2's simpler pipeline.
Tap to flip back
Autoregressive models must generate frames sequentially; each step depends on the previous output, so GPU parallelism cannot be used across time. FastSpeech generates all frames in a single forward pass, allowing full GPU parallelisation across frames. FastSpeech 1 reported approximately 38x real-time factor for mel-spectrogram generation on a V100, versus roughly 1x for Tacotron 2 at the time.
Tap to flip back
Text has a fixed number of tokens (phonemes/graphemes), but the corresponding audio has far more frames. Alignment is the mapping from each text token to the set of audio frames it generates. It arises because there is no closed-form rule for how long each phoneme lasts; duration varies by speaker, prosody, and context, so the model must learn or infer it from data.
Tap to flip back
Every valid alignment is monotonic: phoneme i always precedes phoneme i+1 in time. Enforcing monotonicity prevents the model from looping (repeating a phoneme) or skipping one entirely, both of which are perceptually catastrophic. Soft attention models (e.g. Tacotron) are not constrained to be monotonic and fail on long utterances for exactly this reason.
Tap to flip back
FastSpeech 2 uses the Montreal Forced Aligner (MFA), a separate forced-alignment tool built on Kaldi, to annotate every training utterance with phoneme-level boundaries before TTS training begins. The resulting frame counts per phoneme serve as ground-truth targets for the duration predictor, which is trained with MSE loss on log-durations.
Tap to flip back
L_dur = MSE( log(d_pred + 1), log(d_gt + 1) )
Log: using log-duration means a prediction error of 2 frames on a 4-frame phoneme is penalised the same as a 10-frame error on a 50-frame phoneme, keeping the loss scale-invariant across phone lengths.
+1: prevents log(0) for zero-duration phones (reduced vowels, elided stops) that occasionally appear in forced-alignment output.
Tap to flip back
MAS, introduced in Glow-TTS (Kim et al., NeurIPS 2020), finds the most probable monotonic path through a log-likelihood matrix Q[i,j] (text token i, mel frame j) using dynamic programming. The path lengths give per-phoneme durations that train the duration predictor as a side product, eliminating the need for an external aligner. It runs O(T_text * T_mel) per step and co-trains with the acoustic model from scratch.
Tap to flip back
FastSpeech 2 uses a deterministic duration predictor (a regression head that outputs a single value per phoneme). VITS uses a stochastic duration predictor built from normalising flows, which models a distribution over possible durations. Sampling from this distribution at inference time produces diverse rhythms from the same text, which is important for expressive and conversational voices where timing is genuinely variable.
Tap to flip back
- Short/reduced phones (schwa, stop bursts): forced alignment may assign 0 frames; clipping to 1 introduces a systematic lengthening bias on fast speech.
- Out-of-vocabulary tokens from a faulty text normaliser: the encoder representation is noisy, so the duration predictor has no reliable signal, producing glitch-length segments.
- Long-form inputs (>200 phonemes per segment): both attention-based systems and MAS become unreliable; production pipelines always sentence-segment text before synthesis to avoid this.
Tap to flip back
A magnitude spectrogram discards the phase of each STFT bin. Phase determines how spectral components combine in time, so infinitely many waveforms can share the same magnitude spectrogram. Without additional constraints, reconstruction is impossible to solve uniquely - Griffin-Lim exploits the STFT consistency constraint to narrow the solution space.
Tap to flip back
- Consistency projection: iSTFT then STFT. This forces the complex spectrogram to be exactly representable as the STFT of some real signal (satisfying overlap-add consistency).
- Magnitude projection: Replace the estimated magnitudes with the target magnitudes, keeping the phase from step 1.
Each full cycle is guaranteed to not increase reconstruction error. The algorithm converges but may land at a local minimum, not the global optimum.
Tap to flip back
Musical noise is a shimmering, tonal artefact audible in Griffin-Lim reconstructions. It arises because the random phase initialisation, combined with the magnitude constraint, produces phase patterns that cause spectral bins across frames to interfere constructively at irregular but non-random intervals. The interference creates faint pitched tones that shift in frequency as the spectrogram changes, perceived as a metallic shimmer overlaid on the speech.
Tap to flip back
- F0 (fundamental frequency): Pitch contour over time; determines voiced/unvoiced segments and prosody.
- Spectral envelope: Smoothed magnitude spectrum representing vocal tract shape; encodes vowel quality and formant structure.
- Aperiodicity: Frame-level measure of noise-to-periodic energy ratio; controls breathiness and fricative colouring.
WORLD synthesises output by combining a periodic excitation (from F0) with a noise source, mixed according to the aperiodicity mask, then shaped by the spectral envelope.
Tap to flip back
The algorithm converges to a local minimum of the least-squares reconstruction error. This minimum is determined by the structure of the consistency constraint, not by the number of iterations. Once the phase estimate stops changing meaningfully (typically within 50-100 iterations), further projection cycles only refine residuals that are below the threshold of perceptual relevance. The ceiling on quality is set by the magnitude-only input, not by iteration budget.
Tap to flip back
Tacotron 2 replaced Griffin-Lim with a conditioned WaveNet to eliminate the musical noise and unnatural breathiness of phase-estimated speech. WaveNet models a distribution over valid waveforms given the mel-spectrogram, rather than recovering a single waveform via heuristic phase estimation. The cost was synthesis speed: WaveNet is autoregressive (sample by sample at 24 kHz), requiring significant GPU compute, whereas Griffin-Lim ran in roughly real time on CPU for short utterances.
Tap to flip back
- Fricatives (s, sh, f): These are spectrally shaped noise bursts. Classical vocoders either model them as periodic (LPC, WORLD) or recover a phase that produces incoherent shimmer (Griffin-Lim). Neither approach produces the correct aperiodic, broad-band texture.
- Silence / low-energy regions: STFT magnitudes near zero have a high phase-noise ratio. Tiny magnitude errors get amplified when Griffin-Lim attempts phase recovery, producing audible hiss. WORLD's aperiodicity mask partially mitigates this but cannot eliminate it in transitions.
Tap to flip back
WaveNet models p(x) = product_t p(x_t | x_1...x_{t-1}) - a fully autoregressive factorisation over raw audio samples. Text (or phoneme) features form a local conditioning signal upsampled to the audio frame rate and added as a bias inside each dilated conv layer. A global conditioning signal (speaker embedding) shifts behaviour across the whole sequence.
Tap to flip back
A standard 1-D causal convolution has a receptive field that grows only linearly with depth. Dilated convolutions space filter taps exponentially (1, 2, 4, 8, ...), so stacking 10 layers already spans 1024 samples. This gives a receptive field large enough to capture prosodic patterns (~240 ms) without a quadratic blow-up in parameters or the vanishing-gradient problems of deep recurrent networks.
Tap to flip back
Each layer computes tanh(W_f * x + V_f * h) * sigmoid(W_g * x + V_g * h). The sigmoid gate acts as a soft selector - it can suppress or pass the tanh output. This gating mechanism (from PixelCNN) helps the network learn when to ignore a feature entirely, which a plain ReLU cannot do. It is particularly useful for learning to route conditioning signals selectively through deep stacks.
Tap to flip back
256 bins, using mu-law companding (also called mu-law quantisation). Mu-law is a logarithmic compression that allocates more bins to low-amplitude values where human hearing is most sensitive, giving perceptually uniform resolution with far fewer bins than linear PCM would require for equivalent quality.
Tap to flip back
Because sample x_t depends on all previous samples, generation is strictly sequential - one sample at a time. At 16 kHz this means roughly real-time inference on a GPU, which is unusable for production. Parallel WaveNet trains a fast inverse autoregressive flow student via probability density distillation from a pre-trained WaveNet teacher. The student generates all samples in parallel, reaching 20x faster-than-real-time.
Tap to flip back
- Sequential generation speed - one sample per forward pass means near-real-time at best; unacceptable latency for on-device or streaming use.
- Dependency on upstream text analysis - WaveNet conditions on pre-computed phoneme/linguistic features, so errors in grapheme-to-phoneme conversion or stress assignment propagate directly into bad audio with no internal correction mechanism.
Bonus: data hunger (tens of hours of clean, consistent recordings per voice) and 256-bin quantisation artefacts on high-dynamic-range content.
Tap to flip back
A vocoder converts a mel-spectrogram back into a time-domain waveform. Mel-spectrograms encode magnitude (energy per frequency band per frame) but discard phase. The vocoder must hallucinate a coherent phase trajectory consistent with that magnitude, at 22 kHz or higher - there is no ground-truth phase to supervise against directly.
Tap to flip back
WaveNet generates one sample conditioned on all previous samples, so generation is strictly sequential. For 5 seconds at 22,050 Hz that is approximately 110,000 sequential forward passes through a deep dilated network. Even on fast hardware, this is orders of magnitude slower than real-time, which is why WaveNet was not practical for interactive TTS without significant architectural changes.
Tap to flip back
MRF runs several residual blocks with different kernel sizes and dilation rates in parallel at each upsampling stage, then sums their outputs. Speech contains structure at multiple time-scales simultaneously: formant transitions (~3-10 ms), pitch periods (~5-20 ms at 50-200 Hz), and phrase-level rhythm (~200-800 ms). Using a single receptive field size forces a trade-off; MRF captures all scales without extra parameters by sharing the computation stage.
Tap to flip back
MPD reshapes the 1D waveform into a 2D matrix by folding it at a fixed period (2, 3, 5, 7, or 11 samples), then applies 2D convolutions. When folded at period p, all samples separated by p are placed adjacent, making harmonic relationships explicit. This allows the discriminator to detect phase incoherence and harmonic distortion - cases where the generator produces the right spectral envelope but the periodicity structure is incorrect or inconsistent across cycles.
Tap to flip back
- Adversarial loss (least-squares GAN formulation): pushes the generator to produce waveforms the discriminators cannot distinguish from real speech.
- Feature matching loss (L1 distance on intermediate discriminator activations): provides a dense, stable gradient signal even early in training before the adversarial signal is useful; encourages the generator to match internal representations, not just the final discriminator output.
- Mel-spectrogram reconstruction loss: directly penalises mismatch between the log mel-spectrogram of generated and ground-truth audio; anchors quality throughout training and prevents mode collapse.
Tap to flip back
The most likely cause is spectrogram distribution mismatch: the vocoder learned to invert spectrograms from its training acoustic model, and the new acoustic model produces spectrograms with different statistics (pitch range, energy distribution, silence patterns). The standard fix is fine-tuning the vocoder on spectrograms produced by the actual upstream acoustic model used in production, so the inversion is calibrated to the true distribution it will see at inference time.
Tap to flip back
V3 reduces parameter count from ~14M (V1) to ~0.26M by using fewer and narrower residual blocks in the MRF module and smaller upsampling channel widths. The MOS drops from approximately 4.4 to approximately 4.0. The trade-off is quality for deployability: V3 runs in real-time on CPU, making it viable for on-device or low-latency server deployments where a GPU is unavailable or too expensive.
Tap to flip back
First, the STFT maps the waveform to a complex linear frequency spectrum (magnitude squared = power spectrogram). Then, a triangular mel filterbank projects that linear spectrum onto ~80 mel-spaced bins, and a log is taken. The log compression aligns dynamic range with human loudness perception; the mel scale aligns frequency resolution with pitch discrimination.
Tap to flip back
Mel spectrograms are compact (80 x T vs. tens of thousands of waveform samples), perceptually centred (the model matches what listeners hear, not raw amplitude), and they cleanly separate the acoustic model from the vocoder. This lets the two halves be trained and improved independently, which drove rapid progress after 2017.
Tap to flip back
n_fft- controls frequency resolution; too small and high-frequency bins alias together, blurring fricatives.hop_length- controls temporal resolution; too large and fast transients (stop consonants) are smeared across frames.n_mels- controls filterbank density; below ~64 bins, sibilant energy is lost and the vocoder produces muffled consonants. Mismatching any of these between training and inference produces subtle but audible degradation.
Tap to flip back
Phase is discarded. The mel spectrogram stores only log-magnitude energy per filter. Vocoders must reconstruct or hallucinate phase from the magnitude alone. Iterative methods (Griffin-Lim) produce audible phasiness; GAN vocoders learn implicit phase priors from data but can fail on out-of-distribution pitch or speaking style.
Tap to flip back
L1/L2 loss minimises expected error, so the model learns the mean of the distribution over plausible futures. Multiple valid mel trajectories average into blurred frames. The vocoder converts this blur into a muffled, over-smooth sound lacking fine spectral detail. Solutions include adversarial losses, flow-based decoders, or diffusion-based acoustic models.
Tap to flip back
HiFi-GAN is a GAN-based vocoder that takes a mel spectrogram as input and generates raw waveform samples via transposed convolutions and multi-period / multi-scale discriminators. It runs 167x faster than real-time on a GPU, compared to WaveNet which requires autoregressive sample-by-sample generation. High fidelity at real-time-or-faster speed made it the practical default for offline TTS.
Tap to flip back
High-frequency detail (above ~8 kHz) is lost because the mel filterbank places very wide triangular filters in that region, merging many linear-frequency bins into one mel bin. For 44.1 kHz music or expressive singing synthesis, sibilants and upper harmonics are irretrievably compressed. Mitigation strategies include using a higher n_mels, a higher-resolution linear spectrogram in that band, or representations that retain phase such as the complex spectrogram or a neural audio codec.
Tap to flip back
Encoder - a convolutional network that downsamples the waveform into a dense continuous embedding sequence (e.g., 320x reduction to ~75 Hz).
Residual Vector Quantiser (RVQ) - maps each embedding to N discrete integer codes using a cascade of codebooks; each codebook refines the residual left by the previous one.
Decoder - transposed convolutions reconstruct the waveform from the summed quantised vectors.
All three are trained jointly end-to-end.
Tap to flip back
RVQ quantises an embedding in stages. The first codebook finds the nearest entry and emits its index; the error (residual) is passed to the second codebook, which quantises that error; and so on for N codebooks.
Bitrate = frame_rate × N × log2(codebook_size) bits/sec.
Changing N changes the bitrate without retraining, because each stage is an independent refinement layer. At N=8 codebooks, 1024 entries, 75 Hz: 8 × 10 × 75 = 6000 bps = 6 kbps.
Tap to flip back
Mean-squared error on waveforms or spectrograms penalises average deviation but does not penalise perceptual artifacts like buzziness or ringing. Discriminators trained to distinguish real from reconstructed audio push the decoder to produce outputs that sound natural, not just outputs that are numerically close.
Multi-period and multi-scale discriminators capture different temporal and spectral failure modes, which is why both are typically included.
Tap to flip back
Collapse occurs when only a small fraction of codebook entries are used, because gradient updates push all embeddings to a few popular centroids. The nominal capacity (e.g., 1024 entries) is far larger than the actual effective vocabulary.
Two common mitigations:
1. EMA (exponential moving average) updates - update centroids via momentum; more stable than pure backprop gradients.
2. Random restarts - periodically reinitialise dead (unused) entries to a randomly sampled encoder output, forcing the codebook to diversify.
A commitment loss term also helps by penalising the encoder for drifting away from its nearest centroid.
Tap to flip back
VALL-E uses EnCodec to convert both the 3-second speaker prompt and the target utterance into RVQ integer token sequences. An autoregressive transformer is then conditioned on a phoneme sequence and the prompt's codec tokens, and predicts the codec tokens for the target speech.
The decoder reconstructs the waveform from the predicted tokens. The codec converts the hard problem of waveform synthesis into next-token prediction over a discrete vocabulary - a task language models are well-suited for.
Tap to flip back
The EnCodec loss balancer normalises each loss term by an exponential moving average of its own gradient norm before accumulating the total loss. This means every loss contributes roughly equal gradient magnitude regardless of its absolute scale.
Without it, a suddenly large adversarial loss can dominate and destabilise training. The balancer decouples hyperparameter choices (loss weights) from loss magnitude, making the training more robust to architectural changes and easier to reproduce.
Tap to flip back
Below roughly 3 kbps (corresponding to about N=2 codebooks at 75 Hz), perceptual quality degrades noticeably. Typical artifacts:
- Fricative/sibilant smearing - high-frequency transients like "s" and "sh" lose sharpness.
- Pitch inaccuracy on music - insufficient codes to capture fine spectral peaks.
- Phase artifacts - comb-filtering or hollowness on sustained tones.
The degradation is usually "blurry" rather than the block-artifact character of traditional codecs, but the floor is still real and limits the quality ceiling of any downstream codec language model.
Tap to flip back
A single codebook cannot represent a continuous high-dimensional audio embedding faithfully - reconstruction error is too large. RVQ chains N codebooks so each stage quantises the residual error of the previous stage, achieving expressiveness equivalent to K^N combinations while transmitting only N * log2(K) bits per frame.
Tap to flip back
r_0 = z_e
q_1 = nearest(r_0, C_1)
r_1 = r_0 - q_1
q_2 = nearest(r_1, C_2)
r_2 = r_1 - q_2
q_3 = nearest(r_2, C_3)
z_q = q_1 + q_2 + q_3
The final quantised embedding is the sum of all per-stage lookups; the decoder sees z_q, not the residuals.
Tap to flip back
SoundStream uses structured dropout during training: it randomly discards codebook suffix stages (transmitting only prefix stages of length 1 to N). The decoder learns to reconstruct from any prefix subset. At inference, transmitting fewer stages directly reduces bitrate - no retraining required.
Tap to flip back
Codebook collapse occurs when most codebook entries are never assigned - a few attract nearly all inputs, so the effective bitrate drops far below the nominal value. Two counteracting techniques:
1. EMA updates - codebook vectors track the running mean of assigned encoder outputs, which is more stable than gradient descent directly on entries.
2. Re-initialisation - low-usage entries are periodically reset to random encoder outputs, forcing them back into the data distribution.
Tap to flip back
bits/s = frames/s × stages × log2(codebook_size)
= 75 × 8 × log2(1024)
= 75 × 8 × 10
= 6000 bps (6 kbps)
Halving the number of stages to 4 gives 3 kbps - the operating point where SoundStream outperforms Opus at 12 kbps.
Tap to flip back
Stage 1 codes capture coarse spectral shape and speaker identity (the perceptually dominant information), while stages 2-N correct progressively finer detail. VALL-E models stage 1 autoregressively to get prosody and speaker right, then predicts stages 2-N in parallel (non-autoregressively) since their content is conditional fine-grained correction rather than sequentially dependent structure. This cuts inference steps dramatically versus fully autoregressive modelling of all stages.
Tap to flip back
Codebook entries cluster around activations typical of speech during training. Music and environmental sounds activate different regions of the embedding space. When out-of-distribution inputs arrive, the nearest-neighbour lookup returns a distant codebook entry, producing a large residual. Even with N stages, the residual at each level remains larger than in-distribution, so the summed reconstruction z_q diverges from the true embedding more than nominal bitrate analysis would predict.
Tap to flip back
RVQ maps a continuous audio embedding to K successive discrete codebooks. The first codebook captures the coarse spectral envelope; each subsequent codebook quantises the residual error from all previous ones. This means level 1 alone gives a recognisable but rough reconstruction, while levels 2-8 progressively add fine-grained perceptual detail. The ordering is exploited by models like Bark, which predict coarse levels first (autoregressive) and fine levels second (parallel or faster), enabling a coarse-to-fine generation hierarchy.
Tap to flip back
Stage 1 predicts semantic tokens (discretised activations of a self-supervised model such as w2v-BERT) - these carry prosody and linguistic structure but not fine acoustic texture. Stage 2 predicts coarse RVQ tokens (levels 1-2) conditioned on the semantics. Stage 3 predicts fine RVQ tokens (levels 3-8) conditioned on the coarse tokens. Collapsing all three into one model would require attending over very long mixed-type sequences where semantic and perceptual information compete; staging keeps each model's task narrow and tractable.
Tap to flip back
They are treated as ordinary text tokens fed into the text-to-semantic transformer. The model has seen enough training examples pairing these tokens with the corresponding acoustic events that it generates appropriate codec token sequences when it encounters them. There is no separate module; it is purely learned behaviour from the language modelling objective over text-audio pairs.
Tap to flip back
Classical pipeline TTS is largely deterministic: given text and a speaker embedding, it produces the same mel-spectrogram and waveform every run, and failures are systematic (e.g., attention skips causing repeated or missing words). A fully generative model samples from a learned distribution - two calls with the same input produce different outputs. Failures are stochastic and unpredictable: a generation may drop syllables, shift speaker register, or produce incorrect prosody in ways that vary across runs and cannot be caught by a fixed post-processing rule.
Tap to flip back
Two structural limitations: (1) Bark generates at most ~13 seconds per pass; longer text must be chunked and stitched, introducing audible discontinuities in background noise and prosody. (2) Speaker identity is anchored only by a short voice preset in context, not a persistent embedding - voice characteristics drift across chunks and can shift within a single clip. Production systems needing deterministic, identity-stable output across arbitrary-length documents require an explicit speaker embedding and a stitching mechanism Bark does not provide.
Tap to flip back
VALL-E's specific contribution is zero-shot voice cloning from a 3-second speaker prompt. Given a transcript and a 3-second reference clip of an unseen speaker, it synthesises speech that matches that speaker's voice characteristics, trained at scale on 60,000 hours. Bark is aimed at broader audio generation (speech, music, nonverbal sounds, multilingual output) with loose voice preset matching; it does not claim identity-preserving zero-shot cloning as a verified capability.
Tap to flip back
MOS averages perceptual ratings across samples, collapsing variance. A model that produces one excellent sample and one completely broken sample scores identically to a model producing two mediocre samples. For a stochastic generative model, the variance across samples, the tail failure rate, and word error rate (which measures how often text is reproduced correctly) are all independently important metrics. Reporting only MOS without these obscures the practical unreliability of high-variance generators.
Tap to flip back
- Pitch (F0) - intonation, question vs. statement, emotion.
- Duration - rhythm, stress, emphasis (syllable timing).
- Energy - loudness and sentence-level stress.
- Voice quality (spectral tilt, jitter) - breathiness, creakiness, perceived age.
These operate above the phoneme level; the same sequence of phonemes can sound like a command or a question depending on how these four dimensions are set.
Tap to flip back
The duration predictor outputs an integer frame count per phoneme. The length regulator copies each encoder output that many times before passing the expanded sequence to the mel-spectrogram decoder.
Scaling every duration by a constant factor uniformly speeds up or slows down speech. Scaling selected phonemes targets contrastive stress on individual words. This works at inference with no retraining.
Tap to flip back
A GST system adds a reference encoder (small convolutional stack) that compresses a reference audio clip into a query vector. That query attends over a small bank of learnable style token embeddings. The resulting weighted sum is a style embedding that conditions the TTS acoustic model.
At inference: feed a reference audio through the encoder to copy its style, or set the attention weights directly to interpolate between discovered style axes - without any explicit style labels during training.
Tap to flip back
Speech is one-to-many: the same text can be spoken with many different rhythms. A deterministic predictor collapses this distribution to a single mean output, producing mechanical consistency. VITS samples duration latents from a learned distribution, so each forward pass can yield a different but plausible rhythm - closer to natural human variation.
Tap to flip back
-
Style-speaker entanglement. A "fast" clip from a child and a "fast" clip from a deep-voiced adult produce different style embeddings; the encoder cannot separate speaking rate from voice identity. Disentanglement is partial.
-
Domain collapse. Style tokens are only meaningful within the training distribution. A model trained on audiobooks develops audiobook-specific style axes; applied to conversational speech, token assignments become unreliable.
Tap to flip back
Most systems predict prosody phoneme-by-phoneme without any discourse-level representation. They have no encoding of paragraph structure, information status (given vs. new), or contrast across sentences. Prosody within a sentence can be good, but across a paragraph the system has no signal to vary intonation - so consecutive sentences converge to a neutral plateau. Humans use discourse context to sustain prosodic variation; the TTS model does not.
Tap to flip back
MOS measures overall perceived naturalness; it is a poor proxy for prosodic appropriateness or communicative effectiveness. A voice can score 4.3/5 on MOS while systematically sounding flat or indifferent in passages that should be urgent or warm. Automated acoustic metrics (F0 RMSE, duration error) measure fidelity to a reference signal, not whether the prosody fits the communicative intent of the text.
Tap to flip back
TTFA is the wall-clock time from the start of synthesis to the moment the first decoded audio sample reaches the playback device. It is the metric users directly perceive: silence longer than roughly 200-300 ms reads as a broken system. Total synthesis time and real-time factor matter for sustained playback, but TTFA governs whether the product feels responsive at all.
Tap to flip back
RTF = synthesis time / audio duration. An RTF of 0.5 means the model produces 1 second of audio in 0.5 seconds of compute. For gapless streaming, RTF must stay below 1.0 throughout the session; if it exceeds 1.0 the system generates audio slower than it is consumed, the playback buffer empties, and the listener hears dropout artefacts.
Tap to flip back
When an LLM streams reply tokens, the TTS engine can start synthesising the first sentence while the LLM is still generating the second. Splitting on even shorter units (clauses, phrases) reduces TTFA further. The cost is prosodic coherence: a phrase spoken without knowing what follows may receive the wrong intonation contour - rising pitch where the utterance should fall, or inconsistent speaking rate across chunk boundaries.
Tap to flip back
These models generate discrete codec tokens one-by-one, each conditioning on all previous tokens. At roughly 75 tokens/second for a 10-second utterance, that is 750 sequential generation steps before a full utterance is available. Streaming requires committing to a chunk size in tokens, then immediately decoding those tokens to a waveform while generation continues - which means the model has no look-ahead when making prosodic decisions for the current chunk.
Tap to flip back
The model decodes a chunk that is slightly longer than the audio it will emit (e.g. 20% look-ahead frames). The overlapping tail frames are discarded before playback. Because the acoustic model can see a short future context when computing the boundary frames, pitch and duration decisions at the seam are more consistent with what follows. The cost is a fixed additional latency equal to the overlap duration.
Tap to flip back
Non-autoregressive models (e.g. FastSpeech 2) generate the full spectrogram in a single parallel pass, so no partial output is available until the entire utterance is processed. For short phrases this is fast enough that streaming is unnecessary. For long utterances, total synthesis time still scales with length and all durations must be predicted before any frame is emitted. The production workaround is to segment long text into short phrases and synthesise each phrase independently, using phrase-level TTFA rather than utterance-level TTFA.
Tap to flip back
- Sentence-final intonation errors. A model committing to the final chunk's pitch contour before knowing it is the last chunk generates mid-utterance rising intonation on what is actually the closing phrase, sounding uncertain or incomplete.
- Buffer underrun under load. High GPU utilisation across concurrent sessions can push RTF above 1.0, draining the playback buffer and causing audible dropouts even if RTF was nominally healthy at low load.
- KV-cache state fragility. Streaming codec-LM inference maintains a KV-cache across chunk boundaries; a client reconnect or network interruption invalidates that state, forcing a full restart with no clean recovery path.
Tap to flip back
A MOS of 4.0 ("Good") means the average listener found the speech slightly different from natural but not annoying. It says nothing about speaker identity accuracy, prosodic appropriateness for the context, intelligibility for non-native listeners, or how the system performs on longer utterances. Naturalness on a short sentence is only one axis of TTS quality.
Tap to flip back
MOS is a relative, context-sensitive judgement. Without shared anchors (a reference and a degraded sample), different listener pools, payment levels, and instructions shift the implicit scale. Chiang et al. (2023) showed that changing only the evaluation context reversed the ranking of three established TTS systems. Two independent MOS scores from different labs are not directly comparable; CMOS paired tests or tightly matched conditions are needed for valid comparisons.
Tap to flip back
CMOS (Comparative MOS) asks listeners to rate the difference between a pair of utterances on a -3 to +3 scale rather than rating each utterance in isolation. Use it when comparing two specific systems: it controls for inter-session scale drift, requires fewer listeners to detect the same effect size, and is more sensitive to small differences (a 0.1-0.2 CMOS advantage is conventionally meaningful). Use standard MOS when you need an absolute quality estimate or are evaluating more than two systems without natural pairing.
Tap to flip back
Modern neural TTS systems cluster in the 3.5-4.5 MOS range, compressing the score distribution. Automatic MOS predictors trained on older corpora that include vocoded and concatenative speech learned to discriminate across a much wider range; they perform poorly on this narrow high-quality band. The SOMOS dataset (Maniati et al., 2022) was created to study this: 20,000 MOS-labelled utterances from 200 neural TTS systems on a single voice, revealing that state-of-the-art predictors fail significantly on this modern, compressed distribution.
Tap to flip back
UTMOS fine-tunes a self-supervised speech representation model on human MOS labels. Frame-level SSL features are pooled with attention to produce an utterance embedding, which is then regressed to a scalar score with auxiliary listener-ID tasks. It achieves Pearson r > 0.94 on the BVCC benchmark (in-distribution). The limitation: strong in-distribution correlation does not generalise to out-of-distribution systems. On modern neural TTS corpora like SOMOS, predictors trained on older mixed data systematically mis-rank systems.
Tap to flip back
- Anchors: including a natural-speech reference and a degraded anchor calibrates listeners and reduces variance; omitting them lets the implicit scale drift freely.
- Listener pool: lab-recruited native speakers vs. crowdsourced workers produce different baselines; mixing them makes scores incomparable.
- Utterance length and variety: short, simple sentences inflate scores relative to paragraph-length or prosodically complex stimuli; the test set must cover the deployment domain.
Tap to flip back
ITU-T P.800 (1996) formalises the MOS scale (1 to 5) and methods for subjective determination of transmission quality, originally for telephone networks. TTS research adopted the same scale and terminology. It matters because it gives the field a shared vocabulary, but the original protocol was designed for degraded telephony speech, not for discriminating between modern neural TTS systems that all sound near-natural. Applying P.800 to TTS requires additional adaptations (anchors, stimuli selection, rater qualification) that the original standard does not specify.
Tap to flip back
Adjacent PCM samples are highly correlated at millisecond scales; the discriminative information (phoneme identity, formant structure) lives in frequency patterns over 20-30 ms windows. Raw samples waste model capacity on redundant low-level correlations. The standard solution is a log-mel spectrogram, which summarises each 25 ms frame as 80 filterbank energies.
Tap to flip back
CTC solves the alignment problem: given T acoustic frames and S output tokens (S << T), the training labels do not specify which frame corresponds to which token. CTC marginalises over all valid frame-level paths that collapse to the target string via a blank token. The key assumption it makes is conditional independence: each frame's output depends only on the encoder state at that frame, not on previously emitted tokens. This assumption is what limits CTC's internal language modelling capability.
Tap to flip back
RNN-T adds a prediction network (a small recurrent network) that reads the last emitted non-blank token and produces a context vector. A joint network combines this context with the encoder state at each step. This gives RNN-T an internal language model, improving performance on rare and long-tail words. Practically, RNN-T still allows frame-by-frame streaming emission (unlike attention encoder-decoder models), which is why it dominates production streaming ASR systems.
Tap to flip back
The Conformer interleaves multi-head self-attention and a depthwise separable convolution module (along with two feed-forward modules) within each block. Self-attention captures long-range temporal dependencies across hundreds of frames; convolution captures local spectral patterns such as formant transitions and stop bursts. Speech signals have both long-range prosodic structure and fine-grained local acoustic events, so combining both inductive biases in one encoder consistently outperforms either alone.
Tap to flip back
Whisper was trained on 680,000 hours of diverse internet audio across 99 languages with noisy, weakly supervised transcripts. This breadth gives it robustness to accents, microphone variation, background noise, and domain shift that narrowly fine-tuned supervised models lack. On real-world audio (podcasts, medical dictation, accented speech) it often outperforms supervised models that were optimised only on clean read speech benchmarks.
Tap to flip back
-
Hallucination on silence or music: the autoregressive decoder, conditioned on its own outputs, can generate plausible-sounding text even when there is no speech, or loop into repetition. Voice activity detection gating is the standard mitigation.
-
Out-of-vocabulary proper nouns: the decoder predicts from learned distributions over characters or subwords; rare names and product terms appear too infrequently to be reliably predicted. Contextual biasing (injecting a hotword list at inference) helps but adds complexity.
Tap to flip back
A log-mel spectrogram is a 2-D acoustic feature matrix produced by: (1) framing the waveform with a sliding window, (2) computing FFT power per frame, (3) projecting onto a mel-scale filterbank, and (4) taking the log. Typical hyperparameters: 25 ms window, 10 ms hop (so 100 frames per second), 80 mel filters. The output shape is (T, 80) where T is the number of frames. The mel scale approximates human frequency resolution, concentrating more filters at low frequencies where phonetic distinctions are denser.
Tap to flip back
The STFT applies a DFT to successive overlapping windows of the waveform, producing a 2-D array of complex values with time on one axis and frequency on the other. A full DFT gives a global frequency summary with no time resolution. Because speech phonemes change on 20-100 ms timescales, the STFT's local analysis windows are essential for capturing when each frequency is present.
Tap to flip back
The mel scale maps linear frequency to a perceptual scale: mel(f) = 2595 * log10(1 + f/700). It is denser at low frequencies (where speech intelligibility lives) and sparser at high frequencies. Applying mel filter banks collapses 257 linear FFT bins down to 80 mel bins, matching the frequency resolution to human auditory sensitivity and discarding perceptually redundant information.
Tap to flip back
Three reasons: (1) perceived loudness is logarithmic, so log energies align with human perception; (2) log compression reduces dynamic range, which stabilises gradients during training; (3) channel or microphone effects that multiply the spectrum become additive after the log, making features easier to normalise away. This is why log mel filter banks are the default front-end for deep-learning ASR.
Tap to flip back
MFCCs apply a DCT to the log mel filter-bank energies, keeping the first K coefficients (typically 13) plus their delta and delta-delta derivatives (total 39 features). The DCT decorrelates the filter-bank channels, which was important for diagonal-covariance GMMs. For deep-learning models, log mel filter banks are generally preferred because neural networks learn their own decorrelation and the DCT discards information the model could use.
Tap to flip back
SpecAugment applies: (1) time warping - local stretch/compression along the time axis; (2) frequency masking - zeroing F consecutive mel bins; (3) time masking - zeroing T consecutive time steps. By forcing the model to reconstruct masked regions from context, these augmentations reduce over-reliance on narrow frequency cues and improve generalisation. Park et al. (2019) halved WER on LibriSpeech test-other without any language model.
Tap to flip back
The mel filter banks are computed over 0-8 kHz (Nyquist of 16 kHz). Upsampled 8 kHz audio contains no energy above 4 kHz (its original Nyquist), so the upper half of the mel bins are always near-zero at inference. The model was trained expecting energy distributed across all 80 bins; the systematic silence in upper bins looks like a completely different acoustic environment and typically causes large WER degradation.
Tap to flip back
Shortening the analysis window improves temporal resolution (better tracks fast phonemic transitions) but worsens frequency resolution (bins become wider). Lengthening it does the opposite. The relation is roughly delta_t * delta_f >= 1/(4*pi). A 25 ms Hann window at 16 kHz gives frequency bins of about 40 Hz wide, fine enough to distinguish the harmonics of voiced speech, while still being shorter than most phonemes. This is an empirically validated compromise, not a theoretically optimal one.
Tap to flip back
A 10 ms hop in the STFT pipeline means 100 frames per second, regardless of how fast the speaker is talking. Phonemes and silences alike consume frames. The transcript only records spoken tokens, so it is always far shorter than the frame sequence. This mismatch is the root cause of alignment difficulty in ASR.
Tap to flip back
CTC requires T >= L. Each output token must be emitted at a distinct frame (or surrounded by blank frames), so you cannot decode more tokens than you have input frames. In practice this is almost always satisfied for audio, but it becomes binding when aggressive temporal subsampling reduces T drastically.
Tap to flip back
CTC assumes conditional independence across output positions: the probability at frame t does not depend on what was emitted at earlier frames. RNN-T adds a prediction network that conditions each token on the previous token, making the output distribution autoregressive. This internal language model improves accuracy but requires dynamic programming over a T x K lattice instead of just T frames.
Tap to flip back
Speech has structure at multiple timescales: fine phonetic detail spans a few frames (local), while word identity and prosody span hundreds of frames (global). Self-attention covers global context but is weak at local patterns; depthwise convolution with a ~31-frame kernel covers local context cheaply. Stacking both in each Conformer block lets the model handle both ranges without increasing sequence length or parameter count excessively.
Tap to flip back
Long recordings must be split into 30-second chunks and stitched using predicted timestamps. Attention context resets at each boundary, so discourse-level context (speaker names, topic continuity) is lost. Timestamp prediction errors accumulate, and boundary segments are especially prone to hallucination or mis-transcription in noisy or music-heavy audio.
Tap to flip back
8x subsampling reduces ~100 frames to ~12 encoder frames. CTC cannot emit more than 12 tokens (T >= L), so character-level output with more than 12 characters per second becomes impossible. Practitioners must switch to subword or word-piece units, or reduce the subsampling factor, to avoid hitting this constraint on fast speech.
Tap to flip back
Each look-ahead frame adds 10 ms of latency (one hop interval). A model that looks ahead N frames before emitting a token incurs N x 10 ms of end-of-utterance latency. This creates a fundamental accuracy-latency trade-off: wider context improves recognition of phonetically ambiguous endings, but every additional frame delays the response. No architecture eliminates this trade-off; it can only be shifted.
Tap to flip back
Cross-entropy requires a label for every input frame, which means you need a pre-existing frame-level alignment (e.g. from a bootstrapping HMM). CTC removes that requirement by summing the probability over all frame sequences that collapse to the correct transcript, letting the model learn the alignment implicitly.
Tap to flip back
First, collapse repeated labels into one. Then remove all blank tokens. So - c c - a - t t - becomes cat. This many-to-one mapping is what allows the network to emit any valid alignment as long as it decodes to the correct target.
Tap to flip back
O(T * L) - linear in both input length T and target length L. Enumerating all valid alignments explicitly would be exponential. The dynamic programming recursion (analogous to the HMM forward-backward pass) makes CTC tractable on sequences of hundreds of frames.
Tap to flip back
CTC factorises the output distribution as a product of per-frame probabilities: p(y_1, ..., y_T | x) = prod_t p(y_t | x). Each frame's label is independent of every other frame's label given the input. The limitation: the model cannot learn that "q" tends to be followed by "u". Language modelling must be done externally (n-gram or neural LM) at decode time.
Tap to flip back
CTC needs at least one distinct frame per output character (including blanks between repeated characters). If the encoder uses heavy convolutional subsampling (e.g. 8x stride), short utterances with many characters can produce fewer frames than labels, making the CTC loss undefined. The fix is to either reduce subsampling or filter out such utterances during training.
Tap to flip back
The encoder learns to emit high-probability blanks on most frames and sharp spikes on a small number of frames near phoneme onsets. This makes greedy decoding (argmax + collapse) fast and reliable under clean conditions, but the peaky distribution is fragile: insertions or deletions from noise or domain mismatch are not "softened" by any output history, because there is no recurrence in the CTC output distribution.
Tap to flip back
The CTC auxiliary loss imposes an alignment signal on the shared encoder during training. It pushes encoder representations to be phonetically grounded and monotonically structured, which regularises the attention decoder and speeds convergence. At inference the decoder runs alone, benefiting from the better encoder without paying the CTC decoding cost. Gulati et al. (2020) report consistent WER improvements when the CTC weight is set between 0.1 and 0.3.
Tap to flip back
CTC outputs one symbol per input frame, but the output length rarely equals the label length. The blank token fills two roles: it absorbs frames that carry no new label information (silence, transitions), and it separates consecutive identical characters. Without blank, ll would collapse to l with no way to distinguish them.
Tap to flip back
- Merge consecutive identical symbols into one.
- Remove all blank tokens.
Example: _ h h _ e e l _ l l o _ becomes hello. The order of the two steps matters: merging first, then removing blanks.
Tap to flip back
alpha(s, t) is the total probability of all CTC paths through the first t frames that produce the first s symbols of the label sequence (in the blank-interleaved extended label sequence). The recurrence accumulates paths arriving from the same symbol or from a blank transition.
Time and space complexity: O(T * L) where T is the number of frames and L is the label length. This is efficient enough for most utterance lengths and directly analogous to the HMM forward algorithm.
Tap to flip back
CTC is monotonic: labels must be emitted in left-to-right order. A blank run can extend, but the model cannot jump backward. This is ideal for streaming because you never need future context to commit to a prefix - you can emit a character and never retract it. The downside is that tasks requiring reordering (e.g., cross-lingual transliteration) cannot be modelled with CTC alone.
Tap to flip back
Because two distinct path sets produce the same extended prefix:
- The new frame emits
cdirectly (continuing a run ofcthat will collapse to onec). - The new frame emits blank, then
c(inserting a freshcafter a pause).
Conflating the two overcounts probability. Beam search keeps a "ends in blank" and "ends in label" split for each prefix to correctly sum the two contributions.
Tap to flip back
CTC's loss marginalises over alignments assuming each output symbol is independent of previous outputs given the encoder features. The decoder produces no autoregressive signal. Consequence: the model cannot learn that "k" makes "n" more likely in "knight". This leads to errors on rare or morphologically complex words that a language model (applied at decode time) partially compensates for - but the encoder itself never models output history.
Tap to flip back
When a strong self-supervised encoder (e.g., wav2vec 2.0) is fine-tuned with CTC on limited labelled data, the model sometimes produces extremely peaked softmax distributions - near-certain at each frame - that generalise poorly to out-of-domain audio.
Mitigations:
- Label smoothing: targets are a mixture of the true label and a uniform distribution, preventing overconfident outputs.
- Entropy regularisation: a penalty term is added to the loss to keep output distributions softer.
- Both reduce greedy-decoding accuracy slightly but improve beam-search and transfer robustness.
Tap to flip back
Listener - a pyramidal bidirectional LSTM that encodes log-mel frames into a compressed sequence of acoustic representations.
Attender - content-based (additive) attention that, at each decoder step, computes a weighted sum over listener outputs to form a context vector.
Speller - an autoregressive LSTM decoder that generates one output character per step conditioned on the context vector and the previous character.
Tap to flip back
A 10-second utterance at 10 ms stride gives ~1000 encoder frames. Attending over 1000 states per decoder step produces a nearly uniform attention distribution; gradients are too diluted for the model to learn sharp alignment. The pBLSTM halves sequence length at each layer (by concatenating adjacent hidden states), reducing ~1000 frames to ~125 by layer 3. This makes the attention alignment tractable and cuts computation.
Tap to flip back
e_{i,u} = <v, tanh( W * s_i + V * h_u + b )>
s_i- decoder hidden state at output step ih_u- listener (encoder) output at frame uW,V,b,v- learned parameterse_{i,u}- unnormalised alignment score; softmax over u gives attention weights alpha_{i,u}
This is Bahdanau (additive) attention; the context vector is the weighted sum of h_u under alpha.
Tap to flip back
Scheduled sampling. During training, with probability p the decoder receives its own previous prediction as input instead of the ground-truth character. p is annealed upward during training. Without it the model is never exposed to its own errors during training and degrades rapidly when a single mistake at inference time triggers a cascade of wrong inputs. Scheduled sampling closes this exposure bias gap.
Tap to flip back
Two reasons:
1. The listener is a bidirectional LSTM, so it needs the complete utterance before producing encoder states.
2. The attention mechanism scans the full encoder output at every decoder step; it cannot produce a character until all frames are available.
Both require the full audio before any output is generated. Fixes (monotonic attention, unidirectional encoders with look-ahead) exist but sacrifice accuracy or add latency budgets.
Tap to flip back
CTC assumes conditional independence between output tokens given the input: each symbol is predicted locally from the input up to that frame, and the model marginalises over all valid blank-padded alignments.
LAS makes no such assumption. The speller conditions each character on all previous characters (via the autoregressive decoder) and on the full encoded audio (via attention). This allows LAS to model inter-character dependencies (language-model-like context) directly within the single network, but it prevents streaming output.
Tap to flip back
-
Silence / long pauses: The encoder produces near-identical representations for consecutive silence frames. Attention gets confused by the dense uniform region and can misalign, causing repeated characters or skipped words.
-
Low-resource / rare vocabulary: LAS has no pronunciation lexicon. It must learn phoneme-to-character mappings purely from paired audio-text data. With limited data, rare proper nouns or domain-specific terms are often mis-spelt because the model has not seen enough examples to anchor the spelling.
Tap to flip back
Feed-Forward (half-step) -> Multi-Head Self-Attention -> Convolution Module -> Feed-Forward (half-step) -> LayerNorm.
The two feed-forward layers each apply a 0.5 residual scaling (Macaron-style), together approximating one full feed-forward pass. Attention contextualises first; convolution then refines local patterns on the contextualised representation.
Tap to flip back
- Pointwise conv (d -> 2d expansion)
- GLU activation (gates and halves channels back to d)
- Depthwise conv with kernel k (each channel independently)
- BatchNorm + Swish
- Pointwise conv (d -> d projection)
Depthwise convolution convolves each channel separately, so it has d * k parameters instead of d^2 * k. This keeps the module efficient while still capturing local temporal patterns across up to ~310 ms of context (k=31, 10 ms frame shift).
Tap to flip back
BatchNorm is placed after the depthwise conv because its temporal smoothing over a batch of mel-spectrogram-derived features is beneficial for ASR stability during training.
The risk: at inference with small or single-item batches, or when the acoustic domain shifts (e.g., far-field vs. close-talk), the running statistics accumulated during training can be mismatched, degrading accuracy. Production pipelines often replace BatchNorm with LayerNorm or GroupNorm to remove this dependency.
Tap to flip back
Conformer-L achieved 2.1% / 4.3% WER on test-clean / test-other without an external language model (1.9% / 3.9% with one). The model has 118.8 M parameters, using d_model=512, 8 attention heads, and 17 encoder layers.
Tap to flip back
Attention first resolves long-range dependencies and aligns contextually ambiguous representations. Convolution operating on those contextualised features can refine local acoustic patterns more effectively than convolution operating on raw, uncontextualised frame embeddings.
Reversing the order forces the local convolutional filter to handle input that has not yet been related to its broader phonemic context, which is less efficient especially for phonemes with variable acoustic realisations.
Tap to flip back
- Long-form audio: Full self-attention is O(n^2) in sequence length; a 30-second utterance at 10 ms frames produces 3,000 tokens, stressing memory. Streaming variants (chunk attention, Emformer) are needed.
- Reverberant / far-field audio: Mel features are not reverberation-robust. Without multi-condition training or front-end beamforming, WER rises sharply.
- BatchNorm domain shift: Running stats from training mismatch out-of-distribution acoustic conditions or small inference batches, causing silent accuracy regression.
Tap to flip back
Three common options:
| Head | Training loss | Streaming? |
|---|---|---|
| CTC | CTC | Yes (greedy) |
| Attention decoder | Cross-entropy | No (needs full sequence) |
| RNN-T (Transducer) | RNN-T | Yes |
RNN-T is preferred for streaming because the joiner emits tokens incrementally without waiting for the full encoder output. Many production systems train with a joint CTC + attention objective to regularise the encoder, then deploy with an RNN-T head.
Tap to flip back
Weakly supervised means the training labels (transcripts) were not produced by human annotators verifying the audio - they were scraped from the web alongside the audio (captions, subtitles, posted transcripts), so they are noisy and potentially misaligned. The model is trained end-to-end on these pairs with no separate fine-tuning stage.
Self-supervised ASR (e.g. wav2vec 2.0) uses no text labels during pre-training; it learns representations from unlabelled audio via contrastive or masked objectives, then fine-tunes on a small labelled set. Whisper skips that two-stage design entirely.
Tap to flip back
Whisper prepends a sequence of special task-specifier tokens to the decoder input before generation begins:
<|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|>
- The language token (
<|en|>,<|fr|>, etc.) identifies the source language or is predicted by the model. - The task token (
<|transcribe|>or<|translate|>) selects transcription in the source language or translation to English. - Timestamp tokens toggle whether time offsets are generated alongside words.
One checkpoint handles all tasks; the token prefix acts as a soft instruction to the decoder.
Tap to flip back
The encoder receives a log-mel spectrogram computed over a fixed 30-second audio window: 80 mel filter banks, 25 ms analysis windows, 10 ms hop. Two convolutional layers with GELU activations downsample the time axis before the signal enters the Transformer encoder stack. Sinusoidal positional embeddings are added after the convolutions.
The fixed 30-second window is a key design choice: it simplifies training but makes real-time streaming non-native, requiring chunked inference with overlap for long audio.
Tap to flip back
When input audio contains little or no speech (silence, music, ambient noise), the autoregressive decoder sometimes generates fluent but entirely fabricated text instead of emitting an end-of-sequence token. This is because the model was trained on audio that almost always contains speech, so it has a strong prior to produce words.
Partial mitigation: the model has a <|nospeech|> special token whose logit probability can be thresholded. If P(<|nospeech|>) exceeds a tunable threshold, the segment is treated as silent. However, the optimal threshold is domain-dependent and hallucination is not fully eliminated by this approach.
Tap to flip back
Quality correlates directly with the hours of training data per language. English dominates the 680,000-hour corpus (roughly 65% of total hours). Languages with fewer than ~1,000 training hours show substantially higher WERs, especially character-level languages (Chinese, Japanese) where correct transcription also depends on the original captions having used proper Unicode character sets.
The internet-scraped data distribution is inherently skewed toward high-resource languages, so Whisper's zero-shot multilingual capability degrades on the long tail.
Tap to flip back
Whisper processes fixed 30-second chunks. To stream long audio, you must buffer 30 seconds, run the encoder, decode, then slide the window - introducing at least 30 seconds of latency plus re-stitching artefacts at chunk boundaries.
Architectures designed for streaming process incrementally:
- RNN-T (Recurrent Neural Network Transducer): jointly trains an acoustic encoder and a language predictor with a transducer loss that allows token emission at any frame, enabling frame-synchronous streaming.
- Conformer-based streaming models: use causal or limited-lookahead attention so the encoder never waits for the full utterance.
These models sacrifice some accuracy on clean long-form audio to gain the latency needed for live captioning or voice assistants.
Tap to flip back
Whisper interleaves special <|t_N|> time tokens with word tokens in the decoder output. During training, these time offsets are derived by running forced-alignment tools over the training corpus to produce approximate ground-truth alignments.
At inference, the decoder predicts time tokens as part of the autoregressive sequence - it is a learned sequence-to-sequence prediction, not a CTC-based monotonic alignment. This means:
- Timestamps can drift on dense or fast speech.
- Precision is segment-level, not phoneme-level.
- For applications requiring tight phoneme alignment (linguistic annotation, diarisation), a dedicated forced aligner (e.g. MFA, WhisperX) is still needed on top of Whisper's output.
Tap to flip back
<|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|>
These four tokens are prepended to every decoder input before any audio-derived token is generated. They condition the model on language (English), task (transcription, not translation), and timestamp behaviour (timestamps suppressed). The decoder then generates the transcript autoregressively using cross-attention to the audio encoder output.
Tap to flip back
Replace <|transcribe|> with <|translate|> in the prefix. The decoder's embedding for <|translate|> was trained to induce English-output behaviour; the architecture itself is identical. No separate model, no separate head - just a different token in the same prefix position.
Tap to flip back
Timestamp tokens are interleaved directly in the transcript token sequence (e.g. <|0.00|> Hello <|0.48|>). There are roughly 1,500 special timestamp tokens covering 0 to 30 seconds in 20 ms increments. Choosing a timestamp token uses the same softmax over the full vocabulary as choosing a word - no separate regression head. This makes temporal alignment a pure language-modelling problem.
Tap to flip back
condition_on_prev_tokens: the transcript from the previous window is fed back as a prefix to the next window's decoder. This keeps terminology and style consistent. The failure mode is repetition collapse: once the decoder enters a loop (e.g. "I see, I see...") the repeated text becomes the next-window prefix, which reinforces the loop. Beam search with high beam counts worsens this.
Tap to flip back
The decoder is always conditioned to produce a transcript; on near-silent windows it generates plausible-sounding but fabricated text rather than outputting nothing. The <|nospeech|> token is designed to suppress this: if its logit probability exceeds a threshold the segment is skipped. In practice the threshold is sensitive to tune and the mechanism does not fully eliminate the problem.
Tap to flip back
The language token (e.g. <|en|>, <|fr|>) tells the decoder which language to output, suppressing mid-transcript language switches and improving accuracy. If omitted, Whisper infers the language from the acoustic signal alone (language identification mode). This is generally less accurate than explicitly specifying the language, especially for low-resource languages.
Tap to flip back
- No global context: a term introduced in minute 1 cannot inform transcription at minute 20 because each window's decoder only sees the previous window's transcript as context, not the full audio history.
- Boundary artefacts: words that straddle a window boundary may be duplicated or dropped depending on the overlap stride, requiring post-processing deduplication. Both problems stem from the encoder's fixed 30-second receptive field, not from the multitask token design itself.
Tap to flip back
A streaming encoder is causal: it may only attend to frames at positions ≤ t plus a small bounded look-ahead window R. An offline encoder attends to all frames in both directions. This restriction on future context is the root cause of the latency-accuracy trade-off.
Tap to flip back
RNN-T's prediction network and encoder run independently frame by frame; the joiner combines them without needing the full encoded sequence. Attention encoder-decoder models require cross-attention over all encoder states to generate each token, which makes them non-streamable by default and expensive to run incrementally on device.
Tap to flip back
- Algorithmic latency - the right-context look-ahead (R frames) plus any chunk boundary delay.
- Acoustic model latency - time to run the encoder on the current chunk on target hardware.
- Decoder latency - beam search or greedy decoding over the transducer output.
Buffering latency (waiting for enough audio) is a fourth component that applies to offline systems but should be zero in a well-designed streaming system.
Tap to flip back
Emission delay is the tendency of the transducer to defer emitting a token across many frames while accumulating supporting evidence, causing words to appear in bursts rather than smoothly. FastEmit (Yu et al., 2021) adds a regularisation term to the training loss that rewards earlier token emission, reducing visible delay without significantly increasing WER.
Tap to flip back
Whisper's encoder always processes a fixed 30-second window and the decoder is trained to always produce tokens, so near-silence gets filled with confident nonsense. An RNN-T emits a blank symbol at every frame where it has no evidence for a real token, making silence naturally produce no output.
Tap to flip back
Larger R gives the encoder more future context per frame, reducing WER by resolving phonetic ambiguities (e.g., homophones resolved by following words). However, larger R directly adds algorithmic latency: at 10 ms frame shift, R = 8 frames adds 80 ms before a chunk can be processed. Production systems typically choose R in the range of 4-8 frames (40-80 ms) to stay within a 300 ms end-to-end latency budget while recovering most of the accuracy gap versus fully non-causal models.
Tap to flip back
When the model emits the wrong word early and a disambiguating word arrives 500 ms later, the incorrect token is already displayed. For example, "Let's go to the bank" - if "bank" has not yet been heard, "Let's go to the" might be committed with incorrect downstream context. Some systems re-decode with a shifted window to correct this, but the resulting text flicker is itself a usability problem that offline systems never have.
Tap to flip back
Standard attention computes a weighted sum over every frame in the sequence for each query. In a stream, future frames have not yet arrived, so the model would need to wait for the entire utterance before computing any representation - making real-time output impossible. The quadratic \(O(N^2)\) cost over a growing sequence is a secondary problem; the causality violation is the fundamental one.
Tap to flip back
A causal mask sets all entries where \(j > i\) (future positions) to \(-\infty\) before the softmax, so future frames receive zero attention weight. Strict causality imposes zero algorithmic lookahead latency - each frame can be encoded the moment it arrives. The practical cost is accuracy: phonemes that are disambiguated by right context (e.g. "s" vs "sh") are harder to classify, typically adding 5-15% relative WER compared to full-context attention.
Tap to flip back
Allowing each query frame to attend up to \(L\) future frames before emitting a result adds a fixed latency of \(L \times \text{frame\_shift}\) milliseconds (e.g. \(L=4\) frames at 10 ms shift = 40 ms lookahead) but recovers most of the WER lost by strict causality. The Transformer Transducer work by Zhang et al. (2020) showed that even a two-frame lookahead closes the majority of the gap with full-context attention. The latency is deterministic and bounded, which matters for production voice interfaces.
Tap to flip back
Chunked attention divides the audio stream into segments of \(C\) frames. Each frame attends only to frames within its own chunk plus a fixed \(M\)-frame carryover from the previous chunk. The cost per chunk is \(O(C^2 + C \cdot M)\) rather than \(O(N^2)\) over the full stream. The fundamental latency floor is one full chunk duration: the system must buffer \(C\) frames before it can begin processing any of them. Smaller \(C\) reduces latency but narrows the attention context, trading accuracy.
Tap to flip back
A model trained with full bidirectional attention learns representations that depend on right-context frames. Applying a causal or chunk mask only at inference produces a distribution mismatch: the encoder has never learned to operate without that context, so its streaming representations are out-of-distribution. Whisper was trained on 30-second segments with full attention; slicing audio into chunks and masking future frames at inference does not replicate the training regime. True streaming Whisper requires retraining (or fine-tuning) with the same masked attention pattern used at inference.
Tap to flip back
Naive chunked attention carries over raw frames from the previous chunk as left context, so the KV cache grows or must be truncated to a fixed window. Emformer compresses the left context into a fixed-size augmented memory bank (summary vectors trained end-to-end) rather than storing raw frames. This bounds the KV cache size at inference regardless of how long the stream has been running, and allows a longer effective left context than raw-frame carryover would permit within the same compute budget. The trade-off is that the memory compression is lossy and trained, so if the summary vectors discard phonetically relevant features, accuracy degrades in ways that are hard to diagnose.
Tap to flip back
- Chunk-boundary artefacts: phonemes straddling a boundary get representations computed from a truncated context window on each side, increasing misclassification risk. A small overlap (repeating the last \(O\) frames of chunk \(k\) as prefix for chunk \(k+1\)) reduces this at the cost of redundant computation.
- Training instability with small chunks: very narrow windows (\(C < 8\) frames) produce noisy gradients because the softmax operates over very few keys; this often causes slow convergence or sensitivity to learning-rate warmup length. Neither problem appears in strictly causal attention, which has no hard boundaries, or in full-context attention, which uses the entire sequence.
Tap to flip back
B = 1 reduces beam search to greedy decoding (commit to the single best token at each step). Computational cost is O(B × V × T): at each of T timesteps, B hypotheses are each extended by all V vocabulary tokens, then pruned back to B.
Tap to flip back
Naive beam search treats each frame-level alignment as a distinct hypothesis, so "aa_b" and "a_ab" (both collapsing to "ab" after blank removal) are counted separately and their probabilities are not combined. Prefix beam search merges hypotheses that share the same collapsed output prefix, tracking blank-ending and non-blank-ending probabilities separately to correctly accumulate probability across all valid alignments.
Tap to flip back
log P_combined(y_t) = log P_acoustic(y_t | y_{<t}, x) + lambda * log P_LM(y_t | y_{<t})
Lambda (typically 0.1-0.5, tuned on a validation set) controls how strongly the language model score is weighted relative to the acoustic model. Higher lambda favours linguistically probable sequences; too high and the LM overrides valid but unusual pronunciations.
Tap to flip back
Log-probabilities accumulate additively over tokens, so longer hypotheses naturally carry more negative scores than shorter ones even when they are more accurate. The standard mitigation is length normalisation: divide the total log-probability by the number of output tokens (or by token count raised to a power alpha, where alpha < 1 softens the correction). This can over-penalise short transcripts in noisy conditions.
Tap to flip back
A word-level LM can only score a complete word, but subword tokens arrive one piece at a time. The decoder must accumulate subword tokens into a full word before querying the LM, delaying the bonus. During this window, the beam may prune away correct subword prefixes before the full-word LM score is applied - causing worse results than if both models operated at the same granularity.
Tap to flip back
CTC scores each frame independently, so the beam advances along a single time axis. RNN-T factorises probability jointly over both time (acoustic frames) and label positions, producing a 2-D lattice. At each (time, label) node the output network runs a forward pass, and the beam must track hypotheses that can advance in either dimension. Practitioners often add alignment restrictions (limiting how far ahead in time a hypothesis may jump) to keep the search tractable without sacrificing most of the quality gain.
Tap to flip back
Beam diversity collapse happens when all B hypotheses in the beam converge to the same prefix within a few decoding steps, making the extra beams redundant. It occurs most often when the acoustic model is highly confident (low entropy outputs) and the beam width is small. Mitigations include diverse beam search (penalising hypotheses that share tokens with already-selected beams) and grouping the beam into diverse subsets with separate pruning.
Tap to flip back
WER = (S + D + I) / N
- S: substitutions (wrong word predicted)
- D: deletions (reference word missing from hypothesis)
- I: insertions (extra word in hypothesis not in reference)
- N: total words in the reference transcript
Computed via word-level Levenshtein (dynamic programming) alignment. WER can exceed 100% when insertions dominate, because N anchors to reference length, not hypothesis length.
Tap to flip back
Text normalisation choices before comparison drive the gap:
- Case folding ("NASA" vs "nasa")
- Punctuation stripping
- Number canonicalisation ("five" vs "5")
- Filler word removal ("um", "uh")
- Compound-word tokenisation conventions
Whisper's 2022 paper demonstrated this explicitly: applying their English text normaliser shifted reported WERs significantly across benchmarks. Always publish the normalisation pipeline alongside the number, or the comparison is meaningless.
Tap to flip back
WER = edit distance at word level / reference word count.
CER = edit distance at character level / reference character count.
Prefer CER when:
- The language has no whitespace word boundaries (Mandarin, Japanese, Thai).
- You are evaluating systems on morphologically rich languages where a single insertion affects many "word" tokens.
- You want finer-grained signal on short utterances where a single word error gives WER = 100%.
For standard English audiobook benchmarks (LibriSpeech), WER is the convention.
Tap to flip back
Several reasons test-clean near-saturation is misleading:
- Narrow domain: clean audiobook narration; no noise, accents, disfluencies, far-field microphones.
- Language model contamination: text corpora used to train LMs may overlap with the specific audiobooks in the test set.
- Floor effects: remaining errors are dominated by annotation disagreements and normalisation artefacts, not model capability.
- Hard tasks remain hard: CHiME-6 dinner-party speech, code-switched calls, and streaming scenarios show WERs well above 30%.
Test-clean is a useful sanity check and comparison baseline, not a proxy for real-world utility.
Tap to flip back
WER penalises all word-level mismatches equally regardless of meaning. "Automobile" mistranscribed as "car" counts the same as "automobile" as "armadillo", even though the first is often harmless downstream.
Alternatives:
- Semantic WER / SBERT-WER: uses sentence-embedding similarity for soft alignment rather than exact string match.
- BERTScore: borrowed from MT; computes contextual embedding overlap between reference and hypothesis tokens.
- Word Information Preserved (WIP): fraction of reference content surviving transcription, precision/recall flavour.
None has replaced WER in standard reporting because WER's causal story (each edit = one downstream correction) is clean, and reproducibility requires everyone to use the same metric.
Tap to flip back
WER is a uniformly weighted average across all word positions. In a 200-word call, one missed named entity ("Patel" deleted) contributes only 0.5% to WER, which is statistically invisible in aggregate reporting.
But for the compliance use case, that single deletion is a 100% miss on the operationally critical token. Named entity accuracy, entity recall, or task-specific F1 are far better quality signals here.
Practical lesson: always pair WER with task-relevant metrics (entity recall, action-item extraction accuracy, etc.). WER optimises for transcription fidelity, not downstream utility.
Tap to flip back
Streaming ASR emits partial hypotheses that are revised as more audio arrives.
- Final WER: measured on the last committed transcript after the full utterance is processed. This is what most papers report.
- Real-time WER: measures the edit distance of the partial hypothesis visible at each time step against the eventual correct transcript. A system that heavily revises late has low final WER but high real-time WER.
This matters for live caption displays, wake-word-triggered pipelines, and voice-controlled interfaces where users act on intermediate output. A model with excellent final WER but erratic intermediate hypotheses produces a jarring user experience and may trigger downstream systems incorrectly before correction arrives.
Tap to flip back
VAD (voice activity detection) is a per-frame binary classifier: it labels each 10-30 ms audio window as speech or non-speech. Endpointing is the higher-level policy that accumulates VAD output (and optionally richer signals) to decide that a complete utterance has ended. VAD is the sensor; endpointing is the decision logic.
Tap to flip back
Threshold tau converts the VAD probability p(speech | frame) into a binary label. Lowering tau means more frames are classified as speech, which reduces misses (speech labelled non-speech) at the cost of more false alarms (noise labelled speech). In voice assistant applications a missed speech frame can truncate the utterance or prevent the endpointer from firing correctly, a more costly error than a brief false alarm on background noise. So operators tune tau below 0.5 to bias toward sensitivity.
Tap to flip back
A silence-timeout endpointer declares end-of-utterance (EOU) once VAD has continuously returned non-speech for N frames. Typical settings are N corresponding to 300-800 ms of silence. The fundamental cost is that the system adds exactly N ms of silence latency after every utterance, even when the speaker stopped cleanly. The longer N is, the fewer mid-utterance pauses are mistaken for EOU, but the higher the baseline latency.
Tap to flip back
A recurrent neural network transducer (RNN-T) emits a blank token at frames where no new word-piece label should be output. After the speaker finishes speaking, the model tends to emit a sustained run of blank tokens. This blank-token rate is an ASR-model-internal endpointing signal that is tightly coupled to the decoder state. Adding an explicit end-of-word token plus a delay penalty at training time (Anandh et al., 2025) makes this signal more reliable for conversational speech.
Tap to flip back
Acoustic endpointing fires on a sustained silence in the audio signal - it is agnostic to meaning. Semantic endpointing uses linguistic context: typically a frame-level punctuation prediction head that detects sentence-final markers in the evolving ASR transcript. When the model predicts a sentence-terminal punctuation, EOU can fire even before the acoustic silence timeout expires. Shi et al. (2023) showed a 53.3% latency reduction with no significant increase in character error rate compared to acoustic-only VAD.
Tap to flip back
At low SNR (below roughly 5 dB) the energy and spectral features of background noise overlap heavily with those of voiced speech, so the classifier cannot separate the two classes in feature space. Models trained on clean or mildly noisy data confidently mis-classify noisy frames. The standard mitigation is noise-augmented training: adding a wide variety of real and synthetic noise types at various SNR levels during training. This improves robustness but there is no reliable fix when SNR drops below 0 dB because the speech signal is genuinely buried.
Tap to flip back
Rather than running a separate VAD or endpointing model in parallel with the ASR model, the shared-encoder design branches a lightweight end-of-utterance head off the existing ASR encoder. The head produces a scalar EOU probability at each encoder frame at negligible marginal cost, because the expensive encoder computation was already required for transcription. Li et al. (2022) use this architecture to run multilingual streaming ASR plus endpointing in less than real time on a mobile CPU.
Tap to flip back
Diarisation answers "who spoke when" by partitioning audio into speaker-homogeneous segments, each labelled with an arbitrary identity such as SPEAKER_0. It does not produce transcripts and does not know the speakers' real names unless an external reference links labels to identities.
Tap to flip back
- Voice activity detection (VAD) - removes non-speech frames.
- Segmentation - cuts audio into short, speaker-homogeneous windows.
- Embedding extraction - converts each segment to a speaker vector.
- Clustering - groups vectors by speaker identity.
Errors in earlier stages compound into later ones, which is the main motivation for end-to-end approaches.
Tap to flip back
Statistics pooling concatenates the mean and standard deviation of the frame-level TDNN outputs across all frames in a segment, producing a fixed-length vector regardless of segment duration. This lets the extractor summarise a variable-length utterance into a single speaker representation that can be fed into fully connected layers.
Tap to flip back
EEND (End-to-End Neural Diarisation) outputs, for each frame, simultaneous probabilities for multiple speaker channels being active. Two channels can fire at the same time, directly modelling overlap. Classical clustering assigns each frame to exactly one cluster, so overlap is either lost or requires a separate post-processing step.
Tap to flip back
DER (Diarisation Error Rate) = (False alarm + Missed speech + Speaker confusion) / Total reference speech duration.
- False alarm: non-speech labelled as speech.
- Missed speech: speech labelled as non-speech.
- Speaker confusion: frames attributed to the wrong speaker.
DER is computed over the entire recording; a single long-turn speaker can dominate and make overall numbers look deceptively good.
Tap to flip back
Statistics pooling averages over all frames in the segment. With very few frames (under roughly 50 at 10 ms shift), the mean and standard deviation are high-variance estimates of the speaker's true distribution, so the resulting vector can land anywhere in embedding space. Back-channels like "uh-huh" fall into this category and are frequently mis-attributed.
Tap to flip back
AHC requires an external stopping criterion to determine the number of speakers (a threshold, elbow heuristic, or a known count). VBx (Variational Bayes over x-vector sequences modelled with a Bayesian HMM) infers both the number of speakers and the segment-to-speaker assignments jointly during optimisation, avoiding the most sensitive hyperparameter in the classical pipeline. It still has its own regularisation weight, but that is more interpretable than a raw distance threshold.
Tap to flip back
Noise corrupts the acoustic signal before any linguistic representation is formed - the mel spectrogram itself is damaged. Accent is a distributional shift in phonetic realisation with a clean signal; the spectrogram looks fine but the learned phoneme boundaries do not match the speaker population. Noise calls for signal-level or augmentation-level fixes; accent calls for training data diversity or domain adaptation.
Tap to flip back
- Time warping - randomly distorts the time axis within an utterance.
- Frequency masking - zeros out F consecutive mel bins chosen at random.
- Time masking - zeros out T consecutive frames chosen at random.
Together they prevent the model from relying on any single frequency band or short temporal segment, acting as spatially structured dropout. The model learns distributed representations that approximate what partial-information real noise produces. This alone cut LibriSpeech WER from ~12% to 6.8% without a language model.
Tap to flip back
Pre-training on ~53k hours of unlabelled audio via a contrastive masked prediction task forces the encoder to capture properties that are stable across speakers, conditions, and recording environments - because those are the only generalisable features. Fine-tuning on as little as ten minutes of labelled data then reaches 4.8/8.2 WER on LibriSpeech clean/other, far below what supervised training at that scale could achieve.
Tap to flip back
WavLM adds an explicit denoising objective to the masked speech prediction pre-training: the model must predict masked latent representations from a corrupted (noisy) waveform. This directly incentivises the encoder to disentangle speech content from noise, rather than merely learning to predict from masked clean speech. The result is state-of-the-art performance on the SUPERB benchmark, including noisy ASR tasks.
Tap to flip back
Whisper was trained on 680,000 hours of internet audio paired with auto-generated captions - a distribution that naturally includes podcasts, phone calls, field recordings, and dozens of accents. The model learns that variability is normal. The failure mode: weakly supervised transcripts are wrong in exactly the hard cases (noise, accents, fast speech). Whisper internalises some of those errors, which manifests as confident hallucination of plausible but incorrect text when audio is degraded, rather than outputting a low-confidence or null transcript.
Tap to flip back
- Fine-tuning on accent-matched data - even a few hours of transcribed speech from the target accent, applied to the final layers, recovers most of the WER gap.
- Accent-conditioned multi-accent training - supply an accent embedding or one-hot tag at training time; at inference provide or predict the accent.
- Pronunciation lexicon expansion - add accent-specific phonemic variants for systematically shifted vowels or consonants; zero inference-time cost and effective for predictable dialect shifts.
Tap to flip back
Below approximately 0 dB SNR (noise power equal to or exceeding speech power) all current models degrade severely. Augmentation cannot fully compensate because at this level the acoustic evidence of the speech signal is genuinely destroyed - there is no latent phonetic information left to recover. The model is not failing to generalise; the input literally does not contain enough signal.
Tap to flip back
wav2vec 2.0 masks spans of latent feature frames and trains the transformer to identify the correct quantised target q_t for each masked position among a set of distractors sampled from the same sequence, using a contrastive loss with cosine similarity.
The quantised space matters because raw or filterbank features are spectrally smooth: a model could cheat by interpolating nearby unmasked frames. Discrete codes sever that shortcut, forcing the model to learn genuine phonetic structure to distinguish the right code from distractors.
Tap to flip back
The quantiser partitions the feature vector into G sub-vectors. Each sub-vector is independently mapped to one of V codebook entries, giving G * log2(V) bits per frame. The selection of codebook entries is discrete, so a Gumbel-softmax relaxation is used during training to allow gradients to flow through the argmax approximation. A diversity loss on codebook entropy prevents most entries from being ignored (codebook collapse).
Tap to flip back
Contiguous spans of 10 frames are sampled starting from random positions, with spans allowed to overlap. On average approximately 65% of all time-steps end up masked. The transformer receives the full sequence but the contrastive loss is computed only on masked positions. This high masking rate forces long-range contextual reasoning rather than local interpolation.
Tap to flip back
Pre-trained on 53,000 hours of unlabelled LibriVox audio and fine-tuned with CTC on just 10 minutes of transcribed speech (plus a 4-gram LM), the LARGE model achieved 4.8% / 8.2% WER on LibriSpeech clean/other. Fine-tuning on the full 960-hour labelled set reached 1.8% / 3.3%. The implication: a large unlabelled corpus can substitute for most of the labelled data traditionally required for competitive ASR.
Tap to flip back
wav2vec 2.0 learns its discrete codes online via a jointly trained product quantiser. HuBERT replaces this with offline k-means clustering: MFCC features (first iteration) or a previous model's representations (later iterations) are clustered once, and the cluster assignments become fixed pseudo-labels for the masked prediction loss. This avoids the quantiser's joint-training instability and Gumbel-softmax tricks, at the cost of requiring iterative pre-training runs.
Tap to flip back
- Domain mismatch: codebook and representations reflect the pre-training acoustic distribution (e.g., clean read speech); telephone or child-speech domains can degrade WER substantially.
- Streaming latency: the transformer is non-causal and attends over the full utterance; real-time use requires causal re-design or chunked look-ahead, both reducing accuracy.
- Codebook collapse: if the diversity-loss weight
αis poorly tuned, a few codes dominate, the contrastive task becomes trivially easy, and pre-training stops improving representations.
Tap to flip back
Lower transformer layers encode low-level acoustic-phonetic features: voicing, manner and place of articulation. Higher layers encode progressively more abstract units that cluster by phoneme identity, even without any phoneme supervision. This hierarchy emerges purely from the contrastive pre-training objective. It is the reason fine-tuning on small labelled corpora works: the model already represents phone-like units and needs only a shallow mapping to grapheme or word targets, not a full feature-learning pass.
Tap to flip back