← Blog

Sleep-Time Compute: Thinking About the Context Before the Question Arrives

June 29, 2026 · 22 min read

A reasoning model handed a hard math word problem will happily burn ten thousand tokens before it answers. Most of those tokens are not spent reading the question. They are spent re-deriving facts about the context the question lives in: parsing the same paragraph of givens, recomputing the same intermediate quantities, re-establishing the same relationships, every single time a query touches that context. When the same document, codebase, or user profile is queried again an hour later, the model starts from zero. The reasoning is thrown away the instant the response is streamed.

In April 2025, a group from UC Berkeley and Letta asked an awkward question about that waste: what if the model did the context-reasoning before the user ever typed anything? Their paper, Sleep-time Compute (Lin et al., 2025, arXiv:2504.13171), introduces a third axis of scaling that sits alongside the two we already know. Train-time compute makes the weights better. Test-time compute makes a single answer better. Sleep-time compute makes the context better, offline, while the system is idle, so that when a question finally arrives the model has far less to figure out under the latency clock.

Why this matters: Every production LLM system already pays for the same context to be re-read on every request. Sleep-time compute reframes that recomputation as a one-time, offline investment. On the benchmarks the authors built, it cuts the test-time compute needed to hit a given accuracy by about 5x, and when several questions share one context it drops average cost per query by 2.5x. The catch is that it only pays off when you can guess what the user will ask.

TL;DR

  • Test-time compute is paid at the worst possible moment. The user is waiting, the meter is running, and the model is re-deriving facts about a context it has seen before. Sleep-time compute shifts that work to idle periods between requests.
  • The core trick is statefulness. Instead of a stateless prompt that bundles context and question together, the context persists as a mutable object. A background process reasons over it and writes an enriched representation back, so the live query starts from a head start, not a blank slate.
  • The measured wins are concrete. Roughly 5x less test-time compute for the same accuracy on the authors' Stateful GSM-Symbolic and Stateful AIME tasks, with accuracy climbing up to 13% and 18% respectively when sleep-time compute is scaled up (Lin et al., 2025).
  • Amortization is where the economics flip. When many related queries hit the same context, the offline cost is spread across all of them. On Multi-Query GSM-Symbolic the average cost per query falls about 2.5x.
  • It is not prompt caching. Prompt caching reuses raw KV tensors for identical prefixes. Sleep-time compute reuses reasoning, produces new content that was never in the prompt, and survives small edits to the query.
  • Predictability is the hard constraint. The benefit correlates with how guessable the query is from the context alone. When questions are adversarially unrelated to what the model anticipated, sleep-time compute can be wasted work.
  • It descends from agent-memory research. The same lab built MemGPT; sleep-time compute is the natural next step once you treat the context window as managed, persistent memory rather than a disposable buffer.

At a Glance

The whole idea fits in one contrast: a stateless pipeline does all its thinking after the question lands, while a sleep-time pipeline front-loads the context reasoning.

flowchart LR
  subgraph Standard["Standard test-time scaling"]
    Q1[User query plus context] --> R1[Reason from scratch]
    R1 --> A1[Answer]
  end
  subgraph SleepTime["Sleep-time compute"]
    C2[Raw context] --> S2[Offline reasoning while idle]
    S2 --> E2[Enriched context]
    Q2[User query] --> L2[Short live reasoning]
    E2 --> L2
    L2 --> A2[Answer]
  end
  class Q1,C2,Q2 blue
  class R1,S2,L2 purple
  class A1,A2 teal
  class E2 emerald
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
  classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff

The expensive purple box in the standard pipeline runs on the user's clock. In the sleep-time pipeline, the heaviest purple box runs offline, and the live box that the user waits on is small.

[IMAGE: Side-by-side latency timeline, two horizontal bars labeled "standard" and "sleep-time", with the standard bar showing one long block under a stopwatch icon and the sleep-time bar showing a short block under the stopwatch plus a shaded "offline" block before it]

Before Sleep-Time Compute

To see why this idea arrived when it did, follow the compute the field has been willing to spend, and when.

For years the only knob that mattered was pretraining. Bigger models, more tokens, better loss. The 2024 shift, crystallized by OpenAI's o1 and then made reproducible by open work, was that you could spend compute at inference and get reasoning gains that rivaled scaling the model itself. Snell et al., 2024, Scaling LLM Test-Time Compute Optimally, arXiv:2408.03314 showed that allocating inference compute well, choosing between sequential revision and parallel sampling based on difficulty, could be over 4x more efficient than a naive best-of-N baseline, and on some problems beat a 14x larger model. Muennighoff et al., 2025, s1: Simple test-time scaling, arXiv:2501.19393 then showed the mechanism could be almost embarrassingly simple: fine-tune on 1,000 curated traces and force the model to keep thinking by appending the token "Wait", and a 32B model beats o1-preview on competition math by up to 27%.

That progress came with a bill. Every one of those gains is paid in latency and dollars at request time. The reasoning is also amnesiac: it treats each prompt as a self-contained universe. A reasoning model asked two questions about the same financial report parses that report twice, with full chain-of-thought both times.

Meanwhile a parallel line of work was learning to treat context as something that persists. Packer et al., 2023, MemGPT: Towards LLMs as Operating Systems, arXiv:2310.08560, from the same Berkeley group, borrowed virtual-memory ideas from operating systems to page information in and out of a limited context window, giving an agent durable memory. Once context is a managed, persistent object rather than a string you throw away, a new question becomes obvious: why not improve that stored context during the gaps between requests? Sleep-time compute is that question answered.

timeline
  title From bigger models to thinking while idle
  2020 : Scaling laws favor larger pretrained models
  2023 : MemGPT treats context as paged, persistent memory
  2024 : o1 and Snell et al. establish test-time compute scaling
  2025 : s1 shows budget forcing reproduces reasoning gains
  2025 : Sleep-time compute moves context reasoning offline

[IMAGE: Annotated line plot of accuracy versus log test-time compute, one curve for standard test-time scaling and one for sleep-time compute, with the sleep-time curve shifted left to mark the roughly 5x reduction in test-time tokens at matched accuracy]

How Sleep-Time Compute Actually Works

The mechanism rests on a single structural change to how a query is represented, and three processes that exploit it.

Splitting the prompt into context and query

A normal prompt is one blob: the background material plus the question, concatenated. Sleep-time compute requires you to factor that blob. Formally the prompt \(p\) is decomposed into a context \(c\) and a query \(q\). The context is the part that is stable and shared, a document, a codebase, a user's history, a problem's givens. The query is the part that arrives late and varies, the actual question asked against that context.

This factoring is the whole game. Once \(c\) is separable, it can be reasoned about before \(q\) exists. The system precomputes an enriched context \(c'\) by running the model over \(c\) with an anticipatory instruction: infer what is likely to be asked, derive useful intermediate results, surface implications, and write them down. At test time the model is given \(c'\) and \(q\) together, and because the heavy lifting is already encoded in \(c'\), it can answer with far less reasoning.

You can think of it as a function. Standard inference computes \(a = f(c, q)\) with a large compute budget \(B_\text{test}\) spent entirely after \(q\) lands. Sleep-time inference computes \(c' = g(c)\) offline with budget \(B_\text{sleep}\), then \(a = f(c', q)\) with a much smaller \(B_\text{test}\). The total compute may be similar or even larger, but its placement on the timeline is what changes.

What the offline pass actually produces

The enriched context is not a summary. A summary throws information away to save space; sleep-time compute adds information that was never stated. Given a word problem's setup, the offline pass might compute the totals, rates, and ratios the problem implies. Given a codebase, it might trace which functions call a changed module and what the likely failure modes are. The model is, in effect, doing the predictable part of the reasoning in advance and storing the answer.

flowchart TD
  Raw[Raw context c] --> Pass[Anticipatory reasoning pass]
  Pass --> Derive[Derive implied quantities]
  Pass --> Predict[Predict likely questions]
  Pass --> Surface[Surface hidden relationships]
  Derive --> Write[Write enriched context c prime]
  Predict --> Write
  Surface --> Write
  Write --> Store[(Persistent context store)]
  class Raw blue
  class Pass,Derive,Predict,Surface purple
  class Write emerald
  class Store slate
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff
  classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0

Scaling the offline budget

Sleep-time compute is itself a scalable axis. You can run the anticipatory pass once with a small budget, or sample it many times, or let a reasoning model think at length about the context. The paper finds that pouring more compute into the offline pass keeps lifting test-time accuracy, up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME relative to the no-sleep baseline (Lin et al., 2025). Crucially, scaling sleep-time compute and scaling test-time compute trace out different points on the accuracy-versus-latency frontier, and the sleep-time curve dominates in the regime where you care about latency.

The tradeoff this introduces is speculative work. The offline pass spends compute on reasoning the user may never need. That is acceptable precisely when the spend is amortized, which is the next idea.

Amortization across queries

A single context is rarely queried once. A report gets many questions; a codebase gets many edits; a user profile informs many sessions. The offline cost \(B_\text{sleep}\) is paid once per context, but it benefits every query against that context. If \(N\) queries share one context, the effective sleep-time cost per query is \(B_\text{sleep} / N\). As \(N\) grows, the amortized cost vanishes and only the small per-query test-time budget remains. On the authors' Multi-Query GSM-Symbolic, where each context carries several related questions, this amortization drove average cost per query down by about 2.5x.

[IMAGE: Hyperbolic decay curve of amortized sleep-time cost per query on the y-axis versus number of queries per context on the x-axis, annotated to show the curve flattening toward the fixed per-query test-time floor as N grows]

Seeing It in Motion

Two more views make the runtime behavior concrete: the request-time interaction, and the lifecycle of a context as it cycles between idle enrichment and live use.

The sequence below shows what happens across an idle period and a subsequent query. Note that the user only ever waits on the final exchange.

sequenceDiagram
  actor User
  participant Agent as Agent runtime
  participant Sleeper as Background worker
  participant Store as Context store
  Sleeper->>Store: Read raw context
  Sleeper->>Sleeper: Anticipatory reasoning, idle time
  Sleeper->>Store: Write enriched context
  Note over Sleeper,Store: Happens before any query, off the latency path
  User->>Agent: Ask question
  Agent->>Store: Fetch enriched context
  Agent->>Agent: Short reasoning with head start
  Agent->>User: Answer with low latency

The same context moves through a small state machine. It is built once, refreshed when the underlying material changes, and consumed many times in between.

stateDiagram-v2
  [*] --> Raw
  Raw --> Enriching: idle worker picks it up
  Enriching --> Ready: enriched context written
  Ready --> Serving: query arrives
  Serving --> Ready: answer returned
  Ready --> Stale: underlying context edited
  Stale --> Enriching: re-run anticipatory pass
  Ready --> [*]: context evicted

[IMAGE: Annotated screenshot-style figure of an enriched context object, raw givens on the left panel and machine-derived quantities on the right panel, with arrows linking each derived value to the source line it was inferred from]

Architecturally, a production deployment separates the latency-critical serving path from the elastic offline path so that background reasoning never contends with live requests.

graph TD
  subgraph Online["Online path, latency critical"]
    API[Query API] --> Router[Context router]
    Router --> Infer[Inference server]
    Infer --> Resp[Response]
  end
  subgraph Offline["Offline path, throughput oriented"]
    Queue[Idle work queue] --> Worker[Sleep-time worker pool]
    Worker --> Enrich[Enriched context builder]
  end
  Store[(Shared context store)]
  Router --> Store
  Enrich --> Store
  Store --> Infer
  class API,Queue blue
  class Router,Infer,Worker,Enrich purple
  class Resp teal
  class Store slate
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
  classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0

By the Numbers

The headline figures come from the original paper's two stateful benchmarks and its multi-query extension. The reductions below are measured at matched accuracy, meaning the sleep-time system reaches the same score with far fewer test-time tokens.

Metric Standard test-time scaling With sleep-time compute Source
Test-time compute for matched accuracy (Stateful GSM-Symbolic) baseline about 5x less Lin et al., 2025
Test-time compute for matched accuracy (Stateful AIME) baseline about 5x less Lin et al., 2025
Peak accuracy lift, scaling sleep-time (Stateful GSM-Symbolic) baseline up to +13% Lin et al., 2025
Peak accuracy lift, scaling sleep-time (Stateful AIME) baseline up to +18% Lin et al., 2025
Average cost per query (Multi-Query GSM-Symbolic) baseline about 2.5x lower Lin et al., 2025

A note on the cost weighting. The 2.5x per-query figure assumes test-time tokens are roughly 10x more valuable than sleep-time tokens, a ratio the authors chose to reflect that users pay for latency, not just tokens. That assumption is reasonable for interactive products and weaker for batch workloads where nobody is waiting. The asymmetry is the economic foundation of the whole technique, so it is worth stating plainly rather than burying.

It also helps to place sleep-time compute against the other reuse mechanisms it is often confused with, on the dimension of what gets reused.

Mechanism What is reused Survives query change Produces new content
Prompt caching Raw KV tensors for identical prefix No, prefix must match No
Retrieval (RAG) Retrieved passages, recomputed each call Yes No
Sleep-time compute Anticipatory reasoning over context Yes Yes

[IMAGE: Grouped bar chart comparing tokens spent at test time for standard versus sleep-time across both benchmarks, with a second axis or inset showing the amortized cost-per-query curve falling as the number of queries per context increases]

A Concrete Example

Take a Stateful GSM-Symbolic-style problem. The context \(c\) is a setup:

A bakery makes 240 croissants each morning. It sells them in boxes of 6. Each box sells for $18. The flour for one croissant costs $0.40. Rent is $120 per day.

In a standard pipeline this paragraph sits idle until a question arrives, then gets re-read and reasoned over from scratch every time.

The sleep-time worker, given only \(c\) and an instruction to anticipate, derives a small table of quantities the problem clearly implies, and writes them into \(c'\):

Derived quantity Value Reasoning
Boxes per day 40 240 / 6
Daily revenue $720 40 boxes times $18
Daily flour cost $96 240 times $0.40
Daily fixed cost $120 rent
Daily profit $504 720 minus 96 minus 120

Now three queries arrive over the afternoon. With the enriched context, each is nearly a lookup:

  • "What is the daily profit?" The answer $504 is already derived. Test-time reasoning is a single step.
  • "How many boxes are sold per day?" Already present: 40.
  • "If rent rises to $150, what is profit?" The model reuses the $600 gross margin (revenue minus flour) it can read off the enriched context and subtracts $150, reaching $450 in one step instead of recomputing revenue and flour cost from the raw givens.

In the standard pipeline, each of these three questions would independently re-derive revenue, flour cost, and the arithmetic chain. Across three queries that is the same derivation done three times, each on the user's clock. The sleep-time pipeline did it once, offline. The amortized cost of the enrichment is one-third of its total, and every live answer came back fast. This is the 2.5x story in miniature: the more questions the bakery's owner asks before the numbers change, the closer the per-query cost drops toward just the lookup.

The failure case is also visible here. If the query is "What color is the bakery's awning?" none of the anticipated derivations help, and the offline compute was wasted. Predictability is what separates the profit case from the awning case.

Where It Breaks

The technique has sharp edges, and the paper is candid about most of them.

Unpredictable queries waste the offline pass. The central finding from the authors' analysis is that the benefit of sleep-time compute correlates strongly with how predictable the query is from the context. When the question is something the model could reasonably anticipate, enrichment pays off. When it is orthogonal to everything the model precomputed, the offline tokens are pure overhead, and a standard pipeline that reasons on demand would have been cheaper. There is no free lunch for genuinely surprising questions.

Stale context is wrong context. Enriched context is a cache of reasoning, and like any cache it can go stale. If the underlying document, code, or user state changes after enrichment, the precomputed derivations may be subtly wrong, which is more dangerous than missing, because they read as authoritative. Production systems need an invalidation story, the Stale transition in the state diagram above, and that story is application-specific.

The offline pass can hallucinate into the context. Because \(c'\) contains model-generated content presented as established fact, errors in the anticipatory pass get baked in and then trusted at test time. A wrong intermediate quantity computed during sleep becomes a confident wrong premise during the live answer. The enrichment step inherits every reasoning failure mode of the base model, and now those failures persist.

It assumes a clean context-query split. Many real prompts do not factor cleanly. A conversational turn where the question reframes the context, or a task where the relevant context depends on the question, resists the decomposition the method requires. The benchmarks were deliberately constructed to have a stable shared context, which is the friendly case.

[IMAGE: Heatmap of benefit versus query predictability and number of queries per context, two-axis grid shaded from rose for wasted-effort cells to emerald for high-payoff cells, with the bakery worked example plotted as a point]

Idle capacity is not free in every deployment. The method trades latency for throughput by assuming there is spare compute during idle periods. On a maxed-out cluster or a serverless cost model where you pay per invocation, "idle" compute still has a price, and the elegant timeline-shift becomes a straightforward extra bill that only the amortization math can justify.

Alternative Designs

Sleep-time compute is one of several ways to avoid paying full reasoning cost on every request. They are not mutually exclusive, and the right system often layers them.

Approach Strengths Weaknesses Best when
Pure test-time scaling Maximal accuracy, no staleness, no anticipation needed Full latency and cost on every query Queries are rare, unpredictable, or one-off
Prompt caching Large cost and latency cuts, exact and safe Only helps identical prefixes, reuses computation not reasoning High-volume identical system prompts
Retrieval-augmented generation Scales to huge corpora, fresh by construction Recomputes reasoning each call, retrieval can miss Knowledge too large for context
Sleep-time compute Cuts test-time compute about 5x, amortizes across queries Wasted on unpredictable queries, staleness risk Stable context, many predictable queries
Fine-tuning on the context Fastest inference, knowledge in weights Expensive, slow to update, per-context impractical Context is fixed and queried at massive scale

Prompt caching is the closest cousin and the easiest to confuse with sleep-time compute, so the distinction is worth drawing precisely. Anthropic's prompt caching, launched in August 2024, lets a cached prefix be re-read at roughly 10% of the normal input-token price, with a one-time write premium of about 25%, yielding up to a 90% cost reduction for heavily reused prefixes (Anthropic, 2024). But it reuses the raw key-value tensors of an identical prefix. Change one token and the cache misses. It never produces a fact that was not already in the prompt. Sleep-time compute reuses reasoning, generates content that was never in \(c\), and tolerates query variation because the enrichment is semantic rather than positional. The two compose well: cache the enriched context's prefix and you get both savings at once.

How It Is Used in Practice

The most natural home for sleep-time compute is a stateful agent, which is exactly the product space Letta operates in. An agent with persistent memory already stores a durable context per user or per task. Adding a background worker that reasons over that memory during idle periods is an incremental change to an architecture that already separates storage from inference, the lineage that runs from MemGPT to today's agent runtimes.

The paper closes with a case study on a realistic agentic software-engineering task, where the "context" is a codebase and the queries are issues to resolve. Anticipating likely questions about a code change, which tests it affects, which call sites matter, before the developer asks, is precisely the kind of bounded, predictable reasoning that enrichment captures well. This is a more honest demonstration than the math benchmarks because the context is large and messy and the staleness problem is real, since code changes constantly.

For an engineer evaluating whether to adopt it, the decision reduces to three questions. Is there a stable context that many queries share? Are those queries predictable enough that anticipation lands more often than it misses? Is there genuinely idle or cheap off-peak capacity to do the work? Three yes answers make sleep-time compute compelling. A single firm no usually means a simpler caching or retrieval layer will serve better.

[IMAGE: Decision-flow schematic with three diamonds, "stable shared context", "predictable queries", "idle capacity available", routing to either "adopt sleep-time compute" or "use caching or RAG instead"]

Insights Worth Remembering

  • Latency is the real currency. The technique only makes economic sense because a token spent while the user waits is worth far more than a token spent while the system is idle. Sleep-time compute is best understood as arbitrage between those two prices.
  • Statefulness is a precondition, not a detail. None of this works on a stateless request. The moment context persists and can be mutated between requests, a whole class of offline optimizations opens up, and this is one of them.
  • Enrichment adds, summarization subtracts. The offline pass is valuable precisely because it writes down facts that were never stated, not because it compresses. Confusing the two leads to building the wrong thing.
  • Amortization is the multiplier. A single query barely benefits. The economics improve linearly with the number of queries per context, so the technique rewards systems where contexts are long-lived and heavily revisited.
  • Predictability is measurable and decisive. The correlation between query predictability and benefit means you can estimate, before deploying, whether sleep-time compute will help on your traffic by asking how guessable your queries are.
  • A reasoning cache can be confidently wrong. Storing derived reasoning as fact means storing its errors as fact too. The safety profile is different from a verbatim cache, and that difference deserves explicit handling.

Open Questions

Several threads remain genuinely unsettled.

The first is what to precompute. The paper anticipates queries with a general instruction, but a learned policy that predicts the distribution of likely queries from a context could target the offline budget far better. Whether that policy is best learned, retrieved from query logs, or reasoned out on the fly is open.

The second is invalidation. Treating enriched context as a cache raises the classic hard problem of knowing when it is stale. For code and live data this is acute, and the field does not yet have a clean, general answer beyond application-specific heuristics. This is an engineering problem the evidence has not resolved; it is currently handled case by case.

The third is the interaction with continued test-time scaling. Sleep-time and test-time compute are different axes, and the paper shows they trace different frontiers, but the optimal split of a fixed total budget between offline and online, as a function of predictability and amortization, is not yet characterized. A compute-optimal theory for the two-axis case, in the spirit of what Snell et al. did for the single axis, would be valuable and does not yet exist.

Finally, there is a speculative but appealing direction: continuous, self-directed sleep-time reasoning, where an agent decides for itself what to think about during downtime, closer to consolidation than to query anticipation. That blurs into memory and self-improvement research and is, for now, more aspiration than result.

Sources and Further Reading

Foundational Papers

Important Follow-up Work

Technical Blogs

Additional Resources

Sign in to save and react.
Share Copied

Related reading