Agent Memory Systems: Episodic, Semantic, and the Architecture of Remembering

A coding agent fixes a flaky test for you on Monday. On Thursday you ask it to fix a similar one, and it starts from zero, re-deriving the same diagnosis, asking the same questions, making the same wrong first guess. Nothing is broken. The model did exactly what it was built to do: it answered from the tokens in front of it, and on Thursday those tokens did not include Monday. The context window is not memory. It is a workbench that gets wiped clean between sessions.

The instinct is to make the workbench bigger. Frontier models now advertise context windows of hundreds of thousands to a million tokens, and the implicit promise is that a large enough window makes memory a non-problem: just keep everything. That promise breaks on two facts. First, attention degrades inside long contexts; models systematically lose information placed in the middle of a long input (Liu et al., 2023, Lost in the Middle, arXiv:2307.03172). Second, even a million tokens is a few weeks of heavy use, after which you are back to deciding what to keep. A larger window is a larger desk, not a longer memory.

Why this matters: The difference between an agent that feels like a tool and one that feels like a colleague is almost entirely a memory-architecture decision, not a model-capability one. Two agents on the same base model behave like different products depending on what they write down and what they recall.

TL;DR

The context window is working memory, not long-term memory. Long-term memory is a separate subsystem that writes experiences to external storage and retrieves a small relevant slice back into context on demand.
Borrowing from cognitive science, useful agent memory splits into episodic (what happened, timestamped events) and semantic (distilled facts and preferences), plus procedural (learned how-to). They are stored, summarized, and retrieved differently.
Reflection (also called rollup or consolidation) is the step that turns a flood of raw events into compact, reusable knowledge. Ablations show it is load-bearing, not cosmetic: remove it and long-horizon agents degenerate into repetition.
Retrieval is not pure vector search. The durable recipe scores memories by a weighted blend of recency, relevance, and importance, which fixes failure modes that cosine similarity alone produces.
Memory is not RAG. RAG reads a fixed external corpus; agent memory writes its own corpus from its own history, which adds consistency, staleness, and conflict-resolution problems RAG never faces.
Production systems report large wins from getting this right: Mem0 reports 91.6 on the LoCoMo benchmark while cutting token cost and p95 latency by roughly 90% versus stuffing the full history into context (Chhikara et al., 2025, Mem0, arXiv:2504.19413).

At a Glance

The whole system is a loop: an agent acts, writes selected traces to memory, periodically consolidates them, and on the next turn retrieves a small relevant set back into the prompt.

flowchart LR
  U[User turn] --> WM[Working memory<br/>context window]
  WM --> ACT[Agent acts]
  ACT --> WR[Write selected events]
  WR --> EP[Episodic store]
  EP --> REF[Reflection<br/>consolidation]
  REF --> SM[Semantic store]
  EP --> RET[Retrieval]
  SM --> RET
  RET --> WM
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
  classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0
  class U blue
  class WM,ACT slate
  class WR,REF,RET purple
  class EP,SM teal

The diagram hides the hard parts, which are the three purple boxes: deciding what to write, deciding how to consolidate, and deciding what to retrieve. The rest of this article is about those three decisions.

[IMAGE: Anatomy figure of the three memory types stacked as labelled cards (episodic, semantic, procedural), each card showing one example entry and its storage backend, with a caption noting how each is summarized and retrieved differently]

[IMAGE: Annotated schematic of the memory loop, with each purple stage (write/consolidate/retrieve) expanded into a callout listing its one hard question, laid over the flowchart above]

Before Persistent Memory

Early LLM applications had no memory abstraction at all. A chatbot kept the last few turns in the prompt and dropped everything older once the window filled. The first patch was a sliding window: keep the most recent \(k\) tokens, discard the rest. This is amnesia with a delay.

The next step was conversation summarization: when the window fills, ask the model to compress older turns into a paragraph and prepend that. LangChain shipped this pattern widely under names like ConversationSummaryMemory. It helps, but a single rolling summary is lossy in a way you cannot control; a fact mentioned once forty turns ago is averaged into oblivion.

The conceptual leap came from treating memory the way an operating system treats RAM. MemGPT framed the context window as physical memory and an external database as disk, with the agent itself issuing function calls to page information in and out (Packer et al., 2023, MemGPT: Towards LLMs as Operating Systems, arXiv:2310.08560). Around the same time, the Generative Agents work showed that believable long-running agents need more than storage; they need a memory stream plus a reflection mechanism that synthesizes higher-level observations from raw events (Park et al., 2023, Generative Agents, arXiv:2304.03442). These two papers fixed the vocabulary the field still uses.

timeline
  title Evolution of Agent Memory
  2022 : Sliding-window chat history
       : Drop oldest tokens
  2023 : Rolling summaries (LangChain)
       : MemGPT OS-style paging
       : Generative Agents memory stream and reflection
  2024 : LoCoMo and LongMemEval benchmarks
       : Recency relevance importance retrieval matures
  2025 : Agentic memory (A-MEM) self-organizing notes
       : Mem0 production memory layer
  2026 : Memory as a first-class product surface

The shift from 2022 to 2025 is a shift in where the intelligence lives. In 2022 memory was a buffer; by 2025 memory had become an active subsystem that decides, writes, links, and forgets on its own.

How Agent Memory Actually Works

Human memory research has long separated episodic memory (specific events, tied to a time and place) from semantic memory (general facts abstracted away from any single event). The distinction is usually credited to Endel Tulving's 1972 work on organization in memory, and it maps cleanly onto what agents need. Episodic memory is "on 2026-06-15 the user said the staging deploy failed with an OOM error." Semantic memory is "the user's staging cluster is memory-constrained." The second is a distillation of the first plus several other events, and it is the one you usually want to retrieve.

A third category, procedural memory, holds learned procedures: the sequence of steps that worked last time, the tool-call recipe for a recurring task. Agents increasingly store these as reusable skills rather than re-deriving them.

The storage tiers

A working memory system is tiered, and the tiers differ in latency, capacity, and persistence.

graph TD
  subgraph Fast[In-context, volatile]
    WM[Working memory<br/>current prompt]
    SC[Scratchpad<br/>chain of thought]
  end
  subgraph Slow[External, durable]
    EP[Episodic store<br/>timestamped events]
    SM[Semantic store<br/>facts and profile]
    PR[Procedural store<br/>skills and recipes]
    VEC[(Vector index)]
    KV[(Key-value and graph)]
  end
  WM --> EP
  EP --> VEC
  SM --> VEC
  SM --> KV
  PR --> KV
  VEC --> WM
  KV --> WM
  classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  class WM,SC slate
  class EP,SM,PR teal
  class VEC,KV purple

Working memory is the prompt itself: fast to read, expensive per token, wiped between sessions. The external tiers are slow to query but effectively unbounded and durable. The job of the memory system is to move the right few kilobytes across that boundary at the right moment.

The write decision

Not every token deserves to be remembered. Writing everything reproduces the long-context problem one layer down: a bloated store whose retrieval is noisy. Writing too little loses the fact you will need on Thursday.

Practical systems use an extraction step. After a turn or a session, a model pass decides what is memory-worthy and rewrites it into a clean, self-contained statement. Mem0's pipeline, for instance, extracts candidate facts from a conversation and then runs an update operation that compares each candidate against existing memories, choosing among add, update, delete, or no-op (Chhikara et al., 2025, arXiv:2504.19413). That update step is what keeps the store from accumulating five slightly different versions of the same fact.

sequenceDiagram
  participant A as Agent
  participant E as Extractor
  participant M as Memory store
  participant J as Conflict resolver
  A->>E: Session transcript
  E->>E: Pull candidate facts
  E->>M: Query similar existing memories
  M-->>J: Top matches
  J->>J: add, update, delete, or skip
  J->>M: Commit decision
  Note over J,M: Dedup and conflict handling<br/>happen at write time, not read time

Doing conflict resolution at write time is a deliberate tradeoff. It makes writes more expensive (an extra model call and a retrieval), but it keeps reads cheap and consistent. The alternative, writing blindly and resolving conflicts at read time, pushes cost and ambiguity onto every single query.

Reflection: turning events into knowledge

Raw episodic memory grows linearly with interaction and is mostly low-value. Reflection is the consolidation pass that reads a batch of episodes and writes back higher-order conclusions. In Generative Agents, the system periodically pauses, retrieves the most salient recent memories, asks the model what high-level questions they raise, answers those questions, and stores the answers as new memories with citations to the supporting events (Park et al., 2023, arXiv:2304.03442). A pile of observations like "Klaus is reading at the library again" becomes the durable note "Klaus is deeply engaged in his research project."

This is the step people are tempted to cut, and it is the step that matters most. The Generative Agents ablation is blunt: removing reflection caused agent behavior to degrade noticeably in believability, because without consolidation the agent only ever sees disconnected fragments and never the pattern they form.

[IMAGE: Before/after panel showing a raw episodic stream of 12 timestamped observations on the left collapsing into 3 reflected semantic notes on the right, with arrows showing which events support which note]

Retrieval: more than cosine similarity

Once memories exist, the question is which handful to load into the next prompt. Naive vector search retrieves the \(k\) memories whose embeddings are closest to the query. That fails in predictable ways: it returns a memory that is topically similar but stale, or it ignores a recent critical event because the query happened to be phrased differently.

The durable fix, again from Generative Agents, scores each memory by a weighted sum of three signals:

\[\text{score} = \alpha \cdot \text{recency} + \beta \cdot \text{relevance} + \gamma \cdot \text{importance}\]

Recency is an exponential decay over time since last access, so old memories fade unless re-touched. Relevance is embedding similarity between query and memory. Importance is a self-assessed score the model assigns when the memory is created (a mundane "ate breakfast" rates low; "had a fight with a roommate" rates high). The three are normalized and combined, and the top results are loaded. This is closer to how human recall works than pure similarity, and it removes the most jarring retrieval failures.

flowchart TD
  Q[Query] --> EMB[Embed query]
  EMB --> CAND[Candidate memories]
  CAND --> R1[Recency score<br/>exp decay]
  CAND --> R2[Relevance score<br/>cosine sim]
  CAND --> R3[Importance score<br/>self-rated]
  R1 --> SUM[Weighted sum]
  R2 --> SUM
  R3 --> SUM
  SUM --> TOPK[Top-k into context]
  TOPK --> ANS[Answer]
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
  class Q,EMB blue
  class R1,R2,R3,SUM purple
  class TOPK,ANS teal

Newer systems push organization further. A-MEM builds a self-organizing network of memory notes inspired by the Zettelkasten method: each new memory generates a structured note with keywords and tags, the system links it to related existing notes, and adding a memory can trigger updates to the notes it connects to (Xu et al., 2025, A-MEM: Agentic Memory for LLM Agents, arXiv:2502.12110). Retrieval then traverses links, not just nearest neighbors, which helps multi-hop questions where the answer is two associations away.

By the Numbers

Memory systems are evaluated on long-horizon conversational benchmarks. LoCoMo contains conversations averaging about 27 sessions and 600 turns, with question types spanning single-hop, multi-hop, temporal, open-ended, and adversarial (Maharana et al., 2024, Evaluating Very Long-Term Conversational Memory, arXiv:2402.17753). LongMemEval adds 500 curated questions across five abilities and reports that commercial assistants drop roughly 30% in accuracy on information that must be recalled across sustained interaction (Wu et al., 2024, LongMemEval, arXiv:2410.10813).

The headline production result comes from Mem0, which reports the following on LoCoMo relative to a full-context baseline that stuffs the entire history into the prompt. The cost and latency figures are the practical argument for memory: you get comparable or better accuracy while reading a fraction of the tokens.

Metric	Mem0	Full-context baseline	Source
LoCoMo overall (J score)	91.6	lower across categories	Chhikara et al., 2025
p95 latency	~91% lower	baseline	Chhikara et al., 2025
Token cost per query	~90% lower	baseline	Chhikara et al., 2025
LongMemEval (reported)	94.8	lower	Chhikara et al., 2025

Treat single-vendor benchmark numbers as the vendor's measurement, not a settled fact; LoCoMo and LongMemEval scores move with prompt details and judge models. The directional claim is robust across independent work, though: selective memory plus retrieval beats dumping the full history, on both quality and cost.

The cost asymmetry is worth making concrete in complexity terms.

Strategy	Tokens read per query	Write cost	Recall quality
Full history in context	\(O(n)\) in history length	none	degrades past mid-context
Sliding window	\(O(w)\), fixed	none	loses anything older than \(w\)
Single rolling summary	\(O(1)\) summary	\(O(1)\) per fill	lossy, uncontrolled
Retrieval over memory	\(O(k)\), fixed small	\(O(1)\) extract per turn	strong if write and retrieval are good

The bottom row is the only one whose read cost stays flat as history grows while recall stays high, which is why it won.

[IMAGE: Log-scale line chart, tokens-read-per-query versus conversation length, four lines (full history, sliding window, rolling summary, retrieval) showing the retrieval line staying flat while full-history rises linearly]

A Concrete Example

Walk a coding assistant across three sessions and watch the memory state change.

Session 1 (Monday). The user says: "Our staging deploy keeps dying. Logs show OOMKilled on the worker pod." The agent diagnoses a memory leak in a batch job, suggests lowering batch size, and it works. At session end the extractor writes two memories:

id	type	content	importance
e1	episodic	2026-06-15: staging worker pod OOMKilled during batch job	8
s1	semantic	User's staging cluster is memory-constrained; batch jobs risk OOM	9

Session 2 (Tuesday). The user asks an unrelated CSS question. The extractor adds one low-importance episodic memory (importance 2) and writes nothing semantic. The store now holds three memories. Nothing about Monday is retrieved, because nothing about Monday is relevant; the recency-relevance-importance score for s1 against a CSS query is near zero.

Reflection (overnight). A consolidation pass reviews the week. It notices e1 and an earlier event about a slow query and synthesizes a new semantic note:

id	type	content	supports
s2	semantic	User repeatedly hits resource limits on staging; prefers config fixes over scaling up	e1, e0

Session 3 (Thursday). The user says: "The nightly report job just failed again on staging." The query embeds close to s1 and s2. Retrieval scoring:

memory	recency	relevance	importance	weighted score
s2	0.7	0.81	0.9	0.80
s1	0.5	0.79	0.9	0.73
e1	0.4	0.74	0.8	0.65
css note	0.6	0.05	0.2	0.27

The top three load into context. The agent opens with "Given staging's memory limits and that you prefer config fixes, let me check the report job's batch size before suggesting a bigger node." That single sentence, the thing that makes the agent feel like it remembers you, is produced entirely by the memory layer. The base model never changed.

[IMAGE: Annotated retrieval-scoring table rendered as a heatmap, rows are the four candidate memories and columns are recency/relevance/importance/weighted-score, cells shaded by value, with the winning row s2 outlined to show the reflected note beating its source event]

Notice what the weights did. The CSS note is recent and was written yesterday, so its recency is high, but its near-zero relevance keeps it out. And s2, the reflected note, outscores the raw event e1 it was derived from, because consolidation raised both its relevance and its importance. That is reflection earning its place.

Where It Breaks

Memory systems fail in ways that are subtle precisely because the agent stays fluent while being wrong.

Stale memory. A fact written months ago ("user works at Acme") persists after it stops being true. Without an update or decay mechanism, retrieval keeps surfacing it confidently. This is why write-time conflict resolution and recency-weighting exist, and why a pure append-only vector store is a trap.

Contradiction. Two memories disagree ("prefers TypeScript" and "switched the team to Go"), and naive retrieval may load both, leaving the model to guess. LongMemEval isolates exactly this as the knowledge updates ability, and it is where many systems score worst (Wu et al., 2024, arXiv:2410.10813).

[IMAGE: Grouped bar chart of benchmark accuracy by question category (single-hop, multi-hop, temporal, knowledge-update, adversarial), one bar group per memory strategy, annotated to highlight the steep drop every system shows on the knowledge-update and temporal categories]

Over-retrieval. Loading too many memories reintroduces the lost-in-the-middle problem the system was built to avoid; the relevant note gets buried among marginally relevant ones. More retrieved context is not more recall.

Reflection drift. Consolidation is itself a model call, so it can hallucinate a generalization the events do not support, then store it as a confident semantic fact. A wrong reflection is more dangerous than a wrong episode because it is phrased as settled knowledge and gets retrieved preferentially.

Privacy and deletion. Persistent memory means a system now retains user data across sessions, which raises real obligations: a user asking to be forgotten must propagate through episodic stores, semantic distillations, and any reflections derived from them. Deleting e1 does not delete the semantic note s2 that quietly encoded the same fact.

Alternative Designs

There is no single architecture; the choices trade simplicity against capability.

Approach	Strengths	Weaknesses	Best when
Full context, no memory	trivial, no infra, no staleness	\(O(n)\) cost, mid-context loss, capped by window	sessions are short and self-contained
Rolling summary	cheap, simple, one extra call	lossy, cannot recover dropped detail	casual chat, low stakes
Vector RAG over history	strong recall, mature tooling	no consolidation, stale and conflicting entries	factual recall dominates, facts rarely change
Recency-relevance-importance	human-like recall, robust ranking	needs tuning of weights and decay	long-running personal or assistant agents
Self-organizing graph (A-MEM)	multi-hop, evolving structure	more moving parts, higher write cost	complex reasoning over linked history
OS-style paging (MemGPT)	unbounded effective context, agent-controlled	agent must learn to manage memory well	document and long-session analysis

A useful way to read this table: the top rows optimize for simplicity, the bottom rows for fidelity over long horizons. Most production systems land in the middle, combining a vector store with recency-relevance-importance scoring and a periodic reflection job, then add graph links only where multi-hop questions justify the complexity.

[IMAGE: 2x3 comparison grid of the six architectures, each cell a small schematic of its data flow, color-coded by read cost (green cheap to red expensive) and recall fidelity (border thickness)]

How It Is Used in Practice

The pattern has crossed from research into shipped products. ChatGPT's memory feature and similar assistant memories store user facts across sessions and surface them later; the mechanism is the episodic-plus-semantic split described here, wrapped in user-facing controls to view and delete entries. Mem0 and Zep offer memory as a managed layer that sits between an app and the model, handling extraction, storage, and retrieval so each product does not rebuild it (Chhikara et al., 2025, arXiv:2504.19413). Frameworks such as LangGraph and LlamaIndex ship memory modules that implement summary, vector, and entity-based recall as composable pieces.

The engineering considerations that dominate at scale are unglamorous. Write amplification: every turn potentially triggers an extraction call and a retrieval, so memory roughly doubles the model calls per interaction unless you batch consolidation offline. Index maintenance: embeddings drift when you change embedding models, forcing a re-index of the whole store. Multi-tenancy: memories must be strictly partitioned per user, and a retrieval bug that crosses that boundary is a data-leak incident, not a quality regression. Cost attribution: the cheapest memory is the one you never write, so aggressive importance thresholds at write time pay off more than clever retrieval later.

[IMAGE: Sequence/timing diagram of one production turn, annotated with the two extra model calls memory adds (extract, retrieve) and where an offline reflection job runs out of band]

Insights Worth Remembering

A bigger context window postpones the memory problem; it does not solve it. The moment your interaction history exceeds the window, you are choosing what to keep, which is memory by another name.
The expensive, intelligent work in a memory system happens at write and consolidation time, not read time. Systems that write carefully read cheaply.
Reflection is the highest-leverage component and the one most often skipped. Raw events are cheap to store and nearly useless to retrieve; distilled notes are what make an agent seem to understand you.
Retrieval ranking is a product decision disguised as a similarity function. The weights on recency, relevance, and importance encode what your agent considers worth remembering.
Memory turns quality bugs into privacy bugs. Once you persist user data across sessions, deletion and contradiction handling are correctness requirements, not nice-to-haves.
The same base model, given a good memory layer versus none, behaves like two different products. Memory is where a lot of perceived intelligence actually lives.

Open Questions

What does an agent forget, and on what schedule? Human memory decays adaptively, keeping what is reinforced and shedding the rest. Exponential recency decay is a crude proxy; principled forgetting that protects important rare facts while pruning noise is unsolved, and it is measurable on benchmarks like LongMemEval rather than purely philosophical.

Can memory be learned rather than engineered? Current systems hand-wire the write, reflect, and retrieve policies. Recent work explores training the management policy itself with reinforcement learning so the agent learns what to remember from outcomes, which would replace the hand-tuned weights with learned ones. This is an active direction, and whether it generalizes beyond benchmark distributions is not yet established.

How should shared memory work across agents? When a fleet of agents serves the same organization, should they share a memory pool, and how do you reconcile contradictory writes from agents with different views? The cross-agent case is largely unexplored relative to the single-agent one.

Where is the boundary between memory and fine-tuning? Procedural memory that an agent reuses thousands of times starts to look like something that should be in the weights. When a learned skill graduates from retrieval to parameters is an open design question, and the answer probably depends on how often the skill is used and how fast it changes.

Sources and Further Reading

Foundational Papers

[Tulving, E., 1972, Episodic and Semantic Memory, in Organization of Memory, Academic Press]. The original cognitive-science distinction the agent taxonomy borrows from.
Park et al., 2023, Generative Agents: Interactive Simulacra of Human Behavior, arXiv:2304.03442. Memory stream, reflection, and recency-relevance-importance retrieval.
Packer et al., 2023, MemGPT: Towards LLMs as Operating Systems, arXiv:2310.08560. OS-style tiered memory with agent-controlled paging.

Important Follow-up Work

Xu et al., 2025, A-MEM: Agentic Memory for LLM Agents, arXiv:2502.12110. Self-organizing, Zettelkasten-style linked memory notes.
Chhikara et al., 2025, Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory, arXiv:2504.19413. Production memory layer with extract-and-update writes and benchmark results.

Benchmarks

Context

Liu et al., 2023, Lost in the Middle: How Language Models Use Long Contexts, arXiv:2307.03172. Why a bigger window is not a substitute for memory.