Custom AI Agents

Part 3 — Memory

How does an agent remember things — and why is “a vector database of old messages” the wrong answer?

5 min · Updated June 2026

Context is what the agent sees right now. Memory is what survives when the current context is gone — across turns, across sessions, across the agent’s entire operational life. A customer-service agent that forgets your last three conversations is not really an agent. It is a goldfish with tools.

Q3.1 — What is the difference between context and memory?

Context is the live working window: everything the model can see when it makes its current decision. Memory is the persistent store that outlasts any single window. The two interact: memory is queried to populate context, and context is processed to update memory. Conflating them is the root cause of most poorly-designed agent memory layers.

The wrong answer — the one teams reach for first — is to treat memory as a vector database of raw old messages. A good memory layer extracts salient facts, consolidates them (merging, updating, resolving contradictions), and sometimes forgets deliberately. That extraction-and-consolidation loop is the actual product.

Q3.2 — What does the layered memory architecture look like?

The 2026 architecture is explicitly layered, and the field has converged on vocabulary borrowed from human memory:

Short-term / working memory is the current task’s state. Technically this is checkpointing: the agent’s state is saved at each step so a multi-step task can survive a crash and resume. In the dominant Python stack this is LangGraph’s PostgresSaver and its equivalents. Scope: one thread, one task.

Long-term memory persists across sessions, scoped to a user or entity. This further splits three ways:

Episodic— specific past interactions (“last Tuesday the customer disputed a charge”).
Semantic— distilled facts and preferences (“this customer is on the enterprise plan, prefers email”).
Procedural— learned behaviours and rules (“for this account type, always escalate refunds over $500”).

Q3.3 — What memory frameworks are available, and what are they optimised for?

There is a healthy ecosystem of dedicated memory frameworks in the Python world. The honest summary: they all work, they are optimised for different things, and you cannot trust their published head-to-head numbers. Each vendor benchmarks on a test set that flatters its own design.

Mem0is a hybrid of vector, graph, and key-value storage that passively extracts facts from conversations. Open-source core, very fast to integrate, strong for personalisation-heavy domains like insurance customer service or retail concierge. It is the best default when speed-to-value matters most. Weaker on temporal reasoning — “what was true as of last March” is not its strength.

Zep, built on the open-source Graphiti, is a temporal knowledge graph that tracks how facts change over time and when. This is the right choice when temporal validity is central: a clinical history where a diagnosis evolves, a regulatory state that changes, a customer whose plan tier shifted. The tradeoff is that building the graph is expensive and there is often a lag between ingesting a fact and being able to retrieve it.

Letta (formerly MemGPT)models memory like an operating system, with tiers and, crucially, the agent edits its own memory. Pick this when memory autonomy is the actual product — a long-running research analyst that curates its own knowledge base. It is less a memory layer you bolt on and more a whole runtime you adopt.

LangMem provides episodic, semantic, and procedural memory that integrates natively with LangGraph. Frictionless if you are already on LangGraph; not portable if you are not.

Cognee builds a full knowledge graph before queries. Good for local-first, privacy-critical, graph-reasoning use cases.

Q3.4 — What should I do about the conflicting benchmarks?

Mem0 and Zep have publicly traded contradictory scores, each using a different long-conversation test set (LOCOMO versus LongMemEval), which measure meaningfully different things at different scales. The lesson is not who won. The lesson is: build a small, labelled test set from your own vertical’s data and measure on that. A generic memory benchmark tells you almost nothing about whether the system will correctly recall that this patient is allergic to that drug.

Q3.5 — What is the pragmatic starting recommendation?

For most vertical agents: use your orchestrator’s checkpointer for short-term thread state, and layer Mem0 (personalisation default) or Zep/Graphiti (when time matters) for long-term memory. Do not over-engineer this before you have real conversations to learn from.