Memory Systems in AI Assistants

Working, structured, and retrieval memory for assistants.

Page content

Memory turns assistants from reactive to persistent, but it is also where many systems quietly rot. Surveys argue the short-term versus long-term split is no longer enough for modern agent memory; OpenAI and LangGraph SDKs point to a simpler stack — working memory, durable state, and retrieval.

Assistants need working memory for the current run, durable state for stable facts and preferences, and retrieval memory for relevant supporting context. My slightly opinionated view is that structured state is underused, vector retrieval is overused, and most memory failures come from promotion and injection policy rather than storage choice.

The other important point is that memory does not automatically fix long context. LoCoMo shows that very long-term conversational recall remains hard, and “Lost in the Middle” shows that simply throwing more tokens at the model can degrade performance when relevant information lands in the middle of the prompt. Good memory systems are selective, layered, and explicit about precedence.

This guide sits in the AI Systems Memory hub as the cross-framework map for the memory layer inside AI Assistant Architecture.

Abstract memory system for an AI assistant as layered notebooks, vector points, and structured cards

How to think about assistant memory

Assistant memory is not the same problem as PKM, wikis, or standalone RAG pipelines — PKM vs RAG vs Wiki vs Memory Systems maps those paradigms at the knowledge-architecture level. This guide stays one layer down, in the runtime contracts assistants actually implement.

The cleanest way to think about memory is not as “chat history”, but as a set of storage contracts with different jobs. One store preserves the active thread. Another store keeps durable user state. Another supports semantic lookup over documents or past interactions. OpenAI’s memory guidance for personalisation makes this explicit by separating global and session memory, while LangGraph separates thread-level persistence from long-term stores across conversations.

Memory matters because production assistants repeat work, revisit goals, and operate across days or weeks. Generative Agents popularised the pattern of storing experiences, reflecting on them, and retrieving them dynamically for future planning. MemGPT pushed that further by modelling memory as tiers and movement between fast and slow stores. More recent systems such as A-MEM and Mem0 focus on linking, consolidation, and deployment efficiency rather than just recall volume.

Types of memory

Production assistants typically need three cooperating layers. The FAQ above names them; the sections below explain how each behaves in real systems.

Short-term memory

Short-term memory is the working context of the current conversation or run. OpenAI Sessions automatically prepend conversation history before each run and append new items after each run. LangGraph implements the same idea as thread-level persistence through a checkpointer. This layer keeps local coherence, but it is also the first thing that explodes when tool results, file reads, or long chats pile up.

Long-term retrieval memory

Long-term retrieval memory stores items that are looked up when relevant rather than replayed every turn. That overlaps with RAG as a retrieval technique, but it is not the whole assistant memory story — wikis and PKM corpora often feed the index while structured state and session memory live elsewhere, as the PKM/RAG/wiki/memory comparison above makes clear. In classical RAG, the model combines parametric memory with non-parametric memory such as a dense vector index. Self-RAG improves on naive retrieval by making retrieval on-demand rather than fixed for every request. In practical assistant systems, this is usually the vector store or searchable transcript layer.

Structured memory

Structured memory stores durable facts, preferences, or constraints in explicit fields with precedence rules. OpenAI’s personalisation cookbook is unusually clear here. Global and session memory have different roles, the latest user instruction wins, session memory can override global memory for the current task, and memory that conflicts with current user intent should trigger clarification rather than silent obedience. This is why structured state is often better than retrieval for stable preferences, policies, or standing constraints.

Retrieval mechanics

A typical retrieval flow has five steps: capture, encode, search, rerank or filter, then inject. Pinecone, Weaviate, Qdrant, Redis, and Milvus all document variants of this pattern. Some support dense vectors only, others support hybrid retrieval that combines semantic and lexical search, and some expose metadata filters or namespaces for tenancy and scope control. The engineering point is straightforward. Retrieval quality depends as much on filtering, chunking, and ranking strategy as on the embedding model itself.

Hybrid retrieval is usually the sensible default when queries mix meaning and exact terms. Weaviate documents hybrid search with an alpha parameter balancing vector and keyword components, Qdrant supports hybrid and multi-stage queries through its Query API and score-fusion methods, and Milvus describes dense, sparse, and hybrid retrieval in the same system. That matters for assistants because users often ask for both approximate meaning and exact identifiers, file names, revision numbers, or product codes. When the lexical side lives in Postgres or Elasticsearch rather than inside the vector database, PostgreSQL full text search vs Elasticsearch helps you choose where keyword search should run in production.

One more opinionated point: retrieval should not decide policy. It should supply candidates. The assistant still needs structured rules for precedence, privacy, recency, and conflict resolution. OpenAI’s state-based memory example makes this explicit, and it is a much healthier pattern than pretending similarity search alone can resolve contradictory user state.

Common issues

The most common failure is stale or contradictory memory. OpenAI’s long-term memory cookbook calls memory consolidation the most sensitive and error-prone stage, listing context poisoning, memory loss, duplicate memories, and contradiction handling as core concerns. That is correct, and it is where many assistants fail quietly. They remember too much, too early, and without a rule for forgetting.

The second failure is context overload. LangGraph warns that long conversations can exceed the LLM context window and recommends trimming, deletion, summarisation, or checkpoint management. OpenClaw similarly prunes old tool outputs from in-memory context while preserving the full on-disk transcript. These are not optional optimisations. They are required if your assistant reads, searches, or executes anything non-trivial.

The third failure is assuming long context equals reliable recall. LoCoMo shows that long-term conversational memory is still difficult, and “Lost in the Middle” shows position sensitivity inside long prompts. If memory is important, do not rely on brute-force prompt stuffing. Use compaction, retrieval, and explicit state.

Tradeoffs

The vector database layer is where many assistant teams make early platform bets. The comparison below focuses on documented product characteristics that matter for assistant memory design.

System What stands out Best fit
Pinecone Managed vector database with integrated embedding, reranking, metadata filters, namespaces, and support for dense, sparse, and BM25-style full-text in one schema Teams that want managed retrieval with minimal infra
Weaviate Open-source vector database storing objects and vectors, with semantic and hybrid search and strong RAG positioning Teams that want open-source flexibility with hybrid retrieval
Qdrant AI-native vector search with filtering, hybrid and multi-stage queries, plus an embedded offline-capable Edge mode Teams that want search control, edge deployment, or strong filtering
pgvector Vector similarity search inside Postgres, with exact and approximate search plus ACID, JOINs, and recovery features Teams already standardised on Postgres and relational data
Milvus Cloud-native vector database with disaggregated storage and compute, plus dense, sparse, and hybrid retrieval Large-scale retrieval workloads and distributed deployments

Once you pick a backend, operating it is a data infrastructure problem — Postgres with pgvector for session metadata and vectors on one stack, or Neo4j when retrieval memory is graph-shaped rather than flat chunks.

The latency and cost pattern below is a design synthesis based on the operational models described in OpenAI Sessions and compaction guidance, LangGraph memory management, OpenAI state-based memory, and the documented retrieval behaviour of Redis and vector stores. It is intentionally qualitative, because real numbers depend on corpus size, embedding model, network placement, and caching.

Memory tactic Read latency Write latency Token cost pressure Infra cost When it is worth it
Raw session history Lowest Lowest Highest Lowest Simple multi-turn chat and short runs
Summary or compaction memory Low to medium Medium, because summarisation itself is a model step Medium to low Low to medium Long-running work where the active run must continue
Structured profile and state Low Medium Low Low Durable preferences, rules, and standing constraints
Vector or hybrid retrieval Medium Medium Low to medium Medium Large corpora, searchable history, document grounding
Full replay of everything High and increasingly unstable Low Highest Low infra, high model spend Almost never, except tiny corpora and debugging

Implementation examples

OpenAI’s current stack gives two useful reference patterns. The first is Sessions for short-term continuity across runs. The second is state-based long-term memory, where structured profile fields and global memory notes are injected at session start, session notes are distilled during the run, and a consolidation step promotes only durable items into global memory. That inject → reason → distill → consolidate loop is one of the clearest public memory patterns available right now.

LangGraph provides a similar but framework-agnostic split. Checkpointers handle short-term thread memory and stores handle long-term search across conversations. The store can be searched inside nodes at runtime, which makes it a good reference design for assistants that need explicit orchestration rather than hidden framework magic.

Hermes is a useful public example of layered memory in the wild. Its built-in memory uses MEMORY.md, USER.md, and SQLite FTS5 session search, while external provider plugins add graph memory, semantic retrieval, automatic fact extraction, and user modelling. The full mechanics are documented in Hermes Agent Memory System, and the eight pluggable backends are compared in Agent memory providers compared.

OpenClaw offers a different take, with session pruning, optional active memory that runs before the main reply, and an opt-in Dreaming system for background memory consolidation. Those examples are worth paying attention to because they treat memory as an operational subsystem, not just a retrieval trick. For how OpenClaw maps onto the wider five-layer assistant stack, see the OpenClaw system overview.

Research prototypes point in the same direction. MemGPT uses hierarchical memory tiers and control flow for context management, A-MEM uses dynamic indexing and linking inspired by Zettelkasten, and Mem0 reports better accuracy with much lower p95 latency and token cost than full-context baselines on LoCoMo. You do not need to copy these systems wholesale, but their shared lesson is clear. Memory quality comes from selection and organisation, not from storing everything forever.

When memory helps versus hurts

Memory helps when the assistant repeatedly encounters stable preferences, durable constraints, reusable workflow lessons, or large external corpora that cannot fit in a prompt. OpenAI’s reliable agents guide makes the distinction well. Compaction helps the current long-running run continue, while memory helps future runs reuse workflow lessons. That is the right mental model for most business assistants.

Memory hurts when the task is one-shot, the user state changes often, the retrieval index is noisy, or the system cannot reconcile conflicts. OpenAI’s travel-memory example warns that session memory should not automatically become global memory, and it explicitly states that memory is not a security boundary. If your assistant treats every recalled string as truth, you have built a confusion engine, not a memory system.

A selective memory loop

The simplest robust memory loop is selective and staged. Load durable state, retrieve supporting context, answer, capture only candidate memories, then consolidate later. Both OpenAI’s state-based pattern and recent memory papers move in this direction.

agent-memory-sequence-diagram

Without tracing and evals, memory changes are hard to debug. When you promote new facts or change retrieval policy, pair those changes with the observability patterns in Observability for LLM Systems so you can see which layer injected what.

Take-Away

The practical memory stack for assistants is not “just use a vector DB”. It is working memory for the live run, structured state for durable truth, retrieval memory for supporting evidence, and a conservative consolidation policy that forgets as deliberately as it remembers. Recent research and current SDK guidance both point in that direction.

For the full assistant stack around this layer, start with AI Assistant Architecture. For Hermes-specific bounded memory and provider plugins, follow Hermes Agent Memory System and Agent memory providers compared.

Subscribe

Get new posts on AI systems, Infrastructure, and AI engineering.