AI agents without memory are like employees who forget everything each morning. They can accomplish one-off tasks, but they cannot build lasting understanding of your business, your customers, or your processes. A team of Elastic engineers just published a persistent memory architecture achieving 0.89 recall on their internal benchmarks. Here is how to reproduce this approach in your own projects.
Why Memory Is the Missing Link for AI Agents
When you interact with ChatGPT or Claude, each conversation starts from scratch. The model does not remember what you told it yesterday, your preferences, or your project history. This limitation is acceptable for occasional use, but it becomes a major obstacle for business applications.
Take the example of a customer support agent. Without memory, it will treat each ticket independently, ignoring that the same customer has had the same problem three times this month, or that this particular customer prefers detailed technical responses rather than simplified explanations. With persistent memory, the agent can adapt its behavior, automatically escalate recurring cases, and personalize its responses.
According to a Gartner 2025 study, companies that implement contextual memory in their AI agents see a 34% improvement in customer satisfaction and a 28% reduction in average resolution time.
Elastic Solution Architecture
The architecture proposed by the Elastic team relies on three main components: a vector store for semantic embeddings, a structured index for metadata, and an orchestration layer that decides when and how to access memory.
Vector Store: The Semantic Backbone
The vector store stores numerical representations of each interaction. Each message, document, or action is converted into a 1,536-dimension vector (with OpenAI's ada-002 model) or 768 dimensions (with open-source models like sentence-transformers).
Elasticsearch offers a major advantage here: hybrid search. Unlike pure vector databases like Pinecone or Weaviate, Elasticsearch allows combining semantic search (KNN) with structured filters (date, type, user). This combination is crucial for business agents.
Concrete example: when a support agent receives a question, it can search for similar interactions but filter only on the last three months and the same customer segment. This contextual precision makes the difference between 0.7 recall and 0.89 recall.
Structured Index: Metadata That Gives Meaning
Embeddings alone are not enough. Each memory entry requires structured metadata: timestamp, user identifier, interaction type, detected sentiment, actions performed, outcome obtained.
This metadata enables several operations impossible with vectors alone. Temporal filtering ensures recent information is prioritized. User segmentation prevents information leakage between clients. Action tracking allows analysis of what worked or not.
The Elastic team recommends explicit mapping schemas rather than dynamic mapping. This guarantees type consistency and optimizes search performance.
Orchestration Layer: The Conductor
The orchestration layer decides when to query memory, which fragments to retrieve, and how to integrate them into the agent's context. This logic is often implemented in LangChain, LlamaIndex, or a custom framework.
The most effective pattern is "retrieve-then-filter." The agent first retrieves a broad set of relevant memories (top 50, for example), then applies re-ranking based on the current task to select the 5-10 most useful. This double pass significantly improves relevance while controlling token costs.
Step-by-Step Implementation
Step 1: Configure the Elasticsearch Index
Create an index with the following mapping to support hybrid search:
{
"mappings": {
"properties": {
"content": { "type": "text" },
"embedding": {
"type": "dense_vector",
"dims": 1536,
"index": true,
"similarity": "cosine"
},
"timestamp": { "type": "date" },
"user_id": { "type": "keyword" },
"interaction_type": { "type": "keyword" },
"sentiment": { "type": "float" },
"metadata": { "type": "object" }
}
}
}
This configuration enables KNN searches with filters and aggregations on metadata.
Step 2: Ingestion Pipeline
Each interaction must be transformed before storage. The pipeline comprises three steps: extracting relevant content, generating the embedding, enriching metadata.
For embedding generation, you have two main options. The OpenAI API (text-embedding-3-small model) offers excellent performance for $0.02 per million tokens. Local models like sentence-transformers/all-MiniLM-L6-v2 eliminate recurring costs but require GPU infrastructure.
For a Moroccan SME processing 10,000 interactions per month, the OpenAI API will cost approximately $2-5 monthly. The local model becomes profitable from 100,000 monthly interactions, provided you have the infrastructure.
Step 3: Retrieval Strategy
Effective retrieval combines multiple signals. Semantic similarity identifies contextually relevant memories. Recency favors fresh information. Frequency highlights recurring patterns.
The Elastic team uses a combined scoring formula:
final_score = 0.6 * semantic_score + 0.25 * recency_score + 0.15 * frequency_score
These weights are adjustable according to the use case. A support agent will prioritize recency (recent customer issues). A sales agent will prioritize frequency (frequently mentioned products).
Step 4: Lifecycle Management
Memory is not eternal. A retention policy avoids accumulating obsolete data and associated storage costs.
The recommended approach distinguishes three data tiers. Hot memory (0-30 days) stays on fast SSD nodes. Warm memory (30-180 days) migrates to less expensive HDD nodes. Cold memory (beyond 180 days) is archived or deleted according to regulatory requirements.
Elasticsearch manages this policy via Index Lifecycle Management (ILM), which automates migrations between tiers.
Integration with Agent Frameworks
LangChain
LangChain offers native integration with Elasticsearch via ElasticVectorSearch. To add the memory layer, use ConversationBufferWindowMemory combined with custom retrieval.
LangChain's advantage is its flexibility. You can easily switch between different vector stores during development, then optimize for Elasticsearch in production.
LlamaIndex
LlamaIndex offers a higher-level abstraction with "chat engines." Elasticsearch integration is done via ElasticsearchStore, which automatically handles indexing and retrieval.
For use cases requiring structured memory (task lists, user preferences), LlamaIndex offers "knowledge graphs" that complement vector memory well.
Custom Framework
For high-volume applications or specific requirements, a custom framework may be preferable. The basic architecture comprises three services: an ingestion service (receives and transforms interactions), a memory service (manages storage and retrieval), and an orchestration service (decides when and how to use memory).
This separation allows scaling each component independently and facilitates debugging in production.
Concrete Use Cases for Moroccan Businesses
Multilingual Customer Support
A Moroccan call center manages French-speaking, Arabic-speaking, and English-speaking customers. The AI agent memorizes each customer's linguistic preferences, problem history, and preferred tone (formal vs informal).
Expected result: 40% reduction in first response time, 25% improvement in CSAT. These figures are based on similar implementations documented by Elastic with European clients.
B2B Sales Agent
A food wholesaler uses an AI agent to manage recurring orders. The agent memorizes each customer's ordering patterns, anticipates probable stockouts, and suggests complementary products based on history.
Persistent memory here transforms a simple order-taking tool into a truly proactive sales assistant.
Internal Legal Assistant
A law firm deploys an AI assistant for case law research. The agent memorizes ongoing cases, each lawyer's research preferences, and precedents already identified.
This application illustrates the importance of segmentation: each lawyer should only see their own cases and research, never those of colleagues.
Performance and Costs
Elastic Team Benchmarks
On their test dataset (1 million interactions, 50,000 users), the architecture achieves 0.89 recall with P99 latency under 200ms. These performances are obtained with 3 Elasticsearch nodes of 16 GB RAM each.
Cost Estimate for an SME
For a company processing 50,000 monthly interactions, here is a cost estimate:
Elasticsearch infrastructure (3 nodes on Elastic Cloud) costs approximately $400 per month. OpenAI embeddings represent approximately $10 per month. S3 storage for archives adds approximately $5 per month.
The total cost of approximately $415 monthly compares favorably to managed memory solutions like Pinecone or Zilliz, which often charge $500-1000 for similar volumes.
Cost Optimization
Several levers reduce costs without sacrificing performance. Using open-source embeddings eliminates recurring OpenAI costs. Vector compression (quantization) reduces storage by 50-75%. Aggressive archiving limits hot data volume.
Security and Compliance
Data Isolation
In a multi-tenant context, each organization must have its own isolated memory. Elasticsearch supports this isolation via separate indexes or document-level filters.
For CNDP compliance in Morocco, ensure personal data is encrypted at rest and in transit, and that retention policies respect regulatory requirements.
Right to Erasure
AI memory poses a challenge for the right to erasure (GDPR Article 17, CNDP equivalent). You must be able to identify and delete all memory entries linked to a specific user.
The recommended solution: systematically tag each entry with a user identifier, and maintain an inverted index enabling efficient deletion.
Conclusion
Persistent memory transforms AI agents from one-off gadgets into true business assistants. The Elasticsearch architecture presented here offers an optimal balance between performance (0.89 recall), cost (approximately $400 per month for an SME), and flexibility (hybrid search, structured metadata).
For Moroccan businesses developing custom AI applications, this architecture constitutes a solid foundation. It can be deployed progressively, starting with a pilot use case before extending to the entire organization.
The next challenge will be integrating episodic memories (dated events) and semantic memories (general knowledge) in a unified architecture. Promising research work is emerging in this field, notably at DeepMind and Anthropic.
FAQ
What is the difference between vector memory and RAG?
RAG (Retrieval-Augmented Generation) is a technique that uses vector memory as a context source. Vector memory is the "where" to store, RAG is the "how" to use. A complete memory system combines both: vector storage + retrieval strategy + prompt integration.
Elasticsearch vs specialized vector databases: which to choose?
Elasticsearch excels for hybrid cases requiring structured filters (date, user, type). Specialized databases (Pinecone, Weaviate) offer better performance for pure vector search. For most business applications, filters are essential, which favors Elasticsearch.
How to manage data volume growth?
Three complementary strategies: progressive archiving (Elasticsearch ILM), embedding compression (quantization), and automatic summarization (periodically condensing old interactions into summaries). This last technique uses an LLM to create "meta-memories" that capture the essence without retaining every detail.
Does AI memory pose bias risks?
Yes. If an agent memorizes biased interactions, it will reproduce those biases. Mitigation involves regular auditing of memorized patterns, automatic bias detection (via dedicated classifiers), and the ability to partially "reset" memory to eliminate problematic associations.
Can this architecture be used without Elasticsearch?
Yes. The principles (hybrid search, structured metadata, lifecycle management) apply to other stacks. PostgreSQL with pgvector, MongoDB with Atlas Search, or even custom combinations (Pinecone + PostgreSQL) can implement a similar architecture. The choice depends on your existing stack and team skills.
