Skip to main content

Pipecat Voice AI Persistent Memory: Add Memory to Your Voice Pipeline

· 11 min read
Ben Bartholomew
Hindsight Team

Pipecat Voice AI Persistent Memory

If you build voice AI pipelines with Pipecat, you know it handles real-time speech processing, LLM integration, and streaming synthesis beautifully. But there's one critical thing it doesn't do: remember anything between calls. Every voice conversation starts fresh. Your voice agent has no idea what the user said in yesterday's call, what their preferences are, or what it already researched.

Adding Pipecat persistent memory doesn't require building a custom RAG system or managing separate vector databases. With the hindsight-pipecat integration, you can wire long-term memory into any voice pipeline with a single FrameProcessor that recalls context before responding and retains conversation content after each turn.

TL;DR

  • Pipecat voice pipelines have no built-in memory; each conversation starts from scratch.
  • hindsight-pipecat adds automatic recall (before LLM) and retain (after each turn) without code changes to the pipeline.
  • One-line setup: add HindsightMemoryService between the user aggregator and LLM service.
  • On each turn, Hindsight recalls relevant past conversations and injects them as system messages.
  • Works with any Pipecat transport (Daily, Deepgram, OpenAI, Cartesia, etc.) and any LLM provider.

The Problem: Pipecat Voice Agents Have No Persistent Memory

Pipecat is a solid framework for building voice AI agents. Real-time speech-to-text, streaming text-to-speech, LLM integration, and a clean frame-based pipeline architecture make it a popular choice for voice applications. But it ships with no memory layer.

Every voice conversation starts from zero. The agent doesn't know what the user said last week. It doesn't know their preferences, their history, or what context might be relevant to the current call. As a result, voice agents that interact with users over multiple sessions must re-establish all context in every conversation.

You could pass conversation history to the LLM context. But that's just chat history—it doesn't consolidate repeated information, doesn't build a knowledge graph of entities and relationships, and grows linearly until it exceeds your context window.

Real voice agent memory is different:

  • Extracting structured facts from conversations (names, preferences, past requests)
  • Building relationships between entities (users, topics, decisions)
  • Retrieving relevant context from days, weeks, or months of past conversations
  • Synthesizing coherent answers from scattered memory across multiple calls

That's what Hindsight provides. And with hindsight-pipecat, you get this without rebuilding your pipeline.


How Pipecat Persistent Memory Works

Hindsight is a memory engine that runs locally or in the cloud. It doesn't just store raw conversation text. Instead, it extracts structured entities and relationships, builds a knowledge graph, and indexes everything for multi-strategy retrieval. Semantic search, BM25 keyword matching, graph traversal, and temporal ranking all work together to surface the most relevant memories.

The hindsight-pipecat integration connects this engine to your Pipecat pipeline through a FrameProcessor that sits between your user context aggregator and LLM service.

Voice Input
↓ (STT)
User Aggregator
↓ (OpenAILLMContextFrame arrives)
HindsightMemoryService ← Memory goes here
├─ Retain previous turn (async, non-blocking)
└─ Recall relevant past conversations
↓ (inject as system message)
LLM Service
↓ (generates response with memory context)
Response ↓
Assistant Aggregator
↓ (TTS)
Voice Output

Here's what happens on each turn:

  1. New turn starts. A user message comes in and the user aggregator creates an OpenAILLMContextFrame.
  2. Recall. The HindsightMemoryService extracts the latest user message and searches Hindsight for relevant past conversations. Results are injected as a system message before the LLM sees the context.
  3. Retain. Any new complete user+assistant exchange is sent to Hindsight asynchronously (non-blocking), so the pipeline stays responsive.
  4. Forward. The enriched context frame is passed downstream to the LLM service, which now has access to relevant memories.

Memory accumulates across calls. By the third or fourth conversation, recall starts surfacing useful context that the pipeline didn't have to re-establish in the current call.


Setting Up Pipecat Persistent Memory

The setup process is minimal. Install the integration, choose your Hindsight deployment, and add one line to your pipeline.

Step 1: Install the Pipecat Integration

pip install hindsight-pipecat

Step 2: Choose Your Hindsight Deployment

Pick either Hindsight Cloud (recommended, no self-hosting) or Local (run your own daemon).

Sign up free for Hindsight Cloud — managed infrastructure, no daemon to run, memory syncs across your devices.

Option 2b: Local Hindsight

Run Hindsight locally:

pip install hindsight-all
export HINDSIGHT_API_LLM_API_KEY=YOUR_OPENAI_KEY
hindsight-api

This runs at http://localhost:8888 with embedded Postgres, local embeddings, and local reranking.

Step 3: Add Memory to Your Pipeline

Create a Hindsight memory service and add it to your pipeline:

For Hindsight Cloud:

from pipecat.pipeline.pipeline import Pipeline
from hindsight_pipecat import HindsightMemoryService

memory = HindsightMemoryService(
bank_id="user-123",
hindsight_api_url="https://api.hindsight.vectorize.io",
api_key="hsk_your_token_here",
)

pipeline = Pipeline([
transport.input(),
stt_service,
user_aggregator,
memory, # ← add between user_aggregator and LLM
llm_service,
assistant_aggregator,
tts_service,
transport.output(),
])

For Local Hindsight:

memory = HindsightMemoryService(
bank_id="user-123",
hindsight_api_url="http://localhost:8888",
)

The memory service now:

  • Recalls relevant past conversations before every LLM call
  • Automatically retains new exchanges after each turn
  • Runs asynchronously so your pipeline stays responsive

Configuration and Tuning

The HindsightMemoryService accepts several parameters to control memory behavior:

HindsightMemoryService(
bank_id="user-123", # Required: memory bank to use
hindsight_api_url="...", # Hindsight API URL
api_key="hsk_...", # API key (Hindsight Cloud)
recall_budget="mid", # "low", "mid", or "high"
recall_max_tokens=4096, # Max tokens for recall results
enable_recall=True, # Inject memories before LLM
enable_retain=True, # Store turns after each exchange
memory_prefix="Relevant memories from past conversations:\n",
)

Recall Budget

The recall_budget parameter controls how much memory data gets injected before the LLM:

  • "low": Fast recalls, fewer results. Good for latency-sensitive applications or tight context windows.
  • "mid" (default): Balanced trade-off. 50-200ms latency, moderate result count.
  • "high": Exhaustive recalls. Maximum results, highest latency. Good for complex conversations where context matters most.

For real-time voice, "low" or "mid" are usually best. You want memory recall to happen in 100-300ms so the voice agent responds quickly.

Selective Enable/Disable

You can disable recall or retain individually:

# Recall-only: search past conversations but don't store new ones
memory = HindsightMemoryService(
bank_id="user-123",
enable_retain=False,
enable_recall=True,
)

# Retain-only: store conversations but don't recall them
memory = HindsightMemoryService(
bank_id="user-123",
enable_retain=True,
enable_recall=False,
)

This is useful for multi-agent architectures where one pipeline gathers information while another agent answers questions using only recalled memory.

Global Configuration

If you have multiple pipelines sharing the same Hindsight instance, use global configuration:

from hindsight_pipecat import configure

configure(
hindsight_api_url="http://localhost:8888",
api_key="hsk_...",
recall_budget="mid",
)

# Now create services without repeating connection details
memory = HindsightMemoryService(bank_id="user-123")

Real-World Example: Customer Support Voice Agent

Here's a complete example of a customer support voice agent with persistent memory:

import os
from pipecat.services.openai import OpenAILLMService
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.services.cartesia import CartesiaTTSService
from pipecat.transports.daily import DailyTransport
from pipecat.pipeline.pipeline import Pipeline
from hindsight_pipecat import HindsightMemoryService

# Setup Hindsight memory with customer ID as bank_id
customer_id = "customer-abc-123"
memory = HindsightMemoryService(
bank_id=customer_id,
hindsight_api_url="https://api.hindsight.vectorize.io",
api_key=os.getenv("HINDSIGHT_API_KEY"),
recall_budget="mid",
)

# Setup services
llm = OpenAILLMService(model="gpt-4o-mini")
stt = DeepgramSTTService(model="nova-2")
tts = CartesiaTTSService(model="sonic-english")
transport = DailyTransport(room_url="https://...", token="...")

# Build pipeline with memory
pipeline = Pipeline([
transport.input(),
stt,
transport.user_aggregator,
memory, # ← Memory service here
llm,
transport.assistant_aggregator,
tts,
transport.output(),
])

On the first call, the agent knows nothing about the customer. By the third or fourth call, Hindsight has extracted:

  • Customer name, account status, past issues
  • Preferences for communication style
  • Technical details from previous troubleshooting
  • Resolution history (what worked, what didn't)

Each new call starts with relevant context automatically injected, so the agent can pick up exactly where it left off and respond more intelligently.


Use Cases for Voice AI with Persistent Memory

Customer Support

Support agents remember customer history, previous issues, and what solutions worked. Faster resolution, fewer repetitive questions, better customer experience.

Personal Assistant

Assistants remember user preferences, calendar, habits, and context. Multi-session conversations feel natural instead of starting from scratch each time.

Sales and Booking

Agents remember customer needs, budget, preferences, and past interactions. More targeted recommendations, better close rates, less context-building.

Healthcare and Wellness

Voice health assistants remember patient history, medications, symptoms from past conversations. Continuity of care without manual note review.

Accessibility

Voice-first interfaces for users with visual impairments benefit enormously from persistent context. The agent remembers where you were in a process, what you were doing last time, and can help you pick up without repetition.

Multi-Session Research

Voice agents that gather research across multiple conversations can consolidate findings, spot patterns, and synthesize insights from accumulated data.


Advanced Patterns

Multi-User Pipelines

If a single pipeline handles multiple users (e.g., a shared bot), use the user ID in the bank_id:

user_id = extract_user_from_request()
memory = HindsightMemoryService(
bank_id=f"user-{user_id}",
hindsight_api_url="https://api.hindsight.vectorize.io",
api_key=os.getenv("HINDSIGHT_API_KEY"),
)

Each user gets isolated memory. No cross-contamination.

Scoped Memory Banks

For complex applications, use separate banks for different scopes:

# Support agent per customer
support_memory = HindsightMemoryService(bank_id=f"support-{customer_id}")

# Research agent per project
research_memory = HindsightMemoryService(bank_id=f"research-{project_id}")

# Agent-specific context
agent_memory = HindsightMemoryService(bank_id=f"agent-{agent_role}")

Isolation ensures memory doesn't leak between contexts.

Testing and Development

For testing, use a temporary bank and disable retention:

memory = HindsightMemoryService(
bank_id="test-session",
enable_retain=False, # Don't pollute persistent memory
enable_recall=True, # But do test recall
)

Or switch to a local test Hindsight instance that's separate from production.


Edge Cases and Troubleshooting

Latency

Recall lookups add 50-300ms depending on recall_budget and memory size. For voice applications where response time matters, use "low" budget and monitor latency.

memory = HindsightMemoryService(
bank_id="user-123",
recall_budget="low",
recall_max_tokens=2048, # Limit tokens to reduce latency
)

Privacy and Data Retention

If you need to delete all memories for a user (GDPR, user request), use the Hindsight Cloud dashboard or API to clear the bank:

# Via API
curl -X DELETE https://api.hindsight.vectorize.io/banks/user-123 \
-H "Authorization: Bearer hsk_..."

Async Frame Handling

The HindsightMemoryService processes frames asynchronously. If you need synchronous behavior (rare), you must wait for the retain operation. In practice, async is better—it keeps your pipeline responsive.


Pipecat Memory: Tradeoffs and Alternatives

When Pipecat Persistent Memory Makes Sense

Pipecat persistent memory works best for voice agents that interact with users across multiple sessions over days, weeks, or months. Good use cases include customer support bots, personal assistants, accessibility interfaces, and any voice agent where remembering past interactions improves quality.

When Not to Use It

Skip persistent memory for stateless APIs or one-shot voice transactions where context doesn't compound over time. Also skip if your pipeline is purely transactional (e.g., order confirmation calls) where no learning is needed.

How It Compares to Alternatives

ApproachStrengthsWeaknessesBest For
Hindsight + PipecatMulti-strategy retrieval (semantic + BM25 + graph + temporal), structured entity extraction, low-latency recallRequires Hindsight server or cloud accountVoice agents with multi-session context needs
Manual LLM contextSimple, no dependencies, full controlGrows linearly with conversation length, no entity extractionShort single-session voice calls
Custom vector storeFull control over retrievalYou build chunking, indexing, and reranking yourselfTeams with existing vector infrastructure
Transcript storage onlySimple audit trailNo structured retrieval, no synthesisCompliance-focused voice recording

For most voice pipelines that need persistent memory across sessions with minimal changes, the Hindsight integration is the fastest path to production.


Recap: Pipecat Persistent Memory Integration

Adding persistent memory to Pipecat voice pipelines doesn't require rebuilding your architecture. The hindsight-pipecat integration provides a single FrameProcessor that:

  • Recalls relevant past conversations before each LLM call
  • Retains new exchanges asynchronously after each turn
  • Stays responsive with configurable recall budgets for latency-sensitive voice applications

Just drop it into your pipeline between the user aggregator and LLM service. Memory accumulates over calls and compounds in value as your voice agent learns more about each user.

For Python developers building Pipecat voice agents that need persistent memory, this is the simplest path from stateless to stateful.


Next Steps

  • Try it locally: pip install hindsight-all hindsight-pipecat and test with the interactive chat example
  • Use Hindsight Cloud: Skip self-hosting with a free account
  • Run the basic example: python examples/basic_pipeline.py to see the full voice pipeline with memory
  • Test interactive recall: python examples/interactive_chat.py --bank demo-user to see recall in action
  • Configure recall budget: Tune "low", "mid", or "high" based on your latency requirements
  • Inspect memories: Use the Hindsight Cloud dashboard to browse extracted facts and entities from conversations
  • Explore other integrations: Add memory to Pydantic AI agents, CrewAI agents, or any framework via MCP
  • Read the Pipecat docs: Learn more about Pipecat pipelines and services to extend your voice agent further