I Gave 100+ LLMs a Permanent Memory With One Python Package
hindsight-litellm adds persistent memory to any LLM provider via LiteLLM — OpenAI, Anthropic, Groq, Azure, Bedrock, Vertex AI, and 100+ more. Three lines of setup, and every LLM call automatically gets context from past conversations.
The Problem: Stateless LLM Calls
You build an app with an LLM. User talks to it on Monday. Comes back Tuesday. The LLM has no idea who they are.
This is true for every provider. OpenAI, Anthropic, Groq, Azure — every API call is a blank slate. The LLM doesn't remember preferences, past conversations, or anything you've discussed before.
Most teams work around this by stuffing the last N messages into the context window. But that's not memory — it's a sliding window that drops everything older than your token limit. The user's preferences from last week? Gone. The project context from last month? Gone.
What if every LLM call automatically had the right context from past conversations?
The Fix: Three Lines, Any Provider
hindsight-litellm hooks into LiteLLM to intercept every LLM call. Before the call, it retrieves relevant memories from Hindsight and injects them into the prompt. After the call, it stores the conversation for future retrieval.
Your code ──→ hindsight_litellm.completion()
│
├─→ 1. Recall/Reflect from Hindsight (relevant memories)
├─→ 2. Inject memories into prompt
├─→ 3. Forward to LLM (any provider via LiteLLM)
├─→ 4. Store conversation to Hindsight
└─→ 5. Return response
Here's the setup:
import hindsight_litellm
hindsight_litellm.configure(
hindsight_api_url="http://localhost:8888",
)
hindsight_litellm.set_defaults(
bank_id="my-agent",
)
hindsight_litellm.enable()
That's it. Now every completion() call has memory:
response = hindsight_litellm.completion(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Help me with my Python project"}],
hindsight_query="What do I know about the user's Python project?",
)
The hindsight_query tells Hindsight what to search for in memory. The response comes back with context from past conversations — the LLM knows about the user's project, their preferences, their tech stack.
Note: You can also use Hindsight Cloud and skip the self-hosted setup entirely.
Any Provider, Same Memory
Because the integration runs through LiteLLM, it works with every provider LiteLLM supports. Switch models freely — memory follows:
messages = [{"role": "user", "content": "What did we discuss last time?"}]
query = "What have I discussed with this user?"
# OpenAI
hindsight_litellm.completion(model="gpt-4o", messages=messages, hindsight_query=query)
# Anthropic
hindsight_litellm.completion(model="claude-sonnet-4-20250514", messages=messages, hindsight_query=query)
# Groq
hindsight_litellm.completion(model="groq/llama-3.1-70b-versatile", messages=messages, hindsight_query=query)
# Azure OpenAI
hindsight_litellm.completion(model="azure/gpt-4", messages=messages, hindsight_query=query)
# AWS Bedrock
hindsight_litellm.completion(model="bedrock/anthropic.claude-3", messages=messages, hindsight_query=query)
# Google Vertex AI
hindsight_litellm.completion(model="vertex_ai/gemini-pro", messages=messages, hindsight_query=query)
Same bank, same memories, any model. Migrate from OpenAI to Anthropic and your agent still knows everything.
Two Memory Modes
Recall: Raw Memories
The default mode retrieves individual memory facts and injects them as a numbered list:
hindsight_litellm.set_defaults(bank_id="my-agent", use_reflect=False)
The LLM sees something like:
# Relevant Memories
1. [WORLD] User is building a FastAPI application
2. [WORLD] User prefers pytest for testing
3. [OBSERVATION] User likes type hints
Best for precise, factual context where the LLM needs individual data points.
Reflect: Synthesized Context
Reflect mode synthesizes memories into a coherent paragraph instead of raw facts:
hindsight_litellm.set_defaults(
bank_id="my-agent",
use_reflect=True,
reflect_context="I am a coding assistant helping with Python projects.",
)
The LLM sees something like:
The user is an experienced Python developer working on a FastAPI application.
They prefer pytest for testing and value type hints. In past conversations,
they asked about async patterns and database migrations.
Best for natural, conversational context. The reflect_context parameter shapes how Hindsight reasons about the memories without affecting what it retrieves.
Direct Memory APIs
Sometimes you want to query memory outside of an LLM call. The direct APIs give you full control:
Recall — query raw memories
from hindsight_litellm import recall
memories = recall("what projects is the user working on?", budget="mid")
for m in memories:
print(f"- [{m.fact_type}] {m.text}")
Reflect — synthesized context
from hindsight_litellm import reflect
result = reflect(
"what do you know about the user's preferences?",
context="I am a customer support agent.",
)
print(result.text)
Retain — store memories manually
from hindsight_litellm import retain
retain(
content="User mentioned they're switching from Flask to FastAPI",
context="Discussion about web frameworks",
)
All three have async variants: arecall(), areflect(), aretain().
Per-Call Overrides
Defaults are convenient, but sometimes you need per-call control:
response = hindsight_litellm.completion(
model="gpt-4o-mini",
messages=[...],
hindsight_query="What do I know about Alice?",
hindsight_bank_id="other-agent", # Different bank for this call
hindsight_budget="high", # More thorough retrieval
hindsight_reflect_context="Currently helping with onboarding",
)
Any default can be overridden with a hindsight_* prefix.
Native SDK Wrappers
If you use the OpenAI or Anthropic SDKs directly (without LiteLLM), there are native wrappers:
OpenAI
from openai import OpenAI
from hindsight_litellm import wrap_openai
client = OpenAI()
wrapped = wrap_openai(
client,
bank_id="my-agent",
hindsight_api_url="http://localhost:8888",
)
response = wrapped.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What do you know about me?"}]
)
Anthropic
from anthropic import Anthropic
from hindsight_litellm import wrap_anthropic
client = Anthropic()
wrapped = wrap_anthropic(
client,
bank_id="my-agent",
hindsight_api_url="http://localhost:8888",
)
response = wrapped.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello!"}]
)
Same memory, no LiteLLM dependency.
Context Manager
For temporary memory integration:
from hindsight_litellm import hindsight_memory
import litellm
with hindsight_memory(
hindsight_api_url="http://localhost:8888",
bank_id="user-123",
):
response = litellm.completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}],
hindsight_query="greeting context",
)
# Memory integration automatically disabled after the block
Bank Missions
Tell Hindsight what kind of knowledge the bank should build. This shapes how memories are consolidated into mental models. Pass mission and bank_name to configure():
hindsight_litellm.configure(
hindsight_api_url="http://localhost:8888",
mission="""This agent routes customer support requests.
Remember which issue types go to which teams (billing, technical, sales).
Track customer preferences and past resolutions.""",
bank_name="Customer Support Router",
)
Recap
hindsight-litellmgives any LLM persistent memory across conversations- Works with 100+ providers via LiteLLM, plus native OpenAI and Anthropic wrappers
- Three lines of setup:
configure(),set_defaults(),enable() - Two modes:
recallfor raw facts,reflectfor synthesized context - Direct APIs (
recall,reflect,retain) for manual memory control - Per-call overrides, bank missions, async support, and debug mode
The integration handles memory retrieval, prompt injection, and conversation storage automatically. You just call completion() as usual — the LLM remembers everything.
Next Steps
- Try it locally:
pip install hindsight-all hindsight-litellmand run the quick start above - Use Hindsight Cloud: Skip self-hosting with a free account
- Explore memory modes: Try
use_reflect=Truefor synthesized context vs raw facts - Set a bank mission: Shape what knowledge your agent accumulates
- Inspect with debug mode: Set
verbose=Trueand callget_last_injection_debug()to see exactly what memories are injected
