Skip to main content

I Gave 100+ LLMs a Permanent Memory With One Python Package

· 6 min read
Ben Bartholomew
Hindsight Team

hindsight-litellm adds persistent memory to any LLM provider via LiteLLM — OpenAI, Anthropic, Groq, Azure, Bedrock, Vertex AI, and 100+ more. Three lines of setup, and every LLM call automatically gets context from past conversations.


The Problem: Stateless LLM Calls

You build an app with an LLM. User talks to it on Monday. Comes back Tuesday. The LLM has no idea who they are.

This is true for every provider. OpenAI, Anthropic, Groq, Azure — every API call is a blank slate. The LLM doesn't remember preferences, past conversations, or anything you've discussed before.

Most teams work around this by stuffing the last N messages into the context window. But that's not memory — it's a sliding window that drops everything older than your token limit. The user's preferences from last week? Gone. The project context from last month? Gone.

What if every LLM call automatically had the right context from past conversations?


The Fix: Three Lines, Any Provider

hindsight-litellm hooks into LiteLLM to intercept every LLM call. Before the call, it retrieves relevant memories from Hindsight and injects them into the prompt. After the call, it stores the conversation for future retrieval.

Your code ──→ hindsight_litellm.completion()

├─→ 1. Recall/Reflect from Hindsight (relevant memories)
├─→ 2. Inject memories into prompt
├─→ 3. Forward to LLM (any provider via LiteLLM)
├─→ 4. Store conversation to Hindsight
└─→ 5. Return response

Here's the setup:

import hindsight_litellm

hindsight_litellm.configure(
hindsight_api_url="http://localhost:8888",
)

hindsight_litellm.set_defaults(
bank_id="my-agent",
)

hindsight_litellm.enable()

That's it. Now every completion() call has memory:

response = hindsight_litellm.completion(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Help me with my Python project"}],
hindsight_query="What do I know about the user's Python project?",
)

The hindsight_query tells Hindsight what to search for in memory. The response comes back with context from past conversations — the LLM knows about the user's project, their preferences, their tech stack.

Note: You can also use Hindsight Cloud and skip the self-hosted setup entirely.


Any Provider, Same Memory

Because the integration runs through LiteLLM, it works with every provider LiteLLM supports. Switch models freely — memory follows:

messages = [{"role": "user", "content": "What did we discuss last time?"}]
query = "What have I discussed with this user?"

# OpenAI
hindsight_litellm.completion(model="gpt-4o", messages=messages, hindsight_query=query)

# Anthropic
hindsight_litellm.completion(model="claude-sonnet-4-20250514", messages=messages, hindsight_query=query)

# Groq
hindsight_litellm.completion(model="groq/llama-3.1-70b-versatile", messages=messages, hindsight_query=query)

# Azure OpenAI
hindsight_litellm.completion(model="azure/gpt-4", messages=messages, hindsight_query=query)

# AWS Bedrock
hindsight_litellm.completion(model="bedrock/anthropic.claude-3", messages=messages, hindsight_query=query)

# Google Vertex AI
hindsight_litellm.completion(model="vertex_ai/gemini-pro", messages=messages, hindsight_query=query)

Same bank, same memories, any model. Migrate from OpenAI to Anthropic and your agent still knows everything.


Two Memory Modes

Recall: Raw Memories

The default mode retrieves individual memory facts and injects them as a numbered list:

hindsight_litellm.set_defaults(bank_id="my-agent", use_reflect=False)

The LLM sees something like:

# Relevant Memories
1. [WORLD] User is building a FastAPI application
2. [WORLD] User prefers pytest for testing
3. [OBSERVATION] User likes type hints

Best for precise, factual context where the LLM needs individual data points.

Reflect: Synthesized Context

Reflect mode synthesizes memories into a coherent paragraph instead of raw facts:

hindsight_litellm.set_defaults(
bank_id="my-agent",
use_reflect=True,
reflect_context="I am a coding assistant helping with Python projects.",
)

The LLM sees something like:

The user is an experienced Python developer working on a FastAPI application.
They prefer pytest for testing and value type hints. In past conversations,
they asked about async patterns and database migrations.

Best for natural, conversational context. The reflect_context parameter shapes how Hindsight reasons about the memories without affecting what it retrieves.


Direct Memory APIs

Sometimes you want to query memory outside of an LLM call. The direct APIs give you full control:

Recall — query raw memories

from hindsight_litellm import recall

memories = recall("what projects is the user working on?", budget="mid")
for m in memories:
print(f"- [{m.fact_type}] {m.text}")

Reflect — synthesized context

from hindsight_litellm import reflect

result = reflect(
"what do you know about the user's preferences?",
context="I am a customer support agent.",
)
print(result.text)

Retain — store memories manually

from hindsight_litellm import retain

retain(
content="User mentioned they're switching from Flask to FastAPI",
context="Discussion about web frameworks",
)

All three have async variants: arecall(), areflect(), aretain().


Per-Call Overrides

Defaults are convenient, but sometimes you need per-call control:

response = hindsight_litellm.completion(
model="gpt-4o-mini",
messages=[...],
hindsight_query="What do I know about Alice?",
hindsight_bank_id="other-agent", # Different bank for this call
hindsight_budget="high", # More thorough retrieval
hindsight_reflect_context="Currently helping with onboarding",
)

Any default can be overridden with a hindsight_* prefix.


Native SDK Wrappers

If you use the OpenAI or Anthropic SDKs directly (without LiteLLM), there are native wrappers:

OpenAI

from openai import OpenAI
from hindsight_litellm import wrap_openai

client = OpenAI()
wrapped = wrap_openai(
client,
bank_id="my-agent",
hindsight_api_url="http://localhost:8888",
)

response = wrapped.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What do you know about me?"}]
)

Anthropic

from anthropic import Anthropic
from hindsight_litellm import wrap_anthropic

client = Anthropic()
wrapped = wrap_anthropic(
client,
bank_id="my-agent",
hindsight_api_url="http://localhost:8888",
)

response = wrapped.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello!"}]
)

Same memory, no LiteLLM dependency.


Context Manager

For temporary memory integration:

from hindsight_litellm import hindsight_memory
import litellm

with hindsight_memory(
hindsight_api_url="http://localhost:8888",
bank_id="user-123",
):
response = litellm.completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}],
hindsight_query="greeting context",
)
# Memory integration automatically disabled after the block

Bank Missions

Tell Hindsight what kind of knowledge the bank should build. This shapes how memories are consolidated into mental models. Pass mission and bank_name to configure():

hindsight_litellm.configure(
hindsight_api_url="http://localhost:8888",
mission="""This agent routes customer support requests.
Remember which issue types go to which teams (billing, technical, sales).
Track customer preferences and past resolutions.""",
bank_name="Customer Support Router",
)

Recap

  • hindsight-litellm gives any LLM persistent memory across conversations
  • Works with 100+ providers via LiteLLM, plus native OpenAI and Anthropic wrappers
  • Three lines of setup: configure(), set_defaults(), enable()
  • Two modes: recall for raw facts, reflect for synthesized context
  • Direct APIs (recall, reflect, retain) for manual memory control
  • Per-call overrides, bank missions, async support, and debug mode

The integration handles memory retrieval, prompt injection, and conversation storage automatically. You just call completion() as usual — the LLM remembers everything.


Next Steps

  • Try it locally: pip install hindsight-all hindsight-litellm and run the quick start above
  • Use Hindsight Cloud: Skip self-hosting with a free account
  • Explore memory modes: Try use_reflect=True for synthesized context vs raw facts
  • Set a bank mission: Shape what knowledge your agent accumulates
  • Inspect with debug mode: Set verbose=True and call get_last_injection_debug() to see exactly what memories are injected