LiteLLM
Universal LLM memory integration via LiteLLM. Add persistent memory to any LLM application with just a few lines of code.
Features
- Universal LLM Support - Works with 100+ LLM providers via LiteLLM (OpenAI, Anthropic, Groq, Azure, AWS Bedrock, Google Vertex AI, and more)
- Simple Integration - Just configure, enable, and use
hindsight_litellm.completion() - Automatic Memory Injection - Relevant memories are injected into prompts before LLM calls
- Automatic Conversation Storage - Conversations are stored to Hindsight for future recall
- Two Memory Modes - Choose between
reflect(synthesized context) orrecall(raw memory retrieval) - Direct Memory APIs - Query, synthesize, and store memories manually
- Native Client Wrappers - Alternative wrappers for OpenAI and Anthropic SDKs
Installation
pip install hindsight-litellm
Quick Start
import hindsight_litellm
# Configure and enable memory integration
hindsight_litellm.configure(
hindsight_api_url="http://localhost:8888",
bank_id="my-agent",
)
hindsight_litellm.enable()
# Use the convenience wrapper - memory is automatically injected and stored
response = hindsight_litellm.completion(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What did we discuss about AI?"}]
)
How It Works
When you call completion(), the following happens automatically:
- Memory Retrieval - Hindsight is queried for relevant memories based on the conversation
- Prompt Injection - Memories are injected into the system message
- LLM Call - The enriched prompt is sent to the LLM
- Conversation Storage - The conversation is stored to Hindsight for future recall
- Response Returned - You receive the response as normal
Configuration Options
hindsight_litellm.configure(
# Required
hindsight_api_url="http://localhost:8888", # Hindsight API server URL
bank_id="my-agent", # Memory bank ID
api_key="your-api-key", # Optional API key for authentication
# Optional - Memory behavior
store_conversations=True, # Store conversations after LLM calls
inject_memories=True, # Inject relevant memories into prompts
use_reflect=False, # Use reflect API (synthesized) vs recall (raw memories)
reflect_include_facts=False, # Include source facts with reflect responses
max_memories=None, # Maximum memories to inject (None = unlimited)
max_memory_tokens=4096, # Maximum tokens for memory context
recall_budget="mid", # Recall budget: "low", "mid", "high"
fact_types=["world", "agent"], # Filter fact types to inject
# Optional - Bank Configuration
bank_name="My Agent", # Human-readable display name for the memory bank
mission="This agent...", # Instructions guiding what Hindsight should remember
# Optional - Advanced
injection_mode="system_message", # or "prepend_user"
excluded_models=["gpt-3.5*"], # Exclude certain models
verbose=True, # Enable verbose logging and debug info
)
Bank Configuration
The mission and bank_name parameters configure the memory bank itself. When provided, configure() will automatically create or update the bank with these settings.
hindsight_litellm.configure(
hindsight_api_url="http://localhost:8888",
bank_id="support-router",
bank_name="Customer Support Router",
mission="""You're a customer support router - keep track of which types of issues
should go to which teams (billing, technical, sales), customer preferences for
communication channels, and past issue resolutions.""",
)
Memory Modes: Reflect vs Recall
- Recall mode (
use_reflect=False, default): Retrieves raw memory facts and injects them as a numbered list. Best when you need precise, individual memories. - Reflect mode (
use_reflect=True): Synthesizes memories into a coherent context paragraph. Best for natural, conversational memory context.
# Recall mode - raw memories
hindsight_litellm.configure(
bank_id="my-agent",
use_reflect=False, # Default
)
# Injects: "1. [WORLD] User prefers Python\n2. [MENTAL MODEL] User prefers simple code..."
# Reflect mode - synthesized context
hindsight_litellm.configure(
bank_id="my-agent",
use_reflect=True,
)
# Injects: "Based on previous conversations, the user is a Python developer who..."
Multi-Provider Support
Works with any LiteLLM-supported provider:
import hindsight_litellm
hindsight_litellm.configure(
hindsight_api_url="http://localhost:8888",
bank_id="my-agent",
)
hindsight_litellm.enable()
# OpenAI
hindsight_litellm.completion(model="gpt-4o", messages=[...])
# Anthropic
hindsight_litellm.completion(model="claude-3-5-sonnet-20241022", messages=[...])
# Groq
hindsight_litellm.completion(model="groq/llama-3.1-70b-versatile", messages=[...])
# Azure OpenAI
hindsight_litellm.completion(model="azure/gpt-4", messages=[...])
# AWS Bedrock
hindsight_litellm.completion(model="bedrock/anthropic.claude-3", messages=[...])
# Google Vertex AI
hindsight_litellm.completion(model="vertex_ai/gemini-pro", messages=[...])
Direct Memory APIs
Recall - Query raw memories
from hindsight_litellm import configure, recall
configure(bank_id="my-agent", hindsight_api_url="http://localhost:8888")
memories = recall("what projects am I working on?", budget="mid")
for m in memories:
print(f"- [{m.fact_type}] {m.text}")
Reflect - Get synthesized context
from hindsight_litellm import configure, reflect
configure(bank_id="my-agent", hindsight_api_url="http://localhost:8888")
result = reflect("what do you know about the user's preferences?")
print(result.text)
Retain - Store memories
from hindsight_litellm import configure, retain
configure(bank_id="my-agent", hindsight_api_url="http://localhost:8888")
result = retain(
content="User mentioned they're working on a machine learning project",
context="Discussion about current projects",
)
Async APIs
from hindsight_litellm import arecall, areflect, aretain
# Async versions of all memory APIs
memories = await arecall("what do you know about me?")
context = await areflect("summarize user preferences")
result = await aretain(content="New information to remember")
Native Client Wrappers
Alternative to LiteLLM callbacks for direct SDK integration.
OpenAI Wrapper
from openai import OpenAI
from hindsight_litellm import wrap_openai
client = OpenAI()
wrapped = wrap_openai(
client,
bank_id="my-agent",
hindsight_api_url="http://localhost:8888",
)
response = wrapped.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "What do you know about me?"}]
)
Anthropic Wrapper
from anthropic import Anthropic
from hindsight_litellm import wrap_anthropic
client = Anthropic()
wrapped = wrap_anthropic(
client,
bank_id="my-agent",
hindsight_api_url="http://localhost:8888",
)
response = wrapped.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello!"}]
)
Debug Mode
When verbose=True, you can inspect exactly what memories are being injected:
from hindsight_litellm import configure, enable, completion, get_last_injection_debug
configure(
bank_id="my-agent",
hindsight_api_url="http://localhost:8888",
verbose=True,
)
enable()
response = completion(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What's my favorite color?"}]
)
# Inspect what was injected
debug = get_last_injection_debug()
if debug:
print(f"Mode: {debug.mode}") # "reflect" or "recall"
print(f"Injected: {debug.injected}") # True/False
print(f"Results: {debug.results_count}")
print(f"Memory context:\n{debug.memory_context}")
Context Manager
from hindsight_litellm import hindsight_memory
import litellm
with hindsight_memory(bank_id="user-123"):
response = litellm.completion(model="gpt-4", messages=[...])
# Memory integration automatically disabled after context
Disabling and Cleanup
from hindsight_litellm import disable, cleanup
# Temporarily disable memory integration
disable()
# Clean up all resources (call when shutting down)
cleanup()
API Reference
Main Functions
| Function | Description |
|---|---|
configure(...) | Configure global Hindsight settings |
enable() | Enable memory integration with LiteLLM |
disable() | Disable memory integration |
is_enabled() | Check if memory integration is enabled |
cleanup() | Clean up all resources |
Configuration Functions
| Function | Description |
|---|---|
get_config() | Get current configuration |
is_configured() | Check if Hindsight is configured |
reset_config() | Reset configuration to defaults |
Memory Functions
| Function | Description |
|---|---|
recall(query, ...) | Synchronously query raw memories |
arecall(query, ...) | Asynchronously query raw memories |
reflect(query, ...) | Synchronously get synthesized memory context |
areflect(query, ...) | Asynchronously get synthesized memory context |
retain(content, ...) | Synchronously store a memory |
aretain(content, ...) | Asynchronously store a memory |
Debug Functions
| Function | Description |
|---|---|
get_last_injection_debug() | Get debug info from last memory injection |
clear_injection_debug() | Clear stored debug info |
Client Wrappers
| Function | Description |
|---|---|
wrap_openai(client, ...) | Wrap OpenAI client with memory |
wrap_anthropic(client, ...) | Wrap Anthropic client with memory |
Requirements
- Python >= 3.10
- litellm >= 1.40.0
- A running Hindsight API server