🤖

Using a coding agent? Run this to install the Hindsight docs skill:

npx skills add https://github.com/vectorize-io/hindsight --skill hindsight-docs

LiteLLM

Universal LLM memory integration via LiteLLM. Add persistent memory to any LLM application with just a few lines of code.

Features

Universal LLM Support - Works with 100+ LLM providers via LiteLLM (OpenAI, Anthropic, Groq, Azure, AWS Bedrock, Google Vertex AI, and more)
Simple Integration - Just configure, enable, and use hindsight_litellm.completion()
Automatic Memory Injection - Relevant memories are injected into prompts before LLM calls
Automatic Conversation Storage - Conversations are stored to Hindsight for future recall
Two Memory Modes - Choose between reflect (synthesized context) or recall (raw memory retrieval)
Direct Memory APIs - Query, synthesize, and store memories manually
Native Client Wrappers - Alternative wrappers for OpenAI and Anthropic SDKs

Installation

pip install hindsight-litellm

Quick Start

import hindsight_litellm

# Configure and enable memory integration
hindsight_litellm.configure(
    hindsight_api_url="http://localhost:8888",
    bank_id="my-agent",
)
hindsight_litellm.enable()

# Use the convenience wrapper - memory is automatically injected and stored
response = hindsight_litellm.completion(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What did we discuss about AI?"}]
)

How It Works

When you call completion(), the following happens automatically:

Memory Retrieval - Hindsight is queried for relevant memories based on the conversation
Prompt Injection - Memories are injected into the system message
LLM Call - The enriched prompt is sent to the LLM
Conversation Storage - The conversation is stored to Hindsight for future recall
Response Returned - You receive the response as normal

Configuration Options

hindsight_litellm.configure(
    # Required
    hindsight_api_url="http://localhost:8888",  # Hindsight API server URL
    bank_id="my-agent",                          # Memory bank ID

    api_key="your-api-key",        # Optional API key for authentication

    # Optional - Memory behavior
    store_conversations=True,      # Store conversations after LLM calls
    inject_memories=True,          # Inject relevant memories into prompts
    use_reflect=False,             # Use reflect API (synthesized) vs recall (raw memories)
    reflect_include_facts=False,   # Include source facts with reflect responses
    max_memories=None,             # Maximum memories to inject (None = unlimited)
    max_memory_tokens=4096,        # Maximum tokens for memory context
    recall_budget="mid",           # Recall budget: "low", "mid", "high"
    fact_types=["world", "agent"], # Filter fact types to inject

    # Optional - Bank Configuration
    bank_name="My Agent",          # Human-readable display name for the memory bank
    mission="This agent...",       # Instructions guiding what Hindsight should remember

    # Optional - Advanced
    injection_mode="system_message",  # or "prepend_user"
    excluded_models=["gpt-3.5*"],     # Exclude certain models
    verbose=True,                     # Enable verbose logging and debug info
)

Bank Configuration

The mission and bank_name parameters configure the memory bank itself. When provided, configure() will automatically create or update the bank with these settings.

hindsight_litellm.configure(
    hindsight_api_url="http://localhost:8888",
    bank_id="support-router",
    bank_name="Customer Support Router",
    mission="""You're a customer support router - keep track of which types of issues
    should go to which teams (billing, technical, sales), customer preferences for
    communication channels, and past issue resolutions.""",
)

Memory Modes: Reflect vs Recall

Recall mode (use_reflect=False, default): Retrieves raw memory facts and injects them as a numbered list. Best when you need precise, individual memories.
Reflect mode (use_reflect=True): Synthesizes memories into a coherent context paragraph. Best for natural, conversational memory context.

# Recall mode - raw memories
hindsight_litellm.configure(
    bank_id="my-agent",
    use_reflect=False,  # Default
)
# Injects: "1. [WORLD] User prefers Python\n2. [MENTAL MODEL] User prefers simple code..."

# Reflect mode - synthesized context
hindsight_litellm.configure(
    bank_id="my-agent",
    use_reflect=True,
)
# Injects: "Based on previous conversations, the user is a Python developer who..."

Multi-Provider Support

Works with any LiteLLM-supported provider:

import hindsight_litellm

hindsight_litellm.configure(
    hindsight_api_url="http://localhost:8888",
    bank_id="my-agent",
)
hindsight_litellm.enable()

# OpenAI
hindsight_litellm.completion(model="gpt-4o", messages=[...])

# Anthropic
hindsight_litellm.completion(model="claude-3-5-sonnet-20241022", messages=[...])

# Groq
hindsight_litellm.completion(model="groq/llama-3.1-70b-versatile", messages=[...])

# Azure OpenAI
hindsight_litellm.completion(model="azure/gpt-4", messages=[...])

# AWS Bedrock
hindsight_litellm.completion(model="bedrock/anthropic.claude-3", messages=[...])

# Google Vertex AI
hindsight_litellm.completion(model="vertex_ai/gemini-pro", messages=[...])

Direct Memory APIs

Recall - Query raw memories

from hindsight_litellm import configure, recall

configure(bank_id="my-agent", hindsight_api_url="http://localhost:8888")

memories = recall("what projects am I working on?", budget="mid")
for m in memories:
    print(f"- [{m.fact_type}] {m.text}")

Reflect - Get synthesized context

from hindsight_litellm import configure, reflect

configure(bank_id="my-agent", hindsight_api_url="http://localhost:8888")

result = reflect("what do you know about the user's preferences?")
print(result.text)

Retain - Store memories

from hindsight_litellm import configure, retain

configure(bank_id="my-agent", hindsight_api_url="http://localhost:8888")

result = retain(
    content="User mentioned they're working on a machine learning project",
    context="Discussion about current projects",
)

Async APIs

from hindsight_litellm import arecall, areflect, aretain

# Async versions of all memory APIs
memories = await arecall("what do you know about me?")
context = await areflect("summarize user preferences")
result = await aretain(content="New information to remember")

Native Client Wrappers

Alternative to LiteLLM callbacks for direct SDK integration.

OpenAI Wrapper

from openai import OpenAI
from hindsight_litellm import wrap_openai

client = OpenAI()
wrapped = wrap_openai(
    client,
    bank_id="my-agent",
    hindsight_api_url="http://localhost:8888",
)

response = wrapped.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "What do you know about me?"}]
)

Anthropic Wrapper

from anthropic import Anthropic
from hindsight_litellm import wrap_anthropic

client = Anthropic()
wrapped = wrap_anthropic(
    client,
    bank_id="my-agent",
    hindsight_api_url="http://localhost:8888",
)

response = wrapped.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello!"}]
)

Debug Mode

When verbose=True, you can inspect exactly what memories are being injected:

from hindsight_litellm import configure, enable, completion, get_last_injection_debug

configure(
    bank_id="my-agent",
    hindsight_api_url="http://localhost:8888",
    verbose=True,
)
enable()

response = completion(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What's my favorite color?"}]
)

# Inspect what was injected
debug = get_last_injection_debug()
if debug:
    print(f"Mode: {debug.mode}")           # "reflect" or "recall"
    print(f"Injected: {debug.injected}")   # True/False
    print(f"Results: {debug.results_count}")
    print(f"Memory context:\n{debug.memory_context}")

Context Manager

from hindsight_litellm import hindsight_memory
import litellm

with hindsight_memory(bank_id="user-123"):
    response = litellm.completion(model="gpt-4", messages=[...])
# Memory integration automatically disabled after context

Disabling and Cleanup

from hindsight_litellm import disable, cleanup

# Temporarily disable memory integration
disable()

# Clean up all resources (call when shutting down)
cleanup()

API Reference

Main Functions

Function	Description
`configure(...)`	Configure global Hindsight settings
`enable()`	Enable memory integration with LiteLLM
`disable()`	Disable memory integration
`is_enabled()`	Check if memory integration is enabled
`cleanup()`	Clean up all resources

Configuration Functions

Function	Description
`get_config()`	Get current configuration
`is_configured()`	Check if Hindsight is configured
`reset_config()`	Reset configuration to defaults

Memory Functions

Function	Description
`recall(query, ...)`	Synchronously query raw memories
`arecall(query, ...)`	Asynchronously query raw memories
`reflect(query, ...)`	Synchronously get synthesized memory context
`areflect(query, ...)`	Asynchronously get synthesized memory context
`retain(content, ...)`	Synchronously store a memory
`aretain(content, ...)`	Asynchronously store a memory

Debug Functions

Function	Description
`get_last_injection_debug()`	Get debug info from last memory injection
`clear_injection_debug()`	Clear stored debug info

Client Wrappers

Function	Description
`wrap_openai(client, ...)`	Wrap OpenAI client with memory
`wrap_anthropic(client, ...)`	Wrap Anthropic client with memory

Requirements

Python >= 3.10
litellm >= 1.40.0
A running Hindsight API server

Features​

Installation​

Quick Start​

How It Works​

Configuration Options​

Bank Configuration​

Memory Modes: Reflect vs Recall​

Multi-Provider Support​

Direct Memory APIs​

Recall - Query raw memories​

Reflect - Get synthesized context​

Retain - Store memories​

Async APIs​

Native Client Wrappers​

OpenAI Wrapper​

Anthropic Wrapper​

Debug Mode​

Context Manager​

Disabling and Cleanup​

API Reference​

Main Functions​

Configuration Functions​

Memory Functions​

Debug Functions​

Client Wrappers​

Requirements​