# Hindsight Documentation

> Agent Memory that Works Like Human Memory

This file contains the complete Hindsight documentation for LLM consumption.
Generated: 2026-07-03T15:08:45.860870+00:00

---


## File: developer/retain.md

# Retain: How Hindsight Stores Memories

When you call `retain()`, Hindsight transforms conversations and documents into structured, searchable memories that preserve meaning and context.

## What Retain Does

```mermaid
graph LR
    A[Your Content] --> B[Extract Facts]
    B --> C[Identify Entities]
    C --> D[Build Connections]
    D --> E[Memory Bank]
```

---

## Rich Fact Extraction

Hindsight doesn't just store what was said — it captures **why**, **how**, and **what it means**.

### What Gets Captured

When you retain "Alice joined Google last spring and was thrilled about the research opportunities", Hindsight extracts:

**The core facts:**
- Alice joined Google
- This happened last spring

**The emotions and meaning:**
- She was thrilled
- It represented an important opportunity

**The reasoning:**
- She chose it for the research opportunities

This rich extraction means you can later ask "Why did Alice join Google?" and get a meaningful answer, not just "she joined Google."

### Preserving Context

Traditional systems fragment information:
- "Bob suggested Summer Vibes"
- "Alice wanted something unique"
- "They chose Beach Beats"

Hindsight preserves the full narrative:
- "Alice and Bob discussed naming their summer party playlist. Bob suggested 'Summer Vibes' because it's catchy, but Alice wanted something unique. They ultimately decided on 'Beach Beats' for its playful tone."

This means search results include the full context, not disconnected fragments.

---

## Two Types of Facts

Every fact is classified by **whose perspective it captures** — the agent that owns the bank, or the outside world:

| Type           | What it captures                                                              | Example |
|----------------|------------------------------------------------------------------------------|---------|
| **experience** | The bank's own agent acting, observing, or interacting — its first-person history | "I recommended Python to Alice" |
| **world**      | Facts about other people, places, things, and events                          | "Alice works at Google" |

The split is decided by **who is speaking**, not by grammar. A first-person statement is an `experience` only when the speaker *is* the bank's agent. The same words said by someone else are a `world` fact about that person:

- Agent's own log — "I patched the auth bug" → **experience** (the agent did it).
- A user talking to the agent — "I bought a Tesla" → **world** (a fact about the *user*, not the agent).

Two things steer this correctly:

- **Set a human-readable bank `name`** (the agent's name). It identifies who "the agent" is. If left unset it defaults to the `bank_id`; a `bank_id` that is a routing key (e.g. `my-agent::channel-456::user-789`) is not a usable speaker name, so give the bank a real name.
- **Describe the speaker in each item's `context`** when retaining transcripts or third-party content. For a chat log, a context like *"Customer Maria is speaking"* ensures her first-person statements are stored as `world` facts about Maria rather than mistaken for the agent's own experiences. The `context` takes precedence over the bank name when the two disagree.

**Note:** Observations are consolidated automatically in the background after `retain()` operations complete. This consolidation process synthesizes patterns from new facts into the bank's knowledge base.

---

## Entity Recognition

Hindsight automatically identifies and tracks **entities** — the people, organizations, and concepts that matter.

### What Gets Recognized

- **People:** "Alice", "Dr. Smith", "Bob Chen"
- **Organizations:** "Google", "MIT", "OpenAI"
- **Places:** "Paris", "Central Park", "California"
- **Products & Concepts:** "Python", "TensorFlow", "machine learning"

### Entity Resolution

The same entity mentioned different ways gets unified:
- "Alice" + "Alice Chen" + "Alice C." → one person
- "Bob" + "Robert Chen" → one person (nickname resolution)

**Why it matters:** You can ask "What do I know about Alice?" and get everything, even if she was mentioned as "Alice Chen" in some conversations.

### Context-Aware Disambiguation

If "Alice" appears with "Google" and "Stanford" multiple times, a new "Alice" mentioning those is likely the same person. Hindsight uses co-occurrence patterns to disambiguate common names.

### Entity Labels

You can define a controlled vocabulary of `key:value` classification labels (e.g. `pedagogy:scaffolding`, `engagement:active`) that are extracted at retain time and stored as entities. Because labels become entities, they automatically link related memories in the knowledge graph and improve both semantic and keyword retrieval. Labels can optionally also write to the memory unit's tags, enabling standard tag-based filtering during recall and reflect.

See [entity_labels in the bank config](/developer/api/memory-banks#entity-labels) for full configuration details.

---

## Building Connections

Memories aren't isolated — Hindsight creates a **knowledge graph** with four types of connections:

### Entity Connections

All facts mentioning the same entity are linked together.

**Enables:** "Tell me everything about Alice" → retrieves all Alice-related facts

### Time-Based Connections

Facts close in time are connected, with stronger links for closer dates.

**Enables:** "What else happened around then?" → finds contextually related events

### Meaning-Based Connections

Semantically similar facts are linked, even if they use different words.

**Enables:** "Tell me about similar topics" → finds thematically related information

### Causal Connections

Cause-effect relationships are explicitly tracked.

**Enables:** "Why did this happen?" → trace reasoning chains
**Example:** "Alice felt burned out" ← caused by ← "She worked 80-hour weeks"

---

## Understanding Time

Hindsight tracks **two temporal dimensions**:

### When It Happened

For events (meetings, trips, milestones), Hindsight records when they occurred.
- "Alice got married in June 2024" → occurred in June 2024

For general facts (preferences, characteristics), there's no specific occurrence time.
- "Alice prefers Python" → ongoing preference

### When You Learned It

Hindsight also tracks when you told it each fact.

**Why both?**

Imagine in January 2025, someone tells you "Alice got married in June 2024":
- **Historical queries** work: "What did Alice do in 2024?" → finds the marriage
- **Recency ranking** works: Recent mentions get priority in search
- **Temporal reasoning** works: "What happened before her marriage?" → finds earlier events

Without this distinction, old information would either be unsearchable by date or treated as irrelevant.

---

## Tagging Memories

Tags enable visibility scoping—useful when one memory bank serves multiple users but each should only see relevant memories.

- **Item tags**: Tag individual memories with specific scopes
- **Document tags**: Apply tags to all items in a batch
- **Tag filtering**: Filter during recall/reflect by tags

See [Retain API](./api/retain) for code examples and [Recall API](./api/recall) for filtering options.

---

## What You Get

After `retain()` completes:

- **Structured facts** that preserve meaning, emotions, and reasoning
- **Unified entities** that resolve different name variations
- **Knowledge graph** with entity, temporal, semantic, and causal links
- **Temporal grounding** for both historical and recency-based queries
- **Optional tags** for filtering during recall

All stored in your isolated **memory bank**, ready for `recall()` and `reflect()`.

---

## Steering Extraction with a Mission

By default, `retain()` extracts all significant facts from the content. You can narrow this focus with a **retain mission** (`retain_mission`) — a plain-language description of what this bank should pay attention to.

```
e.g. Always include technical decisions, API design choices, and architectural trade-offs.
     Ignore meeting logistics, greetings, and social exchanges.
```

The mission is injected into the extraction prompt alongside the built-in rules — it steers the LLM without replacing the extraction logic. It works with any extraction mode (`concise`, `verbose`, `custom`).

For finer control, you can also change the **extraction mode**:

| Mode | When to use |
|------|-------------|
| `concise` *(default)* | General-purpose — selective, fast |
| `verbose` | When you need richer facts with full context and relationships |
| `custom` | When you want to write your own extraction rules entirely |

Set `retain_mission` and `retain_extraction_mode` via the [bank config API](/developer/api/memory-banks#retain-configuration) or the [`HINDSIGHT_API_RETAIN_MISSION`](/developer/configuration#retain) environment variable.

---

## Observation Consolidation

After `retain()` completes, Hindsight automatically triggers **observation consolidation** in the background. This process:

1. Analyzes new facts against existing observations
2. Creates new observations when patterns emerge
3. Refines existing observations with new evidence
4. Tracks which facts support each observation

This happens asynchronously — your `retain()` call returns immediately while consolidation runs in the background.

See [Observations](./observations) for details on how consolidation works.

---

## Memory Defense and Source Provenance

### receipt_uri (optional)

Type: `string`.

Optional pointer into an external receipt or co-signature system. Stored as-is and surfaced in `security_events.receipt_uri` for any Memory Defense decision on this item.

### 422 — Memory Defense violation

When Memory Defense is enabled on the target bank and **every** item in the batch is blocked by policy, the request returns 422 with a violation list:

```json
{
  "detail": {
    "violations": [
      { "index": 0, "detector": "prompt_injection", "severity": "high", "message": "..." }
    ]
  }
}
```

Partial-block batches return 200 with the un-blocked items processed; blocked items are silently dropped from the result with their decisions recorded in `security_events`.

See [Memory Defense](./memory-defense/index.md) for the full guide.

---

## Next Steps

- [**Observations**](./observations) — How knowledge is consolidated after retain
- [**Recall**](./retrieval) — How multi-strategy search retrieves relevant memories
- [**Reflect**](./reflect) — How the agentic loop uses observations
- [**Retain API**](./api/retain) — Code examples and parameters


---


## File: developer/retrieval.md

# Recall: How Hindsight Retrieves Memories

When you call `recall()`, Hindsight uses multiple search strategies in parallel to find the most relevant memories, regardless of how you phrase your query.

```mermaid
graph LR
    Q[Query] --> S[Semantic]
    Q --> K[Keyword]
    Q --> G[Graph]
    Q --> T[Temporal]

    S --> RRF[RRF Fusion]
    K --> RRF
    G --> RRF
    T --> RRF

    RRF --> CE[Cross-Encoder]
    CE --> R[Results]
```

---

## The Challenge of Memory Recall

Different queries need different search approaches:

- **"Alice works at Google"** → needs exact name matching
- **"Where does Alice work?"** → needs semantic understanding
- **"What did Alice do last spring?"** → needs temporal reasoning
- **"Why did Alice leave?"** → needs causal relationship tracing

No single search method handles all these well. Hindsight solves this with **TEMPR** — four complementary strategies that run in parallel.

---

## Four Search Strategies

### Semantic Search

**What it does:** Understands the *meaning* behind words, not just the words themselves.

**Best for:**
- Conceptual matches: "Alice's job" → "Alice works as a software engineer"
- Paraphrasing: "Bob's expertise" → "Bob specializes in machine learning"
- Synonyms: "meeting" matches "conference", "discussion", "gathering"

**Why it matters:** You can ask questions naturally without matching exact keywords.

---

### Keyword Search

**What it does:** Finds exact terms and names, even when they're spelled uniquely.

**Best for:**
- Proper nouns: "Google", "Alice Chen", "MIT"
- Technical terms: "PostgreSQL", "HNSW", "TensorFlow"
- Unique identifiers: URLs, product names, specific phrases

**Why it matters:** Ensures you never miss results that mention specific names or terms, even if they're semantically distant from your query.

**Backends:** Hindsight ships five pluggable BM25 backends, selected via
`HINDSIGHT_API_TEXT_SEARCH_EXTENSION`:

| Backend | What it uses | Citus-compatible? |
|---|---|---|
| `native` | PostgreSQL `tsvector` + `ts_rank_cd` (TF-IDF, not true BM25) | Yes |
| `vchord` | `vchord_bm25` extension | No |
| `pg_textsearch` | Timescale `pg_textsearch` extension | No |
| `pgroonga` | PGroonga (Groonga) full-text extension, `TokenBigram` polyglot tokenizer | No |
| `pg_search` | ParadeDB `pg_search` extension, configurable tokenizer (e.g. `jieba`, `chinese_compatible`, `ngram`) via `HINDSIGHT_API_TEXT_SEARCH_EXTENSION_PG_SEARCH_TOKENIZER` | Yes |

If you need true BM25 ranking on a horizontally scaled Postgres (Citus) cluster,
`pg_search` is the only option. See the [`pg_search` docker-compose example](https://github.com/vectorize-io/hindsight/tree/main/docker/docker-compose/pg_search).

---

### Graph Traversal

**What it does:** Follows connections between entities to find indirectly related information.

**Best for:**
- Indirect relationships: "What does Alice do?" → Alice → Google → Google's products
- Entity exploration: "Bob's colleagues" → Bob → co-workers → shared projects
- Multi-hop reasoning: "Alice's team's achievements"

**Why it matters:** Retrieves facts that aren't semantically or lexically similar but are **structurally connected** through the knowledge graph.

**Example:** Even if Alice and her manager are never mentioned together, graph traversal can find the manager through shared projects or team relationships.

---

### Temporal Search

**What it does:** Understands time expressions and filters by when events occurred.

**Best for:**
- Historical queries: "What did Alice do in 2023?"
- Time ranges: "What happened last spring?"
- Relative time: "What did Bob work on last year?"
- Before/after: "What happened before Alice joined Google?"

**How it works:** When the query contains a time reference, Hindsight parses it into a date window, then retrieves the memories whose time overlaps that window. Within the window it selects candidates by **semantic relevance to the query** — not by recency — so the most relevant in-window memory is never dropped just because other memories happen to be more recent. It then **spreads the selection across the window's range**: the window is divided into time-buckets and the strongest match from each populated bucket is taken first, so a "what happened in 2023?" query surfaces memories from across the whole year rather than clustering on whichever stretch is densest. Time is also used as a *scoring* signal — memories closer to the center of the window get a small boost (see [Temporal proximity signal](#temporal-proximity-signal)).

**Why it matters:** Enables precise historical queries that stay relevant *and* representative of the whole period. It also stays fast and meaningful on memory banks whose timestamps are densely clustered (for example when a large batch is ingested with one date): because selection is relevance-first, a time window that happens to match most of the bank still returns the best matches rather than an arbitrary slice.

---

## Result Fusion

After the four strategies run, results are **fused together**:

- Memories appearing in **multiple strategies** rank higher (consensus)
- **Rank matters more than score** (robust across different scoring systems)
- Final results are **re-ranked** using a neural model that considers query-memory interaction

**Why fusion matters:** A fact that's both semantically similar AND mentions the right entity will rank higher than one that's only semantically similar.

---

## Why Multiple Strategies?

Consider the query: **"What did Alice say about Python last spring?"**

- **Semantic** finds facts about Alice's views on programming
- **Keyword** ensures "Python" is actually mentioned
- **Graph** connects Alice → programming languages → related entities
- **Temporal** filters to "last spring" timeframe

The **fusion** of all four gives you exactly what you're looking for, even though no single strategy would suffice.

---

## Token Budget Management

Hindsight is built for AI agents, not humans. Traditional search systems return "top-k" results, but agents don't think in terms of result counts—they think in tokens. An agent's context window is measured in tokens, and that's exactly how Hindsight measures results.

**How it works:**
- Top-ranked memories selected first
- Stops when token budget is exhausted
- You specify context budget, Hindsight fills it with the most relevant memories

**Parameters you control:**
- `max_tokens`: How much memory content to return (default: 4096 tokens)
- `budget`: Search depth level (low, mid, high)
- `types`: Filter by world, experience, observation, or all
- `tags`: Filter memories by visibility tags
- `tags_match`: How to match tags (see [Recall API](./api/recall) for all options)

### Expanding Context: Chunks

Memories are distilled facts—concise but sometimes missing nuance. When your agent needs deeper context, you can optionally retrieve the source material:

**Chunks** return the raw text that generated each memory—useful when the distilled fact loses important nuance:

```
Memory: "Alice prefers Python over JavaScript"
Chunk:  "Alice mentioned she prefers Python over JavaScript, mainly because
         of its data science ecosystem, though she admits JS is better for
         frontend work and she's been learning TypeScript lately."
```

Use `include_chunks=True` with `max_chunk_tokens` to control the token budget for chunks. This is useful when generating responses that need verbatim quotes or when context matters (e.g., "What exactly did Alice say about the project?").

---

## Tuning Recall: Quality vs Latency

Different use cases require different trade-offs between **recall quality** and **response speed**. Two parameters control this:

### Budget: Search Depth

Controls how thoroughly Hindsight explores the memory bank—affecting graph traversal depth, candidate pool size, and cross-encoder re-ranking:

| Budget | Best For | Trade-off |
|--------|----------|-----------|
| **low** | Quick lookups, simple queries | Fast, may miss indirect connections |
| **mid** | Most queries, balanced | Good coverage, reasonable speed |
| **high** | Complex queries requiring deep exploration | Thorough, slower |

**Example:** "What did Alice's manager's team work on?" benefits from high budget to traverse multiple hops (Alice → manager → team → projects) and evaluate more candidates.

### Max Tokens: Context Window Size

Controls how much memory content to return:

| Max Tokens | ~Pages of Text | Best For | Trade-off |
|------------|----------------|----------|-----------|
| **2048** | ~2 pages | Focused answers, fast LLM | Fewer memories, faster |
| **4096** (default) | ~4 pages | Balanced context | Good coverage, standard |
| **8192** | ~8 pages | Comprehensive context | More memories, slower LLM |

**Example:** "Summarize everything about Alice" benefits from higher max_tokens to include more facts.

### Two Independent Dimensions

Budget and max_tokens control different aspects of recall:

| Parameter | What it controls | Latency impact | Example |
|-----------|------------------|----------------|---------|
| **Budget** | How thoroughly to explore memories | Search time | High budget finds Alice → manager → team → projects |
| **Max Tokens** | How much context to return | LLM processing time | High tokens returns more memories to the agent |

**They're independent.** Common combinations:

| Budget | Max Tokens | Use Case |
|--------|------------|----------|
| high | low | Deep search, return only the best results |
| low | high | Quick search, return everything found |
| high | high | Comprehensive research queries |
| low | low | Fast chatbot responses |

### Recommended Configurations

| Use Case | Budget | Max Tokens | Why |
|----------|--------|------------|-----|
| **Chatbot replies** | low | 2048 | Fast responses, focused context |
| **Document Q&A** | mid | 4096 | Balanced coverage and speed |
| **Research queries** | high | 8192 | Comprehensive, multi-hop reasoning |
| **Real-time search** | low | 2048 | Minimize latency |

---

## Scoring & Ranking Deep Dive

This section explains exactly how Hindsight turns raw retrieval results into a final ranked list. The pipeline has three stages: **RRF fusion**, **cross-encoder reranking**, and **combined scoring**.

### Stage 1: Reciprocal Rank Fusion (RRF)

After all strategies run in parallel, their results are merged using [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf). RRF combines ranked lists by rewarding items that appear highly ranked across multiple strategies, without relying on raw scores (which aren't comparable across different retrieval methods).

**Formula:**

```
score(d) = Σ  1 / (k + rank_i(d))
           i
```

Where:
- **k = 60** (smoothing constant — prevents top-ranked items from dominating)
- **rank_i(d)** = position of document *d* in strategy *i* (1-indexed)
- The sum runs over all strategies where *d* appears

**Within RRF, all four strategies are weighted equally** — fusion uses rank position, not the source, so no strategy gets an implicit multiplier. You can, however, deliberately bias a source with [`HINDSIGHT_API_RECALL_STRATEGY_BOOSTS`](./configuration): that boost is applied at a separate stage — before the reranking pre-filter cap and again after reranking — not inside the RRF fusion above.

**Why RRF over raw score merging?** Each retrieval strategy produces scores on a different scale (cosine similarity, BM25 tf-idf, graph activation). These scores aren't comparable — a BM25 score of 12.5 and a cosine similarity of 0.85 don't mean the same thing. RRF sidesteps this by using only rank positions, making it robust across any scoring system without requiring calibration.

**Example:** A memory ranked #1 in semantic and #5 in BM25:
```
RRF score = 1/(60+1) + 1/(60+5) = 0.0164 + 0.0154 = 0.0318
```

A memory ranked #1 in semantic only:
```
RRF score = 1/(60+1) = 0.0164
```

The first memory ranks higher because it has **consensus** across strategies.

---

### Stage 2: Cross-Encoder Reranking

RRF gives a good initial ranking, but it's based on positions, not on deep query-document understanding. The cross-encoder evaluates each candidate against the query as a pair, producing a relevance score.

**Pre-filtering:** Before reranking, candidates are trimmed to the top **300** (by RRF score) to limit computational cost. This is configurable via `HINDSIGHT_API_RERANKER_MAX_CANDIDATES`. If [`HINDSIGHT_API_RECALL_STRATEGY_BOOSTS`](./configuration) is set, the boost is applied to the RRF scores before this cut, so candidates from a favoured source are more likely to survive it.

**Why rerank after RRF?** RRF is position-based — it knows a memory ranked well across strategies, but it never actually reads the query and the memory together. The cross-encoder does: it takes the query and each candidate as a pair and produces a relevance score based on their full interaction. This catches nuances that position-based fusion misses, like a memory that ranked #1 in keyword search because it matched a common term but is actually irrelevant to the query's intent.

**Score normalization:** Cross-encoders output raw logits (which can be negative). Scores that already fall within [0, 1] — as returned by calibrated external API rerankers (e.g. Cohere, SiliconFlow, ZeroEntropy, Alibaba, Jina) — are passed through unchanged to preserve their absolute confidence. Raw logits outside [0, 1] are normalized to [0, 1] using the sigmoid function:

```
CE_normalized = 1 / (1 + e^(-raw_logit))
```

**Batch processing:** Candidates are scored in batches — **32 pairs** for the local reranker, **128 pairs** for TEI.

:::tip No cross-encoder?
When running without a cross-encoder (e.g., slim image with no external reranker), the system falls back to RRF-derived scores: candidates are assigned synthetic scores spread across [0.1, 1.0] based on their RRF rank, so the combined scoring boosts below still work meaningfully.
:::

---

### Stage 3: Combined Scoring (Boosts)

The normalized cross-encoder score is adjusted by three **multiplicative boosts** that incorporate signals the cross-encoder can't see: recency, temporal proximity, and evidence strength.

**Why multiplicative instead of additive?** Additive boosts (e.g., `CE + 0.1 × recency`) would give the same absolute bonus to every candidate regardless of relevance. A barely-relevant memory could leapfrog a highly-relevant one just by being recent. Multiplicative boosts keep adjustments proportional to the base relevance score — a +10% nudge on a high-relevance memory is a bigger absolute change than +10% on a low-relevance one. This ensures secondary signals never overpower the primary relevance judgment.

**Formula:**

```
final_score = CE_normalized × recency_boost × temporal_boost × proof_count_boost
```

Each boost is centered at 1.0 (neutral) and controlled by an alpha that caps how much it can swing:

```
boost = 1 + α × (signal - 0.5)
```

| Boost | α | Max adjustment | What it rewards |
|-------|---|----------------|-----------------|
| **Recency** | 0.2 | ±10% | Recent memories over older ones |
| **Temporal proximity** | 0.2 | ±10% | Memories close to a queried time window |
| **Proof count** | 0.1 | ±5% | Observations backed by more evidence |

#### Recency signal

Linear decay over 365 days from the memory's occurrence date:

```
recency = clamp(1.0 - days_ago / 365, 0.1, 1.0)
```

A memory from the query timestamp has recency 1.0 (+10% boost). A memory from 6 months before the query timestamp has recency ~0.5 (neutral). A memory more than a year before the query timestamp has recency 0.1 (-8% penalty). If no `query_timestamp` is provided, the server's current time is used. Memories without dates get 0.5 (neutral — no boost or penalty).

#### Temporal proximity signal

Only active when the query contains a time reference (e.g., "last spring", "in 2023"). Measures how close a memory's date is to the center of the queried time window:

```
temporal_proximity = 1.0 - min(days_from_center / (window_days / 2), 1.0)
```

A memory at the center of the window gets 1.0 (+10% boost). A memory at the edge gets 0.0 (-10% penalty). For non-temporal queries, all memories get 0.5 (neutral).

#### Proof count signal

For observation-type memories, rewards those backed by more evidence using a logarithmic curve:

```
proof_norm = clamp(0.5 + ln(proof_count) / 10, 0.0, 1.0)
```

| Proof count | proof_norm | Boost |
|-------------|-----------|-------|
| 1 | 0.5 | Neutral |
| 3 | 0.61 | +1.1% |
| 10 | 0.73 | +2.3% |
| 150+ | 1.0 | +5% (max) |

#### Maximum combined range

With all boosts at their extremes:
- **Best case:** ×1.10 × 1.10 × 1.05 ≈ **+27%**
- **Worst case:** ×0.90 × 0.90 × 0.95 ≈ **-23%**

The boosts are intentionally conservative — they nudge the ranking without overriding cross-encoder relevance.

---

### Stage 4: Token Truncation

After scoring, results are sorted by `final_score` and selected top-down until the `max_tokens` budget is exhausted. Only the memory text counts toward the budget — metadata is free.

---

### How Budget Maps to Pipeline Parameters

The `budget` parameter (low/mid/high) controls **search depth** — how many candidates each strategy considers. Each level maps to a **recall budget** number that flows through every pipeline stage:

| Budget | Recall budget (fixed mode) | Env var override |
|--------|---------------------------|-----------------|
| **low** | 100 | `HINDSIGHT_API_RECALL_BUDGET_FIXED_LOW` |
| **mid** | 300 (default) | `HINDSIGHT_API_RECALL_BUDGET_FIXED_MID` |
| **high** | 1000 | `HINDSIGHT_API_RECALL_BUDGET_FIXED_HIGH` |

This recall budget flows through the pipeline as follows:

| Pipeline stage | How the recall budget is used |
|----------------|-------------------------------|
| **Semantic search** | Over-fetches max(recall_budget × 5, 100) from HNSW, trims to recall_budget |
| **BM25 search** | `LIMIT recall_budget` in SQL |
| **Graph traversal** | Explores up to recall_budget nodes |
| **Temporal spreading** | Activates up to recall_budget nodes via links |
| **Result consideration** | Top recall_budget × 2 results considered for token filtering |

Reranking pre-filter (300 candidates) is **independent** of budget — it's a separate knob (`HINDSIGHT_API_RERANKER_MAX_CANDIDATES`).

:::info Adaptive budgeting
An alternative budget mode scales the recall budget with `max_tokens` instead of using fixed values:

```
recall_budget = clamp(max_tokens × ratio, min, max)
```

| Budget | Ratio | Env var override |
|--------|-------|-----------------|
| low | 2.5% of max_tokens | `HINDSIGHT_API_RECALL_BUDGET_ADAPTIVE_LOW` |
| mid | 7.5% of max_tokens | `HINDSIGHT_API_RECALL_BUDGET_ADAPTIVE_MID` |
| high | 25% of max_tokens | `HINDSIGHT_API_RECALL_BUDGET_ADAPTIVE_HIGH` |

The result is clamped to a floor of **20** (`HINDSIGHT_API_RECALL_BUDGET_MIN`) and a ceiling of **2000** (`HINDSIGHT_API_RECALL_BUDGET_MAX`).

Enable with `HINDSIGHT_API_RECALL_BUDGET_FUNCTION=adaptive`.
:::

---

### Graph Scoring Detail

The graph traversal (link expansion) combines three independent signals additively for each candidate:

| Signal | Score formula | Range |
|--------|--------------|-------|
| **Entity overlap** | tanh(shared_entity_count × 0.5) | [0, ~1.0] |
| **Semantic link** | Precomputed kNN link weight | [0.7, 1.0] |
| **Causal link** | Causal link weight | [0, 1.0] |

```
graph_score = entity_score + semantic_score + causal_score   ∈ [0, 3]
```

The additive combination rewards **convergent evidence** — a memory connected to the query through multiple signal types ranks higher than one connected through a single strong signal.

**Why tanh for entity scores?** Raw shared-entity count is unbounded — a high-fanout entity like "user" could produce counts of 50+, drowning out the other two signals. `tanh(count × 0.5)` saturates naturally: the first few shared entities matter a lot (1→0.46, 2→0.76, 3→0.91), but additional ones contribute diminishing returns, keeping the entity signal in [0, 1] alongside semantic and causal scores.

**Why additive instead of multiplicative here?** Unlike the combined scoring boosts, graph signals are independent evidence channels, not adjustments to a base score. A memory might be connected only through causal links (no shared entities, no semantic similarity) — multiplicative combination would zero it out. Additive scoring lets each signal contribute independently, and the outer RRF fusion handles ranking across strategies.

**Entity signal example:** A memory sharing 1 entity with the query scores tanh(0.5) ≈ 0.46. Two shared entities score tanh(1.0) ≈ 0.76. Three or more saturate near 0.91+.

---

## Next Steps

- [**Retain**](./retain) — How memories are stored with rich context
- [**Reflect**](./reflect) — How disposition influences reasoning
- [**Recall API**](./api/recall) — Code examples, parameters, and tag filtering


---


## File: developer/installation.md

# Installation

Hindsight can be deployed in several ways depending on your infrastructure and requirements.

:::tip Don't want to manage infrastructure?
**[Hindsight Cloud](https://ui.hindsight.vectorize.io/signup)** is a fully managed service that handles all infrastructure, scaling, and maintenance — [sign up here](https://ui.hindsight.vectorize.io/signup).
:::

## Supported Platforms

Hindsight runs on **Linux**, **macOS**, and **Windows**:

| Platform | Docker | Bare Metal (pip) | Embedded DB (pg0) | Notes |
|----------|--------|------------------|--------------------|-------|
| **Linux** (x86_64, ARM64) | ✅ | ✅ | ✅ | Fully supported, recommended for production |
| **macOS** (Apple Silicon / arm64) | ✅ | ✅ | ✅ | Fully supported |
| **macOS** (Intel / x86_64) | ✅ | ⚠️ slim only | ✅ | Use `hindsight-all-slim` / `hindsight-api-slim`. The full bundle's local ML models (PyTorch, MLX) publish no Intel-Mac wheels, so `pip install hindsight-all` silently backtracks to a months-old release. Pair the slim bundle with a hosted embeddings/reranker provider or the in-process ONNX backend (`hindsight-api-slim[local-onnx]`). |
| **Windows** (x86_64) | ✅ | ✅ | ✅ | Fully supported — see [Windows setup](#windows) for external PostgreSQL option |

All platforms support the embedded database (pg0) for development. On Windows, you can also use an external PostgreSQL installation — see the [Windows](#windows) section for a step-by-step guide.

---

## Prerequisites

### PostgreSQL

Hindsight requires PostgreSQL 14+ with a vector extension for similarity search. The supported extensions are:

- **pgvector** (default)
- **pgvectorscale**
- **vchord**
- **scann** (AlloyDB)

Configure which one to use with `HINDSIGHT_API_VECTOR_EXTENSION`. See [Configuration](./configuration) for details.

**By default**, Hindsight uses **pg0** — an embedded PostgreSQL that runs locally on your machine. This is convenient for development but **not recommended for production**.

**For production**, use an external PostgreSQL with one of the supported vector extensions:
- **Supabase** — Managed PostgreSQL with pgvector built-in
- **Neon** — Serverless PostgreSQL with pgvector
- **Azure Database for PostgreSQL** — With pgvector and pgvectorscale support
- **Google AlloyDB** / **AlloyDB Omni** — With pgvector and ScaNN support
- **AWS RDS** / **Cloud SQL** — With pgvector extension enabled
- **Self-hosted** — PostgreSQL 14+ with your preferred vector extension

### LLM Provider

You need an LLM API key for fact extraction, entity resolution, and answer generation. See [Models](./models) for supported providers, model recommendations, and configuration.

### Hardware

Hindsight is designed to run on commodity hardware. The footprint depends mainly on whether the **full** image (which bundles local embedding and reranker models) or the **slim** image (which delegates those to external providers) is used.

| Component | Minimum RAM | Recommended RAM | Notes |
|-----------|-------------|-----------------|-------|
| **API — Full image** | 1.5 GB | 2 GB | Loads local BGE embedder (~130 MB) and MiniLM cross-encoder (~90 MB) into memory, plus PyTorch/ONNX runtime arenas. Idle RSS settles around 0.8–1.0 GB; expect 1.2–1.5 GB under load. |
| **API — Slim image** | 512 MB | 1 GB | No local models. Steady-state RSS is dominated by Python runtime and DB connections. Requires [external embedding and reranker providers](./configuration#embeddings) (e.g. TEI, OpenAI, Cohere). |
| **Control Plane (UI)** | 128 MB | 256 MB | Next.js process, lightweight. |
| **Worker** (if separated) | Same as API image variant | Same as API image variant | Workers load the same models as the API server. |
| **PostgreSQL** | 512 MB | 1 GB+ | Scales with the number of memories and indexes. |

:::tip Reducing the footprint
The bulk of the full image's memory comes from the bundled embedding and reranker models and their PyTorch/ONNX runtimes. To shrink the deployment to a few hundred MB of RAM, switch to the **slim** image and configure [external embedding and reranker providers](./configuration#embeddings).
:::

CPU vs GPU: 2 vCPUs on CPU-only is fine for development and basic workloads. For production traffic, the local reranker (cross-encoder) is the main bottleneck and typically benefits from a GPU to keep recall latency reasonable; alternatively, offload reranking to an [external reranker provider](./configuration#embeddings) (e.g. TEI, Cohere) on dedicated GPU hardware.

---

## Docker

**Best for**: Quick start, development, small deployments

Run everything in one container with embedded PostgreSQL:

```bash
export OPENAI_API_KEY=sk-xxx

docker run -it --pull always --name hindsight --restart unless-stopped -p 8888:8888 -p 9999:9999 \
  -e HINDSIGHT_API_LLM_API_KEY=$OPENAI_API_KEY \
  -v hindsight-data:/home/hindsight/.pg0 \
  ghcr.io/vectorize-io/hindsight:latest
```

- **API Server**: http://localhost:8888
- **Control Plane** (Web UI): http://localhost:9999

:::note Persisting data: named volume vs. host bind mount
The container runs as a non-root user (UID 1000). The `hindsight-data` **named volume** above is recommended — Docker creates it owned by the container user, so it works with no extra setup.

If you instead bind-mount a **host directory** (`-v $HOME/.hindsight-docker:/home/hindsight/.pg0`), that directory must be writable by UID 1000, or the embedded database fails to start with `Permission denied`. Either `chown` the directory to UID 1000, or run the container as your host user: `--user $(id -u):$(id -g) -e HOME=/home/hindsight` (after `chown`-ing the directory to your own UID).
:::

All published images are [signed with Cosign](#verifying-image-signatures) — verification is optional.

:::tip Set a stable `HINDSIGHT_API_WORKER_ID` in production
The worker uses the container hostname as its identity, which Docker sets to the container ID by default. That value changes on every restart, so any task that was being processed when the container went down stays parked under the old ID with no way for the new container to recognize it as its own.

Set `HINDSIGHT_API_WORKER_ID` to a stable value (e.g., `-e HINDSIGHT_API_WORKER_ID=hindsight-prod`) so the worker keeps the same identity across restarts. This is recommended even for single-container deployments. For diagnosis and recovery commands, see [Admin CLI - Recovering stuck operations](./admin-cli#recovering-stuck-or-zombie-operations).
:::

### Docker Image Variants

| Variant | Size (AMD64) | Size (ARM64) | When to use |
|---------|--------------|--------------|-------------|
| **Full** (`latest`) | ~9 GB | ~3.7 GB | Default. Works out of the box with no external services except the LLM. |
| **Slim** (`slim`) | ~500 MB | ~500 MB | Use when you already rely on external services for embeddings and reranking (OpenAI, Cohere, TEI). Significantly smaller image, faster deploys. Requires [external providers](./configuration#embeddings). |

The slim image corresponds to the [`hindsight-api-slim`](#bare-metal-pip) pip package. See [Configuration](./configuration#embeddings) for external provider options.

### Bundling Custom Models in a Custom Image

:::tip Production deployments with non-default local models
If you use a non-default local embedder or reranker, bake the models into a custom image at build time rather than enabling the Helm `modelCache` PVC. See [`docker/docker-compose/custom-models/`](https://github.com/vectorize-io/hindsight/tree/main/docker/docker-compose/custom-models) for a runnable example.
:::

### Available Tags

```bash
# Standalone (API + Control Plane)
ghcr.io/vectorize-io/hindsight:latest        # Full, latest release
ghcr.io/vectorize-io/hindsight:latest-slim          # Slim, latest release
ghcr.io/vectorize-io/hindsight:0.4.9         # Full, specific version
ghcr.io/vectorize-io/hindsight:0.4.9-slim    # Slim, specific version

# API only
ghcr.io/vectorize-io/hindsight-api:latest
ghcr.io/vectorize-io/hindsight-api:latest-slim

# Control Plane only
ghcr.io/vectorize-io/hindsight-control-plane:latest
```

### Verifying image signatures

Images are signed with [Cosign](https://docs.sigstore.dev/cosign/signing/overview/) keyless OIDC. To verify any tag:

```bash
cosign verify ghcr.io/vectorize-io/hindsight:<tag> \
  --certificate-identity-regexp '^https://github\.com/vectorize-io/hindsight/\.github/workflows/(sign-images|release)\.yml@.*' \
  --certificate-oidc-issuer https://token.actions.githubusercontent.com
```

---

## Helm / Kubernetes

**Best for**: Production deployments, auto-scaling, cloud environments

```bash
# Install with built-in PostgreSQL
helm install hindsight oci://ghcr.io/vectorize-io/charts/hindsight \
  --set api.llm.provider=groq \
  --set api.llm.apiKey=gsk_xxxxxxxxxxxx \
  --set postgresql.enabled=true

# Or use external PostgreSQL
helm install hindsight oci://ghcr.io/vectorize-io/charts/hindsight \
  --set api.llm.provider=groq \
  --set api.llm.apiKey=gsk_xxxxxxxxxxxx \
  --set postgresql.enabled=false \
  --set api.database.url=postgresql://user:pass@postgres.example.com:5432/hindsight

# Install a specific version
helm install hindsight oci://ghcr.io/vectorize-io/charts/hindsight --version 0.1.3

# Upgrade to latest
helm upgrade hindsight oci://ghcr.io/vectorize-io/charts/hindsight
```

**Requirements**:
- Kubernetes cluster (GKE, EKS, AKS, or self-hosted)
- Helm 3.8+

### Distributed Workers

For high-throughput deployments, enable dedicated worker pods to scale task processing independently:

```bash
helm install hindsight oci://ghcr.io/vectorize-io/charts/hindsight \
  --set worker.enabled=true \
  --set worker.replicaCount=3
```

The chart deploys workers as a StatefulSet, so each pod gets a stable name (e.g. `hindsight-worker-0`) that the worker uses as its `HINDSIGHT_API_WORKER_ID`. Tasks claimed by a pod are recognized as its own across restarts. If you swap the chart for a plain Deployment, set `HINDSIGHT_API_WORKER_ID` explicitly per replica — otherwise hostnames are randomized and previously-claimed tasks become orphaned. See [Admin CLI - Recovering stuck operations](./admin-cli#recovering-stuck-or-zombie-operations) for diagnosis.

See [Services - Worker Service](./services#worker-service) for configuration details and architecture.

See the [Helm chart values.yaml](https://github.com/vectorize-io/hindsight/tree/main/helm/hindsight/values.yaml) for all chart options.

---

## Bare Metal (pip)

**Best for**: Running Hindsight as a standalone service on a host machine.

### Install

```bash
pip install hindsight-api        # Full — works out of the box
pip install hindsight-api-slim   # Slim — requires external services for embeddings, reranking, and the database
```

When using `hindsight-api-slim`, you must configure external providers for all model operations. See [Configuration](./configuration#embeddings) for details.

### Run with Embedded Database

For development and testing, Hindsight can run with an embedded PostgreSQL (pg0):

```bash
export HINDSIGHT_API_LLM_PROVIDER=groq
export HINDSIGHT_API_LLM_API_KEY=gsk_xxxxxxxxxxxx

hindsight-api
```

This creates a database in `~/.hindsight/data/` and starts the API on http://localhost:8888.

### Run with External PostgreSQL

For production, connect to your own PostgreSQL instance:

```bash
export HINDSIGHT_API_DATABASE_URL=postgresql://user:pass@localhost:5432/hindsight
export HINDSIGHT_API_LLM_PROVIDER=groq
export HINDSIGHT_API_LLM_API_KEY=gsk_xxxxxxxxxxxx

hindsight-api
```

**Note**: The database must exist and have pgvector enabled (`CREATE EXTENSION vector;`).

### CLI Options

```bash
hindsight-api --port 9000          # Custom port (default: 8888)
hindsight-api --host 127.0.0.1     # Bind to localhost only
hindsight-api --workers 4          # Multiple worker processes
hindsight-api --log-level debug    # Verbose logging
```

### Control Plane

The Control Plane (Web UI) can be run standalone using npx:

```bash
npx @vectorize-io/hindsight-control-plane --api-url http://localhost:8888
```

This connects to your running API server and provides a visual interface for managing memory banks, exploring entities, and testing queries.

#### Options

| Option | Environment Variable | Default | Description |
|--------|---------------------|---------|-------------|
| `-p, --port` | `PORT` | 9999 | Port to listen on |
| `-H, --hostname` | `HOSTNAME` | 0.0.0.0 | Hostname to bind to |
| `-a, --api-url` | `HINDSIGHT_CP_DATAPLANE_API_URL` | http://localhost:8888 | Hindsight API URL |
| | `HINDSIGHT_CP_ACCESS_KEY` | *(none)* | Access key to protect the Control Plane UI. When set, users must enter this key to log in. |

#### Examples

```bash
# Run on custom port
npx @vectorize-io/hindsight-control-plane --port 9999 --api-url http://localhost:8888

# Using environment variables
export HINDSIGHT_CP_DATAPLANE_API_URL=http://api.example.com
npx @vectorize-io/hindsight-control-plane

# Production deployment
PORT=80 HINDSIGHT_CP_DATAPLANE_API_URL=https://api.hindsight.io npx @vectorize-io/hindsight-control-plane
```

---

## Windows

**Best for**: Running Hindsight natively on Windows without Docker

Hindsight works on Windows with the embedded database (pg0) out of the box — just install and run:

```powershell
pip install hindsight-api

set HINDSIGHT_API_LLM_PROVIDER=openai
set HINDSIGHT_API_LLM_API_KEY=sk-xxx
set HINDSIGHT_API_LLM_MODEL=gpt-4o-mini

hindsight-api
```

### Using External PostgreSQL (optional)

If you prefer to use your own PostgreSQL instance instead of the embedded database:

```powershell
# Install PostgreSQL
winget install PostgreSQL.PostgreSQL.17

# Build pgvector (requires Visual Studio Build Tools)
git clone https://github.com/pgvector/pgvector.git
cd pgvector

# Open "x64 Native Tools Command Prompt for VS" and run:
set PGROOT=C:\Program Files\PostgreSQL\17
nmake /F Makefile.win
nmake /F Makefile.win install

# Create the database and enable the vector extension
psql -U postgres -c "CREATE DATABASE hindsight;"
psql -U postgres -d hindsight -c "CREATE EXTENSION vector;"
```

Then run Hindsight pointing to your database:

```powershell
pip install hindsight-api

set HINDSIGHT_API_DATABASE_URL=postgresql://postgres@localhost:5432/hindsight
set HINDSIGHT_API_LLM_PROVIDER=openai
set HINDSIGHT_API_LLM_API_KEY=sk-xxx
set HINDSIGHT_API_LLM_MODEL=gpt-4o-mini

hindsight-api
```

- **API Server**: http://localhost:8888

:::tip
You can also use the slim package (`pip install hindsight-api-slim`) if you configure external providers for embeddings and reranking. See [Configuration](./configuration#embeddings) for details.
:::

### Windows + China Network Notes

If you are running on Windows behind China network restrictions:

1. DeepSeek works well for `HINDSIGHT_API_LLM_PROVIDER`, but DeepSeek does not provide an embeddings endpoint.
2. Use local embeddings (recommended for privacy and reliability in restricted networks).
3. Set `HF_ENDPOINT=https://hf-mirror.com` before starting Hindsight so Hugging Face model downloads use a China-accessible mirror.

```powershell
set HF_ENDPOINT=https://hf-mirror.com

set HINDSIGHT_API_LLM_PROVIDER=deepseek
set HINDSIGHT_API_LLM_API_KEY=sk-your-deepseek-key
set HINDSIGHT_API_LLM_MODEL=deepseek-v4-flash
set HINDSIGHT_API_LLM_BASE_URL=https://api.deepseek.com

set HINDSIGHT_API_EMBEDDINGS_PROVIDER=local
set HINDSIGHT_API_EMBEDDINGS_LOCAL_MODEL=BAAI/bge-small-en-v1.5

set HINDSIGHT_API_RERANKER_PROVIDER=flashrank

hindsight-api
```

The `HF_ENDPOINT` variable is used by Hugging Face tooling (`huggingface_hub`), not by Hindsight itself.

---

## Embedded in a Python Application

**Best for**: Using Hindsight programmatically from Python without running a separate server process.

```bash
pip install hindsight-all        # Full — works out of the box (Linux, Windows, Apple Silicon Macs)
pip install hindsight-all-slim   # Slim — requires external services for embeddings, reranking, and the database
```

On Intel (x86_64) Macs, install `hindsight-all-slim` — see [Supported Platforms](#supported-platforms).

`hindsight-all` supports two modes of embedding:

**In-process** (`HindsightServer`): the server runs in a background thread inside your application. Best when you want the tightest integration and are already managing your own process lifecycle.

```python
from hindsight import HindsightServer, HindsightClient

with HindsightServer(llm_provider="openai", llm_api_key="sk-xxx") as server:
    client = HindsightClient(base_url=server.url)
    client.retain(bank_id="alice", content="Alice prefers concise answers.")
    results = client.recall(bank_id="alice", query="How should I respond to Alice?")
```

**Managed subprocess** (`HindsightEmbedded`): the server runs as a background daemon process, shared across multiple Python processes or sessions. The daemon starts on first use and shuts down automatically after an idle timeout.

```python
from hindsight import HindsightEmbedded

client = HindsightEmbedded(llm_provider="openai", llm_api_key="sk-xxx")
client.retain(bank_id="alice", content="Alice prefers concise answers.")
results = client.recall(bank_id="alice", query="How should I respond to Alice?")
```

See the [Python SDK](../sdks/python.md) for the full API reference.

---

## Next Steps

- [Configuration](./configuration.md) — Environment variables and settings
- [Models](./models.mdx) — ML models and providers
- [Monitoring](./monitoring.md) — Metrics and observability


---


## File: developer/configuration.md

# Configuration

Complete reference for configuring Hindsight services through environment variables.

Hindsight has two services, each with its own configuration prefix:

| Service | Prefix | Description |
|---------|--------|-------------|
| **API Service** | `HINDSIGHT_API_*` | Core memory engine |
| **Control Plane** | `HINDSIGHT_CP_*` | Web UI |

---

## API Service

The API service handles all memory operations (retain, recall, reflect).

### Database

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_DATABASE_URL` | PostgreSQL connection string | `pg0` (embedded) |
| `HINDSIGHT_API_READ_DATABASE_URL` | Optional read-replica PostgreSQL URL. When set, recall queries (semantic, BM25, graph, temporal) are routed through a separate connection pool against this URL, offloading the primary. Typically points to a read-only endpoint (e.g., CNPG's `<cluster>-ro` service or Aurora reader endpoint). | Unset (uses primary) |
| `HINDSIGHT_API_MIGRATION_DATABASE_URL` | Direct PostgreSQL URL for running migrations, bypassing connection poolers (e.g. PgBouncer). When set, advisory locks and Alembic migrations use this URL instead of `DATABASE_URL`. | Falls back to `DATABASE_URL` |
| `HINDSIGHT_API_DATABASE_SCHEMA` | PostgreSQL schema name for tables | `public` |
| `HINDSIGHT_API_RUN_MIGRATIONS_ON_STARTUP` | Run database migrations on API startup | `true` |
| `HINDSIGHT_API_MIGRATION_CONCURRENCY` | Number of tenant schemas to migrate concurrently (PostgreSQL only). Each schema runs in its own process; within a schema migrations are always sequential. Each worker has a fixed startup cost (~1–2s to boot a fresh interpreter), so this only pays off with **many** schemas (roughly tens or more) or slow/high-latency migrations — for a handful of schemas it is slower than sequential. Each worker uses ~3 database connections, so keep `concurrency × 3` within your database's spare `max_connections` (and any PgBouncer pool limit). `1` = fully sequential. Measured at 20k schemas: the per-restart no-op resweep dropped from ~60min to ~11min (≈5×) at `concurrency=12`. | `1` |
| `HINDSIGHT_API_DATABASE_BACKEND` | Database engine backend: `postgresql` or `oracle` (Oracle 23ai) | `postgresql` |

If not provided, the server uses embedded `pg0` — convenient for development but not recommended for production.

The `DATABASE_SCHEMA` setting allows you to use a custom PostgreSQL schema instead of the default `public` schema. This is useful for:
- Multi-database setups where you want Hindsight tables in a dedicated schema
- Hosting platforms (e.g., Supabase) where `public` schema is reserved or shared
- Organizational preferences for schema naming conventions

```bash
# Example: Using a custom schema
export HINDSIGHT_API_DATABASE_URL=postgresql://user:pass@host:5432/dbname
export HINDSIGHT_API_DATABASE_SCHEMA=hindsight
```

Migrations will automatically create the schema if it doesn't exist and create all tables in the configured schema.

### Database Connection Pool

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_DB_POOL_MIN_SIZE` | Minimum connections in the primary pool | `5` |
| `HINDSIGHT_API_DB_POOL_MAX_SIZE` | Maximum connections in the primary pool | `100` |
| `HINDSIGHT_API_READ_DB_POOL_MIN_SIZE` | Minimum connections in the read-replica pool (only used when `READ_DATABASE_URL` is set) | Falls back to `DB_POOL_MIN_SIZE` |
| `HINDSIGHT_API_READ_DB_POOL_MAX_SIZE` | Maximum connections in the read-replica pool (only used when `READ_DATABASE_URL` is set) | Falls back to `DB_POOL_MAX_SIZE` |
| `HINDSIGHT_API_DB_COMMAND_TIMEOUT` | PostgreSQL command timeout in seconds (asyncpg client-side) | `60` |
| `HINDSIGHT_API_DB_ACQUIRE_TIMEOUT` | Connection acquisition timeout in seconds | `30` |
| `HINDSIGHT_API_DB_STATEMENT_TIMEOUT` | Postgres `statement_timeout` applied to every pool connection, in seconds. Server-side safety net for runaway queries. Does **not** apply to Alembic migrations (which run on a separate psycopg2 engine). Set to `0` to disable. | `600` |

For high-concurrency workloads, increase `DB_POOL_MAX_SIZE`. Each concurrent recall/think operation can use 2-4 connections.

To run migrations manually (e.g., before starting the API), use the admin CLI:

```bash
# Migrate the base schema plus all discovered tenant schemas
hindsight-admin run-db-migration

# Or migrate a specific schema only:
hindsight-admin run-db-migration --schema tenant_acme
```

### Vector Extension

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_VECTOR_EXTENSION` | Vector index algorithm: `pgvector`, `vchord`, `pgvectorscale`, or `scann` | `pgvector` |

Hindsight supports four PostgreSQL vector extensions:

#### **pgvector** (HNSW - default)
- In-memory index using Hierarchical Navigable Small World algorithm
- Works well for most embeddings and dataset sizes
- Fast for small-medium datasets (&lt;10M vectors)
- Higher memory usage for large datasets
- Most widely deployed and supported

#### **pgvectorscale** (DiskANN - recommended for scale) ⭐
- Disk-based index using StreamingDiskANN algorithm
- **28x lower p95 latency** and **16x higher throughput** vs dedicated vector DBs
- **60-75% cost reduction** at scale (SSDs cheaper than RAM)
- Superior filtering performance with streaming retrieval model
- Optimized for large datasets (10M+ vectors)
- Supports both **pgvectorscale** (open source) and **pg_diskann** (Azure)
- **Installation:**
  - Open source/self-hosted: `CREATE EXTENSION vector; CREATE EXTENSION vectorscale CASCADE;`
  - Azure PostgreSQL: `CREATE EXTENSION vector; CREATE EXTENSION pg_diskann CASCADE;`

#### **vchord** (vchordrq)
- Alternative high-performance vector index
- Optimized for high-dimensional embeddings (3000+ dimensions)
- Includes integrated BM25 search capabilities
- Requires `vchord` extension

#### **scann** (AlloyDB ScaNN)
- Google's ScaNN index, available on **AlloyDB** and **AlloyDB Omni**
- Uses a single global vector index in `AUTO` mode (per-bank partial indexes are not used)
- **Installation:** `CREATE EXTENSION vector; CREATE EXTENSION alloydb_scann CASCADE;`
- **Index build is deferred** until a table reaches **10,000 populated embedding rows** — AlloyDB cannot build a ScaNN AUTO index on a near-empty table. Until that threshold is crossed, recall falls back to a sequential scan; the global index is built on the next API startup once enough rows exist.
- A ready-to-use Docker Compose stack is provided at [`docker/docker-compose/alloydb/docker-compose.yaml`](https://github.com/vectorize-io/hindsight/blob/main/docker/docker-compose/alloydb/docker-compose.yaml) for running Hindsight against AlloyDB Omni locally.

**When to use pgvectorscale (DiskANN):**
- Large datasets (10M+ vectors) ⭐
- Complex filtering requirements
- Cost-sensitive deployments
- Production workloads requiring high throughput
- When disk I/O is not a bottleneck

**When to use pgvector (HNSW):**
- Small-medium datasets (&lt;10M vectors)
- Maximum query speed when all data fits in memory
- Simple nearest-neighbor queries without filters
- Standard PostgreSQL deployment preference

**When to use vchord:**
- High-dimensional embeddings (3000+ dimensions)
- Want integrated BM25 search
- Already using vchord for text search

**When to use scann:**
- Running on Google **AlloyDB** or **AlloyDB Omni**
- Want managed ScaNN with `AUTO` mode tuning

**Switching extensions:**

If you need to switch from one extension to another:
1. Set `HINDSIGHT_API_VECTOR_EXTENSION` to your desired extension (`pgvector`, `vchord`, `pgvectorscale`, or `scann`)
2. If your database has existing data, you'll get an error with migration instructions (note: switching **to** `scann` is allowed even with data — the existing index is dropped and rebuilt as ScaNN once the table has at least 10,000 embedding rows)
3. For empty databases, indexes will be automatically recreated on startup

**Learn more:**
- [HNSW vs. DiskANN comparison](https://www.tigerdata.com/learn/hnsw-vs-diskann)
- [pgvectorscale GitHub](https://github.com/timescale/pgvectorscale)

### Text Search Extension

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_TEXT_SEARCH_EXTENSION` | Text search backend: `native`, `vchord`, `pg_textsearch`, `pgroonga`, or `pg_search` | `native` |
| `HINDSIGHT_API_TEXT_SEARCH_EXTENSION_NATIVE_LANGUAGE` | PostgreSQL text search dictionary used by the `native` backend (e.g. `english`, `french`, `simple`, `zhparser`) | `english` |
| `HINDSIGHT_API_TEXT_SEARCH_EXTENSION_PG_SEARCH_TOKENIZER` | ParadeDB `pg_search` tokenizer used when creating BM25 indexes. Empty uses ParadeDB's default tokenizer (`unicode_words`). | unset |
| `HINDSIGHT_API_LLM_OUTPUT_LANGUAGE` | When set, forces every LLM-generated artifact (retain facts, consolidation observations, reflect responses) into this language. Free-form (e.g. `Spanish`, `Japanese`). | unset |

Hindsight supports five backends for BM25 keyword retrieval:
- **native** — PostgreSQL's built-in full-text search (`tsvector` + GIN). Language configurable.
- **vchord** — VectorChord BM25 (uses the `llmlingua2` multilingual tokenizer).
- **pg_textsearch** — Timescale's pg_textsearch extension. English-only.
- **pgroonga** — pgroonga full-text search. Multilingual / CJK out of the box.
- **pg_search** — ParadeDB pg_search. True BM25; the only backend that is Citus-compatible.

To switch backends: set `HINDSIGHT_API_TEXT_SEARCH_EXTENSION`. With existing data, you'll get an error and migration instructions; with an empty database the columns/indexes are recreated automatically on startup.

`HINDSIGHT_API_TEXT_SEARCH_EXTENSION_PG_SEARCH_TOKENIZER` only applies when `HINDSIGHT_API_TEXT_SEARCH_EXTENSION=pg_search`, and only when BM25 indexes are created. Changing it for an existing database requires rebuilding the `pg_search` indexes or recreating the database. Supported values are empty/unset, `unicode_words`, `simple`, `whitespace`, `literal`, `literal_normalized`, `chinese_compatible`, `icu`, `jieba`, `source_code`, `chinese_lindera`/`lindera(chinese)`, `japanese_lindera`/`lindera(japanese)`, `korean_lindera`/`lindera(korean)`, `ngram(min,max)`, and `edge_ngram(min,max)`.

For non-English banks (especially CJK) and the language/extraction-language tradeoffs, see the [Multilingual Support](./multilingual) page.

### LLM Provider

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_LLM_PROVIDER` | Provider: `openai`, `openai-codex`, `claude-code`, `anthropic`, `gemini`, `groq`, `minimax`, `deepseek`, `zai`, `opencode-go`, `nous`, `fireworks`, `ollama`, `ollama-cloud`, `lmstudio`, `llamacpp`, `vertexai`, `bedrock`, `litellm`, `litellmrouter`, `volcano`, `openrouter`, `requesty`, `none` | `openai` |
| `HINDSIGHT_API_LLM_API_KEY` | API key for LLM provider | - |
| `HINDSIGHT_API_LLM_MODEL` | Model name | `gpt-5-mini` |
| `HINDSIGHT_API_LLM_BASE_URL` | Custom LLM endpoint | Provider default |
| `HINDSIGHT_API_LLM_MAX_CONCURRENT` | Max concurrent LLM requests | `32` |
| `HINDSIGHT_API_LLM_MAX_RETRIES` | Max retry attempts for LLM API calls | `3` |
| `HINDSIGHT_API_LLM_INITIAL_BACKOFF` | Initial retry backoff in seconds (exponential backoff) | `1.0` |
| `HINDSIGHT_API_LLM_MAX_BACKOFF` | Max retry backoff cap in seconds | `60.0` |
| `HINDSIGHT_API_LLM_TIMEOUT` | LLM request timeout in seconds | `120` |
| `HINDSIGHT_API_LLM_REASONING_EFFORT` | Reasoning effort for providers/models that support it (for example `low`, `medium`, `high`, `xhigh`) | `low` |
| `HINDSIGHT_API_LLM_TEMPERATURE` | Global override for the sampling temperature of internal LLM calls. Set a number in `[0.0, 2.0]`, or `none` (also `default`/`off`/empty) to **omit** the temperature parameter entirely — required for models that reject explicit temperatures, e.g. Azure `gpt-5.5`, which only accepts its default value. Per-operation variables below override this. | Per-operation defaults |
| `HINDSIGHT_API_LLM_TEMPERATURE_VERIFICATION` | Temperature for the startup connection check. Number in `[0.0, 2.0]` or `none` to omit. Overrides `HINDSIGHT_API_LLM_TEMPERATURE`. | `0.0` |
| `HINDSIGHT_API_LLM_TEMPERATURE_RETAIN` | Temperature for fact extraction during retain. Number in `[0.0, 2.0]` or `none` to omit. Overrides `HINDSIGHT_API_LLM_TEMPERATURE`. | `0.1` |
| `HINDSIGHT_API_LLM_TEMPERATURE_REFLECT` | Temperature for the reflect "thinking" step. Number in `[0.0, 2.0]` or `none` to omit. Overrides `HINDSIGHT_API_LLM_TEMPERATURE`. | `0.9` |
| `HINDSIGHT_API_LLM_TEMPERATURE_CONSOLIDATION` | Temperature for consolidation (mental-model delta and dedup). Number in `[0.0, 2.0]` or `none` to omit. Overrides `HINDSIGHT_API_LLM_TEMPERATURE`. | `0.0` |
| `HINDSIGHT_API_LLM_SEND_BANK_AS_USER` | Tag outbound LLM and embedding calls with `user=<bank_id>` so gateways (OpenRouter usage accounting, LiteLLM, Helicone) can attribute spend per bank. When enabled, the bank id is transmitted to the upstream provider as the end-user identifier. | `false` |
| `HINDSIGHT_API_LLM_GROQ_SERVICE_TIER` | Groq service tier: `on_demand`, `flex`, `auto` | `auto` |
| `HINDSIGHT_API_LLM_OPENAI_SERVICE_TIER` | OpenAI service tier: `flex` for 50% cost savings (OpenAI Flex Processing) | None (default) |
| `HINDSIGHT_API_LLM_BEDROCK_SERVICE_TIER` | Bedrock service tier: `flex` for 50% cost savings (best-effort inference), `priority` (guaranteed throughput), or `reserved` (provisioned capacity) | Unset (default tier) |
| `HINDSIGHT_API_LLM_GEMINI_SERVICE_TIER` | Gemini service tier: `flex` for 50% cost savings (best-effort inference) | Unset (default tier) |
| `HINDSIGHT_API_LLM_EXTRA_BODY` | JSON dict of extra request-body params (e.g. `temperature`, `top_p`, `max_tokens`) merged into every LLM call. Applied across the OpenAI-compatible, Fireworks, Anthropic, Gemini/VertexAI and LiteLLM (incl. Bedrock/Router) providers. Each provider merges them in its own native parameter space, so use that provider's field names (e.g. `max_tokens` for OpenAI/Anthropic vs `max_output_tokens` for Gemini). Also useful for custom model servers (e.g. vLLM `chat_template_kwargs`). | `null` |
| `HINDSIGHT_API_LLM_DEFAULT_HEADERS` | JSON dict passed as `default_headers` to provider SDK clients. Used by operators routing through proxies / request-tracing middleware (e.g. Cloudflare AI Gateway, Helicone, corporate proxies). Currently wired into the Anthropic provider; other providers can opt in. | `null` |
| `HINDSIGHT_API_LLM_STRICT_SCHEMA` | Grammar-enforce structured output via `json_schema` `strict: true` instead of the soft "schema-in-prompt + `json_object`" path. Use it with weaker self-hosted models that return prose preambles, markdown ` ```json ` fences, or invalid JSON — which otherwise fail to parse and wedge retain/consolidation. Applies to OpenAI-compatible backends (OpenAI, llama.cpp, vLLM) and LiteLLM; Gemini already enforces its native `response_schema` regardless, and providers without a strict mode ignore it. | `false` |
| `HINDSIGHT_API_LLM_GEMINI_SAFETY_SETTINGS` | JSON-encoded list of `{category, threshold}` dicts for Gemini/VertexAI content safety filtering | `null` |
| `HINDSIGHT_API_LLM_PROMPT_CACHE_ENABLED` | Reuse the fixed system prefix via the provider's explicit prompt cache, billed at the cached-input rate (Gemini/Vertex `CachedContent`). The cached prefix is shared across all banks and soft-fails to an uncached call. Set to `false` to disable. See [Models](./models#provider-capabilities). | `true` |

**Provider Examples**

```bash
# Groq (recommended for fast inference)
export HINDSIGHT_API_LLM_PROVIDER=groq
export HINDSIGHT_API_LLM_API_KEY=gsk_xxxxxxxxxxxx
export HINDSIGHT_API_LLM_MODEL=openai/gpt-oss-20b
# For free tier users: override to on_demand if you get service_tier errors
# export HINDSIGHT_API_LLM_GROQ_SERVICE_TIER=on_demand

# OpenAI
export HINDSIGHT_API_LLM_PROVIDER=openai
export HINDSIGHT_API_LLM_API_KEY=sk-xxxxxxxxxxxx
export HINDSIGHT_API_LLM_MODEL=gpt-4o
# Optional: Use Flex Processing for 50% cost savings (with variable latency)
# export HINDSIGHT_API_LLM_OPENAI_SERVICE_TIER=flex

# Gemini
export HINDSIGHT_API_LLM_PROVIDER=gemini
export HINDSIGHT_API_LLM_API_KEY=xxxxxxxxxxxx
export HINDSIGHT_API_LLM_MODEL=gemini-2.0-flash
# Optional: Use Gemini Flex for 50% cost savings (best-effort inference)
# export HINDSIGHT_API_LLM_GEMINI_SERVICE_TIER=flex

# Anthropic
export HINDSIGHT_API_LLM_PROVIDER=anthropic
export HINDSIGHT_API_LLM_API_KEY=sk-ant-xxxxxxxxxxxx
export HINDSIGHT_API_LLM_MODEL=claude-sonnet-4-20250514

# Vertex AI (Google Cloud - uses native genai SDK)
export HINDSIGHT_API_LLM_PROVIDER=vertexai
export HINDSIGHT_API_LLM_MODEL=gemini-2.0-flash-001
export HINDSIGHT_API_LLM_VERTEXAI_PROJECT_ID=your-gcp-project-id
export HINDSIGHT_API_LLM_VERTEXAI_REGION=us-central1
# Optional: use ADC (gcloud auth application-default login) or provide service account key:
# export HINDSIGHT_API_LLM_VERTEXAI_SERVICE_ACCOUNT_KEY=/path/to/service-account-key.json

# Ollama (local, no API key)
export HINDSIGHT_API_LLM_PROVIDER=ollama
export HINDSIGHT_API_LLM_BASE_URL=http://localhost:11434/v1
export HINDSIGHT_API_LLM_MODEL=llama3

# Ollama Cloud (hosted Ollama endpoint, requires API key)
export HINDSIGHT_API_LLM_PROVIDER=ollama-cloud
export HINDSIGHT_API_LLM_API_KEY=your-ollama-cloud-api-key
export HINDSIGHT_API_LLM_MODEL=gemma3:12b

# LM Studio (local, no API key)
export HINDSIGHT_API_LLM_PROVIDER=lmstudio
export HINDSIGHT_API_LLM_BASE_URL=http://localhost:1234/v1
export HINDSIGHT_API_LLM_MODEL=your-local-model

# llama.cpp (built-in local inference, no external server needed)
export HINDSIGHT_API_LLM_PROVIDER=llamacpp
# No API key, base URL, or external server required.
# Auto-downloads Gemma 4 E2B (~3.5 GB GGUF) on first run.
# See "Built-in llama.cpp" section below for all configuration options.

# OpenAI-compatible endpoint
export HINDSIGHT_API_LLM_PROVIDER=openai
export HINDSIGHT_API_LLM_BASE_URL=https://your-endpoint.com/v1
export HINDSIGHT_API_LLM_API_KEY=your-api-key
export HINDSIGHT_API_LLM_MODEL=your-model-name

# OpenAI Codex (ChatGPT Plus/Pro subscription - uses OAuth, no API key needed)
export HINDSIGHT_API_LLM_PROVIDER=openai-codex
export HINDSIGHT_API_LLM_MODEL=gpt-5.4-mini
# No API key needed - uses OAuth tokens from ~/.codex/auth.json
# For long-running services, set CODEX_HOME to a dedicated auth directory so
# Hindsight doesn't share (and lose) its refresh token with another Codex process.
# See Models docs → "Isolating Codex auth for long-running services".
# export CODEX_HOME=/var/lib/hindsight/codex

# Claude Code (Claude Pro/Max subscription - uses OAuth, no API key needed)
export HINDSIGHT_API_LLM_PROVIDER=claude-code
export HINDSIGHT_API_LLM_MODEL=claude-sonnet-4-5-20250929
# No API key needed - uses claude auth login credentials

# Volcano Engine (ByteDance - OpenAI-compatible)
export HINDSIGHT_API_LLM_PROVIDER=volcano
export HINDSIGHT_API_LLM_API_KEY=your-api-key
export HINDSIGHT_API_LLM_BASE_URL=https://ark.cn-beijing.volces.com/api/v3
export HINDSIGHT_API_LLM_MODEL=doubao-pro-32k

# OpenRouter (OpenAI-compatible, access 100+ models)
export HINDSIGHT_API_LLM_PROVIDER=openrouter
export HINDSIGHT_API_LLM_API_KEY=your-openrouter-api-key
export HINDSIGHT_API_LLM_MODEL=qwen/qwen3.5-9b

# Requesty (OpenAI-compatible gateway)
export HINDSIGHT_API_LLM_PROVIDER=requesty
export HINDSIGHT_API_LLM_API_KEY=your-requesty-api-key
export HINDSIGHT_API_LLM_MODEL=openai/gpt-4o-mini

# DeepSeek (OpenAI-compatible, https://api.deepseek.com)
export HINDSIGHT_API_LLM_PROVIDER=deepseek
export HINDSIGHT_API_LLM_API_KEY=sk-xxxxxxxxxxxx
export HINDSIGHT_API_LLM_MODEL=deepseek-v4-flash
# Notes:
# - `deepseek-v4-flash` defaults to thinking mode at the API level (treated as
#   `deepseek-reasoner`). Hindsight handles this transparently; the reflect
#   agent will not crash with "deepseek-reasoner does not support this
#   tool_choice".
# - Use `deepseek-v4-pro` for the higher-quality reasoning route.
# - Use `deepseek-chat` for the non-thinking alias (faster, cheaper).

# z.ai (Zhipu GLM series, OpenAI-compatible, https://z.ai)
export HINDSIGHT_API_LLM_PROVIDER=zai
export HINDSIGHT_API_LLM_API_KEY=your-zai-api-key
export HINDSIGHT_API_LLM_MODEL=glm-4.5-flash  # or glm-4.5-air for the paid tier
# Default base_url: https://api.z.ai/api/coding/paas/v4 (override with HINDSIGHT_API_LLM_BASE_URL if needed)

# opencode-go (OpenAI-compatible)
export HINDSIGHT_API_LLM_PROVIDER=opencode-go
export HINDSIGHT_API_LLM_API_KEY=your-opencode-go-api-key
export HINDSIGHT_API_LLM_MODEL=deepseek-v4-flash
# Default base_url: https://opencode.ai/zen/go/v1 (override with HINDSIGHT_API_LLM_BASE_URL if needed)

# Nous Portal (OpenAI-compatible; no API key — uses your `hermes portal` login)
export HINDSIGHT_API_LLM_PROVIDER=nous
export HINDSIGHT_API_LLM_MODEL=deepseek/deepseek-v4-flash
# No API key needed — reads a rotating JWT from ~/.hermes/auth.json (run `hermes portal` first).
# Default base_url: https://inference-api.nousresearch.com/v1 (override with HINDSIGHT_API_LLM_BASE_URL if needed)
# See the "Nous Portal Setup" section in the Models guide for the login flow.

# AWS Bedrock (native support - no API key needed, uses AWS credentials)
export HINDSIGHT_API_LLM_PROVIDER=bedrock
export HINDSIGHT_API_LLM_MODEL=us.amazon.nova-2-lite-v1:0
export AWS_ACCESS_KEY_ID=your-access-key
export AWS_SECRET_ACCESS_KEY=your-secret-key
export AWS_REGION_NAME=us-east-1
# Optional: Use Flex tier for 50% cost savings (with variable latency)
# export HINDSIGHT_API_LLM_BEDROCK_SERVICE_TIER=flex

# LiteLLM (100+ providers via LiteLLM SDK)
# Azure OpenAI via LiteLLM
export HINDSIGHT_API_LLM_PROVIDER=litellm
export HINDSIGHT_API_LLM_API_KEY=your-azure-api-key
export HINDSIGHT_API_LLM_MODEL=azure/gpt-4o

# Together AI via LiteLLM
export HINDSIGHT_API_LLM_PROVIDER=litellm
export HINDSIGHT_API_LLM_API_KEY=your-together-api-key
export HINDSIGHT_API_LLM_MODEL=together_ai/meta-llama/Llama-3-70b-chat-hf

# No LLM (chunk storage + semantic search only, no API key needed)
export HINDSIGHT_API_LLM_PROVIDER=none
# Retain automatically uses chunks mode (no fact extraction)
# Recall works normally (semantic search, BM25, graph retrieval)
# Reflect returns HTTP 400 (requires an LLM)
# Consolidation/observations are disabled
```

:::tip OpenAI Codex, Claude Code & Vertex AI Setup
For detailed setup instructions for **OpenAI Codex** (ChatGPT Plus/Pro), **Claude Code** (Claude Pro/Max), and **Vertex AI** (Google Cloud), see the [Models documentation](./models#openai-codex-setup-chatgpt-pluspro).
:::

### LLM Router (LiteLLM Router)

`HINDSIGHT_API_LLM_PROVIDER=litellmrouter` runs the default LLM through [LiteLLM's `Router`](https://docs.litellm.ai/docs/routing). The config JSON is forwarded verbatim — for fallback chains, load-balancing, rate limits, routing strategies, and the rest of the supported keys, see the [LiteLLM Router docs](https://docs.litellm.ai/docs/routing). Hindsight always issues completions against `model_name: "default"`, so include at least one entry with that name.

| Variable | Description |
|----------|-------------|
| `HINDSIGHT_API_LLM_LITELLMROUTER_CONFIG` | JSON object passed to `litellm.Router(**config)`. Required when provider is `litellmrouter`. |
| `HINDSIGHT_API_{RETAIN,REFLECT,CONSOLIDATION}_LLM_LITELLMROUTER_CONFIG` | Per-operation overrides. Fall back to the default config when unset. |

```bash
export HINDSIGHT_API_LLM_PROVIDER=litellmrouter
export HINDSIGHT_API_LLM_LITELLMROUTER_CONFIG='{
  "model_list": [
    {"model_name": "default",  "litellm_params": {"model": "openai/gpt-4o-mini", "api_key": "sk-..."}},
    {"model_name": "fallback", "litellm_params": {"model": "anthropic/claude-sonnet-4-5", "api_key": "sk-ant-..."}}
  ],
  "fallbacks": [{"default": ["fallback"]}]
}'
```

The config is a credential field — never returned by the bank-config API. Hindsight already retries calls; set `"num_retries": 0` in the Router config to avoid double-retries. Batch APIs aren't supported in router mode.

### Multi-LLM Strategies (failover / round-robin)

Configure additional LLMs **by index** alongside the primary, then choose a strategy for routing across them. This is a provider-agnostic alternative to the LiteLLM Router: the indexed LLMs can be any mix of providers, each fully configured.

The unindexed `HINDSIGHT_API_LLM_*` config is the **primary** (member 1). Extra members are numbered from 1:

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_LLM_<n>_PROVIDER` | Provider for extra member `n` (`n` = 1, 2, ...). Presence of this var defines the member; indices must be contiguous from 1. | - |
| `HINDSIGHT_API_LLM_<n>_API_KEY` | API key for member `n` (required unless the provider needs none). | - |
| `HINDSIGHT_API_LLM_<n>_MODEL` | Model for member `n`. | Provider default |
| `HINDSIGHT_API_LLM_<n>_BASE_URL` | Base URL for member `n`. | Provider default |
| `HINDSIGHT_API_LLM_<n>_REASONING_EFFORT` | Reasoning effort for member `n`. | `HINDSIGHT_API_LLM_REASONING_EFFORT` |
| `HINDSIGHT_API_LLM_<n>_EXTRA_BODY` / `_DEFAULT_HEADERS` | Per-member JSON overrides. | - |
| `HINDSIGHT_API_LLM_<n>_BEDROCK_SERVICE_TIER` / `_GEMINI_SERVICE_TIER` | Per-member service tier. | - |
| `HINDSIGHT_API_LLM_<n>_VERTEXAI_PROJECT_ID` / `_VERTEXAI_REGION` / `_VERTEXAI_SERVICE_ACCOUNT_KEY` | Per-member Vertex AI project, region, and service-account key path (for a `vertexai` member). Each falls back to the global `HINDSIGHT_API_LLM_VERTEXAI_*` when unset. | Global / `us-central1` / ADC |
| `HINDSIGHT_API_LLM_<n>_LITELLMROUTER_CONFIG` | Per-member LiteLLM Router config JSON (for a `litellmrouter` member). Falls back to the global `HINDSIGHT_API_LLM_LITELLMROUTER_CONFIG` when unset. | - |
| `HINDSIGHT_API_LLM_STRATEGY` | JSON routing strategy across the chain. Unset = single primary LLM (no change). | - |

The strategy JSON supports two modes:

- `{"mode": "failover"}` — try members in order (primary first); on a member's failure (after its own retries) advance to the next.
- `{"mode": "round-robin"}` — rotate the starting member per request to spread load, then fall through the rest on failure. Add `"weights": [3, 1, ...]` (positive ints, one per member, primary first) for an **unbalanced** rotation.

```bash
# Primary OpenAI, failover to Groq then Anthropic
export HINDSIGHT_API_LLM_PROVIDER=openai
export HINDSIGHT_API_LLM_API_KEY=sk-...
export HINDSIGHT_API_LLM_1_PROVIDER=groq
export HINDSIGHT_API_LLM_1_API_KEY=gsk-...
export HINDSIGHT_API_LLM_2_PROVIDER=anthropic
export HINDSIGHT_API_LLM_2_API_KEY=sk-ant-...
export HINDSIGHT_API_LLM_STRATEGY='{"mode": "failover"}'

# Weighted round-robin: serve the primary 3x as often as member 1
export HINDSIGHT_API_LLM_STRATEGY='{"mode": "round-robin", "weights": [3, 1]}'
```

**Per-operation chains.** Each operation can define its own members + strategy with the `RETAIN` / `REFLECT` / `CONSOLIDATION` prefix (e.g. `HINDSIGHT_API_RETAIN_LLM_1_PROVIDER`, `HINDSIGHT_API_RETAIN_LLM_STRATEGY`). A per-operation slot with no indexed members (or no strategy) inherits the global chain.

The indexed members are credential fields — never returned by the bank-config API and server-level only (not per-bank configurable). **Batch retain** runs on the primary member only; failover/round-robin apply to the interactive retain/reflect/consolidation calls.

### Built-in llama.cpp

The `llamacpp` provider runs a llama.cpp server as a managed subprocess — no external LLM server needed. On first run it auto-downloads a default GGUF model (~3.5 GB). Requires the `local-llm` extra: `pip install 'hindsight-api-slim[local-llm]'`.

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_LLAMACPP_MODEL_PATH` | Path to a GGUF model file. If not set, auto-downloads `gemma-4-E2B-it-Q4_K_M` from HuggingFace. | Auto-download |
| `HINDSIGHT_API_LLAMACPP_GPU_LAYERS` | Number of layers to offload to GPU. `-1` = all layers (recommended). `0` = CPU only. | `-1` |
| `HINDSIGHT_API_LLAMACPP_CONTEXT_SIZE` | Context window size in tokens. | `8192` |
| `HINDSIGHT_API_LLAMACPP_CHAT_FORMAT` | Chat template format. `null` = auto-detect from GGUF metadata (recommended). | Auto-detect |
| `HINDSIGHT_API_LLAMACPP_NO_GRAMMAR` | Disable JSON grammar enforcement. Faster inference but less reliable JSON output. | `false` |
| `HINDSIGHT_API_LLAMACPP_EXTRA_ARGS` | Space-separated extra CLI args passed to the llama.cpp server (e.g. `--n_threads 8 --type_k 1`). | - |

```bash
# Minimal setup (auto-downloads model, uses GPU)
export HINDSIGHT_API_LLM_PROVIDER=llamacpp

# Custom model with tuning
export HINDSIGHT_API_LLM_PROVIDER=llamacpp
export HINDSIGHT_API_LLM_MAX_CONCURRENT=2
export HINDSIGHT_API_LLAMACPP_MODEL_PATH=~/.hindsight/models/my-model.gguf
export HINDSIGHT_API_LLAMACPP_CONTEXT_SIZE=16384
export HINDSIGHT_API_LLAMACPP_NO_GRAMMAR=true  # faster, less reliable JSON
export HINDSIGHT_API_LLAMACPP_EXTRA_ARGS="--n_threads 8"
```

:::note
The llama.cpp server is shared across all LLM operations (retain, reflect, consolidation). Set `HINDSIGHT_API_LLM_MAX_CONCURRENT=2` to allow retain and consolidation to run concurrently without blocking each other.
:::

### Per-Operation LLM Configuration

Different memory operations have different requirements. **Retain** (fact extraction) benefits from models with strong structured output capabilities, while **Reflect** (reasoning/response generation) can use lighter, faster models. Configure separate LLM models for each operation to optimize for cost and performance.

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_RETAIN_LLM_PROVIDER` | LLM provider for retain operations | Falls back to `HINDSIGHT_API_LLM_PROVIDER` |
| `HINDSIGHT_API_RETAIN_LLM_API_KEY` | API key for retain LLM | Falls back to `HINDSIGHT_API_LLM_API_KEY` |
| `HINDSIGHT_API_RETAIN_LLM_MODEL` | Model for retain operations | Falls back to `HINDSIGHT_API_LLM_MODEL` |
| `HINDSIGHT_API_RETAIN_LLM_BASE_URL` | Base URL for retain LLM | Falls back to `HINDSIGHT_API_LLM_BASE_URL` |
| `HINDSIGHT_API_RETAIN_LLM_MAX_CONCURRENT` | Extra cap on concurrent retain LLM requests, composed with the global cap. Unset → only the global cap applies. | Unset |
| `HINDSIGHT_API_RETAIN_LLM_MAX_RETRIES` | Max retries for retain | Falls back to `HINDSIGHT_API_LLM_MAX_RETRIES` |
| `HINDSIGHT_API_RETAIN_LLM_INITIAL_BACKOFF` | Initial backoff for retain retries (seconds) | Falls back to `HINDSIGHT_API_LLM_INITIAL_BACKOFF` |
| `HINDSIGHT_API_RETAIN_LLM_MAX_BACKOFF` | Max backoff cap for retain retries (seconds) | Falls back to `HINDSIGHT_API_LLM_MAX_BACKOFF` |
| `HINDSIGHT_API_RETAIN_LLM_TIMEOUT` | Timeout for retain requests (seconds) | Falls back to `HINDSIGHT_API_LLM_TIMEOUT` |
| `HINDSIGHT_API_REFLECT_LLM_PROVIDER` | LLM provider for reflect operations | Falls back to `HINDSIGHT_API_LLM_PROVIDER` |
| `HINDSIGHT_API_REFLECT_LLM_API_KEY` | API key for reflect LLM | Falls back to `HINDSIGHT_API_LLM_API_KEY` |
| `HINDSIGHT_API_REFLECT_LLM_MODEL` | Model for reflect operations | Falls back to `HINDSIGHT_API_LLM_MODEL` |
| `HINDSIGHT_API_REFLECT_LLM_BASE_URL` | Base URL for reflect LLM | Falls back to `HINDSIGHT_API_LLM_BASE_URL` |
| `HINDSIGHT_API_REFLECT_LLM_MAX_CONCURRENT` | Extra cap on concurrent reflect LLM requests, composed with the global cap. Unset → only the global cap applies. | Unset |
| `HINDSIGHT_API_REFLECT_LLM_MAX_RETRIES` | Max retries for reflect | Falls back to `HINDSIGHT_API_LLM_MAX_RETRIES` |
| `HINDSIGHT_API_REFLECT_LLM_INITIAL_BACKOFF` | Initial backoff for reflect retries (seconds) | Falls back to `HINDSIGHT_API_LLM_INITIAL_BACKOFF` |
| `HINDSIGHT_API_REFLECT_LLM_MAX_BACKOFF` | Max backoff cap for reflect retries (seconds) | Falls back to `HINDSIGHT_API_LLM_MAX_BACKOFF` |
| `HINDSIGHT_API_REFLECT_LLM_TIMEOUT` | Timeout for reflect requests (seconds) | Falls back to `HINDSIGHT_API_LLM_TIMEOUT` |
| `HINDSIGHT_API_CONSOLIDATION_LLM_PROVIDER` | LLM provider for observation consolidation | Falls back to `HINDSIGHT_API_LLM_PROVIDER` |
| `HINDSIGHT_API_CONSOLIDATION_LLM_API_KEY` | API key for consolidation LLM | Falls back to `HINDSIGHT_API_LLM_API_KEY` |
| `HINDSIGHT_API_CONSOLIDATION_LLM_MODEL` | Model for consolidation operations | Falls back to `HINDSIGHT_API_LLM_MODEL` |
| `HINDSIGHT_API_CONSOLIDATION_LLM_BASE_URL` | Base URL for consolidation LLM | Falls back to `HINDSIGHT_API_LLM_BASE_URL` |
| `HINDSIGHT_API_CONSOLIDATION_LLM_MAX_CONCURRENT` | Extra cap on concurrent consolidation LLM requests, composed with the global cap. Unset → only the global cap applies. | Unset |
| `HINDSIGHT_API_CONSOLIDATION_LLM_MAX_RETRIES` | Max retries for consolidation | Falls back to `HINDSIGHT_API_LLM_MAX_RETRIES` |
| `HINDSIGHT_API_CONSOLIDATION_LLM_INITIAL_BACKOFF` | Initial backoff for consolidation retries (seconds) | Falls back to `HINDSIGHT_API_LLM_INITIAL_BACKOFF` |
| `HINDSIGHT_API_CONSOLIDATION_LLM_MAX_BACKOFF` | Max backoff cap for consolidation retries (seconds) | Falls back to `HINDSIGHT_API_LLM_MAX_BACKOFF` |
| `HINDSIGHT_API_CONSOLIDATION_LLM_TIMEOUT` | Timeout for consolidation requests (seconds) | Falls back to `HINDSIGHT_API_LLM_TIMEOUT` |

:::tip When to Use Per-Operation Config
- **Retain**: Use models with strong structured output (e.g., GPT-4o, Claude) for accurate fact extraction
- **Reflect**: Use faster/cheaper models (e.g., GPT-4o-mini, Groq) for reasoning and response generation
- **Recall**: Does not use LLM (pure retrieval), so no configuration needed
:::

**Example: Separate Models for Retain and Reflect**

```bash
# Default LLM (used as fallback)
export HINDSIGHT_API_LLM_PROVIDER=openai
export HINDSIGHT_API_LLM_API_KEY=sk-xxxxxxxxxxxx
export HINDSIGHT_API_LLM_MODEL=gpt-4o

# Use GPT-4o for retain (strong structured output)
export HINDSIGHT_API_RETAIN_LLM_MODEL=gpt-4o

# Use faster/cheaper model for reflect
export HINDSIGHT_API_REFLECT_LLM_PROVIDER=groq
export HINDSIGHT_API_REFLECT_LLM_API_KEY=gsk_xxxxxxxxxxxx
export HINDSIGHT_API_REFLECT_LLM_MODEL=llama-3.3-70b-versatile
```

**Example: Tuning Retry Behavior for Rate-Limited APIs**

```bash
# For Anthropic with tight rate limits (10k output tokens/minute)
export HINDSIGHT_API_LLM_PROVIDER=anthropic
export HINDSIGHT_API_LLM_API_KEY=sk-ant-xxxxxxxxxxxx
export HINDSIGHT_API_LLM_MODEL=claude-sonnet-4-20250514

# Reduce concurrent requests for retain to avoid rate limits
export HINDSIGHT_API_RETAIN_LLM_MAX_CONCURRENT=3

# Fail faster with fewer retries
export HINDSIGHT_API_RETAIN_LLM_MAX_RETRIES=3

# Or increase backoff times to wait out rate limit windows
export HINDSIGHT_API_RETAIN_LLM_INITIAL_BACKOFF=2.0  # Start at 2s instead of 1s
export HINDSIGHT_API_RETAIN_LLM_MAX_BACKOFF=120.0    # Cap at 2min instead of 1min
```

:::note Per-operation concurrency composes with the global cap
`HINDSIGHT_API_RETAIN_LLM_MAX_CONCURRENT`, `HINDSIGHT_API_REFLECT_LLM_MAX_CONCURRENT`, and
`HINDSIGHT_API_CONSOLIDATION_LLM_MAX_CONCURRENT` add an extra cap that applies *on top of*
`HINDSIGHT_API_LLM_MAX_CONCURRENT`. A retain call counts against both the retain cap and the
global cap; a reflect call without a per-op cap is bounded only by the global cap.

To reserve headroom for live chat/reflect on a rate-limited provider, cap retain and
consolidation below the global value — e.g. global=4, retain=1, consolidation=1 leaves
two slots that retain/consolidation cannot consume.

Unlike the per-operation timeout and retry/backoff knobs, the `*_LLM_MAX_CONCURRENT`
caps are process-global semaphores read from the environment once at startup. They are
server-level only (not overridable per tenant/bank) and a change requires a restart.
:::

### Embeddings

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_EMBEDDINGS_PROVIDER` | Provider: `local`, `onnx`, `tei`, `openai`, `openai-codex`, `openrouter`, `requesty`, `cohere`, `google`, `zeroentropy`, `litellm`, or `litellm-sdk` | `local` |
| `HINDSIGHT_API_EMBEDDINGS_LOCAL_MODEL` | Model for local provider | `BAAI/bge-small-en-v1.5` |
| `HINDSIGHT_API_EMBEDDINGS_LOCAL_TRUST_REMOTE_CODE` | Allow loading models with custom code (security risk, disabled by default) | `false` |
| `HINDSIGHT_API_EMBEDDINGS_LOCAL_FORCE_CPU` | Force CPU mode for local embeddings (avoids MPS/XPC issues on macOS) | `false` |
| `HINDSIGHT_API_EMBEDDINGS_ONNX_MODEL_ID` | Hugging Face model repo for the ONNX provider. Used for auto-download and as the tokenizer fallback. | `intfloat/multilingual-e5-small` |
| `HINDSIGHT_API_EMBEDDINGS_ONNX_MODEL_PATH` | Local path to the ONNX graph. When unset, Hindsight downloads `HINDSIGHT_API_EMBEDDINGS_ONNX_FILE` from `HINDSIGHT_API_EMBEDDINGS_ONNX_MODEL_ID`. | - |
| `HINDSIGHT_API_EMBEDDINGS_ONNX_TOKENIZER_NAME_OR_PATH` | Hugging Face tokenizer repo or local tokenizer directory. Set this when using `HINDSIGHT_API_EMBEDDINGS_ONNX_MODEL_PATH`. | Falls back to `HINDSIGHT_API_EMBEDDINGS_ONNX_MODEL_ID` |
| `HINDSIGHT_API_EMBEDDINGS_ONNX_FILE` | ONNX file path inside the Hugging Face repo. Hindsight also downloads the conventional external-data sidecar with `_data` suffix when present. | `onnx/model.onnx` |
| `HINDSIGHT_API_EMBEDDINGS_ONNX_DIMENSIONS` | Expected embedding dimensions. Startup fails if the loaded model returns a different size. | Auto-detected |
| `HINDSIGHT_API_EMBEDDINGS_ONNX_MAX_TOKENS` | Max tokenizer length for ONNX embeddings. | `512` |
| `HINDSIGHT_API_EMBEDDINGS_ONNX_POOLING` | Pooling strategy for token embeddings: `mean` or `cls`. Ignored when the ONNX graph returns a pre-pooled 2-D embedding output. | `mean` |
| `HINDSIGHT_API_EMBEDDINGS_ONNX_NORMALIZE` | L2-normalize ONNX vectors before storage. | `true` |
| `HINDSIGHT_API_EMBEDDINGS_ONNX_QUERY_PREFIX` | Prefix applied to query/search text before ONNX embedding. Keep `query: ` for E5 models; set to empty for non-E5 models such as MiniLM or BGE. | `query: ` |
| `HINDSIGHT_API_EMBEDDINGS_ONNX_PASSAGE_PREFIX` | Prefix applied to stored memory/document text before ONNX embedding. Keep `passage: ` for E5 models; set to empty for non-E5 models such as MiniLM or BGE. | `passage: ` |
| `HINDSIGHT_API_EMBEDDINGS_ONNX_OUTPUT_NAME` | Optional ONNX output name to request when an exported graph exposes a pooled embedding output. | - |
| `HINDSIGHT_API_EMBEDDINGS_TEI_URL` | TEI server URL | - |
| `HINDSIGHT_API_EMBEDDINGS_OPENAI_API_KEY` | OpenAI API key (falls back to `HINDSIGHT_API_LLM_API_KEY`) | - |
| `HINDSIGHT_API_EMBEDDINGS_OPENAI_MODEL` | OpenAI embedding model | `text-embedding-3-small` |
| `HINDSIGHT_API_EMBEDDINGS_OPENAI_BASE_URL` | Custom base URL for OpenAI-compatible API (e.g., Azure OpenAI) | - |
| `HINDSIGHT_API_EMBEDDINGS_OPENAI_BATCH_SIZE` | Max inputs per `embeddings.create` call for `openai`/`openrouter` providers — lower this when the upstream endpoint enforces stricter limits (e.g. DashScope caps at 10) | `100` |
| `HINDSIGHT_API_EMBEDDINGS_OPENAI_DIMENSIONS` | Optional requested output dimensions for OpenAI `text-embedding-3` models (e.g., `384` to match an existing pgvector schema) | - |
| `HINDSIGHT_API_EMBEDDINGS_OPENROUTER_API_KEY` | OpenRouter API key for embeddings (falls back to `HINDSIGHT_API_OPENROUTER_API_KEY`, then `HINDSIGHT_API_LLM_API_KEY`) | - |
| `HINDSIGHT_API_EMBEDDINGS_REQUESTY_API_KEY` | Requesty API key for embeddings (falls back to `HINDSIGHT_API_REQUESTY_API_KEY`, then `HINDSIGHT_API_LLM_API_KEY`) | - |
| `HINDSIGHT_API_EMBEDDINGS_REQUESTY_MODEL` | Requesty embedding model | `openai/text-embedding-3-small` |
| `HINDSIGHT_API_EMBEDDINGS_OPENROUTER_MODEL` | OpenRouter embedding model | `perplexity/pplx-embed-v1-0.6b` |
| `HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_API_KEY` | ZeroEntropy API key for embeddings | - |
| `HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_MODEL` | ZeroEntropy embedding model | `zembed-1` |
| `HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_BASE_URL` | Custom base URL for ZeroEntropy-compatible API | `https://api.zeroentropy.dev` |
| `HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_DIMENSIONS` | Output dimensions for `zembed-1`. Supported values: `2560`, `1280`, `640`, `320`, `160`, `80`, `40` | `1280` |
| `HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_ENCODING_FORMAT` | Response encoding: `float` or `base64`. Hindsight decodes either format to float vectors before storage. | `float` |
| `HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_BATCH_SIZE` | Max inputs per ZeroEntropy embed request | `100` |
| `HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_LATENCY` | Optional latency mode: `fast` or `slow`. Leave unset to use ZeroEntropy's default routing. | - |
| `HINDSIGHT_API_EMBEDDINGS_COHERE_API_KEY` | Cohere API key for embeddings (falls back to `HINDSIGHT_API_COHERE_API_KEY`) | - |
| `HINDSIGHT_API_EMBEDDINGS_COHERE_MODEL` | Cohere embedding model | `embed-english-v3.0` |
| `HINDSIGHT_API_EMBEDDINGS_COHERE_BASE_URL` | Custom base URL for Cohere-compatible API (e.g., Azure-hosted) | - |
| `HINDSIGHT_API_EMBEDDINGS_COHERE_OUTPUT_DIMENSIONS` | Output embedding dimensions for Cohere (e.g., `256`, `512`, `1024`). When set, overrides the model's default dimension. | - |
| `HINDSIGHT_API_EMBEDDINGS_LITELLM_API_BASE` | LiteLLM proxy base URL for embeddings (falls back to `HINDSIGHT_API_LITELLM_API_BASE`) | `http://localhost:4000` |
| `HINDSIGHT_API_EMBEDDINGS_LITELLM_API_KEY` | LiteLLM proxy API key for embeddings (optional, depends on proxy config; falls back to `HINDSIGHT_API_LITELLM_API_KEY`) | - |
| `HINDSIGHT_API_EMBEDDINGS_LITELLM_MODEL` | LiteLLM embedding model (use provider prefix, e.g., `cohere/embed-english-v3.0`) | `text-embedding-3-small` |
| `HINDSIGHT_API_EMBEDDINGS_LITELLM_SDK_API_KEY` | LiteLLM SDK API key for direct embedding provider access (optional — omit for providers that use ambient credentials, e.g. AWS Bedrock with IAM) | - |
| `HINDSIGHT_API_EMBEDDINGS_LITELLM_SDK_MODEL` | LiteLLM SDK embedding model (use provider prefix, e.g., `cohere/embed-english-v3.0`) | `cohere/embed-english-v3.0` |
| `HINDSIGHT_API_EMBEDDINGS_LITELLM_SDK_API_BASE` | Custom base URL for LiteLLM SDK embeddings (optional) | - |
| `HINDSIGHT_API_EMBEDDINGS_LITELLM_SDK_OUTPUT_DIMENSIONS` | Optional output embedding dimensions (provider-dependent, e.g., `768` for Gemini embedding models) | - |
| `HINDSIGHT_API_EMBEDDINGS_LITELLM_SDK_ENCODING_FORMAT` | Encoding format for embedding responses. Set to empty string to omit the parameter (needed for Voyage AI, Gemini). | `float` |
| `HINDSIGHT_API_EMBEDDINGS_GEMINI_API_KEY` | Gemini API key for embeddings (falls back to `HINDSIGHT_API_LLM_API_KEY`) | - |
| `HINDSIGHT_API_EMBEDDINGS_GEMINI_MODEL` | Gemini embedding model. The `gemini-embedding-2` family (e.g. `gemini-embedding-2-preview`) is supported on both the Gemini API and Vertex AI — because these multimodal models aggregate a multi-input request into one embedding, Hindsight automatically embeds one input per call to keep per-fact vectors. | `gemini-embedding-001` |
| `HINDSIGHT_API_EMBEDDINGS_GEMINI_OUTPUT_DIMENSIONALITY` | Output embedding dimensions (Gemini supports configurable dimensionality) | `768` |
| `HINDSIGHT_API_EMBEDDINGS_GEMINI_FORCE_IPV4` | Force the Gemini embeddings client to use an IPv4-only HTTP transport. Useful in environments where IPv6 egress is broken (e.g. some Docker/VPC setups) and AAAA DNS records cause long hangs. | `false` |
| `HINDSIGHT_API_EMBEDDINGS_VERTEXAI_PROJECT_ID` | Vertex AI project ID for embeddings (falls back to `HINDSIGHT_API_LLM_VERTEXAI_PROJECT_ID`) | - |
| `HINDSIGHT_API_EMBEDDINGS_VERTEXAI_REGION` | Vertex AI region for embeddings (falls back to `HINDSIGHT_API_LLM_VERTEXAI_REGION`) | - |
| `HINDSIGHT_API_EMBEDDINGS_VERTEXAI_SERVICE_ACCOUNT_KEY` | Service account key for Vertex AI embeddings (falls back to `HINDSIGHT_API_LLM_VERTEXAI_SERVICE_ACCOUNT_KEY`) | - |

Embedding provider selection, credentials, base URLs, model choices, dimensions, encoding format, batch sizes, and latency modes are static server-level settings. They are not hierarchical per-bank overrides. The ONNX settings above are also static, matching the existing `embeddings_local_*` settings.

#### Local ONNX embeddings

The ONNX provider runs embedding models in-process with ONNX Runtime. Install the optional deps when building your own API environment:

```bash
pip install 'hindsight-api-slim[local-onnx]'
# or, in this repository:
uv sync --project hindsight-api-slim --extra local-onnx
```

You can either let Hindsight download the model from Hugging Face at startup by setting `HINDSIGHT_API_EMBEDDINGS_ONNX_MODEL_ID`, or pre-download the ONNX graph and tokenizer files under the Hindsight repository root.

```bash
cd /path/to/hindsight
mkdir -p models

MODEL_ID=intfloat/multilingual-e5-small
MODEL_DIR=models/intfloat__multilingual-e5-small

uv run --project hindsight-api-slim --extra local-onnx python - <<'PY'

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id=os.environ["MODEL_ID"],
    local_dir=os.environ["MODEL_DIR"],
    allow_patterns=[
        "onnx/model.onnx",
        "onnx/model.onnx_data",
        "*.json",
        "*.txt",
        "*.model",
    ],
)
PY
```

Then start Hindsight with paths relative to the repository root:

```bash
export HINDSIGHT_API_EMBEDDINGS_PROVIDER=onnx
export HINDSIGHT_API_EMBEDDINGS_ONNX_MODEL_PATH=./models/intfloat__multilingual-e5-small/onnx/model.onnx
export HINDSIGHT_API_EMBEDDINGS_ONNX_TOKENIZER_NAME_OR_PATH=./models/intfloat__multilingual-e5-small
export HINDSIGHT_API_EMBEDDINGS_ONNX_DIMENSIONS=384
export HINDSIGHT_API_EMBEDDINGS_ONNX_QUERY_PREFIX="query: "
export HINDSIGHT_API_EMBEDDINGS_ONNX_PASSAGE_PREFIX="passage: "
```

For Docker deployments, mount the same model directory and use container paths:

```yaml
services:
  hindsight:
    volumes:
      - ./models:/app/models:ro
    environment:
      HINDSIGHT_API_EMBEDDINGS_PROVIDER: onnx
      HINDSIGHT_API_EMBEDDINGS_ONNX_MODEL_PATH: /app/models/intfloat__multilingual-e5-small/onnx/model.onnx
      HINDSIGHT_API_EMBEDDINGS_ONNX_TOKENIZER_NAME_OR_PATH: /app/models/intfloat__multilingual-e5-small
      HINDSIGHT_API_EMBEDDINGS_ONNX_DIMENSIONS: "384"
      HINDSIGHT_API_EMBEDDINGS_ONNX_QUERY_PREFIX: "query: "
      HINDSIGHT_API_EMBEDDINGS_ONNX_PASSAGE_PREFIX: "passage: "
```

Model-specific examples:

```bash
# sentence-transformers/all-MiniLM-L6-v2: 384 dimensions, no E5 prefixes
export HINDSIGHT_API_EMBEDDINGS_ONNX_MODEL_ID=sentence-transformers/all-MiniLM-L6-v2
export HINDSIGHT_API_EMBEDDINGS_ONNX_DIMENSIONS=384
export HINDSIGHT_API_EMBEDDINGS_ONNX_QUERY_PREFIX=""
export HINDSIGHT_API_EMBEDDINGS_ONNX_PASSAGE_PREFIX=""

# intfloat/multilingual-e5-small: 384 dimensions, keep E5 prefixes
export HINDSIGHT_API_EMBEDDINGS_ONNX_MODEL_ID=intfloat/multilingual-e5-small
export HINDSIGHT_API_EMBEDDINGS_ONNX_DIMENSIONS=384
export HINDSIGHT_API_EMBEDDINGS_ONNX_QUERY_PREFIX="query: "
export HINDSIGHT_API_EMBEDDINGS_ONNX_PASSAGE_PREFIX="passage: "

# sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2: 384 dimensions, no E5 prefixes
export HINDSIGHT_API_EMBEDDINGS_ONNX_MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
export HINDSIGHT_API_EMBEDDINGS_ONNX_DIMENSIONS=384
export HINDSIGHT_API_EMBEDDINGS_ONNX_QUERY_PREFIX=""
export HINDSIGHT_API_EMBEDDINGS_ONNX_PASSAGE_PREFIX=""

# BAAI/bge-m3: 1024 dimensions, no E5 prefixes; keep onnx/model.onnx_data next to model.onnx
export HINDSIGHT_API_EMBEDDINGS_ONNX_MODEL_ID=BAAI/bge-m3
export HINDSIGHT_API_EMBEDDINGS_ONNX_DIMENSIONS=1024
export HINDSIGHT_API_EMBEDDINGS_ONNX_QUERY_PREFIX=""
export HINDSIGHT_API_EMBEDDINGS_ONNX_PASSAGE_PREFIX=""
```

:::warning
Do not mix embeddings from different models in the same vector index. Switching from `local` to `onnx`, or changing ONNX models, requires re-embedding existing memories/documents even when the vector dimensions happen to match. For example, `BAAI/bge-small-en-v1.5` and `intfloat/multilingual-e5-small` both produce 384-dimensional vectors, but their embedding spaces are not semantically comparable.
:::

:::warning
The default ONNX query/document prefixes (`query: ` and `passage: `) are for E5 models. Clear both prefix variables for non-E5 models such as MiniLM or BGE, otherwise Hindsight will prepend E5-style text to models that were not trained with that format.
:::

#### Common Pitfall: Provider-Specific Embedding Env Var Names

Embedding environment variables include a provider segment in the key name:

`HINDSIGHT_API_EMBEDDINGS_{PROVIDER}_{PARAMETER}`

For example, when `HINDSIGHT_API_EMBEDDINGS_PROVIDER=openai`:

| Wrong | Correct |
|---|---|
| `HINDSIGHT_API_EMBEDDINGS_BASE_URL` | `HINDSIGHT_API_EMBEDDINGS_OPENAI_BASE_URL` |
| `HINDSIGHT_API_EMBEDDINGS_MODEL` | `HINDSIGHT_API_EMBEDDINGS_OPENAI_MODEL` |
| `HINDSIGHT_API_EMBEDDINGS_API_KEY` | `HINDSIGHT_API_EMBEDDINGS_OPENAI_API_KEY` |

This differs from LLM variables, which follow `HINDSIGHT_API_LLM_{PARAMETER}` without a provider segment.

:::warning
If embedding keys are misnamed, Hindsight may fall back to default OpenAI embedding settings (for example, `text-embedding-3-small`) and fail with auth errors against the wrong endpoint.
:::

#### DeepSeek and Embeddings

DeepSeek is supported as an **LLM** provider, but it does **not** expose an embeddings endpoint. If your LLM is DeepSeek, use a different embedding provider (for example `local`, `openai`, `cohere`, or `google`).

```bash
# Local (default) - uses SentenceTransformers
export HINDSIGHT_API_EMBEDDINGS_PROVIDER=local
export HINDSIGHT_API_EMBEDDINGS_LOCAL_MODEL=BAAI/bge-small-en-v1.5

# Local with custom model requiring trust_remote_code
# WARNING: Only enable trust_remote_code for models you trust (security risk)
# export HINDSIGHT_API_EMBEDDINGS_LOCAL_MODEL=your-custom-model
# export HINDSIGHT_API_EMBEDDINGS_LOCAL_TRUST_REMOTE_CODE=true

# OpenAI - cloud-based embeddings
export HINDSIGHT_API_EMBEDDINGS_PROVIDER=openai
export HINDSIGHT_API_EMBEDDINGS_OPENAI_API_KEY=***  # or reuses HINDSIGHT_API_LLM_API_KEY
export HINDSIGHT_API_EMBEDDINGS_OPENAI_MODEL=text-embedding-3-small  # 1536 dimensions by default
# export HINDSIGHT_API_EMBEDDINGS_OPENAI_DIMENSIONS=384  # optional reduced output size

# OpenAI Codex OAuth - uses existing ChatGPT/Codex login, no API key needed
export HINDSIGHT_API_EMBEDDINGS_PROVIDER=openai-codex
export HINDSIGHT_API_EMBEDDINGS_OPENAI_MODEL=text-embedding-3-small  # 1536 dimensions by default
# export HINDSIGHT_API_EMBEDDINGS_OPENAI_DIMENSIONS=384  # optional reduced output size

# Azure OpenAI - embeddings via Azure endpoint
export HINDSIGHT_API_EMBEDDINGS_PROVIDER=openai
export HINDSIGHT_API_EMBEDDINGS_OPENAI_API_KEY=your-azure-api-key
export HINDSIGHT_API_EMBEDDINGS_OPENAI_MODEL=text-embedding-3-small
export HINDSIGHT_API_EMBEDDINGS_OPENAI_BASE_URL=https://your-resource.openai.azure.com/openai/deployments/your-deployment

# TEI - HuggingFace Text Embeddings Inference (recommended for production)
export HINDSIGHT_API_EMBEDDINGS_PROVIDER=tei
export HINDSIGHT_API_EMBEDDINGS_TEI_URL=http://localhost:8080

# OpenRouter - access 100+ embedding models
export HINDSIGHT_API_EMBEDDINGS_PROVIDER=openrouter
export HINDSIGHT_API_EMBEDDINGS_OPENROUTER_API_KEY=your-openrouter-api-key  # or reuses HINDSIGHT_API_LLM_API_KEY
export HINDSIGHT_API_EMBEDDINGS_OPENROUTER_MODEL=perplexity/pplx-embed-v1-0.6b

# ZeroEntropy - zembed-1
export HINDSIGHT_API_EMBEDDINGS_PROVIDER=zeroentropy
export HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_API_KEY=your-api-key
export HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_MODEL=zembed-1
export HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_DIMENSIONS=1280
# export HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_ENCODING_FORMAT=base64  # optional
# export HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_LATENCY=fast  # optional

# Cohere - cloud-based embeddings
export HINDSIGHT_API_EMBEDDINGS_PROVIDER=cohere
export HINDSIGHT_API_EMBEDDINGS_COHERE_API_KEY=your-api-key
export HINDSIGHT_API_EMBEDDINGS_COHERE_MODEL=embed-english-v3.0  # 1024 dimensions
# Optional: override output dimensions (for Matryoshka-capable models)
# export HINDSIGHT_API_EMBEDDINGS_COHERE_OUTPUT_DIMENSIONS=512

# Azure-hosted Cohere - embeddings via custom endpoint
export HINDSIGHT_API_EMBEDDINGS_PROVIDER=cohere
export HINDSIGHT_API_EMBEDDINGS_COHERE_API_KEY=your-azure-api-key
export HINDSIGHT_API_EMBEDDINGS_COHERE_MODEL=embed-english-v3.0
export HINDSIGHT_API_EMBEDDINGS_COHERE_BASE_URL=https://your-azure-cohere-endpoint.com

# LiteLLM proxy - unified gateway for multiple providers
export HINDSIGHT_API_EMBEDDINGS_PROVIDER=litellm
export HINDSIGHT_API_EMBEDDINGS_LITELLM_API_BASE=http://localhost:4000
export HINDSIGHT_API_EMBEDDINGS_LITELLM_API_KEY=your-litellm-key  # optional
export HINDSIGHT_API_EMBEDDINGS_LITELLM_MODEL=text-embedding-3-small  # or cohere/embed-english-v3.0

# Google - Gemini API (API key auth)
export HINDSIGHT_API_EMBEDDINGS_PROVIDER=google
export HINDSIGHT_API_EMBEDDINGS_GEMINI_API_KEY=xxxxxxxxxxxx  # or reuses HINDSIGHT_API_LLM_API_KEY
export HINDSIGHT_API_EMBEDDINGS_GEMINI_MODEL=gemini-embedding-001  # 768 dimensions (default)
# export HINDSIGHT_API_EMBEDDINGS_GEMINI_OUTPUT_DIMENSIONALITY=768  # configurable: 256, 512, 768, 1024, etc.

# Google - Vertex AI auth (auto-detected when project ID is set)
export HINDSIGHT_API_EMBEDDINGS_PROVIDER=google
export HINDSIGHT_API_EMBEDDINGS_GEMINI_MODEL=gemini-embedding-001
export HINDSIGHT_API_EMBEDDINGS_VERTEXAI_PROJECT_ID=your-gcp-project-id  # falls back to HINDSIGHT_API_LLM_VERTEXAI_PROJECT_ID
# export HINDSIGHT_API_EMBEDDINGS_VERTEXAI_REGION=us-central1  # falls back to HINDSIGHT_API_LLM_VERTEXAI_REGION
# export HINDSIGHT_API_EMBEDDINGS_VERTEXAI_SERVICE_ACCOUNT_KEY=/path/to/key.json  # falls back to LLM config, or uses ADC

# LiteLLM SDK - direct API access without proxy server (recommended)
export HINDSIGHT_API_EMBEDDINGS_PROVIDER=litellm-sdk
export HINDSIGHT_API_EMBEDDINGS_LITELLM_SDK_API_KEY=your-provider-api-key
export HINDSIGHT_API_EMBEDDINGS_LITELLM_SDK_MODEL=cohere/embed-english-v3.0
# Optional: request a specific output dimension when the provider supports it
# export HINDSIGHT_API_EMBEDDINGS_LITELLM_SDK_OUTPUT_DIMENSIONS=768

# Supported LiteLLM SDK embedding providers:
# - cohere/embed-english-v3.0 (1024 dimensions)
# - openai/text-embedding-3-small (1536 dimensions)
# - together_ai/togethercomputer/m2-bert-80M-8k-retrieval
# - huggingface/sentence-transformers/all-MiniLM-L6-v2
# - voyage/voyage-2
```

#### Embedding Dimensions

Hindsight automatically detects the embedding dimension from the model at startup and adjusts the database schema accordingly. The default model (`BAAI/bge-small-en-v1.5`) produces 384-dimensional vectors, while OpenAI models produce 1536 or 3072 dimensions.

For `litellm-sdk`, if you set `HINDSIGHT_API_EMBEDDINGS_LITELLM_SDK_OUTPUT_DIMENSIONS`, startup uses that output size when the underlying provider supports LiteLLM's `dimensions` parameter (otherwise behavior is unchanged). The same dimension-change rules below apply.

For `zeroentropy`, zembed-1 supports `2560`, `1280`, `640`, `320`, `160`, `80`, and `40` dimensions. ZeroEntropy's API default is `2560`; Hindsight defaults to `1280` so the provider works with the default pgvector HNSW index. Use `2560` with a vector extension that supports higher-dimensional indexes, such as DiskANN/pgvectorscale or ScaNN.

:::warning Dimension Changes
Once memories are stored, you cannot change the embedding dimension without losing data. If you need to switch to a model with different dimensions:

1. **Empty database**: The schema is adjusted automatically on startup
2. **Existing data**: Either delete all memories first, or use a model with matching dimensions

Supported OpenAI embedding dimensions:
- `text-embedding-3-small`: 1536 dimensions
- `text-embedding-3-large`: 3072 dimensions
- `text-embedding-ada-002`: 1536 dimensions (legacy)

Google's `gemini-embedding-001` produces 3072 dimensions natively but supports configurable output dimensionality. Set `HINDSIGHT_API_EMBEDDINGS_GEMINI_OUTPUT_DIMENSIONALITY` to control the output size (default: 768).

ZeroEntropy's `zembed-1` supports Matryoshka dimensions: `2560`, `1280`, `640`, `320`, `160`, `80`, and `40`. Hindsight defaults to `1280` for this provider.
:::

### Reranker

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_RERANKER_PROVIDER` | Provider: `local`, `tei`, `cohere`, `openrouter`, `zeroentropy`, `siliconflow`, `alibaba`, `google`, `flashrank`, `litellm`, `litellm-sdk`, `jina-mlx`, or `rrf` | `local` |
| `HINDSIGHT_API_RERANKER_LOCAL_MODEL` | Model for local provider | `cross-encoder/ms-marco-MiniLM-L-6-v2` |
| `HINDSIGHT_API_RERANKER_LOCAL_MAX_CONCURRENT` | Max concurrent local reranking (prevents CPU thrashing under load) | `4` |
| `HINDSIGHT_API_RERANKER_LOCAL_TRUST_REMOTE_CODE` | Allow loading models with custom code (security risk, disabled by default) | `false` |
| `HINDSIGHT_API_RERANKER_LOCAL_FORCE_CPU` | Force CPU mode for local reranker (avoids MPS/XPC issues on macOS) | `false` |
| `HINDSIGHT_API_RERANKER_LOCAL_FP16` | Half-precision (FP16) inference for the local reranker. 27–36% faster on MPS; quality-identical. Disabled by default to avoid regressions on non-MPS deployments — some CPUs lack native FP16 support. | `false` |
| `HINDSIGHT_API_RERANKER_LOCAL_BUCKET_BATCHING` | Sort pairs by token length before batching to reduce padding waste. 36–54% faster across models; quality-identical by construction. | `false` |
| `HINDSIGHT_API_RERANKER_LOCAL_BATCH_SIZE` | Batch size for local reranker `predict()`. Optimal value varies by hardware and model (smaller batches can outperform larger ones on MPS). | `32` |
| `HINDSIGHT_API_RERANKER_TEI_URL` | TEI server URL | - |
| `HINDSIGHT_API_RERANKER_TEI_BATCH_SIZE` | Batch size for TEI reranking | `128` |
| `HINDSIGHT_API_RERANKER_TEI_MAX_CONCURRENT` | Max concurrent TEI reranking requests | `8` |
| `HINDSIGHT_API_RERANKER_TEI_HTTP_TIMEOUT` | HTTP request timeout for TEI reranker (seconds). Increase when using a slower CPU-based reranker under load. | `30.0` |
| `HINDSIGHT_API_RERANKER_OPENROUTER_API_KEY` | OpenRouter API key for reranking (falls back to `HINDSIGHT_API_OPENROUTER_API_KEY`, then `HINDSIGHT_API_LLM_API_KEY`) | - |
| `HINDSIGHT_API_RERANKER_OPENROUTER_MODEL` | OpenRouter rerank model | `cohere/rerank-v3.5` |
| `HINDSIGHT_API_RERANKER_OPENROUTER_TIMEOUT` | HTTP request timeout for OpenRouter reranker (seconds). | `60.0` |
| `HINDSIGHT_API_RERANKER_OPENROUTER_BASE_URL` | Rerank endpoint URL (point at a Cohere-compatible gateway/proxy for metering) | `https://openrouter.ai/api/v1/rerank` |
| `HINDSIGHT_API_RERANKER_COHERE_API_KEY` | Cohere API key for reranking (falls back to `HINDSIGHT_API_COHERE_API_KEY`) | - |
| `HINDSIGHT_API_RERANKER_COHERE_MODEL` | Cohere rerank model | `rerank-english-v3.0` |
| `HINDSIGHT_API_RERANKER_COHERE_BASE_URL` | Custom base URL for any Cohere-compatible `/rerank` endpoint (Azure AI Foundry, Jina, Voyage, self-hosted BGE, etc.). When set, the `cohere` provider bypasses the Cohere SDK and calls the endpoint directly via HTTP. | - |
| `HINDSIGHT_API_RERANKER_COHERE_TIMEOUT` | Request timeout for the Cohere reranker (seconds). Applies to both the native Cohere SDK and the Cohere-compatible HTTP path enabled by `HINDSIGHT_API_RERANKER_COHERE_BASE_URL`. | `60.0` |
| `HINDSIGHT_API_RERANKER_LITELLM_API_BASE` | LiteLLM proxy base URL for reranking (falls back to `HINDSIGHT_API_LITELLM_API_BASE`) | `http://localhost:4000` |
| `HINDSIGHT_API_RERANKER_LITELLM_API_KEY` | LiteLLM proxy API key for reranking (optional, depends on proxy config; falls back to `HINDSIGHT_API_LITELLM_API_KEY`) | - |
| `HINDSIGHT_API_RERANKER_LITELLM_MODEL` | LiteLLM **proxy** rerank model (use provider prefix, e.g., `cohere/rerank-english-v3.0`) | `cohere/rerank-english-v3.0` |
| `HINDSIGHT_API_RERANKER_LITELLM_TIMEOUT` | HTTP request timeout for the LiteLLM proxy reranker (seconds). | `60.0` |
| `HINDSIGHT_API_RERANKER_LITELLM_SDK_API_KEY` | LiteLLM **SDK** API key for direct reranking (no proxy needed) | - |
| `HINDSIGHT_API_RERANKER_LITELLM_SDK_MODEL` | LiteLLM SDK rerank model (e.g., `deepinfra/Qwen3-reranker-8B`) | `cohere/rerank-english-v3.0` |
| `HINDSIGHT_API_RERANKER_LITELLM_SDK_API_BASE` | Custom API base URL for LiteLLM SDK (optional) | - |
| `HINDSIGHT_API_RERANKER_LITELLM_SDK_TIMEOUT` | Request timeout for the LiteLLM SDK reranker (seconds). | `60.0` |
| `HINDSIGHT_API_RERANKER_LITELLM_MAX_TOKENS_PER_DOC` | Truncate documents to this many tokens before sending to the reranker (applies to both `litellm` and `litellm-sdk`). Use for models with small context windows (e.g. set to `900` for a 1024-token limit model). Unset by default (no truncation). | - |
| `HINDSIGHT_API_RERANKER_ZEROENTROPY_API_KEY` | ZeroEntropy API key for reranking | - |
| `HINDSIGHT_API_RERANKER_ZEROENTROPY_MODEL` | ZeroEntropy rerank model (`zerank-2`, `zerank-2-small`) | `zerank-2` |
| `HINDSIGHT_API_RERANKER_ZEROENTROPY_BASE_URL` | Custom base URL for ZeroEntropy-compatible API (e.g., mock server, proxy, or self-hosted deployment) | `https://api.zeroentropy.dev` |
| `HINDSIGHT_API_RERANKER_ZEROENTROPY_TIMEOUT` | HTTP request timeout for ZeroEntropy reranker (seconds). | `60.0` |
| `HINDSIGHT_API_RERANKER_SILICONFLOW_API_KEY` | SiliconFlow API key for reranking | - |
| `HINDSIGHT_API_RERANKER_SILICONFLOW_MODEL` | SiliconFlow rerank model (e.g., `BAAI/bge-reranker-v2-m3`) | `BAAI/bge-reranker-v2-m3` |
| `HINDSIGHT_API_RERANKER_SILICONFLOW_BASE_URL` | Base URL for the SiliconFlow `/rerank` endpoint | `https://api.siliconflow.cn/v1` |
| `HINDSIGHT_API_RERANKER_SILICONFLOW_TIMEOUT` | HTTP request timeout for SiliconFlow reranker (seconds). | `60.0` |
| `HINDSIGHT_API_RERANKER_ALIBABA_API_KEY` | Alibaba Cloud DashScope API key for reranking | - |
| `HINDSIGHT_API_RERANKER_ALIBABA_MODEL` | DashScope rerank model | `qwen3-rerank` |
| `HINDSIGHT_API_RERANKER_ALIBABA_TIMEOUT` | HTTP request timeout for the Alibaba Cloud DashScope reranker (seconds). | `60.0` |
| `HINDSIGHT_API_RERANKER_GOOGLE_PROJECT_ID` | Google Cloud project ID for Discovery Engine reranking (falls back to `HINDSIGHT_API_LLM_VERTEXAI_PROJECT_ID`) | - |
| `HINDSIGHT_API_RERANKER_GOOGLE_MODEL` | Google Discovery Engine ranking model | `semantic-ranker-default-004` |
| `HINDSIGHT_API_RERANKER_GOOGLE_SERVICE_ACCOUNT_KEY` | Path to service account JSON key (falls back to `HINDSIGHT_API_LLM_VERTEXAI_SERVICE_ACCOUNT_KEY`). If unset, uses ADC. | - |
| `HINDSIGHT_API_RERANKER_GOOGLE_TIMEOUT` | HTTP request timeout for Google Discovery Engine reranker (seconds). | `60.0` |
| `HINDSIGHT_API_RERANKER_FLASHRANK_MODEL` | FlashRank model for fast CPU-based reranking | `ms-marco-MiniLM-L-12-v2` |
| `HINDSIGHT_API_RERANKER_FLASHRANK_CACHE_DIR` | Cache directory for FlashRank models | System default |
| `HINDSIGHT_API_RERANKER_FLASHRANK_CPU_MEM_ARENA` | Enable ONNX Runtime CPU memory arena for FlashRank. When `true`, ONNX pre-allocates a memory arena that never shrinks, causing RSS to grow monotonically. `false` trades slightly slower per-call allocation for bounded RSS. | `false` |
| `HINDSIGHT_API_RERANKER_JINA_MLX_MODEL_PATH` | Local path to downloaded `jina-reranker-v3-mlx` model (auto-downloads from HuggingFace if unset) | - |

```bash
# Local (default) - uses SentenceTransformers CrossEncoder
export HINDSIGHT_API_RERANKER_PROVIDER=local
export HINDSIGHT_API_RERANKER_LOCAL_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2

# Local with custom model requiring trust_remote_code (e.g., jina-reranker-v2)
# WARNING: Only enable trust_remote_code for models you trust (security risk)
export HINDSIGHT_API_RERANKER_PROVIDER=local
export HINDSIGHT_API_RERANKER_LOCAL_MODEL=jinaai/jina-reranker-v2-base-multilingual
export HINDSIGHT_API_RERANKER_LOCAL_TRUST_REMOTE_CODE=true

# TEI - for high-performance inference
export HINDSIGHT_API_RERANKER_PROVIDER=tei
export HINDSIGHT_API_RERANKER_TEI_URL=http://localhost:8081

# OpenRouter - access reranking models via OpenRouter
export HINDSIGHT_API_RERANKER_PROVIDER=openrouter
export HINDSIGHT_API_RERANKER_OPENROUTER_API_KEY=your-openrouter-api-key  # or reuses HINDSIGHT_API_LLM_API_KEY
export HINDSIGHT_API_RERANKER_OPENROUTER_MODEL=cohere/rerank-v3.5

# Cohere - cloud-based reranking
export HINDSIGHT_API_RERANKER_PROVIDER=cohere
export HINDSIGHT_API_RERANKER_COHERE_API_KEY=your-api-key
export HINDSIGHT_API_RERANKER_COHERE_MODEL=rerank-english-v3.0

# Any Cohere-compatible /rerank endpoint (Azure AI Foundry, Jina, Voyage, self-hosted BGE, etc.)
#
# Setting HINDSIGHT_API_RERANKER_COHERE_BASE_URL switches the `cohere` provider
# off the Cohere SDK and onto a plain HTTP client that speaks the standard
# Cohere rerank wire format:
#   Request:  POST {base_url}  (or {base_url}/rerank, depending on host)
#             Authorization: Bearer <api_key>
#             {"model": "...", "query": "...", "documents": [...], "return_documents": false}
#   Response: {"results": [{"index": 0, "relevance_score": 0.9}, ...]}
#
# Any service implementing this contract works here. For Azure AI Foundry the
# base_url is the full invoke URL; for SiliconFlow you can also use the
# dedicated `siliconflow` provider below.
export HINDSIGHT_API_RERANKER_PROVIDER=cohere
export HINDSIGHT_API_RERANKER_COHERE_API_KEY=your-api-key
export HINDSIGHT_API_RERANKER_COHERE_MODEL=rerank-english-v3.0  # whatever model the endpoint serves
export HINDSIGHT_API_RERANKER_COHERE_BASE_URL=https://your-cohere-compatible-endpoint.com

# ZeroEntropy - cloud-based reranking (state-of-the-art accuracy)
export HINDSIGHT_API_RERANKER_PROVIDER=zeroentropy
export HINDSIGHT_API_RERANKER_ZEROENTROPY_API_KEY=your-api-key
export HINDSIGHT_API_RERANKER_ZEROENTROPY_MODEL=zerank-2  # or zerank-2-small
# export HINDSIGHT_API_RERANKER_ZEROENTROPY_BASE_URL=https://your-custom-endpoint.com  # optional

# SiliconFlow - cloud reranking via SiliconFlow's Cohere-compatible /rerank endpoint
export HINDSIGHT_API_RERANKER_PROVIDER=siliconflow
export HINDSIGHT_API_RERANKER_SILICONFLOW_API_KEY=your-api-key
export HINDSIGHT_API_RERANKER_SILICONFLOW_MODEL=BAAI/bge-reranker-v2-m3
# export HINDSIGHT_API_RERANKER_SILICONFLOW_BASE_URL=https://api.siliconflow.cn/v1  # default

# Alibaba Cloud DashScope - qwen3-rerank via Cohere-compatible /reranks endpoint
export HINDSIGHT_API_RERANKER_PROVIDER=alibaba
export HINDSIGHT_API_RERANKER_ALIBABA_API_KEY=your-dashscope-api-key  # or set DASHSCOPE_API_KEY
export HINDSIGHT_API_RERANKER_ALIBABA_MODEL=qwen3-rerank  # default, can omit

# LiteLLM proxy - unified gateway for multiple reranking providers (requires running LiteLLM proxy server)
export HINDSIGHT_API_RERANKER_PROVIDER=litellm
export HINDSIGHT_API_RERANKER_LITELLM_API_BASE=http://localhost:4000
export HINDSIGHT_API_RERANKER_LITELLM_API_KEY=your-litellm-key  # optional
export HINDSIGHT_API_RERANKER_LITELLM_MODEL=cohere/rerank-english-v3.0  # or voyage/rerank-2, together_ai/...

# LiteLLM SDK - direct API access without proxy (recommended for simplicity)
export HINDSIGHT_API_RERANKER_PROVIDER=litellm-sdk
export HINDSIGHT_API_RERANKER_LITELLM_SDK_API_KEY=your-deepinfra-api-key
export HINDSIGHT_API_RERANKER_LITELLM_SDK_MODEL=deepinfra/Qwen3-reranker-8B  # or cohere/rerank-english-v3.0, etc.

# Google Discovery Engine - cloud-based semantic reranking
export HINDSIGHT_API_RERANKER_PROVIDER=google
export HINDSIGHT_API_RERANKER_GOOGLE_PROJECT_ID=your-gcp-project-id
export HINDSIGHT_API_RERANKER_GOOGLE_SERVICE_ACCOUNT_KEY=/path/to/service-account.json  # optional, uses ADC if unset
export HINDSIGHT_API_RERANKER_GOOGLE_MODEL=semantic-ranker-default-004  # or semantic-ranker-fast-004

# Jina MLX - Apple Silicon native reranking (no GPU/cloud required)
# Model (~1.2 GB) is downloaded automatically from HuggingFace Hub on first use.
export HINDSIGHT_API_RERANKER_PROVIDER=jina-mlx
```

#### LiteLLM Proxy vs SDK

- **`litellm`**: Requires running a separate LiteLLM proxy server. Good for centralized configuration, rate limiting, and caching.
- **`litellm-sdk`**: Direct API access without proxy. Simpler setup, lower latency, fewer infrastructure components.

Both support the same providers:
- **Cohere** (`cohere/rerank-english-v3.0`, `cohere/rerank-multilingual-v3.0`)
- **DeepInfra** (`deepinfra/Qwen3-reranker-8B`, `deepinfra/bge-reranker-v2-m3`)
- **Together AI** (`together_ai/Salesforce/Llama-Rank-V1`)
- **HuggingFace** (`huggingface/BAAI/bge-reranker-v2-m3`)
- **Voyage AI** (`voyage/rerank-2`)
- **Jina AI** (`jina_ai/jina-reranker-v2`)
- **AWS Bedrock** (`bedrock/...`)

#### Jina MLX (Apple Silicon)

The `jina-mlx` provider uses [`jinaai/jina-reranker-v3-mlx`](https://huggingface.co/jinaai/jina-reranker-v3-mlx), optimized for Apple Silicon. The model (~1.2 GB) is downloaded from HuggingFace Hub automatically on first startup and cached locally.

:::note License
`jina-reranker-v3-mlx` is licensed under CC BY-NC 4.0. Contact Jina AI for commercial usage.
:::

### Authentication

By default, Hindsight runs without authentication. For production deployments, enable API key authentication using the built-in tenant extension:

```bash
# Enable the built-in API key authentication
export HINDSIGHT_API_TENANT_EXTENSION=hindsight_api.extensions.builtin.tenant:ApiKeyTenantExtension
export HINDSIGHT_API_TENANT_API_KEY=your-secret-api-key
```

When enabled, all requests must include the API key in the `Authorization` header:

```bash
curl -H "Authorization: Bearer your-secret-api-key" \
  http://localhost:8888/v1/default/banks
```

Requests without a valid API key receive a `401 Unauthorized` response.

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_TENANT_EXTENSION` | Dotted path to the loaded tenant extension. Set to `hindsight_api.extensions.builtin.tenant:ApiKeyTenantExtension` to require an API key on every request. | *(none; auth disabled)* |
| `HINDSIGHT_API_TENANT_API_KEY` | Shared API key checked by the built-in API-key extension. Sent by clients as `Authorization: Bearer <key>`. | *(none)* |

If you are enabling Memory Defense, see `docs/developer/memory-defense/` for the policy schema, detector catalog, and audit trail.

:::tip Custom Authentication
For advanced authentication (JWT, OAuth, multi-tenant schemas), implement a custom `TenantExtension`. See the [Extensions documentation](./extensions.md) for details.
:::

### Server

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_HOST` | Bind address | `0.0.0.0` |
| `HINDSIGHT_API_PORT` | Server port | `8888` |
| `HINDSIGHT_API_BASE_PATH` | Base path for API when behind reverse proxy (e.g., `/hindsight`) | `""` (root) |
| `HINDSIGHT_API_WORKERS` | Number of uvicorn worker processes | `1` |
| `HINDSIGHT_API_ACCESS_LOG` | Enable uvicorn access log (`true`, `1`, `yes`, `on` to enable) | `false` |
| `HINDSIGHT_API_LOG_LEVEL` | Log level: `debug`, `info`, `warning`, `error` | `info` |
| `HINDSIGHT_API_LOG_FORMAT` | Log format: `text` or `json` (structured logging for cloud platforms) | `text` |
| `HINDSIGHT_API_LOG_JSON_FIELDS` | Comma-separated allowlist of JSON log fields to emit (e.g. `severity,message,tenant`). Available: `severity`, `message`, `timestamp`, `logger`, `tenant`, `exception`. Empty = all fields. | `""` (all) |
| `HINDSIGHT_API_MCP_ENABLED` | Enable MCP server at `/mcp/{bank_id}/` | `true` |
| `HINDSIGHT_API_MODEL_INIT_TIMEOUT` | Wall-clock cap (seconds) on startup model/connection initialization. If embeddings, the cross-encoder, or LLM verification block (e.g. an offline model download or an unreachable provider), the server fails fast with a clear error instead of hanging forever. Increase if a legitimate first-time model download needs more time. | `300` |

### Retrieval

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_GRAPH_RETRIEVER` | Graph retrieval algorithm | `link_expansion` |
| `HINDSIGHT_API_LINK_EXPANSION_PER_ENTITY_LIMIT` | Max target units expanded per entity in `link_expansion` graph retrieval (LATERAL fanout cap per entity; bounds high-fanout entities). | `200` |
| `HINDSIGHT_API_LINK_EXPANSION_TIMEOUT` | Timeout (seconds) for the per-entity graph expansion query in `link_expansion` retrieval. | `10` |
| `HINDSIGHT_API_RECALL_MAX_CONCURRENT` | Max concurrent recall operations per worker (backpressure) | `32` |
| `HINDSIGHT_API_RECALL_CONNECTION_BUDGET` | Max concurrent DB connections per recall operation | `4` |
| `HINDSIGHT_API_RECALL_MAX_QUERY_TOKENS` | Maximum token length of a recall query; requests exceeding this limit are rejected with HTTP 400 | `500` |
| `HINDSIGHT_API_RERANKER_MAX_CANDIDATES` | Max candidates to rerank per recall (RRF pre-filters the rest) | `300` |
| `HINDSIGHT_API_SEMANTIC_MIN_SIMILARITY` | Minimum cosine similarity a candidate must reach to be returned by the semantic retrieval strategy. Must be between `0` and `1`. | `0.3` |
| `HINDSIGHT_API_BM25_MIN_SCORE` | Minimum BM25 score a row must exceed to enter fusion. Gates out zero-score, non-matching rows on backends (notably `vchord`) whose operator ranks every document instead of pre-filtering to query-term matches. `0` keeps only genuine term matches; raise it to require stronger matches. | `0` |
| `HINDSIGHT_API_RECALL_MAX_CANDIDATES_PER_SOURCE` | Cap on candidates each retrieval source (semantic, BM25, graph, temporal) contributes to RRF, applied before the global reranker cap. Prevents one over-expanding backend from filling the reranker budget on its own. `0` disables the cap. | `0` |
| `HINDSIGHT_API_RECALL_STRATEGY_BOOSTS` | Prioritise one or more retrieval sources over the others on recall, as a comma-separated `strategy:level` list (e.g. `graph:high` to strongly favour graph hits, or `graph:high,bm25:low`). Strategies: `semantic`, `bm25`, `graph`, `temporal`. Levels: `low` (gentle — mainly protects the source's candidates from being dropped before reranking), `medium` (moderate preference), `high` (strong — the source dominates the candidate pool and outranks most other matches, only a strong direct match still wins). The boost is applied in two places: before the reranker cap (so favoured candidates survive the `HINDSIGHT_API_RERANKER_MAX_CANDIDATES` budget) and after reranking (to nudge them up the final order); a named level is used because those two stages live on different score scales. Only the strategies you list are boosted — any you omit keep their normal weight (no implicit boost). A strategy written without a level (`graph` or `graph:`) defaults to `medium`. Empty disables the feature. | _(empty)_ |
| `HINDSIGHT_API_RECENCY_DECAY_FUNCTION` | Shape of the recency boost applied during reranking — how a memory's age is turned into a small freshness adjustment to its final rank. `linear` (default) decays in a straight line from full freshness (today) to a floor reached at `HINDSIGHT_API_RECENCY_DECAY_LINEAR_WINDOW_DAYS`. `exponential` decays by half-life: a memory is treated as neutral (no boost or penalty) at `HINDSIGHT_API_RECENCY_DECAY_HALFLIFE_DAYS`, younger memories are boosted and older ones penalised, with a smooth fade rather than a hard cutoff. `none` disables recency entirely (age never affects ranking). | `linear` |
| `HINDSIGHT_API_RECENCY_DECAY_LINEAR_WINDOW_DAYS` | For the `linear` decay function: the number of days over which a memory fades from full freshness to the minimum. Only used when `HINDSIGHT_API_RECENCY_DECAY_FUNCTION=linear`. | `365` |
| `HINDSIGHT_API_RECENCY_DECAY_HALFLIFE_DAYS` | For the `exponential` decay function: the age (in days) at which a memory is considered neutral — younger memories get a recency boost, older ones a penalty. Smaller values favour very recent memories more aggressively. Only used when `HINDSIGHT_API_RECENCY_DECAY_FUNCTION=exponential`. | `90` |
| `HINDSIGHT_API_MENTAL_MODEL_REFRESH_CONCURRENCY` | Max concurrent mental model refreshes | `8` |
| `HINDSIGHT_API_ENABLE_MENTAL_MODEL_HISTORY` | Track history of content changes to each mental model (previous content + timestamp), stored one row per change in the `mental_model_history` table. Set to `false` to disable entirely — no history rows are written, reducing storage if audit trails are not needed. **This is how you turn the feature off** (not a zero cap). | `true` |
| `HINDSIGHT_API_MENTAL_MODEL_HISTORY_MAX_ENTRIES` | Max history rows kept per mental model. On each refresh the previous version is inserted into the `mental_model_history` table and the oldest rows beyond this cap are deleted, so per-model history can't grow without bound. `0` or a negative value **removes the cap** (history then grows with every refresh — unbounded); to turn history off entirely set `HINDSIGHT_API_ENABLE_MENTAL_MODEL_HISTORY=false` instead. | `50` |

#### Graph Retrieval Algorithm

- **`link_expansion`** (default): Fast graph expansion from semantic seeds via entity co-occurrence, semantic kNN, and causal links. Target latency under 100ms.

#### Recall budget mapping

The recall request takes a `budget` parameter (`low` / `mid` / `high`, default `mid`) that maps to an integer `thinking_budget` used by every retrieval method (semantic, BM25, graph, temporal). These knobs control that mapping. They are hierarchical — overridable per bank via the [config API](#hierarchical-configuration).

Two functions are available:

- **`fixed`** (default — preserves legacy behavior): `thinking_budget = recall_budget_fixed_<level>` (independent of `max_tokens`).
- **`adaptive`**: `thinking_budget = round(max_tokens * recall_budget_adaptive_<level>)`, clamped to `[recall_budget_min, recall_budget_max]`. Useful when callers vary `max_tokens` and you want retrieval breadth to scale with the requested output size.

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_RECALL_BUDGET_FUNCTION` | Mapping function: `fixed` or `adaptive`. | `fixed` |
| `HINDSIGHT_API_RECALL_BUDGET_FIXED_LOW` | Items per retrieval method per fact type when `budget=low` and function is `fixed`. | `100` |
| `HINDSIGHT_API_RECALL_BUDGET_FIXED_MID` | Items per retrieval method per fact type when `budget=mid` and function is `fixed`. | `300` |
| `HINDSIGHT_API_RECALL_BUDGET_FIXED_HIGH` | Items per retrieval method per fact type when `budget=high` and function is `fixed`. | `1000` |
| `HINDSIGHT_API_RECALL_BUDGET_ADAPTIVE_LOW` | Ratio of request `max_tokens` used when `budget=low` and function is `adaptive`. | `0.025` |
| `HINDSIGHT_API_RECALL_BUDGET_ADAPTIVE_MID` | Ratio of request `max_tokens` used when `budget=mid` and function is `adaptive`. | `0.075` |
| `HINDSIGHT_API_RECALL_BUDGET_ADAPTIVE_HIGH` | Ratio of request `max_tokens` used when `budget=high` and function is `adaptive`. | `0.25` |
| `HINDSIGHT_API_RECALL_BUDGET_MIN` | Floor for the adaptive function (after clamping). | `20` |
| `HINDSIGHT_API_RECALL_BUDGET_MAX` | Ceiling for the adaptive function (after clamping). | `2000` |

### Retain

Controls the retain (memory ingestion) pipeline.

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_RETAIN_MAX_COMPLETION_TOKENS` | Max completion tokens for fact extraction LLM calls | `64000` |
| `HINDSIGHT_API_RETAIN_CHUNK_SIZE` | Max characters per chunk for fact extraction. Larger chunks extract fewer LLM calls but may lose context. | `3000` |
| `HINDSIGHT_API_RETAIN_STRUCTURED_CHUNK_SIZE` | Max characters for a single JSONL line or conversation turn to keep whole. Unset uses `HINDSIGHT_API_RETAIN_CHUNK_SIZE`. Must be a positive integer when set. | - |
| `HINDSIGHT_API_RETAIN_EXTRACTION_MODE` | Fact extraction mode: `concise`, `verbose`, `verbatim`, `chunks`, or `custom` | `concise` |
| `HINDSIGHT_API_RETAIN_MISSION` | What this bank should pay attention to during extraction. Steers the LLM without replacing the extraction rules — works alongside any extraction mode. | - |
| `HINDSIGHT_API_RETAIN_CUSTOM_INSTRUCTIONS` | Full prompt override for fact extraction (only used when mode is `custom`). Replaces built-in extraction rules entirely. | - |
| `HINDSIGHT_API_RETAIN_EXTRACT_CAUSAL_LINKS` | Extract causal relationships between facts | `true` |
| `HINDSIGHT_API_RETAIN_BATCH_ENABLED` | Use LLM Batch API for fact extraction (50% cost savings, only with async operations) | `false` |
| `HINDSIGHT_API_RETAIN_MAX_CONCURRENT` | Max concurrent retain DB phases (HNSW reads + writes). Limits I/O contention during high-concurrency ingestion. | `4` |
| `HINDSIGHT_API_RETAIN_BATCH_TOKENS` | Max characters per sub-batch for async retain auto-splitting | `10000` |
| `HINDSIGHT_API_RETAIN_CHUNK_BATCH_SIZE` | Max chunks per streaming batch when retain ingests long documents. Each chunk produces roughly 17 facts, so the default 100 chunks ≈ 1700 facts per batch. Lower to cap memory/LLM pressure on large documents; raise for smaller chunks. Configurable per bank. | `100` |
| `HINDSIGHT_API_RETAIN_ENTITY_LOOKUP` | Entity lookup method during retain: `full` (exact match) or `trigram` (fuzzy trigram matching) | `trigram` |
| `HINDSIGHT_API_RETAIN_ENTITY_RESOLUTION_BATCH_SIZE` | Max unique entity names per fuzzy candidate lookup query (`trigram` on PG, `oracle_fuzzy` on Oracle). Bounds query size so very wide retain batches don't time out a single `unnest(...)` join on banks with many entities. | `100` |
| `HINDSIGHT_API_RETAIN_DEFAULT_STRATEGY` | Default retain strategy name. When set, all retain calls without an explicit `strategy` parameter use this strategy. | - |
| `HINDSIGHT_API_RETAIN_BATCH_POLL_INTERVAL_SECONDS` | Batch API polling interval in seconds | `60` |
| `HINDSIGHT_API_STORE_DOCUMENT_TEXT` | Persist the raw source text alongside extracted memories. Set to `false` to skip storing it. Static, server-level. | `true` |

> **Batch-capable providers.** `HINDSIGHT_API_RETAIN_BATCH_ENABLED=true` only works with a retain LLM provider that implements a batch API: `openai`, `groq`, `gemini`, and `fireworks`. Batch always requires async retain (`async=true`); a sync retain with batch enabled errors. Other providers fail fast at startup.
>
> **Gemini** uses the [Gemini Batch API](https://ai.google.dev/gemini-api/docs/batch-api) (flat 50% input + output discount, 24h SLA — typically minutes). It needs no extra settings beyond `HINDSIGHT_API_RETAIN_BATCH_ENABLED=true` and an API-key `gemini` provider; Vertex AI (`vertexai`) is not batch-capable.

#### Fireworks batch inference

Fireworks AI's batch API is **not** OpenAI `/v1/batches`-compatible — it is a proprietary, account-scoped dataset/job workflow on a separate control-plane host (`https://api.fireworks.ai`), distinct from the OpenAI-compatible inference host (`https://api.fireworks.ai/inference/v1`). Hindsight adapts it transparently, so enabling batch is the same as any other provider plus one required setting: your Fireworks **account id**. (This is separate from the existing LiteLLM `fireworks_ai/...` online path, which is unaffected.)

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_FIREWORKS_ACCOUNT_ID` | Fireworks account id. **Required** for `fireworks` batch retain — the control-plane endpoints are `/v1/accounts/{account_id}/...`. Static, server-level. | - |
| `HINDSIGHT_API_FIREWORKS_BATCH_BASE_URL` | Fireworks batch control-plane host. | `https://api.fireworks.ai` |
| `HINDSIGHT_API_FIREWORKS_BATCH_MAX_WAIT_SECONDS` | Max time to wait for a batch job before surfacing a failure. Guards against the Fireworks gotcha where a non-batch-eligible model leaves the job `PENDING` forever. | `86400` (24h) |

```bash
# Fireworks batch retain (50% cost savings, async only)
export HINDSIGHT_API_RETAIN_LLM_PROVIDER=fireworks
export HINDSIGHT_API_RETAIN_LLM_API_KEY=fw_xxxxxxxxxxxx
export HINDSIGHT_API_RETAIN_LLM_MODEL=accounts/fireworks/models/llama-v3p1-8b-instruct
export HINDSIGHT_API_FIREWORKS_ACCOUNT_ID=your-account-id
export HINDSIGHT_API_RETAIN_BATCH_ENABLED=true
```

> **Entity labels** (`entity_labels`) and **free-form entity extraction** (`entities_allow_free_form`) are configured per bank via the [bank config API](/developer/api/memory-banks#retain-configuration), not as global environment variables — each bank can have its own controlled vocabulary. See [Entity Labels](/developer/retain#entity-labels) for details.

#### Skip storing raw document text

By default Hindsight keeps a verbatim copy of everything you retain so you can later read the source, re-process a document, or export it. For deployments that only want to keep the extracted memories (facts, entities, mental models) and not the source text, set:

```bash
export HINDSIGHT_API_STORE_DOCUMENT_TEXT=false
```

When disabled, the full retain pipeline still runs — chunking, fact extraction, embedding, and entity linking are unchanged, so **memory quality and recall are not affected** (recall reads from the extracted memories, never from the raw text). The difference is purely what gets persisted:

- `documents.original_text` is stored as `NULL` instead of the raw payload.
- The raw chunk text is dropped (stored as empty), while the chunk's content hash is still kept so incremental re-retain of the same document continues to deduplicate correctly.

**`update_mode="append"` is rejected when text storage is disabled.** Append rebuilds a document by reading back its previously stored text and adding to it. With nothing stored, appending would silently drop the prior content, so an append retain returns an error instead. Use `update_mode="replace"` (the default).

**Features that degrade when text storage is disabled** (because they read the source text back):

- **Document export** carries no source text; re-importing such a bank cannot re-run extraction from the original payload.
- **Reading a document's source** (the get-document, list-chunks, and get-chunk endpoints, including their MCP equivalents) returns empty content.
- **Recall with `include_chunks=true`** returns empty `chunk_text` — the facts themselves are unaffected, but the surrounding source-chunk context is no longer available.
- **Reflect** no longer offers the `expand` tool (which fetches a memory's source chunk/document), and its `recall` step stops attaching source chunks, since there is no stored text to return. Reflection over the extracted memories is otherwise unaffected.
- **Reprocessing a document** from its stored text is a no-op (there is nothing to reprocess).

This is a static, server-level setting and cannot be overridden per bank.

#### Customizing retain: when to use what

There are five levels of customization for the retain pipeline. Start with the simplest that covers your needs:

| Goal | Use |
|------|-----|
| Steer what topics to focus on or deprioritize | `HINDSIGHT_API_RETAIN_MISSION` |
| Extract more detail per fact | `HINDSIGHT_API_RETAIN_EXTRACTION_MODE=verbose` |
| Store chunks as-is, LLM extracts metadata | `HINDSIGHT_API_RETAIN_EXTRACTION_MODE=verbatim` |
| Store chunks as-is, zero LLM cost | `HINDSIGHT_API_RETAIN_EXTRACTION_MODE=chunks` |
| Completely replace the extraction rules | `HINDSIGHT_API_RETAIN_EXTRACTION_MODE=custom` + `HINDSIGHT_API_RETAIN_CUSTOM_INSTRUCTIONS` |

**`HINDSIGHT_API_RETAIN_MISSION` — steer extraction without replacing it (recommended starting point)**

Tell the bank what to pay attention to during extraction, in plain language. The mission is injected into the extraction prompt alongside the built-in rules — it narrows focus without replacing the underlying logic. Works with any extraction mode (`concise`, `verbose`, `verbatim`, `custom`). Ignored in `chunks` mode.

```bash
export HINDSIGHT_API_RETAIN_MISSION="Focus on technical decisions, architecture choices, and team member expertise. Deprioritize social or personal information."
```

**`HINDSIGHT_API_RETAIN_EXTRACTION_MODE=verbose` — more detail per fact**

Use when you need richer facts with full context, relationships, and verbosity. Slower and uses more tokens than `concise`.

**`HINDSIGHT_API_RETAIN_EXTRACTION_MODE=verbatim` — store chunks as-is**

Each chunk is stored as a single memory unit with its original text preserved exactly — no summarization or rewriting. The LLM still runs to extract entities, temporal information, and location so the chunk is fully indexed and retrievable. Useful for RAG-style indexing, document ingestion pipelines, or benchmarks where you want the original text in memory rather than LLM-generated summaries.

```bash
export HINDSIGHT_API_RETAIN_EXTRACTION_MODE=verbatim
```

**`retain_strategies` / `retain_default_strategy` — per-call extraction strategy**

Named strategies let you ingest different content types into the same bank using different extraction settings. A strategy is a set of hierarchical field overrides applied on top of the resolved bank config.

Any field in the hierarchical config can be overridden per strategy, including `retain_extraction_mode`, `retain_chunk_size`, `retain_structured_chunk_size`, `entity_labels`, `entities_allow_free_form`, `retain_mission`, etc.

Configure strategies via the bank config API:

```json
{
  "retain_default_strategy": "conversations",
  "retain_strategies": {
    "conversations": {
      "retain_extraction_mode": "concise",
      "retain_chunk_size": 3000,
      "retain_structured_chunk_size": 12000
    },
    "documents": {
      "retain_extraction_mode": "chunks",
      "retain_chunk_size": 800,
      "entity_labels": null,
      "entities_allow_free_form": false
    }
  }
}
```

Then specify the strategy at retain time:

```python
# Uses default strategy ("conversations")
client.retain(bank_id, items=[{"content": "Alice joined the team today"}])

# Explicitly use document strategy
client.retain(bank_id, items=[{"content": "...document text..."}], strategy="documents")
```

If no `strategy` is specified in a retain call, `retain_default_strategy` is used. If neither is set, the bank/global config applies directly.

**`HINDSIGHT_API_RETAIN_EXTRACTION_MODE=chunks` — zero LLM cost**

Each chunk is stored as-is with no LLM call whatsoever. No entity extraction, no temporal indexing — only embeddings are generated for semantic search. User-provided entities passed via `RetainContent.entities` are the sole source of entity data. Use when ingestion speed and cost matter more than structured metadata.

```bash
export HINDSIGHT_API_RETAIN_EXTRACTION_MODE=chunks
```

**`HINDSIGHT_API_RETAIN_EXTRACTION_MODE=custom` + `HINDSIGHT_API_RETAIN_CUSTOM_INSTRUCTIONS` — full control**

Replaces the built-in selectivity rules entirely. The structural parts of the prompt (output format, temporal handling, coreference resolution) remain intact — only the extraction guidelines are replaced.

Use this when `retain_mission` isn't sufficient and you need strict inclusion/exclusion logic.

```bash
export HINDSIGHT_API_RETAIN_EXTRACTION_MODE=custom
export HINDSIGHT_API_RETAIN_CUSTOM_INSTRUCTIONS="ONLY extract facts that are:
✅ Technical decisions and their rationale
✅ Architecture patterns and design choices
✅ Performance metrics and benchmarks

DO NOT extract:
❌ Greetings or social conversation
❌ Process chatter (\"let me check\", \"one moment\")
❌ Anything that would not be useful in 6 months"
```

### File Processing

Configuration for the file upload and conversion pipeline (used by `POST /v1/default/banks/{bank_id}/files/retain`).

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_ENABLE_FILE_UPLOAD_API` | Enable the file upload API endpoint | `true` |
| `HINDSIGHT_API_ENABLE_DOCUMENT_EXPORT_API` | Enable the [document export](./api/memory-banks.mdx#document-export--import) endpoint (`GET /document-transfer`) | `true` |
| `HINDSIGHT_API_ENABLE_DOCUMENT_IMPORT_API` | Enable the [document import](./api/memory-banks.mdx#document-export--import) endpoint (`POST /document-transfer`) | `true` |
| `HINDSIGHT_API_FILE_PARSER` | Server-side default parser or fallback chain (comma-separated, e.g. `iris,markitdown`) | `markitdown` |
| `HINDSIGHT_API_FILE_PARSER_ALLOWLIST` | Comma-separated list of parsers clients are allowed to request. If not set, all registered parsers are allowed. | — |
| `HINDSIGHT_API_FILE_CONVERSION_MAX_BATCH_SIZE` | Max files per upload request | `10` |
| `HINDSIGHT_API_FILE_CONVERSION_MAX_BATCH_SIZE_MB` | Max total upload size per request (MB) | `100` |
| `HINDSIGHT_API_FILE_DELETE_AFTER_RETAIN` | Delete stored files after memory extraction completes | `true` |

#### Parser selection

Clients can override the server default by passing `parser` in the request body of `POST /v1/default/banks/{bank_id}/files/retain`. Both the server default and the per-request field accept a single parser name or an ordered **fallback chain** — each parser is tried in sequence until one succeeds.

```bash
# Server default: try iris first, fall back to markitdown if iris fails
export HINDSIGHT_API_FILE_PARSER=iris,markitdown

# Restrict what clients may request (optional — defaults to all registered parsers)
export HINDSIGHT_API_FILE_PARSER_ALLOWLIST=markitdown,iris
```

```json
// Per-request override (in the JSON body of the file retain endpoint)
{
  "parser": "iris",
  "files_metadata": [
    { "document_id": "report" },
    { "document_id": "fallback_doc", "parser": ["iris", "markitdown"] }
  ]
}
```

Clients that request a parser not in the allowlist receive HTTP 400.

#### Parser: markitdown (default)

Local file-to-markdown conversion using [Microsoft's markitdown](https://github.com/microsoft/markitdown). No external service is required by default.

**Supported formats:** PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, images (JPG, PNG — requires optional OCR for text extraction), audio (MP3, WAV — transcription), HTML, TXT, MD, CSV.

For image workloads, MarkItDown can optionally use an OpenAI-compatible OCR/vision endpoint. This is disabled by default. Without it, image uploads fail with an actionable configuration error instead of low-level parser output. When enabled, configure the MarkItDown OCR API key, base URL, and model explicitly; they do not inherit from `HINDSIGHT_API_LLM_*` because MarkItDown uses the OpenAI SDK directly. The selected endpoint must implement OpenAI Chat Completions and the selected model must support image input.

This OCR path uses MarkItDown's image converter hook. It applies to image inputs such as JPG and PNG (and image handling inside converters that consume MarkItDown's `llm_client`), but it does not rasterize scanned PDF pages into images. Scanned PDFs with no text layer may still extract poorly through the default PDF converter. For scanned PDFs or complex document layouts, use an OCR-capable document parser such as `iris` or `llama_parse`, or configure a parser fallback chain like `llama_parse,markitdown`.

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_ENABLED` | Enable MarkItDown image OCR through an OpenAI-compatible OCR/vision endpoint | `false` |
| `HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_API_KEY` | API key for MarkItDown OCR; required when OCR is enabled | — |
| `HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_BASE_URL` | OpenAI-compatible Chat Completions base URL for MarkItDown OCR; required when OCR is enabled | — |
| `HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_MODEL` | OCR/vision model with image-input support; required when OCR is enabled | — |
| `HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_PROMPT` | OCR prompt passed to MarkItDown's image converter | Built-in OCR prompt |

```bash
# Configure a dedicated OpenAI-compatible OCR/vision endpoint for MarkItDown OCR
export HINDSIGHT_API_FILE_PARSER=markitdown
export HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_ENABLED=true
export HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_API_KEY=your-vision-api-key
export HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_BASE_URL=https://vision.example/v1
export HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_MODEL=ocr-or-vision-model
```

#### Parser: iris

Cloud-based extraction via [Vectorize Iris](https://docs.vectorize.io/build-deploy/extract-information/understanding-iris/). Higher quality extraction for complex documents, powered by a remote AI service.

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_FILE_PARSER_IRIS_TOKEN` | Vectorize API token | — |
| `HINDSIGHT_API_FILE_PARSER_IRIS_ORG_ID` | Vectorize organization ID | — |

**Supported formats:** PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, images (JPG, JPEG, PNG, GIF, BMP, TIFF, WEBP), HTML, TXT, MD, CSV.

```bash
# Use iris as the only parser
export HINDSIGHT_API_FILE_PARSER=iris
export HINDSIGHT_API_FILE_PARSER_IRIS_TOKEN=your-vectorize-token
export HINDSIGHT_API_FILE_PARSER_IRIS_ORG_ID=your-org-id

# Or: try iris first, fall back to markitdown if iris fails or rejects the file type
export HINDSIGHT_API_FILE_PARSER=iris,markitdown
```

#### Parser: llama_parse

Cloud-based extraction via [LlamaParse](https://docs.cloud.llamaindex.ai/llamaparse) (LlamaIndex). Strong extraction for complex layouts — tables, charts, multi-column PDFs.

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_FILE_PARSER_LLAMA_PARSE_API_KEY` | LlamaCloud API key (typically starts with `llx-`) | — |

**Supported formats:** PDF, DOCX, PPTX, XLSX, HTML, EPUB, RTF, TXT, and many more — see the [LlamaParse docs](https://docs.cloud.llamaindex.ai/llamaparse/features/supported_document_types) for the full list.

```bash
# Use llama_parse as the only parser
export HINDSIGHT_API_FILE_PARSER=llama_parse
export HINDSIGHT_API_FILE_PARSER_LLAMA_PARSE_API_KEY=llx-your-api-key

# Or: try llama_parse first, fall back to markitdown
export HINDSIGHT_API_FILE_PARSER=llama_parse,markitdown
```

```bash
# Increase batch limits for large file imports
export HINDSIGHT_API_FILE_CONVERSION_MAX_BATCH_SIZE=20
export HINDSIGHT_API_FILE_CONVERSION_MAX_BATCH_SIZE_MB=500

# Keep files after processing (useful for debugging or re-processing)
export HINDSIGHT_API_FILE_DELETE_AFTER_RETAIN=false
```

### File Storage

Files uploaded via the file retain API are stored in an object storage backend before conversion. Choose the backend that fits your infrastructure.

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_FILE_STORAGE_TYPE` | Storage backend: `native`, `s3`, `gcs`, or `azure` | `native` |

#### Native (PostgreSQL)

Files are stored as `BYTEA` in the `file_storage` table. No additional infrastructure required. Suitable for development and small deployments.

```bash
# Native storage is the default — no additional configuration needed
export HINDSIGHT_API_FILE_STORAGE_TYPE=native
```

#### S3 / S3-Compatible

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_FILE_STORAGE_S3_BUCKET` | S3 bucket name | - |
| `HINDSIGHT_API_FILE_STORAGE_S3_REGION` | AWS region | - |
| `HINDSIGHT_API_FILE_STORAGE_S3_ENDPOINT` | Custom endpoint URL (for S3-compatible stores like MinIO, Cloudflare R2, Tigris) | AWS default |
| `HINDSIGHT_API_FILE_STORAGE_S3_ACCESS_KEY_ID` | AWS access key ID | - |
| `HINDSIGHT_API_FILE_STORAGE_S3_SECRET_ACCESS_KEY` | AWS secret access key | - |

For S3-compatible providers that don't expose AWS-style regions (MinIO, Cloudflare R2, Tigris), set `HINDSIGHT_API_FILE_STORAGE_S3_REGION=auto`. The value is required for SigV4 request signing but is ignored by the service.

```bash
# AWS S3
export HINDSIGHT_API_FILE_STORAGE_TYPE=s3
export HINDSIGHT_API_FILE_STORAGE_S3_BUCKET=my-hindsight-files
export HINDSIGHT_API_FILE_STORAGE_S3_REGION=us-east-1
export HINDSIGHT_API_FILE_STORAGE_S3_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
export HINDSIGHT_API_FILE_STORAGE_S3_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

# S3-compatible (MinIO, Cloudflare R2, etc.)
export HINDSIGHT_API_FILE_STORAGE_TYPE=s3
export HINDSIGHT_API_FILE_STORAGE_S3_BUCKET=my-bucket
export HINDSIGHT_API_FILE_STORAGE_S3_REGION=auto
export HINDSIGHT_API_FILE_STORAGE_S3_ENDPOINT=https://your-minio.example.com
export HINDSIGHT_API_FILE_STORAGE_S3_ACCESS_KEY_ID=minioadmin
export HINDSIGHT_API_FILE_STORAGE_S3_SECRET_ACCESS_KEY=minioadmin

# Tigris (S3-compatible, single global endpoint)
export HINDSIGHT_API_FILE_STORAGE_TYPE=s3
export HINDSIGHT_API_FILE_STORAGE_S3_BUCKET=my-hindsight-bucket
export HINDSIGHT_API_FILE_STORAGE_S3_REGION=auto
export HINDSIGHT_API_FILE_STORAGE_S3_ENDPOINT=https://t3.storage.dev
export HINDSIGHT_API_FILE_STORAGE_S3_ACCESS_KEY_ID=tid_your_access_key
export HINDSIGHT_API_FILE_STORAGE_S3_SECRET_ACCESS_KEY=tsec_your_secret_key
```

#### Google Cloud Storage

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_FILE_STORAGE_GCS_BUCKET` | GCS bucket name | - |
| `HINDSIGHT_API_FILE_STORAGE_GCS_SERVICE_ACCOUNT_KEY` | Path to service account JSON key file | ADC if not set |

```bash
export HINDSIGHT_API_FILE_STORAGE_TYPE=gcs
export HINDSIGHT_API_FILE_STORAGE_GCS_BUCKET=my-hindsight-files
# Optional: use service account key file (otherwise falls back to ADC)
export HINDSIGHT_API_FILE_STORAGE_GCS_SERVICE_ACCOUNT_KEY=/path/to/key.json
```

#### Azure Blob Storage

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_FILE_STORAGE_AZURE_CONTAINER` | Azure container name | - |
| `HINDSIGHT_API_FILE_STORAGE_AZURE_ACCOUNT_NAME` | Azure storage account name | - |
| `HINDSIGHT_API_FILE_STORAGE_AZURE_ACCOUNT_KEY` | Azure storage account key | - |

```bash
export HINDSIGHT_API_FILE_STORAGE_TYPE=azure
export HINDSIGHT_API_FILE_STORAGE_AZURE_CONTAINER=hindsight-files
export HINDSIGHT_API_FILE_STORAGE_AZURE_ACCOUNT_NAME=mystorageaccount
export HINDSIGHT_API_FILE_STORAGE_AZURE_ACCOUNT_KEY=base64encodedkey==
```

#### Storage Backend Comparison

| Backend | Best For | Notes |
|---------|----------|-------|
| `native` | Development, small deployments | No extra infrastructure, stored in PostgreSQL |
| `s3` | Production, AWS deployments | Works with any S3-compatible store |
| `gcs` | Production, GCP deployments | Supports ADC for keyless auth |
| `azure` | Production, Azure deployments | Uses account key auth |

:::tip Production Recommendation
For production deployments, use `s3`, `gcs`, or `azure` to avoid storing large binary files in your PostgreSQL database. Set `HINDSIGHT_API_FILE_DELETE_AFTER_RETAIN=true` (the default) to delete files after memory extraction, which minimizes storage costs.
:::

### Observations (Experimental) {#observations}

Observations are deduplicated, evidence-grounded knowledge consolidated from multiple facts. Each observation tracks its supporting memories and a proof count, and is refined — not overwritten — when new evidence arrives.

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_ENABLE_OBSERVATIONS` | Enable observation consolidation | `true` |
| `HINDSIGHT_API_ENABLE_AUTO_CONSOLIDATION` | Automatically trigger consolidation after retain, delete, and update operations. When `false`, consolidation only runs when explicitly triggered via the [consolidate endpoint](/developer/api/operations#consolidation). Configurable per bank. | `true` |
| `HINDSIGHT_API_CONSOLIDATION_RECONCILE_INTERVAL_SECONDS` | Interval for the background sweep that re-schedules consolidation for banks with unconsolidated facts but no consolidation in progress — recovering facts left unscheduled when a consolidation operation failed terminally (e.g. the LLM provider was unavailable). Only applies to banks with auto-consolidation enabled. `0` disables the sweep. | `300` |
| `HINDSIGHT_API_MENTAL_MODEL_REFRESH_TICK_SECONDS` | How often the background loop checks for cron-scheduled mental models that are due for a refresh. This is only the *check* cadence; the actual schedule is the per-model `trigger.refresh_cron` expression set on the mental model. A due model is refreshed only when it is stale (new memories in its scope since the last refresh). `0` disables the sweep. | `60` |
| `HINDSIGHT_API_ENABLE_OBSERVATION_HISTORY` | Track history of changes to each observation (previous text/tags/dates + timestamp), stored one row per change in the `observation_history` table. Set to `false` to disable entirely — no history rows are written. **This is how you turn the feature off** (not a zero cap). | `true` |
| `HINDSIGHT_API_OBSERVATION_HISTORY_MAX_ENTRIES` | Max history rows kept per observation. On each update the previous version is inserted into the `observation_history` table and the oldest rows beyond this cap are deleted, so an often-reinforced observation's history can't grow without bound. `0` or a negative value **removes the cap** (unbounded); to turn history off entirely set `HINDSIGHT_API_ENABLE_OBSERVATION_HISTORY=false` instead. | `50` |
| `HINDSIGHT_API_CONSOLIDATION_MAX_ATTEMPTS` | Outer retry attempts for the consolidation LLM batch call. Each attempt uses the inner retry budget (`HINDSIGHT_API_CONSOLIDATION_LLM_MAX_RETRIES`). Worst-case API calls per batch = `MAX_ATTEMPTS × (LLM_MAX_RETRIES + 1)`. | `3` |
| `HINDSIGHT_API_CONSOLIDATION_BATCH_SIZE` | Memories to load per batch (internal optimization) | `50` |
| `HINDSIGHT_API_CONSOLIDATION_MAX_MEMORIES_PER_ROUND` | Maximum memories processed per consolidation round. When the limit is reached, the job yields its worker slot and re-queues itself so other banks get fair scheduling. Mental model refreshes only run on the final round. `0` = unlimited. Configurable per bank. | `100` |
| `HINDSIGHT_API_CONSOLIDATION_MAX_TOKENS` | Max tokens for recall when finding related observations during consolidation | `1024` |
| `HINDSIGHT_API_CONSOLIDATION_MAX_COMPLETION_TOKENS` | Max completion tokens requested for each consolidation LLM batch call. Unset by default, so each provider keeps its implicit output budget. Set this when a provider applies a low hidden cap (e.g. Bedrock imported models) that truncates consolidation output. | `unset` |
| `HINDSIGHT_API_CONSOLIDATION_LLM_BATCH_SIZE` | Number of facts sent to the LLM in a single consolidation call. Higher values reduce LLM calls and improve throughput at the cost of larger prompts. Set to `1` to disable batching. Configurable per bank. | `8` |
| `HINDSIGHT_API_CONSOLIDATION_DEDUP_THRESHOLD` | Cosine similarity at/above which a newly-created or freshly-updated observation is reconciled against an existing near-identical one via a focused 1-by-1 LLM "merge or keep" call (the model reads both texts, so a number/negation/entity difference is respected). Catches near-duplicate observations that weaker consolidation models emit even when shown the twin, as well as duplicates that arise when an update rewrites an observation into a near-twin of another. Set to `1.0` to disable. Postgres only — consolidation skips reconciliation on Oracle regardless of this value. | `0.97` |
| `HINDSIGHT_API_CONSOLIDATION_LLM_PARALLELISM` | Maximum number of tag groups consolidated concurrently within one consolidation op. Each group acquires per-scope locks before processing, so groups whose write scopes overlap (e.g. under `per_tag` / `all_combinations` / explicit-list `observation_scopes`) automatically serialise on the overlapping scopes — actual concurrency may be lower than this cap when scopes contend. Set to `1` for fully sequential behaviour. Higher values raise peak LLM QPS and connection-pool usage during consolidation proportionally — tune down if your LLM provider rate-limits tightly or your DB pool is small. Configurable per bank. | `4` |
| `HINDSIGHT_API_CONSOLIDATION_RECALL_BUDGET` | Budget level for the recall pass inside consolidation (`low`, `mid`, `high`). Lower budgets fetch fewer candidate rows, reducing peak memory usage on large banks. | `low` |
| `HINDSIGHT_API_CONSOLIDATION_SOURCE_FACTS_MAX_TOKENS` | Total token budget for source facts included with observations in the consolidation prompt. `-1` = unlimited. Configurable per bank. | `4096` |
| `HINDSIGHT_API_CONSOLIDATION_SOURCE_FACTS_MAX_TOKENS_PER_OBSERVATION` | Per-observation token cap for source facts in the consolidation prompt. Each observation independently gets at most this many tokens of source facts. `-1` = unlimited. Configurable per bank. | `256` |
| `HINDSIGHT_API_OBSERVATIONS_MISSION` | What this bank should synthesise into durable observations. Replaces the built-in consolidation rules — leave unset to use the server default. | - |
| `HINDSIGHT_API_MAX_OBSERVATIONS_PER_SCOPE` | Maximum number of observations allowed per tag scope. When the limit is reached, consolidation will only update or delete existing observations — no new ones are created. Applies per tag scope (e.g., per-tag when using `per_tag` observation scopes). Observations with no tags are not subject to this limit. `-1` = unlimited. Configurable per bank. | `-1` |
| `HINDSIGHT_API_OBSERVATION_SCOPE_LIMITS` | Per-scope overrides of `MAX_OBSERVATIONS_PER_SCOPE`, as a JSON array of `{"scope": [tag-globs], "limit": int}` rules. Each `scope` is a list of [fnmatch](https://docs.python.org/3/library/fnmatch.html) globs; a consolidation scope matches under *exact cover* — every tag must be matched by a glob and every glob must match a tag, so `["shared"]` matches the scope `{shared}` but not `{run_1, shared}`. The first matching rule wins; scopes that match no rule fall back to `MAX_OBSERVATIONS_PER_SCOPE`. Example: `[{"scope": ["shared"], "limit": -1}, {"scope": ["run_*", "shared"], "limit": 50}]` keeps the `{shared}` scope unlimited while capping each `{run_*, shared}` scope at 50. Configurable per bank. | - |

#### Customizing observations: when to use what

| Goal | Use |
|------|-----|
| Default behavior: durable specific facts, no ephemeral state | Leave unset |
| Change what observations *are* for this bank (different shape, different purpose) | `HINDSIGHT_API_OBSERVATIONS_MISSION` |

**`HINDSIGHT_API_OBSERVATIONS_MISSION` — redefine what this bank synthesises**

By default, observations are durable, specific beliefs consolidated from memories — the kind of knowledge that stays true over time (preferences, skills, relationships, recurring patterns). Each one is grounded in the source memories that support it. Ephemeral state is filtered out. Contradictions are tracked with temporal markers rather than overwriting the prior belief.

Set `HINDSIGHT_API_OBSERVATIONS_MISSION` to replace this definition entirely. Write a plain-language description of what observations should be for your use case. The LLM will use this instead of the default rules when deciding what to create or update. Leave it unset to keep the server default.

:::tip When to use observations_mission
Use it when the default durable-knowledge behavior doesn't match your use case. Common scenarios:
- You want **broader event summaries** rather than isolated facts
- You want observations **grouped by time period** (weekly, monthly)
- You want a **different granularity** (one observation per project rather than per fact)
- You have a **domain-specific** notion of what's worth remembering
:::

**Example: Weekly event summaries**

```bash
export HINDSIGHT_API_OBSERVATIONS_MISSION="Observations are broad summaries of project events grouped by week. Each observation should capture what happened, what was decided, and what was blocked — not individual facts. Merge related events into cohesive weekly narratives."
```

**Example: Person-centric knowledge**

```bash
export HINDSIGHT_API_OBSERVATIONS_MISSION="Observations are durable facts about specific named people: their preferences, skills, relationships, and behavioral patterns. Only create observations for facts that are stable over time and tied to a named individual."
```

**Example: Support ticket patterns**

```bash
export HINDSIGHT_API_OBSERVATIONS_MISSION="Observations are recurring patterns in customer support interactions: common failure modes, frequently requested features, and pain points that appear across multiple tickets."
```

### Reflect

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_REFLECT_MAX_ITERATIONS` | Max tool call iterations before forcing a response | `10` |
| `HINDSIGHT_API_REFLECT_MAX_CONTEXT_TOKENS` | Max accumulated context tokens in the reflect loop before forcing final synthesis. Prevents `context_length_exceeded` errors on large banks. Lower this if your LLM has a context window smaller than 128K. | `100000` |
| `HINDSIGHT_API_REFLECT_WALL_TIMEOUT` | Wall-clock timeout in seconds for the entire reflect operation. If exceeded, the request returns HTTP 504. | `300` |
| `HINDSIGHT_API_REFLECT_MISSION` | Global reflect mission (identity and reasoning framing). Overridden per bank via config API. | - |
| `HINDSIGHT_API_REFLECT_SOURCE_FACTS_MAX_TOKENS` | Token budget for source facts in `search_observations` during reflect. `-1` disables source facts (default), `0` enables with no limit, `>0` enables with a token budget. Hierarchical — can be overridden per bank via config API. | `-1` |

#### Internal recall (used by mental model refresh)

These knobs control the recall tool that runs inside `reflect_async` (e.g. when refreshing a mental model). They are hierarchical — overridable per bank via the config API, and individually overridable per mental model via the `trigger.include_chunks`, `trigger.recall_max_tokens`, and `trigger.recall_chunks_max_tokens` fields.

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_RECALL_INCLUDE_CHUNKS` | Whether the internal recall returns raw chunk text alongside facts. Set `false` to skip chunks and save prompt budget. | `true` |
| `HINDSIGHT_API_RECALL_MAX_TOKENS` | Token budget for facts returned by the internal recall. | `2048` |
| `HINDSIGHT_API_RECALL_CHUNKS_MAX_TOKENS` | Token budget for raw chunks returned by the internal recall. | `1000` |

#### Disposition

Disposition traits control how the bank reasons during reflect operations. Each trait is on a scale of 1–5. These are hierarchical — they can be overridden per bank via the [config API](./configuration.md#hierarchical-configuration).

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_DISPOSITION_SKEPTICISM` | How skeptical vs trusting (1=trusting, 5=skeptical) | `3` |
| `HINDSIGHT_API_DISPOSITION_LITERALISM` | How literally to interpret information (1=flexible, 5=literal) | `3` |
| `HINDSIGHT_API_DISPOSITION_EMPATHY` | How much to consider emotional context (1=detached, 5=empathetic) | `3` |

### MCP Server

Configuration for MCP server endpoints.

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_MCP_ENABLED` | Enable MCP server at `/mcp/{bank_id}/` | `true` |
| `HINDSIGHT_API_MCP_ENABLED_TOOLS` | Comma-separated allowlist of MCP tools to expose globally (empty = all tools) | - |
| `HINDSIGHT_API_MCP_STATELESS` | Use stateless HTTP transport (POST-only). When `false`, enables stateful mode with GET/SSE support for server-initiated messages | `false` |
| `HINDSIGHT_API_MCP_AUTH_TOKEN` | Bearer token for MCP authentication (optional) | - |
| `HINDSIGHT_API_MCP_LOCAL_BANK_ID` | Memory bank ID for local MCP | `mcp` |
| `HINDSIGHT_API_MCP_INSTRUCTIONS` | Additional instructions appended to retain/recall tool descriptions | - |

**Tool Access Control:**

`HINDSIGHT_API_MCP_ENABLED_TOOLS` restricts which MCP tools are registered at the server level. This is useful for read-only deployments or limiting surface area:

```bash
# Expose only recall (read-only deployment)
export HINDSIGHT_API_MCP_ENABLED_TOOLS=recall

# Expose recall and reflect only
export HINDSIGHT_API_MCP_ENABLED_TOOLS=recall,reflect
```

Available tool names: `retain`, `recall`, `reflect`, `list_banks`, `create_bank`, `list_mental_models`, `get_mental_model`, `create_mental_model`, `update_mental_model`, `delete_mental_model`, `refresh_mental_model`, `list_directives`, `create_directive`, `delete_directive`, `list_memories`, `get_memory`, `list_documents`, `get_document`, `delete_document`, `list_operations`, `get_operation`, `cancel_operation`, `list_tags`, `get_bank`, `get_bank_stats`, `update_bank`, `delete_bank`, `clear_memories`.

This can also be overridden per bank via the [config API](#hierarchical-configuration):

```bash
# Restrict a specific bank to read-only MCP access
curl -X PATCH http://localhost:8888/v1/default/banks/my-bank/config \
  -H "Content-Type: application/json" \
  -d '{"updates": {"mcp_enabled_tools": ["recall"]}}'
```

When a bank-level `mcp_enabled_tools` is set, tools not in the list return a clear error when invoked (they still appear in the tools list for MCP protocol compatibility).

**MCP Authentication:**

By default, the MCP endpoint is open. For production deployments, set `HINDSIGHT_API_MCP_AUTH_TOKEN` to require Bearer token authentication:

```bash
export HINDSIGHT_API_MCP_AUTH_TOKEN=your-secret-token
```

Clients must then include the token in the `Authorization` header. See [MCP Server documentation](./mcp-server.md#authentication) for details.

**Local MCP instructions:**

```bash
# Example: instruct MCP to also store assistant actions
export HINDSIGHT_API_MCP_INSTRUCTIONS="Also store every action you take, including tool calls and decisions made."
```

### Distributed Workers

Configuration for background task processing. By default, the API processes tasks internally. For high-throughput deployments, run dedicated workers. See [Services - Worker Service](./services#worker-service) for details.

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_WORKER_ENABLED` | Enable internal worker in API process | `true` |
| `HINDSIGHT_API_WORKER_ID` | Unique worker identifier | hostname |
| `HINDSIGHT_API_WORKER_POLL_INTERVAL_MS` | Database polling interval in milliseconds | `500` |
| `HINDSIGHT_API_WORKER_MAX_RETRIES` | Max retries before marking task failed | `3` |
| `HINDSIGHT_API_WORKER_TASK_RETRY_BACKOFF_SECONDS` | Seconds between retries on transient task failure | `60` |
| `HINDSIGHT_API_WORKER_HTTP_PORT` | HTTP port for worker metrics/health (worker CLI only) | `8889` |
| `HINDSIGHT_API_WORKER_MAX_SLOTS` | Maximum concurrent tasks per worker (total across all operation types) | `10` |
| `HINDSIGHT_API_WORKER_CONSOLIDATION_MAX_SLOTS` | Reserved slots for consolidation tasks within `WORKER_MAX_SLOTS` (bank-serialization preserved) | `2` |
| `HINDSIGHT_API_WORKER_CONSOLIDATION_BANK_PRIORITY` | Per-bank priority for consolidation scheduling (see note below) | _(unset)_ |
| `HINDSIGHT_API_WORKER_RETAIN_MAX_SLOTS` | Reserved slots for retain tasks within `WORKER_MAX_SLOTS` | `0` |
| `HINDSIGHT_API_WORKER_FILE_CONVERT_RETAIN_MAX_SLOTS` | Reserved slots for file_convert_retain tasks within `WORKER_MAX_SLOTS` | `0` |
| `HINDSIGHT_API_WORKER_REFRESH_MENTAL_MODEL_MAX_SLOTS` | Reserved slots for refresh_mental_model tasks within `WORKER_MAX_SLOTS` | `0` |
| `HINDSIGHT_API_WORKER_GRAPH_MAINTENANCE_MAX_SLOTS` | Reserved slots for graph_maintenance tasks within `WORKER_MAX_SLOTS` | `0` |
| `HINDSIGHT_API_WORKER_IMPORT_DOCUMENTS_MAX_SLOTS` | Reserved slots for import_documents tasks within `WORKER_MAX_SLOTS` | `0` |

:::note Slot reservations and shared pool
Per-operation `*_MAX_SLOTS` values are **reservations within** `WORKER_MAX_SLOTS`, not additive pools. The sum of all reservations must not exceed `WORKER_MAX_SLOTS` (startup raises `ValueError` otherwise). Remaining capacity (`WORKER_MAX_SLOTS - sum of reservations`) forms a **shared pool** usable by any operation type on a first-come basis; operation types whose reserved capacity is full can also overflow into the shared pool. Consolidation's bank-serialization constraint (no two consolidation tasks for the same bank concurrently) is preserved regardless of which pool claims the slot.

Example: `MAX_SLOTS=10, CONSOLIDATION=2, RETAIN=3, REFRESH_MENTAL_MODEL=2` → shared pool = `10 - (2+3+2) = 3`.

With the defaults (`MAX_SLOTS=10`, `CONSOLIDATION_MAX_SLOTS=2`, all other reservations `0`), 2 slots are always reserved for consolidation and the remaining 8 form the shared pool for any operation type. Set `CONSOLIDATION_MAX_SLOTS=0` to release consolidation's reserved capacity into the shared pool.
:::

:::note Consolidation bank priority
`HINDSIGHT_API_WORKER_CONSOLIDATION_BANK_PRIORITY` controls which banks' consolidation tasks are claimed first when a slot becomes available. Format: comma-separated `bank-pattern:priority` pairs where higher numbers mean higher priority. Patterns support `*` as a wildcard; a bare `*` is the catch-all default for unlisted banks (defaults to `1` if omitted).

Example:
```
HINDSIGHT_API_WORKER_CONSOLIDATION_BANK_PRIORITY="shadow-*:10,staging-*:5,*:1"
```

This ensures `shadow-*` banks are always consolidated before others, even if their tasks were submitted later. Useful for deployments with asymmetric bank sizes where a large bank might be starved by many small banks cycling through limited slots. Bank-serialization (max one concurrent consolidation per bank) is preserved regardless of priority. When unset, consolidation tasks are claimed in `created_at` order (default behavior).
:::

### Performance Optimization

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_SKIP_LLM_VERIFICATION` | Skip LLM connection check on startup | `false` |

#### Bank stats cache

`get_bank_stats` aggregates over `memory_links` (joining `memory_units`), which can be a multi-second scan on banks with millions of rows. Because the result is intentionally approximate — it backs a UI widget and the freshness hint inside `reflect` — it is cached per `(schema, bank)` for a few tens of seconds, which also coalesces concurrent misses onto a single in-flight query. Tune it for high-concurrency or large-bank deployments:

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_BANK_STATS_CACHE_TTL_SECONDS` | Time-to-live (seconds) for the `get_bank_stats` result cache. `0` disables caching, so every call runs the query. | `60` |
| `HINDSIGHT_API_BANK_STATS_CACHE_MAX_ENTRIES` | Maximum number of cached `(schema, bank)` entries before LRU eviction. Bounds memory in deployments with many banks. | `1024` |

#### Native thread pools

When local embeddings or reranking run in-process, the underlying BLAS/ML libraries (OpenBLAS, OpenMP, MKL) each spawn a worker pool sized to the host CPU count. Because Hindsight already parallelizes across requests via its own thread-pool executors, those native pools oversubscribe the CPU — on a many-core host the process can accumulate well over 100 native threads. This inflates memory and, under contention, can degrade throughput. Hindsight therefore bounds each native pool to **16 threads** (or the number of *available* CPUs, whichever is smaller) by default, capping runaway growth on large hosts while leaving within-call parallelism intact.

"Available" CPUs is the budget actually granted to the process — the smallest of the CPU-affinity set, the cgroup CPU quota (`--cpus` / cpuset), and the host core count. This matters in containers: the BLAS libraries otherwise size their pools to the *host's* cores even when the container is limited to a few, so a `--cpus=4` container on a 64-core host would spawn far more BLAS threads than it can run.

To tune, set any of these to the desired thread count; the value you set is always honored (the default applies only when the variable is unset):

| Variable | Description | Default |
|----------|-------------|---------|
| `OMP_NUM_THREADS` | OpenMP worker threads (torch, ONNX Runtime, some BLAS builds) | `min(16, available CPUs)` |
| `OPENBLAS_NUM_THREADS` | OpenBLAS worker threads (numpy's default BLAS) | `min(16, available CPUs)` |
| `MKL_NUM_THREADS` | Intel MKL worker threads (numpy/torch when MKL-backed) | `min(16, available CPUs)` |
| `NUMEXPR_NUM_THREADS` | numexpr expression-engine threads | `min(16, available CPUs)` |

For a server handling many concurrent requests, lower values (down to `1`) favor request-level parallelism and minimize thread count; for low-concurrency deployments running large local-model batches, the default leaves room for within-call parallelism. These must be set in the process environment before startup (the libraries read them once, at load time), so they cannot be overridden per-tenant or per-bank.

### Webhooks

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_WEBHOOK_URL` | Global webhook URL for event delivery | - (disabled) |
| `HINDSIGHT_API_WEBHOOK_SECRET` | HMAC signing secret for webhook payloads | - (unsigned) |
| `HINDSIGHT_API_WEBHOOK_EVENT_TYPES` | Comma-separated list of event types to deliver via webhook | `consolidation.completed` |
| `HINDSIGHT_API_WEBHOOK_DELIVERY_POLL_INTERVAL_SECONDS` | How often the webhook delivery worker polls for pending deliveries (seconds) | `30` |

### Audit Logging

Audit logging captures mutating operations (retain, recall, reflect, bank config updates, [Memory Defense](memory-defense/index.md) redact/block actions, etc.) into an `audit_log` table, queryable via the `/audit-logs` endpoint.

**Audit logging is disabled by default.** With `HINDSIGHT_API_AUDIT_LOG_ENABLED=false`, the `audit_log` table stays empty and `/audit-logs` returns `{"total": 0, "items": []}` regardless of activity. Set the flag to `true` and restart the API to start capturing events.

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_AUDIT_LOG_ENABLED` | Master switch for audit logging. Must be `true` for any audit events to be written. | `false` |
| `HINDSIGHT_API_AUDIT_LOG_ACTIONS` | Comma-separated allowlist of action types to audit (empty = all eligible actions) | `""` |
| `HINDSIGHT_API_AUDIT_LOG_RETENTION_DAYS` | Number of days to retain audit log entries. `-1` = keep forever. | `-1` |

### LLM Request Tracing

LLM request tracing records every LLM call Hindsight makes — for retain, reflect, and consolidation — into an `llm_requests` table, queryable per bank via the `/llm-requests` endpoint. Each row captures the input messages, the model output, token usage (input / output / cached / total, taken from the provider response), finish reason, provider/model, timing, and caller metadata. **Failed calls are recorded too** (`status = "error"` with the error message), so the table is useful for debugging what the LLM is doing and why a call failed. Capture is wired into the OpenTelemetry GenAI recording path (the same `record_llm_call` hook used for OTLP span export), so it stays consistent with the provider-reported request details.

**LLM request tracing is enabled by default**, with traced rows retained for 1 day. To disable it entirely set `HINDSIGHT_API_LLM_TRACE_ENABLED=false` and restart the API — the `llm_requests` table then stays empty and `/llm-requests` returns `{"total": 0, "items": []}` regardless of activity.

> **Note:** Traced rows contain the full prompt and model output, which may include sensitive memory content and can be large. Use `HINDSIGHT_API_LLM_TRACE_MAX_CHARS` to bound how much of each payload is stored, tighten `HINDSIGHT_API_LLM_TRACE_RETENTION_DAYS`, or set `HINDSIGHT_API_LLM_TRACE_ENABLED=false` to turn tracing off in sensitive environments.

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_LLM_TRACE_ENABLED` | Master switch for LLM request tracing. Must be `true` for any calls to be recorded. | `true` |
| `HINDSIGHT_API_LLM_TRACE_SCOPES` | Comma-separated allowlist of call scopes to trace (e.g. `retain_extract_facts,reflect`; empty = all scopes) | `""` |
| `HINDSIGHT_API_LLM_TRACE_RETENTION_DAYS` | Number of days to retain trace rows. `-1` = keep forever. | `1` |
| `HINDSIGHT_API_LLM_TRACE_MAX_CHARS` | Truncate stored input/output beyond this many characters (keeps the row, stores a truncated preview). | `50000` |

### Programmatic Configuration

You can also configure the API programmatically using `MemoryEngine.from_env()`:

```python
from hindsight_api import MemoryEngine

memory = MemoryEngine.from_env()
await memory.initialize()
```

---

## Observability & Tracing

Hindsight provides OpenTelemetry-based observability for LLM calls, conforming to GenAI semantic conventions.

### OpenTelemetry Tracing

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_OTEL_TRACES_ENABLED` | Enable distributed tracing for LLM calls | `false` |
| `HINDSIGHT_API_OTEL_EXPORTER_OTLP_ENDPOINT` | OTLP endpoint URL (e.g., Grafana LGTM, Langfuse, etc.) | - |
| `HINDSIGHT_API_OTEL_EXPORTER_OTLP_HEADERS` | Headers for OTLP exporter (format: "key1=value1,key2=value2") | - |
| `HINDSIGHT_API_OTEL_SERVICE_NAME` | Service name for traces | `hindsight-api` |
| `HINDSIGHT_API_OTEL_DEPLOYMENT_ENVIRONMENT` | Deployment environment name (e.g., development, staging, production) | `development` |
| `HINDSIGHT_API_METRICS_INCLUDE_BANK_ID` | Include `bank_id` in OTel metric attributes. Enable only for deployments with few banks — high cardinality causes unbounded memory growth. | `false` |
| `HINDSIGHT_API_METRICS_BACKLOG_ENABLED` | Expose async-operation queue depth and consolidation-backlog gauges (`hindsight_async_operations`, `hindsight_consolidation_backlog`, `hindsight_consolidation_failed`). Runs periodic per-schema `COUNT` queries on a background task. | `false` |

**Features:**
- Full prompts and completions recorded as events
- Token usage tracking (input/output)
- Model and provider information
- Error tracking with finish reasons
- Conforms to OpenTelemetry GenAI semantic conventions v1.37+

**OTLP-Compatible Backends:**

The tracing implementation uses standard OTLP HTTP protocol, so it works with any OTLP-compatible backend:
- **Grafana LGTM** (Recommended for local dev): All-in-one stack with Tempo traces, Loki logs, Mimir metrics, and Grafana UI
- **Langfuse**: LLM-focused observability and analytics
- **OpenLIT**: Built-in LLM dashboards, cost tracking
- **DataDog, New Relic, Honeycomb**: Commercial platforms

**Example Configuration:**

```bash
# Enable tracing
export HINDSIGHT_API_OTEL_TRACES_ENABLED=true

# Configure endpoint (example: OpenLIT Cloud)
export HINDSIGHT_API_OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp.openlit.io
export HINDSIGHT_API_OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer olit-xxx"

# Optional: Custom service name and environment
export HINDSIGHT_API_OTEL_SERVICE_NAME=hindsight-production
export HINDSIGHT_API_OTEL_DEPLOYMENT_ENVIRONMENT=production
```

**Local Development:**

For local development, we recommend the Grafana LGTM stack which provides traces, metrics, and logs in a single container:

```bash
./scripts/dev/start-grafana.sh
```

See `scripts/dev/grafana/README.md` for detailed setup instructions.

Other options: See `scripts/dev/openlit/README.md` for OpenLIT or `scripts/dev/jaeger/README.md` for standalone Jaeger.

### Metrics

Hindsight exposes Prometheus metrics at the `/metrics` endpoint, including:
- LLM call duration and token usage
- Operation duration (retain/recall/reflect)
- HTTP request metrics
- Database connection pool metrics

Metrics are always enabled and available at `http://localhost:8888/metrics`.

---

## Control Plane

The Control Plane is the web UI for managing memory banks.

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_CP_DATAPLANE_API_URL` | URL of the API service | `http://localhost:8888` |
| `HINDSIGHT_CP_DATAPLANE_API_KEY` | Bearer token the Control Plane sends as `Authorization: Bearer <key>` on every request to the API service. Required when the API service is auth-protected; omit for a public API. | *(none — no `Authorization` header sent)* |
| `HINDSIGHT_CP_ACCESS_KEY` | Access key to protect the Control Plane UI. When set, users must enter this key to log in. | *(none — auth disabled)* |
| `HINDSIGHT_CP_MAX_UPLOAD_SIZE` | Maximum size of a single file-upload request the Control Plane accepts before truncating it. Accepts a size string (`100mb`, `1gb`) or a number of bytes. Raise this to upload files larger than the default, and keep it in line with the API's `HINDSIGHT_API_FILE_CONVERSION_MAX_BATCH_SIZE_MB`. | `100mb` |
| `NEXT_PUBLIC_BASE_PATH` | Base path for Control Plane UI when behind reverse proxy (e.g., `/hindsight`) | `""` (root) |

```bash
# Point Control Plane to a remote API service
export HINDSIGHT_CP_DATAPLANE_API_URL=http://api.example.com:8888

# Authenticate to an auth-protected API service
export HINDSIGHT_CP_DATAPLANE_API_KEY=my-dataplane-bearer-token

# Protect the Control Plane with an access key
export HINDSIGHT_CP_ACCESS_KEY=my-secret-key
```

### Hierarchical Configuration

Hindsight supports per-bank configuration overrides through a hierarchical system: **Global (env vars) → Tenant → Bank**.

#### Type-Safe Config Access

To prevent accidentally using global defaults when bank-specific overrides exist, Hindsight enforces type-safe config access:

**In Application Code:**
```python
from hindsight_api.config import get_config

# ✅ Access static (infrastructure) fields
config = get_config()
host = config.host  # OK - static field
port = config.port  # OK - static field

# ❌ Attempting to access bank-configurable fields raises an error
chunk_size = config.retain_chunk_size  # ConfigFieldAccessError!
```

**Error Message:**
```
ConfigFieldAccessError: Field 'retain_chunk_size' is bank-configurable and cannot
be accessed from global config. Use ConfigResolver.resolve_full_config(bank_id, context)
to get bank-specific config.
```

**For Bank-Specific Config:**
```python
# Internal code that needs bank-specific settings
from hindsight_api.config_resolver import ConfigResolver

# Resolve full config for a specific bank
config = await config_resolver.resolve_full_config(bank_id, request_context)
chunk_size = config.retain_chunk_size  # ✅ Uses bank-specific value
```

This design prevents bugs where global defaults are used instead of bank overrides, making it impossible to make this mistake at compile/development time.

#### Security Model

Configuration fields are categorized for security:

1. **Configurable Fields** - Safe behavioral settings that can be customized per-bank:
   - Retention: `retain_chunk_size`, `retain_structured_chunk_size`, `retain_extraction_mode`, `retain_mission`, `retain_custom_instructions`
   - Observations: `enable_observations`, `enable_auto_consolidation`, `observations_mission`, `max_observations_per_scope`
   - MCP access control: `mcp_enabled_tools`

2. **Credential Fields** - NEVER exposed or configurable via API:
   - API keys: `*_api_key` (all LLM API keys)
   - Infrastructure: `*_base_url` (all base URLs)

3. **Static Fields** - Server-level only, cannot be overridden:
   - Infrastructure: `database_url`, `port`, `host`, `worker_count`
   - Provider/Model selection: `llm_provider`, `llm_model` (requires presets - not yet implemented)
   - Performance tuning: `llm_max_concurrent`, `llm_timeout`, retrieval settings, optimization flags

#### Enabling the API

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_ENABLE_BANK_CONFIG_API` | Enable per-bank config API | `true` |
| `HINDSIGHT_API_ENABLE_BANK_LLM_HEALTH` | Enable the per-bank LLM connectivity probe (`POST /v1/default/banks/{bank_id}/health/llm`). It makes a real provider call, so it is **off by default** — enable it to expose the endpoint. Returns status only — never the provider/model/endpoint. | `false` |
| `HINDSIGHT_API_ENABLE_DRY_RUN_EXTRACT` | Enable the dry-run extraction preview endpoint (`POST /v1/default/banks/{bank_id}/memories/dry-run-extract`). Runs extraction only — makes a real LLM call but stores nothing. Set to `false` to remove the endpoint (returns `404`). | `true` |
| `HINDSIGHT_API_DEFAULT_BANK_TEMPLATE` | Bank template manifest (JSON) applied automatically to every newly-created bank. See below. | _(unset)_ |

##### `HINDSIGHT_API_DEFAULT_BANK_TEMPLATE`

Server-level default bank template. When set, the manifest is applied once
to every bank the server creates — triggered the first time a bank is
touched (via `PUT /v1/default/banks/{bank_id}`, `/import`, `/retain`, etc.).
The value is a JSON-encoded `BankTemplateManifest` with the same shape
accepted by `POST /v1/default/banks/{bank_id}/import` (see the `bank`,
`mental_models`, and `directives` sections in the Bank Templates API).

Precedence: fields set by the template become per-bank overrides, so they
take precedence over the equivalent `HINDSIGHT_API_*` env-var defaults
(e.g. `HINDSIGHT_API_RETAIN_EXTRACTION_MODE`). Users can still override
individual fields later via `PATCH /v1/default/banks/{bank_id}/config`;
the template is **not** re-applied on subsequent accesses, so explicit
overrides are never clobbered.

A malformed manifest (bad JSON, unknown version, schema errors) is logged
and ignored — bank creation still succeeds with plain defaults, so a
broken server-level setting cannot wedge all callers.

Example (compact, single-line JSON as required by env vars):

```bash
export HINDSIGHT_API_DEFAULT_BANK_TEMPLATE='{"version":"1","bank":{"reflect_mission":"Help support agents remember customer interactions.","retain_extraction_mode":"verbose","disposition_empathy":5},"directives":[{"name":"Be concise","content":"Always respond concisely.","priority":10}]}'
```

#### API Endpoints

- `GET /v1/default/banks/{bank_id}/config` - View resolved config (filtered by permissions)
- `PATCH /v1/default/banks/{bank_id}/config` - Update bank overrides (only allowed fields)
- `DELETE /v1/default/banks/{bank_id}/config` - Reset to defaults

#### Permission System

Tenant extensions can control which fields banks are allowed to modify via `get_allowed_config_fields()`:

```python
class CustomTenantExtension(TenantExtension):
    async def get_allowed_config_fields(self, context, bank_id):
        # Option 1: Allow all configurable fields
        return None

        # Option 2: Allow specific fields only
        return {"retain_chunk_size", "retain_custom_instructions"}

        # Option 3: Read-only (no modifications)
        return set()
```

#### Examples

```bash
# Update retention settings for a bank
curl -X PATCH http://localhost:8888/v1/default/banks/my-bank/config \
  -H "Content-Type: application/json" \
  -d '{
    "updates": {
      "retain_chunk_size": 4000,
      "retain_extraction_mode": "custom",
      "retain_custom_instructions": "Focus on technical details and implementation specifics"
    }
  }'

# Note: retain_extraction_mode must be "custom" to use retain_custom_instructions

# View resolved config (respects permissions)
curl http://localhost:8888/v1/default/banks/my-bank/config

# Reset to defaults
curl -X DELETE http://localhost:8888/v1/default/banks/my-bank/config
```

**Security Notes:**
- Credentials (API keys, base URLs) are never returned in responses
- Only configurable fields can be modified
- Responses are filtered by tenant permissions
- Attempting to set credentials returns 400 error

### Reverse Proxy / Subpath Deployment

To deploy Hindsight under a subpath (e.g., `example.com/hindsight/`):

1. Set both environment variables to the same path:
   ```bash
   HINDSIGHT_API_BASE_PATH=/hindsight
   NEXT_PUBLIC_BASE_PATH=/hindsight
   ```

2. Configure your reverse proxy to:
   - Forward `/hindsight/*` requests to Hindsight
   - Preserve the full path in forwarded requests
   - Set appropriate proxy headers (X-Forwarded-Proto, X-Forwarded-For)

**Example: Nginx Configuration**

```nginx
location /hindsight/ {
    proxy_pass http://localhost:8888/;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
}
```

**Example: Traefik Configuration**

```yaml
http:
  routers:
    hindsight:
      rule: "PathPrefix(`/hindsight`)"
      service: hindsight
      middlewares:
        - hindsight-stripprefix

  middlewares:
    hindsight-stripprefix:
      stripPrefix:
        prefixes:
          - "/hindsight"

  services:
    hindsight:
      loadBalancer:
        servers:
          - url: "http://localhost:8888"
```

**Important Notes:**
- The base path must start with `/` and should NOT end with `/`
- Both API and Control Plane should use the same base path
- After setting environment variables, restart both services
- OpenAPI docs will be available at `<base-path>/docs` (e.g., `/hindsight/docs`)

**Complete Examples:**

See `docker/compose-examples/` directory for:
- Nginx configuration files (`simple.conf`, `api-and-control-plane.conf`)
- Docker Compose setups (`docker-compose.yml`, `reverse-proxy-only.yml`)
- Traefik and other reverse proxy examples
- Full deployment documentation
---

## Example .env File

```bash
# API Service
HINDSIGHT_API_DATABASE_URL=postgresql://hindsight:hindsight_dev@localhost:5432/hindsight
# HINDSIGHT_API_DATABASE_SCHEMA=public  # optional, defaults to 'public'
HINDSIGHT_API_LLM_PROVIDER=groq
HINDSIGHT_API_LLM_API_KEY=gsk_xxxxxxxxxxxx

# Authentication (optional, recommended for production)
# HINDSIGHT_API_TENANT_EXTENSION=hindsight_api.extensions.builtin.tenant:ApiKeyTenantExtension
# HINDSIGHT_API_TENANT_API_KEY=your-secret-api-key

# File storage (optional, defaults to PostgreSQL native storage)
# HINDSIGHT_API_FILE_STORAGE_TYPE=s3
# HINDSIGHT_API_FILE_STORAGE_S3_BUCKET=my-hindsight-files
# HINDSIGHT_API_FILE_STORAGE_S3_REGION=us-east-1
# HINDSIGHT_API_FILE_STORAGE_S3_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
# HINDSIGHT_API_FILE_STORAGE_S3_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

# Control Plane
HINDSIGHT_CP_DATAPLANE_API_URL=http://localhost:8888
```

---

For configuration issues not covered here, please [open an issue](https://github.com/vectorize-io/hindsight/issues) on GitHub.


---


## File: developer/rag-vs-hindsight.md

# RAG vs Memory

Traditional RAG (Retrieval-Augmented Generation) retrieves documents similar to a query. Hindsight provides structured memory with temporal reasoning, entity understanding, and belief formation.

## Capability Comparison

| Capability | RAG | Hindsight |
|------------|-----|-----------|
| **Search strategy** | Semantic similarity only | Semantic + keyword + graph + temporal |
| **Multi-hop reasoning** | Limited to retrieved chunks | Graph traversal across entity relationships |
| **Temporal queries** | Keyword matching ("spring") | Date parsing and range filtering |
| **Entity understanding** | None | Entity resolution, co-occurrence tracking |
| **Knowledge consolidation** | Stateless | Mental models that synthesize and evolve |
| **Disposition** | None | 3 traits (skepticism, literalism, empathy) influence interpretation |

## Architecture Comparison

### RAG

| Step | Operation |
|------|-----------|
| 1 | Embed query |
| 2 | Vector similarity search |
| 3 | Return top-k chunks |
| 4 | Generate response |

Single retrieval strategy. No state between queries.

### Hindsight

| Step | Operation |
|------|-----------|
| 1 | Parse query (extract temporal expressions, entities) |
| 2 | Execute 4 parallel retrievals: semantic, BM25, graph, temporal |
| 3 | Fuse results with RRF |
| 4 | Rerank with cross-encoder |
| 5 | Apply disposition traits |
| 6 | Generate response |

Multiple retrieval strategies. Persistent state across sessions.

## Example Scenarios

### Multi-Hop Reasoning

**Stored facts:**
- "Alice is the tech lead on Project Atlas"
- "Project Atlas uses Kubernetes"
- "Kubernetes cluster had an outage Tuesday"

**Query:** "Was Alice affected by recent issues?"

| System | Result |
|--------|--------|
| RAG | Retrieves facts about Alice only (no semantic similarity to "issues") |
| Hindsight | Traverses Alice → Project Atlas → Kubernetes → outage via entity links |

### Temporal Queries

**Stored facts with timestamps:**
- March: "Alice started microservices migration"
- April: "Alice completed auth service"
- October: "Alice focusing on performance"

**Query:** "What did Alice do last spring?"

| System | Result |
|--------|--------|
| RAG | Returns all Alice facts regardless of date |
| Hindsight | Parses "last spring" → March-May, filters to that range |

### Entity Understanding

**Stored facts about a user across sessions:**
- "Pro subscription"
- "Mobile app crashes in settings"
- "Switched to annual billing"
- "Desktop app working fine"

**Query:** "What do you know about my account?"

| System | Result |
|--------|--------|
| RAG | Lists disconnected facts |
| Hindsight | Returns connected facts via entity graph: subscription status, billing, known issues |

### Knowledge Evolution

**Week 1:** User struggles with async Python, succeeds with threads
**Week 3:** User asks about asyncio, implements async database calls

| System | Behavior |
|--------|----------|
| RAG | No memory of progression |
| Hindsight | Consolidates mental model "user prefers sync" → refines to "user growing comfortable with async" |

## When to Use Each

| Use Case | Recommended |
|----------|-------------|
| Document Q&A over static corpus | RAG |
| Search with no temporal requirements | RAG |
| AI assistants with persistent memory | Hindsight |
| Applications requiring entity tracking | Hindsight |
| Systems needing consistent disposition | Hindsight |
| Temporal queries ("last month", "in 2023") | Hindsight |


---


## File: sdks/python.md

# Python Client

Official HTTP client for the Hindsight API. Use this when you have a Hindsight server already running — locally, in Docker, or as a managed service — and you want a typed Python client to talk to it.

If you want to **embed and run a Hindsight server in your Python process** (no external server required), see [Embedded Python (hindsight-all)](./hindsight-all.md) instead.

## Installation

```bash
pip install hindsight-client
```

## Quick Start

```python
from hindsight_client import Hindsight

client = Hindsight(base_url="http://localhost:8888")

# Retain a memory
client.retain(bank_id="my-bank", content="Alice works at Google")

# Recall memories
results = client.recall(bank_id="my-bank", query="What does Alice do?")
for r in results.results:
    print(r.text)

# Reflect - generate a contextual answer
answer = client.reflect(bank_id="my-bank", query="Tell me about Alice")
print(answer.text)
```

## Client Initialization

```python
from hindsight_client import Hindsight

client = Hindsight(
    base_url="http://localhost:8888",  # Hindsight API URL
    timeout=30.0,                       # Request timeout in seconds
    # api_key="your-api-key",          # Optional bearer token
)

# Core operations
client.retain(bank_id="test", content="Hello world")
results = client.recall(bank_id="test", query="Hello")

# Organized API namespaces
client.banks.create(bank_id="test", name="Test Bank")
models = client.mental_models.list(bank_id="test")
directives = client.directives.list(bank_id="test")
memories = client.memories.list(bank_id="test")
```

## Core Operations

### Version and Feature Checks

```python
version = client.get_version()

print(version.api_version)

if not version.features.mcp:
    raise RuntimeError("This server does not expose the MCP endpoint")
```

The async client method is available as `await client.aget_version()`.

### Retain (Store Memory)

```python
# Simple
client.retain(
    bank_id="my-bank",
    content="Alice works at Google as a software engineer",
)

# With options
from datetime import datetime

client.retain(
    bank_id="my-bank",
    content="Alice got promoted",
    context="career update",
    timestamp=datetime(2024, 1, 15),
    document_id="conversation_001",
    metadata={"source": "slack"},
    retain_async=False,  # Set True for background processing
)
```

### Retain Batch

```python
client.retain_batch(
    bank_id="my-bank",
    items=[
        {"content": "Alice works at Google", "context": "career"},
        {"content": "Bob is a data scientist", "context": "career"},
    ],
    document_id="conversation_001",
    retain_async=False,  # Set True for background processing
)
```

### Recall (Search)

```python
# Simple - returns list of RecallResult
results = client.recall(
    bank_id="my-bank",
    query="What does Alice do?",
)

for r in results.results:
    print(f"{r.text} (type: {r.type})")

# With options
results = client.recall(
    bank_id="my-bank",
    query="What does Alice do?",
    types=["world", "observation"],  # Filter by fact type
    max_tokens=4096,
    budget="high",  # low, mid, or high
)
```

### Recall with Chunks

```python
# Returns RecallResponse with source chunks
response = client.recall(
    bank_id="my-bank",
    query="What does Alice do?",
    types=["world", "experience"],
    budget="mid",
    max_tokens=4096,
    include_chunks=True,
    max_chunk_tokens=500
)

print(f"Found {len(response.results)} memories")
for r in response.results:
    print(f"  - {r.text}")
    if r.chunks:
        print(f"    Source: {r.chunks[0].text[:100]}...")
```

### Reflect (Generate Response)

```python
answer = client.reflect(
    bank_id="my-bank",
    query="What should I know about Alice?",
    budget="low",  # low, mid, or high
    context="preparing for a meeting",
)

print(answer.text)  # Generated response
```

## Bank Management

### Create Bank

```python
client.create_bank(
    bank_id="my-bank",
    name="Assistant",
    mission="You're a helpful AI assistant - keep track of user preferences and conversation history.",
    disposition={
        "skepticism": 3,    # 1-5: trusting to skeptical
        "literalism": 3,    # 1-5: flexible to literal
        "empathy": 3,       # 1-5: detached to empathetic
    },
)
```

### List Memories

```python
client.list_memories(
    bank_id="my-bank",
    type="world",  # Optional: filter by type
    search_query="Alice",  # Optional: text search
    limit=100,
    offset=0,
)
```

## Async Support

All methods have async versions prefixed with `a`:

```python

from hindsight_client import Hindsight

async def main():
    client = Hindsight(base_url="http://localhost:8888")

    # Async retain
    await client.aretain(bank_id="my-bank", content="Hello world")

    # Async recall
    results = await client.arecall(bank_id="my-bank", query="Hello")
    for r in results:
        print(r.text)

    # Async reflect
    answer = await client.areflect(bank_id="my-bank", query="What did I say?")
    print(answer.text)

    client.close()

asyncio.run(main())
```

## Context Manager

```python
from hindsight_client import Hindsight

with Hindsight(base_url="http://localhost:8888") as client:
    client.retain(bank_id="my-bank", content="Hello")
    results = client.recall(bank_id="my-bank", query="Hello")
# Client automatically closed
```


---


## File: sdks/nodejs.md

# TypeScript / JavaScript Client

Official TypeScript/JavaScript client for the Hindsight API. Supports **Node.js** and **Deno**.

## Installation

### Node.js

```bash
npm install @vectorize-io/hindsight-client
```

### Deno

No installation needed — import directly via the `npm:` specifier:

```typescript

```

## Quick Start

```typescript


const client = new HindsightClient({ baseUrl: 'http://localhost:8888' });

// Retain a memory
await client.retain('my-bank', 'Alice works at Google');

// Recall memories
const response = await client.recall('my-bank', 'What does Alice do?');
for (const r of response.results) {
    console.log(r.text);
}

// Reflect - generate response with disposition
const answer = await client.reflect('my-bank', 'Tell me about Alice');
console.log(answer.text);
```

## Client Initialization

```typescript


const client = new HindsightClient({
    baseUrl: 'http://localhost:8888',
});
```

## Core Operations

### Version and Feature Checks

```typescript
const version = await client.getVersion();

console.log(version.api_version);

if (!version.features.mcp) {
    throw new Error('This server does not expose the MCP endpoint');
}
```

### Retain (Store Memory)

```typescript
// Simple
await client.retain('my-bank', 'Alice works at Google');

// With options
await client.retain('my-bank', 'Alice got promoted', {
    timestamp: new Date('2024-01-15'),
    context: 'career update',
    metadata: { source: 'slack' },
    async: false,  // Set true for background processing
});
```

### Retain Batch

```typescript
await client.retainBatch('my-bank', [
    { content: 'Alice works at Google', context: 'career' },
    { content: 'Bob is a data scientist', context: 'career' },
], {
    async: false,
});
```

### Recall (Search)

```typescript
// Simple - returns RecallResponse
const response = await client.recall('my-bank', 'What does Alice do?');

for (const r of response.results) {
    console.log(`${r.text} (type: ${r.type})`);
}

// With options
const response = await client.recall('my-bank', 'What does Alice do?', {
    types: ['world', 'observation'],  // Filter by fact type
    maxTokens: 4096,
    budget: 'high',  // 'low', 'mid', or 'high'
});
```

### Reflect (Generate Response)

```typescript
const answer = await client.reflect('my-bank', 'What should I know about Alice?', {
    budget: 'low',  // 'low', 'mid', or 'high'
    context: 'preparing for a meeting',
});

console.log(answer.text);       // Generated response
```

## Bank Management

### Create Bank

```typescript
await client.createBank('my-bank', {
    name: 'Assistant',
    mission: "You're a helpful AI assistant - keep track of user preferences and conversation history.",
    disposition: {
        skepticism: 3,   // 1-5: trusting to skeptical
        literalism: 3,   // 1-5: flexible to literal
        empathy: 3,      // 1-5: detached to empathetic
    },
});
```

### List Memories

```typescript
const response = await client.listMemories('my-bank', {
    type: 'world',  // Optional filter
    q: 'Alice',     // Optional text search
    limit: 100,
    offset: 0,
});
console.log(response)
```
## Document Management

### Get Document

```typescript
const doc = await client.getDocument('my-bank', 'conversation_001');
if (doc) {
    console.log(doc);  // null when document not found
}
```

### List Documents

```typescript
const response = await client.listDocuments('my-bank', {
    limit: 50,
    offset: 0,
});
console.log(response);
```

### Update Document

```typescript
await client.updateDocument('my-bank', 'conversation_001', {
    tags: ['important', 'meeting-notes'],
});
```

### Delete Document

```typescript
await client.deleteDocument('my-bank', 'conversation_001');
```


---


## File: sdks/cli.md

# CLI Reference

The Hindsight CLI provides command-line access to memory operations and bank management. All commands follow the [OpenAPI specification](/api-reference), so you can use `--help` on any command to see all available options.

## Installation

```bash
curl -fsSL https://hindsight.vectorize.io/get-cli | bash
```

## Configuration

Configure the API URL:

```bash
# Interactive configuration
hindsight configure

# Or set directly
hindsight configure --api-url http://localhost:8888

# With API key for authentication
hindsight configure --api-url http://localhost:8888 --api-key your-api-key

# Or use environment variables (highest priority)
export HINDSIGHT_API_URL=http://localhost:8888
export HINDSIGHT_API_KEY=your-api-key
```

### Named Profiles

When you need to switch between multiple Hindsight deployments (e.g. local,
staging, production) without constantly rewriting `~/.hindsight/config`, use
named profiles. Each profile is a TOML file at
`~/.hindsight/cli-profiles/<name>.toml` and is selected per-invocation with
`-p/--profile` (or by setting `$HINDSIGHT_PROFILE`).

```bash
# Create (or overwrite) a profile
hindsight profile create prod \
  --api-url https://api.hindsight.vectorize.io \
  --api-key hsk_...

# List and inspect profiles
hindsight profile list
hindsight profile show prod

# Use a profile for a single command
hindsight -p prod bank list

# Or make it sticky for the current shell
export HINDSIGHT_PROFILE=prod
hindsight bank list

# Remove a profile
hindsight profile delete prod -y
```

Profile files are written with `0600` permissions on Unix so the API key is
only readable by the owner.

**Configuration precedence** (highest first):

1. Environment variables (`HINDSIGHT_API_URL`, `HINDSIGHT_API_KEY`)
2. Named profile — explicit `-p <name>`, otherwise `$HINDSIGHT_PROFILE`
3. Shared config file (`~/.hindsight/config`, written by `hindsight configure`)
4. Default (`http://localhost:8888`)

`HINDSIGHT_API_URL` / `HINDSIGHT_API_KEY` always override profile values, which
makes it safe to use `-p` in scripts while letting CI inject credentials via
environment.

## Core Commands

### Retain (Store Memory)

Store a single memory:

```bash
hindsight memory retain <bank_id> "Alice works at Google as a software engineer"

# With context
hindsight memory retain <bank_id> "Bob loves hiking" --context "hobby discussion"

# Queue for background processing
hindsight memory retain <bank_id> "Meeting notes" --async

# With an event date (ISO 8601 datetime or date)
hindsight memory retain <bank_id> "Project launched" --timestamp 2024-01-15

# Store without a timestamp (overrides the default of "now")
hindsight memory retain <bank_id> "Background fact" --timestamp unset
```

### Retain Files

Bulk import from files:

```bash
# Single file
hindsight memory retain-files <bank_id> notes.txt

# Directory (recursive by default)
hindsight memory retain-files <bank_id> ./documents/

# With context
hindsight memory retain-files <bank_id> meeting-notes.txt --context "team meeting"

# With a named retain strategy (see retain_strategies in bank config)
hindsight memory retain-files <bank_id> ./documents/ --strategy conversations

# Background processing
hindsight memory retain-files <bank_id> ./data/ --async
```

### Recall (Search)

Search memories using semantic similarity:

```bash
hindsight memory recall <bank_id> "What does Alice do?"

# With options
hindsight memory recall <bank_id> "hiking recommendations" \
  --budget high \
  --max-tokens 8192

# Filter by fact type
hindsight memory recall <bank_id> "query" --fact-type world,observation

# Filter by tags
hindsight memory recall <bank_id> "query" --tags work,project \
  --tags-match all

# Pin results to a specific time
hindsight memory recall <bank_id> "query" --query-timestamp "2026-01-15T00:00:00Z"

# Show trace information
hindsight memory recall <bank_id> "query" --trace
```

### Reflect (Generate Response)

Generate a response using memories and bank disposition:

```bash
hindsight memory reflect <bank_id> "What do you know about Alice?"

# With additional context
hindsight memory reflect <bank_id> "Should I learn Python?" --context "career advice"

# Higher budget for complex questions
hindsight memory reflect <bank_id> "Summarize my week" --budget high

# Filter by fact type
hindsight memory reflect <bank_id> "query" \
  --fact-types world,experience \
  --exclude-mental-models
```

### Memory History

View the observation history for a specific memory unit:

```bash
hindsight memory history <bank_id> <memory_id>
```

### Clear Observations

Remove all observations for a memory unit, keeping the core fact:

```bash
hindsight memory clear-observations <bank_id> <memory_id>

# Skip confirmation prompt
hindsight memory clear-observations <bank_id> <memory_id> -y
```

## Bank Management

### List Banks

```bash
hindsight bank list
```

### View Disposition

```bash
hindsight bank disposition <bank_id>
```

### Set Disposition

```bash
hindsight bank set-disposition <bank_id> --skepticism 3 --literalism 4 --empathy 5
```

### View Statistics

```bash
hindsight bank stats <bank_id>
```

### Set Bank Name

```bash
hindsight bank name <bank_id> "My Assistant"
```

### Set Mission

```bash
hindsight bank mission <bank_id> "I am a helpful AI assistant interested in technology"
```

### Clear Observations (Bank-wide)

Remove all observations across the entire bank:

```bash
hindsight bank clear-observations <bank_id>

# Skip confirmation prompt
hindsight bank clear-observations <bank_id> -y
```

### Recover Consolidation

Recover from a failed or stuck consolidation:

```bash
hindsight bank consolidation-recover <bank_id>
```

## Document Management

```bash
# List documents
hindsight document list <bank_id>

# Get document details
hindsight document get <bank_id> <document_id>

# Update document metadata
hindsight document update <bank_id> <document_id> --context "updated context"

# Delete document and its memories
hindsight document delete <bank_id> <document_id>
```

## Entity Management

```bash
# List entities
hindsight entity list <bank_id>

# Get entity details
hindsight entity get <bank_id> <entity_id>
```

## Operation Management

Track and manage async operations (retain-files, consolidation, etc.):

```bash
# List operations
hindsight operation list <bank_id>

# Get operation status
hindsight operation get <bank_id> <operation_id>

# Cancel a pending operation
hindsight operation cancel <bank_id> <operation_id>

# Retry a failed operation
hindsight operation retry <bank_id> <operation_id>
```

## Webhook Management

Configure event delivery hooks for bank activity:

```bash
# List webhooks
hindsight webhook list <bank_id>

# Create a webhook (defaults to consolidation.completed events)
hindsight webhook create <bank_id> https://example.com/hook

# Create with specific events and signing secret
hindsight webhook create <bank_id> https://example.com/hook \
  --event-types retain.completed,consolidation.completed \
  --secret my-hmac-secret

# Update a webhook
hindsight webhook update <bank_id> <webhook_id> --url https://new-url.com

# Delete a webhook
hindsight webhook delete <bank_id> <webhook_id>

# View delivery history
hindsight webhook deliveries <bank_id> <webhook_id>
```

## Audit Logs

Inspect the audit trail for a bank:

```bash
# List audit entries
hindsight audit list <bank_id>

# Filter by action and transport
hindsight audit list <bank_id> --action recall --transport mcp

# Filter by date range
hindsight audit list <bank_id> \
  --start-date "2026-04-01T00:00:00Z" \
  --end-date "2026-04-10T00:00:00Z"

# Pagination
hindsight audit list <bank_id> --limit 50 --offset 100
```

## Output Formats

```bash
# Pretty (default)
hindsight memory recall <bank_id> "query"

# JSON
hindsight memory recall <bank_id> "query" -o json

# YAML
hindsight memory recall <bank_id> "query" -o yaml
```

## Global Options

| Flag | Description |
|------|-------------|
| `-v, --verbose` | Show detailed output including request/response |
| `-o, --output <format>` | Output format: pretty, json, yaml |
| `--help` | Show help |
| `--version` | Show version |

## Control Plane UI

Launch the web-based Control Plane UI directly from the CLI:

```bash
hindsight ui
```

This runs the Control Plane locally on port 9999 using the API URL from your configuration. The UI provides:

- **Memory bank management** — Browse and manage all your banks
- **Entity explorer** — Visualize the knowledge graph
- **Query testing** — Interactive recall and reflect testing
- **Operation history** — View ingestion and processing logs

:::tip
The UI command requires Node.js to be installed. It automatically downloads and runs the `@vectorize-io/hindsight-control-plane` package via npx.
:::

## Interactive Explorer

Launch the TUI explorer for visual navigation of your memory banks:

```bash
hindsight explore
```

The explorer provides an interactive terminal interface to:

- **Browse memory banks** — View all banks and their statistics
- **Search memories** — Run recall queries with real-time results
- **Inspect entities** — Explore the knowledge graph and entity relationships
- **View facts** — Browse world facts, experiences, and observations
- **Navigate documents** — See source documents and their extracted memories

### Keyboard Shortcuts

| Key | Action |
|-----|--------|
| `↑/↓` | Navigate items |
| `Enter` | Select / Expand |
| `Tab` | Switch panels |
| `/` | Search |
| `q` | Quit |

<!-- Screenshot placeholder: explore command TUI -->

## Example Workflow

```bash
# Configure API URL
hindsight configure --api-url http://localhost:8888

# Store some memories
hindsight memory retain demo "Alice works at Google"
hindsight memory retain demo "Bob is a data scientist"
hindsight memory retain demo "Alice and Bob are colleagues"

# Search memories
hindsight memory recall demo "Who works with Alice?"

# Generate a response
hindsight memory reflect demo "What do you know about the team?"

# Check bank disposition
hindsight bank disposition demo
```


---


## File: developer/admin-cli.md

# Admin CLI

The `hindsight-admin` CLI provides administrative commands for managing your Hindsight deployment, including database migrations, backup, and restore operations.

## Installation

The admin CLI is included with the `hindsight-api` package — installing it puts the `hindsight-admin` executable on your `PATH`:

```bash
pip install hindsight-api
# or
uv add hindsight-api
```

## Running the CLI

`hindsight-admin` connects **directly to PostgreSQL** — it does not call the HTTP API. It reads the **same configuration as the API service** (environment variables, and a `.env` file in the current working directory), so it operates on whatever database `HINDSIGHT_API_DATABASE_URL` points to:

- **Default**: `pg0`, the embedded development database (must be run on the host that owns the pg0 data).
- **Production**: set `HINDSIGHT_API_DATABASE_URL=postgresql://user:pass@host:5432/hindsight`.

Because it talks to the database directly (binary `COPY`, `TRUNCATE`, etc.), the admin CLI is **PostgreSQL-only** (not supported on Oracle). Run it on the same host/container as your API deployment so it inherits the right configuration and has network access to the database:

```bash
# Bare metal / virtualenv (with the API's env or a .env in the working dir)
hindsight-admin worker-status

# Docker — exec into the API container
docker exec -it hindsight-api hindsight-admin backup /data/backup.zip

# Kubernetes — exec into an API pod
kubectl exec deploy/hindsight-api -- hindsight-admin run-db-migration
```

Use `--schema` to target a specific tenant schema (commands default to the configured base schema). See [Environment Variables](#environment-variables) below.

## Commands

### run-db-migration

Run database migrations to the latest version. By default this migrates the base schema plus all tenant schemas discovered by the tenant extension. Use `--schema` for targeted migration of one schema. This is useful when you want to run migrations separately from API startup (e.g., in CI/CD pipelines or before deploying a new version).

```bash
hindsight-admin run-db-migration [OPTIONS]
```

**Options:**

| Option | Description | Default |
|--------|-------------|---------|
| `--schema`, `-s` | Database schema to run migrations on. If omitted, migrate the base schema plus all discovered tenant schemas. | All schemas |
| `--embedding-dimension` | Expected embedding dimension to enforce after migrations. Omit to skip the post-migration dimension sync. | Skipped |
| `--skip-extension-reconcile` | Skip the post-migration vector / text-search index reconcile (it only does work when `HINDSIGHT_API_VECTOR_EXTENSION` / `HINDSIGHT_API_TEXT_SEARCH_EXTENSION` differs from a schema's existing indexes). Makes a no-change re-migration across many tenant schemas much faster; only use when the backend is unchanged. | Reconcile runs |

**Examples:**

```bash
# Run migrations on the base schema plus all discovered tenant schemas
hindsight-admin run-db-migration

# Run migrations on a specific tenant schema
hindsight-admin run-db-migration --schema tenant_acme
```

:::tip Disabling Auto-Migrations
To disable automatic migrations on API startup, set `HINDSIGHT_API_RUN_MIGRATIONS_ON_STARTUP=false`. This is useful when you want to run migrations as a separate step in your deployment pipeline.
:::

---

### backup

Create a backup of all Hindsight data to a zip file.

```bash
hindsight-admin backup OUTPUT [OPTIONS]
```

**Arguments:**

| Argument | Description |
|----------|-------------|
| `OUTPUT` | Output file path (will add `.zip` extension if not present) |

**Options:**

| Option | Description | Default |
|--------|-------------|---------|
| `--schema`, `-s` | Database schema to backup | `public` |

**Examples:**

```bash
# Backup to a file
hindsight-admin backup /backups/hindsight-2024-01-15.zip

# Backup a specific tenant schema
hindsight-admin backup /backups/tenant-acme.zip --schema tenant_acme
```

The backup includes:
- Memory banks and their configuration
- Documents and chunks
- Entities and their relationships
- Memory units (facts, experiences, observations)
- Entity cooccurrences and memory links
- Mental models and directives
- Webhooks and file storage
- Internal operational tables (async operations, audit log, graph-maintenance queue, and similar bookkeeping) so a restore reproduces a faithful full-database snapshot

:::note Consistency
Backups are created within a database transaction with `REPEATABLE READ` isolation, ensuring a consistent snapshot across all tables.
:::

---

### restore

Restore data from a backup file. **Warning: This deletes all existing data in the target schema.**

```bash
hindsight-admin restore INPUT [OPTIONS]
```

**Arguments:**

| Argument | Description |
|----------|-------------|
| `INPUT` | Input backup file (.zip) |

**Options:**

| Option | Description | Default |
|--------|-------------|---------|
| `--schema`, `-s` | Database schema to restore to | `public` |
| `--yes`, `-y` | Skip confirmation prompt | `false` |

**Examples:**

```bash
# Restore with confirmation prompt
hindsight-admin restore /backups/hindsight-2024-01-15.zip

# Restore without confirmation (for scripts)
hindsight-admin restore /backups/hindsight-2024-01-15.zip --yes

# Restore to a specific tenant schema
hindsight-admin restore /backups/tenant-acme.zip --schema tenant_acme --yes
```

:::warning Data Loss
Restore will **delete all existing data** in the target schema before importing the backup. Always verify you have a recent backup before performing a restore.
:::

---

### decommission-worker

Release all tasks owned by a worker, resetting them from "processing" back to "pending" status so they can be picked up by other workers.

```bash
hindsight-admin decommission-worker WORKER_ID [OPTIONS]
```

**Arguments:**

| Argument | Description |
|----------|-------------|
| `WORKER_ID` | ID of the worker to decommission |

**Options:**

| Option | Description | Default |
|--------|-------------|---------|
| `--schema`, `-s` | Database schema | `public` |
| `--yes`, `-y` | Skip confirmation prompt | `false` |

**Examples:**

```bash
# Before scaling down - release tasks from workers being removed
hindsight-admin decommission-worker hindsight-worker-4
hindsight-admin decommission-worker hindsight-worker-3

# Release tasks from a crashed worker
hindsight-admin decommission-worker worker-2

# For a specific tenant schema
hindsight-admin decommission-worker worker-1 --schema tenant_acme
```

**When to Use:**

- **Scaling down**: Before removing worker replicas in Kubernetes
- **Graceful removal**: When taking a worker offline for maintenance
- **Crash recovery**: If a worker crashed while processing tasks
- **Stuck worker**: When a worker is unresponsive

:::tip Finding Worker IDs
Worker IDs default to the hostname. In Kubernetes StatefulSets, this is the pod name (e.g., `hindsight-worker-0`). You can also set a custom ID with `HINDSIGHT_API_WORKER_ID` or `--worker-id`.
:::


### decommission-workers

Release all currently-processing tasks from every worker, resetting them from "processing" back to "pending" status. Use this when one or more workers have crashed or been removed without graceful shutdown and you don't know which worker IDs to target.

```bash
hindsight-admin decommission-workers [OPTIONS]
```

**Options:**

| Option | Description | Default |
|--------|-------------|---------|
| `--schema`, `-s` | Database schema | `public` |
| `--yes`, `-y` | Skip confirmation prompt | `false` |

**Examples:**

```bash
# Release all processing tasks across all workers (with confirmation)
hindsight-admin decommission-workers

# Skip the confirmation prompt (useful in scripts)
hindsight-admin decommission-workers --yes

# Release tasks in a specific tenant schema
hindsight-admin decommission-workers --schema tenant_acme
```

**When to Use:**

- **Unknown dead workers**: Multiple workers crashed and you do not know their IDs
- **Fleet-wide recovery**: After an infrastructure event where many workers went down
- **"Just fix everything"**: A quick full-queue drain when per-worker cleanup is overkill

:::warning Disruptive
This releases **every** processing task regardless of worker, including tasks owned by healthy workers. Prefer `decommission-worker <WORKER_ID>` when you know which workers need cleanup.
:::

---

### worker-status

Show all currently-processing tasks grouped by worker, including operation type, bank, how long each task has been running, and when it was last updated. Useful for identifying orphaned tasks before decommissioning.

```bash
hindsight-admin worker-status [OPTIONS]
```

**Options:**

| Option | Description | Default |
|--------|-------------|---------|
| `--schema`, `-s` | Database schema | `public` |

**Examples:**

```bash
# Show all processing tasks across all workers
hindsight-admin worker-status

# Show processing tasks for a specific tenant schema
hindsight-admin worker-status --schema tenant_acme
```

**When to Use:**

- **Before decommissioning**: Inspect which workers have stale tasks and how long they have been stuck
- **Debugging throughput**: Diagnose why the queue is not draining (are tasks stuck in processing?)
- **Worker health check**: Spot workers whose `last_update_ago` keeps growing, indicating a dead or unresponsive worker

---

### export-bank

Export an entire bank to a portable ZIP archive — documents, facts, observations, bank configuration, mental models, directives, and webhooks. Embeddings are **never** included; they are regenerated on import. This is the source half of a cross-instance migration (e.g. moving to a different embedding model, vector extension, or text-search backend). PostgreSQL only.

```bash
hindsight-admin export-bank --bank <BANK_ID> --output <FILE.zip> [OPTIONS]
```

**Options:**

| Option | Description | Default |
|--------|-------------|---------|
| `--bank`, `-b` | Bank id to export. | (required) |
| `--output`, `-o` | Path to write the `.zip` archive. | (required) |
| `--schema`, `-s` | Schema the bank lives in. | base schema |
| `--include-history` | Also export operational history (`audit_log`, `llm_requests`). | `false` |

**Examples:**

```bash
hindsight-admin export-bank --bank my-bank --output my-bank.zip

# include operational history
hindsight-admin export-bank --bank my-bank --output my-bank.zip --include-history
```

Read-only — safe to run against a live instance.

---

### import-bank

Restore a whole-bank archive (produced by `export-bank`) into **this** instance. Facts are re-embedded with this instance's configured embedding model and links/indexes are rebuilt; bank configuration, mental models, directives, and webhooks are restored exactly. No LLM fact-extraction runs, and because a migration restores state, it does **not** fire webhooks or re-run consolidation. PostgreSQL only.

```bash
hindsight-admin import-bank --archive <FILE.zip> [OPTIONS]
```

**Options:**

| Option | Description | Default |
|--------|-------------|---------|
| `--archive`, `-a` | Path to the `.zip` produced by `export-bank`. | (required) |
| `--schema`, `-s` | Target schema. | base schema |
| `--target-bank` | Override the bank id (defaults to the archive's source bank). | source bank |
| `--include-history` | Also restore history if present in the archive. | `false` |

**Examples:**

```bash
hindsight-admin import-bank --archive my-bank.zip
```

Run this against an instance configured with the **target** embedding model / vector extension / text-search backend — that's what re-embedding uses.

:::warning Target bank must not exist
Import restores a **whole bank** (config, facts, mental models, …) — it is **not a merge**. If a bank with the target id already exists, the command fails. Delete that bank first, or use `--target-bank` to restore under a fresh id.
:::

---

## Migrating a bank to a new instance

Changing a bank's **embedding model** (e.g. a 384-dim encoder → a 1024-dim one), **vector extension** (pgvector / vchord / pgvectorscale), or **text-search backend** can't be done in place on a populated bank — the stored vectors and indexes are tied to those settings. Because every embedding and index is a deterministic function of text already on disk, the supported path is to **move the bank to a fresh instance configured with the new settings and re-derive everything there — with no LLM re-extraction**.

`export-bank` / `import-bank` carry documents, facts, observations, bank config, mental models, directives, and webhooks — but never embeddings, which the target instance regenerates with its own model.

**Blue-green runbook:**

1. Stand up a **new instance** on a fresh database, configured with the new embedding model / vector extension / text-search backend.
2. Quiesce writes to the source bank (maintenance window) and run `hindsight-admin backup` for safety.
3. Export from the source, then import into the target:
   ```bash
   # on the source instance:
   hindsight-admin export-bank --bank my-bank --output my-bank.zip
   # on the target instance (configured with the new settings):
   hindsight-admin import-bank --archive my-bank.zip
   ```
4. Verify on the target: run representative recall queries and compare results.
5. Cut traffic over to the new instance. The old instance stays as an instant rollback until you're confident.

:::note Why a new instance, not in-place
The embedding model is server-level, and a bank's `memory_units.embedding` column has a single dimension shared across the schema, so a different-dimension or different-backend bank needs its own instance/database. The old vectors are never mutated, which makes rollback trivial.
:::

---

## Recovering stuck or zombie operations

A "zombie" operation is one stuck in `processing` indefinitely because the worker that claimed it is gone. The most common cause is an unstable `HINDSIGHT_API_WORKER_ID`: when it defaults to the container hostname, a Docker restart produces a new container ID, the new worker doesn't recognize the old worker's claims as its own, and those tasks are stranded.

**How to spot them:**

```bash
# List processing tasks grouped by worker — workers with a growing last_update_ago are dead
hindsight-admin worker-status

# Bank-level counters; pending_consolidation that never decreases is the usual symptom
curl -s http://localhost:8888/v1/default/banks/<bank_id>/stats
```

**How to recover:**

```bash
# You know which worker is dead (e.g. from worker-status):
hindsight-admin decommission-worker <old-worker-id>

# You don't know — release every processing task across the fleet:
hindsight-admin decommission-workers
```

Both commands reset `processing` rows back to `pending` so a live worker can claim them on the next poll.

**How to prevent it:**

Set `HINDSIGHT_API_WORKER_ID` to a stable value so worker identity survives restarts:

- **Docker**: pass `-e HINDSIGHT_API_WORKER_ID=hindsight-prod` (or per-replica names if running multiple containers)
- **Kubernetes (Helm)**: the chart's StatefulSet uses the pod name automatically — no extra config needed
- **Bare metal / pip**: pass `--worker-id <name>` or set the env var per process

See [Installation - Docker](./installation#docker) and [Configuration - Distributed Workers](./configuration#distributed-workers).

---

## Environment Variables

The admin CLI uses the same environment variables as the API service. The most important one is:

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_DATABASE_URL` | PostgreSQL connection string | `pg0` (embedded) |

**Example:**

```bash
# Use a specific database
export HINDSIGHT_API_DATABASE_URL=postgresql://user:pass@localhost:5432/hindsight
hindsight-admin backup /backups/mybackup.zip
```


---


## File: developer/api/bank-templates.mdx

# Bank Templates

Declarative JSON manifests for creating pre-configured memory banks with a single API call.


{/* Import raw source files */}


## Overview

A bank template is a JSON manifest that describes a bank's full setup: configuration overrides, mental models, directives, and more. Instead of making multiple API calls to configure a bank, you submit one manifest and the API provisions everything.

Templates are useful for:
- **Replication** — stamp out identically-configured banks for multiple users or agents
- **Onboarding** — new users start with a known-good configuration instead of configuring from scratch
- **Sharing** — distribute recommended setups as portable JSON files
- **Framework integrations** — ship a recommended template alongside your integration

Browse the [Bank Templates Hub](/templates) for ready-to-use templates.

## Manifest Schema

```json
{
  "version": "1",
  "bank": {
    "reflect_mission": "...",
    "retain_mission": "...",
    "retain_extraction_mode": "concise | verbose | custom | chunks",
    "retain_custom_instructions": "...",
    "retain_chunk_size": 2048,
    "retain_structured_chunk_size": 8192,
    "disposition_skepticism": 3,
    "disposition_literalism": 3,
    "disposition_empathy": 3,
    "enable_observations": true,
    "observations_mission": "...",
    "entity_labels": [{ "key": "sentiment", "type": "value", "values": [{ "value": "positive" }, { "value": "negative" }] }],
    "entities_allow_free_form": true
  },
  "mental_models": [
    {
      "id": "unique-lowercase-id",
      "name": "Human-Readable Name",
      "source_query": "The query that generates this mental model's content",
      "tags": ["optional", "tags"],
      "max_tokens": 2048,
      "trigger": {
        "refresh_after_consolidation": false,
        "fact_types": ["world", "experience", "observation"],
        "exclude_mental_models": false,
        "exclude_mental_model_ids": []
      }
    }
  ],
  "directives": [
    {
      "name": "directive-name",
      "content": "The directive instruction text",
      "priority": 0,
      "is_active": true,
      "tags": ["optional", "tags"]
    }
  ]
}
```

### Fields

| Field | Required | Description |
|-------|----------|-------------|
| `version` | Yes | Schema version. Currently `"1"`. |
| `bank` | No | Bank configuration overrides. Omit to leave config unchanged. |
| `mental_models` | No | Mental models to create or update. Omit to leave unchanged. |
| `directives` | No | Directives to create or update. Omit to leave unchanged. |

All of `bank`, `mental_models`, and `directives` are optional. Omit any section to leave that part of the bank unchanged.

### Bank Config Fields

All fields in `bank` are optional. Only the fields you include will be set as per-bank overrides — everything else inherits from the server/tenant defaults.

| Field | Type | Description |
|-------|------|-------------|
| `reflect_mission` | string | Mission/context for reflect operations |
| `retain_mission` | string | Steers what gets extracted during retain |
| `retain_extraction_mode` | string | `concise`, `verbose`, `custom`, or `chunks` |
| `retain_custom_instructions` | string | Custom extraction prompt (requires `mode=custom`) |
| `retain_chunk_size` | integer | Target max characters per content chunk |
| `retain_structured_chunk_size` | integer | Max characters for a single JSONL line or conversation turn to keep whole; defaults to `retain_chunk_size` when unset |
| `disposition_skepticism` | integer (1-5) | How skeptical the disposition is |
| `disposition_literalism` | integer (1-5) | How literal the disposition is |
| `disposition_empathy` | integer (1-5) | How empathetic the disposition is |
| `enable_observations` | boolean | Toggle observation consolidation |
| `observations_mission` | string | Controls what gets synthesised into observations |
| `entity_labels` | object[] | Controlled vocabulary as label groups — see [Memory Banks → entity_labels](./memory-banks#entity-labels) |
| `entities_allow_free_form` | boolean | Allow entities outside the label vocabulary |

### Mental Model Fields

| Field | Required | Description |
|-------|----------|-------------|
| `id` | Yes | Unique ID (lowercase alphanumeric with hyphens). Used to match on re-import. |
| `name` | Yes | Human-readable name |
| `source_query` | Yes | The query that generates this model's content via reflect |
| `tags` | No | Tags for scoped visibility. Default: `[]` |
| `max_tokens` | No | Max tokens for generated content (256-8192). Default: `2048` |
| `trigger` | No | Trigger settings for auto-refresh |

### Directive Fields

| Field | Required | Description |
|-------|----------|-------------|
| `name` | Yes | Directive name. Used as the match key on re-import. |
| `content` | Yes | The directive instruction text. |
| `priority` | No | Priority value (higher = more important). Default: `0` |
| `is_active` | No | Whether the directive is active. Default: `true` |
| `tags` | No | Tags for categorization. Default: `[]` |

## Import

Import a manifest into a bank. If the bank doesn't exist, it's created automatically.


</Tabs>

### Behavior

- **Config**: all `bank` fields are applied as per-bank config overrides
- **Mental models**: matched by `id` — existing models are updated, new ones are created
- **Directives**: matched by `name` — existing directives are updated, new ones are created
- **Async**: mental model content is generated asynchronously. The response includes `operation_ids` to track progress.

### Dry Run

Validate a manifest without applying changes:


</Tabs>

Returns what *would* happen (which config would be applied, which mental models would be created) without making any changes. Returns HTTP 400 with a detailed error message if the manifest is invalid.

## Export

Export a bank's current config overrides, mental models, and directives as a manifest:


</Tabs>

The exported manifest only includes config fields that were explicitly set as per-bank overrides — not the fully resolved config (which includes server/tenant defaults). This means the exported manifest is portable: importing it into a new bank only overrides the fields that were intentionally customized.

### Round-trip

Export from one bank and import into another to replicate the setup:


</Tabs>

## JSON Schema

The manifest format is defined by a JSON Schema. Fetch the live schema from your server:


</Tabs>

The static schema is also available at [bank-template-schema.json](/bank-template-schema.json).

## Control Plane

The control plane bank creation dialog includes an optional "Import from template" toggle. Enable it to paste a manifest JSON and pre-configure the bank on creation.

You can also export any bank's template from the bank Settings page via **Actions → Export Template**, which copies the manifest JSON to your clipboard.

## Versioning

The `version` field enables forward-compatible schema evolution. The current version is `"1"`.

When future versions are released:
- Older manifests are automatically upgraded to the current schema on import
- Export always produces the latest version
- The API rejects manifests with a version newer than what the server supports (with a clear error message suggesting an upgrade)

This means old templates keep working indefinitely — no need to manually update them.


---


## File: developer/api/documents.mdx

# Documents

Track and manage document sources in your memory bank. Documents provide traceability — knowing where memories came from.


{/* Import raw source files */}


:::tip Prerequisites
Make sure you've completed the [Quick Start](./quickstart) and understand [how retain works](./retain).
:::

## What Are Documents?

Documents are containers for retained content. They help you:

- **Track sources** — Know which PDF, conversation, or file a memory came from
- **Update content** — Re-retain a document to update its facts
- **Delete in bulk** — Remove all memories from a document at once
- **Organize memories** — Group related facts by source

## Chunks

When you retain content, Hindsight splits it into chunks before extracting facts. These chunks are stored alongside the extracted memories, preserving the original text segments.

**Why chunks matter:**
- **Context preservation** — Chunks contain the raw text that generated facts, useful when you need the exact wording
- **Richer recall** — Including chunks in recall provides surrounding context for matched facts

:::tip Include Chunks in Recall
Use `include_chunks=True` in your recall calls to get the original text chunks alongside fact results. See [Recall](./recall) for details.
:::

## Retain with Document ID

Associate retained content with a document:


</Tabs>

## Update Documents

Re-retaining with the same document_id **replaces** the old content:


</Tabs>

## Get Document

Retrieve a document's original text and metadata. This is useful for expanding document context after a recall operation returns memories with document references.


</Tabs>

## Update Document

Update mutable fields on an existing document without re-processing the content. Currently supports updating `tags`.


</Tabs>

:::info Observations are re-consolidated
When tags change, any consolidated observations derived from the document's memories are invalidated and queued for re-consolidation under the new tags. Co-source memories from other documents that shared those observations are also reset.
:::

## Delete Document

Remove a document and all its associated memories:


</Tabs>

:::warning
Deleting a document permanently removes all memories extracted from it. This action cannot be undone.
:::

## List Documents

List documents in a bank with optional filtering by ID and tags.


</Tabs>

### Filtering Options

| Parameter | Description |
|---|---|
| `q` | Case-insensitive substring match on document ID. `report` matches `report-2024`, `annual-report`, etc. |
| `tags` | Filter by document tags. Accepts multiple values. |
| `tags_match` | How to match tags (default: `any_strict`). See below. |
| `limit` / `offset` | Pagination. Default limit is 100. |

**`tags_match` modes:**

| Mode | Behaviour |
|---|---|
| `any_strict` *(default)* | Document must have **at least one** of the specified tags. Untagged docs excluded. |
| `any` | Same as `any_strict` but also includes untagged documents. |
| `all_strict` | Document must have **all** specified tags. Untagged docs excluded. |
| `all` | Same as `all_strict` but also includes untagged documents. |

## Document Response Format

```json
{
  "id": "meeting-2024-03-15",
  "bank_id": "my-bank",
  "original_text": "Alice presented the Q4 roadmap...",
  "content_hash": "abc123def456",
  "memory_unit_count": 12,
  "nodes_by_fact_type": {
    "world": 5,
    "experience": 4,
    "observation": 3
  },
  "created_at": "2024-03-15T14:00:00Z",
  "updated_at": "2024-03-15T14:00:00Z"
}
```

## Next Steps

- [**Operations**](./operations) — Monitor background tasks
- [**Memory Banks**](./memory-banks) — Configure bank settings


---


## File: developer/api/main-methods.mdx

# Main Methods

Hindsight provides three core operations: **retain**, **recall**, and **reflect**.


{/* Import raw source files */}


:::tip Prerequisites
Make sure you've [installed Hindsight](../installation) and completed the [Quick Start](./quickstart).
:::

## Retain: Store Information

Store conversations, documents, and facts into a memory bank.


</Tabs>

**What happens:** Content is processed by an LLM to extract rich facts, identify entities, and build connections in a knowledge graph.

**See:** [Retain Details](./retain) for advanced options and parameters.

---

## Recall: Search Memories

Search for relevant memories using multi-strategy retrieval.


</Tabs>

**What happens:** Four search strategies (semantic, keyword, graph, temporal) run in parallel, results are fused and reranked.

**See:** [Recall Details](./recall) for tuning quality vs latency.

---

## Reflect: Reason with Disposition

Generate disposition-aware responses using memories and observations.


</Tabs>

**What happens:** Memories and observations are recalled, bank disposition is applied, and the LLM reasons through the evidence to generate a response.

**See:** [Reflect Details](./reflect) for disposition configuration.

---

## Comparison

| Feature | Retain | Recall | Reflect |
|---------|--------|--------|---------|
| **Purpose** | Store information | Find information | Reason about information |
| **Input** | Raw text/documents | Search query | Question/prompt |
| **Output** | Memory IDs | Ranked facts + observations | Reasoned response |
| **Uses LLM** | Yes (extraction) | No | Yes (generation) |
| **Uses observations** | No | Yes | Yes |
| **Disposition** | No | No | Yes |

---

## Next Steps

- [**Retain**](./retain) — Advanced options for storing memories
- [**Recall**](./recall) — Tuning search quality and performance
- [**Reflect**](./reflect) — Configuring disposition
- [**Memory Banks**](./memory-banks) — Managing memory bank disposition


---


## File: developer/api/memories.mdx

# Memories

A **memory unit** is the atomic fact Hindsight extracts and stores. This page covers the endpoints for working with individual memory units — reading and listing them, inspecting how a derived observation evolved, and **curating** them (correcting, retiring, or restoring). Ingesting and querying memories is covered separately in [Retain](./retain.mdx) and [Recall](./recall.mdx).


{/* Import raw source files */}


## Endpoints

| Method | Endpoint | Purpose |
|---|---|---|
| `GET` | `/v1/default/banks/{bank}/memories/list` | List/filter memory units in a bank |
| `GET` | `/v1/default/banks/{bank}/memories/{id}` | Fetch a single memory unit |
| `GET` | `/v1/default/banks/{bank}/memories/{id}/history` | Refresh history of a derived observation |
| `PATCH` | `/v1/default/banks/{bank}/memories/{id}` | Curate: edit / invalidate / restore |
| `DELETE` | `/v1/default/banks/{bank}/memories/{id}/observations` | Clear a memory's derived observations |

## List memory units

List the memory units in a bank. The response includes each unit's `fact_type` (`world` | `experience` | `observation`), `state` (`valid` | `invalidated`), entities, occurred dates, and — for facts a user has edited — an `edited_at` timestamp. Invalidated rows are **included by default** so curation stays auditable; filter with `state=`.


</Tabs>

## Fetch a single memory unit


</Tabs>

For a **derived observation**, the history endpoint returns how it was refreshed over time as new source facts arrived:


</Tabs>

## Curation: editing, invalidating & pruning

Memory is append-only by design — but sometimes a stored fact is **wrong**, has gone **stale**, or is a **duplicate**. Curation lets you correct or retire individual memories without losing the audit trail. Retired facts are moved out of the active set, so recall never returns them, while remaining fully recoverable.

### When to reach for what

Not every "bad memory" needs the same tool. Pick by *why* it's bad:

| The memory is… | Use | Why |
|---|---|---|
| **Wrong because the whole bank extracts badly** (e.g. consistently wrong subject) | Fix the bank's `retain_mission` / `observations_mission`, then **reprocess** the document | Systematic problems are best fixed at the source, then replayed — see [Retain](./retain.mdx) and [Observations](../observations.mdx). |
| **Wrong as a one-off** (a single misextracted fact) | **Edit** the memory | Corrects the fact and regenerates everything derived from it. |
| **No longer true, with nothing to replace it** (decommissioned server, a tool that was fixed, a role that changed) | **Invalidate** the memory | Nothing in the pipeline knows the world changed, so you tell it explicitly. |
| **A duplicate or superseded fact** | **Invalidate** the memory | Removes the noise from recall while keeping the audit trail. |
| **Superseded by a newer fact you're storing anyway** (e.g. "likes BMW" → "likes Toyota") | Just retain the new fact | Consolidation already reconciles in-stream contradictions into a single observation. |

The rule of thumb: **if Hindsight could have known, let consolidation handle it; if only you know, curate it.**

Only raw **world** and **experience** facts can be curated. Observations are *derived* — they regenerate from their sources, so you curate the underlying facts, not the observation. A `PATCH` on an observation returns `400`.

### Edit a memory

Correct what the LLM extracted. You can change the **text**, **context**, **occurred dates**, **fact type**, and **entities** — anything the extractor could have gotten wrong. Hindsight re-embeds the fact, drops the observations and links derived from the old version, and re-consolidates, so downstream knowledge reflects the correction. Edited facts are marked with an `edited_at` timestamp (surfaced as an **Edited** badge in the control plane).

You don't need to rebuild anything yourself: an edit **automatically recomputes the knowledge graph and links** in the background. The fact's entity associations are re-resolved from the new text/entities, its temporal and semantic links are re-derived, and consolidation re-runs — all triggered by the edit. The PATCH returns as soon as the change is committed; the graph/observation rebuild happens asynchronously right after.


</Tabs>

You can correct the dates, fact type, and entities the same way. For `context`, `occurred_start`, and `occurred_end`, an empty string `""` clears the field and omitting it leaves it unchanged. For `entities`, a list **replaces** the fact's entity set (names are resolved/find-or-created the same way retain does) and `[]` detaches them all; omitting it leaves them unchanged.


</Tabs>

### Invalidate a memory (reversible)

Soft-retire a fact. An invalidated memory:

- **disappears from recall**, consolidation, and the knowledge graph,
- has its **links pruned** and its **derived observations re-computed** without it,
- **stays in the bank** for audit (visible via the memory and document views), and
- can be **restored** at any time.


</Tabs>

Restoring moves the fact back into the active set and re-consolidates:


</Tabs>

Behind the scenes, invalidating **moves** the row out of the active `memory_units` table into a separate archive, so recall and consolidation never need a "skip invalidated" filter — the rows simply aren't there.

:::note Documents are the source of truth
A memory is extracted from a document. Editing or invalidating a memory does **not** change the document it came from — that's deliberate: the document stays as an accurate historical record. As a result, **reprocessing a document resets curation** of the facts it produced (extraction runs fresh from the original text). Fix systematic issues at the mission level and reprocess; use edit/invalidate for the residue.
:::

### A pruning workflow

To clean up duplicates and reclaim noise: cluster duplicates from `memories/list`, then **invalidate** them — recall is clean immediately, and the audit trail is preserved.


---


## File: developer/api/memory-banks.mdx

# Memory Banks

Memory banks are isolated containers that store all memory-related data for a specific context or use case.


{/* Import raw source files */}


## What is a Memory Bank?

A memory bank is a complete, isolated storage unit containing:

- **Memories** — Facts and information retained from conversations
- **Documents** — Files and content indexed for retrieval
- **Entities** — People, places, concepts extracted from memories
- **Relationships** — Connections between entities in the knowledge graph
- **Directives** — Hard rules the agent must follow during reflect operations

Banks are completely isolated from each other — memories stored in one bank are not visible to another.

You don't need to pre-create a bank. Hindsight will automatically create it with default settings when you first use it.

:::tip Prerequisites
Make sure you've completed the [Quick Start](./quickstart) to install the client and start the server.
:::

## Creating a Memory Bank


</Tabs>

## Bank Configuration

Each memory bank can be configured independently per operation. Configuration can be set via the [bank config API](#updating-configuration), the [Control Plane UI](/), or [server-wide environment variables](/developer/configuration).

### retain_mission {#retain-configuration}

A plain-language description of what this bank should pay attention to during extraction. The mission is injected into the extraction prompt alongside the built-in rules — it steers focus without replacing the extraction logic.

```
e.g. Always include technical decisions, API design choices, and architectural trade-offs.
     Ignore meeting logistics, greetings, and social exchanges.
```

Works alongside any extraction mode. Leave blank for general-purpose extraction.

### retain_extraction_mode

Controls how aggressively facts are extracted:

| Mode | Description |
|------|-------------|
| `concise` *(default)* | Selective — only facts worth remembering long-term |
| `verbose` | Captures more detail per fact; slower and uses more tokens |
| `custom` | Write your own extraction rules via `retain_custom_instructions` |

### retain_custom_instructions

Only active when `retain_extraction_mode` is `custom`. Replaces the built-in extraction rules entirely with your own instructions.

### retain_chunk_size

Maximum number of characters per chunk when splitting content for fact extraction. Larger chunks mean fewer LLM calls but may reduce extraction quality on long inputs; smaller chunks improve granularity at the cost of more calls.

Default: `3000`

### retain_structured_chunk_size

Maximum number of characters for a single JSONL line or conversation turn to keep whole when it exceeds `retain_chunk_size`. When unset, the limit is exactly `retain_chunk_size`; set a larger value for structured logs or chat transcripts where splitting a single record would lose useful context.

Default: unset, which uses `retain_chunk_size`

See [Retain configuration](/developer/configuration#retain) for environment variable names and defaults.

### entity_labels {#entity-labels}

Defines a controlled vocabulary of `key:value` classification labels extracted at retain time and stored as entities. Because labels become entities, they automatically link memories in the knowledge graph (two memories with `pedagogy:scaffolding` are linked), improve semantic and BM25 retrieval, and optionally filter memories via the standard `tags`/`tags_match` API when `tag: true` is set on a group.

Each entry in `entity_labels` is a **label group** — one classification dimension:

```json
{
  "entity_labels": [
    {
      "key": "engagement",
      "description": "Student engagement level during the session",
      "type": "value",
      "optional": true,
      "values": [
        { "value": "active",  "description": "Student is actively participating" },
        { "value": "passive", "description": "Student is listening but not participating" }
      ]
    },
    {
      "key": "pedagogy",
      "description": "Teaching strategies used",
      "type": "multi-values",
      "values": [
        { "value": "scaffolding",          "description": "Breaking complex tasks into smaller steps" },
        { "value": "direct_instruction",   "description": "Explicit explanation by the teacher" },
        { "value": "socratic_questioning", "description": "Guiding through questions rather than answers" }
      ]
    }
  ]
}
```

| Field | Default | Description |
|-------|---------|-------------|
| `key` | — | Label group identifier. Becomes the prefix in `key:value` entities (or `key:field:value` for `"map"`). |
| `description` | `""` | Shown to the LLM to guide label assignment. |
| `type` | `"value"` | `"value"` → pick one enum value; `"multi-values"` → pick multiple; `"text"` → free-form string; `"map"` → structured group with named fields. |
| `values` | `[]` | Allowed values for `"value"` and `"multi-values"` types. Ignored for `"text"` and `"map"`. |
| `fields` | `{}` | Field definitions for `"map"` types. Each field is itself typed (`"text"`, `"value"`, `"multi-values"`, or nested `"map"`). Ignored for non-map types. |
| `optional` | `true` | When `true` the LLM may skip the label if not applicable. When `false` the LLM must always assign a value. Has no effect on `"multi-values"` groups (always optional). |
| `tag` | `false` | When `true`, extracted `key:value` labels are also written as tags on the memory unit, enabling filtering via `tags`/`tags_match` in recall/reflect. |

**Enum groups** (`type: "value"` or `type: "multi-values"`): the LLM picks from the predefined `values` list; anything outside the list is silently dropped. Vocabulary stays stable and graph links stay tight. Use `"multi-values"` when a fact can belong to several values at once.

**Free-text groups** (`type: "text"`): the LLM writes any string. Use the `description` field to provide examples and guidance. Graph clustering is less reliable than with enum groups because the model may phrase the same concept differently across sessions.

```json
{
  "key": "topic",
  "description": "Specific subject being discussed. Examples: algebra, quadratic equations, geometry.",
  "type": "text",
  "optional": true,
  "values": []
}
```

**Map groups** (`type: "map"`): defines a structured entity type with named fields. Each field is itself typed (`"text"`, `"value"`, `"multi-values"`, or nested `"map"`) so you can describe rich entities like a person with name, role, and organization. Each extracted field is stored as a flat `key:field:value` entity string (e.g. `person:name:Alice`), reusing the existing entity storage with no schema changes — so map fields participate in the knowledge graph and retrieval the same way single-value labels do.

```json
{
  "key": "person",
  "description": "A person mentioned in the text",
  "type": "map",
  "fields": {
    "name":         { "type": "text", "description": "Full name of the person" },
    "role":         { "type": "text", "description": "Job title or role" },
    "organization": { "type": "text", "description": "Company or organization" }
  }
}
```

### entities_allow_free_form

By default, entity labels are extracted **alongside** regular named entities (people, places, concepts). Set to `false` to disable free-form extraction so only label entities are stored:

```json
{
  "entity_labels": [...],
  "entities_allow_free_form": false
}
```

### enable_observations {#observations-configuration}

Toggles observation consolidation on or off. When `false`, no consolidation runs for this bank — neither automatic nor manual. Defaults to `true` when the observations feature is enabled on the server.

### enable_auto_consolidation

Controls whether consolidation runs automatically after retain, delete, and update operations. When `false`, consolidation only runs when explicitly triggered via the [consolidate endpoint](/developer/observations#trigger-consolidation). Defaults to `true`.

This is useful when you want full control over consolidation timing — for example, batching many retains before consolidating, or running [targeted consolidation](/developer/observations#targeted-consolidation) for specific scopes only.

### observations_mission

Defines what this bank should synthesise into durable observations. Replaces the built-in consolidation rules entirely — leave blank to use the server default.

```
e.g. Observations are stable facts about people and projects.
     Always include preferences, skills, and recurring patterns.
     Ignore one-off events and ephemeral state.
```

### consolidation_llm_batch_size

Number of facts sent to the LLM in a single consolidation call. Higher values reduce LLM calls and improve throughput at the cost of larger prompts. Set to `1` to disable batching. Leave unset to use the server default (`8`).

### consolidation_source_facts_max_tokens

Total token budget for source facts included with observations in the consolidation prompt. Source facts give the LLM evidence to compare new facts against existing observations. `-1` = unlimited. Leave unset to use the server default (`-1`).

### consolidation_source_facts_max_tokens_per_observation

Per-observation token cap for source facts in the consolidation prompt. Each observation independently gets at most this many tokens of source facts, preventing a single observation with many source facts from consuming the entire budget. `-1` = unlimited. Leave unset to use the server default (`256`).

See [Observations configuration](/developer/configuration#observations) for environment variable names and defaults.

### reflect_mission

A first-person narrative that provides identity and framing context for `reflect`. The agent uses this to ground its reasoning and apply a consistent perspective.

```
e.g. You are a senior engineering assistant.
     Always ground answers in documented decisions and rationale.
     Ignore speculation. Be direct and precise.
```

### disposition_skepticism

How skeptical vs trusting the bank is when evaluating claims during `reflect`. Scale 1–5.


</Tabs>

| Value | Behaviour |
|-------|-----------|
| `1` | Trusting — accepts information at face value |
| `3` *(default)* | Balanced |
| `5` | Skeptical — questions and doubts claims |

### disposition_literalism

How literally to interpret information during `reflect`. Scale 1–5.

| Value | Behaviour |
|-------|-----------|
| `1` | Flexible — reads between the lines, considers context |
| `3` *(default)* | Balanced |
| `5` | Literal — takes things exactly as stated |

### disposition_empathy

How much to weight emotional context when reasoning during `reflect`. Scale 1–5.

| Value | Behaviour |
|-------|-----------|
| `1` | Detached — focuses on facts and logic |
| `3` *(default)* | Balanced |
| `5` | Empathetic — considers emotional context |

:::info
Disposition traits and `reflect_mission` only affect the `reflect` operation. `retain_mission` and `observations_mission` are separate per-operation settings.
:::

### mcp_enabled_tools

An allowlist of MCP tool names that are enabled for this bank. When set, only the listed tools can be invoked; any tool not in the list returns an error (tools still appear in the MCP tools list for protocol compatibility). Set to `null` (or omit) to allow all tools.

```json
["recall", "reflect"]
```

Available tool names: `retain`, `recall`, `reflect`, `list_banks`, `create_bank`, `list_mental_models`, `get_mental_model`, `create_mental_model`, `update_mental_model`, `delete_mental_model`, `refresh_mental_model`, `list_directives`, `create_directive`, `delete_directive`, `list_memories`, `get_memory`, `list_documents`, `get_document`, `delete_document`, `list_operations`, `get_operation`, `cancel_operation`, `list_tags`, `get_bank`, `get_bank_stats`, `update_bank`, `delete_bank`, `clear_memories`.

### llm_gemini_safety_settings

Controls content filtering thresholds for Gemini and VertexAI providers. Accepts a list of safety setting objects in the [Google AI safety settings format](https://ai.google.dev/api/generate-content#v1beta.SafetySetting). When `null` (default), Gemini's built-in safety defaults are used.

```json
[
  {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"},
  {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"}
]
```

Only applies when `HINDSIGHT_API_LLM_PROVIDER` is `gemini` or `vertexai`.

### recall_budget_function {#recall-budget-configuration}

Selects how the [`recall` request's `budget` parameter](./recall) (`low` / `mid` / `high`) maps to the internal `thinking_budget` integer used by every retrieval method (semantic, BM25, graph, temporal). Two functions are supported:

| Function | Behaviour |
|----------|-----------|
| `fixed` *(default)* | `thinking_budget = recall_budget_fixed_<level>` — independent of `max_tokens`. Preserves legacy behavior. |
| `adaptive` | `thinking_budget = round(max_tokens * recall_budget_adaptive_<level>)`, clamped to `[recall_budget_min, recall_budget_max]`. Retrieval breadth scales with the requested output size. |

```json
{
  "recall_budget_function": "adaptive",
  "recall_budget_adaptive_low": 0.05,
  "recall_budget_adaptive_mid": 0.1,
  "recall_budget_adaptive_high": 0.3,
  "recall_budget_min": 30,
  "recall_budget_max": 1500
}
```

### recall_budget_fixed_low / recall_budget_fixed_mid / recall_budget_fixed_high

When `recall_budget_function` is `fixed` (the default), these positive integers are used directly as the per-method retrieval limit for each `budget` level. Defaults: `100` / `300` / `1000` — exactly matching the legacy hardcoded mapping.

### recall_budget_adaptive_low / recall_budget_adaptive_mid / recall_budget_adaptive_high

When `recall_budget_function` is `adaptive`, these positive ratios multiply the request's `max_tokens` to derive the per-method retrieval limit. Defaults: `0.025` / `0.075` / `0.25` — chosen to roughly match the fixed defaults at `max_tokens = 4096`.

### recall_budget_min / recall_budget_max

Floor and ceiling applied to the result of the adaptive function (after the ratio multiplication). Both must be positive integers and `min ≤ max`. Defaults: `20` / `2000`.

See [Recall budget mapping](/developer/configuration#recall-budget-mapping) for environment variable names and full defaults.

### memory_defense {#memory_defense}

Per-bank Memory Defense policy. Defaults to absent (Memory Defense disabled on this bank).

| Field | Type | Default | Description |
|---|---|---|---|
| `enabled` | bool | `false` | Master switch. |
| `default_action` | `allow`\|`redact`\|`quarantine`\|`block` | `allow` | Fallback action when no rule matches. |
| `protected_tag_namespaces` | `list[str]` | `[]` | Writes with tags in these namespaces (`ns:*`) are subject to the `protected_key` detector. |
| `immutable_tag_namespaces` | `list[str]` | `[]` | Writes to these namespaces are blocked. |
| `rules` | `list[Rule]` | `[]` | Detector-to-action mappings (see below). |
| `detector_overrides` | `dict` | `{}` | Per-detector tuning (e.g. `size_anomaly.max_size`). |

`Rule` shape:

| Field | Required | Description |
|---|---|---|
| `on` | yes | Detector name (`prompt_injection`, `sensitive_data`, `protected_key`, `immutable_key`, `size_anomaly`) or `*` for any. |
| `action` | yes | One of `allow`, `redact`, `quarantine`, `block`. |
| `min_severity` | no | Minimum severity (`low`, `medium`, `high`, `critical`) for the rule to fire. Defaults to `low`. |

Invalid policies are rejected on PATCH with HTTP 422.

See the [Memory Defense guide](../memory-defense/index.md) for usage examples.

---

## Updating Configuration

Bank configuration fields (retain mission, extraction mode, observations mission, etc.) are managed via a **separate config API**, not the `create_bank` call. This lets you change operational settings independently from the bank's identity and disposition.

### Setting Configuration Overrides


</Tabs>

You can update any subset of fields — only the keys you provide are changed.

### Reading the Current Configuration


</Tabs>

The response distinguishes:
- **`config`** — the fully resolved configuration (server defaults merged with bank overrides)
- **`overrides`** — only the fields explicitly overridden for this bank

### Resetting to Defaults


</Tabs>

This removes all bank-level overrides. The bank reverts to server-wide defaults (set via environment variables).

You can also update configuration directly from the [Control Plane UI](/) — navigate to a bank and open the **Configuration** tab.

---

## Directives

Directives are hard rules that the agent must follow during [reflect](./reflect) operations. Unlike disposition traits which influence *how* the agent reasons, directives are explicit instructions that are *always* enforced.

:::info
Directives only affect the `reflect` operation. They are injected into prompts and the agent is required to comply with them in all responses.
:::

### When to Use Directives

Use directives for rules that must never be violated:

- **Language/style constraints**: "Always respond in formal English"
- **Privacy rules**: "Never share personal data with third parties"
- **Domain constraints**: "Prefer conservative investment recommendations"
- **Behavioral guardrails**: "Always cite sources when making claims"

### Creating Directives


</Tabs>

### Listing Directives


</Tabs>

### Updating Directives


</Tabs>

### Deleting Directives


</Tabs>

### Directives vs Disposition

| Aspect | Directives | Disposition |
|--------|------------|-------------|
| **Nature** | Hard rules, must be followed | Soft influence on reasoning style |
| **Enforcement** | Strict — responses are rejected if violated | Flexible — shapes interpretation |
| **Use case** | Compliance, guardrails, constraints | Personality, character, tone |
| **Example** | "Never recommend specific stocks" | High skepticism: questions claims |

---

## Document export & import

Move documents — and the facts already extracted from them — between banks **without re-running the LLM**. Useful for testing a different embedding model, or copying data between banks/instances without paying for re-extraction. The archive carries documents, raw chunks, and extracted facts (entities by canonical name, causal links) — but **no embeddings or database ids**. On import, facts are re-embedded with the *target* bank's model and entities/links are recomputed against it, so imported documents are integrated with whatever already exists there.

### Export documents

`GET /v1/default/banks/{bank_id}/document-transfer` — synchronous; streams a ZIP archive.

```bash
# whole bank
curl -H "Authorization: Bearer $API_KEY" \
  "$HINDSIGHT_URL/v1/default/banks/my-bank/document-transfer" -o my-bank.zip

# specific documents, including consolidated observations
curl -H "Authorization: Bearer $API_KEY" \
  "$HINDSIGHT_URL/v1/default/banks/my-bank/document-transfer?document_id=doc-1&include_observations=true" -o subset.zip
```

| Query param | Description |
|-------------|-------------|
| `document_id` | Repeatable. Export only these documents; omit for the whole bank. |
| `include_observations` | Also export consolidated observations (default `false`). Only valid for a **whole-bank** export — combining it with `document_id` returns `400`. |

### Import documents

`POST /v1/default/banks/{bank_id}/document-transfer` — multipart upload (`file` = the ZIP). Runs as a **background operation** (re-embedding + entity resolution can take a while), so it returns `202` with an `operation_id`; poll the bank's operations endpoint for status and the result counts in `result_metadata`.

```bash
curl -H "Authorization: Bearer $API_KEY" -F "file=@my-bank.zip" \
  "$HINDSIGHT_URL/v1/default/banks/other-bank/document-transfer?on_conflict=replace"
# -> {"operation_id": "…", "status": "pending"}

curl -H "Authorization: Bearer $API_KEY" \
  "$HINDSIGHT_URL/v1/default/banks/other-bank/operations/$OPERATION_ID"
# -> {"status":"completed","result_metadata":{"documents_imported":3,"facts_imported":42,"observations_imported":5,...}}
```

`on_conflict` controls what happens when a document id already exists in the target bank:

| Mode | Behavior |
|------|----------|
| `skip` (default) | Leave the existing document untouched. |
| `replace` | Delete the existing document's data and re-import. |
| `new-id` | Import a copy under a freshly generated id. |

### Observations

Consolidated observations are excluded by default — the target bank regenerates them from the imported facts during consolidation. Pass `include_observations=true` to carry them instead: they're restored with no LLM, their source references remapped to the imported facts (which are marked consolidated so the target won't re-consolidate them).

Because an observation can be derived from facts spanning several documents, `include_observations` is only supported on a **whole-bank export** (omit `document_id`); combining it with a document subset returns `400`.

:::warning Imported observations are inserted as-is — no merge
They are not merged or deduplicated against observations already in the target bank (consolidation merges related observations; import does not). Prefer importing observations into a fresh/empty bank, or omit `include_observations` and let the target consolidate the imported facts itself.
:::

### Enabling / disabling

Both endpoints are gated by server-level flags (default `true`). A disabled endpoint returns `404`, and `/version` reports the state under `features.document_export_api` / `features.document_import_api` (the control plane hides the buttons accordingly).

| Variable | Gates |
|----------|-------|
| `HINDSIGHT_API_ENABLE_DOCUMENT_EXPORT_API` | `GET …/document-transfer` |
| `HINDSIGHT_API_ENABLE_DOCUMENT_IMPORT_API` | `POST …/document-transfer` |

## Migrating a bank to a new instance

To move a bank to an instance configured with a different **embedding model**, **vector extension**, or **text-search backend** — which can't be changed in place on a populated bank — export the whole bank and import it into the new instance, where every embedding and index is re-derived from the stored text with **no LLM re-extraction**. This carries documents, facts, observations, bank config, mental models, directives, and webhooks (never embeddings).

Use the `hindsight-admin export-bank` / `import-bank` commands and follow the blue-green runbook in **[Admin CLI → Migrating a bank to a new instance](../admin-cli.md#migrating-a-bank-to-a-new-instance)**.


---


## File: developer/api/mental-models.mdx

# Mental Models

User-curated summaries that provide high-quality, pre-computed answers for common queries.


{/* Import raw source files */}


## What Are Mental Models?

Mental models are **saved reflect responses** that you curate for your memory bank. When you create a mental model, Hindsight runs a reflect operation with your source query and stores the result. During future reflect calls, these pre-computed summaries are checked first — providing faster, more consistent answers.

```mermaid
graph LR
    A[Create Mental Model] --> B[Run Reflect]
    B --> C[Store Result]
    C --> D[Future Queries]
    D --> E{Match Found?}
    E -->|Yes| F[Return Mental Model]
    E -->|No| G[Run Full Reflect]
```

### Why Use Mental Models?

| Benefit | Description |
|---------|-------------|
| **Consistency** | Same answer every time for common questions |
| **Speed** | Pre-computed responses are returned instantly |
| **Quality** | Manually curated summaries you've reviewed |
| **Control** | Define exactly how key topics should be answered |

### Hierarchical Retrieval

During reflect, the agent checks sources in priority order:

1. **Mental Models** — User-curated summaries (highest priority)
2. **Observations** — Consolidated knowledge
3. **Raw Facts** — Ground truth memories

Mental models are checked first because they represent your explicitly curated knowledge.

---

## Create a Mental Model

Creating a mental model runs a reflect operation in the background and saves the result:


</Tabs>

### Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `name` | string | Yes | Human-readable name for the mental model |
| `source_query` | string | Yes | The query to run to generate content |
| `id` | string | No | Custom ID for the mental model (alphanumeric lowercase with hyphens). Auto-generated if omitted. |
| `tags` | list | No | Tags that scope the model during reflect **and** filter source memories during refresh. Defaults to `all_strict` matching, so only memories carrying every listed tag are read. See [Tags and Visibility](#tags-and-visibility). |
| `max_tokens` | int | No | Maximum tokens for the mental model content |
| `trigger` | object | No | Trigger settings (see [Automatic Refresh](#automatic-refresh)) |

---

## Create with Custom ID

Assign a stable, human-readable ID to a mental model so you can retrieve or update it by name instead of relying on the auto-generated UUID:


</Tabs>

:::tip
Custom IDs must be lowercase alphanumeric and may contain hyphens (e.g. `team-policies`, `q4-status`). If a mental model with that ID already exists, the request is rejected.
:::

---

## Automatic Refresh

Mental models can be configured to **automatically refresh** when observations are updated. This keeps them in sync with the latest knowledge without manual intervention.

### Trigger Settings

| Setting | Type | Default | Description |
|---------|------|---------|-------------|
| `mode` | `"full"` \| `"delta"` | `"full"` | Refresh strategy. See [Refresh Mode](#refresh-mode) below. |
| `refresh_after_consolidation` | bool | false | Automatically refresh after observations consolidation |
| `refresh_cron` | string \| null | null | UTC 5-field cron expression for scheduled refreshes, such as `"0 3 * * *"` for daily at 03:00 UTC |

When `refresh_after_consolidation` is enabled, the mental model will be re-generated every time the bank's observations are consolidated — ensuring it always reflects the latest synthesized knowledge.

When `refresh_cron` is set, Hindsight checks the schedule on the server's mental-model refresh tick and refreshes the model only if memories in its scope have changed since the last refresh. `refresh_cron` and `refresh_after_consolidation` are mutually exclusive, so a model refreshes either after consolidation or on a fixed UTC schedule, not both.

### Refresh Mode

Two strategies are available for how a refresh produces the new content:

- **`full`** *(default)* — every refresh regenerates the entire content from scratch. Simple and predictable: the LLM synthesises a fresh document from the retrieved memories. Best when the document is short, when you want every refresh to potentially restructure the output, or when you're not yet sure what the final shape should be.

- **`delta`** — refresh emits a list of typed *operations* (add a section, append a bullet, replace a block, remove a stale paragraph) against the document's existing structure, then renders the result. Sections that aren't targeted by any operation are copied through **byte-identical** — no paraphrasing, no whitespace drift, no list-style normalisation. Best for long-lived "playbook"–style mental models where you want stability across refreshes and only the genuinely changed parts to move.

Delta mode falls back to a full regeneration automatically in two cases:
1. The mental model has no existing content yet (nothing to anchor edits on).
2. The `source_query` has changed since the last refresh (the topic has shifted; the existing structure may no longer apply).

If the LLM call fails or returns an empty answer, the existing content is preserved — refreshes never overwrite a populated document with an empty one.

| Use Case | Recommended Mode | Why |
|----------|-----------------|-----|
| Skill / playbook docs | `delta` | Sections live for many refreshes; only specific rules change |
| Onboarding summaries | `delta` | Adding new team members shouldn't restructure the doc |
| Real-time dashboards | `full` | Each refresh is a fresh snapshot |
| Short FAQ summaries | `full` | Whole-document regeneration is cheap and unambiguous |


</Tabs>

### When to Use Automatic Refresh

| Use Case | Automatic Refresh | Why |
|----------|-------------------|-----|
| **Real-time dashboards** | ✅ Enabled | Status should always be current |
| **Policy summaries** | ❌ Disabled | Policies change infrequently, manual refresh preferred |
| **User preferences** | ✅ Enabled | Preferences evolve with new interactions |
| **FAQ answers** | ❌ Disabled | Answers are curated, should be reviewed before updating |

:::tip
Enable automatic refresh for mental models that need to stay current. Disable it for curated content where you want to review changes before they go live.
:::

---

## List Mental Models


</Tabs>

---

## Get a Mental Model


</Tabs>

### Detail Levels

Both **List** and **Get** endpoints accept an optional `detail` query parameter that controls how much data is returned. This is useful for reducing response size, especially in agent boot flows or MCP clients where context budget is limited.

| Level | Fields Returned | Use Case |
|-------|----------------|----------|
| `metadata` | `id`, `bank_id`, `name`, `tags`, `last_refreshed_at`, `created_at` | Inventory — "what models exist?" |
| `content` | All metadata fields + `source_query`, `content`, `max_tokens`, `trigger` | Agent boot — "what do the models say?" |
| `full` (default) | All fields including `reflect_response` | Deep inspection — "what evidence backs this model?" |

```bash
# List only names and tags (smallest response)
curl "$BASE_URL/v1/default/banks/$BANK_ID/mental-models?detail=metadata"

# List with content but without provenance chains
curl "$BASE_URL/v1/default/banks/$BANK_ID/mental-models?detail=content"

# Get full detail (default behavior)
curl "$BASE_URL/v1/default/banks/$BANK_ID/mental-models/$MODEL_ID?detail=full"
```

The `detail` parameter is also available in the MCP tools:

```json
{"bank_id": "my-bank", "detail": "metadata"}
```

:::tip
Use `detail=content` for agent orientation flows. It includes everything the agent needs to understand the models without the heavyweight `reflect_response` provenance chains, which can exceed 200KB for banks with many models.
:::

### Response Fields

| Field | Type | Detail Level | Description |
|-------|------|-------------|-------------|
| `id` | string | metadata | Unique mental model ID |
| `bank_id` | string | metadata | Memory bank ID |
| `name` | string | metadata | Human-readable name |
| `tags` | list | metadata | Tags for filtering |
| `last_refreshed_at` | string | metadata | When the mental model was last updated |
| `created_at` | string | metadata | When the mental model was created |
| `source_query` | string | content | The query used to generate content |
| `content` | string | content | The generated mental model text |
| `max_tokens` | int | content | Maximum tokens for the mental model content |
| `trigger` | object | content | Trigger settings (see [Automatic Refresh](#automatic-refresh)) |
| `reflect_response` | object | full | Full reflect response including `based_on` provenance facts |

---

## Refresh a Mental Model

Re-run the source query to update the mental model with current knowledge:


</Tabs>

Refreshing is useful when:
- New memories have been retained that affect the topic
- Observations have been updated
- You want to ensure the mental model reflects current knowledge

---

## Clear a Mental Model

Clear a mental model's content so the next refresh performs a **full re-synthesis** from scratch, regardless of the model's trigger mode.

This is useful for delta-mode models that have accumulated drift over many incremental refreshes. Over time, small inaccuracies can compound as each delta refresh only sees new facts since the last. Clearing and then refreshing produces a clean baseline from all facts.


</Tabs>

The clear operation is synchronous and resets the content to an empty string. The model's configuration (name, source query, trigger settings) is preserved. Since the content is now empty, the next `/refresh` call will always perform a full regeneration — even if the model's trigger mode is set to `delta`.

:::tip
For long-lived delta-mode mental models, consider scheduling a periodic clear + refresh (e.g. every 48 hours) to keep the content accurate while still benefiting from incremental delta updates in between.
:::

---

## Update a Mental Model

Update the mental model's name:


</Tabs>

---

## Delete a Mental Model


</Tabs>

---

## Tags and Visibility

Mental models support the same tag system as memories. When you assign tags to a mental model, those tags control both **which memories it reads** during refresh and **when it is surfaced** during reflect.

### How tags affect mental model refresh

:::warning
Adding tags to a mental model narrows the pool of source memories its refresh can read from. If no memories carry those tags yet, refresh will return empty content (e.g. `"I cannot find any information…"`) even though direct `reflect` on the same query works. Backfill tags on the relevant memories first, or override the default via `trigger.tags_match` / `trigger.tag_groups`.
:::

When a mental model is refreshed (manually or automatically), it runs an internal reflect call to regenerate its content. If the mental model has tags, that reflect call uses `all_strict` tag matching — meaning it will only read memories that carry **all** of the mental model's tags. Untagged memories are excluded.

```
Mental model tags: ["user:alice"]

During refresh, it reads:
  ✅ "Alice prefers async communication"     — has "user:alice"
  ✅ "Team uses Slack for announcements"      — has "user:alice" (plus other tags)
  ❌ "Company policy: no meetings on Fridays" — untagged, excluded
  ❌ "Bob dislikes long meetings"             — no "user:alice" tag
```

This means a mental model tagged `["user:alice"]` will also pick up memories tagged `["user:alice", "team"]` — extra tags on a memory don't disqualify it. Only the mental model's own tags are required to be present.

### How tags affect mental model lookup during reflect

When you call `reflect` with tags, those same tags are used to filter which mental models the agent can see. A mental model is visible only if its tags overlap with the tags on the reflect request.

For more details on tag matching modes (`any`, `any_strict`, `all`, `all_strict`) and worked examples, see the [Recall tags reference](./recall#tags).

### Listing mental model tags

`GET /v1/default/banks/{bank_id}/tags` accepts a `source` query parameter that selects which tag space to enumerate:

- `source=memories` *(default)* — tags attached to memory units.
- `source=mental_models` — tags attached to mental models in this bank.

Use the `mental_models` source to populate autocomplete or filter UIs over mental-model tags, distinct from the (typically larger) memory tag set.

---

## History

Every time a mental model's content changes (via refresh or manual update), the previous version is saved with a timestamp. You can retrieve the full change log with the history endpoint:


</Tabs>

### Response

The endpoint returns a list of history entries, most recent first:

| Field | Type | Description |
|-------|------|-------------|
| `previous_content` | string \| null | The content before this change (`null` if not available) |
| `changed_at` | string | ISO 8601 timestamp of when the change occurred |

Each entry captures the **content before the change** and when it happened. The current content is returned by the standard [Get a Mental Model](#get-a-mental-model) endpoint.

:::note
History tracking is enabled by default. Set `HINDSIGHT_API_ENABLE_MENTAL_MODEL_HISTORY=false` to disable it.
:::

---

## Use Cases

| Use Case | Example |
|----------|---------|
| **FAQ Answers** | Pre-compute answers to common customer questions |
| **Onboarding Summaries** | "What should new team members know?" |
| **Status Reports** | "What's the current project status?" refreshed weekly |
| **Policy Summaries** | "What are our security policies?" |

---

## Next Steps

- [**Reflect**](./reflect) — How the agentic loop uses mental models
- [**Observations**](/developer/observations) — How knowledge is consolidated
- [**Operations**](./operations) — Track async mental model creation


---


## File: developer/api/operations.mdx

# Operations

Hindsight runs several maintenance and ingestion tasks asynchronously instead of blocking the API call that triggers them. These tasks share a single queue (`async_operations`) and a single worker pool, and the same REST endpoints — list, status, cancel, retry — work across every type.

This page explains each operation type, when it fires, and how to inspect or manage it.


{/* Import raw source files */}


:::tip Prerequisites
Make sure you've completed the [Quick Start](./quickstart) and understand [how retain works](./retain).
:::

## How operations work

When an API call needs background work, the request handler writes a row to the `async_operations` table with `status=pending` and returns immediately. A worker (running either in-process inside the API by default, or as a dedicated service — see [Services - Worker Service](../services#worker-service)) polls the table, claims pending rows, executes the corresponding handler, and marks the row `completed` or `failed`.

By default, every operation runs in-process: no external queue, no extra process to deploy. The same code paths support scaling out to dedicated worker processes when throughput demands it.

### Lifecycle

| Status | Meaning |
|--------|---------|
| `pending` | The row is queued. Either no worker has picked it up yet, or an extension has parked it via `next_retry_at` in the future (e.g., for backpressure). |
| `processing` | A worker has claimed the row and is actively running the handler. |
| `completed` | The handler returned successfully. |
| `failed` | The handler raised. `error_message` carries the reason; you can re-queue with `POST /…/retry`. |
| `cancelled` | The operation was cancelled via `DELETE /…/operations/{id}` before a worker picked it up. Cancelling a `processing` operation is not supported. |

The worker retries failed operations up to `HINDSIGHT_API_WORKER_MAX_RETRIES` times before settling on `failed`. Deterministic failures (e.g., invalid embedding dimensions, integrity violations) skip retries — they won't succeed by re-running.

## Operation types

Every operation has an `operation_type` in the database and a `task_type` in the payload. They're usually the same.

### `retain`

Submitted by `POST /v1/default/banks/{bank_id}/memories` with `async=true`, or by the multi-item `retain_batch` call. The handler runs the same pipeline as a synchronous retain: fact extraction (LLM), embedding generation, entity resolution, and link creation (temporal, semantic).

Use async retain when you're ingesting thousands of items and don't want the HTTP call to hold for minutes. The `operation_id` in the response lets you poll for completion.

#### Parent op: `retain_batch`

For large submissions, Hindsight automatically splits the input into sub-batches and creates a single `retain_batch` parent operation that tracks the children. The parent's status reflects the aggregate — `pending` until at least one child is running, `processing` while children execute, `completed` once every child has finished, `failed` if any child failed. Each child is itself a `retain` operation linked to the parent, so you can drill in for per-batch error messages.

When you list operations, the parent and its children all appear by default. Pass `exclude_parents=true` to hide the aggregate rows and show only individual `retain` jobs.

### `file_convert_retain`

Submitted by file upload endpoints. The handler runs MIME-specific conversion (PDF → text, DOCX → text, etc.) and then passes the extracted text into the retain pipeline. Failures here are **non-retryable** by default — a corrupted PDF or missing OCR won't improve on rerun, so the operation goes straight to `failed`.

Which parser runs (`markitdown`, `iris`, or `llama_parse`) is selected per deployment via `HINDSIGHT_API_FILE_PARSER`, and clients can override it per request — see [Configuration → File Processing](../configuration#file-processing).

### `consolidation`

Produces **observations** from new world/experience memories. See [Observations](../observations) for what they are and how they're synthesized.

Triggered automatically:

- After every retain that added world/experience facts (gated by per-bank `enable_auto_consolidation` and `enable_observations`).
- After deletes that invalidated existing observations (the source memory disappeared → derived observations are stale → re-run with the surviving co-source memories).
- Manually via `POST /v1/default/banks/{bank_id}/consolidate`. Pass `observation_scopes` to consolidate only memories matching specific tag combinations.

**Bank-deduped**: while one `consolidation` job is pending for a bank, repeat submits return the existing `operation_id` instead of stacking. Once the job starts processing, the next submit becomes the next pending slot.

### `refresh_mental_model`

A mental model has a `source_query` that defines which memories it summarizes. The handler re-runs that query, re-summarizes the result, and updates the model's content in place.

Triggered either manually via `POST /v1/default/banks/{bank_id}/mental-models/{id}/refresh`, or automatically by the auto-refresh schedule for mental models that have one configured.

### `graph_maintenance`

Reconciles derived state that goes stale after a delete. Every invocation runs three passes:

1. **Link top-up.** Drains the `graph_maintenance_queue` (units whose outgoing temporal/semantic links lost a neighbour). For each, if the unit is under its cap (20 temporal, 50 semantic), Hindsight re-runs the same probes retain uses and inserts the missing links. Without this, the retain pipeline's top-K capping would leave surviving units permanently under-capped after every delete — degrading graph-expansion recall.
2. **Orphan entity prune.** Deletes entities in the bank with no remaining `unit_entities` references. FK `ON DELETE CASCADE` on `entity_cooccurrences` then removes any cooccurrence row pointing at a pruned entity.
3. **Stale cooccurrence prune.** Cleans up `entity_cooccurrences` rows where both endpoints still exist but no current memory_unit references both of them — the cooccurrence was real when it was recorded, but every unit that witnessed it has since been deleted.

Bank-deduped at submit time, so concurrent triggers against the same bank coalesce into one drain.

**Triggers:** any delete that removes memory_units — `DELETE /documents/{id}`, `DELETE /memories/{id}`, and re-retaining an existing `document_id` (the upsert path). A full bank wipe (`delete_bank`) is a no-op: there's nothing left in the bank to maintain.

### `webhook_delivery`

After certain operations complete (e.g., consolidation finishing on a bank with a registered webhook), Hindsight enqueues a `webhook_delivery` task. The handler POSTs the payload to the configured URL and retries on transient failures.

## Endpoints

All paths below are scoped by `bank_id`.

### List operations

```bash
GET /v1/default/banks/{bank_id}/operations
```

Query parameters:

| Param | Description |
|-------|-------------|
| `status` | Filter by `pending`, `processing`, `completed`, `failed`, `cancelled`. |
| `type` | Filter by `retain`, `file_convert_retain`, `consolidation`, `refresh_mental_model`, `graph_maintenance`, `webhook_delivery`. |
| `limit` | 1–100, default 20. |
| `offset` | Pagination offset. |
| `exclude_parents` | Exclude parent batch operations from results (large `retain_batch` calls create one parent + N children). |


</Tabs>

`items_count` is operation-specific — non-zero only for retain-shaped operations (it counts content items in the submission).

### Get operation status


</Tabs>

Query parameters:

| Param | Description |
|-------|-------------|
| `include_payload` | Include the raw task payload (the submission params) in the response as `task_payload`. Default `false`; may be large. |

A few response fields are worth calling out:

| Field | Description |
|-------|-------------|
| `updated_at` | When the operation's row last changed — claim, progress heartbeat, or completion. |
| `progress` | Last-known progress snapshot for a running operation, or `null` if none was recorded (completed-instantly or pre-feature rows). |
| `task_payload` | The raw submission params; only populated when `include_payload=true`. |

`progress` is written at coarse phase/batch boundaries (consolidation, batch retain) and lets you tell a healthy long-running job from a frozen one: if `processed` keeps advancing across polls the job is alive; identical numbers with no movement in `at` mean it's stuck. Its shape:

| Field | Description |
|-------|-------------|
| `stage` | Coarse phase the operation last reported (e.g. `processing_batch`). |
| `at` | ISO-8601 timestamp when this snapshot was written. |
| `processed` | Units of work finished so far (sub-batches, memories), when known. |
| `total` | Total units of work for the operation, when known. |
| `detail` | Operation-specific counters (e.g. `observations_created`, `round`, `items_in_sub_batch`). |

### Cancel a pending operation

Returns `409` if the operation is already in `processing`, `completed`, or `failed` state.


</Tabs>

### Retry a failed operation

The row's status resets to `pending` and the worker picks it up again. Returns `409` if the operation isn't in `failed` or `cancelled` state.


</Tabs>

## Async retain example

Submit a batch asynchronously and poll until the operation completes:


</Tabs>

## Worker tuning

Each worker has a single concurrency budget (`HINDSIGHT_API_WORKER_MAX_SLOTS`, default 10) shared across all operation types. Per-type slot reservations (`HINDSIGHT_API_WORKER_<TYPE>_MAX_SLOTS`) carve out guaranteed capacity within that budget; remaining slots form a shared pool any type can use. See [Configuration → Worker Configuration](../configuration#distributed-workers) for the full table.

For most deployments the defaults are fine. Reserve slots for an operation type if you've seen it starved by a flood of another type (e.g., a long file_convert_retain blocking graph_maintenance on a deletion-heavy workload).

## Next Steps

- [**Documents**](./documents) — Track document sources
- [**Memory Banks**](./memory-banks) — Configure bank settings


---


## File: developer/api/quickstart.mdx

# Quick Start

Get up and running with Hindsight in 60 seconds.


{/* Import raw source files */}


## Clients


## Start the API Server


</Tabs>

:::tip LLM Provider
Hindsight requires an LLM with structured output support. Recommended: **Groq** with `gpt-oss-20b` for fast, cost-effective inference.
See [LLM Providers](/developer/models#llm) for more details.
:::

---

## Use the Client


</Tabs>

---

## What's Happening

| Operation | What it does |
|-----------|--------------|
| **Retain** | Content is processed, facts are extracted, entities are identified and linked in a knowledge graph |
| **Recall** | Four search strategies (semantic, keyword, graph, temporal) run in parallel to find relevant memories |
| **Reflect** | Retrieved memories are used to generate a disposition-aware response |

---

## Integrations

Browse all supported integrations in the [Integrations Hub](/integrations).

## Next Steps

- [**Retain**](./retain) — Advanced options for storing memories
- [**Recall**](./recall) — Search and retrieval strategies
- [**Reflect**](./reflect) — Disposition-aware reasoning
- [**Memory Banks**](./memory-banks) — Configure disposition and mission
- [**Server Deployment**](/developer/installation) — Docker Compose, Helm, and production setup


---


## File: developer/api/recall.mdx

# Recall Memories

Retrieve memories from a bank using multi-strategy recall.

When you **recall**, Hindsight runs four retrieval strategies in parallel — semantic similarity, keyword (BM25), graph traversal, and temporal — then fuses and reranks the results into a single ranked list. The response contains structured facts, not raw documents.


{/* Import raw source files */}


:::info How Recall Works
Learn about the four retrieval strategies (semantic, keyword, graph, temporal) and RRF fusion in the [Recall Architecture](/developer/retrieval) guide.
:::

:::tip Prerequisites
Make sure you've completed the [Quick Start](./quickstart) to install the client and start the server.
:::

## Basic Recall


</Tabs>

---

## Parameters

### query

The natural language question or statement to search for. This is the only required field. The query drives all four retrieval strategies simultaneously: it is embedded for semantic search, tokenized for BM25 keyword search, used to seed graph traversal, and parsed for temporal expressions. After retrieval, the raw query text is also passed to the cross-encoder reranker to re-score every candidate. Queries exceeding 500 tokens are rejected.

### types

Controls which categories of memory facts are searched. Accepted values are `world` (objective facts), `experience` (events and conversations), and `observation` (deduplicated, evidence-grounded beliefs consolidated from multiple memories). When omitted, all three types are searched.

Each type runs the full four-strategy retrieval pipeline independently, so narrowing `types` reduces both the result set and query cost.


</Tabs>

:::tip About Observations
Observations are deduplicated, evidence-grounded beliefs consolidated from multiple facts — preferences, recurring patterns, and durable learnings the memory bank has built up. Each observation references its supporting memories (with exact quotes), and is refined rather than overwritten when new evidence arrives. They are created and maintained automatically in the background after retain operations.
:::

### prefer_observations

Because observations are consolidated from raw facts, recalling `observation` alongside `world` and `experience` can return the same information twice — once as the raw fact and once folded into an observation. With `prefer_observations` you get the best of both: you still recall every type, but whenever an observation in the results was built from a raw fact, that raw fact is dropped so the observation supersedes it. The freed slots are backfilled with the next-best results, so you don't lose coverage.

This lets you ask for everything without choosing between "raw facts only" (no consolidation) and "observations only" (which may lag behind the latest retains while consolidation catches up). **Disabled by default** — set it to `true` to opt in. It has no effect unless both `observation` and at least one of `world`/`experience` are included in `types`.

### budget

Controls retrieval depth and breadth. Accepted values are `low`, `mid` (default), and `high`. Use `low` for fast simple lookups, `mid` for balanced everyday queries, and `high` when you need to find indirect connections or exhaustive coverage.


</Tabs>

### max_tokens

The maximum number of tokens the returned facts can collectively occupy. Defaults to `4096`. Only the `text` field of each fact is counted toward this budget — metadata, tags, entities, and other fields are not included. After reranking, facts are included in relevance order until this budget is exhausted — so you always get the most relevant memories that fit. Hindsight is designed for agents, which think in tokens rather than result counts: set `max_tokens` to however much of your context window you want to allocate to memories.


</Tabs>

### query_timestamp

An ISO 8601 datetime representing when the query is being asked, from the user's perspective. When provided, it is used as the anchor for resolving relative temporal expressions in the query and for recency scoring — for example, if the query says "last month" and `query_timestamp` is `2023-05-30`, the temporal search window becomes approximately April 2023, and recency boosts are calculated as of May 30, 2023. Without it, the server's current time is used as the anchor. This field matters most for replaying historical conversations or building agents that need time-anchored recall.

### include

An optional object controlling supplementary data returned alongside the main facts.

#### chunks

When enabled, the response includes the raw source text chunks from which each fact was extracted. Chunks are fetched before the `max_tokens` filter, so setting `max_tokens=0` returns no facts but can still return chunks. The `max_tokens` sub-option (default `8192`) controls the total chunk token budget independently of the main fact budget. This is useful when agents need surrounding context beyond the extracted fact text.

:::note
When `include_chunks` is enabled, chunks are fetched based on the top-scored reranked results before token filtering. The last chunk is truncated (not dropped) to fit exactly within the budget, and each chunk carries a `truncated` flag indicating whether it was cut.
:::

#### source_facts

When enabled and `types` includes `observation`, each observation result is accompanied by the original contributing facts it was synthesized from. Source facts are returned in a top-level `source_facts` dict keyed by fact ID, and each observation result carries a `source_fact_ids` list for cross-referencing. Facts are deduplicated across observations. The `max_tokens` sub-option (default `4096`) limits the total token budget for source facts.


</Tabs>

#### entities

Enabled by default. When active, each returned fact includes the canonical names of entities associated with it. Set to `null` to skip the entity JOIN query and reduce response size. The `max_tokens` sub-option (default `500`) is a future-facing guard for entity data.

### tags

Filters recall to only memories that match the specified tags. When omitted, all memories regardless of tags are eligible. Tag filtering is applied at the database level across all four retrieval strategies, not as a post-processing step.

The `tags_match` parameter controls the filtering logic:

| Mode | Untagged memories | Match condition |
|------|-------------------|-----------------|
| `any` (default) | Included | Memory has **at least one** of the specified tags |
| `any_strict` | Excluded | Memory has **at least one** of the specified tags |
| `all` | Included | Memory has **all** of the specified tags |
| `all_strict` | Excluded | Memory has **all** of the specified tags |
| `exact` | Excluded | Memory has **exactly** the specified tag set |

#### Scenario setup

Consider a bank with these four memories:

| Memory | Tags |
|--------|------|
| "Alice prefers async communication" | `["user:alice"]` |
| "Bob dislikes long meetings" | `["user:bob"]` |
| "Team uses Slack for announcements" | `["user:alice", "team"]` |
| "Company policy: no meetings on Fridays" | *(untagged)* |

#### `any` — OR matching, includes untagged (default)

Returns memories that have **at least one** matching tag, plus untagged memories.


</Tabs>

Use this for **shared global knowledge + user-specific** patterns, where untagged memories represent information everyone should see.

#### `any_strict` — OR matching, excludes untagged

Same as `any` but untagged memories are excluded.


</Tabs>

Use this when memories are **fully partitioned by tags** and untagged memories should never be visible.

#### `all` — AND matching, includes untagged

Returns memories that have **every** specified tag, plus untagged memories.


</Tabs>

Use this when memories must belong to a **specific intersection** of scopes (e.g., only memories relevant to both a user and a project), while still surfacing shared global knowledge.

#### `all_strict` — AND matching, excludes untagged

Returns memories that have **every** specified tag, and excludes untagged memories.


</Tabs>

Use this for strict scope enforcement where a memory must explicitly belong to **all** specified contexts.

:::tip Extra tags are fine
A memory with tags `["user:alice", "team", "project:x"]` will still match a filter of `["user:alice", "team"]` under `all_strict` — extra tags on the memory are not a problem. The filter only requires the memory to contain **at least** the specified tags.
:::

#### `exact` — set equality, excludes untagged

Returns memories whose tag set is exactly equal to the specified tags, regardless of tag order. Unlike `all_strict`, memories with extra tags do not match.

Use this when filtering a precise observation scope returned by `GET /v1/default/banks/{bank_id}/observations/scopes`, where `["user:alice"]` should not also match observations scoped to `["user:alice", "project:x"]`.

:::tip Filter to global (untagged) observations only
The empty scope is a real scope — it's where `observation_scopes: "shared"` consolidation writes. Set `tags_match: "exact"` with **no tags** (omit `tags`, or pass `[]`) to recall **only** untagged/global memories and exclude every tagged one:

```json
{ "query": "...", "tags": [], "tags_match": "exact" }
```

With any other `tags_match` mode, absent or empty `tags` means "no tag filter" (all memories are eligible). Only under `exact` do absent/empty tags select "the global scope". This is the way to read back just the global observations after you've started using more specific scopes.
:::

### tag_groups

`tag_groups` is a list of compound boolean tag filters. The groups in the list are AND-ed together at the top level. Each group is a recursive boolean expression: a **leaf** node `{tags, match}`, or a **compound** node `{and: [...]}`, `{or: [...]}`, or `{not: ...}`.

`tag_groups` and `tags` / `tags_match` can be used simultaneously — they are AND-ed together.

#### Leaf node

```json
{ "tags": ["step:5", "step:8"], "match": "any_strict" }
```

`match` accepts the same values as `tags_match`: `any`, `all`, `any_strict`, `all_strict`, `exact`. Defaults to `any_strict`.

#### Compound nodes

```json
{ "and": [ <TagGroup>, <TagGroup>, ... ] }
{ "or":  [ <TagGroup>, <TagGroup>, ... ] }
{ "not": <TagGroup> }
```

#### Examples

**Step filter AND user scope** — two top-level groups AND-ed:

```json
{
  "tag_groups": [
    { "tags": ["step:5", "step:8", "step:12"], "match": "any_strict" },
    { "tags": ["user:ep_42"], "match": "all_strict" }
  ]
}
```

**Nested OR inside AND** — user must match, plus either step OR priority:

```json
{
  "tag_groups": [
    { "tags": ["user:alice"], "match": "all_strict" },
    { "or": [
        { "tags": ["step:5"], "match": "any_strict" },
        { "tags": ["priority:high"], "match": "all_strict" }
    ]}
  ]
}
```

**Exclusion** — user must match, but archived memories are excluded:

```json
{
  "tag_groups": [
    { "tags": ["user:alice"], "match": "all_strict" },
    { "not": { "tags": ["archived"], "match": "any_strict" } }
  ]
}
```

### trace

When set to `true`, the response includes a detailed debug trace covering the query embedding, entry points, per-strategy retrieval results, RRF fusion candidates, reranked results, temporal constraints detected, and per-phase timings. Has no effect on the retrieval logic itself. Useful for understanding why specific memories were or were not returned.

### min_scores

An optional object of per-stage score floors, each compared **inclusively** (`>=`) against the matching field of a result's [`scores`](#scores) and AND-ed together. Any field you leave unset imposes no floor; omitting `min_scores` entirely (the default) applies no score filtering at all. The four fields operate at **two different levels of the pipeline**:

| field | level | effect |
|---|---|---|
| `semantic` | retrieval | minimum vector similarity, pushed into the SQL — prunes weak vector matches **before** fusion (overrides the global similarity minimum for this request) |
| `keyword` | retrieval | minimum keyword/full-text (BM25) score, pushed into the SQL — prunes weak keyword matches before fusion |
| `reranker` | post-query | minimum normalized cross-encoder score, applied to the ranked results |
| `final` | post-query | minimum final ranking score, applied to the ranked results |

```json
{ "query": "...", "min_scores": { "reranker": 0.5 } }
```

The retrieval-level floors (`semantic`/`keyword`) change *which candidates are considered*, so they can also change the final ordering; the post-query floors (`reranker`/`final`) only drop already-ranked results. Because freed slots are **not** backfilled, any floor can return fewer results than the budget allows.

**Use floors with care.** The reranker's scores are reliable for *ordering* but not as *absolute* values — a clearly-relevant memory can score `~0.001` on one query and `~1.0` on another, so a fixed cutoff risks silently dropping good results. Calibrate any threshold against the scores you actually observe (recall with no `min_scores` first and inspect the [`scores`](#scores) object).

Each threshold is compared against the matching field in the response [`scores`](#scores) object. See the note under [`scores`](#scores) on why the scale is relative, not absolute, before relying on a fixed threshold.

---

## Response

### results

The main list of recalled facts, ordered by relevance. Relevance is computed by running four retrieval strategies in parallel — semantic similarity, BM25 keyword, graph traversal, and temporal — fusing their rankings with Reciprocal Rank Fusion (RRF), then re-scoring the merged candidates with a cross-encoder reranker against the original query.

Each result carries a [`scores`](#scores) object (see below). Treat these as **relative** signals: they reflect the ranking within a single query, not an absolute, cross-query confidence — a `0.8` from one query is not comparable to a `0.8` from another. For most agents the right approach is to consume memories in order and let `max_tokens` determine how many fit, rather than filtering by score. The `scores` object (and the [`min_scores`](#min_scores) parameter) exist for callers that want to inspect the ranking or drop a low-confidence tail; calibrate any threshold against the scores you see on an unfiltered query.

Each item in `results` has the following fields:

#### id

The unique identifier of this fact. Use it to cross-reference with `source_facts` or for application-level deduplication.

#### text

The extracted fact text as stored in the memory bank.

#### type

The fact category: `world` for objective information, `experience` for events and conversations, or `observation` for consolidated knowledge synthesized over time.

#### context

The context label provided when the fact was retained (e.g., `"team meeting"`, `"slack"`). `null` if none was set.

#### metadata

The key-value string pairs attached when the fact was retained. `null` if none were set.

#### tags

The visibility-scoping tags attached to this fact.

#### entities

A list of canonical entity name strings linked to this fact. Only populated when `include.entities` is enabled (the default). `null` otherwise.

#### occurred_start / occurred_end

ISO 8601 datetimes representing when the described event started and ended. Extracted by the LLM from the content during retain. `null` if the content had no temporal information.

#### mentioned_at

ISO 8601 datetime of when this fact was retained into the bank.

#### document_id

The document ID this fact belongs to, as set during retain.

#### chunk_id

The ID of the source text chunk this fact was extracted from. Used to cross-reference with `chunks` in the response when `include.chunks` is enabled.

#### source_fact_ids

For `observation`-type results only: the IDs of the original facts this observation was synthesized from. Cross-references with `source_facts` in the response. `null` for other types or when `include.source_facts` is not enabled.

#### scores

An object of the per-stage scores for this result. `null` for `source_facts` entries, which are attached by provenance rather than ranked. Fields:

- **`final`** — the score this fact was ranked by (cross-encoder relevance × recency/temporal/evidence boosts). `results` is ordered by it descending. A relative signal, not a calibrated probability (see the note above).
- **`reranker`** — the cross-encoder's normalized relevance (`0`–`1`). `null` when the deployment uses a passthrough reranker (RRF/interleave modes).
- **`semantic`** — the raw vector cosine similarity (`0`–`1`). `null` if this result was not surfaced by semantic search.
- **`keyword`** — the raw keyword/full-text (BM25) score (`≥ 0`, unbounded). `null` if this result was not surfaced by keyword search.

Each field is also a valid [`min_scores`](#min_scores) floor.

---

### source_facts

A dict keyed by fact ID containing full `RecallResult` objects for the source facts that contributed to observation results. Only present when `include.source_facts` is enabled. Facts are deduplicated — if two observations share a source fact, it appears once.

### chunks

A dict keyed by chunk ID containing the raw source text chunks associated with the returned facts. Only present when `include.chunks` is enabled. Each chunk has `id`, `text`, `chunk_index`, and `truncated` (whether the text was cut to fit the token budget).

### entities

A dict keyed by canonical entity name containing entity state objects. Only present when `include.entities` is enabled. Each entry has `entity_id`, `canonical_name`, and `observations`.

### trace

A debug object present only when `trace: true` was set in the request. Contains per-phase timings, retrieval breakdowns, and RRF fusion details.


---


## File: developer/api/reflect.mdx

# Reflect

Generate a grounded, disposition-aware response using an agentic reasoning loop.

When you call **reflect**, Hindsight runs an agentic loop that autonomously searches the memory bank using multiple retrieval tools, applies the bank's disposition traits to shape the reasoning style, and produces a final answer grounded in what it found. Unlike recall — which returns raw facts — reflect returns a synthesized response written by the LLM.


{/* Import raw source files */}


:::info How Reflect Works
Learn about disposition-driven reasoning in the [Reflect Architecture](/developer/reflect) guide.
:::

:::tip Prerequisites
Make sure you've completed the [Quick Start](./quickstart) to install the client and start the server.
:::

## Basic Usage


</Tabs>

---

## Parameters

### query

The question or prompt to reflect on. This is the only required field. If you have situational context that should influence the answer, include it directly in the query rather than as a separate field.

### budget

Controls how thoroughly the agent explores the memory bank before answering. Accepted values are `low` (default), `mid`, and `high`. At `low`, the agent does a shallow search optimized for speed. At `mid`, it checks multiple sources when the question warrants it. At `high`, it performs deep exploration across all knowledge levels and may use multiple query variations to find indirect connections. Use `high` for complex questions that require synthesizing information from many sources.


</Tabs>

### max_tokens

Limits the length of the final generated response. Defaults to `4096`. This does not affect how much the agent can retrieve during the agentic loop — only the final answer length.

### response_schema

An optional JSON Schema object. When provided, the LLM generates a response that conforms to the schema and the response includes a `structured_output` field with the result parsed accordingly. The `text` field will be empty since only a single structured LLM call is made. Use this when you need to process the response programmatically rather than display it as prose.


</Tabs>

### tags

Filters which memories the agent can access during reflection. Works identically to [recall tags](./recall#tags) — only memories matching the specified tags are considered. The `tags_match` parameter controls the matching logic (`any`, `all`, `any_strict`, `all_strict`, `exact`) with the same semantics as recall.


</Tabs>

### include

Controls optional supplementary data returned alongside the main response.

#### include.facts

When enabled, the response includes a `based_on` object listing the memories, mental models, and directives the agent actually used to construct the answer. Only sources retrieved during the agent loop can appear here — citations are validated to prevent hallucinated references. Useful for transparency and verification.


</Tabs>

#### include.tool_calls

When enabled, the response includes a `trace` object with the full execution log of every tool call and LLM call made during the agentic loop, including inputs, outputs, and durations. Set `output: false` to include only tool inputs for a smaller payload. Useful for debugging why the agent reached a particular conclusion.

---

## Response

### text

The synthesized answer as a well-formatted markdown string. This is the primary output of reflect. Empty when `response_schema` is provided (use `structured_output` instead in that case).

### structured_output

The LLM's response parsed according to the `response_schema` provided in the request. Only present when `response_schema` was set. `null` otherwise.

### based_on

The sources the agent used to construct the answer. Only present when `include.facts` was enabled. Contains three fields:

- `memories` — a list of memory facts (world, experience, observation) that were retrieved and cited. Each item has `id`, `text`, `type`, `context`, `occurred_start`, and `occurred_end`.
- `mental_models` — a list of mental models that were used. Each item has `id`, `text`, and `context`.
- `directives` — a list of directives that were enforced during reasoning. Each item has `id`, `name`, and `content`.

### usage

Token usage for all LLM calls made during the agentic loop: `input_tokens`, `output_tokens`, and `total_tokens`. Useful for cost tracking.

### trace

The full execution log of the agentic loop. Only present when `include.tool_calls` was enabled. Contains:

- `tool_calls` — each tool invocation with `tool` name (`lookup`, `recall`, `learn`, `expand`), `input`, `output` (if `output: true`), `duration_ms`, and `iteration` number.
- `llm_calls` — each LLM call with `scope` (e.g., `"agent_1"`, `"final"`) and `duration_ms`.


---


## File: developer/api/retain.mdx

# Ingest Data

Store documents, conversations, and raw content into Hindsight to automatically extract and create memories.

When you **retain** content, Hindsight doesn't just store the raw text—it intelligently analyzes the content to extract meaningful facts, identify entities, and build a connected knowledge graph. This process transforms unstructured information into structured, queryable memories.


{/* Import raw source files */}


:::info How Retain Works
Learn about fact extraction, entity resolution, and graph construction in the [Retain Architecture](/developer/retain) guide.
:::

:::tip Prerequisites
Make sure you've completed the [Quick Start](./quickstart) to install the client and start the server.
:::

## Store a Document

A single retain call accepts one or more **items**. Each item is a piece of raw content — a conversation, a document, a note — that Hindsight will analyze and decompose into one or many memories. The content itself is never stored verbatim; what gets stored are the structured facts the LLM extracts from it.


</Tabs>

### Retaining a Conversation

A full conversation should be retained as a single item. The LLM can parse any format — plain text, JSON, Markdown, or any structured representation — as long as it clearly conveys who said what and when. The example below uses a simple `Name (timestamp): text` format.


</Tabs>

When the conversation grows — a new message arrives — just retain again with the full updated content and the same `document_id`. Hindsight will delete the previous version and reprocess from scratch, so memories always reflect the latest state of the conversation.

---

## Parameters

### content

The raw text to store. This is the only required field. Hindsight chunks the content, sends each chunk to the LLM for fact extraction, and stores the resulting structured facts — not the original text. A single `content` value can produce many memories depending on how much information it contains.

### timestamp

When the event described in the content actually occurred. Three forms are accepted:

| Value | Behaviour |
|-------|-----------|
| Omitted / `null` | Defaults to the current time at ingestion. |
| ISO 8601 string (e.g. `"2024-01-15T10:30:00Z"`) | Uses the provided datetime. |
| `"unset"` | Stores the content **without any timestamp**. Use this for timeless material such as reference documents, books, or fictional content where no real event time exists. |

The timestamp is injected into the LLM fact-extraction prompt so the model can resolve relative temporal references in the content — for example, if the content says "last Monday", the model uses the provided timestamp as the anchor to pin down the actual date. When `"unset"` is passed the prompt shows `Event Date: Unknown`, allowing the model to correctly return `N/A` for the `when` field of every extracted fact. Providing a real timestamp also enables temporal recall queries like "What happened last spring?" to work correctly.

### context

A short label describing the source or situation — for example `"team meeting"`, `"slack"`, or `"support ticket"`. It is injected directly into the LLM prompt, so it actively shapes how facts are extracted. The same sentence can mean something very different depending on context: "the project was terminated" in a `"performance review"` context versus a `"product roadmap"` context produces different memories.

Providing context consistently is one of the highest-leverage things you can do to improve memory quality.


</Tabs>

### metadata

Arbitrary key-value string pairs that provide context about this item. For example: `{"source": "slack", "channel": "engineering", "thread_id": "T123"}`. Metadata is included in the fact extraction prompt, so the LLM can use it as additional context when extracting facts — for instance, knowing the document title or source can improve accuracy. It is also stored on each memory unit and returned with every recalled memory, letting you do client-side filtering or static enrichment without extra lookups — for example, linking a memory back to its source URL, thread ID, or any application-specific identifier.

### document_id

A caller-supplied string that groups one or more items under a logical document. This field is the key to making retain **idempotent**.

When you provide a `document_id`, Hindsight upserts the document: if a document with that ID already exists in the bank, it and all its associated memories are deleted before the new content is processed and inserted. This means you can safely re-run retain on updated content — for example, a chat thread that grew since last time — without accumulating duplicate memories.

If you omit `document_id`, Hindsight assigns a random UUID per request, so re-ingesting the same content will create duplicate memories.

### update_mode

Controls how Hindsight handles an existing document when you retain with a `document_id` that already exists.

| Value | Behaviour |
|-------|-----------|
| `"replace"` *(default)* | Deletes the old document and all its memories, then processes the new content from scratch. This is the standard upsert described above. |
| `"append"` | Concatenates the new content onto the existing document text and reprocesses the combined document. Delta retain automatically skips unchanged chunks, so only the new portion triggers LLM extraction. |

Append mode requires a `document_id` — without one there is no existing document to append to.

**When to use append**: Use `"append"` for content that grows incrementally — for example, a log file, a journal, or a chat transcript where you receive new messages one at a time. Instead of re-sending the entire history on each update, send only the new content with `update_mode: "append"` and Hindsight will efficiently merge it with what it already has.

```json
{
  "items": [
    {
      "content": "New entry to add to the existing document.",
      "document_id": "my-growing-doc",
      "update_mode": "append"
    }
  ]
}
```

### entities

A list of entities you want to guarantee are recognized, merged with any entities the LLM extracts automatically. Each entry has a `text` field (the entity name) and an optional `type` (e.g., `"PERSON"`, `"ORG"`, `"CONCEPT"` — defaults to `"CONCEPT"` if omitted).

Use this when you know certain entities are important but the LLM might miss them or refer to them inconsistently across different parts of the content. Providing entities explicitly ensures they are always linked in the knowledge graph.

### tags and document_tags

Tags control **visibility scoping** — which memories are visible during recall. A memory is only returned if its tags intersect with the tags filter provided in the recall request. This makes tags useful when a single memory bank serves multiple users or sessions and each should only see their own memories.

Use consistent naming patterns to keep tag filtering predictable. Common conventions: `user:<id>` for per-user scoping, `session:<id>` for session isolation, `room:<id>` for chat rooms, `topic:<name>` for category filtering. The bank also exposes a list-tags endpoint that returns all tags with their memory counts, useful for UI autocomplete or wildcard expansion.

See [Recall API](./recall#tags) for filtering by tags during retrieval.

### observation_scopes

Controls which [observations](../observations) this memory contributes to during consolidation. Each scope runs an independent pass, creating or updating observations tagged with only that scope's tags.

:::info Scope isolation
During consolidation, Hindsight uses `all_strict` matching to find existing observations to update — only observations whose tags exactly match the current scope are considered. This keeps scopes isolated: a memory consolidated under `["student:alice"]` will never bleed into an observation tagged `["student:alice", "teacher:bob"]`.
:::

The examples below use a lesson transcript retained with `tags: ["student:alice", "teacher:bob", "session-id:s1"]`.

#### combined *(default)*

One consolidation pass using all tags together. The resulting observation is tagged with the full set.

- Observations created: `["student:alice", "teacher:bob", "session-id:s1"]`
- ✗ *"What does Alice struggle with across all her sessions?"* — no match, because no observation was ever built for `student:alice` alone
- ✗ *"How does Bob teach?"* — no match for `teacher:bob` alone
- ✓ *"What happened in session s1 with Alice and Bob?"* — exact match

**Use when** the memory is meaningful only as a whole and you never need to query any single tag in isolation.

#### shared

One consolidation pass over a single global, **untagged** scope. The memory's own tags are ignored for observation scoping (they stay on the source facts for recall filtering), so memories with *different* tags all consolidate into the **same** observation.

- Observations created: one untagged observation (`[]`)
- ✓ Untagged observations match every recall regardless of tag filter
- Deduplicates across volatile per-call tags: if each session is retained with a unique `session-id:…` tag, `combined` and `per_tag` create a fresh observation every session (the tag never repeats), whereas `shared` folds them all into one.

**Use when** your tags are per-call provenance (e.g. session ids) that you want for recall filtering and debugging but not as a consolidation boundary — keep the tag on `tags` and set `observation_scopes: "shared"`.

:::caution `shared` vs `[[]]` vs `[]`
`shared` is equivalent to the explicit scope `[[]]` — a list containing one empty scope. Do **not** confuse it with `[]` (an empty list), which declares *zero* scopes and silently falls back to `combined`.
:::

#### per_tag

One consolidation pass per individual tag. Each tag gets its own isolated observation that grows with every new memory sharing that tag.

- Observations created: `["student:alice"]` · `["teacher:bob"]` · `["session-id:s1"]`
- ✓ *"What does Alice struggle with across all her sessions?"*
- ✓ *"How does Bob teach?"*
- ✓ *"What happened in session s1?"*
- ✗ *"How does Alice perform specifically with Bob?"* — no observation for the `["student:alice", "teacher:bob"]` combination
- ✗ *"How does Bob teach in online sessions?"* — no observation for `["teacher:bob", "session-id:s1"]`

**Use when** content involves multiple tags that each represent an independent subject — the most common choice for multi-party content like conversations, lessons, or support sessions.

#### all_combinations

One pass per subset of tags — singles, pairs, triples, and so on. For 3 tags that is 7 passes.

- Observations created: all `"per_tag"` scopes above, plus `["student:alice", "teacher:bob"]` · `["student:alice", "session-id:s1"]` · `["teacher:bob", "session-id:s1"]` · `["student:alice", "teacher:bob", "session-id:s1"]`
- ✓ All questions from `"per_tag"` above
- ✓ *"How does Alice perform specifically with Bob?"* — matched by `["student:alice", "teacher:bob"]`

**Use when** you need observations at every granularity — per tag, per pair, per group.

#### custom

Pass an explicit list of tag sets. Each inner list is one scope.

```json
[["student:alice"], ["teacher:bob"], ["teacher:bob", "session-id:s1"]]
```

- Observations created: exactly those three scopes — nothing more
- ✓ *"What does Alice struggle with?"*
- ✓ *"How does Bob teach?"*
- ✓ *"How does Bob teach in session s1 specifically?"*
- ✗ *"What happened in session s1 regardless of teacher?"* — `["session-id:s1"]` alone was not included

**Use when** you know exactly which combinations are meaningful and want to avoid unnecessary passes.

### Response

The synchronous retain response includes:

- `success` — whether the operation completed without errors
- `bank_id` — the memory bank that received the content
- `items_count` — number of items processed
- `async` — whether processing ran asynchronously
- `usage` — token usage for the LLM calls (`input_tokens`, `output_tokens`, `total_tokens`), only present for synchronous operations

---

## Batch Ingestion

Multiple items can be submitted in a single request. Batch ingestion is the recommended approach — it reduces network overhead and lets Hindsight optimize extraction across related content.


</Tabs>


---

## Files

Upload files directly — Hindsight converts them to text and extracts memories automatically. File processing always runs asynchronously and returns operation IDs for tracking.

**Supported formats:** PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, images (JPG, PNG, GIF, etc. — OCR support depends on the configured parser), audio (MP3, WAV, FLAC, etc. — transcription), HTML, and plain text formats (TXT, MD, CSV, JSON, YAML, etc.)


</Tabs>

The file retain endpoint always returns asynchronously. The response contains `operation_ids` — one per uploaded file — which you can poll via `GET /v1/default/banks/{bank_id}/operations` to track progress.

Upload up to 10 files per request (max 100 MB total). Each file becomes a separate document with optional per-file metadata:


</Tabs>

:::info File Storage
Uploaded files are stored server-side (PostgreSQL by default, or S3/GCS/Azure for production). Configure storage via `HINDSIGHT_API_FILE_STORAGE_TYPE`. See [Configuration](../configuration#file-processing) for details.
:::

---

## Async Ingestion

For large batches, use async ingestion to avoid blocking your application:


</Tabs>

When `async: true`, the call returns immediately with an `operation_id`. Processing runs in the background via the worker service. No `usage` metrics are returned for async operations.

### Cut Costs 50% with Provider Batch APIs

When using async retain, enable the provider Batch API to reduce LLM fact-extraction costs by 50%. OpenAI, Groq, and Gemini all offer this discount in exchange for a processing window of up to 24 hours — a trade-off that's typically invisible when retain already runs in the background.

```bash
export HINDSIGHT_API_RETAIN_BATCH_ENABLED=true
```

Hindsight submits fact extraction calls as a batch job to the provider, polls for completion, and processes results automatically. No changes to your API calls are needed.

:::note
Batch API cost savings require `async=true` in your retain request and a compatible provider (OpenAI, Groq, or Gemini).
:::


---


## File: developer/api/webhooks.mdx

# Webhooks

Hindsight can notify your application in real-time when memory events occur by sending HTTP POST requests to a URL you configure.

## Delivery and Retries

Webhooks are registered per memory bank and fire automatically when matching events occur. Each delivery attempt is tracked, and failed deliveries are retried with exponential backoff:

| Attempt | Delay after failure |
|---------|---------------------|
| 1 | 5 seconds |
| 2 | 5 minutes |
| 3 | 30 minutes |
| 4 | 2 hours |
| 5 | 5 hours |
| 6 | Permanent failure |

A delivery is considered failed if your endpoint returns a non-2xx status code or does not respond within the configured timeout (default 30 seconds). After 6 failed attempts, the delivery is marked as permanently failed and no further retries are made.

:::info At-least-once delivery
Webhook delivery tasks are queued in the same database transaction as the primary operation (e.g. the retain or consolidation write). This means if the server crashes after committing but before sending, the delivery task survives and will be retried. As a result, **your endpoint may receive the same event more than once** — use the `operation_id` field to deduplicate if needed.
:::

## Event Types

### `consolidation.completed`

Fired after Hindsight finishes consolidating new memories into observations for a bank.

**Payload:**

```json
{
  "event": "consolidation.completed",
  "bank_id": "my-bank",
  "operation_id": "a1b2c3d4e5f6",
  "status": "completed",
  "timestamp": "2026-03-04T12:00:00Z",
  "data": {
    "observations_created": 3,
    "observations_updated": 1,
    "observations_deleted": null,
    "error_message": null
  }
}
```

**`data` fields:**

| Field | Type | Description |
|-------|------|-------------|
| `observations_created` | `integer \| null` | Number of new observations created |
| `observations_updated` | `integer \| null` | Number of existing observations updated |
| `observations_deleted` | `integer \| null` | Number of observations deleted |
| `error_message` | `string \| null` | Set when `status` is `"failed"` |

**`status` values:** `"completed"` or `"failed"`

---

### `retain.completed`

Fired once per document after a retain operation completes (both synchronous and asynchronous). When retaining a batch of N documents, N separate events are fired.

**Payload:**

```json
{
  "event": "retain.completed",
  "bank_id": "my-bank",
  "operation_id": "a1b2c3d4e5f6",
  "status": "completed",
  "timestamp": "2026-03-04T12:00:01Z",
  "data": {
    "document_id": "doc-abc123",
    "tags": ["meeting", "q1-2026"]
  }
}
```

**`data` fields:**

| Field | Type | Description |
|-------|------|-------------|
| `document_id` | `string \| null` | The document ID if one was provided in the retain request |
| `tags` | `string[] \| null` | Document-level tags applied during retain |

**Notes:**
- For async retain (`async: true`), `operation_id` matches the `operation_id` returned by the retain API.
- For sync retain, `operation_id` is a generated identifier for tracing purposes.
- One event is fired per content item in the retain request.

---

### `memory_defense.triggered`

Fired when a bank's [Memory Defense](../memory-defense/index.md) policy acts on a retained item — once per item that is **redacted** or **blocked**. Items that pass cleanly do not fire an event. Requires a Memory Defense policy enabled on the bank and a webhook subscribed to this event type.

**Payload:**

```json
{
  "event": "memory_defense.triggered",
  "bank_id": "my-bank",
  "operation_id": "a1b2c3d4e5f6",
  "status": "redact",
  "timestamp": "2026-03-04T12:00:02Z",
  "data": {
    "action": "redact",
    "detector": "sensitive_data",
    "document_id": "doc-abc123",
    "matched_types": ["github_token", "aws_access_key"],
    "message": "Sensitive data pattern matched: github_token, aws_access_key"
  }
}
```

**`data` fields:**

| Field | Type | Description |
|-------|------|-------------|
| `action` | `string` | Action taken on the item: `"redact"` or `"block"` |
| `detector` | `string \| null` | The detector that matched (`"sensitive_data"`) |
| `document_id` | `string \| null` | The document ID if one was provided in the retain request |
| `matched_types` | `string[] \| null` | Labels of the redaction patterns that fired (e.g. `github_token`, `ssn_us`) |
| `message` | `string \| null` | Human-readable summary of what matched |

**`status` values:** mirrors `data.action` — `"redact"` or `"block"`.

**Notes:**
- A `redact` event means the secret was scrubbed and the redacted memory was still stored. A `block` event means the item was dropped; if every item in the retain request is blocked, the retain call returns `422`.


---


## File: developer/development.md

# Development Guide

Guide to setting up a local development environment for contributing to Hindsight.

## Prerequisites

- Python 3.11+
- [uv](https://docs.astral.sh/uv/) - Fast Python package manager
- Docker and Docker Compose
- An LLM API key (OpenAI, Groq, or Ollama)

## Local Development Setup

### 1. Clone the Repository

```bash
git clone https://github.com/vectorize-io/hindsight.git
cd hindsight
```

### 2. Install Dependencies

```bash
uv sync
```

### 3. Start PostgreSQL

Start only the database via Docker:

```bash
cd docker && docker-compose up -d postgres
```

### 4. Configure Environment

```bash
cp .env.example .env
```

Edit `.env` with your LLM API key:

```bash
# Database (connects to Docker postgres)
HINDSIGHT_API_DATABASE_URL=postgresql://hindsight:hindsight_dev@localhost:5432/hindsight

# LLM Provider (choose one)
HINDSIGHT_API_LLM_PROVIDER=groq
HINDSIGHT_API_LLM_API_KEY=gsk_xxxxxxxxxxxx
HINDSIGHT_API_LLM_MODEL=llama-3.1-70b-versatile
```

### 5. Start the API Server

```bash
./scripts/start-server.sh --env local
```

The server will be available at http://localhost:8888.

## Running Tests

```bash
# Run all tests
uv run pytest

# Run specific test file
uv run pytest tests/test_retrieval.py

# Run with verbose output
uv run pytest -v
```

## Code Generation

### Regenerate API Clients

When you modify the OpenAPI spec, regenerate the clients:

```bash
./scripts/generate-clients.sh
```

This generates:
- Python client in `hindsight-clients/python/`
- TypeScript client in `hindsight-clients/typescript/`

### Export OpenAPI Schema

```bash
./scripts/export-openapi.sh
```

## Project Structure

```
hindsight/
├── hindsight-api/          # Main API server
│   ├── hindsight_api/
│   │   ├── api/           # HTTP endpoints
│   │   ├── engine/        # Memory engine, retrieval, reasoning
│   │   └── web/           # Server entry point
│   └── tests/
├── hindsight-clients/      # Generated SDK clients
│   ├── python/
│   └── typescript/
├── hindsight-control-plane/ # Admin UI (Next.js)
├── docker/                 # Docker Compose setup
└── scripts/               # Development scripts
```

## Contributing

1. Create a feature branch from `main`
2. Make your changes
3. Run tests: `uv run pytest`
4. Submit a pull request

## Troubleshooting

### Database Connection Issues

Ensure PostgreSQL is running:

```bash
docker-compose ps
```

Check database connectivity:

```bash
psql postgresql://hindsight:hindsight_dev@localhost:5432/hindsight
```

### ML Model Download

On first run, Hindsight downloads embedding and reranking models. This may take a few minutes. Models are cached in `~/.cache/huggingface/`.

### Port Conflicts

If port 8888 is in use:

```bash
HINDSIGHT_API_PORT=8889 ./scripts/start-server.sh --env local
```


---


## File: developer/extensions.md

# Extensions

Extensions allow you to customize and extend Hindsight behavior without modifying core code. They enable multi-tenancy, custom authentication, additional HTTP endpoints, and operation hooks.

---

## Available Extensions

### TenantExtension

Handles multi-tenancy and API key authentication. Validates incoming requests and determines which PostgreSQL schema to use for database operations, enabling tenant isolation at the database level.

**Built-in: ApiKeyTenantExtension**

A simple implementation that validates API keys against an environment variable and uses the `public` schema for all authenticated requests.

```bash
HINDSIGHT_API_TENANT_EXTENSION=hindsight_api.extensions.builtin.tenant:ApiKeyTenantExtension
HINDSIGHT_API_TENANT_API_KEY=your-secret-key
```

**Built-in: SupabaseTenantExtension**

Validates [Supabase](https://supabase.com) JWTs and provides multi-tenant memory isolation. Each authenticated user gets their own PostgreSQL schema (`{prefix}_{user_id}`), ensuring complete data separation. Performs local JWT verification using JWKS for optimal performance (no network call per request).

```bash
HINDSIGHT_API_TENANT_EXTENSION=hindsight_api.extensions.builtin.supabase_tenant:SupabaseTenantExtension
HINDSIGHT_API_TENANT_SUPABASE_URL=https://your-project.supabase.co
# Optional - only needed for legacy HS256 projects or health check
HINDSIGHT_API_TENANT_SUPABASE_SERVICE_KEY=your-service-role-key
```

See the [source code](https://github.com/vectorize-io/hindsight/blob/main/hindsight-api-slim/hindsight_api/extensions/builtin/supabase_tenant.py) for complete configuration options and implementation details.

For other multi-tenant setups with separate schemas per tenant (e.g., custom JWT-based auth), implement a custom `TenantExtension`.

---

### HttpExtension

Adds custom HTTP endpoints under the `/ext/` path prefix. Useful for adding domain-specific APIs that integrate with Hindsight's memory engine.

Provides two router methods:
- `get_router(memory)` — returns a FastAPI router mounted at `/ext/`
- `get_root_router(memory)` — returns a FastAPI router mounted at the application root (for well-known endpoints or other paths that must be at specific locations). Returns `None` by default.

**No built-in implementation** - implement your own to add custom endpoints.

```bash
HINDSIGHT_API_HTTP_EXTENSION=mypackage.ext:MyHttpExtension
```

---

### OperationValidatorExtension

Hooks into retain/recall/reflect operations for validation and monitoring. Use cases include:
- Rate limiting and quota enforcement
- Permission checks and content filtering
- Audit logging and usage tracking
- Custom metrics collection

**No built-in implementation** - implement your own based on your requirements.

```bash
HINDSIGHT_API_OPERATION_VALIDATOR_EXTENSION=mypackage.validators:MyValidator
```

---

### MCPExtension

Registers additional MCP (Model Context Protocol) tools on the Hindsight MCP server. Enables external packages to add custom tools without modifying core code.

**No built-in implementation** - implement your own to add custom MCP tools.

```bash
HINDSIGHT_API_MCP_EXTENSION=mypackage.mcp:MyMCPExtension
```

---

## Writing Custom Extensions

### Extension Basics

Extensions are Python classes loaded via environment variables:

```bash
HINDSIGHT_API_<TYPE>_EXTENSION=mypackage.module:MyExtensionClass
```

Configuration is passed via prefixed environment variables:

```bash
HINDSIGHT_API_<TYPE>_SOME_CONFIG=value
# Extension receives: {"some_config": "value"}
```

All extensions support lifecycle hooks:
- `on_startup()` - Called when the application starts
- `on_shutdown()` - Called when the application shuts down

Extensions have access to an `ExtensionContext` that provides:
- `run_migration(schema)` - Run database migrations for a schema
- `get_memory_engine()` - Get the MemoryEngine interface

### Example: Custom TenantExtension with JWT

```python

from hindsight_api.extensions import TenantExtension, TenantContext, AuthenticationError

class JwtTenantExtension(TenantExtension):
    def __init__(self, config: dict[str, str]):
        super().__init__(config)
        self.jwt_secret = config.get("jwt_secret")
        if not self.jwt_secret:
            raise ValueError("HINDSIGHT_API_TENANT_JWT_SECRET is required")

    async def authenticate(self, context: RequestContext) -> TenantContext:
        token = context.api_key
        if not token:
            # Optional headers dict is forwarded in HTTP/MCP error responses
            raise AuthenticationError("Bearer token required")

        try:
            payload = jwt.decode(token, self.jwt_secret, algorithms=["HS256"])
            tenant_id = payload.get("tenant_id")
            if not tenant_id:
                raise AuthenticationError("Missing tenant_id in token")
            return TenantContext(schema_name=f"tenant_{tenant_id}")
        except jwt.InvalidTokenError as e:
            raise AuthenticationError(str(e))
```

`AuthenticationError` accepts an optional `headers` dict that is forwarded in both HTTP and MCP error responses. This is useful for returning custom headers like `WWW-Authenticate`:

```python
raise AuthenticationError(
    "Authorization required",
    headers={"WWW-Authenticate": 'Bearer realm="example"'},
)
```

### Example: Custom HttpExtension

```python
from fastapi import APIRouter
from hindsight_api.extensions import HttpExtension

class MyHttpExtension(HttpExtension):
    def get_router(self, memory: MemoryEngine) -> APIRouter:
        router = APIRouter()

        @router.get("/hello")
        async def hello():
            return {"message": "Hello from extension!"}

        @router.post("/custom/{bank_id}/action")
        async def custom_action(bank_id: str):
            # Access memory engine for database operations
            pool = await memory._get_pool()
            # ... custom logic
            return {"status": "ok"}

        return router

    def get_root_router(self, memory: MemoryEngine) -> APIRouter | None:
        """Optional: mount routes at the application root (not under /ext/)."""
        router = APIRouter()

        @router.get("/.well-known/my-metadata")
        async def metadata():
            return {"version": "1.0"}

        return router
```

Routes from `get_router` are available at `/ext/hello`, `/ext/custom/{bank_id}/action`, etc.
Routes from `get_root_router` are mounted at the app root (e.g., `/.well-known/my-metadata`).

### Example: Custom OperationValidatorExtension

```python
from hindsight_api.extensions import (
    OperationValidatorExtension,
    ValidationResult,
    PrecheckContext,
    RetainContext,
    RecallContext,
    ReflectContext,
    RetainResult,
)

class MyValidator(OperationValidatorExtension):
    # Pre-body validation (optional)
    async def precheck(self, ctx: PrecheckContext) -> ValidationResult:
        if ctx.content_length is not None and ctx.content_length > 10_000_000:
            return ValidationResult.reject("Payload is too large")
        return ValidationResult.accept()

    # Pre-operation validation (required)
    async def validate_retain(self, ctx: RetainContext) -> ValidationResult:
        # Implement your validation logic
        return ValidationResult.accept()
        # Or reject: return ValidationResult.reject("Reason")

    async def validate_recall(self, ctx: RecallContext) -> ValidationResult:
        return ValidationResult.accept()

    async def validate_reflect(self, ctx: ReflectContext) -> ValidationResult:
        return ValidationResult.accept()

    # Post-operation hooks (optional)
    async def on_retain_complete(self, result: RetainResult) -> None:
        # Log usage, update metrics, send notifications, etc.
        pass
```

`precheck` runs before the request body is read or deserialized. Its
`PrecheckContext.content_length` is the parsed `Content-Length` header as an
integer, or `None` when the header is missing or cannot be parsed (for example,
chunked transfer encoding). Use it for cheap size-aware quota or cost guards;
the full `validate_*` hooks still run after parsing and should enforce precise
per-operation limits.

#### Deferring an operation

In addition to `accept` and `reject`, a `validate_*` hook can ask the
worker to **requeue** the operation for a future time by raising
`DeferOperation`. Use this for backpressure (rate-limited upstream,
quota window not yet open, dependency warming up) — unlike a retry, it
does not increment `retry_count` or write `error_message`. The worker
sets `next_retry_at` to your `exec_date` and the task is invisible to
claim queries until that time.

```python
from datetime import datetime, timedelta, timezone

from hindsight_api.extensions import (
    DeferOperation,
    OperationValidatorExtension,
    RetainContext,
    ValidationResult,
)


class QuotaAwareValidator(OperationValidatorExtension):
    async def validate_retain(self, ctx: RetainContext) -> ValidationResult:
        if not await self._quota_available(ctx.bank_id):
            raise DeferOperation(
                exec_date=datetime.now(timezone.utc) + timedelta(minutes=5),
                reason="bank quota window exhausted",
            )
        return ValidationResult.accept()
```

`DeferOperation` is **worker-only**: do not raise it from
`validate_recall` or `validate_reflect` in synchronous HTTP request
paths — there is no queue to defer to and it will surface as a 500.

### Example: Custom MCPExtension

```python
from mcp.server.fastmcp import FastMCP
from hindsight_api.extensions import MCPExtension
from hindsight_api.engine import MemoryEngine

class MyMCPExtension(MCPExtension):
    async def register_tools(self, mcp: FastMCP, memory: MemoryEngine) -> None:
        @mcp.tool()
        async def custom_search(query: str) -> str:
            """Custom MCP tool for specialized search."""
            # Access memory engine for operations
            pool = await memory._get_pool()
            # ... custom logic
            return f"Results for: {query}"
```

---

## Deploying Custom Extensions

### With Docker

Mount your extension package as a volume and set the environment variable:

```yaml
# docker-compose.yml
services:
  hindsight-api:
    image: vectorize/hindsight-api:latest
    volumes:
      - ./my_extensions:/app/my_extensions
    environment:
      - HINDSIGHT_API_TENANT_EXTENSION=my_extensions.auth:JwtTenantExtension
      - HINDSIGHT_API_TENANT_JWT_SECRET=${JWT_SECRET}
      - PYTHONPATH=/app
```

Or build a custom image with your extensions:

```dockerfile
FROM vectorize/hindsight-api:latest
COPY my_extensions /app/my_extensions
ENV PYTHONPATH=/app
```

### Bare Metal

Install your extension package in the same Python environment as Hindsight:

```bash
# Install Hindsight
pip install hindsight-api

# Install your extension package
pip install ./my-extensions
# or
pip install my-extensions-package

# Configure
export HINDSIGHT_API_TENANT_EXTENSION=my_extensions.auth:JwtTenantExtension
export HINDSIGHT_API_TENANT_JWT_SECRET=your-secret

# Run
hindsight-api
```

---

## Contributing Extensions

Custom extensions that solve common use cases are welcome contributions to the Hindsight project. If you've built an extension for:

- Authentication providers (OAuth, SAML, API gateways)
- Rate limiting or quota management
- Audit logging integrations
- Metrics exporters (Datadog, New Relic, etc.)
- Custom HTTP endpoints for specific platforms

Consider contributing it to the `hindsight_api.extensions.builtin` package. Open an issue or pull request on [GitHub](https://github.com/vectorize-io/hindsight) to discuss your extension.


---


## File: developer/index.mdx

# Overview

## Why Hindsight?

AI agents forget everything between sessions. Every conversation starts from zero—no context about who you are, what you've discussed, or what the assistant has learned. This isn't just an implementation detail; it fundamentally limits what AI Agents can do.

**The problem is harder than it looks:**

- **Simple vector search isn't enough** — "What did Alice do last spring?" requires temporal reasoning, not just semantic similarity
- **Facts get disconnected** — Knowing "Alice works at Google" and "Google is in Mountain View" should let you answer "Where does Alice work?" even if you never stored that directly
- **AI Agents need to consolidate knowledge** — A coding assistant that remembers "the user prefers functional programming" should consolidate this into an observation and weigh it when making recommendations
- **Context matters** — The same information means different things to different memory banks with different personalities

Hindsight solves these problems with a memory system designed specifically for AI agents.

## What Hindsight Does

```mermaid
graph LR
    subgraph app["<b>Your Application</b>"]
        Agent[AI Agent]
    end

    subgraph hindsight["<b>Hindsight</b>"]
        API[API Server]

        subgraph bank["<b>Memory Bank</b>"]
            direction TB
            MentalModels[Mental Models]
            Observations[Observations]
            MemEnt[Memories & Entities]
            Chunks[Chunks]
            Documents[Documents]

            MentalModels --> Observations --> MemEnt --> Chunks --> Documents
        end
    end

    Agent -->|retain| API
    Agent -->|recall| API
    Agent -->|reflect| API

    API --> bank
```

**Your AI agent** stores information via `retain()`, searches with `recall()`, and reasons with `reflect()` — all interactions with its dedicated **memory bank**

## Key Components

### Memory Types

Hindsight organizes knowledge into a hierarchy of facts and consolidated knowledge:

| Type | What it stores | Example |
|------|----------------|---------|
| **Mental Model** | User-curated summaries for common queries | "Team communication best practices" |
| **Observation** | Automatically consolidated knowledge from facts | "User was a React enthusiast but has now switched to Vue" (captures history) |
| **World Fact** | Objective facts received | "Alice works at Google" |
| **Experience Fact** | Bank's own actions and interactions | "I recommended Python to Bob" |

During reflect, the agent checks sources in priority order: **Mental Models → Observations → Raw Facts**.

### Multi-Strategy Retrieval (TEMPR)

Four search strategies run in parallel:

```mermaid
graph LR
    Q[Query] --> S[Semantic]
    Q --> K[Keyword]
    Q --> G[Graph]
    Q --> T[Temporal]

    S --> RRF[RRF Fusion]
    K --> RRF
    G --> RRF
    T --> RRF

    RRF --> CE[Cross-Encoder]
    CE --> R[Results]
```

| Strategy | Best for |
|----------|----------|
| **Semantic** | Conceptual similarity, paraphrasing |
| **Keyword (BM25)** | Names, technical terms, exact matches |
| **Graph** | Related entities, indirect connections |
| **Temporal** | "last spring", "in June", time ranges |

### Observation Consolidation

After memories are retained, Hindsight automatically consolidates related facts into **observations** — deduplicated, evidence-grounded beliefs that the bank has built up across many memories:

- **Deduplication**: Overlapping facts are merged into a single durable observation instead of piling up as repeats
- **Evidence tracking**: Each observation references the source memories (with exact quotes) that support it, plus a proof count
- **Continuous refinement**: Observations are updated — not overwritten — when new evidence supports, contradicts, or extends them; history is preserved
- **Freshness awareness**: when newer memories have been retained but not yet consolidated, `reflect` treats the affected observations as stale and verifies them against raw facts before relying on them

### Mission, Directives & Disposition

Memory banks can be configured to shape how the agent reasons during `reflect`:

| Configuration | Purpose | Example |
|---------------|---------|---------|
| **Mission** | Natural language identity for the bank | "I am a research assistant specializing in ML. I prefer simplicity over cutting-edge." |
| **Directives** | Hard rules the agent must follow | "Never recommend specific stocks", "Always cite sources" |
| **Disposition** | Soft traits that influence reasoning style | Skepticism, literalism, empathy (1-5 scale) |

The **mission** tells Hindsight what knowledge to prioritize and provides context for reasoning. **Directives** are guardrails and compliance rules that must never be violated. **Disposition traits** subtly influence interpretation style.

These settings only affect the `reflect` operation, not `recall`.

## Clients & Languages


## Integrations

Browse all supported integrations in the [Integrations Hub](/integrations).

## Next Steps

### Getting Started
- [**Quick Start**](/developer/api/quickstart) — Install and get up and running in 60 seconds
- [**RAG vs Hindsight**](/developer/rag-vs-hindsight) — See how Hindsight differs from traditional RAG with real examples

### Core Concepts
- [**Retain**](/developer/retain) — How memories are stored with multi-dimensional facts
- [**Recall**](/developer/retrieval) — How TEMPR's 4-way search retrieves memories
- [**Reflect**](/developer/reflect) — How mission, directives, and disposition shape reasoning

### API Methods
- [**Retain**](/developer/api/retain) — Store information in memory banks
- [**Recall**](/developer/api/recall) — Search and retrieve memories
- [**Reflect**](/developer/api/reflect) — Agentic reasoning with memory
- [**Mental Models**](/developer/api/mental-models) — User-curated summaries for common queries
- [**Memory Banks**](/developer/api/memory-banks) — Configure mission, directives, and disposition
- [**Documents**](/developer/api/documents) — Manage document sources
- [**Operations**](/developer/api/operations) — Monitor async tasks

### Deployment
- [**Server Setup**](/developer/installation) — Deploy with Docker Compose, Helm, or pip


---


## File: developer/mcp-server.md

# MCP Server

Hindsight includes a built-in [Model Context Protocol (MCP)](https://modelcontextprotocol.io/) server that allows AI assistants to store and retrieve memories directly.

## Access

The MCP server is **enabled by default** and mounted at `/mcp` on the API server. Each memory bank has its own MCP endpoint:

```
http://localhost:8888/mcp/{bank_id}/
```

For example, to connect to the memory bank `alice`:
```
http://localhost:8888/mcp/alice/
```

To disable the MCP server, set the environment variable:

```bash
export HINDSIGHT_API_MCP_ENABLED=false
```

## Authentication

By default, the MCP endpoint is **open** (no authentication required).

To enable authentication, configure the API key tenant extension:

```bash
export HINDSIGHT_API_TENANT_EXTENSION=hindsight_api.extensions.builtin.tenant:ApiKeyTenantExtension
export HINDSIGHT_API_TENANT_API_KEY=your-secret-key
```

When authentication is enabled, include your API key in the `Authorization` header:

### Claude Code

```bash
claude mcp add --transport http hindsight http://localhost:8888/mcp \
  --header "Authorization: Bearer your-secret-key" \
  --header "X-Bank-Id: my-bank"
```

### Claude Desktop

Add to `~/.claude_desktop_config.json`:

```json
{
  "mcpServers": {
    "hindsight": {
      "url": "http://localhost:8888/mcp",
      "headers": {
        "Authorization": "Bearer your-secret-key",
        "X-Bank-Id": "my-bank"
      }
    }
  }
}
```

### Direct HTTP Request

```bash
curl -X POST http://localhost:8888/mcp \
  -H "Authorization: Bearer your-secret-key" \
  -H "X-Bank-Id: my-bank" \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -d '{"jsonrpc": "2.0", "method": "tools/list", "id": 1}'
```

If the key is missing or invalid, requests will receive a `401 Unauthorized` response.

## Bank Selection

The memory bank is resolved in this priority order:

1. **URL path** (highest priority): `http://localhost:8888/mcp/my-bank/`
2. **X-Bank-Id header**: `--header "X-Bank-Id: my-bank"`
3. **Default**: Uses `HINDSIGHT_MCP_BANK_ID` env var (default: "default")

## Per-Bank Endpoints

Unlike traditional MCP servers where tools require explicit identifiers, Hindsight uses **per-bank endpoints**. The `bank_id` is part of the URL path, so tools don't need to specify which bank to use—it's implicit from the connection.

This design:
- **Simplifies tool usage** — no need to pass `bank_id` with every call
- **Enforces isolation** — each MCP connection is scoped to a single bank
- **Enables multi-tenant setups** — connect different users to different endpoints

## Two Modes

The MCP server operates in two modes depending on the URL:

| Mode | URL | Tools | bank_id |
|------|-----|-------|---------|
| **Single-bank** | `/mcp/{bank_id}/` | 27 tools (memory, mental models, directives, documents, operations, tags, bank management) | Implicit from URL |
| **Multi-bank** | `/mcp/` | All 30 tools including `list_banks`, `create_bank`, `get_bank_stats` | Explicit `bank_id` parameter on each tool |

**Single-bank mode** (recommended) scopes all operations to the bank in the URL. Tools don't expose a `bank_id` parameter.

**Multi-bank mode** exposes all tools with an optional `bank_id` parameter, plus bank management tools (`list_banks`, `create_bank`, `get_bank_stats`).

## Tool Metadata and Instructions

Hindsight can append deployment-specific guidance to the `retain` and `recall` MCP tool descriptions. Set `HINDSIGHT_API_MCP_INSTRUCTIONS` on the API server when clients should see local rules, such as which tags to use or which memories should be retained.

```bash
export HINDSIGHT_API_MCP_INSTRUCTIONS="Use project:<name> tags for project-specific memories."
```

MCP clients that read tool annotations also receive safety hints from the built-in tools:

- Read-only operations such as `recall`, `reflect`, `list_*`, and `get_*` are marked with `readOnlyHint: true`.
- Delete, clear, and invalidate operations are marked with `destructiveHint: true`.
- `openWorldHint` is `false` for the built-in tools because Hindsight operates on its configured memory store rather than the open internet.
- Write operations such as `retain`, `create_*`, `update_*`, `refresh_mental_model`, and `cancel_operation` are not marked destructive.

---

## Available Tools

### retain

Store information to long-term memory.

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `content` | string | Yes | The fact or memory to store |
| `context` | string | No | Category for the memory (default: `general`) |
| `timestamp` | string | No | ISO 8601 timestamp for when the event occurred |
| `tags` | list[string] | No | Tags for organizing and filtering this memory |
| `metadata` | object | No | Key-value metadata to attach (e.g., `{"source": "slack"}`) |
| `document_id` | string | No | Associate this memory with an existing document |

**Example:**
```json
{
  "name": "retain",
  "arguments": {
    "content": "User prefers Python over JavaScript for backend development",
    "context": "programming_preferences",
    "tags": ["user:alice", "preferences"]
  }
}
```

**When to use:**
- User shares personal facts, preferences, or interests
- Important events or milestones are mentioned
- Decisions, opinions, or goals are stated
- Work context or project details are discussed

---

### sync_retain

Store information to long-term memory and wait for completion. Unlike [`retain`](#retain) (which is asynchronous), `sync_retain` blocks until the memory is fully stored and immediately available for recall — useful for read-after-write flows where you query right after storing.

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `content` | string | Yes | The fact or memory to store |
| `context` | string | No | Category for the memory (default: `general`) |
| `timestamp` | string | No | ISO 8601 timestamp for when the event occurred |
| `tags` | list[string] | No | Tags for organizing and filtering this memory |
| `metadata` | object | No | Key-value metadata to attach (e.g., `{"source": "slack"}`) |
| `document_id` | string | No | Associate this memory with an existing document |

**Example:**
```json
{
  "name": "sync_retain",
  "arguments": {
    "content": "User prefers Python over JavaScript for backend development",
    "context": "programming_preferences",
    "tags": ["user:alice", "preferences"]
  }
}
```

**When to use:**
- You need the memory queryable immediately after storing (read-after-write)
- A workflow step depends on the stored memory being available before continuing
- Otherwise prefer `retain` (asynchronous) to avoid blocking on storage

---

### recall

Search memories to provide personalized responses.

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `query` | string | Yes | Natural language search query |
| `max_tokens` | integer | No | Maximum tokens to return (default: 4096) |
| `budget` | string | No | Search thoroughness: `low`, `mid`, or `high` (default: `high`) |
| `types` | list[string] | No | Filter by fact type: `world`, `experience`, `observation`. Defaults to all |
| `tags` | list[string] | No | Filter memories by tags |
| `tags_match` | string | No | Tag matching mode: `any` (default) or `all` |
| `query_timestamp` | string | No | ISO 8601 timestamp — recall as if asking at this point in time; anchors relative temporal expressions and recency scoring |
| `min_scores` | object | No | Optional per-stage score floors, e.g. `{"reranker": 0.5}`. Keys: `semantic`/`keyword` (retrieval-level cutoffs), `reranker`/`final` (post-ranking). All inclusive and AND-ed; omit for no filtering. Reranker scores aren't calibrated across queries — calibrate before use |

**Example:**
```json
{
  "name": "recall",
  "arguments": {
    "query": "What are the user's programming language preferences?",
    "tags": ["preferences"],
    "budget": "high"
  }
}
```

**When to use:**
- Start of conversation to recall relevant context
- Before making recommendations
- When user asks about something they may have mentioned before
- To provide continuity across conversations

---

### reflect

Generate thoughtful analysis by synthesizing stored memories with the bank's personality.

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `query` | string | Yes | The question or topic to reflect on |
| `context` | string | No | Optional context about why this reflection is needed |
| `budget` | string | No | Search budget: `low`, `mid`, or `high` (default: `low`) |
| `max_tokens` | integer | No | Maximum tokens in the response (default: 4096) |
| `response_schema` | object | No | JSON Schema for structured output. When provided, the response includes a `structured_output` field |
| `tags` | list[string] | No | Filter memories by tags before reflecting |
| `tags_match` | string | No | Tag matching mode: `any` (default) or `all` |
| `include_trace` | boolean | No | Include `tool_trace` and `llm_trace` debugging output. Defaults to `false` to keep responses small |

**Example:**
```json
{
  "name": "reflect",
  "arguments": {
    "query": "Based on my past decisions, what architectural style do I prefer?",
    "budget": "mid",
    "tags": ["architecture"]
  }
}
```

**When to use:**
- When reasoned analysis is needed, not just fact retrieval
- Questions like "What should I do?" rather than "What did I say?"
- Synthesizing patterns across multiple memories

---

### create_mental_model

Create a mental model — a living document that stays current with your memories. Mental models are pre-computed reflections that get automatically refreshed as new memories are stored.

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `name` | string | Yes | Human-readable name for the mental model |
| `source_query` | string | Yes | The query used to generate and refresh the model |
| `mental_model_id` | string | No | Custom ID (alphanumeric lowercase with hyphens). Auto-generated if not provided |
| `tags` | list[string] | No | Tags for organizing and filtering models |
| `max_tokens` | integer | No | Maximum tokens for model content (default: 2048) |
| `trigger_refresh_after_consolidation` | boolean | No | Auto-refresh this model after memory consolidation (default: `false`) |

**Example:**
```json
{
  "name": "create_mental_model",
  "arguments": {
    "name": "Team Directory",
    "source_query": "Who works here and what do they do?",
    "tags": ["team", "people"]
  }
}
```

Content generation runs asynchronously. The response includes an `operation_id` to track progress.

---

### list_mental_models

List all mental models in a bank, optionally filtered by tags.

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `tags` | list[string] | No | Filter models by tags |

---

### get_mental_model

Retrieve a specific mental model by ID, including its full content.

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `mental_model_id` | string | Yes | The ID of the mental model to retrieve |

---

### update_mental_model

Update a mental model's metadata or settings.

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `mental_model_id` | string | Yes | The ID of the mental model to update |
| `name` | string | No | New name |
| `source_query` | string | No | New source query |
| `tags` | list[string] | No | New tags |
| `max_tokens` | integer | No | New max tokens |
| `trigger_refresh_after_consolidation` | boolean | No | Auto-refresh after consolidation. Only set when you want to change this setting |

---

### delete_mental_model

Permanently delete a mental model.

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `mental_model_id` | string | Yes | The ID of the mental model to delete |

---

### refresh_mental_model

Re-generate a mental model's content from the latest memories. Runs asynchronously.

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `mental_model_id` | string | Yes | The ID of the mental model to refresh |

---

### clear_mental_model

Clear a mental model's content while keeping its definition. After clearing, call `refresh_mental_model` to rebuild it from the latest memories.

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `mental_model_id` | string | Yes | The ID of the mental model to clear |

---

### list_banks (multi-bank mode only)

List all available memory banks.

---

### create_bank (multi-bank mode only)

Create a new memory bank or retrieve an existing one.

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `bank_id` | string | Yes | The ID for the new bank |
| `name` | string | No | Human-friendly name for the bank |
| `mission` | string | No | Mission describing who the agent is and what they're trying to accomplish |

---

### list_directives

List all directives in a bank. Directives are instructions that guide how the memory system processes and responds to queries.

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `tags` | list[string] | No | Filter directives by tags |
| `active_only` | boolean | No | Only return active directives (default: `true`) |

---

### create_directive

Create a new directive in a bank.

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `name` | string | Yes | Human-readable name for the directive |
| `content` | string | Yes | The directive content/instruction |
| `priority` | integer | No | Priority level (higher = more important) |
| `is_active` | boolean | No | Whether the directive is active (default: `true`) |
| `tags` | list[string] | No | Tags for organizing directives |

---

### delete_directive

Delete a directive by ID.

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `directive_id` | string | Yes | The ID of the directive to delete |

---

### list_memories

Browse stored memories with optional filtering and pagination.

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `type` | string | No | Filter by fact type: `world`, `experience`, or `observation` |
| `q` | string | No | Search query to filter memories |
| `limit` | integer | No | Maximum number of results (default: 100) |
| `offset` | integer | No | Number of results to skip for pagination (default: 0) |

---

### get_memory

Retrieve a specific memory by ID.

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `memory_id` | string | Yes | The ID of the memory to retrieve |

---

### list_documents

List documents that have been ingested into the memory bank.

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `q` | string | No | Search query to filter documents |
| `limit` | integer | No | Maximum number of results (default: 100) |

---

### get_document

Retrieve a specific document by ID, including its metadata.

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `document_id` | string | Yes | The ID of the document to retrieve |

---

### delete_document

Delete a document and all memories linked to it.

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `document_id` | string | Yes | The ID of the document to delete |

---

### list_operations

List async operations (retain processing, mental model refresh, etc.) with optional status filtering.

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `status` | string | No | Filter by status: `pending`, `running`, `completed`, `failed`, `cancelled` |
| `limit` | integer | No | Maximum number of results (default: 100) |

---

### get_operation

Get the status and details of an async operation.

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `operation_id` | string | Yes | The ID of the operation to check |

---

### cancel_operation

Cancel a pending or running async operation.

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `operation_id` | string | Yes | The ID of the operation to cancel |

---

### list_tags

List all unique tags used in a bank, optionally filtered by pattern.

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `q` | string | No | Glob pattern to filter tags (e.g., `project:*`) |
| `limit` | integer | No | Maximum number of results (default: 100) |

---

### get_bank

Get information about a memory bank, including its name, mission, and disposition.

---

### get_bank_stats (multi-bank mode only)

Get statistics for a memory bank (node/link counts).

---

### update_bank

Update a memory bank's configuration. Updates the bank's name and/or any bank-level configuration fields — only provided fields are updated; omitted fields remain unchanged.

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `name` | string | No | Human-friendly display name for the bank |
| `mission` | string | No | **Deprecated** — alias for `config_updates.reflect_mission` |
| `config_updates` | object | No | Dictionary of configuration fields to update. Supports all bank-configurable fields (see below). Non-configurable or credential fields are rejected |

The `config_updates` object accepts any bank-configurable field by its Python field name, including:

- `reflect_mission` — mission/context for Reflect operations
- `retain_mission` — steers what gets extracted during `retain()`
- `retain_extraction_mode` — `concise` (default), `verbose`, or `custom`
- `retain_custom_instructions` — custom extraction prompt (active when mode is `custom`)
- `retain_chunk_size` — target maximum characters for each content chunk
- `retain_structured_chunk_size` — maximum characters for a single JSONL line or conversation turn to keep whole
- `retain_chunk_batch_size` — number of chunks to process in parallel
- `enable_observations` — toggle observation consolidation after `retain()`
- `observations_mission` — controls observation synthesis rules
- `disposition_skepticism` — critical evaluation level (1–5)
- `disposition_literalism` — literal vs. abstract interpretation (1–5)
- `disposition_empathy` — emotional context consideration (1–5)
- `entity_labels` — controlled vocabulary for entity classification
- `entities_allow_free_form` — allow labels outside `entity_labels`
- `recall_include_chunks` — include raw chunks in recall results
- `recall_max_tokens` — max tokens for recall results
- `mcp_enabled_tools` — tool allowlist for this bank

---

### delete_bank

Permanently delete a memory bank and all its data (memories, documents, entities, mental models).

---

### clear_memories

Clear all memories from a bank without deleting the bank itself. Optionally filter by fact type to only clear specific kinds of memories.

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `type` | string | No | Fact type to clear: `world`, `experience`, or `observation`. If not specified, clears all |

---

## Integration with AI Assistants

The MCP server can be used with any MCP-compatible AI assistant. See the [Authentication](#authentication) section above for Claude Code and Claude Desktop configuration examples.

Each user can have their own configuration pointing to their personal memory bank using either:
- A bank-specific URL path like `/mcp/alice/` (recommended)
- The `X-Bank-Id` header


---


## File: developer/memory-defense/index.md

# Memory Defense

Hindsight scrubs secrets and PII from retain content using a 45-pattern regex set. Each match is replaced with a `[REDACTED:type]` marker before content reaches memory units or the document body. The feature is configured per bank and disabled by default.

## How it works

Memory Defense is opt-in per bank. The extension is always present, but it sits dormant until you give a bank a policy that turns it on. When a policy is set, every memory the agent writes to that bank is scanned before it lands in storage. When the scanner recognizes a credential, an API key, a database connection string, or a known PII format, the matched substring is replaced with a redaction marker like `[REDACTED:github_token]`.

The scrubbed version is what actually gets stored. Memory units and document bodies persist the redacted text, so future recall responses, exports, and reflect operations never see the original secret.

A policy only affects future retain calls on the bank where it is set. Existing memories are not retroactively scanned when you add or change a policy.

## Configuring Memory Defense

Memory Defense is configured per bank via the bank's `memory_defense` config field. You can set the policy at bank creation time or update it later via `PATCH /v1/{tenant}/banks/{bank_id}/config`.

The open-source version implements the `sensitive_data` rule with two possible actions:

- **`redact`** — replace each matched secret with a `[REDACTED:type]` marker and store the scrubbed memory.
- **`block`** — drop any item that contains a match. If every item in a retain request is blocked, the call returns `422`.

A minimal policy:

```json
{
  "memory_defense": {
    "enabled": true,
    "rules": [
      { "on": "sensitive_data", "action": "redact" }
    ]
  }
}
```

Once that policy is on a bank, every retain to that bank is screened with the 45 redaction patterns documented below.

:::note Existing memories are not retroactively scanned
Enabling Memory Defense on a bank only affects future retain calls. Memories already in the bank are not re-scanned or modified when you add or change a policy. If you need to scrub a bank that already contains unredacted content, you have to re-ingest the affected memories or remove them manually.
:::

### Disabled by default

Memory Defense is off on every bank until you set a policy. A bank with no `memory_defense` field, with `enabled: false`, or with no `sensitive_data` rule is treated identically: the extension returns ALLOW and content passes through unchanged. To stop redacting on a bank that has it on, set `enabled: false` or remove the policy.

## Notifications

When an item is redacted or blocked, Hindsight fires a [`memory_defense.triggered` webhook](../api/webhooks.mdx#memory_defensetriggered) if a webhook on the bank is subscribed to that event type. The payload reports the action taken, the document ID, and which redaction patterns matched — useful for routing security alerts to a SIEM or Slack. Clean items fire no event.

The same redact/block decisions are also recorded as `memory_defense` entries in the [audit log](../configuration.md#audit-logging) (when audit logging is enabled), with the action and matched pattern labels in the entry metadata.

## Patterns covered

The 45 bundled patterns cover the categories below.

### AI and LLM providers

| Label | Catches |
|---|---|
| `anthropic_key` | `sk-ant-...` |
| `openai_key`, `openai_project_key`, `openai_admin_key` | `sk-...`, `sk-proj-...`, `sk-admin-...` |
| `google_api_key` | `AIza...` (39 chars) |
| `google_oauth_token` | `ya29.<token>` |
| `xai_key` | `xai-...` |
| `groq_key` | `gsk_...` |
| `huggingface_token` | `hf_...` |
| `replicate_token` | `r8_...` |
| `perplexity_key` | `pplx-...` |
| `databricks_token` | `dapi<hex32>` |

### Cloud providers

| Label | Catches |
|---|---|
| `aws_access_key` | `AKIA<16>` |
| `aws_session_token` | `ASIA<16>` |
| `digitalocean_token` | `dop_v1_<hex64>` |

### Source control and CI

| Label | Catches |
|---|---|
| `github_fg_pat` | `github_pat_...` |
| `github_token` | `ghp_<36>` |
| `github_app_token` | `ghs_<36>` |
| `github_user_token` | `ghu_<36>` |
| `github_refresh` | `ghr_<36>` |
| `github_oauth` | `gho_<36>` |
| `gitlab_pat` | `glpat-...` |
| `npm_token` | `npm_...` |
| `pypi_token` | `pypi-AgEIcHlwaS5vcmc...` |

### Payment processors

| Label | Catches |
|---|---|
| `stripe_secret` | `sk_live_...`, `sk_test_...` |
| `stripe_restricted` | `rk_live_...`, `rk_test_...` |
| `square_token` | `sq0...` |
| `braintree_token` | `access_token$production$...` |

### Communications and email

| Label | Catches |
|---|---|
| `slack_token` | `xoxb-`, `xoxp-`, `xoxa-`, `xoxr-` |
| `slack_webhook` | `https://hooks.slack.com/services/...` |
| `twilio_api_key` | `SK<hex32>` |
| `twilio_account_sid` | `AC<hex32>` |
| `sendgrid_key` | `SG.<22>.<43>` |
| `mailgun_key` | `key-<32>` |
| `discord_bot` | `<MNO><23>.<6>.<27>` |
| `telegram_bot` | `<8-10 digits>:<35>` |

### Commerce

| Label | Catches |
|---|---|
| `shopify_token` | `shpat_<hex32>` |

### Database connection strings

| Label | Catches |
|---|---|
| `db_url_postgres` | `postgres://user:pass@host` or `postgresql://...` |
| `db_url_mysql` | `mysql://user:pass@host` |
| `db_url_mongodb` | `mongodb://user:pass@host` or `mongodb+srv://...` |

### Private keys, JWTs, and generic credentials

| Label | Catches |
|---|---|
| `private_key_pem` | `-----BEGIN ... PRIVATE KEY-----` PEM blocks |
| `jwt` | `eyJ<header>.eyJ<payload>.<signature>` |

### PII (US defaults)

| Label | Catches |
|---|---|
| `credit_card` | 13 to 19 digits with regular separators |
| `ssn_us` | `123-45-6789` shape |


---


## File: developer/models.mdx

# Models

Hindsight uses several machine learning models for different tasks.

## Overview

- **LLM** — Fact extraction, reasoning, and generation. Provider-specific, fully configurable.
- **Embedding** — Vector representations for semantic search. Default: `BAAI/bge-small-en-v1.5`.
- **Cross-Encoder** — Reranking search results. Default: `cross-encoder/ms-marco-MiniLM-L-6-v2`.

Embedding and cross-encoder models are downloaded automatically from HuggingFace on first run.

---

## LLM

Used for fact extraction, entity resolution, mental model consolidation, and answer synthesis.

**Supported providers:**


Also supports **any OpenAI-compatible API** (e.g., Azure OpenAI, Together AI, Fireworks) and **100+ providers via LiteLLM** (e.g., AWS Bedrock, Azure OpenAI, Together AI).

:::tip OpenAI-Compatible Providers
Hindsight works with any provider that exposes an OpenAI-compatible API (e.g., Azure OpenAI). Simply set `HINDSIGHT_API_LLM_PROVIDER=openai` and configure `HINDSIGHT_API_LLM_BASE_URL` to point to your provider's endpoint.

See [Configuration](./configuration#llm-provider) for setup examples.
:::

:::tip AWS Bedrock
Set `HINDSIGHT_API_LLM_PROVIDER=bedrock` to use AWS Bedrock models directly. Model names use Bedrock model IDs (e.g., `us.amazon.nova-2-lite-v1:0`). No API key is required — authentication uses AWS credentials (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION_NAME`) or IAM roles. For 50% cost savings on throughput, set `HINDSIGHT_API_LLM_BEDROCK_SERVICE_TIER=flex` (see [Configuration](./configuration#llm-provider)).

See [Configuration](./configuration#llm-provider) for setup examples.
:::

:::tip Built-in llama.cpp (fully local, no API key)
Set `HINDSIGHT_API_LLM_PROVIDER=llamacpp` to run a built-in llama.cpp server with no external dependencies. A Gemma 4 E2B GGUF model (~3.5 GB) is auto-downloaded on first run. Requires the `local-llm` extra: `pip install 'hindsight-api-slim[local-llm]'`.

The published Docker image does not bundle `llama-cpp-python` (to keep the image small). For a runnable Docker setup that adds it on top, see [`docker/docker-compose/local-llm/`](https://github.com/vectorize-io/hindsight/tree/main/docker/docker-compose/local-llm).

See [Configuration](./configuration#built-in-llamacpp) for all options.
:::

:::tip LiteLLM Provider (Azure, Together AI, and more)
Set `HINDSIGHT_API_LLM_PROVIDER=litellm` to use any model supported by [LiteLLM](https://docs.litellm.ai/docs/providers), including **Azure OpenAI**, **Together AI**, **Fireworks AI**, and many more. Model names use LiteLLM's provider prefix format (e.g., `azure/gpt-4o`).

See [Configuration](./configuration#llm-provider) for setup examples.
:::

:::tip LiteLLM Router (fallback chains, load-balancing, per-deployment limits)
Set `HINDSIGHT_API_LLM_PROVIDER=litellmrouter` to run the default LLM through [LiteLLM's Router](https://docs.litellm.ai/docs/routing) — ordered fallback across deployments, load-balanced same-tier routing, weighted picks, per-deployment `rpm`/`tpm` limits, and cooldowns are all available via the [`Router` config](https://docs.litellm.ai/docs/routing#fallbacks). Hindsight passes the JSON config through verbatim.

See [Configuration](./configuration#llm-router-litellm-router) for setup.
:::

### Provider Capabilities

Beyond basic generation, some providers support optional features that lower cost or latency. Hindsight uses each feature automatically when the configured provider supports it.


- **Batch API** — submits bulk retain extraction through the provider's asynchronous batch endpoint, typically at ~50% lower cost. Used automatically when available; otherwise calls run synchronously.
- **Explicit prompt caching** — reuses the large, fixed system prefix that retain (fact extraction), consolidation, and the reflect tool-loop send on every call, billing it at the provider's cached-input rate. On Gemini/Vertex this uses the `CachedContent` API. **On by default**; disable with `HINDSIGHT_API_LLM_PROMPT_CACHE_ENABLED=false`. Hindsight structures these prompts so the cached prefix is **bank-agnostic** — one cache is shared across all banks rather than one per bank/mission, and creation soft-fails to an uncached call, so it never breaks a request.

:::note
A blank "Explicit prompt caching" cell does not mean a provider has no caching. OpenAI, for example, caches a stable leading prompt prefix **automatically** server-side, so it benefits with no configuration; Anthropic supports caching via `cache_control` breakpoints which can be wired up through the same provider hook. The column tracks only Hindsight's explicit `get_or_create_cached_prefix` hook, which Gemini/Vertex implement today.
:::

### Benchmarks

Not sure which model to use? The **[Model Leaderboard](https://benchmarks.hindsight.vectorize.io/)** benchmarks models across accuracy, speed, cost, and reliability for retain, reflect, and observation consolidation so you can pick the right trade-off for your use case.

[![Model Leaderboard](/img/leaderboard.png)](https://benchmarks.hindsight.vectorize.io/)

### Tested Models

The following models have been tested and verified to work correctly with Hindsight:

| Provider | Model |
|----------|-------|
| **OpenAI** | `gpt-5.2` |
| **OpenAI** | `gpt-5` |
| **OpenAI** | `gpt-5-mini` |
| **OpenAI** | `gpt-5-nano` |
| **OpenAI** | `gpt-4.1-mini` |
| **OpenAI** | `gpt-4.1-nano` |
| **OpenAI** | `gpt-4o-mini` |
| **Anthropic** | `claude-sonnet-4-20250514` |
| **Anthropic** | `claude-3-5-sonnet-20241022` |
| **Gemini** | `gemini-3.5-flash` |
| **Gemini** | `gemini-3.1-pro-preview` |
| **Gemini** | `gemini-3.1-flash-lite` |
| **Groq** | `openai/gpt-oss-120b` |
| **Groq** | `openai/gpt-oss-20b` |

### Provider Default Models

Each provider has a recommended default model that's used when `HINDSIGHT_API_LLM_MODEL` is not explicitly set. This makes configuration simpler - just specify the provider and get a sensible default:


**Example:** Setting just the provider uses its default model:
```bash
# Uses claude-haiku-4-5 automatically
export HINDSIGHT_API_LLM_PROVIDER=anthropic
export HINDSIGHT_API_LLM_API_KEY=sk-ant-xxxxxxxxxxxx
```

You can override the default by explicitly setting `HINDSIGHT_API_LLM_MODEL`:
```bash
# Override to use Sonnet instead
export HINDSIGHT_API_LLM_PROVIDER=anthropic
export HINDSIGHT_API_LLM_API_KEY=sk-ant-xxxxxxxxxxxx
export HINDSIGHT_API_LLM_MODEL=claude-sonnet-4-5-20250929
```

This also applies to per-operation overrides:
```bash
# Global: OpenAI gpt-4o-mini (default)
export HINDSIGHT_API_LLM_PROVIDER=openai

# Retain: Anthropic claude-haiku-4-5 (default)
export HINDSIGHT_API_RETAIN_LLM_PROVIDER=anthropic
```

### Using Other Models

Other LLM models not listed above may work with Hindsight, but they must support **at least 65,000 output tokens** to ensure reliable fact extraction. If you need support for a specific model that doesn't meet this requirement, please [open an issue](https://github.com/hindsight-ai/hindsight/issues) to request an exception.

:::tip Models with Limited Output Tokens
If your model only supports 32k or fewer output tokens (e.g., some older models), you can reduce the retain completion token limit:

```bash
# For models that support 32k output tokens
export HINDSIGHT_API_RETAIN_MAX_COMPLETION_TOKENS=32000

# For models that support 16k output tokens
export HINDSIGHT_API_RETAIN_MAX_COMPLETION_TOKENS=16000
```

**Important:** `HINDSIGHT_API_RETAIN_MAX_COMPLETION_TOKENS` must be greater than `HINDSIGHT_API_RETAIN_CHUNK_SIZE` (default: 3000). The system will validate this on startup and provide an error message if the configuration is invalid.
:::

:::warning Groq free tier is not suitable for Hindsight
Groq's free tier only allows 8,000 tokens per minute — far below what Hindsight needs for a single retain call (~64k). Free-tier Groq models therefore can't be used with Hindsight; use a paid Groq tier or a different provider.
:::

### Configuration

```bash
# Groq (recommended)
export HINDSIGHT_API_LLM_PROVIDER=groq
export HINDSIGHT_API_LLM_API_KEY=gsk_xxxxxxxxxxxx
export HINDSIGHT_API_LLM_MODEL=openai/gpt-oss-20b

# OpenAI
export HINDSIGHT_API_LLM_PROVIDER=openai
export HINDSIGHT_API_LLM_API_KEY=sk-xxxxxxxxxxxx
export HINDSIGHT_API_LLM_MODEL=gpt-4o

# Gemini
export HINDSIGHT_API_LLM_PROVIDER=gemini
export HINDSIGHT_API_LLM_API_KEY=xxxxxxxxxxxx
export HINDSIGHT_API_LLM_MODEL=gemini-3.5-flash

# Anthropic
export HINDSIGHT_API_LLM_PROVIDER=anthropic
export HINDSIGHT_API_LLM_API_KEY=sk-ant-xxxxxxxxxxxx
export HINDSIGHT_API_LLM_MODEL=claude-sonnet-4-20250514

# Ollama (local)
export HINDSIGHT_API_LLM_PROVIDER=ollama
export HINDSIGHT_API_LLM_BASE_URL=http://localhost:11434/v1
export HINDSIGHT_API_LLM_MODEL=llama3

# Ollama Cloud (hosted Ollama endpoint, requires API key)
export HINDSIGHT_API_LLM_PROVIDER=ollama-cloud
export HINDSIGHT_API_LLM_API_KEY=your-ollama-cloud-api-key
export HINDSIGHT_API_LLM_MODEL=gemma3:12b

# LM Studio (local)
export HINDSIGHT_API_LLM_PROVIDER=lmstudio
export HINDSIGHT_API_LLM_BASE_URL=http://localhost:1234/v1
export HINDSIGHT_API_LLM_MODEL=your-local-model

# MiniMax (1M context window)
export HINDSIGHT_API_LLM_PROVIDER=minimax
export HINDSIGHT_API_LLM_API_KEY=your-minimax-api-key
export HINDSIGHT_API_LLM_MODEL=MiniMax-M3  # or MiniMax-M2.7 for the previous generation

# DeepSeek (https://api.deepseek.com)
export HINDSIGHT_API_LLM_PROVIDER=deepseek
export HINDSIGHT_API_LLM_API_KEY=sk-xxxxxxxxxxxx
export HINDSIGHT_API_LLM_MODEL=deepseek-v4-flash  # or deepseek-v4-pro / deepseek-chat / deepseek-reasoner

# z.ai (Zhipu GLM series, OpenAI-compatible, https://z.ai)
export HINDSIGHT_API_LLM_PROVIDER=zai
export HINDSIGHT_API_LLM_API_KEY=your-zai-api-key
export HINDSIGHT_API_LLM_MODEL=glm-4.5-flash  # or glm-4.5-air for the paid tier

# opencode-go (OpenAI-compatible)
export HINDSIGHT_API_LLM_PROVIDER=opencode-go
export HINDSIGHT_API_LLM_API_KEY=your-opencode-go-api-key
export HINDSIGHT_API_LLM_MODEL=deepseek-v4-flash

# Atlas Cloud (OpenAI-compatible, https://www.atlascloud.ai)
export HINDSIGHT_API_LLM_PROVIDER=atlas
export HINDSIGHT_API_LLM_API_KEY=your-atlascloud-api-key  # base_url defaults to https://api.atlascloud.ai/v1
export HINDSIGHT_API_LLM_MODEL=deepseek-ai/deepseek-v4-pro  # reasoning model; also Qwen / GLM / Kimi / MiniMax, etc.

# Nous Portal (OpenAI-compatible; no API key — uses your `hermes portal` login)
export HINDSIGHT_API_LLM_PROVIDER=nous
export HINDSIGHT_API_LLM_MODEL=deepseek/deepseek-v4-flash  # any Nous-hosted slug
# No API key needed — reads a rotating JWT from ~/.hermes/auth.json (see "Nous Portal Setup" below)

# Vertex AI (Google Cloud)
export HINDSIGHT_API_LLM_PROVIDER=vertexai
export HINDSIGHT_API_LLM_MODEL=gemini-3.1-flash-lite
export HINDSIGHT_API_LLM_VERTEXAI_PROJECT_ID=your-gcp-project-id
# Optional: region (default: us-central1)
# export HINDSIGHT_API_LLM_VERTEXAI_REGION=us-central1
# Optional: service account key (otherwise uses ADC)
# export HINDSIGHT_API_LLM_VERTEXAI_SERVICE_ACCOUNT_KEY=/path/to/key.json
```

**Note:** The LLM is the primary bottleneck for retain operations. See [Performance](./performance) for optimization strategies.

---

### OpenAI Codex Setup (ChatGPT Plus/Pro)

Use your ChatGPT Plus or Pro subscription for Hindsight without separate OpenAI Platform API costs.

**Prerequisites:**
- Active ChatGPT Plus or Pro subscription
- Node.js/npm installed (for Codex CLI)

**Setup Steps:**

1. **Install Codex CLI:**
   ```bash
   npm install -g @openai/codex
   ```

2. **Login with ChatGPT credentials:**
   ```bash
   codex auth login
   ```
   This opens a browser window to authenticate with your ChatGPT account and saves OAuth tokens to `~/.codex/auth.json`.

3. **Verify authentication:**
   ```bash
   ls ~/.codex/auth.json  # Should show the auth file exists
   ```

4. **Configure Hindsight:**
   ```bash
   export HINDSIGHT_API_LLM_PROVIDER=openai-codex
   # export HINDSIGHT_API_LLM_MODEL=gpt-5.3-codex  # defaults to gpt-5.4-mini
   # No API key needed - reads from ~/.codex/auth.json automatically
   ```

5. **Start Hindsight:**
   ```bash
   hindsight-api
   ```

You can use any model supported by OpenAI Codex CLI

**Important Notes:**
- OAuth tokens are stored in `~/.codex/auth.json`
- Tokens refresh automatically when needed
- Usage is billed to your ChatGPT subscription (not separate API costs)
- For personal development use only (see ChatGPT Terms of Service)

#### Isolating Codex auth for long-running services

By default Hindsight reads Codex credentials from `~/.codex/auth.json` — the
same file the `@openai/codex` CLI, editor plugins, and other agent runtimes use.
This is convenient for local development but can cause a subtle failure mode when
Hindsight runs as a **long-lived service** (systemd unit, container, background
daemon) alongside another Codex process:

- Codex refresh tokens are single-use and rotate on refresh.
- If another process refreshes the shared token, Hindsight's long-running
  process is left holding a stale refresh token.
- Recall and `/health` keep working (the database and API are fine), but
  `/reflect` fails with an error such as:
  ```text
  Codex refresh_token is permanently invalid (error.code=refresh_token_reused).
  Run 'codex auth login' to re-authenticate.
  ```

To avoid this, give the Hindsight service its **own dedicated Codex auth home**
via the `CODEX_HOME` environment variable. Hindsight honors `CODEX_HOME` exactly
like the `@openai/codex` CLI: when set, it reads `$CODEX_HOME/auth.json` instead
of `~/.codex/auth.json`.

```bash
# Dedicated credentials directory for the Hindsight service
export CODEX_HOME=/var/lib/hindsight/codex

# One-time login into that isolated home (opens a browser / device-code flow)
codex auth login   # writes $CODEX_HOME/auth.json

export HINDSIGHT_API_LLM_PROVIDER=openai-codex
hindsight-api
```

For a systemd unit, set it in the service definition so it never shares auth
with an interactive Codex session:

```ini
[Service]
Environment=CODEX_HOME=/var/lib/hindsight/codex
```

After a fresh login into the dedicated home and restarting only the Hindsight
service, `/reflect` uses its own token that other Codex processes will not
rotate out from under it.

`CODEX_HOME` is also honored by the `openai-codex` embeddings provider.

---

### Nous Portal Setup (Hermes)

Use your [Nous Portal](https://portal.nousresearch.com) subscription for Hindsight via the Hermes CLI login — no static API key required.

**Prerequisites:**
- A Nous Portal account
- The [Hermes](https://hermes-agent.nousresearch.com) CLI installed

**Setup Steps:**

1. **Log in to Nous Portal:**
   ```bash
   hermes portal
   ```
   This opens a browser to authenticate with Nous Portal and saves OAuth credentials to `~/.hermes/auth.json`.

2. **Verify authentication:**
   ```bash
   hermes portal status  # should show "Auth: ✓ logged in"
   ```

3. **Configure Hindsight:**
   ```bash
   export HINDSIGHT_API_LLM_PROVIDER=nous
   # export HINDSIGHT_API_LLM_MODEL=deepseek/deepseek-v4-flash  # defaults to deepseek/deepseek-v4-flash
   # No API key needed — reads from ~/.hermes/auth.json automatically
   ```

4. **Start Hindsight:**
   ```bash
   hindsight-api
   ```

You can use any model hosted on the Nous Portal inference API.

**Important Notes:**
- Credentials are read from `~/.hermes/auth.json` (the same store the Hermes agent uses) — no static API key in Hindsight's config.
- The short-lived inference JWT is refreshed automatically, before expiry and reactively on a 401.
- Refreshes coordinate with a running Hermes agent through the shared auth store, so the two never disrupt each other's session.
- Default base URL: `https://inference-api.nousresearch.com/v1` (override with `HINDSIGHT_API_LLM_BASE_URL`).

---

### Claude Code Setup (Claude Pro/Max)

Use your Claude Pro or Max subscription for Hindsight without separate Anthropic API costs.


:::warning Terms of Service Notice

This integration uses the Claude Agent SDK with your personal Claude Pro/Max subscription
credentials. You must be logged into Claude Code on your own machine before using this provider.

**Please be aware:**

- Anthropic's [Agent SDK documentation](https://docs.claude.com/en/api/agent-sdk/overview)
  states that third-party developers should not offer claude.ai login or rate limits for
  their products. Hindsight does **not** perform any login on your behalf — it uses
  credentials you've already authenticated via `claude auth login`.
- In January 2026, Anthropic [enforced restrictions](https://paddo.dev/blog/anthropic-walled-garden-crackdown/)
  against third-party tools using Claude subscription OAuth tokens. Those restrictions
  targeted tools that **spoofed the Claude Code client identity** — Hindsight uses the
  official Claude Agent SDK instead.
- This provider is intended for **local, personal development use only**. Do not use it
  in production deployments or shared environments.
- Anthropic's terms may change. If you want guaranteed compliance, use the `anthropic`
  provider with an API key instead.
- Usage counts against your Claude Pro/Max subscription limits.

For production or team use, we recommend using `HINDSIGHT_API_LLM_PROVIDER=anthropic` with
an API key from the [Anthropic Console](https://console.anthropic.com/).

:::


**Prerequisites:**
- Active Claude Pro or Max subscription
- Claude Code CLI installed

**Setup Steps:**

1. **Install Claude Code CLI:**
   ```bash
   npm install -g @anthropics/claude-code
   # Or via Homebrew
   brew install anthropics/claude-code/claude-code
   ```

2. **Login with Claude credentials:**
   ```bash
   claude auth login
   ```
   This opens a browser window to authenticate with your Claude account. Authentication is automatically managed by the Claude Agent SDK.

3. **Verify authentication:**
   ```bash
   claude --version
   # Should show version without errors
   ```

4. **Configure Hindsight:**
   ```bash
   export HINDSIGHT_API_LLM_PROVIDER=claude-code
   # No API key needed - uses claude auth login credentials
   ```

5. **Start Hindsight:**
   ```bash
   hindsight-api
   ```

You can use any model supported by Claude Code CLI.

**Important Notes:**
- Authentication handled by Claude Agent SDK (uses bundled CLI)
- Credentials managed securely by Claude Code
- Usage billed to your Claude subscription (not separate API costs)
- For personal development use only (see Claude Terms of Service)


---

### Vertex AI Setup (Google Cloud)

Google Cloud's Vertex AI provides access to Gemini models via the native Google GenAI SDK.

**Prerequisites:**
- GCP project with Vertex AI API enabled
- IAM role `roles/aiplatform.user` for your credentials

**Environment Variables:**

| Variable | Description | Required |
|----------|-------------|----------|
| `HINDSIGHT_API_LLM_VERTEXAI_PROJECT_ID` | Your GCP project ID | Yes |
| `HINDSIGHT_API_LLM_VERTEXAI_REGION` | GCP region (e.g., `us-central1`) | No (default: `us-central1`) |
| `HINDSIGHT_API_LLM_VERTEXAI_SERVICE_ACCOUNT_KEY` | Path to service account JSON key file | No (uses ADC if not set) |

**Authentication Methods:**

1. **Application Default Credentials (ADC)** - Recommended for development
   ```bash
   # Setup ADC
   gcloud auth application-default login

   # Configure Hindsight
   export HINDSIGHT_API_LLM_PROVIDER=vertexai
   export HINDSIGHT_API_LLM_MODEL=gemini-3.1-flash-lite
   export HINDSIGHT_API_LLM_VERTEXAI_PROJECT_ID=your-project-id
   ```

2. **Service Account Key** - Recommended for production
   ```bash
   # Create service account and download key
   gcloud iam service-accounts create hindsight-api
   gcloud projects add-iam-policy-binding your-project-id \
     --member="serviceAccount:hindsight-api@your-project-id.iam.gserviceaccount.com" \
     --role="roles/aiplatform.user"
   gcloud iam service-accounts keys create key.json \
     --iam-account=hindsight-api@your-project-id.iam.gserviceaccount.com

   # Configure Hindsight
   export HINDSIGHT_API_LLM_PROVIDER=vertexai
   export HINDSIGHT_API_LLM_MODEL=gemini-3.1-flash-lite
   export HINDSIGHT_API_LLM_VERTEXAI_PROJECT_ID=your-project-id
   export HINDSIGHT_API_LLM_VERTEXAI_SERVICE_ACCOUNT_KEY=/path/to/key.json
   ```

**Notes:**
- Model names can optionally include the `google/` prefix (e.g., `google/gemini-3.1-flash-lite`) — it will be stripped automatically
- The native SDK handles token refresh automatically
- Uses service account credentials if provided, otherwise falls back to ADC

---

## Embedding Model

Converts text into dense vector representations for semantic similarity search.

**Default:** `BAAI/bge-small-en-v1.5` (384 dimensions, ~130MB)

### Supported Providers

| Provider | Description | Best For |
|----------|-------------|----------|
| `local` | SentenceTransformers (default) | Development, low latency |
| `onnx` | In-process ONNX Runtime embedder (no Ollama/TEI/API sidecar) | Lightweight local CPU, multilingual |
| `openai` | OpenAI embeddings API | Production, high quality |
| `openai-codex` | OpenAI embeddings via Codex OAuth (ChatGPT Plus/Pro, no API key) | Existing ChatGPT/Codex subscribers |
| `openrouter` | OpenRouter embeddings (OpenAI-compatible gateway) | Multi-provider setups |
| `cohere` | Cohere embeddings API | Production, multilingual |
| `google` | Google embeddings (Gemini API or Vertex AI) | Production, multilingual, high quality |
| `tei` | HuggingFace Text Embeddings Inference | Production, self-hosted |
| `zeroentropy` | ZeroEntropy zembed-1 | Production, high quality retrieval |
| `litellm` | LiteLLM proxy (unified gateway) | Multi-provider setups |
| `litellm-sdk` | LiteLLM SDK (direct API, no proxy) | Multi-provider, simpler setup |

### Local Models

| Model | Dimensions | Use Case |
|-------|------------|----------|
| `BAAI/bge-small-en-v1.5` | 384 | Default, fast, good quality |
| `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` | 384 | Multilingual (50+ languages) |

### OpenAI Models

| Model | Dimensions | Use Case |
|-------|------------|----------|
| `text-embedding-3-small` | 1536 | Default OpenAI, cost-effective |
| `text-embedding-3-large` | 3072 | Higher quality, more expensive |
| `text-embedding-ada-002` | 1536 | Legacy model |

### Google Models

| Model | Dimensions | Use Case |
|-------|------------|----------|
| `gemini-embedding-001` | 768 (configurable) | Default Google, general purpose |
| `gemini-embedding-2-preview` | 768 (configurable) | Gemini Embedding 2 family; multimodal, one vector per input |

Google's `gemini-embedding-001` supports configurable output dimensionality via truncation, google recommend using: 768, 1536, 3072, via `HINDSIGHT_API_EMBEDDINGS_GEMINI_OUTPUT_DIMENSIONALITY`. Default is 768.

The `gemini-embedding-2` family, including `gemini-embedding-2-preview`, is supported on both the Gemini API and Vertex AI. These models aggregate multi-input requests, so Hindsight automatically embeds one input per call to keep per-fact vectors aligned.

### Cohere Models

| Model | Dimensions | Use Case |
|-------|------------|----------|
| `embed-english-v3.0` | 1024 | English text |
| `embed-multilingual-v3.0` | 1024 | 100+ languages |

### ZeroEntropy Models

| Model | Dimensions | Use Case |
|-------|------------|----------|
| `zembed-1` | 1280 default (2560/1280/640/320/160/80/40 configurable) | High quality asymmetric retrieval |

Hindsight sends retained memory text to ZeroEntropy as `document` inputs and recall/search text as `query` inputs. ZeroEntropy's API default is 2560 dimensions; Hindsight defaults to 1280 so pgvector HNSW works without changing the vector extension.

:::warning Embedding Dimensions
Hindsight automatically detects the embedding dimension at startup and adjusts the database schema. Once memories are stored, you cannot change dimensions without losing data.
:::

**Configuration Examples:**

```bash
# Local provider (default)
export HINDSIGHT_API_EMBEDDINGS_PROVIDER=local
export HINDSIGHT_API_EMBEDDINGS_LOCAL_MODEL=BAAI/bge-small-en-v1.5

# ONNX provider (in-process local CPU, no Ollama/TEI/API sidecar; pip install hindsight-api-slim[local-onnx])
export HINDSIGHT_API_EMBEDDINGS_PROVIDER=onnx
export HINDSIGHT_API_EMBEDDINGS_ONNX_MODEL_ID=intfloat/multilingual-e5-small
export HINDSIGHT_API_EMBEDDINGS_ONNX_DIMENSIONS=384

# OpenAI
export HINDSIGHT_API_EMBEDDINGS_PROVIDER=openai
export HINDSIGHT_API_EMBEDDINGS_OPENAI_API_KEY=sk-xxxxxxxxxxxx
export HINDSIGHT_API_EMBEDDINGS_OPENAI_MODEL=text-embedding-3-small

# Cohere
export HINDSIGHT_API_EMBEDDINGS_PROVIDER=cohere
export HINDSIGHT_API_COHERE_API_KEY=your-api-key
export HINDSIGHT_API_EMBEDDINGS_COHERE_MODEL=embed-english-v3.0

# Google (API key auth)
export HINDSIGHT_API_EMBEDDINGS_PROVIDER=google
export HINDSIGHT_API_EMBEDDINGS_GEMINI_API_KEY=xxxxxxxxxxxx
export HINDSIGHT_API_EMBEDDINGS_GEMINI_MODEL=gemini-embedding-001

# Google (Vertex AI auth - auto-detected when project ID is set)
export HINDSIGHT_API_EMBEDDINGS_PROVIDER=google
export HINDSIGHT_API_EMBEDDINGS_GEMINI_MODEL=gemini-embedding-001
export HINDSIGHT_API_EMBEDDINGS_VERTEXAI_PROJECT_ID=your-gcp-project-id

# TEI (self-hosted)
export HINDSIGHT_API_EMBEDDINGS_PROVIDER=tei
export HINDSIGHT_API_EMBEDDINGS_TEI_URL=http://localhost:8080

# ZeroEntropy
export HINDSIGHT_API_EMBEDDINGS_PROVIDER=zeroentropy
export HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_API_KEY=your-api-key
export HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_MODEL=zembed-1
export HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_DIMENSIONS=1280

# LiteLLM proxy
export HINDSIGHT_API_EMBEDDINGS_PROVIDER=litellm
export HINDSIGHT_API_LITELLM_API_BASE=http://localhost:4000
export HINDSIGHT_API_EMBEDDINGS_LITELLM_MODEL=text-embedding-3-small

# LiteLLM SDK (direct, no proxy)
export HINDSIGHT_API_EMBEDDINGS_PROVIDER=litellm-sdk
export HINDSIGHT_API_EMBEDDINGS_LITELLM_SDK_API_KEY=sk-xxxxxxxxxxxx
export HINDSIGHT_API_EMBEDDINGS_LITELLM_SDK_MODEL=openai/text-embedding-3-small
```

See [Configuration](./configuration#embeddings) for all options including Azure OpenAI and custom endpoints.

---

## Cross-Encoder (Reranker)

Reranks initial search results to improve precision.

**Default:** `cross-encoder/ms-marco-MiniLM-L-6-v2` (~85MB)

### Supported Providers

| Provider | Description | Best For |
|----------|-------------|----------|
| `local` | SentenceTransformers CrossEncoder (default) | Development, low latency |
| `cohere` | Cohere rerank API | Production, high quality |
| `openrouter` | OpenRouter rerank API (Cohere-compatible gateway) | Multi-provider setups |
| `zeroentropy` | ZeroEntropy rerank API (zerank-2) | Production, state-of-the-art accuracy |
| `siliconflow` | SiliconFlow rerank API (Cohere-compatible `/rerank` endpoint) | Users in China or anyone on SiliconFlow's platform |
| `alibaba` | Alibaba Cloud DashScope rerank API (qwen3-rerank) | Users on Alibaba Cloud / DashScope |
| `google` | Google Discovery Engine ranking API (REST + Google auth) | Production, GCP integration |
| `tei` | HuggingFace Text Embeddings Inference | Production, self-hosted |
| `flashrank` | FlashRank (lightweight, fast) | Resource-constrained environments |
| `litellm` | LiteLLM proxy (unified gateway) | Multi-provider setups |
| `litellm-sdk` | LiteLLM SDK (direct API, no proxy) | Multi-provider, simpler setup |
| `jina-mlx` | Jina rerank v3 via Apple Silicon MLX (local, no API key) | Apple Silicon (M1+) local inference |
| `rrf` | RRF-only (no neural reranking) | Testing, minimal resources |

### Local Models

| Model | Use Case |
|-------|----------|
| `cross-encoder/ms-marco-MiniLM-L-6-v2` | Default, fast |
| `cross-encoder/ms-marco-MiniLM-L-12-v2` | Higher accuracy |
| `cross-encoder/mmarco-mMiniLMv2-L12-H384-v1` | Multilingual |

### Cohere Models

| Model | Use Case |
|-------|----------|
| `rerank-english-v3.0` | English text |
| `rerank-multilingual-v3.0` | 100+ languages |

### ZeroEntropy Models

| Model | Use Case |
|-------|----------|
| `zerank-2` | Flagship multilingual reranker (default) |
| `zerank-2-small` | Faster, lighter variant |

### SiliconFlow Models

SiliconFlow hosts a range of open-weight rerankers behind a Cohere-compatible `/rerank` endpoint:

| Model | Use Case |
|-------|----------|
| `BAAI/bge-reranker-v2-m3` | Multilingual, strong default |
| `Qwen/Qwen3-Reranker-8B` | Larger, higher accuracy |

### Alibaba Cloud Models

Alibaba Cloud DashScope exposes `qwen3-rerank` via a Cohere-compatible `/reranks` endpoint:

| Model | Use Case |
|-------|----------|
| `qwen3-rerank` | 100+ languages, default |

### LiteLLM Supported Providers

LiteLLM supports multiple reranking providers via the `/rerank` endpoint:

| Provider | Model Example |
|----------|---------------|
| Cohere | `cohere/rerank-english-v3.0` |
| Together AI | `together_ai/...` |
| Voyage AI | `voyage/rerank-2` |
| Jina AI | `jina_ai/...` |
| AWS Bedrock | `bedrock/...` |

**Configuration Examples:**

```bash
# Local provider (default)
export HINDSIGHT_API_RERANKER_PROVIDER=local
export HINDSIGHT_API_RERANKER_LOCAL_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2

# Cohere
export HINDSIGHT_API_RERANKER_PROVIDER=cohere
export HINDSIGHT_API_COHERE_API_KEY=your-api-key
export HINDSIGHT_API_RERANKER_COHERE_MODEL=rerank-english-v3.0

# Cohere-compatible endpoint (Azure AI Foundry, Jina, Voyage, self-hosted BGE, ...)
# Setting COHERE_BASE_URL switches the provider off the Cohere SDK and onto a
# plain HTTP client that speaks the standard rerank wire format:
#   POST {base_url}  Authorization: Bearer <key>
#   {"model","query","documents","return_documents":false}
#   -> {"results":[{"index","relevance_score"}, ...]}
export HINDSIGHT_API_RERANKER_PROVIDER=cohere
export HINDSIGHT_API_RERANKER_COHERE_API_KEY=your-api-key
export HINDSIGHT_API_RERANKER_COHERE_MODEL=rerank-v3.5  # whatever model the endpoint serves
export HINDSIGHT_API_RERANKER_COHERE_BASE_URL=https://your-endpoint.example/rerank

# ZeroEntropy (state-of-the-art accuracy)
export HINDSIGHT_API_RERANKER_PROVIDER=zeroentropy
export HINDSIGHT_API_RERANKER_ZEROENTROPY_API_KEY=your-api-key
export HINDSIGHT_API_RERANKER_ZEROENTROPY_MODEL=zerank-2  # default, can omit

# SiliconFlow (Cohere-compatible /rerank endpoint)
export HINDSIGHT_API_RERANKER_PROVIDER=siliconflow
export HINDSIGHT_API_RERANKER_SILICONFLOW_API_KEY=your-api-key
export HINDSIGHT_API_RERANKER_SILICONFLOW_MODEL=BAAI/bge-reranker-v2-m3  # default, can omit

# Alibaba Cloud DashScope (qwen3-rerank)
export HINDSIGHT_API_RERANKER_PROVIDER=alibaba
export HINDSIGHT_API_RERANKER_ALIBABA_API_KEY=your-dashscope-api-key
export HINDSIGHT_API_RERANKER_ALIBABA_MODEL=qwen3-rerank  # default, can omit

# TEI (self-hosted)
export HINDSIGHT_API_RERANKER_PROVIDER=tei
export HINDSIGHT_API_RERANKER_TEI_URL=http://localhost:8081

# FlashRank (lightweight)
export HINDSIGHT_API_RERANKER_PROVIDER=flashrank

# LiteLLM proxy
export HINDSIGHT_API_RERANKER_PROVIDER=litellm
export HINDSIGHT_API_LITELLM_API_BASE=http://localhost:4000
export HINDSIGHT_API_RERANKER_LITELLM_MODEL=cohere/rerank-english-v3.0

# RRF-only (no neural reranking)
export HINDSIGHT_API_RERANKER_PROVIDER=rrf
```

See [Configuration](./configuration#reranker) for all options including Azure-hosted endpoints and batch settings.


---


## File: developer/monitoring.md

# Monitoring

Hindsight provides comprehensive observability through Prometheus metrics, OpenTelemetry distributed tracing, and pre-built Grafana dashboards.

## Local Development

For local observability, use the Grafana LGTM (Loki, Grafana, Tempo, Mimir) all-in-one stack:

```bash
./scripts/dev/start-monitoring.sh
```

This starts a single Docker container providing:
- **Grafana UI**: http://localhost:3000 (anonymous admin access)
- **Traces (Tempo)**: OTLP endpoint at http://localhost:4318 (HTTP) and http://localhost:4317 (gRPC)
- **Metrics (Prometheus/Mimir)**: Scrapes http://localhost:8888/metrics automatically
- **Logs (Loki)**: Available for log aggregation
- **Pre-built Dashboards**: Hindsight Operations, LLM Metrics, API Service

**Enable tracing in your API:**
```bash
export HINDSIGHT_API_OTEL_TRACES_ENABLED=true
export HINDSIGHT_API_OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
```

:::note Production Deployment
The local monitoring stack is for development only. In production, deploy Grafana LGTM separately or use commercial platforms (Grafana Cloud, DataDog, New Relic, etc.).
:::

## Grafana Dashboards

Pre-built dashboards are available in [`monitoring/grafana/dashboards/`](https://github.com/anthropics/hindsight/tree/main/monitoring/grafana/dashboards). Import these JSON files into your Grafana instance:

| Dashboard | Description |
|-----------|-------------|
| **Hindsight Operations** | Operation rates, latency percentiles, per-bank metrics |
| **Hindsight LLM Metrics** | LLM calls, token usage, latency by scope/provider |
| **Hindsight API Service** | HTTP requests, error rates, DB pool, process metrics |

The dashboards are automatically provisioned when using the monitoring stack script.

## Metrics Endpoint

Hindsight exposes Prometheus metrics at `/metrics`:

```bash
curl http://localhost:8888/metrics
```

## Available Metrics

### Operation Metrics

| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `hindsight.operation.duration` | Histogram | operation, bank_id, source, budget, max_tokens, success | Duration of operations in seconds |
| `hindsight.operation.total` | Counter | operation, bank_id, source, budget, max_tokens, success | Total number of operations executed |

**Labels:**
- `operation`: Operation type (`retain`, `recall`, `reflect`, plus async worker task types such as `consolidation`)
- `bank_id`: Memory bank identifier
- `source`: Where the operation was triggered from (`api`, `reflect`, `internal`, `worker`)
- `budget`: Budget level if specified (`low`, `mid`, `high`)
- `max_tokens`: Max tokens if specified
- `success`: Whether the operation succeeded (`true`, `false`)

The `source` label allows distinguishing between:
- `api`: Direct API calls from clients
- `reflect`: Internal recall calls made during reflect operations
- `internal`: Other internal operations
- `worker`: Async worker completions recorded when a claimed task reaches a terminal outcome

For `source="worker"`, the `success` label is a completion-throughput signal:
`false` means the task raised out to the poller after retries were exhausted or
an unexpected error occurred. Failures handled inside the executor and returned
normally still record `success="true"` here; use
`hindsight_async_operations{status="failed"}` for authoritative async operation
failure status.

### LLM Metrics

| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `hindsight.llm.duration` | Histogram | provider, model, scope, success | Duration of LLM API calls in seconds |
| `hindsight.llm.calls.total` | Counter | provider, model, scope, success | Total number of LLM API calls |
| `hindsight.llm.tokens.input` | Counter | provider, model, scope, success, token_bucket | Input tokens for LLM calls |
| `hindsight.llm.tokens.output` | Counter | provider, model, scope, success, token_bucket | Output tokens from LLM calls |

**Labels:**
- `provider`: LLM provider (`openai`, `anthropic`, `gemini`, `groq`, `ollama`, `lmstudio`, `bedrock`, `litellm`)
- `model`: Model name (e.g., `gpt-4`, `claude-3-sonnet`)
- `scope`: What the LLM call is for (`memory`, `reflect`, `consolidation`, `answer`)
- `success`: Whether the call succeeded (`true`, `false`)
- `token_bucket`: Token count bucket for cardinality control (`0-100`, `100-500`, `500-1k`, `1k-5k`, `5k-10k`, `10k-50k`, `50k+`)

### HTTP Request Metrics

| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `hindsight.http.duration` | Histogram | method, endpoint, status_code, status_class | Duration of HTTP requests in seconds |
| `hindsight.http.requests.total` | Counter | method, endpoint, status_code, status_class | Total number of HTTP requests |
| `hindsight.http.requests.in_progress` | UpDownCounter | method, endpoint | Number of HTTP requests currently being processed |

**Labels:**
- `method`: HTTP method (`GET`, `POST`, `PUT`, `DELETE`)
- `endpoint`: Request path (normalized to reduce cardinality - UUIDs replaced with `{id}`)
- `status_code`: HTTP status code (`200`, `400`, `500`, etc.)
- `status_class`: Status code class (`2xx`, `4xx`, `5xx`)

### Database Pool Metrics

| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `hindsight.db.pool.size` | Gauge | - | Current number of connections in the pool |
| `hindsight.db.pool.idle` | Gauge | - | Number of idle connections in the pool |
| `hindsight.db.pool.min` | Gauge | - | Minimum pool size |
| `hindsight.db.pool.max` | Gauge | - | Maximum pool size |

### Process Metrics

| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `hindsight.process.cpu.seconds` | Gauge | type | Process CPU time in seconds |
| `hindsight.process.memory.bytes` | Gauge | type | Process memory usage in bytes |
| `hindsight.process.open_fds` | Gauge | - | Number of open file descriptors |
| `hindsight.process.threads` | Gauge | - | Number of active threads |

**Labels:**
- `type` (CPU): `user` or `system`
- `type` (Memory): `rss_max` (maximum resident set size)

### Histogram Buckets

Custom bucket boundaries are configured for better percentile accuracy:

**Operation Duration Buckets (seconds):**
```
0.1, 0.25, 0.5, 0.75, 1.0, 2.0, 3.0, 5.0, 7.5, 10.0, 15.0, 20.0, 30.0, 60.0, 120.0
```

**LLM Duration Buckets (seconds):**
```
0.1, 0.25, 0.5, 1.0, 2.0, 3.0, 5.0, 10.0, 15.0, 30.0, 60.0, 120.0
```

**HTTP Duration Buckets (seconds):**
```
0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0
```

## Prometheus Configuration

```yaml
scrape_configs:
  - job_name: 'hindsight'
    static_configs:
      - targets: ['localhost:8888']
```

## Example Queries

### Average operation latency by type
```promql
rate(hindsight_operation_duration_sum[5m]) / rate(hindsight_operation_duration_count[5m])
```

### LLM calls per minute by provider
```promql
rate(hindsight_llm_calls_total[1m]) * 60
```

### P95 LLM latency
```promql
histogram_quantile(0.95, rate(hindsight_llm_duration_bucket[5m]))
```

### Total tokens consumed by model
```promql
sum by (model) (hindsight_llm_tokens_input_total + hindsight_llm_tokens_output_total)
```

### Internal vs API recall operations
```promql
sum by (source) (rate(hindsight_operation_total{operation="recall"}[5m]))
```

### HTTP requests per second by endpoint
```promql
sum by (endpoint) (rate(hindsight_http_requests_total[1m]))
```

### HTTP error rate (5xx)
```promql
sum(rate(hindsight_http_requests_total{status_class="5xx"}[5m])) / sum(rate(hindsight_http_requests_total[5m]))
```

### P95 HTTP latency
```promql
histogram_quantile(0.95, sum by (le) (rate(hindsight_http_duration_seconds_bucket[5m])))
```

### Database pool utilization
```promql
hindsight_db_pool_size / hindsight_db_pool_max
```

### Active database connections
```promql
hindsight_db_pool_size - hindsight_db_pool_idle
```

### CPU usage rate
```promql
rate(hindsight_process_cpu_seconds{type="user"}[1m])
```

---

## Distributed Tracing

Hindsight supports OpenTelemetry distributed tracing for memory operations and LLM calls, following GenAI semantic conventions v1.37+.

### Configuration

See [Configuration - OpenTelemetry Tracing](./configuration#opentelemetry-tracing) for environment variables.

**Quick Start:**
```bash
# Enable tracing
export HINDSIGHT_API_OTEL_TRACES_ENABLED=true
export HINDSIGHT_API_OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318

# View traces with Grafana LGTM (local dev)
./scripts/dev/start-monitoring.sh
# Open http://localhost:3000 → Explore → Tempo
```

Supports any OTLP-compatible backend (Grafana LGTM, Langfuse, OpenLIT, DataDog, New Relic, Honeycomb, [Pydantic Logfire](https://logfire.pydantic.dev), etc.).

### Span Hierarchy

**Parent Spans (Operations):**
- `hindsight.retain` - Memory ingestion
- `hindsight.recall` - Memory retrieval
  - `hindsight.recall_embedding` - Query embedding
  - `hindsight.recall_retrieval` - Parallel search (semantic, BM25, graph, temporal)
  - `hindsight.recall_fusion` - Reciprocal Rank Fusion
  - `hindsight.recall_rerank` - Cross-encoder reranking
- `hindsight.reflect` - Agentic reasoning
  - `hindsight.reflect_tool_call` - Tool execution (recall, lookup, etc.)
- `hindsight.consolidation` - Observation synthesis
- `hindsight.mental_model_refresh` - Mental model updates

**Child Spans (LLM Calls):**
- Named by scope (e.g., `hindsight.memory`, `hindsight.reflect`)
- Contain full prompts/completions as events
- Follow GenAI semantic conventions for attributes

### Span Attributes

**Operation Spans:**
- `hindsight.operation` - Operation type
- `hindsight.bank_id` - Memory bank ID
- `hindsight.query` - Query text (truncated to 100 chars)
- `hindsight.fact_types` - Fact types for recall
- `hindsight.thinking_budget` - Budget allocation
- `hindsight.max_tokens` - Token limit

**LLM Spans (GenAI Semantic Conventions):**
- `gen_ai.operation.name` - Always `"chat"`
- `gen_ai.provider.name` - Provider (`openai`, `anthropic`, `google`, etc.)
- `gen_ai.request.model` - Model name
- `gen_ai.usage.input_tokens` - Input tokens
- `gen_ai.usage.output_tokens` - Output tokens
- `hindsight.scope` - LLM call purpose (`memory`, `reflect`, `consolidation`, etc.)

**Events:**
- `gen_ai.client.inference.operation.details` - Full prompts and completions


---


## File: developer/multilingual.md

# Multilingual Support

Hindsight automatically detects the language of your input and responds in the same language. This means facts, entities, and reflect responses are preserved in their original language without translation to English.

## How It Works

```mermaid
graph LR
    A[Chinese Input] --> B[Language Detection]
    B --> C[Extract Facts in Chinese]
    C --> D[Chinese Entities]
    D --> E[Chinese Response]
```

When you retain content or reflect on a query, Hindsight:

1. **Detects the input language** automatically from the content
2. **Extracts facts in the original language** - preserving nuance and meaning
3. **Stores entities in their native script** - 张伟 stays 张伟, not "Zhang Wei"
4. **Responds in the same language** - queries in Chinese get Chinese answers

---

## Retain with Non-English Content

When you retain content in any language, Hindsight extracts and stores facts in that same language.

### Example: Chinese Content

```python
from hindsight import Hindsight

hindsight = Hindsight()

# Retain Chinese content
hindsight.retain(
    bank_id="user-123",
    content="""
    张伟是一位资深软件工程师，在腾讯工作了五年。
    他专门研究分布式系统，并领导了公司微服务架构的开发。
    """,
    context="团队概述"
)

# Query in Chinese - get Chinese results
results = hindsight.recall(
    bank_id="user-123",
    query="告诉我关于张伟的信息"
)

# Facts are returned in Chinese:
# - 张伟是一位资深软件工程师，在腾讯工作了五年
# - 张伟专门研究分布式系统，并领导了公司微服务架构的开发
```

### Example: Japanese Content

```python
hindsight.retain(
    bank_id="user-123",
    content="""
    田中さんはソフトウェアエンジニアで、東京のスタートアップで働いています。
    彼女はPythonとTypeScriptが得意で、毎日コードレビューをしています。
    """,
    context="チームプロフィール"
)

# Query in Japanese
results = hindsight.recall(
    bank_id="user-123",
    query="田中さんについて教えてください"
)
```

---

## Reflect with Non-English Queries

The `reflect` operation also respects the input language, generating thoughtful responses in the same language as the query.

### Example: Chinese Reflection

```python
# Store facts about team members (in Chinese)
hindsight.retain(
    bank_id="team-eval",
    content="张伟是一位优秀的软件工程师，完成了五个重大项目。他总是按时交付，代码整洁有良好的文档。",
    context="绩效评估"
)

hindsight.retain(
    bank_id="team-eval",
    content="李明最近加入团队。他错过了第一个截止日期，代码有很多bug。",
    context="绩效评估"
)

# Reflect in Chinese
result = hindsight.reflect(
    bank_id="team-eval",
    query="谁是更可靠的工程师？"
)

# Response is in Chinese:
# "我认为张伟更可靠。张伟完成了五个重大项目，按时交付，代码质量高..."
```

---

## Mixed Language Content

Hindsight handles mixed-language content gracefully, preserving both languages where appropriate.

### Example: Chinese Text with English Company Names

```python
hindsight.retain(
    bank_id="user-123",
    content="""
    王芳在Google北京办公室工作，她是一名高级产品经理。
    之前她在Microsoft和Amazon工作过。
    她负责管理YouTube在中国市场的推广策略。
    """,
    context="员工资料"
)

# Facts preserve both languages:
# - 王芳在Google北京办公室工作，担任高级产品经理
# - 王芳曾在Microsoft和Amazon工作过
# - 王芳负责管理YouTube在中国市场的推广策略
```

---

## Supported Languages

**Hindsight's multilingual support depends entirely on your LLM's language capabilities.** Hindsight instructs the LLM to detect the input language and respond in that same language. If your LLM supports a language, Hindsight will work with it.

Most modern LLMs (GPT-4, Claude, Gemini, Llama 3, etc.) support dozens of languages including:

- **East Asian**: Chinese (Simplified/Traditional), Japanese, Korean
- **European**: Spanish, French, German, Italian, Portuguese, Dutch, Polish, Russian
- **Middle Eastern**: Arabic, Hebrew, Turkish
- **South Asian**: Hindi, Bengali, Tamil
- **Southeast Asian**: Thai, Vietnamese, Indonesian

**To verify support for your target language**, test your LLM directly with content in that language. If the LLM can understand and generate text in the language, Hindsight will preserve it correctly.

---

## Configuring for Multilingual Use

For optimal multilingual performance, configure all four components of the pipeline:

### 1. LLM (Required)
Your LLM must support the target languages. Most modern LLMs do, but verify with your specific model.

### 2. Embedding Model (Recommended)
The default embedding model (`BAAI/bge-small-en-v1.5`) is **English-only**. For multilingual content, use a multilingual embedding model:

```bash
# In your .env file
HINDSIGHT_API_EMBEDDINGS_LOCAL_MODEL=BAAI/bge-m3
```

**Recommended multilingual embedding models:**
| Model | Languages | Notes |
|-------|-----------|-------|
| `BAAI/bge-m3` | 100+ | Best overall multilingual performance |
| `intfloat/multilingual-e5-large` | 100+ | Good alternative |
| `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` | 50+ | Lighter weight |

### 3. Reranker Model (Recommended)
The default reranker (`cross-encoder/ms-marco-MiniLM-L-6-v2`) is **English-only**. For multilingual content, use a multilingual reranker:

```bash
# In your .env file
HINDSIGHT_API_RERANKER_LOCAL_MODEL=BAAI/bge-reranker-v2-m3
```

**Recommended multilingual reranker models:**
| Model | Languages | Notes |
|-------|-----------|-------|
| `BAAI/bge-reranker-v2-m3` | 100+ | Best multilingual reranking |
| `cross-encoder/mmarco-mMiniLMv2-L12-H384-v1` | 14 | Lighter alternative |

### 4. BM25 / Full-Text Search Backend

The semantic (embedding) arm covers cross-lingual matches by meaning. Hindsight runs a BM25 keyword arm in parallel, and **BM25 is inherently within-language** — it's character/token matching against a tokenizer's lexemes. The default `native` backend uses PostgreSQL's English dictionary, which produces poor results for non-English content (and no useful tokenization at all for Chinese / Japanese / Korean, which lack whitespace word boundaries).

There are two knobs that interact:

- `HINDSIGHT_API_TEXT_SEARCH_EXTENSION` — selects the backend (`native`, `vchord`, `pg_textsearch`, `pgroonga`, or `pg_search`).
- `HINDSIGHT_API_TEXT_SEARCH_EXTENSION_NATIVE_LANGUAGE` — selects the PostgreSQL dictionary used by the `native` backend (default: `english`).

Pick the backend based on the languages your bank stores:

| Backend | Multilingual / CJK | Notes |
|---------|--------------------|-------|
| `native` | European languages only (English, French, German, Spanish, Italian, Portuguese, Russian, Dutch, Swedish, Norwegian, Danish, Finnish, Hungarian, Turkish, Arabic, plus `simple`). CJK requires a third-party dictionary like `zhparser`. | Stock PostgreSQL — no extra extensions. Configure the language via `HINDSIGHT_API_TEXT_SEARCH_EXTENSION_NATIVE_LANGUAGE`. |
| `vchord` | Multilingual via `llmlingua2` tokenizer. | Best when you're already using vchord for vector search. |
| `pg_textsearch` | English only (hardcoded). | Industry-standard BM25 ranking + Block-Max WAND. |
| `pgroonga` | **Yes — out of the box.** Single index handles English, CJK, and mixed-script content via the `TokenBigram` polyglot tokenizer + `NormalizerNFKC150` Unicode normalization. | Recommended for non-English / mixed-language banks. Requires the `pgroonga` extension. See `docker/docker-compose/pgroonga/`. |
| `pg_search` | Multilingual via configurable tokenizer (e.g. `chinese_compatible`, `jieba`, `chinese_lindera`, `japanese_lindera`, `korean_lindera`, `ngram`). | ParadeDB `pg_search` extension; the only Citus-compatible BM25 backend. Tokenizer set via `HINDSIGHT_API_TEXT_SEARCH_EXTENSION_PG_SEARCH_TOKENIZER`. See `docker/docker-compose/pg_search/`. |

**Choosing for a single-language bank** (e.g. all Spanish content):
```bash
HINDSIGHT_API_TEXT_SEARCH_EXTENSION=native
HINDSIGHT_API_TEXT_SEARCH_EXTENSION_NATIVE_LANGUAGE=spanish
```

**Choosing for a CJK or mixed-language bank**:
```bash
HINDSIGHT_API_TEXT_SEARCH_EXTENSION=pgroonga
```

The `native` and `pgroonga` knobs do not apply to each other — `pgroonga`'s tokenizer is set at index creation and ignores `HINDSIGHT_API_TEXT_SEARCH_EXTENSION_NATIVE_LANGUAGE`.

#### Forcing the LLM Output Language

Independent from the BM25 backend, `HINDSIGHT_API_LLM_OUTPUT_LANGUAGE` forces every LLM-generated artifact into a single language regardless of the source content. This applies uniformly to:

- **Retain** — fact text, context, and entity names extracted from source documents.
- **Consolidation** — observations / mental models synthesized from those facts.
- **Reflect** — the final natural-language response returned by the reflect API.

```bash
# Every LLM call (retain, consolidation, reflect) emits Spanish regardless of source language.
HINDSIGHT_API_LLM_OUTPUT_LANGUAGE=Spanish
```

Common patterns:
- **Aligned, single-language bank**: `HINDSIGHT_API_TEXT_SEARCH_EXTENSION_NATIVE_LANGUAGE=spanish` + `HINDSIGHT_API_LLM_OUTPUT_LANGUAGE=Spanish` — store, index, and respond in Spanish even when sources are mixed.
- **Mixed-language bank with multilingual indexing**: `HINDSIGHT_API_TEXT_SEARCH_EXTENSION=pgroonga` + leave `HINDSIGHT_API_LLM_OUTPUT_LANGUAGE` unset — preserve source-language facts; pgroonga handles all of them in one index; reflect responds in the query's language.
- **Cross-lingual unification**: `HINDSIGHT_API_LLM_OUTPUT_LANGUAGE=English` — every fact, observation, and reflect response in English regardless of source. Useful when the consumer (an English-only LLM, dashboard, or downstream pipeline) needs uniform output.

Leave `HINDSIGHT_API_LLM_OUTPUT_LANGUAGE` unset to preserve the source/query language across the pipeline (the default).

---

## Best Practices

### 1. Use Multilingual Models for Non-English Content
If you primarily work with non-English content, configure multilingual embedding and reranker models. English-only models will still store your content correctly, but semantic search quality will be degraded.

### 2. Keep Content in One Language Per Retain Call
While mixed content works, keeping each `retain` call in a single language produces more consistent results.

### 3. Query in the Same Language as Your Content
For best results, query using the same language as your stored content. Cross-language queries (e.g., English query for Chinese content) may work but results can vary depending on your embedding model.

---

## Technical Details

Multilingual support is implemented through LLM prompt instructions rather than external language detection libraries. This approach:

- **Requires no additional dependencies**
- **Works with any LLM** that supports multiple languages
- **Handles edge cases** like mixed-language content naturally
- **Preserves semantic meaning** better than rule-based translation

The LLM is instructed to:
1. Detect the input language
2. Extract all facts, entities, and descriptions in that same language
3. Never translate to English unless the input is in English


---


## File: developer/observations.mdx

# Observations: Knowledge Consolidation

After memories are retained, Hindsight automatically consolidates related facts into **observations** — deduplicated, evidence-grounded beliefs the bank has built up from multiple memories. Each observation tracks its supporting evidence (with exact quotes) and a proof count, and is refined rather than overwritten when new evidence arrives.

```mermaid
graph LR
    A[New Facts] --> B[Consolidation Engine]
    B --> C{Existing Observation?}
    C -->|Yes| D[Refine Observation]
    C -->|No| E[Create Observation]
    D --> F[Observations]
    E --> F
```

---

## What Are Observations?

Observations are **consolidated knowledge** built from multiple facts. Unlike raw facts — which are individual pieces of information — observations represent deduplicated beliefs, preferences, and learnings grounded in accumulated evidence. They are not summaries the LLM invents on the fly: each observation is backed by specific source memories, carries a proof count, and evolves as new evidence supports, contradicts, or extends it.

| Raw Facts | Observation |
|-----------|--------------|
| "Alice prefers Python" | "Alice is a Python-focused developer who values readability and simplicity" |
| "Alice dislikes verbose code" | |
| "Alice recommends type hints" | |

Observations provide:
- **Deduplication**: One durable belief instead of many overlapping facts
- **Grounding**: Every observation references the specific memories (with quotes) that support it
- **Evolution**: Refined as evidence strengthens, weakens, or contradicts it — history is preserved
- **Freshness awareness**: when newer memories haven't been consolidated yet, `reflect` treats the affected observations as stale and verifies them against raw facts
- **Efficiency**: Condensed knowledge for faster retrieval

---

## How Consolidation Works

### Automatic Background Processing

After `retain()` completes, the consolidation engine runs automatically:

1. **New facts analyzed** — Each new fact is compared against existing observations
2. **Pattern detection** — Related facts are grouped and synthesized
3. **Observation creation/update** — New observations are created or existing ones refined
4. **Evidence tracking** — Each observation maintains references to supporting facts

### Near-Duplicate Reconciliation

Consolidation can still produce two observations that say the same thing in slightly different words — for example when a weaker model writes a near-identical observation instead of refining the existing one, or when refining an observation reshapes its wording so it overlaps another one. Left alone, these near-duplicates clutter recall with redundant beliefs.

When enabled, Hindsight reconciles them automatically. Whenever an observation is created **or** updated, it is compared against the existing observations it most closely resembles. If one is highly similar, a focused check decides whether to **merge** them into a single belief (folding both sets of supporting evidence together) or **keep** them separate. Because the check reads the full text of both, observations that differ in a meaningful detail — a number, a negation, a named entity or language — are correctly kept apart rather than collapsed.

This is controlled by the [`HINDSIGHT_API_CONSOLIDATION_DEDUP_THRESHOLD`](/developer/configuration#observations) setting: the cosine similarity at or above which two observations are reconciled. It is **enabled by default** (`0.97`); a lower value reconciles more aggressively, and `1.0` disables it. Reconciliation runs on PostgreSQL deployments only — it is skipped on Oracle regardless of the threshold.

Reconciliation only compares observations **within the same tag scope**. If you tag retains with a unique per-call value (e.g. a `session-id`), each session lands in its own scope and never dedups against the others — producing one near-identical observation per session. To consolidate across those volatile tags, retain with [`observation_scopes: "shared"`](/developer/api/retain#shared), which scopes observations to one global, untagged belief while leaving the session tag on the source facts for recall filtering.

### Disabling Auto-Consolidation

Set `HINDSIGHT_API_ENABLE_AUTO_CONSOLIDATION=false` (or configure per-bank via the [bank config API](/developer/api/memory-banks#observations-configuration)) to prevent consolidation from running automatically after retain, delete, and update operations. When disabled, consolidation only runs when you explicitly call the [consolidate endpoint](#trigger-consolidation).

This is useful when you want full control over consolidation timing — for example, batching many retains before consolidating, or running consolidation only for specific scopes.

### Targeted Consolidation

By default, consolidation processes **all** unconsolidated memories in a bank. You can scope it to specific tag sets using the `observation_scopes` parameter on the consolidate endpoint:

```python
# Consolidate only memories tagged with user:alice
client.consolidate(
    bank_id="my-bank",
    observation_scopes=[["user:alice"]]
)

# Consolidate memories for alice OR the engineering team
client.consolidate(
    bank_id="my-bank",
    observation_scopes=[["user:alice"], ["team:engineering"]]
)
```

Each scope is a list of tags. A memory matches a scope if its tags **contain all** tags in that scope. For example, scope `["user:alice"]` matches memories tagged `["user:alice", "team:eng"]`.

When `observation_scopes` is omitted, all unconsolidated memories are processed (backward compatible).

### Evidence-Based Evolution

Observations evolve as new evidence arrives:

| Event | What the bank learns | Observation state |
|-------|---------------------|----------------|
| **Day 1** | "Redis is open source under BSD license" | "Redis is excellent for caching — fast, reliable, and OSS-friendly" (2 supporting facts) |
| **Day 2** | "Redis has great community support" | Observation reinforced (3 supporting facts) |
| **Day 30** | "Redis changed license to SSPL" | Observation refined: "Redis is technically strong, but has license concerns for cloud" |
| **Day 45** | "Valkey forked Redis under BSD" | New observation: "Consider Valkey for new projects requiring true OSS" |

### Handling Contradictory Evidence

What happens when a new fact contradicts an existing observation?

The consolidation engine doesn't blindly overwrite — it **reconciles** the contradiction by capturing the evolution:

**Example: User preference changes**

| Time | Fact | Observation |
|------|------|--------------|
| Week 1 | "User says they love React" | "User prefers React for frontend development" |
| Week 2 | "User praises React's component model" | "User is enthusiastic about React, particularly its component model" |
| Week 3 | "User says they've switched to Vue and won't use React anymore" | "User was previously a React enthusiast who appreciated its component model, but has now switched to Vue and no longer uses React" |

Notice how the final observation captures the **full journey** — not just "User prefers Vue" but the complete evolution of their preference. This nuanced understanding means:

- Your agent won't recommend React tutorials to someone who explicitly moved away from it
- Your agent understands *why* this matters (they were enthusiastic before, so this is a deliberate choice)
- Your agent can reference this history when relevant ("I know you used to work with React...")

The system:
1. **Detects the conflict** — New fact contradicts existing observation
2. **Preserves history** — Incorporates the previous understanding into the new observation
3. **Creates nuanced observation** — Synthesizes a richer understanding that captures the change
4. **Updates freshness** — Marks the observation as recently updated

**Example: Correcting misinformation**

| Time | Fact | Observation |
|------|------|--------------|
| Day 1 | "Alice works at Google" | "Alice is a Google employee" |
| Day 10 | "Alice actually works at Meta, not Google" | "Alice works at Meta (previously thought to work at Google)" |

When a fact explicitly corrects previous information, the observation is updated to reflect the correction while noting the previous understanding. The raw facts are always preserved, so you can trace back to see what was originally stated and when it was corrected.

---

## Observations in Retrieval

Observations are automatically included in both `recall()` and `reflect()` operations:

### In Recall

Observations are returned alongside raw facts, filtered by the `types` parameter:


### In Reflect

The reflect agent uses **hierarchical retrieval**:

1. **[Mental Models](/developer/api/mental-models)** — User-curated summaries (highest priority)
2. **Observations** — Consolidated knowledge with freshness awareness
3. **Raw Facts** — Ground truth for verification

The agent automatically queries observations and uses them to inform its reasoning.

---

## Freshness Awareness

Observations track when they were last updated. During reflect, the agent considers freshness:

- **Fresh observations**: Used directly for reasoning
- **Stale observations**: Agent verifies against current facts before relying on them

This ensures responses stay accurate even as the underlying data changes.

---

## Observation Scopes

By default, observations are scoped to all of a memory's tags combined. The `observation_scopes` retain parameter lets you control this — building separate observations per tag, per combination, or with a custom list of scopes. This is key when a single memory carries multiple tags and you want each tag to accumulate its own observations independently.

See [`observation_scopes` in the Retain API](./api/retain#observation_scopes) for the full explanation and options.

To inspect the scopes that already exist in a bank, call `GET /v1/default/banks/{bank_id}/observations/scopes`. The response lists each exact tag set with its observation count; the empty tag list is the global scope. Use a returned scope as `tags` with `tags_match: "exact"` when you need to filter to that precise observation scope without also matching observations that carry extra tags. To recall **only** the global scope — the untagged observations written by `observation_scopes: "shared"` — pass an empty list with exact matching: `tags: []`, `tags_match: "exact"`.

---

## Observations Mission

You can define exactly what this bank should synthesise by setting an **observations mission** (`observations_mission`). This replaces the built-in durable-knowledge rules with your own instructions, letting you control what shape observations take.

```
e.g. Observations are stable facts about people and projects.
     Always include preferences, skills, and recurring patterns.
     Ignore one-off events and ephemeral state.
```

Leave it blank to use the server default — durable, specific facts that stay true over time (preferences, skills, relationships, recurring patterns), with ephemeral state filtered out.

**Examples:**

| `observations_mission` | What gets synthesised |
|------------------------|----------------------|
| *(unset — default)* | Durable facts: preferences, skills, relationships, recurring patterns |
| *"Observations are weekly summaries of sprint outcomes and blockers"* | Broad event summaries grouped by time period |
| *"Observations are stable facts about named individuals only"* | Person-centric knowledge, tied to specific people |
| *"Observations are recurring patterns in customer support interactions"* | Failure modes, common requests, pain points |

Set `observations_mission` via the [bank config API](/developer/api/memory-banks#observations-configuration) or the [`HINDSIGHT_API_OBSERVATIONS_MISSION`](/developer/configuration#observations) environment variable.

---

## Observation Lifecycle & Invalidation

### When Memories Are Deleted

Observations are derived from source memories. When source memories are removed, Hindsight automatically keeps observations consistent:

| Action | Effect on observations |
|--------|----------------------|
| Delete a document | All observations derived from the document's memories are deleted |
| Delete individual memories (by type) | Observations sourced from those memories are deleted |
| Delete an entire bank | All observations are deleted along with everything else |

After deletion, the **remaining source memories** that fed the affected observations have their consolidation state reset, so they will be re-consolidated on the next consolidation run and produce fresh observations.

### Clearing Observations for a Specific Memory

You can clear all observations derived from a single memory without deleting the memory itself. This is useful when you want to force re-synthesis of a memory's contribution to consolidated knowledge.

Use the `DELETE /v1/default/banks/{bank_id}/memories/{memory_id}/observations` endpoint. This will:
1. Delete all observations that list the memory as a source
2. Reset `consolidated_at` on the memory itself and any other source memories that contributed to those observations
3. Trigger a consolidation job so fresh observations are produced automatically

### Resetting All Observations

To wipe all consolidated knowledge and start over:

```python
# Clear all observations for a bank
client.clear_observations(bank_id="my-bank")
```

This resets the consolidation state for all source memories in the bank, so the next consolidation run will re-derive all observations from scratch.

---

## Trigger Consolidation {#trigger-consolidation}

Use the consolidate endpoint to manually trigger consolidation:

```http
POST /v1/default/banks/{bank_id}/consolidate
Content-Type: application/json

{
  "observation_scopes": [["user:alice"], ["team:engineering"]]
}
```

The request body is optional. When omitted (or sent as an empty body), all unconsolidated memories in the bank are processed.

| Parameter | Type | Description |
|-----------|------|-------------|
| `observation_scopes` | `list[list[str]]` \| `null` | Optional list of tag scopes. Only memories whose tags contain all tags in at least one scope are processed. Omit for a full-bank sweep. |

## Configuration

Observation consolidation runs automatically by default. You can disable auto-consolidation with [`HINDSIGHT_API_ENABLE_AUTO_CONSOLIDATION`](/developer/configuration#observations) and trigger it on-demand via the [consolidate endpoint](#trigger-consolidation). Monitor consolidation progress via the [Operations API](./api/operations).

---

## Next Steps

- [**Retain**](./retain) — How facts are stored and trigger consolidation
- [**Recall**](./retrieval) — How observations are retrieved
- [**Reflect**](./reflect) — How the agentic loop uses observations
- [**Mental Models**](./api/mental-models) — User-curated summaries for common queries


---


## File: developer/performance.md

# Performance

Hindsight is designed for high-performance semantic memory operations at scale. This page covers performance characteristics, optimization strategies, and best practices.

## Overview

Hindsight's performance is optimized across three key operations:

- **Retain (Ingestion)**: Batch processing with async operations for large-scale memory storage
- **Recall (Search)**: Sub-second semantic search with configurable thinking budgets
- **Reflect (Reasoning)**: Disposition-aware answer generation with controllable compute

## Design Philosophy: Optimized for Fast Reads

Hindsight is **architected from the ground up to prioritize read performance over write performance**. This design decision reflects the typical usage pattern of memory systems: memories are written once but read many times.

The system makes deliberate trade-offs to ensure **sub-second recall operations**:

- **Pre-computed embeddings**: All memory embeddings are generated and indexed during retention
- **Optimized vector search**: HNSW indexes enable fast approximate nearest neighbor search
- **Fact extraction at write time**: Complex LLM-based fact extraction happens during retention, not retrieval
- **Structured memory graphs**: Relationships and temporal information are resolved upfront

This means **Recall (search) operations are blazingly fast** because all the heavy lifting has already been done.

### Performance Comparison

| Operation | Typical Latency | Primary Bottleneck | Optimization Strategy            |
|-----------|----------------|-------------------|----------------------------------|
| **Recall** | 100-600ms | Re-ranker (on CPU) | Use GPU for re-ranking, or reduce budget |
| **Reflect** | 800-3000ms | LLM generation | Use faster LLM                   |
| **Retain** | 500ms-2000ms per batch | **LLM fact extraction** | Use high-throughput LLM provider |

Hindsight is designed to ensure your **application's read path (recall/reflect) is always fast**, even if it means spending more time upfront during writes. This is the right trade-off for memory systems where:

- Memories are retained in background processes or during low-traffic periods
- Memories are queried frequently in user-facing, latency-sensitive contexts
- The ratio of reads to writes is high (typically 10:1 or higher)

---

## Retain Performance

**Retain (write) operations are inherently slower** because they involve LLM-based fact extraction, entity recognition, temporal reasoning, relationship mapping, and embedding generation. **The LLM is the primary bottleneck for write latency.**

### Hindsight Doesn't Need a Smart Model

The fact extraction process is structured and well-defined, so smaller, faster models work extremely well. Our recommended model is `gpt-oss-20b` (available via Groq and other providers).

To maximize retention throughput:

1. **Use high-throughput LLM providers**: Choose providers with high requests-per-minute (RPM) limits and low latency
   - **Fast**: [Groq](https://groq.com) with `gpt-oss-20b` or other openai-oss models, self-hosted models on GPU clusters (vLLM, TGI)
   - **Slow**: Standard cloud LLM providers with rate limits

2. **Batch your operations**: Group related content into batch requests. Send as much data as you want in a single request — the only limit is the HTTP payload size.

3. **Use async mode for large datasets**: Queue operations in the background

4. **Parallel processing**: For very large datasets, use multiple concurrent retention requests with different `document_id` values

### Automatic Batch Optimization

**When using async retain, Hindsight automatically handles batch sizing for you.** You don't need to manually tune batch sizes or worry about optimal chunking.

How it works:
- **Send large batches**: Submit hundreds or thousands of items in a single async retain request
- **Automatic splitting**: Hindsight automatically splits large batches (>10,000 tokens) into optimized sub-batches
- **Parallel processing**: Sub-batches are processed concurrently in the background
- **Status tracking**: Parent operation aggregates status from all sub-batches
- **Token-based**: Batching uses tiktoken for accurate token counting, not character counts

Benefits:
- Send entire documents or datasets in one API call
- Let Hindsight optimize the processing strategy
- Track overall progress via the parent operation status
- No need to manually split data into small batches

### Throughput

Factors affecting throughput:
- Document size and complexity
- LLM provider rate limits (for fact extraction)
- Database write performance
- Available CPU/memory resources

---

## Tuning for Local & Small Environments

Hindsight's defaults are tuned for cloud LLM providers and multi-core servers. When you run it on a laptop, a single GPU box, or against a **local LLM server** (llama.cpp, vLLM, LM Studio, Ollama) with a small fixed slot pool, those defaults can saturate the backend, time out, or thrash the CPU. This section collects the knobs that matter for low-resource setups.

### LLM concurrency

The default `HINDSIGHT_API_LLM_MAX_CONCURRENT=32` assumes a cloud provider that can absorb dozens of parallel requests. A local server with a handful of slots cannot — Hindsight will fill every slot and **starve any other client sharing the endpoint** (your main agent, another app, or a second Hindsight operation).

```bash
export HINDSIGHT_API_LLM_MAX_CONCURRENT=2
```

A value of `2` lets retain and consolidation run concurrently without blocking each other. If the endpoint is **shared** with other clients (other applications, agents, or workflows hitting the same llama-server / vLLM / LM Studio instance), reserve slots for them by lowering further — leave at least one slot free per shared client.

You can also split the budget per operation so background work never crowds out live reads. The per-operation caps compose *on top of* the global cap:

```bash
# global=4, with retain/consolidation capped low so reflect always has headroom
export HINDSIGHT_API_LLM_MAX_CONCURRENT=4
export HINDSIGHT_API_RETAIN_LLM_MAX_CONCURRENT=1
export HINDSIGHT_API_CONSOLIDATION_LLM_MAX_CONCURRENT=1
```

### Timeouts and retries

Small models on modest hardware generate tokens slowly, and the first request after startup pays a model-load cost. The default `HINDSIGHT_API_LLM_TIMEOUT=120` (seconds) can be too tight for a large local model on CPU — raise it to avoid spurious timeouts and wasted retries:

```bash
export HINDSIGHT_API_LLM_TIMEOUT=300        # allow slow local generation
export HINDSIGHT_API_LLM_MAX_RETRIES=2      # fail faster locally — retries rarely help a slow box
```

A local endpoint isn't rate-limited, so aggressive retry/backoff mostly adds latency on real failures. Lower retries and let genuine errors surface quickly.

### Smaller, faster models — and reasoning effort

Retain (fact extraction) is structured work that does not need a frontier model; reflect can use a lighter model still. On a constrained box, point each operation at the smallest model that holds up:

```bash
# Reflect on a small/fast model; retain on a slightly stronger structured-output model
export HINDSIGHT_API_REFLECT_LLM_MODEL=<small-fast-model>
export HINDSIGHT_API_RETAIN_LLM_MODEL=<structured-output-model>
```

If your model exposes a reasoning/thinking budget, keep it low (the default) — extra reasoning tokens are pure latency for the extraction and consolidation paths:

```bash
export HINDSIGHT_API_LLM_REASONING_EFFORT=low
```

Consolidation sends multiple facts to the LLM in a single call (default 8). On a small model with a limited context window, a large batch produces an oversized prompt and a long, error-prone response. Shrink the batch so each consolidation call stays small and reliable:

```bash
export HINDSIGHT_API_CONSOLIDATION_LLM_BATCH_SIZE=2   # default 8; lower = smaller prompts, more calls
```

### Built-in llama.cpp tuning

The bundled `llamacpp` provider runs a llama.cpp server as a managed subprocess — no external server needed. Key knobs for small machines:

```bash
export HINDSIGHT_API_LLM_PROVIDER=llamacpp
export HINDSIGHT_API_LLM_MAX_CONCURRENT=2        # retain + consolidation without blocking
export HINDSIGHT_API_LLAMACPP_GPU_LAYERS=-1      # -1 = offload all layers to GPU; 0 = CPU only
export HINDSIGHT_API_LLAMACPP_CONTEXT_SIZE=8192  # lower to save RAM/VRAM; raise for big batches
export HINDSIGHT_API_LLAMACPP_EXTRA_ARGS="--n_threads 8"  # match physical cores on CPU-only boxes
# export HINDSIGHT_API_LLAMACPP_NO_GRAMMAR=true  # faster, but less reliable JSON output
```

See [Built-in llama.cpp](./configuration#built-in-llamacpp) for the full option list.

### Reranker on CPU

Recall's bottleneck on a machine without a GPU is the cross-encoder reranker. The local reranker has several CPU/Apple-Silicon knobs that are quality-neutral but materially faster:

```bash
# Apple Silicon (MPS): half precision is 27–36% faster, quality-identical
export HINDSIGHT_API_RERANKER_LOCAL_FP16=true

# Sort pairs by length before batching — 36–54% faster, quality-identical by construction
export HINDSIGHT_API_RERANKER_LOCAL_BUCKET_BATCHING=true

# Cap reranker parallelism so it doesn't thrash a small CPU under load (default 4)
export HINDSIGHT_API_RERANKER_LOCAL_MAX_CONCURRENT=2

# On macOS, force CPU if MPS/XPC causes instability
# export HINDSIGHT_API_RERANKER_LOCAL_FORCE_CPU=true
```

The biggest single win on CPU is reranking fewer candidates. By default Hindsight reranks up to 300 candidates per recall — shrink that pool to cut cross-encoder work proportionally:

```bash
export HINDSIGHT_API_RERANKER_MAX_CANDIDATES=100   # default 300; RRF pre-filters the rest
```

For a pure-CPU box that struggles with the cross-encoder, `flashrank` is a lighter ONNX-based reranker:

```bash
export HINDSIGHT_API_RERANKER_PROVIDER=flashrank
```

You can also reduce recall work directly: use a lower `budget` (`low`/`mid`) for everyday queries and reserve `high` for comprehensive reasoning. See [Recall Performance](#recall-performance) below.

---

## Recall Performance

### Budget

The `budget` parameter controls the search depth and quality. Choose based on query complexity — comprehensive questions that need thorough analysis benefit from higher budgets:

| Budget | Use Case |
|--------|----------|
| `low` | Quick lookups, real-time chat |
| `mid` | Standard queries, balanced performance |
| `high` | Comprehensive questions, thorough analysis |

### Optimization

1. **Appropriate budgets**: Use lower budgets for simple queries, higher for comprehensive reasoning
2. **Limit result tokens**: Set `max_tokens` to control response size (default: 4096)
3. **Include chunks**: Use `include_chunks` to retrieve the raw text that generated memories when you need additional context

### Database Performance

Hindsight uses PostgreSQL with pgvector for efficient vector search:

- **Index type**: HNSW for approximate nearest neighbor search
- **Typical query time**: 10-50ms for vector search on 100K+ facts
- **Scalability**: Tested with millions of facts per bank

## Reflect Performance

### Performance Characteristics

| Component | Latency        | Description |
|-----------|----------------|-------------|
| Memory search | 100-600ms      | Based on budget (low/mid/high) |
| LLM generation | 500-2000ms     | Depends on provider and response length |
| **Total** | **600-2600ms** | Typical end-to-end latency |

### Optimization Strategies

1. **Budget selection**: Use lower budgets when context is sufficient
2. **Context provision**: Provide relevant `context` to reduce recall requirements and steer towards more focused answers

## Best Practices

### Operations
- **Use appropriate budgets**: Don't over-provision for simple queries; use higher budgets for comprehensive reasoning
- **Batch retain operations**: Group related content together for better efficiency
- **Cache frequent queries**: Cache at the application level for repeated queries
- **Profile with trace**: Use the `trace` parameter to identify slow operations

### Scaling
- **Horizontal scaling**: Deploy multiple API instances behind a load balancer with shared PostgreSQL
- **Concurrency**: 100+ simultaneous requests supported; memory search scales with CPU cores
- **LLM rate limits**: Distribute load across multiple API keys/providers (typically 60-500 RPM per key)

### Cost Optimization
- **Use efficient models**: `gpt-oss-20b` via Groq for retain — Hindsight doesn't need frontier models
- **Enable provider Batch API**: Set `HINDSIGHT_API_RETAIN_BATCH_ENABLED=true` with async retain to cut LLM fact-extraction costs by 50% (supported on OpenAI and Groq; results delivered within 24 hours)
- **Control token budgets**: Limit `max_tokens` for recall, use lower budgets when possible
- **Optimize chunks**: Larger chunks (1000-2000 tokens) are more efficient than many small ones

### Monitoring
- **Prometheus metrics**: Available at `/metrics` — track latency percentiles, throughput, and error rates
- **Key metrics**: `hindsight_recall_duration_seconds`, `hindsight_reflect_duration_seconds`, `hindsight_retain_items_total`


---


## File: developer/reflect.mdx

# Reflect: Agentic Reasoning with Disposition

When you call `reflect()`, Hindsight runs an **agentic loop** that autonomously gathers evidence and reasons through the lens of the bank's disposition to generate contextual responses.

```mermaid
graph TB
    subgraph agent["Reflect Agent Loop"]
        A[Query] --> B{Need more info?}
        B -->|Yes| C[Call Tools]
        C --> D[search_mental_models]
        C --> E[search_observations]
        C --> F[recall]
        C --> G[expand]
        D --> B
        E --> B
        F --> B
        G --> B
        B -->|No| H[Generate Response]
    end
    H --> I[Response + Citations]
```

---

## How It Works

Unlike simple retrieval, reflect is an **agentic system** that:

1. **Autonomously gathers evidence** — The agent decides what information it needs and calls appropriate tools
2. **Uses hierarchical retrieval** — Checks mental models first, then observations, then raw facts
3. **Applies disposition** — Shapes reasoning based on the bank's personality traits
4. **Enforces directives** — Hard rules that must be followed in all responses
5. **Cites sources** — Returns which memories and observations were used

### The Agentic Loop

The reflect agent runs in a loop with access to these tools:

| Tool | Purpose | Priority |
|------|---------|----------|
| `search_mental_models` | User-curated summaries | Highest (check first) |
| `search_observations` | Consolidated knowledge | High |
| `recall` | Raw facts (ground truth) | Fallback |
| `expand` | Get more context for a memory | As needed |
| `done` | Complete with final answer | When ready |

The agent:
- **Must gather evidence** before answering (guardrail prevents empty responses)
- **Runs up to 10 iterations** to find relevant information
- **Validates citations** — only IDs that were actually retrieved can be cited

### Hierarchical Retrieval Strategy

The agent uses a smart retrieval hierarchy:

1. **[Mental Models](/developer/api/mental-models)** — User-curated summaries you've pre-computed for common queries
2. **[Observations](/developer/observations)** — Consolidated knowledge with freshness awareness
3. **Raw Facts** — Ground truth for verification when observations are stale

**Mental models** are saved reflect responses that you create for frequently asked questions. They're checked first because they represent explicitly curated knowledge. See the [Mental Models API](/developer/api/mental-models) for how to create and manage them.

If an observation is marked as **stale**, the agent automatically verifies it against current facts.

---

## Why Reflect?

Most AI systems can retrieve facts, but they can't **reason** about them in a consistent way.

### The Problem

Without reflect:
- **No consistent character**: Same question gets different answers each time
- **No knowledge synthesis**: System never connects related facts
- **No reasoning context**: Responses don't reflect accumulated knowledge
- **Generic responses**: Every AI sounds the same

### The Value

With reflect:
- **Consistent character**: A "detail-oriented, cautious" bank emphasizes risks and thorough planning
- **Evolving knowledge**: Observations strengthen and adapt as evidence accumulates
- **Contextual reasoning**: "Based on what I know about your team's remote work success..."
- **Differentiated behavior**: Support bots sound diplomatic, code reviewers sound direct

### When to Use Reflect

| Use `recall()` when... | Use `reflect()` when... |
|------------------------|-------------------------|
| You need raw facts | You need reasoned interpretation |
| You're building your own reasoning | You want disposition-consistent responses |
| You need maximum control | You want the bank to "think" for itself |
| Simple fact lookup | Forming recommendations |

**Example:**
- `recall("Alice")` → Returns all Alice facts and relevant mental models
- `reflect("Should we hire Alice?")` → Agent gathers evidence about Alice, reasons about fit, returns answer with citations

---

## Disposition Traits

When you create a memory bank, you can configure its disposition using three traits. These traits influence how the bank interprets information and reasons during `reflect()`:

| Trait | Scale | Low (1) | High (5) |
|-------|-------|---------|----------|
| **Skepticism** | 1-5 | Trusting, accepts information at face value | Skeptical, questions and doubts claims |
| **Literalism** | 1-5 | Flexible interpretation, reads between the lines | Literal interpretation, takes things at face value |
| **Empathy** | 1-5 | Detached, focuses on facts | Empathetic, considers emotional context |

### Mission: Natural Language Identity

Beyond numeric traits, you can provide a natural language **mission** that describes the bank's identity and reasoning context:


The reflect mission frames how the agent reasons and responds:
- Provides identity context: who the agent is and what it cares about
- Shapes how disposition traits are applied in practice
- Keeps reasoning consistent across conversations

:::info Per-operation missions
The reflect mission only affects `reflect()`. To steer what gets extracted during `retain()`, use [`retain_mission`](/developer/api/memory-banks#retain-configuration). To control what gets synthesised into observations, use [`observations_mission`](/developer/api/memory-banks#observations-configuration).
:::

---

## Disposition Shapes Reasoning

Two banks with different dispositions, given identical facts about remote work:

**Bank A** (low skepticism, high empathy):
> "Remote work enables flexibility and work-life balance. The team seems happier and more productive when they can choose their environment."

**Bank B** (high skepticism, low empathy):
> "Remote work claims need verification. What are the actual productivity metrics? The anecdotal benefits may not translate to measurable outcomes."

**Same facts → Different conclusions** because disposition shapes interpretation.

---

## Disposition Presets by Use Case

Different use cases benefit from different disposition configurations:

| Use Case | Recommended Traits | Why |
|----------|-------------------|-----|
| **Customer Support** | skepticism: 2, literalism: 2, empathy: 5 | Trusting, flexible, understanding |
| **Code Review** | skepticism: 4, literalism: 5, empathy: 2 | Questions assumptions, precise, direct |
| **Legal Analysis** | skepticism: 5, literalism: 5, empathy: 2 | Highly skeptical, exact interpretation |
| **Therapist/Coach** | skepticism: 2, literalism: 2, empathy: 5 | Supportive, reads between lines |
| **Research Assistant** | skepticism: 4, literalism: 3, empathy: 3 | Questions claims, balanced interpretation |

---

## Directives: Hard Rules

While disposition traits *influence* reasoning style, **directives** are hard rules that the agent *must* follow. Directives are injected into the prompt and enforced in every response.

### When to Use Directives

Use directives for constraints that must never be violated:

- **Compliance rules**: "Never recommend specific stocks or financial products"
- **Privacy constraints**: "Never share personal data with third parties"
- **Style requirements**: "Always respond in formal English"
- **Domain guardrails**: "Always cite sources when making factual claims"

### Directives vs Disposition

| Aspect | Disposition | Directives |
|--------|-------------|------------|
| **Nature** | Soft influence | Hard rules |
| **Effect** | Shapes interpretation and tone | Must be followed exactly |
| **Violation** | Acceptable (it's a tendency) | Not acceptable |
| **Example** | High skepticism → questions claims | "Never make medical diagnoses" |

:::tip
Use disposition for personality and character. Use directives for compliance and guardrails.
:::

See [Memory Banks: Directives](/developer/api/memory-banks#directives) for how to create and manage directives.

---

## What You Get from Reflect

When you call `reflect()`:

**Returns:**
- **Response text** — Disposition-influenced answer from the agent
- **based_on** — Evidence used: memories, mental models, and directives that grounded the response
- **trace** — Tool calls, LLM calls, and observations accessed (when `include.tool_calls=True`)
- **structured_output** — Parsed response if `response_schema` was provided
- **usage** — Token usage metrics

**Example:**
```json
{
  "text": "Based on Alice's ML expertise and her work at Google, she'd be an excellent fit for the research team lead position...",
  "based_on": {
    "memories": [
      {"id": "mem-123", "text": "Alice has 5 years of ML experience", "type": "world"},
      {"id": "mem-456", "text": "Alice worked at Google on search ranking", "type": "experience"}
    ],
    "mental_models": [],
    "directives": [
      {"id": "dir-001", "name": "Formal Language", "rules": ["Always respond in formal English"]}
    ]
  },
  "usage": {"input_tokens": 1500, "output_tokens": 500, "total_tokens": 2000}
}
```

The agent automatically gathers evidence, validates citations, and generates a grounded response.

---

## Why Disposition Matters

Without disposition, all AI assistants sound the same. With disposition:

- **Customer support bots** can be diplomatic and empathetic
- **Code review assistants** can be direct and thorough
- **Creative assistants** can be open to unconventional ideas
- **Risk analysts** can be appropriately cautious

Disposition creates **consistent character** across conversations while observations **evolve with evidence**.

---

## Next Steps

- [**Observations**](./observations) — How knowledge is consolidated
- [**Retain**](./retain) — How rich facts are stored
- [**Recall**](./retrieval) — How multi-strategy search works
- [**Reflect API**](./api/reflect) — Code examples and parameters


---


## File: developer/services.md

# Services

Hindsight consists of three services that can run together or separately depending on your deployment needs.

## API Service

The core memory engine. Handles all memory operations:

- **Retain**: Ingests content, extracts facts, builds knowledge graph
- **Recall**: Semantic search across memories
- **Reflect**: Disposition-aware answer generation

```bash
hindsight-api        # Default port: 8888
```

The API service is stateless and can be horizontally scaled behind a load balancer. All state is stored in PostgreSQL.

By default, the API also processes background tasks (mental model consolidation) internally. For high-throughput deployments, you can disable this and run dedicated workers instead.

## Worker Service

Dedicated task processor for background operations. Uses the **same package and Docker image** as the API service, just with a different entry point.

```bash
hindsight-worker     # Default metrics port: 8889
```

Workers use PostgreSQL as a task broker, polling for pending tasks. Multiple workers can run simultaneously without conflicts.

| Deployment | Internal Worker | Dedicated Workers |
|------------|-----------------|-------------------|
| **Development** | ✅ Simple, all-in-one | ❌ Overkill |
| **Small production** | ✅ Less infrastructure | ❌ Overkill |
| **High throughput** | ❌ API bottleneck | ✅ Scale independently |
| **Long-running tasks** | ❌ Blocks API resources | ✅ Isolated processing |

To use dedicated workers, disable the internal worker in the API and start worker processes:

```bash
# Disable internal worker in API
HINDSIGHT_API_WORKER_ENABLED=false hindsight-api

# Start dedicated workers (run multiple instances)
hindsight-worker --worker-id worker-1
hindsight-worker --worker-id worker-2
```

Each worker exposes `/health` and `/metrics` endpoints for monitoring.

Before scaling down or removing workers, release their tasks with `hindsight-admin decommission-worker <worker-id>`.

See [Configuration - Distributed Workers](./configuration#distributed-workers) for all worker settings and [Installation - Helm](./installation#distributed-workers) for Kubernetes deployment.

## Control Plane

Web UI for managing and exploring your memory banks:

- Browse agents and memory banks
- Explore entities and relationships
- View ingestion history and operations
- Test recall queries interactively

The Control Plane connects to the API service and provides a visual interface for development and debugging.

For bare metal deployments, you can run the Control Plane standalone using npx. See [Installation - Bare Metal](./installation#control-plane) for details.


---


## File: developer/storage.md

# Storage

Hindsight uses PostgreSQL as its primary storage backend, with Oracle AI Database available as an alternative for enterprise deployments.

## Why PostgreSQL?

PostgreSQL provides all capabilities required for a semantic memory system in a single database:

| Capability | Implementation |
|------------|----------------|
| Vector search | pgvector extension with HNSW indexes |
| Full-text search | Built-in tsvector with GIN indexes |
| Relational data | Native PostgreSQL |
| JSON documents | JSONB with indexing |
| Graph queries | Recursive CTEs |

### Reduced System Dependencies

Building exclusively for PostgreSQL simplifies deployment and operations:

- Single connection string to configure
- Single backup and restore strategy
- Single monitoring target
- ACID transactions across all data types
- Single upgrade path

### No Storage Abstraction

Hindsight does not abstract storage behind a generic interface. This is a deliberate trade-off.

We believe PostgreSQL is becoming the standard database API. Its popularity, extension ecosystem, and modularity mean that PostgreSQL-compatible interfaces are appearing everywhere—from serverless offerings to distributed databases. Building for PostgreSQL today means compatibility with a growing ecosystem tomorrow.

Supporting multiple databases would increase flexibility but conflict with our core goals: Hindsight is fully open source and designed to be as simple as possible to run and use. Adding database abstractions introduces complexity in code, testing, documentation, and operations—complexity that we pass on to users.

By building on PostgreSQL, we keep the system simple:
- One set of deployment instructions
- One set of performance characteristics to understand
- One codebase optimized for one backend
- No configuration decisions about which database to use

### Oracle AI Database Support

For enterprise deployments, Hindsight also supports Oracle AI Database with full feature parity. All memory operations—retain, recall, and reflect—work identically on Oracle, making it a drop-in option for organizations that standardize on Oracle infrastructure.

## Development with pg0

For local development, Hindsight uses **[pg0](https://github.com/vectorize-io/pg0)**—an embedded PostgreSQL distribution.

### What is pg0?

pg0 is a single binary containing:
- PostgreSQL server
- pgvector extension (pre-installed)
- Automatic initialization

### Behavior

When no `HINDSIGHT_API_DATABASE_URL` is configured, Hindsight:
1. Starts an embedded PostgreSQL instance on port 5555
2. Initializes the schema
3. Stores data in `~/.hindsight/pg0/`

### Environments

| Environment | Database | Configuration |
|-------------|----------|---------------|
| Development | pg0 (embedded) | Automatic |
| Production | PostgreSQL 15+ | `HINDSIGHT_API_DATABASE_URL` environment variable |

## Requirements

- PostgreSQL 15 or later
- pgvector 0.5.0 or later

Any PostgreSQL instance that satisfies these requirements should work. If you encounter issues with a specific setup, [open a GitHub issue](https://github.com/hindsight-ai/hindsight/issues).

### Tested Managed Services

- AWS RDS (PostgreSQL 15+)
- Google Cloud SQL
- Azure Database for PostgreSQL
- Supabase
- Neon


---


## File: sdks/embed.md

# Daemon CLI (hindsight-embed)

Zero-configuration local memory system with automatic daemon management. Perfect for development, prototyping, and single-user applications.

## Overview

`hindsight-embed` is a zero-configuration SDK that wraps the Hindsight API and PostgreSQL database into a single auto-managed local daemon. It's designed for development, prototyping, and single-user applications where you want memory capabilities without infrastructure overhead.

**How it works:**

1. **First command triggers startup**: When you run any `hindsight-embed` command, it checks if a local daemon is running
2. **Auto-daemon management**: If no daemon exists, it automatically spawns `hindsight-api --daemon` in the background
3. **Embedded database**: The daemon uses `pg0` (embedded PostgreSQL) — no separate database installation required
4. **Command forwarding**: Your command is forwarded to the local daemon via HTTP (localhost:8888)
5. **Auto-shutdown**: After 5 minutes of inactivity (configurable), the daemon gracefully shuts down to free resources

**Key features:**

- **Zero setup** — One `configure` command and you're ready
- **Automatic lifecycle** — Daemon starts on-demand, stops when idle
- **Isolated storage** — Each bank gets its own embedded PostgreSQL database
- **Local-only** — Binds to `127.0.0.1:8888`, not accessible from network
- **Production-grade engine** — Uses the same memory engine as the full API service

Think of it as SQLite for long-term memory — all the power of Hindsight without managing servers.

## Installation

Install via `uvx` (recommended - always latest version):

```bash
# Run directly without installation
uvx hindsight-embed@latest configure

# Or use pipx for persistent installation
pipx install hindsight-embed
```

## Quick Start

### 1. Configure

```bash
# Interactive configuration
hindsight-embed configure

# Or non-interactive via environment variables
export HINDSIGHT_API_LLM_PROVIDER=openai
export HINDSIGHT_API_LLM_API_KEY=sk-xxxxxxxxxxxx
export HINDSIGHT_API_LLM_MODEL=gpt-4o-mini
hindsight-embed configure
```

Configuration is saved to `~/.hindsight/embed`:

```bash
HINDSIGHT_API_LLM_PROVIDER=openai
HINDSIGHT_API_LLM_MODEL=gpt-4o-mini
HINDSIGHT_API_LLM_API_KEY=sk-xxxxxxxxxxxx

# Daemon settings (macOS: force CPU to avoid MPS/XPC issues)
HINDSIGHT_API_EMBEDDINGS_LOCAL_FORCE_CPU=1
HINDSIGHT_API_RERANKER_LOCAL_FORCE_CPU=1
```

### 2. Use Memory Operations

```bash
# Store a memory
hindsight-embed memory retain default "User prefers dark mode"

# Query memories
hindsight-embed memory recall default "user preferences"

# Reasoning with memory
hindsight-embed memory reflect default "What color scheme should I use?"
```

The daemon starts automatically on first use!

### 3. Open the Control Center (optional)

Use the local control center when you want a browser-based configuration wizard and daemon supervisor:

```bash
# Launch the control center and open the browser
hindsight-embed control start

# Or use the browser wizard instead of terminal prompts during setup
hindsight-embed configure --ui
```

The control center listens on localhost only (`http://localhost:7878` by default) and prints a tokenized URL. It stores the local access token at `~/.hindsight/control.token` and writes logs to `~/.hindsight/control.log`.

```bash
# Pick a different control-center port for this launch
hindsight-embed control start --port 7879

# Start without opening a browser automatically
hindsight-embed control start --no-open

# Check, inspect, or stop the control center
hindsight-embed control status
hindsight-embed control logs -f
hindsight-embed control stop
```

The control center runs as a separate process from the memory daemon. Stopping or restarting the control center does not stop a running daemon.

## Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `HINDSIGHT_API_LLM_API_KEY` | **Required**. API key for LLM provider | - |
| `HINDSIGHT_API_LLM_PROVIDER` | LLM provider: `openai`, `anthropic`, `gemini`, `groq`, `minimax`, `ollama` | `openai` |
| `HINDSIGHT_API_LLM_MODEL` | Model name | `gpt-4o-mini` |
| `HINDSIGHT_EMBED_DAEMON_IDLE_TIMEOUT` | Seconds before daemon auto-exits when idle (0 = never) | `0` |
| `HINDSIGHT_EMBED_CONTROL_PORT` | Default port for `hindsight-embed control start` | `7878` |

**Provider Examples:**

```bash
# OpenAI
export HINDSIGHT_API_LLM_PROVIDER=openai
export HINDSIGHT_API_LLM_API_KEY=sk-xxxxxxxxxxxx
export HINDSIGHT_API_LLM_MODEL=gpt-4o

# Groq (fast inference)
export HINDSIGHT_API_LLM_PROVIDER=groq
export HINDSIGHT_API_LLM_API_KEY=gsk_xxxxxxxxxxxx
export HINDSIGHT_API_LLM_MODEL=llama-3.3-70b-versatile

# Anthropic
export HINDSIGHT_API_LLM_PROVIDER=anthropic
export HINDSIGHT_API_LLM_API_KEY=sk-ant-xxxxxxxxxxxx
export HINDSIGHT_API_LLM_MODEL=claude-sonnet-4-20250514
```

## Daemon Management

### Idle Timeout

Customize how long the daemon stays alive when idle:

```bash
# Never timeout (daemon runs until manually stopped)
export HINDSIGHT_EMBED_DAEMON_IDLE_TIMEOUT=0

# Shorter timeout: 1 minute
export HINDSIGHT_EMBED_DAEMON_IDLE_TIMEOUT=60

# Longer timeout: 30 minutes
export HINDSIGHT_EMBED_DAEMON_IDLE_TIMEOUT=1800
```

### Daemon Commands

```bash
# Check daemon status
hindsight-embed daemon status

# View daemon logs in real-time
hindsight-embed daemon logs -f

# Stop daemon manually
hindsight-embed daemon stop
```

### Control Center Commands

```bash
# Start or reuse the local browser control center
hindsight-embed control start

# Check whether it is running
hindsight-embed control status

# View control-center logs
hindsight-embed control logs -f

# Stop the control-center process
hindsight-embed control stop
```

## Commands

All memory operations follow the same interface as the CLI:

### Retain (Store Memory)

```bash
hindsight-embed memory retain <bank_id> "content"

# With context
hindsight-embed memory retain <bank_id> "content" --context "source information"

# Background processing
hindsight-embed memory retain <bank_id> "content" --async
```

### Recall (Search)

```bash
hindsight-embed memory recall <bank_id> "query"

# With budget control
hindsight-embed memory recall <bank_id> "query" --budget high

# Show trace
hindsight-embed memory recall <bank_id> "query" --trace
```

### Reflect (Generate Response)

```bash
hindsight-embed memory reflect <bank_id> "prompt"

# With additional context
hindsight-embed memory reflect <bank_id> "prompt" --context "additional info"
```

### Bank Management

```bash
# List all banks
hindsight-embed bank list

# View bank stats
hindsight-embed bank stats <bank_id>

# Set bank name
hindsight-embed bank name <bank_id> "My Assistant"

# Set bank mission
hindsight-embed bank mission <bank_id> "I am a helpful AI assistant"
```

## Troubleshooting

### Daemon Won't Start

Check the daemon logs:

```bash
hindsight-embed daemon logs
# Or watch in real-time
hindsight-embed daemon logs -f
```

Common issues:
- **Missing API key**: Set `HINDSIGHT_API_LLM_API_KEY`
- **Port conflict**: Another service using port 8888
- **Permissions**: Check `~/.hindsight/` directory permissions

### Daemon Exits Immediately

Check if you have the idle timeout set too low:

```bash
# Disable idle timeout for debugging
export HINDSIGHT_EMBED_DAEMON_IDLE_TIMEOUT=0
hindsight-embed daemon status
```

### Reset Configuration

```bash
# Remove config file and reconfigure
rm ~/.hindsight/embed
hindsight-embed configure
```

## When to Use

**Perfect for:**
- Development and prototyping
- Single-user applications
- Local-first tools
- Quick experiments with Hindsight

**Not suitable for:**
- Production multi-user deployments
- Network-accessible services
- High-availability requirements
- Multi-tenant applications

For production deployments, use the [API Service](/developer/services) with external PostgreSQL instead.


---


## File: sdks/go.md

# Go Client

Official Go client for the Hindsight API, generated from the OpenAPI 3.1 spec using [OpenAPI Generator](https://github.com/OpenAPITools/openapi-generator).


## Installation

```bash
go get github.com/vectorize-io/hindsight/hindsight-clients/go
```

Requires Go 1.23+.

## Quick Start


## API Structure

The Go client provides access to all Hindsight API operations through structured namespaces:

- **`client.MemoryAPI`** - Retain, recall, reflect operations
- **`client.BanksAPI`** - Bank management
- **`client.DirectivesAPI`** - Directive management
- **`client.MentalModelsAPI`** - Mental model management
- **`client.DocumentsAPI`** - Document operations
- **`client.EntitiesAPI`** - Entity operations
- **`client.OperationsAPI`** - Async operation monitoring

## Working with Nullable Fields

The Go client uses `NullableString`, `NullableTime`, and similar types for optional fields:


## Error Handling


## More Examples

For detailed examples of all operations, see:
- [Python SDK documentation](./python.md) - API concepts are the same
- [Node.js SDK documentation](./nodejs.md) - API concepts are the same
- [OpenAPI specification](https://hindsight.dev/openapi.json) - Complete API reference


---


## File: sdks/hindsight-all-npm.md

# Programmatic API (Node.js)

The `@vectorize-io/hindsight-all` npm package is the Node.js equivalent of the Python [`hindsight-all`](./hindsight-all.md) package. It lets your Node code spawn and supervise a local Hindsight daemon without deploying any server infrastructure — pair it with [`@vectorize-io/hindsight-client`](./nodejs.md) for memory operations.

The daemon runs as a **separate OS process** on `127.0.0.1` (not in your Node process). Your code talks to it over HTTP via `HindsightClient`.

This package **does not ship an HTTP client** — it only owns the server process. Once the daemon is running, talk to it with [`@vectorize-io/hindsight-client`](./nodejs.md) against `server.getBaseUrl()`. The two packages compose: one owns the process, the other owns the API surface.

## How it works

1. `server.start()` resolves the underlying `hindsight-embed` command (via `uvx` from PyPI, or `uv run --directory <path>` for a local checkout).
2. Runs `profile create <name> --merge --port <port> [--env KEY=VALUE ...]` with every entry from `options.env` forwarded as `--env`.
3. Runs `daemon --profile <name> start`.
4. Polls `http://host:port/health` until it returns `200` or the `readyTimeoutMs` budget is exhausted.
5. `server.stop()` runs `daemon --profile <name> stop`.

The server is intentionally transparent: new daemon env vars or CLI flags never require a wrapper release — pass them through `env`, `extraProfileCreateArgs`, or `extraDaemonStartArgs`.

## Requirements

- **Node.js ≥ 22** — uses global `fetch` and `AbortSignal.timeout`.
- **`uv` / `uvx`** on `PATH` — used to download and run the Hindsight daemon. Install via [docs.astral.sh/uv](https://docs.astral.sh/uv/).

## Install

```bash
npm install @vectorize-io/hindsight-all @vectorize-io/hindsight-client
```

## Example

```ts


const server = new HindsightServer({
  profile: 'my-app',
  port: 9077,
  env: {
    HINDSIGHT_API_LLM_PROVIDER: 'anthropic',
    HINDSIGHT_API_LLM_API_KEY: process.env.ANTHROPIC_API_KEY,
    HINDSIGHT_API_LLM_MODEL: 'claude-sonnet-4-20250514',
    HINDSIGHT_EMBED_DAEMON_IDLE_TIMEOUT: '0',
  },
  logger: consoleLogger,
});

await server.start();

const client = new HindsightClient({ baseUrl: server.getBaseUrl() });
await client.retain('user-123', 'User prefers dark mode.');
const recall = await client.recall('user-123', 'what are the user preferences?');

await server.stop();
```

For a remote Hindsight API, skip the server entirely and point `HindsightClient` directly at the remote URL.

## `HindsightServerOptions`

| Option | Type | Default | Description |
|---|---|---|---|
| `profile` | `string` | `"default"` | Profile name passed to `--profile` on every sub-command. |
| `port` | `number` | `8888` | TCP port the daemon listens on. |
| `host` | `string` | `"127.0.0.1"` | Hostname the daemon binds to (used for health checks). |
| `embedVersion` | `string` | `"latest"` | Version of the underlying `hindsight-embed` package to run via `uvx`. |
| `embedPackagePath` | `string` | — | Local checkout path — takes precedence over `embedVersion`. Uses `uv run --directory` instead of `uvx`. |
| `env` | `Record<string, string \| undefined>` | `{}` | Environment variables passed to the daemon process **and** written into the profile config via `--env KEY=VALUE`. The preferred way to surface any `HINDSIGHT_API_*` / `HINDSIGHT_EMBED_*` setting. |
| `extraProfileCreateArgs` | `string[]` | `[]` | Extra args appended verbatim to `profile create`. |
| `extraDaemonStartArgs` | `string[]` | `[]` | Extra args appended verbatim to `daemon start`. |
| `platformCpuWorkaround` | `boolean` | `true` on macOS | Auto-set `HINDSIGHT_API_EMBEDDINGS_LOCAL_FORCE_CPU=1` and `HINDSIGHT_API_RERANKER_LOCAL_FORCE_CPU=1` to avoid Metal/MPS crashes. Caller-supplied `env` values win over the auto-applied ones. |
| `readyTimeoutMs` | `number` | `30000` | Max time to wait for `/health` to return 200. |
| `readyPollIntervalMs` | `number` | `1000` | Polling interval while waiting for `/health`. |
| `logger` | `Logger` | silent | Pluggable logger (`debug`/`info`/`warn`/`error`). `consoleLogger` and `silentLogger` helpers are exported. |

## Server methods

| Method | Returns | Description |
|---|---|---|
| `start()` | `Promise<void>` | Configure profile, spawn the daemon, wait for `/health`. Idempotent — safe to re-run. |
| `stop()` | `Promise<void>` | Stop the daemon. Never throws; logs and resolves even on failure. |
| `checkHealth()` | `Promise<boolean>` | One-shot `/health` probe with a 2 s timeout. |
| `getBaseUrl()` | `string` | `http://host:port` — pass this straight to `HindsightClient`. |
| `getProfile()` | `string` | The profile name this server operates on. |

For memory operations (retain, recall, reflect, bank management) use [`@vectorize-io/hindsight-client`](./nodejs.md).


---


## File: sdks/hindsight-all.md

# Programmatic API (Python)

The `hindsight-all` Python package lets your code spawn and manage a local Hindsight daemon without deploying any server infrastructure. It bundles the Hindsight API server, embedded PostgreSQL, and the Python client into one install — `pip install hindsight-all` and you can start a fully-functional Hindsight instance from a few lines of Python.

The daemon runs as a **separate OS process** on `127.0.0.1` (not in your Python process memory). Your code talks to it over HTTP via the bundled `HindsightClient`.

If you already have a Hindsight server running and just need a client, use [Python Client (hindsight-client)](./python.md) instead.

## How it works

`hindsight-all` exposes two primary APIs:

- **`HindsightServer`** — explicit lifecycle. Use it as a context manager when you want deterministic startup/shutdown (e.g. in tests).
- **`HindsightEmbedded`** — auto-managed. Starts a daemon on first use, reuses it across calls, shuts it down after an idle timeout. Easiest for application code that doesn't want to think about lifecycle.

Both end up talking to the same underlying daemon via the same `HindsightClient` HTTP interface — the difference is only how the server process is managed.

## Installation

```bash
pip install hindsight-all
```

The `hindsight-all` wheel bundles `hindsight-api-slim`, `hindsight-client`, and `hindsight-embed` as dependencies, so one `pip install` gets you everything.

On Intel (x86_64) Macs, install `hindsight-all-slim` instead — the full bundle's local ML models have no Intel-Mac wheels. See [Supported Platforms](../developer/installation#supported-platforms).

## `HindsightServer` — explicit lifecycle

Use `HindsightServer` as a context manager when you want the server to start immediately, run for the duration of a block, and shut down cleanly afterwards. Ideal for tests and short-lived scripts.

```python

from hindsight import HindsightServer, HindsightClient

with HindsightServer(
    llm_provider="openai",
    llm_model="gpt-4o-mini",
    llm_api_key=os.environ["OPENAI_API_KEY"],
) as server:
    client = HindsightClient(base_url=server.url)

    client.retain(bank_id="my-bank", content="Alice works at Google")
    results = client.recall(bank_id="my-bank", query="What does Alice do?")
    for r in results:
        print(r.text)

    answer = client.reflect(bank_id="my-bank", query="Tell me about Alice")
    print(answer.text)
# Server is stopped here
```

## `HindsightEmbedded` — auto-managed

`HindsightEmbedded` is the simplest way to use Hindsight in Python. It automatically manages a background daemon for you — starts on first use, stays alive across calls, shuts down after an idle timeout.

```python
from hindsight import HindsightEmbedded


# Server starts automatically on first call
client = HindsightEmbedded(
    profile="myapp",                        # Profile for data isolation
    llm_provider="openai",
    llm_model="gpt-4o-mini",
    llm_api_key=os.environ["OPENAI_API_KEY"],
)

# Use immediately - no manual server management needed
client.retain(bank_id="my-bank", content="Alice works at Google")
results = client.recall(bank_id="my-bank", query="What does Alice do?")

# Server continues running (auto-stops after idle timeout)
# Or explicitly stop it:
client.close(stop_daemon=True)
```

### What's a Profile?

A profile is an isolated Hindsight environment. Each profile gets its own embedded PostgreSQL database (stored in `~/.pg0/instances/hindsight-embed-{profile}/`) and its own API server. Use different profiles to separate environments (dev/prod), applications, or users.

### When to use which

| Use case | Pick |
|---|---|
| Tests, short-lived scripts, deterministic startup/shutdown | `HindsightServer` (context manager) |
| Long-running application, auto-start on first use, don't want to manage lifecycle | `HindsightEmbedded` |
| Existing Hindsight server running elsewhere | [`hindsight-client`](./python.md) directly |

## API namespaces

Both `HindsightEmbedded` and `HindsightClient` expose organized API namespaces for bank management, mental models, directives, and memories:

```python
from hindsight import HindsightEmbedded


embedded = HindsightEmbedded(
    profile="myapp",
    llm_provider="openai",
    llm_api_key=os.environ["OPENAI_API_KEY"],
)

# Core operations
embedded.retain(bank_id="test", content="Hello")
results = embedded.recall(bank_id="test", query="Hello")

# Bank management
embedded.banks.create(bank_id="test", name="Test Bank", mission="Help users")
embedded.banks.set_mission(bank_id="test", mission="Updated mission")
embedded.banks.delete(bank_id="test")

# Mental models
embedded.mental_models.create(
    bank_id="test",
    name="User Preferences",
    content="User prefers dark mode"
)
models = embedded.mental_models.list(bank_id="test")

# Directives
embedded.directives.create(
    bank_id="test",
    name="Response Style",
    content="Be concise and friendly"
)
directives = embedded.directives.list(bank_id="test")

# List memories
memories = embedded.memories.list(bank_id="test", type="world", limit=50)
```

API namespaces ensure the daemon is running before each call, so daemon crashes are handled gracefully:

```python
# ✅ GOOD - Uses API namespace (daemon restarts handled)
embedded.banks.create(bank_id="test", name="Test")

# ❌ BAD - Direct client access (daemon crashes NOT handled)
client = embedded.client
client.create_bank(bank_id="test", name="Test")  # Fails if daemon crashed
```

For the full reference of retain/recall/reflect methods and their options (which work the same regardless of how you obtain the client) see the [Python Client page](./python.md).


---