# Hindsight Documentation > Agent Memory that Works Like Human Memory This file contains the complete Hindsight documentation for LLM consumption. Generated: 2026-07-03T15:08:45.860870+00:00 --- ## File: developer/retain.md # Retain: How Hindsight Stores Memories When you call `retain()`, Hindsight transforms conversations and documents into structured, searchable memories that preserve meaning and context. ## What Retain Does ```mermaid graph LR A[Your Content] --> B[Extract Facts] B --> C[Identify Entities] C --> D[Build Connections] D --> E[Memory Bank] ``` --- ## Rich Fact Extraction Hindsight doesn't just store what was said — it captures **why**, **how**, and **what it means**. ### What Gets Captured When you retain "Alice joined Google last spring and was thrilled about the research opportunities", Hindsight extracts: **The core facts:** - Alice joined Google - This happened last spring **The emotions and meaning:** - She was thrilled - It represented an important opportunity **The reasoning:** - She chose it for the research opportunities This rich extraction means you can later ask "Why did Alice join Google?" and get a meaningful answer, not just "she joined Google." ### Preserving Context Traditional systems fragment information: - "Bob suggested Summer Vibes" - "Alice wanted something unique" - "They chose Beach Beats" Hindsight preserves the full narrative: - "Alice and Bob discussed naming their summer party playlist. Bob suggested 'Summer Vibes' because it's catchy, but Alice wanted something unique. They ultimately decided on 'Beach Beats' for its playful tone." This means search results include the full context, not disconnected fragments. --- ## Two Types of Facts Every fact is classified by **whose perspective it captures** — the agent that owns the bank, or the outside world: | Type | What it captures | Example | |----------------|------------------------------------------------------------------------------|---------| | **experience** | The bank's own agent acting, observing, or interacting — its first-person history | "I recommended Python to Alice" | | **world** | Facts about other people, places, things, and events | "Alice works at Google" | The split is decided by **who is speaking**, not by grammar. A first-person statement is an `experience` only when the speaker *is* the bank's agent. The same words said by someone else are a `world` fact about that person: - Agent's own log — "I patched the auth bug" → **experience** (the agent did it). - A user talking to the agent — "I bought a Tesla" → **world** (a fact about the *user*, not the agent). Two things steer this correctly: - **Set a human-readable bank `name`** (the agent's name). It identifies who "the agent" is. If left unset it defaults to the `bank_id`; a `bank_id` that is a routing key (e.g. `my-agent::channel-456::user-789`) is not a usable speaker name, so give the bank a real name. - **Describe the speaker in each item's `context`** when retaining transcripts or third-party content. For a chat log, a context like *"Customer Maria is speaking"* ensures her first-person statements are stored as `world` facts about Maria rather than mistaken for the agent's own experiences. The `context` takes precedence over the bank name when the two disagree. **Note:** Observations are consolidated automatically in the background after `retain()` operations complete. This consolidation process synthesizes patterns from new facts into the bank's knowledge base. --- ## Entity Recognition Hindsight automatically identifies and tracks **entities** — the people, organizations, and concepts that matter. ### What Gets Recognized - **People:** "Alice", "Dr. Smith", "Bob Chen" - **Organizations:** "Google", "MIT", "OpenAI" - **Places:** "Paris", "Central Park", "California" - **Products & Concepts:** "Python", "TensorFlow", "machine learning" ### Entity Resolution The same entity mentioned different ways gets unified: - "Alice" + "Alice Chen" + "Alice C." → one person - "Bob" + "Robert Chen" → one person (nickname resolution) **Why it matters:** You can ask "What do I know about Alice?" and get everything, even if she was mentioned as "Alice Chen" in some conversations. ### Context-Aware Disambiguation If "Alice" appears with "Google" and "Stanford" multiple times, a new "Alice" mentioning those is likely the same person. Hindsight uses co-occurrence patterns to disambiguate common names. ### Entity Labels You can define a controlled vocabulary of `key:value` classification labels (e.g. `pedagogy:scaffolding`, `engagement:active`) that are extracted at retain time and stored as entities. Because labels become entities, they automatically link related memories in the knowledge graph and improve both semantic and keyword retrieval. Labels can optionally also write to the memory unit's tags, enabling standard tag-based filtering during recall and reflect. See [entity_labels in the bank config](/developer/api/memory-banks#entity-labels) for full configuration details. --- ## Building Connections Memories aren't isolated — Hindsight creates a **knowledge graph** with four types of connections: ### Entity Connections All facts mentioning the same entity are linked together. **Enables:** "Tell me everything about Alice" → retrieves all Alice-related facts ### Time-Based Connections Facts close in time are connected, with stronger links for closer dates. **Enables:** "What else happened around then?" → finds contextually related events ### Meaning-Based Connections Semantically similar facts are linked, even if they use different words. **Enables:** "Tell me about similar topics" → finds thematically related information ### Causal Connections Cause-effect relationships are explicitly tracked. **Enables:** "Why did this happen?" → trace reasoning chains **Example:** "Alice felt burned out" ← caused by ← "She worked 80-hour weeks" --- ## Understanding Time Hindsight tracks **two temporal dimensions**: ### When It Happened For events (meetings, trips, milestones), Hindsight records when they occurred. - "Alice got married in June 2024" → occurred in June 2024 For general facts (preferences, characteristics), there's no specific occurrence time. - "Alice prefers Python" → ongoing preference ### When You Learned It Hindsight also tracks when you told it each fact. **Why both?** Imagine in January 2025, someone tells you "Alice got married in June 2024": - **Historical queries** work: "What did Alice do in 2024?" → finds the marriage - **Recency ranking** works: Recent mentions get priority in search - **Temporal reasoning** works: "What happened before her marriage?" → finds earlier events Without this distinction, old information would either be unsearchable by date or treated as irrelevant. --- ## Tagging Memories Tags enable visibility scoping—useful when one memory bank serves multiple users but each should only see relevant memories. - **Item tags**: Tag individual memories with specific scopes - **Document tags**: Apply tags to all items in a batch - **Tag filtering**: Filter during recall/reflect by tags See [Retain API](./api/retain) for code examples and [Recall API](./api/recall) for filtering options. --- ## What You Get After `retain()` completes: - **Structured facts** that preserve meaning, emotions, and reasoning - **Unified entities** that resolve different name variations - **Knowledge graph** with entity, temporal, semantic, and causal links - **Temporal grounding** for both historical and recency-based queries - **Optional tags** for filtering during recall All stored in your isolated **memory bank**, ready for `recall()` and `reflect()`. --- ## Steering Extraction with a Mission By default, `retain()` extracts all significant facts from the content. You can narrow this focus with a **retain mission** (`retain_mission`) — a plain-language description of what this bank should pay attention to. ``` e.g. Always include technical decisions, API design choices, and architectural trade-offs. Ignore meeting logistics, greetings, and social exchanges. ``` The mission is injected into the extraction prompt alongside the built-in rules — it steers the LLM without replacing the extraction logic. It works with any extraction mode (`concise`, `verbose`, `custom`). For finer control, you can also change the **extraction mode**: | Mode | When to use | |------|-------------| | `concise` *(default)* | General-purpose — selective, fast | | `verbose` | When you need richer facts with full context and relationships | | `custom` | When you want to write your own extraction rules entirely | Set `retain_mission` and `retain_extraction_mode` via the [bank config API](/developer/api/memory-banks#retain-configuration) or the [`HINDSIGHT_API_RETAIN_MISSION`](/developer/configuration#retain) environment variable. --- ## Observation Consolidation After `retain()` completes, Hindsight automatically triggers **observation consolidation** in the background. This process: 1. Analyzes new facts against existing observations 2. Creates new observations when patterns emerge 3. Refines existing observations with new evidence 4. Tracks which facts support each observation This happens asynchronously — your `retain()` call returns immediately while consolidation runs in the background. See [Observations](./observations) for details on how consolidation works. --- ## Memory Defense and Source Provenance ### receipt_uri (optional) Type: `string`. Optional pointer into an external receipt or co-signature system. Stored as-is and surfaced in `security_events.receipt_uri` for any Memory Defense decision on this item. ### 422 — Memory Defense violation When Memory Defense is enabled on the target bank and **every** item in the batch is blocked by policy, the request returns 422 with a violation list: ```json { "detail": { "violations": [ { "index": 0, "detector": "prompt_injection", "severity": "high", "message": "..." } ] } } ``` Partial-block batches return 200 with the un-blocked items processed; blocked items are silently dropped from the result with their decisions recorded in `security_events`. See [Memory Defense](./memory-defense/index.md) for the full guide. --- ## Next Steps - [**Observations**](./observations) — How knowledge is consolidated after retain - [**Recall**](./retrieval) — How multi-strategy search retrieves relevant memories - [**Reflect**](./reflect) — How the agentic loop uses observations - [**Retain API**](./api/retain) — Code examples and parameters --- ## File: developer/retrieval.md # Recall: How Hindsight Retrieves Memories When you call `recall()`, Hindsight uses multiple search strategies in parallel to find the most relevant memories, regardless of how you phrase your query. ```mermaid graph LR Q[Query] --> S[Semantic] Q --> K[Keyword] Q --> G[Graph] Q --> T[Temporal] S --> RRF[RRF Fusion] K --> RRF G --> RRF T --> RRF RRF --> CE[Cross-Encoder] CE --> R[Results] ``` --- ## The Challenge of Memory Recall Different queries need different search approaches: - **"Alice works at Google"** → needs exact name matching - **"Where does Alice work?"** → needs semantic understanding - **"What did Alice do last spring?"** → needs temporal reasoning - **"Why did Alice leave?"** → needs causal relationship tracing No single search method handles all these well. Hindsight solves this with **TEMPR** — four complementary strategies that run in parallel. --- ## Four Search Strategies ### Semantic Search **What it does:** Understands the *meaning* behind words, not just the words themselves. **Best for:** - Conceptual matches: "Alice's job" → "Alice works as a software engineer" - Paraphrasing: "Bob's expertise" → "Bob specializes in machine learning" - Synonyms: "meeting" matches "conference", "discussion", "gathering" **Why it matters:** You can ask questions naturally without matching exact keywords. --- ### Keyword Search **What it does:** Finds exact terms and names, even when they're spelled uniquely. **Best for:** - Proper nouns: "Google", "Alice Chen", "MIT" - Technical terms: "PostgreSQL", "HNSW", "TensorFlow" - Unique identifiers: URLs, product names, specific phrases **Why it matters:** Ensures you never miss results that mention specific names or terms, even if they're semantically distant from your query. **Backends:** Hindsight ships five pluggable BM25 backends, selected via `HINDSIGHT_API_TEXT_SEARCH_EXTENSION`: | Backend | What it uses | Citus-compatible? | |---|---|---| | `native` | PostgreSQL `tsvector` + `ts_rank_cd` (TF-IDF, not true BM25) | Yes | | `vchord` | `vchord_bm25` extension | No | | `pg_textsearch` | Timescale `pg_textsearch` extension | No | | `pgroonga` | PGroonga (Groonga) full-text extension, `TokenBigram` polyglot tokenizer | No | | `pg_search` | ParadeDB `pg_search` extension, configurable tokenizer (e.g. `jieba`, `chinese_compatible`, `ngram`) via `HINDSIGHT_API_TEXT_SEARCH_EXTENSION_PG_SEARCH_TOKENIZER` | Yes | If you need true BM25 ranking on a horizontally scaled Postgres (Citus) cluster, `pg_search` is the only option. See the [`pg_search` docker-compose example](https://github.com/vectorize-io/hindsight/tree/main/docker/docker-compose/pg_search). --- ### Graph Traversal **What it does:** Follows connections between entities to find indirectly related information. **Best for:** - Indirect relationships: "What does Alice do?" → Alice → Google → Google's products - Entity exploration: "Bob's colleagues" → Bob → co-workers → shared projects - Multi-hop reasoning: "Alice's team's achievements" **Why it matters:** Retrieves facts that aren't semantically or lexically similar but are **structurally connected** through the knowledge graph. **Example:** Even if Alice and her manager are never mentioned together, graph traversal can find the manager through shared projects or team relationships. --- ### Temporal Search **What it does:** Understands time expressions and filters by when events occurred. **Best for:** - Historical queries: "What did Alice do in 2023?" - Time ranges: "What happened last spring?" - Relative time: "What did Bob work on last year?" - Before/after: "What happened before Alice joined Google?" **How it works:** When the query contains a time reference, Hindsight parses it into a date window, then retrieves the memories whose time overlaps that window. Within the window it selects candidates by **semantic relevance to the query** — not by recency — so the most relevant in-window memory is never dropped just because other memories happen to be more recent. It then **spreads the selection across the window's range**: the window is divided into time-buckets and the strongest match from each populated bucket is taken first, so a "what happened in 2023?" query surfaces memories from across the whole year rather than clustering on whichever stretch is densest. Time is also used as a *scoring* signal — memories closer to the center of the window get a small boost (see [Temporal proximity signal](#temporal-proximity-signal)). **Why it matters:** Enables precise historical queries that stay relevant *and* representative of the whole period. It also stays fast and meaningful on memory banks whose timestamps are densely clustered (for example when a large batch is ingested with one date): because selection is relevance-first, a time window that happens to match most of the bank still returns the best matches rather than an arbitrary slice. --- ## Result Fusion After the four strategies run, results are **fused together**: - Memories appearing in **multiple strategies** rank higher (consensus) - **Rank matters more than score** (robust across different scoring systems) - Final results are **re-ranked** using a neural model that considers query-memory interaction **Why fusion matters:** A fact that's both semantically similar AND mentions the right entity will rank higher than one that's only semantically similar. --- ## Why Multiple Strategies? Consider the query: **"What did Alice say about Python last spring?"** - **Semantic** finds facts about Alice's views on programming - **Keyword** ensures "Python" is actually mentioned - **Graph** connects Alice → programming languages → related entities - **Temporal** filters to "last spring" timeframe The **fusion** of all four gives you exactly what you're looking for, even though no single strategy would suffice. --- ## Token Budget Management Hindsight is built for AI agents, not humans. Traditional search systems return "top-k" results, but agents don't think in terms of result counts—they think in tokens. An agent's context window is measured in tokens, and that's exactly how Hindsight measures results. **How it works:** - Top-ranked memories selected first - Stops when token budget is exhausted - You specify context budget, Hindsight fills it with the most relevant memories **Parameters you control:** - `max_tokens`: How much memory content to return (default: 4096 tokens) - `budget`: Search depth level (low, mid, high) - `types`: Filter by world, experience, observation, or all - `tags`: Filter memories by visibility tags - `tags_match`: How to match tags (see [Recall API](./api/recall) for all options) ### Expanding Context: Chunks Memories are distilled facts—concise but sometimes missing nuance. When your agent needs deeper context, you can optionally retrieve the source material: **Chunks** return the raw text that generated each memory—useful when the distilled fact loses important nuance: ``` Memory: "Alice prefers Python over JavaScript" Chunk: "Alice mentioned she prefers Python over JavaScript, mainly because of its data science ecosystem, though she admits JS is better for frontend work and she's been learning TypeScript lately." ``` Use `include_chunks=True` with `max_chunk_tokens` to control the token budget for chunks. This is useful when generating responses that need verbatim quotes or when context matters (e.g., "What exactly did Alice say about the project?"). --- ## Tuning Recall: Quality vs Latency Different use cases require different trade-offs between **recall quality** and **response speed**. Two parameters control this: ### Budget: Search Depth Controls how thoroughly Hindsight explores the memory bank—affecting graph traversal depth, candidate pool size, and cross-encoder re-ranking: | Budget | Best For | Trade-off | |--------|----------|-----------| | **low** | Quick lookups, simple queries | Fast, may miss indirect connections | | **mid** | Most queries, balanced | Good coverage, reasonable speed | | **high** | Complex queries requiring deep exploration | Thorough, slower | **Example:** "What did Alice's manager's team work on?" benefits from high budget to traverse multiple hops (Alice → manager → team → projects) and evaluate more candidates. ### Max Tokens: Context Window Size Controls how much memory content to return: | Max Tokens | ~Pages of Text | Best For | Trade-off | |------------|----------------|----------|-----------| | **2048** | ~2 pages | Focused answers, fast LLM | Fewer memories, faster | | **4096** (default) | ~4 pages | Balanced context | Good coverage, standard | | **8192** | ~8 pages | Comprehensive context | More memories, slower LLM | **Example:** "Summarize everything about Alice" benefits from higher max_tokens to include more facts. ### Two Independent Dimensions Budget and max_tokens control different aspects of recall: | Parameter | What it controls | Latency impact | Example | |-----------|------------------|----------------|---------| | **Budget** | How thoroughly to explore memories | Search time | High budget finds Alice → manager → team → projects | | **Max Tokens** | How much context to return | LLM processing time | High tokens returns more memories to the agent | **They're independent.** Common combinations: | Budget | Max Tokens | Use Case | |--------|------------|----------| | high | low | Deep search, return only the best results | | low | high | Quick search, return everything found | | high | high | Comprehensive research queries | | low | low | Fast chatbot responses | ### Recommended Configurations | Use Case | Budget | Max Tokens | Why | |----------|--------|------------|-----| | **Chatbot replies** | low | 2048 | Fast responses, focused context | | **Document Q&A** | mid | 4096 | Balanced coverage and speed | | **Research queries** | high | 8192 | Comprehensive, multi-hop reasoning | | **Real-time search** | low | 2048 | Minimize latency | --- ## Scoring & Ranking Deep Dive This section explains exactly how Hindsight turns raw retrieval results into a final ranked list. The pipeline has three stages: **RRF fusion**, **cross-encoder reranking**, and **combined scoring**. ### Stage 1: Reciprocal Rank Fusion (RRF) After all strategies run in parallel, their results are merged using [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf). RRF combines ranked lists by rewarding items that appear highly ranked across multiple strategies, without relying on raw scores (which aren't comparable across different retrieval methods). **Formula:** ``` score(d) = Σ 1 / (k + rank_i(d)) i ``` Where: - **k = 60** (smoothing constant — prevents top-ranked items from dominating) - **rank_i(d)** = position of document *d* in strategy *i* (1-indexed) - The sum runs over all strategies where *d* appears **Within RRF, all four strategies are weighted equally** — fusion uses rank position, not the source, so no strategy gets an implicit multiplier. You can, however, deliberately bias a source with [`HINDSIGHT_API_RECALL_STRATEGY_BOOSTS`](./configuration): that boost is applied at a separate stage — before the reranking pre-filter cap and again after reranking — not inside the RRF fusion above. **Why RRF over raw score merging?** Each retrieval strategy produces scores on a different scale (cosine similarity, BM25 tf-idf, graph activation). These scores aren't comparable — a BM25 score of 12.5 and a cosine similarity of 0.85 don't mean the same thing. RRF sidesteps this by using only rank positions, making it robust across any scoring system without requiring calibration. **Example:** A memory ranked #1 in semantic and #5 in BM25: ``` RRF score = 1/(60+1) + 1/(60+5) = 0.0164 + 0.0154 = 0.0318 ``` A memory ranked #1 in semantic only: ``` RRF score = 1/(60+1) = 0.0164 ``` The first memory ranks higher because it has **consensus** across strategies. --- ### Stage 2: Cross-Encoder Reranking RRF gives a good initial ranking, but it's based on positions, not on deep query-document understanding. The cross-encoder evaluates each candidate against the query as a pair, producing a relevance score. **Pre-filtering:** Before reranking, candidates are trimmed to the top **300** (by RRF score) to limit computational cost. This is configurable via `HINDSIGHT_API_RERANKER_MAX_CANDIDATES`. If [`HINDSIGHT_API_RECALL_STRATEGY_BOOSTS`](./configuration) is set, the boost is applied to the RRF scores before this cut, so candidates from a favoured source are more likely to survive it. **Why rerank after RRF?** RRF is position-based — it knows a memory ranked well across strategies, but it never actually reads the query and the memory together. The cross-encoder does: it takes the query and each candidate as a pair and produces a relevance score based on their full interaction. This catches nuances that position-based fusion misses, like a memory that ranked #1 in keyword search because it matched a common term but is actually irrelevant to the query's intent. **Score normalization:** Cross-encoders output raw logits (which can be negative). Scores that already fall within [0, 1] — as returned by calibrated external API rerankers (e.g. Cohere, SiliconFlow, ZeroEntropy, Alibaba, Jina) — are passed through unchanged to preserve their absolute confidence. Raw logits outside [0, 1] are normalized to [0, 1] using the sigmoid function: ``` CE_normalized = 1 / (1 + e^(-raw_logit)) ``` **Batch processing:** Candidates are scored in batches — **32 pairs** for the local reranker, **128 pairs** for TEI. :::tip No cross-encoder? When running without a cross-encoder (e.g., slim image with no external reranker), the system falls back to RRF-derived scores: candidates are assigned synthetic scores spread across [0.1, 1.0] based on their RRF rank, so the combined scoring boosts below still work meaningfully. ::: --- ### Stage 3: Combined Scoring (Boosts) The normalized cross-encoder score is adjusted by three **multiplicative boosts** that incorporate signals the cross-encoder can't see: recency, temporal proximity, and evidence strength. **Why multiplicative instead of additive?** Additive boosts (e.g., `CE + 0.1 × recency`) would give the same absolute bonus to every candidate regardless of relevance. A barely-relevant memory could leapfrog a highly-relevant one just by being recent. Multiplicative boosts keep adjustments proportional to the base relevance score — a +10% nudge on a high-relevance memory is a bigger absolute change than +10% on a low-relevance one. This ensures secondary signals never overpower the primary relevance judgment. **Formula:** ``` final_score = CE_normalized × recency_boost × temporal_boost × proof_count_boost ``` Each boost is centered at 1.0 (neutral) and controlled by an alpha that caps how much it can swing: ``` boost = 1 + α × (signal - 0.5) ``` | Boost | α | Max adjustment | What it rewards | |-------|---|----------------|-----------------| | **Recency** | 0.2 | ±10% | Recent memories over older ones | | **Temporal proximity** | 0.2 | ±10% | Memories close to a queried time window | | **Proof count** | 0.1 | ±5% | Observations backed by more evidence | #### Recency signal Linear decay over 365 days from the memory's occurrence date: ``` recency = clamp(1.0 - days_ago / 365, 0.1, 1.0) ``` A memory from the query timestamp has recency 1.0 (+10% boost). A memory from 6 months before the query timestamp has recency ~0.5 (neutral). A memory more than a year before the query timestamp has recency 0.1 (-8% penalty). If no `query_timestamp` is provided, the server's current time is used. Memories without dates get 0.5 (neutral — no boost or penalty). #### Temporal proximity signal Only active when the query contains a time reference (e.g., "last spring", "in 2023"). Measures how close a memory's date is to the center of the queried time window: ``` temporal_proximity = 1.0 - min(days_from_center / (window_days / 2), 1.0) ``` A memory at the center of the window gets 1.0 (+10% boost). A memory at the edge gets 0.0 (-10% penalty). For non-temporal queries, all memories get 0.5 (neutral). #### Proof count signal For observation-type memories, rewards those backed by more evidence using a logarithmic curve: ``` proof_norm = clamp(0.5 + ln(proof_count) / 10, 0.0, 1.0) ``` | Proof count | proof_norm | Boost | |-------------|-----------|-------| | 1 | 0.5 | Neutral | | 3 | 0.61 | +1.1% | | 10 | 0.73 | +2.3% | | 150+ | 1.0 | +5% (max) | #### Maximum combined range With all boosts at their extremes: - **Best case:** ×1.10 × 1.10 × 1.05 ≈ **+27%** - **Worst case:** ×0.90 × 0.90 × 0.95 ≈ **-23%** The boosts are intentionally conservative — they nudge the ranking without overriding cross-encoder relevance. --- ### Stage 4: Token Truncation After scoring, results are sorted by `final_score` and selected top-down until the `max_tokens` budget is exhausted. Only the memory text counts toward the budget — metadata is free. --- ### How Budget Maps to Pipeline Parameters The `budget` parameter (low/mid/high) controls **search depth** — how many candidates each strategy considers. Each level maps to a **recall budget** number that flows through every pipeline stage: | Budget | Recall budget (fixed mode) | Env var override | |--------|---------------------------|-----------------| | **low** | 100 | `HINDSIGHT_API_RECALL_BUDGET_FIXED_LOW` | | **mid** | 300 (default) | `HINDSIGHT_API_RECALL_BUDGET_FIXED_MID` | | **high** | 1000 | `HINDSIGHT_API_RECALL_BUDGET_FIXED_HIGH` | This recall budget flows through the pipeline as follows: | Pipeline stage | How the recall budget is used | |----------------|-------------------------------| | **Semantic search** | Over-fetches max(recall_budget × 5, 100) from HNSW, trims to recall_budget | | **BM25 search** | `LIMIT recall_budget` in SQL | | **Graph traversal** | Explores up to recall_budget nodes | | **Temporal spreading** | Activates up to recall_budget nodes via links | | **Result consideration** | Top recall_budget × 2 results considered for token filtering | Reranking pre-filter (300 candidates) is **independent** of budget — it's a separate knob (`HINDSIGHT_API_RERANKER_MAX_CANDIDATES`). :::info Adaptive budgeting An alternative budget mode scales the recall budget with `max_tokens` instead of using fixed values: ``` recall_budget = clamp(max_tokens × ratio, min, max) ``` | Budget | Ratio | Env var override | |--------|-------|-----------------| | low | 2.5% of max_tokens | `HINDSIGHT_API_RECALL_BUDGET_ADAPTIVE_LOW` | | mid | 7.5% of max_tokens | `HINDSIGHT_API_RECALL_BUDGET_ADAPTIVE_MID` | | high | 25% of max_tokens | `HINDSIGHT_API_RECALL_BUDGET_ADAPTIVE_HIGH` | The result is clamped to a floor of **20** (`HINDSIGHT_API_RECALL_BUDGET_MIN`) and a ceiling of **2000** (`HINDSIGHT_API_RECALL_BUDGET_MAX`). Enable with `HINDSIGHT_API_RECALL_BUDGET_FUNCTION=adaptive`. ::: --- ### Graph Scoring Detail The graph traversal (link expansion) combines three independent signals additively for each candidate: | Signal | Score formula | Range | |--------|--------------|-------| | **Entity overlap** | tanh(shared_entity_count × 0.5) | [0, ~1.0] | | **Semantic link** | Precomputed kNN link weight | [0.7, 1.0] | | **Causal link** | Causal link weight | [0, 1.0] | ``` graph_score = entity_score + semantic_score + causal_score ∈ [0, 3] ``` The additive combination rewards **convergent evidence** — a memory connected to the query through multiple signal types ranks higher than one connected through a single strong signal. **Why tanh for entity scores?** Raw shared-entity count is unbounded — a high-fanout entity like "user" could produce counts of 50+, drowning out the other two signals. `tanh(count × 0.5)` saturates naturally: the first few shared entities matter a lot (1→0.46, 2→0.76, 3→0.91), but additional ones contribute diminishing returns, keeping the entity signal in [0, 1] alongside semantic and causal scores. **Why additive instead of multiplicative here?** Unlike the combined scoring boosts, graph signals are independent evidence channels, not adjustments to a base score. A memory might be connected only through causal links (no shared entities, no semantic similarity) — multiplicative combination would zero it out. Additive scoring lets each signal contribute independently, and the outer RRF fusion handles ranking across strategies. **Entity signal example:** A memory sharing 1 entity with the query scores tanh(0.5) ≈ 0.46. Two shared entities score tanh(1.0) ≈ 0.76. Three or more saturate near 0.91+. --- ## Next Steps - [**Retain**](./retain) — How memories are stored with rich context - [**Reflect**](./reflect) — How disposition influences reasoning - [**Recall API**](./api/recall) — Code examples, parameters, and tag filtering --- ## File: developer/installation.md # Installation Hindsight can be deployed in several ways depending on your infrastructure and requirements. :::tip Don't want to manage infrastructure? **[Hindsight Cloud](https://ui.hindsight.vectorize.io/signup)** is a fully managed service that handles all infrastructure, scaling, and maintenance — [sign up here](https://ui.hindsight.vectorize.io/signup). ::: ## Supported Platforms Hindsight runs on **Linux**, **macOS**, and **Windows**: | Platform | Docker | Bare Metal (pip) | Embedded DB (pg0) | Notes | |----------|--------|------------------|--------------------|-------| | **Linux** (x86_64, ARM64) | ✅ | ✅ | ✅ | Fully supported, recommended for production | | **macOS** (Apple Silicon / arm64) | ✅ | ✅ | ✅ | Fully supported | | **macOS** (Intel / x86_64) | ✅ | ⚠️ slim only | ✅ | Use `hindsight-all-slim` / `hindsight-api-slim`. The full bundle's local ML models (PyTorch, MLX) publish no Intel-Mac wheels, so `pip install hindsight-all` silently backtracks to a months-old release. Pair the slim bundle with a hosted embeddings/reranker provider or the in-process ONNX backend (`hindsight-api-slim[local-onnx]`). | | **Windows** (x86_64) | ✅ | ✅ | ✅ | Fully supported — see [Windows setup](#windows) for external PostgreSQL option | All platforms support the embedded database (pg0) for development. On Windows, you can also use an external PostgreSQL installation — see the [Windows](#windows) section for a step-by-step guide. --- ## Prerequisites ### PostgreSQL Hindsight requires PostgreSQL 14+ with a vector extension for similarity search. The supported extensions are: - **pgvector** (default) - **pgvectorscale** - **vchord** - **scann** (AlloyDB) Configure which one to use with `HINDSIGHT_API_VECTOR_EXTENSION`. See [Configuration](./configuration) for details. **By default**, Hindsight uses **pg0** — an embedded PostgreSQL that runs locally on your machine. This is convenient for development but **not recommended for production**. **For production**, use an external PostgreSQL with one of the supported vector extensions: - **Supabase** — Managed PostgreSQL with pgvector built-in - **Neon** — Serverless PostgreSQL with pgvector - **Azure Database for PostgreSQL** — With pgvector and pgvectorscale support - **Google AlloyDB** / **AlloyDB Omni** — With pgvector and ScaNN support - **AWS RDS** / **Cloud SQL** — With pgvector extension enabled - **Self-hosted** — PostgreSQL 14+ with your preferred vector extension ### LLM Provider You need an LLM API key for fact extraction, entity resolution, and answer generation. See [Models](./models) for supported providers, model recommendations, and configuration. ### Hardware Hindsight is designed to run on commodity hardware. The footprint depends mainly on whether the **full** image (which bundles local embedding and reranker models) or the **slim** image (which delegates those to external providers) is used. | Component | Minimum RAM | Recommended RAM | Notes | |-----------|-------------|-----------------|-------| | **API — Full image** | 1.5 GB | 2 GB | Loads local BGE embedder (~130 MB) and MiniLM cross-encoder (~90 MB) into memory, plus PyTorch/ONNX runtime arenas. Idle RSS settles around 0.8–1.0 GB; expect 1.2–1.5 GB under load. | | **API — Slim image** | 512 MB | 1 GB | No local models. Steady-state RSS is dominated by Python runtime and DB connections. Requires [external embedding and reranker providers](./configuration#embeddings) (e.g. TEI, OpenAI, Cohere). | | **Control Plane (UI)** | 128 MB | 256 MB | Next.js process, lightweight. | | **Worker** (if separated) | Same as API image variant | Same as API image variant | Workers load the same models as the API server. | | **PostgreSQL** | 512 MB | 1 GB+ | Scales with the number of memories and indexes. | :::tip Reducing the footprint The bulk of the full image's memory comes from the bundled embedding and reranker models and their PyTorch/ONNX runtimes. To shrink the deployment to a few hundred MB of RAM, switch to the **slim** image and configure [external embedding and reranker providers](./configuration#embeddings). ::: CPU vs GPU: 2 vCPUs on CPU-only is fine for development and basic workloads. For production traffic, the local reranker (cross-encoder) is the main bottleneck and typically benefits from a GPU to keep recall latency reasonable; alternatively, offload reranking to an [external reranker provider](./configuration#embeddings) (e.g. TEI, Cohere) on dedicated GPU hardware. --- ## Docker **Best for**: Quick start, development, small deployments Run everything in one container with embedded PostgreSQL: ```bash export OPENAI_API_KEY=sk-xxx docker run -it --pull always --name hindsight --restart unless-stopped -p 8888:8888 -p 9999:9999 \ -e HINDSIGHT_API_LLM_API_KEY=$OPENAI_API_KEY \ -v hindsight-data:/home/hindsight/.pg0 \ ghcr.io/vectorize-io/hindsight:latest ``` - **API Server**: http://localhost:8888 - **Control Plane** (Web UI): http://localhost:9999 :::note Persisting data: named volume vs. host bind mount The container runs as a non-root user (UID 1000). The `hindsight-data` **named volume** above is recommended — Docker creates it owned by the container user, so it works with no extra setup. If you instead bind-mount a **host directory** (`-v $HOME/.hindsight-docker:/home/hindsight/.pg0`), that directory must be writable by UID 1000, or the embedded database fails to start with `Permission denied`. Either `chown` the directory to UID 1000, or run the container as your host user: `--user $(id -u):$(id -g) -e HOME=/home/hindsight` (after `chown`-ing the directory to your own UID). ::: All published images are [signed with Cosign](#verifying-image-signatures) — verification is optional. :::tip Set a stable `HINDSIGHT_API_WORKER_ID` in production The worker uses the container hostname as its identity, which Docker sets to the container ID by default. That value changes on every restart, so any task that was being processed when the container went down stays parked under the old ID with no way for the new container to recognize it as its own. Set `HINDSIGHT_API_WORKER_ID` to a stable value (e.g., `-e HINDSIGHT_API_WORKER_ID=hindsight-prod`) so the worker keeps the same identity across restarts. This is recommended even for single-container deployments. For diagnosis and recovery commands, see [Admin CLI - Recovering stuck operations](./admin-cli#recovering-stuck-or-zombie-operations). ::: ### Docker Image Variants | Variant | Size (AMD64) | Size (ARM64) | When to use | |---------|--------------|--------------|-------------| | **Full** (`latest`) | ~9 GB | ~3.7 GB | Default. Works out of the box with no external services except the LLM. | | **Slim** (`slim`) | ~500 MB | ~500 MB | Use when you already rely on external services for embeddings and reranking (OpenAI, Cohere, TEI). Significantly smaller image, faster deploys. Requires [external providers](./configuration#embeddings). | The slim image corresponds to the [`hindsight-api-slim`](#bare-metal-pip) pip package. See [Configuration](./configuration#embeddings) for external provider options. ### Bundling Custom Models in a Custom Image :::tip Production deployments with non-default local models If you use a non-default local embedder or reranker, bake the models into a custom image at build time rather than enabling the Helm `modelCache` PVC. See [`docker/docker-compose/custom-models/`](https://github.com/vectorize-io/hindsight/tree/main/docker/docker-compose/custom-models) for a runnable example. ::: ### Available Tags ```bash # Standalone (API + Control Plane) ghcr.io/vectorize-io/hindsight:latest # Full, latest release ghcr.io/vectorize-io/hindsight:latest-slim # Slim, latest release ghcr.io/vectorize-io/hindsight:0.4.9 # Full, specific version ghcr.io/vectorize-io/hindsight:0.4.9-slim # Slim, specific version # API only ghcr.io/vectorize-io/hindsight-api:latest ghcr.io/vectorize-io/hindsight-api:latest-slim # Control Plane only ghcr.io/vectorize-io/hindsight-control-plane:latest ``` ### Verifying image signatures Images are signed with [Cosign](https://docs.sigstore.dev/cosign/signing/overview/) keyless OIDC. To verify any tag: ```bash cosign verify ghcr.io/vectorize-io/hindsight: \ --certificate-identity-regexp '^https://github\.com/vectorize-io/hindsight/\.github/workflows/(sign-images|release)\.yml@.*' \ --certificate-oidc-issuer https://token.actions.githubusercontent.com ``` --- ## Helm / Kubernetes **Best for**: Production deployments, auto-scaling, cloud environments ```bash # Install with built-in PostgreSQL helm install hindsight oci://ghcr.io/vectorize-io/charts/hindsight \ --set api.llm.provider=groq \ --set api.llm.apiKey=gsk_xxxxxxxxxxxx \ --set postgresql.enabled=true # Or use external PostgreSQL helm install hindsight oci://ghcr.io/vectorize-io/charts/hindsight \ --set api.llm.provider=groq \ --set api.llm.apiKey=gsk_xxxxxxxxxxxx \ --set postgresql.enabled=false \ --set api.database.url=postgresql://user:pass@postgres.example.com:5432/hindsight # Install a specific version helm install hindsight oci://ghcr.io/vectorize-io/charts/hindsight --version 0.1.3 # Upgrade to latest helm upgrade hindsight oci://ghcr.io/vectorize-io/charts/hindsight ``` **Requirements**: - Kubernetes cluster (GKE, EKS, AKS, or self-hosted) - Helm 3.8+ ### Distributed Workers For high-throughput deployments, enable dedicated worker pods to scale task processing independently: ```bash helm install hindsight oci://ghcr.io/vectorize-io/charts/hindsight \ --set worker.enabled=true \ --set worker.replicaCount=3 ``` The chart deploys workers as a StatefulSet, so each pod gets a stable name (e.g. `hindsight-worker-0`) that the worker uses as its `HINDSIGHT_API_WORKER_ID`. Tasks claimed by a pod are recognized as its own across restarts. If you swap the chart for a plain Deployment, set `HINDSIGHT_API_WORKER_ID` explicitly per replica — otherwise hostnames are randomized and previously-claimed tasks become orphaned. See [Admin CLI - Recovering stuck operations](./admin-cli#recovering-stuck-or-zombie-operations) for diagnosis. See [Services - Worker Service](./services#worker-service) for configuration details and architecture. See the [Helm chart values.yaml](https://github.com/vectorize-io/hindsight/tree/main/helm/hindsight/values.yaml) for all chart options. --- ## Bare Metal (pip) **Best for**: Running Hindsight as a standalone service on a host machine. ### Install ```bash pip install hindsight-api # Full — works out of the box pip install hindsight-api-slim # Slim — requires external services for embeddings, reranking, and the database ``` When using `hindsight-api-slim`, you must configure external providers for all model operations. See [Configuration](./configuration#embeddings) for details. ### Run with Embedded Database For development and testing, Hindsight can run with an embedded PostgreSQL (pg0): ```bash export HINDSIGHT_API_LLM_PROVIDER=groq export HINDSIGHT_API_LLM_API_KEY=gsk_xxxxxxxxxxxx hindsight-api ``` This creates a database in `~/.hindsight/data/` and starts the API on http://localhost:8888. ### Run with External PostgreSQL For production, connect to your own PostgreSQL instance: ```bash export HINDSIGHT_API_DATABASE_URL=postgresql://user:pass@localhost:5432/hindsight export HINDSIGHT_API_LLM_PROVIDER=groq export HINDSIGHT_API_LLM_API_KEY=gsk_xxxxxxxxxxxx hindsight-api ``` **Note**: The database must exist and have pgvector enabled (`CREATE EXTENSION vector;`). ### CLI Options ```bash hindsight-api --port 9000 # Custom port (default: 8888) hindsight-api --host 127.0.0.1 # Bind to localhost only hindsight-api --workers 4 # Multiple worker processes hindsight-api --log-level debug # Verbose logging ``` ### Control Plane The Control Plane (Web UI) can be run standalone using npx: ```bash npx @vectorize-io/hindsight-control-plane --api-url http://localhost:8888 ``` This connects to your running API server and provides a visual interface for managing memory banks, exploring entities, and testing queries. #### Options | Option | Environment Variable | Default | Description | |--------|---------------------|---------|-------------| | `-p, --port` | `PORT` | 9999 | Port to listen on | | `-H, --hostname` | `HOSTNAME` | 0.0.0.0 | Hostname to bind to | | `-a, --api-url` | `HINDSIGHT_CP_DATAPLANE_API_URL` | http://localhost:8888 | Hindsight API URL | | | `HINDSIGHT_CP_ACCESS_KEY` | *(none)* | Access key to protect the Control Plane UI. When set, users must enter this key to log in. | #### Examples ```bash # Run on custom port npx @vectorize-io/hindsight-control-plane --port 9999 --api-url http://localhost:8888 # Using environment variables export HINDSIGHT_CP_DATAPLANE_API_URL=http://api.example.com npx @vectorize-io/hindsight-control-plane # Production deployment PORT=80 HINDSIGHT_CP_DATAPLANE_API_URL=https://api.hindsight.io npx @vectorize-io/hindsight-control-plane ``` --- ## Windows **Best for**: Running Hindsight natively on Windows without Docker Hindsight works on Windows with the embedded database (pg0) out of the box — just install and run: ```powershell pip install hindsight-api set HINDSIGHT_API_LLM_PROVIDER=openai set HINDSIGHT_API_LLM_API_KEY=sk-xxx set HINDSIGHT_API_LLM_MODEL=gpt-4o-mini hindsight-api ``` ### Using External PostgreSQL (optional) If you prefer to use your own PostgreSQL instance instead of the embedded database: ```powershell # Install PostgreSQL winget install PostgreSQL.PostgreSQL.17 # Build pgvector (requires Visual Studio Build Tools) git clone https://github.com/pgvector/pgvector.git cd pgvector # Open "x64 Native Tools Command Prompt for VS" and run: set PGROOT=C:\Program Files\PostgreSQL\17 nmake /F Makefile.win nmake /F Makefile.win install # Create the database and enable the vector extension psql -U postgres -c "CREATE DATABASE hindsight;" psql -U postgres -d hindsight -c "CREATE EXTENSION vector;" ``` Then run Hindsight pointing to your database: ```powershell pip install hindsight-api set HINDSIGHT_API_DATABASE_URL=postgresql://postgres@localhost:5432/hindsight set HINDSIGHT_API_LLM_PROVIDER=openai set HINDSIGHT_API_LLM_API_KEY=sk-xxx set HINDSIGHT_API_LLM_MODEL=gpt-4o-mini hindsight-api ``` - **API Server**: http://localhost:8888 :::tip You can also use the slim package (`pip install hindsight-api-slim`) if you configure external providers for embeddings and reranking. See [Configuration](./configuration#embeddings) for details. ::: ### Windows + China Network Notes If you are running on Windows behind China network restrictions: 1. DeepSeek works well for `HINDSIGHT_API_LLM_PROVIDER`, but DeepSeek does not provide an embeddings endpoint. 2. Use local embeddings (recommended for privacy and reliability in restricted networks). 3. Set `HF_ENDPOINT=https://hf-mirror.com` before starting Hindsight so Hugging Face model downloads use a China-accessible mirror. ```powershell set HF_ENDPOINT=https://hf-mirror.com set HINDSIGHT_API_LLM_PROVIDER=deepseek set HINDSIGHT_API_LLM_API_KEY=sk-your-deepseek-key set HINDSIGHT_API_LLM_MODEL=deepseek-v4-flash set HINDSIGHT_API_LLM_BASE_URL=https://api.deepseek.com set HINDSIGHT_API_EMBEDDINGS_PROVIDER=local set HINDSIGHT_API_EMBEDDINGS_LOCAL_MODEL=BAAI/bge-small-en-v1.5 set HINDSIGHT_API_RERANKER_PROVIDER=flashrank hindsight-api ``` The `HF_ENDPOINT` variable is used by Hugging Face tooling (`huggingface_hub`), not by Hindsight itself. --- ## Embedded in a Python Application **Best for**: Using Hindsight programmatically from Python without running a separate server process. ```bash pip install hindsight-all # Full — works out of the box (Linux, Windows, Apple Silicon Macs) pip install hindsight-all-slim # Slim — requires external services for embeddings, reranking, and the database ``` On Intel (x86_64) Macs, install `hindsight-all-slim` — see [Supported Platforms](#supported-platforms). `hindsight-all` supports two modes of embedding: **In-process** (`HindsightServer`): the server runs in a background thread inside your application. Best when you want the tightest integration and are already managing your own process lifecycle. ```python from hindsight import HindsightServer, HindsightClient with HindsightServer(llm_provider="openai", llm_api_key="sk-xxx") as server: client = HindsightClient(base_url=server.url) client.retain(bank_id="alice", content="Alice prefers concise answers.") results = client.recall(bank_id="alice", query="How should I respond to Alice?") ``` **Managed subprocess** (`HindsightEmbedded`): the server runs as a background daemon process, shared across multiple Python processes or sessions. The daemon starts on first use and shuts down automatically after an idle timeout. ```python from hindsight import HindsightEmbedded client = HindsightEmbedded(llm_provider="openai", llm_api_key="sk-xxx") client.retain(bank_id="alice", content="Alice prefers concise answers.") results = client.recall(bank_id="alice", query="How should I respond to Alice?") ``` See the [Python SDK](../sdks/python.md) for the full API reference. --- ## Next Steps - [Configuration](./configuration.md) — Environment variables and settings - [Models](./models.mdx) — ML models and providers - [Monitoring](./monitoring.md) — Metrics and observability --- ## File: developer/configuration.md # Configuration Complete reference for configuring Hindsight services through environment variables. Hindsight has two services, each with its own configuration prefix: | Service | Prefix | Description | |---------|--------|-------------| | **API Service** | `HINDSIGHT_API_*` | Core memory engine | | **Control Plane** | `HINDSIGHT_CP_*` | Web UI | --- ## API Service The API service handles all memory operations (retain, recall, reflect). ### Database | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_DATABASE_URL` | PostgreSQL connection string | `pg0` (embedded) | | `HINDSIGHT_API_READ_DATABASE_URL` | Optional read-replica PostgreSQL URL. When set, recall queries (semantic, BM25, graph, temporal) are routed through a separate connection pool against this URL, offloading the primary. Typically points to a read-only endpoint (e.g., CNPG's `-ro` service or Aurora reader endpoint). | Unset (uses primary) | | `HINDSIGHT_API_MIGRATION_DATABASE_URL` | Direct PostgreSQL URL for running migrations, bypassing connection poolers (e.g. PgBouncer). When set, advisory locks and Alembic migrations use this URL instead of `DATABASE_URL`. | Falls back to `DATABASE_URL` | | `HINDSIGHT_API_DATABASE_SCHEMA` | PostgreSQL schema name for tables | `public` | | `HINDSIGHT_API_RUN_MIGRATIONS_ON_STARTUP` | Run database migrations on API startup | `true` | | `HINDSIGHT_API_MIGRATION_CONCURRENCY` | Number of tenant schemas to migrate concurrently (PostgreSQL only). Each schema runs in its own process; within a schema migrations are always sequential. Each worker has a fixed startup cost (~1–2s to boot a fresh interpreter), so this only pays off with **many** schemas (roughly tens or more) or slow/high-latency migrations — for a handful of schemas it is slower than sequential. Each worker uses ~3 database connections, so keep `concurrency × 3` within your database's spare `max_connections` (and any PgBouncer pool limit). `1` = fully sequential. Measured at 20k schemas: the per-restart no-op resweep dropped from ~60min to ~11min (≈5×) at `concurrency=12`. | `1` | | `HINDSIGHT_API_DATABASE_BACKEND` | Database engine backend: `postgresql` or `oracle` (Oracle 23ai) | `postgresql` | If not provided, the server uses embedded `pg0` — convenient for development but not recommended for production. The `DATABASE_SCHEMA` setting allows you to use a custom PostgreSQL schema instead of the default `public` schema. This is useful for: - Multi-database setups where you want Hindsight tables in a dedicated schema - Hosting platforms (e.g., Supabase) where `public` schema is reserved or shared - Organizational preferences for schema naming conventions ```bash # Example: Using a custom schema export HINDSIGHT_API_DATABASE_URL=postgresql://user:pass@host:5432/dbname export HINDSIGHT_API_DATABASE_SCHEMA=hindsight ``` Migrations will automatically create the schema if it doesn't exist and create all tables in the configured schema. ### Database Connection Pool | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_DB_POOL_MIN_SIZE` | Minimum connections in the primary pool | `5` | | `HINDSIGHT_API_DB_POOL_MAX_SIZE` | Maximum connections in the primary pool | `100` | | `HINDSIGHT_API_READ_DB_POOL_MIN_SIZE` | Minimum connections in the read-replica pool (only used when `READ_DATABASE_URL` is set) | Falls back to `DB_POOL_MIN_SIZE` | | `HINDSIGHT_API_READ_DB_POOL_MAX_SIZE` | Maximum connections in the read-replica pool (only used when `READ_DATABASE_URL` is set) | Falls back to `DB_POOL_MAX_SIZE` | | `HINDSIGHT_API_DB_COMMAND_TIMEOUT` | PostgreSQL command timeout in seconds (asyncpg client-side) | `60` | | `HINDSIGHT_API_DB_ACQUIRE_TIMEOUT` | Connection acquisition timeout in seconds | `30` | | `HINDSIGHT_API_DB_STATEMENT_TIMEOUT` | Postgres `statement_timeout` applied to every pool connection, in seconds. Server-side safety net for runaway queries. Does **not** apply to Alembic migrations (which run on a separate psycopg2 engine). Set to `0` to disable. | `600` | For high-concurrency workloads, increase `DB_POOL_MAX_SIZE`. Each concurrent recall/think operation can use 2-4 connections. To run migrations manually (e.g., before starting the API), use the admin CLI: ```bash # Migrate the base schema plus all discovered tenant schemas hindsight-admin run-db-migration # Or migrate a specific schema only: hindsight-admin run-db-migration --schema tenant_acme ``` ### Vector Extension | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_VECTOR_EXTENSION` | Vector index algorithm: `pgvector`, `vchord`, `pgvectorscale`, or `scann` | `pgvector` | Hindsight supports four PostgreSQL vector extensions: #### **pgvector** (HNSW - default) - In-memory index using Hierarchical Navigable Small World algorithm - Works well for most embeddings and dataset sizes - Fast for small-medium datasets (<10M vectors) - Higher memory usage for large datasets - Most widely deployed and supported #### **pgvectorscale** (DiskANN - recommended for scale) ⭐ - Disk-based index using StreamingDiskANN algorithm - **28x lower p95 latency** and **16x higher throughput** vs dedicated vector DBs - **60-75% cost reduction** at scale (SSDs cheaper than RAM) - Superior filtering performance with streaming retrieval model - Optimized for large datasets (10M+ vectors) - Supports both **pgvectorscale** (open source) and **pg_diskann** (Azure) - **Installation:** - Open source/self-hosted: `CREATE EXTENSION vector; CREATE EXTENSION vectorscale CASCADE;` - Azure PostgreSQL: `CREATE EXTENSION vector; CREATE EXTENSION pg_diskann CASCADE;` #### **vchord** (vchordrq) - Alternative high-performance vector index - Optimized for high-dimensional embeddings (3000+ dimensions) - Includes integrated BM25 search capabilities - Requires `vchord` extension #### **scann** (AlloyDB ScaNN) - Google's ScaNN index, available on **AlloyDB** and **AlloyDB Omni** - Uses a single global vector index in `AUTO` mode (per-bank partial indexes are not used) - **Installation:** `CREATE EXTENSION vector; CREATE EXTENSION alloydb_scann CASCADE;` - **Index build is deferred** until a table reaches **10,000 populated embedding rows** — AlloyDB cannot build a ScaNN AUTO index on a near-empty table. Until that threshold is crossed, recall falls back to a sequential scan; the global index is built on the next API startup once enough rows exist. - A ready-to-use Docker Compose stack is provided at [`docker/docker-compose/alloydb/docker-compose.yaml`](https://github.com/vectorize-io/hindsight/blob/main/docker/docker-compose/alloydb/docker-compose.yaml) for running Hindsight against AlloyDB Omni locally. **When to use pgvectorscale (DiskANN):** - Large datasets (10M+ vectors) ⭐ - Complex filtering requirements - Cost-sensitive deployments - Production workloads requiring high throughput - When disk I/O is not a bottleneck **When to use pgvector (HNSW):** - Small-medium datasets (<10M vectors) - Maximum query speed when all data fits in memory - Simple nearest-neighbor queries without filters - Standard PostgreSQL deployment preference **When to use vchord:** - High-dimensional embeddings (3000+ dimensions) - Want integrated BM25 search - Already using vchord for text search **When to use scann:** - Running on Google **AlloyDB** or **AlloyDB Omni** - Want managed ScaNN with `AUTO` mode tuning **Switching extensions:** If you need to switch from one extension to another: 1. Set `HINDSIGHT_API_VECTOR_EXTENSION` to your desired extension (`pgvector`, `vchord`, `pgvectorscale`, or `scann`) 2. If your database has existing data, you'll get an error with migration instructions (note: switching **to** `scann` is allowed even with data — the existing index is dropped and rebuilt as ScaNN once the table has at least 10,000 embedding rows) 3. For empty databases, indexes will be automatically recreated on startup **Learn more:** - [HNSW vs. DiskANN comparison](https://www.tigerdata.com/learn/hnsw-vs-diskann) - [pgvectorscale GitHub](https://github.com/timescale/pgvectorscale) ### Text Search Extension | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_TEXT_SEARCH_EXTENSION` | Text search backend: `native`, `vchord`, `pg_textsearch`, `pgroonga`, or `pg_search` | `native` | | `HINDSIGHT_API_TEXT_SEARCH_EXTENSION_NATIVE_LANGUAGE` | PostgreSQL text search dictionary used by the `native` backend (e.g. `english`, `french`, `simple`, `zhparser`) | `english` | | `HINDSIGHT_API_TEXT_SEARCH_EXTENSION_PG_SEARCH_TOKENIZER` | ParadeDB `pg_search` tokenizer used when creating BM25 indexes. Empty uses ParadeDB's default tokenizer (`unicode_words`). | unset | | `HINDSIGHT_API_LLM_OUTPUT_LANGUAGE` | When set, forces every LLM-generated artifact (retain facts, consolidation observations, reflect responses) into this language. Free-form (e.g. `Spanish`, `Japanese`). | unset | Hindsight supports five backends for BM25 keyword retrieval: - **native** — PostgreSQL's built-in full-text search (`tsvector` + GIN). Language configurable. - **vchord** — VectorChord BM25 (uses the `llmlingua2` multilingual tokenizer). - **pg_textsearch** — Timescale's pg_textsearch extension. English-only. - **pgroonga** — pgroonga full-text search. Multilingual / CJK out of the box. - **pg_search** — ParadeDB pg_search. True BM25; the only backend that is Citus-compatible. To switch backends: set `HINDSIGHT_API_TEXT_SEARCH_EXTENSION`. With existing data, you'll get an error and migration instructions; with an empty database the columns/indexes are recreated automatically on startup. `HINDSIGHT_API_TEXT_SEARCH_EXTENSION_PG_SEARCH_TOKENIZER` only applies when `HINDSIGHT_API_TEXT_SEARCH_EXTENSION=pg_search`, and only when BM25 indexes are created. Changing it for an existing database requires rebuilding the `pg_search` indexes or recreating the database. Supported values are empty/unset, `unicode_words`, `simple`, `whitespace`, `literal`, `literal_normalized`, `chinese_compatible`, `icu`, `jieba`, `source_code`, `chinese_lindera`/`lindera(chinese)`, `japanese_lindera`/`lindera(japanese)`, `korean_lindera`/`lindera(korean)`, `ngram(min,max)`, and `edge_ngram(min,max)`. For non-English banks (especially CJK) and the language/extraction-language tradeoffs, see the [Multilingual Support](./multilingual) page. ### LLM Provider | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_LLM_PROVIDER` | Provider: `openai`, `openai-codex`, `claude-code`, `anthropic`, `gemini`, `groq`, `minimax`, `deepseek`, `zai`, `opencode-go`, `nous`, `fireworks`, `ollama`, `ollama-cloud`, `lmstudio`, `llamacpp`, `vertexai`, `bedrock`, `litellm`, `litellmrouter`, `volcano`, `openrouter`, `requesty`, `none` | `openai` | | `HINDSIGHT_API_LLM_API_KEY` | API key for LLM provider | - | | `HINDSIGHT_API_LLM_MODEL` | Model name | `gpt-5-mini` | | `HINDSIGHT_API_LLM_BASE_URL` | Custom LLM endpoint | Provider default | | `HINDSIGHT_API_LLM_MAX_CONCURRENT` | Max concurrent LLM requests | `32` | | `HINDSIGHT_API_LLM_MAX_RETRIES` | Max retry attempts for LLM API calls | `3` | | `HINDSIGHT_API_LLM_INITIAL_BACKOFF` | Initial retry backoff in seconds (exponential backoff) | `1.0` | | `HINDSIGHT_API_LLM_MAX_BACKOFF` | Max retry backoff cap in seconds | `60.0` | | `HINDSIGHT_API_LLM_TIMEOUT` | LLM request timeout in seconds | `120` | | `HINDSIGHT_API_LLM_REASONING_EFFORT` | Reasoning effort for providers/models that support it (for example `low`, `medium`, `high`, `xhigh`) | `low` | | `HINDSIGHT_API_LLM_TEMPERATURE` | Global override for the sampling temperature of internal LLM calls. Set a number in `[0.0, 2.0]`, or `none` (also `default`/`off`/empty) to **omit** the temperature parameter entirely — required for models that reject explicit temperatures, e.g. Azure `gpt-5.5`, which only accepts its default value. Per-operation variables below override this. | Per-operation defaults | | `HINDSIGHT_API_LLM_TEMPERATURE_VERIFICATION` | Temperature for the startup connection check. Number in `[0.0, 2.0]` or `none` to omit. Overrides `HINDSIGHT_API_LLM_TEMPERATURE`. | `0.0` | | `HINDSIGHT_API_LLM_TEMPERATURE_RETAIN` | Temperature for fact extraction during retain. Number in `[0.0, 2.0]` or `none` to omit. Overrides `HINDSIGHT_API_LLM_TEMPERATURE`. | `0.1` | | `HINDSIGHT_API_LLM_TEMPERATURE_REFLECT` | Temperature for the reflect "thinking" step. Number in `[0.0, 2.0]` or `none` to omit. Overrides `HINDSIGHT_API_LLM_TEMPERATURE`. | `0.9` | | `HINDSIGHT_API_LLM_TEMPERATURE_CONSOLIDATION` | Temperature for consolidation (mental-model delta and dedup). Number in `[0.0, 2.0]` or `none` to omit. Overrides `HINDSIGHT_API_LLM_TEMPERATURE`. | `0.0` | | `HINDSIGHT_API_LLM_SEND_BANK_AS_USER` | Tag outbound LLM and embedding calls with `user=` so gateways (OpenRouter usage accounting, LiteLLM, Helicone) can attribute spend per bank. When enabled, the bank id is transmitted to the upstream provider as the end-user identifier. | `false` | | `HINDSIGHT_API_LLM_GROQ_SERVICE_TIER` | Groq service tier: `on_demand`, `flex`, `auto` | `auto` | | `HINDSIGHT_API_LLM_OPENAI_SERVICE_TIER` | OpenAI service tier: `flex` for 50% cost savings (OpenAI Flex Processing) | None (default) | | `HINDSIGHT_API_LLM_BEDROCK_SERVICE_TIER` | Bedrock service tier: `flex` for 50% cost savings (best-effort inference), `priority` (guaranteed throughput), or `reserved` (provisioned capacity) | Unset (default tier) | | `HINDSIGHT_API_LLM_GEMINI_SERVICE_TIER` | Gemini service tier: `flex` for 50% cost savings (best-effort inference) | Unset (default tier) | | `HINDSIGHT_API_LLM_EXTRA_BODY` | JSON dict of extra request-body params (e.g. `temperature`, `top_p`, `max_tokens`) merged into every LLM call. Applied across the OpenAI-compatible, Fireworks, Anthropic, Gemini/VertexAI and LiteLLM (incl. Bedrock/Router) providers. Each provider merges them in its own native parameter space, so use that provider's field names (e.g. `max_tokens` for OpenAI/Anthropic vs `max_output_tokens` for Gemini). Also useful for custom model servers (e.g. vLLM `chat_template_kwargs`). | `null` | | `HINDSIGHT_API_LLM_DEFAULT_HEADERS` | JSON dict passed as `default_headers` to provider SDK clients. Used by operators routing through proxies / request-tracing middleware (e.g. Cloudflare AI Gateway, Helicone, corporate proxies). Currently wired into the Anthropic provider; other providers can opt in. | `null` | | `HINDSIGHT_API_LLM_STRICT_SCHEMA` | Grammar-enforce structured output via `json_schema` `strict: true` instead of the soft "schema-in-prompt + `json_object`" path. Use it with weaker self-hosted models that return prose preambles, markdown ` ```json ` fences, or invalid JSON — which otherwise fail to parse and wedge retain/consolidation. Applies to OpenAI-compatible backends (OpenAI, llama.cpp, vLLM) and LiteLLM; Gemini already enforces its native `response_schema` regardless, and providers without a strict mode ignore it. | `false` | | `HINDSIGHT_API_LLM_GEMINI_SAFETY_SETTINGS` | JSON-encoded list of `{category, threshold}` dicts for Gemini/VertexAI content safety filtering | `null` | | `HINDSIGHT_API_LLM_PROMPT_CACHE_ENABLED` | Reuse the fixed system prefix via the provider's explicit prompt cache, billed at the cached-input rate (Gemini/Vertex `CachedContent`). The cached prefix is shared across all banks and soft-fails to an uncached call. Set to `false` to disable. See [Models](./models#provider-capabilities). | `true` | **Provider Examples** ```bash # Groq (recommended for fast inference) export HINDSIGHT_API_LLM_PROVIDER=groq export HINDSIGHT_API_LLM_API_KEY=gsk_xxxxxxxxxxxx export HINDSIGHT_API_LLM_MODEL=openai/gpt-oss-20b # For free tier users: override to on_demand if you get service_tier errors # export HINDSIGHT_API_LLM_GROQ_SERVICE_TIER=on_demand # OpenAI export HINDSIGHT_API_LLM_PROVIDER=openai export HINDSIGHT_API_LLM_API_KEY=sk-xxxxxxxxxxxx export HINDSIGHT_API_LLM_MODEL=gpt-4o # Optional: Use Flex Processing for 50% cost savings (with variable latency) # export HINDSIGHT_API_LLM_OPENAI_SERVICE_TIER=flex # Gemini export HINDSIGHT_API_LLM_PROVIDER=gemini export HINDSIGHT_API_LLM_API_KEY=xxxxxxxxxxxx export HINDSIGHT_API_LLM_MODEL=gemini-2.0-flash # Optional: Use Gemini Flex for 50% cost savings (best-effort inference) # export HINDSIGHT_API_LLM_GEMINI_SERVICE_TIER=flex # Anthropic export HINDSIGHT_API_LLM_PROVIDER=anthropic export HINDSIGHT_API_LLM_API_KEY=sk-ant-xxxxxxxxxxxx export HINDSIGHT_API_LLM_MODEL=claude-sonnet-4-20250514 # Vertex AI (Google Cloud - uses native genai SDK) export HINDSIGHT_API_LLM_PROVIDER=vertexai export HINDSIGHT_API_LLM_MODEL=gemini-2.0-flash-001 export HINDSIGHT_API_LLM_VERTEXAI_PROJECT_ID=your-gcp-project-id export HINDSIGHT_API_LLM_VERTEXAI_REGION=us-central1 # Optional: use ADC (gcloud auth application-default login) or provide service account key: # export HINDSIGHT_API_LLM_VERTEXAI_SERVICE_ACCOUNT_KEY=/path/to/service-account-key.json # Ollama (local, no API key) export HINDSIGHT_API_LLM_PROVIDER=ollama export HINDSIGHT_API_LLM_BASE_URL=http://localhost:11434/v1 export HINDSIGHT_API_LLM_MODEL=llama3 # Ollama Cloud (hosted Ollama endpoint, requires API key) export HINDSIGHT_API_LLM_PROVIDER=ollama-cloud export HINDSIGHT_API_LLM_API_KEY=your-ollama-cloud-api-key export HINDSIGHT_API_LLM_MODEL=gemma3:12b # LM Studio (local, no API key) export HINDSIGHT_API_LLM_PROVIDER=lmstudio export HINDSIGHT_API_LLM_BASE_URL=http://localhost:1234/v1 export HINDSIGHT_API_LLM_MODEL=your-local-model # llama.cpp (built-in local inference, no external server needed) export HINDSIGHT_API_LLM_PROVIDER=llamacpp # No API key, base URL, or external server required. # Auto-downloads Gemma 4 E2B (~3.5 GB GGUF) on first run. # See "Built-in llama.cpp" section below for all configuration options. # OpenAI-compatible endpoint export HINDSIGHT_API_LLM_PROVIDER=openai export HINDSIGHT_API_LLM_BASE_URL=https://your-endpoint.com/v1 export HINDSIGHT_API_LLM_API_KEY=your-api-key export HINDSIGHT_API_LLM_MODEL=your-model-name # OpenAI Codex (ChatGPT Plus/Pro subscription - uses OAuth, no API key needed) export HINDSIGHT_API_LLM_PROVIDER=openai-codex export HINDSIGHT_API_LLM_MODEL=gpt-5.4-mini # No API key needed - uses OAuth tokens from ~/.codex/auth.json # For long-running services, set CODEX_HOME to a dedicated auth directory so # Hindsight doesn't share (and lose) its refresh token with another Codex process. # See Models docs → "Isolating Codex auth for long-running services". # export CODEX_HOME=/var/lib/hindsight/codex # Claude Code (Claude Pro/Max subscription - uses OAuth, no API key needed) export HINDSIGHT_API_LLM_PROVIDER=claude-code export HINDSIGHT_API_LLM_MODEL=claude-sonnet-4-5-20250929 # No API key needed - uses claude auth login credentials # Volcano Engine (ByteDance - OpenAI-compatible) export HINDSIGHT_API_LLM_PROVIDER=volcano export HINDSIGHT_API_LLM_API_KEY=your-api-key export HINDSIGHT_API_LLM_BASE_URL=https://ark.cn-beijing.volces.com/api/v3 export HINDSIGHT_API_LLM_MODEL=doubao-pro-32k # OpenRouter (OpenAI-compatible, access 100+ models) export HINDSIGHT_API_LLM_PROVIDER=openrouter export HINDSIGHT_API_LLM_API_KEY=your-openrouter-api-key export HINDSIGHT_API_LLM_MODEL=qwen/qwen3.5-9b # Requesty (OpenAI-compatible gateway) export HINDSIGHT_API_LLM_PROVIDER=requesty export HINDSIGHT_API_LLM_API_KEY=your-requesty-api-key export HINDSIGHT_API_LLM_MODEL=openai/gpt-4o-mini # DeepSeek (OpenAI-compatible, https://api.deepseek.com) export HINDSIGHT_API_LLM_PROVIDER=deepseek export HINDSIGHT_API_LLM_API_KEY=sk-xxxxxxxxxxxx export HINDSIGHT_API_LLM_MODEL=deepseek-v4-flash # Notes: # - `deepseek-v4-flash` defaults to thinking mode at the API level (treated as # `deepseek-reasoner`). Hindsight handles this transparently; the reflect # agent will not crash with "deepseek-reasoner does not support this # tool_choice". # - Use `deepseek-v4-pro` for the higher-quality reasoning route. # - Use `deepseek-chat` for the non-thinking alias (faster, cheaper). # z.ai (Zhipu GLM series, OpenAI-compatible, https://z.ai) export HINDSIGHT_API_LLM_PROVIDER=zai export HINDSIGHT_API_LLM_API_KEY=your-zai-api-key export HINDSIGHT_API_LLM_MODEL=glm-4.5-flash # or glm-4.5-air for the paid tier # Default base_url: https://api.z.ai/api/coding/paas/v4 (override with HINDSIGHT_API_LLM_BASE_URL if needed) # opencode-go (OpenAI-compatible) export HINDSIGHT_API_LLM_PROVIDER=opencode-go export HINDSIGHT_API_LLM_API_KEY=your-opencode-go-api-key export HINDSIGHT_API_LLM_MODEL=deepseek-v4-flash # Default base_url: https://opencode.ai/zen/go/v1 (override with HINDSIGHT_API_LLM_BASE_URL if needed) # Nous Portal (OpenAI-compatible; no API key — uses your `hermes portal` login) export HINDSIGHT_API_LLM_PROVIDER=nous export HINDSIGHT_API_LLM_MODEL=deepseek/deepseek-v4-flash # No API key needed — reads a rotating JWT from ~/.hermes/auth.json (run `hermes portal` first). # Default base_url: https://inference-api.nousresearch.com/v1 (override with HINDSIGHT_API_LLM_BASE_URL if needed) # See the "Nous Portal Setup" section in the Models guide for the login flow. # AWS Bedrock (native support - no API key needed, uses AWS credentials) export HINDSIGHT_API_LLM_PROVIDER=bedrock export HINDSIGHT_API_LLM_MODEL=us.amazon.nova-2-lite-v1:0 export AWS_ACCESS_KEY_ID=your-access-key export AWS_SECRET_ACCESS_KEY=your-secret-key export AWS_REGION_NAME=us-east-1 # Optional: Use Flex tier for 50% cost savings (with variable latency) # export HINDSIGHT_API_LLM_BEDROCK_SERVICE_TIER=flex # LiteLLM (100+ providers via LiteLLM SDK) # Azure OpenAI via LiteLLM export HINDSIGHT_API_LLM_PROVIDER=litellm export HINDSIGHT_API_LLM_API_KEY=your-azure-api-key export HINDSIGHT_API_LLM_MODEL=azure/gpt-4o # Together AI via LiteLLM export HINDSIGHT_API_LLM_PROVIDER=litellm export HINDSIGHT_API_LLM_API_KEY=your-together-api-key export HINDSIGHT_API_LLM_MODEL=together_ai/meta-llama/Llama-3-70b-chat-hf # No LLM (chunk storage + semantic search only, no API key needed) export HINDSIGHT_API_LLM_PROVIDER=none # Retain automatically uses chunks mode (no fact extraction) # Recall works normally (semantic search, BM25, graph retrieval) # Reflect returns HTTP 400 (requires an LLM) # Consolidation/observations are disabled ``` :::tip OpenAI Codex, Claude Code & Vertex AI Setup For detailed setup instructions for **OpenAI Codex** (ChatGPT Plus/Pro), **Claude Code** (Claude Pro/Max), and **Vertex AI** (Google Cloud), see the [Models documentation](./models#openai-codex-setup-chatgpt-pluspro). ::: ### LLM Router (LiteLLM Router) `HINDSIGHT_API_LLM_PROVIDER=litellmrouter` runs the default LLM through [LiteLLM's `Router`](https://docs.litellm.ai/docs/routing). The config JSON is forwarded verbatim — for fallback chains, load-balancing, rate limits, routing strategies, and the rest of the supported keys, see the [LiteLLM Router docs](https://docs.litellm.ai/docs/routing). Hindsight always issues completions against `model_name: "default"`, so include at least one entry with that name. | Variable | Description | |----------|-------------| | `HINDSIGHT_API_LLM_LITELLMROUTER_CONFIG` | JSON object passed to `litellm.Router(**config)`. Required when provider is `litellmrouter`. | | `HINDSIGHT_API_{RETAIN,REFLECT,CONSOLIDATION}_LLM_LITELLMROUTER_CONFIG` | Per-operation overrides. Fall back to the default config when unset. | ```bash export HINDSIGHT_API_LLM_PROVIDER=litellmrouter export HINDSIGHT_API_LLM_LITELLMROUTER_CONFIG='{ "model_list": [ {"model_name": "default", "litellm_params": {"model": "openai/gpt-4o-mini", "api_key": "sk-..."}}, {"model_name": "fallback", "litellm_params": {"model": "anthropic/claude-sonnet-4-5", "api_key": "sk-ant-..."}} ], "fallbacks": [{"default": ["fallback"]}] }' ``` The config is a credential field — never returned by the bank-config API. Hindsight already retries calls; set `"num_retries": 0` in the Router config to avoid double-retries. Batch APIs aren't supported in router mode. ### Multi-LLM Strategies (failover / round-robin) Configure additional LLMs **by index** alongside the primary, then choose a strategy for routing across them. This is a provider-agnostic alternative to the LiteLLM Router: the indexed LLMs can be any mix of providers, each fully configured. The unindexed `HINDSIGHT_API_LLM_*` config is the **primary** (member 1). Extra members are numbered from 1: | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_LLM__PROVIDER` | Provider for extra member `n` (`n` = 1, 2, ...). Presence of this var defines the member; indices must be contiguous from 1. | - | | `HINDSIGHT_API_LLM__API_KEY` | API key for member `n` (required unless the provider needs none). | - | | `HINDSIGHT_API_LLM__MODEL` | Model for member `n`. | Provider default | | `HINDSIGHT_API_LLM__BASE_URL` | Base URL for member `n`. | Provider default | | `HINDSIGHT_API_LLM__REASONING_EFFORT` | Reasoning effort for member `n`. | `HINDSIGHT_API_LLM_REASONING_EFFORT` | | `HINDSIGHT_API_LLM__EXTRA_BODY` / `_DEFAULT_HEADERS` | Per-member JSON overrides. | - | | `HINDSIGHT_API_LLM__BEDROCK_SERVICE_TIER` / `_GEMINI_SERVICE_TIER` | Per-member service tier. | - | | `HINDSIGHT_API_LLM__VERTEXAI_PROJECT_ID` / `_VERTEXAI_REGION` / `_VERTEXAI_SERVICE_ACCOUNT_KEY` | Per-member Vertex AI project, region, and service-account key path (for a `vertexai` member). Each falls back to the global `HINDSIGHT_API_LLM_VERTEXAI_*` when unset. | Global / `us-central1` / ADC | | `HINDSIGHT_API_LLM__LITELLMROUTER_CONFIG` | Per-member LiteLLM Router config JSON (for a `litellmrouter` member). Falls back to the global `HINDSIGHT_API_LLM_LITELLMROUTER_CONFIG` when unset. | - | | `HINDSIGHT_API_LLM_STRATEGY` | JSON routing strategy across the chain. Unset = single primary LLM (no change). | - | The strategy JSON supports two modes: - `{"mode": "failover"}` — try members in order (primary first); on a member's failure (after its own retries) advance to the next. - `{"mode": "round-robin"}` — rotate the starting member per request to spread load, then fall through the rest on failure. Add `"weights": [3, 1, ...]` (positive ints, one per member, primary first) for an **unbalanced** rotation. ```bash # Primary OpenAI, failover to Groq then Anthropic export HINDSIGHT_API_LLM_PROVIDER=openai export HINDSIGHT_API_LLM_API_KEY=sk-... export HINDSIGHT_API_LLM_1_PROVIDER=groq export HINDSIGHT_API_LLM_1_API_KEY=gsk-... export HINDSIGHT_API_LLM_2_PROVIDER=anthropic export HINDSIGHT_API_LLM_2_API_KEY=sk-ant-... export HINDSIGHT_API_LLM_STRATEGY='{"mode": "failover"}' # Weighted round-robin: serve the primary 3x as often as member 1 export HINDSIGHT_API_LLM_STRATEGY='{"mode": "round-robin", "weights": [3, 1]}' ``` **Per-operation chains.** Each operation can define its own members + strategy with the `RETAIN` / `REFLECT` / `CONSOLIDATION` prefix (e.g. `HINDSIGHT_API_RETAIN_LLM_1_PROVIDER`, `HINDSIGHT_API_RETAIN_LLM_STRATEGY`). A per-operation slot with no indexed members (or no strategy) inherits the global chain. The indexed members are credential fields — never returned by the bank-config API and server-level only (not per-bank configurable). **Batch retain** runs on the primary member only; failover/round-robin apply to the interactive retain/reflect/consolidation calls. ### Built-in llama.cpp The `llamacpp` provider runs a llama.cpp server as a managed subprocess — no external LLM server needed. On first run it auto-downloads a default GGUF model (~3.5 GB). Requires the `local-llm` extra: `pip install 'hindsight-api-slim[local-llm]'`. | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_LLAMACPP_MODEL_PATH` | Path to a GGUF model file. If not set, auto-downloads `gemma-4-E2B-it-Q4_K_M` from HuggingFace. | Auto-download | | `HINDSIGHT_API_LLAMACPP_GPU_LAYERS` | Number of layers to offload to GPU. `-1` = all layers (recommended). `0` = CPU only. | `-1` | | `HINDSIGHT_API_LLAMACPP_CONTEXT_SIZE` | Context window size in tokens. | `8192` | | `HINDSIGHT_API_LLAMACPP_CHAT_FORMAT` | Chat template format. `null` = auto-detect from GGUF metadata (recommended). | Auto-detect | | `HINDSIGHT_API_LLAMACPP_NO_GRAMMAR` | Disable JSON grammar enforcement. Faster inference but less reliable JSON output. | `false` | | `HINDSIGHT_API_LLAMACPP_EXTRA_ARGS` | Space-separated extra CLI args passed to the llama.cpp server (e.g. `--n_threads 8 --type_k 1`). | - | ```bash # Minimal setup (auto-downloads model, uses GPU) export HINDSIGHT_API_LLM_PROVIDER=llamacpp # Custom model with tuning export HINDSIGHT_API_LLM_PROVIDER=llamacpp export HINDSIGHT_API_LLM_MAX_CONCURRENT=2 export HINDSIGHT_API_LLAMACPP_MODEL_PATH=~/.hindsight/models/my-model.gguf export HINDSIGHT_API_LLAMACPP_CONTEXT_SIZE=16384 export HINDSIGHT_API_LLAMACPP_NO_GRAMMAR=true # faster, less reliable JSON export HINDSIGHT_API_LLAMACPP_EXTRA_ARGS="--n_threads 8" ``` :::note The llama.cpp server is shared across all LLM operations (retain, reflect, consolidation). Set `HINDSIGHT_API_LLM_MAX_CONCURRENT=2` to allow retain and consolidation to run concurrently without blocking each other. ::: ### Per-Operation LLM Configuration Different memory operations have different requirements. **Retain** (fact extraction) benefits from models with strong structured output capabilities, while **Reflect** (reasoning/response generation) can use lighter, faster models. Configure separate LLM models for each operation to optimize for cost and performance. | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_RETAIN_LLM_PROVIDER` | LLM provider for retain operations | Falls back to `HINDSIGHT_API_LLM_PROVIDER` | | `HINDSIGHT_API_RETAIN_LLM_API_KEY` | API key for retain LLM | Falls back to `HINDSIGHT_API_LLM_API_KEY` | | `HINDSIGHT_API_RETAIN_LLM_MODEL` | Model for retain operations | Falls back to `HINDSIGHT_API_LLM_MODEL` | | `HINDSIGHT_API_RETAIN_LLM_BASE_URL` | Base URL for retain LLM | Falls back to `HINDSIGHT_API_LLM_BASE_URL` | | `HINDSIGHT_API_RETAIN_LLM_MAX_CONCURRENT` | Extra cap on concurrent retain LLM requests, composed with the global cap. Unset → only the global cap applies. | Unset | | `HINDSIGHT_API_RETAIN_LLM_MAX_RETRIES` | Max retries for retain | Falls back to `HINDSIGHT_API_LLM_MAX_RETRIES` | | `HINDSIGHT_API_RETAIN_LLM_INITIAL_BACKOFF` | Initial backoff for retain retries (seconds) | Falls back to `HINDSIGHT_API_LLM_INITIAL_BACKOFF` | | `HINDSIGHT_API_RETAIN_LLM_MAX_BACKOFF` | Max backoff cap for retain retries (seconds) | Falls back to `HINDSIGHT_API_LLM_MAX_BACKOFF` | | `HINDSIGHT_API_RETAIN_LLM_TIMEOUT` | Timeout for retain requests (seconds) | Falls back to `HINDSIGHT_API_LLM_TIMEOUT` | | `HINDSIGHT_API_REFLECT_LLM_PROVIDER` | LLM provider for reflect operations | Falls back to `HINDSIGHT_API_LLM_PROVIDER` | | `HINDSIGHT_API_REFLECT_LLM_API_KEY` | API key for reflect LLM | Falls back to `HINDSIGHT_API_LLM_API_KEY` | | `HINDSIGHT_API_REFLECT_LLM_MODEL` | Model for reflect operations | Falls back to `HINDSIGHT_API_LLM_MODEL` | | `HINDSIGHT_API_REFLECT_LLM_BASE_URL` | Base URL for reflect LLM | Falls back to `HINDSIGHT_API_LLM_BASE_URL` | | `HINDSIGHT_API_REFLECT_LLM_MAX_CONCURRENT` | Extra cap on concurrent reflect LLM requests, composed with the global cap. Unset → only the global cap applies. | Unset | | `HINDSIGHT_API_REFLECT_LLM_MAX_RETRIES` | Max retries for reflect | Falls back to `HINDSIGHT_API_LLM_MAX_RETRIES` | | `HINDSIGHT_API_REFLECT_LLM_INITIAL_BACKOFF` | Initial backoff for reflect retries (seconds) | Falls back to `HINDSIGHT_API_LLM_INITIAL_BACKOFF` | | `HINDSIGHT_API_REFLECT_LLM_MAX_BACKOFF` | Max backoff cap for reflect retries (seconds) | Falls back to `HINDSIGHT_API_LLM_MAX_BACKOFF` | | `HINDSIGHT_API_REFLECT_LLM_TIMEOUT` | Timeout for reflect requests (seconds) | Falls back to `HINDSIGHT_API_LLM_TIMEOUT` | | `HINDSIGHT_API_CONSOLIDATION_LLM_PROVIDER` | LLM provider for observation consolidation | Falls back to `HINDSIGHT_API_LLM_PROVIDER` | | `HINDSIGHT_API_CONSOLIDATION_LLM_API_KEY` | API key for consolidation LLM | Falls back to `HINDSIGHT_API_LLM_API_KEY` | | `HINDSIGHT_API_CONSOLIDATION_LLM_MODEL` | Model for consolidation operations | Falls back to `HINDSIGHT_API_LLM_MODEL` | | `HINDSIGHT_API_CONSOLIDATION_LLM_BASE_URL` | Base URL for consolidation LLM | Falls back to `HINDSIGHT_API_LLM_BASE_URL` | | `HINDSIGHT_API_CONSOLIDATION_LLM_MAX_CONCURRENT` | Extra cap on concurrent consolidation LLM requests, composed with the global cap. Unset → only the global cap applies. | Unset | | `HINDSIGHT_API_CONSOLIDATION_LLM_MAX_RETRIES` | Max retries for consolidation | Falls back to `HINDSIGHT_API_LLM_MAX_RETRIES` | | `HINDSIGHT_API_CONSOLIDATION_LLM_INITIAL_BACKOFF` | Initial backoff for consolidation retries (seconds) | Falls back to `HINDSIGHT_API_LLM_INITIAL_BACKOFF` | | `HINDSIGHT_API_CONSOLIDATION_LLM_MAX_BACKOFF` | Max backoff cap for consolidation retries (seconds) | Falls back to `HINDSIGHT_API_LLM_MAX_BACKOFF` | | `HINDSIGHT_API_CONSOLIDATION_LLM_TIMEOUT` | Timeout for consolidation requests (seconds) | Falls back to `HINDSIGHT_API_LLM_TIMEOUT` | :::tip When to Use Per-Operation Config - **Retain**: Use models with strong structured output (e.g., GPT-4o, Claude) for accurate fact extraction - **Reflect**: Use faster/cheaper models (e.g., GPT-4o-mini, Groq) for reasoning and response generation - **Recall**: Does not use LLM (pure retrieval), so no configuration needed ::: **Example: Separate Models for Retain and Reflect** ```bash # Default LLM (used as fallback) export HINDSIGHT_API_LLM_PROVIDER=openai export HINDSIGHT_API_LLM_API_KEY=sk-xxxxxxxxxxxx export HINDSIGHT_API_LLM_MODEL=gpt-4o # Use GPT-4o for retain (strong structured output) export HINDSIGHT_API_RETAIN_LLM_MODEL=gpt-4o # Use faster/cheaper model for reflect export HINDSIGHT_API_REFLECT_LLM_PROVIDER=groq export HINDSIGHT_API_REFLECT_LLM_API_KEY=gsk_xxxxxxxxxxxx export HINDSIGHT_API_REFLECT_LLM_MODEL=llama-3.3-70b-versatile ``` **Example: Tuning Retry Behavior for Rate-Limited APIs** ```bash # For Anthropic with tight rate limits (10k output tokens/minute) export HINDSIGHT_API_LLM_PROVIDER=anthropic export HINDSIGHT_API_LLM_API_KEY=sk-ant-xxxxxxxxxxxx export HINDSIGHT_API_LLM_MODEL=claude-sonnet-4-20250514 # Reduce concurrent requests for retain to avoid rate limits export HINDSIGHT_API_RETAIN_LLM_MAX_CONCURRENT=3 # Fail faster with fewer retries export HINDSIGHT_API_RETAIN_LLM_MAX_RETRIES=3 # Or increase backoff times to wait out rate limit windows export HINDSIGHT_API_RETAIN_LLM_INITIAL_BACKOFF=2.0 # Start at 2s instead of 1s export HINDSIGHT_API_RETAIN_LLM_MAX_BACKOFF=120.0 # Cap at 2min instead of 1min ``` :::note Per-operation concurrency composes with the global cap `HINDSIGHT_API_RETAIN_LLM_MAX_CONCURRENT`, `HINDSIGHT_API_REFLECT_LLM_MAX_CONCURRENT`, and `HINDSIGHT_API_CONSOLIDATION_LLM_MAX_CONCURRENT` add an extra cap that applies *on top of* `HINDSIGHT_API_LLM_MAX_CONCURRENT`. A retain call counts against both the retain cap and the global cap; a reflect call without a per-op cap is bounded only by the global cap. To reserve headroom for live chat/reflect on a rate-limited provider, cap retain and consolidation below the global value — e.g. global=4, retain=1, consolidation=1 leaves two slots that retain/consolidation cannot consume. Unlike the per-operation timeout and retry/backoff knobs, the `*_LLM_MAX_CONCURRENT` caps are process-global semaphores read from the environment once at startup. They are server-level only (not overridable per tenant/bank) and a change requires a restart. ::: ### Embeddings | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_EMBEDDINGS_PROVIDER` | Provider: `local`, `onnx`, `tei`, `openai`, `openai-codex`, `openrouter`, `requesty`, `cohere`, `google`, `zeroentropy`, `litellm`, or `litellm-sdk` | `local` | | `HINDSIGHT_API_EMBEDDINGS_LOCAL_MODEL` | Model for local provider | `BAAI/bge-small-en-v1.5` | | `HINDSIGHT_API_EMBEDDINGS_LOCAL_TRUST_REMOTE_CODE` | Allow loading models with custom code (security risk, disabled by default) | `false` | | `HINDSIGHT_API_EMBEDDINGS_LOCAL_FORCE_CPU` | Force CPU mode for local embeddings (avoids MPS/XPC issues on macOS) | `false` | | `HINDSIGHT_API_EMBEDDINGS_ONNX_MODEL_ID` | Hugging Face model repo for the ONNX provider. Used for auto-download and as the tokenizer fallback. | `intfloat/multilingual-e5-small` | | `HINDSIGHT_API_EMBEDDINGS_ONNX_MODEL_PATH` | Local path to the ONNX graph. When unset, Hindsight downloads `HINDSIGHT_API_EMBEDDINGS_ONNX_FILE` from `HINDSIGHT_API_EMBEDDINGS_ONNX_MODEL_ID`. | - | | `HINDSIGHT_API_EMBEDDINGS_ONNX_TOKENIZER_NAME_OR_PATH` | Hugging Face tokenizer repo or local tokenizer directory. Set this when using `HINDSIGHT_API_EMBEDDINGS_ONNX_MODEL_PATH`. | Falls back to `HINDSIGHT_API_EMBEDDINGS_ONNX_MODEL_ID` | | `HINDSIGHT_API_EMBEDDINGS_ONNX_FILE` | ONNX file path inside the Hugging Face repo. Hindsight also downloads the conventional external-data sidecar with `_data` suffix when present. | `onnx/model.onnx` | | `HINDSIGHT_API_EMBEDDINGS_ONNX_DIMENSIONS` | Expected embedding dimensions. Startup fails if the loaded model returns a different size. | Auto-detected | | `HINDSIGHT_API_EMBEDDINGS_ONNX_MAX_TOKENS` | Max tokenizer length for ONNX embeddings. | `512` | | `HINDSIGHT_API_EMBEDDINGS_ONNX_POOLING` | Pooling strategy for token embeddings: `mean` or `cls`. Ignored when the ONNX graph returns a pre-pooled 2-D embedding output. | `mean` | | `HINDSIGHT_API_EMBEDDINGS_ONNX_NORMALIZE` | L2-normalize ONNX vectors before storage. | `true` | | `HINDSIGHT_API_EMBEDDINGS_ONNX_QUERY_PREFIX` | Prefix applied to query/search text before ONNX embedding. Keep `query: ` for E5 models; set to empty for non-E5 models such as MiniLM or BGE. | `query: ` | | `HINDSIGHT_API_EMBEDDINGS_ONNX_PASSAGE_PREFIX` | Prefix applied to stored memory/document text before ONNX embedding. Keep `passage: ` for E5 models; set to empty for non-E5 models such as MiniLM or BGE. | `passage: ` | | `HINDSIGHT_API_EMBEDDINGS_ONNX_OUTPUT_NAME` | Optional ONNX output name to request when an exported graph exposes a pooled embedding output. | - | | `HINDSIGHT_API_EMBEDDINGS_TEI_URL` | TEI server URL | - | | `HINDSIGHT_API_EMBEDDINGS_OPENAI_API_KEY` | OpenAI API key (falls back to `HINDSIGHT_API_LLM_API_KEY`) | - | | `HINDSIGHT_API_EMBEDDINGS_OPENAI_MODEL` | OpenAI embedding model | `text-embedding-3-small` | | `HINDSIGHT_API_EMBEDDINGS_OPENAI_BASE_URL` | Custom base URL for OpenAI-compatible API (e.g., Azure OpenAI) | - | | `HINDSIGHT_API_EMBEDDINGS_OPENAI_BATCH_SIZE` | Max inputs per `embeddings.create` call for `openai`/`openrouter` providers — lower this when the upstream endpoint enforces stricter limits (e.g. DashScope caps at 10) | `100` | | `HINDSIGHT_API_EMBEDDINGS_OPENAI_DIMENSIONS` | Optional requested output dimensions for OpenAI `text-embedding-3` models (e.g., `384` to match an existing pgvector schema) | - | | `HINDSIGHT_API_EMBEDDINGS_OPENROUTER_API_KEY` | OpenRouter API key for embeddings (falls back to `HINDSIGHT_API_OPENROUTER_API_KEY`, then `HINDSIGHT_API_LLM_API_KEY`) | - | | `HINDSIGHT_API_EMBEDDINGS_REQUESTY_API_KEY` | Requesty API key for embeddings (falls back to `HINDSIGHT_API_REQUESTY_API_KEY`, then `HINDSIGHT_API_LLM_API_KEY`) | - | | `HINDSIGHT_API_EMBEDDINGS_REQUESTY_MODEL` | Requesty embedding model | `openai/text-embedding-3-small` | | `HINDSIGHT_API_EMBEDDINGS_OPENROUTER_MODEL` | OpenRouter embedding model | `perplexity/pplx-embed-v1-0.6b` | | `HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_API_KEY` | ZeroEntropy API key for embeddings | - | | `HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_MODEL` | ZeroEntropy embedding model | `zembed-1` | | `HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_BASE_URL` | Custom base URL for ZeroEntropy-compatible API | `https://api.zeroentropy.dev` | | `HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_DIMENSIONS` | Output dimensions for `zembed-1`. Supported values: `2560`, `1280`, `640`, `320`, `160`, `80`, `40` | `1280` | | `HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_ENCODING_FORMAT` | Response encoding: `float` or `base64`. Hindsight decodes either format to float vectors before storage. | `float` | | `HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_BATCH_SIZE` | Max inputs per ZeroEntropy embed request | `100` | | `HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_LATENCY` | Optional latency mode: `fast` or `slow`. Leave unset to use ZeroEntropy's default routing. | - | | `HINDSIGHT_API_EMBEDDINGS_COHERE_API_KEY` | Cohere API key for embeddings (falls back to `HINDSIGHT_API_COHERE_API_KEY`) | - | | `HINDSIGHT_API_EMBEDDINGS_COHERE_MODEL` | Cohere embedding model | `embed-english-v3.0` | | `HINDSIGHT_API_EMBEDDINGS_COHERE_BASE_URL` | Custom base URL for Cohere-compatible API (e.g., Azure-hosted) | - | | `HINDSIGHT_API_EMBEDDINGS_COHERE_OUTPUT_DIMENSIONS` | Output embedding dimensions for Cohere (e.g., `256`, `512`, `1024`). When set, overrides the model's default dimension. | - | | `HINDSIGHT_API_EMBEDDINGS_LITELLM_API_BASE` | LiteLLM proxy base URL for embeddings (falls back to `HINDSIGHT_API_LITELLM_API_BASE`) | `http://localhost:4000` | | `HINDSIGHT_API_EMBEDDINGS_LITELLM_API_KEY` | LiteLLM proxy API key for embeddings (optional, depends on proxy config; falls back to `HINDSIGHT_API_LITELLM_API_KEY`) | - | | `HINDSIGHT_API_EMBEDDINGS_LITELLM_MODEL` | LiteLLM embedding model (use provider prefix, e.g., `cohere/embed-english-v3.0`) | `text-embedding-3-small` | | `HINDSIGHT_API_EMBEDDINGS_LITELLM_SDK_API_KEY` | LiteLLM SDK API key for direct embedding provider access (optional — omit for providers that use ambient credentials, e.g. AWS Bedrock with IAM) | - | | `HINDSIGHT_API_EMBEDDINGS_LITELLM_SDK_MODEL` | LiteLLM SDK embedding model (use provider prefix, e.g., `cohere/embed-english-v3.0`) | `cohere/embed-english-v3.0` | | `HINDSIGHT_API_EMBEDDINGS_LITELLM_SDK_API_BASE` | Custom base URL for LiteLLM SDK embeddings (optional) | - | | `HINDSIGHT_API_EMBEDDINGS_LITELLM_SDK_OUTPUT_DIMENSIONS` | Optional output embedding dimensions (provider-dependent, e.g., `768` for Gemini embedding models) | - | | `HINDSIGHT_API_EMBEDDINGS_LITELLM_SDK_ENCODING_FORMAT` | Encoding format for embedding responses. Set to empty string to omit the parameter (needed for Voyage AI, Gemini). | `float` | | `HINDSIGHT_API_EMBEDDINGS_GEMINI_API_KEY` | Gemini API key for embeddings (falls back to `HINDSIGHT_API_LLM_API_KEY`) | - | | `HINDSIGHT_API_EMBEDDINGS_GEMINI_MODEL` | Gemini embedding model. The `gemini-embedding-2` family (e.g. `gemini-embedding-2-preview`) is supported on both the Gemini API and Vertex AI — because these multimodal models aggregate a multi-input request into one embedding, Hindsight automatically embeds one input per call to keep per-fact vectors. | `gemini-embedding-001` | | `HINDSIGHT_API_EMBEDDINGS_GEMINI_OUTPUT_DIMENSIONALITY` | Output embedding dimensions (Gemini supports configurable dimensionality) | `768` | | `HINDSIGHT_API_EMBEDDINGS_GEMINI_FORCE_IPV4` | Force the Gemini embeddings client to use an IPv4-only HTTP transport. Useful in environments where IPv6 egress is broken (e.g. some Docker/VPC setups) and AAAA DNS records cause long hangs. | `false` | | `HINDSIGHT_API_EMBEDDINGS_VERTEXAI_PROJECT_ID` | Vertex AI project ID for embeddings (falls back to `HINDSIGHT_API_LLM_VERTEXAI_PROJECT_ID`) | - | | `HINDSIGHT_API_EMBEDDINGS_VERTEXAI_REGION` | Vertex AI region for embeddings (falls back to `HINDSIGHT_API_LLM_VERTEXAI_REGION`) | - | | `HINDSIGHT_API_EMBEDDINGS_VERTEXAI_SERVICE_ACCOUNT_KEY` | Service account key for Vertex AI embeddings (falls back to `HINDSIGHT_API_LLM_VERTEXAI_SERVICE_ACCOUNT_KEY`) | - | Embedding provider selection, credentials, base URLs, model choices, dimensions, encoding format, batch sizes, and latency modes are static server-level settings. They are not hierarchical per-bank overrides. The ONNX settings above are also static, matching the existing `embeddings_local_*` settings. #### Local ONNX embeddings The ONNX provider runs embedding models in-process with ONNX Runtime. Install the optional deps when building your own API environment: ```bash pip install 'hindsight-api-slim[local-onnx]' # or, in this repository: uv sync --project hindsight-api-slim --extra local-onnx ``` You can either let Hindsight download the model from Hugging Face at startup by setting `HINDSIGHT_API_EMBEDDINGS_ONNX_MODEL_ID`, or pre-download the ONNX graph and tokenizer files under the Hindsight repository root. ```bash cd /path/to/hindsight mkdir -p models MODEL_ID=intfloat/multilingual-e5-small MODEL_DIR=models/intfloat__multilingual-e5-small uv run --project hindsight-api-slim --extra local-onnx python - <<'PY' from huggingface_hub import snapshot_download snapshot_download( repo_id=os.environ["MODEL_ID"], local_dir=os.environ["MODEL_DIR"], allow_patterns=[ "onnx/model.onnx", "onnx/model.onnx_data", "*.json", "*.txt", "*.model", ], ) PY ``` Then start Hindsight with paths relative to the repository root: ```bash export HINDSIGHT_API_EMBEDDINGS_PROVIDER=onnx export HINDSIGHT_API_EMBEDDINGS_ONNX_MODEL_PATH=./models/intfloat__multilingual-e5-small/onnx/model.onnx export HINDSIGHT_API_EMBEDDINGS_ONNX_TOKENIZER_NAME_OR_PATH=./models/intfloat__multilingual-e5-small export HINDSIGHT_API_EMBEDDINGS_ONNX_DIMENSIONS=384 export HINDSIGHT_API_EMBEDDINGS_ONNX_QUERY_PREFIX="query: " export HINDSIGHT_API_EMBEDDINGS_ONNX_PASSAGE_PREFIX="passage: " ``` For Docker deployments, mount the same model directory and use container paths: ```yaml services: hindsight: volumes: - ./models:/app/models:ro environment: HINDSIGHT_API_EMBEDDINGS_PROVIDER: onnx HINDSIGHT_API_EMBEDDINGS_ONNX_MODEL_PATH: /app/models/intfloat__multilingual-e5-small/onnx/model.onnx HINDSIGHT_API_EMBEDDINGS_ONNX_TOKENIZER_NAME_OR_PATH: /app/models/intfloat__multilingual-e5-small HINDSIGHT_API_EMBEDDINGS_ONNX_DIMENSIONS: "384" HINDSIGHT_API_EMBEDDINGS_ONNX_QUERY_PREFIX: "query: " HINDSIGHT_API_EMBEDDINGS_ONNX_PASSAGE_PREFIX: "passage: " ``` Model-specific examples: ```bash # sentence-transformers/all-MiniLM-L6-v2: 384 dimensions, no E5 prefixes export HINDSIGHT_API_EMBEDDINGS_ONNX_MODEL_ID=sentence-transformers/all-MiniLM-L6-v2 export HINDSIGHT_API_EMBEDDINGS_ONNX_DIMENSIONS=384 export HINDSIGHT_API_EMBEDDINGS_ONNX_QUERY_PREFIX="" export HINDSIGHT_API_EMBEDDINGS_ONNX_PASSAGE_PREFIX="" # intfloat/multilingual-e5-small: 384 dimensions, keep E5 prefixes export HINDSIGHT_API_EMBEDDINGS_ONNX_MODEL_ID=intfloat/multilingual-e5-small export HINDSIGHT_API_EMBEDDINGS_ONNX_DIMENSIONS=384 export HINDSIGHT_API_EMBEDDINGS_ONNX_QUERY_PREFIX="query: " export HINDSIGHT_API_EMBEDDINGS_ONNX_PASSAGE_PREFIX="passage: " # sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2: 384 dimensions, no E5 prefixes export HINDSIGHT_API_EMBEDDINGS_ONNX_MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 export HINDSIGHT_API_EMBEDDINGS_ONNX_DIMENSIONS=384 export HINDSIGHT_API_EMBEDDINGS_ONNX_QUERY_PREFIX="" export HINDSIGHT_API_EMBEDDINGS_ONNX_PASSAGE_PREFIX="" # BAAI/bge-m3: 1024 dimensions, no E5 prefixes; keep onnx/model.onnx_data next to model.onnx export HINDSIGHT_API_EMBEDDINGS_ONNX_MODEL_ID=BAAI/bge-m3 export HINDSIGHT_API_EMBEDDINGS_ONNX_DIMENSIONS=1024 export HINDSIGHT_API_EMBEDDINGS_ONNX_QUERY_PREFIX="" export HINDSIGHT_API_EMBEDDINGS_ONNX_PASSAGE_PREFIX="" ``` :::warning Do not mix embeddings from different models in the same vector index. Switching from `local` to `onnx`, or changing ONNX models, requires re-embedding existing memories/documents even when the vector dimensions happen to match. For example, `BAAI/bge-small-en-v1.5` and `intfloat/multilingual-e5-small` both produce 384-dimensional vectors, but their embedding spaces are not semantically comparable. ::: :::warning The default ONNX query/document prefixes (`query: ` and `passage: `) are for E5 models. Clear both prefix variables for non-E5 models such as MiniLM or BGE, otherwise Hindsight will prepend E5-style text to models that were not trained with that format. ::: #### Common Pitfall: Provider-Specific Embedding Env Var Names Embedding environment variables include a provider segment in the key name: `HINDSIGHT_API_EMBEDDINGS_{PROVIDER}_{PARAMETER}` For example, when `HINDSIGHT_API_EMBEDDINGS_PROVIDER=openai`: | Wrong | Correct | |---|---| | `HINDSIGHT_API_EMBEDDINGS_BASE_URL` | `HINDSIGHT_API_EMBEDDINGS_OPENAI_BASE_URL` | | `HINDSIGHT_API_EMBEDDINGS_MODEL` | `HINDSIGHT_API_EMBEDDINGS_OPENAI_MODEL` | | `HINDSIGHT_API_EMBEDDINGS_API_KEY` | `HINDSIGHT_API_EMBEDDINGS_OPENAI_API_KEY` | This differs from LLM variables, which follow `HINDSIGHT_API_LLM_{PARAMETER}` without a provider segment. :::warning If embedding keys are misnamed, Hindsight may fall back to default OpenAI embedding settings (for example, `text-embedding-3-small`) and fail with auth errors against the wrong endpoint. ::: #### DeepSeek and Embeddings DeepSeek is supported as an **LLM** provider, but it does **not** expose an embeddings endpoint. If your LLM is DeepSeek, use a different embedding provider (for example `local`, `openai`, `cohere`, or `google`). ```bash # Local (default) - uses SentenceTransformers export HINDSIGHT_API_EMBEDDINGS_PROVIDER=local export HINDSIGHT_API_EMBEDDINGS_LOCAL_MODEL=BAAI/bge-small-en-v1.5 # Local with custom model requiring trust_remote_code # WARNING: Only enable trust_remote_code for models you trust (security risk) # export HINDSIGHT_API_EMBEDDINGS_LOCAL_MODEL=your-custom-model # export HINDSIGHT_API_EMBEDDINGS_LOCAL_TRUST_REMOTE_CODE=true # OpenAI - cloud-based embeddings export HINDSIGHT_API_EMBEDDINGS_PROVIDER=openai export HINDSIGHT_API_EMBEDDINGS_OPENAI_API_KEY=*** # or reuses HINDSIGHT_API_LLM_API_KEY export HINDSIGHT_API_EMBEDDINGS_OPENAI_MODEL=text-embedding-3-small # 1536 dimensions by default # export HINDSIGHT_API_EMBEDDINGS_OPENAI_DIMENSIONS=384 # optional reduced output size # OpenAI Codex OAuth - uses existing ChatGPT/Codex login, no API key needed export HINDSIGHT_API_EMBEDDINGS_PROVIDER=openai-codex export HINDSIGHT_API_EMBEDDINGS_OPENAI_MODEL=text-embedding-3-small # 1536 dimensions by default # export HINDSIGHT_API_EMBEDDINGS_OPENAI_DIMENSIONS=384 # optional reduced output size # Azure OpenAI - embeddings via Azure endpoint export HINDSIGHT_API_EMBEDDINGS_PROVIDER=openai export HINDSIGHT_API_EMBEDDINGS_OPENAI_API_KEY=your-azure-api-key export HINDSIGHT_API_EMBEDDINGS_OPENAI_MODEL=text-embedding-3-small export HINDSIGHT_API_EMBEDDINGS_OPENAI_BASE_URL=https://your-resource.openai.azure.com/openai/deployments/your-deployment # TEI - HuggingFace Text Embeddings Inference (recommended for production) export HINDSIGHT_API_EMBEDDINGS_PROVIDER=tei export HINDSIGHT_API_EMBEDDINGS_TEI_URL=http://localhost:8080 # OpenRouter - access 100+ embedding models export HINDSIGHT_API_EMBEDDINGS_PROVIDER=openrouter export HINDSIGHT_API_EMBEDDINGS_OPENROUTER_API_KEY=your-openrouter-api-key # or reuses HINDSIGHT_API_LLM_API_KEY export HINDSIGHT_API_EMBEDDINGS_OPENROUTER_MODEL=perplexity/pplx-embed-v1-0.6b # ZeroEntropy - zembed-1 export HINDSIGHT_API_EMBEDDINGS_PROVIDER=zeroentropy export HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_API_KEY=your-api-key export HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_MODEL=zembed-1 export HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_DIMENSIONS=1280 # export HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_ENCODING_FORMAT=base64 # optional # export HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_LATENCY=fast # optional # Cohere - cloud-based embeddings export HINDSIGHT_API_EMBEDDINGS_PROVIDER=cohere export HINDSIGHT_API_EMBEDDINGS_COHERE_API_KEY=your-api-key export HINDSIGHT_API_EMBEDDINGS_COHERE_MODEL=embed-english-v3.0 # 1024 dimensions # Optional: override output dimensions (for Matryoshka-capable models) # export HINDSIGHT_API_EMBEDDINGS_COHERE_OUTPUT_DIMENSIONS=512 # Azure-hosted Cohere - embeddings via custom endpoint export HINDSIGHT_API_EMBEDDINGS_PROVIDER=cohere export HINDSIGHT_API_EMBEDDINGS_COHERE_API_KEY=your-azure-api-key export HINDSIGHT_API_EMBEDDINGS_COHERE_MODEL=embed-english-v3.0 export HINDSIGHT_API_EMBEDDINGS_COHERE_BASE_URL=https://your-azure-cohere-endpoint.com # LiteLLM proxy - unified gateway for multiple providers export HINDSIGHT_API_EMBEDDINGS_PROVIDER=litellm export HINDSIGHT_API_EMBEDDINGS_LITELLM_API_BASE=http://localhost:4000 export HINDSIGHT_API_EMBEDDINGS_LITELLM_API_KEY=your-litellm-key # optional export HINDSIGHT_API_EMBEDDINGS_LITELLM_MODEL=text-embedding-3-small # or cohere/embed-english-v3.0 # Google - Gemini API (API key auth) export HINDSIGHT_API_EMBEDDINGS_PROVIDER=google export HINDSIGHT_API_EMBEDDINGS_GEMINI_API_KEY=xxxxxxxxxxxx # or reuses HINDSIGHT_API_LLM_API_KEY export HINDSIGHT_API_EMBEDDINGS_GEMINI_MODEL=gemini-embedding-001 # 768 dimensions (default) # export HINDSIGHT_API_EMBEDDINGS_GEMINI_OUTPUT_DIMENSIONALITY=768 # configurable: 256, 512, 768, 1024, etc. # Google - Vertex AI auth (auto-detected when project ID is set) export HINDSIGHT_API_EMBEDDINGS_PROVIDER=google export HINDSIGHT_API_EMBEDDINGS_GEMINI_MODEL=gemini-embedding-001 export HINDSIGHT_API_EMBEDDINGS_VERTEXAI_PROJECT_ID=your-gcp-project-id # falls back to HINDSIGHT_API_LLM_VERTEXAI_PROJECT_ID # export HINDSIGHT_API_EMBEDDINGS_VERTEXAI_REGION=us-central1 # falls back to HINDSIGHT_API_LLM_VERTEXAI_REGION # export HINDSIGHT_API_EMBEDDINGS_VERTEXAI_SERVICE_ACCOUNT_KEY=/path/to/key.json # falls back to LLM config, or uses ADC # LiteLLM SDK - direct API access without proxy server (recommended) export HINDSIGHT_API_EMBEDDINGS_PROVIDER=litellm-sdk export HINDSIGHT_API_EMBEDDINGS_LITELLM_SDK_API_KEY=your-provider-api-key export HINDSIGHT_API_EMBEDDINGS_LITELLM_SDK_MODEL=cohere/embed-english-v3.0 # Optional: request a specific output dimension when the provider supports it # export HINDSIGHT_API_EMBEDDINGS_LITELLM_SDK_OUTPUT_DIMENSIONS=768 # Supported LiteLLM SDK embedding providers: # - cohere/embed-english-v3.0 (1024 dimensions) # - openai/text-embedding-3-small (1536 dimensions) # - together_ai/togethercomputer/m2-bert-80M-8k-retrieval # - huggingface/sentence-transformers/all-MiniLM-L6-v2 # - voyage/voyage-2 ``` #### Embedding Dimensions Hindsight automatically detects the embedding dimension from the model at startup and adjusts the database schema accordingly. The default model (`BAAI/bge-small-en-v1.5`) produces 384-dimensional vectors, while OpenAI models produce 1536 or 3072 dimensions. For `litellm-sdk`, if you set `HINDSIGHT_API_EMBEDDINGS_LITELLM_SDK_OUTPUT_DIMENSIONS`, startup uses that output size when the underlying provider supports LiteLLM's `dimensions` parameter (otherwise behavior is unchanged). The same dimension-change rules below apply. For `zeroentropy`, zembed-1 supports `2560`, `1280`, `640`, `320`, `160`, `80`, and `40` dimensions. ZeroEntropy's API default is `2560`; Hindsight defaults to `1280` so the provider works with the default pgvector HNSW index. Use `2560` with a vector extension that supports higher-dimensional indexes, such as DiskANN/pgvectorscale or ScaNN. :::warning Dimension Changes Once memories are stored, you cannot change the embedding dimension without losing data. If you need to switch to a model with different dimensions: 1. **Empty database**: The schema is adjusted automatically on startup 2. **Existing data**: Either delete all memories first, or use a model with matching dimensions Supported OpenAI embedding dimensions: - `text-embedding-3-small`: 1536 dimensions - `text-embedding-3-large`: 3072 dimensions - `text-embedding-ada-002`: 1536 dimensions (legacy) Google's `gemini-embedding-001` produces 3072 dimensions natively but supports configurable output dimensionality. Set `HINDSIGHT_API_EMBEDDINGS_GEMINI_OUTPUT_DIMENSIONALITY` to control the output size (default: 768). ZeroEntropy's `zembed-1` supports Matryoshka dimensions: `2560`, `1280`, `640`, `320`, `160`, `80`, and `40`. Hindsight defaults to `1280` for this provider. ::: ### Reranker | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_RERANKER_PROVIDER` | Provider: `local`, `tei`, `cohere`, `openrouter`, `zeroentropy`, `siliconflow`, `alibaba`, `google`, `flashrank`, `litellm`, `litellm-sdk`, `jina-mlx`, or `rrf` | `local` | | `HINDSIGHT_API_RERANKER_LOCAL_MODEL` | Model for local provider | `cross-encoder/ms-marco-MiniLM-L-6-v2` | | `HINDSIGHT_API_RERANKER_LOCAL_MAX_CONCURRENT` | Max concurrent local reranking (prevents CPU thrashing under load) | `4` | | `HINDSIGHT_API_RERANKER_LOCAL_TRUST_REMOTE_CODE` | Allow loading models with custom code (security risk, disabled by default) | `false` | | `HINDSIGHT_API_RERANKER_LOCAL_FORCE_CPU` | Force CPU mode for local reranker (avoids MPS/XPC issues on macOS) | `false` | | `HINDSIGHT_API_RERANKER_LOCAL_FP16` | Half-precision (FP16) inference for the local reranker. 27–36% faster on MPS; quality-identical. Disabled by default to avoid regressions on non-MPS deployments — some CPUs lack native FP16 support. | `false` | | `HINDSIGHT_API_RERANKER_LOCAL_BUCKET_BATCHING` | Sort pairs by token length before batching to reduce padding waste. 36–54% faster across models; quality-identical by construction. | `false` | | `HINDSIGHT_API_RERANKER_LOCAL_BATCH_SIZE` | Batch size for local reranker `predict()`. Optimal value varies by hardware and model (smaller batches can outperform larger ones on MPS). | `32` | | `HINDSIGHT_API_RERANKER_TEI_URL` | TEI server URL | - | | `HINDSIGHT_API_RERANKER_TEI_BATCH_SIZE` | Batch size for TEI reranking | `128` | | `HINDSIGHT_API_RERANKER_TEI_MAX_CONCURRENT` | Max concurrent TEI reranking requests | `8` | | `HINDSIGHT_API_RERANKER_TEI_HTTP_TIMEOUT` | HTTP request timeout for TEI reranker (seconds). Increase when using a slower CPU-based reranker under load. | `30.0` | | `HINDSIGHT_API_RERANKER_OPENROUTER_API_KEY` | OpenRouter API key for reranking (falls back to `HINDSIGHT_API_OPENROUTER_API_KEY`, then `HINDSIGHT_API_LLM_API_KEY`) | - | | `HINDSIGHT_API_RERANKER_OPENROUTER_MODEL` | OpenRouter rerank model | `cohere/rerank-v3.5` | | `HINDSIGHT_API_RERANKER_OPENROUTER_TIMEOUT` | HTTP request timeout for OpenRouter reranker (seconds). | `60.0` | | `HINDSIGHT_API_RERANKER_OPENROUTER_BASE_URL` | Rerank endpoint URL (point at a Cohere-compatible gateway/proxy for metering) | `https://openrouter.ai/api/v1/rerank` | | `HINDSIGHT_API_RERANKER_COHERE_API_KEY` | Cohere API key for reranking (falls back to `HINDSIGHT_API_COHERE_API_KEY`) | - | | `HINDSIGHT_API_RERANKER_COHERE_MODEL` | Cohere rerank model | `rerank-english-v3.0` | | `HINDSIGHT_API_RERANKER_COHERE_BASE_URL` | Custom base URL for any Cohere-compatible `/rerank` endpoint (Azure AI Foundry, Jina, Voyage, self-hosted BGE, etc.). When set, the `cohere` provider bypasses the Cohere SDK and calls the endpoint directly via HTTP. | - | | `HINDSIGHT_API_RERANKER_COHERE_TIMEOUT` | Request timeout for the Cohere reranker (seconds). Applies to both the native Cohere SDK and the Cohere-compatible HTTP path enabled by `HINDSIGHT_API_RERANKER_COHERE_BASE_URL`. | `60.0` | | `HINDSIGHT_API_RERANKER_LITELLM_API_BASE` | LiteLLM proxy base URL for reranking (falls back to `HINDSIGHT_API_LITELLM_API_BASE`) | `http://localhost:4000` | | `HINDSIGHT_API_RERANKER_LITELLM_API_KEY` | LiteLLM proxy API key for reranking (optional, depends on proxy config; falls back to `HINDSIGHT_API_LITELLM_API_KEY`) | - | | `HINDSIGHT_API_RERANKER_LITELLM_MODEL` | LiteLLM **proxy** rerank model (use provider prefix, e.g., `cohere/rerank-english-v3.0`) | `cohere/rerank-english-v3.0` | | `HINDSIGHT_API_RERANKER_LITELLM_TIMEOUT` | HTTP request timeout for the LiteLLM proxy reranker (seconds). | `60.0` | | `HINDSIGHT_API_RERANKER_LITELLM_SDK_API_KEY` | LiteLLM **SDK** API key for direct reranking (no proxy needed) | - | | `HINDSIGHT_API_RERANKER_LITELLM_SDK_MODEL` | LiteLLM SDK rerank model (e.g., `deepinfra/Qwen3-reranker-8B`) | `cohere/rerank-english-v3.0` | | `HINDSIGHT_API_RERANKER_LITELLM_SDK_API_BASE` | Custom API base URL for LiteLLM SDK (optional) | - | | `HINDSIGHT_API_RERANKER_LITELLM_SDK_TIMEOUT` | Request timeout for the LiteLLM SDK reranker (seconds). | `60.0` | | `HINDSIGHT_API_RERANKER_LITELLM_MAX_TOKENS_PER_DOC` | Truncate documents to this many tokens before sending to the reranker (applies to both `litellm` and `litellm-sdk`). Use for models with small context windows (e.g. set to `900` for a 1024-token limit model). Unset by default (no truncation). | - | | `HINDSIGHT_API_RERANKER_ZEROENTROPY_API_KEY` | ZeroEntropy API key for reranking | - | | `HINDSIGHT_API_RERANKER_ZEROENTROPY_MODEL` | ZeroEntropy rerank model (`zerank-2`, `zerank-2-small`) | `zerank-2` | | `HINDSIGHT_API_RERANKER_ZEROENTROPY_BASE_URL` | Custom base URL for ZeroEntropy-compatible API (e.g., mock server, proxy, or self-hosted deployment) | `https://api.zeroentropy.dev` | | `HINDSIGHT_API_RERANKER_ZEROENTROPY_TIMEOUT` | HTTP request timeout for ZeroEntropy reranker (seconds). | `60.0` | | `HINDSIGHT_API_RERANKER_SILICONFLOW_API_KEY` | SiliconFlow API key for reranking | - | | `HINDSIGHT_API_RERANKER_SILICONFLOW_MODEL` | SiliconFlow rerank model (e.g., `BAAI/bge-reranker-v2-m3`) | `BAAI/bge-reranker-v2-m3` | | `HINDSIGHT_API_RERANKER_SILICONFLOW_BASE_URL` | Base URL for the SiliconFlow `/rerank` endpoint | `https://api.siliconflow.cn/v1` | | `HINDSIGHT_API_RERANKER_SILICONFLOW_TIMEOUT` | HTTP request timeout for SiliconFlow reranker (seconds). | `60.0` | | `HINDSIGHT_API_RERANKER_ALIBABA_API_KEY` | Alibaba Cloud DashScope API key for reranking | - | | `HINDSIGHT_API_RERANKER_ALIBABA_MODEL` | DashScope rerank model | `qwen3-rerank` | | `HINDSIGHT_API_RERANKER_ALIBABA_TIMEOUT` | HTTP request timeout for the Alibaba Cloud DashScope reranker (seconds). | `60.0` | | `HINDSIGHT_API_RERANKER_GOOGLE_PROJECT_ID` | Google Cloud project ID for Discovery Engine reranking (falls back to `HINDSIGHT_API_LLM_VERTEXAI_PROJECT_ID`) | - | | `HINDSIGHT_API_RERANKER_GOOGLE_MODEL` | Google Discovery Engine ranking model | `semantic-ranker-default-004` | | `HINDSIGHT_API_RERANKER_GOOGLE_SERVICE_ACCOUNT_KEY` | Path to service account JSON key (falls back to `HINDSIGHT_API_LLM_VERTEXAI_SERVICE_ACCOUNT_KEY`). If unset, uses ADC. | - | | `HINDSIGHT_API_RERANKER_GOOGLE_TIMEOUT` | HTTP request timeout for Google Discovery Engine reranker (seconds). | `60.0` | | `HINDSIGHT_API_RERANKER_FLASHRANK_MODEL` | FlashRank model for fast CPU-based reranking | `ms-marco-MiniLM-L-12-v2` | | `HINDSIGHT_API_RERANKER_FLASHRANK_CACHE_DIR` | Cache directory for FlashRank models | System default | | `HINDSIGHT_API_RERANKER_FLASHRANK_CPU_MEM_ARENA` | Enable ONNX Runtime CPU memory arena for FlashRank. When `true`, ONNX pre-allocates a memory arena that never shrinks, causing RSS to grow monotonically. `false` trades slightly slower per-call allocation for bounded RSS. | `false` | | `HINDSIGHT_API_RERANKER_JINA_MLX_MODEL_PATH` | Local path to downloaded `jina-reranker-v3-mlx` model (auto-downloads from HuggingFace if unset) | - | ```bash # Local (default) - uses SentenceTransformers CrossEncoder export HINDSIGHT_API_RERANKER_PROVIDER=local export HINDSIGHT_API_RERANKER_LOCAL_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2 # Local with custom model requiring trust_remote_code (e.g., jina-reranker-v2) # WARNING: Only enable trust_remote_code for models you trust (security risk) export HINDSIGHT_API_RERANKER_PROVIDER=local export HINDSIGHT_API_RERANKER_LOCAL_MODEL=jinaai/jina-reranker-v2-base-multilingual export HINDSIGHT_API_RERANKER_LOCAL_TRUST_REMOTE_CODE=true # TEI - for high-performance inference export HINDSIGHT_API_RERANKER_PROVIDER=tei export HINDSIGHT_API_RERANKER_TEI_URL=http://localhost:8081 # OpenRouter - access reranking models via OpenRouter export HINDSIGHT_API_RERANKER_PROVIDER=openrouter export HINDSIGHT_API_RERANKER_OPENROUTER_API_KEY=your-openrouter-api-key # or reuses HINDSIGHT_API_LLM_API_KEY export HINDSIGHT_API_RERANKER_OPENROUTER_MODEL=cohere/rerank-v3.5 # Cohere - cloud-based reranking export HINDSIGHT_API_RERANKER_PROVIDER=cohere export HINDSIGHT_API_RERANKER_COHERE_API_KEY=your-api-key export HINDSIGHT_API_RERANKER_COHERE_MODEL=rerank-english-v3.0 # Any Cohere-compatible /rerank endpoint (Azure AI Foundry, Jina, Voyage, self-hosted BGE, etc.) # # Setting HINDSIGHT_API_RERANKER_COHERE_BASE_URL switches the `cohere` provider # off the Cohere SDK and onto a plain HTTP client that speaks the standard # Cohere rerank wire format: # Request: POST {base_url} (or {base_url}/rerank, depending on host) # Authorization: Bearer # {"model": "...", "query": "...", "documents": [...], "return_documents": false} # Response: {"results": [{"index": 0, "relevance_score": 0.9}, ...]} # # Any service implementing this contract works here. For Azure AI Foundry the # base_url is the full invoke URL; for SiliconFlow you can also use the # dedicated `siliconflow` provider below. export HINDSIGHT_API_RERANKER_PROVIDER=cohere export HINDSIGHT_API_RERANKER_COHERE_API_KEY=your-api-key export HINDSIGHT_API_RERANKER_COHERE_MODEL=rerank-english-v3.0 # whatever model the endpoint serves export HINDSIGHT_API_RERANKER_COHERE_BASE_URL=https://your-cohere-compatible-endpoint.com # ZeroEntropy - cloud-based reranking (state-of-the-art accuracy) export HINDSIGHT_API_RERANKER_PROVIDER=zeroentropy export HINDSIGHT_API_RERANKER_ZEROENTROPY_API_KEY=your-api-key export HINDSIGHT_API_RERANKER_ZEROENTROPY_MODEL=zerank-2 # or zerank-2-small # export HINDSIGHT_API_RERANKER_ZEROENTROPY_BASE_URL=https://your-custom-endpoint.com # optional # SiliconFlow - cloud reranking via SiliconFlow's Cohere-compatible /rerank endpoint export HINDSIGHT_API_RERANKER_PROVIDER=siliconflow export HINDSIGHT_API_RERANKER_SILICONFLOW_API_KEY=your-api-key export HINDSIGHT_API_RERANKER_SILICONFLOW_MODEL=BAAI/bge-reranker-v2-m3 # export HINDSIGHT_API_RERANKER_SILICONFLOW_BASE_URL=https://api.siliconflow.cn/v1 # default # Alibaba Cloud DashScope - qwen3-rerank via Cohere-compatible /reranks endpoint export HINDSIGHT_API_RERANKER_PROVIDER=alibaba export HINDSIGHT_API_RERANKER_ALIBABA_API_KEY=your-dashscope-api-key # or set DASHSCOPE_API_KEY export HINDSIGHT_API_RERANKER_ALIBABA_MODEL=qwen3-rerank # default, can omit # LiteLLM proxy - unified gateway for multiple reranking providers (requires running LiteLLM proxy server) export HINDSIGHT_API_RERANKER_PROVIDER=litellm export HINDSIGHT_API_RERANKER_LITELLM_API_BASE=http://localhost:4000 export HINDSIGHT_API_RERANKER_LITELLM_API_KEY=your-litellm-key # optional export HINDSIGHT_API_RERANKER_LITELLM_MODEL=cohere/rerank-english-v3.0 # or voyage/rerank-2, together_ai/... # LiteLLM SDK - direct API access without proxy (recommended for simplicity) export HINDSIGHT_API_RERANKER_PROVIDER=litellm-sdk export HINDSIGHT_API_RERANKER_LITELLM_SDK_API_KEY=your-deepinfra-api-key export HINDSIGHT_API_RERANKER_LITELLM_SDK_MODEL=deepinfra/Qwen3-reranker-8B # or cohere/rerank-english-v3.0, etc. # Google Discovery Engine - cloud-based semantic reranking export HINDSIGHT_API_RERANKER_PROVIDER=google export HINDSIGHT_API_RERANKER_GOOGLE_PROJECT_ID=your-gcp-project-id export HINDSIGHT_API_RERANKER_GOOGLE_SERVICE_ACCOUNT_KEY=/path/to/service-account.json # optional, uses ADC if unset export HINDSIGHT_API_RERANKER_GOOGLE_MODEL=semantic-ranker-default-004 # or semantic-ranker-fast-004 # Jina MLX - Apple Silicon native reranking (no GPU/cloud required) # Model (~1.2 GB) is downloaded automatically from HuggingFace Hub on first use. export HINDSIGHT_API_RERANKER_PROVIDER=jina-mlx ``` #### LiteLLM Proxy vs SDK - **`litellm`**: Requires running a separate LiteLLM proxy server. Good for centralized configuration, rate limiting, and caching. - **`litellm-sdk`**: Direct API access without proxy. Simpler setup, lower latency, fewer infrastructure components. Both support the same providers: - **Cohere** (`cohere/rerank-english-v3.0`, `cohere/rerank-multilingual-v3.0`) - **DeepInfra** (`deepinfra/Qwen3-reranker-8B`, `deepinfra/bge-reranker-v2-m3`) - **Together AI** (`together_ai/Salesforce/Llama-Rank-V1`) - **HuggingFace** (`huggingface/BAAI/bge-reranker-v2-m3`) - **Voyage AI** (`voyage/rerank-2`) - **Jina AI** (`jina_ai/jina-reranker-v2`) - **AWS Bedrock** (`bedrock/...`) #### Jina MLX (Apple Silicon) The `jina-mlx` provider uses [`jinaai/jina-reranker-v3-mlx`](https://huggingface.co/jinaai/jina-reranker-v3-mlx), optimized for Apple Silicon. The model (~1.2 GB) is downloaded from HuggingFace Hub automatically on first startup and cached locally. :::note License `jina-reranker-v3-mlx` is licensed under CC BY-NC 4.0. Contact Jina AI for commercial usage. ::: ### Authentication By default, Hindsight runs without authentication. For production deployments, enable API key authentication using the built-in tenant extension: ```bash # Enable the built-in API key authentication export HINDSIGHT_API_TENANT_EXTENSION=hindsight_api.extensions.builtin.tenant:ApiKeyTenantExtension export HINDSIGHT_API_TENANT_API_KEY=your-secret-api-key ``` When enabled, all requests must include the API key in the `Authorization` header: ```bash curl -H "Authorization: Bearer your-secret-api-key" \ http://localhost:8888/v1/default/banks ``` Requests without a valid API key receive a `401 Unauthorized` response. | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_TENANT_EXTENSION` | Dotted path to the loaded tenant extension. Set to `hindsight_api.extensions.builtin.tenant:ApiKeyTenantExtension` to require an API key on every request. | *(none; auth disabled)* | | `HINDSIGHT_API_TENANT_API_KEY` | Shared API key checked by the built-in API-key extension. Sent by clients as `Authorization: Bearer `. | *(none)* | If you are enabling Memory Defense, see `docs/developer/memory-defense/` for the policy schema, detector catalog, and audit trail. :::tip Custom Authentication For advanced authentication (JWT, OAuth, multi-tenant schemas), implement a custom `TenantExtension`. See the [Extensions documentation](./extensions.md) for details. ::: ### Server | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_HOST` | Bind address | `0.0.0.0` | | `HINDSIGHT_API_PORT` | Server port | `8888` | | `HINDSIGHT_API_BASE_PATH` | Base path for API when behind reverse proxy (e.g., `/hindsight`) | `""` (root) | | `HINDSIGHT_API_WORKERS` | Number of uvicorn worker processes | `1` | | `HINDSIGHT_API_ACCESS_LOG` | Enable uvicorn access log (`true`, `1`, `yes`, `on` to enable) | `false` | | `HINDSIGHT_API_LOG_LEVEL` | Log level: `debug`, `info`, `warning`, `error` | `info` | | `HINDSIGHT_API_LOG_FORMAT` | Log format: `text` or `json` (structured logging for cloud platforms) | `text` | | `HINDSIGHT_API_LOG_JSON_FIELDS` | Comma-separated allowlist of JSON log fields to emit (e.g. `severity,message,tenant`). Available: `severity`, `message`, `timestamp`, `logger`, `tenant`, `exception`. Empty = all fields. | `""` (all) | | `HINDSIGHT_API_MCP_ENABLED` | Enable MCP server at `/mcp/{bank_id}/` | `true` | | `HINDSIGHT_API_MODEL_INIT_TIMEOUT` | Wall-clock cap (seconds) on startup model/connection initialization. If embeddings, the cross-encoder, or LLM verification block (e.g. an offline model download or an unreachable provider), the server fails fast with a clear error instead of hanging forever. Increase if a legitimate first-time model download needs more time. | `300` | ### Retrieval | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_GRAPH_RETRIEVER` | Graph retrieval algorithm | `link_expansion` | | `HINDSIGHT_API_LINK_EXPANSION_PER_ENTITY_LIMIT` | Max target units expanded per entity in `link_expansion` graph retrieval (LATERAL fanout cap per entity; bounds high-fanout entities). | `200` | | `HINDSIGHT_API_LINK_EXPANSION_TIMEOUT` | Timeout (seconds) for the per-entity graph expansion query in `link_expansion` retrieval. | `10` | | `HINDSIGHT_API_RECALL_MAX_CONCURRENT` | Max concurrent recall operations per worker (backpressure) | `32` | | `HINDSIGHT_API_RECALL_CONNECTION_BUDGET` | Max concurrent DB connections per recall operation | `4` | | `HINDSIGHT_API_RECALL_MAX_QUERY_TOKENS` | Maximum token length of a recall query; requests exceeding this limit are rejected with HTTP 400 | `500` | | `HINDSIGHT_API_RERANKER_MAX_CANDIDATES` | Max candidates to rerank per recall (RRF pre-filters the rest) | `300` | | `HINDSIGHT_API_SEMANTIC_MIN_SIMILARITY` | Minimum cosine similarity a candidate must reach to be returned by the semantic retrieval strategy. Must be between `0` and `1`. | `0.3` | | `HINDSIGHT_API_BM25_MIN_SCORE` | Minimum BM25 score a row must exceed to enter fusion. Gates out zero-score, non-matching rows on backends (notably `vchord`) whose operator ranks every document instead of pre-filtering to query-term matches. `0` keeps only genuine term matches; raise it to require stronger matches. | `0` | | `HINDSIGHT_API_RECALL_MAX_CANDIDATES_PER_SOURCE` | Cap on candidates each retrieval source (semantic, BM25, graph, temporal) contributes to RRF, applied before the global reranker cap. Prevents one over-expanding backend from filling the reranker budget on its own. `0` disables the cap. | `0` | | `HINDSIGHT_API_RECALL_STRATEGY_BOOSTS` | Prioritise one or more retrieval sources over the others on recall, as a comma-separated `strategy:level` list (e.g. `graph:high` to strongly favour graph hits, or `graph:high,bm25:low`). Strategies: `semantic`, `bm25`, `graph`, `temporal`. Levels: `low` (gentle — mainly protects the source's candidates from being dropped before reranking), `medium` (moderate preference), `high` (strong — the source dominates the candidate pool and outranks most other matches, only a strong direct match still wins). The boost is applied in two places: before the reranker cap (so favoured candidates survive the `HINDSIGHT_API_RERANKER_MAX_CANDIDATES` budget) and after reranking (to nudge them up the final order); a named level is used because those two stages live on different score scales. Only the strategies you list are boosted — any you omit keep their normal weight (no implicit boost). A strategy written without a level (`graph` or `graph:`) defaults to `medium`. Empty disables the feature. | _(empty)_ | | `HINDSIGHT_API_RECENCY_DECAY_FUNCTION` | Shape of the recency boost applied during reranking — how a memory's age is turned into a small freshness adjustment to its final rank. `linear` (default) decays in a straight line from full freshness (today) to a floor reached at `HINDSIGHT_API_RECENCY_DECAY_LINEAR_WINDOW_DAYS`. `exponential` decays by half-life: a memory is treated as neutral (no boost or penalty) at `HINDSIGHT_API_RECENCY_DECAY_HALFLIFE_DAYS`, younger memories are boosted and older ones penalised, with a smooth fade rather than a hard cutoff. `none` disables recency entirely (age never affects ranking). | `linear` | | `HINDSIGHT_API_RECENCY_DECAY_LINEAR_WINDOW_DAYS` | For the `linear` decay function: the number of days over which a memory fades from full freshness to the minimum. Only used when `HINDSIGHT_API_RECENCY_DECAY_FUNCTION=linear`. | `365` | | `HINDSIGHT_API_RECENCY_DECAY_HALFLIFE_DAYS` | For the `exponential` decay function: the age (in days) at which a memory is considered neutral — younger memories get a recency boost, older ones a penalty. Smaller values favour very recent memories more aggressively. Only used when `HINDSIGHT_API_RECENCY_DECAY_FUNCTION=exponential`. | `90` | | `HINDSIGHT_API_MENTAL_MODEL_REFRESH_CONCURRENCY` | Max concurrent mental model refreshes | `8` | | `HINDSIGHT_API_ENABLE_MENTAL_MODEL_HISTORY` | Track history of content changes to each mental model (previous content + timestamp), stored one row per change in the `mental_model_history` table. Set to `false` to disable entirely — no history rows are written, reducing storage if audit trails are not needed. **This is how you turn the feature off** (not a zero cap). | `true` | | `HINDSIGHT_API_MENTAL_MODEL_HISTORY_MAX_ENTRIES` | Max history rows kept per mental model. On each refresh the previous version is inserted into the `mental_model_history` table and the oldest rows beyond this cap are deleted, so per-model history can't grow without bound. `0` or a negative value **removes the cap** (history then grows with every refresh — unbounded); to turn history off entirely set `HINDSIGHT_API_ENABLE_MENTAL_MODEL_HISTORY=false` instead. | `50` | #### Graph Retrieval Algorithm - **`link_expansion`** (default): Fast graph expansion from semantic seeds via entity co-occurrence, semantic kNN, and causal links. Target latency under 100ms. #### Recall budget mapping The recall request takes a `budget` parameter (`low` / `mid` / `high`, default `mid`) that maps to an integer `thinking_budget` used by every retrieval method (semantic, BM25, graph, temporal). These knobs control that mapping. They are hierarchical — overridable per bank via the [config API](#hierarchical-configuration). Two functions are available: - **`fixed`** (default — preserves legacy behavior): `thinking_budget = recall_budget_fixed_` (independent of `max_tokens`). - **`adaptive`**: `thinking_budget = round(max_tokens * recall_budget_adaptive_)`, clamped to `[recall_budget_min, recall_budget_max]`. Useful when callers vary `max_tokens` and you want retrieval breadth to scale with the requested output size. | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_RECALL_BUDGET_FUNCTION` | Mapping function: `fixed` or `adaptive`. | `fixed` | | `HINDSIGHT_API_RECALL_BUDGET_FIXED_LOW` | Items per retrieval method per fact type when `budget=low` and function is `fixed`. | `100` | | `HINDSIGHT_API_RECALL_BUDGET_FIXED_MID` | Items per retrieval method per fact type when `budget=mid` and function is `fixed`. | `300` | | `HINDSIGHT_API_RECALL_BUDGET_FIXED_HIGH` | Items per retrieval method per fact type when `budget=high` and function is `fixed`. | `1000` | | `HINDSIGHT_API_RECALL_BUDGET_ADAPTIVE_LOW` | Ratio of request `max_tokens` used when `budget=low` and function is `adaptive`. | `0.025` | | `HINDSIGHT_API_RECALL_BUDGET_ADAPTIVE_MID` | Ratio of request `max_tokens` used when `budget=mid` and function is `adaptive`. | `0.075` | | `HINDSIGHT_API_RECALL_BUDGET_ADAPTIVE_HIGH` | Ratio of request `max_tokens` used when `budget=high` and function is `adaptive`. | `0.25` | | `HINDSIGHT_API_RECALL_BUDGET_MIN` | Floor for the adaptive function (after clamping). | `20` | | `HINDSIGHT_API_RECALL_BUDGET_MAX` | Ceiling for the adaptive function (after clamping). | `2000` | ### Retain Controls the retain (memory ingestion) pipeline. | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_RETAIN_MAX_COMPLETION_TOKENS` | Max completion tokens for fact extraction LLM calls | `64000` | | `HINDSIGHT_API_RETAIN_CHUNK_SIZE` | Max characters per chunk for fact extraction. Larger chunks extract fewer LLM calls but may lose context. | `3000` | | `HINDSIGHT_API_RETAIN_STRUCTURED_CHUNK_SIZE` | Max characters for a single JSONL line or conversation turn to keep whole. Unset uses `HINDSIGHT_API_RETAIN_CHUNK_SIZE`. Must be a positive integer when set. | - | | `HINDSIGHT_API_RETAIN_EXTRACTION_MODE` | Fact extraction mode: `concise`, `verbose`, `verbatim`, `chunks`, or `custom` | `concise` | | `HINDSIGHT_API_RETAIN_MISSION` | What this bank should pay attention to during extraction. Steers the LLM without replacing the extraction rules — works alongside any extraction mode. | - | | `HINDSIGHT_API_RETAIN_CUSTOM_INSTRUCTIONS` | Full prompt override for fact extraction (only used when mode is `custom`). Replaces built-in extraction rules entirely. | - | | `HINDSIGHT_API_RETAIN_EXTRACT_CAUSAL_LINKS` | Extract causal relationships between facts | `true` | | `HINDSIGHT_API_RETAIN_BATCH_ENABLED` | Use LLM Batch API for fact extraction (50% cost savings, only with async operations) | `false` | | `HINDSIGHT_API_RETAIN_MAX_CONCURRENT` | Max concurrent retain DB phases (HNSW reads + writes). Limits I/O contention during high-concurrency ingestion. | `4` | | `HINDSIGHT_API_RETAIN_BATCH_TOKENS` | Max characters per sub-batch for async retain auto-splitting | `10000` | | `HINDSIGHT_API_RETAIN_CHUNK_BATCH_SIZE` | Max chunks per streaming batch when retain ingests long documents. Each chunk produces roughly 17 facts, so the default 100 chunks ≈ 1700 facts per batch. Lower to cap memory/LLM pressure on large documents; raise for smaller chunks. Configurable per bank. | `100` | | `HINDSIGHT_API_RETAIN_ENTITY_LOOKUP` | Entity lookup method during retain: `full` (exact match) or `trigram` (fuzzy trigram matching) | `trigram` | | `HINDSIGHT_API_RETAIN_ENTITY_RESOLUTION_BATCH_SIZE` | Max unique entity names per fuzzy candidate lookup query (`trigram` on PG, `oracle_fuzzy` on Oracle). Bounds query size so very wide retain batches don't time out a single `unnest(...)` join on banks with many entities. | `100` | | `HINDSIGHT_API_RETAIN_DEFAULT_STRATEGY` | Default retain strategy name. When set, all retain calls without an explicit `strategy` parameter use this strategy. | - | | `HINDSIGHT_API_RETAIN_BATCH_POLL_INTERVAL_SECONDS` | Batch API polling interval in seconds | `60` | | `HINDSIGHT_API_STORE_DOCUMENT_TEXT` | Persist the raw source text alongside extracted memories. Set to `false` to skip storing it. Static, server-level. | `true` | > **Batch-capable providers.** `HINDSIGHT_API_RETAIN_BATCH_ENABLED=true` only works with a retain LLM provider that implements a batch API: `openai`, `groq`, `gemini`, and `fireworks`. Batch always requires async retain (`async=true`); a sync retain with batch enabled errors. Other providers fail fast at startup. > > **Gemini** uses the [Gemini Batch API](https://ai.google.dev/gemini-api/docs/batch-api) (flat 50% input + output discount, 24h SLA — typically minutes). It needs no extra settings beyond `HINDSIGHT_API_RETAIN_BATCH_ENABLED=true` and an API-key `gemini` provider; Vertex AI (`vertexai`) is not batch-capable. #### Fireworks batch inference Fireworks AI's batch API is **not** OpenAI `/v1/batches`-compatible — it is a proprietary, account-scoped dataset/job workflow on a separate control-plane host (`https://api.fireworks.ai`), distinct from the OpenAI-compatible inference host (`https://api.fireworks.ai/inference/v1`). Hindsight adapts it transparently, so enabling batch is the same as any other provider plus one required setting: your Fireworks **account id**. (This is separate from the existing LiteLLM `fireworks_ai/...` online path, which is unaffected.) | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_FIREWORKS_ACCOUNT_ID` | Fireworks account id. **Required** for `fireworks` batch retain — the control-plane endpoints are `/v1/accounts/{account_id}/...`. Static, server-level. | - | | `HINDSIGHT_API_FIREWORKS_BATCH_BASE_URL` | Fireworks batch control-plane host. | `https://api.fireworks.ai` | | `HINDSIGHT_API_FIREWORKS_BATCH_MAX_WAIT_SECONDS` | Max time to wait for a batch job before surfacing a failure. Guards against the Fireworks gotcha where a non-batch-eligible model leaves the job `PENDING` forever. | `86400` (24h) | ```bash # Fireworks batch retain (50% cost savings, async only) export HINDSIGHT_API_RETAIN_LLM_PROVIDER=fireworks export HINDSIGHT_API_RETAIN_LLM_API_KEY=fw_xxxxxxxxxxxx export HINDSIGHT_API_RETAIN_LLM_MODEL=accounts/fireworks/models/llama-v3p1-8b-instruct export HINDSIGHT_API_FIREWORKS_ACCOUNT_ID=your-account-id export HINDSIGHT_API_RETAIN_BATCH_ENABLED=true ``` > **Entity labels** (`entity_labels`) and **free-form entity extraction** (`entities_allow_free_form`) are configured per bank via the [bank config API](/developer/api/memory-banks#retain-configuration), not as global environment variables — each bank can have its own controlled vocabulary. See [Entity Labels](/developer/retain#entity-labels) for details. #### Skip storing raw document text By default Hindsight keeps a verbatim copy of everything you retain so you can later read the source, re-process a document, or export it. For deployments that only want to keep the extracted memories (facts, entities, mental models) and not the source text, set: ```bash export HINDSIGHT_API_STORE_DOCUMENT_TEXT=false ``` When disabled, the full retain pipeline still runs — chunking, fact extraction, embedding, and entity linking are unchanged, so **memory quality and recall are not affected** (recall reads from the extracted memories, never from the raw text). The difference is purely what gets persisted: - `documents.original_text` is stored as `NULL` instead of the raw payload. - The raw chunk text is dropped (stored as empty), while the chunk's content hash is still kept so incremental re-retain of the same document continues to deduplicate correctly. **`update_mode="append"` is rejected when text storage is disabled.** Append rebuilds a document by reading back its previously stored text and adding to it. With nothing stored, appending would silently drop the prior content, so an append retain returns an error instead. Use `update_mode="replace"` (the default). **Features that degrade when text storage is disabled** (because they read the source text back): - **Document export** carries no source text; re-importing such a bank cannot re-run extraction from the original payload. - **Reading a document's source** (the get-document, list-chunks, and get-chunk endpoints, including their MCP equivalents) returns empty content. - **Recall with `include_chunks=true`** returns empty `chunk_text` — the facts themselves are unaffected, but the surrounding source-chunk context is no longer available. - **Reflect** no longer offers the `expand` tool (which fetches a memory's source chunk/document), and its `recall` step stops attaching source chunks, since there is no stored text to return. Reflection over the extracted memories is otherwise unaffected. - **Reprocessing a document** from its stored text is a no-op (there is nothing to reprocess). This is a static, server-level setting and cannot be overridden per bank. #### Customizing retain: when to use what There are five levels of customization for the retain pipeline. Start with the simplest that covers your needs: | Goal | Use | |------|-----| | Steer what topics to focus on or deprioritize | `HINDSIGHT_API_RETAIN_MISSION` | | Extract more detail per fact | `HINDSIGHT_API_RETAIN_EXTRACTION_MODE=verbose` | | Store chunks as-is, LLM extracts metadata | `HINDSIGHT_API_RETAIN_EXTRACTION_MODE=verbatim` | | Store chunks as-is, zero LLM cost | `HINDSIGHT_API_RETAIN_EXTRACTION_MODE=chunks` | | Completely replace the extraction rules | `HINDSIGHT_API_RETAIN_EXTRACTION_MODE=custom` + `HINDSIGHT_API_RETAIN_CUSTOM_INSTRUCTIONS` | **`HINDSIGHT_API_RETAIN_MISSION` — steer extraction without replacing it (recommended starting point)** Tell the bank what to pay attention to during extraction, in plain language. The mission is injected into the extraction prompt alongside the built-in rules — it narrows focus without replacing the underlying logic. Works with any extraction mode (`concise`, `verbose`, `verbatim`, `custom`). Ignored in `chunks` mode. ```bash export HINDSIGHT_API_RETAIN_MISSION="Focus on technical decisions, architecture choices, and team member expertise. Deprioritize social or personal information." ``` **`HINDSIGHT_API_RETAIN_EXTRACTION_MODE=verbose` — more detail per fact** Use when you need richer facts with full context, relationships, and verbosity. Slower and uses more tokens than `concise`. **`HINDSIGHT_API_RETAIN_EXTRACTION_MODE=verbatim` — store chunks as-is** Each chunk is stored as a single memory unit with its original text preserved exactly — no summarization or rewriting. The LLM still runs to extract entities, temporal information, and location so the chunk is fully indexed and retrievable. Useful for RAG-style indexing, document ingestion pipelines, or benchmarks where you want the original text in memory rather than LLM-generated summaries. ```bash export HINDSIGHT_API_RETAIN_EXTRACTION_MODE=verbatim ``` **`retain_strategies` / `retain_default_strategy` — per-call extraction strategy** Named strategies let you ingest different content types into the same bank using different extraction settings. A strategy is a set of hierarchical field overrides applied on top of the resolved bank config. Any field in the hierarchical config can be overridden per strategy, including `retain_extraction_mode`, `retain_chunk_size`, `retain_structured_chunk_size`, `entity_labels`, `entities_allow_free_form`, `retain_mission`, etc. Configure strategies via the bank config API: ```json { "retain_default_strategy": "conversations", "retain_strategies": { "conversations": { "retain_extraction_mode": "concise", "retain_chunk_size": 3000, "retain_structured_chunk_size": 12000 }, "documents": { "retain_extraction_mode": "chunks", "retain_chunk_size": 800, "entity_labels": null, "entities_allow_free_form": false } } } ``` Then specify the strategy at retain time: ```python # Uses default strategy ("conversations") client.retain(bank_id, items=[{"content": "Alice joined the team today"}]) # Explicitly use document strategy client.retain(bank_id, items=[{"content": "...document text..."}], strategy="documents") ``` If no `strategy` is specified in a retain call, `retain_default_strategy` is used. If neither is set, the bank/global config applies directly. **`HINDSIGHT_API_RETAIN_EXTRACTION_MODE=chunks` — zero LLM cost** Each chunk is stored as-is with no LLM call whatsoever. No entity extraction, no temporal indexing — only embeddings are generated for semantic search. User-provided entities passed via `RetainContent.entities` are the sole source of entity data. Use when ingestion speed and cost matter more than structured metadata. ```bash export HINDSIGHT_API_RETAIN_EXTRACTION_MODE=chunks ``` **`HINDSIGHT_API_RETAIN_EXTRACTION_MODE=custom` + `HINDSIGHT_API_RETAIN_CUSTOM_INSTRUCTIONS` — full control** Replaces the built-in selectivity rules entirely. The structural parts of the prompt (output format, temporal handling, coreference resolution) remain intact — only the extraction guidelines are replaced. Use this when `retain_mission` isn't sufficient and you need strict inclusion/exclusion logic. ```bash export HINDSIGHT_API_RETAIN_EXTRACTION_MODE=custom export HINDSIGHT_API_RETAIN_CUSTOM_INSTRUCTIONS="ONLY extract facts that are: ✅ Technical decisions and their rationale ✅ Architecture patterns and design choices ✅ Performance metrics and benchmarks DO NOT extract: ❌ Greetings or social conversation ❌ Process chatter (\"let me check\", \"one moment\") ❌ Anything that would not be useful in 6 months" ``` ### File Processing Configuration for the file upload and conversion pipeline (used by `POST /v1/default/banks/{bank_id}/files/retain`). | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_ENABLE_FILE_UPLOAD_API` | Enable the file upload API endpoint | `true` | | `HINDSIGHT_API_ENABLE_DOCUMENT_EXPORT_API` | Enable the [document export](./api/memory-banks.mdx#document-export--import) endpoint (`GET /document-transfer`) | `true` | | `HINDSIGHT_API_ENABLE_DOCUMENT_IMPORT_API` | Enable the [document import](./api/memory-banks.mdx#document-export--import) endpoint (`POST /document-transfer`) | `true` | | `HINDSIGHT_API_FILE_PARSER` | Server-side default parser or fallback chain (comma-separated, e.g. `iris,markitdown`) | `markitdown` | | `HINDSIGHT_API_FILE_PARSER_ALLOWLIST` | Comma-separated list of parsers clients are allowed to request. If not set, all registered parsers are allowed. | — | | `HINDSIGHT_API_FILE_CONVERSION_MAX_BATCH_SIZE` | Max files per upload request | `10` | | `HINDSIGHT_API_FILE_CONVERSION_MAX_BATCH_SIZE_MB` | Max total upload size per request (MB) | `100` | | `HINDSIGHT_API_FILE_DELETE_AFTER_RETAIN` | Delete stored files after memory extraction completes | `true` | #### Parser selection Clients can override the server default by passing `parser` in the request body of `POST /v1/default/banks/{bank_id}/files/retain`. Both the server default and the per-request field accept a single parser name or an ordered **fallback chain** — each parser is tried in sequence until one succeeds. ```bash # Server default: try iris first, fall back to markitdown if iris fails export HINDSIGHT_API_FILE_PARSER=iris,markitdown # Restrict what clients may request (optional — defaults to all registered parsers) export HINDSIGHT_API_FILE_PARSER_ALLOWLIST=markitdown,iris ``` ```json // Per-request override (in the JSON body of the file retain endpoint) { "parser": "iris", "files_metadata": [ { "document_id": "report" }, { "document_id": "fallback_doc", "parser": ["iris", "markitdown"] } ] } ``` Clients that request a parser not in the allowlist receive HTTP 400. #### Parser: markitdown (default) Local file-to-markdown conversion using [Microsoft's markitdown](https://github.com/microsoft/markitdown). No external service is required by default. **Supported formats:** PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, images (JPG, PNG — requires optional OCR for text extraction), audio (MP3, WAV — transcription), HTML, TXT, MD, CSV. For image workloads, MarkItDown can optionally use an OpenAI-compatible OCR/vision endpoint. This is disabled by default. Without it, image uploads fail with an actionable configuration error instead of low-level parser output. When enabled, configure the MarkItDown OCR API key, base URL, and model explicitly; they do not inherit from `HINDSIGHT_API_LLM_*` because MarkItDown uses the OpenAI SDK directly. The selected endpoint must implement OpenAI Chat Completions and the selected model must support image input. This OCR path uses MarkItDown's image converter hook. It applies to image inputs such as JPG and PNG (and image handling inside converters that consume MarkItDown's `llm_client`), but it does not rasterize scanned PDF pages into images. Scanned PDFs with no text layer may still extract poorly through the default PDF converter. For scanned PDFs or complex document layouts, use an OCR-capable document parser such as `iris` or `llama_parse`, or configure a parser fallback chain like `llama_parse,markitdown`. | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_ENABLED` | Enable MarkItDown image OCR through an OpenAI-compatible OCR/vision endpoint | `false` | | `HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_API_KEY` | API key for MarkItDown OCR; required when OCR is enabled | — | | `HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_BASE_URL` | OpenAI-compatible Chat Completions base URL for MarkItDown OCR; required when OCR is enabled | — | | `HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_MODEL` | OCR/vision model with image-input support; required when OCR is enabled | — | | `HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_PROMPT` | OCR prompt passed to MarkItDown's image converter | Built-in OCR prompt | ```bash # Configure a dedicated OpenAI-compatible OCR/vision endpoint for MarkItDown OCR export HINDSIGHT_API_FILE_PARSER=markitdown export HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_ENABLED=true export HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_API_KEY=your-vision-api-key export HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_BASE_URL=https://vision.example/v1 export HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_MODEL=ocr-or-vision-model ``` #### Parser: iris Cloud-based extraction via [Vectorize Iris](https://docs.vectorize.io/build-deploy/extract-information/understanding-iris/). Higher quality extraction for complex documents, powered by a remote AI service. | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_FILE_PARSER_IRIS_TOKEN` | Vectorize API token | — | | `HINDSIGHT_API_FILE_PARSER_IRIS_ORG_ID` | Vectorize organization ID | — | **Supported formats:** PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, images (JPG, JPEG, PNG, GIF, BMP, TIFF, WEBP), HTML, TXT, MD, CSV. ```bash # Use iris as the only parser export HINDSIGHT_API_FILE_PARSER=iris export HINDSIGHT_API_FILE_PARSER_IRIS_TOKEN=your-vectorize-token export HINDSIGHT_API_FILE_PARSER_IRIS_ORG_ID=your-org-id # Or: try iris first, fall back to markitdown if iris fails or rejects the file type export HINDSIGHT_API_FILE_PARSER=iris,markitdown ``` #### Parser: llama_parse Cloud-based extraction via [LlamaParse](https://docs.cloud.llamaindex.ai/llamaparse) (LlamaIndex). Strong extraction for complex layouts — tables, charts, multi-column PDFs. | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_FILE_PARSER_LLAMA_PARSE_API_KEY` | LlamaCloud API key (typically starts with `llx-`) | — | **Supported formats:** PDF, DOCX, PPTX, XLSX, HTML, EPUB, RTF, TXT, and many more — see the [LlamaParse docs](https://docs.cloud.llamaindex.ai/llamaparse/features/supported_document_types) for the full list. ```bash # Use llama_parse as the only parser export HINDSIGHT_API_FILE_PARSER=llama_parse export HINDSIGHT_API_FILE_PARSER_LLAMA_PARSE_API_KEY=llx-your-api-key # Or: try llama_parse first, fall back to markitdown export HINDSIGHT_API_FILE_PARSER=llama_parse,markitdown ``` ```bash # Increase batch limits for large file imports export HINDSIGHT_API_FILE_CONVERSION_MAX_BATCH_SIZE=20 export HINDSIGHT_API_FILE_CONVERSION_MAX_BATCH_SIZE_MB=500 # Keep files after processing (useful for debugging or re-processing) export HINDSIGHT_API_FILE_DELETE_AFTER_RETAIN=false ``` ### File Storage Files uploaded via the file retain API are stored in an object storage backend before conversion. Choose the backend that fits your infrastructure. | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_FILE_STORAGE_TYPE` | Storage backend: `native`, `s3`, `gcs`, or `azure` | `native` | #### Native (PostgreSQL) Files are stored as `BYTEA` in the `file_storage` table. No additional infrastructure required. Suitable for development and small deployments. ```bash # Native storage is the default — no additional configuration needed export HINDSIGHT_API_FILE_STORAGE_TYPE=native ``` #### S3 / S3-Compatible | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_FILE_STORAGE_S3_BUCKET` | S3 bucket name | - | | `HINDSIGHT_API_FILE_STORAGE_S3_REGION` | AWS region | - | | `HINDSIGHT_API_FILE_STORAGE_S3_ENDPOINT` | Custom endpoint URL (for S3-compatible stores like MinIO, Cloudflare R2, Tigris) | AWS default | | `HINDSIGHT_API_FILE_STORAGE_S3_ACCESS_KEY_ID` | AWS access key ID | - | | `HINDSIGHT_API_FILE_STORAGE_S3_SECRET_ACCESS_KEY` | AWS secret access key | - | For S3-compatible providers that don't expose AWS-style regions (MinIO, Cloudflare R2, Tigris), set `HINDSIGHT_API_FILE_STORAGE_S3_REGION=auto`. The value is required for SigV4 request signing but is ignored by the service. ```bash # AWS S3 export HINDSIGHT_API_FILE_STORAGE_TYPE=s3 export HINDSIGHT_API_FILE_STORAGE_S3_BUCKET=my-hindsight-files export HINDSIGHT_API_FILE_STORAGE_S3_REGION=us-east-1 export HINDSIGHT_API_FILE_STORAGE_S3_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE export HINDSIGHT_API_FILE_STORAGE_S3_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY # S3-compatible (MinIO, Cloudflare R2, etc.) export HINDSIGHT_API_FILE_STORAGE_TYPE=s3 export HINDSIGHT_API_FILE_STORAGE_S3_BUCKET=my-bucket export HINDSIGHT_API_FILE_STORAGE_S3_REGION=auto export HINDSIGHT_API_FILE_STORAGE_S3_ENDPOINT=https://your-minio.example.com export HINDSIGHT_API_FILE_STORAGE_S3_ACCESS_KEY_ID=minioadmin export HINDSIGHT_API_FILE_STORAGE_S3_SECRET_ACCESS_KEY=minioadmin # Tigris (S3-compatible, single global endpoint) export HINDSIGHT_API_FILE_STORAGE_TYPE=s3 export HINDSIGHT_API_FILE_STORAGE_S3_BUCKET=my-hindsight-bucket export HINDSIGHT_API_FILE_STORAGE_S3_REGION=auto export HINDSIGHT_API_FILE_STORAGE_S3_ENDPOINT=https://t3.storage.dev export HINDSIGHT_API_FILE_STORAGE_S3_ACCESS_KEY_ID=tid_your_access_key export HINDSIGHT_API_FILE_STORAGE_S3_SECRET_ACCESS_KEY=tsec_your_secret_key ``` #### Google Cloud Storage | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_FILE_STORAGE_GCS_BUCKET` | GCS bucket name | - | | `HINDSIGHT_API_FILE_STORAGE_GCS_SERVICE_ACCOUNT_KEY` | Path to service account JSON key file | ADC if not set | ```bash export HINDSIGHT_API_FILE_STORAGE_TYPE=gcs export HINDSIGHT_API_FILE_STORAGE_GCS_BUCKET=my-hindsight-files # Optional: use service account key file (otherwise falls back to ADC) export HINDSIGHT_API_FILE_STORAGE_GCS_SERVICE_ACCOUNT_KEY=/path/to/key.json ``` #### Azure Blob Storage | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_FILE_STORAGE_AZURE_CONTAINER` | Azure container name | - | | `HINDSIGHT_API_FILE_STORAGE_AZURE_ACCOUNT_NAME` | Azure storage account name | - | | `HINDSIGHT_API_FILE_STORAGE_AZURE_ACCOUNT_KEY` | Azure storage account key | - | ```bash export HINDSIGHT_API_FILE_STORAGE_TYPE=azure export HINDSIGHT_API_FILE_STORAGE_AZURE_CONTAINER=hindsight-files export HINDSIGHT_API_FILE_STORAGE_AZURE_ACCOUNT_NAME=mystorageaccount export HINDSIGHT_API_FILE_STORAGE_AZURE_ACCOUNT_KEY=base64encodedkey== ``` #### Storage Backend Comparison | Backend | Best For | Notes | |---------|----------|-------| | `native` | Development, small deployments | No extra infrastructure, stored in PostgreSQL | | `s3` | Production, AWS deployments | Works with any S3-compatible store | | `gcs` | Production, GCP deployments | Supports ADC for keyless auth | | `azure` | Production, Azure deployments | Uses account key auth | :::tip Production Recommendation For production deployments, use `s3`, `gcs`, or `azure` to avoid storing large binary files in your PostgreSQL database. Set `HINDSIGHT_API_FILE_DELETE_AFTER_RETAIN=true` (the default) to delete files after memory extraction, which minimizes storage costs. ::: ### Observations (Experimental) {#observations} Observations are deduplicated, evidence-grounded knowledge consolidated from multiple facts. Each observation tracks its supporting memories and a proof count, and is refined — not overwritten — when new evidence arrives. | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_ENABLE_OBSERVATIONS` | Enable observation consolidation | `true` | | `HINDSIGHT_API_ENABLE_AUTO_CONSOLIDATION` | Automatically trigger consolidation after retain, delete, and update operations. When `false`, consolidation only runs when explicitly triggered via the [consolidate endpoint](/developer/api/operations#consolidation). Configurable per bank. | `true` | | `HINDSIGHT_API_CONSOLIDATION_RECONCILE_INTERVAL_SECONDS` | Interval for the background sweep that re-schedules consolidation for banks with unconsolidated facts but no consolidation in progress — recovering facts left unscheduled when a consolidation operation failed terminally (e.g. the LLM provider was unavailable). Only applies to banks with auto-consolidation enabled. `0` disables the sweep. | `300` | | `HINDSIGHT_API_MENTAL_MODEL_REFRESH_TICK_SECONDS` | How often the background loop checks for cron-scheduled mental models that are due for a refresh. This is only the *check* cadence; the actual schedule is the per-model `trigger.refresh_cron` expression set on the mental model. A due model is refreshed only when it is stale (new memories in its scope since the last refresh). `0` disables the sweep. | `60` | | `HINDSIGHT_API_ENABLE_OBSERVATION_HISTORY` | Track history of changes to each observation (previous text/tags/dates + timestamp), stored one row per change in the `observation_history` table. Set to `false` to disable entirely — no history rows are written. **This is how you turn the feature off** (not a zero cap). | `true` | | `HINDSIGHT_API_OBSERVATION_HISTORY_MAX_ENTRIES` | Max history rows kept per observation. On each update the previous version is inserted into the `observation_history` table and the oldest rows beyond this cap are deleted, so an often-reinforced observation's history can't grow without bound. `0` or a negative value **removes the cap** (unbounded); to turn history off entirely set `HINDSIGHT_API_ENABLE_OBSERVATION_HISTORY=false` instead. | `50` | | `HINDSIGHT_API_CONSOLIDATION_MAX_ATTEMPTS` | Outer retry attempts for the consolidation LLM batch call. Each attempt uses the inner retry budget (`HINDSIGHT_API_CONSOLIDATION_LLM_MAX_RETRIES`). Worst-case API calls per batch = `MAX_ATTEMPTS × (LLM_MAX_RETRIES + 1)`. | `3` | | `HINDSIGHT_API_CONSOLIDATION_BATCH_SIZE` | Memories to load per batch (internal optimization) | `50` | | `HINDSIGHT_API_CONSOLIDATION_MAX_MEMORIES_PER_ROUND` | Maximum memories processed per consolidation round. When the limit is reached, the job yields its worker slot and re-queues itself so other banks get fair scheduling. Mental model refreshes only run on the final round. `0` = unlimited. Configurable per bank. | `100` | | `HINDSIGHT_API_CONSOLIDATION_MAX_TOKENS` | Max tokens for recall when finding related observations during consolidation | `1024` | | `HINDSIGHT_API_CONSOLIDATION_MAX_COMPLETION_TOKENS` | Max completion tokens requested for each consolidation LLM batch call. Unset by default, so each provider keeps its implicit output budget. Set this when a provider applies a low hidden cap (e.g. Bedrock imported models) that truncates consolidation output. | `unset` | | `HINDSIGHT_API_CONSOLIDATION_LLM_BATCH_SIZE` | Number of facts sent to the LLM in a single consolidation call. Higher values reduce LLM calls and improve throughput at the cost of larger prompts. Set to `1` to disable batching. Configurable per bank. | `8` | | `HINDSIGHT_API_CONSOLIDATION_DEDUP_THRESHOLD` | Cosine similarity at/above which a newly-created or freshly-updated observation is reconciled against an existing near-identical one via a focused 1-by-1 LLM "merge or keep" call (the model reads both texts, so a number/negation/entity difference is respected). Catches near-duplicate observations that weaker consolidation models emit even when shown the twin, as well as duplicates that arise when an update rewrites an observation into a near-twin of another. Set to `1.0` to disable. Postgres only — consolidation skips reconciliation on Oracle regardless of this value. | `0.97` | | `HINDSIGHT_API_CONSOLIDATION_LLM_PARALLELISM` | Maximum number of tag groups consolidated concurrently within one consolidation op. Each group acquires per-scope locks before processing, so groups whose write scopes overlap (e.g. under `per_tag` / `all_combinations` / explicit-list `observation_scopes`) automatically serialise on the overlapping scopes — actual concurrency may be lower than this cap when scopes contend. Set to `1` for fully sequential behaviour. Higher values raise peak LLM QPS and connection-pool usage during consolidation proportionally — tune down if your LLM provider rate-limits tightly or your DB pool is small. Configurable per bank. | `4` | | `HINDSIGHT_API_CONSOLIDATION_RECALL_BUDGET` | Budget level for the recall pass inside consolidation (`low`, `mid`, `high`). Lower budgets fetch fewer candidate rows, reducing peak memory usage on large banks. | `low` | | `HINDSIGHT_API_CONSOLIDATION_SOURCE_FACTS_MAX_TOKENS` | Total token budget for source facts included with observations in the consolidation prompt. `-1` = unlimited. Configurable per bank. | `4096` | | `HINDSIGHT_API_CONSOLIDATION_SOURCE_FACTS_MAX_TOKENS_PER_OBSERVATION` | Per-observation token cap for source facts in the consolidation prompt. Each observation independently gets at most this many tokens of source facts. `-1` = unlimited. Configurable per bank. | `256` | | `HINDSIGHT_API_OBSERVATIONS_MISSION` | What this bank should synthesise into durable observations. Replaces the built-in consolidation rules — leave unset to use the server default. | - | | `HINDSIGHT_API_MAX_OBSERVATIONS_PER_SCOPE` | Maximum number of observations allowed per tag scope. When the limit is reached, consolidation will only update or delete existing observations — no new ones are created. Applies per tag scope (e.g., per-tag when using `per_tag` observation scopes). Observations with no tags are not subject to this limit. `-1` = unlimited. Configurable per bank. | `-1` | | `HINDSIGHT_API_OBSERVATION_SCOPE_LIMITS` | Per-scope overrides of `MAX_OBSERVATIONS_PER_SCOPE`, as a JSON array of `{"scope": [tag-globs], "limit": int}` rules. Each `scope` is a list of [fnmatch](https://docs.python.org/3/library/fnmatch.html) globs; a consolidation scope matches under *exact cover* — every tag must be matched by a glob and every glob must match a tag, so `["shared"]` matches the scope `{shared}` but not `{run_1, shared}`. The first matching rule wins; scopes that match no rule fall back to `MAX_OBSERVATIONS_PER_SCOPE`. Example: `[{"scope": ["shared"], "limit": -1}, {"scope": ["run_*", "shared"], "limit": 50}]` keeps the `{shared}` scope unlimited while capping each `{run_*, shared}` scope at 50. Configurable per bank. | - | #### Customizing observations: when to use what | Goal | Use | |------|-----| | Default behavior: durable specific facts, no ephemeral state | Leave unset | | Change what observations *are* for this bank (different shape, different purpose) | `HINDSIGHT_API_OBSERVATIONS_MISSION` | **`HINDSIGHT_API_OBSERVATIONS_MISSION` — redefine what this bank synthesises** By default, observations are durable, specific beliefs consolidated from memories — the kind of knowledge that stays true over time (preferences, skills, relationships, recurring patterns). Each one is grounded in the source memories that support it. Ephemeral state is filtered out. Contradictions are tracked with temporal markers rather than overwriting the prior belief. Set `HINDSIGHT_API_OBSERVATIONS_MISSION` to replace this definition entirely. Write a plain-language description of what observations should be for your use case. The LLM will use this instead of the default rules when deciding what to create or update. Leave it unset to keep the server default. :::tip When to use observations_mission Use it when the default durable-knowledge behavior doesn't match your use case. Common scenarios: - You want **broader event summaries** rather than isolated facts - You want observations **grouped by time period** (weekly, monthly) - You want a **different granularity** (one observation per project rather than per fact) - You have a **domain-specific** notion of what's worth remembering ::: **Example: Weekly event summaries** ```bash export HINDSIGHT_API_OBSERVATIONS_MISSION="Observations are broad summaries of project events grouped by week. Each observation should capture what happened, what was decided, and what was blocked — not individual facts. Merge related events into cohesive weekly narratives." ``` **Example: Person-centric knowledge** ```bash export HINDSIGHT_API_OBSERVATIONS_MISSION="Observations are durable facts about specific named people: their preferences, skills, relationships, and behavioral patterns. Only create observations for facts that are stable over time and tied to a named individual." ``` **Example: Support ticket patterns** ```bash export HINDSIGHT_API_OBSERVATIONS_MISSION="Observations are recurring patterns in customer support interactions: common failure modes, frequently requested features, and pain points that appear across multiple tickets." ``` ### Reflect | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_REFLECT_MAX_ITERATIONS` | Max tool call iterations before forcing a response | `10` | | `HINDSIGHT_API_REFLECT_MAX_CONTEXT_TOKENS` | Max accumulated context tokens in the reflect loop before forcing final synthesis. Prevents `context_length_exceeded` errors on large banks. Lower this if your LLM has a context window smaller than 128K. | `100000` | | `HINDSIGHT_API_REFLECT_WALL_TIMEOUT` | Wall-clock timeout in seconds for the entire reflect operation. If exceeded, the request returns HTTP 504. | `300` | | `HINDSIGHT_API_REFLECT_MISSION` | Global reflect mission (identity and reasoning framing). Overridden per bank via config API. | - | | `HINDSIGHT_API_REFLECT_SOURCE_FACTS_MAX_TOKENS` | Token budget for source facts in `search_observations` during reflect. `-1` disables source facts (default), `0` enables with no limit, `>0` enables with a token budget. Hierarchical — can be overridden per bank via config API. | `-1` | #### Internal recall (used by mental model refresh) These knobs control the recall tool that runs inside `reflect_async` (e.g. when refreshing a mental model). They are hierarchical — overridable per bank via the config API, and individually overridable per mental model via the `trigger.include_chunks`, `trigger.recall_max_tokens`, and `trigger.recall_chunks_max_tokens` fields. | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_RECALL_INCLUDE_CHUNKS` | Whether the internal recall returns raw chunk text alongside facts. Set `false` to skip chunks and save prompt budget. | `true` | | `HINDSIGHT_API_RECALL_MAX_TOKENS` | Token budget for facts returned by the internal recall. | `2048` | | `HINDSIGHT_API_RECALL_CHUNKS_MAX_TOKENS` | Token budget for raw chunks returned by the internal recall. | `1000` | #### Disposition Disposition traits control how the bank reasons during reflect operations. Each trait is on a scale of 1–5. These are hierarchical — they can be overridden per bank via the [config API](./configuration.md#hierarchical-configuration). | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_DISPOSITION_SKEPTICISM` | How skeptical vs trusting (1=trusting, 5=skeptical) | `3` | | `HINDSIGHT_API_DISPOSITION_LITERALISM` | How literally to interpret information (1=flexible, 5=literal) | `3` | | `HINDSIGHT_API_DISPOSITION_EMPATHY` | How much to consider emotional context (1=detached, 5=empathetic) | `3` | ### MCP Server Configuration for MCP server endpoints. | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_MCP_ENABLED` | Enable MCP server at `/mcp/{bank_id}/` | `true` | | `HINDSIGHT_API_MCP_ENABLED_TOOLS` | Comma-separated allowlist of MCP tools to expose globally (empty = all tools) | - | | `HINDSIGHT_API_MCP_STATELESS` | Use stateless HTTP transport (POST-only). When `false`, enables stateful mode with GET/SSE support for server-initiated messages | `false` | | `HINDSIGHT_API_MCP_AUTH_TOKEN` | Bearer token for MCP authentication (optional) | - | | `HINDSIGHT_API_MCP_LOCAL_BANK_ID` | Memory bank ID for local MCP | `mcp` | | `HINDSIGHT_API_MCP_INSTRUCTIONS` | Additional instructions appended to retain/recall tool descriptions | - | **Tool Access Control:** `HINDSIGHT_API_MCP_ENABLED_TOOLS` restricts which MCP tools are registered at the server level. This is useful for read-only deployments or limiting surface area: ```bash # Expose only recall (read-only deployment) export HINDSIGHT_API_MCP_ENABLED_TOOLS=recall # Expose recall and reflect only export HINDSIGHT_API_MCP_ENABLED_TOOLS=recall,reflect ``` Available tool names: `retain`, `recall`, `reflect`, `list_banks`, `create_bank`, `list_mental_models`, `get_mental_model`, `create_mental_model`, `update_mental_model`, `delete_mental_model`, `refresh_mental_model`, `list_directives`, `create_directive`, `delete_directive`, `list_memories`, `get_memory`, `list_documents`, `get_document`, `delete_document`, `list_operations`, `get_operation`, `cancel_operation`, `list_tags`, `get_bank`, `get_bank_stats`, `update_bank`, `delete_bank`, `clear_memories`. This can also be overridden per bank via the [config API](#hierarchical-configuration): ```bash # Restrict a specific bank to read-only MCP access curl -X PATCH http://localhost:8888/v1/default/banks/my-bank/config \ -H "Content-Type: application/json" \ -d '{"updates": {"mcp_enabled_tools": ["recall"]}}' ``` When a bank-level `mcp_enabled_tools` is set, tools not in the list return a clear error when invoked (they still appear in the tools list for MCP protocol compatibility). **MCP Authentication:** By default, the MCP endpoint is open. For production deployments, set `HINDSIGHT_API_MCP_AUTH_TOKEN` to require Bearer token authentication: ```bash export HINDSIGHT_API_MCP_AUTH_TOKEN=your-secret-token ``` Clients must then include the token in the `Authorization` header. See [MCP Server documentation](./mcp-server.md#authentication) for details. **Local MCP instructions:** ```bash # Example: instruct MCP to also store assistant actions export HINDSIGHT_API_MCP_INSTRUCTIONS="Also store every action you take, including tool calls and decisions made." ``` ### Distributed Workers Configuration for background task processing. By default, the API processes tasks internally. For high-throughput deployments, run dedicated workers. See [Services - Worker Service](./services#worker-service) for details. | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_WORKER_ENABLED` | Enable internal worker in API process | `true` | | `HINDSIGHT_API_WORKER_ID` | Unique worker identifier | hostname | | `HINDSIGHT_API_WORKER_POLL_INTERVAL_MS` | Database polling interval in milliseconds | `500` | | `HINDSIGHT_API_WORKER_MAX_RETRIES` | Max retries before marking task failed | `3` | | `HINDSIGHT_API_WORKER_TASK_RETRY_BACKOFF_SECONDS` | Seconds between retries on transient task failure | `60` | | `HINDSIGHT_API_WORKER_HTTP_PORT` | HTTP port for worker metrics/health (worker CLI only) | `8889` | | `HINDSIGHT_API_WORKER_MAX_SLOTS` | Maximum concurrent tasks per worker (total across all operation types) | `10` | | `HINDSIGHT_API_WORKER_CONSOLIDATION_MAX_SLOTS` | Reserved slots for consolidation tasks within `WORKER_MAX_SLOTS` (bank-serialization preserved) | `2` | | `HINDSIGHT_API_WORKER_CONSOLIDATION_BANK_PRIORITY` | Per-bank priority for consolidation scheduling (see note below) | _(unset)_ | | `HINDSIGHT_API_WORKER_RETAIN_MAX_SLOTS` | Reserved slots for retain tasks within `WORKER_MAX_SLOTS` | `0` | | `HINDSIGHT_API_WORKER_FILE_CONVERT_RETAIN_MAX_SLOTS` | Reserved slots for file_convert_retain tasks within `WORKER_MAX_SLOTS` | `0` | | `HINDSIGHT_API_WORKER_REFRESH_MENTAL_MODEL_MAX_SLOTS` | Reserved slots for refresh_mental_model tasks within `WORKER_MAX_SLOTS` | `0` | | `HINDSIGHT_API_WORKER_GRAPH_MAINTENANCE_MAX_SLOTS` | Reserved slots for graph_maintenance tasks within `WORKER_MAX_SLOTS` | `0` | | `HINDSIGHT_API_WORKER_IMPORT_DOCUMENTS_MAX_SLOTS` | Reserved slots for import_documents tasks within `WORKER_MAX_SLOTS` | `0` | :::note Slot reservations and shared pool Per-operation `*_MAX_SLOTS` values are **reservations within** `WORKER_MAX_SLOTS`, not additive pools. The sum of all reservations must not exceed `WORKER_MAX_SLOTS` (startup raises `ValueError` otherwise). Remaining capacity (`WORKER_MAX_SLOTS - sum of reservations`) forms a **shared pool** usable by any operation type on a first-come basis; operation types whose reserved capacity is full can also overflow into the shared pool. Consolidation's bank-serialization constraint (no two consolidation tasks for the same bank concurrently) is preserved regardless of which pool claims the slot. Example: `MAX_SLOTS=10, CONSOLIDATION=2, RETAIN=3, REFRESH_MENTAL_MODEL=2` → shared pool = `10 - (2+3+2) = 3`. With the defaults (`MAX_SLOTS=10`, `CONSOLIDATION_MAX_SLOTS=2`, all other reservations `0`), 2 slots are always reserved for consolidation and the remaining 8 form the shared pool for any operation type. Set `CONSOLIDATION_MAX_SLOTS=0` to release consolidation's reserved capacity into the shared pool. ::: :::note Consolidation bank priority `HINDSIGHT_API_WORKER_CONSOLIDATION_BANK_PRIORITY` controls which banks' consolidation tasks are claimed first when a slot becomes available. Format: comma-separated `bank-pattern:priority` pairs where higher numbers mean higher priority. Patterns support `*` as a wildcard; a bare `*` is the catch-all default for unlisted banks (defaults to `1` if omitted). Example: ``` HINDSIGHT_API_WORKER_CONSOLIDATION_BANK_PRIORITY="shadow-*:10,staging-*:5,*:1" ``` This ensures `shadow-*` banks are always consolidated before others, even if their tasks were submitted later. Useful for deployments with asymmetric bank sizes where a large bank might be starved by many small banks cycling through limited slots. Bank-serialization (max one concurrent consolidation per bank) is preserved regardless of priority. When unset, consolidation tasks are claimed in `created_at` order (default behavior). ::: ### Performance Optimization | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_SKIP_LLM_VERIFICATION` | Skip LLM connection check on startup | `false` | #### Bank stats cache `get_bank_stats` aggregates over `memory_links` (joining `memory_units`), which can be a multi-second scan on banks with millions of rows. Because the result is intentionally approximate — it backs a UI widget and the freshness hint inside `reflect` — it is cached per `(schema, bank)` for a few tens of seconds, which also coalesces concurrent misses onto a single in-flight query. Tune it for high-concurrency or large-bank deployments: | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_BANK_STATS_CACHE_TTL_SECONDS` | Time-to-live (seconds) for the `get_bank_stats` result cache. `0` disables caching, so every call runs the query. | `60` | | `HINDSIGHT_API_BANK_STATS_CACHE_MAX_ENTRIES` | Maximum number of cached `(schema, bank)` entries before LRU eviction. Bounds memory in deployments with many banks. | `1024` | #### Native thread pools When local embeddings or reranking run in-process, the underlying BLAS/ML libraries (OpenBLAS, OpenMP, MKL) each spawn a worker pool sized to the host CPU count. Because Hindsight already parallelizes across requests via its own thread-pool executors, those native pools oversubscribe the CPU — on a many-core host the process can accumulate well over 100 native threads. This inflates memory and, under contention, can degrade throughput. Hindsight therefore bounds each native pool to **16 threads** (or the number of *available* CPUs, whichever is smaller) by default, capping runaway growth on large hosts while leaving within-call parallelism intact. "Available" CPUs is the budget actually granted to the process — the smallest of the CPU-affinity set, the cgroup CPU quota (`--cpus` / cpuset), and the host core count. This matters in containers: the BLAS libraries otherwise size their pools to the *host's* cores even when the container is limited to a few, so a `--cpus=4` container on a 64-core host would spawn far more BLAS threads than it can run. To tune, set any of these to the desired thread count; the value you set is always honored (the default applies only when the variable is unset): | Variable | Description | Default | |----------|-------------|---------| | `OMP_NUM_THREADS` | OpenMP worker threads (torch, ONNX Runtime, some BLAS builds) | `min(16, available CPUs)` | | `OPENBLAS_NUM_THREADS` | OpenBLAS worker threads (numpy's default BLAS) | `min(16, available CPUs)` | | `MKL_NUM_THREADS` | Intel MKL worker threads (numpy/torch when MKL-backed) | `min(16, available CPUs)` | | `NUMEXPR_NUM_THREADS` | numexpr expression-engine threads | `min(16, available CPUs)` | For a server handling many concurrent requests, lower values (down to `1`) favor request-level parallelism and minimize thread count; for low-concurrency deployments running large local-model batches, the default leaves room for within-call parallelism. These must be set in the process environment before startup (the libraries read them once, at load time), so they cannot be overridden per-tenant or per-bank. ### Webhooks | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_WEBHOOK_URL` | Global webhook URL for event delivery | - (disabled) | | `HINDSIGHT_API_WEBHOOK_SECRET` | HMAC signing secret for webhook payloads | - (unsigned) | | `HINDSIGHT_API_WEBHOOK_EVENT_TYPES` | Comma-separated list of event types to deliver via webhook | `consolidation.completed` | | `HINDSIGHT_API_WEBHOOK_DELIVERY_POLL_INTERVAL_SECONDS` | How often the webhook delivery worker polls for pending deliveries (seconds) | `30` | ### Audit Logging Audit logging captures mutating operations (retain, recall, reflect, bank config updates, [Memory Defense](memory-defense/index.md) redact/block actions, etc.) into an `audit_log` table, queryable via the `/audit-logs` endpoint. **Audit logging is disabled by default.** With `HINDSIGHT_API_AUDIT_LOG_ENABLED=false`, the `audit_log` table stays empty and `/audit-logs` returns `{"total": 0, "items": []}` regardless of activity. Set the flag to `true` and restart the API to start capturing events. | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_AUDIT_LOG_ENABLED` | Master switch for audit logging. Must be `true` for any audit events to be written. | `false` | | `HINDSIGHT_API_AUDIT_LOG_ACTIONS` | Comma-separated allowlist of action types to audit (empty = all eligible actions) | `""` | | `HINDSIGHT_API_AUDIT_LOG_RETENTION_DAYS` | Number of days to retain audit log entries. `-1` = keep forever. | `-1` | ### LLM Request Tracing LLM request tracing records every LLM call Hindsight makes — for retain, reflect, and consolidation — into an `llm_requests` table, queryable per bank via the `/llm-requests` endpoint. Each row captures the input messages, the model output, token usage (input / output / cached / total, taken from the provider response), finish reason, provider/model, timing, and caller metadata. **Failed calls are recorded too** (`status = "error"` with the error message), so the table is useful for debugging what the LLM is doing and why a call failed. Capture is wired into the OpenTelemetry GenAI recording path (the same `record_llm_call` hook used for OTLP span export), so it stays consistent with the provider-reported request details. **LLM request tracing is enabled by default**, with traced rows retained for 1 day. To disable it entirely set `HINDSIGHT_API_LLM_TRACE_ENABLED=false` and restart the API — the `llm_requests` table then stays empty and `/llm-requests` returns `{"total": 0, "items": []}` regardless of activity. > **Note:** Traced rows contain the full prompt and model output, which may include sensitive memory content and can be large. Use `HINDSIGHT_API_LLM_TRACE_MAX_CHARS` to bound how much of each payload is stored, tighten `HINDSIGHT_API_LLM_TRACE_RETENTION_DAYS`, or set `HINDSIGHT_API_LLM_TRACE_ENABLED=false` to turn tracing off in sensitive environments. | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_LLM_TRACE_ENABLED` | Master switch for LLM request tracing. Must be `true` for any calls to be recorded. | `true` | | `HINDSIGHT_API_LLM_TRACE_SCOPES` | Comma-separated allowlist of call scopes to trace (e.g. `retain_extract_facts,reflect`; empty = all scopes) | `""` | | `HINDSIGHT_API_LLM_TRACE_RETENTION_DAYS` | Number of days to retain trace rows. `-1` = keep forever. | `1` | | `HINDSIGHT_API_LLM_TRACE_MAX_CHARS` | Truncate stored input/output beyond this many characters (keeps the row, stores a truncated preview). | `50000` | ### Programmatic Configuration You can also configure the API programmatically using `MemoryEngine.from_env()`: ```python from hindsight_api import MemoryEngine memory = MemoryEngine.from_env() await memory.initialize() ``` --- ## Observability & Tracing Hindsight provides OpenTelemetry-based observability for LLM calls, conforming to GenAI semantic conventions. ### OpenTelemetry Tracing | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_OTEL_TRACES_ENABLED` | Enable distributed tracing for LLM calls | `false` | | `HINDSIGHT_API_OTEL_EXPORTER_OTLP_ENDPOINT` | OTLP endpoint URL (e.g., Grafana LGTM, Langfuse, etc.) | - | | `HINDSIGHT_API_OTEL_EXPORTER_OTLP_HEADERS` | Headers for OTLP exporter (format: "key1=value1,key2=value2") | - | | `HINDSIGHT_API_OTEL_SERVICE_NAME` | Service name for traces | `hindsight-api` | | `HINDSIGHT_API_OTEL_DEPLOYMENT_ENVIRONMENT` | Deployment environment name (e.g., development, staging, production) | `development` | | `HINDSIGHT_API_METRICS_INCLUDE_BANK_ID` | Include `bank_id` in OTel metric attributes. Enable only for deployments with few banks — high cardinality causes unbounded memory growth. | `false` | | `HINDSIGHT_API_METRICS_BACKLOG_ENABLED` | Expose async-operation queue depth and consolidation-backlog gauges (`hindsight_async_operations`, `hindsight_consolidation_backlog`, `hindsight_consolidation_failed`). Runs periodic per-schema `COUNT` queries on a background task. | `false` | **Features:** - Full prompts and completions recorded as events - Token usage tracking (input/output) - Model and provider information - Error tracking with finish reasons - Conforms to OpenTelemetry GenAI semantic conventions v1.37+ **OTLP-Compatible Backends:** The tracing implementation uses standard OTLP HTTP protocol, so it works with any OTLP-compatible backend: - **Grafana LGTM** (Recommended for local dev): All-in-one stack with Tempo traces, Loki logs, Mimir metrics, and Grafana UI - **Langfuse**: LLM-focused observability and analytics - **OpenLIT**: Built-in LLM dashboards, cost tracking - **DataDog, New Relic, Honeycomb**: Commercial platforms **Example Configuration:** ```bash # Enable tracing export HINDSIGHT_API_OTEL_TRACES_ENABLED=true # Configure endpoint (example: OpenLIT Cloud) export HINDSIGHT_API_OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp.openlit.io export HINDSIGHT_API_OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer olit-xxx" # Optional: Custom service name and environment export HINDSIGHT_API_OTEL_SERVICE_NAME=hindsight-production export HINDSIGHT_API_OTEL_DEPLOYMENT_ENVIRONMENT=production ``` **Local Development:** For local development, we recommend the Grafana LGTM stack which provides traces, metrics, and logs in a single container: ```bash ./scripts/dev/start-grafana.sh ``` See `scripts/dev/grafana/README.md` for detailed setup instructions. Other options: See `scripts/dev/openlit/README.md` for OpenLIT or `scripts/dev/jaeger/README.md` for standalone Jaeger. ### Metrics Hindsight exposes Prometheus metrics at the `/metrics` endpoint, including: - LLM call duration and token usage - Operation duration (retain/recall/reflect) - HTTP request metrics - Database connection pool metrics Metrics are always enabled and available at `http://localhost:8888/metrics`. --- ## Control Plane The Control Plane is the web UI for managing memory banks. | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_CP_DATAPLANE_API_URL` | URL of the API service | `http://localhost:8888` | | `HINDSIGHT_CP_DATAPLANE_API_KEY` | Bearer token the Control Plane sends as `Authorization: Bearer ` on every request to the API service. Required when the API service is auth-protected; omit for a public API. | *(none — no `Authorization` header sent)* | | `HINDSIGHT_CP_ACCESS_KEY` | Access key to protect the Control Plane UI. When set, users must enter this key to log in. | *(none — auth disabled)* | | `HINDSIGHT_CP_MAX_UPLOAD_SIZE` | Maximum size of a single file-upload request the Control Plane accepts before truncating it. Accepts a size string (`100mb`, `1gb`) or a number of bytes. Raise this to upload files larger than the default, and keep it in line with the API's `HINDSIGHT_API_FILE_CONVERSION_MAX_BATCH_SIZE_MB`. | `100mb` | | `NEXT_PUBLIC_BASE_PATH` | Base path for Control Plane UI when behind reverse proxy (e.g., `/hindsight`) | `""` (root) | ```bash # Point Control Plane to a remote API service export HINDSIGHT_CP_DATAPLANE_API_URL=http://api.example.com:8888 # Authenticate to an auth-protected API service export HINDSIGHT_CP_DATAPLANE_API_KEY=my-dataplane-bearer-token # Protect the Control Plane with an access key export HINDSIGHT_CP_ACCESS_KEY=my-secret-key ``` ### Hierarchical Configuration Hindsight supports per-bank configuration overrides through a hierarchical system: **Global (env vars) → Tenant → Bank**. #### Type-Safe Config Access To prevent accidentally using global defaults when bank-specific overrides exist, Hindsight enforces type-safe config access: **In Application Code:** ```python from hindsight_api.config import get_config # ✅ Access static (infrastructure) fields config = get_config() host = config.host # OK - static field port = config.port # OK - static field # ❌ Attempting to access bank-configurable fields raises an error chunk_size = config.retain_chunk_size # ConfigFieldAccessError! ``` **Error Message:** ``` ConfigFieldAccessError: Field 'retain_chunk_size' is bank-configurable and cannot be accessed from global config. Use ConfigResolver.resolve_full_config(bank_id, context) to get bank-specific config. ``` **For Bank-Specific Config:** ```python # Internal code that needs bank-specific settings from hindsight_api.config_resolver import ConfigResolver # Resolve full config for a specific bank config = await config_resolver.resolve_full_config(bank_id, request_context) chunk_size = config.retain_chunk_size # ✅ Uses bank-specific value ``` This design prevents bugs where global defaults are used instead of bank overrides, making it impossible to make this mistake at compile/development time. #### Security Model Configuration fields are categorized for security: 1. **Configurable Fields** - Safe behavioral settings that can be customized per-bank: - Retention: `retain_chunk_size`, `retain_structured_chunk_size`, `retain_extraction_mode`, `retain_mission`, `retain_custom_instructions` - Observations: `enable_observations`, `enable_auto_consolidation`, `observations_mission`, `max_observations_per_scope` - MCP access control: `mcp_enabled_tools` 2. **Credential Fields** - NEVER exposed or configurable via API: - API keys: `*_api_key` (all LLM API keys) - Infrastructure: `*_base_url` (all base URLs) 3. **Static Fields** - Server-level only, cannot be overridden: - Infrastructure: `database_url`, `port`, `host`, `worker_count` - Provider/Model selection: `llm_provider`, `llm_model` (requires presets - not yet implemented) - Performance tuning: `llm_max_concurrent`, `llm_timeout`, retrieval settings, optimization flags #### Enabling the API | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_ENABLE_BANK_CONFIG_API` | Enable per-bank config API | `true` | | `HINDSIGHT_API_ENABLE_BANK_LLM_HEALTH` | Enable the per-bank LLM connectivity probe (`POST /v1/default/banks/{bank_id}/health/llm`). It makes a real provider call, so it is **off by default** — enable it to expose the endpoint. Returns status only — never the provider/model/endpoint. | `false` | | `HINDSIGHT_API_ENABLE_DRY_RUN_EXTRACT` | Enable the dry-run extraction preview endpoint (`POST /v1/default/banks/{bank_id}/memories/dry-run-extract`). Runs extraction only — makes a real LLM call but stores nothing. Set to `false` to remove the endpoint (returns `404`). | `true` | | `HINDSIGHT_API_DEFAULT_BANK_TEMPLATE` | Bank template manifest (JSON) applied automatically to every newly-created bank. See below. | _(unset)_ | ##### `HINDSIGHT_API_DEFAULT_BANK_TEMPLATE` Server-level default bank template. When set, the manifest is applied once to every bank the server creates — triggered the first time a bank is touched (via `PUT /v1/default/banks/{bank_id}`, `/import`, `/retain`, etc.). The value is a JSON-encoded `BankTemplateManifest` with the same shape accepted by `POST /v1/default/banks/{bank_id}/import` (see the `bank`, `mental_models`, and `directives` sections in the Bank Templates API). Precedence: fields set by the template become per-bank overrides, so they take precedence over the equivalent `HINDSIGHT_API_*` env-var defaults (e.g. `HINDSIGHT_API_RETAIN_EXTRACTION_MODE`). Users can still override individual fields later via `PATCH /v1/default/banks/{bank_id}/config`; the template is **not** re-applied on subsequent accesses, so explicit overrides are never clobbered. A malformed manifest (bad JSON, unknown version, schema errors) is logged and ignored — bank creation still succeeds with plain defaults, so a broken server-level setting cannot wedge all callers. Example (compact, single-line JSON as required by env vars): ```bash export HINDSIGHT_API_DEFAULT_BANK_TEMPLATE='{"version":"1","bank":{"reflect_mission":"Help support agents remember customer interactions.","retain_extraction_mode":"verbose","disposition_empathy":5},"directives":[{"name":"Be concise","content":"Always respond concisely.","priority":10}]}' ``` #### API Endpoints - `GET /v1/default/banks/{bank_id}/config` - View resolved config (filtered by permissions) - `PATCH /v1/default/banks/{bank_id}/config` - Update bank overrides (only allowed fields) - `DELETE /v1/default/banks/{bank_id}/config` - Reset to defaults #### Permission System Tenant extensions can control which fields banks are allowed to modify via `get_allowed_config_fields()`: ```python class CustomTenantExtension(TenantExtension): async def get_allowed_config_fields(self, context, bank_id): # Option 1: Allow all configurable fields return None # Option 2: Allow specific fields only return {"retain_chunk_size", "retain_custom_instructions"} # Option 3: Read-only (no modifications) return set() ``` #### Examples ```bash # Update retention settings for a bank curl -X PATCH http://localhost:8888/v1/default/banks/my-bank/config \ -H "Content-Type: application/json" \ -d '{ "updates": { "retain_chunk_size": 4000, "retain_extraction_mode": "custom", "retain_custom_instructions": "Focus on technical details and implementation specifics" } }' # Note: retain_extraction_mode must be "custom" to use retain_custom_instructions # View resolved config (respects permissions) curl http://localhost:8888/v1/default/banks/my-bank/config # Reset to defaults curl -X DELETE http://localhost:8888/v1/default/banks/my-bank/config ``` **Security Notes:** - Credentials (API keys, base URLs) are never returned in responses - Only configurable fields can be modified - Responses are filtered by tenant permissions - Attempting to set credentials returns 400 error ### Reverse Proxy / Subpath Deployment To deploy Hindsight under a subpath (e.g., `example.com/hindsight/`): 1. Set both environment variables to the same path: ```bash HINDSIGHT_API_BASE_PATH=/hindsight NEXT_PUBLIC_BASE_PATH=/hindsight ``` 2. Configure your reverse proxy to: - Forward `/hindsight/*` requests to Hindsight - Preserve the full path in forwarded requests - Set appropriate proxy headers (X-Forwarded-Proto, X-Forwarded-For) **Example: Nginx Configuration** ```nginx location /hindsight/ { proxy_pass http://localhost:8888/; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; } ``` **Example: Traefik Configuration** ```yaml http: routers: hindsight: rule: "PathPrefix(`/hindsight`)" service: hindsight middlewares: - hindsight-stripprefix middlewares: hindsight-stripprefix: stripPrefix: prefixes: - "/hindsight" services: hindsight: loadBalancer: servers: - url: "http://localhost:8888" ``` **Important Notes:** - The base path must start with `/` and should NOT end with `/` - Both API and Control Plane should use the same base path - After setting environment variables, restart both services - OpenAPI docs will be available at `/docs` (e.g., `/hindsight/docs`) **Complete Examples:** See `docker/compose-examples/` directory for: - Nginx configuration files (`simple.conf`, `api-and-control-plane.conf`) - Docker Compose setups (`docker-compose.yml`, `reverse-proxy-only.yml`) - Traefik and other reverse proxy examples - Full deployment documentation --- ## Example .env File ```bash # API Service HINDSIGHT_API_DATABASE_URL=postgresql://hindsight:hindsight_dev@localhost:5432/hindsight # HINDSIGHT_API_DATABASE_SCHEMA=public # optional, defaults to 'public' HINDSIGHT_API_LLM_PROVIDER=groq HINDSIGHT_API_LLM_API_KEY=gsk_xxxxxxxxxxxx # Authentication (optional, recommended for production) # HINDSIGHT_API_TENANT_EXTENSION=hindsight_api.extensions.builtin.tenant:ApiKeyTenantExtension # HINDSIGHT_API_TENANT_API_KEY=your-secret-api-key # File storage (optional, defaults to PostgreSQL native storage) # HINDSIGHT_API_FILE_STORAGE_TYPE=s3 # HINDSIGHT_API_FILE_STORAGE_S3_BUCKET=my-hindsight-files # HINDSIGHT_API_FILE_STORAGE_S3_REGION=us-east-1 # HINDSIGHT_API_FILE_STORAGE_S3_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE # HINDSIGHT_API_FILE_STORAGE_S3_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY # Control Plane HINDSIGHT_CP_DATAPLANE_API_URL=http://localhost:8888 ``` --- For configuration issues not covered here, please [open an issue](https://github.com/vectorize-io/hindsight/issues) on GitHub. --- ## File: developer/rag-vs-hindsight.md # RAG vs Memory Traditional RAG (Retrieval-Augmented Generation) retrieves documents similar to a query. Hindsight provides structured memory with temporal reasoning, entity understanding, and belief formation. ## Capability Comparison | Capability | RAG | Hindsight | |------------|-----|-----------| | **Search strategy** | Semantic similarity only | Semantic + keyword + graph + temporal | | **Multi-hop reasoning** | Limited to retrieved chunks | Graph traversal across entity relationships | | **Temporal queries** | Keyword matching ("spring") | Date parsing and range filtering | | **Entity understanding** | None | Entity resolution, co-occurrence tracking | | **Knowledge consolidation** | Stateless | Mental models that synthesize and evolve | | **Disposition** | None | 3 traits (skepticism, literalism, empathy) influence interpretation | ## Architecture Comparison ### RAG | Step | Operation | |------|-----------| | 1 | Embed query | | 2 | Vector similarity search | | 3 | Return top-k chunks | | 4 | Generate response | Single retrieval strategy. No state between queries. ### Hindsight | Step | Operation | |------|-----------| | 1 | Parse query (extract temporal expressions, entities) | | 2 | Execute 4 parallel retrievals: semantic, BM25, graph, temporal | | 3 | Fuse results with RRF | | 4 | Rerank with cross-encoder | | 5 | Apply disposition traits | | 6 | Generate response | Multiple retrieval strategies. Persistent state across sessions. ## Example Scenarios ### Multi-Hop Reasoning **Stored facts:** - "Alice is the tech lead on Project Atlas" - "Project Atlas uses Kubernetes" - "Kubernetes cluster had an outage Tuesday" **Query:** "Was Alice affected by recent issues?" | System | Result | |--------|--------| | RAG | Retrieves facts about Alice only (no semantic similarity to "issues") | | Hindsight | Traverses Alice → Project Atlas → Kubernetes → outage via entity links | ### Temporal Queries **Stored facts with timestamps:** - March: "Alice started microservices migration" - April: "Alice completed auth service" - October: "Alice focusing on performance" **Query:** "What did Alice do last spring?" | System | Result | |--------|--------| | RAG | Returns all Alice facts regardless of date | | Hindsight | Parses "last spring" → March-May, filters to that range | ### Entity Understanding **Stored facts about a user across sessions:** - "Pro subscription" - "Mobile app crashes in settings" - "Switched to annual billing" - "Desktop app working fine" **Query:** "What do you know about my account?" | System | Result | |--------|--------| | RAG | Lists disconnected facts | | Hindsight | Returns connected facts via entity graph: subscription status, billing, known issues | ### Knowledge Evolution **Week 1:** User struggles with async Python, succeeds with threads **Week 3:** User asks about asyncio, implements async database calls | System | Behavior | |--------|----------| | RAG | No memory of progression | | Hindsight | Consolidates mental model "user prefers sync" → refines to "user growing comfortable with async" | ## When to Use Each | Use Case | Recommended | |----------|-------------| | Document Q&A over static corpus | RAG | | Search with no temporal requirements | RAG | | AI assistants with persistent memory | Hindsight | | Applications requiring entity tracking | Hindsight | | Systems needing consistent disposition | Hindsight | | Temporal queries ("last month", "in 2023") | Hindsight | --- ## File: sdks/python.md # Python Client Official HTTP client for the Hindsight API. Use this when you have a Hindsight server already running — locally, in Docker, or as a managed service — and you want a typed Python client to talk to it. If you want to **embed and run a Hindsight server in your Python process** (no external server required), see [Embedded Python (hindsight-all)](./hindsight-all.md) instead. ## Installation ```bash pip install hindsight-client ``` ## Quick Start ```python from hindsight_client import Hindsight client = Hindsight(base_url="http://localhost:8888") # Retain a memory client.retain(bank_id="my-bank", content="Alice works at Google") # Recall memories results = client.recall(bank_id="my-bank", query="What does Alice do?") for r in results.results: print(r.text) # Reflect - generate a contextual answer answer = client.reflect(bank_id="my-bank", query="Tell me about Alice") print(answer.text) ``` ## Client Initialization ```python from hindsight_client import Hindsight client = Hindsight( base_url="http://localhost:8888", # Hindsight API URL timeout=30.0, # Request timeout in seconds # api_key="your-api-key", # Optional bearer token ) # Core operations client.retain(bank_id="test", content="Hello world") results = client.recall(bank_id="test", query="Hello") # Organized API namespaces client.banks.create(bank_id="test", name="Test Bank") models = client.mental_models.list(bank_id="test") directives = client.directives.list(bank_id="test") memories = client.memories.list(bank_id="test") ``` ## Core Operations ### Version and Feature Checks ```python version = client.get_version() print(version.api_version) if not version.features.mcp: raise RuntimeError("This server does not expose the MCP endpoint") ``` The async client method is available as `await client.aget_version()`. ### Retain (Store Memory) ```python # Simple client.retain( bank_id="my-bank", content="Alice works at Google as a software engineer", ) # With options from datetime import datetime client.retain( bank_id="my-bank", content="Alice got promoted", context="career update", timestamp=datetime(2024, 1, 15), document_id="conversation_001", metadata={"source": "slack"}, retain_async=False, # Set True for background processing ) ``` ### Retain Batch ```python client.retain_batch( bank_id="my-bank", items=[ {"content": "Alice works at Google", "context": "career"}, {"content": "Bob is a data scientist", "context": "career"}, ], document_id="conversation_001", retain_async=False, # Set True for background processing ) ``` ### Recall (Search) ```python # Simple - returns list of RecallResult results = client.recall( bank_id="my-bank", query="What does Alice do?", ) for r in results.results: print(f"{r.text} (type: {r.type})") # With options results = client.recall( bank_id="my-bank", query="What does Alice do?", types=["world", "observation"], # Filter by fact type max_tokens=4096, budget="high", # low, mid, or high ) ``` ### Recall with Chunks ```python # Returns RecallResponse with source chunks response = client.recall( bank_id="my-bank", query="What does Alice do?", types=["world", "experience"], budget="mid", max_tokens=4096, include_chunks=True, max_chunk_tokens=500 ) print(f"Found {len(response.results)} memories") for r in response.results: print(f" - {r.text}") if r.chunks: print(f" Source: {r.chunks[0].text[:100]}...") ``` ### Reflect (Generate Response) ```python answer = client.reflect( bank_id="my-bank", query="What should I know about Alice?", budget="low", # low, mid, or high context="preparing for a meeting", ) print(answer.text) # Generated response ``` ## Bank Management ### Create Bank ```python client.create_bank( bank_id="my-bank", name="Assistant", mission="You're a helpful AI assistant - keep track of user preferences and conversation history.", disposition={ "skepticism": 3, # 1-5: trusting to skeptical "literalism": 3, # 1-5: flexible to literal "empathy": 3, # 1-5: detached to empathetic }, ) ``` ### List Memories ```python client.list_memories( bank_id="my-bank", type="world", # Optional: filter by type search_query="Alice", # Optional: text search limit=100, offset=0, ) ``` ## Async Support All methods have async versions prefixed with `a`: ```python from hindsight_client import Hindsight async def main(): client = Hindsight(base_url="http://localhost:8888") # Async retain await client.aretain(bank_id="my-bank", content="Hello world") # Async recall results = await client.arecall(bank_id="my-bank", query="Hello") for r in results: print(r.text) # Async reflect answer = await client.areflect(bank_id="my-bank", query="What did I say?") print(answer.text) client.close() asyncio.run(main()) ``` ## Context Manager ```python from hindsight_client import Hindsight with Hindsight(base_url="http://localhost:8888") as client: client.retain(bank_id="my-bank", content="Hello") results = client.recall(bank_id="my-bank", query="Hello") # Client automatically closed ``` --- ## File: sdks/nodejs.md # TypeScript / JavaScript Client Official TypeScript/JavaScript client for the Hindsight API. Supports **Node.js** and **Deno**. ## Installation ### Node.js ```bash npm install @vectorize-io/hindsight-client ``` ### Deno No installation needed — import directly via the `npm:` specifier: ```typescript ``` ## Quick Start ```typescript const client = new HindsightClient({ baseUrl: 'http://localhost:8888' }); // Retain a memory await client.retain('my-bank', 'Alice works at Google'); // Recall memories const response = await client.recall('my-bank', 'What does Alice do?'); for (const r of response.results) { console.log(r.text); } // Reflect - generate response with disposition const answer = await client.reflect('my-bank', 'Tell me about Alice'); console.log(answer.text); ``` ## Client Initialization ```typescript const client = new HindsightClient({ baseUrl: 'http://localhost:8888', }); ``` ## Core Operations ### Version and Feature Checks ```typescript const version = await client.getVersion(); console.log(version.api_version); if (!version.features.mcp) { throw new Error('This server does not expose the MCP endpoint'); } ``` ### Retain (Store Memory) ```typescript // Simple await client.retain('my-bank', 'Alice works at Google'); // With options await client.retain('my-bank', 'Alice got promoted', { timestamp: new Date('2024-01-15'), context: 'career update', metadata: { source: 'slack' }, async: false, // Set true for background processing }); ``` ### Retain Batch ```typescript await client.retainBatch('my-bank', [ { content: 'Alice works at Google', context: 'career' }, { content: 'Bob is a data scientist', context: 'career' }, ], { async: false, }); ``` ### Recall (Search) ```typescript // Simple - returns RecallResponse const response = await client.recall('my-bank', 'What does Alice do?'); for (const r of response.results) { console.log(`${r.text} (type: ${r.type})`); } // With options const response = await client.recall('my-bank', 'What does Alice do?', { types: ['world', 'observation'], // Filter by fact type maxTokens: 4096, budget: 'high', // 'low', 'mid', or 'high' }); ``` ### Reflect (Generate Response) ```typescript const answer = await client.reflect('my-bank', 'What should I know about Alice?', { budget: 'low', // 'low', 'mid', or 'high' context: 'preparing for a meeting', }); console.log(answer.text); // Generated response ``` ## Bank Management ### Create Bank ```typescript await client.createBank('my-bank', { name: 'Assistant', mission: "You're a helpful AI assistant - keep track of user preferences and conversation history.", disposition: { skepticism: 3, // 1-5: trusting to skeptical literalism: 3, // 1-5: flexible to literal empathy: 3, // 1-5: detached to empathetic }, }); ``` ### List Memories ```typescript const response = await client.listMemories('my-bank', { type: 'world', // Optional filter q: 'Alice', // Optional text search limit: 100, offset: 0, }); console.log(response) ``` ## Document Management ### Get Document ```typescript const doc = await client.getDocument('my-bank', 'conversation_001'); if (doc) { console.log(doc); // null when document not found } ``` ### List Documents ```typescript const response = await client.listDocuments('my-bank', { limit: 50, offset: 0, }); console.log(response); ``` ### Update Document ```typescript await client.updateDocument('my-bank', 'conversation_001', { tags: ['important', 'meeting-notes'], }); ``` ### Delete Document ```typescript await client.deleteDocument('my-bank', 'conversation_001'); ``` --- ## File: sdks/cli.md # CLI Reference The Hindsight CLI provides command-line access to memory operations and bank management. All commands follow the [OpenAPI specification](/api-reference), so you can use `--help` on any command to see all available options. ## Installation ```bash curl -fsSL https://hindsight.vectorize.io/get-cli | bash ``` ## Configuration Configure the API URL: ```bash # Interactive configuration hindsight configure # Or set directly hindsight configure --api-url http://localhost:8888 # With API key for authentication hindsight configure --api-url http://localhost:8888 --api-key your-api-key # Or use environment variables (highest priority) export HINDSIGHT_API_URL=http://localhost:8888 export HINDSIGHT_API_KEY=your-api-key ``` ### Named Profiles When you need to switch between multiple Hindsight deployments (e.g. local, staging, production) without constantly rewriting `~/.hindsight/config`, use named profiles. Each profile is a TOML file at `~/.hindsight/cli-profiles/.toml` and is selected per-invocation with `-p/--profile` (or by setting `$HINDSIGHT_PROFILE`). ```bash # Create (or overwrite) a profile hindsight profile create prod \ --api-url https://api.hindsight.vectorize.io \ --api-key hsk_... # List and inspect profiles hindsight profile list hindsight profile show prod # Use a profile for a single command hindsight -p prod bank list # Or make it sticky for the current shell export HINDSIGHT_PROFILE=prod hindsight bank list # Remove a profile hindsight profile delete prod -y ``` Profile files are written with `0600` permissions on Unix so the API key is only readable by the owner. **Configuration precedence** (highest first): 1. Environment variables (`HINDSIGHT_API_URL`, `HINDSIGHT_API_KEY`) 2. Named profile — explicit `-p `, otherwise `$HINDSIGHT_PROFILE` 3. Shared config file (`~/.hindsight/config`, written by `hindsight configure`) 4. Default (`http://localhost:8888`) `HINDSIGHT_API_URL` / `HINDSIGHT_API_KEY` always override profile values, which makes it safe to use `-p` in scripts while letting CI inject credentials via environment. ## Core Commands ### Retain (Store Memory) Store a single memory: ```bash hindsight memory retain "Alice works at Google as a software engineer" # With context hindsight memory retain "Bob loves hiking" --context "hobby discussion" # Queue for background processing hindsight memory retain "Meeting notes" --async # With an event date (ISO 8601 datetime or date) hindsight memory retain "Project launched" --timestamp 2024-01-15 # Store without a timestamp (overrides the default of "now") hindsight memory retain "Background fact" --timestamp unset ``` ### Retain Files Bulk import from files: ```bash # Single file hindsight memory retain-files notes.txt # Directory (recursive by default) hindsight memory retain-files ./documents/ # With context hindsight memory retain-files meeting-notes.txt --context "team meeting" # With a named retain strategy (see retain_strategies in bank config) hindsight memory retain-files ./documents/ --strategy conversations # Background processing hindsight memory retain-files ./data/ --async ``` ### Recall (Search) Search memories using semantic similarity: ```bash hindsight memory recall "What does Alice do?" # With options hindsight memory recall "hiking recommendations" \ --budget high \ --max-tokens 8192 # Filter by fact type hindsight memory recall "query" --fact-type world,observation # Filter by tags hindsight memory recall "query" --tags work,project \ --tags-match all # Pin results to a specific time hindsight memory recall "query" --query-timestamp "2026-01-15T00:00:00Z" # Show trace information hindsight memory recall "query" --trace ``` ### Reflect (Generate Response) Generate a response using memories and bank disposition: ```bash hindsight memory reflect "What do you know about Alice?" # With additional context hindsight memory reflect "Should I learn Python?" --context "career advice" # Higher budget for complex questions hindsight memory reflect "Summarize my week" --budget high # Filter by fact type hindsight memory reflect "query" \ --fact-types world,experience \ --exclude-mental-models ``` ### Memory History View the observation history for a specific memory unit: ```bash hindsight memory history ``` ### Clear Observations Remove all observations for a memory unit, keeping the core fact: ```bash hindsight memory clear-observations # Skip confirmation prompt hindsight memory clear-observations -y ``` ## Bank Management ### List Banks ```bash hindsight bank list ``` ### View Disposition ```bash hindsight bank disposition ``` ### Set Disposition ```bash hindsight bank set-disposition --skepticism 3 --literalism 4 --empathy 5 ``` ### View Statistics ```bash hindsight bank stats ``` ### Set Bank Name ```bash hindsight bank name "My Assistant" ``` ### Set Mission ```bash hindsight bank mission "I am a helpful AI assistant interested in technology" ``` ### Clear Observations (Bank-wide) Remove all observations across the entire bank: ```bash hindsight bank clear-observations # Skip confirmation prompt hindsight bank clear-observations -y ``` ### Recover Consolidation Recover from a failed or stuck consolidation: ```bash hindsight bank consolidation-recover ``` ## Document Management ```bash # List documents hindsight document list # Get document details hindsight document get # Update document metadata hindsight document update --context "updated context" # Delete document and its memories hindsight document delete ``` ## Entity Management ```bash # List entities hindsight entity list # Get entity details hindsight entity get ``` ## Operation Management Track and manage async operations (retain-files, consolidation, etc.): ```bash # List operations hindsight operation list # Get operation status hindsight operation get # Cancel a pending operation hindsight operation cancel # Retry a failed operation hindsight operation retry ``` ## Webhook Management Configure event delivery hooks for bank activity: ```bash # List webhooks hindsight webhook list # Create a webhook (defaults to consolidation.completed events) hindsight webhook create https://example.com/hook # Create with specific events and signing secret hindsight webhook create https://example.com/hook \ --event-types retain.completed,consolidation.completed \ --secret my-hmac-secret # Update a webhook hindsight webhook update --url https://new-url.com # Delete a webhook hindsight webhook delete # View delivery history hindsight webhook deliveries ``` ## Audit Logs Inspect the audit trail for a bank: ```bash # List audit entries hindsight audit list # Filter by action and transport hindsight audit list --action recall --transport mcp # Filter by date range hindsight audit list \ --start-date "2026-04-01T00:00:00Z" \ --end-date "2026-04-10T00:00:00Z" # Pagination hindsight audit list --limit 50 --offset 100 ``` ## Output Formats ```bash # Pretty (default) hindsight memory recall "query" # JSON hindsight memory recall "query" -o json # YAML hindsight memory recall "query" -o yaml ``` ## Global Options | Flag | Description | |------|-------------| | `-v, --verbose` | Show detailed output including request/response | | `-o, --output ` | Output format: pretty, json, yaml | | `--help` | Show help | | `--version` | Show version | ## Control Plane UI Launch the web-based Control Plane UI directly from the CLI: ```bash hindsight ui ``` This runs the Control Plane locally on port 9999 using the API URL from your configuration. The UI provides: - **Memory bank management** — Browse and manage all your banks - **Entity explorer** — Visualize the knowledge graph - **Query testing** — Interactive recall and reflect testing - **Operation history** — View ingestion and processing logs :::tip The UI command requires Node.js to be installed. It automatically downloads and runs the `@vectorize-io/hindsight-control-plane` package via npx. ::: ## Interactive Explorer Launch the TUI explorer for visual navigation of your memory banks: ```bash hindsight explore ``` The explorer provides an interactive terminal interface to: - **Browse memory banks** — View all banks and their statistics - **Search memories** — Run recall queries with real-time results - **Inspect entities** — Explore the knowledge graph and entity relationships - **View facts** — Browse world facts, experiences, and observations - **Navigate documents** — See source documents and their extracted memories ### Keyboard Shortcuts | Key | Action | |-----|--------| | `↑/↓` | Navigate items | | `Enter` | Select / Expand | | `Tab` | Switch panels | | `/` | Search | | `q` | Quit | ## Example Workflow ```bash # Configure API URL hindsight configure --api-url http://localhost:8888 # Store some memories hindsight memory retain demo "Alice works at Google" hindsight memory retain demo "Bob is a data scientist" hindsight memory retain demo "Alice and Bob are colleagues" # Search memories hindsight memory recall demo "Who works with Alice?" # Generate a response hindsight memory reflect demo "What do you know about the team?" # Check bank disposition hindsight bank disposition demo ``` --- ## File: developer/admin-cli.md # Admin CLI The `hindsight-admin` CLI provides administrative commands for managing your Hindsight deployment, including database migrations, backup, and restore operations. ## Installation The admin CLI is included with the `hindsight-api` package — installing it puts the `hindsight-admin` executable on your `PATH`: ```bash pip install hindsight-api # or uv add hindsight-api ``` ## Running the CLI `hindsight-admin` connects **directly to PostgreSQL** — it does not call the HTTP API. It reads the **same configuration as the API service** (environment variables, and a `.env` file in the current working directory), so it operates on whatever database `HINDSIGHT_API_DATABASE_URL` points to: - **Default**: `pg0`, the embedded development database (must be run on the host that owns the pg0 data). - **Production**: set `HINDSIGHT_API_DATABASE_URL=postgresql://user:pass@host:5432/hindsight`. Because it talks to the database directly (binary `COPY`, `TRUNCATE`, etc.), the admin CLI is **PostgreSQL-only** (not supported on Oracle). Run it on the same host/container as your API deployment so it inherits the right configuration and has network access to the database: ```bash # Bare metal / virtualenv (with the API's env or a .env in the working dir) hindsight-admin worker-status # Docker — exec into the API container docker exec -it hindsight-api hindsight-admin backup /data/backup.zip # Kubernetes — exec into an API pod kubectl exec deploy/hindsight-api -- hindsight-admin run-db-migration ``` Use `--schema` to target a specific tenant schema (commands default to the configured base schema). See [Environment Variables](#environment-variables) below. ## Commands ### run-db-migration Run database migrations to the latest version. By default this migrates the base schema plus all tenant schemas discovered by the tenant extension. Use `--schema` for targeted migration of one schema. This is useful when you want to run migrations separately from API startup (e.g., in CI/CD pipelines or before deploying a new version). ```bash hindsight-admin run-db-migration [OPTIONS] ``` **Options:** | Option | Description | Default | |--------|-------------|---------| | `--schema`, `-s` | Database schema to run migrations on. If omitted, migrate the base schema plus all discovered tenant schemas. | All schemas | | `--embedding-dimension` | Expected embedding dimension to enforce after migrations. Omit to skip the post-migration dimension sync. | Skipped | | `--skip-extension-reconcile` | Skip the post-migration vector / text-search index reconcile (it only does work when `HINDSIGHT_API_VECTOR_EXTENSION` / `HINDSIGHT_API_TEXT_SEARCH_EXTENSION` differs from a schema's existing indexes). Makes a no-change re-migration across many tenant schemas much faster; only use when the backend is unchanged. | Reconcile runs | **Examples:** ```bash # Run migrations on the base schema plus all discovered tenant schemas hindsight-admin run-db-migration # Run migrations on a specific tenant schema hindsight-admin run-db-migration --schema tenant_acme ``` :::tip Disabling Auto-Migrations To disable automatic migrations on API startup, set `HINDSIGHT_API_RUN_MIGRATIONS_ON_STARTUP=false`. This is useful when you want to run migrations as a separate step in your deployment pipeline. ::: --- ### backup Create a backup of all Hindsight data to a zip file. ```bash hindsight-admin backup OUTPUT [OPTIONS] ``` **Arguments:** | Argument | Description | |----------|-------------| | `OUTPUT` | Output file path (will add `.zip` extension if not present) | **Options:** | Option | Description | Default | |--------|-------------|---------| | `--schema`, `-s` | Database schema to backup | `public` | **Examples:** ```bash # Backup to a file hindsight-admin backup /backups/hindsight-2024-01-15.zip # Backup a specific tenant schema hindsight-admin backup /backups/tenant-acme.zip --schema tenant_acme ``` The backup includes: - Memory banks and their configuration - Documents and chunks - Entities and their relationships - Memory units (facts, experiences, observations) - Entity cooccurrences and memory links - Mental models and directives - Webhooks and file storage - Internal operational tables (async operations, audit log, graph-maintenance queue, and similar bookkeeping) so a restore reproduces a faithful full-database snapshot :::note Consistency Backups are created within a database transaction with `REPEATABLE READ` isolation, ensuring a consistent snapshot across all tables. ::: --- ### restore Restore data from a backup file. **Warning: This deletes all existing data in the target schema.** ```bash hindsight-admin restore INPUT [OPTIONS] ``` **Arguments:** | Argument | Description | |----------|-------------| | `INPUT` | Input backup file (.zip) | **Options:** | Option | Description | Default | |--------|-------------|---------| | `--schema`, `-s` | Database schema to restore to | `public` | | `--yes`, `-y` | Skip confirmation prompt | `false` | **Examples:** ```bash # Restore with confirmation prompt hindsight-admin restore /backups/hindsight-2024-01-15.zip # Restore without confirmation (for scripts) hindsight-admin restore /backups/hindsight-2024-01-15.zip --yes # Restore to a specific tenant schema hindsight-admin restore /backups/tenant-acme.zip --schema tenant_acme --yes ``` :::warning Data Loss Restore will **delete all existing data** in the target schema before importing the backup. Always verify you have a recent backup before performing a restore. ::: --- ### decommission-worker Release all tasks owned by a worker, resetting them from "processing" back to "pending" status so they can be picked up by other workers. ```bash hindsight-admin decommission-worker WORKER_ID [OPTIONS] ``` **Arguments:** | Argument | Description | |----------|-------------| | `WORKER_ID` | ID of the worker to decommission | **Options:** | Option | Description | Default | |--------|-------------|---------| | `--schema`, `-s` | Database schema | `public` | | `--yes`, `-y` | Skip confirmation prompt | `false` | **Examples:** ```bash # Before scaling down - release tasks from workers being removed hindsight-admin decommission-worker hindsight-worker-4 hindsight-admin decommission-worker hindsight-worker-3 # Release tasks from a crashed worker hindsight-admin decommission-worker worker-2 # For a specific tenant schema hindsight-admin decommission-worker worker-1 --schema tenant_acme ``` **When to Use:** - **Scaling down**: Before removing worker replicas in Kubernetes - **Graceful removal**: When taking a worker offline for maintenance - **Crash recovery**: If a worker crashed while processing tasks - **Stuck worker**: When a worker is unresponsive :::tip Finding Worker IDs Worker IDs default to the hostname. In Kubernetes StatefulSets, this is the pod name (e.g., `hindsight-worker-0`). You can also set a custom ID with `HINDSIGHT_API_WORKER_ID` or `--worker-id`. ::: ### decommission-workers Release all currently-processing tasks from every worker, resetting them from "processing" back to "pending" status. Use this when one or more workers have crashed or been removed without graceful shutdown and you don't know which worker IDs to target. ```bash hindsight-admin decommission-workers [OPTIONS] ``` **Options:** | Option | Description | Default | |--------|-------------|---------| | `--schema`, `-s` | Database schema | `public` | | `--yes`, `-y` | Skip confirmation prompt | `false` | **Examples:** ```bash # Release all processing tasks across all workers (with confirmation) hindsight-admin decommission-workers # Skip the confirmation prompt (useful in scripts) hindsight-admin decommission-workers --yes # Release tasks in a specific tenant schema hindsight-admin decommission-workers --schema tenant_acme ``` **When to Use:** - **Unknown dead workers**: Multiple workers crashed and you do not know their IDs - **Fleet-wide recovery**: After an infrastructure event where many workers went down - **"Just fix everything"**: A quick full-queue drain when per-worker cleanup is overkill :::warning Disruptive This releases **every** processing task regardless of worker, including tasks owned by healthy workers. Prefer `decommission-worker ` when you know which workers need cleanup. ::: --- ### worker-status Show all currently-processing tasks grouped by worker, including operation type, bank, how long each task has been running, and when it was last updated. Useful for identifying orphaned tasks before decommissioning. ```bash hindsight-admin worker-status [OPTIONS] ``` **Options:** | Option | Description | Default | |--------|-------------|---------| | `--schema`, `-s` | Database schema | `public` | **Examples:** ```bash # Show all processing tasks across all workers hindsight-admin worker-status # Show processing tasks for a specific tenant schema hindsight-admin worker-status --schema tenant_acme ``` **When to Use:** - **Before decommissioning**: Inspect which workers have stale tasks and how long they have been stuck - **Debugging throughput**: Diagnose why the queue is not draining (are tasks stuck in processing?) - **Worker health check**: Spot workers whose `last_update_ago` keeps growing, indicating a dead or unresponsive worker --- ### export-bank Export an entire bank to a portable ZIP archive — documents, facts, observations, bank configuration, mental models, directives, and webhooks. Embeddings are **never** included; they are regenerated on import. This is the source half of a cross-instance migration (e.g. moving to a different embedding model, vector extension, or text-search backend). PostgreSQL only. ```bash hindsight-admin export-bank --bank --output [OPTIONS] ``` **Options:** | Option | Description | Default | |--------|-------------|---------| | `--bank`, `-b` | Bank id to export. | (required) | | `--output`, `-o` | Path to write the `.zip` archive. | (required) | | `--schema`, `-s` | Schema the bank lives in. | base schema | | `--include-history` | Also export operational history (`audit_log`, `llm_requests`). | `false` | **Examples:** ```bash hindsight-admin export-bank --bank my-bank --output my-bank.zip # include operational history hindsight-admin export-bank --bank my-bank --output my-bank.zip --include-history ``` Read-only — safe to run against a live instance. --- ### import-bank Restore a whole-bank archive (produced by `export-bank`) into **this** instance. Facts are re-embedded with this instance's configured embedding model and links/indexes are rebuilt; bank configuration, mental models, directives, and webhooks are restored exactly. No LLM fact-extraction runs, and because a migration restores state, it does **not** fire webhooks or re-run consolidation. PostgreSQL only. ```bash hindsight-admin import-bank --archive [OPTIONS] ``` **Options:** | Option | Description | Default | |--------|-------------|---------| | `--archive`, `-a` | Path to the `.zip` produced by `export-bank`. | (required) | | `--schema`, `-s` | Target schema. | base schema | | `--target-bank` | Override the bank id (defaults to the archive's source bank). | source bank | | `--include-history` | Also restore history if present in the archive. | `false` | **Examples:** ```bash hindsight-admin import-bank --archive my-bank.zip ``` Run this against an instance configured with the **target** embedding model / vector extension / text-search backend — that's what re-embedding uses. :::warning Target bank must not exist Import restores a **whole bank** (config, facts, mental models, …) — it is **not a merge**. If a bank with the target id already exists, the command fails. Delete that bank first, or use `--target-bank` to restore under a fresh id. ::: --- ## Migrating a bank to a new instance Changing a bank's **embedding model** (e.g. a 384-dim encoder → a 1024-dim one), **vector extension** (pgvector / vchord / pgvectorscale), or **text-search backend** can't be done in place on a populated bank — the stored vectors and indexes are tied to those settings. Because every embedding and index is a deterministic function of text already on disk, the supported path is to **move the bank to a fresh instance configured with the new settings and re-derive everything there — with no LLM re-extraction**. `export-bank` / `import-bank` carry documents, facts, observations, bank config, mental models, directives, and webhooks — but never embeddings, which the target instance regenerates with its own model. **Blue-green runbook:** 1. Stand up a **new instance** on a fresh database, configured with the new embedding model / vector extension / text-search backend. 2. Quiesce writes to the source bank (maintenance window) and run `hindsight-admin backup` for safety. 3. Export from the source, then import into the target: ```bash # on the source instance: hindsight-admin export-bank --bank my-bank --output my-bank.zip # on the target instance (configured with the new settings): hindsight-admin import-bank --archive my-bank.zip ``` 4. Verify on the target: run representative recall queries and compare results. 5. Cut traffic over to the new instance. The old instance stays as an instant rollback until you're confident. :::note Why a new instance, not in-place The embedding model is server-level, and a bank's `memory_units.embedding` column has a single dimension shared across the schema, so a different-dimension or different-backend bank needs its own instance/database. The old vectors are never mutated, which makes rollback trivial. ::: --- ## Recovering stuck or zombie operations A "zombie" operation is one stuck in `processing` indefinitely because the worker that claimed it is gone. The most common cause is an unstable `HINDSIGHT_API_WORKER_ID`: when it defaults to the container hostname, a Docker restart produces a new container ID, the new worker doesn't recognize the old worker's claims as its own, and those tasks are stranded. **How to spot them:** ```bash # List processing tasks grouped by worker — workers with a growing last_update_ago are dead hindsight-admin worker-status # Bank-level counters; pending_consolidation that never decreases is the usual symptom curl -s http://localhost:8888/v1/default/banks//stats ``` **How to recover:** ```bash # You know which worker is dead (e.g. from worker-status): hindsight-admin decommission-worker # You don't know — release every processing task across the fleet: hindsight-admin decommission-workers ``` Both commands reset `processing` rows back to `pending` so a live worker can claim them on the next poll. **How to prevent it:** Set `HINDSIGHT_API_WORKER_ID` to a stable value so worker identity survives restarts: - **Docker**: pass `-e HINDSIGHT_API_WORKER_ID=hindsight-prod` (or per-replica names if running multiple containers) - **Kubernetes (Helm)**: the chart's StatefulSet uses the pod name automatically — no extra config needed - **Bare metal / pip**: pass `--worker-id ` or set the env var per process See [Installation - Docker](./installation#docker) and [Configuration - Distributed Workers](./configuration#distributed-workers). --- ## Environment Variables The admin CLI uses the same environment variables as the API service. The most important one is: | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_DATABASE_URL` | PostgreSQL connection string | `pg0` (embedded) | **Example:** ```bash # Use a specific database export HINDSIGHT_API_DATABASE_URL=postgresql://user:pass@localhost:5432/hindsight hindsight-admin backup /backups/mybackup.zip ``` --- ## File: developer/api/bank-templates.mdx # Bank Templates Declarative JSON manifests for creating pre-configured memory banks with a single API call. {/* Import raw source files */} ## Overview A bank template is a JSON manifest that describes a bank's full setup: configuration overrides, mental models, directives, and more. Instead of making multiple API calls to configure a bank, you submit one manifest and the API provisions everything. Templates are useful for: - **Replication** — stamp out identically-configured banks for multiple users or agents - **Onboarding** — new users start with a known-good configuration instead of configuring from scratch - **Sharing** — distribute recommended setups as portable JSON files - **Framework integrations** — ship a recommended template alongside your integration Browse the [Bank Templates Hub](/templates) for ready-to-use templates. ## Manifest Schema ```json { "version": "1", "bank": { "reflect_mission": "...", "retain_mission": "...", "retain_extraction_mode": "concise | verbose | custom | chunks", "retain_custom_instructions": "...", "retain_chunk_size": 2048, "retain_structured_chunk_size": 8192, "disposition_skepticism": 3, "disposition_literalism": 3, "disposition_empathy": 3, "enable_observations": true, "observations_mission": "...", "entity_labels": [{ "key": "sentiment", "type": "value", "values": [{ "value": "positive" }, { "value": "negative" }] }], "entities_allow_free_form": true }, "mental_models": [ { "id": "unique-lowercase-id", "name": "Human-Readable Name", "source_query": "The query that generates this mental model's content", "tags": ["optional", "tags"], "max_tokens": 2048, "trigger": { "refresh_after_consolidation": false, "fact_types": ["world", "experience", "observation"], "exclude_mental_models": false, "exclude_mental_model_ids": [] } } ], "directives": [ { "name": "directive-name", "content": "The directive instruction text", "priority": 0, "is_active": true, "tags": ["optional", "tags"] } ] } ``` ### Fields | Field | Required | Description | |-------|----------|-------------| | `version` | Yes | Schema version. Currently `"1"`. | | `bank` | No | Bank configuration overrides. Omit to leave config unchanged. | | `mental_models` | No | Mental models to create or update. Omit to leave unchanged. | | `directives` | No | Directives to create or update. Omit to leave unchanged. | All of `bank`, `mental_models`, and `directives` are optional. Omit any section to leave that part of the bank unchanged. ### Bank Config Fields All fields in `bank` are optional. Only the fields you include will be set as per-bank overrides — everything else inherits from the server/tenant defaults. | Field | Type | Description | |-------|------|-------------| | `reflect_mission` | string | Mission/context for reflect operations | | `retain_mission` | string | Steers what gets extracted during retain | | `retain_extraction_mode` | string | `concise`, `verbose`, `custom`, or `chunks` | | `retain_custom_instructions` | string | Custom extraction prompt (requires `mode=custom`) | | `retain_chunk_size` | integer | Target max characters per content chunk | | `retain_structured_chunk_size` | integer | Max characters for a single JSONL line or conversation turn to keep whole; defaults to `retain_chunk_size` when unset | | `disposition_skepticism` | integer (1-5) | How skeptical the disposition is | | `disposition_literalism` | integer (1-5) | How literal the disposition is | | `disposition_empathy` | integer (1-5) | How empathetic the disposition is | | `enable_observations` | boolean | Toggle observation consolidation | | `observations_mission` | string | Controls what gets synthesised into observations | | `entity_labels` | object[] | Controlled vocabulary as label groups — see [Memory Banks → entity_labels](./memory-banks#entity-labels) | | `entities_allow_free_form` | boolean | Allow entities outside the label vocabulary | ### Mental Model Fields | Field | Required | Description | |-------|----------|-------------| | `id` | Yes | Unique ID (lowercase alphanumeric with hyphens). Used to match on re-import. | | `name` | Yes | Human-readable name | | `source_query` | Yes | The query that generates this model's content via reflect | | `tags` | No | Tags for scoped visibility. Default: `[]` | | `max_tokens` | No | Max tokens for generated content (256-8192). Default: `2048` | | `trigger` | No | Trigger settings for auto-refresh | ### Directive Fields | Field | Required | Description | |-------|----------|-------------| | `name` | Yes | Directive name. Used as the match key on re-import. | | `content` | Yes | The directive instruction text. | | `priority` | No | Priority value (higher = more important). Default: `0` | | `is_active` | No | Whether the directive is active. Default: `true` | | `tags` | No | Tags for categorization. Default: `[]` | ## Import Import a manifest into a bank. If the bank doesn't exist, it's created automatically. ### Behavior - **Config**: all `bank` fields are applied as per-bank config overrides - **Mental models**: matched by `id` — existing models are updated, new ones are created - **Directives**: matched by `name` — existing directives are updated, new ones are created - **Async**: mental model content is generated asynchronously. The response includes `operation_ids` to track progress. ### Dry Run Validate a manifest without applying changes: Returns what *would* happen (which config would be applied, which mental models would be created) without making any changes. Returns HTTP 400 with a detailed error message if the manifest is invalid. ## Export Export a bank's current config overrides, mental models, and directives as a manifest: The exported manifest only includes config fields that were explicitly set as per-bank overrides — not the fully resolved config (which includes server/tenant defaults). This means the exported manifest is portable: importing it into a new bank only overrides the fields that were intentionally customized. ### Round-trip Export from one bank and import into another to replicate the setup: ## JSON Schema The manifest format is defined by a JSON Schema. Fetch the live schema from your server: The static schema is also available at [bank-template-schema.json](/bank-template-schema.json). ## Control Plane The control plane bank creation dialog includes an optional "Import from template" toggle. Enable it to paste a manifest JSON and pre-configure the bank on creation. You can also export any bank's template from the bank Settings page via **Actions → Export Template**, which copies the manifest JSON to your clipboard. ## Versioning The `version` field enables forward-compatible schema evolution. The current version is `"1"`. When future versions are released: - Older manifests are automatically upgraded to the current schema on import - Export always produces the latest version - The API rejects manifests with a version newer than what the server supports (with a clear error message suggesting an upgrade) This means old templates keep working indefinitely — no need to manually update them. --- ## File: developer/api/documents.mdx # Documents Track and manage document sources in your memory bank. Documents provide traceability — knowing where memories came from. {/* Import raw source files */} :::tip Prerequisites Make sure you've completed the [Quick Start](./quickstart) and understand [how retain works](./retain). ::: ## What Are Documents? Documents are containers for retained content. They help you: - **Track sources** — Know which PDF, conversation, or file a memory came from - **Update content** — Re-retain a document to update its facts - **Delete in bulk** — Remove all memories from a document at once - **Organize memories** — Group related facts by source ## Chunks When you retain content, Hindsight splits it into chunks before extracting facts. These chunks are stored alongside the extracted memories, preserving the original text segments. **Why chunks matter:** - **Context preservation** — Chunks contain the raw text that generated facts, useful when you need the exact wording - **Richer recall** — Including chunks in recall provides surrounding context for matched facts :::tip Include Chunks in Recall Use `include_chunks=True` in your recall calls to get the original text chunks alongside fact results. See [Recall](./recall) for details. ::: ## Retain with Document ID Associate retained content with a document: ## Update Documents Re-retaining with the same document_id **replaces** the old content: ## Get Document Retrieve a document's original text and metadata. This is useful for expanding document context after a recall operation returns memories with document references. ## Update Document Update mutable fields on an existing document without re-processing the content. Currently supports updating `tags`. :::info Observations are re-consolidated When tags change, any consolidated observations derived from the document's memories are invalidated and queued for re-consolidation under the new tags. Co-source memories from other documents that shared those observations are also reset. ::: ## Delete Document Remove a document and all its associated memories: :::warning Deleting a document permanently removes all memories extracted from it. This action cannot be undone. ::: ## List Documents List documents in a bank with optional filtering by ID and tags. ### Filtering Options | Parameter | Description | |---|---| | `q` | Case-insensitive substring match on document ID. `report` matches `report-2024`, `annual-report`, etc. | | `tags` | Filter by document tags. Accepts multiple values. | | `tags_match` | How to match tags (default: `any_strict`). See below. | | `limit` / `offset` | Pagination. Default limit is 100. | **`tags_match` modes:** | Mode | Behaviour | |---|---| | `any_strict` *(default)* | Document must have **at least one** of the specified tags. Untagged docs excluded. | | `any` | Same as `any_strict` but also includes untagged documents. | | `all_strict` | Document must have **all** specified tags. Untagged docs excluded. | | `all` | Same as `all_strict` but also includes untagged documents. | ## Document Response Format ```json { "id": "meeting-2024-03-15", "bank_id": "my-bank", "original_text": "Alice presented the Q4 roadmap...", "content_hash": "abc123def456", "memory_unit_count": 12, "nodes_by_fact_type": { "world": 5, "experience": 4, "observation": 3 }, "created_at": "2024-03-15T14:00:00Z", "updated_at": "2024-03-15T14:00:00Z" } ``` ## Next Steps - [**Operations**](./operations) — Monitor background tasks - [**Memory Banks**](./memory-banks) — Configure bank settings --- ## File: developer/api/main-methods.mdx # Main Methods Hindsight provides three core operations: **retain**, **recall**, and **reflect**. {/* Import raw source files */} :::tip Prerequisites Make sure you've [installed Hindsight](../installation) and completed the [Quick Start](./quickstart). ::: ## Retain: Store Information Store conversations, documents, and facts into a memory bank. **What happens:** Content is processed by an LLM to extract rich facts, identify entities, and build connections in a knowledge graph. **See:** [Retain Details](./retain) for advanced options and parameters. --- ## Recall: Search Memories Search for relevant memories using multi-strategy retrieval. **What happens:** Four search strategies (semantic, keyword, graph, temporal) run in parallel, results are fused and reranked. **See:** [Recall Details](./recall) for tuning quality vs latency. --- ## Reflect: Reason with Disposition Generate disposition-aware responses using memories and observations. **What happens:** Memories and observations are recalled, bank disposition is applied, and the LLM reasons through the evidence to generate a response. **See:** [Reflect Details](./reflect) for disposition configuration. --- ## Comparison | Feature | Retain | Recall | Reflect | |---------|--------|--------|---------| | **Purpose** | Store information | Find information | Reason about information | | **Input** | Raw text/documents | Search query | Question/prompt | | **Output** | Memory IDs | Ranked facts + observations | Reasoned response | | **Uses LLM** | Yes (extraction) | No | Yes (generation) | | **Uses observations** | No | Yes | Yes | | **Disposition** | No | No | Yes | --- ## Next Steps - [**Retain**](./retain) — Advanced options for storing memories - [**Recall**](./recall) — Tuning search quality and performance - [**Reflect**](./reflect) — Configuring disposition - [**Memory Banks**](./memory-banks) — Managing memory bank disposition --- ## File: developer/api/memories.mdx # Memories A **memory unit** is the atomic fact Hindsight extracts and stores. This page covers the endpoints for working with individual memory units — reading and listing them, inspecting how a derived observation evolved, and **curating** them (correcting, retiring, or restoring). Ingesting and querying memories is covered separately in [Retain](./retain.mdx) and [Recall](./recall.mdx). {/* Import raw source files */} ## Endpoints | Method | Endpoint | Purpose | |---|---|---| | `GET` | `/v1/default/banks/{bank}/memories/list` | List/filter memory units in a bank | | `GET` | `/v1/default/banks/{bank}/memories/{id}` | Fetch a single memory unit | | `GET` | `/v1/default/banks/{bank}/memories/{id}/history` | Refresh history of a derived observation | | `PATCH` | `/v1/default/banks/{bank}/memories/{id}` | Curate: edit / invalidate / restore | | `DELETE` | `/v1/default/banks/{bank}/memories/{id}/observations` | Clear a memory's derived observations | ## List memory units List the memory units in a bank. The response includes each unit's `fact_type` (`world` | `experience` | `observation`), `state` (`valid` | `invalidated`), entities, occurred dates, and — for facts a user has edited — an `edited_at` timestamp. Invalidated rows are **included by default** so curation stays auditable; filter with `state=`. ## Fetch a single memory unit For a **derived observation**, the history endpoint returns how it was refreshed over time as new source facts arrived: ## Curation: editing, invalidating & pruning Memory is append-only by design — but sometimes a stored fact is **wrong**, has gone **stale**, or is a **duplicate**. Curation lets you correct or retire individual memories without losing the audit trail. Retired facts are moved out of the active set, so recall never returns them, while remaining fully recoverable. ### When to reach for what Not every "bad memory" needs the same tool. Pick by *why* it's bad: | The memory is… | Use | Why | |---|---|---| | **Wrong because the whole bank extracts badly** (e.g. consistently wrong subject) | Fix the bank's `retain_mission` / `observations_mission`, then **reprocess** the document | Systematic problems are best fixed at the source, then replayed — see [Retain](./retain.mdx) and [Observations](../observations.mdx). | | **Wrong as a one-off** (a single misextracted fact) | **Edit** the memory | Corrects the fact and regenerates everything derived from it. | | **No longer true, with nothing to replace it** (decommissioned server, a tool that was fixed, a role that changed) | **Invalidate** the memory | Nothing in the pipeline knows the world changed, so you tell it explicitly. | | **A duplicate or superseded fact** | **Invalidate** the memory | Removes the noise from recall while keeping the audit trail. | | **Superseded by a newer fact you're storing anyway** (e.g. "likes BMW" → "likes Toyota") | Just retain the new fact | Consolidation already reconciles in-stream contradictions into a single observation. | The rule of thumb: **if Hindsight could have known, let consolidation handle it; if only you know, curate it.** Only raw **world** and **experience** facts can be curated. Observations are *derived* — they regenerate from their sources, so you curate the underlying facts, not the observation. A `PATCH` on an observation returns `400`. ### Edit a memory Correct what the LLM extracted. You can change the **text**, **context**, **occurred dates**, **fact type**, and **entities** — anything the extractor could have gotten wrong. Hindsight re-embeds the fact, drops the observations and links derived from the old version, and re-consolidates, so downstream knowledge reflects the correction. Edited facts are marked with an `edited_at` timestamp (surfaced as an **Edited** badge in the control plane). You don't need to rebuild anything yourself: an edit **automatically recomputes the knowledge graph and links** in the background. The fact's entity associations are re-resolved from the new text/entities, its temporal and semantic links are re-derived, and consolidation re-runs — all triggered by the edit. The PATCH returns as soon as the change is committed; the graph/observation rebuild happens asynchronously right after. You can correct the dates, fact type, and entities the same way. For `context`, `occurred_start`, and `occurred_end`, an empty string `""` clears the field and omitting it leaves it unchanged. For `entities`, a list **replaces** the fact's entity set (names are resolved/find-or-created the same way retain does) and `[]` detaches them all; omitting it leaves them unchanged. ### Invalidate a memory (reversible) Soft-retire a fact. An invalidated memory: - **disappears from recall**, consolidation, and the knowledge graph, - has its **links pruned** and its **derived observations re-computed** without it, - **stays in the bank** for audit (visible via the memory and document views), and - can be **restored** at any time. Restoring moves the fact back into the active set and re-consolidates: Behind the scenes, invalidating **moves** the row out of the active `memory_units` table into a separate archive, so recall and consolidation never need a "skip invalidated" filter — the rows simply aren't there. :::note Documents are the source of truth A memory is extracted from a document. Editing or invalidating a memory does **not** change the document it came from — that's deliberate: the document stays as an accurate historical record. As a result, **reprocessing a document resets curation** of the facts it produced (extraction runs fresh from the original text). Fix systematic issues at the mission level and reprocess; use edit/invalidate for the residue. ::: ### A pruning workflow To clean up duplicates and reclaim noise: cluster duplicates from `memories/list`, then **invalidate** them — recall is clean immediately, and the audit trail is preserved. --- ## File: developer/api/memory-banks.mdx # Memory Banks Memory banks are isolated containers that store all memory-related data for a specific context or use case. {/* Import raw source files */} ## What is a Memory Bank? A memory bank is a complete, isolated storage unit containing: - **Memories** — Facts and information retained from conversations - **Documents** — Files and content indexed for retrieval - **Entities** — People, places, concepts extracted from memories - **Relationships** — Connections between entities in the knowledge graph - **Directives** — Hard rules the agent must follow during reflect operations Banks are completely isolated from each other — memories stored in one bank are not visible to another. You don't need to pre-create a bank. Hindsight will automatically create it with default settings when you first use it. :::tip Prerequisites Make sure you've completed the [Quick Start](./quickstart) to install the client and start the server. ::: ## Creating a Memory Bank ## Bank Configuration Each memory bank can be configured independently per operation. Configuration can be set via the [bank config API](#updating-configuration), the [Control Plane UI](/), or [server-wide environment variables](/developer/configuration). ### retain_mission {#retain-configuration} A plain-language description of what this bank should pay attention to during extraction. The mission is injected into the extraction prompt alongside the built-in rules — it steers focus without replacing the extraction logic. ``` e.g. Always include technical decisions, API design choices, and architectural trade-offs. Ignore meeting logistics, greetings, and social exchanges. ``` Works alongside any extraction mode. Leave blank for general-purpose extraction. ### retain_extraction_mode Controls how aggressively facts are extracted: | Mode | Description | |------|-------------| | `concise` *(default)* | Selective — only facts worth remembering long-term | | `verbose` | Captures more detail per fact; slower and uses more tokens | | `custom` | Write your own extraction rules via `retain_custom_instructions` | ### retain_custom_instructions Only active when `retain_extraction_mode` is `custom`. Replaces the built-in extraction rules entirely with your own instructions. ### retain_chunk_size Maximum number of characters per chunk when splitting content for fact extraction. Larger chunks mean fewer LLM calls but may reduce extraction quality on long inputs; smaller chunks improve granularity at the cost of more calls. Default: `3000` ### retain_structured_chunk_size Maximum number of characters for a single JSONL line or conversation turn to keep whole when it exceeds `retain_chunk_size`. When unset, the limit is exactly `retain_chunk_size`; set a larger value for structured logs or chat transcripts where splitting a single record would lose useful context. Default: unset, which uses `retain_chunk_size` See [Retain configuration](/developer/configuration#retain) for environment variable names and defaults. ### entity_labels {#entity-labels} Defines a controlled vocabulary of `key:value` classification labels extracted at retain time and stored as entities. Because labels become entities, they automatically link memories in the knowledge graph (two memories with `pedagogy:scaffolding` are linked), improve semantic and BM25 retrieval, and optionally filter memories via the standard `tags`/`tags_match` API when `tag: true` is set on a group. Each entry in `entity_labels` is a **label group** — one classification dimension: ```json { "entity_labels": [ { "key": "engagement", "description": "Student engagement level during the session", "type": "value", "optional": true, "values": [ { "value": "active", "description": "Student is actively participating" }, { "value": "passive", "description": "Student is listening but not participating" } ] }, { "key": "pedagogy", "description": "Teaching strategies used", "type": "multi-values", "values": [ { "value": "scaffolding", "description": "Breaking complex tasks into smaller steps" }, { "value": "direct_instruction", "description": "Explicit explanation by the teacher" }, { "value": "socratic_questioning", "description": "Guiding through questions rather than answers" } ] } ] } ``` | Field | Default | Description | |-------|---------|-------------| | `key` | — | Label group identifier. Becomes the prefix in `key:value` entities (or `key:field:value` for `"map"`). | | `description` | `""` | Shown to the LLM to guide label assignment. | | `type` | `"value"` | `"value"` → pick one enum value; `"multi-values"` → pick multiple; `"text"` → free-form string; `"map"` → structured group with named fields. | | `values` | `[]` | Allowed values for `"value"` and `"multi-values"` types. Ignored for `"text"` and `"map"`. | | `fields` | `{}` | Field definitions for `"map"` types. Each field is itself typed (`"text"`, `"value"`, `"multi-values"`, or nested `"map"`). Ignored for non-map types. | | `optional` | `true` | When `true` the LLM may skip the label if not applicable. When `false` the LLM must always assign a value. Has no effect on `"multi-values"` groups (always optional). | | `tag` | `false` | When `true`, extracted `key:value` labels are also written as tags on the memory unit, enabling filtering via `tags`/`tags_match` in recall/reflect. | **Enum groups** (`type: "value"` or `type: "multi-values"`): the LLM picks from the predefined `values` list; anything outside the list is silently dropped. Vocabulary stays stable and graph links stay tight. Use `"multi-values"` when a fact can belong to several values at once. **Free-text groups** (`type: "text"`): the LLM writes any string. Use the `description` field to provide examples and guidance. Graph clustering is less reliable than with enum groups because the model may phrase the same concept differently across sessions. ```json { "key": "topic", "description": "Specific subject being discussed. Examples: algebra, quadratic equations, geometry.", "type": "text", "optional": true, "values": [] } ``` **Map groups** (`type: "map"`): defines a structured entity type with named fields. Each field is itself typed (`"text"`, `"value"`, `"multi-values"`, or nested `"map"`) so you can describe rich entities like a person with name, role, and organization. Each extracted field is stored as a flat `key:field:value` entity string (e.g. `person:name:Alice`), reusing the existing entity storage with no schema changes — so map fields participate in the knowledge graph and retrieval the same way single-value labels do. ```json { "key": "person", "description": "A person mentioned in the text", "type": "map", "fields": { "name": { "type": "text", "description": "Full name of the person" }, "role": { "type": "text", "description": "Job title or role" }, "organization": { "type": "text", "description": "Company or organization" } } } ``` ### entities_allow_free_form By default, entity labels are extracted **alongside** regular named entities (people, places, concepts). Set to `false` to disable free-form extraction so only label entities are stored: ```json { "entity_labels": [...], "entities_allow_free_form": false } ``` ### enable_observations {#observations-configuration} Toggles observation consolidation on or off. When `false`, no consolidation runs for this bank — neither automatic nor manual. Defaults to `true` when the observations feature is enabled on the server. ### enable_auto_consolidation Controls whether consolidation runs automatically after retain, delete, and update operations. When `false`, consolidation only runs when explicitly triggered via the [consolidate endpoint](/developer/observations#trigger-consolidation). Defaults to `true`. This is useful when you want full control over consolidation timing — for example, batching many retains before consolidating, or running [targeted consolidation](/developer/observations#targeted-consolidation) for specific scopes only. ### observations_mission Defines what this bank should synthesise into durable observations. Replaces the built-in consolidation rules entirely — leave blank to use the server default. ``` e.g. Observations are stable facts about people and projects. Always include preferences, skills, and recurring patterns. Ignore one-off events and ephemeral state. ``` ### consolidation_llm_batch_size Number of facts sent to the LLM in a single consolidation call. Higher values reduce LLM calls and improve throughput at the cost of larger prompts. Set to `1` to disable batching. Leave unset to use the server default (`8`). ### consolidation_source_facts_max_tokens Total token budget for source facts included with observations in the consolidation prompt. Source facts give the LLM evidence to compare new facts against existing observations. `-1` = unlimited. Leave unset to use the server default (`-1`). ### consolidation_source_facts_max_tokens_per_observation Per-observation token cap for source facts in the consolidation prompt. Each observation independently gets at most this many tokens of source facts, preventing a single observation with many source facts from consuming the entire budget. `-1` = unlimited. Leave unset to use the server default (`256`). See [Observations configuration](/developer/configuration#observations) for environment variable names and defaults. ### reflect_mission A first-person narrative that provides identity and framing context for `reflect`. The agent uses this to ground its reasoning and apply a consistent perspective. ``` e.g. You are a senior engineering assistant. Always ground answers in documented decisions and rationale. Ignore speculation. Be direct and precise. ``` ### disposition_skepticism How skeptical vs trusting the bank is when evaluating claims during `reflect`. Scale 1–5. | Value | Behaviour | |-------|-----------| | `1` | Trusting — accepts information at face value | | `3` *(default)* | Balanced | | `5` | Skeptical — questions and doubts claims | ### disposition_literalism How literally to interpret information during `reflect`. Scale 1–5. | Value | Behaviour | |-------|-----------| | `1` | Flexible — reads between the lines, considers context | | `3` *(default)* | Balanced | | `5` | Literal — takes things exactly as stated | ### disposition_empathy How much to weight emotional context when reasoning during `reflect`. Scale 1–5. | Value | Behaviour | |-------|-----------| | `1` | Detached — focuses on facts and logic | | `3` *(default)* | Balanced | | `5` | Empathetic — considers emotional context | :::info Disposition traits and `reflect_mission` only affect the `reflect` operation. `retain_mission` and `observations_mission` are separate per-operation settings. ::: ### mcp_enabled_tools An allowlist of MCP tool names that are enabled for this bank. When set, only the listed tools can be invoked; any tool not in the list returns an error (tools still appear in the MCP tools list for protocol compatibility). Set to `null` (or omit) to allow all tools. ```json ["recall", "reflect"] ``` Available tool names: `retain`, `recall`, `reflect`, `list_banks`, `create_bank`, `list_mental_models`, `get_mental_model`, `create_mental_model`, `update_mental_model`, `delete_mental_model`, `refresh_mental_model`, `list_directives`, `create_directive`, `delete_directive`, `list_memories`, `get_memory`, `list_documents`, `get_document`, `delete_document`, `list_operations`, `get_operation`, `cancel_operation`, `list_tags`, `get_bank`, `get_bank_stats`, `update_bank`, `delete_bank`, `clear_memories`. ### llm_gemini_safety_settings Controls content filtering thresholds for Gemini and VertexAI providers. Accepts a list of safety setting objects in the [Google AI safety settings format](https://ai.google.dev/api/generate-content#v1beta.SafetySetting). When `null` (default), Gemini's built-in safety defaults are used. ```json [ {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"}, {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"} ] ``` Only applies when `HINDSIGHT_API_LLM_PROVIDER` is `gemini` or `vertexai`. ### recall_budget_function {#recall-budget-configuration} Selects how the [`recall` request's `budget` parameter](./recall) (`low` / `mid` / `high`) maps to the internal `thinking_budget` integer used by every retrieval method (semantic, BM25, graph, temporal). Two functions are supported: | Function | Behaviour | |----------|-----------| | `fixed` *(default)* | `thinking_budget = recall_budget_fixed_` — independent of `max_tokens`. Preserves legacy behavior. | | `adaptive` | `thinking_budget = round(max_tokens * recall_budget_adaptive_)`, clamped to `[recall_budget_min, recall_budget_max]`. Retrieval breadth scales with the requested output size. | ```json { "recall_budget_function": "adaptive", "recall_budget_adaptive_low": 0.05, "recall_budget_adaptive_mid": 0.1, "recall_budget_adaptive_high": 0.3, "recall_budget_min": 30, "recall_budget_max": 1500 } ``` ### recall_budget_fixed_low / recall_budget_fixed_mid / recall_budget_fixed_high When `recall_budget_function` is `fixed` (the default), these positive integers are used directly as the per-method retrieval limit for each `budget` level. Defaults: `100` / `300` / `1000` — exactly matching the legacy hardcoded mapping. ### recall_budget_adaptive_low / recall_budget_adaptive_mid / recall_budget_adaptive_high When `recall_budget_function` is `adaptive`, these positive ratios multiply the request's `max_tokens` to derive the per-method retrieval limit. Defaults: `0.025` / `0.075` / `0.25` — chosen to roughly match the fixed defaults at `max_tokens = 4096`. ### recall_budget_min / recall_budget_max Floor and ceiling applied to the result of the adaptive function (after the ratio multiplication). Both must be positive integers and `min ≤ max`. Defaults: `20` / `2000`. See [Recall budget mapping](/developer/configuration#recall-budget-mapping) for environment variable names and full defaults. ### memory_defense {#memory_defense} Per-bank Memory Defense policy. Defaults to absent (Memory Defense disabled on this bank). | Field | Type | Default | Description | |---|---|---|---| | `enabled` | bool | `false` | Master switch. | | `default_action` | `allow`\|`redact`\|`quarantine`\|`block` | `allow` | Fallback action when no rule matches. | | `protected_tag_namespaces` | `list[str]` | `[]` | Writes with tags in these namespaces (`ns:*`) are subject to the `protected_key` detector. | | `immutable_tag_namespaces` | `list[str]` | `[]` | Writes to these namespaces are blocked. | | `rules` | `list[Rule]` | `[]` | Detector-to-action mappings (see below). | | `detector_overrides` | `dict` | `{}` | Per-detector tuning (e.g. `size_anomaly.max_size`). | `Rule` shape: | Field | Required | Description | |---|---|---| | `on` | yes | Detector name (`prompt_injection`, `sensitive_data`, `protected_key`, `immutable_key`, `size_anomaly`) or `*` for any. | | `action` | yes | One of `allow`, `redact`, `quarantine`, `block`. | | `min_severity` | no | Minimum severity (`low`, `medium`, `high`, `critical`) for the rule to fire. Defaults to `low`. | Invalid policies are rejected on PATCH with HTTP 422. See the [Memory Defense guide](../memory-defense/index.md) for usage examples. --- ## Updating Configuration Bank configuration fields (retain mission, extraction mode, observations mission, etc.) are managed via a **separate config API**, not the `create_bank` call. This lets you change operational settings independently from the bank's identity and disposition. ### Setting Configuration Overrides You can update any subset of fields — only the keys you provide are changed. ### Reading the Current Configuration The response distinguishes: - **`config`** — the fully resolved configuration (server defaults merged with bank overrides) - **`overrides`** — only the fields explicitly overridden for this bank ### Resetting to Defaults This removes all bank-level overrides. The bank reverts to server-wide defaults (set via environment variables). You can also update configuration directly from the [Control Plane UI](/) — navigate to a bank and open the **Configuration** tab. --- ## Directives Directives are hard rules that the agent must follow during [reflect](./reflect) operations. Unlike disposition traits which influence *how* the agent reasons, directives are explicit instructions that are *always* enforced. :::info Directives only affect the `reflect` operation. They are injected into prompts and the agent is required to comply with them in all responses. ::: ### When to Use Directives Use directives for rules that must never be violated: - **Language/style constraints**: "Always respond in formal English" - **Privacy rules**: "Never share personal data with third parties" - **Domain constraints**: "Prefer conservative investment recommendations" - **Behavioral guardrails**: "Always cite sources when making claims" ### Creating Directives ### Listing Directives ### Updating Directives ### Deleting Directives ### Directives vs Disposition | Aspect | Directives | Disposition | |--------|------------|-------------| | **Nature** | Hard rules, must be followed | Soft influence on reasoning style | | **Enforcement** | Strict — responses are rejected if violated | Flexible — shapes interpretation | | **Use case** | Compliance, guardrails, constraints | Personality, character, tone | | **Example** | "Never recommend specific stocks" | High skepticism: questions claims | --- ## Document export & import Move documents — and the facts already extracted from them — between banks **without re-running the LLM**. Useful for testing a different embedding model, or copying data between banks/instances without paying for re-extraction. The archive carries documents, raw chunks, and extracted facts (entities by canonical name, causal links) — but **no embeddings or database ids**. On import, facts are re-embedded with the *target* bank's model and entities/links are recomputed against it, so imported documents are integrated with whatever already exists there. ### Export documents `GET /v1/default/banks/{bank_id}/document-transfer` — synchronous; streams a ZIP archive. ```bash # whole bank curl -H "Authorization: Bearer $API_KEY" \ "$HINDSIGHT_URL/v1/default/banks/my-bank/document-transfer" -o my-bank.zip # specific documents, including consolidated observations curl -H "Authorization: Bearer $API_KEY" \ "$HINDSIGHT_URL/v1/default/banks/my-bank/document-transfer?document_id=doc-1&include_observations=true" -o subset.zip ``` | Query param | Description | |-------------|-------------| | `document_id` | Repeatable. Export only these documents; omit for the whole bank. | | `include_observations` | Also export consolidated observations (default `false`). Only valid for a **whole-bank** export — combining it with `document_id` returns `400`. | ### Import documents `POST /v1/default/banks/{bank_id}/document-transfer` — multipart upload (`file` = the ZIP). Runs as a **background operation** (re-embedding + entity resolution can take a while), so it returns `202` with an `operation_id`; poll the bank's operations endpoint for status and the result counts in `result_metadata`. ```bash curl -H "Authorization: Bearer $API_KEY" -F "file=@my-bank.zip" \ "$HINDSIGHT_URL/v1/default/banks/other-bank/document-transfer?on_conflict=replace" # -> {"operation_id": "…", "status": "pending"} curl -H "Authorization: Bearer $API_KEY" \ "$HINDSIGHT_URL/v1/default/banks/other-bank/operations/$OPERATION_ID" # -> {"status":"completed","result_metadata":{"documents_imported":3,"facts_imported":42,"observations_imported":5,...}} ``` `on_conflict` controls what happens when a document id already exists in the target bank: | Mode | Behavior | |------|----------| | `skip` (default) | Leave the existing document untouched. | | `replace` | Delete the existing document's data and re-import. | | `new-id` | Import a copy under a freshly generated id. | ### Observations Consolidated observations are excluded by default — the target bank regenerates them from the imported facts during consolidation. Pass `include_observations=true` to carry them instead: they're restored with no LLM, their source references remapped to the imported facts (which are marked consolidated so the target won't re-consolidate them). Because an observation can be derived from facts spanning several documents, `include_observations` is only supported on a **whole-bank export** (omit `document_id`); combining it with a document subset returns `400`. :::warning Imported observations are inserted as-is — no merge They are not merged or deduplicated against observations already in the target bank (consolidation merges related observations; import does not). Prefer importing observations into a fresh/empty bank, or omit `include_observations` and let the target consolidate the imported facts itself. ::: ### Enabling / disabling Both endpoints are gated by server-level flags (default `true`). A disabled endpoint returns `404`, and `/version` reports the state under `features.document_export_api` / `features.document_import_api` (the control plane hides the buttons accordingly). | Variable | Gates | |----------|-------| | `HINDSIGHT_API_ENABLE_DOCUMENT_EXPORT_API` | `GET …/document-transfer` | | `HINDSIGHT_API_ENABLE_DOCUMENT_IMPORT_API` | `POST …/document-transfer` | ## Migrating a bank to a new instance To move a bank to an instance configured with a different **embedding model**, **vector extension**, or **text-search backend** — which can't be changed in place on a populated bank — export the whole bank and import it into the new instance, where every embedding and index is re-derived from the stored text with **no LLM re-extraction**. This carries documents, facts, observations, bank config, mental models, directives, and webhooks (never embeddings). Use the `hindsight-admin export-bank` / `import-bank` commands and follow the blue-green runbook in **[Admin CLI → Migrating a bank to a new instance](../admin-cli.md#migrating-a-bank-to-a-new-instance)**. --- ## File: developer/api/mental-models.mdx # Mental Models User-curated summaries that provide high-quality, pre-computed answers for common queries. {/* Import raw source files */} ## What Are Mental Models? Mental models are **saved reflect responses** that you curate for your memory bank. When you create a mental model, Hindsight runs a reflect operation with your source query and stores the result. During future reflect calls, these pre-computed summaries are checked first — providing faster, more consistent answers. ```mermaid graph LR A[Create Mental Model] --> B[Run Reflect] B --> C[Store Result] C --> D[Future Queries] D --> E{Match Found?} E -->|Yes| F[Return Mental Model] E -->|No| G[Run Full Reflect] ``` ### Why Use Mental Models? | Benefit | Description | |---------|-------------| | **Consistency** | Same answer every time for common questions | | **Speed** | Pre-computed responses are returned instantly | | **Quality** | Manually curated summaries you've reviewed | | **Control** | Define exactly how key topics should be answered | ### Hierarchical Retrieval During reflect, the agent checks sources in priority order: 1. **Mental Models** — User-curated summaries (highest priority) 2. **Observations** — Consolidated knowledge 3. **Raw Facts** — Ground truth memories Mental models are checked first because they represent your explicitly curated knowledge. --- ## Create a Mental Model Creating a mental model runs a reflect operation in the background and saves the result: ### Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `name` | string | Yes | Human-readable name for the mental model | | `source_query` | string | Yes | The query to run to generate content | | `id` | string | No | Custom ID for the mental model (alphanumeric lowercase with hyphens). Auto-generated if omitted. | | `tags` | list | No | Tags that scope the model during reflect **and** filter source memories during refresh. Defaults to `all_strict` matching, so only memories carrying every listed tag are read. See [Tags and Visibility](#tags-and-visibility). | | `max_tokens` | int | No | Maximum tokens for the mental model content | | `trigger` | object | No | Trigger settings (see [Automatic Refresh](#automatic-refresh)) | --- ## Create with Custom ID Assign a stable, human-readable ID to a mental model so you can retrieve or update it by name instead of relying on the auto-generated UUID: :::tip Custom IDs must be lowercase alphanumeric and may contain hyphens (e.g. `team-policies`, `q4-status`). If a mental model with that ID already exists, the request is rejected. ::: --- ## Automatic Refresh Mental models can be configured to **automatically refresh** when observations are updated. This keeps them in sync with the latest knowledge without manual intervention. ### Trigger Settings | Setting | Type | Default | Description | |---------|------|---------|-------------| | `mode` | `"full"` \| `"delta"` | `"full"` | Refresh strategy. See [Refresh Mode](#refresh-mode) below. | | `refresh_after_consolidation` | bool | false | Automatically refresh after observations consolidation | | `refresh_cron` | string \| null | null | UTC 5-field cron expression for scheduled refreshes, such as `"0 3 * * *"` for daily at 03:00 UTC | When `refresh_after_consolidation` is enabled, the mental model will be re-generated every time the bank's observations are consolidated — ensuring it always reflects the latest synthesized knowledge. When `refresh_cron` is set, Hindsight checks the schedule on the server's mental-model refresh tick and refreshes the model only if memories in its scope have changed since the last refresh. `refresh_cron` and `refresh_after_consolidation` are mutually exclusive, so a model refreshes either after consolidation or on a fixed UTC schedule, not both. ### Refresh Mode Two strategies are available for how a refresh produces the new content: - **`full`** *(default)* — every refresh regenerates the entire content from scratch. Simple and predictable: the LLM synthesises a fresh document from the retrieved memories. Best when the document is short, when you want every refresh to potentially restructure the output, or when you're not yet sure what the final shape should be. - **`delta`** — refresh emits a list of typed *operations* (add a section, append a bullet, replace a block, remove a stale paragraph) against the document's existing structure, then renders the result. Sections that aren't targeted by any operation are copied through **byte-identical** — no paraphrasing, no whitespace drift, no list-style normalisation. Best for long-lived "playbook"–style mental models where you want stability across refreshes and only the genuinely changed parts to move. Delta mode falls back to a full regeneration automatically in two cases: 1. The mental model has no existing content yet (nothing to anchor edits on). 2. The `source_query` has changed since the last refresh (the topic has shifted; the existing structure may no longer apply). If the LLM call fails or returns an empty answer, the existing content is preserved — refreshes never overwrite a populated document with an empty one. | Use Case | Recommended Mode | Why | |----------|-----------------|-----| | Skill / playbook docs | `delta` | Sections live for many refreshes; only specific rules change | | Onboarding summaries | `delta` | Adding new team members shouldn't restructure the doc | | Real-time dashboards | `full` | Each refresh is a fresh snapshot | | Short FAQ summaries | `full` | Whole-document regeneration is cheap and unambiguous | ### When to Use Automatic Refresh | Use Case | Automatic Refresh | Why | |----------|-------------------|-----| | **Real-time dashboards** | ✅ Enabled | Status should always be current | | **Policy summaries** | ❌ Disabled | Policies change infrequently, manual refresh preferred | | **User preferences** | ✅ Enabled | Preferences evolve with new interactions | | **FAQ answers** | ❌ Disabled | Answers are curated, should be reviewed before updating | :::tip Enable automatic refresh for mental models that need to stay current. Disable it for curated content where you want to review changes before they go live. ::: --- ## List Mental Models --- ## Get a Mental Model ### Detail Levels Both **List** and **Get** endpoints accept an optional `detail` query parameter that controls how much data is returned. This is useful for reducing response size, especially in agent boot flows or MCP clients where context budget is limited. | Level | Fields Returned | Use Case | |-------|----------------|----------| | `metadata` | `id`, `bank_id`, `name`, `tags`, `last_refreshed_at`, `created_at` | Inventory — "what models exist?" | | `content` | All metadata fields + `source_query`, `content`, `max_tokens`, `trigger` | Agent boot — "what do the models say?" | | `full` (default) | All fields including `reflect_response` | Deep inspection — "what evidence backs this model?" | ```bash # List only names and tags (smallest response) curl "$BASE_URL/v1/default/banks/$BANK_ID/mental-models?detail=metadata" # List with content but without provenance chains curl "$BASE_URL/v1/default/banks/$BANK_ID/mental-models?detail=content" # Get full detail (default behavior) curl "$BASE_URL/v1/default/banks/$BANK_ID/mental-models/$MODEL_ID?detail=full" ``` The `detail` parameter is also available in the MCP tools: ```json {"bank_id": "my-bank", "detail": "metadata"} ``` :::tip Use `detail=content` for agent orientation flows. It includes everything the agent needs to understand the models without the heavyweight `reflect_response` provenance chains, which can exceed 200KB for banks with many models. ::: ### Response Fields | Field | Type | Detail Level | Description | |-------|------|-------------|-------------| | `id` | string | metadata | Unique mental model ID | | `bank_id` | string | metadata | Memory bank ID | | `name` | string | metadata | Human-readable name | | `tags` | list | metadata | Tags for filtering | | `last_refreshed_at` | string | metadata | When the mental model was last updated | | `created_at` | string | metadata | When the mental model was created | | `source_query` | string | content | The query used to generate content | | `content` | string | content | The generated mental model text | | `max_tokens` | int | content | Maximum tokens for the mental model content | | `trigger` | object | content | Trigger settings (see [Automatic Refresh](#automatic-refresh)) | | `reflect_response` | object | full | Full reflect response including `based_on` provenance facts | --- ## Refresh a Mental Model Re-run the source query to update the mental model with current knowledge: Refreshing is useful when: - New memories have been retained that affect the topic - Observations have been updated - You want to ensure the mental model reflects current knowledge --- ## Clear a Mental Model Clear a mental model's content so the next refresh performs a **full re-synthesis** from scratch, regardless of the model's trigger mode. This is useful for delta-mode models that have accumulated drift over many incremental refreshes. Over time, small inaccuracies can compound as each delta refresh only sees new facts since the last. Clearing and then refreshing produces a clean baseline from all facts. The clear operation is synchronous and resets the content to an empty string. The model's configuration (name, source query, trigger settings) is preserved. Since the content is now empty, the next `/refresh` call will always perform a full regeneration — even if the model's trigger mode is set to `delta`. :::tip For long-lived delta-mode mental models, consider scheduling a periodic clear + refresh (e.g. every 48 hours) to keep the content accurate while still benefiting from incremental delta updates in between. ::: --- ## Update a Mental Model Update the mental model's name: --- ## Delete a Mental Model --- ## Tags and Visibility Mental models support the same tag system as memories. When you assign tags to a mental model, those tags control both **which memories it reads** during refresh and **when it is surfaced** during reflect. ### How tags affect mental model refresh :::warning Adding tags to a mental model narrows the pool of source memories its refresh can read from. If no memories carry those tags yet, refresh will return empty content (e.g. `"I cannot find any information…"`) even though direct `reflect` on the same query works. Backfill tags on the relevant memories first, or override the default via `trigger.tags_match` / `trigger.tag_groups`. ::: When a mental model is refreshed (manually or automatically), it runs an internal reflect call to regenerate its content. If the mental model has tags, that reflect call uses `all_strict` tag matching — meaning it will only read memories that carry **all** of the mental model's tags. Untagged memories are excluded. ``` Mental model tags: ["user:alice"] During refresh, it reads: ✅ "Alice prefers async communication" — has "user:alice" ✅ "Team uses Slack for announcements" — has "user:alice" (plus other tags) ❌ "Company policy: no meetings on Fridays" — untagged, excluded ❌ "Bob dislikes long meetings" — no "user:alice" tag ``` This means a mental model tagged `["user:alice"]` will also pick up memories tagged `["user:alice", "team"]` — extra tags on a memory don't disqualify it. Only the mental model's own tags are required to be present. ### How tags affect mental model lookup during reflect When you call `reflect` with tags, those same tags are used to filter which mental models the agent can see. A mental model is visible only if its tags overlap with the tags on the reflect request. For more details on tag matching modes (`any`, `any_strict`, `all`, `all_strict`) and worked examples, see the [Recall tags reference](./recall#tags). ### Listing mental model tags `GET /v1/default/banks/{bank_id}/tags` accepts a `source` query parameter that selects which tag space to enumerate: - `source=memories` *(default)* — tags attached to memory units. - `source=mental_models` — tags attached to mental models in this bank. Use the `mental_models` source to populate autocomplete or filter UIs over mental-model tags, distinct from the (typically larger) memory tag set. --- ## History Every time a mental model's content changes (via refresh or manual update), the previous version is saved with a timestamp. You can retrieve the full change log with the history endpoint: ### Response The endpoint returns a list of history entries, most recent first: | Field | Type | Description | |-------|------|-------------| | `previous_content` | string \| null | The content before this change (`null` if not available) | | `changed_at` | string | ISO 8601 timestamp of when the change occurred | Each entry captures the **content before the change** and when it happened. The current content is returned by the standard [Get a Mental Model](#get-a-mental-model) endpoint. :::note History tracking is enabled by default. Set `HINDSIGHT_API_ENABLE_MENTAL_MODEL_HISTORY=false` to disable it. ::: --- ## Use Cases | Use Case | Example | |----------|---------| | **FAQ Answers** | Pre-compute answers to common customer questions | | **Onboarding Summaries** | "What should new team members know?" | | **Status Reports** | "What's the current project status?" refreshed weekly | | **Policy Summaries** | "What are our security policies?" | --- ## Next Steps - [**Reflect**](./reflect) — How the agentic loop uses mental models - [**Observations**](/developer/observations) — How knowledge is consolidated - [**Operations**](./operations) — Track async mental model creation --- ## File: developer/api/operations.mdx # Operations Hindsight runs several maintenance and ingestion tasks asynchronously instead of blocking the API call that triggers them. These tasks share a single queue (`async_operations`) and a single worker pool, and the same REST endpoints — list, status, cancel, retry — work across every type. This page explains each operation type, when it fires, and how to inspect or manage it. {/* Import raw source files */} :::tip Prerequisites Make sure you've completed the [Quick Start](./quickstart) and understand [how retain works](./retain). ::: ## How operations work When an API call needs background work, the request handler writes a row to the `async_operations` table with `status=pending` and returns immediately. A worker (running either in-process inside the API by default, or as a dedicated service — see [Services - Worker Service](../services#worker-service)) polls the table, claims pending rows, executes the corresponding handler, and marks the row `completed` or `failed`. By default, every operation runs in-process: no external queue, no extra process to deploy. The same code paths support scaling out to dedicated worker processes when throughput demands it. ### Lifecycle | Status | Meaning | |--------|---------| | `pending` | The row is queued. Either no worker has picked it up yet, or an extension has parked it via `next_retry_at` in the future (e.g., for backpressure). | | `processing` | A worker has claimed the row and is actively running the handler. | | `completed` | The handler returned successfully. | | `failed` | The handler raised. `error_message` carries the reason; you can re-queue with `POST /…/retry`. | | `cancelled` | The operation was cancelled via `DELETE /…/operations/{id}` before a worker picked it up. Cancelling a `processing` operation is not supported. | The worker retries failed operations up to `HINDSIGHT_API_WORKER_MAX_RETRIES` times before settling on `failed`. Deterministic failures (e.g., invalid embedding dimensions, integrity violations) skip retries — they won't succeed by re-running. ## Operation types Every operation has an `operation_type` in the database and a `task_type` in the payload. They're usually the same. ### `retain` Submitted by `POST /v1/default/banks/{bank_id}/memories` with `async=true`, or by the multi-item `retain_batch` call. The handler runs the same pipeline as a synchronous retain: fact extraction (LLM), embedding generation, entity resolution, and link creation (temporal, semantic). Use async retain when you're ingesting thousands of items and don't want the HTTP call to hold for minutes. The `operation_id` in the response lets you poll for completion. #### Parent op: `retain_batch` For large submissions, Hindsight automatically splits the input into sub-batches and creates a single `retain_batch` parent operation that tracks the children. The parent's status reflects the aggregate — `pending` until at least one child is running, `processing` while children execute, `completed` once every child has finished, `failed` if any child failed. Each child is itself a `retain` operation linked to the parent, so you can drill in for per-batch error messages. When you list operations, the parent and its children all appear by default. Pass `exclude_parents=true` to hide the aggregate rows and show only individual `retain` jobs. ### `file_convert_retain` Submitted by file upload endpoints. The handler runs MIME-specific conversion (PDF → text, DOCX → text, etc.) and then passes the extracted text into the retain pipeline. Failures here are **non-retryable** by default — a corrupted PDF or missing OCR won't improve on rerun, so the operation goes straight to `failed`. Which parser runs (`markitdown`, `iris`, or `llama_parse`) is selected per deployment via `HINDSIGHT_API_FILE_PARSER`, and clients can override it per request — see [Configuration → File Processing](../configuration#file-processing). ### `consolidation` Produces **observations** from new world/experience memories. See [Observations](../observations) for what they are and how they're synthesized. Triggered automatically: - After every retain that added world/experience facts (gated by per-bank `enable_auto_consolidation` and `enable_observations`). - After deletes that invalidated existing observations (the source memory disappeared → derived observations are stale → re-run with the surviving co-source memories). - Manually via `POST /v1/default/banks/{bank_id}/consolidate`. Pass `observation_scopes` to consolidate only memories matching specific tag combinations. **Bank-deduped**: while one `consolidation` job is pending for a bank, repeat submits return the existing `operation_id` instead of stacking. Once the job starts processing, the next submit becomes the next pending slot. ### `refresh_mental_model` A mental model has a `source_query` that defines which memories it summarizes. The handler re-runs that query, re-summarizes the result, and updates the model's content in place. Triggered either manually via `POST /v1/default/banks/{bank_id}/mental-models/{id}/refresh`, or automatically by the auto-refresh schedule for mental models that have one configured. ### `graph_maintenance` Reconciles derived state that goes stale after a delete. Every invocation runs three passes: 1. **Link top-up.** Drains the `graph_maintenance_queue` (units whose outgoing temporal/semantic links lost a neighbour). For each, if the unit is under its cap (20 temporal, 50 semantic), Hindsight re-runs the same probes retain uses and inserts the missing links. Without this, the retain pipeline's top-K capping would leave surviving units permanently under-capped after every delete — degrading graph-expansion recall. 2. **Orphan entity prune.** Deletes entities in the bank with no remaining `unit_entities` references. FK `ON DELETE CASCADE` on `entity_cooccurrences` then removes any cooccurrence row pointing at a pruned entity. 3. **Stale cooccurrence prune.** Cleans up `entity_cooccurrences` rows where both endpoints still exist but no current memory_unit references both of them — the cooccurrence was real when it was recorded, but every unit that witnessed it has since been deleted. Bank-deduped at submit time, so concurrent triggers against the same bank coalesce into one drain. **Triggers:** any delete that removes memory_units — `DELETE /documents/{id}`, `DELETE /memories/{id}`, and re-retaining an existing `document_id` (the upsert path). A full bank wipe (`delete_bank`) is a no-op: there's nothing left in the bank to maintain. ### `webhook_delivery` After certain operations complete (e.g., consolidation finishing on a bank with a registered webhook), Hindsight enqueues a `webhook_delivery` task. The handler POSTs the payload to the configured URL and retries on transient failures. ## Endpoints All paths below are scoped by `bank_id`. ### List operations ```bash GET /v1/default/banks/{bank_id}/operations ``` Query parameters: | Param | Description | |-------|-------------| | `status` | Filter by `pending`, `processing`, `completed`, `failed`, `cancelled`. | | `type` | Filter by `retain`, `file_convert_retain`, `consolidation`, `refresh_mental_model`, `graph_maintenance`, `webhook_delivery`. | | `limit` | 1–100, default 20. | | `offset` | Pagination offset. | | `exclude_parents` | Exclude parent batch operations from results (large `retain_batch` calls create one parent + N children). | `items_count` is operation-specific — non-zero only for retain-shaped operations (it counts content items in the submission). ### Get operation status Query parameters: | Param | Description | |-------|-------------| | `include_payload` | Include the raw task payload (the submission params) in the response as `task_payload`. Default `false`; may be large. | A few response fields are worth calling out: | Field | Description | |-------|-------------| | `updated_at` | When the operation's row last changed — claim, progress heartbeat, or completion. | | `progress` | Last-known progress snapshot for a running operation, or `null` if none was recorded (completed-instantly or pre-feature rows). | | `task_payload` | The raw submission params; only populated when `include_payload=true`. | `progress` is written at coarse phase/batch boundaries (consolidation, batch retain) and lets you tell a healthy long-running job from a frozen one: if `processed` keeps advancing across polls the job is alive; identical numbers with no movement in `at` mean it's stuck. Its shape: | Field | Description | |-------|-------------| | `stage` | Coarse phase the operation last reported (e.g. `processing_batch`). | | `at` | ISO-8601 timestamp when this snapshot was written. | | `processed` | Units of work finished so far (sub-batches, memories), when known. | | `total` | Total units of work for the operation, when known. | | `detail` | Operation-specific counters (e.g. `observations_created`, `round`, `items_in_sub_batch`). | ### Cancel a pending operation Returns `409` if the operation is already in `processing`, `completed`, or `failed` state. ### Retry a failed operation The row's status resets to `pending` and the worker picks it up again. Returns `409` if the operation isn't in `failed` or `cancelled` state. ## Async retain example Submit a batch asynchronously and poll until the operation completes: ## Worker tuning Each worker has a single concurrency budget (`HINDSIGHT_API_WORKER_MAX_SLOTS`, default 10) shared across all operation types. Per-type slot reservations (`HINDSIGHT_API_WORKER__MAX_SLOTS`) carve out guaranteed capacity within that budget; remaining slots form a shared pool any type can use. See [Configuration → Worker Configuration](../configuration#distributed-workers) for the full table. For most deployments the defaults are fine. Reserve slots for an operation type if you've seen it starved by a flood of another type (e.g., a long file_convert_retain blocking graph_maintenance on a deletion-heavy workload). ## Next Steps - [**Documents**](./documents) — Track document sources - [**Memory Banks**](./memory-banks) — Configure bank settings --- ## File: developer/api/quickstart.mdx # Quick Start Get up and running with Hindsight in 60 seconds. {/* Import raw source files */} ## Clients ## Start the API Server :::tip LLM Provider Hindsight requires an LLM with structured output support. Recommended: **Groq** with `gpt-oss-20b` for fast, cost-effective inference. See [LLM Providers](/developer/models#llm) for more details. ::: --- ## Use the Client --- ## What's Happening | Operation | What it does | |-----------|--------------| | **Retain** | Content is processed, facts are extracted, entities are identified and linked in a knowledge graph | | **Recall** | Four search strategies (semantic, keyword, graph, temporal) run in parallel to find relevant memories | | **Reflect** | Retrieved memories are used to generate a disposition-aware response | --- ## Integrations Browse all supported integrations in the [Integrations Hub](/integrations). ## Next Steps - [**Retain**](./retain) — Advanced options for storing memories - [**Recall**](./recall) — Search and retrieval strategies - [**Reflect**](./reflect) — Disposition-aware reasoning - [**Memory Banks**](./memory-banks) — Configure disposition and mission - [**Server Deployment**](/developer/installation) — Docker Compose, Helm, and production setup --- ## File: developer/api/recall.mdx # Recall Memories Retrieve memories from a bank using multi-strategy recall. When you **recall**, Hindsight runs four retrieval strategies in parallel — semantic similarity, keyword (BM25), graph traversal, and temporal — then fuses and reranks the results into a single ranked list. The response contains structured facts, not raw documents. {/* Import raw source files */} :::info How Recall Works Learn about the four retrieval strategies (semantic, keyword, graph, temporal) and RRF fusion in the [Recall Architecture](/developer/retrieval) guide. ::: :::tip Prerequisites Make sure you've completed the [Quick Start](./quickstart) to install the client and start the server. ::: ## Basic Recall --- ## Parameters ### query The natural language question or statement to search for. This is the only required field. The query drives all four retrieval strategies simultaneously: it is embedded for semantic search, tokenized for BM25 keyword search, used to seed graph traversal, and parsed for temporal expressions. After retrieval, the raw query text is also passed to the cross-encoder reranker to re-score every candidate. Queries exceeding 500 tokens are rejected. ### types Controls which categories of memory facts are searched. Accepted values are `world` (objective facts), `experience` (events and conversations), and `observation` (deduplicated, evidence-grounded beliefs consolidated from multiple memories). When omitted, all three types are searched. Each type runs the full four-strategy retrieval pipeline independently, so narrowing `types` reduces both the result set and query cost. :::tip About Observations Observations are deduplicated, evidence-grounded beliefs consolidated from multiple facts — preferences, recurring patterns, and durable learnings the memory bank has built up. Each observation references its supporting memories (with exact quotes), and is refined rather than overwritten when new evidence arrives. They are created and maintained automatically in the background after retain operations. ::: ### prefer_observations Because observations are consolidated from raw facts, recalling `observation` alongside `world` and `experience` can return the same information twice — once as the raw fact and once folded into an observation. With `prefer_observations` you get the best of both: you still recall every type, but whenever an observation in the results was built from a raw fact, that raw fact is dropped so the observation supersedes it. The freed slots are backfilled with the next-best results, so you don't lose coverage. This lets you ask for everything without choosing between "raw facts only" (no consolidation) and "observations only" (which may lag behind the latest retains while consolidation catches up). **Disabled by default** — set it to `true` to opt in. It has no effect unless both `observation` and at least one of `world`/`experience` are included in `types`. ### budget Controls retrieval depth and breadth. Accepted values are `low`, `mid` (default), and `high`. Use `low` for fast simple lookups, `mid` for balanced everyday queries, and `high` when you need to find indirect connections or exhaustive coverage. ### max_tokens The maximum number of tokens the returned facts can collectively occupy. Defaults to `4096`. Only the `text` field of each fact is counted toward this budget — metadata, tags, entities, and other fields are not included. After reranking, facts are included in relevance order until this budget is exhausted — so you always get the most relevant memories that fit. Hindsight is designed for agents, which think in tokens rather than result counts: set `max_tokens` to however much of your context window you want to allocate to memories. ### query_timestamp An ISO 8601 datetime representing when the query is being asked, from the user's perspective. When provided, it is used as the anchor for resolving relative temporal expressions in the query and for recency scoring — for example, if the query says "last month" and `query_timestamp` is `2023-05-30`, the temporal search window becomes approximately April 2023, and recency boosts are calculated as of May 30, 2023. Without it, the server's current time is used as the anchor. This field matters most for replaying historical conversations or building agents that need time-anchored recall. ### include An optional object controlling supplementary data returned alongside the main facts. #### chunks When enabled, the response includes the raw source text chunks from which each fact was extracted. Chunks are fetched before the `max_tokens` filter, so setting `max_tokens=0` returns no facts but can still return chunks. The `max_tokens` sub-option (default `8192`) controls the total chunk token budget independently of the main fact budget. This is useful when agents need surrounding context beyond the extracted fact text. :::note When `include_chunks` is enabled, chunks are fetched based on the top-scored reranked results before token filtering. The last chunk is truncated (not dropped) to fit exactly within the budget, and each chunk carries a `truncated` flag indicating whether it was cut. ::: #### source_facts When enabled and `types` includes `observation`, each observation result is accompanied by the original contributing facts it was synthesized from. Source facts are returned in a top-level `source_facts` dict keyed by fact ID, and each observation result carries a `source_fact_ids` list for cross-referencing. Facts are deduplicated across observations. The `max_tokens` sub-option (default `4096`) limits the total token budget for source facts. #### entities Enabled by default. When active, each returned fact includes the canonical names of entities associated with it. Set to `null` to skip the entity JOIN query and reduce response size. The `max_tokens` sub-option (default `500`) is a future-facing guard for entity data. ### tags Filters recall to only memories that match the specified tags. When omitted, all memories regardless of tags are eligible. Tag filtering is applied at the database level across all four retrieval strategies, not as a post-processing step. The `tags_match` parameter controls the filtering logic: | Mode | Untagged memories | Match condition | |------|-------------------|-----------------| | `any` (default) | Included | Memory has **at least one** of the specified tags | | `any_strict` | Excluded | Memory has **at least one** of the specified tags | | `all` | Included | Memory has **all** of the specified tags | | `all_strict` | Excluded | Memory has **all** of the specified tags | | `exact` | Excluded | Memory has **exactly** the specified tag set | #### Scenario setup Consider a bank with these four memories: | Memory | Tags | |--------|------| | "Alice prefers async communication" | `["user:alice"]` | | "Bob dislikes long meetings" | `["user:bob"]` | | "Team uses Slack for announcements" | `["user:alice", "team"]` | | "Company policy: no meetings on Fridays" | *(untagged)* | #### `any` — OR matching, includes untagged (default) Returns memories that have **at least one** matching tag, plus untagged memories. Use this for **shared global knowledge + user-specific** patterns, where untagged memories represent information everyone should see. #### `any_strict` — OR matching, excludes untagged Same as `any` but untagged memories are excluded. Use this when memories are **fully partitioned by tags** and untagged memories should never be visible. #### `all` — AND matching, includes untagged Returns memories that have **every** specified tag, plus untagged memories. Use this when memories must belong to a **specific intersection** of scopes (e.g., only memories relevant to both a user and a project), while still surfacing shared global knowledge. #### `all_strict` — AND matching, excludes untagged Returns memories that have **every** specified tag, and excludes untagged memories. Use this for strict scope enforcement where a memory must explicitly belong to **all** specified contexts. :::tip Extra tags are fine A memory with tags `["user:alice", "team", "project:x"]` will still match a filter of `["user:alice", "team"]` under `all_strict` — extra tags on the memory are not a problem. The filter only requires the memory to contain **at least** the specified tags. ::: #### `exact` — set equality, excludes untagged Returns memories whose tag set is exactly equal to the specified tags, regardless of tag order. Unlike `all_strict`, memories with extra tags do not match. Use this when filtering a precise observation scope returned by `GET /v1/default/banks/{bank_id}/observations/scopes`, where `["user:alice"]` should not also match observations scoped to `["user:alice", "project:x"]`. :::tip Filter to global (untagged) observations only The empty scope is a real scope — it's where `observation_scopes: "shared"` consolidation writes. Set `tags_match: "exact"` with **no tags** (omit `tags`, or pass `[]`) to recall **only** untagged/global memories and exclude every tagged one: ```json { "query": "...", "tags": [], "tags_match": "exact" } ``` With any other `tags_match` mode, absent or empty `tags` means "no tag filter" (all memories are eligible). Only under `exact` do absent/empty tags select "the global scope". This is the way to read back just the global observations after you've started using more specific scopes. ::: ### tag_groups `tag_groups` is a list of compound boolean tag filters. The groups in the list are AND-ed together at the top level. Each group is a recursive boolean expression: a **leaf** node `{tags, match}`, or a **compound** node `{and: [...]}`, `{or: [...]}`, or `{not: ...}`. `tag_groups` and `tags` / `tags_match` can be used simultaneously — they are AND-ed together. #### Leaf node ```json { "tags": ["step:5", "step:8"], "match": "any_strict" } ``` `match` accepts the same values as `tags_match`: `any`, `all`, `any_strict`, `all_strict`, `exact`. Defaults to `any_strict`. #### Compound nodes ```json { "and": [ , , ... ] } { "or": [ , , ... ] } { "not": } ``` #### Examples **Step filter AND user scope** — two top-level groups AND-ed: ```json { "tag_groups": [ { "tags": ["step:5", "step:8", "step:12"], "match": "any_strict" }, { "tags": ["user:ep_42"], "match": "all_strict" } ] } ``` **Nested OR inside AND** — user must match, plus either step OR priority: ```json { "tag_groups": [ { "tags": ["user:alice"], "match": "all_strict" }, { "or": [ { "tags": ["step:5"], "match": "any_strict" }, { "tags": ["priority:high"], "match": "all_strict" } ]} ] } ``` **Exclusion** — user must match, but archived memories are excluded: ```json { "tag_groups": [ { "tags": ["user:alice"], "match": "all_strict" }, { "not": { "tags": ["archived"], "match": "any_strict" } } ] } ``` ### trace When set to `true`, the response includes a detailed debug trace covering the query embedding, entry points, per-strategy retrieval results, RRF fusion candidates, reranked results, temporal constraints detected, and per-phase timings. Has no effect on the retrieval logic itself. Useful for understanding why specific memories were or were not returned. ### min_scores An optional object of per-stage score floors, each compared **inclusively** (`>=`) against the matching field of a result's [`scores`](#scores) and AND-ed together. Any field you leave unset imposes no floor; omitting `min_scores` entirely (the default) applies no score filtering at all. The four fields operate at **two different levels of the pipeline**: | field | level | effect | |---|---|---| | `semantic` | retrieval | minimum vector similarity, pushed into the SQL — prunes weak vector matches **before** fusion (overrides the global similarity minimum for this request) | | `keyword` | retrieval | minimum keyword/full-text (BM25) score, pushed into the SQL — prunes weak keyword matches before fusion | | `reranker` | post-query | minimum normalized cross-encoder score, applied to the ranked results | | `final` | post-query | minimum final ranking score, applied to the ranked results | ```json { "query": "...", "min_scores": { "reranker": 0.5 } } ``` The retrieval-level floors (`semantic`/`keyword`) change *which candidates are considered*, so they can also change the final ordering; the post-query floors (`reranker`/`final`) only drop already-ranked results. Because freed slots are **not** backfilled, any floor can return fewer results than the budget allows. **Use floors with care.** The reranker's scores are reliable for *ordering* but not as *absolute* values — a clearly-relevant memory can score `~0.001` on one query and `~1.0` on another, so a fixed cutoff risks silently dropping good results. Calibrate any threshold against the scores you actually observe (recall with no `min_scores` first and inspect the [`scores`](#scores) object). Each threshold is compared against the matching field in the response [`scores`](#scores) object. See the note under [`scores`](#scores) on why the scale is relative, not absolute, before relying on a fixed threshold. --- ## Response ### results The main list of recalled facts, ordered by relevance. Relevance is computed by running four retrieval strategies in parallel — semantic similarity, BM25 keyword, graph traversal, and temporal — fusing their rankings with Reciprocal Rank Fusion (RRF), then re-scoring the merged candidates with a cross-encoder reranker against the original query. Each result carries a [`scores`](#scores) object (see below). Treat these as **relative** signals: they reflect the ranking within a single query, not an absolute, cross-query confidence — a `0.8` from one query is not comparable to a `0.8` from another. For most agents the right approach is to consume memories in order and let `max_tokens` determine how many fit, rather than filtering by score. The `scores` object (and the [`min_scores`](#min_scores) parameter) exist for callers that want to inspect the ranking or drop a low-confidence tail; calibrate any threshold against the scores you see on an unfiltered query. Each item in `results` has the following fields: #### id The unique identifier of this fact. Use it to cross-reference with `source_facts` or for application-level deduplication. #### text The extracted fact text as stored in the memory bank. #### type The fact category: `world` for objective information, `experience` for events and conversations, or `observation` for consolidated knowledge synthesized over time. #### context The context label provided when the fact was retained (e.g., `"team meeting"`, `"slack"`). `null` if none was set. #### metadata The key-value string pairs attached when the fact was retained. `null` if none were set. #### tags The visibility-scoping tags attached to this fact. #### entities A list of canonical entity name strings linked to this fact. Only populated when `include.entities` is enabled (the default). `null` otherwise. #### occurred_start / occurred_end ISO 8601 datetimes representing when the described event started and ended. Extracted by the LLM from the content during retain. `null` if the content had no temporal information. #### mentioned_at ISO 8601 datetime of when this fact was retained into the bank. #### document_id The document ID this fact belongs to, as set during retain. #### chunk_id The ID of the source text chunk this fact was extracted from. Used to cross-reference with `chunks` in the response when `include.chunks` is enabled. #### source_fact_ids For `observation`-type results only: the IDs of the original facts this observation was synthesized from. Cross-references with `source_facts` in the response. `null` for other types or when `include.source_facts` is not enabled. #### scores An object of the per-stage scores for this result. `null` for `source_facts` entries, which are attached by provenance rather than ranked. Fields: - **`final`** — the score this fact was ranked by (cross-encoder relevance × recency/temporal/evidence boosts). `results` is ordered by it descending. A relative signal, not a calibrated probability (see the note above). - **`reranker`** — the cross-encoder's normalized relevance (`0`–`1`). `null` when the deployment uses a passthrough reranker (RRF/interleave modes). - **`semantic`** — the raw vector cosine similarity (`0`–`1`). `null` if this result was not surfaced by semantic search. - **`keyword`** — the raw keyword/full-text (BM25) score (`≥ 0`, unbounded). `null` if this result was not surfaced by keyword search. Each field is also a valid [`min_scores`](#min_scores) floor. --- ### source_facts A dict keyed by fact ID containing full `RecallResult` objects for the source facts that contributed to observation results. Only present when `include.source_facts` is enabled. Facts are deduplicated — if two observations share a source fact, it appears once. ### chunks A dict keyed by chunk ID containing the raw source text chunks associated with the returned facts. Only present when `include.chunks` is enabled. Each chunk has `id`, `text`, `chunk_index`, and `truncated` (whether the text was cut to fit the token budget). ### entities A dict keyed by canonical entity name containing entity state objects. Only present when `include.entities` is enabled. Each entry has `entity_id`, `canonical_name`, and `observations`. ### trace A debug object present only when `trace: true` was set in the request. Contains per-phase timings, retrieval breakdowns, and RRF fusion details. --- ## File: developer/api/reflect.mdx # Reflect Generate a grounded, disposition-aware response using an agentic reasoning loop. When you call **reflect**, Hindsight runs an agentic loop that autonomously searches the memory bank using multiple retrieval tools, applies the bank's disposition traits to shape the reasoning style, and produces a final answer grounded in what it found. Unlike recall — which returns raw facts — reflect returns a synthesized response written by the LLM. {/* Import raw source files */} :::info How Reflect Works Learn about disposition-driven reasoning in the [Reflect Architecture](/developer/reflect) guide. ::: :::tip Prerequisites Make sure you've completed the [Quick Start](./quickstart) to install the client and start the server. ::: ## Basic Usage --- ## Parameters ### query The question or prompt to reflect on. This is the only required field. If you have situational context that should influence the answer, include it directly in the query rather than as a separate field. ### budget Controls how thoroughly the agent explores the memory bank before answering. Accepted values are `low` (default), `mid`, and `high`. At `low`, the agent does a shallow search optimized for speed. At `mid`, it checks multiple sources when the question warrants it. At `high`, it performs deep exploration across all knowledge levels and may use multiple query variations to find indirect connections. Use `high` for complex questions that require synthesizing information from many sources. ### max_tokens Limits the length of the final generated response. Defaults to `4096`. This does not affect how much the agent can retrieve during the agentic loop — only the final answer length. ### response_schema An optional JSON Schema object. When provided, the LLM generates a response that conforms to the schema and the response includes a `structured_output` field with the result parsed accordingly. The `text` field will be empty since only a single structured LLM call is made. Use this when you need to process the response programmatically rather than display it as prose. ### tags Filters which memories the agent can access during reflection. Works identically to [recall tags](./recall#tags) — only memories matching the specified tags are considered. The `tags_match` parameter controls the matching logic (`any`, `all`, `any_strict`, `all_strict`, `exact`) with the same semantics as recall. ### include Controls optional supplementary data returned alongside the main response. #### include.facts When enabled, the response includes a `based_on` object listing the memories, mental models, and directives the agent actually used to construct the answer. Only sources retrieved during the agent loop can appear here — citations are validated to prevent hallucinated references. Useful for transparency and verification. #### include.tool_calls When enabled, the response includes a `trace` object with the full execution log of every tool call and LLM call made during the agentic loop, including inputs, outputs, and durations. Set `output: false` to include only tool inputs for a smaller payload. Useful for debugging why the agent reached a particular conclusion. --- ## Response ### text The synthesized answer as a well-formatted markdown string. This is the primary output of reflect. Empty when `response_schema` is provided (use `structured_output` instead in that case). ### structured_output The LLM's response parsed according to the `response_schema` provided in the request. Only present when `response_schema` was set. `null` otherwise. ### based_on The sources the agent used to construct the answer. Only present when `include.facts` was enabled. Contains three fields: - `memories` — a list of memory facts (world, experience, observation) that were retrieved and cited. Each item has `id`, `text`, `type`, `context`, `occurred_start`, and `occurred_end`. - `mental_models` — a list of mental models that were used. Each item has `id`, `text`, and `context`. - `directives` — a list of directives that were enforced during reasoning. Each item has `id`, `name`, and `content`. ### usage Token usage for all LLM calls made during the agentic loop: `input_tokens`, `output_tokens`, and `total_tokens`. Useful for cost tracking. ### trace The full execution log of the agentic loop. Only present when `include.tool_calls` was enabled. Contains: - `tool_calls` — each tool invocation with `tool` name (`lookup`, `recall`, `learn`, `expand`), `input`, `output` (if `output: true`), `duration_ms`, and `iteration` number. - `llm_calls` — each LLM call with `scope` (e.g., `"agent_1"`, `"final"`) and `duration_ms`. --- ## File: developer/api/retain.mdx # Ingest Data Store documents, conversations, and raw content into Hindsight to automatically extract and create memories. When you **retain** content, Hindsight doesn't just store the raw text—it intelligently analyzes the content to extract meaningful facts, identify entities, and build a connected knowledge graph. This process transforms unstructured information into structured, queryable memories. {/* Import raw source files */} :::info How Retain Works Learn about fact extraction, entity resolution, and graph construction in the [Retain Architecture](/developer/retain) guide. ::: :::tip Prerequisites Make sure you've completed the [Quick Start](./quickstart) to install the client and start the server. ::: ## Store a Document A single retain call accepts one or more **items**. Each item is a piece of raw content — a conversation, a document, a note — that Hindsight will analyze and decompose into one or many memories. The content itself is never stored verbatim; what gets stored are the structured facts the LLM extracts from it. ### Retaining a Conversation A full conversation should be retained as a single item. The LLM can parse any format — plain text, JSON, Markdown, or any structured representation — as long as it clearly conveys who said what and when. The example below uses a simple `Name (timestamp): text` format. When the conversation grows — a new message arrives — just retain again with the full updated content and the same `document_id`. Hindsight will delete the previous version and reprocess from scratch, so memories always reflect the latest state of the conversation. --- ## Parameters ### content The raw text to store. This is the only required field. Hindsight chunks the content, sends each chunk to the LLM for fact extraction, and stores the resulting structured facts — not the original text. A single `content` value can produce many memories depending on how much information it contains. ### timestamp When the event described in the content actually occurred. Three forms are accepted: | Value | Behaviour | |-------|-----------| | Omitted / `null` | Defaults to the current time at ingestion. | | ISO 8601 string (e.g. `"2024-01-15T10:30:00Z"`) | Uses the provided datetime. | | `"unset"` | Stores the content **without any timestamp**. Use this for timeless material such as reference documents, books, or fictional content where no real event time exists. | The timestamp is injected into the LLM fact-extraction prompt so the model can resolve relative temporal references in the content — for example, if the content says "last Monday", the model uses the provided timestamp as the anchor to pin down the actual date. When `"unset"` is passed the prompt shows `Event Date: Unknown`, allowing the model to correctly return `N/A` for the `when` field of every extracted fact. Providing a real timestamp also enables temporal recall queries like "What happened last spring?" to work correctly. ### context A short label describing the source or situation — for example `"team meeting"`, `"slack"`, or `"support ticket"`. It is injected directly into the LLM prompt, so it actively shapes how facts are extracted. The same sentence can mean something very different depending on context: "the project was terminated" in a `"performance review"` context versus a `"product roadmap"` context produces different memories. Providing context consistently is one of the highest-leverage things you can do to improve memory quality. ### metadata Arbitrary key-value string pairs that provide context about this item. For example: `{"source": "slack", "channel": "engineering", "thread_id": "T123"}`. Metadata is included in the fact extraction prompt, so the LLM can use it as additional context when extracting facts — for instance, knowing the document title or source can improve accuracy. It is also stored on each memory unit and returned with every recalled memory, letting you do client-side filtering or static enrichment without extra lookups — for example, linking a memory back to its source URL, thread ID, or any application-specific identifier. ### document_id A caller-supplied string that groups one or more items under a logical document. This field is the key to making retain **idempotent**. When you provide a `document_id`, Hindsight upserts the document: if a document with that ID already exists in the bank, it and all its associated memories are deleted before the new content is processed and inserted. This means you can safely re-run retain on updated content — for example, a chat thread that grew since last time — without accumulating duplicate memories. If you omit `document_id`, Hindsight assigns a random UUID per request, so re-ingesting the same content will create duplicate memories. ### update_mode Controls how Hindsight handles an existing document when you retain with a `document_id` that already exists. | Value | Behaviour | |-------|-----------| | `"replace"` *(default)* | Deletes the old document and all its memories, then processes the new content from scratch. This is the standard upsert described above. | | `"append"` | Concatenates the new content onto the existing document text and reprocesses the combined document. Delta retain automatically skips unchanged chunks, so only the new portion triggers LLM extraction. | Append mode requires a `document_id` — without one there is no existing document to append to. **When to use append**: Use `"append"` for content that grows incrementally — for example, a log file, a journal, or a chat transcript where you receive new messages one at a time. Instead of re-sending the entire history on each update, send only the new content with `update_mode: "append"` and Hindsight will efficiently merge it with what it already has. ```json { "items": [ { "content": "New entry to add to the existing document.", "document_id": "my-growing-doc", "update_mode": "append" } ] } ``` ### entities A list of entities you want to guarantee are recognized, merged with any entities the LLM extracts automatically. Each entry has a `text` field (the entity name) and an optional `type` (e.g., `"PERSON"`, `"ORG"`, `"CONCEPT"` — defaults to `"CONCEPT"` if omitted). Use this when you know certain entities are important but the LLM might miss them or refer to them inconsistently across different parts of the content. Providing entities explicitly ensures they are always linked in the knowledge graph. ### tags and document_tags Tags control **visibility scoping** — which memories are visible during recall. A memory is only returned if its tags intersect with the tags filter provided in the recall request. This makes tags useful when a single memory bank serves multiple users or sessions and each should only see their own memories. Use consistent naming patterns to keep tag filtering predictable. Common conventions: `user:` for per-user scoping, `session:` for session isolation, `room:` for chat rooms, `topic:` for category filtering. The bank also exposes a list-tags endpoint that returns all tags with their memory counts, useful for UI autocomplete or wildcard expansion. See [Recall API](./recall#tags) for filtering by tags during retrieval. ### observation_scopes Controls which [observations](../observations) this memory contributes to during consolidation. Each scope runs an independent pass, creating or updating observations tagged with only that scope's tags. :::info Scope isolation During consolidation, Hindsight uses `all_strict` matching to find existing observations to update — only observations whose tags exactly match the current scope are considered. This keeps scopes isolated: a memory consolidated under `["student:alice"]` will never bleed into an observation tagged `["student:alice", "teacher:bob"]`. ::: The examples below use a lesson transcript retained with `tags: ["student:alice", "teacher:bob", "session-id:s1"]`. #### combined *(default)* One consolidation pass using all tags together. The resulting observation is tagged with the full set. - Observations created: `["student:alice", "teacher:bob", "session-id:s1"]` - ✗ *"What does Alice struggle with across all her sessions?"* — no match, because no observation was ever built for `student:alice` alone - ✗ *"How does Bob teach?"* — no match for `teacher:bob` alone - ✓ *"What happened in session s1 with Alice and Bob?"* — exact match **Use when** the memory is meaningful only as a whole and you never need to query any single tag in isolation. #### shared One consolidation pass over a single global, **untagged** scope. The memory's own tags are ignored for observation scoping (they stay on the source facts for recall filtering), so memories with *different* tags all consolidate into the **same** observation. - Observations created: one untagged observation (`[]`) - ✓ Untagged observations match every recall regardless of tag filter - Deduplicates across volatile per-call tags: if each session is retained with a unique `session-id:…` tag, `combined` and `per_tag` create a fresh observation every session (the tag never repeats), whereas `shared` folds them all into one. **Use when** your tags are per-call provenance (e.g. session ids) that you want for recall filtering and debugging but not as a consolidation boundary — keep the tag on `tags` and set `observation_scopes: "shared"`. :::caution `shared` vs `[[]]` vs `[]` `shared` is equivalent to the explicit scope `[[]]` — a list containing one empty scope. Do **not** confuse it with `[]` (an empty list), which declares *zero* scopes and silently falls back to `combined`. ::: #### per_tag One consolidation pass per individual tag. Each tag gets its own isolated observation that grows with every new memory sharing that tag. - Observations created: `["student:alice"]` · `["teacher:bob"]` · `["session-id:s1"]` - ✓ *"What does Alice struggle with across all her sessions?"* - ✓ *"How does Bob teach?"* - ✓ *"What happened in session s1?"* - ✗ *"How does Alice perform specifically with Bob?"* — no observation for the `["student:alice", "teacher:bob"]` combination - ✗ *"How does Bob teach in online sessions?"* — no observation for `["teacher:bob", "session-id:s1"]` **Use when** content involves multiple tags that each represent an independent subject — the most common choice for multi-party content like conversations, lessons, or support sessions. #### all_combinations One pass per subset of tags — singles, pairs, triples, and so on. For 3 tags that is 7 passes. - Observations created: all `"per_tag"` scopes above, plus `["student:alice", "teacher:bob"]` · `["student:alice", "session-id:s1"]` · `["teacher:bob", "session-id:s1"]` · `["student:alice", "teacher:bob", "session-id:s1"]` - ✓ All questions from `"per_tag"` above - ✓ *"How does Alice perform specifically with Bob?"* — matched by `["student:alice", "teacher:bob"]` **Use when** you need observations at every granularity — per tag, per pair, per group. #### custom Pass an explicit list of tag sets. Each inner list is one scope. ```json [["student:alice"], ["teacher:bob"], ["teacher:bob", "session-id:s1"]] ``` - Observations created: exactly those three scopes — nothing more - ✓ *"What does Alice struggle with?"* - ✓ *"How does Bob teach?"* - ✓ *"How does Bob teach in session s1 specifically?"* - ✗ *"What happened in session s1 regardless of teacher?"* — `["session-id:s1"]` alone was not included **Use when** you know exactly which combinations are meaningful and want to avoid unnecessary passes. ### Response The synchronous retain response includes: - `success` — whether the operation completed without errors - `bank_id` — the memory bank that received the content - `items_count` — number of items processed - `async` — whether processing ran asynchronously - `usage` — token usage for the LLM calls (`input_tokens`, `output_tokens`, `total_tokens`), only present for synchronous operations --- ## Batch Ingestion Multiple items can be submitted in a single request. Batch ingestion is the recommended approach — it reduces network overhead and lets Hindsight optimize extraction across related content. --- ## Files Upload files directly — Hindsight converts them to text and extracts memories automatically. File processing always runs asynchronously and returns operation IDs for tracking. **Supported formats:** PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, images (JPG, PNG, GIF, etc. — OCR support depends on the configured parser), audio (MP3, WAV, FLAC, etc. — transcription), HTML, and plain text formats (TXT, MD, CSV, JSON, YAML, etc.) The file retain endpoint always returns asynchronously. The response contains `operation_ids` — one per uploaded file — which you can poll via `GET /v1/default/banks/{bank_id}/operations` to track progress. Upload up to 10 files per request (max 100 MB total). Each file becomes a separate document with optional per-file metadata: :::info File Storage Uploaded files are stored server-side (PostgreSQL by default, or S3/GCS/Azure for production). Configure storage via `HINDSIGHT_API_FILE_STORAGE_TYPE`. See [Configuration](../configuration#file-processing) for details. ::: --- ## Async Ingestion For large batches, use async ingestion to avoid blocking your application: When `async: true`, the call returns immediately with an `operation_id`. Processing runs in the background via the worker service. No `usage` metrics are returned for async operations. ### Cut Costs 50% with Provider Batch APIs When using async retain, enable the provider Batch API to reduce LLM fact-extraction costs by 50%. OpenAI, Groq, and Gemini all offer this discount in exchange for a processing window of up to 24 hours — a trade-off that's typically invisible when retain already runs in the background. ```bash export HINDSIGHT_API_RETAIN_BATCH_ENABLED=true ``` Hindsight submits fact extraction calls as a batch job to the provider, polls for completion, and processes results automatically. No changes to your API calls are needed. :::note Batch API cost savings require `async=true` in your retain request and a compatible provider (OpenAI, Groq, or Gemini). ::: --- ## File: developer/api/webhooks.mdx # Webhooks Hindsight can notify your application in real-time when memory events occur by sending HTTP POST requests to a URL you configure. ## Delivery and Retries Webhooks are registered per memory bank and fire automatically when matching events occur. Each delivery attempt is tracked, and failed deliveries are retried with exponential backoff: | Attempt | Delay after failure | |---------|---------------------| | 1 | 5 seconds | | 2 | 5 minutes | | 3 | 30 minutes | | 4 | 2 hours | | 5 | 5 hours | | 6 | Permanent failure | A delivery is considered failed if your endpoint returns a non-2xx status code or does not respond within the configured timeout (default 30 seconds). After 6 failed attempts, the delivery is marked as permanently failed and no further retries are made. :::info At-least-once delivery Webhook delivery tasks are queued in the same database transaction as the primary operation (e.g. the retain or consolidation write). This means if the server crashes after committing but before sending, the delivery task survives and will be retried. As a result, **your endpoint may receive the same event more than once** — use the `operation_id` field to deduplicate if needed. ::: ## Event Types ### `consolidation.completed` Fired after Hindsight finishes consolidating new memories into observations for a bank. **Payload:** ```json { "event": "consolidation.completed", "bank_id": "my-bank", "operation_id": "a1b2c3d4e5f6", "status": "completed", "timestamp": "2026-03-04T12:00:00Z", "data": { "observations_created": 3, "observations_updated": 1, "observations_deleted": null, "error_message": null } } ``` **`data` fields:** | Field | Type | Description | |-------|------|-------------| | `observations_created` | `integer \| null` | Number of new observations created | | `observations_updated` | `integer \| null` | Number of existing observations updated | | `observations_deleted` | `integer \| null` | Number of observations deleted | | `error_message` | `string \| null` | Set when `status` is `"failed"` | **`status` values:** `"completed"` or `"failed"` --- ### `retain.completed` Fired once per document after a retain operation completes (both synchronous and asynchronous). When retaining a batch of N documents, N separate events are fired. **Payload:** ```json { "event": "retain.completed", "bank_id": "my-bank", "operation_id": "a1b2c3d4e5f6", "status": "completed", "timestamp": "2026-03-04T12:00:01Z", "data": { "document_id": "doc-abc123", "tags": ["meeting", "q1-2026"] } } ``` **`data` fields:** | Field | Type | Description | |-------|------|-------------| | `document_id` | `string \| null` | The document ID if one was provided in the retain request | | `tags` | `string[] \| null` | Document-level tags applied during retain | **Notes:** - For async retain (`async: true`), `operation_id` matches the `operation_id` returned by the retain API. - For sync retain, `operation_id` is a generated identifier for tracing purposes. - One event is fired per content item in the retain request. --- ### `memory_defense.triggered` Fired when a bank's [Memory Defense](../memory-defense/index.md) policy acts on a retained item — once per item that is **redacted** or **blocked**. Items that pass cleanly do not fire an event. Requires a Memory Defense policy enabled on the bank and a webhook subscribed to this event type. **Payload:** ```json { "event": "memory_defense.triggered", "bank_id": "my-bank", "operation_id": "a1b2c3d4e5f6", "status": "redact", "timestamp": "2026-03-04T12:00:02Z", "data": { "action": "redact", "detector": "sensitive_data", "document_id": "doc-abc123", "matched_types": ["github_token", "aws_access_key"], "message": "Sensitive data pattern matched: github_token, aws_access_key" } } ``` **`data` fields:** | Field | Type | Description | |-------|------|-------------| | `action` | `string` | Action taken on the item: `"redact"` or `"block"` | | `detector` | `string \| null` | The detector that matched (`"sensitive_data"`) | | `document_id` | `string \| null` | The document ID if one was provided in the retain request | | `matched_types` | `string[] \| null` | Labels of the redaction patterns that fired (e.g. `github_token`, `ssn_us`) | | `message` | `string \| null` | Human-readable summary of what matched | **`status` values:** mirrors `data.action` — `"redact"` or `"block"`. **Notes:** - A `redact` event means the secret was scrubbed and the redacted memory was still stored. A `block` event means the item was dropped; if every item in the retain request is blocked, the retain call returns `422`. --- ## File: developer/development.md # Development Guide Guide to setting up a local development environment for contributing to Hindsight. ## Prerequisites - Python 3.11+ - [uv](https://docs.astral.sh/uv/) - Fast Python package manager - Docker and Docker Compose - An LLM API key (OpenAI, Groq, or Ollama) ## Local Development Setup ### 1. Clone the Repository ```bash git clone https://github.com/vectorize-io/hindsight.git cd hindsight ``` ### 2. Install Dependencies ```bash uv sync ``` ### 3. Start PostgreSQL Start only the database via Docker: ```bash cd docker && docker-compose up -d postgres ``` ### 4. Configure Environment ```bash cp .env.example .env ``` Edit `.env` with your LLM API key: ```bash # Database (connects to Docker postgres) HINDSIGHT_API_DATABASE_URL=postgresql://hindsight:hindsight_dev@localhost:5432/hindsight # LLM Provider (choose one) HINDSIGHT_API_LLM_PROVIDER=groq HINDSIGHT_API_LLM_API_KEY=gsk_xxxxxxxxxxxx HINDSIGHT_API_LLM_MODEL=llama-3.1-70b-versatile ``` ### 5. Start the API Server ```bash ./scripts/start-server.sh --env local ``` The server will be available at http://localhost:8888. ## Running Tests ```bash # Run all tests uv run pytest # Run specific test file uv run pytest tests/test_retrieval.py # Run with verbose output uv run pytest -v ``` ## Code Generation ### Regenerate API Clients When you modify the OpenAPI spec, regenerate the clients: ```bash ./scripts/generate-clients.sh ``` This generates: - Python client in `hindsight-clients/python/` - TypeScript client in `hindsight-clients/typescript/` ### Export OpenAPI Schema ```bash ./scripts/export-openapi.sh ``` ## Project Structure ``` hindsight/ ├── hindsight-api/ # Main API server │ ├── hindsight_api/ │ │ ├── api/ # HTTP endpoints │ │ ├── engine/ # Memory engine, retrieval, reasoning │ │ └── web/ # Server entry point │ └── tests/ ├── hindsight-clients/ # Generated SDK clients │ ├── python/ │ └── typescript/ ├── hindsight-control-plane/ # Admin UI (Next.js) ├── docker/ # Docker Compose setup └── scripts/ # Development scripts ``` ## Contributing 1. Create a feature branch from `main` 2. Make your changes 3. Run tests: `uv run pytest` 4. Submit a pull request ## Troubleshooting ### Database Connection Issues Ensure PostgreSQL is running: ```bash docker-compose ps ``` Check database connectivity: ```bash psql postgresql://hindsight:hindsight_dev@localhost:5432/hindsight ``` ### ML Model Download On first run, Hindsight downloads embedding and reranking models. This may take a few minutes. Models are cached in `~/.cache/huggingface/`. ### Port Conflicts If port 8888 is in use: ```bash HINDSIGHT_API_PORT=8889 ./scripts/start-server.sh --env local ``` --- ## File: developer/extensions.md # Extensions Extensions allow you to customize and extend Hindsight behavior without modifying core code. They enable multi-tenancy, custom authentication, additional HTTP endpoints, and operation hooks. --- ## Available Extensions ### TenantExtension Handles multi-tenancy and API key authentication. Validates incoming requests and determines which PostgreSQL schema to use for database operations, enabling tenant isolation at the database level. **Built-in: ApiKeyTenantExtension** A simple implementation that validates API keys against an environment variable and uses the `public` schema for all authenticated requests. ```bash HINDSIGHT_API_TENANT_EXTENSION=hindsight_api.extensions.builtin.tenant:ApiKeyTenantExtension HINDSIGHT_API_TENANT_API_KEY=your-secret-key ``` **Built-in: SupabaseTenantExtension** Validates [Supabase](https://supabase.com) JWTs and provides multi-tenant memory isolation. Each authenticated user gets their own PostgreSQL schema (`{prefix}_{user_id}`), ensuring complete data separation. Performs local JWT verification using JWKS for optimal performance (no network call per request). ```bash HINDSIGHT_API_TENANT_EXTENSION=hindsight_api.extensions.builtin.supabase_tenant:SupabaseTenantExtension HINDSIGHT_API_TENANT_SUPABASE_URL=https://your-project.supabase.co # Optional - only needed for legacy HS256 projects or health check HINDSIGHT_API_TENANT_SUPABASE_SERVICE_KEY=your-service-role-key ``` See the [source code](https://github.com/vectorize-io/hindsight/blob/main/hindsight-api-slim/hindsight_api/extensions/builtin/supabase_tenant.py) for complete configuration options and implementation details. For other multi-tenant setups with separate schemas per tenant (e.g., custom JWT-based auth), implement a custom `TenantExtension`. --- ### HttpExtension Adds custom HTTP endpoints under the `/ext/` path prefix. Useful for adding domain-specific APIs that integrate with Hindsight's memory engine. Provides two router methods: - `get_router(memory)` — returns a FastAPI router mounted at `/ext/` - `get_root_router(memory)` — returns a FastAPI router mounted at the application root (for well-known endpoints or other paths that must be at specific locations). Returns `None` by default. **No built-in implementation** - implement your own to add custom endpoints. ```bash HINDSIGHT_API_HTTP_EXTENSION=mypackage.ext:MyHttpExtension ``` --- ### OperationValidatorExtension Hooks into retain/recall/reflect operations for validation and monitoring. Use cases include: - Rate limiting and quota enforcement - Permission checks and content filtering - Audit logging and usage tracking - Custom metrics collection **No built-in implementation** - implement your own based on your requirements. ```bash HINDSIGHT_API_OPERATION_VALIDATOR_EXTENSION=mypackage.validators:MyValidator ``` --- ### MCPExtension Registers additional MCP (Model Context Protocol) tools on the Hindsight MCP server. Enables external packages to add custom tools without modifying core code. **No built-in implementation** - implement your own to add custom MCP tools. ```bash HINDSIGHT_API_MCP_EXTENSION=mypackage.mcp:MyMCPExtension ``` --- ## Writing Custom Extensions ### Extension Basics Extensions are Python classes loaded via environment variables: ```bash HINDSIGHT_API__EXTENSION=mypackage.module:MyExtensionClass ``` Configuration is passed via prefixed environment variables: ```bash HINDSIGHT_API__SOME_CONFIG=value # Extension receives: {"some_config": "value"} ``` All extensions support lifecycle hooks: - `on_startup()` - Called when the application starts - `on_shutdown()` - Called when the application shuts down Extensions have access to an `ExtensionContext` that provides: - `run_migration(schema)` - Run database migrations for a schema - `get_memory_engine()` - Get the MemoryEngine interface ### Example: Custom TenantExtension with JWT ```python from hindsight_api.extensions import TenantExtension, TenantContext, AuthenticationError class JwtTenantExtension(TenantExtension): def __init__(self, config: dict[str, str]): super().__init__(config) self.jwt_secret = config.get("jwt_secret") if not self.jwt_secret: raise ValueError("HINDSIGHT_API_TENANT_JWT_SECRET is required") async def authenticate(self, context: RequestContext) -> TenantContext: token = context.api_key if not token: # Optional headers dict is forwarded in HTTP/MCP error responses raise AuthenticationError("Bearer token required") try: payload = jwt.decode(token, self.jwt_secret, algorithms=["HS256"]) tenant_id = payload.get("tenant_id") if not tenant_id: raise AuthenticationError("Missing tenant_id in token") return TenantContext(schema_name=f"tenant_{tenant_id}") except jwt.InvalidTokenError as e: raise AuthenticationError(str(e)) ``` `AuthenticationError` accepts an optional `headers` dict that is forwarded in both HTTP and MCP error responses. This is useful for returning custom headers like `WWW-Authenticate`: ```python raise AuthenticationError( "Authorization required", headers={"WWW-Authenticate": 'Bearer realm="example"'}, ) ``` ### Example: Custom HttpExtension ```python from fastapi import APIRouter from hindsight_api.extensions import HttpExtension class MyHttpExtension(HttpExtension): def get_router(self, memory: MemoryEngine) -> APIRouter: router = APIRouter() @router.get("/hello") async def hello(): return {"message": "Hello from extension!"} @router.post("/custom/{bank_id}/action") async def custom_action(bank_id: str): # Access memory engine for database operations pool = await memory._get_pool() # ... custom logic return {"status": "ok"} return router def get_root_router(self, memory: MemoryEngine) -> APIRouter | None: """Optional: mount routes at the application root (not under /ext/).""" router = APIRouter() @router.get("/.well-known/my-metadata") async def metadata(): return {"version": "1.0"} return router ``` Routes from `get_router` are available at `/ext/hello`, `/ext/custom/{bank_id}/action`, etc. Routes from `get_root_router` are mounted at the app root (e.g., `/.well-known/my-metadata`). ### Example: Custom OperationValidatorExtension ```python from hindsight_api.extensions import ( OperationValidatorExtension, ValidationResult, PrecheckContext, RetainContext, RecallContext, ReflectContext, RetainResult, ) class MyValidator(OperationValidatorExtension): # Pre-body validation (optional) async def precheck(self, ctx: PrecheckContext) -> ValidationResult: if ctx.content_length is not None and ctx.content_length > 10_000_000: return ValidationResult.reject("Payload is too large") return ValidationResult.accept() # Pre-operation validation (required) async def validate_retain(self, ctx: RetainContext) -> ValidationResult: # Implement your validation logic return ValidationResult.accept() # Or reject: return ValidationResult.reject("Reason") async def validate_recall(self, ctx: RecallContext) -> ValidationResult: return ValidationResult.accept() async def validate_reflect(self, ctx: ReflectContext) -> ValidationResult: return ValidationResult.accept() # Post-operation hooks (optional) async def on_retain_complete(self, result: RetainResult) -> None: # Log usage, update metrics, send notifications, etc. pass ``` `precheck` runs before the request body is read or deserialized. Its `PrecheckContext.content_length` is the parsed `Content-Length` header as an integer, or `None` when the header is missing or cannot be parsed (for example, chunked transfer encoding). Use it for cheap size-aware quota or cost guards; the full `validate_*` hooks still run after parsing and should enforce precise per-operation limits. #### Deferring an operation In addition to `accept` and `reject`, a `validate_*` hook can ask the worker to **requeue** the operation for a future time by raising `DeferOperation`. Use this for backpressure (rate-limited upstream, quota window not yet open, dependency warming up) — unlike a retry, it does not increment `retry_count` or write `error_message`. The worker sets `next_retry_at` to your `exec_date` and the task is invisible to claim queries until that time. ```python from datetime import datetime, timedelta, timezone from hindsight_api.extensions import ( DeferOperation, OperationValidatorExtension, RetainContext, ValidationResult, ) class QuotaAwareValidator(OperationValidatorExtension): async def validate_retain(self, ctx: RetainContext) -> ValidationResult: if not await self._quota_available(ctx.bank_id): raise DeferOperation( exec_date=datetime.now(timezone.utc) + timedelta(minutes=5), reason="bank quota window exhausted", ) return ValidationResult.accept() ``` `DeferOperation` is **worker-only**: do not raise it from `validate_recall` or `validate_reflect` in synchronous HTTP request paths — there is no queue to defer to and it will surface as a 500. ### Example: Custom MCPExtension ```python from mcp.server.fastmcp import FastMCP from hindsight_api.extensions import MCPExtension from hindsight_api.engine import MemoryEngine class MyMCPExtension(MCPExtension): async def register_tools(self, mcp: FastMCP, memory: MemoryEngine) -> None: @mcp.tool() async def custom_search(query: str) -> str: """Custom MCP tool for specialized search.""" # Access memory engine for operations pool = await memory._get_pool() # ... custom logic return f"Results for: {query}" ``` --- ## Deploying Custom Extensions ### With Docker Mount your extension package as a volume and set the environment variable: ```yaml # docker-compose.yml services: hindsight-api: image: vectorize/hindsight-api:latest volumes: - ./my_extensions:/app/my_extensions environment: - HINDSIGHT_API_TENANT_EXTENSION=my_extensions.auth:JwtTenantExtension - HINDSIGHT_API_TENANT_JWT_SECRET=${JWT_SECRET} - PYTHONPATH=/app ``` Or build a custom image with your extensions: ```dockerfile FROM vectorize/hindsight-api:latest COPY my_extensions /app/my_extensions ENV PYTHONPATH=/app ``` ### Bare Metal Install your extension package in the same Python environment as Hindsight: ```bash # Install Hindsight pip install hindsight-api # Install your extension package pip install ./my-extensions # or pip install my-extensions-package # Configure export HINDSIGHT_API_TENANT_EXTENSION=my_extensions.auth:JwtTenantExtension export HINDSIGHT_API_TENANT_JWT_SECRET=your-secret # Run hindsight-api ``` --- ## Contributing Extensions Custom extensions that solve common use cases are welcome contributions to the Hindsight project. If you've built an extension for: - Authentication providers (OAuth, SAML, API gateways) - Rate limiting or quota management - Audit logging integrations - Metrics exporters (Datadog, New Relic, etc.) - Custom HTTP endpoints for specific platforms Consider contributing it to the `hindsight_api.extensions.builtin` package. Open an issue or pull request on [GitHub](https://github.com/vectorize-io/hindsight) to discuss your extension. --- ## File: developer/index.mdx # Overview ## Why Hindsight? AI agents forget everything between sessions. Every conversation starts from zero—no context about who you are, what you've discussed, or what the assistant has learned. This isn't just an implementation detail; it fundamentally limits what AI Agents can do. **The problem is harder than it looks:** - **Simple vector search isn't enough** — "What did Alice do last spring?" requires temporal reasoning, not just semantic similarity - **Facts get disconnected** — Knowing "Alice works at Google" and "Google is in Mountain View" should let you answer "Where does Alice work?" even if you never stored that directly - **AI Agents need to consolidate knowledge** — A coding assistant that remembers "the user prefers functional programming" should consolidate this into an observation and weigh it when making recommendations - **Context matters** — The same information means different things to different memory banks with different personalities Hindsight solves these problems with a memory system designed specifically for AI agents. ## What Hindsight Does ```mermaid graph LR subgraph app["Your Application"] Agent[AI Agent] end subgraph hindsight["Hindsight"] API[API Server] subgraph bank["Memory Bank"] direction TB MentalModels[Mental Models] Observations[Observations] MemEnt[Memories & Entities] Chunks[Chunks] Documents[Documents] MentalModels --> Observations --> MemEnt --> Chunks --> Documents end end Agent -->|retain| API Agent -->|recall| API Agent -->|reflect| API API --> bank ``` **Your AI agent** stores information via `retain()`, searches with `recall()`, and reasons with `reflect()` — all interactions with its dedicated **memory bank** ## Key Components ### Memory Types Hindsight organizes knowledge into a hierarchy of facts and consolidated knowledge: | Type | What it stores | Example | |------|----------------|---------| | **Mental Model** | User-curated summaries for common queries | "Team communication best practices" | | **Observation** | Automatically consolidated knowledge from facts | "User was a React enthusiast but has now switched to Vue" (captures history) | | **World Fact** | Objective facts received | "Alice works at Google" | | **Experience Fact** | Bank's own actions and interactions | "I recommended Python to Bob" | During reflect, the agent checks sources in priority order: **Mental Models → Observations → Raw Facts**. ### Multi-Strategy Retrieval (TEMPR) Four search strategies run in parallel: ```mermaid graph LR Q[Query] --> S[Semantic] Q --> K[Keyword] Q --> G[Graph] Q --> T[Temporal] S --> RRF[RRF Fusion] K --> RRF G --> RRF T --> RRF RRF --> CE[Cross-Encoder] CE --> R[Results] ``` | Strategy | Best for | |----------|----------| | **Semantic** | Conceptual similarity, paraphrasing | | **Keyword (BM25)** | Names, technical terms, exact matches | | **Graph** | Related entities, indirect connections | | **Temporal** | "last spring", "in June", time ranges | ### Observation Consolidation After memories are retained, Hindsight automatically consolidates related facts into **observations** — deduplicated, evidence-grounded beliefs that the bank has built up across many memories: - **Deduplication**: Overlapping facts are merged into a single durable observation instead of piling up as repeats - **Evidence tracking**: Each observation references the source memories (with exact quotes) that support it, plus a proof count - **Continuous refinement**: Observations are updated — not overwritten — when new evidence supports, contradicts, or extends them; history is preserved - **Freshness awareness**: when newer memories have been retained but not yet consolidated, `reflect` treats the affected observations as stale and verifies them against raw facts before relying on them ### Mission, Directives & Disposition Memory banks can be configured to shape how the agent reasons during `reflect`: | Configuration | Purpose | Example | |---------------|---------|---------| | **Mission** | Natural language identity for the bank | "I am a research assistant specializing in ML. I prefer simplicity over cutting-edge." | | **Directives** | Hard rules the agent must follow | "Never recommend specific stocks", "Always cite sources" | | **Disposition** | Soft traits that influence reasoning style | Skepticism, literalism, empathy (1-5 scale) | The **mission** tells Hindsight what knowledge to prioritize and provides context for reasoning. **Directives** are guardrails and compliance rules that must never be violated. **Disposition traits** subtly influence interpretation style. These settings only affect the `reflect` operation, not `recall`. ## Clients & Languages ## Integrations Browse all supported integrations in the [Integrations Hub](/integrations). ## Next Steps ### Getting Started - [**Quick Start**](/developer/api/quickstart) — Install and get up and running in 60 seconds - [**RAG vs Hindsight**](/developer/rag-vs-hindsight) — See how Hindsight differs from traditional RAG with real examples ### Core Concepts - [**Retain**](/developer/retain) — How memories are stored with multi-dimensional facts - [**Recall**](/developer/retrieval) — How TEMPR's 4-way search retrieves memories - [**Reflect**](/developer/reflect) — How mission, directives, and disposition shape reasoning ### API Methods - [**Retain**](/developer/api/retain) — Store information in memory banks - [**Recall**](/developer/api/recall) — Search and retrieve memories - [**Reflect**](/developer/api/reflect) — Agentic reasoning with memory - [**Mental Models**](/developer/api/mental-models) — User-curated summaries for common queries - [**Memory Banks**](/developer/api/memory-banks) — Configure mission, directives, and disposition - [**Documents**](/developer/api/documents) — Manage document sources - [**Operations**](/developer/api/operations) — Monitor async tasks ### Deployment - [**Server Setup**](/developer/installation) — Deploy with Docker Compose, Helm, or pip --- ## File: developer/mcp-server.md # MCP Server Hindsight includes a built-in [Model Context Protocol (MCP)](https://modelcontextprotocol.io/) server that allows AI assistants to store and retrieve memories directly. ## Access The MCP server is **enabled by default** and mounted at `/mcp` on the API server. Each memory bank has its own MCP endpoint: ``` http://localhost:8888/mcp/{bank_id}/ ``` For example, to connect to the memory bank `alice`: ``` http://localhost:8888/mcp/alice/ ``` To disable the MCP server, set the environment variable: ```bash export HINDSIGHT_API_MCP_ENABLED=false ``` ## Authentication By default, the MCP endpoint is **open** (no authentication required). To enable authentication, configure the API key tenant extension: ```bash export HINDSIGHT_API_TENANT_EXTENSION=hindsight_api.extensions.builtin.tenant:ApiKeyTenantExtension export HINDSIGHT_API_TENANT_API_KEY=your-secret-key ``` When authentication is enabled, include your API key in the `Authorization` header: ### Claude Code ```bash claude mcp add --transport http hindsight http://localhost:8888/mcp \ --header "Authorization: Bearer your-secret-key" \ --header "X-Bank-Id: my-bank" ``` ### Claude Desktop Add to `~/.claude_desktop_config.json`: ```json { "mcpServers": { "hindsight": { "url": "http://localhost:8888/mcp", "headers": { "Authorization": "Bearer your-secret-key", "X-Bank-Id": "my-bank" } } } } ``` ### Direct HTTP Request ```bash curl -X POST http://localhost:8888/mcp \ -H "Authorization: Bearer your-secret-key" \ -H "X-Bank-Id: my-bank" \ -H "Content-Type: application/json" \ -H "Accept: application/json, text/event-stream" \ -d '{"jsonrpc": "2.0", "method": "tools/list", "id": 1}' ``` If the key is missing or invalid, requests will receive a `401 Unauthorized` response. ## Bank Selection The memory bank is resolved in this priority order: 1. **URL path** (highest priority): `http://localhost:8888/mcp/my-bank/` 2. **X-Bank-Id header**: `--header "X-Bank-Id: my-bank"` 3. **Default**: Uses `HINDSIGHT_MCP_BANK_ID` env var (default: "default") ## Per-Bank Endpoints Unlike traditional MCP servers where tools require explicit identifiers, Hindsight uses **per-bank endpoints**. The `bank_id` is part of the URL path, so tools don't need to specify which bank to use—it's implicit from the connection. This design: - **Simplifies tool usage** — no need to pass `bank_id` with every call - **Enforces isolation** — each MCP connection is scoped to a single bank - **Enables multi-tenant setups** — connect different users to different endpoints ## Two Modes The MCP server operates in two modes depending on the URL: | Mode | URL | Tools | bank_id | |------|-----|-------|---------| | **Single-bank** | `/mcp/{bank_id}/` | 27 tools (memory, mental models, directives, documents, operations, tags, bank management) | Implicit from URL | | **Multi-bank** | `/mcp/` | All 30 tools including `list_banks`, `create_bank`, `get_bank_stats` | Explicit `bank_id` parameter on each tool | **Single-bank mode** (recommended) scopes all operations to the bank in the URL. Tools don't expose a `bank_id` parameter. **Multi-bank mode** exposes all tools with an optional `bank_id` parameter, plus bank management tools (`list_banks`, `create_bank`, `get_bank_stats`). ## Tool Metadata and Instructions Hindsight can append deployment-specific guidance to the `retain` and `recall` MCP tool descriptions. Set `HINDSIGHT_API_MCP_INSTRUCTIONS` on the API server when clients should see local rules, such as which tags to use or which memories should be retained. ```bash export HINDSIGHT_API_MCP_INSTRUCTIONS="Use project: tags for project-specific memories." ``` MCP clients that read tool annotations also receive safety hints from the built-in tools: - Read-only operations such as `recall`, `reflect`, `list_*`, and `get_*` are marked with `readOnlyHint: true`. - Delete, clear, and invalidate operations are marked with `destructiveHint: true`. - `openWorldHint` is `false` for the built-in tools because Hindsight operates on its configured memory store rather than the open internet. - Write operations such as `retain`, `create_*`, `update_*`, `refresh_mental_model`, and `cancel_operation` are not marked destructive. --- ## Available Tools ### retain Store information to long-term memory. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `content` | string | Yes | The fact or memory to store | | `context` | string | No | Category for the memory (default: `general`) | | `timestamp` | string | No | ISO 8601 timestamp for when the event occurred | | `tags` | list[string] | No | Tags for organizing and filtering this memory | | `metadata` | object | No | Key-value metadata to attach (e.g., `{"source": "slack"}`) | | `document_id` | string | No | Associate this memory with an existing document | **Example:** ```json { "name": "retain", "arguments": { "content": "User prefers Python over JavaScript for backend development", "context": "programming_preferences", "tags": ["user:alice", "preferences"] } } ``` **When to use:** - User shares personal facts, preferences, or interests - Important events or milestones are mentioned - Decisions, opinions, or goals are stated - Work context or project details are discussed --- ### sync_retain Store information to long-term memory and wait for completion. Unlike [`retain`](#retain) (which is asynchronous), `sync_retain` blocks until the memory is fully stored and immediately available for recall — useful for read-after-write flows where you query right after storing. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `content` | string | Yes | The fact or memory to store | | `context` | string | No | Category for the memory (default: `general`) | | `timestamp` | string | No | ISO 8601 timestamp for when the event occurred | | `tags` | list[string] | No | Tags for organizing and filtering this memory | | `metadata` | object | No | Key-value metadata to attach (e.g., `{"source": "slack"}`) | | `document_id` | string | No | Associate this memory with an existing document | **Example:** ```json { "name": "sync_retain", "arguments": { "content": "User prefers Python over JavaScript for backend development", "context": "programming_preferences", "tags": ["user:alice", "preferences"] } } ``` **When to use:** - You need the memory queryable immediately after storing (read-after-write) - A workflow step depends on the stored memory being available before continuing - Otherwise prefer `retain` (asynchronous) to avoid blocking on storage --- ### recall Search memories to provide personalized responses. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `query` | string | Yes | Natural language search query | | `max_tokens` | integer | No | Maximum tokens to return (default: 4096) | | `budget` | string | No | Search thoroughness: `low`, `mid`, or `high` (default: `high`) | | `types` | list[string] | No | Filter by fact type: `world`, `experience`, `observation`. Defaults to all | | `tags` | list[string] | No | Filter memories by tags | | `tags_match` | string | No | Tag matching mode: `any` (default) or `all` | | `query_timestamp` | string | No | ISO 8601 timestamp — recall as if asking at this point in time; anchors relative temporal expressions and recency scoring | | `min_scores` | object | No | Optional per-stage score floors, e.g. `{"reranker": 0.5}`. Keys: `semantic`/`keyword` (retrieval-level cutoffs), `reranker`/`final` (post-ranking). All inclusive and AND-ed; omit for no filtering. Reranker scores aren't calibrated across queries — calibrate before use | **Example:** ```json { "name": "recall", "arguments": { "query": "What are the user's programming language preferences?", "tags": ["preferences"], "budget": "high" } } ``` **When to use:** - Start of conversation to recall relevant context - Before making recommendations - When user asks about something they may have mentioned before - To provide continuity across conversations --- ### reflect Generate thoughtful analysis by synthesizing stored memories with the bank's personality. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `query` | string | Yes | The question or topic to reflect on | | `context` | string | No | Optional context about why this reflection is needed | | `budget` | string | No | Search budget: `low`, `mid`, or `high` (default: `low`) | | `max_tokens` | integer | No | Maximum tokens in the response (default: 4096) | | `response_schema` | object | No | JSON Schema for structured output. When provided, the response includes a `structured_output` field | | `tags` | list[string] | No | Filter memories by tags before reflecting | | `tags_match` | string | No | Tag matching mode: `any` (default) or `all` | | `include_trace` | boolean | No | Include `tool_trace` and `llm_trace` debugging output. Defaults to `false` to keep responses small | **Example:** ```json { "name": "reflect", "arguments": { "query": "Based on my past decisions, what architectural style do I prefer?", "budget": "mid", "tags": ["architecture"] } } ``` **When to use:** - When reasoned analysis is needed, not just fact retrieval - Questions like "What should I do?" rather than "What did I say?" - Synthesizing patterns across multiple memories --- ### create_mental_model Create a mental model — a living document that stays current with your memories. Mental models are pre-computed reflections that get automatically refreshed as new memories are stored. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `name` | string | Yes | Human-readable name for the mental model | | `source_query` | string | Yes | The query used to generate and refresh the model | | `mental_model_id` | string | No | Custom ID (alphanumeric lowercase with hyphens). Auto-generated if not provided | | `tags` | list[string] | No | Tags for organizing and filtering models | | `max_tokens` | integer | No | Maximum tokens for model content (default: 2048) | | `trigger_refresh_after_consolidation` | boolean | No | Auto-refresh this model after memory consolidation (default: `false`) | **Example:** ```json { "name": "create_mental_model", "arguments": { "name": "Team Directory", "source_query": "Who works here and what do they do?", "tags": ["team", "people"] } } ``` Content generation runs asynchronously. The response includes an `operation_id` to track progress. --- ### list_mental_models List all mental models in a bank, optionally filtered by tags. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `tags` | list[string] | No | Filter models by tags | --- ### get_mental_model Retrieve a specific mental model by ID, including its full content. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `mental_model_id` | string | Yes | The ID of the mental model to retrieve | --- ### update_mental_model Update a mental model's metadata or settings. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `mental_model_id` | string | Yes | The ID of the mental model to update | | `name` | string | No | New name | | `source_query` | string | No | New source query | | `tags` | list[string] | No | New tags | | `max_tokens` | integer | No | New max tokens | | `trigger_refresh_after_consolidation` | boolean | No | Auto-refresh after consolidation. Only set when you want to change this setting | --- ### delete_mental_model Permanently delete a mental model. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `mental_model_id` | string | Yes | The ID of the mental model to delete | --- ### refresh_mental_model Re-generate a mental model's content from the latest memories. Runs asynchronously. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `mental_model_id` | string | Yes | The ID of the mental model to refresh | --- ### clear_mental_model Clear a mental model's content while keeping its definition. After clearing, call `refresh_mental_model` to rebuild it from the latest memories. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `mental_model_id` | string | Yes | The ID of the mental model to clear | --- ### list_banks (multi-bank mode only) List all available memory banks. --- ### create_bank (multi-bank mode only) Create a new memory bank or retrieve an existing one. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `bank_id` | string | Yes | The ID for the new bank | | `name` | string | No | Human-friendly name for the bank | | `mission` | string | No | Mission describing who the agent is and what they're trying to accomplish | --- ### list_directives List all directives in a bank. Directives are instructions that guide how the memory system processes and responds to queries. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `tags` | list[string] | No | Filter directives by tags | | `active_only` | boolean | No | Only return active directives (default: `true`) | --- ### create_directive Create a new directive in a bank. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `name` | string | Yes | Human-readable name for the directive | | `content` | string | Yes | The directive content/instruction | | `priority` | integer | No | Priority level (higher = more important) | | `is_active` | boolean | No | Whether the directive is active (default: `true`) | | `tags` | list[string] | No | Tags for organizing directives | --- ### delete_directive Delete a directive by ID. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `directive_id` | string | Yes | The ID of the directive to delete | --- ### list_memories Browse stored memories with optional filtering and pagination. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `type` | string | No | Filter by fact type: `world`, `experience`, or `observation` | | `q` | string | No | Search query to filter memories | | `limit` | integer | No | Maximum number of results (default: 100) | | `offset` | integer | No | Number of results to skip for pagination (default: 0) | --- ### get_memory Retrieve a specific memory by ID. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `memory_id` | string | Yes | The ID of the memory to retrieve | --- ### list_documents List documents that have been ingested into the memory bank. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `q` | string | No | Search query to filter documents | | `limit` | integer | No | Maximum number of results (default: 100) | --- ### get_document Retrieve a specific document by ID, including its metadata. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `document_id` | string | Yes | The ID of the document to retrieve | --- ### delete_document Delete a document and all memories linked to it. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `document_id` | string | Yes | The ID of the document to delete | --- ### list_operations List async operations (retain processing, mental model refresh, etc.) with optional status filtering. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `status` | string | No | Filter by status: `pending`, `running`, `completed`, `failed`, `cancelled` | | `limit` | integer | No | Maximum number of results (default: 100) | --- ### get_operation Get the status and details of an async operation. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `operation_id` | string | Yes | The ID of the operation to check | --- ### cancel_operation Cancel a pending or running async operation. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `operation_id` | string | Yes | The ID of the operation to cancel | --- ### list_tags List all unique tags used in a bank, optionally filtered by pattern. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `q` | string | No | Glob pattern to filter tags (e.g., `project:*`) | | `limit` | integer | No | Maximum number of results (default: 100) | --- ### get_bank Get information about a memory bank, including its name, mission, and disposition. --- ### get_bank_stats (multi-bank mode only) Get statistics for a memory bank (node/link counts). --- ### update_bank Update a memory bank's configuration. Updates the bank's name and/or any bank-level configuration fields — only provided fields are updated; omitted fields remain unchanged. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `name` | string | No | Human-friendly display name for the bank | | `mission` | string | No | **Deprecated** — alias for `config_updates.reflect_mission` | | `config_updates` | object | No | Dictionary of configuration fields to update. Supports all bank-configurable fields (see below). Non-configurable or credential fields are rejected | The `config_updates` object accepts any bank-configurable field by its Python field name, including: - `reflect_mission` — mission/context for Reflect operations - `retain_mission` — steers what gets extracted during `retain()` - `retain_extraction_mode` — `concise` (default), `verbose`, or `custom` - `retain_custom_instructions` — custom extraction prompt (active when mode is `custom`) - `retain_chunk_size` — target maximum characters for each content chunk - `retain_structured_chunk_size` — maximum characters for a single JSONL line or conversation turn to keep whole - `retain_chunk_batch_size` — number of chunks to process in parallel - `enable_observations` — toggle observation consolidation after `retain()` - `observations_mission` — controls observation synthesis rules - `disposition_skepticism` — critical evaluation level (1–5) - `disposition_literalism` — literal vs. abstract interpretation (1–5) - `disposition_empathy` — emotional context consideration (1–5) - `entity_labels` — controlled vocabulary for entity classification - `entities_allow_free_form` — allow labels outside `entity_labels` - `recall_include_chunks` — include raw chunks in recall results - `recall_max_tokens` — max tokens for recall results - `mcp_enabled_tools` — tool allowlist for this bank --- ### delete_bank Permanently delete a memory bank and all its data (memories, documents, entities, mental models). --- ### clear_memories Clear all memories from a bank without deleting the bank itself. Optionally filter by fact type to only clear specific kinds of memories. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `type` | string | No | Fact type to clear: `world`, `experience`, or `observation`. If not specified, clears all | --- ## Integration with AI Assistants The MCP server can be used with any MCP-compatible AI assistant. See the [Authentication](#authentication) section above for Claude Code and Claude Desktop configuration examples. Each user can have their own configuration pointing to their personal memory bank using either: - A bank-specific URL path like `/mcp/alice/` (recommended) - The `X-Bank-Id` header --- ## File: developer/memory-defense/index.md # Memory Defense Hindsight scrubs secrets and PII from retain content using a 45-pattern regex set. Each match is replaced with a `[REDACTED:type]` marker before content reaches memory units or the document body. The feature is configured per bank and disabled by default. ## How it works Memory Defense is opt-in per bank. The extension is always present, but it sits dormant until you give a bank a policy that turns it on. When a policy is set, every memory the agent writes to that bank is scanned before it lands in storage. When the scanner recognizes a credential, an API key, a database connection string, or a known PII format, the matched substring is replaced with a redaction marker like `[REDACTED:github_token]`. The scrubbed version is what actually gets stored. Memory units and document bodies persist the redacted text, so future recall responses, exports, and reflect operations never see the original secret. A policy only affects future retain calls on the bank where it is set. Existing memories are not retroactively scanned when you add or change a policy. ## Configuring Memory Defense Memory Defense is configured per bank via the bank's `memory_defense` config field. You can set the policy at bank creation time or update it later via `PATCH /v1/{tenant}/banks/{bank_id}/config`. The open-source version implements the `sensitive_data` rule with two possible actions: - **`redact`** — replace each matched secret with a `[REDACTED:type]` marker and store the scrubbed memory. - **`block`** — drop any item that contains a match. If every item in a retain request is blocked, the call returns `422`. A minimal policy: ```json { "memory_defense": { "enabled": true, "rules": [ { "on": "sensitive_data", "action": "redact" } ] } } ``` Once that policy is on a bank, every retain to that bank is screened with the 45 redaction patterns documented below. :::note Existing memories are not retroactively scanned Enabling Memory Defense on a bank only affects future retain calls. Memories already in the bank are not re-scanned or modified when you add or change a policy. If you need to scrub a bank that already contains unredacted content, you have to re-ingest the affected memories or remove them manually. ::: ### Disabled by default Memory Defense is off on every bank until you set a policy. A bank with no `memory_defense` field, with `enabled: false`, or with no `sensitive_data` rule is treated identically: the extension returns ALLOW and content passes through unchanged. To stop redacting on a bank that has it on, set `enabled: false` or remove the policy. ## Notifications When an item is redacted or blocked, Hindsight fires a [`memory_defense.triggered` webhook](../api/webhooks.mdx#memory_defensetriggered) if a webhook on the bank is subscribed to that event type. The payload reports the action taken, the document ID, and which redaction patterns matched — useful for routing security alerts to a SIEM or Slack. Clean items fire no event. The same redact/block decisions are also recorded as `memory_defense` entries in the [audit log](../configuration.md#audit-logging) (when audit logging is enabled), with the action and matched pattern labels in the entry metadata. ## Patterns covered The 45 bundled patterns cover the categories below. ### AI and LLM providers | Label | Catches | |---|---| | `anthropic_key` | `sk-ant-...` | | `openai_key`, `openai_project_key`, `openai_admin_key` | `sk-...`, `sk-proj-...`, `sk-admin-...` | | `google_api_key` | `AIza...` (39 chars) | | `google_oauth_token` | `ya29.` | | `xai_key` | `xai-...` | | `groq_key` | `gsk_...` | | `huggingface_token` | `hf_...` | | `replicate_token` | `r8_...` | | `perplexity_key` | `pplx-...` | | `databricks_token` | `dapi` | ### Cloud providers | Label | Catches | |---|---| | `aws_access_key` | `AKIA<16>` | | `aws_session_token` | `ASIA<16>` | | `digitalocean_token` | `dop_v1_` | ### Source control and CI | Label | Catches | |---|---| | `github_fg_pat` | `github_pat_...` | | `github_token` | `ghp_<36>` | | `github_app_token` | `ghs_<36>` | | `github_user_token` | `ghu_<36>` | | `github_refresh` | `ghr_<36>` | | `github_oauth` | `gho_<36>` | | `gitlab_pat` | `glpat-...` | | `npm_token` | `npm_...` | | `pypi_token` | `pypi-AgEIcHlwaS5vcmc...` | ### Payment processors | Label | Catches | |---|---| | `stripe_secret` | `sk_live_...`, `sk_test_...` | | `stripe_restricted` | `rk_live_...`, `rk_test_...` | | `square_token` | `sq0...` | | `braintree_token` | `access_token$production$...` | ### Communications and email | Label | Catches | |---|---| | `slack_token` | `xoxb-`, `xoxp-`, `xoxa-`, `xoxr-` | | `slack_webhook` | `https://hooks.slack.com/services/...` | | `twilio_api_key` | `SK` | | `twilio_account_sid` | `AC` | | `sendgrid_key` | `SG.<22>.<43>` | | `mailgun_key` | `key-<32>` | | `discord_bot` | `<23>.<6>.<27>` | | `telegram_bot` | `<8-10 digits>:<35>` | ### Commerce | Label | Catches | |---|---| | `shopify_token` | `shpat_` | ### Database connection strings | Label | Catches | |---|---| | `db_url_postgres` | `postgres://user:pass@host` or `postgresql://...` | | `db_url_mysql` | `mysql://user:pass@host` | | `db_url_mongodb` | `mongodb://user:pass@host` or `mongodb+srv://...` | ### Private keys, JWTs, and generic credentials | Label | Catches | |---|---| | `private_key_pem` | `-----BEGIN ... PRIVATE KEY-----` PEM blocks | | `jwt` | `eyJ
.eyJ.` | ### PII (US defaults) | Label | Catches | |---|---| | `credit_card` | 13 to 19 digits with regular separators | | `ssn_us` | `123-45-6789` shape | --- ## File: developer/models.mdx # Models Hindsight uses several machine learning models for different tasks. ## Overview - **LLM** — Fact extraction, reasoning, and generation. Provider-specific, fully configurable. - **Embedding** — Vector representations for semantic search. Default: `BAAI/bge-small-en-v1.5`. - **Cross-Encoder** — Reranking search results. Default: `cross-encoder/ms-marco-MiniLM-L-6-v2`. Embedding and cross-encoder models are downloaded automatically from HuggingFace on first run. --- ## LLM Used for fact extraction, entity resolution, mental model consolidation, and answer synthesis. **Supported providers:** Also supports **any OpenAI-compatible API** (e.g., Azure OpenAI, Together AI, Fireworks) and **100+ providers via LiteLLM** (e.g., AWS Bedrock, Azure OpenAI, Together AI). :::tip OpenAI-Compatible Providers Hindsight works with any provider that exposes an OpenAI-compatible API (e.g., Azure OpenAI). Simply set `HINDSIGHT_API_LLM_PROVIDER=openai` and configure `HINDSIGHT_API_LLM_BASE_URL` to point to your provider's endpoint. See [Configuration](./configuration#llm-provider) for setup examples. ::: :::tip AWS Bedrock Set `HINDSIGHT_API_LLM_PROVIDER=bedrock` to use AWS Bedrock models directly. Model names use Bedrock model IDs (e.g., `us.amazon.nova-2-lite-v1:0`). No API key is required — authentication uses AWS credentials (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION_NAME`) or IAM roles. For 50% cost savings on throughput, set `HINDSIGHT_API_LLM_BEDROCK_SERVICE_TIER=flex` (see [Configuration](./configuration#llm-provider)). See [Configuration](./configuration#llm-provider) for setup examples. ::: :::tip Built-in llama.cpp (fully local, no API key) Set `HINDSIGHT_API_LLM_PROVIDER=llamacpp` to run a built-in llama.cpp server with no external dependencies. A Gemma 4 E2B GGUF model (~3.5 GB) is auto-downloaded on first run. Requires the `local-llm` extra: `pip install 'hindsight-api-slim[local-llm]'`. The published Docker image does not bundle `llama-cpp-python` (to keep the image small). For a runnable Docker setup that adds it on top, see [`docker/docker-compose/local-llm/`](https://github.com/vectorize-io/hindsight/tree/main/docker/docker-compose/local-llm). See [Configuration](./configuration#built-in-llamacpp) for all options. ::: :::tip LiteLLM Provider (Azure, Together AI, and more) Set `HINDSIGHT_API_LLM_PROVIDER=litellm` to use any model supported by [LiteLLM](https://docs.litellm.ai/docs/providers), including **Azure OpenAI**, **Together AI**, **Fireworks AI**, and many more. Model names use LiteLLM's provider prefix format (e.g., `azure/gpt-4o`). See [Configuration](./configuration#llm-provider) for setup examples. ::: :::tip LiteLLM Router (fallback chains, load-balancing, per-deployment limits) Set `HINDSIGHT_API_LLM_PROVIDER=litellmrouter` to run the default LLM through [LiteLLM's Router](https://docs.litellm.ai/docs/routing) — ordered fallback across deployments, load-balanced same-tier routing, weighted picks, per-deployment `rpm`/`tpm` limits, and cooldowns are all available via the [`Router` config](https://docs.litellm.ai/docs/routing#fallbacks). Hindsight passes the JSON config through verbatim. See [Configuration](./configuration#llm-router-litellm-router) for setup. ::: ### Provider Capabilities Beyond basic generation, some providers support optional features that lower cost or latency. Hindsight uses each feature automatically when the configured provider supports it. - **Batch API** — submits bulk retain extraction through the provider's asynchronous batch endpoint, typically at ~50% lower cost. Used automatically when available; otherwise calls run synchronously. - **Explicit prompt caching** — reuses the large, fixed system prefix that retain (fact extraction), consolidation, and the reflect tool-loop send on every call, billing it at the provider's cached-input rate. On Gemini/Vertex this uses the `CachedContent` API. **On by default**; disable with `HINDSIGHT_API_LLM_PROMPT_CACHE_ENABLED=false`. Hindsight structures these prompts so the cached prefix is **bank-agnostic** — one cache is shared across all banks rather than one per bank/mission, and creation soft-fails to an uncached call, so it never breaks a request. :::note A blank "Explicit prompt caching" cell does not mean a provider has no caching. OpenAI, for example, caches a stable leading prompt prefix **automatically** server-side, so it benefits with no configuration; Anthropic supports caching via `cache_control` breakpoints which can be wired up through the same provider hook. The column tracks only Hindsight's explicit `get_or_create_cached_prefix` hook, which Gemini/Vertex implement today. ::: ### Benchmarks Not sure which model to use? The **[Model Leaderboard](https://benchmarks.hindsight.vectorize.io/)** benchmarks models across accuracy, speed, cost, and reliability for retain, reflect, and observation consolidation so you can pick the right trade-off for your use case. [![Model Leaderboard](/img/leaderboard.png)](https://benchmarks.hindsight.vectorize.io/) ### Tested Models The following models have been tested and verified to work correctly with Hindsight: | Provider | Model | |----------|-------| | **OpenAI** | `gpt-5.2` | | **OpenAI** | `gpt-5` | | **OpenAI** | `gpt-5-mini` | | **OpenAI** | `gpt-5-nano` | | **OpenAI** | `gpt-4.1-mini` | | **OpenAI** | `gpt-4.1-nano` | | **OpenAI** | `gpt-4o-mini` | | **Anthropic** | `claude-sonnet-4-20250514` | | **Anthropic** | `claude-3-5-sonnet-20241022` | | **Gemini** | `gemini-3.5-flash` | | **Gemini** | `gemini-3.1-pro-preview` | | **Gemini** | `gemini-3.1-flash-lite` | | **Groq** | `openai/gpt-oss-120b` | | **Groq** | `openai/gpt-oss-20b` | ### Provider Default Models Each provider has a recommended default model that's used when `HINDSIGHT_API_LLM_MODEL` is not explicitly set. This makes configuration simpler - just specify the provider and get a sensible default: **Example:** Setting just the provider uses its default model: ```bash # Uses claude-haiku-4-5 automatically export HINDSIGHT_API_LLM_PROVIDER=anthropic export HINDSIGHT_API_LLM_API_KEY=sk-ant-xxxxxxxxxxxx ``` You can override the default by explicitly setting `HINDSIGHT_API_LLM_MODEL`: ```bash # Override to use Sonnet instead export HINDSIGHT_API_LLM_PROVIDER=anthropic export HINDSIGHT_API_LLM_API_KEY=sk-ant-xxxxxxxxxxxx export HINDSIGHT_API_LLM_MODEL=claude-sonnet-4-5-20250929 ``` This also applies to per-operation overrides: ```bash # Global: OpenAI gpt-4o-mini (default) export HINDSIGHT_API_LLM_PROVIDER=openai # Retain: Anthropic claude-haiku-4-5 (default) export HINDSIGHT_API_RETAIN_LLM_PROVIDER=anthropic ``` ### Using Other Models Other LLM models not listed above may work with Hindsight, but they must support **at least 65,000 output tokens** to ensure reliable fact extraction. If you need support for a specific model that doesn't meet this requirement, please [open an issue](https://github.com/hindsight-ai/hindsight/issues) to request an exception. :::tip Models with Limited Output Tokens If your model only supports 32k or fewer output tokens (e.g., some older models), you can reduce the retain completion token limit: ```bash # For models that support 32k output tokens export HINDSIGHT_API_RETAIN_MAX_COMPLETION_TOKENS=32000 # For models that support 16k output tokens export HINDSIGHT_API_RETAIN_MAX_COMPLETION_TOKENS=16000 ``` **Important:** `HINDSIGHT_API_RETAIN_MAX_COMPLETION_TOKENS` must be greater than `HINDSIGHT_API_RETAIN_CHUNK_SIZE` (default: 3000). The system will validate this on startup and provide an error message if the configuration is invalid. ::: :::warning Groq free tier is not suitable for Hindsight Groq's free tier only allows 8,000 tokens per minute — far below what Hindsight needs for a single retain call (~64k). Free-tier Groq models therefore can't be used with Hindsight; use a paid Groq tier or a different provider. ::: ### Configuration ```bash # Groq (recommended) export HINDSIGHT_API_LLM_PROVIDER=groq export HINDSIGHT_API_LLM_API_KEY=gsk_xxxxxxxxxxxx export HINDSIGHT_API_LLM_MODEL=openai/gpt-oss-20b # OpenAI export HINDSIGHT_API_LLM_PROVIDER=openai export HINDSIGHT_API_LLM_API_KEY=sk-xxxxxxxxxxxx export HINDSIGHT_API_LLM_MODEL=gpt-4o # Gemini export HINDSIGHT_API_LLM_PROVIDER=gemini export HINDSIGHT_API_LLM_API_KEY=xxxxxxxxxxxx export HINDSIGHT_API_LLM_MODEL=gemini-3.5-flash # Anthropic export HINDSIGHT_API_LLM_PROVIDER=anthropic export HINDSIGHT_API_LLM_API_KEY=sk-ant-xxxxxxxxxxxx export HINDSIGHT_API_LLM_MODEL=claude-sonnet-4-20250514 # Ollama (local) export HINDSIGHT_API_LLM_PROVIDER=ollama export HINDSIGHT_API_LLM_BASE_URL=http://localhost:11434/v1 export HINDSIGHT_API_LLM_MODEL=llama3 # Ollama Cloud (hosted Ollama endpoint, requires API key) export HINDSIGHT_API_LLM_PROVIDER=ollama-cloud export HINDSIGHT_API_LLM_API_KEY=your-ollama-cloud-api-key export HINDSIGHT_API_LLM_MODEL=gemma3:12b # LM Studio (local) export HINDSIGHT_API_LLM_PROVIDER=lmstudio export HINDSIGHT_API_LLM_BASE_URL=http://localhost:1234/v1 export HINDSIGHT_API_LLM_MODEL=your-local-model # MiniMax (1M context window) export HINDSIGHT_API_LLM_PROVIDER=minimax export HINDSIGHT_API_LLM_API_KEY=your-minimax-api-key export HINDSIGHT_API_LLM_MODEL=MiniMax-M3 # or MiniMax-M2.7 for the previous generation # DeepSeek (https://api.deepseek.com) export HINDSIGHT_API_LLM_PROVIDER=deepseek export HINDSIGHT_API_LLM_API_KEY=sk-xxxxxxxxxxxx export HINDSIGHT_API_LLM_MODEL=deepseek-v4-flash # or deepseek-v4-pro / deepseek-chat / deepseek-reasoner # z.ai (Zhipu GLM series, OpenAI-compatible, https://z.ai) export HINDSIGHT_API_LLM_PROVIDER=zai export HINDSIGHT_API_LLM_API_KEY=your-zai-api-key export HINDSIGHT_API_LLM_MODEL=glm-4.5-flash # or glm-4.5-air for the paid tier # opencode-go (OpenAI-compatible) export HINDSIGHT_API_LLM_PROVIDER=opencode-go export HINDSIGHT_API_LLM_API_KEY=your-opencode-go-api-key export HINDSIGHT_API_LLM_MODEL=deepseek-v4-flash # Atlas Cloud (OpenAI-compatible, https://www.atlascloud.ai) export HINDSIGHT_API_LLM_PROVIDER=atlas export HINDSIGHT_API_LLM_API_KEY=your-atlascloud-api-key # base_url defaults to https://api.atlascloud.ai/v1 export HINDSIGHT_API_LLM_MODEL=deepseek-ai/deepseek-v4-pro # reasoning model; also Qwen / GLM / Kimi / MiniMax, etc. # Nous Portal (OpenAI-compatible; no API key — uses your `hermes portal` login) export HINDSIGHT_API_LLM_PROVIDER=nous export HINDSIGHT_API_LLM_MODEL=deepseek/deepseek-v4-flash # any Nous-hosted slug # No API key needed — reads a rotating JWT from ~/.hermes/auth.json (see "Nous Portal Setup" below) # Vertex AI (Google Cloud) export HINDSIGHT_API_LLM_PROVIDER=vertexai export HINDSIGHT_API_LLM_MODEL=gemini-3.1-flash-lite export HINDSIGHT_API_LLM_VERTEXAI_PROJECT_ID=your-gcp-project-id # Optional: region (default: us-central1) # export HINDSIGHT_API_LLM_VERTEXAI_REGION=us-central1 # Optional: service account key (otherwise uses ADC) # export HINDSIGHT_API_LLM_VERTEXAI_SERVICE_ACCOUNT_KEY=/path/to/key.json ``` **Note:** The LLM is the primary bottleneck for retain operations. See [Performance](./performance) for optimization strategies. --- ### OpenAI Codex Setup (ChatGPT Plus/Pro) Use your ChatGPT Plus or Pro subscription for Hindsight without separate OpenAI Platform API costs. **Prerequisites:** - Active ChatGPT Plus or Pro subscription - Node.js/npm installed (for Codex CLI) **Setup Steps:** 1. **Install Codex CLI:** ```bash npm install -g @openai/codex ``` 2. **Login with ChatGPT credentials:** ```bash codex auth login ``` This opens a browser window to authenticate with your ChatGPT account and saves OAuth tokens to `~/.codex/auth.json`. 3. **Verify authentication:** ```bash ls ~/.codex/auth.json # Should show the auth file exists ``` 4. **Configure Hindsight:** ```bash export HINDSIGHT_API_LLM_PROVIDER=openai-codex # export HINDSIGHT_API_LLM_MODEL=gpt-5.3-codex # defaults to gpt-5.4-mini # No API key needed - reads from ~/.codex/auth.json automatically ``` 5. **Start Hindsight:** ```bash hindsight-api ``` You can use any model supported by OpenAI Codex CLI **Important Notes:** - OAuth tokens are stored in `~/.codex/auth.json` - Tokens refresh automatically when needed - Usage is billed to your ChatGPT subscription (not separate API costs) - For personal development use only (see ChatGPT Terms of Service) #### Isolating Codex auth for long-running services By default Hindsight reads Codex credentials from `~/.codex/auth.json` — the same file the `@openai/codex` CLI, editor plugins, and other agent runtimes use. This is convenient for local development but can cause a subtle failure mode when Hindsight runs as a **long-lived service** (systemd unit, container, background daemon) alongside another Codex process: - Codex refresh tokens are single-use and rotate on refresh. - If another process refreshes the shared token, Hindsight's long-running process is left holding a stale refresh token. - Recall and `/health` keep working (the database and API are fine), but `/reflect` fails with an error such as: ```text Codex refresh_token is permanently invalid (error.code=refresh_token_reused). Run 'codex auth login' to re-authenticate. ``` To avoid this, give the Hindsight service its **own dedicated Codex auth home** via the `CODEX_HOME` environment variable. Hindsight honors `CODEX_HOME` exactly like the `@openai/codex` CLI: when set, it reads `$CODEX_HOME/auth.json` instead of `~/.codex/auth.json`. ```bash # Dedicated credentials directory for the Hindsight service export CODEX_HOME=/var/lib/hindsight/codex # One-time login into that isolated home (opens a browser / device-code flow) codex auth login # writes $CODEX_HOME/auth.json export HINDSIGHT_API_LLM_PROVIDER=openai-codex hindsight-api ``` For a systemd unit, set it in the service definition so it never shares auth with an interactive Codex session: ```ini [Service] Environment=CODEX_HOME=/var/lib/hindsight/codex ``` After a fresh login into the dedicated home and restarting only the Hindsight service, `/reflect` uses its own token that other Codex processes will not rotate out from under it. `CODEX_HOME` is also honored by the `openai-codex` embeddings provider. --- ### Nous Portal Setup (Hermes) Use your [Nous Portal](https://portal.nousresearch.com) subscription for Hindsight via the Hermes CLI login — no static API key required. **Prerequisites:** - A Nous Portal account - The [Hermes](https://hermes-agent.nousresearch.com) CLI installed **Setup Steps:** 1. **Log in to Nous Portal:** ```bash hermes portal ``` This opens a browser to authenticate with Nous Portal and saves OAuth credentials to `~/.hermes/auth.json`. 2. **Verify authentication:** ```bash hermes portal status # should show "Auth: ✓ logged in" ``` 3. **Configure Hindsight:** ```bash export HINDSIGHT_API_LLM_PROVIDER=nous # export HINDSIGHT_API_LLM_MODEL=deepseek/deepseek-v4-flash # defaults to deepseek/deepseek-v4-flash # No API key needed — reads from ~/.hermes/auth.json automatically ``` 4. **Start Hindsight:** ```bash hindsight-api ``` You can use any model hosted on the Nous Portal inference API. **Important Notes:** - Credentials are read from `~/.hermes/auth.json` (the same store the Hermes agent uses) — no static API key in Hindsight's config. - The short-lived inference JWT is refreshed automatically, before expiry and reactively on a 401. - Refreshes coordinate with a running Hermes agent through the shared auth store, so the two never disrupt each other's session. - Default base URL: `https://inference-api.nousresearch.com/v1` (override with `HINDSIGHT_API_LLM_BASE_URL`). --- ### Claude Code Setup (Claude Pro/Max) Use your Claude Pro or Max subscription for Hindsight without separate Anthropic API costs. :::warning Terms of Service Notice This integration uses the Claude Agent SDK with your personal Claude Pro/Max subscription credentials. You must be logged into Claude Code on your own machine before using this provider. **Please be aware:** - Anthropic's [Agent SDK documentation](https://docs.claude.com/en/api/agent-sdk/overview) states that third-party developers should not offer claude.ai login or rate limits for their products. Hindsight does **not** perform any login on your behalf — it uses credentials you've already authenticated via `claude auth login`. - In January 2026, Anthropic [enforced restrictions](https://paddo.dev/blog/anthropic-walled-garden-crackdown/) against third-party tools using Claude subscription OAuth tokens. Those restrictions targeted tools that **spoofed the Claude Code client identity** — Hindsight uses the official Claude Agent SDK instead. - This provider is intended for **local, personal development use only**. Do not use it in production deployments or shared environments. - Anthropic's terms may change. If you want guaranteed compliance, use the `anthropic` provider with an API key instead. - Usage counts against your Claude Pro/Max subscription limits. For production or team use, we recommend using `HINDSIGHT_API_LLM_PROVIDER=anthropic` with an API key from the [Anthropic Console](https://console.anthropic.com/). ::: **Prerequisites:** - Active Claude Pro or Max subscription - Claude Code CLI installed **Setup Steps:** 1. **Install Claude Code CLI:** ```bash npm install -g @anthropics/claude-code # Or via Homebrew brew install anthropics/claude-code/claude-code ``` 2. **Login with Claude credentials:** ```bash claude auth login ``` This opens a browser window to authenticate with your Claude account. Authentication is automatically managed by the Claude Agent SDK. 3. **Verify authentication:** ```bash claude --version # Should show version without errors ``` 4. **Configure Hindsight:** ```bash export HINDSIGHT_API_LLM_PROVIDER=claude-code # No API key needed - uses claude auth login credentials ``` 5. **Start Hindsight:** ```bash hindsight-api ``` You can use any model supported by Claude Code CLI. **Important Notes:** - Authentication handled by Claude Agent SDK (uses bundled CLI) - Credentials managed securely by Claude Code - Usage billed to your Claude subscription (not separate API costs) - For personal development use only (see Claude Terms of Service) --- ### Vertex AI Setup (Google Cloud) Google Cloud's Vertex AI provides access to Gemini models via the native Google GenAI SDK. **Prerequisites:** - GCP project with Vertex AI API enabled - IAM role `roles/aiplatform.user` for your credentials **Environment Variables:** | Variable | Description | Required | |----------|-------------|----------| | `HINDSIGHT_API_LLM_VERTEXAI_PROJECT_ID` | Your GCP project ID | Yes | | `HINDSIGHT_API_LLM_VERTEXAI_REGION` | GCP region (e.g., `us-central1`) | No (default: `us-central1`) | | `HINDSIGHT_API_LLM_VERTEXAI_SERVICE_ACCOUNT_KEY` | Path to service account JSON key file | No (uses ADC if not set) | **Authentication Methods:** 1. **Application Default Credentials (ADC)** - Recommended for development ```bash # Setup ADC gcloud auth application-default login # Configure Hindsight export HINDSIGHT_API_LLM_PROVIDER=vertexai export HINDSIGHT_API_LLM_MODEL=gemini-3.1-flash-lite export HINDSIGHT_API_LLM_VERTEXAI_PROJECT_ID=your-project-id ``` 2. **Service Account Key** - Recommended for production ```bash # Create service account and download key gcloud iam service-accounts create hindsight-api gcloud projects add-iam-policy-binding your-project-id \ --member="serviceAccount:hindsight-api@your-project-id.iam.gserviceaccount.com" \ --role="roles/aiplatform.user" gcloud iam service-accounts keys create key.json \ --iam-account=hindsight-api@your-project-id.iam.gserviceaccount.com # Configure Hindsight export HINDSIGHT_API_LLM_PROVIDER=vertexai export HINDSIGHT_API_LLM_MODEL=gemini-3.1-flash-lite export HINDSIGHT_API_LLM_VERTEXAI_PROJECT_ID=your-project-id export HINDSIGHT_API_LLM_VERTEXAI_SERVICE_ACCOUNT_KEY=/path/to/key.json ``` **Notes:** - Model names can optionally include the `google/` prefix (e.g., `google/gemini-3.1-flash-lite`) — it will be stripped automatically - The native SDK handles token refresh automatically - Uses service account credentials if provided, otherwise falls back to ADC --- ## Embedding Model Converts text into dense vector representations for semantic similarity search. **Default:** `BAAI/bge-small-en-v1.5` (384 dimensions, ~130MB) ### Supported Providers | Provider | Description | Best For | |----------|-------------|----------| | `local` | SentenceTransformers (default) | Development, low latency | | `onnx` | In-process ONNX Runtime embedder (no Ollama/TEI/API sidecar) | Lightweight local CPU, multilingual | | `openai` | OpenAI embeddings API | Production, high quality | | `openai-codex` | OpenAI embeddings via Codex OAuth (ChatGPT Plus/Pro, no API key) | Existing ChatGPT/Codex subscribers | | `openrouter` | OpenRouter embeddings (OpenAI-compatible gateway) | Multi-provider setups | | `cohere` | Cohere embeddings API | Production, multilingual | | `google` | Google embeddings (Gemini API or Vertex AI) | Production, multilingual, high quality | | `tei` | HuggingFace Text Embeddings Inference | Production, self-hosted | | `zeroentropy` | ZeroEntropy zembed-1 | Production, high quality retrieval | | `litellm` | LiteLLM proxy (unified gateway) | Multi-provider setups | | `litellm-sdk` | LiteLLM SDK (direct API, no proxy) | Multi-provider, simpler setup | ### Local Models | Model | Dimensions | Use Case | |-------|------------|----------| | `BAAI/bge-small-en-v1.5` | 384 | Default, fast, good quality | | `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` | 384 | Multilingual (50+ languages) | ### OpenAI Models | Model | Dimensions | Use Case | |-------|------------|----------| | `text-embedding-3-small` | 1536 | Default OpenAI, cost-effective | | `text-embedding-3-large` | 3072 | Higher quality, more expensive | | `text-embedding-ada-002` | 1536 | Legacy model | ### Google Models | Model | Dimensions | Use Case | |-------|------------|----------| | `gemini-embedding-001` | 768 (configurable) | Default Google, general purpose | | `gemini-embedding-2-preview` | 768 (configurable) | Gemini Embedding 2 family; multimodal, one vector per input | Google's `gemini-embedding-001` supports configurable output dimensionality via truncation, google recommend using: 768, 1536, 3072, via `HINDSIGHT_API_EMBEDDINGS_GEMINI_OUTPUT_DIMENSIONALITY`. Default is 768. The `gemini-embedding-2` family, including `gemini-embedding-2-preview`, is supported on both the Gemini API and Vertex AI. These models aggregate multi-input requests, so Hindsight automatically embeds one input per call to keep per-fact vectors aligned. ### Cohere Models | Model | Dimensions | Use Case | |-------|------------|----------| | `embed-english-v3.0` | 1024 | English text | | `embed-multilingual-v3.0` | 1024 | 100+ languages | ### ZeroEntropy Models | Model | Dimensions | Use Case | |-------|------------|----------| | `zembed-1` | 1280 default (2560/1280/640/320/160/80/40 configurable) | High quality asymmetric retrieval | Hindsight sends retained memory text to ZeroEntropy as `document` inputs and recall/search text as `query` inputs. ZeroEntropy's API default is 2560 dimensions; Hindsight defaults to 1280 so pgvector HNSW works without changing the vector extension. :::warning Embedding Dimensions Hindsight automatically detects the embedding dimension at startup and adjusts the database schema. Once memories are stored, you cannot change dimensions without losing data. ::: **Configuration Examples:** ```bash # Local provider (default) export HINDSIGHT_API_EMBEDDINGS_PROVIDER=local export HINDSIGHT_API_EMBEDDINGS_LOCAL_MODEL=BAAI/bge-small-en-v1.5 # ONNX provider (in-process local CPU, no Ollama/TEI/API sidecar; pip install hindsight-api-slim[local-onnx]) export HINDSIGHT_API_EMBEDDINGS_PROVIDER=onnx export HINDSIGHT_API_EMBEDDINGS_ONNX_MODEL_ID=intfloat/multilingual-e5-small export HINDSIGHT_API_EMBEDDINGS_ONNX_DIMENSIONS=384 # OpenAI export HINDSIGHT_API_EMBEDDINGS_PROVIDER=openai export HINDSIGHT_API_EMBEDDINGS_OPENAI_API_KEY=sk-xxxxxxxxxxxx export HINDSIGHT_API_EMBEDDINGS_OPENAI_MODEL=text-embedding-3-small # Cohere export HINDSIGHT_API_EMBEDDINGS_PROVIDER=cohere export HINDSIGHT_API_COHERE_API_KEY=your-api-key export HINDSIGHT_API_EMBEDDINGS_COHERE_MODEL=embed-english-v3.0 # Google (API key auth) export HINDSIGHT_API_EMBEDDINGS_PROVIDER=google export HINDSIGHT_API_EMBEDDINGS_GEMINI_API_KEY=xxxxxxxxxxxx export HINDSIGHT_API_EMBEDDINGS_GEMINI_MODEL=gemini-embedding-001 # Google (Vertex AI auth - auto-detected when project ID is set) export HINDSIGHT_API_EMBEDDINGS_PROVIDER=google export HINDSIGHT_API_EMBEDDINGS_GEMINI_MODEL=gemini-embedding-001 export HINDSIGHT_API_EMBEDDINGS_VERTEXAI_PROJECT_ID=your-gcp-project-id # TEI (self-hosted) export HINDSIGHT_API_EMBEDDINGS_PROVIDER=tei export HINDSIGHT_API_EMBEDDINGS_TEI_URL=http://localhost:8080 # ZeroEntropy export HINDSIGHT_API_EMBEDDINGS_PROVIDER=zeroentropy export HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_API_KEY=your-api-key export HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_MODEL=zembed-1 export HINDSIGHT_API_EMBEDDINGS_ZEROENTROPY_DIMENSIONS=1280 # LiteLLM proxy export HINDSIGHT_API_EMBEDDINGS_PROVIDER=litellm export HINDSIGHT_API_LITELLM_API_BASE=http://localhost:4000 export HINDSIGHT_API_EMBEDDINGS_LITELLM_MODEL=text-embedding-3-small # LiteLLM SDK (direct, no proxy) export HINDSIGHT_API_EMBEDDINGS_PROVIDER=litellm-sdk export HINDSIGHT_API_EMBEDDINGS_LITELLM_SDK_API_KEY=sk-xxxxxxxxxxxx export HINDSIGHT_API_EMBEDDINGS_LITELLM_SDK_MODEL=openai/text-embedding-3-small ``` See [Configuration](./configuration#embeddings) for all options including Azure OpenAI and custom endpoints. --- ## Cross-Encoder (Reranker) Reranks initial search results to improve precision. **Default:** `cross-encoder/ms-marco-MiniLM-L-6-v2` (~85MB) ### Supported Providers | Provider | Description | Best For | |----------|-------------|----------| | `local` | SentenceTransformers CrossEncoder (default) | Development, low latency | | `cohere` | Cohere rerank API | Production, high quality | | `openrouter` | OpenRouter rerank API (Cohere-compatible gateway) | Multi-provider setups | | `zeroentropy` | ZeroEntropy rerank API (zerank-2) | Production, state-of-the-art accuracy | | `siliconflow` | SiliconFlow rerank API (Cohere-compatible `/rerank` endpoint) | Users in China or anyone on SiliconFlow's platform | | `alibaba` | Alibaba Cloud DashScope rerank API (qwen3-rerank) | Users on Alibaba Cloud / DashScope | | `google` | Google Discovery Engine ranking API (REST + Google auth) | Production, GCP integration | | `tei` | HuggingFace Text Embeddings Inference | Production, self-hosted | | `flashrank` | FlashRank (lightweight, fast) | Resource-constrained environments | | `litellm` | LiteLLM proxy (unified gateway) | Multi-provider setups | | `litellm-sdk` | LiteLLM SDK (direct API, no proxy) | Multi-provider, simpler setup | | `jina-mlx` | Jina rerank v3 via Apple Silicon MLX (local, no API key) | Apple Silicon (M1+) local inference | | `rrf` | RRF-only (no neural reranking) | Testing, minimal resources | ### Local Models | Model | Use Case | |-------|----------| | `cross-encoder/ms-marco-MiniLM-L-6-v2` | Default, fast | | `cross-encoder/ms-marco-MiniLM-L-12-v2` | Higher accuracy | | `cross-encoder/mmarco-mMiniLMv2-L12-H384-v1` | Multilingual | ### Cohere Models | Model | Use Case | |-------|----------| | `rerank-english-v3.0` | English text | | `rerank-multilingual-v3.0` | 100+ languages | ### ZeroEntropy Models | Model | Use Case | |-------|----------| | `zerank-2` | Flagship multilingual reranker (default) | | `zerank-2-small` | Faster, lighter variant | ### SiliconFlow Models SiliconFlow hosts a range of open-weight rerankers behind a Cohere-compatible `/rerank` endpoint: | Model | Use Case | |-------|----------| | `BAAI/bge-reranker-v2-m3` | Multilingual, strong default | | `Qwen/Qwen3-Reranker-8B` | Larger, higher accuracy | ### Alibaba Cloud Models Alibaba Cloud DashScope exposes `qwen3-rerank` via a Cohere-compatible `/reranks` endpoint: | Model | Use Case | |-------|----------| | `qwen3-rerank` | 100+ languages, default | ### LiteLLM Supported Providers LiteLLM supports multiple reranking providers via the `/rerank` endpoint: | Provider | Model Example | |----------|---------------| | Cohere | `cohere/rerank-english-v3.0` | | Together AI | `together_ai/...` | | Voyage AI | `voyage/rerank-2` | | Jina AI | `jina_ai/...` | | AWS Bedrock | `bedrock/...` | **Configuration Examples:** ```bash # Local provider (default) export HINDSIGHT_API_RERANKER_PROVIDER=local export HINDSIGHT_API_RERANKER_LOCAL_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2 # Cohere export HINDSIGHT_API_RERANKER_PROVIDER=cohere export HINDSIGHT_API_COHERE_API_KEY=your-api-key export HINDSIGHT_API_RERANKER_COHERE_MODEL=rerank-english-v3.0 # Cohere-compatible endpoint (Azure AI Foundry, Jina, Voyage, self-hosted BGE, ...) # Setting COHERE_BASE_URL switches the provider off the Cohere SDK and onto a # plain HTTP client that speaks the standard rerank wire format: # POST {base_url} Authorization: Bearer # {"model","query","documents","return_documents":false} # -> {"results":[{"index","relevance_score"}, ...]} export HINDSIGHT_API_RERANKER_PROVIDER=cohere export HINDSIGHT_API_RERANKER_COHERE_API_KEY=your-api-key export HINDSIGHT_API_RERANKER_COHERE_MODEL=rerank-v3.5 # whatever model the endpoint serves export HINDSIGHT_API_RERANKER_COHERE_BASE_URL=https://your-endpoint.example/rerank # ZeroEntropy (state-of-the-art accuracy) export HINDSIGHT_API_RERANKER_PROVIDER=zeroentropy export HINDSIGHT_API_RERANKER_ZEROENTROPY_API_KEY=your-api-key export HINDSIGHT_API_RERANKER_ZEROENTROPY_MODEL=zerank-2 # default, can omit # SiliconFlow (Cohere-compatible /rerank endpoint) export HINDSIGHT_API_RERANKER_PROVIDER=siliconflow export HINDSIGHT_API_RERANKER_SILICONFLOW_API_KEY=your-api-key export HINDSIGHT_API_RERANKER_SILICONFLOW_MODEL=BAAI/bge-reranker-v2-m3 # default, can omit # Alibaba Cloud DashScope (qwen3-rerank) export HINDSIGHT_API_RERANKER_PROVIDER=alibaba export HINDSIGHT_API_RERANKER_ALIBABA_API_KEY=your-dashscope-api-key export HINDSIGHT_API_RERANKER_ALIBABA_MODEL=qwen3-rerank # default, can omit # TEI (self-hosted) export HINDSIGHT_API_RERANKER_PROVIDER=tei export HINDSIGHT_API_RERANKER_TEI_URL=http://localhost:8081 # FlashRank (lightweight) export HINDSIGHT_API_RERANKER_PROVIDER=flashrank # LiteLLM proxy export HINDSIGHT_API_RERANKER_PROVIDER=litellm export HINDSIGHT_API_LITELLM_API_BASE=http://localhost:4000 export HINDSIGHT_API_RERANKER_LITELLM_MODEL=cohere/rerank-english-v3.0 # RRF-only (no neural reranking) export HINDSIGHT_API_RERANKER_PROVIDER=rrf ``` See [Configuration](./configuration#reranker) for all options including Azure-hosted endpoints and batch settings. --- ## File: developer/monitoring.md # Monitoring Hindsight provides comprehensive observability through Prometheus metrics, OpenTelemetry distributed tracing, and pre-built Grafana dashboards. ## Local Development For local observability, use the Grafana LGTM (Loki, Grafana, Tempo, Mimir) all-in-one stack: ```bash ./scripts/dev/start-monitoring.sh ``` This starts a single Docker container providing: - **Grafana UI**: http://localhost:3000 (anonymous admin access) - **Traces (Tempo)**: OTLP endpoint at http://localhost:4318 (HTTP) and http://localhost:4317 (gRPC) - **Metrics (Prometheus/Mimir)**: Scrapes http://localhost:8888/metrics automatically - **Logs (Loki)**: Available for log aggregation - **Pre-built Dashboards**: Hindsight Operations, LLM Metrics, API Service **Enable tracing in your API:** ```bash export HINDSIGHT_API_OTEL_TRACES_ENABLED=true export HINDSIGHT_API_OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 ``` :::note Production Deployment The local monitoring stack is for development only. In production, deploy Grafana LGTM separately or use commercial platforms (Grafana Cloud, DataDog, New Relic, etc.). ::: ## Grafana Dashboards Pre-built dashboards are available in [`monitoring/grafana/dashboards/`](https://github.com/anthropics/hindsight/tree/main/monitoring/grafana/dashboards). Import these JSON files into your Grafana instance: | Dashboard | Description | |-----------|-------------| | **Hindsight Operations** | Operation rates, latency percentiles, per-bank metrics | | **Hindsight LLM Metrics** | LLM calls, token usage, latency by scope/provider | | **Hindsight API Service** | HTTP requests, error rates, DB pool, process metrics | The dashboards are automatically provisioned when using the monitoring stack script. ## Metrics Endpoint Hindsight exposes Prometheus metrics at `/metrics`: ```bash curl http://localhost:8888/metrics ``` ## Available Metrics ### Operation Metrics | Metric | Type | Labels | Description | |--------|------|--------|-------------| | `hindsight.operation.duration` | Histogram | operation, bank_id, source, budget, max_tokens, success | Duration of operations in seconds | | `hindsight.operation.total` | Counter | operation, bank_id, source, budget, max_tokens, success | Total number of operations executed | **Labels:** - `operation`: Operation type (`retain`, `recall`, `reflect`, plus async worker task types such as `consolidation`) - `bank_id`: Memory bank identifier - `source`: Where the operation was triggered from (`api`, `reflect`, `internal`, `worker`) - `budget`: Budget level if specified (`low`, `mid`, `high`) - `max_tokens`: Max tokens if specified - `success`: Whether the operation succeeded (`true`, `false`) The `source` label allows distinguishing between: - `api`: Direct API calls from clients - `reflect`: Internal recall calls made during reflect operations - `internal`: Other internal operations - `worker`: Async worker completions recorded when a claimed task reaches a terminal outcome For `source="worker"`, the `success` label is a completion-throughput signal: `false` means the task raised out to the poller after retries were exhausted or an unexpected error occurred. Failures handled inside the executor and returned normally still record `success="true"` here; use `hindsight_async_operations{status="failed"}` for authoritative async operation failure status. ### LLM Metrics | Metric | Type | Labels | Description | |--------|------|--------|-------------| | `hindsight.llm.duration` | Histogram | provider, model, scope, success | Duration of LLM API calls in seconds | | `hindsight.llm.calls.total` | Counter | provider, model, scope, success | Total number of LLM API calls | | `hindsight.llm.tokens.input` | Counter | provider, model, scope, success, token_bucket | Input tokens for LLM calls | | `hindsight.llm.tokens.output` | Counter | provider, model, scope, success, token_bucket | Output tokens from LLM calls | **Labels:** - `provider`: LLM provider (`openai`, `anthropic`, `gemini`, `groq`, `ollama`, `lmstudio`, `bedrock`, `litellm`) - `model`: Model name (e.g., `gpt-4`, `claude-3-sonnet`) - `scope`: What the LLM call is for (`memory`, `reflect`, `consolidation`, `answer`) - `success`: Whether the call succeeded (`true`, `false`) - `token_bucket`: Token count bucket for cardinality control (`0-100`, `100-500`, `500-1k`, `1k-5k`, `5k-10k`, `10k-50k`, `50k+`) ### HTTP Request Metrics | Metric | Type | Labels | Description | |--------|------|--------|-------------| | `hindsight.http.duration` | Histogram | method, endpoint, status_code, status_class | Duration of HTTP requests in seconds | | `hindsight.http.requests.total` | Counter | method, endpoint, status_code, status_class | Total number of HTTP requests | | `hindsight.http.requests.in_progress` | UpDownCounter | method, endpoint | Number of HTTP requests currently being processed | **Labels:** - `method`: HTTP method (`GET`, `POST`, `PUT`, `DELETE`) - `endpoint`: Request path (normalized to reduce cardinality - UUIDs replaced with `{id}`) - `status_code`: HTTP status code (`200`, `400`, `500`, etc.) - `status_class`: Status code class (`2xx`, `4xx`, `5xx`) ### Database Pool Metrics | Metric | Type | Labels | Description | |--------|------|--------|-------------| | `hindsight.db.pool.size` | Gauge | - | Current number of connections in the pool | | `hindsight.db.pool.idle` | Gauge | - | Number of idle connections in the pool | | `hindsight.db.pool.min` | Gauge | - | Minimum pool size | | `hindsight.db.pool.max` | Gauge | - | Maximum pool size | ### Process Metrics | Metric | Type | Labels | Description | |--------|------|--------|-------------| | `hindsight.process.cpu.seconds` | Gauge | type | Process CPU time in seconds | | `hindsight.process.memory.bytes` | Gauge | type | Process memory usage in bytes | | `hindsight.process.open_fds` | Gauge | - | Number of open file descriptors | | `hindsight.process.threads` | Gauge | - | Number of active threads | **Labels:** - `type` (CPU): `user` or `system` - `type` (Memory): `rss_max` (maximum resident set size) ### Histogram Buckets Custom bucket boundaries are configured for better percentile accuracy: **Operation Duration Buckets (seconds):** ``` 0.1, 0.25, 0.5, 0.75, 1.0, 2.0, 3.0, 5.0, 7.5, 10.0, 15.0, 20.0, 30.0, 60.0, 120.0 ``` **LLM Duration Buckets (seconds):** ``` 0.1, 0.25, 0.5, 1.0, 2.0, 3.0, 5.0, 10.0, 15.0, 30.0, 60.0, 120.0 ``` **HTTP Duration Buckets (seconds):** ``` 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0 ``` ## Prometheus Configuration ```yaml scrape_configs: - job_name: 'hindsight' static_configs: - targets: ['localhost:8888'] ``` ## Example Queries ### Average operation latency by type ```promql rate(hindsight_operation_duration_sum[5m]) / rate(hindsight_operation_duration_count[5m]) ``` ### LLM calls per minute by provider ```promql rate(hindsight_llm_calls_total[1m]) * 60 ``` ### P95 LLM latency ```promql histogram_quantile(0.95, rate(hindsight_llm_duration_bucket[5m])) ``` ### Total tokens consumed by model ```promql sum by (model) (hindsight_llm_tokens_input_total + hindsight_llm_tokens_output_total) ``` ### Internal vs API recall operations ```promql sum by (source) (rate(hindsight_operation_total{operation="recall"}[5m])) ``` ### HTTP requests per second by endpoint ```promql sum by (endpoint) (rate(hindsight_http_requests_total[1m])) ``` ### HTTP error rate (5xx) ```promql sum(rate(hindsight_http_requests_total{status_class="5xx"}[5m])) / sum(rate(hindsight_http_requests_total[5m])) ``` ### P95 HTTP latency ```promql histogram_quantile(0.95, sum by (le) (rate(hindsight_http_duration_seconds_bucket[5m]))) ``` ### Database pool utilization ```promql hindsight_db_pool_size / hindsight_db_pool_max ``` ### Active database connections ```promql hindsight_db_pool_size - hindsight_db_pool_idle ``` ### CPU usage rate ```promql rate(hindsight_process_cpu_seconds{type="user"}[1m]) ``` --- ## Distributed Tracing Hindsight supports OpenTelemetry distributed tracing for memory operations and LLM calls, following GenAI semantic conventions v1.37+. ### Configuration See [Configuration - OpenTelemetry Tracing](./configuration#opentelemetry-tracing) for environment variables. **Quick Start:** ```bash # Enable tracing export HINDSIGHT_API_OTEL_TRACES_ENABLED=true export HINDSIGHT_API_OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 # View traces with Grafana LGTM (local dev) ./scripts/dev/start-monitoring.sh # Open http://localhost:3000 → Explore → Tempo ``` Supports any OTLP-compatible backend (Grafana LGTM, Langfuse, OpenLIT, DataDog, New Relic, Honeycomb, [Pydantic Logfire](https://logfire.pydantic.dev), etc.). ### Span Hierarchy **Parent Spans (Operations):** - `hindsight.retain` - Memory ingestion - `hindsight.recall` - Memory retrieval - `hindsight.recall_embedding` - Query embedding - `hindsight.recall_retrieval` - Parallel search (semantic, BM25, graph, temporal) - `hindsight.recall_fusion` - Reciprocal Rank Fusion - `hindsight.recall_rerank` - Cross-encoder reranking - `hindsight.reflect` - Agentic reasoning - `hindsight.reflect_tool_call` - Tool execution (recall, lookup, etc.) - `hindsight.consolidation` - Observation synthesis - `hindsight.mental_model_refresh` - Mental model updates **Child Spans (LLM Calls):** - Named by scope (e.g., `hindsight.memory`, `hindsight.reflect`) - Contain full prompts/completions as events - Follow GenAI semantic conventions for attributes ### Span Attributes **Operation Spans:** - `hindsight.operation` - Operation type - `hindsight.bank_id` - Memory bank ID - `hindsight.query` - Query text (truncated to 100 chars) - `hindsight.fact_types` - Fact types for recall - `hindsight.thinking_budget` - Budget allocation - `hindsight.max_tokens` - Token limit **LLM Spans (GenAI Semantic Conventions):** - `gen_ai.operation.name` - Always `"chat"` - `gen_ai.provider.name` - Provider (`openai`, `anthropic`, `google`, etc.) - `gen_ai.request.model` - Model name - `gen_ai.usage.input_tokens` - Input tokens - `gen_ai.usage.output_tokens` - Output tokens - `hindsight.scope` - LLM call purpose (`memory`, `reflect`, `consolidation`, etc.) **Events:** - `gen_ai.client.inference.operation.details` - Full prompts and completions --- ## File: developer/multilingual.md # Multilingual Support Hindsight automatically detects the language of your input and responds in the same language. This means facts, entities, and reflect responses are preserved in their original language without translation to English. ## How It Works ```mermaid graph LR A[Chinese Input] --> B[Language Detection] B --> C[Extract Facts in Chinese] C --> D[Chinese Entities] D --> E[Chinese Response] ``` When you retain content or reflect on a query, Hindsight: 1. **Detects the input language** automatically from the content 2. **Extracts facts in the original language** - preserving nuance and meaning 3. **Stores entities in their native script** - 张伟 stays 张伟, not "Zhang Wei" 4. **Responds in the same language** - queries in Chinese get Chinese answers --- ## Retain with Non-English Content When you retain content in any language, Hindsight extracts and stores facts in that same language. ### Example: Chinese Content ```python from hindsight import Hindsight hindsight = Hindsight() # Retain Chinese content hindsight.retain( bank_id="user-123", content=""" 张伟是一位资深软件工程师,在腾讯工作了五年。 他专门研究分布式系统,并领导了公司微服务架构的开发。 """, context="团队概述" ) # Query in Chinese - get Chinese results results = hindsight.recall( bank_id="user-123", query="告诉我关于张伟的信息" ) # Facts are returned in Chinese: # - 张伟是一位资深软件工程师,在腾讯工作了五年 # - 张伟专门研究分布式系统,并领导了公司微服务架构的开发 ``` ### Example: Japanese Content ```python hindsight.retain( bank_id="user-123", content=""" 田中さんはソフトウェアエンジニアで、東京のスタートアップで働いています。 彼女はPythonとTypeScriptが得意で、毎日コードレビューをしています。 """, context="チームプロフィール" ) # Query in Japanese results = hindsight.recall( bank_id="user-123", query="田中さんについて教えてください" ) ``` --- ## Reflect with Non-English Queries The `reflect` operation also respects the input language, generating thoughtful responses in the same language as the query. ### Example: Chinese Reflection ```python # Store facts about team members (in Chinese) hindsight.retain( bank_id="team-eval", content="张伟是一位优秀的软件工程师,完成了五个重大项目。他总是按时交付,代码整洁有良好的文档。", context="绩效评估" ) hindsight.retain( bank_id="team-eval", content="李明最近加入团队。他错过了第一个截止日期,代码有很多bug。", context="绩效评估" ) # Reflect in Chinese result = hindsight.reflect( bank_id="team-eval", query="谁是更可靠的工程师?" ) # Response is in Chinese: # "我认为张伟更可靠。张伟完成了五个重大项目,按时交付,代码质量高..." ``` --- ## Mixed Language Content Hindsight handles mixed-language content gracefully, preserving both languages where appropriate. ### Example: Chinese Text with English Company Names ```python hindsight.retain( bank_id="user-123", content=""" 王芳在Google北京办公室工作,她是一名高级产品经理。 之前她在Microsoft和Amazon工作过。 她负责管理YouTube在中国市场的推广策略。 """, context="员工资料" ) # Facts preserve both languages: # - 王芳在Google北京办公室工作,担任高级产品经理 # - 王芳曾在Microsoft和Amazon工作过 # - 王芳负责管理YouTube在中国市场的推广策略 ``` --- ## Supported Languages **Hindsight's multilingual support depends entirely on your LLM's language capabilities.** Hindsight instructs the LLM to detect the input language and respond in that same language. If your LLM supports a language, Hindsight will work with it. Most modern LLMs (GPT-4, Claude, Gemini, Llama 3, etc.) support dozens of languages including: - **East Asian**: Chinese (Simplified/Traditional), Japanese, Korean - **European**: Spanish, French, German, Italian, Portuguese, Dutch, Polish, Russian - **Middle Eastern**: Arabic, Hebrew, Turkish - **South Asian**: Hindi, Bengali, Tamil - **Southeast Asian**: Thai, Vietnamese, Indonesian **To verify support for your target language**, test your LLM directly with content in that language. If the LLM can understand and generate text in the language, Hindsight will preserve it correctly. --- ## Configuring for Multilingual Use For optimal multilingual performance, configure all four components of the pipeline: ### 1. LLM (Required) Your LLM must support the target languages. Most modern LLMs do, but verify with your specific model. ### 2. Embedding Model (Recommended) The default embedding model (`BAAI/bge-small-en-v1.5`) is **English-only**. For multilingual content, use a multilingual embedding model: ```bash # In your .env file HINDSIGHT_API_EMBEDDINGS_LOCAL_MODEL=BAAI/bge-m3 ``` **Recommended multilingual embedding models:** | Model | Languages | Notes | |-------|-----------|-------| | `BAAI/bge-m3` | 100+ | Best overall multilingual performance | | `intfloat/multilingual-e5-large` | 100+ | Good alternative | | `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` | 50+ | Lighter weight | ### 3. Reranker Model (Recommended) The default reranker (`cross-encoder/ms-marco-MiniLM-L-6-v2`) is **English-only**. For multilingual content, use a multilingual reranker: ```bash # In your .env file HINDSIGHT_API_RERANKER_LOCAL_MODEL=BAAI/bge-reranker-v2-m3 ``` **Recommended multilingual reranker models:** | Model | Languages | Notes | |-------|-----------|-------| | `BAAI/bge-reranker-v2-m3` | 100+ | Best multilingual reranking | | `cross-encoder/mmarco-mMiniLMv2-L12-H384-v1` | 14 | Lighter alternative | ### 4. BM25 / Full-Text Search Backend The semantic (embedding) arm covers cross-lingual matches by meaning. Hindsight runs a BM25 keyword arm in parallel, and **BM25 is inherently within-language** — it's character/token matching against a tokenizer's lexemes. The default `native` backend uses PostgreSQL's English dictionary, which produces poor results for non-English content (and no useful tokenization at all for Chinese / Japanese / Korean, which lack whitespace word boundaries). There are two knobs that interact: - `HINDSIGHT_API_TEXT_SEARCH_EXTENSION` — selects the backend (`native`, `vchord`, `pg_textsearch`, `pgroonga`, or `pg_search`). - `HINDSIGHT_API_TEXT_SEARCH_EXTENSION_NATIVE_LANGUAGE` — selects the PostgreSQL dictionary used by the `native` backend (default: `english`). Pick the backend based on the languages your bank stores: | Backend | Multilingual / CJK | Notes | |---------|--------------------|-------| | `native` | European languages only (English, French, German, Spanish, Italian, Portuguese, Russian, Dutch, Swedish, Norwegian, Danish, Finnish, Hungarian, Turkish, Arabic, plus `simple`). CJK requires a third-party dictionary like `zhparser`. | Stock PostgreSQL — no extra extensions. Configure the language via `HINDSIGHT_API_TEXT_SEARCH_EXTENSION_NATIVE_LANGUAGE`. | | `vchord` | Multilingual via `llmlingua2` tokenizer. | Best when you're already using vchord for vector search. | | `pg_textsearch` | English only (hardcoded). | Industry-standard BM25 ranking + Block-Max WAND. | | `pgroonga` | **Yes — out of the box.** Single index handles English, CJK, and mixed-script content via the `TokenBigram` polyglot tokenizer + `NormalizerNFKC150` Unicode normalization. | Recommended for non-English / mixed-language banks. Requires the `pgroonga` extension. See `docker/docker-compose/pgroonga/`. | | `pg_search` | Multilingual via configurable tokenizer (e.g. `chinese_compatible`, `jieba`, `chinese_lindera`, `japanese_lindera`, `korean_lindera`, `ngram`). | ParadeDB `pg_search` extension; the only Citus-compatible BM25 backend. Tokenizer set via `HINDSIGHT_API_TEXT_SEARCH_EXTENSION_PG_SEARCH_TOKENIZER`. See `docker/docker-compose/pg_search/`. | **Choosing for a single-language bank** (e.g. all Spanish content): ```bash HINDSIGHT_API_TEXT_SEARCH_EXTENSION=native HINDSIGHT_API_TEXT_SEARCH_EXTENSION_NATIVE_LANGUAGE=spanish ``` **Choosing for a CJK or mixed-language bank**: ```bash HINDSIGHT_API_TEXT_SEARCH_EXTENSION=pgroonga ``` The `native` and `pgroonga` knobs do not apply to each other — `pgroonga`'s tokenizer is set at index creation and ignores `HINDSIGHT_API_TEXT_SEARCH_EXTENSION_NATIVE_LANGUAGE`. #### Forcing the LLM Output Language Independent from the BM25 backend, `HINDSIGHT_API_LLM_OUTPUT_LANGUAGE` forces every LLM-generated artifact into a single language regardless of the source content. This applies uniformly to: - **Retain** — fact text, context, and entity names extracted from source documents. - **Consolidation** — observations / mental models synthesized from those facts. - **Reflect** — the final natural-language response returned by the reflect API. ```bash # Every LLM call (retain, consolidation, reflect) emits Spanish regardless of source language. HINDSIGHT_API_LLM_OUTPUT_LANGUAGE=Spanish ``` Common patterns: - **Aligned, single-language bank**: `HINDSIGHT_API_TEXT_SEARCH_EXTENSION_NATIVE_LANGUAGE=spanish` + `HINDSIGHT_API_LLM_OUTPUT_LANGUAGE=Spanish` — store, index, and respond in Spanish even when sources are mixed. - **Mixed-language bank with multilingual indexing**: `HINDSIGHT_API_TEXT_SEARCH_EXTENSION=pgroonga` + leave `HINDSIGHT_API_LLM_OUTPUT_LANGUAGE` unset — preserve source-language facts; pgroonga handles all of them in one index; reflect responds in the query's language. - **Cross-lingual unification**: `HINDSIGHT_API_LLM_OUTPUT_LANGUAGE=English` — every fact, observation, and reflect response in English regardless of source. Useful when the consumer (an English-only LLM, dashboard, or downstream pipeline) needs uniform output. Leave `HINDSIGHT_API_LLM_OUTPUT_LANGUAGE` unset to preserve the source/query language across the pipeline (the default). --- ## Best Practices ### 1. Use Multilingual Models for Non-English Content If you primarily work with non-English content, configure multilingual embedding and reranker models. English-only models will still store your content correctly, but semantic search quality will be degraded. ### 2. Keep Content in One Language Per Retain Call While mixed content works, keeping each `retain` call in a single language produces more consistent results. ### 3. Query in the Same Language as Your Content For best results, query using the same language as your stored content. Cross-language queries (e.g., English query for Chinese content) may work but results can vary depending on your embedding model. --- ## Technical Details Multilingual support is implemented through LLM prompt instructions rather than external language detection libraries. This approach: - **Requires no additional dependencies** - **Works with any LLM** that supports multiple languages - **Handles edge cases** like mixed-language content naturally - **Preserves semantic meaning** better than rule-based translation The LLM is instructed to: 1. Detect the input language 2. Extract all facts, entities, and descriptions in that same language 3. Never translate to English unless the input is in English --- ## File: developer/observations.mdx # Observations: Knowledge Consolidation After memories are retained, Hindsight automatically consolidates related facts into **observations** — deduplicated, evidence-grounded beliefs the bank has built up from multiple memories. Each observation tracks its supporting evidence (with exact quotes) and a proof count, and is refined rather than overwritten when new evidence arrives. ```mermaid graph LR A[New Facts] --> B[Consolidation Engine] B --> C{Existing Observation?} C -->|Yes| D[Refine Observation] C -->|No| E[Create Observation] D --> F[Observations] E --> F ``` --- ## What Are Observations? Observations are **consolidated knowledge** built from multiple facts. Unlike raw facts — which are individual pieces of information — observations represent deduplicated beliefs, preferences, and learnings grounded in accumulated evidence. They are not summaries the LLM invents on the fly: each observation is backed by specific source memories, carries a proof count, and evolves as new evidence supports, contradicts, or extends it. | Raw Facts | Observation | |-----------|--------------| | "Alice prefers Python" | "Alice is a Python-focused developer who values readability and simplicity" | | "Alice dislikes verbose code" | | | "Alice recommends type hints" | | Observations provide: - **Deduplication**: One durable belief instead of many overlapping facts - **Grounding**: Every observation references the specific memories (with quotes) that support it - **Evolution**: Refined as evidence strengthens, weakens, or contradicts it — history is preserved - **Freshness awareness**: when newer memories haven't been consolidated yet, `reflect` treats the affected observations as stale and verifies them against raw facts - **Efficiency**: Condensed knowledge for faster retrieval --- ## How Consolidation Works ### Automatic Background Processing After `retain()` completes, the consolidation engine runs automatically: 1. **New facts analyzed** — Each new fact is compared against existing observations 2. **Pattern detection** — Related facts are grouped and synthesized 3. **Observation creation/update** — New observations are created or existing ones refined 4. **Evidence tracking** — Each observation maintains references to supporting facts ### Near-Duplicate Reconciliation Consolidation can still produce two observations that say the same thing in slightly different words — for example when a weaker model writes a near-identical observation instead of refining the existing one, or when refining an observation reshapes its wording so it overlaps another one. Left alone, these near-duplicates clutter recall with redundant beliefs. When enabled, Hindsight reconciles them automatically. Whenever an observation is created **or** updated, it is compared against the existing observations it most closely resembles. If one is highly similar, a focused check decides whether to **merge** them into a single belief (folding both sets of supporting evidence together) or **keep** them separate. Because the check reads the full text of both, observations that differ in a meaningful detail — a number, a negation, a named entity or language — are correctly kept apart rather than collapsed. This is controlled by the [`HINDSIGHT_API_CONSOLIDATION_DEDUP_THRESHOLD`](/developer/configuration#observations) setting: the cosine similarity at or above which two observations are reconciled. It is **enabled by default** (`0.97`); a lower value reconciles more aggressively, and `1.0` disables it. Reconciliation runs on PostgreSQL deployments only — it is skipped on Oracle regardless of the threshold. Reconciliation only compares observations **within the same tag scope**. If you tag retains with a unique per-call value (e.g. a `session-id`), each session lands in its own scope and never dedups against the others — producing one near-identical observation per session. To consolidate across those volatile tags, retain with [`observation_scopes: "shared"`](/developer/api/retain#shared), which scopes observations to one global, untagged belief while leaving the session tag on the source facts for recall filtering. ### Disabling Auto-Consolidation Set `HINDSIGHT_API_ENABLE_AUTO_CONSOLIDATION=false` (or configure per-bank via the [bank config API](/developer/api/memory-banks#observations-configuration)) to prevent consolidation from running automatically after retain, delete, and update operations. When disabled, consolidation only runs when you explicitly call the [consolidate endpoint](#trigger-consolidation). This is useful when you want full control over consolidation timing — for example, batching many retains before consolidating, or running consolidation only for specific scopes. ### Targeted Consolidation By default, consolidation processes **all** unconsolidated memories in a bank. You can scope it to specific tag sets using the `observation_scopes` parameter on the consolidate endpoint: ```python # Consolidate only memories tagged with user:alice client.consolidate( bank_id="my-bank", observation_scopes=[["user:alice"]] ) # Consolidate memories for alice OR the engineering team client.consolidate( bank_id="my-bank", observation_scopes=[["user:alice"], ["team:engineering"]] ) ``` Each scope is a list of tags. A memory matches a scope if its tags **contain all** tags in that scope. For example, scope `["user:alice"]` matches memories tagged `["user:alice", "team:eng"]`. When `observation_scopes` is omitted, all unconsolidated memories are processed (backward compatible). ### Evidence-Based Evolution Observations evolve as new evidence arrives: | Event | What the bank learns | Observation state | |-------|---------------------|----------------| | **Day 1** | "Redis is open source under BSD license" | "Redis is excellent for caching — fast, reliable, and OSS-friendly" (2 supporting facts) | | **Day 2** | "Redis has great community support" | Observation reinforced (3 supporting facts) | | **Day 30** | "Redis changed license to SSPL" | Observation refined: "Redis is technically strong, but has license concerns for cloud" | | **Day 45** | "Valkey forked Redis under BSD" | New observation: "Consider Valkey for new projects requiring true OSS" | ### Handling Contradictory Evidence What happens when a new fact contradicts an existing observation? The consolidation engine doesn't blindly overwrite — it **reconciles** the contradiction by capturing the evolution: **Example: User preference changes** | Time | Fact | Observation | |------|------|--------------| | Week 1 | "User says they love React" | "User prefers React for frontend development" | | Week 2 | "User praises React's component model" | "User is enthusiastic about React, particularly its component model" | | Week 3 | "User says they've switched to Vue and won't use React anymore" | "User was previously a React enthusiast who appreciated its component model, but has now switched to Vue and no longer uses React" | Notice how the final observation captures the **full journey** — not just "User prefers Vue" but the complete evolution of their preference. This nuanced understanding means: - Your agent won't recommend React tutorials to someone who explicitly moved away from it - Your agent understands *why* this matters (they were enthusiastic before, so this is a deliberate choice) - Your agent can reference this history when relevant ("I know you used to work with React...") The system: 1. **Detects the conflict** — New fact contradicts existing observation 2. **Preserves history** — Incorporates the previous understanding into the new observation 3. **Creates nuanced observation** — Synthesizes a richer understanding that captures the change 4. **Updates freshness** — Marks the observation as recently updated **Example: Correcting misinformation** | Time | Fact | Observation | |------|------|--------------| | Day 1 | "Alice works at Google" | "Alice is a Google employee" | | Day 10 | "Alice actually works at Meta, not Google" | "Alice works at Meta (previously thought to work at Google)" | When a fact explicitly corrects previous information, the observation is updated to reflect the correction while noting the previous understanding. The raw facts are always preserved, so you can trace back to see what was originally stated and when it was corrected. --- ## Observations in Retrieval Observations are automatically included in both `recall()` and `reflect()` operations: ### In Recall Observations are returned alongside raw facts, filtered by the `types` parameter: ### In Reflect The reflect agent uses **hierarchical retrieval**: 1. **[Mental Models](/developer/api/mental-models)** — User-curated summaries (highest priority) 2. **Observations** — Consolidated knowledge with freshness awareness 3. **Raw Facts** — Ground truth for verification The agent automatically queries observations and uses them to inform its reasoning. --- ## Freshness Awareness Observations track when they were last updated. During reflect, the agent considers freshness: - **Fresh observations**: Used directly for reasoning - **Stale observations**: Agent verifies against current facts before relying on them This ensures responses stay accurate even as the underlying data changes. --- ## Observation Scopes By default, observations are scoped to all of a memory's tags combined. The `observation_scopes` retain parameter lets you control this — building separate observations per tag, per combination, or with a custom list of scopes. This is key when a single memory carries multiple tags and you want each tag to accumulate its own observations independently. See [`observation_scopes` in the Retain API](./api/retain#observation_scopes) for the full explanation and options. To inspect the scopes that already exist in a bank, call `GET /v1/default/banks/{bank_id}/observations/scopes`. The response lists each exact tag set with its observation count; the empty tag list is the global scope. Use a returned scope as `tags` with `tags_match: "exact"` when you need to filter to that precise observation scope without also matching observations that carry extra tags. To recall **only** the global scope — the untagged observations written by `observation_scopes: "shared"` — pass an empty list with exact matching: `tags: []`, `tags_match: "exact"`. --- ## Observations Mission You can define exactly what this bank should synthesise by setting an **observations mission** (`observations_mission`). This replaces the built-in durable-knowledge rules with your own instructions, letting you control what shape observations take. ``` e.g. Observations are stable facts about people and projects. Always include preferences, skills, and recurring patterns. Ignore one-off events and ephemeral state. ``` Leave it blank to use the server default — durable, specific facts that stay true over time (preferences, skills, relationships, recurring patterns), with ephemeral state filtered out. **Examples:** | `observations_mission` | What gets synthesised | |------------------------|----------------------| | *(unset — default)* | Durable facts: preferences, skills, relationships, recurring patterns | | *"Observations are weekly summaries of sprint outcomes and blockers"* | Broad event summaries grouped by time period | | *"Observations are stable facts about named individuals only"* | Person-centric knowledge, tied to specific people | | *"Observations are recurring patterns in customer support interactions"* | Failure modes, common requests, pain points | Set `observations_mission` via the [bank config API](/developer/api/memory-banks#observations-configuration) or the [`HINDSIGHT_API_OBSERVATIONS_MISSION`](/developer/configuration#observations) environment variable. --- ## Observation Lifecycle & Invalidation ### When Memories Are Deleted Observations are derived from source memories. When source memories are removed, Hindsight automatically keeps observations consistent: | Action | Effect on observations | |--------|----------------------| | Delete a document | All observations derived from the document's memories are deleted | | Delete individual memories (by type) | Observations sourced from those memories are deleted | | Delete an entire bank | All observations are deleted along with everything else | After deletion, the **remaining source memories** that fed the affected observations have their consolidation state reset, so they will be re-consolidated on the next consolidation run and produce fresh observations. ### Clearing Observations for a Specific Memory You can clear all observations derived from a single memory without deleting the memory itself. This is useful when you want to force re-synthesis of a memory's contribution to consolidated knowledge. Use the `DELETE /v1/default/banks/{bank_id}/memories/{memory_id}/observations` endpoint. This will: 1. Delete all observations that list the memory as a source 2. Reset `consolidated_at` on the memory itself and any other source memories that contributed to those observations 3. Trigger a consolidation job so fresh observations are produced automatically ### Resetting All Observations To wipe all consolidated knowledge and start over: ```python # Clear all observations for a bank client.clear_observations(bank_id="my-bank") ``` This resets the consolidation state for all source memories in the bank, so the next consolidation run will re-derive all observations from scratch. --- ## Trigger Consolidation {#trigger-consolidation} Use the consolidate endpoint to manually trigger consolidation: ```http POST /v1/default/banks/{bank_id}/consolidate Content-Type: application/json { "observation_scopes": [["user:alice"], ["team:engineering"]] } ``` The request body is optional. When omitted (or sent as an empty body), all unconsolidated memories in the bank are processed. | Parameter | Type | Description | |-----------|------|-------------| | `observation_scopes` | `list[list[str]]` \| `null` | Optional list of tag scopes. Only memories whose tags contain all tags in at least one scope are processed. Omit for a full-bank sweep. | ## Configuration Observation consolidation runs automatically by default. You can disable auto-consolidation with [`HINDSIGHT_API_ENABLE_AUTO_CONSOLIDATION`](/developer/configuration#observations) and trigger it on-demand via the [consolidate endpoint](#trigger-consolidation). Monitor consolidation progress via the [Operations API](./api/operations). --- ## Next Steps - [**Retain**](./retain) — How facts are stored and trigger consolidation - [**Recall**](./retrieval) — How observations are retrieved - [**Reflect**](./reflect) — How the agentic loop uses observations - [**Mental Models**](./api/mental-models) — User-curated summaries for common queries --- ## File: developer/performance.md # Performance Hindsight is designed for high-performance semantic memory operations at scale. This page covers performance characteristics, optimization strategies, and best practices. ## Overview Hindsight's performance is optimized across three key operations: - **Retain (Ingestion)**: Batch processing with async operations for large-scale memory storage - **Recall (Search)**: Sub-second semantic search with configurable thinking budgets - **Reflect (Reasoning)**: Disposition-aware answer generation with controllable compute ## Design Philosophy: Optimized for Fast Reads Hindsight is **architected from the ground up to prioritize read performance over write performance**. This design decision reflects the typical usage pattern of memory systems: memories are written once but read many times. The system makes deliberate trade-offs to ensure **sub-second recall operations**: - **Pre-computed embeddings**: All memory embeddings are generated and indexed during retention - **Optimized vector search**: HNSW indexes enable fast approximate nearest neighbor search - **Fact extraction at write time**: Complex LLM-based fact extraction happens during retention, not retrieval - **Structured memory graphs**: Relationships and temporal information are resolved upfront This means **Recall (search) operations are blazingly fast** because all the heavy lifting has already been done. ### Performance Comparison | Operation | Typical Latency | Primary Bottleneck | Optimization Strategy | |-----------|----------------|-------------------|----------------------------------| | **Recall** | 100-600ms | Re-ranker (on CPU) | Use GPU for re-ranking, or reduce budget | | **Reflect** | 800-3000ms | LLM generation | Use faster LLM | | **Retain** | 500ms-2000ms per batch | **LLM fact extraction** | Use high-throughput LLM provider | Hindsight is designed to ensure your **application's read path (recall/reflect) is always fast**, even if it means spending more time upfront during writes. This is the right trade-off for memory systems where: - Memories are retained in background processes or during low-traffic periods - Memories are queried frequently in user-facing, latency-sensitive contexts - The ratio of reads to writes is high (typically 10:1 or higher) --- ## Retain Performance **Retain (write) operations are inherently slower** because they involve LLM-based fact extraction, entity recognition, temporal reasoning, relationship mapping, and embedding generation. **The LLM is the primary bottleneck for write latency.** ### Hindsight Doesn't Need a Smart Model The fact extraction process is structured and well-defined, so smaller, faster models work extremely well. Our recommended model is `gpt-oss-20b` (available via Groq and other providers). To maximize retention throughput: 1. **Use high-throughput LLM providers**: Choose providers with high requests-per-minute (RPM) limits and low latency - **Fast**: [Groq](https://groq.com) with `gpt-oss-20b` or other openai-oss models, self-hosted models on GPU clusters (vLLM, TGI) - **Slow**: Standard cloud LLM providers with rate limits 2. **Batch your operations**: Group related content into batch requests. Send as much data as you want in a single request — the only limit is the HTTP payload size. 3. **Use async mode for large datasets**: Queue operations in the background 4. **Parallel processing**: For very large datasets, use multiple concurrent retention requests with different `document_id` values ### Automatic Batch Optimization **When using async retain, Hindsight automatically handles batch sizing for you.** You don't need to manually tune batch sizes or worry about optimal chunking. How it works: - **Send large batches**: Submit hundreds or thousands of items in a single async retain request - **Automatic splitting**: Hindsight automatically splits large batches (>10,000 tokens) into optimized sub-batches - **Parallel processing**: Sub-batches are processed concurrently in the background - **Status tracking**: Parent operation aggregates status from all sub-batches - **Token-based**: Batching uses tiktoken for accurate token counting, not character counts Benefits: - Send entire documents or datasets in one API call - Let Hindsight optimize the processing strategy - Track overall progress via the parent operation status - No need to manually split data into small batches ### Throughput Factors affecting throughput: - Document size and complexity - LLM provider rate limits (for fact extraction) - Database write performance - Available CPU/memory resources --- ## Tuning for Local & Small Environments Hindsight's defaults are tuned for cloud LLM providers and multi-core servers. When you run it on a laptop, a single GPU box, or against a **local LLM server** (llama.cpp, vLLM, LM Studio, Ollama) with a small fixed slot pool, those defaults can saturate the backend, time out, or thrash the CPU. This section collects the knobs that matter for low-resource setups. ### LLM concurrency The default `HINDSIGHT_API_LLM_MAX_CONCURRENT=32` assumes a cloud provider that can absorb dozens of parallel requests. A local server with a handful of slots cannot — Hindsight will fill every slot and **starve any other client sharing the endpoint** (your main agent, another app, or a second Hindsight operation). ```bash export HINDSIGHT_API_LLM_MAX_CONCURRENT=2 ``` A value of `2` lets retain and consolidation run concurrently without blocking each other. If the endpoint is **shared** with other clients (other applications, agents, or workflows hitting the same llama-server / vLLM / LM Studio instance), reserve slots for them by lowering further — leave at least one slot free per shared client. You can also split the budget per operation so background work never crowds out live reads. The per-operation caps compose *on top of* the global cap: ```bash # global=4, with retain/consolidation capped low so reflect always has headroom export HINDSIGHT_API_LLM_MAX_CONCURRENT=4 export HINDSIGHT_API_RETAIN_LLM_MAX_CONCURRENT=1 export HINDSIGHT_API_CONSOLIDATION_LLM_MAX_CONCURRENT=1 ``` ### Timeouts and retries Small models on modest hardware generate tokens slowly, and the first request after startup pays a model-load cost. The default `HINDSIGHT_API_LLM_TIMEOUT=120` (seconds) can be too tight for a large local model on CPU — raise it to avoid spurious timeouts and wasted retries: ```bash export HINDSIGHT_API_LLM_TIMEOUT=300 # allow slow local generation export HINDSIGHT_API_LLM_MAX_RETRIES=2 # fail faster locally — retries rarely help a slow box ``` A local endpoint isn't rate-limited, so aggressive retry/backoff mostly adds latency on real failures. Lower retries and let genuine errors surface quickly. ### Smaller, faster models — and reasoning effort Retain (fact extraction) is structured work that does not need a frontier model; reflect can use a lighter model still. On a constrained box, point each operation at the smallest model that holds up: ```bash # Reflect on a small/fast model; retain on a slightly stronger structured-output model export HINDSIGHT_API_REFLECT_LLM_MODEL= export HINDSIGHT_API_RETAIN_LLM_MODEL= ``` If your model exposes a reasoning/thinking budget, keep it low (the default) — extra reasoning tokens are pure latency for the extraction and consolidation paths: ```bash export HINDSIGHT_API_LLM_REASONING_EFFORT=low ``` Consolidation sends multiple facts to the LLM in a single call (default 8). On a small model with a limited context window, a large batch produces an oversized prompt and a long, error-prone response. Shrink the batch so each consolidation call stays small and reliable: ```bash export HINDSIGHT_API_CONSOLIDATION_LLM_BATCH_SIZE=2 # default 8; lower = smaller prompts, more calls ``` ### Built-in llama.cpp tuning The bundled `llamacpp` provider runs a llama.cpp server as a managed subprocess — no external server needed. Key knobs for small machines: ```bash export HINDSIGHT_API_LLM_PROVIDER=llamacpp export HINDSIGHT_API_LLM_MAX_CONCURRENT=2 # retain + consolidation without blocking export HINDSIGHT_API_LLAMACPP_GPU_LAYERS=-1 # -1 = offload all layers to GPU; 0 = CPU only export HINDSIGHT_API_LLAMACPP_CONTEXT_SIZE=8192 # lower to save RAM/VRAM; raise for big batches export HINDSIGHT_API_LLAMACPP_EXTRA_ARGS="--n_threads 8" # match physical cores on CPU-only boxes # export HINDSIGHT_API_LLAMACPP_NO_GRAMMAR=true # faster, but less reliable JSON output ``` See [Built-in llama.cpp](./configuration#built-in-llamacpp) for the full option list. ### Reranker on CPU Recall's bottleneck on a machine without a GPU is the cross-encoder reranker. The local reranker has several CPU/Apple-Silicon knobs that are quality-neutral but materially faster: ```bash # Apple Silicon (MPS): half precision is 27–36% faster, quality-identical export HINDSIGHT_API_RERANKER_LOCAL_FP16=true # Sort pairs by length before batching — 36–54% faster, quality-identical by construction export HINDSIGHT_API_RERANKER_LOCAL_BUCKET_BATCHING=true # Cap reranker parallelism so it doesn't thrash a small CPU under load (default 4) export HINDSIGHT_API_RERANKER_LOCAL_MAX_CONCURRENT=2 # On macOS, force CPU if MPS/XPC causes instability # export HINDSIGHT_API_RERANKER_LOCAL_FORCE_CPU=true ``` The biggest single win on CPU is reranking fewer candidates. By default Hindsight reranks up to 300 candidates per recall — shrink that pool to cut cross-encoder work proportionally: ```bash export HINDSIGHT_API_RERANKER_MAX_CANDIDATES=100 # default 300; RRF pre-filters the rest ``` For a pure-CPU box that struggles with the cross-encoder, `flashrank` is a lighter ONNX-based reranker: ```bash export HINDSIGHT_API_RERANKER_PROVIDER=flashrank ``` You can also reduce recall work directly: use a lower `budget` (`low`/`mid`) for everyday queries and reserve `high` for comprehensive reasoning. See [Recall Performance](#recall-performance) below. --- ## Recall Performance ### Budget The `budget` parameter controls the search depth and quality. Choose based on query complexity — comprehensive questions that need thorough analysis benefit from higher budgets: | Budget | Use Case | |--------|----------| | `low` | Quick lookups, real-time chat | | `mid` | Standard queries, balanced performance | | `high` | Comprehensive questions, thorough analysis | ### Optimization 1. **Appropriate budgets**: Use lower budgets for simple queries, higher for comprehensive reasoning 2. **Limit result tokens**: Set `max_tokens` to control response size (default: 4096) 3. **Include chunks**: Use `include_chunks` to retrieve the raw text that generated memories when you need additional context ### Database Performance Hindsight uses PostgreSQL with pgvector for efficient vector search: - **Index type**: HNSW for approximate nearest neighbor search - **Typical query time**: 10-50ms for vector search on 100K+ facts - **Scalability**: Tested with millions of facts per bank ## Reflect Performance ### Performance Characteristics | Component | Latency | Description | |-----------|----------------|-------------| | Memory search | 100-600ms | Based on budget (low/mid/high) | | LLM generation | 500-2000ms | Depends on provider and response length | | **Total** | **600-2600ms** | Typical end-to-end latency | ### Optimization Strategies 1. **Budget selection**: Use lower budgets when context is sufficient 2. **Context provision**: Provide relevant `context` to reduce recall requirements and steer towards more focused answers ## Best Practices ### Operations - **Use appropriate budgets**: Don't over-provision for simple queries; use higher budgets for comprehensive reasoning - **Batch retain operations**: Group related content together for better efficiency - **Cache frequent queries**: Cache at the application level for repeated queries - **Profile with trace**: Use the `trace` parameter to identify slow operations ### Scaling - **Horizontal scaling**: Deploy multiple API instances behind a load balancer with shared PostgreSQL - **Concurrency**: 100+ simultaneous requests supported; memory search scales with CPU cores - **LLM rate limits**: Distribute load across multiple API keys/providers (typically 60-500 RPM per key) ### Cost Optimization - **Use efficient models**: `gpt-oss-20b` via Groq for retain — Hindsight doesn't need frontier models - **Enable provider Batch API**: Set `HINDSIGHT_API_RETAIN_BATCH_ENABLED=true` with async retain to cut LLM fact-extraction costs by 50% (supported on OpenAI and Groq; results delivered within 24 hours) - **Control token budgets**: Limit `max_tokens` for recall, use lower budgets when possible - **Optimize chunks**: Larger chunks (1000-2000 tokens) are more efficient than many small ones ### Monitoring - **Prometheus metrics**: Available at `/metrics` — track latency percentiles, throughput, and error rates - **Key metrics**: `hindsight_recall_duration_seconds`, `hindsight_reflect_duration_seconds`, `hindsight_retain_items_total` --- ## File: developer/reflect.mdx # Reflect: Agentic Reasoning with Disposition When you call `reflect()`, Hindsight runs an **agentic loop** that autonomously gathers evidence and reasons through the lens of the bank's disposition to generate contextual responses. ```mermaid graph TB subgraph agent["Reflect Agent Loop"] A[Query] --> B{Need more info?} B -->|Yes| C[Call Tools] C --> D[search_mental_models] C --> E[search_observations] C --> F[recall] C --> G[expand] D --> B E --> B F --> B G --> B B -->|No| H[Generate Response] end H --> I[Response + Citations] ``` --- ## How It Works Unlike simple retrieval, reflect is an **agentic system** that: 1. **Autonomously gathers evidence** — The agent decides what information it needs and calls appropriate tools 2. **Uses hierarchical retrieval** — Checks mental models first, then observations, then raw facts 3. **Applies disposition** — Shapes reasoning based on the bank's personality traits 4. **Enforces directives** — Hard rules that must be followed in all responses 5. **Cites sources** — Returns which memories and observations were used ### The Agentic Loop The reflect agent runs in a loop with access to these tools: | Tool | Purpose | Priority | |------|---------|----------| | `search_mental_models` | User-curated summaries | Highest (check first) | | `search_observations` | Consolidated knowledge | High | | `recall` | Raw facts (ground truth) | Fallback | | `expand` | Get more context for a memory | As needed | | `done` | Complete with final answer | When ready | The agent: - **Must gather evidence** before answering (guardrail prevents empty responses) - **Runs up to 10 iterations** to find relevant information - **Validates citations** — only IDs that were actually retrieved can be cited ### Hierarchical Retrieval Strategy The agent uses a smart retrieval hierarchy: 1. **[Mental Models](/developer/api/mental-models)** — User-curated summaries you've pre-computed for common queries 2. **[Observations](/developer/observations)** — Consolidated knowledge with freshness awareness 3. **Raw Facts** — Ground truth for verification when observations are stale **Mental models** are saved reflect responses that you create for frequently asked questions. They're checked first because they represent explicitly curated knowledge. See the [Mental Models API](/developer/api/mental-models) for how to create and manage them. If an observation is marked as **stale**, the agent automatically verifies it against current facts. --- ## Why Reflect? Most AI systems can retrieve facts, but they can't **reason** about them in a consistent way. ### The Problem Without reflect: - **No consistent character**: Same question gets different answers each time - **No knowledge synthesis**: System never connects related facts - **No reasoning context**: Responses don't reflect accumulated knowledge - **Generic responses**: Every AI sounds the same ### The Value With reflect: - **Consistent character**: A "detail-oriented, cautious" bank emphasizes risks and thorough planning - **Evolving knowledge**: Observations strengthen and adapt as evidence accumulates - **Contextual reasoning**: "Based on what I know about your team's remote work success..." - **Differentiated behavior**: Support bots sound diplomatic, code reviewers sound direct ### When to Use Reflect | Use `recall()` when... | Use `reflect()` when... | |------------------------|-------------------------| | You need raw facts | You need reasoned interpretation | | You're building your own reasoning | You want disposition-consistent responses | | You need maximum control | You want the bank to "think" for itself | | Simple fact lookup | Forming recommendations | **Example:** - `recall("Alice")` → Returns all Alice facts and relevant mental models - `reflect("Should we hire Alice?")` → Agent gathers evidence about Alice, reasons about fit, returns answer with citations --- ## Disposition Traits When you create a memory bank, you can configure its disposition using three traits. These traits influence how the bank interprets information and reasons during `reflect()`: | Trait | Scale | Low (1) | High (5) | |-------|-------|---------|----------| | **Skepticism** | 1-5 | Trusting, accepts information at face value | Skeptical, questions and doubts claims | | **Literalism** | 1-5 | Flexible interpretation, reads between the lines | Literal interpretation, takes things at face value | | **Empathy** | 1-5 | Detached, focuses on facts | Empathetic, considers emotional context | ### Mission: Natural Language Identity Beyond numeric traits, you can provide a natural language **mission** that describes the bank's identity and reasoning context: The reflect mission frames how the agent reasons and responds: - Provides identity context: who the agent is and what it cares about - Shapes how disposition traits are applied in practice - Keeps reasoning consistent across conversations :::info Per-operation missions The reflect mission only affects `reflect()`. To steer what gets extracted during `retain()`, use [`retain_mission`](/developer/api/memory-banks#retain-configuration). To control what gets synthesised into observations, use [`observations_mission`](/developer/api/memory-banks#observations-configuration). ::: --- ## Disposition Shapes Reasoning Two banks with different dispositions, given identical facts about remote work: **Bank A** (low skepticism, high empathy): > "Remote work enables flexibility and work-life balance. The team seems happier and more productive when they can choose their environment." **Bank B** (high skepticism, low empathy): > "Remote work claims need verification. What are the actual productivity metrics? The anecdotal benefits may not translate to measurable outcomes." **Same facts → Different conclusions** because disposition shapes interpretation. --- ## Disposition Presets by Use Case Different use cases benefit from different disposition configurations: | Use Case | Recommended Traits | Why | |----------|-------------------|-----| | **Customer Support** | skepticism: 2, literalism: 2, empathy: 5 | Trusting, flexible, understanding | | **Code Review** | skepticism: 4, literalism: 5, empathy: 2 | Questions assumptions, precise, direct | | **Legal Analysis** | skepticism: 5, literalism: 5, empathy: 2 | Highly skeptical, exact interpretation | | **Therapist/Coach** | skepticism: 2, literalism: 2, empathy: 5 | Supportive, reads between lines | | **Research Assistant** | skepticism: 4, literalism: 3, empathy: 3 | Questions claims, balanced interpretation | --- ## Directives: Hard Rules While disposition traits *influence* reasoning style, **directives** are hard rules that the agent *must* follow. Directives are injected into the prompt and enforced in every response. ### When to Use Directives Use directives for constraints that must never be violated: - **Compliance rules**: "Never recommend specific stocks or financial products" - **Privacy constraints**: "Never share personal data with third parties" - **Style requirements**: "Always respond in formal English" - **Domain guardrails**: "Always cite sources when making factual claims" ### Directives vs Disposition | Aspect | Disposition | Directives | |--------|-------------|------------| | **Nature** | Soft influence | Hard rules | | **Effect** | Shapes interpretation and tone | Must be followed exactly | | **Violation** | Acceptable (it's a tendency) | Not acceptable | | **Example** | High skepticism → questions claims | "Never make medical diagnoses" | :::tip Use disposition for personality and character. Use directives for compliance and guardrails. ::: See [Memory Banks: Directives](/developer/api/memory-banks#directives) for how to create and manage directives. --- ## What You Get from Reflect When you call `reflect()`: **Returns:** - **Response text** — Disposition-influenced answer from the agent - **based_on** — Evidence used: memories, mental models, and directives that grounded the response - **trace** — Tool calls, LLM calls, and observations accessed (when `include.tool_calls=True`) - **structured_output** — Parsed response if `response_schema` was provided - **usage** — Token usage metrics **Example:** ```json { "text": "Based on Alice's ML expertise and her work at Google, she'd be an excellent fit for the research team lead position...", "based_on": { "memories": [ {"id": "mem-123", "text": "Alice has 5 years of ML experience", "type": "world"}, {"id": "mem-456", "text": "Alice worked at Google on search ranking", "type": "experience"} ], "mental_models": [], "directives": [ {"id": "dir-001", "name": "Formal Language", "rules": ["Always respond in formal English"]} ] }, "usage": {"input_tokens": 1500, "output_tokens": 500, "total_tokens": 2000} } ``` The agent automatically gathers evidence, validates citations, and generates a grounded response. --- ## Why Disposition Matters Without disposition, all AI assistants sound the same. With disposition: - **Customer support bots** can be diplomatic and empathetic - **Code review assistants** can be direct and thorough - **Creative assistants** can be open to unconventional ideas - **Risk analysts** can be appropriately cautious Disposition creates **consistent character** across conversations while observations **evolve with evidence**. --- ## Next Steps - [**Observations**](./observations) — How knowledge is consolidated - [**Retain**](./retain) — How rich facts are stored - [**Recall**](./retrieval) — How multi-strategy search works - [**Reflect API**](./api/reflect) — Code examples and parameters --- ## File: developer/services.md # Services Hindsight consists of three services that can run together or separately depending on your deployment needs. ## API Service The core memory engine. Handles all memory operations: - **Retain**: Ingests content, extracts facts, builds knowledge graph - **Recall**: Semantic search across memories - **Reflect**: Disposition-aware answer generation ```bash hindsight-api # Default port: 8888 ``` The API service is stateless and can be horizontally scaled behind a load balancer. All state is stored in PostgreSQL. By default, the API also processes background tasks (mental model consolidation) internally. For high-throughput deployments, you can disable this and run dedicated workers instead. ## Worker Service Dedicated task processor for background operations. Uses the **same package and Docker image** as the API service, just with a different entry point. ```bash hindsight-worker # Default metrics port: 8889 ``` Workers use PostgreSQL as a task broker, polling for pending tasks. Multiple workers can run simultaneously without conflicts. | Deployment | Internal Worker | Dedicated Workers | |------------|-----------------|-------------------| | **Development** | ✅ Simple, all-in-one | ❌ Overkill | | **Small production** | ✅ Less infrastructure | ❌ Overkill | | **High throughput** | ❌ API bottleneck | ✅ Scale independently | | **Long-running tasks** | ❌ Blocks API resources | ✅ Isolated processing | To use dedicated workers, disable the internal worker in the API and start worker processes: ```bash # Disable internal worker in API HINDSIGHT_API_WORKER_ENABLED=false hindsight-api # Start dedicated workers (run multiple instances) hindsight-worker --worker-id worker-1 hindsight-worker --worker-id worker-2 ``` Each worker exposes `/health` and `/metrics` endpoints for monitoring. Before scaling down or removing workers, release their tasks with `hindsight-admin decommission-worker `. See [Configuration - Distributed Workers](./configuration#distributed-workers) for all worker settings and [Installation - Helm](./installation#distributed-workers) for Kubernetes deployment. ## Control Plane Web UI for managing and exploring your memory banks: - Browse agents and memory banks - Explore entities and relationships - View ingestion history and operations - Test recall queries interactively The Control Plane connects to the API service and provides a visual interface for development and debugging. For bare metal deployments, you can run the Control Plane standalone using npx. See [Installation - Bare Metal](./installation#control-plane) for details. --- ## File: developer/storage.md # Storage Hindsight uses PostgreSQL as its primary storage backend, with Oracle AI Database available as an alternative for enterprise deployments. ## Why PostgreSQL? PostgreSQL provides all capabilities required for a semantic memory system in a single database: | Capability | Implementation | |------------|----------------| | Vector search | pgvector extension with HNSW indexes | | Full-text search | Built-in tsvector with GIN indexes | | Relational data | Native PostgreSQL | | JSON documents | JSONB with indexing | | Graph queries | Recursive CTEs | ### Reduced System Dependencies Building exclusively for PostgreSQL simplifies deployment and operations: - Single connection string to configure - Single backup and restore strategy - Single monitoring target - ACID transactions across all data types - Single upgrade path ### No Storage Abstraction Hindsight does not abstract storage behind a generic interface. This is a deliberate trade-off. We believe PostgreSQL is becoming the standard database API. Its popularity, extension ecosystem, and modularity mean that PostgreSQL-compatible interfaces are appearing everywhere—from serverless offerings to distributed databases. Building for PostgreSQL today means compatibility with a growing ecosystem tomorrow. Supporting multiple databases would increase flexibility but conflict with our core goals: Hindsight is fully open source and designed to be as simple as possible to run and use. Adding database abstractions introduces complexity in code, testing, documentation, and operations—complexity that we pass on to users. By building on PostgreSQL, we keep the system simple: - One set of deployment instructions - One set of performance characteristics to understand - One codebase optimized for one backend - No configuration decisions about which database to use ### Oracle AI Database Support For enterprise deployments, Hindsight also supports Oracle AI Database with full feature parity. All memory operations—retain, recall, and reflect—work identically on Oracle, making it a drop-in option for organizations that standardize on Oracle infrastructure. ## Development with pg0 For local development, Hindsight uses **[pg0](https://github.com/vectorize-io/pg0)**—an embedded PostgreSQL distribution. ### What is pg0? pg0 is a single binary containing: - PostgreSQL server - pgvector extension (pre-installed) - Automatic initialization ### Behavior When no `HINDSIGHT_API_DATABASE_URL` is configured, Hindsight: 1. Starts an embedded PostgreSQL instance on port 5555 2. Initializes the schema 3. Stores data in `~/.hindsight/pg0/` ### Environments | Environment | Database | Configuration | |-------------|----------|---------------| | Development | pg0 (embedded) | Automatic | | Production | PostgreSQL 15+ | `HINDSIGHT_API_DATABASE_URL` environment variable | ## Requirements - PostgreSQL 15 or later - pgvector 0.5.0 or later Any PostgreSQL instance that satisfies these requirements should work. If you encounter issues with a specific setup, [open a GitHub issue](https://github.com/hindsight-ai/hindsight/issues). ### Tested Managed Services - AWS RDS (PostgreSQL 15+) - Google Cloud SQL - Azure Database for PostgreSQL - Supabase - Neon --- ## File: sdks/embed.md # Daemon CLI (hindsight-embed) Zero-configuration local memory system with automatic daemon management. Perfect for development, prototyping, and single-user applications. ## Overview `hindsight-embed` is a zero-configuration SDK that wraps the Hindsight API and PostgreSQL database into a single auto-managed local daemon. It's designed for development, prototyping, and single-user applications where you want memory capabilities without infrastructure overhead. **How it works:** 1. **First command triggers startup**: When you run any `hindsight-embed` command, it checks if a local daemon is running 2. **Auto-daemon management**: If no daemon exists, it automatically spawns `hindsight-api --daemon` in the background 3. **Embedded database**: The daemon uses `pg0` (embedded PostgreSQL) — no separate database installation required 4. **Command forwarding**: Your command is forwarded to the local daemon via HTTP (localhost:8888) 5. **Auto-shutdown**: After 5 minutes of inactivity (configurable), the daemon gracefully shuts down to free resources **Key features:** - **Zero setup** — One `configure` command and you're ready - **Automatic lifecycle** — Daemon starts on-demand, stops when idle - **Isolated storage** — Each bank gets its own embedded PostgreSQL database - **Local-only** — Binds to `127.0.0.1:8888`, not accessible from network - **Production-grade engine** — Uses the same memory engine as the full API service Think of it as SQLite for long-term memory — all the power of Hindsight without managing servers. ## Installation Install via `uvx` (recommended - always latest version): ```bash # Run directly without installation uvx hindsight-embed@latest configure # Or use pipx for persistent installation pipx install hindsight-embed ``` ## Quick Start ### 1. Configure ```bash # Interactive configuration hindsight-embed configure # Or non-interactive via environment variables export HINDSIGHT_API_LLM_PROVIDER=openai export HINDSIGHT_API_LLM_API_KEY=sk-xxxxxxxxxxxx export HINDSIGHT_API_LLM_MODEL=gpt-4o-mini hindsight-embed configure ``` Configuration is saved to `~/.hindsight/embed`: ```bash HINDSIGHT_API_LLM_PROVIDER=openai HINDSIGHT_API_LLM_MODEL=gpt-4o-mini HINDSIGHT_API_LLM_API_KEY=sk-xxxxxxxxxxxx # Daemon settings (macOS: force CPU to avoid MPS/XPC issues) HINDSIGHT_API_EMBEDDINGS_LOCAL_FORCE_CPU=1 HINDSIGHT_API_RERANKER_LOCAL_FORCE_CPU=1 ``` ### 2. Use Memory Operations ```bash # Store a memory hindsight-embed memory retain default "User prefers dark mode" # Query memories hindsight-embed memory recall default "user preferences" # Reasoning with memory hindsight-embed memory reflect default "What color scheme should I use?" ``` The daemon starts automatically on first use! ### 3. Open the Control Center (optional) Use the local control center when you want a browser-based configuration wizard and daemon supervisor: ```bash # Launch the control center and open the browser hindsight-embed control start # Or use the browser wizard instead of terminal prompts during setup hindsight-embed configure --ui ``` The control center listens on localhost only (`http://localhost:7878` by default) and prints a tokenized URL. It stores the local access token at `~/.hindsight/control.token` and writes logs to `~/.hindsight/control.log`. ```bash # Pick a different control-center port for this launch hindsight-embed control start --port 7879 # Start without opening a browser automatically hindsight-embed control start --no-open # Check, inspect, or stop the control center hindsight-embed control status hindsight-embed control logs -f hindsight-embed control stop ``` The control center runs as a separate process from the memory daemon. Stopping or restarting the control center does not stop a running daemon. ## Environment Variables | Variable | Description | Default | |----------|-------------|---------| | `HINDSIGHT_API_LLM_API_KEY` | **Required**. API key for LLM provider | - | | `HINDSIGHT_API_LLM_PROVIDER` | LLM provider: `openai`, `anthropic`, `gemini`, `groq`, `minimax`, `ollama` | `openai` | | `HINDSIGHT_API_LLM_MODEL` | Model name | `gpt-4o-mini` | | `HINDSIGHT_EMBED_DAEMON_IDLE_TIMEOUT` | Seconds before daemon auto-exits when idle (0 = never) | `0` | | `HINDSIGHT_EMBED_CONTROL_PORT` | Default port for `hindsight-embed control start` | `7878` | **Provider Examples:** ```bash # OpenAI export HINDSIGHT_API_LLM_PROVIDER=openai export HINDSIGHT_API_LLM_API_KEY=sk-xxxxxxxxxxxx export HINDSIGHT_API_LLM_MODEL=gpt-4o # Groq (fast inference) export HINDSIGHT_API_LLM_PROVIDER=groq export HINDSIGHT_API_LLM_API_KEY=gsk_xxxxxxxxxxxx export HINDSIGHT_API_LLM_MODEL=llama-3.3-70b-versatile # Anthropic export HINDSIGHT_API_LLM_PROVIDER=anthropic export HINDSIGHT_API_LLM_API_KEY=sk-ant-xxxxxxxxxxxx export HINDSIGHT_API_LLM_MODEL=claude-sonnet-4-20250514 ``` ## Daemon Management ### Idle Timeout Customize how long the daemon stays alive when idle: ```bash # Never timeout (daemon runs until manually stopped) export HINDSIGHT_EMBED_DAEMON_IDLE_TIMEOUT=0 # Shorter timeout: 1 minute export HINDSIGHT_EMBED_DAEMON_IDLE_TIMEOUT=60 # Longer timeout: 30 minutes export HINDSIGHT_EMBED_DAEMON_IDLE_TIMEOUT=1800 ``` ### Daemon Commands ```bash # Check daemon status hindsight-embed daemon status # View daemon logs in real-time hindsight-embed daemon logs -f # Stop daemon manually hindsight-embed daemon stop ``` ### Control Center Commands ```bash # Start or reuse the local browser control center hindsight-embed control start # Check whether it is running hindsight-embed control status # View control-center logs hindsight-embed control logs -f # Stop the control-center process hindsight-embed control stop ``` ## Commands All memory operations follow the same interface as the CLI: ### Retain (Store Memory) ```bash hindsight-embed memory retain "content" # With context hindsight-embed memory retain "content" --context "source information" # Background processing hindsight-embed memory retain "content" --async ``` ### Recall (Search) ```bash hindsight-embed memory recall "query" # With budget control hindsight-embed memory recall "query" --budget high # Show trace hindsight-embed memory recall "query" --trace ``` ### Reflect (Generate Response) ```bash hindsight-embed memory reflect "prompt" # With additional context hindsight-embed memory reflect "prompt" --context "additional info" ``` ### Bank Management ```bash # List all banks hindsight-embed bank list # View bank stats hindsight-embed bank stats # Set bank name hindsight-embed bank name "My Assistant" # Set bank mission hindsight-embed bank mission "I am a helpful AI assistant" ``` ## Troubleshooting ### Daemon Won't Start Check the daemon logs: ```bash hindsight-embed daemon logs # Or watch in real-time hindsight-embed daemon logs -f ``` Common issues: - **Missing API key**: Set `HINDSIGHT_API_LLM_API_KEY` - **Port conflict**: Another service using port 8888 - **Permissions**: Check `~/.hindsight/` directory permissions ### Daemon Exits Immediately Check if you have the idle timeout set too low: ```bash # Disable idle timeout for debugging export HINDSIGHT_EMBED_DAEMON_IDLE_TIMEOUT=0 hindsight-embed daemon status ``` ### Reset Configuration ```bash # Remove config file and reconfigure rm ~/.hindsight/embed hindsight-embed configure ``` ## When to Use **Perfect for:** - Development and prototyping - Single-user applications - Local-first tools - Quick experiments with Hindsight **Not suitable for:** - Production multi-user deployments - Network-accessible services - High-availability requirements - Multi-tenant applications For production deployments, use the [API Service](/developer/services) with external PostgreSQL instead. --- ## File: sdks/go.md # Go Client Official Go client for the Hindsight API, generated from the OpenAPI 3.1 spec using [OpenAPI Generator](https://github.com/OpenAPITools/openapi-generator). ## Installation ```bash go get github.com/vectorize-io/hindsight/hindsight-clients/go ``` Requires Go 1.23+. ## Quick Start ## API Structure The Go client provides access to all Hindsight API operations through structured namespaces: - **`client.MemoryAPI`** - Retain, recall, reflect operations - **`client.BanksAPI`** - Bank management - **`client.DirectivesAPI`** - Directive management - **`client.MentalModelsAPI`** - Mental model management - **`client.DocumentsAPI`** - Document operations - **`client.EntitiesAPI`** - Entity operations - **`client.OperationsAPI`** - Async operation monitoring ## Working with Nullable Fields The Go client uses `NullableString`, `NullableTime`, and similar types for optional fields: ## Error Handling ## More Examples For detailed examples of all operations, see: - [Python SDK documentation](./python.md) - API concepts are the same - [Node.js SDK documentation](./nodejs.md) - API concepts are the same - [OpenAPI specification](https://hindsight.dev/openapi.json) - Complete API reference --- ## File: sdks/hindsight-all-npm.md # Programmatic API (Node.js) The `@vectorize-io/hindsight-all` npm package is the Node.js equivalent of the Python [`hindsight-all`](./hindsight-all.md) package. It lets your Node code spawn and supervise a local Hindsight daemon without deploying any server infrastructure — pair it with [`@vectorize-io/hindsight-client`](./nodejs.md) for memory operations. The daemon runs as a **separate OS process** on `127.0.0.1` (not in your Node process). Your code talks to it over HTTP via `HindsightClient`. This package **does not ship an HTTP client** — it only owns the server process. Once the daemon is running, talk to it with [`@vectorize-io/hindsight-client`](./nodejs.md) against `server.getBaseUrl()`. The two packages compose: one owns the process, the other owns the API surface. ## How it works 1. `server.start()` resolves the underlying `hindsight-embed` command (via `uvx` from PyPI, or `uv run --directory ` for a local checkout). 2. Runs `profile create --merge --port [--env KEY=VALUE ...]` with every entry from `options.env` forwarded as `--env`. 3. Runs `daemon --profile start`. 4. Polls `http://host:port/health` until it returns `200` or the `readyTimeoutMs` budget is exhausted. 5. `server.stop()` runs `daemon --profile stop`. The server is intentionally transparent: new daemon env vars or CLI flags never require a wrapper release — pass them through `env`, `extraProfileCreateArgs`, or `extraDaemonStartArgs`. ## Requirements - **Node.js ≥ 22** — uses global `fetch` and `AbortSignal.timeout`. - **`uv` / `uvx`** on `PATH` — used to download and run the Hindsight daemon. Install via [docs.astral.sh/uv](https://docs.astral.sh/uv/). ## Install ```bash npm install @vectorize-io/hindsight-all @vectorize-io/hindsight-client ``` ## Example ```ts const server = new HindsightServer({ profile: 'my-app', port: 9077, env: { HINDSIGHT_API_LLM_PROVIDER: 'anthropic', HINDSIGHT_API_LLM_API_KEY: process.env.ANTHROPIC_API_KEY, HINDSIGHT_API_LLM_MODEL: 'claude-sonnet-4-20250514', HINDSIGHT_EMBED_DAEMON_IDLE_TIMEOUT: '0', }, logger: consoleLogger, }); await server.start(); const client = new HindsightClient({ baseUrl: server.getBaseUrl() }); await client.retain('user-123', 'User prefers dark mode.'); const recall = await client.recall('user-123', 'what are the user preferences?'); await server.stop(); ``` For a remote Hindsight API, skip the server entirely and point `HindsightClient` directly at the remote URL. ## `HindsightServerOptions` | Option | Type | Default | Description | |---|---|---|---| | `profile` | `string` | `"default"` | Profile name passed to `--profile` on every sub-command. | | `port` | `number` | `8888` | TCP port the daemon listens on. | | `host` | `string` | `"127.0.0.1"` | Hostname the daemon binds to (used for health checks). | | `embedVersion` | `string` | `"latest"` | Version of the underlying `hindsight-embed` package to run via `uvx`. | | `embedPackagePath` | `string` | — | Local checkout path — takes precedence over `embedVersion`. Uses `uv run --directory` instead of `uvx`. | | `env` | `Record` | `{}` | Environment variables passed to the daemon process **and** written into the profile config via `--env KEY=VALUE`. The preferred way to surface any `HINDSIGHT_API_*` / `HINDSIGHT_EMBED_*` setting. | | `extraProfileCreateArgs` | `string[]` | `[]` | Extra args appended verbatim to `profile create`. | | `extraDaemonStartArgs` | `string[]` | `[]` | Extra args appended verbatim to `daemon start`. | | `platformCpuWorkaround` | `boolean` | `true` on macOS | Auto-set `HINDSIGHT_API_EMBEDDINGS_LOCAL_FORCE_CPU=1` and `HINDSIGHT_API_RERANKER_LOCAL_FORCE_CPU=1` to avoid Metal/MPS crashes. Caller-supplied `env` values win over the auto-applied ones. | | `readyTimeoutMs` | `number` | `30000` | Max time to wait for `/health` to return 200. | | `readyPollIntervalMs` | `number` | `1000` | Polling interval while waiting for `/health`. | | `logger` | `Logger` | silent | Pluggable logger (`debug`/`info`/`warn`/`error`). `consoleLogger` and `silentLogger` helpers are exported. | ## Server methods | Method | Returns | Description | |---|---|---| | `start()` | `Promise` | Configure profile, spawn the daemon, wait for `/health`. Idempotent — safe to re-run. | | `stop()` | `Promise` | Stop the daemon. Never throws; logs and resolves even on failure. | | `checkHealth()` | `Promise` | One-shot `/health` probe with a 2 s timeout. | | `getBaseUrl()` | `string` | `http://host:port` — pass this straight to `HindsightClient`. | | `getProfile()` | `string` | The profile name this server operates on. | For memory operations (retain, recall, reflect, bank management) use [`@vectorize-io/hindsight-client`](./nodejs.md). --- ## File: sdks/hindsight-all.md # Programmatic API (Python) The `hindsight-all` Python package lets your code spawn and manage a local Hindsight daemon without deploying any server infrastructure. It bundles the Hindsight API server, embedded PostgreSQL, and the Python client into one install — `pip install hindsight-all` and you can start a fully-functional Hindsight instance from a few lines of Python. The daemon runs as a **separate OS process** on `127.0.0.1` (not in your Python process memory). Your code talks to it over HTTP via the bundled `HindsightClient`. If you already have a Hindsight server running and just need a client, use [Python Client (hindsight-client)](./python.md) instead. ## How it works `hindsight-all` exposes two primary APIs: - **`HindsightServer`** — explicit lifecycle. Use it as a context manager when you want deterministic startup/shutdown (e.g. in tests). - **`HindsightEmbedded`** — auto-managed. Starts a daemon on first use, reuses it across calls, shuts it down after an idle timeout. Easiest for application code that doesn't want to think about lifecycle. Both end up talking to the same underlying daemon via the same `HindsightClient` HTTP interface — the difference is only how the server process is managed. ## Installation ```bash pip install hindsight-all ``` The `hindsight-all` wheel bundles `hindsight-api-slim`, `hindsight-client`, and `hindsight-embed` as dependencies, so one `pip install` gets you everything. On Intel (x86_64) Macs, install `hindsight-all-slim` instead — the full bundle's local ML models have no Intel-Mac wheels. See [Supported Platforms](../developer/installation#supported-platforms). ## `HindsightServer` — explicit lifecycle Use `HindsightServer` as a context manager when you want the server to start immediately, run for the duration of a block, and shut down cleanly afterwards. Ideal for tests and short-lived scripts. ```python from hindsight import HindsightServer, HindsightClient with HindsightServer( llm_provider="openai", llm_model="gpt-4o-mini", llm_api_key=os.environ["OPENAI_API_KEY"], ) as server: client = HindsightClient(base_url=server.url) client.retain(bank_id="my-bank", content="Alice works at Google") results = client.recall(bank_id="my-bank", query="What does Alice do?") for r in results: print(r.text) answer = client.reflect(bank_id="my-bank", query="Tell me about Alice") print(answer.text) # Server is stopped here ``` ## `HindsightEmbedded` — auto-managed `HindsightEmbedded` is the simplest way to use Hindsight in Python. It automatically manages a background daemon for you — starts on first use, stays alive across calls, shuts down after an idle timeout. ```python from hindsight import HindsightEmbedded # Server starts automatically on first call client = HindsightEmbedded( profile="myapp", # Profile for data isolation llm_provider="openai", llm_model="gpt-4o-mini", llm_api_key=os.environ["OPENAI_API_KEY"], ) # Use immediately - no manual server management needed client.retain(bank_id="my-bank", content="Alice works at Google") results = client.recall(bank_id="my-bank", query="What does Alice do?") # Server continues running (auto-stops after idle timeout) # Or explicitly stop it: client.close(stop_daemon=True) ``` ### What's a Profile? A profile is an isolated Hindsight environment. Each profile gets its own embedded PostgreSQL database (stored in `~/.pg0/instances/hindsight-embed-{profile}/`) and its own API server. Use different profiles to separate environments (dev/prod), applications, or users. ### When to use which | Use case | Pick | |---|---| | Tests, short-lived scripts, deterministic startup/shutdown | `HindsightServer` (context manager) | | Long-running application, auto-start on first use, don't want to manage lifecycle | `HindsightEmbedded` | | Existing Hindsight server running elsewhere | [`hindsight-client`](./python.md) directly | ## API namespaces Both `HindsightEmbedded` and `HindsightClient` expose organized API namespaces for bank management, mental models, directives, and memories: ```python from hindsight import HindsightEmbedded embedded = HindsightEmbedded( profile="myapp", llm_provider="openai", llm_api_key=os.environ["OPENAI_API_KEY"], ) # Core operations embedded.retain(bank_id="test", content="Hello") results = embedded.recall(bank_id="test", query="Hello") # Bank management embedded.banks.create(bank_id="test", name="Test Bank", mission="Help users") embedded.banks.set_mission(bank_id="test", mission="Updated mission") embedded.banks.delete(bank_id="test") # Mental models embedded.mental_models.create( bank_id="test", name="User Preferences", content="User prefers dark mode" ) models = embedded.mental_models.list(bank_id="test") # Directives embedded.directives.create( bank_id="test", name="Response Style", content="Be concise and friendly" ) directives = embedded.directives.list(bank_id="test") # List memories memories = embedded.memories.list(bank_id="test", type="world", limit=50) ``` API namespaces ensure the daemon is running before each call, so daemon crashes are handled gracefully: ```python # ✅ GOOD - Uses API namespace (daemon restarts handled) embedded.banks.create(bank_id="test", name="Test") # ❌ BAD - Direct client access (daemon crashes NOT handled) client = embedded.client client.create_bank(bank_id="test", name="Test") # Fails if daemon crashed ``` For the full reference of retain/recall/reflect methods and their options (which work the same regardless of how you obtain the client) see the [Python Client page](./python.md). ---