Skip to main content

Recall: How Hindsight Retrieves Memories

When you call recall(), Hindsight uses multiple search strategies in parallel to find the most relevant memories, regardless of how you phrase your query.


The Challenge of Memory Recall​

Different queries need different search approaches:

  • "Alice works at Google" β†’ needs exact name matching
  • "Where does Alice work?" β†’ needs semantic understanding
  • "What did Alice do last spring?" β†’ needs temporal reasoning
  • "Why did Alice leave?" β†’ needs causal relationship tracing

No single search method handles all these well. Hindsight solves this with TEMPR β€” four complementary strategies that run in parallel.


Four Search Strategies​

What it does: Understands the meaning behind words, not just the words themselves.

Best for:

  • Conceptual matches: "Alice's job" β†’ "Alice works as a software engineer"
  • Paraphrasing: "Bob's expertise" β†’ "Bob specializes in machine learning"
  • Synonyms: "meeting" matches "conference", "discussion", "gathering"

Why it matters: You can ask questions naturally without matching exact keywords.


What it does: Finds exact terms and names, even when they're spelled uniquely.

Best for:

  • Proper nouns: "Google", "Alice Chen", "MIT"
  • Technical terms: "PostgreSQL", "HNSW", "TensorFlow"
  • Unique identifiers: URLs, product names, specific phrases

Why it matters: Ensures you never miss results that mention specific names or terms, even if they're semantically distant from your query.


Graph Traversal​

What it does: Follows connections between entities to find indirectly related information.

Best for:

  • Indirect relationships: "What does Alice do?" β†’ Alice β†’ Google β†’ Google's products
  • Entity exploration: "Bob's colleagues" β†’ Bob β†’ co-workers β†’ shared projects
  • Multi-hop reasoning: "Alice's team's achievements"

Why it matters: Retrieves facts that aren't semantically or lexically similar but are structurally connected through the knowledge graph.

Example: Even if Alice and her manager are never mentioned together, graph traversal can find the manager through shared projects or team relationships.


What it does: Understands time expressions and filters by when events occurred.

Best for:

  • Historical queries: "What did Alice do in 2023?"
  • Time ranges: "What happened last spring?"
  • Relative time: "What did Bob work on last year?"
  • Before/after: "What happened before Alice joined Google?"

How it works: Combines semantic understanding with time filtering to find events within specific periods.

Why it matters: Enables precise historical queries without losing old information.


Result Fusion​

After the four strategies run, results are fused together:

  • Memories appearing in multiple strategies rank higher (consensus)
  • Rank matters more than score (robust across different scoring systems)
  • Final results are re-ranked using a neural model that considers query-memory interaction

Why fusion matters: A fact that's both semantically similar AND mentions the right entity will rank higher than one that's only semantically similar.


Why Multiple Strategies?​

Consider the query: "What did Alice say about Python last spring?"

  • Semantic finds facts about Alice's views on programming
  • Keyword ensures "Python" is actually mentioned
  • Graph connects Alice β†’ programming languages β†’ related entities
  • Temporal filters to "last spring" timeframe

The fusion of all four gives you exactly what you're looking for, even though no single strategy would suffice.


Token Budget Management​

Hindsight is built for AI agents, not humans. Traditional search systems return "top-k" results, but agents don't think in terms of result countsβ€”they think in tokens. An agent's context window is measured in tokens, and that's exactly how Hindsight measures results.

How it works:

  • Top-ranked memories selected first
  • Stops when token budget is exhausted
  • You specify context budget, Hindsight fills it with the most relevant memories

Parameters you control:

  • max_tokens: How much memory content to return (default: 4096 tokens)
  • budget: Search depth level (low, mid, high)
  • types: Filter by world, experience, observation, or all
  • tags: Filter memories by visibility tags
  • tags_match: How to match tags (see Recall API for all options)

Expanding Context: Chunks​

Memories are distilled factsβ€”concise but sometimes missing nuance. When your agent needs deeper context, you can optionally retrieve the source material:

Chunks return the raw text that generated each memoryβ€”useful when the distilled fact loses important nuance:

Memory: "Alice prefers Python over JavaScript"
Chunk: "Alice mentioned she prefers Python over JavaScript, mainly because
of its data science ecosystem, though she admits JS is better for
frontend work and she's been learning TypeScript lately."

Use include_chunks=True with max_chunk_tokens to control the token budget for chunks. This is useful when generating responses that need verbatim quotes or when context matters (e.g., "What exactly did Alice say about the project?").


Tuning Recall: Quality vs Latency​

Different use cases require different trade-offs between recall quality and response speed. Two parameters control this:

Budget: Search Depth​

Controls how thoroughly Hindsight explores the memory bankβ€”affecting graph traversal depth, candidate pool size, and cross-encoder re-ranking:

BudgetBest ForTrade-off
lowQuick lookups, simple queriesFast, may miss indirect connections
midMost queries, balancedGood coverage, reasonable speed
highComplex queries requiring deep explorationThorough, slower

Example: "What did Alice's manager's team work on?" benefits from high budget to traverse multiple hops (Alice β†’ manager β†’ team β†’ projects) and evaluate more candidates.

Max Tokens: Context Window Size​

Controls how much memory content to return:

Max Tokens~Pages of TextBest ForTrade-off
2048~2 pagesFocused answers, fast LLMFewer memories, faster
4096 (default)~4 pagesBalanced contextGood coverage, standard
8192~8 pagesComprehensive contextMore memories, slower LLM

Example: "Summarize everything about Alice" benefits from higher max_tokens to include more facts.

Two Independent Dimensions​

Budget and max_tokens control different aspects of recall:

ParameterWhat it controlsLatency impactExample
BudgetHow thoroughly to explore memoriesSearch timeHigh budget finds Alice β†’ manager β†’ team β†’ projects
Max TokensHow much context to returnLLM processing timeHigh tokens returns more memories to the agent

They're independent. Common combinations:

BudgetMax TokensUse Case
highlowDeep search, return only the best results
lowhighQuick search, return everything found
highhighComprehensive research queries
lowlowFast chatbot responses
Use CaseBudgetMax TokensWhy
Chatbot replieslow2048Fast responses, focused context
Document Q&Amid4096Balanced coverage and speed
Research querieshigh8192Comprehensive, multi-hop reasoning
Real-time searchlow2048Minimize latency

Scoring & Ranking Deep Dive​

This section explains exactly how Hindsight turns raw retrieval results into a final ranked list. The pipeline has three stages: RRF fusion, cross-encoder reranking, and combined scoring.

Stage 1: Reciprocal Rank Fusion (RRF)​

After all strategies run in parallel, their results are merged using Reciprocal Rank Fusion. RRF combines ranked lists by rewarding items that appear highly ranked across multiple strategies, without relying on raw scores (which aren't comparable across different retrieval methods).

Formula:

score(d) = Ξ£  1 / (k + rank_i(d))
i

Where:

  • k = 60 (smoothing constant β€” prevents top-ranked items from dominating)
  • rank_i(d) = position of document d in strategy i (1-indexed)
  • The sum runs over all strategies where d appears

All four strategies are weighted equally. There are no per-strategy weight multipliers β€” importance comes from rank position, not the source.

Why RRF over raw score merging? Each retrieval strategy produces scores on a different scale (cosine similarity, BM25 tf-idf, graph activation). These scores aren't comparable β€” a BM25 score of 12.5 and a cosine similarity of 0.85 don't mean the same thing. RRF sidesteps this by using only rank positions, making it robust across any scoring system without requiring calibration.

Example: A memory ranked #1 in semantic and #5 in BM25:

RRF score = 1/(60+1) + 1/(60+5) = 0.0164 + 0.0154 = 0.0318

A memory ranked #1 in semantic only:

RRF score = 1/(60+1) = 0.0164

The first memory ranks higher because it has consensus across strategies.


Stage 2: Cross-Encoder Reranking​

RRF gives a good initial ranking, but it's based on positions, not on deep query-document understanding. The cross-encoder evaluates each candidate against the query as a pair, producing a relevance score.

Pre-filtering: Before reranking, candidates are trimmed to the top 300 (by RRF score) to limit computational cost. This is configurable via HINDSIGHT_API_RERANKER_MAX_CANDIDATES.

Why rerank after RRF? RRF is position-based β€” it knows a memory ranked well across strategies, but it never actually reads the query and the memory together. The cross-encoder does: it takes the query and each candidate as a pair and produces a relevance score based on their full interaction. This catches nuances that position-based fusion misses, like a memory that ranked #1 in keyword search because it matched a common term but is actually irrelevant to the query's intent.

Score normalization: Cross-encoders output raw logits (which can be negative). These are normalized to [0, 1] using the sigmoid function:

CE_normalized = 1 / (1 + e^(-raw_logit))

Batch processing: Candidates are scored in batches β€” 32 pairs for the local reranker, 128 pairs for TEI.

No cross-encoder?

When running without a cross-encoder (e.g., slim image with no external reranker), the system falls back to RRF-derived scores: candidates are assigned synthetic scores spread across [0.1, 1.0] based on their RRF rank, so the combined scoring boosts below still work meaningfully.


Stage 3: Combined Scoring (Boosts)​

The normalized cross-encoder score is adjusted by three multiplicative boosts that incorporate signals the cross-encoder can't see: recency, temporal proximity, and evidence strength.

Why multiplicative instead of additive? Additive boosts (e.g., CE + 0.1 Γ— recency) would give the same absolute bonus to every candidate regardless of relevance. A barely-relevant memory could leapfrog a highly-relevant one just by being recent. Multiplicative boosts keep adjustments proportional to the base relevance score β€” a +10% nudge on a high-relevance memory is a bigger absolute change than +10% on a low-relevance one. This ensures secondary signals never overpower the primary relevance judgment.

Formula:

final_score = CE_normalized Γ— recency_boost Γ— temporal_boost Γ— proof_count_boost

Each boost is centered at 1.0 (neutral) and controlled by an alpha that caps how much it can swing:

boost = 1 + Ξ± Γ— (signal - 0.5)
BoostΞ±Max adjustmentWhat it rewards
Recency0.2Β±10%Recent memories over older ones
Temporal proximity0.2Β±10%Memories close to a queried time window
Proof count0.1Β±5%Observations backed by more evidence

Recency signal​

Linear decay over 365 days from the memory's occurrence date:

recency = clamp(1.0 - days_ago / 365, 0.1, 1.0)

A memory from today has recency 1.0 (+10% boost). A memory from 6 months ago has recency ~0.5 (neutral). A memory older than a year has recency 0.1 (-8% penalty). Memories without dates get 0.5 (neutral β€” no boost or penalty).

Temporal proximity signal​

Only active when the query contains a time reference (e.g., "last spring", "in 2023"). Measures how close a memory's date is to the center of the queried time window:

temporal_proximity = 1.0 - min(days_from_center / (window_days / 2), 1.0)

A memory at the center of the window gets 1.0 (+10% boost). A memory at the edge gets 0.0 (-10% penalty). For non-temporal queries, all memories get 0.5 (neutral).

Proof count signal​

For observation-type memories, rewards those backed by more evidence using a logarithmic curve:

proof_norm = clamp(0.5 + ln(proof_count) / 10, 0.0, 1.0)
Proof countproof_normBoost
10.5Neutral
30.61+1.1%
100.73+2.3%
150+1.0+5% (max)

Maximum combined range​

With all boosts at their extremes:

  • Best case: Γ—1.10 Γ— 1.10 Γ— 1.05 β‰ˆ +27%
  • Worst case: Γ—0.90 Γ— 0.90 Γ— 0.95 β‰ˆ -23%

The boosts are intentionally conservative β€” they nudge the ranking without overriding cross-encoder relevance.


Stage 4: Token Truncation​

After scoring, results are sorted by final_score and selected top-down until the max_tokens budget is exhausted. Only the memory text counts toward the budget β€” metadata is free.


How Budget Maps to Pipeline Parameters​

The budget parameter (low/mid/high) controls search depth β€” how many candidates each strategy considers. Each level maps to a recall budget number that flows through every pipeline stage:

BudgetRecall budget (fixed mode)Env var override
low100HINDSIGHT_API_RECALL_BUDGET_FIXED_LOW
mid300 (default)HINDSIGHT_API_RECALL_BUDGET_FIXED_MID
high1000HINDSIGHT_API_RECALL_BUDGET_FIXED_HIGH

This recall budget flows through the pipeline as follows:

Pipeline stageHow the recall budget is used
Semantic searchOver-fetches max(recall_budget Γ— 5, 100) from HNSW, trims to recall_budget
BM25 searchLIMIT recall_budget in SQL
Graph traversalExplores up to recall_budget nodes
Temporal spreadingActivates up to recall_budget nodes via links
Result considerationTop recall_budget Γ— 2 results considered for token filtering

Reranking pre-filter (300 candidates) is independent of budget β€” it's a separate knob (HINDSIGHT_API_RERANKER_MAX_CANDIDATES).

Adaptive budgeting

An alternative budget mode scales the recall budget with max_tokens instead of using fixed values:

recall_budget = clamp(max_tokens Γ— ratio, min, max)
BudgetRatioEnv var override
low2.5% of max_tokensHINDSIGHT_API_RECALL_BUDGET_ADAPTIVE_LOW
mid7.5% of max_tokensHINDSIGHT_API_RECALL_BUDGET_ADAPTIVE_MID
high25% of max_tokensHINDSIGHT_API_RECALL_BUDGET_ADAPTIVE_HIGH

The result is clamped to a floor of 20 (HINDSIGHT_API_RECALL_BUDGET_MIN) and a ceiling of 2000 (HINDSIGHT_API_RECALL_BUDGET_MAX).

Enable with HINDSIGHT_API_RECALL_BUDGET_FUNCTION=adaptive.


Graph Scoring Detail​

The graph traversal (link expansion) combines three independent signals additively for each candidate:

SignalScore formulaRange
Entity overlaptanh(shared_entity_count Γ— 0.5)[0, ~1.0]
Semantic linkPrecomputed kNN link weight[0.7, 1.0]
Causal linkCausal link weight[0, 1.0]
graph_score = entity_score + semantic_score + causal_score   ∈ [0, 3]

The additive combination rewards convergent evidence β€” a memory connected to the query through multiple signal types ranks higher than one connected through a single strong signal.

Why tanh for entity scores? Raw shared-entity count is unbounded β€” a high-fanout entity like "user" could produce counts of 50+, drowning out the other two signals. tanh(count Γ— 0.5) saturates naturally: the first few shared entities matter a lot (1β†’0.46, 2β†’0.76, 3β†’0.91), but additional ones contribute diminishing returns, keeping the entity signal in [0, 1] alongside semantic and causal scores.

Why additive instead of multiplicative here? Unlike the combined scoring boosts, graph signals are independent evidence channels, not adjustments to a base score. A memory might be connected only through causal links (no shared entities, no semantic similarity) β€” multiplicative combination would zero it out. Additive scoring lets each signal contribute independently, and the outer RRF fusion handles ranking across strategies.

Entity signal example: A memory sharing 1 entity with the query scores tanh(0.5) β‰ˆ 0.46. Two shared entities score tanh(1.0) β‰ˆ 0.76. Three or more saturate near 0.91+.


Next Steps​

  • Retain β€” How memories are stored with rich context
  • Reflect β€” How disposition influences reasoning
  • Recall API β€” Code examples, parameters, and tag filtering