Reduce Hindsight Consolidation Memory Fan-Out Safely

If you need to reduce Hindsight consolidation memory fan out, the recent defaults are a real improvement. Consolidation used to be able to amplify memory use during internal recall, especially on large banks where source fact hydration and reranker behavior could keep RSS higher than expected. The new defaults make that path much more bounded. Keep the configuration guide, the observations guide, the installation guide, and the docs home nearby while you tune it.
The quick answer
- Consolidation recall now defaults to a low budget, which reduces how many candidate rows each recall arm tries to pull in.
- Source facts inside consolidation are now capped by default, instead of staying unlimited.
- FlashRank now defaults to a bounded CPU memory arena setting, which helps prevent monotonically growing RSS after consolidation work completes.
Why consolidation used to fan out
Consolidation is not just a simple merge pass. It can trigger internal recall work so the system can find related observations, hydrate source facts, and decide what to update or combine. On large banks, that can get expensive fast if the recall budget is wide, source facts are effectively unlimited, and the reranker runtime keeps memory arenas hot.
That is the pattern this update is addressing. It does not remove consolidation. It narrows the expensive parts so large banks stay more predictable.
Use the new bounded defaults first
The new baseline is intentionally conservative:
export HINDSIGHT_API_CONSOLIDATION_RECALL_BUDGET=low
export HINDSIGHT_API_CONSOLIDATION_SOURCE_FACTS_MAX_TOKENS=4096
export HINDSIGHT_API_RERANKER_FLASHRANK_CPU_MEM_ARENA=false
Those defaults reduce candidate fan out, cap how much source evidence gets pulled into the prompt, and stop ONNX Runtime from holding onto an ever growing CPU arena after consolidation batches finish.
Tune up only when you have a reason
If you move beyond the defaults, do it on purpose.
- Raise
HINDSIGHT_API_CONSOLIDATION_RECALL_BUDGETonly when low recall is clearly missing useful related observations. - Raise
HINDSIGHT_API_CONSOLIDATION_SOURCE_FACTS_MAX_TOKENSonly when the LLM needs more supporting evidence to make stable updates. - Review
HINDSIGHT_API_CONSOLIDATION_MAX_MEMORIES_PER_ROUNDandHINDSIGHT_API_CONSOLIDATION_LLM_BATCH_SIZEin the configuration guide if you want to trade throughput against peak memory pressure.
The point is to keep the expensive path narrow by default, then widen one lever at a time if the bank actually needs it.
Check the rest of the deployment shape too
Consolidation tuning only solves consolidation. If RSS still looks bad, compare it against the wider deployment:
- are you running the full image instead of slim?
- is the worker colocated with the API on the same small host?
- is PostgreSQL sharing the same memory envelope?
- are you using local reranking when an external reranker would fit better?
That is why the installation guide and the services guide still matter here. Consolidation fan out is one contributor, not the entire footprint story.
A simple operating playbook
A sane production playbook looks like this:
- Start with the new defaults.
- Watch RSS during large consolidation rounds.
- Only tune one knob at a time.
- Re-test on the same bank shape.
- Keep notes on which change actually moved the needle.
That discipline matters because memory problems often feel mysterious when several variables change at once.
FAQ
Does low recall budget hurt normal user recall quality?
No. This setting is specifically for the internal recall pass inside consolidation, not the general recall path users call directly.
Why cap source facts at 4096 tokens?
Because unlimited source fact hydration was one of the worst memory amplifiers on large banks. A cap makes the prompt cost much more predictable.
Should I turn the FlashRank CPU memory arena back on?
Usually no. Leave it off unless you have measured a real need and are comfortable trading bounded RSS for a different allocation pattern.
