Monitoring
Hindsight provides comprehensive observability through Prometheus metrics, OpenTelemetry distributed tracing, and pre-built Grafana dashboards.
Local Development
For local observability, use the Grafana LGTM (Loki, Grafana, Tempo, Mimir) all-in-one stack:
./scripts/dev/start-monitoring.sh
This starts a single Docker container providing:
- Grafana UI: http://localhost:3000 (anonymous admin access)
- Traces (Tempo): OTLP endpoint at http://localhost:4318 (HTTP) and http://localhost:4317 (gRPC)
- Metrics (Prometheus/Mimir): Scrapes http://localhost:8888/metrics automatically
- Logs (Loki): Available for log aggregation
- Pre-built Dashboards: Hindsight Operations, LLM Metrics, API Service
Enable tracing in your API:
export HINDSIGHT_API_OTEL_TRACES_ENABLED=true
export HINDSIGHT_API_OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
The local monitoring stack is for development only. In production, deploy Grafana LGTM separately or use commercial platforms (Grafana Cloud, DataDog, New Relic, etc.).
Grafana Dashboards
Pre-built dashboards are available in monitoring/grafana/dashboards/. Import these JSON files into your Grafana instance:
| Dashboard | Description |
|---|---|
| Hindsight Operations | Operation rates, latency percentiles, per-bank metrics |
| Hindsight LLM Metrics | LLM calls, token usage, latency by scope/provider |
| Hindsight API Service | HTTP requests, error rates, DB pool, process metrics |
The dashboards are automatically provisioned when using the monitoring stack script.
Metrics Endpoint
Hindsight exposes Prometheus metrics at /metrics:
curl http://localhost:8888/metrics
Available Metrics
Operation Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
hindsight.operation.duration | Histogram | operation, bank_id, source, budget, max_tokens, success | Duration of operations in seconds |
hindsight.operation.total | Counter | operation, bank_id, source, budget, max_tokens, success | Total number of operations executed |
Labels:
operation: Operation type (retain,recall,reflect)bank_id: Memory bank identifiersource: Where the operation was triggered from (api,reflect,internal)budget: Budget level if specified (low,mid,high)max_tokens: Max tokens if specifiedsuccess: Whether the operation succeeded (true,false)
The source label allows distinguishing between:
api: Direct API calls from clientsreflect: Internal recall calls made during reflect operationsinternal: Other internal operations
LLM Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
hindsight.llm.duration | Histogram | provider, model, scope, success | Duration of LLM API calls in seconds |
hindsight.llm.calls.total | Counter | provider, model, scope, success | Total number of LLM API calls |
hindsight.llm.tokens.input | Counter | provider, model, scope, success, token_bucket | Input tokens for LLM calls |
hindsight.llm.tokens.output | Counter | provider, model, scope, success, token_bucket | Output tokens from LLM calls |
Labels:
provider: LLM provider (openai,anthropic,gemini,groq,ollama,lmstudio)model: Model name (e.g.,gpt-4,claude-3-sonnet)scope: What the LLM call is for (memory,reflect,consolidation,answer)success: Whether the call succeeded (true,false)token_bucket: Token count bucket for cardinality control (0-100,100-500,500-1k,1k-5k,5k-10k,10k-50k,50k+)
HTTP Request Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
hindsight.http.duration | Histogram | method, endpoint, status_code, status_class | Duration of HTTP requests in seconds |
hindsight.http.requests.total | Counter | method, endpoint, status_code, status_class | Total number of HTTP requests |
hindsight.http.requests.in_progress | UpDownCounter | method, endpoint | Number of HTTP requests currently being processed |
Labels:
method: HTTP method (GET,POST,PUT,DELETE)endpoint: Request path (normalized to reduce cardinality - UUIDs replaced with{id})status_code: HTTP status code (200,400,500, etc.)status_class: Status code class (2xx,4xx,5xx)
Database Pool Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
hindsight.db.pool.size | Gauge | - | Current number of connections in the pool |
hindsight.db.pool.idle | Gauge | - | Number of idle connections in the pool |
hindsight.db.pool.min | Gauge | - | Minimum pool size |
hindsight.db.pool.max | Gauge | - | Maximum pool size |
Process Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
hindsight.process.cpu.seconds | Gauge | type | Process CPU time in seconds |
hindsight.process.memory.bytes | Gauge | type | Process memory usage in bytes |
hindsight.process.open_fds | Gauge | - | Number of open file descriptors |
hindsight.process.threads | Gauge | - | Number of active threads |
Labels:
type(CPU):userorsystemtype(Memory):rss_max(maximum resident set size)
Histogram Buckets
Custom bucket boundaries are configured for better percentile accuracy:
Operation Duration Buckets (seconds):
0.1, 0.25, 0.5, 0.75, 1.0, 2.0, 3.0, 5.0, 7.5, 10.0, 15.0, 20.0, 30.0, 60.0, 120.0
LLM Duration Buckets (seconds):
0.1, 0.25, 0.5, 1.0, 2.0, 3.0, 5.0, 10.0, 15.0, 30.0, 60.0, 120.0
HTTP Duration Buckets (seconds):
0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0
Prometheus Configuration
scrape_configs:
- job_name: 'hindsight'
static_configs:
- targets: ['localhost:8888']
Example Queries
Average operation latency by type
rate(hindsight_operation_duration_sum[5m]) / rate(hindsight_operation_duration_count[5m])
LLM calls per minute by provider
rate(hindsight_llm_calls_total[1m]) * 60
P95 LLM latency
histogram_quantile(0.95, rate(hindsight_llm_duration_bucket[5m]))
Total tokens consumed by model
sum by (model) (hindsight_llm_tokens_input_total + hindsight_llm_tokens_output_total)
Internal vs API recall operations
sum by (source) (rate(hindsight_operation_total{operation="recall"}[5m]))
HTTP requests per second by endpoint
sum by (endpoint) (rate(hindsight_http_requests_total[1m]))
HTTP error rate (5xx)
sum(rate(hindsight_http_requests_total{status_class="5xx"}[5m])) / sum(rate(hindsight_http_requests_total[5m]))
P95 HTTP latency
histogram_quantile(0.95, sum by (le) (rate(hindsight_http_duration_seconds_bucket[5m])))
Database pool utilization
hindsight_db_pool_size / hindsight_db_pool_max
Active database connections
hindsight_db_pool_size - hindsight_db_pool_idle
CPU usage rate
rate(hindsight_process_cpu_seconds{type="user"}[1m])
Distributed Tracing
Hindsight supports OpenTelemetry distributed tracing for memory operations and LLM calls, following GenAI semantic conventions v1.37+.
Configuration
See Configuration - OpenTelemetry Tracing for environment variables.
Quick Start:
# Enable tracing
export HINDSIGHT_API_OTEL_TRACES_ENABLED=true
export HINDSIGHT_API_OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
# View traces with Grafana LGTM (local dev)
./scripts/dev/start-monitoring.sh
# Open http://localhost:3000 → Explore → Tempo
Supports any OTLP-compatible backend (Grafana LGTM, Langfuse, OpenLIT, DataDog, New Relic, Honeycomb, etc.).
Span Hierarchy
Parent Spans (Operations):
hindsight.retain- Memory ingestionhindsight.recall- Memory retrievalhindsight.recall_embedding- Query embeddinghindsight.recall_retrieval- Parallel search (semantic, BM25, graph, temporal)hindsight.recall_fusion- Reciprocal Rank Fusionhindsight.recall_rerank- Cross-encoder reranking
hindsight.reflect- Agentic reasoninghindsight.reflect_tool_call- Tool execution (recall, lookup, etc.)
hindsight.consolidation- Observation synthesishindsight.mental_model_refresh- Mental model updates
Child Spans (LLM Calls):
- Named by scope (e.g.,
hindsight.memory,hindsight.reflect) - Contain full prompts/completions as events
- Follow GenAI semantic conventions for attributes
Span Attributes
Operation Spans:
hindsight.operation- Operation typehindsight.bank_id- Memory bank IDhindsight.query- Query text (truncated to 100 chars)hindsight.fact_types- Fact types for recallhindsight.thinking_budget- Budget allocationhindsight.max_tokens- Token limit
LLM Spans (GenAI Semantic Conventions):
gen_ai.operation.name- Always"chat"gen_ai.provider.name- Provider (openai,anthropic,google, etc.)gen_ai.request.model- Model namegen_ai.usage.input_tokens- Input tokensgen_ai.usage.output_tokens- Output tokenshindsight.scope- LLM call purpose (memory,reflect,consolidation, etc.)
Events:
gen_ai.client.inference.operation.details- Full prompts and completions