🤖

Using a coding agent? Run this to install the Hindsight docs skill:

npx skills add https://github.com/vectorize-io/hindsight --skill hindsight-docs

Monitoring

Hindsight provides comprehensive observability through Prometheus metrics, OpenTelemetry distributed tracing, and pre-built Grafana dashboards.

Local Development

For local observability, use the Grafana LGTM (Loki, Grafana, Tempo, Mimir) all-in-one stack:

./scripts/dev/start-monitoring.sh

This starts a single Docker container providing:

Grafana UI: http://localhost:3000 (anonymous admin access)
Traces (Tempo): OTLP endpoint at http://localhost:4318 (HTTP) and http://localhost:4317 (gRPC)
Metrics (Prometheus/Mimir): Scrapes http://localhost:8888/metrics automatically
Logs (Loki): Available for log aggregation
Pre-built Dashboards: Hindsight Operations, LLM Metrics, API Service

Enable tracing in your API:

export HINDSIGHT_API_OTEL_TRACES_ENABLED=true
export HINDSIGHT_API_OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318

Production Deployment

The local monitoring stack is for development only. In production, deploy Grafana LGTM separately or use commercial platforms (Grafana Cloud, DataDog, New Relic, etc.).

Grafana Dashboards

Pre-built dashboards are available in monitoring/grafana/dashboards/. Import these JSON files into your Grafana instance:

Dashboard	Description
Hindsight Operations	Operation rates, latency percentiles, per-bank metrics
Hindsight LLM Metrics	LLM calls, token usage, latency by scope/provider
Hindsight API Service	HTTP requests, error rates, DB pool, process metrics

The dashboards are automatically provisioned when using the monitoring stack script.

Metrics Endpoint

Hindsight exposes Prometheus metrics at /metrics:

curl http://localhost:8888/metrics

Available Metrics

Operation Metrics

Metric	Type	Labels	Description
`hindsight.operation.duration`	Histogram	operation, bank_id, source, budget, max_tokens, success	Duration of operations in seconds
`hindsight.operation.total`	Counter	operation, bank_id, source, budget, max_tokens, success	Total number of operations executed

Labels:

operation: Operation type (retain, recall, reflect)
bank_id: Memory bank identifier
source: Where the operation was triggered from (api, reflect, internal)
budget: Budget level if specified (low, mid, high)
max_tokens: Max tokens if specified
success: Whether the operation succeeded (true, false)

The source label allows distinguishing between:

api: Direct API calls from clients
reflect: Internal recall calls made during reflect operations
internal: Other internal operations

LLM Metrics

Metric	Type	Labels	Description
`hindsight.llm.duration`	Histogram	provider, model, scope, success	Duration of LLM API calls in seconds
`hindsight.llm.calls.total`	Counter	provider, model, scope, success	Total number of LLM API calls
`hindsight.llm.tokens.input`	Counter	provider, model, scope, success, token_bucket	Input tokens for LLM calls
`hindsight.llm.tokens.output`	Counter	provider, model, scope, success, token_bucket	Output tokens from LLM calls

Labels:

provider: LLM provider (openai, anthropic, gemini, groq, ollama, lmstudio)
model: Model name (e.g., gpt-4, claude-3-sonnet)
scope: What the LLM call is for (memory, reflect, consolidation, answer)
success: Whether the call succeeded (true, false)
token_bucket: Token count bucket for cardinality control (0-100, 100-500, 500-1k, 1k-5k, 5k-10k, 10k-50k, 50k+)

HTTP Request Metrics

Metric	Type	Labels	Description
`hindsight.http.duration`	Histogram	method, endpoint, status_code, status_class	Duration of HTTP requests in seconds
`hindsight.http.requests.total`	Counter	method, endpoint, status_code, status_class	Total number of HTTP requests
`hindsight.http.requests.in_progress`	UpDownCounter	method, endpoint	Number of HTTP requests currently being processed

Labels:

method: HTTP method (GET, POST, PUT, DELETE)
endpoint: Request path (normalized to reduce cardinality - UUIDs replaced with {id})
status_code: HTTP status code (200, 400, 500, etc.)
status_class: Status code class (2xx, 4xx, 5xx)

Database Pool Metrics

Metric	Type	Labels	Description
`hindsight.db.pool.size`	Gauge	-	Current number of connections in the pool
`hindsight.db.pool.idle`	Gauge	-	Number of idle connections in the pool
`hindsight.db.pool.min`	Gauge	-	Minimum pool size
`hindsight.db.pool.max`	Gauge	-	Maximum pool size

Process Metrics

Metric	Type	Labels	Description
`hindsight.process.cpu.seconds`	Gauge	type	Process CPU time in seconds
`hindsight.process.memory.bytes`	Gauge	type	Process memory usage in bytes
`hindsight.process.open_fds`	Gauge	-	Number of open file descriptors
`hindsight.process.threads`	Gauge	-	Number of active threads

Labels:

type (CPU): user or system
type (Memory): rss_max (maximum resident set size)

Histogram Buckets

Custom bucket boundaries are configured for better percentile accuracy:

Operation Duration Buckets (seconds):

0.1, 0.25, 0.5, 0.75, 1.0, 2.0, 3.0, 5.0, 7.5, 10.0, 15.0, 20.0, 30.0, 60.0, 120.0

LLM Duration Buckets (seconds):

0.1, 0.25, 0.5, 1.0, 2.0, 3.0, 5.0, 10.0, 15.0, 30.0, 60.0, 120.0

HTTP Duration Buckets (seconds):

0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0

Prometheus Configuration

scrape_configs:
  - job_name: 'hindsight'
    static_configs:
      - targets: ['localhost:8888']

Example Queries

Average operation latency by type

rate(hindsight_operation_duration_sum[5m]) / rate(hindsight_operation_duration_count[5m])

LLM calls per minute by provider

rate(hindsight_llm_calls_total[1m]) * 60

P95 LLM latency

histogram_quantile(0.95, rate(hindsight_llm_duration_bucket[5m]))

Total tokens consumed by model

sum by (model) (hindsight_llm_tokens_input_total + hindsight_llm_tokens_output_total)

Internal vs API recall operations

sum by (source) (rate(hindsight_operation_total{operation="recall"}[5m]))

HTTP requests per second by endpoint

sum by (endpoint) (rate(hindsight_http_requests_total[1m]))

HTTP error rate (5xx)

sum(rate(hindsight_http_requests_total{status_class="5xx"}[5m])) / sum(rate(hindsight_http_requests_total[5m]))

P95 HTTP latency

histogram_quantile(0.95, sum by (le) (rate(hindsight_http_duration_seconds_bucket[5m])))

Database pool utilization

hindsight_db_pool_size / hindsight_db_pool_max

Active database connections

hindsight_db_pool_size - hindsight_db_pool_idle

CPU usage rate

rate(hindsight_process_cpu_seconds{type="user"}[1m])

Distributed Tracing

Hindsight supports OpenTelemetry distributed tracing for memory operations and LLM calls, following GenAI semantic conventions v1.37+.

Configuration

See Configuration - OpenTelemetry Tracing for environment variables.

Quick Start:

# Enable tracing
export HINDSIGHT_API_OTEL_TRACES_ENABLED=true
export HINDSIGHT_API_OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318

# View traces with Grafana LGTM (local dev)
./scripts/dev/start-monitoring.sh
# Open http://localhost:3000 → Explore → Tempo

Supports any OTLP-compatible backend (Grafana LGTM, Langfuse, OpenLIT, DataDog, New Relic, Honeycomb, etc.).

Span Hierarchy

Parent Spans (Operations):

hindsight.retain - Memory ingestion
hindsight.recall - Memory retrieval
- hindsight.recall_embedding - Query embedding
- hindsight.recall_retrieval - Parallel search (semantic, BM25, graph, temporal)
- hindsight.recall_fusion - Reciprocal Rank Fusion
- hindsight.recall_rerank - Cross-encoder reranking
hindsight.reflect - Agentic reasoning
- hindsight.reflect_tool_call - Tool execution (recall, lookup, etc.)
hindsight.consolidation - Observation synthesis
hindsight.mental_model_refresh - Mental model updates

Child Spans (LLM Calls):

Named by scope (e.g., hindsight.memory, hindsight.reflect)
Contain full prompts/completions as events
Follow GenAI semantic conventions for attributes

Span Attributes

Operation Spans:

hindsight.operation - Operation type
hindsight.bank_id - Memory bank ID
hindsight.query - Query text (truncated to 100 chars)
hindsight.fact_types - Fact types for recall
hindsight.thinking_budget - Budget allocation
hindsight.max_tokens - Token limit

LLM Spans (GenAI Semantic Conventions):

gen_ai.operation.name - Always "chat"
gen_ai.provider.name - Provider (openai, anthropic, google, etc.)
gen_ai.request.model - Model name
gen_ai.usage.input_tokens - Input tokens
gen_ai.usage.output_tokens - Output tokens
hindsight.scope - LLM call purpose (memory, reflect, consolidation, etc.)

Events:

gen_ai.client.inference.operation.details - Full prompts and completions

Local Development​

Grafana Dashboards​

Metrics Endpoint​

Available Metrics​

Operation Metrics​

LLM Metrics​

HTTP Request Metrics​

Database Pool Metrics​

Process Metrics​

Histogram Buckets​

Prometheus Configuration​

Example Queries​

Average operation latency by type​

LLM calls per minute by provider​

P95 LLM latency​

Total tokens consumed by model​

Internal vs API recall operations​

HTTP requests per second by endpoint​

HTTP error rate (5xx)​

P95 HTTP latency​

Database pool utilization​

Active database connections​

CPU usage rate​

Distributed Tracing​

Configuration​

Span Hierarchy​

Span Attributes​

Local Development

Grafana Dashboards

Metrics Endpoint

Available Metrics

Operation Metrics

LLM Metrics

HTTP Request Metrics

Database Pool Metrics

Process Metrics

Histogram Buckets

Prometheus Configuration

Example Queries

Average operation latency by type

LLM calls per minute by provider

P95 LLM latency

Total tokens consumed by model

Internal vs API recall operations

HTTP requests per second by endpoint

HTTP error rate (5xx)

P95 HTTP latency

Database pool utilization

Active database connections

CPU usage rate

Distributed Tracing

Configuration

Span Hierarchy

Span Attributes