Skip to main content

Monitoring

Hindsight provides comprehensive observability through Prometheus metrics, OpenTelemetry distributed tracing, and pre-built Grafana dashboards.

Local Development

For local observability, use the Grafana LGTM (Loki, Grafana, Tempo, Mimir) all-in-one stack:

./scripts/dev/start-monitoring.sh

This starts a single Docker container providing:

Enable tracing in your API:

export HINDSIGHT_API_OTEL_TRACES_ENABLED=true
export HINDSIGHT_API_OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
Production Deployment

The local monitoring stack is for development only. In production, deploy Grafana LGTM separately or use commercial platforms (Grafana Cloud, DataDog, New Relic, etc.).

Grafana Dashboards

Pre-built dashboards are available in monitoring/grafana/dashboards/. Import these JSON files into your Grafana instance:

DashboardDescription
Hindsight OperationsOperation rates, latency percentiles, per-bank metrics
Hindsight LLM MetricsLLM calls, token usage, latency by scope/provider
Hindsight API ServiceHTTP requests, error rates, DB pool, process metrics

The dashboards are automatically provisioned when using the monitoring stack script.

Metrics Endpoint

Hindsight exposes Prometheus metrics at /metrics:

curl http://localhost:8888/metrics

Available Metrics

Operation Metrics

MetricTypeLabelsDescription
hindsight.operation.durationHistogramoperation, bank_id, source, budget, max_tokens, successDuration of operations in seconds
hindsight.operation.totalCounteroperation, bank_id, source, budget, max_tokens, successTotal number of operations executed

Labels:

  • operation: Operation type (retain, recall, reflect)
  • bank_id: Memory bank identifier
  • source: Where the operation was triggered from (api, reflect, internal)
  • budget: Budget level if specified (low, mid, high)
  • max_tokens: Max tokens if specified
  • success: Whether the operation succeeded (true, false)

The source label allows distinguishing between:

  • api: Direct API calls from clients
  • reflect: Internal recall calls made during reflect operations
  • internal: Other internal operations

LLM Metrics

MetricTypeLabelsDescription
hindsight.llm.durationHistogramprovider, model, scope, successDuration of LLM API calls in seconds
hindsight.llm.calls.totalCounterprovider, model, scope, successTotal number of LLM API calls
hindsight.llm.tokens.inputCounterprovider, model, scope, success, token_bucketInput tokens for LLM calls
hindsight.llm.tokens.outputCounterprovider, model, scope, success, token_bucketOutput tokens from LLM calls

Labels:

  • provider: LLM provider (openai, anthropic, gemini, groq, ollama, lmstudio)
  • model: Model name (e.g., gpt-4, claude-3-sonnet)
  • scope: What the LLM call is for (memory, reflect, consolidation, answer)
  • success: Whether the call succeeded (true, false)
  • token_bucket: Token count bucket for cardinality control (0-100, 100-500, 500-1k, 1k-5k, 5k-10k, 10k-50k, 50k+)

HTTP Request Metrics

MetricTypeLabelsDescription
hindsight.http.durationHistogrammethod, endpoint, status_code, status_classDuration of HTTP requests in seconds
hindsight.http.requests.totalCountermethod, endpoint, status_code, status_classTotal number of HTTP requests
hindsight.http.requests.in_progressUpDownCountermethod, endpointNumber of HTTP requests currently being processed

Labels:

  • method: HTTP method (GET, POST, PUT, DELETE)
  • endpoint: Request path (normalized to reduce cardinality - UUIDs replaced with {id})
  • status_code: HTTP status code (200, 400, 500, etc.)
  • status_class: Status code class (2xx, 4xx, 5xx)

Database Pool Metrics

MetricTypeLabelsDescription
hindsight.db.pool.sizeGauge-Current number of connections in the pool
hindsight.db.pool.idleGauge-Number of idle connections in the pool
hindsight.db.pool.minGauge-Minimum pool size
hindsight.db.pool.maxGauge-Maximum pool size

Process Metrics

MetricTypeLabelsDescription
hindsight.process.cpu.secondsGaugetypeProcess CPU time in seconds
hindsight.process.memory.bytesGaugetypeProcess memory usage in bytes
hindsight.process.open_fdsGauge-Number of open file descriptors
hindsight.process.threadsGauge-Number of active threads

Labels:

  • type (CPU): user or system
  • type (Memory): rss_max (maximum resident set size)

Histogram Buckets

Custom bucket boundaries are configured for better percentile accuracy:

Operation Duration Buckets (seconds):

0.1, 0.25, 0.5, 0.75, 1.0, 2.0, 3.0, 5.0, 7.5, 10.0, 15.0, 20.0, 30.0, 60.0, 120.0

LLM Duration Buckets (seconds):

0.1, 0.25, 0.5, 1.0, 2.0, 3.0, 5.0, 10.0, 15.0, 30.0, 60.0, 120.0

HTTP Duration Buckets (seconds):

0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0

Prometheus Configuration

scrape_configs:
- job_name: 'hindsight'
static_configs:
- targets: ['localhost:8888']

Example Queries

Average operation latency by type

rate(hindsight_operation_duration_sum[5m]) / rate(hindsight_operation_duration_count[5m])

LLM calls per minute by provider

rate(hindsight_llm_calls_total[1m]) * 60

P95 LLM latency

histogram_quantile(0.95, rate(hindsight_llm_duration_bucket[5m]))

Total tokens consumed by model

sum by (model) (hindsight_llm_tokens_input_total + hindsight_llm_tokens_output_total)

Internal vs API recall operations

sum by (source) (rate(hindsight_operation_total{operation="recall"}[5m]))

HTTP requests per second by endpoint

sum by (endpoint) (rate(hindsight_http_requests_total[1m]))

HTTP error rate (5xx)

sum(rate(hindsight_http_requests_total{status_class="5xx"}[5m])) / sum(rate(hindsight_http_requests_total[5m]))

P95 HTTP latency

histogram_quantile(0.95, sum by (le) (rate(hindsight_http_duration_seconds_bucket[5m])))

Database pool utilization

hindsight_db_pool_size / hindsight_db_pool_max

Active database connections

hindsight_db_pool_size - hindsight_db_pool_idle

CPU usage rate

rate(hindsight_process_cpu_seconds{type="user"}[1m])

Distributed Tracing

Hindsight supports OpenTelemetry distributed tracing for memory operations and LLM calls, following GenAI semantic conventions v1.37+.

Configuration

See Configuration - OpenTelemetry Tracing for environment variables.

Quick Start:

# Enable tracing
export HINDSIGHT_API_OTEL_TRACES_ENABLED=true
export HINDSIGHT_API_OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318

# View traces with Grafana LGTM (local dev)
./scripts/dev/start-monitoring.sh
# Open http://localhost:3000 → Explore → Tempo

Supports any OTLP-compatible backend (Grafana LGTM, Langfuse, OpenLIT, DataDog, New Relic, Honeycomb, etc.).

Span Hierarchy

Parent Spans (Operations):

  • hindsight.retain - Memory ingestion
  • hindsight.recall - Memory retrieval
    • hindsight.recall_embedding - Query embedding
    • hindsight.recall_retrieval - Parallel search (semantic, BM25, graph, temporal)
    • hindsight.recall_fusion - Reciprocal Rank Fusion
    • hindsight.recall_rerank - Cross-encoder reranking
  • hindsight.reflect - Agentic reasoning
    • hindsight.reflect_tool_call - Tool execution (recall, lookup, etc.)
  • hindsight.consolidation - Observation synthesis
  • hindsight.mental_model_refresh - Mental model updates

Child Spans (LLM Calls):

  • Named by scope (e.g., hindsight.memory, hindsight.reflect)
  • Contain full prompts/completions as events
  • Follow GenAI semantic conventions for attributes

Span Attributes

Operation Spans:

  • hindsight.operation - Operation type
  • hindsight.bank_id - Memory bank ID
  • hindsight.query - Query text (truncated to 100 chars)
  • hindsight.fact_types - Fact types for recall
  • hindsight.thinking_budget - Budget allocation
  • hindsight.max_tokens - Token limit

LLM Spans (GenAI Semantic Conventions):

  • gen_ai.operation.name - Always "chat"
  • gen_ai.provider.name - Provider (openai, anthropic, google, etc.)
  • gen_ai.request.model - Model name
  • gen_ai.usage.input_tokens - Input tokens
  • gen_ai.usage.output_tokens - Output tokens
  • hindsight.scope - LLM call purpose (memory, reflect, consolidation, etc.)

Events:

  • gen_ai.client.inference.operation.details - Full prompts and completions