Skip to content

feat: surface real search observability (p50/p95/p99 latency, cost/query, cache-hit-rate) #313

Description

@DevanshuNEU

Problem

We make/imply latency and cost claims we cannot back with numbers:

  • Latency: no p50/p95/p99 measured. Timing instrumentation EXISTS (metrics.record_search, @track_time("cohere_rerank"), @track_time("search_v3")) but is not aggregated into percentiles. Any "sub-300ms" claim is currently unbacked, and the Cohere-reranked path is far slower.
  • Cost: no token or per-query cost tracking anywhere (indexer_optimized.py batches embeddings 100/call but logs no cost). Cohere rerank is the dominant per-query cost and is untracked.
  • Cache-hit-rate: this one already works (cache_hits/cache_misses -> cache_hit_rate in observability.py:369-384); just needs surfacing alongside the rest.

Scope

  • Aggregate existing per-stage timings into p50/p95/p99, split cold vs warm cache and reranked vs non-reranked.
  • Track embedding tokens + Cohere rerank calls per query -> derive cost/query; track cost/index per indexing run.
  • Surface all of the above (+ existing cache-hit-rate) on the /metrics endpoint and/or the dashboard.

Acceptance criteria

  • /metrics returns p50/p95/p99 search latency, split cold/warm and reranked/non.
  • /metrics returns cost-per-query (embedding + rerank) and cost-per-index.
  • Cache-hit-rate displayed alongside.
  • A short doc records the real measured numbers (the resume bullets).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions