Problem
We make/imply latency and cost claims we cannot back with numbers:
- Latency: no p50/p95/p99 measured. Timing instrumentation EXISTS (
metrics.record_search, @track_time("cohere_rerank"), @track_time("search_v3")) but is not aggregated into percentiles. Any "sub-300ms" claim is currently unbacked, and the Cohere-reranked path is far slower.
- Cost: no token or per-query cost tracking anywhere (
indexer_optimized.py batches embeddings 100/call but logs no cost). Cohere rerank is the dominant per-query cost and is untracked.
- Cache-hit-rate: this one already works (
cache_hits/cache_misses -> cache_hit_rate in observability.py:369-384); just needs surfacing alongside the rest.
Scope
- Aggregate existing per-stage timings into p50/p95/p99, split cold vs warm cache and reranked vs non-reranked.
- Track embedding tokens + Cohere rerank calls per query -> derive cost/query; track cost/index per indexing run.
- Surface all of the above (+ existing cache-hit-rate) on the
/metrics endpoint and/or the dashboard.
Acceptance criteria
Problem
We make/imply latency and cost claims we cannot back with numbers:
metrics.record_search,@track_time("cohere_rerank"),@track_time("search_v3")) but is not aggregated into percentiles. Any "sub-300ms" claim is currently unbacked, and the Cohere-reranked path is far slower.indexer_optimized.pybatches embeddings 100/call but logs no cost). Cohere rerank is the dominant per-query cost and is untracked.cache_hits/cache_misses->cache_hit_rateinobservability.py:369-384); just needs surfacing alongside the rest.Scope
/metricsendpoint and/or the dashboard.Acceptance criteria
/metricsreturns p50/p95/p99 search latency, split cold/warm and reranked/non./metricsreturns cost-per-query (embedding + rerank) and cost-per-index.