feat(cache-proposals): proactive outcome evaluation with UI and histo…#225
Merged
Conversation
…rical signal weighting Add CacheOutcomeEvaluator — a periodic job (default: 2-min tick, 15-min evaluation window, configurable via OUTCOME_EVAL_TICK_MS and OUTCOME_EVAL_WINDOW_MS env vars) that evaluates applied threshold proposals after sufficient time has passed: 1. Finds applied threshold_adjust proposals past the evaluation window 2. Reads current similarity window metrics from Valkey 3. Compares against the metrics snapshot stored at apply time 4. Writes a verdict (improved / degraded / neutral) back to the proposal's applied_result.details.outcome_evaluation 5. Appends an audit trail entry (outcome_evaluated event type) Recommendation engine integration: - getSignalOutcomeHistory reads evaluated verdicts from past proposals - If a signal has led to degradation in 2+ evaluated proposals and degraded more often than improved, further adjustments from that signal are blocked with reasoning UI changes: - HistoryTable: new "Outcome" column with verdict badge (improved/ degraded/neutral) for evaluated proposals - DetailPanel: structured outcome evaluation section showing verdict, signal, evaluation window, human-readable detail, and before/after metrics comparison in a two-column layout Test script (scripts/test_outcome_evaluator.py): - Multi-cycle test that verifies the full feedback loop end-to-end - Stores and checks 300 STSb pairs, triggers autotune, waits for the proactive evaluator to fire, then verifies the verdict is written and the recommendation engine references it Verified: zero regression on STSb (5K) and SICK (9,927) benchmarks — the new code only adds behavior when there is proposal history to evaluate.
…evaluator computeCurrentMetrics was reading the entire similarity window including entries classified as hit/miss at the OLD threshold. After a threshold adjustment, these stale entries contaminate the metrics — an entry that was a "hit" at the old (looser) threshold may be a "miss" at the new one, but it is still labeled "hit" in the window. Fix: use ZRANGEBYSCORE with the proposal's applied_at timestamp as the lower bound so only entries recorded AFTER the adjustment are included in the evaluation.
jamby77
reviewed
Jun 2, 2026
| 'applied', | ||
| 'failed', | ||
| 'expired', | ||
| 'outcome_evaluated', |
Collaborator
There was a problem hiding this comment.
This value should probably be added to the persistence-layer constraints:
- apps/api/src/storage/adapters/sqlite.adapter.ts:1382 — CHECK (event_type IN (...))
- apps/api/src/storage/adapters/postgres.adapter.ts:1623 — same
Member
Author
There was a problem hiding this comment.
good catch! I usually use postgres, but for simplicity during the large number of benchmarks, I used the in memory storage and missed these. Should be fixed now
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 06e2a19. Configure here.
…mic audit write Extract shared similarity-window metrics logic into similarity-metrics.utils.ts so the evaluator and dispatcher stay in sync when formulas change. Swap the write order in the outcome evaluator: append the audit entry before updating proposal details, so a failed audit insert does not permanently block retry via the outcome_evaluation guard.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Closes the self-tuning feedback loop: after a threshold proposal is applied, a background evaluator waits 15 minutes, measures whether the adjustment actually helped, and writes a verdict (improved/degraded/neutral) back to the proposal record. The recommendation engine uses accumulated verdicts to block signals that have historically led to degradation on a given cache. A new "Outcome" column in the proposals history table and a structured detail panel make verdicts visible in the UI.
Changes
Backend:
Frontend:
Test script:
Verified: zero regression on STSb (5K pairs, 5 thresholds) and SICK (9,927 pairs, 5 thresholds) — new code only adds behavior when there is proposal history to evaluate.
Checklist
roborev review --branchor/roborev-review-branchin Claude Code (internal)Note
Medium Risk
Changes autonomous threshold-tuning behavior (background evaluator + recommendation blocking) and requires DB audit event constraint alignment; mis-evaluation could suppress valid tune proposals, but scope is limited to semantic threshold proposals with existing safeguards.
Overview
Adds a post-apply feedback loop for semantic cache threshold proposals: after the evaluation window, a background job compares similarity-window metrics to the apply-time snapshot, stores an improved / degraded / neutral verdict on the proposal, and records an
outcome_evaluatedaudit event (schema + shared enum + Postgres/SQLite CHECK constraints).Backend: New
CacheOutcomeEvaluator(periodic tick, configurable viaOUTCOME_EVAL_*env vars) writesapplied_result.details.outcome_evaluation; sharedcomputeMetricsFromSimilarityWindowreplaces duplicated logic in the apply dispatcher. Recommendations now read past verdicts per signal and can force optimal when a signal has degraded in 2+ evaluated proposals and degraded more than improved (signalHistoricallyIneffectivereasoning).UI: Proposal history gains an Outcome column; detail panel shows a structured outcome section (verdict, signal, window, before/after metrics) above the raw apply result JSON.
Tests: Multi-cycle
test_outcome_evaluator.pyscript for end-to-end verification against a running monitor.Reviewed by Cursor Bugbot for commit 1ffc176. Bugbot is set up for automated code reviews on this repo. Configure here.