feat(cache-proposals): proactive outcome evaluation with UI and histo… by KIvanow · Pull Request #225 · BetterDB-inc/monitor

KIvanow · 2026-05-28T08:21:54Z

Summary

Closes the self-tuning feedback loop: after a threshold proposal is applied, a background evaluator waits 15 minutes, measures whether the adjustment actually helped, and writes a verdict (improved/degraded/neutral) back to the proposal record. The recommendation engine uses accumulated verdicts to block signals that have historically led to degradation on a given cache. A new "Outcome" column in the proposals history table and a structured detail panel make verdicts visible in the UI.

Changes

Backend:

Add CacheOutcomeEvaluator — periodic job (2-min tick, 15-min evaluation window) that finds applied threshold_adjust proposals past the window, reads current similarity window metrics, compares against the snapshot stored at apply time, and writes the verdict to applied_result.details.outcome_evaluation
Add outcome_evaluated audit event type to the shared proposal schema
Add getSignalOutcomeHistory to the recommendation engine — reads past evaluated verdicts for a cache; blocks a signal if it has led to degradation in 2+ proposals and degraded more often than improved
Add signalHistoricallyIneffective reasoning string
Evaluation window and tick interval configurable via OUTCOME_EVAL_WINDOW_MS and OUTCOME_EVAL_TICK_MS env vars (for testing)

Frontend:

HistoryTable: new "Outcome" column with verdict badge (green improved / red degraded / gray neutral)
DetailPanel: structured outcome evaluation section with verdict badge, signal name, evaluation window, human-readable detail line, and before/after metrics comparison in a two-column layout (raw JSON retained below for debugging)

Test script:

scripts/test_outcome_evaluator.py — multi-cycle end-to-end test that stores 300 STSb pairs, triggers autotune, waits for the proactive evaluator to fire (with short window via env vars), and verifies the verdict is written to the proposal record and the recommendation engine references it

Verified: zero regression on STSb (5K pairs, 5 thresholds) and SICK (9,927 pairs, 5 thresholds) — new code only adds behavior when there is proposal history to evaluate.

Checklist

Unit / integration tests added
Docs added / updated
Roborev review passed — run roborev review --branch or /roborev-review-branch in Claude Code (internal)
Competitive analysis done / discussed (internal)
Blog post about it discussed (internal)

Note

Medium Risk
Changes autonomous threshold-tuning behavior (background evaluator + recommendation blocking) and requires DB audit event constraint alignment; mis-evaluation could suppress valid tune proposals, but scope is limited to semantic threshold proposals with existing safeguards.

Overview
Adds a post-apply feedback loop for semantic cache threshold proposals: after the evaluation window, a background job compares similarity-window metrics to the apply-time snapshot, stores an improved / degraded / neutral verdict on the proposal, and records an outcome_evaluated audit event (schema + shared enum + Postgres/SQLite CHECK constraints).

Backend: New CacheOutcomeEvaluator (periodic tick, configurable via OUTCOME_EVAL_* env vars) writes applied_result.details.outcome_evaluation; shared computeMetricsFromSimilarityWindow replaces duplicated logic in the apply dispatcher. Recommendations now read past verdicts per signal and can force optimal when a signal has degraded in 2+ evaluated proposals and degraded more than improved (signalHistoricallyIneffective reasoning).

UI: Proposal history gains an Outcome column; detail panel shows a structured outcome section (verdict, signal, window, before/after metrics) above the raw apply result JSON.

Tests: Multi-cycle test_outcome_evaluator.py script for end-to-end verification against a running monitor.

^{Reviewed by Cursor Bugbot for commit 1ffc176. Bugbot is set up for automated code reviews on this repo. Configure here.}

…rical signal weighting Add CacheOutcomeEvaluator — a periodic job (default: 2-min tick, 15-min evaluation window, configurable via OUTCOME_EVAL_TICK_MS and OUTCOME_EVAL_WINDOW_MS env vars) that evaluates applied threshold proposals after sufficient time has passed: 1. Finds applied threshold_adjust proposals past the evaluation window 2. Reads current similarity window metrics from Valkey 3. Compares against the metrics snapshot stored at apply time 4. Writes a verdict (improved / degraded / neutral) back to the proposal's applied_result.details.outcome_evaluation 5. Appends an audit trail entry (outcome_evaluated event type) Recommendation engine integration: - getSignalOutcomeHistory reads evaluated verdicts from past proposals - If a signal has led to degradation in 2+ evaluated proposals and degraded more often than improved, further adjustments from that signal are blocked with reasoning UI changes: - HistoryTable: new "Outcome" column with verdict badge (improved/ degraded/neutral) for evaluated proposals - DetailPanel: structured outcome evaluation section showing verdict, signal, evaluation window, human-readable detail, and before/after metrics comparison in a two-column layout Test script (scripts/test_outcome_evaluator.py): - Multi-cycle test that verifies the full feedback loop end-to-end - Stores and checks 300 STSb pairs, triggers autotune, waits for the proactive evaluator to fire, then verifies the verdict is written and the recommendation engine references it Verified: zero regression on STSb (5K) and SICK (9,927) benchmarks — the new code only adds behavior when there is proposal history to evaluate.

…evaluator computeCurrentMetrics was reading the entire similarity window including entries classified as hit/miss at the OLD threshold. After a threshold adjustment, these stale entries contaminate the metrics — an entry that was a "hit" at the old (looser) threshold may be a "miss" at the new one, but it is still labeled "hit" in the window. Fix: use ZRANGEBYSCORE with the proposal's applied_at timestamp as the lower bound so only entries recorded AFTER the adjustment are included in the evaluation.

jamby77 · 2026-06-02T05:55:52Z

  'applied',
  'failed',
  'expired',
+  'outcome_evaluated',


This value should probably be added to the persistence-layer constraints:

apps/api/src/storage/adapters/sqlite.adapter.ts:1382 — CHECK (event_type IN (...))

apps/api/src/storage/adapters/postgres.adapter.ts:1623 — same

good catch! I usually use postgres, but for simplicity during the large number of benchmarks, I used the in memory storage and missed these. Should be fixed now

…K constraints

jamby77

looks good

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 06e2a19. Configure here.}

…mic audit write Extract shared similarity-window metrics logic into similarity-metrics.utils.ts so the evaluator and dispatcher stay in sync when formulas change. Swap the write order in the outcome evaluator: append the audit entry before updating proposal details, so a failed audit insert does not permanently block retry via the outcome_evaluation guard.

cursor Bot reviewed May 28, 2026

View reviewed changes

Comment thread proprietary/cache-proposals/cache-outcome-evaluator.ts Outdated

KIvanow requested a review from jamby77 May 28, 2026 09:03

jamby77 reviewed Jun 2, 2026

View reviewed changes

fix(cache-proposals): add outcome_evaluated to persistence-layer CHEC…

06e2a19

…K constraints

jamby77 approved these changes Jun 2, 2026

View reviewed changes

cursor Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread proprietary/cache-proposals/cache-outcome-evaluator.ts

Comment thread proprietary/cache-proposals/cache-outcome-evaluator.ts

KIvanow merged commit d8cfd2b into master Jun 2, 2026
3 checks passed

KIvanow deleted the feature/outcome-tracking-feedback-loop branch June 2, 2026 12:48

github-actions Bot locked and limited conversation to collaborators Jun 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cache-proposals): proactive outcome evaluation with UI and histo…#225

feat(cache-proposals): proactive outcome evaluation with UI and histo…#225
KIvanow merged 4 commits into
masterfrom
feature/outcome-tracking-feedback-loop

KIvanow commented May 28, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

jamby77 Jun 2, 2026

Uh oh!

KIvanow Jun 2, 2026

Uh oh!

jamby77 left a comment

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KIvanow commented May 28, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Checklist

Uh oh!

Uh oh!

jamby77 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

KIvanow Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

jamby77 left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KIvanow commented May 28, 2026 •

edited by cursor Bot

Loading