Skip to content

feat(cache-proposals): proactive outcome evaluation with UI and histo…#225

Merged
KIvanow merged 4 commits into
masterfrom
feature/outcome-tracking-feedback-loop
Jun 2, 2026
Merged

feat(cache-proposals): proactive outcome evaluation with UI and histo…#225
KIvanow merged 4 commits into
masterfrom
feature/outcome-tracking-feedback-loop

Conversation

@KIvanow
Copy link
Copy Markdown
Member

@KIvanow KIvanow commented May 28, 2026

Summary

Closes the self-tuning feedback loop: after a threshold proposal is applied, a background evaluator waits 15 minutes, measures whether the adjustment actually helped, and writes a verdict (improved/degraded/neutral) back to the proposal record. The recommendation engine uses accumulated verdicts to block signals that have historically led to degradation on a given cache. A new "Outcome" column in the proposals history table and a structured detail panel make verdicts visible in the UI.

Changes

Backend:

  • Add CacheOutcomeEvaluator — periodic job (2-min tick, 15-min evaluation window) that finds applied threshold_adjust proposals past the window, reads current similarity window metrics, compares against the snapshot stored at apply time, and writes the verdict to applied_result.details.outcome_evaluation
  • Add outcome_evaluated audit event type to the shared proposal schema
  • Add getSignalOutcomeHistory to the recommendation engine — reads past evaluated verdicts for a cache; blocks a signal if it has led to degradation in 2+ proposals and degraded more often than improved
  • Add signalHistoricallyIneffective reasoning string
  • Evaluation window and tick interval configurable via OUTCOME_EVAL_WINDOW_MS and OUTCOME_EVAL_TICK_MS env vars (for testing)

Frontend:

  • HistoryTable: new "Outcome" column with verdict badge (green improved / red degraded / gray neutral)
  • DetailPanel: structured outcome evaluation section with verdict badge, signal name, evaluation window, human-readable detail line, and before/after metrics comparison in a two-column layout (raw JSON retained below for debugging)

Test script:

  • scripts/test_outcome_evaluator.py — multi-cycle end-to-end test that stores 300 STSb pairs, triggers autotune, waits for the proactive evaluator to fire (with short window via env vars), and verifies the verdict is written to the proposal record and the recommendation engine references it

Verified: zero regression on STSb (5K pairs, 5 thresholds) and SICK (9,927 pairs, 5 thresholds) — new code only adds behavior when there is proposal history to evaluate.

Checklist

  • Unit / integration tests added
  • Docs added / updated
  • Roborev review passed — run roborev review --branch or /roborev-review-branch in Claude Code (internal)
  • Competitive analysis done / discussed (internal)
  • Blog post about it discussed (internal)

Note

Medium Risk
Changes autonomous threshold-tuning behavior (background evaluator + recommendation blocking) and requires DB audit event constraint alignment; mis-evaluation could suppress valid tune proposals, but scope is limited to semantic threshold proposals with existing safeguards.

Overview
Adds a post-apply feedback loop for semantic cache threshold proposals: after the evaluation window, a background job compares similarity-window metrics to the apply-time snapshot, stores an improved / degraded / neutral verdict on the proposal, and records an outcome_evaluated audit event (schema + shared enum + Postgres/SQLite CHECK constraints).

Backend: New CacheOutcomeEvaluator (periodic tick, configurable via OUTCOME_EVAL_* env vars) writes applied_result.details.outcome_evaluation; shared computeMetricsFromSimilarityWindow replaces duplicated logic in the apply dispatcher. Recommendations now read past verdicts per signal and can force optimal when a signal has degraded in 2+ evaluated proposals and degraded more than improved (signalHistoricallyIneffective reasoning).

UI: Proposal history gains an Outcome column; detail panel shows a structured outcome section (verdict, signal, window, before/after metrics) above the raw apply result JSON.

Tests: Multi-cycle test_outcome_evaluator.py script for end-to-end verification against a running monitor.

Reviewed by Cursor Bugbot for commit 1ffc176. Bugbot is set up for automated code reviews on this repo. Configure here.

…rical signal weighting

Add CacheOutcomeEvaluator — a periodic job (default: 2-min tick, 15-min
evaluation window, configurable via OUTCOME_EVAL_TICK_MS and
OUTCOME_EVAL_WINDOW_MS env vars) that evaluates applied threshold
proposals after sufficient time has passed:

1. Finds applied threshold_adjust proposals past the evaluation window
2. Reads current similarity window metrics from Valkey
3. Compares against the metrics snapshot stored at apply time
4. Writes a verdict (improved / degraded / neutral) back to the
   proposal's applied_result.details.outcome_evaluation
5. Appends an audit trail entry (outcome_evaluated event type)

Recommendation engine integration:
- getSignalOutcomeHistory reads evaluated verdicts from past proposals
- If a signal has led to degradation in 2+ evaluated proposals and
  degraded more often than improved, further adjustments from that
  signal are blocked with reasoning

UI changes:
- HistoryTable: new "Outcome" column with verdict badge (improved/
  degraded/neutral) for evaluated proposals
- DetailPanel: structured outcome evaluation section showing verdict,
  signal, evaluation window, human-readable detail, and before/after
  metrics comparison in a two-column layout

Test script (scripts/test_outcome_evaluator.py):
- Multi-cycle test that verifies the full feedback loop end-to-end
- Stores and checks 300 STSb pairs, triggers autotune, waits for
  the proactive evaluator to fire, then verifies the verdict is
  written and the recommendation engine references it

Verified: zero regression on STSb (5K) and SICK (9,927) benchmarks —
the new code only adds behavior when there is proposal history to
evaluate.
Comment thread proprietary/cache-proposals/cache-outcome-evaluator.ts Outdated
…evaluator

computeCurrentMetrics was reading the entire similarity window including
entries classified as hit/miss at the OLD threshold. After a threshold
adjustment, these stale entries contaminate the metrics — an entry that
was a "hit" at the old (looser) threshold may be a "miss" at the new one,
but it is still labeled "hit" in the window.

Fix: use ZRANGEBYSCORE with the proposal's applied_at timestamp as the
lower bound so only entries recorded AFTER the adjustment are included
in the evaluation.
@KIvanow KIvanow requested a review from jamby77 May 28, 2026 09:03
'applied',
'failed',
'expired',
'outcome_evaluated',
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value should probably be added to the persistence-layer constraints:

  • apps/api/src/storage/adapters/sqlite.adapter.ts:1382 — CHECK (event_type IN (...))
  • apps/api/src/storage/adapters/postgres.adapter.ts:1623 — same

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! I usually use postgres, but for simplicity during the large number of benchmarks, I used the in memory storage and missed these. Should be fixed now

Copy link
Copy Markdown
Collaborator

@jamby77 jamby77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 06e2a19. Configure here.

Comment thread proprietary/cache-proposals/cache-outcome-evaluator.ts
Comment thread proprietary/cache-proposals/cache-outcome-evaluator.ts
…mic audit write

Extract shared similarity-window metrics logic into
similarity-metrics.utils.ts so the evaluator and dispatcher
stay in sync when formulas change.

Swap the write order in the outcome evaluator: append the audit
entry before updating proposal details, so a failed audit insert
does not permanently block retry via the outcome_evaluation guard.
@KIvanow KIvanow merged commit d8cfd2b into master Jun 2, 2026
3 checks passed
@KIvanow KIvanow deleted the feature/outcome-tracking-feedback-loop branch June 2, 2026 12:48
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 2, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants