diff --git a/docs/context-rot-thresholds-spec.md b/docs/context-rot-thresholds-spec.md new file mode 100644 index 0000000..b3bc592 --- /dev/null +++ b/docs/context-rot-thresholds-spec.md @@ -0,0 +1,256 @@ +# Context-rot thresholds spec + +Pinned 2026-04-28 for GET-28. Owner: Saar AI Usage Coach. + +This document is the human-readable companion to +`lib/context-rot-thresholds.ts`. The runtime file is the source of truth +for the values; this document is the source of truth for **why** the +values are what they are. If Anthropic publishes new MRCR numbers or +ships new models, update both files together. The drift test in +`tests/unit/context-rot-thresholds.test.ts` will fail loudly if a +threshold changes without its provenance fields being updated. + +## What problem are we solving + +Saar's AI Usage Coach exists to teach claude.ai users that context rot +is real, that different models hit it at different points, and that +continuing past the rot zone produces unreliable answers (Claude +"hallucinates earlier details" because retrieval has degraded). + +The previous health indicator used model-agnostic thresholds (70% warn, +90% critical). That was wrong: Sonnet 4.5 retrieves 18.5% at full +window while Opus 4.6 retrieves 76% at the same length. Treating them +identically pushed the wrong message at the wrong percentage. + +GET-28 makes the threshold and the coaching copy per-model and grounds +both in Anthropic's published facts. + +## Anthropic-published facts we rely on + +These are the only externally-sourced figures used in the threshold +table. Every other number is a Saar coaching default (documented below). + +### Fact 1: context rot is real and Anthropic acknowledges it + +> "As token count grows, accuracy and recall degrade, a phenomenon known +> as *context rot*." + +Source: https://platform.claude.com/docs/en/build-with-claude/context-windows + +### Fact 2: MRCR scores at full window for Opus 4.6 and Sonnet 4.5 + +> "on the 8-needle 1M variant of MRCR v2, a needle-in-a-haystack +> benchmark that tests a model's ability to retrieve information +> 'hidden' in vast amounts of text, Opus 4.6 scores 76%, whereas +> Sonnet 4.5 scores just 18.5%." + +Source: https://www.anthropic.com/news/claude-opus-4-6 + +### Fact 3: per-model context window sizes + +From the same context-windows docs: + +> "Claude Mythos Preview, Claude Opus 4.7, Claude Opus 4.6, and Claude +> Sonnet 4.6 have a 1M-token context window. Other Claude models, +> including Claude Sonnet 4.5 and Sonnet 4 (deprecated), have a 200k-token +> context window." + +### Fact 4: server-side compaction availability + +From the same docs page: + +> "Compaction provides server-side summarization that automatically +> condenses earlier parts of a conversation, enabling long-running +> conversations beyond context limits with minimal integration work. +> It is currently available in beta for Claude Opus 4.7, Claude Opus 4.6, +> and Claude Sonnet 4.6." + +This drives our coaching copy distinction: compaction-aware models get +softer "consider starting fresh" copy, non-compaction models get harder +"start a new chat now, use Projects" copy. + +## What is NOT an Anthropic fact + +The following are explicitly Saar coaching defaults, not Anthropic +claims. We say so in the runtime copy when relevant. + +- Anthropic does not publish a rot CURVE. They publish endpoints (0% and + 100% utilization). Picking a "warn at 65%" or "critical at 75%" is + Saar's coaching judgment, derived from those endpoints. +- Anthropic does not say "warn at X%" or "/compact at Y%". Anything + resembling such a threshold in our copy is our recommendation, not a + quote. +- Anthropic does not publish MRCR figures for Opus 4.7 or Sonnet 4.6. + Our thresholds for those models extrapolate from siblings (Opus 4.6 + and Sonnet 4.5) and from Anthropic's qualitative claim that 4.6+ + models show "marked improvement" in long-context behavior. + +## The threshold table + +| Model prefix | Window | Warn | Crit | Detail-heavy adjust | MRCR @ 1M | Compaction | +|---|---|---|---|---|---|---| +| `claude-opus-4-7` | 1M | 65% | 85% | -15pp | not published | yes | +| `claude-opus-4-6` | 1M | 65% | 85% | -15pp | **76%** | yes | +| `claude-sonnet-4-6` | 1M | 60% | 80% | -15pp | not published | yes | +| `claude-sonnet-4-5` | 200k | 50% | 75% | -15pp | **18.5%** | no | +| `claude-haiku-4-5` | 200k | 50% | 75% | -15pp | not published | no | +| `claude-opus-4-5` | 200k | 50% | 75% | -15pp | not published | no | +| `claude-opus-4-1` | 200k | 50% | 75% | -15pp | not published | no | +| (unknown) | 200k | 50% | 75% | -15pp | not published | no | + +Bolded MRCR values are Anthropic-published and pinned in +`sourceUrl` + `sourceQuote` fields on the matching profile row. + +### Why these warn / critical numbers + +**1M-window Opus models (warn 65, crit 85).** Opus 4.6 retrieves 76% of +hidden information at full 1M window. We treat that as "high-fidelity +retrieval still possible at the limit, so the rot zone starts late." +Warn at two-thirds utilization is a coaching choice that gives the user +room to act. Crit at 85 is a Saar default; 90% is the absolute floor +where any model is unreliable, so 85 leaves a 5pp buffer for the in-rot +warning to register before the absolute floor takes over. Opus 4.7 +inherits the same numbers until Anthropic publishes a separate score. + +**1M-window Sonnet 4.6 (warn 60, crit 80).** No Anthropic-published +MRCR. Anthropic markets Sonnet 4.6 as a long-context improvement over +4.5 but does not put a number on it. We picked thresholds between +Sonnet 4.5 (50/75) and Opus 4.6 (65/85) to reflect the "better than +its predecessor, not as good as Opus" qualitative position. + +**200k-window Sonnet 4.5 (warn 50, crit 75).** Sonnet 4.5 retrieves +just 18.5% at 1M. While the user is on a 200k window (so they cannot +drive it to 1M directly), the curve implied by that endpoint suggests +accuracy is meaningfully degraded by half-window utilization. Warn at +50% is the half-window choice that matches the implied curve. Crit at +75 leaves a 15pp band for the user to react before the absolute floor. + +**200k-window Haiku 4.5 (warn 50, crit 75).** No Anthropic-published +MRCR. Use the conservative 200k profile by default. Haiku is meant for +short, simple questions; in practice it rarely fills its window in a +chat session. + +**Older 200k-window Opus models (4.5, 4.1).** Devanshu confirmed (GET-28 +brief): nobody uses these on claude.ai today. We keep them in the table +so the indicator does not silently fall through to the unknown-model +fallback if anyone has them pinned. + +### Why -15 percentage points for detail-heavy + +Detail-heavy prompts (code blocks, precision keywords like "exact", +"verbatim", "list every") raise the cost of retrieval failure: the user +specifically wants Claude to find earlier details, so a degraded +retrieval rate matters more than for casual questions. + +15 percentage points is the value documented in the GET-28 issue brief. +It moves the warning meaningfully earlier without flooding the user on +a normal Q&A. Detail-heavy never lowers a threshold below +`MIN_THRESHOLD_FLOOR` (30%), guarding against future tuning that stacks +multiple downward adjustments. + +### Why an absolute critical floor at 90% + +`ABSOLUTE_CRITICAL_FLOOR` = 90% is model-agnostic. Even on Opus 4.7, +once the user is at 90%+ of the window, retrieval is unreliable in +practice and the user is one large message away from a hard refusal. +The floor is a safety net so the indicator goes red even if a future +model is added without per-model thresholds. + +## Coaching copy contract + +Three zones, two model classes, optional MRCR citation. Generated by +`getRotCoaching(model, contextPct, isDetailHeavy)`. + +### Healthy zone + +- Below 30% context: "Conversation is fresh and responsive." +- 30%+ but under warn: "{N}% of {label}'s {window} window used. Plenty of room." + +Low-friction. Names the model only when there is anything useful to say +beyond "fresh". + +### Approaching zone (warn ≤ pct < crit) + +The educational moment. Three components: + +1. Lede: "Approaching the zone where retrieval declines." +2. MRCR clause when available: "On Anthropic's 8-needle 1M MRCR + benchmark, {label} retrieves {N}% at full window." +3. Action: + - Compaction-aware: "Anthropic's server-side compaction handles long + sessions, but for accuracy-critical work consider starting a new + chat." + - Non-compaction: "For accuracy-critical work, start a new chat now." + +### In-rot zone (pct ≥ crit, or pct ≥ ABSOLUTE_CRITICAL_FLOOR) + +Direct, action-first. + +- Compaction-aware: "{N}% used. Even with compaction, fine details + from earlier may be missed. Start a new chat for new threads of work." +- Non-compaction: "{N}% used. Retrieval is unreliable here. Start a new + chat. Use Projects to keep ongoing work organized." + +### Things the copy never says + +- `/compact`. That is a Claude Code slash command and does not exist + on claude.ai web. Mentioning it would teach the wrong action. +- "Anthropic recommends starting fresh at X%." Anthropic does not + publish that recommendation. The action is Saar's. +- A specific MRCR figure for any model whose profile does not carry + one in `mrcrAt1MPct`. The copy generator omits the citation entirely + when `mrcrAt1MPct` is undefined. + +## Drift policy + +When Anthropic publishes a new fact: + +1. Update `lib/context-rot-thresholds.ts`. Add the new MRCR figure to + the relevant profile's `mrcrAt1MPct`, `sourceUrl`, `sourceQuote`. +2. Update this spec doc with the verbatim quote and the source URL. +3. The drift test verifies that any profile with `mrcrAt1MPct` carries + matching `sourceUrl` and `sourceQuote`, and that the quote contains + the figure verbatim. If the test fails, fix the data, do not delete + the assertion. + +When Anthropic deprecates a model: + +1. Keep the row in the table. Old extension installs may still see the + model name on a stale tab. +2. Add a comment marking it deprecated. Do not silently delete; that + would fall through to FALLBACK_PROFILE without explanation. + +When a new model ships: + +1. Add a row. Use the closest sibling's thresholds as defaults. +2. If Anthropic publishes an MRCR figure, fill in the provenance fields + immediately. +3. Update the spec table above and the rationale paragraph. + +## Out of scope (deferred to follow-up tickets) + +- **Detail-heavy on the draft (pre-submit).** Today, we evaluate + detail-heavy on the last *sent* prompt. Pre-submit detection on the + draft would require plumbing the compose-box observer into the health + recompute and is bigger scope. Cycle 2. +- **GET-36 fresh-session false-rot.** The suspected cause named in + GET-36's body (`escalateForProjection` from a never-merged commit) + does not exist on `main`. The bug, if reproducible on `main`, has a + different root cause. Investigate after GET-28 lands. +- **"200% session %" calibration bug.** That is the weekly-cap display, + not context rot. Separate ticket. +- **Visual rot-zone marker on the context bar.** A subtle tick at the + warn threshold. Real value, but a UI ticket. +- **Empirical calibration via Anthropic's `count_tokens` endpoint.** + Requires API-key plumbing and a privacy review. + +## Sources + +- Anthropic Opus 4.6 announcement (MRCR scores): + https://www.anthropic.com/news/claude-opus-4-6 +- Anthropic context-windows docs (rot acknowledgment, model windows, + compaction availability): + https://platform.claude.com/docs/en/build-with-claude/context-windows +- GET-28 Linear issue: + https://linear.app/getsaar/issue/GET-28/context-rot-warning-per-model +- Verified live on 2026-04-28 against the live pages above. diff --git a/entrypoints/claude-ai.content.ts b/entrypoints/claude-ai.content.ts index 87a22b0..385aa89 100644 --- a/entrypoints/claude-ai.content.ts +++ b/entrypoints/claude-ai.content.ts @@ -182,6 +182,14 @@ async function initializeMonitoring(): Promise { // Resets on SPA navigation alongside other conversation state. let consecutiveShortFollowUps = 0; + // Whether the last sent prompt demanded precise / exhaustive recall on + // prior context (code blocks or precision keywords). Used by the Health + // Agent to shift its per-model warn and critical thresholds earlier, see + // lib/context-rot-thresholds.ts. Computed in inject.ts on STREAM_COMPLETE + // and forwarded via the bridge so raw prompt text never crosses worlds. + // Resets on SPA navigation alongside other conversation state. + let lastDetailHeavy = false; + // Last known 5-hour session utilization from the Anthropic usage endpoint. // Snapshot before each stream, then read again after STREAM_COMPLETE to // compute the exact session % consumed by that message (delta tracking). @@ -249,10 +257,14 @@ async function initializeMonitoring(): Promise { // buildConvStateFromRecord receives it to avoid duplication. state = applyRestoredConversation(state, record, null); convState = buildConvStateFromRecord(record, state.contextPct ?? 0); + // Restore path: we don't have the live draft so detail-heavy is + // false. Model comes from the stored conversation record. const health = computeHealthScore({ contextPct: convState.contextPct, turnCount: convState.turnCount, growthRate: computeGrowthRate(convState.contextHistory), + model: convState.model, + isDetailHeavy: false, }); state = { ...state, health }; } @@ -283,10 +295,14 @@ async function initializeMonitoring(): Promise { cumulativeCost = record.estimatedCost ?? 0; state = applyRestoredConversation(state, record, null); convState = buildConvStateFromRecord(record, state.contextPct ?? 0); + // Async restore path. Same reasoning as the synchronous restore + // block above: detail-heavy unknown -> false; model from record. const health = computeHealthScore({ contextPct: convState.contextPct, turnCount: convState.turnCount, growthRate: computeGrowthRate(convState.contextHistory), + model: convState.model, + isDetailHeavy: false, }); state = { ...state, health }; overlay.render(state); @@ -429,6 +445,13 @@ async function initializeMonitoring(): Promise { contextWindow: ctxWindow, }; + // Update the cached detail-heavy flag from this turn's prompt + // characteristics (inject.ts sets msg.isDetailHeavy from the live + // promptText; raw text never crosses the bridge). The Health + // Agent reads this to shift its per-model thresholds when the + // user is asking for precise / exhaustive recall. + lastDetailHeavy = msg.isDetailHeavy ?? false; + // Compute full next state in one step, render once. state = { ...applyStreamComplete(state, msg), @@ -443,6 +466,8 @@ async function initializeMonitoring(): Promise { contextPct: currentContextPct, turnCount: convState.turnCount, growthRate: computeGrowthRate(convState.contextHistory), + model: msg.model, + isDetailHeavy: lastDetailHeavy, }), }; overlay.render(state); @@ -976,6 +1001,7 @@ async function initializeMonitoring(): Promise { cumulativeOutput = 0; cumulativeCost = 0; consecutiveShortFollowUps = 0; + lastDetailHeavy = false; deltaHistory = []; firstTurnDelta = null; dismissed = new Set(); @@ -1006,10 +1032,16 @@ async function initializeMonitoring(): Promise { cumulativeCost = record.estimatedCost ?? 0; state = applyRestoredConversation(state, record, null); convState = buildConvStateFromRecord(record, state.contextPct ?? 0); + // SPA navigation restore. New conversation, no live draft + // text, so detail-heavy resets to false here. Model from + // the stored record. The next STREAM_COMPLETE on the new + // conversation will refresh both fields. const health = computeHealthScore({ contextPct: convState.contextPct, turnCount: convState.turnCount, growthRate: computeGrowthRate(convState.contextHistory), + model: convState.model, + isDetailHeavy: false, }); state = { ...state, health }; overlay.render(state); diff --git a/entrypoints/inject.ts b/entrypoints/inject.ts index 70a6f3d..6760819 100644 --- a/entrypoints/inject.ts +++ b/entrypoints/inject.ts @@ -103,6 +103,7 @@ export default defineUnlistedScript(() => { promptLength?: number; hasCodeBlock?: boolean; isShortFollowUp?: boolean; + isDetailHeavy?: boolean; organizationId?: string; draftCharCount?: number; }) { @@ -424,6 +425,24 @@ export default defineUnlistedScript(() => { const hasCodeBlock = promptText.includes('```'); const isShortFollowUp = promptText.length > 0 && promptText.length < 50; // mirrors SHORT_FOLLOWUP_MAX_CHARS + // Detail-heavy detection. Mirrors lib/prompt-analysis.ts isDetailHeavy() + // and DETAIL_HEAVY_KEYWORDS. Inject.ts cannot import from lib/, so the + // keyword list is duplicated here. If you change either side, change both. + // Code fences are treated as detail-heavy without a keyword scan. + const detailHeavyKeywords = [ + 'exact', 'exactly', 'precise', 'precisely', 'verbatim', + 'exhaustive', + 'complete list', 'full list', 'list every', 'list all', + 'every detail', 'all details', + ]; + let isDetailHeavy = hasCodeBlock; + if (!isDetailHeavy && promptText.length > 0) { + const lowered = promptText.toLowerCase(); + for (const keyword of detailHeavyKeywords) { + if (lowered.includes(keyword)) { isDetailHeavy = true; break; } + } + } + // Send final complete event to the content script bridge postSecureBatch({ type: 'STREAM_COMPLETE', @@ -435,6 +454,7 @@ export default defineUnlistedScript(() => { promptLength, hasCodeBlock, isShortFollowUp, + isDetailHeavy, ...(orgId ? { organizationId: orgId } : {}), }); diff --git a/entrypoints/sidepanel/hooks/useDashboardData.ts b/entrypoints/sidepanel/hooks/useDashboardData.ts index 2a904e7..59cb10a 100644 --- a/entrypoints/sidepanel/hooks/useDashboardData.ts +++ b/entrypoints/sidepanel/hooks/useDashboardData.ts @@ -216,10 +216,15 @@ export function useDashboardData(): DashboardData { if (conv) { const growthRate = computeGrowthRate(conv.turns.map(t => t.contextPct)); + // Side panel renders historical / restored conversations. We do not + // have the live draft text here, so detail-heavy is conservatively + // false. Model comes from the stored conversation record. const health = computeHealthScore({ contextPct: conv.lastContextPct, turnCount: conv.turnCount, growthRate, + model: conv.model, + isDetailHeavy: false, }); setActiveHealth(health); } else { diff --git a/lib/bridge-validation.ts b/lib/bridge-validation.ts index e7790ca..47da767 100644 --- a/lib/bridge-validation.ts +++ b/lib/bridge-validation.ts @@ -33,6 +33,7 @@ export function isValidBridgeSchema(data: any): boolean { if (data.promptLength !== undefined && typeof data.promptLength !== 'number') return false; if (data.hasCodeBlock !== undefined && typeof data.hasCodeBlock !== 'boolean') return false; if (data.isShortFollowUp !== undefined && typeof data.isShortFollowUp !== 'boolean') return false; + if (data.isDetailHeavy !== undefined && typeof data.isDetailHeavy !== 'boolean') return false; } } diff --git a/lib/context-rot-thresholds.ts b/lib/context-rot-thresholds.ts new file mode 100644 index 0000000..5b731aa --- /dev/null +++ b/lib/context-rot-thresholds.ts @@ -0,0 +1,413 @@ +// lib/context-rot-thresholds.ts +// +// Per-model context rot agent. Single source of truth for: +// 1. The context window size that defines 100% utilization +// 2. The "warn" threshold where Saar starts coaching the user toward fresh +// 3. The "critical" threshold where retrieval is likely already unreliable +// 4. The "detail-heavy" adjustment when the user's prompt demands precision +// 5. The user-facing coaching copy, grounded in Anthropic-published facts +// where they exist and clearly marked as Saar defaults where they don't +// +// No DOM refs, no chrome APIs, no side effects. Pure functions only. +// +// ── Why this file exists ────────────────────────────────────────────────── +// +// Anthropic publishes that "as token count grows, accuracy and recall +// degrade, a phenomenon known as context rot" (platform.claude.com docs, +// build-with-claude/context-windows). They publish two anchor data points: +// +// - Opus 4.6 retrieves 76% on the 8-needle 1M variant of MRCR v2 +// - Sonnet 4.5 retrieves 18.5% on the same benchmark at the same length +// +// Source: https://www.anthropic.com/news/claude-opus-4-6 +// +// Anthropic does NOT publish a rot CURVE. They give endpoints, not a graph. +// So picking "the threshold at which a user should be warned" is Saar's +// coaching judgment, derived from those endpoints. This file is that +// judgment, encoded once, with provenance. +// +// Why hard-code rather than infer: +// - We refuse to invent Anthropic claims. If Anthropic says nothing about +// Opus 4.7, we use Opus 4.6's pattern as a sibling and document why. +// - The spec doc (docs/context-rot-thresholds-spec.md) is the human-readable +// companion that explains every choice. If Anthropic publishes new MRCR +// numbers, update both this file and the spec, nowhere else. +// +// ── Detail-heavy adjustment ────────────────────────────────────────────── +// +// Retrieval failure cost rises sharply when the user's prompt demands +// precision. A casual "summarize this thread" can tolerate fuzzy recall; +// "list every parameter we discussed for the auth refactor" cannot. When +// the prompt signals high precision (code blocks, or precision keywords), +// we subtract DETAIL_HEAVY_ADJUSTMENT from both the warn and critical +// thresholds. This shifts the warning earlier, where it matters most. +// +// We floor the adjusted threshold at MIN_THRESHOLD_FLOOR so a stack of +// adjustments cannot push the warning to absurdly low context. +// ───────────────────────────────────────────────────────────────────────── + +// ── Types ──────────────────────────────────────────────────────────────── + +export type RotZone = 'healthy' | 'approaching' | 'in-rot'; + +/** + * One row in the threshold table. Matched by longest model-name prefix. + * Anything in `mrcrAt1MPct` / `sourceUrl` / `sourceQuote` MUST come from a + * verified Anthropic publication. Saar defaults are absent from those fields. + */ +export interface ContextRotProfile { + /** Model-name prefix (longest match wins). */ + modelPrefix: string; + /** Display label used in coaching copy. */ + label: string; + /** Token capacity that defines 100% utilization. Mirrors pricing.json. */ + contextWindow: number; + /** % of window at which Saar starts the "approaching rot" warning. */ + warnAtPct: number; + /** % of window at which Saar declares "in rot zone". */ + criticalAtPct: number; + /** + * Anthropic-published 8-needle 1M MRCR v2 score for this model, if any. + * Only set when we can quote the exact figure verbatim from an + * Anthropic source. Used in coaching copy to show the user the + * primary-source evidence behind the warning. + */ + mrcrAt1MPct?: number; + /** Anthropic primary-source URL for the MRCR figure. */ + sourceUrl?: string; + /** Verbatim quote pinning the figure, for the spec drift test. */ + sourceQuote?: string; + /** + * Whether Anthropic offers server-side compaction for this model on its + * platform. When true, the coaching copy softens, since Anthropic itself + * handles long sessions. When false, the user must act (start fresh, + * use Projects). Compaction beta as of 2026-04: Opus 4.7, Opus 4.6, + * Sonnet 4.6. + */ + hasServerSideCompaction: boolean; +} + +// ── Constants ──────────────────────────────────────────────────────────── + +/** + * Subtracted from both warnAtPct and criticalAtPct when the latest prompt + * is detail-heavy. 15 percentage points is the value documented in the + * GET-28 issue spec: enough to noticeably shift the warning earlier, not + * so much it floods the user on a normal Q&A. + */ +export const DETAIL_HEAVY_ADJUSTMENT = 15; + +/** + * Floor applied after the detail-heavy adjustment. Prevents the threshold + * from sliding below a sensible minimum if future tuning combines multiple + * downward adjustments. 30% is far enough below any current model's warn + * threshold that the floor only fires in adjustment edge cases. + */ +export const MIN_THRESHOLD_FLOOR = 30; + +/** + * Absolute hard cap, model-agnostic. No model is reliable above this + * utilization, regardless of its published MRCR score. Acts as a safety + * net so the critical signal always fires near the window limit even if + * the model lookup falls through. This mirrors the legacy DEGRADING_CEIL + * behavior so existing audit assertions about "context >= 90 = critical" + * still hold. + */ +export const ABSOLUTE_CRITICAL_FLOOR = 90; + +/** + * Below this context % the coaching layer keeps the message minimal + * ("fresh and responsive") instead of spelling out the percentage. The + * Health Agent uses the same constant to gate its "fast growth" warning, + * since tiny chats can show large per-turn growth as a percentage with + * no real risk. Lives here so both the threshold agent's coaching copy + * and the health-score rule see the same value without a circular import. + */ +export const LOW_CONTEXT_REASSURANCE_CEIL = 30; + +// Source URLs pinned once. If they 404 in the spec drift test, we know. +const URL_OPUS_4_6_ANNOUNCEMENT = 'https://www.anthropic.com/news/claude-opus-4-6'; +const URL_CONTEXT_WINDOWS_DOCS = 'https://platform.claude.com/docs/en/build-with-claude/context-windows'; + +// Verbatim quote from the Opus 4.6 announcement, pinned for the drift test. +// Both Opus 4.6 and Sonnet 4.5 share this quote since it names both scores. +const QUOTE_MRCR_OPUS_4_6_AND_SONNET_4_5 = + 'on the 8-needle 1M variant of MRCR v2, a needle-in-a-haystack benchmark that tests a model\'s ability to retrieve information "hidden" in vast amounts of text, Opus 4.6 scores 76%, whereas Sonnet 4.5 scores just 18.5%.'; + +// ── Threshold table ────────────────────────────────────────────────────── +// +// Order matters: longest-prefix-first scan, so we list the most specific +// entries (e.g. claude-sonnet-4-6) before the generic ones. Each row +// carries enough context (label + window + provenance) that the coaching +// string can be assembled without lookups elsewhere. +// +// Threshold derivation (see docs/context-rot-thresholds-spec.md for full +// rationale): +// - 1M-window models with strong long-context performance (Opus 4.6 at +// 76% MRCR) hold accuracy further into the window. Warn at 65, crit +// at 85. Opus 4.7 inherits this until Anthropic publishes a number. +// - 1M-window Sonnet 4.6 has no published MRCR. Anthropic markets it as +// a long-context improvement over Sonnet 4.5. We warn at 60 (more +// conservative than Opus, less than 200k Sonnet) and crit at 80. +// - 200k-window models with weak long-context (Sonnet 4.5 at 18.5% +// MRCR) lose accuracy fast as % climbs. Warn at 50, crit at 75. +// - Haiku 4.5 has no published MRCR but a smaller window. Use the +// conservative 200k profile. +// - Older 200k-window Opus models (4.5, 4.1) inherit the 200k profile. +// +// Saar defaults (warn/crit) are NOT Anthropic-published thresholds. The +// spec doc says so explicitly. Only the mrcrAt1MPct field is sourced. + +export const ROT_PROFILES: ContextRotProfile[] = [ + // 1M-window flagships, compaction-aware. Opus 4.6 has a verified MRCR. + { + modelPrefix: 'claude-opus-4-7', + label: 'Opus 4.7', + contextWindow: 1_000_000, + warnAtPct: 65, + criticalAtPct: 85, + hasServerSideCompaction: true, + }, + { + modelPrefix: 'claude-opus-4-6', + label: 'Opus 4.6', + contextWindow: 1_000_000, + warnAtPct: 65, + criticalAtPct: 85, + mrcrAt1MPct: 76, + sourceUrl: URL_OPUS_4_6_ANNOUNCEMENT, + sourceQuote: QUOTE_MRCR_OPUS_4_6_AND_SONNET_4_5, + hasServerSideCompaction: true, + }, + { + modelPrefix: 'claude-sonnet-4-6', + label: 'Sonnet 4.6', + contextWindow: 1_000_000, + warnAtPct: 60, + criticalAtPct: 80, + hasServerSideCompaction: true, + }, + + // 200k-window models, no server-side compaction available. Sonnet 4.5 + // is the rot exemplar: 18.5% MRCR at full window means accuracy is + // measurably degraded well before the user hits the limit. + { + modelPrefix: 'claude-sonnet-4-5', + label: 'Sonnet 4.5', + contextWindow: 200_000, + warnAtPct: 50, + criticalAtPct: 75, + mrcrAt1MPct: 18.5, + sourceUrl: URL_OPUS_4_6_ANNOUNCEMENT, + sourceQuote: QUOTE_MRCR_OPUS_4_6_AND_SONNET_4_5, + hasServerSideCompaction: false, + }, + { + modelPrefix: 'claude-haiku-4', + label: 'Haiku 4.5', + contextWindow: 200_000, + warnAtPct: 50, + criticalAtPct: 75, + hasServerSideCompaction: false, + }, + + // Legacy Opus generations on the 200k window. Devanshu notes nobody + // uses these on claude.ai today, but we keep them in the table so the + // health indicator does not silently fall through to the unknown-model + // fallback for anyone who still has them pinned. + { + modelPrefix: 'claude-opus-4-5', + label: 'Opus 4.5', + contextWindow: 200_000, + warnAtPct: 50, + criticalAtPct: 75, + hasServerSideCompaction: false, + }, + { + modelPrefix: 'claude-opus-4-1', + label: 'Opus 4.1', + contextWindow: 200_000, + warnAtPct: 50, + criticalAtPct: 75, + hasServerSideCompaction: false, + }, +]; + +/** + * Fallback profile for unknown / unrecognized model strings. Intentionally + * conservative: a 200k window with the same warn/crit defaults as Sonnet + * 4.5. Better to over-warn an unknown model than to leave the user + * unprotected on a model we have not characterized. + */ +export const FALLBACK_PROFILE: ContextRotProfile = { + modelPrefix: '', + label: 'this model', + contextWindow: 200_000, + warnAtPct: 50, + criticalAtPct: 75, + hasServerSideCompaction: false, +}; + +// ── Lookup ────────────────────────────────────────────────────────────── + +/** + * Returns the rot profile for the given model name. Matches by longest + * prefix so version suffixes (claude-sonnet-4-6-20250514) resolve to the + * canonical entry (claude-sonnet-4-6). Falls back to FALLBACK_PROFILE for + * unknown models. + * + * Lookup is case-insensitive on the input but the table is lowercase, so + * we normalize the input before scanning. + */ +export function getRotProfile(model: string): ContextRotProfile { + if (!model) return FALLBACK_PROFILE; + const normalized = model.toLowerCase(); + + // Longest-prefix scan with a digit-boundary check. Without the boundary + // a hypothetical future model name like "claude-sonnet-4-50" would + // falsely match the "claude-sonnet-4-5" entry, classifying it as the + // 200k profile. Only accept the prefix when the model name ends there + // OR the next character is a non-digit (typically '-' before a date + // suffix like "-20250514"). This way "claude-sonnet-4-5" and + // "claude-sonnet-4-5-20250929" both match the 4-5 row, but a future + // "claude-sonnet-4-50" falls through to the FALLBACK_PROFILE. + let best: ContextRotProfile | null = null; + for (const profile of ROT_PROFILES) { + if (!normalized.startsWith(profile.modelPrefix)) continue; + const nextChar = normalized.charAt(profile.modelPrefix.length); + const boundaryOk = nextChar === '' || nextChar < '0' || nextChar > '9'; + if (!boundaryOk) continue; + if (best === null || profile.modelPrefix.length > best.modelPrefix.length) { + best = profile; + } + } + return best ?? FALLBACK_PROFILE; +} + +/** + * Effective thresholds after applying the detail-heavy adjustment. + * Both warn and crit shift down by DETAIL_HEAVY_ADJUSTMENT, floored at + * MIN_THRESHOLD_FLOOR. + */ +export interface EffectiveThresholds { + warnAtPct: number; + criticalAtPct: number; +} + +export function getEffectiveThresholds( + model: string, + isDetailHeavy: boolean, +): EffectiveThresholds { + const profile = getRotProfile(model); + return { + warnAtPct: applyDetailHeavyAdjustment(profile.warnAtPct, isDetailHeavy), + criticalAtPct: applyDetailHeavyAdjustment(profile.criticalAtPct, isDetailHeavy), + }; +} + +/** + * Apply the detail-heavy adjustment to a single threshold value, with the + * MIN_THRESHOLD_FLOOR guard. Exported so tests can exercise the floor + * branch independently of the threshold table (no real profile in the + * table is low enough to trip the floor today; the floor is here as a + * guard for future tuning that combines downward adjustments). + */ +export function applyDetailHeavyAdjustment(threshold: number, isDetailHeavy: boolean): number { + if (!isDetailHeavy) return threshold; + return Math.max(MIN_THRESHOLD_FLOOR, threshold - DETAIL_HEAVY_ADJUSTMENT); +} + +/** + * Three-way zone classification given the user's current context %, the + * model, and whether their last prompt was detail-heavy. + * + * The zones map to the existing health levels in `lib/health-score.ts`: + * healthy -> Healthy + * approaching -> Degrading + * in-rot -> Critical + * + * Plus, an absolute floor: anything at ABSOLUTE_CRITICAL_FLOOR (90%) or + * above is "in-rot" regardless of model. This guarantees the health + * indicator goes red before the user actually hits the wall. + */ +export function getRotZone( + model: string, + contextPct: number, + isDetailHeavy: boolean, +): RotZone { + if (contextPct >= ABSOLUTE_CRITICAL_FLOOR) return 'in-rot'; + const { warnAtPct, criticalAtPct } = getEffectiveThresholds(model, isDetailHeavy); + if (contextPct >= criticalAtPct) return 'in-rot'; + if (contextPct >= warnAtPct) return 'approaching'; + return 'healthy'; +} + +// ── Coaching copy ──────────────────────────────────────────────────────── +// +// Three zones, two model classes (with-compaction / without-compaction), +// optional MRCR citation when we have one. Copy is claude.ai-specific: +// never references `/compact` (Claude Code feature, not on the web), uses +// "start a new chat" and "use Projects" as the primary actions. + +/** + * Returns the user-facing coaching message for the given model and state. + * + * Healthy zone: low-friction reassurance, names the model and its window. + * Approaching: educational, explains why, cites MRCR when we have it, + * gives a claude.ai-appropriate next step. + * In-rot: direct, action-first. + * + * The detail-heavy flag does not change the copy itself, only the + * threshold at which the zone trips. This is intentional: the user + * already feels the precision pressure, the warning just arrives sooner. + */ +export function getRotCoaching( + model: string, + contextPct: number, + isDetailHeavy: boolean, +): string { + const profile = getRotProfile(model); + const zone = getRotZone(model, contextPct, isDetailHeavy); + const windowLabel = formatWindowLabel(profile.contextWindow); + const pctRounded = Math.round(contextPct); + + if (zone === 'healthy') { + // Low context: low friction. Mention the model only when there is + // anything to say beyond "fresh". + if (contextPct < LOW_CONTEXT_REASSURANCE_CEIL) { + return 'Conversation is fresh and responsive.'; + } + return `${pctRounded}% of ${profile.label}'s ${windowLabel} window used. Plenty of room.`; + } + + if (zone === 'approaching') { + // Educational moment. Cite MRCR when available, soften when + // compaction is present, harden when it is not. + const mrcrClause = profile.mrcrAt1MPct !== undefined + ? ` On Anthropic's 8-needle 1M MRCR benchmark, ${profile.label} retrieves ${profile.mrcrAt1MPct}% at full window.` + : ''; + const action = profile.hasServerSideCompaction + ? "Anthropic's server-side compaction handles long sessions, but for accuracy-critical work consider starting a new chat." + : 'For accuracy-critical work, start a new chat now.'; + return `Approaching the zone where retrieval declines.${mrcrClause} ${action}`; + } + + // in-rot zone + if (profile.hasServerSideCompaction) { + return `${pctRounded}% used. Even with compaction, fine details from earlier may be missed. Start a new chat for new threads of work.`; + } + return `${pctRounded}% used. Retrieval is unreliable here. Start a new chat. Use Projects to keep ongoing work organized.`; +} + +/** + * Formats a context window size for display. 1_000_000 -> "1M", + * 200_000 -> "200k". Keeps the coaching copy short and scannable. + */ +function formatWindowLabel(contextWindow: number): string { + if (contextWindow >= 1_000_000) { + const millions = contextWindow / 1_000_000; + return millions === Math.floor(millions) ? `${millions}M` : `${millions.toFixed(1)}M`; + } + const thousands = Math.round(contextWindow / 1_000); + return `${thousands}k`; +} diff --git a/lib/health-score.ts b/lib/health-score.ts index 88bb4bd..81976f9 100644 --- a/lib/health-score.ts +++ b/lib/health-score.ts @@ -1,4 +1,5 @@ // lib/health-score.ts +// // Pure function: conversation metadata -> human-readable health assessment. // No DOM refs, no chrome APIs, no side effects. // @@ -6,10 +7,47 @@ // (Healthy / Degrading / Critical) and a one-line coaching message that // tells the user exactly what is happening and what to do about it. // -// Based on the Chroma context rot research (2025): every frontier LLM -// shows a U-shaped attention curve. Models attend strongly to the beginning -// and end of context, and poorly to the middle. Performance degrades as -// context grows, especially past 50% utilization with high turn counts. +// ── How the score is computed (READ THIS BEFORE EDITING) ──────────────── +// +// Two independent classifiers run in sequence, and the more severe of +// the two wins: +// +// 1. PRIMARY (per-model utilization). The conversation's context % is +// compared to model-specific warn / critical thresholds from +// context-rot-thresholds.ts. This is the load-bearing signal. +// Anthropic's published MRCR scores ground the thresholds where they +// exist; otherwise we use Saar coaching defaults documented in +// docs/context-rot-thresholds-spec.md. +// +// 2. SECONDARY (turn count and growth rate). A long conversation with +// few tokens, or a fast-filling short conversation, deserves +// coaching even before the per-model threshold trips. These are +// weaker signals so they cannot escalate above the primary level +// they would otherwise produce, except for one explicit boost: +// already-degrading + very-long-convo escalates to critical because +// the attention valley research suggests retrieval is falling apart +// regardless of model class. +// +// Detail-heavy adjustment: when the user's last prompt demanded precision +// (code blocks, precision keywords), the per-model warn / crit thresholds +// shift earlier. The shift is applied once, in the threshold lookup, so +// every rule below sees the adjusted numbers without special-casing. +// +// Coaching copy comes from context-rot-thresholds.ts when the primary +// classifier fires (model-aware, evidence-grounded). When a secondary +// rule fires alone, we use generic copy that names the model but does +// not invent threshold-specific claims. This keeps copy honest: we never +// quote an MRCR figure on a turn-count-driven warning. +// ───────────────────────────────────────────────────────────────────────── + +import { + ABSOLUTE_CRITICAL_FLOOR, + LOW_CONTEXT_REASSURANCE_CEIL, + getEffectiveThresholds, + getRotCoaching, + getRotProfile, + getRotZone, +} from './context-rot-thresholds'; // ── Types ───────────────────────────────────────────────────────────────────── @@ -27,84 +65,159 @@ export interface HealthScore { // ── Constants ───────────────────────────────────────────────────────────────── -// Context thresholds (percentage of context window). +/** + * Absolute floor: any conversation at or above this context % is critical + * regardless of model. Re-exported from context-rot-thresholds.ts so the + * legacy "DEGRADING_CEIL = 90" assertions in audit tests keep working + * without binding the test file to the new agent path. + */ +export const DEGRADING_CEIL = ABSOLUTE_CRITICAL_FLOOR; + +/** + * Legacy default warn threshold. Pre-GET-28, this was the single + * model-agnostic warn ceiling. Now superseded by per-model thresholds in + * context-rot-thresholds.ts. Kept exported for tests and call sites that + * still reference it as a sentinel value, NOT used by computeHealthScore. + */ export const HEALTHY_CEIL = 70; -export const DEGRADING_CEIL = 90; -// Turn count thresholds. High turn count amplifies context rot. +// Turn count thresholds. Long conversations develop attention drift even +// when context is low, per the U-shaped attention curve from the Chroma +// 2025 context-rot research. export const TURN_HEALTHY_CEIL = 10; export const TURN_DEGRADING_CEIL = 20; export const TURN_CRITICAL_CEIL = 30; -// Growth rate threshold (% per turn). Fast-filling conversations degrade sooner. +/** + * Growth rate threshold (% per turn). A conversation filling at this rate + * or faster is on track to hit the per-model warn threshold within a + * handful of messages. Triggers a forward-looking warning before the + * primary classifier would fire. + */ export const FAST_GROWTH_PCT = 8; +/** + * How far below the per-model warn threshold the turn-aware degrading + * rule (Rule 6) starts firing. With warn = 60 (Sonnet 4.6) and offset = + * 10, a 12-turn conversation at 50%+ context is degrading even though + * it has not yet crossed the per-model warn. Rationale: the attention + * valley research shows fidelity erodes as turns accumulate, and the + * 10-point band gives the indicator room to coach the user before the + * primary classifier trips. + */ +export const TURN_AWARE_WARN_OFFSET = 10; + // ── Score computation ───────────────────────────────────────────────────────── export interface HealthInput { - contextPct: number; // 0-100 + /** Current context utilization, 0 to 100, as a % of the model's window. */ + contextPct: number; + /** Number of user-assistant turn pairs so far. */ turnCount: number; - growthRate: number | null; // avg % per turn, null if insufficient data + /** Average upward growth per turn (% per turn), or null if insufficient data. */ + growthRate: number | null; + /** + * Model name as reported by the SSE message_start event (e.g. + * 'claude-sonnet-4-6-20250514'). Used to look up per-model warn / + * critical thresholds and to name the model in coaching copy. Pass + * an empty string for unknown models; the agent will fall back to + * a conservative 200k / 50% / 75% profile. + */ + model: string; + /** + * True when the most recent prompt demanded precise / exhaustive + * recall (code blocks or precision keywords). Computed by + * lib/prompt-analysis.ts isDetailHeavy(). Shifts the per-model warn + * and critical thresholds earlier by DETAIL_HEAVY_ADJUSTMENT. + */ + isDetailHeavy: boolean; } /** * Compute the conversation health score. * - * The score combines context utilization, turn count, and growth rate. - * A conversation can be Critical from context alone (>= DEGRADING_CEIL, 90%) or from - * a combination of moderate context + high turn count (the "attention - * valley" effect from context rot research). + * Returns the worst (most severe) of the primary per-model classifier + * and the secondary turn / growth heuristics. Coaching copy is sourced + * from the rule that won, so a turn-count-driven warning does not pretend + * to quote MRCR data it never used. */ export function computeHealthScore(input: HealthInput): HealthScore { - const { contextPct, turnCount, growthRate } = input; + const { contextPct, turnCount, growthRate, model, isDetailHeavy } = input; - // Rule 1: Very high context is always Critical, regardless of turn count. - if (contextPct >= DEGRADING_CEIL) { + const profile = getRotProfile(model); + const thresholds = getEffectiveThresholds(model, isDetailHeavy); + const zone = getRotZone(model, contextPct, isDetailHeavy); + + // Rule 1 (primary): in-rot zone is always critical. This includes the + // absolute 90% floor since getRotZone honors it. Coaching is the + // model-aware in-rot message: cites the model, mentions compaction + // for 1M models, points 200k models at Projects + new chat. + if (zone === 'in-rot') { return { level: 'critical', label: 'Critical', - coaching: 'Context nearly full. Start a new chat, or use Claude Projects for ongoing work.', + coaching: getRotCoaching(model, contextPct, isDetailHeavy), contextPct, }; } - // Rule 2: High context + many turns = Critical. - // The "attention valley": past TURN_DEGRADING_CEIL turns with >= HEALTHY_CEIL (70%) context, - // Claude's attention to mid-conversation details degrades measurably. - if (contextPct >= HEALTHY_CEIL && turnCount > TURN_DEGRADING_CEIL) { + // Rule 2 (secondary boost): if we are already in the approaching zone + // AND the conversation is past TURN_DEGRADING_CEIL turns, we promote + // to critical. Rationale: the per-model warn threshold is the point + // where retrieval starts dropping; combined with a deep conversation, + // the user is statistically losing fidelity from earlier turns even + // if the percentage alone would not yet trip critical. + if (zone === 'approaching' && turnCount > TURN_DEGRADING_CEIL) { return { level: 'critical', label: 'Critical', - coaching: `${turnCount} turns deep. Claude has likely lost detail from early messages.`, + coaching: `${turnCount} turns deep on ${profile.label}. Earlier details are likely missing from recall. Start a new chat.`, contextPct, }; } - // Rule 3: Moderate context + moderate turns = Degrading. - if (contextPct >= HEALTHY_CEIL && turnCount > TURN_HEALTHY_CEIL) { + // Rule 3 (primary): approaching zone without the turn boost is + // degrading. Use the per-model coaching string, which decides whether + // to cite MRCR, mention compaction, or push toward Projects. + if (zone === 'approaching') { return { level: 'degrading', label: 'Degrading', - coaching: 'Earlier details may be fading. Consider starting fresh soon.', + coaching: getRotCoaching(model, contextPct, isDetailHeavy), contextPct, }; } - // Rule 4: Fast growth rate with meaningful context = Degrading. - // Even at low turn counts, a conversation filling at >8%/turn will hit - // Critical within a few more messages. - if (growthRate !== null && growthRate > FAST_GROWTH_PCT && contextPct > 30) { - const remaining = Math.max(0, Math.round((100 - contextPct) / growthRate)); + // ── At this point the primary classifier says "healthy". Secondary + // signals can downgrade to "degrading" but never to "critical" on + // their own; the primary classifier is the only path to red. ── + + // Rule 4 (secondary): fast growth at meaningful context is degrading. + // We only fire above LOW_CONTEXT_REASSURANCE_CEIL because tiny chats + // can show large per-turn growth as a percentage with no real risk. + // The remaining-messages estimate uses the post-detail-heavy warn + // threshold so it answers "messages until we hit the warning", not + // "messages until we exhaust the entire window". + if (growthRate !== null && growthRate > FAST_GROWTH_PCT && contextPct > LOW_CONTEXT_REASSURANCE_CEIL) { + const headroom = Math.max(0, thresholds.warnAtPct - contextPct); + // Floor the displayed count at 1: a "0 messages" warning is silly + // when the rule has already decided we should warn. Rounding can + // produce 0 when headroom < growthRate / 2 (e.g. headroom=2pp, + // growthRate=9pp/turn -> 0.22 -> 0). Show "1 message" instead. + const rawRemaining = Math.round(headroom / growthRate); + const remaining = Math.max(1, rawRemaining); + const target = remaining === 1 ? 'message' : 'messages'; return { level: 'degrading', label: 'Degrading', - coaching: `Filling fast. ~${remaining} message${remaining === 1 ? '' : 's'} until context limit.`, + coaching: `Filling fast on ${profile.label}. About ${remaining} ${target} until the rot zone.`, contextPct, }; } - // Rule 5: High turn count alone (even with low context) = mild Degrading. - // Very long conversations develop attention drift regardless of context %. + // Rule 5 (secondary): very long conversations develop attention drift + // even at low utilization. The turn ceiling is generic; no model + // claim attached. if (turnCount > TURN_CRITICAL_CEIL) { return { level: 'degrading', @@ -114,13 +227,31 @@ export function computeHealthScore(input: HealthInput): HealthScore { }; } - // Default: Healthy. + // Rule 6 (degrading boost from turns at high-but-healthy context): + // when context is moderate AND we are past TURN_HEALTHY_CEIL, the + // attention valley starts kicking in even before the per-model warn. + // Threshold: anything within TURN_AWARE_WARN_OFFSET points of warn + // (so a Sonnet 4.5 user at 45% with 12 turns trips this; an Opus + // 4.7 user at 45% with 12 turns does not, because warn = 65). This + // preserves the legacy "70% + 11 turns = degrading" coverage without + // locking it to 70. + const turnAwareWarnFloor = thresholds.warnAtPct - TURN_AWARE_WARN_OFFSET; + if (contextPct >= turnAwareWarnFloor && turnCount > TURN_HEALTHY_CEIL && contextPct > LOW_CONTEXT_REASSURANCE_CEIL) { + return { + level: 'degrading', + label: 'Degrading', + coaching: `${turnCount} turns into ${profile.label}. Earlier details may be fading; consider starting fresh soon.`, + contextPct, + }; + } + + // Default: healthy. Use the per-model coaching for both the very-low + // case (returns "fresh and responsive") and the moderate case + // (returns "X% of {label}'s {window} window used. Plenty of room."). return { level: 'healthy', label: 'Healthy', - coaching: contextPct > 30 - ? `${contextPct.toFixed(0)}% context used. Plenty of room.` - : 'Conversation is fresh and responsive.', + coaching: getRotCoaching(model, contextPct, isDetailHeavy), contextPct, }; } diff --git a/lib/message-types.ts b/lib/message-types.ts index 51ea460..bc0f32f 100644 --- a/lib/message-types.ts +++ b/lib/message-types.ts @@ -40,6 +40,13 @@ export interface StreamCompletePayload { hasCodeBlock?: boolean; /** True when the prompt is shorter than 50 characters (terse follow-up). */ isShortFollowUp?: boolean; + /** + * True when the prompt demands precise / exhaustive recall on prior context + * (code blocks or precision keywords). Computed in inject.ts using the same + * keyword set as lib/prompt-analysis.ts isDetailHeavy(). The Health Agent + * reads this to shift its per-model warn/critical thresholds earlier. + */ + isDetailHeavy?: boolean; /** Organization UUID extracted from the API URL. Used for account isolation. */ organizationId?: string; } diff --git a/lib/prompt-analysis.ts b/lib/prompt-analysis.ts index 94d6674..e5f5da9 100644 --- a/lib/prompt-analysis.ts +++ b/lib/prompt-analysis.ts @@ -72,6 +72,38 @@ export const SHORT_FOLLOWUP_MAX_CHARS = 50; /** Number of consecutive short follow-ups required to trigger the chain signal. */ export const FOLLOWUP_CHAIN_MIN_COUNT = 3; +/** + * Keywords that signal the user wants precise, exhaustive recall on the + * conversation so far. When these appear, retrieval failure cost is high, + * and the context-rot agent shifts the warn/critical thresholds earlier. + * + * Substring match, case-insensitive (plain `String.prototype.includes` + * after a single `toLowerCase` on the prompt). Substring rather than + * word-bounded because words like "exact" should still match inside + * "exactly", "exactness", etc., without us listing every inflection. + * + * Tuned conservatively: bare "every" and "all" appear constantly in + * prose ("every user", "I tried all the suggestions") and would flood + * the warning if treated as triggers on their own. We pair them with + * specific nouns ("every detail", "all details", "list every", "list + * all") so the trigger only fires when the user is plausibly asking + * for exhaustive recall. + */ +export const DETAIL_HEAVY_KEYWORDS: ReadonlyArray = [ + 'exact', + 'exactly', + 'precise', + 'precisely', + 'verbatim', + 'exhaustive', + 'complete list', + 'full list', + 'list every', + 'list all', + 'every detail', + 'all details', +] as const; + // ── Model classification ──────────────────────────────────────────────────── /** @@ -113,6 +145,41 @@ export interface DeltaPromptContext { haikuMedianDelta: number | null; } +// ── Detail-heavy detection ────────────────────────────────────────────────── + +/** + * Returns true if the prompt demands precise / exhaustive recall on prior + * context. Two independent signals; either one trips the flag: + * + * 1. The prompt contains a fenced code block. Code-bearing turns almost + * always need exact details from earlier (variable names, line + * numbers, error messages). Any code fence trips it. + * 2. The prompt contains one of the DETAIL_HEAVY_KEYWORDS, matched + * case-insensitively as a substring (the keywords are already + * multi-word or distinct enough that simple substring is safe). + * + * Pure function. No allocation in the hot path beyond the lowercase copy + * needed for case-insensitive matching, and that copy is O(n) on prompt + * length which dwarfs anything in this loop. + * + * Why this lives here and not in context-rot-thresholds.ts: prompt + * inspection is the Prompt Agent's job. The rot agent consumes a flag, + * not raw text. Keeps each agent focused on one input shape. + */ +export function isDetailHeavy(promptText: string): boolean { + if (!promptText) return false; + + // Code fence check first: cheapest, no allocation. + if (promptText.includes('```')) return true; + + // Lowercase once for the keyword scan. + const haystack = promptText.toLowerCase(); + for (const keyword of DETAIL_HEAVY_KEYWORDS) { + if (haystack.includes(keyword)) return true; + } + return false; +} + // ── Analysis ──────────────────────────────────────────────────────────────── /** diff --git a/tests/audit/fuzz.test.ts b/tests/audit/fuzz.test.ts index 828d0a5..a19b984 100644 --- a/tests/audit/fuzz.test.ts +++ b/tests/audit/fuzz.test.ts @@ -181,17 +181,17 @@ describe('number fuzz: calculateCost', () => { describe('number fuzz: computeHealthScore', () => { for (const pct of numFuzz) { test(`contextPct=${pct} does not throw`, () => { - expect(() => computeHealthScore({ contextPct: pct, turnCount: 5, growthRate: null })).not.toThrow(); + expect(() => computeHealthScore({ contextPct: pct, turnCount: 5, growthRate: null, model: 'claude-sonnet-4-6', isDetailHeavy: false })).not.toThrow(); }); } for (const turns of numFuzz) { test(`turnCount=${turns} does not throw`, () => { - expect(() => computeHealthScore({ contextPct: 50, turnCount: turns, growthRate: null })).not.toThrow(); + expect(() => computeHealthScore({ contextPct: 50, turnCount: turns, growthRate: null, model: 'claude-sonnet-4-6', isDetailHeavy: false })).not.toThrow(); }); } for (const rate of numFuzz) { test(`growthRate=${rate} does not throw`, () => { - expect(() => computeHealthScore({ contextPct: 50, turnCount: 5, growthRate: rate })).not.toThrow(); + expect(() => computeHealthScore({ contextPct: 50, turnCount: 5, growthRate: rate, model: 'claude-sonnet-4-6', isDetailHeavy: false })).not.toThrow(); }); } }); diff --git a/tests/audit/health-score-audit.test.ts b/tests/audit/health-score-audit.test.ts index 3d8b153..c022094 100644 --- a/tests/audit/health-score-audit.test.ts +++ b/tests/audit/health-score-audit.test.ts @@ -1,11 +1,16 @@ import { describe, test, expect } from 'vitest'; -// Audit: lib/health-score.ts - rule verification and boundary testing +// Audit: lib/health-score.ts - per-model rule verification and boundary testing. +// +// After GET-28 the threshold is per-model. The audit here pins specific +// behaviors using two reference models: +// - Sonnet 4.5 (warn=50, crit=75): 200k window, weak long-context. +// - Opus 4.6 (warn=65, crit=85): 1M window, strong long-context. +// Plus the model-agnostic absolute floor at DEGRADING_CEIL = 90. import { computeHealthScore, computeGrowthRate, - HEALTHY_CEIL, DEGRADING_CEIL, TURN_HEALTHY_CEIL, TURN_DEGRADING_CEIL, @@ -14,105 +19,152 @@ import { type HealthInput, } from '../../lib/health-score'; -// ── computeHealthScore: Rule 1 (context >= DEGRADING_CEIL = 90% = critical) ── +const SONNET_45 = 'claude-sonnet-4-5'; +const OPUS_46 = 'claude-opus-4-6'; -describe('computeHealthScore: Rule 1 (high context = critical)', () => { - test('exactly at DEGRADING_CEIL (90%) is critical', () => { - expect(computeHealthScore({ contextPct: 90, turnCount: 0, growthRate: null }).level).toBe('critical'); +function input(overrides: Partial = {}): HealthInput { + return { + contextPct: 0, + turnCount: 0, + growthRate: null, + model: SONNET_45, + isDetailHeavy: false, + ...overrides, + }; +} + +// ── Absolute floor: any model >= 90% = critical ─────────────────────────── + +describe('absolute critical floor', () => { + test('exactly at DEGRADING_CEIL (90%) is critical on Sonnet 4.5', () => { + expect(computeHealthScore(input({ contextPct: 90 })).level).toBe('critical'); + }); + + test('above DEGRADING_CEIL is critical on Sonnet 4.5', () => { + expect(computeHealthScore(input({ contextPct: 95 })).level).toBe('critical'); + }); + + test('exactly at DEGRADING_CEIL is critical even on Opus 4.6 (1M)', () => { + expect(computeHealthScore(input({ contextPct: 90, model: OPUS_46 })).level).toBe('critical'); + }); + + test('per-model critical fires before the absolute floor on Opus 4.6', () => { + // Opus 4.6 has the highest per-model crit in the table (85). At + // 89.9% context we are below the absolute 90% floor but already + // past Opus 4.6's per-model crit. The result is still critical; + // this asserts the per-model rule does its job before the floor + // ever has to step in. (No model in the table has crit > 90, so + // no current row tests the floor in isolation; the absolute floor + // is exercised separately by the 90% / 95% cases above.) + expect(computeHealthScore(input({ contextPct: 89.9, model: OPUS_46 })).level).toBe('critical'); + }); +}); + +// ── Per-model: Sonnet 4.5 thresholds (50 / 75) ──────────────────────────── + +describe('per-model: Sonnet 4.5 (warn=50, crit=75)', () => { + test('crit boundary: 75% = critical', () => { + expect(computeHealthScore(input({ contextPct: 75 })).level).toBe('critical'); + }); + + test('crit boundary minus 1: 74% with low turns = degrading', () => { + expect(computeHealthScore(input({ contextPct: 74, turnCount: 2 })).level).toBe('degrading'); }); - test('above DEGRADING_CEIL is critical', () => { - expect(computeHealthScore({ contextPct: 95, turnCount: 0, growthRate: null }).level).toBe('critical'); + test('warn boundary: 50% = degrading', () => { + expect(computeHealthScore(input({ contextPct: 50 })).level).toBe('degrading'); }); - test('just below DEGRADING_CEIL is not critical from context alone', () => { - expect(computeHealthScore({ contextPct: 89.9, turnCount: 0, growthRate: null }).level).not.toBe('critical'); + test('just below warn: 49% with few turns = healthy', () => { + expect(computeHealthScore(input({ contextPct: 49, turnCount: 3 })).level).toBe('healthy'); }); }); -// ── Rule 2 (high context + many turns = critical) ────────────────────────── +// ── Per-model: Opus 4.6 thresholds (65 / 85) ────────────────────────────── -describe('computeHealthScore: Rule 2 (context >= 70 + turns > 20)', () => { - test('at threshold boundary: 70% context + 21 turns = critical', () => { - expect(computeHealthScore({ contextPct: 70, turnCount: 21, growthRate: null }).level).toBe('critical'); +describe('per-model: Opus 4.6 (warn=65, crit=85)', () => { + test('crit boundary: 85% = critical', () => { + expect(computeHealthScore(input({ contextPct: 85, model: OPUS_46 })).level).toBe('critical'); }); - test('at threshold boundary: 70% context + 20 turns = degrading (not critical)', () => { - const result = computeHealthScore({ contextPct: 70, turnCount: 20, growthRate: null }); - expect(result.level).not.toBe('critical'); + test('warn boundary: 65% = degrading', () => { + expect(computeHealthScore(input({ contextPct: 65, model: OPUS_46 })).level).toBe('degrading'); }); - test('below context threshold: 69% + 21 turns = not critical from rule 2', () => { - const result = computeHealthScore({ contextPct: 69, turnCount: 21, growthRate: null }); - // Should be healthy or degrading from rule 5 (turns > 30), not critical - expect(result.level).not.toBe('critical'); + test('just below warn: 64% with few turns = healthy', () => { + expect(computeHealthScore(input({ contextPct: 64, turnCount: 3, model: OPUS_46 })).level).toBe('healthy'); + }); + + test('Opus 4.6 at 80% (degrading) where Sonnet 4.5 would be critical', () => { + const opus = computeHealthScore(input({ contextPct: 80, model: OPUS_46, turnCount: 3 })); + const sonnet = computeHealthScore(input({ contextPct: 80, model: SONNET_45, turnCount: 3 })); + expect(opus.level).toBe('degrading'); + expect(sonnet.level).toBe('critical'); }); }); -// ── Rule 3 (moderate context + moderate turns = degrading) ───────────────── +// ── Turn-based escalation: approaching + many turns -> critical ───────── -describe('computeHealthScore: Rule 3 (context >= 70 + turns > 10)', () => { - test('at threshold: 70% + 11 turns = degrading', () => { - expect(computeHealthScore({ contextPct: 70, turnCount: 11, growthRate: null }).level).toBe('degrading'); +describe('approaching + deep turns escalates to critical', () => { + test('Sonnet 4.5 at 60% (approaching) with > 20 turns = critical', () => { + const h = computeHealthScore(input({ contextPct: 60, turnCount: TURN_DEGRADING_CEIL + 2 })); + expect(h.level).toBe('critical'); + expect(h.coaching).toMatch(/turns deep/); }); - test('below turn threshold: 70% + 10 turns = healthy (not degrading from rule 3)', () => { - const result = computeHealthScore({ contextPct: 70, turnCount: 10, growthRate: null }); - expect(result.level).toBe('healthy'); + test('Sonnet 4.5 at 60% (approaching) with 20 turns = degrading (boundary)', () => { + const h = computeHealthScore(input({ contextPct: 60, turnCount: TURN_DEGRADING_CEIL })); + expect(h.level).toBe('degrading'); }); }); -// ── Rule 4 (fast growth + meaningful context = degrading) ────────────────── +// ── Fast growth secondary signal ─────────────────────────────────────────── -describe('computeHealthScore: Rule 4 (fast growth)', () => { - test('growth > 8 with context > 30 = degrading', () => { - expect(computeHealthScore({ contextPct: 31, turnCount: 2, growthRate: 9 }).level).toBe('degrading'); +describe('fast growth secondary', () => { + test('growth > 8 with context > 30 (and below warn) = degrading', () => { + // Sonnet 4.5 warn=50; 35 with fast growth -> below warn but secondary fires. + expect(computeHealthScore(input({ contextPct: 35, turnCount: 2, growthRate: 9 })).level).toBe('degrading'); }); test('growth > 8 but context <= 30 = healthy', () => { - expect(computeHealthScore({ contextPct: 30, turnCount: 2, growthRate: 9 }).level).toBe('healthy'); + expect(computeHealthScore(input({ contextPct: 30, turnCount: 2, growthRate: 9 })).level).toBe('healthy'); }); - test('growth exactly 8 = healthy (not triggered)', () => { - expect(computeHealthScore({ contextPct: 40, turnCount: 2, growthRate: 8 }).level).toBe('healthy'); + test('growth exactly 8 = not triggered', () => { + expect(computeHealthScore(input({ contextPct: 40, turnCount: 2, growthRate: 8 })).level).toBe('healthy'); }); - test('remaining messages calculation in coaching', () => { - const result = computeHealthScore({ contextPct: 40, turnCount: 2, growthRate: 10 }); - expect(result.coaching).toMatch(/~6 messages/); + test('remaining-messages calculation lands in coaching', () => { + // Sonnet 4.5 warn=50, ctx=40, growth=10 -> headroom 10, ~1 message. + const result = computeHealthScore(input({ contextPct: 40, turnCount: 2, growthRate: 10 })); + expect(result.coaching).toMatch(/messages? until/); }); }); -// ── Rule 5 (high turn count alone = degrading) ───────────────────────────── +// ── Long-conversation secondary signal ───────────────────────────────────── -describe('computeHealthScore: Rule 5 (turns > 30)', () => { +describe('long-conversation secondary', () => { test('31 turns with low context = degrading', () => { - expect(computeHealthScore({ contextPct: 10, turnCount: 31, growthRate: null }).level).toBe('degrading'); + expect(computeHealthScore(input({ contextPct: 10, turnCount: TURN_CRITICAL_CEIL + 1 })).level).toBe('degrading'); }); - test('30 turns with low context = healthy', () => { - expect(computeHealthScore({ contextPct: 10, turnCount: 30, growthRate: null }).level).toBe('healthy'); + test('30 turns with low context = healthy (below threshold)', () => { + expect(computeHealthScore(input({ contextPct: 10, turnCount: TURN_CRITICAL_CEIL })).level).toBe('healthy'); }); }); -// ── Rule 6 (healthy default) ─────────────────────────────────────────────── - -describe('computeHealthScore: healthy default', () => { - test('fresh conversation = healthy', () => { - const result = computeHealthScore({ contextPct: 0, turnCount: 0, growthRate: null }); - expect(result.level).toBe('healthy'); - expect(result.coaching).toMatch(/fresh/i); - }); +// ── Healthy default ──────────────────────────────────────────────────────── - test('moderate context below 30% gets "plenty of room" coaching', () => { - const result = computeHealthScore({ contextPct: 0, turnCount: 0, growthRate: null }); - expect(result.coaching).toMatch(/fresh and responsive/i); +describe('healthy default', () => { + test('fresh conversation = healthy with fresh-and-responsive copy', () => { + const h = computeHealthScore(input({ contextPct: 0, turnCount: 0 })); + expect(h.level).toBe('healthy'); + expect(h.coaching).toMatch(/fresh/i); }); - test('context above 30% but below 50% shows percentage', () => { - const result = computeHealthScore({ contextPct: 35, turnCount: 2, growthRate: null }); - expect(result.coaching).toMatch(/35%/); - expect(result.coaching).toMatch(/plenty of room/i); + test('moderate context (35%) below warn shows percentage in coaching', () => { + const h = computeHealthScore(input({ contextPct: 35, turnCount: 2 })); + expect(h.coaching).toMatch(/35%/); }); }); @@ -120,17 +172,17 @@ describe('computeHealthScore: healthy default', () => { describe('computeHealthScore: output shape', () => { test('contextPct is passed through unchanged', () => { - expect(computeHealthScore({ contextPct: 42.5, turnCount: 0, growthRate: null }).contextPct).toBe(42.5); + expect(computeHealthScore(input({ contextPct: 42.5 })).contextPct).toBe(42.5); }); test('label is always a non-empty string', () => { const inputs: HealthInput[] = [ - { contextPct: 0, turnCount: 0, growthRate: null }, - { contextPct: 50, turnCount: 15, growthRate: null }, - { contextPct: 95, turnCount: 50, growthRate: 20 }, + input({ contextPct: 0 }), + input({ contextPct: 50, turnCount: 15 }), + input({ contextPct: 95, turnCount: 50, growthRate: 20 }), ]; - for (const input of inputs) { - const result = computeHealthScore(input); + for (const i of inputs) { + const result = computeHealthScore(i); expect(result.label.length).toBeGreaterThan(0); expect(result.coaching.length).toBeGreaterThan(0); } @@ -157,17 +209,14 @@ describe('computeGrowthRate', () => { }); test('correct average for uniform growth', () => { - // 10 -> 20 -> 30: two steps, each +10 expect(computeGrowthRate([10, 20, 30])).toBeCloseTo(10, 5); }); test('only counts upward deltas', () => { - // 10 -> 5 -> 15: only one upward delta (+10), average = 10 expect(computeGrowthRate([10, 5, 15])).toBeCloseTo(10, 5); }); test('mixed growth and decline', () => { - // 10 -> 20 -> 15 -> 25: upward deltas: +10, +10. Average = 10 expect(computeGrowthRate([10, 20, 15, 25])).toBeCloseTo(10, 5); }); }); @@ -176,16 +225,16 @@ describe('computeGrowthRate', () => { describe('purity', () => { test('same input produces identical output', () => { - const input: HealthInput = { contextPct: 55, turnCount: 12, growthRate: 5 }; - const a = computeHealthScore(input); - const b = computeHealthScore(input); + const i: HealthInput = input({ contextPct: 55, turnCount: 12, growthRate: 5 }); + const a = computeHealthScore(i); + const b = computeHealthScore(i); expect(a).toEqual(b); }); test('input object is not mutated', () => { - const input: HealthInput = { contextPct: 55, turnCount: 12, growthRate: 5 }; - const frozen = { ...input }; - computeHealthScore(input); - expect(input).toEqual(frozen); + const i: HealthInput = input({ contextPct: 55, turnCount: 12, growthRate: 5 }); + const frozen = { ...i }; + computeHealthScore(i); + expect(i).toEqual(frozen); }); }); diff --git a/tests/audit/smoke.test.ts b/tests/audit/smoke.test.ts index e7ddf2e..a2a1598 100644 --- a/tests/audit/smoke.test.ts +++ b/tests/audit/smoke.test.ts @@ -38,7 +38,7 @@ describe('smoke: imports resolve and basic calls succeed', () => { }); test('health-score: computeHealthScore returns valid shape', () => { - const result = computeHealthScore({ contextPct: 10, turnCount: 2, growthRate: null }); + const result = computeHealthScore({ contextPct: 10, turnCount: 2, growthRate: null, model: 'claude-sonnet-4-6', isDetailHeavy: false }); expect(result.level).toBe('healthy'); expect(typeof result.label).toBe('string'); expect(typeof result.coaching).toBe('string'); diff --git a/tests/integration/agent-pipeline.test.ts b/tests/integration/agent-pipeline.test.ts index 7bd13e9..212855a 100644 --- a/tests/integration/agent-pipeline.test.ts +++ b/tests/integration/agent-pipeline.test.ts @@ -95,6 +95,8 @@ describe('agent pipeline: healthy mid-conversation', () => { contextPct: convState.contextPct, turnCount: convState.turnCount, growthRate: computeGrowthRate(convState.contextHistory), + model: MODEL, + isDetailHeavy: false, }); it('pricing agent returns a non-null cost', () => { @@ -125,9 +127,13 @@ describe('agent pipeline: healthy mid-conversation', () => { expect(stale).toBeUndefined(); }); - it('health agent returns healthy (12 turns, 55% context)', () => { - // contextPct(55) < HEALTHY_CEIL(70): below threshold, healthy by default - expect(health.level).toBe('healthy'); + it('health agent returns degrading (12 turns, 55% context on Sonnet 4.6)', () => { + // Per-model thresholds (GET-28). Sonnet 4.6 warn = 60, but Rule 6 + // (turn-aware degrading floor = warn - 10 = 50) trips when context + // is within 10 points of warn AND turnCount > TURN_HEALTHY_CEIL. + // 55% with 12 turns satisfies both, so the indicator surfaces + // attention-valley risk before the per-model warn fires. + expect(health.level).toBe('degrading'); }); it('delta coach returns no critical signals at 62% session', () => { @@ -159,6 +165,8 @@ describe('agent pipeline: degrading mid-conversation', () => { contextPct: convState.contextPct, turnCount: convState.turnCount, growthRate: computeGrowthRate(convState.contextHistory), + model: MODEL, + isDetailHeavy: false, }); it('health agent returns degrading (12 turns, 75% context)', () => { @@ -186,6 +194,8 @@ describe('agent pipeline: critical conversation', () => { contextPct: 85, turnCount: 25, growthRate: computeGrowthRate(convState.contextHistory), + model: MODEL, + isDetailHeavy: false, }); const deltaSignals = analyzeDelta(makeDeltaInput({ sessionPct: 88, @@ -315,7 +325,7 @@ describe('agent pipeline: agent isolation', () => { }); it('health agent handles edge case of 0% context', () => { - const health = computeHealthScore({ contextPct: 0, turnCount: 0, growthRate: null }); + const health = computeHealthScore({ contextPct: 0, turnCount: 0, growthRate: null, model: MODEL, isDetailHeavy: false }); expect(health.level).toBe('healthy'); }); @@ -332,7 +342,7 @@ describe('agent pipeline: agent isolation', () => { const deltaSigs = analyzeDelta({ currentDelta: null, recentDeltas: [], sessionPct: 0, firstTurnDelta: null, turnCount: 0, }); - const health = computeHealthScore({ contextPct: 0, turnCount: 0, growthRate: null }); + const health = computeHealthScore({ contextPct: 0, turnCount: 0, growthRate: null, model: MODEL, isDetailHeavy: false }); const cost = calculateCost(0, 0, 'unknown'); const preSubmit = computePreSubmitEstimate({ draftCharCount: 0, model: 'unknown', pctPerInputToken: null, currentSessionPct: 0, @@ -429,7 +439,7 @@ describe('agent pipeline: consistency between agents', () => { Array.from({ length: turnCount }, (_, i) => (i + 1) * 3.4), ); - const health = computeHealthScore({ contextPct, turnCount, growthRate }); + const health = computeHealthScore({ contextPct, turnCount, growthRate, model: MODEL, isDetailHeavy: false }); const contextSignals = analyzeContext({ turnCount, contextPct, @@ -446,7 +456,7 @@ describe('agent pipeline: consistency between agents', () => { }); it('healthy health at low context means no threshold signals', () => { - const health = computeHealthScore({ contextPct: 10, turnCount: 3, growthRate: null }); + const health = computeHealthScore({ contextPct: 10, turnCount: 3, growthRate: null, model: MODEL, isDetailHeavy: false }); const signals = analyzeContext({ turnCount: 3, contextPct: 10, diff --git a/tests/integration/restore-pipeline.test.ts b/tests/integration/restore-pipeline.test.ts index 777c920..204d53d 100644 --- a/tests/integration/restore-pipeline.test.ts +++ b/tests/integration/restore-pipeline.test.ts @@ -47,6 +47,8 @@ function restoreConversation(record: ConversationRecord): { contextPct: convState.contextPct, turnCount: convState.turnCount, growthRate: computeGrowthRate(convState.contextHistory), + model: convState.model, + isDetailHeavy: false, }); state = { ...state, health }; @@ -95,6 +97,8 @@ function applyStreamCompleteAfterRestore( contextPct: cumulativeContextPct, turnCount: convState.turnCount, growthRate: computeGrowthRate(convState.contextHistory), + model: msg.model, + isDetailHeavy: false, }), }; diff --git a/tests/integration/session-state.test.ts b/tests/integration/session-state.test.ts index 494b8fb..9bd4270 100644 --- a/tests/integration/session-state.test.ts +++ b/tests/integration/session-state.test.ts @@ -109,6 +109,8 @@ function processMessage(session: SessionAccumulator, msg: MessageSpec): SessionA contextPct, turnCount, growthRate: computeGrowthRate(contextHistory), + model: msg.model, + isDetailHeavy: false, }); overlayState = { diff --git a/tests/perf/benchmarks.bench.ts b/tests/perf/benchmarks.bench.ts index 05df5ff..867da8e 100644 --- a/tests/perf/benchmarks.bench.ts +++ b/tests/perf/benchmarks.bench.ts @@ -97,15 +97,15 @@ describe('pricing', () => { describe('health score', () => { bench('computeHealthScore - healthy', () => { - computeHealthScore({ contextPct: 30, turnCount: 5, growthRate: 3 }); + computeHealthScore({ contextPct: 30, turnCount: 5, growthRate: 3, model: 'claude-sonnet-4-6', isDetailHeavy: false }); }); bench('computeHealthScore - degrading', () => { - computeHealthScore({ contextPct: 65, turnCount: 15, growthRate: 6 }); + computeHealthScore({ contextPct: 65, turnCount: 15, growthRate: 6, model: 'claude-sonnet-4-6', isDetailHeavy: false }); }); bench('computeHealthScore - critical', () => { - computeHealthScore({ contextPct: 92, turnCount: 25, growthRate: 10 }); + computeHealthScore({ contextPct: 92, turnCount: 25, growthRate: 10, model: 'claude-sonnet-4-6', isDetailHeavy: false }); }); bench('computeGrowthRate - 12 data points', () => { @@ -227,7 +227,7 @@ describe('full agent pipeline (single turn)', () => { calculateCost(3400, 1200, 'claude-sonnet-4-6-20250514'); // Health const growth = computeGrowthRate(CONVERSATION_STATE.contextHistory); - computeHealthScore({ contextPct: 65, turnCount: 12, growthRate: growth ?? 0 }); + computeHealthScore({ contextPct: 65, turnCount: 12, growthRate: growth ?? 0, model: 'claude-sonnet-4-6-20250514', isDetailHeavy: false }); // Context intelligence const ctxSignals = analyzeContext(CONVERSATION_STATE); // Prompt analysis diff --git a/tests/perf/memory.test.ts b/tests/perf/memory.test.ts index b8b18a5..6e2f776 100644 --- a/tests/perf/memory.test.ts +++ b/tests/perf/memory.test.ts @@ -114,6 +114,8 @@ describe('memory: full agent pipeline', () => { contextPct, turnCount: i + 1, growthRate: growth ?? 0, + model: 'claude-sonnet-4-6-20250514', + isDetailHeavy: false, }); // Context intelligence diff --git a/tests/unit/context-rot-thresholds.test.ts b/tests/unit/context-rot-thresholds.test.ts new file mode 100644 index 0000000..12e9585 --- /dev/null +++ b/tests/unit/context-rot-thresholds.test.ts @@ -0,0 +1,370 @@ +// tests/unit/context-rot-thresholds.test.ts +// +// Tests for lib/context-rot-thresholds.ts. Three concerns: +// 1. Lookup correctness: every model resolves to its intended profile, +// including the longest-prefix-match guarantee. +// 2. Threshold math: detail-heavy adjustment, floor behavior, zone +// classification at and around boundaries. +// 3. Provenance: any profile that quotes an MRCR figure must carry a +// sourceUrl AND a sourceQuote AND that quote must mention the figure. +// This is the drift assertion: if someone changes a number without +// updating the citation, the test fails loudly. + +import { describe, it, expect } from 'vitest'; +import { + ABSOLUTE_CRITICAL_FLOOR, + DETAIL_HEAVY_ADJUSTMENT, + FALLBACK_PROFILE, + MIN_THRESHOLD_FLOOR, + ROT_PROFILES, + applyDetailHeavyAdjustment, + getEffectiveThresholds, + getRotCoaching, + getRotProfile, + getRotZone, +} from '../../lib/context-rot-thresholds'; + +// ── getRotProfile: lookup ───────────────────────────────────────────────── + +describe('getRotProfile: model lookup', () => { + it('resolves Opus 4.7 by exact prefix', () => { + const p = getRotProfile('claude-opus-4-7'); + expect(p.label).toBe('Opus 4.7'); + expect(p.contextWindow).toBe(1_000_000); + expect(p.hasServerSideCompaction).toBe(true); + }); + + it('resolves Opus 4.6 with full Anthropic suffix', () => { + const p = getRotProfile('claude-opus-4-6-20250514'); + expect(p.label).toBe('Opus 4.6'); + expect(p.mrcrAt1MPct).toBe(76); + }); + + it('resolves Sonnet 4.6 separately from Sonnet 4.5', () => { + const sonnet46 = getRotProfile('claude-sonnet-4-6'); + const sonnet45 = getRotProfile('claude-sonnet-4-5'); + expect(sonnet46.contextWindow).toBe(1_000_000); + expect(sonnet45.contextWindow).toBe(200_000); + expect(sonnet46.warnAtPct).toBeGreaterThan(sonnet45.warnAtPct); + }); + + it('resolves Haiku 4.5 to the 200k profile', () => { + const p = getRotProfile('claude-haiku-4-5'); + expect(p.contextWindow).toBe(200_000); + expect(p.hasServerSideCompaction).toBe(false); + }); + + it('uses the longest matching prefix when multiple could match', () => { + // Both 'claude-opus-4-6' and (hypothetically) 'claude-opus-4' would + // match 'claude-opus-4-6-20250514'. Longest wins. + const p = getRotProfile('claude-opus-4-6-20250514'); + expect(p.modelPrefix).toBe('claude-opus-4-6'); + }); + + it('does not collide on a hypothetical "claude-sonnet-4-50"', () => { + // Without the digit-boundary check, the naive startsWith match + // would classify "claude-sonnet-4-50" as the 4-5 profile (200k + // window), which would silently misroute a future Anthropic + // model into Sonnet 4.5's coaching. We expect the boundary check + // to reject that match and fall through to FALLBACK_PROFILE. + const p = getRotProfile('claude-sonnet-4-50'); + expect(p.modelPrefix).not.toBe('claude-sonnet-4-5'); + expect(p).toEqual(FALLBACK_PROFILE); + }); + + it('still accepts the date-suffixed form like claude-sonnet-4-5-20250929', () => { + // The boundary check should let through non-digit characters + // (typically '-' before the date suffix). Verify the canonical + // versioned model name still resolves to its profile. + const p = getRotProfile('claude-sonnet-4-5-20250929'); + expect(p.modelPrefix).toBe('claude-sonnet-4-5'); + }); + + it('is case-insensitive on the input', () => { + const lower = getRotProfile('claude-sonnet-4-6'); + const upper = getRotProfile('CLAUDE-SONNET-4-6'); + expect(upper).toEqual(lower); + }); + + it('falls back for unknown models', () => { + expect(getRotProfile('llama-3-1')).toEqual(FALLBACK_PROFILE); + expect(getRotProfile('gpt-4-turbo')).toEqual(FALLBACK_PROFILE); + }); + + it('falls back for empty model string', () => { + expect(getRotProfile('')).toEqual(FALLBACK_PROFILE); + }); +}); + +// ── getEffectiveThresholds: detail-heavy adjustment ────────────────────── + +describe('getEffectiveThresholds: detail-heavy adjustment', () => { + it('returns the profile values when not detail-heavy', () => { + const t = getEffectiveThresholds('claude-opus-4-6', false); + expect(t.warnAtPct).toBe(65); + expect(t.criticalAtPct).toBe(85); + }); + + it('subtracts DETAIL_HEAVY_ADJUSTMENT from both thresholds when detail-heavy', () => { + const t = getEffectiveThresholds('claude-opus-4-6', true); + expect(t.warnAtPct).toBe(65 - DETAIL_HEAVY_ADJUSTMENT); + expect(t.criticalAtPct).toBe(85 - DETAIL_HEAVY_ADJUSTMENT); + }); + + it('adjusts Sonnet 4.5 thresholds correctly under detail-heavy', () => { + const t = getEffectiveThresholds('claude-sonnet-4-5', true); + expect(t.warnAtPct).toBe(35); // 50 - 15 + expect(t.criticalAtPct).toBe(60); // 75 - 15 + }); + + it('floors at MIN_THRESHOLD_FLOOR for known-low Haiku profile', () => { + // Sanity: every shipping profile after detail-heavy adjustment + // stays at or above the floor. Catches future tuning that drops + // a profile below the floor without bumping the floor itself. + const t = getEffectiveThresholds('claude-haiku-4-5', true); + expect(t.warnAtPct).toBeGreaterThanOrEqual(MIN_THRESHOLD_FLOOR); + expect(t.criticalAtPct).toBeGreaterThanOrEqual(MIN_THRESHOLD_FLOOR); + }); +}); + +// ── applyDetailHeavyAdjustment: direct exercise of the floor branch ───── +// +// No real profile is low enough today that warn - 15 < MIN_THRESHOLD_FLOOR, +// so the floor branch in the production code is dead by inspection. The +// helper is exposed so we can verify the floor actually fires when the +// math demands it. If anyone ever lowers a profile's warn below 45, +// this test catches a silent change in coaching behavior. + +describe('applyDetailHeavyAdjustment: floor branch', () => { + it('returns the threshold unchanged when not detail-heavy', () => { + expect(applyDetailHeavyAdjustment(50, false)).toBe(50); + expect(applyDetailHeavyAdjustment(MIN_THRESHOLD_FLOOR + 5, false)).toBe(MIN_THRESHOLD_FLOOR + 5); + }); + + it('subtracts DETAIL_HEAVY_ADJUSTMENT when detail-heavy and result is above floor', () => { + expect(applyDetailHeavyAdjustment(65, true)).toBe(65 - DETAIL_HEAVY_ADJUSTMENT); + expect(applyDetailHeavyAdjustment(50, true)).toBe(50 - DETAIL_HEAVY_ADJUSTMENT); + }); + + it('floors at MIN_THRESHOLD_FLOOR when adjustment would drop below it', () => { + // 40 - 15 = 25, below the 30 floor: floor wins. + expect(applyDetailHeavyAdjustment(40, true)).toBe(MIN_THRESHOLD_FLOOR); + // 35 - 15 = 20, well below: still 30. + expect(applyDetailHeavyAdjustment(35, true)).toBe(MIN_THRESHOLD_FLOOR); + }); + + it('returns exactly MIN_THRESHOLD_FLOOR at the boundary', () => { + // 45 - 15 = 30, exactly at the floor. + expect(applyDetailHeavyAdjustment(45, true)).toBe(MIN_THRESHOLD_FLOOR); + }); + + it('does not raise a threshold that is already below the floor', () => { + // The function only floors AFTER the subtraction. A caller that + // passes a value below the floor with isDetailHeavy=false should + // get it back unchanged; the floor is a downward guard, not an + // unconditional clamp. + expect(applyDetailHeavyAdjustment(20, false)).toBe(20); + }); +}); + +// ── getRotZone: zone classification ────────────────────────────────────── + +describe('getRotZone: zone classification', () => { + it('returns healthy below the warn threshold', () => { + expect(getRotZone('claude-sonnet-4-6', 30, false)).toBe('healthy'); + expect(getRotZone('claude-opus-4-6', 50, false)).toBe('healthy'); + }); + + it('returns approaching at the warn threshold', () => { + expect(getRotZone('claude-sonnet-4-6', 60, false)).toBe('approaching'); + expect(getRotZone('claude-opus-4-6', 65, false)).toBe('approaching'); + }); + + it('returns approaching just below the critical threshold', () => { + expect(getRotZone('claude-sonnet-4-6', 79, false)).toBe('approaching'); + expect(getRotZone('claude-opus-4-6', 84, false)).toBe('approaching'); + }); + + it('returns in-rot at the critical threshold', () => { + expect(getRotZone('claude-sonnet-4-6', 80, false)).toBe('in-rot'); + expect(getRotZone('claude-opus-4-6', 85, false)).toBe('in-rot'); + }); + + it('detail-heavy shifts the zone earlier', () => { + // Sonnet 4.6 at 50%: healthy when not detail-heavy (warn=60), + // approaching when detail-heavy (warn=45). + expect(getRotZone('claude-sonnet-4-6', 50, false)).toBe('healthy'); + expect(getRotZone('claude-sonnet-4-6', 50, true)).toBe('approaching'); + }); + + it('detail-heavy can push from approaching to in-rot', () => { + // Opus 4.6 at 75%: approaching when not detail-heavy (warn=65, crit=85), + // in-rot when detail-heavy (warn=50, crit=70). + expect(getRotZone('claude-opus-4-6', 75, false)).toBe('approaching'); + expect(getRotZone('claude-opus-4-6', 75, true)).toBe('in-rot'); + }); + + it('respects the absolute critical floor regardless of model', () => { + // Even Opus 4.7 (the most-permissive 1M profile) trips in-rot at 90%. + expect(getRotZone('claude-opus-4-7', ABSOLUTE_CRITICAL_FLOOR, false)).toBe('in-rot'); + expect(getRotZone('claude-opus-4-7', 95, false)).toBe('in-rot'); + }); + + it('classifies unknown models via the fallback profile', () => { + // Fallback uses Sonnet 4.5 numbers (50/75). + expect(getRotZone('unknown-model', 49, false)).toBe('healthy'); + expect(getRotZone('unknown-model', 50, false)).toBe('approaching'); + expect(getRotZone('unknown-model', 75, false)).toBe('in-rot'); + }); +}); + +// ── getRotCoaching: copy ───────────────────────────────────────────────── + +describe('getRotCoaching: copy generation', () => { + it('returns "fresh and responsive" copy for very low context', () => { + const msg = getRotCoaching('claude-sonnet-4-6', 5, false); + expect(msg).toMatch(/fresh and responsive/i); + }); + + it('mentions the model label and window in the healthy zone above 30%', () => { + const msg = getRotCoaching('claude-sonnet-4-6', 45, false); + expect(msg).toContain('Sonnet 4.6'); + expect(msg).toContain('1M'); + expect(msg).toMatch(/45%/); + }); + + it('shows 200k window for Sonnet 4.5', () => { + const msg = getRotCoaching('claude-sonnet-4-5', 35, false); + expect(msg).toContain('Sonnet 4.5'); + expect(msg).toContain('200k'); + }); + + it('cites the MRCR figure verbatim in approaching zone for Sonnet 4.5', () => { + const msg = getRotCoaching('claude-sonnet-4-5', 55, false); + expect(msg).toMatch(/Sonnet 4\.5/); + expect(msg).toMatch(/18\.5%/); + expect(msg).toMatch(/MRCR/); + }); + + it('cites the MRCR figure verbatim in approaching zone for Opus 4.6', () => { + const msg = getRotCoaching('claude-opus-4-6', 70, false); + expect(msg).toMatch(/Opus 4\.6/); + expect(msg).toMatch(/76%/); + expect(msg).toMatch(/MRCR/); + }); + + it('omits MRCR clause for models without a published score', () => { + // Sonnet 4.6 has no MRCR on its profile; copy should not invent one. + const msg = getRotCoaching('claude-sonnet-4-6', 65, false); + expect(msg).not.toMatch(/MRCR/); + }); + + it('uses compaction-aware copy for 1M models in approaching zone', () => { + const msg = getRotCoaching('claude-sonnet-4-6', 65, false); + expect(msg).toMatch(/compaction/i); + }); + + it('uses harder copy for non-compaction models in approaching zone', () => { + const msg = getRotCoaching('claude-sonnet-4-5', 55, false); + expect(msg).toMatch(/start a new chat now/i); + expect(msg).not.toMatch(/compaction/i); + }); + + it('mentions Projects in the in-rot zone for non-compaction models', () => { + const msg = getRotCoaching('claude-sonnet-4-5', 80, false); + expect(msg).toMatch(/Projects/); + expect(msg).toMatch(/start a new chat/i); + }); + + it('uses a softer in-rot message for compaction-aware models', () => { + const msg = getRotCoaching('claude-opus-4-7', 90, false); + expect(msg).toMatch(/compaction/i); + expect(msg).toMatch(/fine details/i); + }); + + it('never includes a `/compact` instruction in any copy', () => { + // claude.ai web does not expose `/compact`. The slash command + // belongs to Claude Code only. This guard prevents drift. + for (const model of ['claude-opus-4-7', 'claude-opus-4-6', 'claude-sonnet-4-6', 'claude-sonnet-4-5', 'claude-haiku-4-5']) { + for (const pct of [10, 50, 70, 85, 95]) { + const msg = getRotCoaching(model, pct, false); + expect(msg).not.toMatch(/\/compact/); + } + } + }); + + it('returns non-empty copy across the full range and all models', () => { + for (const model of ['claude-opus-4-7', 'claude-opus-4-6', 'claude-sonnet-4-6', 'claude-sonnet-4-5', 'claude-haiku-4-5', 'unknown-model']) { + for (const pct of [0, 25, 50, 65, 75, 85, 95]) { + for (const detail of [true, false]) { + const msg = getRotCoaching(model, pct, detail); + expect(msg.length).toBeGreaterThan(0); + } + } + } + }); +}); + +// ── Provenance / drift guard ───────────────────────────────────────────── + +describe('drift guard: every cited fact has provenance', () => { + it('every profile with mrcrAt1MPct also has sourceUrl and sourceQuote', () => { + for (const profile of ROT_PROFILES) { + if (profile.mrcrAt1MPct !== undefined) { + expect(profile.sourceUrl, `${profile.label}: missing sourceUrl`).toBeTruthy(); + expect(profile.sourceQuote, `${profile.label}: missing sourceQuote`).toBeTruthy(); + } + } + }); + + it('every quoted MRCR figure appears verbatim in its sourceQuote', () => { + // Catches the case where someone changes the number but forgets + // to update the quote (or vice versa). The pinned quote currently + // contains both Opus 4.6's 76% and Sonnet 4.5's 18.5%. + for (const profile of ROT_PROFILES) { + if (profile.mrcrAt1MPct !== undefined && profile.sourceQuote) { + expect( + profile.sourceQuote, + `${profile.label}: source quote does not mention ${profile.mrcrAt1MPct}%`, + ).toContain(`${profile.mrcrAt1MPct}%`); + } + } + }); + + it('every sourceUrl looks like an Anthropic URL', () => { + for (const profile of ROT_PROFILES) { + if (profile.sourceUrl) { + expect(profile.sourceUrl).toMatch(/anthropic\.com|claude\.com/); + } + } + }); +}); + +// ── Constants sanity ───────────────────────────────────────────────────── + +describe('constants: sanity bounds', () => { + it('every profile has warn < critical', () => { + for (const profile of ROT_PROFILES) { + expect(profile.warnAtPct).toBeLessThan(profile.criticalAtPct); + } + }); + + it('every profile critical is below the absolute floor', () => { + // Otherwise the per-model crit never fires before the absolute + // safety net, making it dead code. + for (const profile of ROT_PROFILES) { + expect(profile.criticalAtPct).toBeLessThanOrEqual(ABSOLUTE_CRITICAL_FLOOR); + } + }); + + it('every profile has a positive context window', () => { + for (const profile of ROT_PROFILES) { + expect(profile.contextWindow).toBeGreaterThan(0); + } + }); + + it('DETAIL_HEAVY_ADJUSTMENT is positive and meaningful', () => { + expect(DETAIL_HEAVY_ADJUSTMENT).toBeGreaterThan(0); + expect(DETAIL_HEAVY_ADJUSTMENT).toBeLessThan(50); + }); +}); diff --git a/tests/unit/health-score.test.ts b/tests/unit/health-score.test.ts index dd18910..7373a82 100644 --- a/tests/unit/health-score.test.ts +++ b/tests/unit/health-score.test.ts @@ -1,12 +1,18 @@ // tests/unit/health-score.test.ts // Tests for the conversation health score module. +// +// Health score is now per-model: thresholds come from +// lib/context-rot-thresholds.ts. Tests pick a specific model so the +// threshold under test is unambiguous. Sonnet 4.5 (warn=50, crit=75) is +// the workhorse fixture because its 200k window matches the historical +// test assumptions and it carries a verified Anthropic MRCR figure. import { describe, it, expect } from 'vitest'; import { computeHealthScore, computeGrowthRate, - HEALTHY_CEIL, DEGRADING_CEIL, + HEALTHY_CEIL, TURN_HEALTHY_CEIL, TURN_DEGRADING_CEIL, TURN_CRITICAL_CEIL, @@ -14,15 +20,36 @@ import { type HealthInput, } from '../../lib/health-score'; +// ── Fixtures ────────────────────────────────────────────────────────────────── + +// Sonnet 4.5 is the rot exemplar (warn=50, crit=75) and matches the +// 200k-window assumptions in the legacy tests. Most cases use it. +const SONNET_45 = 'claude-sonnet-4-5'; + +// Opus 4.6 is the well-behaved 1M model (warn=65, crit=85). Used for +// cases that need to verify model-specific differentiation. +const OPUS_46 = 'claude-opus-4-6'; + +// Sonnet 4.6 (warn=60, crit=80, compaction-aware). Used for copy +// assertions about server-side compaction. +const SONNET_46 = 'claude-sonnet-4-6'; + function input(overrides: Partial = {}): HealthInput { - return { contextPct: 10, turnCount: 3, growthRate: null, ...overrides }; + return { + contextPct: 10, + turnCount: 3, + growthRate: null, + model: SONNET_45, + isDetailHeavy: false, + ...overrides, + }; } // ── Critical conditions ─────────────────────────────────────────────────────── describe('critical health', () => { - it('returns critical when context >= DEGRADING_CEIL', () => { - const h = computeHealthScore(input({ contextPct: DEGRADING_CEIL })); + it('returns critical at the absolute floor (90%) regardless of model', () => { + const h = computeHealthScore(input({ model: OPUS_46, contextPct: DEGRADING_CEIL })); expect(h.level).toBe('critical'); }); @@ -31,50 +58,52 @@ describe('critical health', () => { expect(h.level).toBe('critical'); }); - it('returns critical when context >= HEALTHY_CEIL and turns > TURN_DEGRADING_CEIL', () => { - const h = computeHealthScore(input({ - contextPct: HEALTHY_CEIL, - turnCount: TURN_DEGRADING_CEIL + 1, - })); + it('returns critical at the per-model critical threshold', () => { + // Sonnet 4.5 crit = 75 + const h = computeHealthScore(input({ contextPct: 75 })); expect(h.level).toBe('critical'); }); - it('includes turn count in coaching when critical from turns', () => { - const h = computeHealthScore(input({ - contextPct: 75, - turnCount: 25, - })); - expect(h.coaching).toMatch(/25 turns deep/); + it('Opus 4.6 stays healthy at 75% where Sonnet 4.5 would be critical', () => { + const sonnet = computeHealthScore(input({ model: SONNET_45, contextPct: 75 })); + const opus = computeHealthScore(input({ model: OPUS_46, contextPct: 75 })); + expect(sonnet.level).toBe('critical'); + // 75 is between Opus warn (65) and Opus crit (85) -> degrading. + expect(opus.level).toBe('degrading'); + }); + + it('promotes to critical when in approaching zone with deep turn count', () => { + // Sonnet 4.5 approaching at 60% + 25 turns -> rule 2 promotes to critical. + const h = computeHealthScore(input({ contextPct: 60, turnCount: TURN_DEGRADING_CEIL + 5 })); + expect(h.level).toBe('critical'); + expect(h.coaching).toMatch(/turns deep/); }); }); // ── Degrading conditions ────────────────────────────────────────────────────── describe('degrading health', () => { - it('returns degrading when context >= HEALTHY_CEIL and turns > TURN_HEALTHY_CEIL', () => { - const h = computeHealthScore(input({ - contextPct: HEALTHY_CEIL, - turnCount: TURN_HEALTHY_CEIL + 1, - })); + it('returns degrading at the per-model warn threshold', () => { + // Sonnet 4.5 warn = 50 + const h = computeHealthScore(input({ contextPct: 50 })); expect(h.level).toBe('degrading'); }); - it('returns degrading at context just below DEGRADING_CEIL with moderate turns', () => { - const h = computeHealthScore(input({ - contextPct: DEGRADING_CEIL - 1, - turnCount: TURN_HEALTHY_CEIL + 5, - })); + it('returns degrading just below the per-model critical threshold', () => { + const h = computeHealthScore(input({ contextPct: 70 })); expect(h.level).toBe('degrading'); }); it('returns degrading when growth rate is fast and context > 30%', () => { + // 35% on Sonnet 4.5 is below warn=50 -> primary says healthy. + // Fast growth secondary kicks in. const h = computeHealthScore(input({ contextPct: 35, turnCount: 5, growthRate: FAST_GROWTH_PCT + 1, })); expect(h.level).toBe('degrading'); - expect(h.coaching).toMatch(/~\d+ message/); + expect(h.coaching).toMatch(/messages? until/); }); it('does not trigger growth-rate degrading when context <= 30%', () => { @@ -94,16 +123,17 @@ describe('degrading health', () => { expect(h.level).toBe('degrading'); }); - it('uses singular "message" when only 1 message remaining', () => { - // contextPct=80, growthRate=15: (100-80)/15 = 1.33, rounds to 1 → "~1 message" - // Rule 4 fires: growthRate(15) > FAST_GROWTH_PCT(8) && contextPct(80) > 30 + it('uses singular "message" when only 1 message remaining to warn', () => { + // Sonnet 4.5 warn=50, contextPct=45, growth=15 -> headroom 5, 5/15 -> 0 + // Pick numbers so headroom/growth rounds to 1. + // contextPct=42, growth=8.1 -> just over FAST_GROWTH_PCT, headroom 8, ~1 message const h = computeHealthScore(input({ - contextPct: 80, + contextPct: 42, turnCount: 3, - growthRate: 15, + growthRate: 8.1, })); expect(h.level).toBe('degrading'); - expect(h.coaching).toMatch(/~1 message until context limit/); + expect(h.coaching).toMatch(/1 message until/); }); }); @@ -116,55 +146,104 @@ describe('healthy', () => { expect(h.label).toBe('Healthy'); }); - it('returns healthy at context just below HEALTHY_CEIL with few turns', () => { - const h = computeHealthScore(input({ - contextPct: HEALTHY_CEIL - 1, - turnCount: TURN_HEALTHY_CEIL, - })); + it('returns healthy below the per-model warn threshold with few turns', () => { + // Sonnet 4.5 warn = 50; 49 with 5 turns is healthy. + const h = computeHealthScore(input({ contextPct: 49, turnCount: 5 })); expect(h.level).toBe('healthy'); }); it('shows "fresh and responsive" for low context', () => { const h = computeHealthScore(input({ contextPct: 10 })); - expect(h.coaching).toMatch(/fresh and responsive/); + expect(h.coaching).toMatch(/fresh and responsive/i); }); - it('shows context percentage for moderate usage', () => { + it('shows context percentage and model label for moderate usage', () => { const h = computeHealthScore(input({ contextPct: 35, turnCount: 5 })); - expect(h.coaching).toMatch(/35% context used/); + expect(h.coaching).toMatch(/35%/); + expect(h.coaching).toMatch(/Sonnet 4\.5/); }); +}); - it('returns healthy when turns are below threshold even with moderate context', () => { - const h = computeHealthScore(input({ - contextPct: 45, - turnCount: TURN_HEALTHY_CEIL, - })); - expect(h.level).toBe('healthy'); +// ── Detail-heavy adjustment ────────────────────────────────────────────────── + +describe('detail-heavy adjustment', () => { + it('shifts the warn threshold earlier on Opus 4.6', () => { + // Opus 4.6 warn = 65 normally, 50 under detail-heavy. + // 55% is healthy on a normal turn but degrading on a precision turn. + const normal = computeHealthScore(input({ model: OPUS_46, contextPct: 55, isDetailHeavy: false })); + const heavy = computeHealthScore(input({ model: OPUS_46, contextPct: 55, isDetailHeavy: true })); + expect(normal.level).toBe('healthy'); + expect(heavy.level).toBe('degrading'); + }); + + it('shifts the critical threshold earlier on Sonnet 4.5', () => { + // Sonnet 4.5 crit = 75 normally, 60 under detail-heavy. + // 62% is degrading on a normal turn but critical on a precision turn. + const normal = computeHealthScore(input({ model: SONNET_45, contextPct: 62, isDetailHeavy: false })); + const heavy = computeHealthScore(input({ model: SONNET_45, contextPct: 62, isDetailHeavy: true })); + expect(normal.level).toBe('degrading'); + expect(heavy.level).toBe('critical'); + }); +}); + +// ── Coaching copy ──────────────────────────────────────────────────────────── + +describe('coaching copy: model-aware and evidence-grounded', () => { + it('cites Sonnet 4.5 MRCR figure in approaching-zone copy', () => { + const h = computeHealthScore(input({ model: SONNET_45, contextPct: 55 })); + expect(h.level).toBe('degrading'); + expect(h.coaching).toMatch(/Sonnet 4\.5/); + expect(h.coaching).toMatch(/18\.5%/); + expect(h.coaching).toMatch(/MRCR/); + }); + + it('cites Opus 4.6 MRCR figure in approaching-zone copy', () => { + const h = computeHealthScore(input({ model: OPUS_46, contextPct: 70 })); + expect(h.level).toBe('degrading'); + expect(h.coaching).toMatch(/Opus 4\.6/); + expect(h.coaching).toMatch(/76%/); + }); + + it('mentions compaction for Sonnet 4.6 (compaction-aware)', () => { + const h = computeHealthScore(input({ model: SONNET_46, contextPct: 65 })); + expect(h.coaching).toMatch(/compaction/i); + }); + + it('does not mention compaction for Sonnet 4.5 (no compaction)', () => { + const h = computeHealthScore(input({ model: SONNET_45, contextPct: 55 })); + expect(h.coaching).not.toMatch(/compaction/i); + }); + + it('mentions Projects in critical zone for Sonnet 4.5', () => { + const h = computeHealthScore(input({ model: SONNET_45, contextPct: 80 })); + expect(h.coaching).toMatch(/Projects/); + }); + + it('never includes a /compact instruction', () => { + for (const model of [SONNET_45, SONNET_46, OPUS_46]) { + for (const pct of [10, 50, 70, 85, 95]) { + const h = computeHealthScore(input({ model, contextPct: pct, turnCount: 5 })); + expect(h.coaching).not.toMatch(/\/compact/); + } + } }); }); // ── Boundary conditions ─────────────────────────────────────────────────────── describe('boundaries', () => { - it('context at exactly HEALTHY_CEIL with exactly TURN_HEALTHY_CEIL turns = healthy', () => { - const h = computeHealthScore(input({ - contextPct: HEALTHY_CEIL, - turnCount: TURN_HEALTHY_CEIL, - })); - // HEALTHY_CEIL with turnCount = TURN_HEALTHY_CEIL (not > TURN_HEALTHY_CEIL): healthy - expect(h.level).toBe('healthy'); + it('Sonnet 4.5 at exactly 50% with few turns = degrading (warn boundary)', () => { + const h = computeHealthScore(input({ contextPct: 50, turnCount: 5 })); + expect(h.level).toBe('degrading'); }); - it('context at exactly HEALTHY_CEIL with TURN_HEALTHY_CEIL + 1 = degrading', () => { - const h = computeHealthScore(input({ - contextPct: HEALTHY_CEIL, - turnCount: TURN_HEALTHY_CEIL + 1, - })); - expect(h.level).toBe('degrading'); + it('Sonnet 4.5 just below warn (49%) with few turns = healthy', () => { + const h = computeHealthScore(input({ contextPct: 49, turnCount: 5 })); + expect(h.level).toBe('healthy'); }); - it('context at exactly DEGRADING_CEIL = critical regardless of turns', () => { - const h = computeHealthScore(input({ contextPct: DEGRADING_CEIL, turnCount: 1 })); + it('absolute floor (90%) overrides per-model regardless of model', () => { + const h = computeHealthScore(input({ model: OPUS_46, contextPct: DEGRADING_CEIL, turnCount: 1 })); expect(h.level).toBe('critical'); }); @@ -176,37 +255,28 @@ describe('boundaries', () => { })); expect(h.level).toBe('healthy'); }); - - it('growth rate just above FAST_GROWTH_PCT with context > 30 triggers degrading', () => { - const h = computeHealthScore(input({ - contextPct: 35, - turnCount: 5, - growthRate: FAST_GROWTH_PCT + 0.1, - })); - expect(h.level).toBe('degrading'); - }); }); // ── Rule priority ───────────────────────────────────────────────────────────── describe('rule priority', () => { - it('context-based critical overrides turn-based degrading', () => { + it('absolute critical overrides turn-based degrading', () => { const h = computeHealthScore(input({ contextPct: DEGRADING_CEIL, turnCount: TURN_HEALTHY_CEIL + 1, })); expect(h.level).toBe('critical'); - expect(h.coaching).toMatch(/nearly full/); }); - it('turn+context critical overrides growth-rate degrading', () => { + it('per-model critical wins over fast-growth secondary', () => { + // Sonnet 4.5 at 80% is in-rot. Fast growth would only have made it + // degrading. Verify the more severe primary classifier wins. const h = computeHealthScore(input({ - contextPct: 75, - turnCount: TURN_DEGRADING_CEIL + 1, + contextPct: 80, + turnCount: 3, growthRate: FAST_GROWTH_PCT + 5, })); expect(h.level).toBe('critical'); - expect(h.coaching).toMatch(/turns deep/); }); }); @@ -227,6 +297,18 @@ describe('HealthScore shape', () => { }); }); +// ── Legacy constants still exported (back-compat) ───────────────────────────── + +describe('legacy exports', () => { + it('exports HEALTHY_CEIL = 70 (legacy default warn)', () => { + expect(HEALTHY_CEIL).toBe(70); + }); + + it('exports DEGRADING_CEIL = 90 (absolute critical floor)', () => { + expect(DEGRADING_CEIL).toBe(90); + }); +}); + // ── computeGrowthRate ───────────────────────────────────────────────────────── describe('computeGrowthRate', () => { @@ -243,17 +325,14 @@ describe('computeGrowthRate', () => { }); it('computes average upward growth', () => { - // [10, 20, 30]: two upward steps of 10 each expect(computeGrowthRate([10, 20, 30])).toBe(10); }); it('ignores downward steps', () => { - // [10, 20, 15, 25]: upward steps are 10 and 10, avg = 10 expect(computeGrowthRate([10, 20, 15, 25])).toBe(10); }); it('handles mixed growth rates', () => { - // [0, 5, 8, 20]: upward steps are 5, 3, 12. avg = 20/3 ≈ 6.67 expect(computeGrowthRate([0, 5, 8, 20])).toBeCloseTo(6.67, 1); }); }); diff --git a/tests/unit/prompt-analysis.test.ts b/tests/unit/prompt-analysis.test.ts index d4bc484..a56767a 100644 --- a/tests/unit/prompt-analysis.test.ts +++ b/tests/unit/prompt-analysis.test.ts @@ -5,6 +5,7 @@ import { describe, it, expect } from 'vitest'; import { classifyModelTier, analyzePrompt, + isDetailHeavy, LARGE_PASTE_MIN_CHARS, FOLLOWUP_CHAIN_MIN_COUNT, type PromptCharacteristics, @@ -368,3 +369,66 @@ describe('model_suggestion with delta data', () => { expect(ms).toBeUndefined(); }); }); + +// ── DETAIL_HEAVY_KEYWORDS / inject.ts mirror drift guard ──────────────────── +// +// inject.ts cannot import from lib/, so DETAIL_HEAVY_KEYWORDS is mirrored +// inline as `detailHeavyKeywords`. If the two ever drift, the warning +// threshold shifts at the bridge layer but not in lib (or vice versa), +// and the user sees inconsistent coaching. This test reads inject.ts as +// text and asserts every keyword on the lib side appears verbatim. + +describe('inject.ts mirrors DETAIL_HEAVY_KEYWORDS', () => { + it('contains every lib-side keyword as a string literal', async () => { + const fs = await import('node:fs/promises'); + const path = await import('node:path'); + const url = await import('node:url'); + const here = path.dirname(url.fileURLToPath(import.meta.url)); + const injectPath = path.resolve(here, '../../entrypoints/inject.ts'); + const source = await fs.readFile(injectPath, 'utf8'); + // Re-import to avoid pulling in the whole lib through static deps. + const { DETAIL_HEAVY_KEYWORDS } = await import('../../lib/prompt-analysis'); + for (const keyword of DETAIL_HEAVY_KEYWORDS) { + expect( + source.includes(`'${keyword}'`), + `inject.ts is missing the mirrored keyword '${keyword}'`, + ).toBe(true); + } + }); +}); + +// ── isDetailHeavy ──────────────────────────────────────────────────────────── + +describe('isDetailHeavy', () => { + it('returns false for empty input', () => { + expect(isDetailHeavy('')).toBe(false); + }); + + it('returns false for casual prose', () => { + expect(isDetailHeavy('how do I sort an array in python')).toBe(false); + expect(isDetailHeavy('any thoughts on the pricing model?')).toBe(false); + expect(isDetailHeavy('all good?')).toBe(false); + }); + + it('returns true when the prompt contains a fenced code block', () => { + expect(isDetailHeavy('here is the code:\n```ts\nconst x = 1;\n```')).toBe(true); + }); + + it('returns true for precision keywords (case-insensitive)', () => { + expect(isDetailHeavy('give me the exact bytes')).toBe(true); + expect(isDetailHeavy('Quote it VERBATIM please')).toBe(true); + expect(isDetailHeavy('precise reproduction needed')).toBe(true); + }); + + it('returns true for compound triggers like "list every"', () => { + expect(isDetailHeavy('list every parameter we discussed')).toBe(true); + expect(isDetailHeavy('full list of files touched')).toBe(true); + }); + + it('does not over-trigger on the bare word "all" or "every"', () => { + // Bare "all" and "every" appear constantly in prose and would + // flood the warning if treated as triggers on their own. + expect(isDetailHeavy('I tried all the suggestions')).toBe(false); + expect(isDetailHeavy('every user reports the same bug')).toBe(false); + }); +});