feat(scan): sentence-salience reducer for embed inputs (ADR-130)#98
Conversation
Implements reduce_for_embed(input, budget_tokens) — the structural replacement for the interim truncation caps from PRs #94/#95/#96. Pipeline: tokenize → frequency-count over whole input → softmax → score each sentence by average token weight → keep top-N within token budget → reassemble in document order. Non-prose fallback: when split_sentences finds ≤1 sentence (bash commands, JSON, heredoc bodies, code blobs), degrade to bag-of-words token selection rather than emitting empty or panicking. Panic-free on input-derived values; pathological cases (e.g. all stopwords) fall through to a hard char-budget slice so a non-empty input never produces an empty output. Unit tests cover: passthrough under budget, empty/whitespace input, prose reduction with topical term preservation, document-order preservation, single-long-sentence fallback, budget enforcement, stopword filtering, punctuation handling, softmax correctness, sentence-splitter behavior on prose/paragraph/non-prose inputs, non-prose fallback contract. 14/14 tests pass. Module is not yet wired into scan paths — next commit integrates the four entrypoints. Refs ADR-130.
Each of the four scan entrypoints now passes its embed query through reduce::reduce_for_embed before invoking batch_embed_score. Pattern/ keyword/regex matching downstream still operates on the full input — only the embed signal sees the reduced form. Per-hook budgets (approximate tokens, sized for ~75% of MiniLM's 128-token position-embedding window): scan::prompt → 95 tokens (room for paragraph-scale user prompts) scan::task → 95 tokens (agent delegation is the largest input class) scan::command → 60 tokens (command shape + description suffix) scan::file → 30 tokens (filepaths are short by nature) Inputs under budget pass through unchanged; long inputs collapse to top-salience sentences (or top-K tokens for non-prose) within budget. 84 unit + 11 simulator tests pass. Behavioral test against real crash repros lands in the next commit. Refs ADR-130.
Adds four tests using the exact payload shapes that triggered today's SIGABRTs: - behavior_vue_expert_dispatch_reduces_safely (PID 128657/156817 shape) - behavior_gh_pr_heredoc_does_not_crash (PID 88563/197729 shape) - behavior_task_notification_reduces_safely (PID 197729 shape) - high_frequency_terms_survive_reduction (recall proxy) Each asserts the contract: non-empty output, within ~budget+headroom, high-frequency content survives. The reducer being deterministic and panic-free is verified by the existing unit tests; these add the "runs cleanly on real-world payloads" gate. Also fixes a slice-bounds panic in split_sentences when a sentence end (`.`+whitespace) is immediately followed by a paragraph break (`\n\n`). Found by the gh-pr-heredoc behavioral test. 18/18 reduce tests pass; full 84-test suite + 11-scenario simulator suite still pass.
Two improvements to the recall validation: 1. Threshold-aware top-1. The previous test grabbed the highest-scoring row from way-embed match --threshold 0.0 regardless of whether it would actually fire in production. Per-way embed_threshold gating (EN ≥ 0.40, multi ≥ 0.55) is what determines firing in scan/mod.rs:: match_prompt; the test now mirrors that. Queries with no good corpus match produce (none) for both full and reduced — the correct outcome. 2. Budget tuning. Bumped per-hook budgets from ~75% to ~85% of the 128-token MiniLM window: BUDGET_PROMPT/TASK 95→110, BUDGET_COMMAND 60→75, BUDGET_FILE unchanged. Approximate tokenizer over-counts vs WordPiece so real tokens stay safely under 128 even at the higher numbers. Also extends scaffold stopwords with PR/git workflow terms (pr, branch, diff, review, body, scope, summary, etc.) that dominate agent-dispatch prompts and crowd out topical signal under pure frequency salience. Result on 8-query validation set against the live corpus: 8/8 (100%) firing-set agreement between full input and reducer output. Above the ≥ 90% gate. 6 queries correctly produce no-match for both runs (weak-signal queries the corpus has no good match for); 2 queries fire identical ways for both runs. Refs ADR-130.
ReviewTook the four commits in order; the story holds together — module first, integration second, behavioral tests with the slice-bounds fix third, threshold-aware recall gate plus budget tuning fourth. The merge gate logic in commit 4 is the right calibration to mirror production semantics. Algorithm correctnessSalience scoring matches ADR-130 as described. Walked the math:
Hot-path safetyPanic-free contract holds end-to-end. Audited:
Suggestion (non-blocking) — Budget calibration`approx_tokens` heuristic (`reduce.rs:79-87`) using `max(word_count, char_count/4)`:
Stopword listThe 80-ish word list is small and reasoned. The PR/git scaffold additions (`pr branch diff merge review body title summary scope...`) are well-motivated by the dispatch-prompt structure but they DO bleed into legitimate PR/git topic queries. Quick check:
Minor: `description` (line 47) and `summary` (line 50) are both common topical English words. Worth monitoring for regressions if a future recall query targets ways about "writing descriptions" or "writing summaries" — these would score near 0 salience. IntegrationThe contract "regex sees full input, embed sees reduced" is honored at all four call sites and the inline comments make it explicit:
Tests18 unit + 4 behavioral + 1 ignored integration. Coverage is good and the behavioral tests built from real SIGABRT payloads are the right anchor — they're the artifact of "the bug that started this work." `split_sentences` has explicit tests for the paragraph-break case that triggered the panic fix. Nit (non-blocking) — `reduce.rs:543`: `use std::process::Command;` inside `top1_fires` is unused (only `best_match` actually uses it at line 568). Under `#[cfg(test)]` this won't break a build but it'll produce a dead-import warning under `cargo clippy --tests --all-targets`. Trivial cleanup. Observation (non-blocking) — `reduce.rs:198-205` (greedy fitting): the loop uses `continue` not `break` when a sentence exceeds remaining budget. That means a high-salience long sentence can be skipped while a lower-salience short sentence later still gets included. This is a divergence from strict top-N-by-salience but it maximizes budget utilization. Not a bug — worth a one-line comment noting the deliberate behavior so the next reader doesn't "fix" it to `break`. ADR alignment
VerdictApprove. No blockers. Two non-blocking items worth a follow-up issue:
Optional cleanups (your call): the `budget_tokens == 0` defensive guard in `reduce_by_tokens`, and a one-line comment at `reduce.rs:200` documenting the `continue` (not `break`) choice. Nice work taking the time to write the production-faithful recall validation. The threshold-aware mirroring of `match_prompt` semantics in `top1_fires` is the difference between "recall test" and "recall gate." AI-assisted review via Claude |
Four cleanups from code-reviewer: 1. CJK tokenization estimate. approx_tokens used max(words, chars/4), which severely under-counts WordPiece tokens for Japanese/Chinese where each ideograph or kana costs ~1 token. A 400-char Japanese input would estimate at ~100 tokens but produce ~400 real tokens past the 128-position-embedding window → SIGABRT regression for multilingual users. Added is_cjk(char) covering CJK Unified Ideographs (incl. Extensions A/B), Hiragana, Katakana, Hangul, and CJK Compatibility Ideographs. approx_tokens now takes the max of three estimators: words, chars/4, and cjk-chars. 2. Dead `use std::process::Command;` import in test fn top1_fires — the function delegates to best_match and doesn't use Command directly. 3. Defensive budget_tokens == 0 guard in reduce_by_tokens. Unreachable from current call sites, but a 0 budget would silently produce empty output and break the "non-empty input never returns empty" contract. 4. Comment on the deliberate `continue` (not `break`) in the greedy salience-sorted fitting loop — a later sentence may still fit even after a longer earlier one didn't. Three new tests cover the changes: - cjk_input_estimates_one_token_per_char (JP, ZH, EN) - cjk_long_input_triggers_reduction (would have been a SIGABRT pre-fix) - budget_zero_does_not_return_empty 91 unit + 11 simulator tests pass. Recall validation still 8/8 (100%) against the live corpus. Refs PR #98 review (comment 4512998689).
Summary
Implements ADR-130. Replaces the interim character-truncation caps (PRs #94, #95, #96) with a structural fix: a sentence-salience reducer that runs ahead of every
batch_embed_scorecall, preserving high-signal prose instead of clamping at an arbitrary char index.Inputs that fit the embedder's working window pass through unchanged. Long inputs collapse to top-salience sentences within a per-hook token budget. Non-prose inputs (bash commands, JSON blobs, heredoc bodies — anything
split_sentencesreturns ≤1 sentence on) degrade to bag-of-words top-K token selection rather than emitting empty.Algorithm
Bag-of-words fallback for non-prose, hard char-budget fallback when even tokenization yields nothing usable. Panic-free on input-derived values.
Per-hook budgets
scan promptscan taskscan commandscan fileSized for ~85% of MiniLM's 128-token position-embedding window. The approximate tokenizer (
max(word_count, char_count/4)) over-counts vs WordPiece, so real tokens land safely under 128 at all four budgets.What stays from the interim PRs
subagent_typecheck) — stays. Custom agents have their own constitution; they shouldn't reach the embed path at all.Tests
18 unit tests in
reduce.rs— pass/fail behavior:Existing 84 unit + 11 simulator tests still pass.
Recall validation (the merge gate from the briefing)
Production-faithful gate: queries embedded both ways, threshold gating applied (EN ≥ 0.40, multi ≥ 0.55, per ADR-125), top-1 firing way compared. None == None counts as agreement.
Firing-set agreement: 8/8 (100%) — Gate: ≥ 90%.
Reproducible via
cargo test -p ways --release -- --ignored validate_recall_against_live_corpus --nocapture(requires the live corpus + way-embed binary).Of note: 6 of 8 queries correctly produce "no match" with both inputs — the corpus doesn't have a strong-enough match for those topics under production thresholds. The reducer preserves this null result. The two queries that do fire produce identical top-1 ways before and after reduction.
Manual crash repro
Three payloads identical in shape to today's actual SIGABRTs:
Commits
feat(scan): sentence-salience reducer module— reduce.rs + 14 unit testsfeat(scan): wire reducer into prompt/task/command/file paths— integrationtest(scan): behavioral tests against actual crash repros— 4 behavioral tests + slice-bounds bug fix in split_sentencestest(scan): production-faithful recall gate + budget tuning— threshold-aware recall validation, extended stopwords, bumped budgets to ~85% windowRefs