Skip to content

feat(scan): sentence-salience reducer for embed inputs (ADR-130)#98

Merged
aaronsb merged 5 commits into
mainfrom
feat/adr-130-sentence-salience-reducer
May 21, 2026
Merged

feat(scan): sentence-salience reducer for embed inputs (ADR-130)#98
aaronsb merged 5 commits into
mainfrom
feat/adr-130-sentence-salience-reducer

Conversation

@aaronsb
Copy link
Copy Markdown
Owner

@aaronsb aaronsb commented May 21, 2026

Summary

Implements ADR-130. Replaces the interim character-truncation caps (PRs #94, #95, #96) with a structural fix: a sentence-salience reducer that runs ahead of every batch_embed_score call, preserving high-signal prose instead of clamping at an arbitrary char index.

Inputs that fit the embedder's working window pass through unchanged. Long inputs collapse to top-salience sentences within a per-hook token budget. Non-prose inputs (bash commands, JSON blobs, heredoc bodies — anything split_sentences returns ≤1 sentence on) degrade to bag-of-words top-K token selection rather than emitting empty.

Algorithm

tokenize input → frequency count → softmax (numerically stable)
                                       ↓
                             per-sentence salience =
                             avg(softmax weights of its tokens)
                                       ↓
                       sort sentences by salience desc
                       take top-N within token budget
                       resort by document position
                                       ↓
                            join → embedder input

Bag-of-words fallback for non-prose, hard char-budget fallback when even tokenization yields nothing usable. Panic-free on input-derived values.

Per-hook budgets

Hook Budget (approx tokens)
scan prompt 110
scan task 110
scan command 75
scan file 30

Sized for ~85% of MiniLM's 128-token position-embedding window. The approximate tokenizer (max(word_count, char_count/4)) over-counts vs WordPiece, so real tokens land safely under 128 at all four budgets.

What stays from the interim PRs

Tests

18 unit tests in reduce.rs — pass/fail behavior:

  • passthrough under budget, empty input, prose reduction with topical term preservation
  • document order preservation, single-long-sentence fallback, budget enforcement
  • stopword filtering, punctuation handling, softmax correctness, sentence splitter
  • non-prose fallback (heredoc bodies, repeated-term blobs)
  • behavioral tests against the three actual crash payloads from today (vue-expert dispatch, gh pr create heredoc, task-notification blob)

Existing 84 unit + 11 simulator tests still pass.

Recall validation (the merge gate from the briefing)

Production-faithful gate: queries embedded both ways, threshold gating applied (EN ≥ 0.40, multi ≥ 0.55, per ADR-125), top-1 firing way compared. None == None counts as agreement.

Case Full fires Reduced fires Agree
vue agent dispatch (none) (none)
gh pr create heredoc (none) (none)
task notification meta/subagents meta/subagents
debugging question (none) (none)
schema migration question softwaredev/delivery/migrations softwaredev/delivery/migrations
test structure question (none) (none)
kubectl long args (none) (none)
code review request (none) (none)

Firing-set agreement: 8/8 (100%) — Gate: ≥ 90%.

Reproducible via cargo test -p ways --release -- --ignored validate_recall_against_live_corpus --nocapture (requires the live corpus + way-embed binary).

Of note: 6 of 8 queries correctly produce "no match" with both inputs — the corpus doesn't have a strong-enough match for those topics under production thresholds. The reducer preserves this null result. The two queries that do fire produce identical top-1 ways before and after reduction.

Manual crash repro

Three payloads identical in shape to today's actual SIGABRTs:

  • 11,040-char vue-expert agent dispatch → 0 crashes
  • 3,387-char gh pr create heredoc → 0 crashes, scan path emits hook context as designed
  • 2,864-char task-notification blob → 0 crashes

Commits

  1. feat(scan): sentence-salience reducer module — reduce.rs + 14 unit tests
  2. feat(scan): wire reducer into prompt/task/command/file paths — integration
  3. test(scan): behavioral tests against actual crash repros — 4 behavioral tests + slice-bounds bug fix in split_sentences
  4. test(scan): production-faithful recall gate + budget tuning — threshold-aware recall validation, extended stopwords, bumped budgets to ~85% window

Refs

aaronsb added 4 commits May 21, 2026 13:56
Implements reduce_for_embed(input, budget_tokens) — the structural
replacement for the interim truncation caps from PRs #94/#95/#96.

Pipeline: tokenize → frequency-count over whole input → softmax →
score each sentence by average token weight → keep top-N within
token budget → reassemble in document order.

Non-prose fallback: when split_sentences finds ≤1 sentence (bash
commands, JSON, heredoc bodies, code blobs), degrade to bag-of-words
token selection rather than emitting empty or panicking.

Panic-free on input-derived values; pathological cases (e.g. all
stopwords) fall through to a hard char-budget slice so a non-empty
input never produces an empty output.

Unit tests cover: passthrough under budget, empty/whitespace input,
prose reduction with topical term preservation, document-order
preservation, single-long-sentence fallback, budget enforcement,
stopword filtering, punctuation handling, softmax correctness,
sentence-splitter behavior on prose/paragraph/non-prose inputs,
non-prose fallback contract.

14/14 tests pass. Module is not yet wired into scan paths — next
commit integrates the four entrypoints.

Refs ADR-130.
Each of the four scan entrypoints now passes its embed query through
reduce::reduce_for_embed before invoking batch_embed_score. Pattern/
keyword/regex matching downstream still operates on the full input —
only the embed signal sees the reduced form.

Per-hook budgets (approximate tokens, sized for ~75% of MiniLM's
128-token position-embedding window):

  scan::prompt  → 95 tokens (room for paragraph-scale user prompts)
  scan::task    → 95 tokens (agent delegation is the largest input class)
  scan::command → 60 tokens (command shape + description suffix)
  scan::file    → 30 tokens (filepaths are short by nature)

Inputs under budget pass through unchanged; long inputs collapse to
top-salience sentences (or top-K tokens for non-prose) within budget.

84 unit + 11 simulator tests pass. Behavioral test against real crash
repros lands in the next commit.

Refs ADR-130.
Adds four tests using the exact payload shapes that triggered today's
SIGABRTs:

- behavior_vue_expert_dispatch_reduces_safely (PID 128657/156817 shape)
- behavior_gh_pr_heredoc_does_not_crash (PID 88563/197729 shape)
- behavior_task_notification_reduces_safely (PID 197729 shape)
- high_frequency_terms_survive_reduction (recall proxy)

Each asserts the contract: non-empty output, within ~budget+headroom,
high-frequency content survives. The reducer being deterministic and
panic-free is verified by the existing unit tests; these add the
"runs cleanly on real-world payloads" gate.

Also fixes a slice-bounds panic in split_sentences when a sentence
end (`.`+whitespace) is immediately followed by a paragraph break
(`\n\n`). Found by the gh-pr-heredoc behavioral test.

18/18 reduce tests pass; full 84-test suite + 11-scenario simulator
suite still pass.
Two improvements to the recall validation:

1. Threshold-aware top-1. The previous test grabbed the highest-scoring
   row from way-embed match --threshold 0.0 regardless of whether it
   would actually fire in production. Per-way embed_threshold gating
   (EN ≥ 0.40, multi ≥ 0.55) is what determines firing in scan/mod.rs::
   match_prompt; the test now mirrors that. Queries with no good corpus
   match produce (none) for both full and reduced — the correct outcome.

2. Budget tuning. Bumped per-hook budgets from ~75% to ~85% of the
   128-token MiniLM window: BUDGET_PROMPT/TASK 95→110, BUDGET_COMMAND
   60→75, BUDGET_FILE unchanged. Approximate tokenizer over-counts vs
   WordPiece so real tokens stay safely under 128 even at the higher
   numbers.

Also extends scaffold stopwords with PR/git workflow terms (pr, branch,
diff, review, body, scope, summary, etc.) that dominate agent-dispatch
prompts and crowd out topical signal under pure frequency salience.

Result on 8-query validation set against the live corpus: 8/8 (100%)
firing-set agreement between full input and reducer output. Above the
≥ 90% gate. 6 queries correctly produce no-match for both runs
(weak-signal queries the corpus has no good match for); 2 queries
fire identical ways for both runs.

Refs ADR-130.
@aaronsb
Copy link
Copy Markdown
Owner Author

aaronsb commented May 21, 2026

Review

Took the four commits in order; the story holds together — module first, integration second, behavioral tests with the slice-bounds fix third, threshold-aware recall gate plus budget tuning fourth. The merge gate logic in commit 4 is the right calibration to mirror production semantics.

Algorithm correctness

Salience scoring matches ADR-130 as described. Walked the math:

  • Softmax stability (reduce.rs:152-167): max-shift correct, the if sum == 0.0 guard handles the (effectively unreachable) all--inf case without producing NaN — good defensive code.
  • Empty-token denominator (reduce.rs:187): tokens.len().max(1) as f64 — empty-token sentences score 0/1 = 0.0 not NaN. No divide-by-zero.
  • Tie-breaking under equal salience: Rust's sort_by is stable; ties resolve by enumerate-index (= document order). Then the explicit sort_by_key(|(i, _)| *i) after greedy selection guarantees document order in output. Verified intent.
  • UTF-8 safety in split_sentences (reduce.rs:101-150): The splitter inspects bytes for ASCII .!? and \n (all single-byte, can never share a byte with a multi-byte UTF-8 codepoint), then slices s[start..i+1], s[start..i], s[start..] — every boundary lands on a single-byte ASCII char or is 0/len. Safe across UTF-8 inputs.
  • Trailing-punctuation-at-EOF: is_sentence_end requires i + 1 < bytes.len() AND whitespace at i+1. Trailing `.` without whitespace falls through to the tail block which trims and pushes whatever's left. No infinite loop, no panic.
  • Slice-bounds fix in commit 3 (reduce.rs:115-140): The fix is correct. Pre-fix code did `start = end + 1; i += 1` for sentence-end, leaving the trailing whitespace at `i` to potentially get re-processed as part of a paragraph break — which set `start > end` on the next iteration's slice. Post-fix advances `start` and `i` together past punctuation+whitespace and gates the slice with `start <= i+1`/`start <= i`. Tight.

Hot-path safety

Panic-free contract holds end-to-end. Audited:

  • No `unwrap()` on input-derived data. The two `unwrap_or(...)` calls (lines 194, 242) are on `partial_cmp` Option (NaN handling) — Equal is correct fallback.
  • No raw-index slicing without bounds (verified above).
  • The fallback chain `sentences → tokens → char-budget` (`reduce.rs:209-212, 222-230, 228-229`) guarantees non-empty output from non-empty input. The single edge to flag is below.

Suggestion (non-blocking)reduce.rs:222-229: `reduce_by_tokens` with `budget_tokens == 0` would produce empty output (`take(0)` and `max_chars = 0`). In current code this is unreachable (smallest budget constant is `BUDGET_FILE = 30`), but if a future caller wired in `0` for any reason the non-empty contract would break silently. Defensive option: `let max_chars = (budget_tokens * CHARS_PER_TOKEN).max(1);` and `take(budget_tokens.max(1))`. Cheap insurance.

Budget calibration

`approx_tokens` heuristic (`reduce.rs:79-87`) using `max(word_count, char_count/4)`:

  • English prose / code-heavy bash: overcounts vs WordPiece (claim holds — WordPiece averages ~4 chars/token on English, and the `max` with word count rounds up). Headroom comment in mod.rs is accurate.
  • Multilingual concern (CJK): CJK text has no spaces → word_count = 1 for an entire paragraph. `char_count/4` then drives the estimate, but Japanese/Chinese WordPiece often tokenizes ~1 token per character (or per 2 chars). So for CJK, the heuristic under-estimates real tokens. With `BUDGET_PROMPT = 110` (= ~440 char-budget for CJK), real WordPiece tokens could land at 220-440 — well past the 128 window. This is a genuine gap, but not a blocker for this PR because (a) the observed crashes were English, (b) the multi model takes over for non-English and the system has the belt-and-suspenders char caps from PRs fix(scan): cap bash embed query at 256 chars #95/fix(scan): cap prompt-hook embed query at 1024 chars #96 still in place. Recommend filing a follow-up issue to add a CJK-aware char-budget multiplier (e.g., divide char_count by 2 instead of 4 when non-ASCII ratio crosses a threshold) before removing the interim caps.

Stopword list

The 80-ish word list is small and reasoned. The PR/git scaffold additions (`pr branch diff merge review body title summary scope...`) are well-motivated by the dispatch-prompt structure but they DO bleed into legitimate PR/git topic queries. Quick check:

  • "should I rebase or merge this branch?" → after stopwords: `[rebase, should]`. Salience scoring degrades but the embed input is still full sentences (only sentence selection uses salience), so the embed sees `should I rebase or merge this branch?` verbatim and can match a "git workflow" way semantically.
  • A way like `softwaredev/git/pr-review` won't get stronger signal from a PR query because the topical terms are stopworded out of salience — but won't get weaker signal either, because PR queries are short and pass through unchanged under budget. The risk surfaces only when a long input has PR/git as its actual topic and we need to choose which sentences to keep. Acceptable trade-off given the observed crash payloads.

Minor: `description` (line 47) and `summary` (line 50) are both common topical English words. Worth monitoring for regressions if a future recall query targets ways about "writing descriptions" or "writing summaries" — these would score near 0 salience.

Integration

The contract "regex sees full input, embed sees reduced" is honored at all four call sites and the inline comments make it explicit:

  • `mod.rs:80-81` (prompt): reduced fed to `batch_embed_score`, `query` (unreduced) fed to `match_prompt` at line 95. Clean.
  • `mod.rs:135-136` (task): same shape. Clean.
  • `mod.rs:231-244` (command): regex against unreduced `cmd`. Then `mod.rs:266-267` reduces `query_for_checks` for embed. Clean.
  • `mod.rs:333` (file): regex against unreduced `filepath`. Then `mod.rs:345-346` reduces for embed. Clean.

Tests

18 unit + 4 behavioral + 1 ignored integration. Coverage is good and the behavioral tests built from real SIGABRT payloads are the right anchor — they're the artifact of "the bug that started this work." `split_sentences` has explicit tests for the paragraph-break case that triggered the panic fix.

Nit (non-blocking) — `reduce.rs:543`: `use std::process::Command;` inside `top1_fires` is unused (only `best_match` actually uses it at line 568). Under `#[cfg(test)]` this won't break a build but it'll produce a dead-import warning under `cargo clippy --tests --all-targets`. Trivial cleanup.

Observation (non-blocking) — `reduce.rs:198-205` (greedy fitting): the loop uses `continue` not `break` when a sentence exceeds remaining budget. That means a high-salience long sentence can be skipped while a lower-salience short sentence later still gets included. This is a divergence from strict top-N-by-salience but it maximizes budget utilization. Not a bug — worth a one-line comment noting the deliberate behavior so the next reader doesn't "fix" it to `break`.

ADR alignment

  • ADR-125 (no lexical tier reintroduced): the reducer never touches the corpus side — it only conditions the query before embedding. The matcher (`match_prompt` in mod.rs) is unchanged; per-way `embed_threshold` gating, parent-boost, dual EN/multi paths all preserved. Contract honored.
  • ADR-127 (input reduction not embedded-representation enrichment): the reducer outputs fewer sentences from the source. No vocabulary augmentation, no semantic expansion, no synonym injection — it strictly removes. Contract honored.
  • ADR-130 (this PR): the merge gate (≥ 90% firing-set agreement on the 8-query validation set, run under production threshold semantics) is the gate the ADR specifies and it lands at 8/8 (100%). The ignored test is reproducible and the assertion (`pct >= 90`) is the right gate.

Verdict

Approve. No blockers. Two non-blocking items worth a follow-up issue:

  1. CJK token estimation gap in `approx_tokens` — file before removing the interim char-caps from PRs fix(scan): cap bash embed query at 256 chars #95/fix(scan): cap prompt-hook embed query at 1024 chars #96.
  2. Dead import cleanup at `reduce.rs:543`.

Optional cleanups (your call): the `budget_tokens == 0` defensive guard in `reduce_by_tokens`, and a one-line comment at `reduce.rs:200` documenting the `continue` (not `break`) choice.

Nice work taking the time to write the production-faithful recall validation. The threshold-aware mirroring of `match_prompt` semantics in `top1_fires` is the difference between "recall test" and "recall gate."


AI-assisted review via Claude

Four cleanups from code-reviewer:

1. CJK tokenization estimate. approx_tokens used max(words, chars/4),
   which severely under-counts WordPiece tokens for Japanese/Chinese
   where each ideograph or kana costs ~1 token. A 400-char Japanese
   input would estimate at ~100 tokens but produce ~400 real tokens
   past the 128-position-embedding window → SIGABRT regression for
   multilingual users. Added is_cjk(char) covering CJK Unified
   Ideographs (incl. Extensions A/B), Hiragana, Katakana, Hangul,
   and CJK Compatibility Ideographs. approx_tokens now takes the max
   of three estimators: words, chars/4, and cjk-chars.

2. Dead `use std::process::Command;` import in test fn top1_fires —
   the function delegates to best_match and doesn't use Command directly.

3. Defensive budget_tokens == 0 guard in reduce_by_tokens. Unreachable
   from current call sites, but a 0 budget would silently produce
   empty output and break the "non-empty input never returns empty"
   contract.

4. Comment on the deliberate `continue` (not `break`) in the greedy
   salience-sorted fitting loop — a later sentence may still fit
   even after a longer earlier one didn't.

Three new tests cover the changes:
- cjk_input_estimates_one_token_per_char (JP, ZH, EN)
- cjk_long_input_triggers_reduction (would have been a SIGABRT pre-fix)
- budget_zero_does_not_return_empty

91 unit + 11 simulator tests pass. Recall validation still 8/8 (100%)
against the live corpus.

Refs PR #98 review (comment 4512998689).
@aaronsb aaronsb merged commit dc96b75 into main May 21, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant