feat(scan): sentence-salience reducer for embed inputs (ADR-130) by aaronsb · Pull Request #98 · aaronsb/agent-ways

aaronsb · 2026-05-21T21:35:07Z

Summary

Implements ADR-130. Replaces the interim character-truncation caps (PRs #94, #95, #96) with a structural fix: a sentence-salience reducer that runs ahead of every batch_embed_score call, preserving high-signal prose instead of clamping at an arbitrary char index.

Inputs that fit the embedder's working window pass through unchanged. Long inputs collapse to top-salience sentences within a per-hook token budget. Non-prose inputs (bash commands, JSON blobs, heredoc bodies — anything split_sentences returns ≤1 sentence on) degrade to bag-of-words top-K token selection rather than emitting empty.

Algorithm

tokenize input → frequency count → softmax (numerically stable)
                                       ↓
                             per-sentence salience =
                             avg(softmax weights of its tokens)
                                       ↓
                       sort sentences by salience desc
                       take top-N within token budget
                       resort by document position
                                       ↓
                            join → embedder input

Bag-of-words fallback for non-prose, hard char-budget fallback when even tokenization yields nothing usable. Panic-free on input-derived values.

Per-hook budgets

Hook	Budget (approx tokens)
`scan prompt`	110
`scan task`	110
`scan command`	75
`scan file`	30

Sized for ~85% of MiniLM's 128-token position-embedding window. The approximate tokenizer (max(word_count, char_count/4)) over-counts vs WordPiece, so real tokens land safely under 128 at all four budgets.

What stays from the interim PRs

PR fix(scan): skip embed for custom-agent dispatches #94's custom-agent discriminator (subagent_type check) — stays. Custom agents have their own constitution; they shouldn't reach the embed path at all.
PRs fix(scan): cap bash embed query at 256 chars #95 / fix(scan): cap prompt-hook embed query at 1024 chars #96 char truncation caps in the hook scripts — stay as belt-and-suspenders fallback. The reducer should never emit anything they'd clamp. Removable in a follow-up after observation window.

Tests

18 unit tests in reduce.rs — pass/fail behavior:

passthrough under budget, empty input, prose reduction with topical term preservation
document order preservation, single-long-sentence fallback, budget enforcement
stopword filtering, punctuation handling, softmax correctness, sentence splitter
non-prose fallback (heredoc bodies, repeated-term blobs)
behavioral tests against the three actual crash payloads from today (vue-expert dispatch, gh pr create heredoc, task-notification blob)

Existing 84 unit + 11 simulator tests still pass.

Recall validation (the merge gate from the briefing)

Production-faithful gate: queries embedded both ways, threshold gating applied (EN ≥ 0.40, multi ≥ 0.55, per ADR-125), top-1 firing way compared. None == None counts as agreement.

Case	Full fires	Reduced fires	Agree
vue agent dispatch	(none)	(none)	✓
gh pr create heredoc	(none)	(none)	✓
task notification	meta/subagents	meta/subagents	✓
debugging question	(none)	(none)	✓
schema migration question	softwaredev/delivery/migrations	softwaredev/delivery/migrations	✓
test structure question	(none)	(none)	✓
kubectl long args	(none)	(none)	✓
code review request	(none)	(none)	✓

Firing-set agreement: 8/8 (100%) — Gate: ≥ 90%.

Reproducible via cargo test -p ways --release -- --ignored validate_recall_against_live_corpus --nocapture (requires the live corpus + way-embed binary).

Of note: 6 of 8 queries correctly produce "no match" with both inputs — the corpus doesn't have a strong-enough match for those topics under production thresholds. The reducer preserves this null result. The two queries that do fire produce identical top-1 ways before and after reduction.

Manual crash repro

Three payloads identical in shape to today's actual SIGABRTs:

11,040-char vue-expert agent dispatch → 0 crashes
3,387-char gh pr create heredoc → 0 crashes, scan path emits hook context as designed
2,864-char task-notification blob → 0 crashes

Commits

feat(scan): sentence-salience reducer module — reduce.rs + 14 unit tests
feat(scan): wire reducer into prompt/task/command/file paths — integration
test(scan): behavioral tests against actual crash repros — 4 behavioral tests + slice-bounds bug fix in split_sentences
test(scan): production-faithful recall gate + budget tuning — threshold-aware recall validation, extended stopwords, bumped budgets to ~85% window

Refs

Closes the implementation side of ADR-130
Builds on the interim truncation safety net from PRs fix(scan): skip embed for custom-agent dispatches #94, fix(scan): cap bash embed query at 256 chars #95, fix(scan): cap prompt-hook embed query at 1024 chars #96
Respects ADR-125 (embedding-only matcher, no lexical tier reintroduced)
Respects ADR-127 (input reduction not enrichment — text-density not the axis)

Implements reduce_for_embed(input, budget_tokens) — the structural replacement for the interim truncation caps from PRs #94/#95/#96. Pipeline: tokenize → frequency-count over whole input → softmax → score each sentence by average token weight → keep top-N within token budget → reassemble in document order. Non-prose fallback: when split_sentences finds ≤1 sentence (bash commands, JSON, heredoc bodies, code blobs), degrade to bag-of-words token selection rather than emitting empty or panicking. Panic-free on input-derived values; pathological cases (e.g. all stopwords) fall through to a hard char-budget slice so a non-empty input never produces an empty output. Unit tests cover: passthrough under budget, empty/whitespace input, prose reduction with topical term preservation, document-order preservation, single-long-sentence fallback, budget enforcement, stopword filtering, punctuation handling, softmax correctness, sentence-splitter behavior on prose/paragraph/non-prose inputs, non-prose fallback contract. 14/14 tests pass. Module is not yet wired into scan paths — next commit integrates the four entrypoints. Refs ADR-130.

Each of the four scan entrypoints now passes its embed query through reduce::reduce_for_embed before invoking batch_embed_score. Pattern/ keyword/regex matching downstream still operates on the full input — only the embed signal sees the reduced form. Per-hook budgets (approximate tokens, sized for ~75% of MiniLM's 128-token position-embedding window): scan::prompt → 95 tokens (room for paragraph-scale user prompts) scan::task → 95 tokens (agent delegation is the largest input class) scan::command → 60 tokens (command shape + description suffix) scan::file → 30 tokens (filepaths are short by nature) Inputs under budget pass through unchanged; long inputs collapse to top-salience sentences (or top-K tokens for non-prose) within budget. 84 unit + 11 simulator tests pass. Behavioral test against real crash repros lands in the next commit. Refs ADR-130.

Adds four tests using the exact payload shapes that triggered today's SIGABRTs: - behavior_vue_expert_dispatch_reduces_safely (PID 128657/156817 shape) - behavior_gh_pr_heredoc_does_not_crash (PID 88563/197729 shape) - behavior_task_notification_reduces_safely (PID 197729 shape) - high_frequency_terms_survive_reduction (recall proxy) Each asserts the contract: non-empty output, within ~budget+headroom, high-frequency content survives. The reducer being deterministic and panic-free is verified by the existing unit tests; these add the "runs cleanly on real-world payloads" gate. Also fixes a slice-bounds panic in split_sentences when a sentence end (`.`+whitespace) is immediately followed by a paragraph break (`\n\n`). Found by the gh-pr-heredoc behavioral test. 18/18 reduce tests pass; full 84-test suite + 11-scenario simulator suite still pass.

Two improvements to the recall validation: 1. Threshold-aware top-1. The previous test grabbed the highest-scoring row from way-embed match --threshold 0.0 regardless of whether it would actually fire in production. Per-way embed_threshold gating (EN ≥ 0.40, multi ≥ 0.55) is what determines firing in scan/mod.rs:: match_prompt; the test now mirrors that. Queries with no good corpus match produce (none) for both full and reduced — the correct outcome. 2. Budget tuning. Bumped per-hook budgets from ~75% to ~85% of the 128-token MiniLM window: BUDGET_PROMPT/TASK 95→110, BUDGET_COMMAND 60→75, BUDGET_FILE unchanged. Approximate tokenizer over-counts vs WordPiece so real tokens stay safely under 128 even at the higher numbers. Also extends scaffold stopwords with PR/git workflow terms (pr, branch, diff, review, body, scope, summary, etc.) that dominate agent-dispatch prompts and crowd out topical signal under pure frequency salience. Result on 8-query validation set against the live corpus: 8/8 (100%) firing-set agreement between full input and reducer output. Above the ≥ 90% gate. 6 queries correctly produce no-match for both runs (weak-signal queries the corpus has no good match for); 2 queries fire identical ways for both runs. Refs ADR-130.

aaronsb · 2026-05-21T21:40:04Z

Review

Took the four commits in order; the story holds together — module first, integration second, behavioral tests with the slice-bounds fix third, threshold-aware recall gate plus budget tuning fourth. The merge gate logic in commit 4 is the right calibration to mirror production semantics.

Algorithm correctness

Salience scoring matches ADR-130 as described. Walked the math:

Softmax stability (reduce.rs:152-167): max-shift correct, the if sum == 0.0 guard handles the (effectively unreachable) all--inf case without producing NaN — good defensive code.
Empty-token denominator (reduce.rs:187): tokens.len().max(1) as f64 — empty-token sentences score 0/1 = 0.0 not NaN. No divide-by-zero.
Tie-breaking under equal salience: Rust's sort_by is stable; ties resolve by enumerate-index (= document order). Then the explicit sort_by_key(|(i, _)| *i) after greedy selection guarantees document order in output. Verified intent.
UTF-8 safety in split_sentences (reduce.rs:101-150): The splitter inspects bytes for ASCII .!? and \n (all single-byte, can never share a byte with a multi-byte UTF-8 codepoint), then slices s[start..i+1], s[start..i], s[start..] — every boundary lands on a single-byte ASCII char or is 0/len. Safe across UTF-8 inputs.
Trailing-punctuation-at-EOF: is_sentence_end requires i + 1 < bytes.len() AND whitespace at i+1. Trailing `.` without whitespace falls through to the tail block which trims and pushes whatever's left. No infinite loop, no panic.
Slice-bounds fix in commit 3 (reduce.rs:115-140): The fix is correct. Pre-fix code did `start = end + 1; i += 1` for sentence-end, leaving the trailing whitespace at `i` to potentially get re-processed as part of a paragraph break — which set `start > end` on the next iteration's slice. Post-fix advances `start` and `i` together past punctuation+whitespace and gates the slice with `start <= i+1`/`start <= i`. Tight.

Hot-path safety

Panic-free contract holds end-to-end. Audited:

No `unwrap()` on input-derived data. The two `unwrap_or(...)` calls (lines 194, 242) are on `partial_cmp` Option (NaN handling) — Equal is correct fallback.
No raw-index slicing without bounds (verified above).
The fallback chain `sentences → tokens → char-budget` (`reduce.rs:209-212, 222-230, 228-229`) guarantees non-empty output from non-empty input. The single edge to flag is below.

Suggestion (non-blocking) — reduce.rs:222-229: `reduce_by_tokens` with `budget_tokens == 0` would produce empty output (`take(0)` and `max_chars = 0`). In current code this is unreachable (smallest budget constant is `BUDGET_FILE = 30`), but if a future caller wired in `0` for any reason the non-empty contract would break silently. Defensive option: `let max_chars = (budget_tokens * CHARS_PER_TOKEN).max(1);` and `take(budget_tokens.max(1))`. Cheap insurance.

Budget calibration

`approx_tokens` heuristic (`reduce.rs:79-87`) using `max(word_count, char_count/4)`:

English prose / code-heavy bash: overcounts vs WordPiece (claim holds — WordPiece averages ~4 chars/token on English, and the `max` with word count rounds up). Headroom comment in mod.rs is accurate.
Multilingual concern (CJK): CJK text has no spaces → word_count = 1 for an entire paragraph. `char_count/4` then drives the estimate, but Japanese/Chinese WordPiece often tokenizes ~1 token per character (or per 2 chars). So for CJK, the heuristic under-estimates real tokens. With `BUDGET_PROMPT = 110` (= ~440 char-budget for CJK), real WordPiece tokens could land at 220-440 — well past the 128 window. This is a genuine gap, but not a blocker for this PR because (a) the observed crashes were English, (b) the multi model takes over for non-English and the system has the belt-and-suspenders char caps from PRs fix(scan): cap bash embed query at 256 chars #95/fix(scan): cap prompt-hook embed query at 1024 chars #96 still in place. Recommend filing a follow-up issue to add a CJK-aware char-budget multiplier (e.g., divide char_count by 2 instead of 4 when non-ASCII ratio crosses a threshold) before removing the interim caps.

Stopword list

The 80-ish word list is small and reasoned. The PR/git scaffold additions (`pr branch diff merge review body title summary scope...`) are well-motivated by the dispatch-prompt structure but they DO bleed into legitimate PR/git topic queries. Quick check:

"should I rebase or merge this branch?" → after stopwords: `[rebase, should]`. Salience scoring degrades but the embed input is still full sentences (only sentence selection uses salience), so the embed sees `should I rebase or merge this branch?` verbatim and can match a "git workflow" way semantically.
A way like `softwaredev/git/pr-review` won't get stronger signal from a PR query because the topical terms are stopworded out of salience — but won't get weaker signal either, because PR queries are short and pass through unchanged under budget. The risk surfaces only when a long input has PR/git as its actual topic and we need to choose which sentences to keep. Acceptable trade-off given the observed crash payloads.

Minor: `description` (line 47) and `summary` (line 50) are both common topical English words. Worth monitoring for regressions if a future recall query targets ways about "writing descriptions" or "writing summaries" — these would score near 0 salience.

Integration

The contract "regex sees full input, embed sees reduced" is honored at all four call sites and the inline comments make it explicit:

`mod.rs:80-81` (prompt): reduced fed to `batch_embed_score`, `query` (unreduced) fed to `match_prompt` at line 95. Clean.
`mod.rs:135-136` (task): same shape. Clean.
`mod.rs:231-244` (command): regex against unreduced `cmd`. Then `mod.rs:266-267` reduces `query_for_checks` for embed. Clean.
`mod.rs:333` (file): regex against unreduced `filepath`. Then `mod.rs:345-346` reduces for embed. Clean.

Tests

18 unit + 4 behavioral + 1 ignored integration. Coverage is good and the behavioral tests built from real SIGABRT payloads are the right anchor — they're the artifact of "the bug that started this work." `split_sentences` has explicit tests for the paragraph-break case that triggered the panic fix.

Nit (non-blocking) — `reduce.rs:543`: `use std::process::Command;` inside `top1_fires` is unused (only `best_match` actually uses it at line 568). Under `#[cfg(test)]` this won't break a build but it'll produce a dead-import warning under `cargo clippy --tests --all-targets`. Trivial cleanup.

Observation (non-blocking) — `reduce.rs:198-205` (greedy fitting): the loop uses `continue` not `break` when a sentence exceeds remaining budget. That means a high-salience long sentence can be skipped while a lower-salience short sentence later still gets included. This is a divergence from strict top-N-by-salience but it maximizes budget utilization. Not a bug — worth a one-line comment noting the deliberate behavior so the next reader doesn't "fix" it to `break`.

ADR alignment

ADR-125 (no lexical tier reintroduced): the reducer never touches the corpus side — it only conditions the query before embedding. The matcher (`match_prompt` in mod.rs) is unchanged; per-way `embed_threshold` gating, parent-boost, dual EN/multi paths all preserved. Contract honored.
ADR-127 (input reduction not embedded-representation enrichment): the reducer outputs fewer sentences from the source. No vocabulary augmentation, no semantic expansion, no synonym injection — it strictly removes. Contract honored.
ADR-130 (this PR): the merge gate (≥ 90% firing-set agreement on the 8-query validation set, run under production threshold semantics) is the gate the ADR specifies and it lands at 8/8 (100%). The ignored test is reproducible and the assertion (`pct >= 90`) is the right gate.

Verdict

Approve. No blockers. Two non-blocking items worth a follow-up issue:

CJK token estimation gap in `approx_tokens` — file before removing the interim char-caps from PRs fix(scan): cap bash embed query at 256 chars #95/fix(scan): cap prompt-hook embed query at 1024 chars #96.
Dead import cleanup at `reduce.rs:543`.

Optional cleanups (your call): the `budget_tokens == 0` defensive guard in `reduce_by_tokens`, and a one-line comment at `reduce.rs:200` documenting the `continue` (not `break`) choice.

Nice work taking the time to write the production-faithful recall validation. The threshold-aware mirroring of `match_prompt` semantics in `top1_fires` is the difference between "recall test" and "recall gate."

AI-assisted review via Claude

Four cleanups from code-reviewer: 1. CJK tokenization estimate. approx_tokens used max(words, chars/4), which severely under-counts WordPiece tokens for Japanese/Chinese where each ideograph or kana costs ~1 token. A 400-char Japanese input would estimate at ~100 tokens but produce ~400 real tokens past the 128-position-embedding window → SIGABRT regression for multilingual users. Added is_cjk(char) covering CJK Unified Ideographs (incl. Extensions A/B), Hiragana, Katakana, Hangul, and CJK Compatibility Ideographs. approx_tokens now takes the max of three estimators: words, chars/4, and cjk-chars. 2. Dead `use std::process::Command;` import in test fn top1_fires — the function delegates to best_match and doesn't use Command directly. 3. Defensive budget_tokens == 0 guard in reduce_by_tokens. Unreachable from current call sites, but a 0 budget would silently produce empty output and break the "non-empty input never returns empty" contract. 4. Comment on the deliberate `continue` (not `break`) in the greedy salience-sorted fitting loop — a later sentence may still fit even after a longer earlier one didn't. Three new tests cover the changes: - cjk_input_estimates_one_token_per_char (JP, ZH, EN) - cjk_long_input_triggers_reduction (would have been a SIGABRT pre-fix) - budget_zero_does_not_return_empty 91 unit + 11 simulator tests pass. Recall validation still 8/8 (100%) against the live corpus. Refs PR #98 review (comment 4512998689).

aaronsb added 4 commits May 21, 2026 13:56

aaronsb merged commit dc96b75 into main May 21, 2026
6 checks passed

aaronsb mentioned this pull request May 21, 2026

fix(hooks): remove interim truncation caps — reducer is the size gate #99

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scan): sentence-salience reducer for embed inputs (ADR-130)#98

feat(scan): sentence-salience reducer for embed inputs (ADR-130)#98
aaronsb merged 5 commits into
mainfrom
feat/adr-130-sentence-salience-reducer

aaronsb commented May 21, 2026

Uh oh!

aaronsb commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aaronsb commented May 21, 2026

Summary

Algorithm

Per-hook budgets

What stays from the interim PRs

Tests

Recall validation (the merge gate from the briefing)

Manual crash repro

Commits

Refs

Uh oh!

aaronsb commented May 21, 2026

Review

Algorithm correctness

Hot-path safety

Budget calibration

Stopword list

Integration

Tests

ADR alignment

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant