From a8111a6d7fa7e2da95e00ef8c5d56411d47c0789 Mon Sep 17 00:00:00 2001
From: Aaron Bockelie <aaronsb@gmail.com>
Date: Thu, 21 May 2026 15:26:18 -0700
Subject: [PATCH] =?UTF-8?q?fix(hooks):=20remove=20interim=20truncation=20c?=
 =?UTF-8?q?aps=20=E2=80=94=20reducer=20is=20the=20size=20gate?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

PRs #95 (bash 256-char cap) and #96 (prompt 1024-char cap) were
character-based pre-clamps that ran strictly upstream of the ADR-130
sentence-salience reducer. They weren't redundant — they were
*starving* the reducer by chopping off the back of the input before
the reducer could score it.

Concrete shape of the bug, on a 4000-char `gh pr create --body
"$(cat <<EOF…)"`:

  hook receives 4000-char CMD
    ↓
  check-bash-pre.sh truncates to 256 chars    ← cap fires here
    ↓
  ways scan command --command "<256 chars>"
    ↓
  reduce_for_embed("<256 chars>", 75)         ← input fits, passthrough
    ↓
  batch_embed_score("<256 chars>")

The reducer's whole pitch — preserve prose distributed across the
whole document — was defeated as long as the caps ran first. The
caps also masked any reducer-bounds bug by clamping inputs to a
safe range before the reducer could exercise its full path.

ADR-130's reducer provides the size guarantee these caps were meant
to provide. With the caps gone, the reducer sees the full input,
scores sentence salience across it, and hands the embedder a
bounded query.

Custom-agent discriminator in check-task-pre.sh (PR #94) is
*not* a cap — it's a "don't invoke ways scan task at all for
dispatches to custom agents" gate. Unchanged.

Tested live: 6.4KB bash command, 6KB task-notification prompt
both run cleanly through the production hook chain, 0 SIGABRTs.
---
 hooks/ways/check-bash-pre.sh | 28 ++++++++--------------------
 hooks/ways/check-prompt.sh   | 18 +++++-------------
 2 files changed, 13 insertions(+), 33 deletions(-)

diff --git a/hooks/ways/check-bash-pre.sh b/hooks/ways/check-bash-pre.sh
index 8c75c3b..2fc4b50 100755
--- a/hooks/ways/check-bash-pre.sh
+++ b/hooks/ways/check-bash-pre.sh
@@ -4,26 +4,16 @@
 # The ways binary handles: command pattern matching, semantic scoring,
 # check curve scoring, session state, and content output.
 #
-# The command is truncated to its semantic prefix before being passed to
-# `ways scan command`. Heredoc bodies (`gh pr create --body "$(cat <<EOF…)"`),
-# JSON payloads (`curl -d '{…}'`), and other large argument bodies carry
-# no signal for "what kind of command is this" — the program name and
-# first few args do. The MiniLM embedding models cap at ~128 tokens of
-# position embeddings; queries past that abort the embedder (ggml
-# get_rows out-of-range). 256 chars ≈ 60 tokens, safely under the limit
-# with headroom for the description that gets appended downstream.
-#
-# This truncation feeds both the embed query *and* the regex matcher in
-# the ways binary. Every existing `commands:` pattern under hooks/ways/
-# matches on the program name + first arg (≤106 chars), so cropping at
-# 256 changes no current behavior. Future patterns that need to look
-# past char 256 of a bash command would be misusing this trigger
-# anyway — that signal belongs in `pattern:` against the description.
+# Size bounding for the embed query is the ways binary's responsibility
+# (ADR-130 sentence-salience reducer in scan/reduce.rs). This script
+# passes the full command through so the reducer can score the prose
+# distribution itself — pre-truncating here would starve the reducer
+# of the back half of any long input. The regex `commands:` matcher
+# also gets the full command, which is what ways with patterns like
+# `^(npm|cargo|gh) ` expect.
 
 source "$(dirname "$0")/require-ways.sh"
 
-readonly CMD_QUERY_MAX=256
-
 INPUT=$(cat)
 CMD=$(echo "$INPUT" | jq -r '.tool_input.command // empty')
 DESC=$(echo "$INPUT" | jq -r '.tool_input.description // empty' | tr '[:upper:]' '[:lower:]')
@@ -32,11 +22,9 @@ AGENT_ID=$(echo "$INPUT" | jq -r '.agent_id // empty')
 [[ -n "$AGENT_ID" ]] && export CLAUDE_AGENT_ID="$AGENT_ID"
 PROJECT_DIR="${CLAUDE_PROJECT_DIR:-$(echo "$INPUT" | jq -r '.cwd // empty')}"
 
-CMD_QUERY="${CMD:0:$CMD_QUERY_MAX}"
-
 export CLAUDE_PROJECT_DIR="${PROJECT_DIR}"
 "${HOME}/.claude/bin/ways" scan command \
-  --command "$CMD_QUERY" \
+  --command "$CMD" \
   --description "$DESC" \
   --session "$SESSION_ID" \
   --project "$PROJECT_DIR"
diff --git a/hooks/ways/check-prompt.sh b/hooks/ways/check-prompt.sh
index 660b689..c5f7a47 100755
--- a/hooks/ways/check-prompt.sh
+++ b/hooks/ways/check-prompt.sh
@@ -9,18 +9,13 @@
 # harness also injects structured content here: <task-notification>
 # blobs from completed background agents, <persisted-output> pointers
 # for tool results that exceed inline budget, and other system-reminder
-# envelopes. Any of those can run multiple KB, and embedding that
-# overruns the MiniLM model's position-embedding table (SIGABRT in
-# ggml_compute_forward_get_rows). Cap the embed query at 1024 chars
-# (~240 tokens — generous because real user prompts can legitimately
-# be paragraphs of context, unlike bash commands). Anything past 1024
-# in a prompt is system-injected envelope content that carries no
-# additional signal for matching the user's *intent* against ways.
+# envelopes. Size bounding for the embed query is the ways binary's
+# responsibility (ADR-130 sentence-salience reducer in scan/reduce.rs).
+# This script passes the full combined prompt+topics through so the
+# reducer can score sentence salience across the whole input.
 
 source "$(dirname "$0")/require-ways.sh"
 
-readonly PROMPT_QUERY_MAX=1024
-
 INPUT=$(cat)
 PROMPT=$(echo "$INPUT" | jq -r '.prompt // empty' | tr '[:upper:]' '[:lower:]')
 SESSION_ID=$(echo "$INPUT" | jq -r '.session_id // empty')
@@ -37,11 +32,8 @@ if [[ -f "$RESPONSE_STATE" ]]; then
   RESPONSE_TOPICS=$(jq -r '.topics // empty' "$RESPONSE_STATE" 2>/dev/null)
 fi
 
-# Combined context: user prompt + Claude's recent topics, capped.
-# RESPONSE_TOPICS is bounded by check-response.sh's extraction (~50 chars
-# of keywords) so the cap is effectively a guard on PROMPT itself.
+# Combined context: user prompt + Claude's recent topics.
 COMBINED="${PROMPT} ${RESPONSE_TOPICS}"
-COMBINED="${COMBINED:0:$PROMPT_QUERY_MAX}"
 
 export CLAUDE_PROJECT_DIR="${PROJECT_DIR}"
 "${HOME}/.claude/bin/ways" scan prompt \