Add finding diversity gate and pair-eval timeout diagnostic (Phase 2C+.5) by R00T-Kim · Pull Request #9 · R00T-Kim/SCOUT

R00T-Kim · 2026-04-19T10:07:57Z

Summary

Phase 2C+ Track A first PR. Adds measurement scaffolding for the two reviewer-eval-lane blockers identified in the 2026-04-19 Direction Pivot:

finding diversity gate — local-7 lane currently maps every pair-side row to a single finding_id (aiedge.findings.web.exec_sink_overlap), giving finding_diversity_index = 1.0 (degenerate). Gate fails when max_share(finding_id) >= AIEDGE_PAIR_DIVERSITY_MAX (default 0.5).
pair-eval timeout diagnostic — dedicated reviewer reruns (pair-eval-dedicated-local7-claude-6h, codex-6h) terminated at run_index rows = 0 with no actionable signal. _dump_timeout_diagnostic() now captures last 200 stderr / 50 stdout lines + best-effort run_dir guess + last stage status into <side>/timeout_diagnostic.json.

Phase 2D entry exit-gate

This PR delivers gate 3 of 5 for the Phase 2D entry threshold:

recall ≥ 0.40 (current 0.142857)
tier ≥ 2 nonzero TP (current 1)
finding diversity < 0.5 (current 1.0) ← this PR enables enforcement
dedicated rerun ≥ 1 driver success (current 0)
corpus size ≥ 10 (current 7)

Test plan

pytest -q tests/test_finding_diversity_gate.py — 12 passed
pytest -q — full suite green
ruff check src/ tests/ scripts/ — clean
pyright src/aiedge/quality_policy.py scripts/run_pair_eval.py tests/test_finding_diversity_gate.py — 0 errors
python3 scripts/check_doc_consistency.py — OK
bash -n scripts/release_gate.sh — clean
Run release_gate.sh --pair-eval-findings benchmark-results/pair-eval/pair_eval_findings.csv against the local-7 baseline → expect FAIL with QUALITY_GATE_DIVERSITY_MISS, actual=1.0

Files

src/aiedge/quality_policy.py (+115)
scripts/run_pair_eval.py (+82, helpers + TimeoutExpired branch)
scripts/release_gate.sh (+57, opt-in PAIR_EVAL_DIVERSITY sub-gate)
tests/test_finding_diversity_gate.py (new, 12 tests)
docs/finding_diversity_gate.md (new)
CHANGELOG.md (Unreleased ### Added)

Plan reference

See ~/.claude/plans/twinkly-hugging-leaf.md for the broader Track A / Track B parallel plan.

🤖 Generated with Claude Code

…+.5) Reviewer eval lane analysis (2026-04-19) surfaced two blockers on the path into Phase 2D': - the local-7 lane mapped every pair-side row to a single finding_id (degenerate ROC: all 14 rows on aiedge.findings.web.exec_sink_overlap) - the dedicated reviewer reruns (claude-6h, codex-6h) terminated at `run_index rows = 0` with no actionable diagnostic This adds measurement scaffolding for both: - quality_policy.py: compute_pair_eval_diversity_index() (max-share over finding_id), load_pair_eval_finding_ids() (CSV reader with optional ground_truth filter), evaluate_pair_eval_diversity_gate() (fails when index >= AIEDGE_PAIR_DIVERSITY_MAX, default 0.5). New violation tokens QUALITY_GATE_DIVERSITY_MISS / QUALITY_GATE_INVALID_PAIR_EVAL. - run_pair_eval.py: TimeoutExpired now writes <side>/timeout_diagnostic.json with last 200 stderr / 50 stdout lines, best-effort run_dir guess, and the most recent stage's name/status. - release_gate.sh: opt-in PAIR_EVAL_DIVERSITY sub-gate via --pair-eval-findings; absent flag emits an INFO skip line. - docs/finding_diversity_gate.md: threshold rationale, output schema, Phase 2D entry exit-gate hook (recall >= 0.40 / tier >= 2 nonzero TP / diversity < 0.5 / dedicated rerun success / corpus >= 10). Verification: pytest -q tests/test_finding_diversity_gate.py # 12 passed pytest -q # full suite green ruff check, pyright, check_doc_consistency # clean bash -n scripts/release_gate.sh # clean Phase 2C+ Track A first PR (Pivot 2026-04-19). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ector (Phase 2C+.3) Doubles the dangerous-call catalogue covered by taint_propagation so the diversity gate added in 2C+.5 has more candidates to discriminate against. The pre-Pivot _SINK_SYMBOLS only covered the cmd-injection / strcpy / printf families (29 symbols); the firmware corpus routinely surfaces sinks across at least nine CWE families that were silently missed. - _SINK_SYMBOLS 29 -> 51, with explicit CWE comments per group: * CWE-78 + wordexp / posix_spawn / posix_spawnp * CWE-22 + fopen / open / openat / freopen / chdir * CWE-426 + dlsym / dlmopen * CWE-732 + chmod / fchmod / chown / fchown / lchown * CWE-377 + mktemp / tmpnam / tempnam / tmpfile * CWE-250/269 + chroot / setuid / seteuid / setgid / setegid * CWE-454 + putenv / setenv / unsetenv * CWE-134 + vsnprintf / dprintf / vdprintf - _FORMAT_STRING_SINKS 6 -> 15 with size-bounded, fd-based, and wide-character format-string variants. - _is_format_string_variable() is widened to flag any first argument whose first non-whitespace character is *not* a string literal: bare identifiers, function calls, struct field access (`obj->field`), array subscripts, C-style casts, parenthesised ternaries, and pointer dereferences (`*p_fmt`). Previously only bare identifiers matched, so `printf(obj->field)` was silently considered safe. Verification: pytest -q tests/test_taint_propagation.py # 20 passed pytest -q # full suite green ruff check src/ tests/ # clean pyright (changed files) # 0 errors python3 scripts/check_doc_consistency.py # OK Phase 2C+ Track A second commit on PR #9 (Pivot 2026-04-19, Plan ~/.claude/plans/twinkly-hugging-leaf.md). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…+.2) EnhancedSourceStage now widens source identification beyond the INPUT_APIS dynstr scan to cover the three attacker-influenced string families that LARA (USENIX Sec 2024) showed are missing from traditional source pools: * URI prefixes (20 patterns) * CGI environment variables (17 patterns) * NVRAM / sysconf keys (24 patterns) This is the source-side counterpart to PR #9's sink expansion (2C+.3): together they grow both ends of the source->sink graph so the diversity gate (2C+.5) has a meaningfully larger candidate pool. - _URI_SOURCE_PATTERNS / _CGI_VAR_PATTERNS / _CONFIG_KEY_PATTERNS frozensets with CWE / RFC / OEM provenance comments. - _extract_uri_key_sources(bin_path, symbols, ascii_strings=None) returns deduplicated (pattern, kind) tuples. Matching policy: URI: substring vs bin_path AND ascii_strings (symbols intentionally excluded -- '/' is not a valid identifier char) CGI var: exact lower-case match against symbols OR ascii_strings config key: substring vs bin_path, symbols, AND ascii_strings - EnhancedSourceStage.run() loop wraps each match into a source dict with confidence=0.40 (SYMBOL_COOCCURRENCE cap), method="lara_pattern", and source_type set to the match kind. ascii_strings wiring is intentionally deferred -- a follow-up will plumb inventory's string_hits / sbom _extract_ascii_runs through into this call site. Verification: pytest -q tests/test_uri_source_extraction.py # 13 passed pytest -q # full suite green ruff check src/ tests/ # clean pyright (changed files) # 0 errors python3 scripts/check_doc_consistency.py # OK Phase 2C+ Track A third commit on PR #9 (Pivot 2026-04-19). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

First-cut implementation of the LATTE (Liu et al., TOSEM 2025) prompt-slicing idea: when AIEDGE_LATTE_SLICING=1 is set, _build_taint_prompt() replaces the full function body with a sink-rooted backward slice so the LLM spends its token budget on the data-dependency chain instead of the entire function. - src/aiedge/code_slicing.py (new, 190 lines): * latte_slicing_enabled() -- env-gate helper * find_sink_line(body, sink_sym) -- first sink call location * extract_backward_slice(body, sink_line_idx, max_lines=30) bottom-up walker: start from the sink line, track identifiers, keep earlier lines whose identifier set intersects the tracked set. Blank/comment lines are kept for structural context. Source order preserved; the sink line and defining lines of its arguments always land in the slice. * extract_slice_around_sink() -- convenience wrapper * maybe_slice() -- env-gated entry point (recommended for taint_propagation call site; default-off returns body unchanged so existing prompts are byte-identical) * slice_compression_ratio() -- telemetry helper - src/aiedge/taint_propagation.py (+5 lines): _build_taint_prompt() pipes each function body through maybe_slice(body, sink_symbol) before the _truncate_text() cap. - tests/test_code_slicing.py (new, 32 tests): sink location + word boundary / slice invariants (subset, source order, sink kept, defining lines pulled in) / max_lines cap / degenerate inputs / env-gate parsing (truthy/falsy) / byte-identical default-off / compression-ratio telemetry. - docs/code_slicing_contract.md (new): algorithm description, over-approximation caveats, env gate, call site, Phase 2D entry interaction guidance. Verification: pytest -q tests/test_code_slicing.py # 32 passed pytest -q # full suite green ruff check, pyright (changed files) # clean / 0 errors python3 scripts/check_doc_consistency.py # OK Phase 2C+ Track A fourth commit on PR #9 (Pivot 2026-04-19). This closes 2C+.1, leaving 2C+.4 (vendor extraction chain -- requires five external firmware binaries) as the only remaining Track A step before the Phase 2D entry exit-gate evaluation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # CHANGELOG.md

…corecard - Mark Phase 1.5 (LATTE) / 1.6 (LARA) as landed in v2.7.0 PR #9 in the phase-mapping table (previously only labeled 'Phase 2C+로 이관') - Add 'v2.7.0 / v2.7.1 landed status' subsection under Phase 2C+ Insert that captures: all 2C+ items shipped, Phase 2D' Entry Gate FINAL 2/5 PASS scorecard (12-pair, WRT ok measurement), partial-extraction artifact back-slide lesson, and Pivot Option D unchanged (v2.7.1 is a quantitative refinement of scenario C, not a re-pivot) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R00T-Kim and others added 3 commits April 19, 2026 19:07

R00T-Kim mentioned this pull request Apr 19, 2026

Relocate CRA mapping into docs/compliance_mapping/ (Phase 3'.1 step B-1) #10

Merged

4 tasks

R00T-Kim and others added 2 commits April 19, 2026 20:14

Merge remote-tracking branch 'origin/main' into phase-2c-plus

2fc5c96

# Conflicts: # CHANGELOG.md

R00T-Kim merged commit 30595e6 into main Apr 19, 2026
6 checks passed

R00T-Kim deleted the phase-2c-plus branch April 19, 2026 15:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add finding diversity gate and pair-eval timeout diagnostic (Phase 2C+.5)#9

Add finding diversity gate and pair-eval timeout diagnostic (Phase 2C+.5)#9
R00T-Kim merged 5 commits into
mainfrom
phase-2c-plus

R00T-Kim commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

R00T-Kim commented Apr 19, 2026

Summary

Phase 2D entry exit-gate

Test plan

Files

Plan reference

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant