Skip to content

Add finding diversity gate and pair-eval timeout diagnostic (Phase 2C+.5)#9

Merged
R00T-Kim merged 5 commits into
mainfrom
phase-2c-plus
Apr 19, 2026
Merged

Add finding diversity gate and pair-eval timeout diagnostic (Phase 2C+.5)#9
R00T-Kim merged 5 commits into
mainfrom
phase-2c-plus

Conversation

@R00T-Kim
Copy link
Copy Markdown
Owner

Summary

Phase 2C+ Track A first PR. Adds measurement scaffolding for the two reviewer-eval-lane blockers identified in the 2026-04-19 Direction Pivot:

  • finding diversity gate — local-7 lane currently maps every pair-side row to a single finding_id (aiedge.findings.web.exec_sink_overlap), giving finding_diversity_index = 1.0 (degenerate). Gate fails when max_share(finding_id) >= AIEDGE_PAIR_DIVERSITY_MAX (default 0.5).
  • pair-eval timeout diagnostic — dedicated reviewer reruns (pair-eval-dedicated-local7-claude-6h, codex-6h) terminated at run_index rows = 0 with no actionable signal. _dump_timeout_diagnostic() now captures last 200 stderr / 50 stdout lines + best-effort run_dir guess + last stage status into <side>/timeout_diagnostic.json.

Phase 2D entry exit-gate

This PR delivers gate 3 of 5 for the Phase 2D entry threshold:

  • recall ≥ 0.40 (current 0.142857)
  • tier ≥ 2 nonzero TP (current 1)
  • finding diversity < 0.5 (current 1.0) ← this PR enables enforcement
  • dedicated rerun ≥ 1 driver success (current 0)
  • corpus size ≥ 10 (current 7)

Test plan

  • pytest -q tests/test_finding_diversity_gate.py — 12 passed
  • pytest -q — full suite green
  • ruff check src/ tests/ scripts/ — clean
  • pyright src/aiedge/quality_policy.py scripts/run_pair_eval.py tests/test_finding_diversity_gate.py — 0 errors
  • python3 scripts/check_doc_consistency.py — OK
  • bash -n scripts/release_gate.sh — clean
  • Run release_gate.sh --pair-eval-findings benchmark-results/pair-eval/pair_eval_findings.csv against the local-7 baseline → expect FAIL with QUALITY_GATE_DIVERSITY_MISS, actual=1.0

Files

  • src/aiedge/quality_policy.py (+115)
  • scripts/run_pair_eval.py (+82, helpers + TimeoutExpired branch)
  • scripts/release_gate.sh (+57, opt-in PAIR_EVAL_DIVERSITY sub-gate)
  • tests/test_finding_diversity_gate.py (new, 12 tests)
  • docs/finding_diversity_gate.md (new)
  • CHANGELOG.md (Unreleased ### Added)

Plan reference

See ~/.claude/plans/twinkly-hugging-leaf.md for the broader Track A / Track B parallel plan.

🤖 Generated with Claude Code

R00T-Kim and others added 3 commits April 19, 2026 19:07
…+.5)

Reviewer eval lane analysis (2026-04-19) surfaced two blockers on the path
into Phase 2D':
  - the local-7 lane mapped every pair-side row to a single finding_id
    (degenerate ROC: all 14 rows on aiedge.findings.web.exec_sink_overlap)
  - the dedicated reviewer reruns (claude-6h, codex-6h) terminated at
    `run_index rows = 0` with no actionable diagnostic

This adds measurement scaffolding for both:

- quality_policy.py: compute_pair_eval_diversity_index() (max-share over
  finding_id), load_pair_eval_finding_ids() (CSV reader with optional
  ground_truth filter), evaluate_pair_eval_diversity_gate() (fails when
  index >= AIEDGE_PAIR_DIVERSITY_MAX, default 0.5). New violation tokens
  QUALITY_GATE_DIVERSITY_MISS / QUALITY_GATE_INVALID_PAIR_EVAL.
- run_pair_eval.py: TimeoutExpired now writes <side>/timeout_diagnostic.json
  with last 200 stderr / 50 stdout lines, best-effort run_dir guess, and
  the most recent stage's name/status.
- release_gate.sh: opt-in PAIR_EVAL_DIVERSITY sub-gate via
  --pair-eval-findings; absent flag emits an INFO skip line.
- docs/finding_diversity_gate.md: threshold rationale, output schema,
  Phase 2D entry exit-gate hook (recall >= 0.40 / tier >= 2 nonzero TP /
  diversity < 0.5 / dedicated rerun success / corpus >= 10).

Verification:
  pytest -q tests/test_finding_diversity_gate.py   # 12 passed
  pytest -q                                         # full suite green
  ruff check, pyright, check_doc_consistency       # clean
  bash -n scripts/release_gate.sh                  # clean

Phase 2C+ Track A first PR (Pivot 2026-04-19).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ector (Phase 2C+.3)

Doubles the dangerous-call catalogue covered by taint_propagation so the
diversity gate added in 2C+.5 has more candidates to discriminate against.
The pre-Pivot _SINK_SYMBOLS only covered the cmd-injection / strcpy /
printf families (29 symbols); the firmware corpus routinely surfaces
sinks across at least nine CWE families that were silently missed.

- _SINK_SYMBOLS 29 -> 51, with explicit CWE comments per group:
    * CWE-78  + wordexp / posix_spawn / posix_spawnp
    * CWE-22  + fopen / open / openat / freopen / chdir
    * CWE-426 + dlsym / dlmopen
    * CWE-732 + chmod / fchmod / chown / fchown / lchown
    * CWE-377 + mktemp / tmpnam / tempnam / tmpfile
    * CWE-250/269 + chroot / setuid / seteuid / setgid / setegid
    * CWE-454 + putenv / setenv / unsetenv
    * CWE-134 + vsnprintf / dprintf / vdprintf
- _FORMAT_STRING_SINKS 6 -> 15 with size-bounded, fd-based, and
  wide-character format-string variants.
- _is_format_string_variable() is widened to flag any first argument
  whose first non-whitespace character is *not* a string literal: bare
  identifiers, function calls, struct field access (`obj->field`),
  array subscripts, C-style casts, parenthesised ternaries, and
  pointer dereferences (`*p_fmt`). Previously only bare identifiers
  matched, so `printf(obj->field)` was silently considered safe.

Verification:
  pytest -q tests/test_taint_propagation.py   # 20 passed
  pytest -q                                    # full suite green
  ruff check src/ tests/                       # clean
  pyright (changed files)                     # 0 errors
  python3 scripts/check_doc_consistency.py    # OK

Phase 2C+ Track A second commit on PR #9 (Pivot 2026-04-19, Plan
~/.claude/plans/twinkly-hugging-leaf.md).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…+.2)

EnhancedSourceStage now widens source identification beyond the
INPUT_APIS dynstr scan to cover the three attacker-influenced string
families that LARA (USENIX Sec 2024) showed are missing from
traditional source pools:
  * URI prefixes               (20 patterns)
  * CGI environment variables  (17 patterns)
  * NVRAM / sysconf keys       (24 patterns)

This is the source-side counterpart to PR #9's sink expansion (2C+.3):
together they grow both ends of the source->sink graph so the
diversity gate (2C+.5) has a meaningfully larger candidate pool.

- _URI_SOURCE_PATTERNS / _CGI_VAR_PATTERNS / _CONFIG_KEY_PATTERNS
  frozensets with CWE / RFC / OEM provenance comments.
- _extract_uri_key_sources(bin_path, symbols, ascii_strings=None)
  returns deduplicated (pattern, kind) tuples. Matching policy:
    URI: substring vs bin_path AND ascii_strings (symbols intentionally
         excluded -- '/' is not a valid identifier char)
    CGI var: exact lower-case match against symbols OR ascii_strings
    config key: substring vs bin_path, symbols, AND ascii_strings
- EnhancedSourceStage.run() loop wraps each match into a source dict
  with confidence=0.40 (SYMBOL_COOCCURRENCE cap), method="lara_pattern",
  and source_type set to the match kind. ascii_strings wiring is
  intentionally deferred -- a follow-up will plumb inventory's
  string_hits / sbom _extract_ascii_runs through into this call site.

Verification:
  pytest -q tests/test_uri_source_extraction.py    # 13 passed
  pytest -q                                         # full suite green
  ruff check src/ tests/                            # clean
  pyright (changed files)                          # 0 errors
  python3 scripts/check_doc_consistency.py         # OK

Phase 2C+ Track A third commit on PR #9 (Pivot 2026-04-19).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R00T-Kim and others added 2 commits April 19, 2026 20:14
First-cut implementation of the LATTE (Liu et al., TOSEM 2025)
prompt-slicing idea: when AIEDGE_LATTE_SLICING=1 is set,
_build_taint_prompt() replaces the full function body with a
sink-rooted backward slice so the LLM spends its token budget on
the data-dependency chain instead of the entire function.

- src/aiedge/code_slicing.py (new, 190 lines):
    * latte_slicing_enabled() -- env-gate helper
    * find_sink_line(body, sink_sym) -- first sink call location
    * extract_backward_slice(body, sink_line_idx, max_lines=30)
      bottom-up walker: start from the sink line, track identifiers,
      keep earlier lines whose identifier set intersects the tracked
      set. Blank/comment lines are kept for structural context.
      Source order preserved; the sink line and defining lines of
      its arguments always land in the slice.
    * extract_slice_around_sink() -- convenience wrapper
    * maybe_slice() -- env-gated entry point (recommended for
      taint_propagation call site; default-off returns body
      unchanged so existing prompts are byte-identical)
    * slice_compression_ratio() -- telemetry helper
- src/aiedge/taint_propagation.py (+5 lines):
    _build_taint_prompt() pipes each function body through
    maybe_slice(body, sink_symbol) before the _truncate_text() cap.
- tests/test_code_slicing.py (new, 32 tests):
    sink location + word boundary / slice invariants (subset,
    source order, sink kept, defining lines pulled in) / max_lines
    cap / degenerate inputs / env-gate parsing (truthy/falsy) /
    byte-identical default-off / compression-ratio telemetry.
- docs/code_slicing_contract.md (new): algorithm description,
    over-approximation caveats, env gate, call site, Phase 2D entry
    interaction guidance.

Verification:
  pytest -q tests/test_code_slicing.py   # 32 passed
  pytest -q                               # full suite green
  ruff check, pyright (changed files)    # clean / 0 errors
  python3 scripts/check_doc_consistency.py  # OK

Phase 2C+ Track A fourth commit on PR #9 (Pivot 2026-04-19). This
closes 2C+.1, leaving 2C+.4 (vendor extraction chain -- requires
five external firmware binaries) as the only remaining Track A
step before the Phase 2D entry exit-gate evaluation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@R00T-Kim R00T-Kim merged commit 30595e6 into main Apr 19, 2026
6 checks passed
@R00T-Kim R00T-Kim deleted the phase-2c-plus branch April 19, 2026 15:20
R00T-Kim added a commit that referenced this pull request Apr 22, 2026
…corecard

- Mark Phase 1.5 (LATTE) / 1.6 (LARA) as landed in v2.7.0 PR #9 in the
  phase-mapping table (previously only labeled 'Phase 2C+로 이관')
- Add 'v2.7.0 / v2.7.1 landed status' subsection under Phase 2C+ Insert
  that captures: all 2C+ items shipped, Phase 2D' Entry Gate FINAL 2/5 PASS
  scorecard (12-pair, WRT ok measurement), partial-extraction artifact
  back-slide lesson, and Pivot Option D unchanged (v2.7.1 is a quantitative
  refinement of scenario C, not a re-pivot)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant