Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,11 @@ Format based on [Keep a Changelog](https://keepachangelog.com/).

### Added

- **LATTE-inspired text-based backward slicing (Phase 2C+.1)** (`src/aiedge/code_slicing.py`, `src/aiedge/taint_propagation.py`, `tests/test_code_slicing.py`, `docs/code_slicing_contract.md`). First-cut implementation of the LATTE (Liu et al., TOSEM 2025) prompt-slicing idea: when `AIEDGE_LATTE_SLICING=1` is set, `_build_taint_prompt()` replaces the full function body with a sink-rooted backward slice. The slice walks bottom-up from the sink call, keeping earlier lines whose identifiers overlap the tracked variables-of-interest (minus a conservative noise set of C keywords / literals / common macros). The slice is a strict subset of the original body with source order preserved; the sink line and the defining lines of its arguments are always retained. Public API: `find_sink_line`, `extract_backward_slice`, `extract_slice_around_sink`, `maybe_slice`, `slice_compression_ratio`, `latte_slicing_enabled`. Default-off keeps existing LLM prompts byte-identical. _(32 new tests in `tests/test_code_slicing.py`.)_
- **LARA-style URI / CGI / config-key source identification (Phase 2C+.2)** (`enhanced_source.py`, `tests/test_uri_source_extraction.py`). `EnhancedSourceStage` now widens source identification beyond C-level input APIs by recognising attacker-influenced strings, taking inspiration from the LARA paper (USENIX Sec 2024). Three new pattern sets totalling 50 entries cover URI prefixes (`/cgi-bin/`, `/api/`, `/upnp/`, `/admin/`, `/goform/`, ...), CGI environment variables (`QUERY_STRING`, `REQUEST_METHOD`, `HTTP_*`, ...), and NVRAM / sysconf config keys (`http_passwd`, `wpa_psk`, `cloud_token`, `firmware_url`, ...). New helper `_extract_uri_key_sources(bin_path, symbols, ascii_strings=None)` produces `(pattern, kind)` tuples that are wrapped per-binary into source dicts with `confidence=0.40` (SYMBOL_COOCCURRENCE cap, since string presence alone does not prove reachability) and `method="lara_pattern"`. Symbol-based URI matching is intentionally skipped to avoid noise; the optional `ascii_strings` parameter is the path for string-literal evidence (to be wired through inventory data in a follow-up). _(13 new tests in `tests/test_uri_source_extraction.py`.)_
- **Sink coverage expansion (Phase 2C+.3)** (`taint_propagation.py`, `tests/test_taint_propagation.py`). `_SINK_SYMBOLS` grows from 29 to 51 symbols, mapping the full CWE taxonomy that the firmware corpus actually exercises: CWE-78 cmd injection (now incl. `wordexp`, `posix_spawn`, `posix_spawnp`), CWE-22 path traversal (`fopen`, `open`, `openat`, `freopen`, `chdir`), CWE-426 search path (`dlsym`, `dlmopen`), CWE-732 perms (`chmod`/`fchmod`/`chown`/`fchown`/`lchown`), CWE-377 insecure tmp (`mktemp`, `tmpnam`, `tempnam`, `tmpfile`), CWE-250/269 privilege (`chroot`, `setuid`, `seteuid`, `setgid`, `setegid`), and CWE-454 env injection (`putenv`, `setenv`, `unsetenv`). `_FORMAT_STRING_SINKS` doubles from 6 to 15 with size-bounded (`vsnprintf`), file-descriptor (`dprintf`/`vdprintf`), and wide-char (`swprintf`, `vswprintf`, `wprintf`, `vwprintf`, `fwprintf`, `vfwprintf`) variants. `_is_format_string_variable()` is strengthened to flag struct field access, array subscripts, function-call results, C-style casts, parenthesised ternaries, and pointer dereferences as variable first-arguments — not just bare identifiers. _(20 new tests in `tests/test_taint_propagation.py`.)_
- **Finding diversity gate (Phase 2C+.5)** (`quality_policy.py`, `release_gate.sh`, `tests/test_finding_diversity_gate.py`, `docs/finding_diversity_gate.md`). Detects degenerate pair-eval coverage where every pair-side row maps to the same `finding_id` — the structural failure surfaced by the 2026-04-19 reviewer eval lane analysis (local-7 baseline `finding_diversity_index = 1.0`, all 14 rows on `aiedge.findings.web.exec_sink_overlap`). New helpers `compute_pair_eval_diversity_index()`, `load_pair_eval_finding_ids()`, `evaluate_pair_eval_diversity_gate()` produce a `QUALITY_GATE_DIVERSITY_MISS` violation when `max_share(finding_id) >= AIEDGE_PAIR_DIVERSITY_MAX` (default 0.5). `release_gate.sh` wires this in as the opt-in `PAIR_EVAL_DIVERSITY` sub-gate via `--pair-eval-findings`. _(12 new tests in `tests/test_finding_diversity_gate.py`.)_
- **Pair-eval timeout diagnostic** (`scripts/run_pair_eval.py`). When a pair-side run hits the wall-clock timeout, `_dump_timeout_diagnostic()` writes `<side>/timeout_diagnostic.json` capturing the last 200 stderr / 50 stdout lines, a best-effort run_dir guess, and the most recent stage's name/status. Closes the visibility gap that left the dedicated reviewer rerun lanes (`pair-eval-dedicated-local7-claude-6h`, `codex-6h`) stuck at `run_index rows = 0` without actionable signal.
- **FDA Section 524B compatibility mapping (Phase 3'.1 step B-2)** (`docs/compliance_mapping/fda_section_524b.md`). Maps SCOUT outputs to the four §524B(b) statutory obligations (postmarket vulnerability monitoring plan, secure design/develop/maintain processes, postmarket updates/patches, SBOM) and to the September 2023 FDA premarket cybersecurity guidance content elements (security objectives, threat modelling, security risk management, cybersecurity testing, architecture views, SBOM, vulnerability management, labelling, postmarket plan). Coverage is documented per element with explicit "out of scope" callouts for sponsor-side QMS deliverables. Disclaimer reuses the directory-wide "compatible with" wording rule.
- **ISO/SAE 21434 compatibility mapping (Phase 3'.1 step B-3)** (`docs/compliance_mapping/iso_21434.md`). Maps SCOUT outputs to ISO/SAE 21434:2021 work products across clauses 8 (continual cybersecurity activities), 9 (concept), 10 (product development), 11 (cybersecurity validation), 13 (operations and maintenance), and 15 (TARA methods). Identifies which work products are tool-friendly (WP-08-01..04, WP-10-04, WP-10-05, WP-13-02) versus manufacturer-side narratives (WP-09-02, WP-10-01, WP-10-02, etc.).
- **UN R155 compatibility mapping (Phase 3'.1 step B-3)** (`docs/compliance_mapping/un_r155.md`). Maps SCOUT outputs to UN R155 §7.2 (CSMS) and §7.3 (vehicle-type approval) requirements, plus per-threat guidance for the 15 most-relevant Annex 5 threat categories (manipulation, replay, malware insertion, network-design vulnerabilities, etc.). Co-published with the ISO/SAE 21434 mapping per the standard / regulation pairing.
Expand Down
111 changes: 111 additions & 0 deletions docs/code_slicing_contract.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# LATTE Code Slicing Contract

> Phase 2C+.1 (Pivot 2026-04-19) — text-based backward slicing that the taint
> propagation stage uses to compress LLM prompts when
> `AIEDGE_LATTE_SLICING=1` is set.

## Why this exists

LATTE (Liu et al., "LATTE: LLM-Powered Static Binary Taint Analysis",
TOSEM 2025) reported that feeding the LLM the **sink-rooted backward
slice** instead of the full decompiled function body improved new-bug
discovery and reduced token usage. SCOUT's first-cut implementation
takes the same idea but stays conservative: it operates on plain text,
does not require a Ghidra-grade SSA backend, and is opt-in so the
existing prompt behaviour stays byte-identical when the env var is
unset.

The slicing is **over-approximate**: it keeps every earlier line whose
identifier set overlaps the already-tracked variables-of-interest. That
means the slice is a strict subset of the original body (ordering
preserved) but it may retain irrelevant lines that happen to mention a
tainted variable name in passing. In exchange, it never drops a line
that contains a real data dependency along the sink path, so the LLM
never has to reason about a variable whose definition disappeared.

## Public API

Source: `src/aiedge/code_slicing.py`.

| Function | Purpose |
|---|---|
| `latte_slicing_enabled()` | Returns `True` when `AIEDGE_LATTE_SLICING` is set to `1`/`true`/`yes`/`on` (case-insensitive). |
| `find_sink_line(body, sink_sym)` | 0-based line index of the first `sink_sym(` call, or `None`. |
| `extract_backward_slice(body, sink_line_idx, max_lines=30)` | Backward-walks from `sink_line_idx`, keeps lines whose identifiers overlap the tracked set. Returns a string of the retained lines in source order. |
| `extract_slice_around_sink(body, sink_sym, max_lines=30)` | Convenience: `find_sink_line` then `extract_backward_slice`. Returns `None` when the sink is absent. |
| `maybe_slice(body, sink_sym, max_lines=30)` | Recommended entry point for call sites: when the env gate is off it returns the body unchanged; when on it returns the slice (falling back to the full body if the sink is not found). Never returns `None`. |
| `slice_compression_ratio(original, sliced)` | Telemetry helper — ratio of kept lines to original lines. |

## Env gate

```
AIEDGE_LATTE_SLICING=1 # enable slicing (any of 1/true/yes/on)
```

Default (unset) means `maybe_slice` returns the input body verbatim, so
dropping the env var gives byte-identical prompts to every LLM call.

## Algorithm (first-cut)

```
1. Locate the sink line (first occurrence of `<sink_sym>(`).
2. Initial variables-of-interest = identifiers on the sink line
(minus the noise set: C keywords, literals, common macros).
3. For each earlier line (bottom-up):
a. If its identifier set intersects the variables-of-interest,
include it and union its identifiers into the interest set.
b. If the line has no usable identifier (blank, comment-only),
include it so the LLM keeps structural context.
c. Stop at `max_lines` or the function start.
4. Emit retained lines in source order.
```

Noise identifiers (`_NOISE_IDENTIFIERS`) are kept minimal on purpose: we
filter only what is guaranteed not to carry data (`if`, `int`, `NULL`,
`true`, ...). Vendor-specific tokens are *not* filtered because they
often *are* the relevant variables in router firmware decompilation.

## Over-approximation behaviour

Because the algorithm tracks identifiers and not their scopes, a slice
may include lines that merely reference a same-named variable elsewhere
in the function. This is acceptable for prompt compression but analysts
who need an exact data-flow trace should still consult the Ghidra
P-code SSA path (`pcode_taint.py`).

## Call site

The only caller today is `_build_taint_prompt()` in
`src/aiedge/taint_propagation.py`:

```python
body_raw = fb.get("body", "")
body_sliced = maybe_slice(body_raw, sink_symbol)
body = _truncate_text(body_sliced, max_chars=2000)
```

When `AIEDGE_LATTE_SLICING` is unset the call returns `body_raw`
unchanged and the subsequent `_truncate_text` path is byte-identical to
pre-2C+.1 behaviour.

## Phase 2D entry interaction

Phase 2D.1 (reasoning_trail + MCP loop validation) depends on the LLM
actually producing useful verdicts across diverse findings. Slicing is
the main lever we have today to let the LLM see *more* findings within
the same token budget — so even if Phase 2D.1 does not require slicing,
leaving it disabled in production runs means the analyst cycles through
a smaller effective corpus. Operators planning a Phase 2D.1 walkthrough
should enable `AIEDGE_LATTE_SLICING=1` for the run.

## Related artifacts

- `src/aiedge/code_slicing.py` — implementation
- `src/aiedge/taint_propagation.py` — call site in `_build_taint_prompt`
- `tests/test_code_slicing.py` — unit tests (32 cases) that pin:
- sink-line location and word-boundary behaviour
- slice invariants (subset, source order, sink kept, defining lines
pulled in)
- `max_lines` cap and degenerate inputs
- env-gate parsing and byte-identical default-off
- compression-ratio telemetry
137 changes: 137 additions & 0 deletions docs/finding_diversity_gate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# Finding Diversity Gate

> Phase 2C+.5 (Pivot 2026-04-19) — pair-eval lane gate that detects degenerate
> evidence-tier coverage by measuring finding-id share concentration.

## Why this gate exists

The 2026-04-19 reviewer eval lane analysis surfaced a structural failure that
neither precision/recall nor confidence caps caught: **every pair-side row in the
local-7 lane mapped to the same `finding_id`** (`aiedge.findings.web.exec_sink_overlap`,
`evidence_tier=symbol_only`). The pair-level recall and FP rate looked plausible
(0.142857 each) yet the underlying tier-ROC was *degenerate* — there was nothing
to discriminate between vulnerable and patched runs because the detection layer
collapsed onto a single finding.

The diversity gate quantifies this collapse and blocks releases that ship it.

## Definition

```
finding_diversity_index = max_count(finding_id) / total_rows
```

- `1.0` — degenerate (every row mapped to a single `finding_id`)
- `1/N` — fully diverse (every row a distinct `finding_id`)
- `0.0` — empty input (callers decide whether to treat as violation)

The index is a **maximum-share** metric, not entropy. It is robust to long-tail
distributions and surfaces the dominant finding bucket directly.

## Threshold

| Env variable | Default | Direction |
|---|---|---|
| `AIEDGE_PAIR_DIVERSITY_MAX` | `0.5` | gate fails when index `>=` threshold |

The default `0.5` was chosen as a first-cut: any single `finding_id` accounting
for 50%+ of pair rows is treated as a degenerate signal. Once the corpus grows
past 10 pairs the threshold should be re-evaluated against representative runs
(see Phase 2C+.4 vendor-extraction expansion).

## Inputs

The gate consumes the pair-eval findings CSV produced by
`scripts/run_pair_eval.py`. Schema (relevant columns):

| Column | Use |
|---|---|
| `finding_id` | counted into the share distribution |
| `ground_truth` | optional filter via `load_pair_eval_finding_ids(only_ground_truth=...)` |

Empty `finding_id` rows are skipped silently. Missing CSV raises
`QUALITY_GATE_INVALID_PAIR_EVAL`.

## Output schema

```json
{
"schema_version": 1,
"verdict": "pass" | "fail",
"passed": true | false,
"findings_source": "<path string>",
"policy": {
"finding_diversity_max": 0.5,
"finding_diversity_max_env": "AIEDGE_PAIR_DIVERSITY_MAX"
},
"measured": {
"finding_diversity_index": 0.0..1.0,
"sample_size": <int>
},
"errors": [
{
"error_token": "QUALITY_GATE_DIVERSITY_MISS",
"metric": "finding_diversity_index",
"source_field": "pair_eval_findings.finding_id",
"actual": 1.0,
"threshold": 0.5,
"operator": "<",
"sample_size": 14,
"message": "..."
}
]
}
```

## Wiring into `release_gate.sh`

The unified release gate wires this in as the `PAIR_EVAL_DIVERSITY` sub-gate. It
is **opt-in** via `--pair-eval-findings`:

```bash
scripts/release_gate.sh \
--run-dir aiedge-runs/<id> \
--pair-eval-findings benchmark-results/pair-eval/pair_eval_findings.csv
```

When the flag is omitted the gate is skipped with an `INFO` line so existing
release flows continue working unchanged.

## Current baseline (2026-04-19)

Running the gate against the trusted summary-reuse local-7 lane:

```
sample_size = 14 (7 pairs × 2 sides)
finding_diversity_index = 1.0 (degenerate — single finding for all rows)
verdict = fail
```

This matches the Pivot 2026-04-19 [diagnosis](../docs/status.md): Phase 2D entry
is gated until detection coverage produces at least two distinct findings across
the pair lane. The gate makes that requirement enforceable instead of advisory.

## Phase 2D entry exit-gate hook

The diversity gate is one of the five Phase 2D entry exit-gate thresholds
defined in [`docs/status.md`](status.md):

| Gate | Threshold | Tooling |
|---|---|---|
| Detection recall | `≥ 0.40` | `pair_eval_summary.json` |
| Tier discriminability | `≥ 2 nonzero TP tiers` | `pair_eval_findings.csv` |
| **Finding diversity** | **`< 0.5`** | **this gate** |
| Dedicated rerun | `≥ 1 driver success` | `pair-eval-dedicated-*` lanes |
| Corpus size | `≥ 10 pairs` | `benchmarks/pair-eval/pairs.json` |

The other four are tracked in their own places; this gate only owns the
diversity threshold.

## Related artifacts

- `src/aiedge/quality_policy.py` — `compute_pair_eval_diversity_index`,
`load_pair_eval_finding_ids`, `evaluate_pair_eval_diversity_gate`
- `scripts/run_pair_eval.py` — adds `timeout_diagnostic.json` for dedicated
rerun timeout investigations (companion 2C+.5 work)
- `scripts/release_gate.sh` — `PAIR_EVAL_DIVERSITY` sub-gate
- `tests/test_finding_diversity_gate.py` — unit + baseline tests
68 changes: 67 additions & 1 deletion scripts/release_gate.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,13 @@ CORPUS_MANIFEST="benchmarks/corpus/manifest.json"
METRICS_OUT=""
QUALITY_OUT=""
LLM_FIXTURE=""
PAIR_EVAL_FINDINGS=""

FAILED=0

usage() {
cat <<'EOF'
Usage: scripts/release_gate.sh --run-dir <PATH> [--manifest <PATH>] [--metrics-out <PATH>] [--quality-out <PATH>] [--llm-fixture <PATH>]
Usage: scripts/release_gate.sh --run-dir <PATH> [--manifest <PATH>] [--metrics-out <PATH>] [--quality-out <PATH>] [--llm-fixture <PATH>] [--pair-eval-findings <PATH>]

Unified release governance gate (single entrypoint).

Expand All @@ -25,6 +26,7 @@ Sub-gates:
- QUALITY_METRICS: aiedge quality-metrics
- QUALITY_POLICY: aiedge release-quality-gate
- EXPLOIT_TIER_POLICY: schema tier checks plus exploit_policy artifact checks when present
- PAIR_EVAL_DIVERSITY: finding-diversity gate over pair_eval_findings.csv (skipped when --pair-eval-findings absent)
- TAMPER_SUITE: pytest tests/test_tamper_suite.py
EOF
}
Expand Down Expand Up @@ -97,6 +99,10 @@ while [[ $# -gt 0 ]]; do
LLM_FIXTURE="$2"
shift 2
;;
--pair-eval-findings)
PAIR_EVAL_FINDINGS="$2"
shift 2
;;
-h|--help)
usage
exit 0
Expand Down Expand Up @@ -203,6 +209,66 @@ else
fi
rm -f "$EXPLOIT_CHECK_OUTPUT"

if [[ -n "$PAIR_EVAL_FINDINGS" ]]; then
PAIR_EVAL_OUTPUT="$(mktemp)"
set +e
PYTHONPATH="$PYTHONPATH" python3 - <<'PY' "$PAIR_EVAL_FINDINGS" "$RUN_DIR" >"$PAIR_EVAL_OUTPUT" 2>&1
import json
import sys
from pathlib import Path

from aiedge.quality_policy import (
QualityGateError,
evaluate_pair_eval_diversity_gate,
load_pair_eval_finding_ids,
)

csv_path = Path(sys.argv[1]).resolve()
run_dir = Path(sys.argv[2]).resolve()
out_path = run_dir / "pair_eval_diversity_gate.json"
try:
finding_ids = load_pair_eval_finding_ids(csv_path)
except QualityGateError as exc:
print(f"{exc.token}: {exc}")
raise SystemExit(1) from exc

result = evaluate_pair_eval_diversity_gate(
finding_ids=finding_ids,
findings_source=str(csv_path),
)
out_path.write_text(
json.dumps(result, indent=2, sort_keys=True) + "\n", encoding="utf-8"
)
if not result["passed"]:
for err in result["errors"]:
print(err.get("message") or err.get("error_token"))
raise SystemExit(1)
measured = result["measured"]
print(
"diversity_index="
+ str(measured["finding_diversity_index"])
+ " sample_size="
+ str(measured["sample_size"])
)
PY
PAIR_EVAL_RC=$?
set -e
if [[ "$PAIR_EVAL_RC" -ne 0 ]]; then
gate_fail "PAIR_EVAL_DIVERSITY" "diversity gate violated"
while IFS= read -r line; do
[[ -n "$line" ]] && echo "[GATE][LOG][PAIR_EVAL_DIVERSITY] $line"
done <"$PAIR_EVAL_OUTPUT"
else
gate_pass "PAIR_EVAL_DIVERSITY" "diversity gate passed"
while IFS= read -r line; do
[[ -n "$line" ]] && gate_info "PAIR_EVAL_DIVERSITY" "$line"
done <"$PAIR_EVAL_OUTPUT"
fi
rm -f "$PAIR_EVAL_OUTPUT"
else
gate_info "PAIR_EVAL_DIVERSITY" "skipped (no --pair-eval-findings)"
fi

if [[ "${AIEDGE_SKIP_TAMPER_TESTS:-0}" == "1" ]]; then
gate_info "TAMPER_SUITE" "skipped by AIEDGE_SKIP_TAMPER_TESTS=1"
else
Expand Down
Loading
Loading