benchmarks: OSS guardrails comparison harness (closes #32 scaffolding)#54
Merged
Merged
Conversation
… / Guardrails AI / NeMo) — closes #32 scaffolding Ships the reproducible benchmark framework requested in #32: - 72-record curated dataset (42 attacks + 30 benign) across prompt_injection / jailbreak / data_exfiltration / evasion + safe baseline, with per-record source attribution. - Pluggable adapters: Aigis (in-process), LLM Guard / Guardrails AI / NeMo Guardrails (HTTP sidecars). Each adapter advertises its "default" config tier so the report row is honest about what was measured. - `docker-compose.yml` for the three external services. - Driver + reporter: `make bench`, `make bench-aigis`, CSV + markdown output with per-category TPR, FPR on safe baseline, p50/p95 latency, error counts. - CI workflow with a ±2 pp regression guard on the Aigis row (`benchmarks/oss_comparison/baseline.json` + `scripts/regression_guard.py`). - 7 smoke tests in tests/test_oss_comparison_bench.py. - docs/benchmarks/oss-comparison.md documenting methodology, the v0 Aigis baseline (14.3% detection rate / 0% FPR / p50 0.49 ms on default policy), acknowledged gaps (data_exfiltration 0%, evasion 0% on default — surfaced not hidden), and limitations. Remaining acceptance-criteria work tracked in the docs page's "Open work" section: SHA256-pinning the three docker images, populating their live rows, wiring `fetch_extended.py` to actually download PromptBench. Tests: 1529 passed, 0 failed (uv run pytest --tb=no -q). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: killertcell428 <killertcell428@gmail.com>
d537524 to
bc0c626
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ships the reproducible benchmark requested in #32 so Aigis can be compared apples-to-apples against LLM Guard, Guardrails AI, and NVIDIA NeMo Guardrails.
benchmarks/oss_comparison/datasets/) — 72 records (42 attacks + 30 benign) acrossprompt_injection/jailbreak/data_exfiltration/evasion, plus a multi-lingual (en/ja/ko/zh) safe baseline. Per-record source attribution. Curated to avoid vendoring research-licensed corpora; afetch_extended.pystub is included for opt-in PromptBench/HarmBench downloads.benchmarks/oss_comparison/adapters/) — CommonVerdictprotocol with four concrete implementations. Aigis runs in-process; the three external tools run as HTTP sidecars via the includeddocker-compose.yml.make bench(all tools) andmake bench-aigis(no Docker, CI-friendly). Output: CSV row per (tool, input) and a markdown report with per-category TPR, FPR on the safe baseline, p50/p95 latency, and error counts..github/workflows/bench-oss-comparison.yml+scripts/regression_guard.py) — On every PR touchingaigis/or the benchmark, re-runsmake bench-aigisand fails if Aigis's detection rate drops more than 2 pp below the frozenbaseline.json. Intentional regressions require updating the baseline in the same PR.docs/benchmarks/oss-comparison.mdwith methodology, the live v0 Aigis baseline, acknowledged gaps (data_exfiltration 0%, evasion 0% on default policy — surfaced not hidden), and explicit limitations.tests/test_oss_comparison_bench.py. Full suite: 1529 passed, 0 failed.Honest v0 Aigis baseline (default policy)
prompt_injectionjailbreakdata_exfiltrationevasionThe 0% rows are real coverage gaps the benchmark deliberately surfaces — they're flagged in the docs as candidates for the next auto-improvement cycle. As Issue #32 puts it: "the point is calibration, not advocacy."
Acceptance criteria status
benchmarks/oss_comparison/results/)docker compose up && make benchreproduces the published numbers within ±2 %→ Pending: SHA256-pinning the three external docker images and populating their live rows in a follow-up. The Aigis row is fully reproducible today via
make bench-aigis.Open work tracked in the docs page under "Open work."
Test plan
uv run pytest tests/test_oss_comparison_bench.py -v→ 7 passeduv run pytest --tb=no -q→ 1529 passed, 0 faileduv run ruff check benchmarks/ tests/test_oss_comparison_bench.py→ cleanuv run ruff format --check benchmarks/ tests/test_oss_comparison_bench.py→ cleanmake bench-aigisend-to-end (driver → reporter → regression-guard) on Windowsbench-oss-comparison.ymlgreen on this PRbenchmarks/oss_comparison/results/report.mdCloses #32 (scaffolding tranche). Live external-tool rows + image pinning land in a follow-up.
🤖 Generated with Claude Code