benchmarks: OSS guardrails comparison harness (closes #32 scaffolding) by killertcell428 · Pull Request #54 · killertcell428/aigis

killertcell428 · 2026-05-17T06:32:26Z

Summary

Ships the reproducible benchmark requested in #32 so Aigis can be compared apples-to-apples against LLM Guard, Guardrails AI, and NVIDIA NeMo Guardrails.

Dataset (benchmarks/oss_comparison/datasets/) — 72 records (42 attacks + 30 benign) across prompt_injection / jailbreak / data_exfiltration / evasion, plus a multi-lingual (en/ja/ko/zh) safe baseline. Per-record source attribution. Curated to avoid vendoring research-licensed corpora; a fetch_extended.py stub is included for opt-in PromptBench/HarmBench downloads.
Adapters (benchmarks/oss_comparison/adapters/) — Common Verdict protocol with four concrete implementations. Aigis runs in-process; the three external tools run as HTTP sidecars via the included docker-compose.yml.
Driver + reporter — make bench (all tools) and make bench-aigis (no Docker, CI-friendly). Output: CSV row per (tool, input) and a markdown report with per-category TPR, FPR on the safe baseline, p50/p95 latency, and error counts.
CI regression guard (.github/workflows/bench-oss-comparison.yml + scripts/regression_guard.py) — On every PR touching aigis/ or the benchmark, re-runs make bench-aigis and fails if Aigis's detection rate drops more than 2 pp below the frozen baseline.json. Intentional regressions require updating the baseline in the same PR.
Docs — docs/benchmarks/oss-comparison.md with methodology, the live v0 Aigis baseline, acknowledged gaps (data_exfiltration 0%, evasion 0% on default policy — surfaced not hidden), and explicit limitations.
Tests — 7 smoke tests in tests/test_oss_comparison_bench.py. Full suite: 1529 passed, 0 failed.

Honest v0 Aigis baseline (default policy)

Metric	Value
Overall detection rate	14.3 %
FPR on safe baseline	0.0 %
p50 latency	0.49 ms
`prompt_injection`	16.7 %
`jailbreak`	33.3 %
`data_exfiltration`	0.0 %
`evasion`	0.0 %

The 0% rows are real coverage gaps the benchmark deliberately surfaces — they're flagged in the docs as candidates for the next auto-improvement cycle. As Issue #32 puts it: "the point is calibration, not advocacy."

Acceptance criteria status

CSV + markdown table both checked in (benchmarks/oss_comparison/results/)
Per-category detection rate AND false-positive rate columns
Limitations / "what this doesn't measure" section in the doc
docker compose up && make bench reproduces the published numbers within ±2 %
→ Pending: SHA256-pinning the three external docker images and populating their live rows in a follow-up. The Aigis row is fully reproducible today via make bench-aigis.

Open work tracked in the docs page under "Open work."

Test plan

uv run pytest tests/test_oss_comparison_bench.py -v → 7 passed
uv run pytest --tb=no -q → 1529 passed, 0 failed
uv run ruff check benchmarks/ tests/test_oss_comparison_bench.py → clean
uv run ruff format --check benchmarks/ tests/test_oss_comparison_bench.py → clean
make bench-aigis end-to-end (driver → reporter → regression-guard) on Windows
CI workflow bench-oss-comparison.yml green on this PR
Spot-check the rendered markdown report at benchmarks/oss_comparison/results/report.md

Closes #32 (scaffolding tranche). Live external-tool rows + image pinning land in a follow-up.

🤖 Generated with Claude Code

… / Guardrails AI / NeMo) — closes #32 scaffolding Ships the reproducible benchmark framework requested in #32: - 72-record curated dataset (42 attacks + 30 benign) across prompt_injection / jailbreak / data_exfiltration / evasion + safe baseline, with per-record source attribution. - Pluggable adapters: Aigis (in-process), LLM Guard / Guardrails AI / NeMo Guardrails (HTTP sidecars). Each adapter advertises its "default" config tier so the report row is honest about what was measured. - `docker-compose.yml` for the three external services. - Driver + reporter: `make bench`, `make bench-aigis`, CSV + markdown output with per-category TPR, FPR on safe baseline, p50/p95 latency, error counts. - CI workflow with a ±2 pp regression guard on the Aigis row (`benchmarks/oss_comparison/baseline.json` + `scripts/regression_guard.py`). - 7 smoke tests in tests/test_oss_comparison_bench.py. - docs/benchmarks/oss-comparison.md documenting methodology, the v0 Aigis baseline (14.3% detection rate / 0% FPR / p50 0.49 ms on default policy), acknowledged gaps (data_exfiltration 0%, evasion 0% on default — surfaced not hidden), and limitations. Remaining acceptance-criteria work tracked in the docs page's "Open work" section: SHA256-pinning the three docker images, populating their live rows, wiring `fetch_extended.py` to actually download PromptBench. Tests: 1529 passed, 0 failed (uv run pytest --tb=no -q). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: killertcell428 <killertcell428@gmail.com>

github-advanced-security AI found potential problems May 17, 2026

View reviewed changes

Comment thread benchmarks/oss_comparison/adapters/base.py Fixed

killertcell428 force-pushed the claude/wizardly-brahmagupta-5d9697 branch from d537524 to bc0c626 Compare May 17, 2026 06:48

killertcell428 added 2 commits May 18, 2026 03:38

Merge branch 'master' into claude/wizardly-brahmagupta-5d9697

79e6099

Merge branch 'master' into claude/wizardly-brahmagupta-5d9697

083bc8c

killertcell428 merged commit 638984f into master May 18, 2026
12 checks passed

killertcell428 deleted the claude/wizardly-brahmagupta-5d9697 branch May 18, 2026 02:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmarks: OSS guardrails comparison harness (closes #32 scaffolding)#54

benchmarks: OSS guardrails comparison harness (closes #32 scaffolding)#54
killertcell428 merged 3 commits into
masterfrom
claude/wizardly-brahmagupta-5d9697

killertcell428 commented May 17, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

killertcell428 commented May 17, 2026

Summary

Honest v0 Aigis baseline (default policy)

Acceptance criteria status

Test plan

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants