Skip to content

benchmarks: OSS guardrails comparison harness (closes #32 scaffolding)#54

Merged
killertcell428 merged 3 commits into
masterfrom
claude/wizardly-brahmagupta-5d9697
May 18, 2026
Merged

benchmarks: OSS guardrails comparison harness (closes #32 scaffolding)#54
killertcell428 merged 3 commits into
masterfrom
claude/wizardly-brahmagupta-5d9697

Conversation

@killertcell428
Copy link
Copy Markdown
Owner

Summary

Ships the reproducible benchmark requested in #32 so Aigis can be compared apples-to-apples against LLM Guard, Guardrails AI, and NVIDIA NeMo Guardrails.

  • Dataset (benchmarks/oss_comparison/datasets/) — 72 records (42 attacks + 30 benign) across prompt_injection / jailbreak / data_exfiltration / evasion, plus a multi-lingual (en/ja/ko/zh) safe baseline. Per-record source attribution. Curated to avoid vendoring research-licensed corpora; a fetch_extended.py stub is included for opt-in PromptBench/HarmBench downloads.
  • Adapters (benchmarks/oss_comparison/adapters/) — Common Verdict protocol with four concrete implementations. Aigis runs in-process; the three external tools run as HTTP sidecars via the included docker-compose.yml.
  • Driver + reportermake bench (all tools) and make bench-aigis (no Docker, CI-friendly). Output: CSV row per (tool, input) and a markdown report with per-category TPR, FPR on the safe baseline, p50/p95 latency, and error counts.
  • CI regression guard (.github/workflows/bench-oss-comparison.yml + scripts/regression_guard.py) — On every PR touching aigis/ or the benchmark, re-runs make bench-aigis and fails if Aigis's detection rate drops more than 2 pp below the frozen baseline.json. Intentional regressions require updating the baseline in the same PR.
  • Docsdocs/benchmarks/oss-comparison.md with methodology, the live v0 Aigis baseline, acknowledged gaps (data_exfiltration 0%, evasion 0% on default policy — surfaced not hidden), and explicit limitations.
  • Tests — 7 smoke tests in tests/test_oss_comparison_bench.py. Full suite: 1529 passed, 0 failed.

Honest v0 Aigis baseline (default policy)

Metric Value
Overall detection rate 14.3 %
FPR on safe baseline 0.0 %
p50 latency 0.49 ms
prompt_injection 16.7 %
jailbreak 33.3 %
data_exfiltration 0.0 %
evasion 0.0 %

The 0% rows are real coverage gaps the benchmark deliberately surfaces — they're flagged in the docs as candidates for the next auto-improvement cycle. As Issue #32 puts it: "the point is calibration, not advocacy."

Acceptance criteria status

  • CSV + markdown table both checked in (benchmarks/oss_comparison/results/)
  • Per-category detection rate AND false-positive rate columns
  • Limitations / "what this doesn't measure" section in the doc
  • docker compose up && make bench reproduces the published numbers within ±2 %
    → Pending: SHA256-pinning the three external docker images and populating their live rows in a follow-up. The Aigis row is fully reproducible today via make bench-aigis.

Open work tracked in the docs page under "Open work."

Test plan

  • uv run pytest tests/test_oss_comparison_bench.py -v → 7 passed
  • uv run pytest --tb=no -q → 1529 passed, 0 failed
  • uv run ruff check benchmarks/ tests/test_oss_comparison_bench.py → clean
  • uv run ruff format --check benchmarks/ tests/test_oss_comparison_bench.py → clean
  • make bench-aigis end-to-end (driver → reporter → regression-guard) on Windows
  • CI workflow bench-oss-comparison.yml green on this PR
  • Spot-check the rendered markdown report at benchmarks/oss_comparison/results/report.md

Closes #32 (scaffolding tranche). Live external-tool rows + image pinning land in a follow-up.

🤖 Generated with Claude Code

Comment thread benchmarks/oss_comparison/adapters/base.py Fixed
… / Guardrails AI / NeMo) — closes #32 scaffolding

Ships the reproducible benchmark framework requested in #32:

- 72-record curated dataset (42 attacks + 30 benign) across
  prompt_injection / jailbreak / data_exfiltration / evasion + safe baseline,
  with per-record source attribution.
- Pluggable adapters: Aigis (in-process), LLM Guard / Guardrails AI / NeMo
  Guardrails (HTTP sidecars). Each adapter advertises its "default" config
  tier so the report row is honest about what was measured.
- `docker-compose.yml` for the three external services.
- Driver + reporter: `make bench`, `make bench-aigis`, CSV + markdown output
  with per-category TPR, FPR on safe baseline, p50/p95 latency, error counts.
- CI workflow with a ±2 pp regression guard on the Aigis row
  (`benchmarks/oss_comparison/baseline.json` + `scripts/regression_guard.py`).
- 7 smoke tests in tests/test_oss_comparison_bench.py.
- docs/benchmarks/oss-comparison.md documenting methodology, the v0 Aigis
  baseline (14.3% detection rate / 0% FPR / p50 0.49 ms on default policy),
  acknowledged gaps (data_exfiltration 0%, evasion 0% on default — surfaced
  not hidden), and limitations.

Remaining acceptance-criteria work tracked in the docs page's "Open work"
section: SHA256-pinning the three docker images, populating their live
rows, wiring `fetch_extended.py` to actually download PromptBench.

Tests: 1529 passed, 0 failed (uv run pytest --tb=no -q).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: killertcell428 <killertcell428@gmail.com>
@killertcell428 killertcell428 force-pushed the claude/wizardly-brahmagupta-5d9697 branch from d537524 to bc0c626 Compare May 17, 2026 06:48
@killertcell428 killertcell428 merged commit 638984f into master May 18, 2026
12 checks passed
@killertcell428 killertcell428 deleted the claude/wizardly-brahmagupta-5d9697 branch May 18, 2026 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[benchmark] Reproducible comparison vs LLM Guard / Guardrails AI / NeMo Guardrails

2 participants