Skip to content

Showcase: strengthen export-screening CI gating (multi-sample runs + stability-aware thresholds) #118

@christso

Description

@christso

Goal

Improve the existing export-screening showcase to demonstrate more robust CI gating without requiring AgentV core changes.

This keeps the current PR scope contained while providing an actionable pattern users can adopt immediately.

Proposal

Extend examples/showcase/export-screening/evals/ci_check.ts with optional multi-sample evaluation and stability-aware gating.

New wrapper options (suggested)

  • --samples N: run the eval N times (fresh eval invocation each time) and aggregate metrics across runs.
  • --gate min|mean|p05|p10 (or similar): choose conservative gating strategy.
  • --min-run-f1 X: require every run (or p05) to meet threshold.
  • Optional: --max-stddev X or --max-variance X for the checked class.

Behavior

  • Default behavior remains unchanged (single-run threshold gate), so existing docs continue to work.
  • When --samples is provided:
    • wrapper runs bun agentv eval repeatedly (or expects multiple results files)
    • aggregates confusion matrices / per-class metrics
    • emits a stability-aware CI result JSON and exits non-zero on failure.

Why this belongs in a wrapper/example

Acceptance Criteria

  • bun run ./evals/ci_check.ts --eval ./evals/dataset.yaml --samples 5 --threshold 0.95 --check-class High works.
  • Output JSON includes per-run metrics + aggregate metrics + selected gating rule.
  • Wrapper exits 0/1 deterministically based on selected gate.
  • README updated with the new options and guidance on choosing --samples and gate types.

Related

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions