Skip to content

Feat: Support for Agent-as-a-Judge with Filesystem Primitives #140

@christso

Description

@christso

Goal: Enable the evaluator to act as an autonomous auditor that investigates the filesystem to verify implementation success, rather than relying solely on the task's final output or conversation history.

Problem:
Standard LLM-as-a-Judge evaluations suffer from "context rot" and compaction when codebases are large. A judge limited to context alone cannot verify deep implementation details, file persistence, or side effects that weren't explicitly piped into the final evaluation prompt.

Proposal:
Introduce an Agentic Evaluator type to AgentV. Instead of a single LLM call, the evaluation step initiates a tool-capable agent loop provided with:

  1. Auditing Primitives: Access to ls, grep, and read_file.
  2. Investigation Logic: The ability to navigate the workspace to find "exhibits" (actual code on disk) that prove a requirement from the spec has been satisfied.
  3. Cross-Reference: The ability to verify the "Persistence" of changes by reading the ground truth of the filesystem vs. the agent's reported actions.

Benefits:

  • Scale: Evaluates multi-file changes without loading the entire codebase into a single context window.
  • Fidelity: Eliminates hallucinations regarding whether a file was actually saved or if it compiles in its final state.
  • Deterministic Validation: Integrates bash-based checks (exit codes) as the primary verification layer before performing semantic audit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions