-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Goal: Enable the evaluator to act as an autonomous auditor that investigates the filesystem to verify implementation success, rather than relying solely on the task's final output or conversation history.
Problem:
Standard LLM-as-a-Judge evaluations suffer from "context rot" and compaction when codebases are large. A judge limited to context alone cannot verify deep implementation details, file persistence, or side effects that weren't explicitly piped into the final evaluation prompt.
Proposal:
Introduce an Agentic Evaluator type to AgentV. Instead of a single LLM call, the evaluation step initiates a tool-capable agent loop provided with:
- Auditing Primitives: Access to
ls,grep, andread_file. - Investigation Logic: The ability to navigate the workspace to find "exhibits" (actual code on disk) that prove a requirement from the spec has been satisfied.
- Cross-Reference: The ability to verify the "Persistence" of changes by reading the ground truth of the filesystem vs. the agent's reported actions.
Benefits:
- Scale: Evaluates multi-file changes without loading the entire codebase into a single context window.
- Fidelity: Eliminates hallucinations regarding whether a file was actually saved or if it compiles in its final state.
- Deterministic Validation: Integrates bash-based checks (exit codes) as the primary verification layer before performing semantic audit.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request