Feat: Support for Agent-as-a-Judge with Filesystem Primitives

**Goal:** Enable the evaluator to act as an autonomous auditor that investigates the filesystem to verify implementation success, rather than relying solely on the task's final output or conversation history.

**Problem:** 
Standard LLM-as-a-Judge evaluations suffer from "context rot" and compaction when codebases are large. A judge limited to context alone cannot verify deep implementation details, file persistence, or side effects that weren't explicitly piped into the final evaluation prompt.

**Proposal:**
Introduce an **Agentic Evaluator** type to AgentV. Instead of a single LLM call, the evaluation step initiates a tool-capable agent loop provided with:
1. **Auditing Primitives:** Access to `ls`, `grep`, and `read_file`.
2. **Investigation Logic:** The ability to navigate the workspace to find "exhibits" (actual code on disk) that prove a requirement from the spec has been satisfied.
3. **Cross-Reference:** The ability to verify the "Persistence" of changes by reading the ground truth of the filesystem vs. the agent's reported actions.

**Benefits:**
* **Scale:** Evaluates multi-file changes without loading the entire codebase into a single context window.
* **Fidelity:** Eliminates hallucinations regarding whether a file was actually saved or if it compiles in its final state.
* **Deterministic Validation:** Integrates bash-based checks (exit codes) as the primary verification layer before performing semantic audit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat: Support for Agent-as-a-Judge with Filesystem Primitives #140

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feat: Support for Agent-as-a-Judge with Filesystem Primitives #140

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions