Summary
Research and design a production-grade benchmark framework for evaluating Agent Skills in Claude Code. Current pytest harness works but needs LLM-as-judge scoring for meaningful evaluation.
Problem
- No off-the-shelf eval framework supports CLI-based AI agents (Inspect AI, DeepEval, Promptfoo all assume direct API access)
- Current string-matching assertions are brittle and don't capture semantic correctness
- Need reproducible, trackable benchmarks for skill quality over time
Research Areas
1. LLM-as-Judge Implementation
2. Benchmark Infrastructure
3. Evaluation Dimensions
4. Framework Options to Evaluate
| Framework |
Approach |
Notes |
| Custom pytest + LLM judge |
Wrap CLI, add semantic scoring |
Current direction |
| Inspect AI custom solver |
Write solver that shells out to CLI |
May be overkill |
| Promptfoo custom provider |
YAML config, CLI wrapper |
Good for CI |
Proposed Design
@scorer
async def skill_relevance_judge():
"""LLM-as-judge for skill evaluation."""
async def score(state, target):
prompt = f"""
Evaluate if this response demonstrates knowledge from the skill.
Question: {state.input}
Response: {state.output}
Expected concepts: {target}
Score 1-5:
1 = No skill knowledge used, hallucinated
2 = Partial, mostly generic knowledge
3 = Adequate skill usage
4 = Good skill usage with specifics
5 = Excellent, comprehensive skill application
Respond with just the score and one sentence reason.
"""
result = await run_claude_async(prompt, model="haiku")
score = extract_score(result)
return Score(value=score, explanation=result)
return score
Success Criteria
References
Labels
enhancement, research, testing
Summary
Research and design a production-grade benchmark framework for evaluating Agent Skills in Claude Code. Current pytest harness works but needs LLM-as-judge scoring for meaningful evaluation.
Problem
Research Areas
1. LLM-as-Judge Implementation
2. Benchmark Infrastructure
3. Evaluation Dimensions
4. Framework Options to Evaluate
Proposed Design
Success Criteria
References
Labels
enhancement, research, testing