A tool for evaluating LLM outputs against reference answers using configurable graders.
cd backend
uv sync
uv run python seed.py # seed 5 sample jobs
uv run uvicorn main:app --reload --port 8001API runs at http://127.0.0.1:8001. Docs at http://127.0.0.1:8001/docs.
Score model graders require OPENAI_API_KEY in the backend environment.
cd frontend
npm install
npm run devApp runs at http://localhost:3001.
Build a grader system on top of this app. A grader takes a model's output and a reference answer and returns a score between 0 and 1. Read the OpenAI graders guide — your implementation should support the two types described there: text_similarity and score_model.
The starter includes both free-text outputs and JSON outputs so your graders can handle whole-answer comparisons as well as field-level comparisons.
-
Reference answers — Users can add or update the reference answer for a job.
-
Grader configuration — Users can create and save grader configs. A config captures the grader type and its parameters (e.g. which similarity metric to use, or which model and prompt to use as a judge). A config can be run against any job.
-
Run a grader — Users can run a saved grader config against a job. The app computes a score in [0, 1] and records whether it passed a configurable threshold. The result is visible on the job detail page.
-
Text similarity grader — Implement scoring based on string similarity between the model output and reference answer. Support at least fuzzy matching.
-
Score model grader — Implement an LLM-as-judge grader that sends the model output and reference answer to an OpenAI model and asks it to return a numeric score. The grader prompt should be configurable per grader config.
-
Grader run history — The job detail page shows all past grader runs for that job — which grader was used, the score, and whether it passed.
-
Run all jobs — Users can run a grader config across all jobs at once and see a summary of results.
- Build on the existing codebase — don't rewrite from scratch
- Use the existing patterns you see in the code
We're evaluating how you plan, design, interact with AI, and build — not just the final output.