Compile natural-language eval criteria into deterministic Python evaluators. No LLM judges, ever.
You describe what a good output looks like. evalc writes a real Python evaluator that checks it, runs the evaluator against your dataset, and registers it as a Phoenix custom evaluator so future versions of your prompt get the same regression check forever.
LLM-as-judge is itself probabilistic. Scoring an LLM's output with another LLM gives you noise dressed as truth. evalc replaces the judge with compiled code: when "valid JSON with these fields" is the criterion, the evaluator is a real parser, not a vibes-based score from a sibling model.
- You paste a criterion in English: "Output must be valid JSON with fields intent and confidence; confidence between 0 and 1."
- Gemini synthesizes a Python evaluator that implements that predicate, plus 3 pass and 2 fail synthetic test cases.
- The evaluator runs in a sandbox against those self-tests. If any fail, evalc retries once with the error as context.
- On success, the evaluator is stored and shown to you for review.
- You upload a labeled dataset (CSV or JSON). evalc runs the evaluator row-by-row in the sandbox, streams progress, and logs every run as a Phoenix experiment with CODE evaluations.
The compiled evaluator is deterministic and reusable: you can re-run it against any future version of your prompt or model and get the same comparable score.
Prerequisites:
- Go 1.26+
- Python 3 on
PATH(the sandbox shells out to it) - A Gemini credential: either
GOOGLE_CLOUD_PROJECT(Vertex AI) orGEMINI_API_KEY(Google AI) - Optional: a Phoenix instance. Default is
http://localhost:6006. For Arize-hosted Phoenix, setPHOENIX_HOST=https://app.phoenix.arize.com,PHOENIX_API_KEY=...,PHOENIX_SPACE_ID=....
go build -o evalc ./cmd/evalc
# Verify Phoenix
./evalc check-phoenix
# Compile a criterion to Python
./evalc compile "output must be valid JSON with a field 'intent'"
# Run a compiled evaluator on a dataset
./evalc run --evaluator <id> --dataset path.csv
# Web UI
./evalc serve --port 8080Dataset format: CSV or JSON with at minimum an output column. Optional: input, expected. Capped at 50 rows per run.
| Command | Purpose |
|---|---|
evalc check-phoenix |
Probe Phoenix connectivity |
evalc compile <criteria> |
Compile a criterion to a Python evaluator |
evalc run --evaluator <id> --dataset <path> |
Run an evaluator on a labeled dataset |
evalc serve --port <port> |
Start the web UI |
| Var | Purpose |
|---|---|
GOOGLE_CLOUD_PROJECT / GCP_LOCATION |
Vertex AI credentials |
GEMINI_API_KEY |
Google AI API key (fallback) |
PHOENIX_HOST |
Phoenix base URL (default http://localhost:6006) |
PHOENIX_API_KEY |
Bearer token for Arize-hosted Phoenix |
PHOENIX_SPACE_ID |
Numeric Arize space ID (for correct UI links) |
EVALC_DB |
SQLite path (default evalc.db in CWD, /tmp/evalc.db in container) |
EVALC_LEDGER |
Append-only JSONL path for local Gemini call records. Unset = no logging. |
SPENDLINT_URL |
spendlint base URL. Each Gemini call is POSTed to <url>/record for cost tracking. |
SPENDLINT_TOKEN |
Shared secret for the X-Spendlint-Token header. Both this and SPENDLINT_URL must be set to enable. |
PORT |
HTTP port for serve (Cloud Run picks this up) |
gcloud builds submit --tag gcr.io/$PROJECT/evalc
gcloud run deploy evalc \
--image gcr.io/$PROJECT/evalc \
--region us-central1 \
--allow-unauthenticated \
--set-env-vars GOOGLE_CLOUD_PROJECT=$PROJECT,PHOENIX_HOST=$PHOENIX_HOST,PHOENIX_API_KEY=$PHOENIX_API_KEY,PHOENIX_SPACE_ID=$PHOENIX_SPACE_IDThe container has Python 3.12, runs as non-root, listens on $PORT, and stores its evaluator metadata in /tmp/evalc.db (ephemeral - intended for the hackathon scope; persist via Cloud SQL or GCS if you need durability).
Generated Python runs in a subprocess with:
- An import allowlist: stdlib essentials only (
re,json,sys, etc.) plus a small curated set (jsonschema,pydantic,email_validator). - 10-second per-call timeout.
- No network access expected (egress is whatever Cloud Run grants the container; remove with VPC egress rules in production).
This is hackathon-grade, not pentest-grade isolation. For production, run each evaluator in a Cloud Run job or a bwrap/firejail jail.
Browser (single-page UI)
|
v
Cloud Run (Go service)
|
+---> Gemini (code synthesis)
+---> Phoenix REST API (datasets, experiments, evaluations)
+---> SQLite (evaluator + run history)
+---> Python sandbox (deterministic execution)
Package layout:
cmd/evalc/ CLI entry (cobra)
internal/gemini Vertex / Google AI client
internal/compile NL criteria -> Python evaluator + self-tests
internal/sandbox subprocess execution with import allowlist
internal/store SQLite, dataset loader, schema inference
internal/phoenix Phoenix REST client
internal/web HTTP server, SSE progress, embedded static UI
MIT. See LICENSE.