evalc

Compile natural-language eval criteria into deterministic Python evaluators. No LLM judges, ever.

You describe what a good output looks like. evalc writes a real Python evaluator that checks it, runs the evaluator against your dataset, and registers it as a Phoenix custom evaluator so future versions of your prompt get the same regression check forever.

Why

LLM-as-judge is itself probabilistic. Scoring an LLM's output with another LLM gives you noise dressed as truth. evalc replaces the judge with compiled code: when "valid JSON with these fields" is the criterion, the evaluator is a real parser, not a vibes-based score from a sibling model.

What it does

You paste a criterion in English: "Output must be valid JSON with fields intent and confidence; confidence between 0 and 1."
Gemini synthesizes a Python evaluator that implements that predicate, plus 3 pass and 2 fail synthetic test cases.
The evaluator runs in a sandbox against those self-tests. If any fail, evalc retries once with the error as context.
On success, the evaluator is stored and shown to you for review.
You upload a labeled dataset (CSV or JSON). evalc runs the evaluator row-by-row in the sandbox, streams progress, and logs every run as a Phoenix experiment with CODE evaluations.

The compiled evaluator is deterministic and reusable: you can re-run it against any future version of your prompt or model and get the same comparable score.

Quick start

Prerequisites:

Go 1.26+
Python 3 on PATH (the sandbox shells out to it)
A Gemini credential: either GOOGLE_CLOUD_PROJECT (Vertex AI) or GEMINI_API_KEY (Google AI)
Optional: a Phoenix instance. Default is http://localhost:6006. For Arize-hosted Phoenix, set PHOENIX_HOST=https://app.phoenix.arize.com, PHOENIX_API_KEY=..., PHOENIX_SPACE_ID=....

go build -o evalc ./cmd/evalc

# Verify Phoenix
./evalc check-phoenix

# Compile a criterion to Python
./evalc compile "output must be valid JSON with a field 'intent'"

# Run a compiled evaluator on a dataset
./evalc run --evaluator <id> --dataset path.csv

# Web UI
./evalc serve --port 8080

Dataset format: CSV or JSON with at minimum an output column. Optional: input, expected. Capped at 50 rows per run.

CLI

Command	Purpose
`evalc check-phoenix`	Probe Phoenix connectivity
`evalc compile <criteria>`	Compile a criterion to a Python evaluator
`evalc run --evaluator <id> --dataset <path>`	Run an evaluator on a labeled dataset
`evalc serve --port <port>`	Start the web UI

Environment variables

Var	Purpose
`GOOGLE_CLOUD_PROJECT` / `GCP_LOCATION`	Vertex AI credentials
`GEMINI_API_KEY`	Google AI API key (fallback)
`PHOENIX_HOST`	Phoenix base URL (default `http://localhost:6006`)
`PHOENIX_API_KEY`	Bearer token for Arize-hosted Phoenix
`PHOENIX_SPACE_ID`	Numeric Arize space ID (for correct UI links)
`EVALC_DB`	SQLite path (default `evalc.db` in CWD, `/tmp/evalc.db` in container)
`EVALC_LEDGER`	Append-only JSONL path for local Gemini call records. Unset = no logging.
`SPENDLINT_URL`	spendlint base URL. Each Gemini call is POSTed to `<url>/record` for cost tracking.
`SPENDLINT_TOKEN`	Shared secret for the `X-Spendlint-Token` header. Both this and `SPENDLINT_URL` must be set to enable.
`PORT`	HTTP port for `serve` (Cloud Run picks this up)

Deploy to Cloud Run

gcloud builds submit --tag gcr.io/$PROJECT/evalc
gcloud run deploy evalc \
  --image gcr.io/$PROJECT/evalc \
  --region us-central1 \
  --allow-unauthenticated \
  --set-env-vars GOOGLE_CLOUD_PROJECT=$PROJECT,PHOENIX_HOST=$PHOENIX_HOST,PHOENIX_API_KEY=$PHOENIX_API_KEY,PHOENIX_SPACE_ID=$PHOENIX_SPACE_ID

The container has Python 3.12, runs as non-root, listens on $PORT, and stores its evaluator metadata in /tmp/evalc.db (ephemeral - intended for the hackathon scope; persist via Cloud SQL or GCS if you need durability).

Sandbox model

Generated Python runs in a subprocess with:

An import allowlist: stdlib essentials only (re, json, sys, etc.) plus a small curated set (jsonschema, pydantic, email_validator).
10-second per-call timeout.
No network access expected (egress is whatever Cloud Run grants the container; remove with VPC egress rules in production).

This is hackathon-grade, not pentest-grade isolation. For production, run each evaluator in a Cloud Run job or a bwrap/firejail jail.

Architecture

Browser (single-page UI)
   |
   v
Cloud Run (Go service)
   |
   +---> Gemini (code synthesis)
   +---> Phoenix REST API (datasets, experiments, evaluations)
   +---> SQLite (evaluator + run history)
   +---> Python sandbox (deterministic execution)

Package layout:

cmd/evalc/        CLI entry (cobra)
internal/gemini   Vertex / Google AI client
internal/compile  NL criteria -> Python evaluator + self-tests
internal/sandbox  subprocess execution with import allowlist
internal/store    SQLite, dataset loader, schema inference
internal/phoenix  Phoenix REST client
internal/web      HTTP server, SSE progress, embedded static UI

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
cmd/evalc		cmd/evalc
internal		internal
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

evalc

Why

What it does

Quick start

CLI

Environment variables

Deploy to Cloud Run

Sandbox model

Architecture

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

evalc

Why

What it does

Quick start

CLI

Environment variables

Deploy to Cloud Run

Sandbox model

Architecture

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages