Skip to content

Yatsuiii/evalc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

evalc

Compile natural-language eval criteria into deterministic Python evaluators. No LLM judges, ever.

You describe what a good output looks like. evalc writes a real Python evaluator that checks it, runs the evaluator against your dataset, and registers it as a Phoenix custom evaluator so future versions of your prompt get the same regression check forever.

Why

LLM-as-judge is itself probabilistic. Scoring an LLM's output with another LLM gives you noise dressed as truth. evalc replaces the judge with compiled code: when "valid JSON with these fields" is the criterion, the evaluator is a real parser, not a vibes-based score from a sibling model.

What it does

  1. You paste a criterion in English: "Output must be valid JSON with fields intent and confidence; confidence between 0 and 1."
  2. Gemini synthesizes a Python evaluator that implements that predicate, plus 3 pass and 2 fail synthetic test cases.
  3. The evaluator runs in a sandbox against those self-tests. If any fail, evalc retries once with the error as context.
  4. On success, the evaluator is stored and shown to you for review.
  5. You upload a labeled dataset (CSV or JSON). evalc runs the evaluator row-by-row in the sandbox, streams progress, and logs every run as a Phoenix experiment with CODE evaluations.

The compiled evaluator is deterministic and reusable: you can re-run it against any future version of your prompt or model and get the same comparable score.

Quick start

Prerequisites:

  • Go 1.26+
  • Python 3 on PATH (the sandbox shells out to it)
  • A Gemini credential: either GOOGLE_CLOUD_PROJECT (Vertex AI) or GEMINI_API_KEY (Google AI)
  • Optional: a Phoenix instance. Default is http://localhost:6006. For Arize-hosted Phoenix, set PHOENIX_HOST=https://app.phoenix.arize.com, PHOENIX_API_KEY=..., PHOENIX_SPACE_ID=....
go build -o evalc ./cmd/evalc

# Verify Phoenix
./evalc check-phoenix

# Compile a criterion to Python
./evalc compile "output must be valid JSON with a field 'intent'"

# Run a compiled evaluator on a dataset
./evalc run --evaluator <id> --dataset path.csv

# Web UI
./evalc serve --port 8080

Dataset format: CSV or JSON with at minimum an output column. Optional: input, expected. Capped at 50 rows per run.

CLI

Command Purpose
evalc check-phoenix Probe Phoenix connectivity
evalc compile <criteria> Compile a criterion to a Python evaluator
evalc run --evaluator <id> --dataset <path> Run an evaluator on a labeled dataset
evalc serve --port <port> Start the web UI

Environment variables

Var Purpose
GOOGLE_CLOUD_PROJECT / GCP_LOCATION Vertex AI credentials
GEMINI_API_KEY Google AI API key (fallback)
PHOENIX_HOST Phoenix base URL (default http://localhost:6006)
PHOENIX_API_KEY Bearer token for Arize-hosted Phoenix
PHOENIX_SPACE_ID Numeric Arize space ID (for correct UI links)
EVALC_DB SQLite path (default evalc.db in CWD, /tmp/evalc.db in container)
EVALC_LEDGER Append-only JSONL path for local Gemini call records. Unset = no logging.
SPENDLINT_URL spendlint base URL. Each Gemini call is POSTed to <url>/record for cost tracking.
SPENDLINT_TOKEN Shared secret for the X-Spendlint-Token header. Both this and SPENDLINT_URL must be set to enable.
PORT HTTP port for serve (Cloud Run picks this up)

Deploy to Cloud Run

gcloud builds submit --tag gcr.io/$PROJECT/evalc
gcloud run deploy evalc \
  --image gcr.io/$PROJECT/evalc \
  --region us-central1 \
  --allow-unauthenticated \
  --set-env-vars GOOGLE_CLOUD_PROJECT=$PROJECT,PHOENIX_HOST=$PHOENIX_HOST,PHOENIX_API_KEY=$PHOENIX_API_KEY,PHOENIX_SPACE_ID=$PHOENIX_SPACE_ID

The container has Python 3.12, runs as non-root, listens on $PORT, and stores its evaluator metadata in /tmp/evalc.db (ephemeral - intended for the hackathon scope; persist via Cloud SQL or GCS if you need durability).

Sandbox model

Generated Python runs in a subprocess with:

  • An import allowlist: stdlib essentials only (re, json, sys, etc.) plus a small curated set (jsonschema, pydantic, email_validator).
  • 10-second per-call timeout.
  • No network access expected (egress is whatever Cloud Run grants the container; remove with VPC egress rules in production).

This is hackathon-grade, not pentest-grade isolation. For production, run each evaluator in a Cloud Run job or a bwrap/firejail jail.

Architecture

Browser (single-page UI)
   |
   v
Cloud Run (Go service)
   |
   +---> Gemini (code synthesis)
   +---> Phoenix REST API (datasets, experiments, evaluations)
   +---> SQLite (evaluator + run history)
   +---> Python sandbox (deterministic execution)

Package layout:

cmd/evalc/        CLI entry (cobra)
internal/gemini   Vertex / Google AI client
internal/compile  NL criteria -> Python evaluator + self-tests
internal/sandbox  subprocess execution with import allowlist
internal/store    SQLite, dataset loader, schema inference
internal/phoenix  Phoenix REST client
internal/web      HTTP server, SSE progress, embedded static UI

License

MIT. See LICENSE.

About

Compile natural-language eval criteria into deterministic Python evaluators. No LLM judges.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors