Seer

Agent evaluation framework for Glean agents using LLM-as-judge methodology

Seer evaluates AI agents built in Glean's Agent Builder. It runs agents, scores their responses across multiple dimensions using a research-backed judge architecture, and tracks results over time.

Prerequisites

Node.js v21.7+
pnpm
Docker (for PostgreSQL)

Setup

pnpm install

cp .env.example .env
# Add your GLEAN_API_KEY (needs chat + search + agents + documents scopes)

# Start PostgreSQL
pnpm run db:up

# Push the schema
pnpm run db:push

# Optional: seed local demo eval data
pnpm run db:seed

# Verify it works
pnpm dev list sets

Web UI

pnpm --filter web dev

Quick Start

1. Generate an eval set

pnpm dev generate <agent-id> --count 5

Uses Glean's ADVANCED agent with company search to find real input values from your CRM/documents and generate grounded evaluation guidance.

2. Run evaluation

# Quick mode (coverage + faithfulness, 2 judge calls/case)
pnpm dev run <set-id>

# Deep mode (+ factuality verification via company search)
pnpm dev run <set-id> --deep

# Multi-judge (Opus 4.6 + GPT-5)
pnpm dev run <set-id> --multi-judge

3. View results

pnpm dev results <run-id>

Or use the Web UI for formatted results with markdown rendering and research-backed tooltips.

How Scoring Works

Three judge calls, each measuring something different:

Call	Dimensions	What it checks against	Needs expected answer?
Coverage	Topical Coverage, Response Quality	Eval guidance (themes to cover)	Yes
Faithfulness	Groundedness, Hallucination Risk	Agent's own retrieved documents	No
Factuality	Factual Accuracy	Live company data (judge searches independently)	No

Categorical scale (not 1-10): full → substantial → partial → minimal → failure

Categories are 15% more reliable than continuous scales (SJT research). The judge commits to a defined bucket instead of picking an arbitrary number.

Configuration

Option A: Settings UI

Open /settings in the web UI. Saves to data/settings.json.

Option B: .env file

GLEAN_API_KEY=your_key_here
GLEAN_BACKEND=https://your-instance-be.glean.com
GLEAN_INSTANCE=your-instance

Commands

# Eval sets
pnpm dev set create --name <name> --agent-id <id>
pnpm dev set add-case <set-id> --query <query>
pnpm dev set view <set-id>
pnpm dev list sets

# Generate
pnpm dev generate <agent-id> --count <n>

# Run & results
pnpm dev run <set-id> [--deep] [--multi-judge] [--multi-turn] [--max-turns 5]
pnpm dev results <run-id>
pnpm dev list runs

# Local demo data
pnpm run db:seed         # idempotently create demo sets, runs, results, scores
pnpm run db:seed:reset   # delete and recreate only demo rows

Architecture

CLI ←→ Shared SQLite ←→ Web UI
              ↓
        Eval Engine
      ├── Agent Runner    (Agents Runs API for workflow, Chat API for autonomous)
      ├── Simulator       (LLM-based simulated user for multi-turn conversations)
      ├── Smart Generator (ADVANCED agent + company tools)
      ├── Judge           (4-call architecture, Opus 4.6)
      └── Metrics         (latency, tool calls)

See docs/evaluation-framework.md for the full evaluation philosophy and docs/architecture.md for system design.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
.githooks		.githooks
.github/workflows		.github/workflows
docs		docs
src		src
templates		templates
web		web
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
README.md		README.md
biome.json		biome.json
docker-compose.yml		docker-compose.yml
drizzle.config.ts		drizzle.config.ts
ledger.md		ledger.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
tsconfig.build.json		tsconfig.build.json
tsconfig.json		tsconfig.json
vitest.web-api-smoke.config.ts		vitest.web-api-smoke.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Seer

Prerequisites

Setup

Web UI

Quick Start

1. Generate an eval set

2. Run evaluation

3. View results

How Scoring Works

Configuration

Option A: Settings UI

Option B: .env file

Commands

Architecture

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Seer

Prerequisites

Setup

Web UI

Quick Start

1. Generate an eval set

2. Run evaluation

3. View results

How Scoring Works

Configuration

Option A: Settings UI

Option B: .env file

Commands

Architecture

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages