Skip to content

askscio/seer

Repository files navigation

Seer

Agent evaluation framework for Glean agents using LLM-as-judge methodology

Seer evaluates AI agents built in Glean's Agent Builder. It runs agents, scores their responses across multiple dimensions using a research-backed judge architecture, and tracks results over time.

Prerequisites

Setup

pnpm install

cp .env.example .env
# Add your GLEAN_API_KEY (needs chat + search + agents + documents scopes)

# Start PostgreSQL
pnpm run db:up

# Push the schema
pnpm run db:push

# Optional: seed local demo eval data
pnpm run db:seed

# Verify it works
pnpm dev list sets

Web UI

pnpm --filter web dev

Quick Start

1. Generate an eval set

pnpm dev generate <agent-id> --count 5

Uses Glean's ADVANCED agent with company search to find real input values from your CRM/documents and generate grounded evaluation guidance.

2. Run evaluation

# Quick mode (coverage + faithfulness, 2 judge calls/case)
pnpm dev run <set-id>

# Deep mode (+ factuality verification via company search)
pnpm dev run <set-id> --deep

# Multi-judge (Opus 4.6 + GPT-5)
pnpm dev run <set-id> --multi-judge

3. View results

pnpm dev results <run-id>

Or use the Web UI for formatted results with markdown rendering and research-backed tooltips.

How Scoring Works

Three judge calls, each measuring something different:

Call Dimensions What it checks against Needs expected answer?
Coverage Topical Coverage, Response Quality Eval guidance (themes to cover) Yes
Faithfulness Groundedness, Hallucination Risk Agent's own retrieved documents No
Factuality Factual Accuracy Live company data (judge searches independently) No

Categorical scale (not 1-10): fullsubstantialpartialminimalfailure

Categories are 15% more reliable than continuous scales (SJT research). The judge commits to a defined bucket instead of picking an arbitrary number.

Configuration

Option A: Settings UI

Open /settings in the web UI. Saves to data/settings.json.

Option B: .env file

GLEAN_API_KEY=your_key_here
GLEAN_BACKEND=https://your-instance-be.glean.com
GLEAN_INSTANCE=your-instance

Commands

# Eval sets
pnpm dev set create --name <name> --agent-id <id>
pnpm dev set add-case <set-id> --query <query>
pnpm dev set view <set-id>
pnpm dev list sets

# Generate
pnpm dev generate <agent-id> --count <n>

# Run & results
pnpm dev run <set-id> [--deep] [--multi-judge] [--multi-turn] [--max-turns 5]
pnpm dev results <run-id>
pnpm dev list runs

# Local demo data
pnpm run db:seed         # idempotently create demo sets, runs, results, scores
pnpm run db:seed:reset   # delete and recreate only demo rows

Architecture

CLI ←→ Shared SQLite ←→ Web UI
              ↓
        Eval Engine
      ├── Agent Runner    (Agents Runs API for workflow, Chat API for autonomous)
      ├── Simulator       (LLM-based simulated user for multi-turn conversations)
      ├── Smart Generator (ADVANCED agent + company tools)
      ├── Judge           (4-call architecture, Opus 4.6)
      └── Metrics         (latency, tool calls)

See docs/evaluation-framework.md for the full evaluation philosophy and docs/architecture.md for system design.

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors