feat: harness engineering — tests, linting, CI, architecture enforcement by ken-cavanagh-glean · Pull Request #7 · askscio/seer

ken-cavanagh-glean · 2026-05-17T22:41:43Z

Summary

Adds mechanical enforcement to Seer following HES v1 principles. The eval framework that evaluates agent quality now has quality gates of its own.

bun run check — canonical check command (typecheck + lint + test)
Biome v2.4 for linting and formatting (2-space indent, single quotes, no semicolons)
77 tests across 7 files in <100ms:
- Scoring logic (weighted average, edge cases, custom criteria)
- CSV parsing (quoting, escaping, field handling)
- Retry logic (5xx/429/408 retry, 4xx pass-through, network errors)
- Criteria definitions (snapshot of all 10 defaults, lookups, scale mappings)
- Judge prompt snapshots — locks the exact text of all 7 judge prompts
- Architecture boundaries — enforces 5-layer import constraints mechanically
- E2E pipeline — full judge pipeline with mocked Glean API (guidance + golden + safety + metrics + skip logic)
GitHub Actions CI — 3-gate matrix (typecheck, lint, test) with fail-fast: false
AGENTS.md — agent-optimized entry point (~60 lines)
Development ledger — cross-session continuity

Key refactors (zero behavior change)

Extracted 7 prompt builders from judge.ts → judge-prompts.ts (pure functions, snapshot-tested)
Extracted parseCSVLine from cli.ts → lib/csv.ts
any → unknown in API response interfaces
Removed unused variables (DEFAULT_CONFIG in simulator, agentId in generate-agent)

Test plan

bun run check passes (typecheck + lint + 77 tests)
Browser walkthrough via Playwright: dashboard, golden set detail, run results, expanded rows with judge reasoning, guidance set detail, settings, new eval set
Introduce deliberate type error → tsc --noEmit fails
Introduce wrong-layer import → architecture test fails
Change a judge prompt → snapshot test fails

🤖 Generated with Claude Code

… enforcement Add mechanical enforcement to Seer following HES v1 principles: - `bun run check` — canonical check command (typecheck + lint + test) - biome v2.4.15 for linting and formatting - 77 tests across 7 files covering scoring, CSV parsing, retry logic, criteria definitions, judge prompt snapshots, architecture boundaries, and full e2e pipeline (guidance + golden + safety + metrics) - Extract prompt builders from judge.ts → judge-prompts.ts (pure functions) - Extract parseCSVLine from cli.ts → lib/csv.ts - Architecture boundary test enforcing 5-layer import constraints - GitHub Actions CI with 3-gate matrix (fail-fast: false) - AGENTS.md as agent-optimized entry point - Development ledger for cross-session continuity Type fixes: any → unknown in API response interfaces, removed unused variables, biome auto-formatting across all src/ files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Config was evaluated eagerly at module load via `export const config = loadConfig()`, which throws when GLEAN_API_KEY is absent. Changed to lazy singleton `getConfig()` so E2E tests can set dummy env vars before first config access. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

kenneth cavanagh and others added 2 commits May 15, 2026 19:19

ken-cavanagh-glean merged commit 79a5d2c into main May 17, 2026
3 checks passed

ken-cavanagh-glean deleted the harness-engineering branch May 21, 2026 21:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: harness engineering — tests, linting, CI, architecture enforcement#7

feat: harness engineering — tests, linting, CI, architecture enforcement#7
ken-cavanagh-glean merged 2 commits into
mainfrom
harness-engineering

ken-cavanagh-glean commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ken-cavanagh-glean commented May 17, 2026

Summary

Key refactors (zero behavior change)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant