Skip to content

feat: harness engineering — tests, linting, CI, architecture enforcement#7

Merged
ken-cavanagh-glean merged 2 commits into
mainfrom
harness-engineering
May 17, 2026
Merged

feat: harness engineering — tests, linting, CI, architecture enforcement#7
ken-cavanagh-glean merged 2 commits into
mainfrom
harness-engineering

Conversation

@ken-cavanagh-glean
Copy link
Copy Markdown
Collaborator

Summary

Adds mechanical enforcement to Seer following HES v1 principles. The eval framework that evaluates agent quality now has quality gates of its own.

  • bun run check — canonical check command (typecheck + lint + test)
  • Biome v2.4 for linting and formatting (2-space indent, single quotes, no semicolons)
  • 77 tests across 7 files in <100ms:
    • Scoring logic (weighted average, edge cases, custom criteria)
    • CSV parsing (quoting, escaping, field handling)
    • Retry logic (5xx/429/408 retry, 4xx pass-through, network errors)
    • Criteria definitions (snapshot of all 10 defaults, lookups, scale mappings)
    • Judge prompt snapshots — locks the exact text of all 7 judge prompts
    • Architecture boundaries — enforces 5-layer import constraints mechanically
    • E2E pipeline — full judge pipeline with mocked Glean API (guidance + golden + safety + metrics + skip logic)
  • GitHub Actions CI — 3-gate matrix (typecheck, lint, test) with fail-fast: false
  • AGENTS.md — agent-optimized entry point (~60 lines)
  • Development ledger — cross-session continuity

Key refactors (zero behavior change)

  • Extracted 7 prompt builders from judge.tsjudge-prompts.ts (pure functions, snapshot-tested)
  • Extracted parseCSVLine from cli.tslib/csv.ts
  • anyunknown in API response interfaces
  • Removed unused variables (DEFAULT_CONFIG in simulator, agentId in generate-agent)

Test plan

  • bun run check passes (typecheck + lint + 77 tests)
  • Browser walkthrough via Playwright: dashboard, golden set detail, run results, expanded rows with judge reasoning, guidance set detail, settings, new eval set
  • Introduce deliberate type error → tsc --noEmit fails
  • Introduce wrong-layer import → architecture test fails
  • Change a judge prompt → snapshot test fails

🤖 Generated with Claude Code

kenneth cavanagh and others added 2 commits May 15, 2026 19:19
… enforcement

Add mechanical enforcement to Seer following HES v1 principles:

- `bun run check` — canonical check command (typecheck + lint + test)
- biome v2.4.15 for linting and formatting
- 77 tests across 7 files covering scoring, CSV parsing, retry logic,
  criteria definitions, judge prompt snapshots, architecture boundaries,
  and full e2e pipeline (guidance + golden + safety + metrics)
- Extract prompt builders from judge.ts → judge-prompts.ts (pure functions)
- Extract parseCSVLine from cli.ts → lib/csv.ts
- Architecture boundary test enforcing 5-layer import constraints
- GitHub Actions CI with 3-gate matrix (fail-fast: false)
- AGENTS.md as agent-optimized entry point
- Development ledger for cross-session continuity

Type fixes: any → unknown in API response interfaces, removed unused
variables, biome auto-formatting across all src/ files.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Config was evaluated eagerly at module load via `export const config = loadConfig()`,
which throws when GLEAN_API_KEY is absent. Changed to lazy singleton `getConfig()`
so E2E tests can set dummy env vars before first config access.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ken-cavanagh-glean ken-cavanagh-glean merged commit 79a5d2c into main May 17, 2026
3 checks passed
@ken-cavanagh-glean ken-cavanagh-glean deleted the harness-engineering branch May 21, 2026 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant