feat: team platform — eval infra wiring, costs, cache, trends (v0.3.10)#78
Open
feat: team platform — eval infra wiring, costs, cache, trends (v0.3.10)#78
Conversation
Design doc for Supabase-backed team data store and universal eval infrastructure. Covers architecture, credential storage, eval formats, YAML test case spec, Supabase schema, phased rollout, and security model. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…k-config - Replace project-specific references with generic language - Add missing fields to eval result format: prompt_sha, by_category, timestamp, response_preview - Enrich failure format with details array, scores dict, expectation_type - Add EVAL_JUDGE_CACHE, EVAL_VERBOSE, multiprocess worker support, dedup on push, run scopes, model aliases, judge profiles - Restructure credential storage to 4 layers with gstack-config (v0.3.9) for user preferences (sync_enabled, sync_transcripts) - Update integration points, observability, and reuse map Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DRY up atomicWriteSync, readJSON, getGitInfo, getVersion, getRemoteSlug, and sanitizeForFilename from eval-store.ts, session-runner.ts, and eval-watch.ts into a shared module. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- lib/sync-config.ts: reads .gstack-sync.json + ~/.gstack/auth.json - lib/auth.ts: device auth flow (browser OAuth, local HTTP callback) - lib/sync.ts: Supabase push/pull via raw fetch(), offline queue, cache - lib/cli-sync.ts: CLI handler for gstack-sync commands - bin/gstack-sync: bash wrapper (setup, status, push-*, pull, drain) - .gstack-sync.json.example: template for team setup Zero new dependencies — uses raw fetch() against PostgREST API. All sync is non-fatal with 5s timeout and offline queue fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- 001_teams.sql: teams + team_members + RLS - 002_eval_runs.sql: eval results with universal format, indexes, upsert key - 003_data_tables.sql: retro, QA, ship, greptile, transcripts + RLS All tables use RLS: team members read/insert, admins delete. Transcript table has tighter policy (admin-only read). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- eval-store.ts: import shared getGitInfo/getVersion, add pushEvalRun() hook in finalize() (non-blocking, non-fatal) - session-runner.ts: import shared atomicWriteSync/sanitizeForFilename - eval-store.test.ts: fix pre-existing bug in double-finalize test (was counting _partial file) - 30 new tests for lib/util, lib/sync-config, lib/sync Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DRY up eval I/O duplicated across scripts/eval-list.ts, eval-compare.ts, and eval-summary.ts. Adds EVAL_DIR constant, formatTimestamp(), listEvalFiles(), loadEvalResults() with --limit support. 13 new tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- lib/eval-format.ts: StandardEvalResult interfaces, validateEvalResult(), normalizeFromLegacy/normalizeToLegacy round-trip converters - lib/eval-tier.ts: EvalTier type, resolveTier/resolveJudgeTier from env, tierToModel mapping, TIER_ALIASES (haiku→fast, sonnet→standard, opus→full) - lib/eval-cost.ts: MODEL_PRICING (last verified 2025-05-01), computeCosts(), formatCostDashboard(), aggregateCosts(), fallback for unknown models - 42 tests across 3 test files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cache at ~/.gstack/eval-cache/{suite}/{sha}.json. Compute cache keys
from source file contents + test input via Bun.CryptoHasher SHA256.
Supports read/write/stats/clear/verify operations. EVAL_CACHE=0
skips reads for force-rerun. 16 tests including corrupt JSON handling.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- lib/cli-eval.ts: routes to list/compare/summary/push/cost/cache/watch subcommands. Ports logic from 4 separate scripts into unified entry. Adds ANSI color for TTY (respects NO_COLOR), --limit flag for list. - bin/gstack-eval: bash wrapper matching bin/gstack-sync pattern - package.json: eval:* scripts now point to lib/cli-eval.ts - supabase/migrations/004_eval_costs.sql: per-model cost tracking + RLS - docs/eval-result-format.md: public format spec for any language - test/lib-eval-cli.test.ts: integration tests (spawn CLI subprocess) including 3 push failure modes (file-not-found, invalid schema, sync unavailable) 215 tests passing across 13 files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract per-model token usage from resultLine.modelUsage (including cache tokens and exact API cost), flow CostEntry[] through EvalCollector, aggregate in finalize(). Extend CostEntry with cache_read_input_tokens, cache_creation_input_tokens, cost_usd. computeCosts() prefers exact cost_usd over MODEL_PRICING when available (~4x more accurate with prompt caching). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
callJudge/judge now return {result, meta} with SHA-based caching
(~$0.18/run savings when SKILL.md unchanged) and dynamic model
selection via EVAL_JUDGE_TIER env var. E2E tests pass --model from
EVAL_TIER to claude -p. outcomeJudge retains simple return type.
All 8 LLM eval test sites updated with real costs and costs[].
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
computeTrends() classifies tests as stable-pass/stable-fail/flaky/ improving/degrading based on pass rate, flip count, and recent streak. gstack eval trend shows sparkline table with --limit, --tier, --test filters. Guard CLI main block with import.meta.main to prevent execution on import. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Explain what team sync gives you, that it's optional, and how to set it up. Points to TEAM_COORDINATION_STORE.md for full guide. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove _comment hacks from JSON example file. Add short team sync section to README explaining what it is, that it's optional, and how to set it up. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract pushWithSync() helper to eliminate boilerplate across 6 push functions. Add pushHeartbeat() for connectivity testing. Add push-greptile to CLI. New commands: gstack-sync test (validates full push/pull flow via sync_heartbeats table), gstack-sync show (terminal team data dashboard with summary/evals/ships/retros views). Guard main block with import.meta.main. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add non-fatal sync steps to all 4 skill templates: - /ship Step 8.5: write ship log JSON + push after PR creation - /retro Step 13: push snapshot after JSON save - /qa Phase 6.7: write qa-sync.json + push after health score - greptile-triage: push each triage entry after history file writes All calls use || true for zero disruption. Silent when sync not configured. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- 005_sync_heartbeats.sql migration for connectivity testing - eval:trend --team flag pulls team eval data (graceful fallback) - docs/TEAM_SYNC_SETUP.md step-by-step setup guide - Design doc status updated to Phase 2 complete - 10 new tests for sync show formatting functions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…richment Add session intelligence pipeline for team transcript sync: - lib/transcript-sync.ts: parse history.jsonl, enrich with Claude session file data (tools_used, full turn count), sync marker management, 10-concurrent push with 5-concurrent Haiku summarization - lib/llm-summarize.ts: raw fetch() to Anthropic Messages API (no SDK dep), retry-after on 429, exponential backoff on 5xx, SHA-based eval-cache - lib/sync.ts: pushTranscript() and pullTranscripts() following existing patterns - 006_transcript_sync.sql: unique index on (team_id, session_id) for idempotent upsert, RLS changed from admin-only to team-wide read Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ests - cli-sync.ts: push-transcript command, show sessions with formatSessionTable(), upgrade cmdSetup() to interactively create .gstack-sync.json if missing - bin/gstack-sync: add push-transcript case and help text - test/lib-llm-summarize.test.ts: 10 tests with mocked fetch (429 retry, 5xx backoff, malformed response, no API key, cache) - test/lib-transcript-sync.test.ts: 22 tests for parsing, grouping, session file extraction, marker management, slug resolution - test/lib-sync-show.test.ts: 4 tests for formatSessionTable Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- setup-team-sync/SKILL.md.tmpl: idempotent guided setup (create config, OAuth, verify connectivity, configure settings, summary) - ship/retro/qa SKILL.md.tmpl: add push-transcript hook after existing push-ship/push-retro/push-qa hooks (silent, non-fatal) - scripts/gen-skill-docs.ts: add setup-team-sync to template list - Regenerated all SKILL.md files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…weekly digest Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6 functions: detectRegressions, computeVelocity, computeCostTrend, computeLeaderboard, computeQATrend, computeEvalTrend. All pure, no I/O, with division-by-zero guards. 28 tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…d-queries New `gstack eval leaderboard` subcommand pulls team data and renders weekly stats per contributor. Refactored formatTeamSummary to use computeVelocity from dashboard-queries (DRY). 4 new tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…_team RPC) New `gstack team` CLI with create, members, set subcommands. Migration adds team_settings (admin-only), alert_cooldowns (edge-fn dedup), and create_team() SECURITY DEFINER RPC for atomic team + first member creation. 9 tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e functions Dashboard: Supabase edge function serving self-contained HTML with PKCE OAuth, 6 parallel client-side REST queries, SVG charts, dark theme, auto-refresh, who's-online from heartbeats. Public URL. Regression alert: webhook on eval_runs INSERT, 5-min cooldown dedup via alert_cooldowns, Slack notification on >5% pass rate drop. Weekly digest: pg_cron Monday 9am UTC, aggregates 7-day team data, Slack message with evals/ships/sessions/costs. 15 tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Interactive 8-step setup skill for deploying dashboard + edge functions. Post-ship callout shows team leaderboard after successful sync. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolves conflicts in CHANGELOG.md (ordering), CONTRIBUTING.md (eval tools list merge), VERSION (take main's 0.4.1), qa/SKILL.md.tmpl (keep full methodology + baseline line), eval-store.test.ts (drop redundant comment). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ion, dashboard UX From CEO plan review: - Edge functions: early guard on missing env vars instead of non-null assert crash - cli-team: wire isTokenExpired check (was imported but unused) - Migration 007: CHECK constraint on team slug (a-z0-9 hyphens, 2-50 chars) - Dashboard: streak badges on leaderboard, repo slug in who's-online, contextual empty states that teach, 60s refresh (was 30s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…shipped New TODOs: regression alert links, projected monthly cost, ship-to-Slack notifications, dynamic favicon, server-side aggregation, SSE streaming, GitHub Check Runs, ship_logs index. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
costs[]with exact API-reported per-model token usage (including cache tokens).computeCosts()prefers exact cost over MODEL_PRICING estimates (~4x more accurate with prompt caching).eval-cache.ts. Unchanged SKILL.md = $0 API cost on repeat runs.EVAL_CACHE=0to bypass.EVAL_JUDGE_TIERfor judge model,EVAL_TIERfor E2E model pinning via--modeltoclaude -p.eval:trendCLI — per-test pass rate over N runs with flaky/improving/degrading classification and sparkline table.StandardEvalResultwith validation, normalization, and bidirectional legacy conversion.gstack eval list|compare|summary|trend|push|cost|cache|watchTest plan
bun run eval:trendshows real data from existing eval runsbun run eval:cost <file>shows per-model breakdownEVALS=1 bun run test:evals— verify costs[] populated and cache works(cached)and $0 cost for judge testsEVAL_JUDGE_TIER=haiku EVALS=1 bun run test:evals— haiku for judge🤖 Generated with Claude Code