garrytan · garrytan · Mar 15, 2026 · Mar 15, 2026 · Mar 15, 2026 · Mar 15, 2026
diff --git a/.gitignore b/.gitignore
@@ -11,3 +11,4 @@ bun.lock
 .env.local
 .env.*
 !.env.example
+.gstack-sync.json
diff --git a/.gstack-sync.json.example b/.gstack-sync.json.example
@@ -0,0 +1,5 @@
+{
+  "supabase_url": "https://YOUR_PROJECT.supabase.co",
+  "supabase_anon_key": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.YOUR_ANON_KEY_HERE",
+  "team_slug": "your-team-name"
+}
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -58,6 +58,29 @@
 ### Fixed
 - Browser ref staleness — refs collected before page mutation (e.g. SPA navigation) are now detected and re-collected. Eliminates a class of flaky QA failures on dynamic sites.
 
+## 0.3.10 — 2026-03-15
+
+### Added
+- **Team sync via Supabase (optional)** — shared data store for eval results, retro snapshots, QA reports, ship logs, and Greptile triage across team members. All sync operations are non-fatal and non-blocking — skills never wait on network. Offline queue with automatic retry (up to 5 attempts). Zero impact when not configured: without `.gstack-sync.json`, everything works locally as before. See `docs/designs/TEAM_COORDINATION_STORE.md` for architecture and setup.
+- **Supabase migration SQL** — 4 migration files in `supabase/migrations/` for teams, eval_runs, data tables (retros, QA, ships, Greptile), and eval costs. Row-level security policies ensure team members can only access their own team's data.
+- **Sync config + auth** — `.gstack-sync.json` for project-level config (Supabase URL, anon key, team slug). `~/.gstack/auth.json` for user-level tokens (keyed by Supabase URL for multi-team support). `GSTACK_SUPABASE_ACCESS_TOKEN` env var for CI/automation. Token refresh built in.
+- **`gstack sync` CLI** — `status`, `push`, `pull`, `drain`, `login`, `logout` subcommands for managing team sync.
+- **Universal eval format** — `StandardEvalResult` schema with validation, normalization, and bidirectional legacy conversion. Any language can produce JSON matching this format and push via `gstack eval push`.
+- **Unified eval CLI** — `gstack eval list|compare|summary|trend|push|cost|cache|watch` consolidating all eval tools into one entry point.
+- **Per-model cost tracking** — eval results now include `costs[]` with exact per-model token usage (input, output, cache read, cache creation) and API-reported cost. Extracted from `resultLine.modelUsage` in the `claude -p` NDJSON stream. `computeCosts()` prefers exact `cost_usd` over MODEL_PRICING estimates (~4x more accurate with prompt caching).
+- **LLM judge caching** — SHA-based caching for LLM-as-judge eval calls via `eval-cache.ts`. Cache keyed by `model:prompt`, so unchanged SKILL.md content skips API calls entirely. ~$0.18/run savings. Set `EVAL_CACHE=0` to force re-run.
+- **Dynamic model selection** — `EVAL_JUDGE_TIER` env var controls which Claude model runs judge evals (haiku/sonnet/opus, default: sonnet). `EVAL_TIER` pins the E2E test model via `--model` flag to `claude -p`.
+- **`bun run eval:trend`** — per-test pass rate tracking over last N runs. Classifies tests as stable-pass, stable-fail, flaky, improving, or degrading. Sparkline table with `--limit`, `--tier`, `--test` filters.
+- **Shared utilities** — `lib/util.ts` extracted with `atomicWriteJSON`, `readJSON`, `getGitInfo`, `getRemoteSlug`, `listEvalFiles`, `loadEvalResults`, `formatTimestamp`, and path constants.
+- 52+ new tests across eval cache, cost, format, tier, trend, sync config, sync client, and LLM judge integration.
+
+### Changed
+- `callJudge()` and `judge()` now return `{ result, meta }` with `JudgeMeta` (model, tokens, cached flag). `outcomeJudge()` retains simple return type for E2E callers.
+- `EvalCollector.finalize()` aggregates per-test `costs[]` into result-level cost breakdown and attempts team sync (non-blocking).
+- `cli-eval.ts` main block guarded with `import.meta.main` to prevent execution on import.
+- `eval:summary` now hints to run `eval:trend` when flaky tests are detected.
+- All 8 LLM eval test sites updated from hard-coded `cost_usd: 0.02` to real API-reported costs.
+
 ## 0.3.9 — 2026-03-15
 
 ### Added

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -15,6 +15,7 @@ bun run dev:skill    # watch mode: auto-regen + validate on change
 bun run eval:list    # list all eval runs from ~/.gstack-dev/evals/
 bun run eval:compare # compare two eval runs (auto-picks most recent)
 bun run eval:summary # aggregate stats across all eval runs
+bun run eval:trend   # per-test pass rate trends (flaky detection)
 ```
 
 `test:evals` requires `ANTHROPIC_API_KEY`. E2E tests stream progress in real-time

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -171,6 +171,8 @@ When E2E tests run, they produce machine-readable artifacts in `~/.gstack-dev/`:
 bun run eval:list            # list all eval runs (turns, duration, cost per run)
 bun run eval:compare         # compare two runs — shows per-test deltas + Takeaway commentary
 bun run eval:summary         # aggregate stats + per-test efficiency averages across runs
+bun run eval:trend           # per-test pass rate over last N runs (flaky detection)
+bun run eval:cache stats     # check LLM judge cache hit rate
 ```
 
 **Eval comparison commentary:** `eval:compare` generates natural-language Takeaway sections interpreting what changed between runs — flagging regressions, noting improvements, calling out efficiency gains (fewer turns, faster, cheaper), and producing an overall summary. This is driven by `generateCommentary()` in `eval-store.ts`.
@@ -191,7 +193,8 @@ Each dimension is scored 1-5. Threshold: every dimension must score **≥ 4**. T
 # Needs ANTHROPIC_API_KEY in .env — included in bun run test:evals
 ```
 
-- Uses `claude-sonnet-4-6` for scoring stability
+- Model defaults to `claude-sonnet-4-6`; override with `EVAL_JUDGE_TIER=haiku|opus`
+- Results are SHA-cached — unchanged SKILL.md content skips API calls ($0 on repeat runs). Set `EVAL_CACHE=0` to force re-run.
 - Tests live in `test/skill-llm-eval.test.ts`
 - Calls the Anthropic API directly (not `claude -p`), so it works from anywhere including inside Claude Code
 

diff --git a/README.md b/README.md
@@ -630,6 +630,12 @@ bun run eval:watch            # live dashboard during E2E runs
 
 E2E tests stream real-time progress, write machine-readable diagnostics, and persist partial results that survive kills. See CONTRIBUTING.md for the full eval infrastructure.
 
+### Team sync (optional)
+
+For teams, gstack can sync eval results, retro snapshots, QA reports, and ship logs to a shared Supabase store. Without this, everything works locally as before — sync is purely additive.
+
+To set up: copy `.gstack-sync.json.example` to `.gstack-sync.json`, create a Supabase project, run the migrations in `supabase/migrations/`, and fill in your credentials. See `docs/designs/TEAM_COORDINATION_STORE.md` for the full guide.
+
 ## License
 
 MIT
diff --git a/TODOS.md b/TODOS.md
@@ -231,7 +231,7 @@
 
 **Why:** Spot quality trends — is the app getting better or worse?
 
-**Context:** QA already writes structured reports. This adds cross-run comparison.
+**Context:** `eval:trend` now tracks test-level pass rates (eval infrastructure). QA-run-level trending (health scores over time across QA report files) is a separate feature that could reuse `computeTrends` pattern from `lib/cli-eval.ts`.
 
 **Effort:** S
 **Priority:** P2
@@ -277,6 +277,130 @@
 **Priority:** P3
 **Depends on:** Browse sessions
 
+## Team Sync
+
+### Streaming parser for large session files
+
+**What:** Replace readFileSync with readline/createReadStream for session files >10MB.
+
+**Why:** Currently skip files >10MB. Long sessions (1000+ turns, 35MB) lose enrichment data (tools_used, full turn count).
+
+**Context:** Current 10MB cap is defensive. Session files at `~/.claude/projects/{hash}/{sid}.jsonl` can be 35MB for marathon sessions. Streaming parser removes the cap while keeping memory usage constant.
+
+**Effort:** S
+**Priority:** P3
+**Depends on:** Transcript sync (Phase 3)
+
+### Session effectiveness scoring
+
+**What:** Compute a 1-5 effectiveness score per session based on turns to achieve goal, tool diversity, whether code was shipped, and session duration.
+
+**Why:** Enables `show sessions --best` and team-level AI effectiveness metrics. Raw data (tools_used, turns, duration, summary) already in Supabase after transcript sync.
+
+**Context:** Year 2 roadmap item. Scoring heuristics need iteration. Could start with: fewer turns = more efficient, more tool diversity = better problem decomposition, shipped code (detected via git) = successful outcome.
+
+**Effort:** M
+**Priority:** P2
+**Depends on:** Transcript sync (Phase 3)
+
+### ~~Weekly AI usage digest~~ ✓ Shipped in Phase 4
+
+Implemented as `supabase/functions/weekly-digest/index.ts`. pg_cron Monday 9am UTC, aggregates 7-day team data, sends Slack summary.
+
+## Team Dashboard
+
+### Regression alert: include failing test names + dashboard link
+
+**What:** Slack alert message should list the specific tests that regressed and include a direct URL to the dashboard Evals tab.
+
+**Why:** Current alert says "pass rate dropped 89% → 82%" but doesn't say which tests. The person paged has to open the dashboard and hunt. Including test names and a direct link saves 2 minutes of triage.
+
+**Context:** `all_results` array in eval_runs has per-test data. `formatSlackMessage()` in regression-alert/index.ts is the change point. Dashboard URL can be derived from SUPABASE_URL.
+
+**Effort:** S
+**Priority:** P2
+**Depends on:** Phase 4 (shipped)
+
+### Projected monthly cost annotation on dashboard
+
+**What:** Add "Projected monthly: ~$X" annotation to the cost chart on the dashboard.
+
+**Why:** Everyone wants the monthly number for budgeting. One line of math (last 4 weeks average × 4.33), huge value for finance conversations.
+
+**Context:** `renderVBarChart` or `renderCosts` in dashboard/ui.ts. Data is already fetched.
+
+**Effort:** XS
+**Priority:** P3
+
+### Ship notification to Slack
+
+**What:** Post a Slack message when someone ships: "alice shipped v0.4.2 → repo-slug (PR #45)". Reuses existing Slack webhook from team_settings.
+
+**Why:** Real-time team shipping awareness. Currently only regression alerts go to Slack — positive events (ships) should too.
+
+**Context:** Either add to the sync push path in ship/SKILL.md.tmpl or create a new edge function triggered on ship_logs INSERT (same pattern as regression-alert).
+
+**Effort:** S
+**Priority:** P2
+**Depends on:** Phase 4 (shipped)
+
+### Dynamic favicon based on team pass rate
+
+**What:** Dashboard favicon changes color (green/yellow/red dot) based on current overall eval pass rate. Visible from the browser tab bar without switching to the dashboard tab.
+
+**Why:** Zero-click observability. At a glance from your tab bar, you know if the team is healthy.
+
+**Context:** Canvas → data URL favicon, update on each fetchAll() refresh in dashboard/ui.ts. Green >80%, yellow 50-80%, red <50%.
+
+**Effort:** XS
+**Priority:** P3
+
+### Server-side aggregation / materialized views
+
+**What:** Replace client-side data fetching (6 parallel REST calls per refresh) with server-side pre-aggregated views or Supabase materialized views.
+
+**Why:** Current approach pulls up to 100 rows per table per refresh. With 5+ users and 60s refresh, this puts pressure on Supabase request limits. Materialized views would return pre-computed summaries in a single call.
+
+**Context:** Could use Supabase pg_cron to refresh materialized views every 5 minutes. Dashboard would fetch one view instead of 6 tables.
+
+**Effort:** L
+**Priority:** P3
+**Depends on:** Phase 4 (shipped)
+
+### Real-time SSE streaming on dashboard
+
+**What:** Server-Sent Events stream from a Supabase edge function that pushes updates when new data arrives (eval_runs INSERT, ship_logs INSERT, heartbeats).
+
+**Why:** Dashboard currently polls every 60s. SSE would make it truly real-time — see an eval complete the moment it finishes.
+
+**Context:** Supabase Realtime can be used client-side, or a custom SSE edge function can listen to Postgres NOTIFY. Year 2 roadmap item.
+
+**Effort:** L
+**Priority:** P3
+
+### GitHub Check Run integration
+
+**What:** When an eval run is pushed, create a GitHub Check Run on the corresponding commit/PR showing pass rate, regressions, and cost.
+
+**Why:** Eval results become visible directly in the PR review workflow. Regressions can block merge.
+
+**Context:** Requires GitHub App installation or personal access token. Uses GitHub REST API `POST /repos/{owner}/{repo}/check-runs`. Year 2 roadmap item.
+
+**Effort:** L
+**Priority:** P3
+**Depends on:** Phase 4 (shipped)
+
+### ship_logs index on (team_id, created_at)
+
+**What:** Add composite index `idx_ship_logs_team_date ON ship_logs(team_id, created_at DESC)`.
+
+**Why:** Weekly digest queries `ship_logs WHERE team_id = ? AND created_at >= ?`. Without this index, it table-scans. Low priority because ship_logs volume is small in Year 1, but needed before scale.
+
+**Context:** Add to a new migration 008 or append to 007.
+
+**Effort:** XS
+**Priority:** P3
+
 ## Infrastructure
 
 ### /setup-gstack-upload skill (S3 bucket)
@@ -335,6 +459,8 @@
 
 **Why:** Reduce E2E test cost and flakiness.
 
+**Status:** Model pinning shipped (session-runner.ts passes `--model` from `EVAL_TIER` env). Retry:2 still TODO.
+
 **Effort:** XS
 **Priority:** P2
 

diff --git a/bin/gstack-eval b/bin/gstack-eval
@@ -0,0 +1,8 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# gstack eval — unified eval CLI
+# Delegates to lib/cli-eval.ts via bun
+
+GSTACK_DIR="${GSTACK_DIR:-$(cd "$(dirname "$0")/.." && pwd)}"
+exec bun run "$GSTACK_DIR/lib/cli-eval.ts" "$@"
diff --git a/bin/gstack-sync b/bin/gstack-sync
@@ -0,0 +1,86 @@
+#!/usr/bin/env bash
+# gstack-sync — team data sync CLI.
+#
+# Usage:
+#   gstack-sync setup                    — interactive auth flow
+#   gstack-sync status                   — show sync status
+#   gstack-sync test                     — validate full sync flow
+#   gstack-sync show [evals|ships|retros] — view team data
+#   gstack-sync push-{eval,retro,qa,ship,greptile} <file> — push data
+#   gstack-sync push-transcript            — sync Claude session transcripts
+#   gstack-sync pull                     — pull team data to local cache
+#   gstack-sync drain                    — drain the offline queue
+#   gstack-sync logout                   — clear auth tokens
+#
+# Env overrides (for testing):
+#   GSTACK_DIR          — override auto-detected gstack root
+#   GSTACK_STATE_DIR    — override ~/.gstack state directory
+set -euo pipefail
+
+GSTACK_DIR="${GSTACK_DIR:-$(cd "$(dirname "$0")/.." && pwd)}"
+
+case "${1:-}" in
+  setup)
+    exec bun run "$GSTACK_DIR/lib/cli-sync.ts" setup
+    ;;
+  status)
+    exec bun run "$GSTACK_DIR/lib/cli-sync.ts" status
+    ;;
+  push-eval)
+    FILE="${2:?Usage: gstack-sync push-eval <file.json>}"
+    exec bun run "$GSTACK_DIR/lib/cli-sync.ts" push-eval "$FILE"
+    ;;
+  push-retro)
+    FILE="${2:?Usage: gstack-sync push-retro <file.json>}"
+    exec bun run "$GSTACK_DIR/lib/cli-sync.ts" push-retro "$FILE"
+    ;;
+  push-qa)
+    FILE="${2:?Usage: gstack-sync push-qa <file.json>}"
+    exec bun run "$GSTACK_DIR/lib/cli-sync.ts" push-qa "$FILE"
+    ;;
+  push-ship)
+    FILE="${2:?Usage: gstack-sync push-ship <file.json>}"
+    exec bun run "$GSTACK_DIR/lib/cli-sync.ts" push-ship "$FILE"
+    ;;
+  push-greptile)
+    FILE="${2:?Usage: gstack-sync push-greptile <file.json>}"
+    exec bun run "$GSTACK_DIR/lib/cli-sync.ts" push-greptile "$FILE"
+    ;;
+  push-transcript)
+    exec bun run "$GSTACK_DIR/lib/cli-sync.ts" push-transcript
+    ;;
+  test)
+    exec bun run "$GSTACK_DIR/lib/cli-sync.ts" test
+    ;;
+  show)
+    exec bun run "$GSTACK_DIR/lib/cli-sync.ts" show "${@:2}"
+    ;;
+  pull)
+    exec bun run "$GSTACK_DIR/lib/cli-sync.ts" pull
+    ;;
+  drain)
+    exec bun run "$GSTACK_DIR/lib/cli-sync.ts" drain
+    ;;
+  logout)
+    exec bun run "$GSTACK_DIR/lib/cli-sync.ts" logout
+    ;;
+  *)
+    echo "Usage: gstack-sync <command> [args]"
+    echo ""
+    echo "Commands:"
+    echo "  setup                 Interactive auth flow (opens browser)"
+    echo "  status                Show sync status (queue, cache, connection)"
+    echo "  test                  Validate full sync flow (push + pull)"
+    echo "  show [evals|ships|retros|sessions]  View team data in terminal"
+    echo "  push-eval <file>      Push eval result JSON to team store"
+    echo "  push-retro <file>     Push retro snapshot JSON"
+    echo "  push-qa <file>        Push QA report JSON"
+    echo "  push-ship <file>      Push ship log JSON"
+    echo "  push-greptile <file>  Push Greptile triage entry JSON"
+    echo "  push-transcript       Sync Claude session transcripts"
+    echo "  pull                  Pull team data to local cache"
+    echo "  drain                 Drain the offline sync queue"
+    echo "  logout                Clear auth tokens"
+    exit 1
+    ;;
+esac
diff --git a/bin/gstack-team b/bin/gstack-team
@@ -0,0 +1,8 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# gstack team — team admin CLI
+# Delegates to lib/cli-team.ts via bun
+
+GSTACK_DIR="${GSTACK_DIR:-$(cd "$(dirname "$0")/.." && pwd)}"
+exec bun run "$GSTACK_DIR/lib/cli-team.ts" "$@"