Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
f87bc21
docs: add team coordination store design doc
garrytan Mar 15, 2026
8931165
Merge remote-tracking branch 'origin/main' into garrytan/team-supabas…
garrytan Mar 15, 2026
5c1ea08
docs: scrub proprietary refs, close eval format gaps, integrate gstac…
garrytan Mar 15, 2026
caed287
feat: extract shared utilities into lib/util.ts
garrytan Mar 15, 2026
3713c3b
feat: add team sync infrastructure (config, auth, push/pull, CLI)
garrytan Mar 15, 2026
f7ae465
feat: add Supabase migration SQL for team data store
garrytan Mar 15, 2026
82e2041
feat: hook eval-store sync, use shared utils, add 30 lib tests
garrytan Mar 15, 2026
7f7035f
feat: add listEvalFiles, loadEvalResults, formatTimestamp to lib/util.ts
garrytan Mar 15, 2026
9bc6c94
feat: add eval format validation, tier selection, cost tracking
garrytan Mar 15, 2026
1f5b788
feat: add SHA-based eval caching with EVAL_CACHE=0 bypass
garrytan Mar 15, 2026
4ad73f7
feat: unified gstack eval CLI with list, compare, push, cache, cost
garrytan Mar 15, 2026
02925cf
feat: wire costs[] from modelUsage into eval results
garrytan Mar 15, 2026
59752fc
feat: wire eval-cache + eval-tier into LLM judge, pin E2E model
garrytan Mar 15, 2026
daea165
feat: add eval:trend CLI for per-test pass rate tracking
garrytan Mar 15, 2026
33c9552
chore: update gitignore
garrytan Mar 15, 2026
e280333
chore: bump v0.3.10, update CHANGELOG and docs
garrytan Mar 15, 2026
eb7ef21
docs: add setup comments to .gstack-sync.json.example
garrytan Mar 15, 2026
1432046
docs: CHANGELOG covers full branch scope including team sync
garrytan Mar 15, 2026
704fe34
docs: clean up sync example, add team sync section to README
garrytan Mar 15, 2026
dc3fcc8
feat: DRY push functions, add push-greptile + sync test/show commands
garrytan Mar 16, 2026
06f2da2
feat: wire team sync push into ship, retro, qa, and greptile skills
garrytan Mar 16, 2026
87cb769
feat: sync heartbeats, eval:trend --team, setup guide, 10 new tests
garrytan Mar 16, 2026
0e29d7d
feat: add enriched transcript sync — Haiku summaries, session file en…
garrytan Mar 16, 2026
a104471
feat: add push-transcript CLI, show sessions, interactive setup, 36 t…
garrytan Mar 16, 2026
3a57a3f
feat: add /setup-team-sync skill, auto-push transcript hooks in skills
garrytan Mar 16, 2026
6e14689
docs: add team sync TODOs — streaming parser, effectiveness scoring, …
garrytan Mar 16, 2026
e969c6d
feat: add dashboard query functions — pure transforms for team analytics
garrytan Mar 16, 2026
4985c8e
feat: add CLI leaderboard, refactor formatTeamSummary to use dashboar…
garrytan Mar 16, 2026
46c82ce
feat: add team admin CLI + migration 007 (settings, cooldowns, create…
garrytan Mar 16, 2026
78840c6
feat: add shared team dashboard, regression alerts, weekly digest edg…
garrytan Mar 16, 2026
83bfc7f
feat: add /setup-team-dashboard skill, post-ship leaderboard callout
garrytan Mar 16, 2026
2357f13
merge: integrate origin/main (v0.4.0, v0.4.1) into team-supabase-store
garrytan Mar 16, 2026
721abce
fix: review-driven hardening — env guards, token expiry, slug validat…
garrytan Mar 16, 2026
9e67d71
docs: add 8 team dashboard TODOs from CEO review, mark weekly digest …
garrytan Mar 16, 2026
8ef73a7
Merge remote-tracking branch 'origin/main' into garrytan/team-supabas…
garrytan Mar 16, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,4 @@ bun.lock
.env.local
.env.*
!.env.example
.gstack-sync.json
5 changes: 5 additions & 0 deletions .gstack-sync.json.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"supabase_url": "https://YOUR_PROJECT.supabase.co",
"supabase_anon_key": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.YOUR_ANON_KEY_HERE",
"team_slug": "your-team-name"
}
23 changes: 23 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,29 @@
### Fixed
- Browser ref staleness — refs collected before page mutation (e.g. SPA navigation) are now detected and re-collected. Eliminates a class of flaky QA failures on dynamic sites.

## 0.3.10 — 2026-03-15

### Added
- **Team sync via Supabase (optional)** — shared data store for eval results, retro snapshots, QA reports, ship logs, and Greptile triage across team members. All sync operations are non-fatal and non-blocking — skills never wait on network. Offline queue with automatic retry (up to 5 attempts). Zero impact when not configured: without `.gstack-sync.json`, everything works locally as before. See `docs/designs/TEAM_COORDINATION_STORE.md` for architecture and setup.
- **Supabase migration SQL** — 4 migration files in `supabase/migrations/` for teams, eval_runs, data tables (retros, QA, ships, Greptile), and eval costs. Row-level security policies ensure team members can only access their own team's data.
- **Sync config + auth** — `.gstack-sync.json` for project-level config (Supabase URL, anon key, team slug). `~/.gstack/auth.json` for user-level tokens (keyed by Supabase URL for multi-team support). `GSTACK_SUPABASE_ACCESS_TOKEN` env var for CI/automation. Token refresh built in.
- **`gstack sync` CLI** — `status`, `push`, `pull`, `drain`, `login`, `logout` subcommands for managing team sync.
- **Universal eval format** — `StandardEvalResult` schema with validation, normalization, and bidirectional legacy conversion. Any language can produce JSON matching this format and push via `gstack eval push`.
- **Unified eval CLI** — `gstack eval list|compare|summary|trend|push|cost|cache|watch` consolidating all eval tools into one entry point.
- **Per-model cost tracking** — eval results now include `costs[]` with exact per-model token usage (input, output, cache read, cache creation) and API-reported cost. Extracted from `resultLine.modelUsage` in the `claude -p` NDJSON stream. `computeCosts()` prefers exact `cost_usd` over MODEL_PRICING estimates (~4x more accurate with prompt caching).
- **LLM judge caching** — SHA-based caching for LLM-as-judge eval calls via `eval-cache.ts`. Cache keyed by `model:prompt`, so unchanged SKILL.md content skips API calls entirely. ~$0.18/run savings. Set `EVAL_CACHE=0` to force re-run.
- **Dynamic model selection** — `EVAL_JUDGE_TIER` env var controls which Claude model runs judge evals (haiku/sonnet/opus, default: sonnet). `EVAL_TIER` pins the E2E test model via `--model` flag to `claude -p`.
- **`bun run eval:trend`** — per-test pass rate tracking over last N runs. Classifies tests as stable-pass, stable-fail, flaky, improving, or degrading. Sparkline table with `--limit`, `--tier`, `--test` filters.
- **Shared utilities** — `lib/util.ts` extracted with `atomicWriteJSON`, `readJSON`, `getGitInfo`, `getRemoteSlug`, `listEvalFiles`, `loadEvalResults`, `formatTimestamp`, and path constants.
- 52+ new tests across eval cache, cost, format, tier, trend, sync config, sync client, and LLM judge integration.

### Changed
- `callJudge()` and `judge()` now return `{ result, meta }` with `JudgeMeta` (model, tokens, cached flag). `outcomeJudge()` retains simple return type for E2E callers.
- `EvalCollector.finalize()` aggregates per-test `costs[]` into result-level cost breakdown and attempts team sync (non-blocking).
- `cli-eval.ts` main block guarded with `import.meta.main` to prevent execution on import.
- `eval:summary` now hints to run `eval:trend` when flaky tests are detected.
- All 8 LLM eval test sites updated from hard-coded `cost_usd: 0.02` to real API-reported costs.

## 0.3.9 — 2026-03-15

### Added
Expand Down
1 change: 1 addition & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ bun run dev:skill # watch mode: auto-regen + validate on change
bun run eval:list # list all eval runs from ~/.gstack-dev/evals/
bun run eval:compare # compare two eval runs (auto-picks most recent)
bun run eval:summary # aggregate stats across all eval runs
bun run eval:trend # per-test pass rate trends (flaky detection)
```

`test:evals` requires `ANTHROPIC_API_KEY`. E2E tests stream progress in real-time
Expand Down
5 changes: 4 additions & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,8 @@ When E2E tests run, they produce machine-readable artifacts in `~/.gstack-dev/`:
bun run eval:list # list all eval runs (turns, duration, cost per run)
bun run eval:compare # compare two runs — shows per-test deltas + Takeaway commentary
bun run eval:summary # aggregate stats + per-test efficiency averages across runs
bun run eval:trend # per-test pass rate over last N runs (flaky detection)
bun run eval:cache stats # check LLM judge cache hit rate
```

**Eval comparison commentary:** `eval:compare` generates natural-language Takeaway sections interpreting what changed between runs — flagging regressions, noting improvements, calling out efficiency gains (fewer turns, faster, cheaper), and producing an overall summary. This is driven by `generateCommentary()` in `eval-store.ts`.
Expand All @@ -191,7 +193,8 @@ Each dimension is scored 1-5. Threshold: every dimension must score **≥ 4**. T
# Needs ANTHROPIC_API_KEY in .env — included in bun run test:evals
```

- Uses `claude-sonnet-4-6` for scoring stability
- Model defaults to `claude-sonnet-4-6`; override with `EVAL_JUDGE_TIER=haiku|opus`
- Results are SHA-cached — unchanged SKILL.md content skips API calls ($0 on repeat runs). Set `EVAL_CACHE=0` to force re-run.
- Tests live in `test/skill-llm-eval.test.ts`
- Calls the Anthropic API directly (not `claude -p`), so it works from anywhere including inside Claude Code

Expand Down
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -630,6 +630,12 @@ bun run eval:watch # live dashboard during E2E runs

E2E tests stream real-time progress, write machine-readable diagnostics, and persist partial results that survive kills. See CONTRIBUTING.md for the full eval infrastructure.

### Team sync (optional)

For teams, gstack can sync eval results, retro snapshots, QA reports, and ship logs to a shared Supabase store. Without this, everything works locally as before — sync is purely additive.

To set up: copy `.gstack-sync.json.example` to `.gstack-sync.json`, create a Supabase project, run the migrations in `supabase/migrations/`, and fill in your credentials. See `docs/designs/TEAM_COORDINATION_STORE.md` for the full guide.

## License

MIT
128 changes: 127 additions & 1 deletion TODOS.md
Original file line number Diff line number Diff line change
Expand Up @@ -231,7 +231,7 @@

**Why:** Spot quality trends — is the app getting better or worse?

**Context:** QA already writes structured reports. This adds cross-run comparison.
**Context:** `eval:trend` now tracks test-level pass rates (eval infrastructure). QA-run-level trending (health scores over time across QA report files) is a separate feature that could reuse `computeTrends` pattern from `lib/cli-eval.ts`.

**Effort:** S
**Priority:** P2
Expand Down Expand Up @@ -277,6 +277,130 @@
**Priority:** P3
**Depends on:** Browse sessions

## Team Sync

### Streaming parser for large session files

**What:** Replace readFileSync with readline/createReadStream for session files >10MB.

**Why:** Currently skip files >10MB. Long sessions (1000+ turns, 35MB) lose enrichment data (tools_used, full turn count).

**Context:** Current 10MB cap is defensive. Session files at `~/.claude/projects/{hash}/{sid}.jsonl` can be 35MB for marathon sessions. Streaming parser removes the cap while keeping memory usage constant.

**Effort:** S
**Priority:** P3
**Depends on:** Transcript sync (Phase 3)

### Session effectiveness scoring

**What:** Compute a 1-5 effectiveness score per session based on turns to achieve goal, tool diversity, whether code was shipped, and session duration.

**Why:** Enables `show sessions --best` and team-level AI effectiveness metrics. Raw data (tools_used, turns, duration, summary) already in Supabase after transcript sync.

**Context:** Year 2 roadmap item. Scoring heuristics need iteration. Could start with: fewer turns = more efficient, more tool diversity = better problem decomposition, shipped code (detected via git) = successful outcome.

**Effort:** M
**Priority:** P2
**Depends on:** Transcript sync (Phase 3)

### ~~Weekly AI usage digest~~ ✓ Shipped in Phase 4

Implemented as `supabase/functions/weekly-digest/index.ts`. pg_cron Monday 9am UTC, aggregates 7-day team data, sends Slack summary.

## Team Dashboard

### Regression alert: include failing test names + dashboard link

**What:** Slack alert message should list the specific tests that regressed and include a direct URL to the dashboard Evals tab.

**Why:** Current alert says "pass rate dropped 89% → 82%" but doesn't say which tests. The person paged has to open the dashboard and hunt. Including test names and a direct link saves 2 minutes of triage.

**Context:** `all_results` array in eval_runs has per-test data. `formatSlackMessage()` in regression-alert/index.ts is the change point. Dashboard URL can be derived from SUPABASE_URL.

**Effort:** S
**Priority:** P2
**Depends on:** Phase 4 (shipped)

### Projected monthly cost annotation on dashboard

**What:** Add "Projected monthly: ~$X" annotation to the cost chart on the dashboard.

**Why:** Everyone wants the monthly number for budgeting. One line of math (last 4 weeks average × 4.33), huge value for finance conversations.

**Context:** `renderVBarChart` or `renderCosts` in dashboard/ui.ts. Data is already fetched.

**Effort:** XS
**Priority:** P3

### Ship notification to Slack

**What:** Post a Slack message when someone ships: "alice shipped v0.4.2 → repo-slug (PR #45)". Reuses existing Slack webhook from team_settings.

**Why:** Real-time team shipping awareness. Currently only regression alerts go to Slack — positive events (ships) should too.

**Context:** Either add to the sync push path in ship/SKILL.md.tmpl or create a new edge function triggered on ship_logs INSERT (same pattern as regression-alert).

**Effort:** S
**Priority:** P2
**Depends on:** Phase 4 (shipped)

### Dynamic favicon based on team pass rate

**What:** Dashboard favicon changes color (green/yellow/red dot) based on current overall eval pass rate. Visible from the browser tab bar without switching to the dashboard tab.

**Why:** Zero-click observability. At a glance from your tab bar, you know if the team is healthy.

**Context:** Canvas → data URL favicon, update on each fetchAll() refresh in dashboard/ui.ts. Green >80%, yellow 50-80%, red <50%.

**Effort:** XS
**Priority:** P3

### Server-side aggregation / materialized views

**What:** Replace client-side data fetching (6 parallel REST calls per refresh) with server-side pre-aggregated views or Supabase materialized views.

**Why:** Current approach pulls up to 100 rows per table per refresh. With 5+ users and 60s refresh, this puts pressure on Supabase request limits. Materialized views would return pre-computed summaries in a single call.

**Context:** Could use Supabase pg_cron to refresh materialized views every 5 minutes. Dashboard would fetch one view instead of 6 tables.

**Effort:** L
**Priority:** P3
**Depends on:** Phase 4 (shipped)

### Real-time SSE streaming on dashboard

**What:** Server-Sent Events stream from a Supabase edge function that pushes updates when new data arrives (eval_runs INSERT, ship_logs INSERT, heartbeats).

**Why:** Dashboard currently polls every 60s. SSE would make it truly real-time — see an eval complete the moment it finishes.

**Context:** Supabase Realtime can be used client-side, or a custom SSE edge function can listen to Postgres NOTIFY. Year 2 roadmap item.

**Effort:** L
**Priority:** P3

### GitHub Check Run integration

**What:** When an eval run is pushed, create a GitHub Check Run on the corresponding commit/PR showing pass rate, regressions, and cost.

**Why:** Eval results become visible directly in the PR review workflow. Regressions can block merge.

**Context:** Requires GitHub App installation or personal access token. Uses GitHub REST API `POST /repos/{owner}/{repo}/check-runs`. Year 2 roadmap item.

**Effort:** L
**Priority:** P3
**Depends on:** Phase 4 (shipped)

### ship_logs index on (team_id, created_at)

**What:** Add composite index `idx_ship_logs_team_date ON ship_logs(team_id, created_at DESC)`.

**Why:** Weekly digest queries `ship_logs WHERE team_id = ? AND created_at >= ?`. Without this index, it table-scans. Low priority because ship_logs volume is small in Year 1, but needed before scale.

**Context:** Add to a new migration 008 or append to 007.

**Effort:** XS
**Priority:** P3

## Infrastructure

### /setup-gstack-upload skill (S3 bucket)
Expand Down Expand Up @@ -335,6 +459,8 @@

**Why:** Reduce E2E test cost and flakiness.

**Status:** Model pinning shipped (session-runner.ts passes `--model` from `EVAL_TIER` env). Retry:2 still TODO.

**Effort:** XS
**Priority:** P2

Expand Down
8 changes: 8 additions & 0 deletions bin/gstack-eval
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#!/usr/bin/env bash
set -euo pipefail

# gstack eval — unified eval CLI
# Delegates to lib/cli-eval.ts via bun

GSTACK_DIR="${GSTACK_DIR:-$(cd "$(dirname "$0")/.." && pwd)}"
exec bun run "$GSTACK_DIR/lib/cli-eval.ts" "$@"
86 changes: 86 additions & 0 deletions bin/gstack-sync
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
#!/usr/bin/env bash
# gstack-sync — team data sync CLI.
#
# Usage:
# gstack-sync setup — interactive auth flow
# gstack-sync status — show sync status
# gstack-sync test — validate full sync flow
# gstack-sync show [evals|ships|retros] — view team data
# gstack-sync push-{eval,retro,qa,ship,greptile} <file> — push data
# gstack-sync push-transcript — sync Claude session transcripts
# gstack-sync pull — pull team data to local cache
# gstack-sync drain — drain the offline queue
# gstack-sync logout — clear auth tokens
#
# Env overrides (for testing):
# GSTACK_DIR — override auto-detected gstack root
# GSTACK_STATE_DIR — override ~/.gstack state directory
set -euo pipefail

GSTACK_DIR="${GSTACK_DIR:-$(cd "$(dirname "$0")/.." && pwd)}"

case "${1:-}" in
setup)
exec bun run "$GSTACK_DIR/lib/cli-sync.ts" setup
;;
status)
exec bun run "$GSTACK_DIR/lib/cli-sync.ts" status
;;
push-eval)
FILE="${2:?Usage: gstack-sync push-eval <file.json>}"
exec bun run "$GSTACK_DIR/lib/cli-sync.ts" push-eval "$FILE"
;;
push-retro)
FILE="${2:?Usage: gstack-sync push-retro <file.json>}"
exec bun run "$GSTACK_DIR/lib/cli-sync.ts" push-retro "$FILE"
;;
push-qa)
FILE="${2:?Usage: gstack-sync push-qa <file.json>}"
exec bun run "$GSTACK_DIR/lib/cli-sync.ts" push-qa "$FILE"
;;
push-ship)
FILE="${2:?Usage: gstack-sync push-ship <file.json>}"
exec bun run "$GSTACK_DIR/lib/cli-sync.ts" push-ship "$FILE"
;;
push-greptile)
FILE="${2:?Usage: gstack-sync push-greptile <file.json>}"
exec bun run "$GSTACK_DIR/lib/cli-sync.ts" push-greptile "$FILE"
;;
push-transcript)
exec bun run "$GSTACK_DIR/lib/cli-sync.ts" push-transcript
;;
test)
exec bun run "$GSTACK_DIR/lib/cli-sync.ts" test
;;
show)
exec bun run "$GSTACK_DIR/lib/cli-sync.ts" show "${@:2}"
;;
pull)
exec bun run "$GSTACK_DIR/lib/cli-sync.ts" pull
;;
drain)
exec bun run "$GSTACK_DIR/lib/cli-sync.ts" drain
;;
logout)
exec bun run "$GSTACK_DIR/lib/cli-sync.ts" logout
;;
*)
echo "Usage: gstack-sync <command> [args]"
echo ""
echo "Commands:"
echo " setup Interactive auth flow (opens browser)"
echo " status Show sync status (queue, cache, connection)"
echo " test Validate full sync flow (push + pull)"
echo " show [evals|ships|retros|sessions] View team data in terminal"
echo " push-eval <file> Push eval result JSON to team store"
echo " push-retro <file> Push retro snapshot JSON"
echo " push-qa <file> Push QA report JSON"
echo " push-ship <file> Push ship log JSON"
echo " push-greptile <file> Push Greptile triage entry JSON"
echo " push-transcript Sync Claude session transcripts"
echo " pull Pull team data to local cache"
echo " drain Drain the offline sync queue"
echo " logout Clear auth tokens"
exit 1
;;
esac
8 changes: 8 additions & 0 deletions bin/gstack-team
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#!/usr/bin/env bash
set -euo pipefail

# gstack team — team admin CLI
# Delegates to lib/cli-team.ts via bun

GSTACK_DIR="${GSTACK_DIR:-$(cd "$(dirname "$0")/.." && pwd)}"
exec bun run "$GSTACK_DIR/lib/cli-team.ts" "$@"
Loading