EPIC: Task-queue orchestrator runner (experimental `orchestrator` variant)

# EPIC: Task-queue orchestrator runner (experimental `orchestrator` variant)

## Summary

The wizard today is a single-agent, linear runner. `runAgent` → `runProgram`
(`src/lib/agent/agent-runner.ts:154`) ends in one Claude Agent SDK `query()` call
driven by one large assembled prompt. All integration work happens inside that
single agent context.

This epic prototypes a task-queue-driven orchestrator that runs many small,
fresh-context micro-agents. An orchestrator agent inspects the repo and seeds an
in-memory task queue. An executor drains it, running one fresh agent per task,
each with its own model, goal, success criteria, permissions, and the mini-skills
that tell it HOW. Tasks can dynamically enqueue more tasks as they learn, water
running downhill. Each task leaves a structured handoff for the next agent: what
it did, what its goal was, and what the next agent should know.

A core requirement is separation of concerns:
- **Agent prompts (the WHAT).** They say what to do, and carry the artifacts in
  frontmatter: model, goal, success criteria, permissions, which mini-skills.
- **Mini-skills (the HOW).** The procedural integration knowledge.

Both are markdown files with frontmatter served from context-mill. Skills already
exist there. Agent prompts are a new `agents` content type, a flavor parallel to
skills that carries intention rather than procedure.

It branches from the linear runner, gated by the boolean `wizard-orchestrator` feature flag, so we launch in the dark and A/B it against the linear
baseline.

## Why, the primary goal is responsiveness

The linear runner packs the whole integration into one agent, which produces long
silences. The user waits through minutes of planning, then minutes of editing,
with little sign of progress. The win condition for this experiment is that the
user sees real, incremental progress the whole way through. Decomposing into
discrete micro-tasks delivers that. Each task is small and quick, it finishes
visibly, and it streams a steady drumbeat of one thing done and the next thing
starting.

Two levers keep tasks small and quick:
- **Per-task model fit.** Each task picks the cheapest model that can do the job,
  a small model for install and boilerplate, a stronger one only where reasoning
  is needed. This is core.
- **Granularity.** Prefer many small tasks over a few large ones, scoped tightly
  enough to finish fast and show progress. The orchestrator itself seeds fast, a
  quick glance at the repo rather than a long plan.

Decomposition also helps reliability, debuggability, and per-step permissions.

## Locked decisions

- **Concurrency.** Parallel-capable, tasks declare `dependsOn`, default cap 1 for
  the prototype. The real graph has genuine parallel branches: install alongside
  init, identify alongside capture-planning. At cap 1 they run in any order;
  raising the cap runs the independent branches together.
- **Persistence.** The queue lives in memory and is reflected to disk
  asynchronously: queue, audit log, handoffs. Resume across runs and crashes is
  deferred to #10, after #9.
- **Seeding.** Dynamic enqueue, with loop and termination guards.
- **Gating.** The boolean `wizard-orchestrator` feature flag (`constants.ts`). The
  `integrate` command reads the flag and routes to the new runner when it is
  `true`. To test in the dark, enable the flag for your own user in PostHog
  (`isOrchestratorEnabled`, `agent-interface.ts`).
- **Artifacts.** The WHAT and the HOW are both content in context-mill. Agent
  prompts are a new `agents` content type, a flavor parallel to skills that carries
  intention: model, goal, success criteria, permissions, which skills to load.
  Mini-skills stay as skills, the procedure. Both are markdown with frontmatter,
  authored on a clearly-named experiment branch, served from context-mill. Local
  dev runs against `localhost:8765`. Context-mill grows the `agents` type to match
  how it already builds and serves skills (#2).
- **Tools.** The orchestrator tools live in the existing `wizard-tools` server,
  alongside the env tools and the audit tools.
- **Handoff.** Each task reports a structured handoff through `complete_task`:
  goal, what it did, what the next agent should know.
- **Success.** Success criteria are plain text in the agent prompt. The agent
  self-reports the outcome through `complete_task`.
- **UI.** The TUI renders the queue.
- **Sequencing.** Prove the full machinery with a walking skeleton, stub tasks
  that write a temp file, then author the real bodies: install, init,
  instrument-events.

## Architecture at a glance

- **Fork point.** Inside `runProgram` after OAuth, at `agent-runner.ts:302`.
  Extract steps 1 to 4 into `bootstrapProgram()`, which the orchestrator arm
  reuses for the health check, settings conflicts, OAuth and credentials, MCP url,
  and variant metadata.
- **Per-task agent.** Reuse the existing `runAgent` (`agent-interface.ts:773`),
  initialize the agent once, and override model, tools, and permissions per task.
- **Queue.** In memory as the source of truth, reflected asynchronously to
  `<installDir>/.posthog-wizard/` (`queue.json`, `audit.jsonl`, structured
  `handoffs/<id>.json`), reusing the audit-ledger atomic write and mutex helpers.
- **Tools.** `enqueue_task`, `complete_task`, and `read_handoffs` in
  `wizard-tools`, registered when a `QueueStore` is present, with termination
  guards.
- **Telemetry.** `VARIANT=orchestrator` flows into existing events, plus
  orchestrator-specific events for the A/B comparison.

## What good looks like

- Time to first visible progress is short. The first task is running and showing
  in the UI soon after launch.
- Progress is steady. No single step dominates wall-clock, and the user always
  sees what just finished and what is next.
- Cheap models carry the cheap work. Tasks default to the lowest-cost model that
  succeeds, and the strong models are the exception.
- Per-task latency and model are measured from day one (telemetry, #9).

## Cost to watch

Each fresh task-agent re-pays the full system-prompt, the `claude_code` preset,
and the MCP connect, roughly 113k tokens that the codebase worked hard to defer.
N agents means N times that startup cost. We watch per-task tokens, but the
experiment is judged on responsiveness, not on beating the linear baseline's total
token count. Small tasks, cheap models, and fast seeding keep the tax bearable.

## Design discipline

`CLAUDE.md` keeps product knowledge out of infra. Frameworks live in
`FrameworkConfig`, integration knowledge lives in context-mill skills, programs
are step arrays. This design fits that. The agent prompts and mini-skills are
markdown content, the same shape as today's skills, so they live in context-mill.
The new wizard-side code is machinery: the queue, the executor, the loader, and
the tools. The runner stays product-ignorant.

## Child issues

| # | Issue | Title | Deps |
|---|---|---|---|
| 1 | #621 | Shared bootstrap extraction + variant gating | none |
| 2 | #631 | Context-mill: `agents` content type (parallel to skills) | none |
| 3 | #622 | Queue + persistence layer | none |
| 4 | #623 | Orchestrator MCP tools (in `wizard-tools`) | 3 |
| 5 | #624 | Executor framework + fresh per-task agent | 1, 3, 4 |
| 6 | #625 | Agent-prompt and mini-skill format + seed prompt | 2, 5 |
| 7 | #626 | **Walking skeleton, end-to-end with stub tasks** | 5, 6 |
| 8 | #627 | Real thin-slice task bodies (install, init, instrument-events) | 7 |
| 9 | #628 | Telemetry + experiment instrumentation | 5, finalize after 7/8 |
| 10 | #629 | Resume across runs/crashes *(deferred, low priority)* | 9 |
| 11 | #632 | Separate transient task instructions from keepable skills | 5, 8 |

**Ordering:** 1 ∥ 2 ∥ 3, then 4, then 5 with 6 in parallel, then
**7 (demo-able)**, then 8, then 9, then 10 later. #2 is in the context-mill repo
and gates #6 onward, so it runs early. The strongest standalone breakpoints are 1,
3, 7, 8.

## Definition of done

With the `wizard-orchestrator` flag enabled, a clean Next.js app
integrates PostHog end-to-end through the orchestrator (SDK installed, env set, at
least one event instrumented) at parity with the linear baseline, and the whole
run is segmentable from baseline in PostHog telemetry by `VARIANT`. With the flag
off, the linear path is byte-for-byte unchanged. Resume after a forced kill is #10.

## Implementation status (current)

Built and running end-to-end against real apps (Next.js, Express, Android).
Done: **#621, #631, #622, #623, #624, #626**, and **#625 + #627** (now the full
1:1 flow). Remaining: **#628** telemetry, **#632** transient task instructions.
**#629** resume is dropped, not worth the complexity.

### PRs

Wizard, a stacked train of drafts:

- #606 flag gating + shared bootstrap extraction
- #607 in-memory queue, every transition reflected to queue.json
- #608 enqueue_task, complete_task, read_handoffs tools
- #609 executor: graph-driven drain, parallel by dependency
- #619 markdown-backed agent loader + full integration flow (runner included; #610's walking skeleton was folded in)
- #620 identify the user so feature flags can target by email (sibling, separate concern)

Context-mill:

- PostHog/context-mill#181 `agents` content type, built and served alongside skills, plus the orchestrator agent prompts and mini-skills

### Decisions made during the build, beyond the original issues

- **Manifest-only install + a `build` phase.** `install` only declares the SDK;
  the new `build` task runs the real install + build/typecheck and surfaces any
  unresolved conflict (one line in the outro, full detail in the report). See #627.
- **Full graph is nine tasks:** install, init, identify, error-tracking,
  plan-capture, capture, build, dashboard, report.
- **Client injects the basics; agent prompts are lean.** The wizard owns the I/O
  contract, project context, and the reference-example pointer; `/agents` files
  carry intent only. See #625.
- **Grounding like the linear flow.** Mini-skills carry real PostHog docs; each
  task agent is pointed at the detected framework's reference `EXAMPLE.md` to
  reference its patterns (not copy).
- **UI:** agent-set per-task `label`s drive the queue panel; the agent task tools
  and per-task spinner lines are suppressed so the queue is the sole progress
  surface.
- **Flag targeting (separate concern, own branch).** The wizard now identifies the
  user (email) before evaluating flags, so `wizard-orchestrator` can target by
  email — previously only `$app_name` was sent.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EPIC: Task-queue orchestrator runner (experimental `orchestrator` variant) #633

EPIC: Task-queue orchestrator runner (experimental `orchestrator` variant)

Summary

Why, the primary goal is responsiveness

Locked decisions

Architecture at a glance

What good looks like

Cost to watch

Design discipline

Child issues

Definition of done

Implementation status (current)

PRs

Decisions made during the build, beyond the original issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

#	Issue	Title	Deps
1	#621	Shared bootstrap extraction + variant gating	none
2	#631	Context-mill: `agents` content type (parallel to skills)	none
3	#622	Queue + persistence layer	none
4	#623	Orchestrator MCP tools (in `wizard-tools`)	3
5	#624	Executor framework + fresh per-task agent	1, 3, 4
6	#625	Agent-prompt and mini-skill format + seed prompt	2, 5
7	#626	Walking skeleton, end-to-end with stub tasks	5, 6
8	#627	Real thin-slice task bodies (install, init, instrument-events)	7
9	#628	Telemetry + experiment instrumentation	5, finalize after 7/8
10	#629	Resume across runs/crashes (deferred, low priority)	9
11	#632	Separate transient task instructions from keepable skills	5, 8

EPIC: Task-queue orchestrator runner (experimental orchestrator variant) #633

Description

EPIC: Task-queue orchestrator runner (experimental orchestrator variant)

Summary

Why, the primary goal is responsiveness

Locked decisions

Architecture at a glance

What good looks like

Cost to watch

Design discipline

Child issues

Definition of done

Implementation status (current)

PRs

Decisions made during the build, beyond the original issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

EPIC: Task-queue orchestrator runner (experimental `orchestrator` variant) #633

EPIC: Task-queue orchestrator runner (experimental `orchestrator` variant)