Skip to content

EPIC: Task-queue orchestrator runner (experimental orchestrator variant) #633

Description

@gewenyu99

EPIC: Task-queue orchestrator runner (experimental orchestrator variant)

Summary

The wizard today is a single-agent, linear runner. runAgentrunProgram
(src/lib/agent/agent-runner.ts:154) ends in one Claude Agent SDK query() call
driven by one large assembled prompt. All integration work happens inside that
single agent context.

This epic prototypes a task-queue-driven orchestrator that runs many small,
fresh-context micro-agents. An orchestrator agent inspects the repo and seeds an
in-memory task queue. An executor drains it, running one fresh agent per task,
each with its own model, goal, success criteria, permissions, and the mini-skills
that tell it HOW. Tasks can dynamically enqueue more tasks as they learn, water
running downhill. Each task leaves a structured handoff for the next agent: what
it did, what its goal was, and what the next agent should know.

A core requirement is separation of concerns:

  • Agent prompts (the WHAT). They say what to do, and carry the artifacts in
    frontmatter: model, goal, success criteria, permissions, which mini-skills.
  • Mini-skills (the HOW). The procedural integration knowledge.

Both are markdown files with frontmatter served from context-mill. Skills already
exist there. Agent prompts are a new agents content type, a flavor parallel to
skills that carries intention rather than procedure.

It branches from the linear runner, gated by the boolean wizard-orchestrator feature flag, so we launch in the dark and A/B it against the linear
baseline.

Why, the primary goal is responsiveness

The linear runner packs the whole integration into one agent, which produces long
silences. The user waits through minutes of planning, then minutes of editing,
with little sign of progress. The win condition for this experiment is that the
user sees real, incremental progress the whole way through. Decomposing into
discrete micro-tasks delivers that. Each task is small and quick, it finishes
visibly, and it streams a steady drumbeat of one thing done and the next thing
starting.

Two levers keep tasks small and quick:

  • Per-task model fit. Each task picks the cheapest model that can do the job,
    a small model for install and boilerplate, a stronger one only where reasoning
    is needed. This is core.
  • Granularity. Prefer many small tasks over a few large ones, scoped tightly
    enough to finish fast and show progress. The orchestrator itself seeds fast, a
    quick glance at the repo rather than a long plan.

Decomposition also helps reliability, debuggability, and per-step permissions.

Locked decisions

  • Concurrency. Parallel-capable, tasks declare dependsOn, default cap 1 for
    the prototype. The real graph has genuine parallel branches: install alongside
    init, identify alongside capture-planning. At cap 1 they run in any order;
    raising the cap runs the independent branches together.
  • Persistence. The queue lives in memory and is reflected to disk
    asynchronously: queue, audit log, handoffs. Resume across runs and crashes is
    deferred to chore: initial action for publishing on a new version #10, after fix: add react option when not detected #9.
  • Seeding. Dynamic enqueue, with loop and termination guards.
  • Gating. The boolean wizard-orchestrator feature flag (constants.ts). The
    integrate command reads the flag and routes to the new runner when it is
    true. To test in the dark, enable the flag for your own user in PostHog
    (isOrchestratorEnabled, agent-interface.ts).
  • Artifacts. The WHAT and the HOW are both content in context-mill. Agent
    prompts are a new agents content type, a flavor parallel to skills that carries
    intention: model, goal, success criteria, permissions, which skills to load.
    Mini-skills stay as skills, the procedure. Both are markdown with frontmatter,
    authored on a clearly-named experiment branch, served from context-mill. Local
    dev runs against localhost:8765. Context-mill grows the agents type to match
    how it already builds and serves skills (feat: posthog analytics setup #2).
  • Tools. The orchestrator tools live in the existing wizard-tools server,
    alongside the env tools and the audit tools.
  • Handoff. Each task reports a structured handoff through complete_task:
    goal, what it did, what the next agent should know.
  • Success. Success criteria are plain text in the agent prompt. The agent
    self-reports the outcome through complete_task.
  • UI. The TUI renders the queue.
  • Sequencing. Prove the full machinery with a walking skeleton, stub tasks
    that write a temp file, then author the real bodies: install, init,
    instrument-events.

Architecture at a glance

  • Fork point. Inside runProgram after OAuth, at agent-runner.ts:302.
    Extract steps 1 to 4 into bootstrapProgram(), which the orchestrator arm
    reuses for the health check, settings conflicts, OAuth and credentials, MCP url,
    and variant metadata.
  • Per-task agent. Reuse the existing runAgent (agent-interface.ts:773),
    initialize the agent once, and override model, tools, and permissions per task.
  • Queue. In memory as the source of truth, reflected asynchronously to
    <installDir>/.posthog-wizard/ (queue.json, audit.jsonl, structured
    handoffs/<id>.json), reusing the audit-ledger atomic write and mutex helpers.
  • Tools. enqueue_task, complete_task, and read_handoffs in
    wizard-tools, registered when a QueueStore is present, with termination
    guards.
  • Telemetry. VARIANT=orchestrator flows into existing events, plus
    orchestrator-specific events for the A/B comparison.

What good looks like

  • Time to first visible progress is short. The first task is running and showing
    in the UI soon after launch.
  • Progress is steady. No single step dominates wall-clock, and the user always
    sees what just finished and what is next.
  • Cheap models carry the cheap work. Tasks default to the lowest-cost model that
    succeeds, and the strong models are the exception.
  • Per-task latency and model are measured from day one (telemetry, fix: add react option when not detected #9).

Cost to watch

Each fresh task-agent re-pays the full system-prompt, the claude_code preset,
and the MCP connect, roughly 113k tokens that the codebase worked hard to defer.
N agents means N times that startup cost. We watch per-task tokens, but the
experiment is judged on responsiveness, not on beating the linear baseline's total
token count. Small tasks, cheap models, and fast seeding keep the tax bearable.

Design discipline

CLAUDE.md keeps product knowledge out of infra. Frameworks live in
FrameworkConfig, integration knowledge lives in context-mill skills, programs
are step arrays. This design fits that. The agent prompts and mini-skills are
markdown content, the same shape as today's skills, so they live in context-mill.
The new wizard-side code is machinery: the queue, the executor, the loader, and
the tools. The runner stays product-ignorant.

Child issues

# Issue Title Deps
1 #621 Shared bootstrap extraction + variant gating none
2 #631 Context-mill: agents content type (parallel to skills) none
3 #622 Queue + persistence layer none
4 #623 Orchestrator MCP tools (in wizard-tools) 3
5 #624 Executor framework + fresh per-task agent 1, 3, 4
6 #625 Agent-prompt and mini-skill format + seed prompt 2, 5
7 #626 Walking skeleton, end-to-end with stub tasks 5, 6
8 #627 Real thin-slice task bodies (install, init, instrument-events) 7
9 #628 Telemetry + experiment instrumentation 5, finalize after 7/8
10 #629 Resume across runs/crashes (deferred, low priority) 9
11 #632 Separate transient task instructions from keepable skills 5, 8

Ordering: 1 ∥ 2 ∥ 3, then 4, then 5 with 6 in parallel, then
7 (demo-able), then 8, then 9, then 10 later. #2 is in the context-mill repo
and gates #6 onward, so it runs early. The strongest standalone breakpoints are 1,
3, 7, 8.

Definition of done

With the wizard-orchestrator flag enabled, a clean Next.js app
integrates PostHog end-to-end through the orchestrator (SDK installed, env set, at
least one event instrumented) at parity with the linear baseline, and the whole
run is segmentable from baseline in PostHog telemetry by VARIANT. With the flag
off, the linear path is byte-for-byte unchanged. Resume after a forced kill is #10.

Implementation status (current)

Built and running end-to-end against real apps (Next.js, Express, Android).
Done: #621, #631, #622, #623, #624, #626, and #625 + #627 (now the full
1:1 flow). Remaining: #628 telemetry, #632 transient task instructions.
#629 resume is dropped, not worth the complexity.

PRs

Wizard, a stacked train of drafts:

Context-mill:

Decisions made during the build, beyond the original issues

  • Manifest-only install + a build phase. install only declares the SDK;
    the new build task runs the real install + build/typecheck and surfaces any
    unresolved conflict (one line in the outro, full detail in the report). See Issue 8: Real task bodies + full 1:1 integration flow #627.
  • Full graph is nine tasks: install, init, identify, error-tracking,
    plan-capture, capture, build, dashboard, report.
  • Client injects the basics; agent prompts are lean. The wizard owns the I/O
    contract, project context, and the reference-example pointer; /agents files
    carry intent only. See Issue 6: Agent-prompt and mini-skill format + seed prompt #625.
  • Grounding like the linear flow. Mini-skills carry real PostHog docs; each
    task agent is pointed at the detected framework's reference EXAMPLE.md to
    reference its patterns (not copy).
  • UI: agent-set per-task labels drive the queue panel; the agent task tools
    and per-task spinner lines are suppressed so the queue is the sole progress
    surface.
  • Flag targeting (separate concern, own branch). The wizard now identifies the
    user (email) before evaluating flags, so wizard-orchestrator can target by
    email — previously only $app_name was sent.

Metadata

Metadata

Assignees

No one assigned

    Labels

    wizard-painPain point from team triage — DX, tech debt, bugs, future improvements

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions