Skip to content

feat(cosim): instrument batch utilisation + record backend-seam findings (#105)#117

Merged
robtaylor merged 4 commits into
mainfrom
feat/cosim-backend-seam
Jun 7, 2026
Merged

feat(cosim): instrument batch utilisation + record backend-seam findings (#105)#117
robtaylor merged 4 commits into
mainfrom
feat/cosim-backend-seam

Conversation

@robtaylor

Copy link
Copy Markdown
Contributor

Why

Before extracting the CosimBackend seam (#105 Phase 0), measure how the
cosim loop actually dispatches — specifically how often the GPU-only
batched fast path is used vs. forced single-edge (per-edge CPU↔GPU
handover). The answer reshapes the seam design.

What

  • Telemetry in the cosim run summary: single-edge commits, mean/max
    batch, % edges batched. Two u64/usize counters in the dispatch loop
    • one summary line. Behaviour-preserving — all 7 Metal cosim fixtures
      stay byte-identical (verified against a pre-change golden).
  • Doc findings in ADR 0017 (amendment), the cosim-backend-portability
    plan, and the multi-clock plan.

Measurements (this machine)

Fixture Edges Batched (edges) Single-edge commits Total commits
dual_uart 10,000 100% 0 11
apb_trace 200 100% 0 2
xprop_cosim 40 100% 0 2
jtag_minimal 4,000,000 97.4% 102,310 106,117

Findings recorded

  1. The seam must be batch-capable, not naive per-edge. Designs with
    GPU-side peripherals run 100% batched; Metal encodes up to BATCH_SIZE
    edges + GPU peripheral kernels + ring drain into one command buffer. A
    literal simulate_edge-per-edge trait would regress Metal ~1000×.
    Refines ADR 0017's "per-edge on every backend" framing; the batch model
    itself is unchanged (still a non-goal).
  2. JTAG's single-edge tail is the measured MC.3 trigger. 102k
    single-edge commits (96% of submits, 2.6% of edges) dominate JTAG's
    wall-clock — the "CPU↔GPU round-trip measured as the bottleneck"
    trigger, and the motivation for MC.4 (per-island multi-rate batching).
    Both orthogonal to and larger than the cosim is Metal-only — make the cosim driver backend-portable (CPU fallback + CUDA/HIP) #105 seam.

Not in this PR

The seam refactor itself (Phase 0b/0c) is intentionally deferred — this
PR records the design corrections first. Telemetry is kept as useful
ongoing signal (it's the MC.3 bottleneck metric).

robtaylor added a commit that referenced this pull request Jun 7, 2026
Second session update: #115 merged; debug-mask fix (#116) and
batch-utilisation telemetry + findings (#117) open. Captures the
batch-capable-trait refinement and the paused-for-review status of
Phase 0b/0c.

Co-developed-by: Claude Code v2.1.168 (claude-opus-4-8)
robtaylor added a commit that referenced this pull request Jun 7, 2026
…tion

#117 expanded to the ADR 0017 target-architecture rewrite + rename;
JTAG cosim pinned to self-hosted (xlarge too slow) — main green again.
Phase 0 implementation not started; design review concluded.

Co-developed-by: Claude Code v2.1.168 (claude-opus-4-8)
robtaylor added a commit that referenced this pull request Jun 7, 2026
… (review)

Review feedback on #117:

- Schedule buffers are owned by the backend (opaque to orchestration),
  built once via init_schedule and mutated via edge_ops_mut — NOT an
  orchestration-owned Vec the backend re-materialises each dispatch. This
  keeps Metal's zero-copy unified-memory path (edge_ops_mut returns a slice
  over the shared MTLBuffer; the write IS the upload) and lets CUDA/HIP
  upload only dirty edges, avoiding needless CPU↔GPU traffic. Also resolves
  the closure-borrow friction. ADR Layer 2 + Consequences and the plan's
  trait sketch / Phase 0 updated.

- Merge old Phase 2 (CUDA per-edge correctness) and Phase 3 (GPU
  peripherals) into one Phase 2 that lands the CUDA/HIP backend WITH its
  Tier-2 GPU peripherals. Per-edge-only CUDA is an unusably-slow,
  non-shippable intermediate, and Phase 1 already proves the per-edge
  orchestration — so a separate per-edge milestone re-validates known-good
  code then reworks dispatch for batching. Internal 2a/2b checkpoint keeps
  it verifiable. Single-source peripherals renumbered Phase 3.

Co-developed-by: Claude Code v2.1.168 (claude-opus-4-8)
robtaylor added 4 commits June 7, 2026 12:19
Add batch-utilisation telemetry to the cosim run summary
(single-edge commits, mean/max batch) to measure how often the
GPU-only batched fast path is exercised vs. forced single-edge
dispatch. Behaviour-preserving: all 7 Metal cosim fixtures stay
byte-identical.

Measurements (this machine):
- dual_uart / apb_trace / xprop_cosim: 100% batched, 0 single-edge.
- jtag_minimal (CPU-side replay): 97.4% of edges batched, but
  102,310 single-edge commits (96% of all submits) dominate wall-clock.

Record two design findings:
- The CosimBackend seam (#105) must be batch-capable, not naive
  per-edge: Metal's production path encodes up to BATCH_SIZE edges +
  GPU peripherals + ring drain per command buffer, so a literal
  simulate_edge-per-edge trait would regress it ~1000x. Refines ADR
  0017's "per-edge on every backend" framing without changing the
  batch model (still a non-goal).
- The JTAG single-edge tail is the measured MC.3 trigger ("CPU<->GPU
  round-trip measured as the bottleneck") and the motivation for MC.4
  (per-island multi-rate batching) in the multi-clock plan — both
  orthogonal to and larger than the #105 seam.

Docs: ADR 0017 amendment (Measured batch utilisation), the
cosim-backend-portability plan (batch-granularity refinement), and the
multi-clock plan (MC.3/MC.4 trigger now measured).

Co-developed-by: Claude Code v2.1.168 (claude-opus-4-8)
The variable computes `clock_period_ps / gcd_ps` = dense scheduler ticks
per sys_clk period, NOT the number of sys_clk transitions (always 2). It
equals 2 only when gcd_ps == the half-period (single-clock or harmonic
multi-clock); with non-commensurate periods or phase offsets gcd_ps
shrinks below the half-period and it is >2. The old name + comment claimed
it was "still 2 for multi-clock", conflating sys_clk edges with scheduler
ticks. Pure rename + comment fix; no behaviour change.

Co-developed-by: Claude Code v2.1.168 (claude-opus-4-8)
…(plan)

Rewrite ADR 0017's portability amendment into the steady-state target
architecture (supersedes the incremental 2026-06-05 note):
- 3-layer design: backend-agnostic orchestration / batch-granular
  CosimBackend trait / GpuPeripheral abstraction.
- GpuPeripheral is a 3-tier model with GPU peripherals as the PRIMARY
  path to CUDA/HIP batching (Tier 1 CPU reference/oracle/fallback;
  Tier 2 hand-written kernels, CUDA+HIP sharing *_impl.cuh = 2 impls;
  Tier 3 single-source/user-extensible, later). PCIe per-edge cost makes
  GPU peripherals architecturally required, not optional.
- Speculative batching (MC.4/MC.5) recorded as considered-not-adopted.
- Cross-backend equivalence (CPU model = ground truth) as the contract.
- Update edges_per_sys_clk_cycle → sched_ticks_per_sys_clk_cycle refs.

Re-sequence the plan staging to match: P0 seam → P1 CpuBackend+Linux CI
→ P2 CUDA/HIP correctness (per-edge, CI gate, NOT perf) → P3 GPU
peripherals CUDA/HIP (promoted from optional → the perf path) → P4
single-source peripherals. Trait sketch updated to batch-granular
(run_edges) + GpuPeripheral.

Co-developed-by: Claude Code v2.1.168 (claude-opus-4-8)
… (review)

Review feedback on #117:

- Schedule buffers are owned by the backend (opaque to orchestration),
  built once via init_schedule and mutated via edge_ops_mut — NOT an
  orchestration-owned Vec the backend re-materialises each dispatch. This
  keeps Metal's zero-copy unified-memory path (edge_ops_mut returns a slice
  over the shared MTLBuffer; the write IS the upload) and lets CUDA/HIP
  upload only dirty edges, avoiding needless CPU↔GPU traffic. Also resolves
  the closure-borrow friction. ADR Layer 2 + Consequences and the plan's
  trait sketch / Phase 0 updated.

- Merge old Phase 2 (CUDA per-edge correctness) and Phase 3 (GPU
  peripherals) into one Phase 2 that lands the CUDA/HIP backend WITH its
  Tier-2 GPU peripherals. Per-edge-only CUDA is an unusably-slow,
  non-shippable intermediate, and Phase 1 already proves the per-edge
  orchestration — so a separate per-edge milestone re-validates known-good
  code then reworks dispatch for batching. Internal 2a/2b checkpoint keeps
  it verifiable. Single-source peripherals renumbered Phase 3.

Co-developed-by: Claude Code v2.1.168 (claude-opus-4-8)
@robtaylor robtaylor force-pushed the feat/cosim-backend-seam branch from 9c3ebda to a7e4bfd Compare June 7, 2026 11:20
@robtaylor robtaylor merged commit 87ee498 into main Jun 7, 2026
16 checks passed
@robtaylor robtaylor deleted the feat/cosim-backend-seam branch June 7, 2026 11:54
robtaylor added a commit that referenced this pull request Jun 7, 2026
Both PRs merged. Sync handoff to final state: backend-owned schedule
(edge_ops_mut), P2/P3 merged in the plan, JTAG on xlarge + 40/45min
timeout (validated 32.0min), amendment date refs.

Co-developed-by: Claude Code v2.1.168 (claude-opus-4-8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant