feat(cosim): instrument batch utilisation + record backend-seam findings (#105) by robtaylor · Pull Request #117 · gpu-eda/Jacquard

robtaylor · 2026-06-07T09:28:26Z

Why

Before extracting the CosimBackend seam (#105 Phase 0), measure how the
cosim loop actually dispatches — specifically how often the GPU-only
batched fast path is used vs. forced single-edge (per-edge CPU↔GPU
handover). The answer reshapes the seam design.

What

Telemetry in the cosim run summary: single-edge commits, mean/max
batch, % edges batched. Two u64/usize counters in the dispatch loop
- one summary line. Behaviour-preserving — all 7 Metal cosim fixtures
  stay byte-identical (verified against a pre-change golden).
Doc findings in ADR 0017 (amendment), the cosim-backend-portability
plan, and the multi-clock plan.

Measurements (this machine)

Fixture	Edges	Batched (edges)	Single-edge commits	Total commits
dual_uart	10,000	100%	0	11
apb_trace	200	100%	0	2
xprop_cosim	40	100%	0	2
jtag_minimal	4,000,000	97.4%	102,310	106,117

Findings recorded

The seam must be batch-capable, not naive per-edge. Designs with
GPU-side peripherals run 100% batched; Metal encodes up to BATCH_SIZE
edges + GPU peripheral kernels + ring drain into one command buffer. A
literal simulate_edge-per-edge trait would regress Metal ~1000×.
Refines ADR 0017's "per-edge on every backend" framing; the batch model
itself is unchanged (still a non-goal).
JTAG's single-edge tail is the measured MC.3 trigger. 102k
single-edge commits (96% of submits, 2.6% of edges) dominate JTAG's
wall-clock — the "CPU↔GPU round-trip measured as the bottleneck"
trigger, and the motivation for MC.4 (per-island multi-rate batching).
Both orthogonal to and larger than the cosim is Metal-only — make the cosim driver backend-portable (CPU fallback + CUDA/HIP) #105 seam.

Not in this PR

The seam refactor itself (Phase 0b/0c) is intentionally deferred — this
PR records the design corrections first. Telemetry is kept as useful
ongoing signal (it's the MC.3 bottleneck metric).

Second session update: #115 merged; debug-mask fix (#116) and batch-utilisation telemetry + findings (#117) open. Captures the batch-capable-trait refinement and the paused-for-review status of Phase 0b/0c. Co-developed-by: Claude Code v2.1.168 (claude-opus-4-8)

…tion #117 expanded to the ADR 0017 target-architecture rewrite + rename; JTAG cosim pinned to self-hosted (xlarge too slow) — main green again. Phase 0 implementation not started; design review concluded. Co-developed-by: Claude Code v2.1.168 (claude-opus-4-8)

… (review) Review feedback on #117: - Schedule buffers are owned by the backend (opaque to orchestration), built once via init_schedule and mutated via edge_ops_mut — NOT an orchestration-owned Vec the backend re-materialises each dispatch. This keeps Metal's zero-copy unified-memory path (edge_ops_mut returns a slice over the shared MTLBuffer; the write IS the upload) and lets CUDA/HIP upload only dirty edges, avoiding needless CPU↔GPU traffic. Also resolves the closure-borrow friction. ADR Layer 2 + Consequences and the plan's trait sketch / Phase 0 updated. - Merge old Phase 2 (CUDA per-edge correctness) and Phase 3 (GPU peripherals) into one Phase 2 that lands the CUDA/HIP backend WITH its Tier-2 GPU peripherals. Per-edge-only CUDA is an unusably-slow, non-shippable intermediate, and Phase 1 already proves the per-edge orchestration — so a separate per-edge milestone re-validates known-good code then reworks dispatch for batching. Internal 2a/2b checkpoint keeps it verifiable. Single-source peripherals renumbered Phase 3. Co-developed-by: Claude Code v2.1.168 (claude-opus-4-8)

Add batch-utilisation telemetry to the cosim run summary (single-edge commits, mean/max batch) to measure how often the GPU-only batched fast path is exercised vs. forced single-edge dispatch. Behaviour-preserving: all 7 Metal cosim fixtures stay byte-identical. Measurements (this machine): - dual_uart / apb_trace / xprop_cosim: 100% batched, 0 single-edge. - jtag_minimal (CPU-side replay): 97.4% of edges batched, but 102,310 single-edge commits (96% of all submits) dominate wall-clock. Record two design findings: - The CosimBackend seam (#105) must be batch-capable, not naive per-edge: Metal's production path encodes up to BATCH_SIZE edges + GPU peripherals + ring drain per command buffer, so a literal simulate_edge-per-edge trait would regress it ~1000x. Refines ADR 0017's "per-edge on every backend" framing without changing the batch model (still a non-goal). - The JTAG single-edge tail is the measured MC.3 trigger ("CPU<->GPU round-trip measured as the bottleneck") and the motivation for MC.4 (per-island multi-rate batching) in the multi-clock plan — both orthogonal to and larger than the #105 seam. Docs: ADR 0017 amendment (Measured batch utilisation), the cosim-backend-portability plan (batch-granularity refinement), and the multi-clock plan (MC.3/MC.4 trigger now measured). Co-developed-by: Claude Code v2.1.168 (claude-opus-4-8)

The variable computes `clock_period_ps / gcd_ps` = dense scheduler ticks per sys_clk period, NOT the number of sys_clk transitions (always 2). It equals 2 only when gcd_ps == the half-period (single-clock or harmonic multi-clock); with non-commensurate periods or phase offsets gcd_ps shrinks below the half-period and it is >2. The old name + comment claimed it was "still 2 for multi-clock", conflating sys_clk edges with scheduler ticks. Pure rename + comment fix; no behaviour change. Co-developed-by: Claude Code v2.1.168 (claude-opus-4-8)

…(plan) Rewrite ADR 0017's portability amendment into the steady-state target architecture (supersedes the incremental 2026-06-05 note): - 3-layer design: backend-agnostic orchestration / batch-granular CosimBackend trait / GpuPeripheral abstraction. - GpuPeripheral is a 3-tier model with GPU peripherals as the PRIMARY path to CUDA/HIP batching (Tier 1 CPU reference/oracle/fallback; Tier 2 hand-written kernels, CUDA+HIP sharing *_impl.cuh = 2 impls; Tier 3 single-source/user-extensible, later). PCIe per-edge cost makes GPU peripherals architecturally required, not optional. - Speculative batching (MC.4/MC.5) recorded as considered-not-adopted. - Cross-backend equivalence (CPU model = ground truth) as the contract. - Update edges_per_sys_clk_cycle → sched_ticks_per_sys_clk_cycle refs. Re-sequence the plan staging to match: P0 seam → P1 CpuBackend+Linux CI → P2 CUDA/HIP correctness (per-edge, CI gate, NOT perf) → P3 GPU peripherals CUDA/HIP (promoted from optional → the perf path) → P4 single-source peripherals. Trait sketch updated to batch-granular (run_edges) + GpuPeripheral. Co-developed-by: Claude Code v2.1.168 (claude-opus-4-8)

… (review) Review feedback on #117: - Schedule buffers are owned by the backend (opaque to orchestration), built once via init_schedule and mutated via edge_ops_mut — NOT an orchestration-owned Vec the backend re-materialises each dispatch. This keeps Metal's zero-copy unified-memory path (edge_ops_mut returns a slice over the shared MTLBuffer; the write IS the upload) and lets CUDA/HIP upload only dirty edges, avoiding needless CPU↔GPU traffic. Also resolves the closure-borrow friction. ADR Layer 2 + Consequences and the plan's trait sketch / Phase 0 updated. - Merge old Phase 2 (CUDA per-edge correctness) and Phase 3 (GPU peripherals) into one Phase 2 that lands the CUDA/HIP backend WITH its Tier-2 GPU peripherals. Per-edge-only CUDA is an unusably-slow, non-shippable intermediate, and Phase 1 already proves the per-edge orchestration — so a separate per-edge milestone re-validates known-good code then reworks dispatch for batching. Internal 2a/2b checkpoint keeps it verifiable. Single-source peripherals renumbered Phase 3. Co-developed-by: Claude Code v2.1.168 (claude-opus-4-8)

Both PRs merged. Sync handoff to final state: backend-owned schedule (edge_ops_mut), P2/P3 merged in the plan, JTAG on xlarge + 40/45min timeout (validated 32.0min), amendment date refs. Co-developed-by: Claude Code v2.1.168 (claude-opus-4-8)

robtaylor added 4 commits June 7, 2026 12:19

robtaylor force-pushed the feat/cosim-backend-seam branch from 9c3ebda to a7e4bfd Compare June 7, 2026 11:20

robtaylor merged commit 87ee498 into main Jun 7, 2026
16 checks passed

robtaylor deleted the feat/cosim-backend-seam branch June 7, 2026 11:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cosim): instrument batch utilisation + record backend-seam findings (#105)#117

feat(cosim): instrument batch utilisation + record backend-seam findings (#105)#117
robtaylor merged 4 commits into
mainfrom
feat/cosim-backend-seam

robtaylor commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

robtaylor commented Jun 7, 2026

Why

What

Measurements (this machine)

Findings recorded

Not in this PR

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant