feat(cosim): instrument batch utilisation + record backend-seam findings (#105)#117
Merged
Conversation
robtaylor
added a commit
that referenced
this pull request
Jun 7, 2026
…tion #117 expanded to the ADR 0017 target-architecture rewrite + rename; JTAG cosim pinned to self-hosted (xlarge too slow) — main green again. Phase 0 implementation not started; design review concluded. Co-developed-by: Claude Code v2.1.168 (claude-opus-4-8)
robtaylor
added a commit
that referenced
this pull request
Jun 7, 2026
… (review) Review feedback on #117: - Schedule buffers are owned by the backend (opaque to orchestration), built once via init_schedule and mutated via edge_ops_mut — NOT an orchestration-owned Vec the backend re-materialises each dispatch. This keeps Metal's zero-copy unified-memory path (edge_ops_mut returns a slice over the shared MTLBuffer; the write IS the upload) and lets CUDA/HIP upload only dirty edges, avoiding needless CPU↔GPU traffic. Also resolves the closure-borrow friction. ADR Layer 2 + Consequences and the plan's trait sketch / Phase 0 updated. - Merge old Phase 2 (CUDA per-edge correctness) and Phase 3 (GPU peripherals) into one Phase 2 that lands the CUDA/HIP backend WITH its Tier-2 GPU peripherals. Per-edge-only CUDA is an unusably-slow, non-shippable intermediate, and Phase 1 already proves the per-edge orchestration — so a separate per-edge milestone re-validates known-good code then reworks dispatch for batching. Internal 2a/2b checkpoint keeps it verifiable. Single-source peripherals renumbered Phase 3. Co-developed-by: Claude Code v2.1.168 (claude-opus-4-8)
Add batch-utilisation telemetry to the cosim run summary (single-edge commits, mean/max batch) to measure how often the GPU-only batched fast path is exercised vs. forced single-edge dispatch. Behaviour-preserving: all 7 Metal cosim fixtures stay byte-identical. Measurements (this machine): - dual_uart / apb_trace / xprop_cosim: 100% batched, 0 single-edge. - jtag_minimal (CPU-side replay): 97.4% of edges batched, but 102,310 single-edge commits (96% of all submits) dominate wall-clock. Record two design findings: - The CosimBackend seam (#105) must be batch-capable, not naive per-edge: Metal's production path encodes up to BATCH_SIZE edges + GPU peripherals + ring drain per command buffer, so a literal simulate_edge-per-edge trait would regress it ~1000x. Refines ADR 0017's "per-edge on every backend" framing without changing the batch model (still a non-goal). - The JTAG single-edge tail is the measured MC.3 trigger ("CPU<->GPU round-trip measured as the bottleneck") and the motivation for MC.4 (per-island multi-rate batching) in the multi-clock plan — both orthogonal to and larger than the #105 seam. Docs: ADR 0017 amendment (Measured batch utilisation), the cosim-backend-portability plan (batch-granularity refinement), and the multi-clock plan (MC.3/MC.4 trigger now measured). Co-developed-by: Claude Code v2.1.168 (claude-opus-4-8)
The variable computes `clock_period_ps / gcd_ps` = dense scheduler ticks per sys_clk period, NOT the number of sys_clk transitions (always 2). It equals 2 only when gcd_ps == the half-period (single-clock or harmonic multi-clock); with non-commensurate periods or phase offsets gcd_ps shrinks below the half-period and it is >2. The old name + comment claimed it was "still 2 for multi-clock", conflating sys_clk edges with scheduler ticks. Pure rename + comment fix; no behaviour change. Co-developed-by: Claude Code v2.1.168 (claude-opus-4-8)
…(plan) Rewrite ADR 0017's portability amendment into the steady-state target architecture (supersedes the incremental 2026-06-05 note): - 3-layer design: backend-agnostic orchestration / batch-granular CosimBackend trait / GpuPeripheral abstraction. - GpuPeripheral is a 3-tier model with GPU peripherals as the PRIMARY path to CUDA/HIP batching (Tier 1 CPU reference/oracle/fallback; Tier 2 hand-written kernels, CUDA+HIP sharing *_impl.cuh = 2 impls; Tier 3 single-source/user-extensible, later). PCIe per-edge cost makes GPU peripherals architecturally required, not optional. - Speculative batching (MC.4/MC.5) recorded as considered-not-adopted. - Cross-backend equivalence (CPU model = ground truth) as the contract. - Update edges_per_sys_clk_cycle → sched_ticks_per_sys_clk_cycle refs. Re-sequence the plan staging to match: P0 seam → P1 CpuBackend+Linux CI → P2 CUDA/HIP correctness (per-edge, CI gate, NOT perf) → P3 GPU peripherals CUDA/HIP (promoted from optional → the perf path) → P4 single-source peripherals. Trait sketch updated to batch-granular (run_edges) + GpuPeripheral. Co-developed-by: Claude Code v2.1.168 (claude-opus-4-8)
… (review) Review feedback on #117: - Schedule buffers are owned by the backend (opaque to orchestration), built once via init_schedule and mutated via edge_ops_mut — NOT an orchestration-owned Vec the backend re-materialises each dispatch. This keeps Metal's zero-copy unified-memory path (edge_ops_mut returns a slice over the shared MTLBuffer; the write IS the upload) and lets CUDA/HIP upload only dirty edges, avoiding needless CPU↔GPU traffic. Also resolves the closure-borrow friction. ADR Layer 2 + Consequences and the plan's trait sketch / Phase 0 updated. - Merge old Phase 2 (CUDA per-edge correctness) and Phase 3 (GPU peripherals) into one Phase 2 that lands the CUDA/HIP backend WITH its Tier-2 GPU peripherals. Per-edge-only CUDA is an unusably-slow, non-shippable intermediate, and Phase 1 already proves the per-edge orchestration — so a separate per-edge milestone re-validates known-good code then reworks dispatch for batching. Internal 2a/2b checkpoint keeps it verifiable. Single-source peripherals renumbered Phase 3. Co-developed-by: Claude Code v2.1.168 (claude-opus-4-8)
9c3ebda to
a7e4bfd
Compare
robtaylor
added a commit
that referenced
this pull request
Jun 7, 2026
Both PRs merged. Sync handoff to final state: backend-owned schedule (edge_ops_mut), P2/P3 merged in the plan, JTAG on xlarge + 40/45min timeout (validated 32.0min), amendment date refs. Co-developed-by: Claude Code v2.1.168 (claude-opus-4-8)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Before extracting the
CosimBackendseam (#105 Phase 0), measure how thecosim loop actually dispatches — specifically how often the GPU-only
batched fast path is used vs. forced single-edge (per-edge CPU↔GPU
handover). The answer reshapes the seam design.
What
batch, % edges batched. Two
u64/usizecounters in the dispatch loopstay byte-identical (verified against a pre-change golden).
plan, and the multi-clock plan.
Measurements (this machine)
Findings recorded
GPU-side peripherals run 100% batched; Metal encodes up to
BATCH_SIZEedges + GPU peripheral kernels + ring drain into one command buffer. A
literal
simulate_edge-per-edge trait would regress Metal ~1000×.Refines ADR 0017's "per-edge on every backend" framing; the batch model
itself is unchanged (still a non-goal).
single-edge commits (96% of submits, 2.6% of edges) dominate JTAG's
wall-clock — the "CPU↔GPU round-trip measured as the bottleneck"
trigger, and the motivation for MC.4 (per-island multi-rate batching).
Both orthogonal to and larger than the cosim is Metal-only — make the cosim driver backend-portable (CPU fallback + CUDA/HIP) #105 seam.
Not in this PR
The seam refactor itself (Phase 0b/0c) is intentionally deferred — this
PR records the design corrections first. Telemetry is kept as useful
ongoing signal (it's the MC.3 bottleneck metric).