Add host-side in-process-RPC path for real-time QEC decoding by cketcham2333 · Pull Request #609 · NVIDIA/cudaqx

cketcham2333 · 2026-06-12T21:11:53Z

Description

What this adds. A public, host-side path for in-process RPC ("inproc_rpc")
real-time QEC decoding. In this mode the host produces syndrome-decode requests into a
shared host/device ring buffer; they are serviced through cuda-quantum's on-GPU
DEVICE_LOOP dispatcher by a GPU-resident decoder (e.g. nv-qldpc-decoder), keeping
decode on the device with no CPU round-trip on the hot path. Selected at runtime via
CUDAQ_QEC_REALTIME_MODE=inproc_rpc; the existing host-call path is unchanged when it
is unset.
No proprietary device code lives in this repo. The GPU device-dispatch archive is
consumed optionally as an IMPORTED target via
CUDAQ_QEC_REALTIME_CUDEVICE_PROPRIETARY_ARCHIVE (whole-archive link); the realtime
graph/surface tests register only when it is supplied. (That archive is built
separately; see the companion GitLab MR.)
Host components (new):

qec_realtime_session.{h,cpp} — shared host/device ring-buffer session.
rpc_producer.{h,cpp} — host-side RPC request/response producer.
realtime_decoding.{h,cpp} routes through the session/producer when
inproc_rpc is selected; config.cpp / decoding_config.h plumbing.
Build / linking:
IMPORTED proprietary-archive wiring in lib/realtime/CMakeLists.txt +
unittests/CMakeLists.txt (whole-archive, test gating on target existence).
Scope --exclude-libs,ALL to PRIVATE on the simulation/quantinuum realtime libs and
add --export-dynamic for app examples, so the dispatch-API symbols stay
dlsym-resolvable in test/app executables.
Kernels: postprocess_* XOR-accumulate corrections (^=) so multi-window shots
(num_rounds > decoder_window) reduce correctly; test_gpu_kernels zeroes the
accumulator and skips ROUND_START markers.
Tests: test_realtime_qldpc_graph_decoding (+ qldpc_config_loader) and a
surface_code-1 inproc-rpc variant (--use-relay-bp); mock/hololink syndrome loaders
skip ROUND_START; refreshed *_nv_qldpc_relay fixtures.
Release/build infra: all_libs_release.yaml fails fast if the nv-qldpc-decoder
plugin or the proprietary archive is missing from the release asset; build_all.sh /
build_qec.sh build the realtime path.

Adds the public, host-side support for "inproc_rpc" real-time decoding: syndrome-decode requests are dispatched through cuda-quantum's on-GPU DEVICE_LOOP and serviced by a GPU-resident decoder. No proprietary code lives in this repo; the GPU device-dispatch archive is supplied at build time and linked optionally (built privately). Host components: - qec_realtime_session.{h,cpp}: shared host/device ring-buffer session. - rpc_producer.{h,cpp}: host-side RPC request/response producer. - realtime_decoding.{h,cpp}: route through the session/producer when CUDAQ_QEC_REALTIME_MODE=inproc_rpc; config.cpp / decoding_config.h plumbing. Build / linking: - Consume the optional proprietary device-dispatch archive as an IMPORTED target via CUDAQ_QEC_REALTIME_CUDEVICE_PROPRIETARY_ARCHIVE (whole-archive link); realtime graph/surface tests register only when it is present. - Scope --exclude-libs,ALL to PRIVATE on the simulation/quantinuum realtime libs and export-dynamic for app examples, so the dispatch-API symbols stay dlsym-resolvable in test/app executables. Kernels: - postprocess_* XOR-accumulate corrections (^=) for multi-window shots; test_gpu_kernels zeros the accumulator and skips ROUND_START markers. Tests: - test_realtime_qldpc_graph_decoding (+ qldpc_config_loader) and a surface_code-1 inproc-rpc variant (--use-relay-bp); mock/hololink syndrome loaders skip ROUND_START; refreshed nv_qldpc_relay fixtures. Release/build infra: - all_libs_release.yaml fails fast if the nv-qldpc-decoder plugin or the proprietary archive is missing from the release asset; build_all.sh / build_qec.sh build the realtime path. Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>

Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>

build_cudaq.sh patches CMakeLists.txt (Python find_package Development -> Development.Module), but upstream cuda-quantum #4698 restructured the python/nanobind block, shifting the surrounding context so `git apply` can no longer locate the hunk (hit when .cudaq_version points at a ref that includes Non-interactive on purpose: `patch` can hang on a File-to-patch prompt in CI. Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>

Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>

After merging main (which added NVIDIA#600's HOST_CALL PyMatching session) into the decode_server1 device-dispatch session, reconcile the two into a single qec_realtime_session that selects its dispatch mode per decoder at initialize() via decoder::supports_graph_dispatch(): - DEVICE mode (e.g. the Relay BP GPU decoder): per-round GRAPH_LAUNCH enqueue + shared DEVICE_CALL get_corrections/reset, driven by a persistent DEVICE_LOOP dispatcher plus a CPU HOST_LOOP graph-worker monitor over a pinned-mapped, shared_ring_mode ring. Unchanged behavior from decode_server1. - HOST mode (e.g. PyMatching, a CPU decoder): all three RPCs run as inline CUDAQ_DISPATCH_HOST_CALL handlers on the CPU host loop over a host-memory ring (no GPU required at runtime; the device-visible pointers alias the host backings so rpc_producer stays mode-agnostic). A session must be homogeneous (all graph-dispatch, or all host); a mixed set throws, because the host loop resolves a slot by function_id alone and a GRAPH_LAUNCH + HOST_CALL enqueue would collide on kEnqueueSyndromesFunctionId. Key changes: - Port the HOST_CALL handlers to the two-ring host_fn ABI (const void *rx, void *tx, size_t): read the request from the RX slot and write the RPCResponse into the distinct TX slot, echoing request_id / ptp_timestamp; the host loop publishes tx_flags. - Run the host loop with shared_ring_mode=1 so it scans the ring for work. The mode-agnostic producer reuses the first free slot, which a single-cursor non-shared loop would advance past and deadlock on. - CMake: locate and link cudaq-realtime-host-dispatch (the host dispatch loop that upstream split out of libcudaq-realtime), keeping the realtime decoding library's link options PRIVATE. - Re-enable and adapt the PyMatching realtime test to the production (5-arg) enqueue_syndromes signature. - Pin .cudaq_version to canonical NVIDIA/cuda-quantum d201ab96 (#4712: shared_ring_mode + routing_key sub-routing + the two-ring host_fn ABI). Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>

Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>

- realtime_decoding: let CPU (non-graph) decoders under CUDAQ_QEC_REALTIME_MODE=inproc_rpc build a HOST-dispatch session instead of throwing. maybe_init_realtime_session() now selects DEVICE vs HOST the same way qec_realtime_session::initialize() does (any graph-capable decoder => DEVICE), and only resolves the device launch fn / enables the device shared-ring mode in DEVICE mode. This makes the session's HOST path reachable from the production entry point (previously only the unit test could reach it). Full app/E2E pymatching-over-inproc_rpc coverage is a separate change. - realtime_decoding: set cudaSetDeviceFlags(cudaDeviceMapHost) before the per-decoder dry-run on the inproc_rpc path, since the dry-run can create the CUDA context (after which the flag is a no-op). Best-effort: mapped host allocation works via UVA regardless, so a pre-existing context (cudaErrorSetOnActiveProcess) is tolerated, and HOST-mode sessions don't use mapped memory at all. - qec_realtime_session: the DEVICE-mode host loop now receives the host view of the function table (function_table_host_) rather than the device pointer, since the loop dereferences entries on the CPU. Equal under UVA, but the host pointer is the correct/portable choice (the device dispatcher still gets function_table_dev_). - rpc_producer: enforce the single-producer contract at runtime with an always-on guard (atomic CAS + throw -- not assert(), which release builds compile out) that fails loudly if a second concurrent producer is detected, instead of silently racing on slot selection. Doc updated from "assumption" to enforced contract. Full multi-producer support (CAS slot-claim / per-producer arena) remains a deliberate follow-up. Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>

vedika-saravanan · 2026-06-17T00:50:59Z

Thanks @cketcham2333 for addressing the review comments. I’m approving the PR since all of my comments have been addressed.

cketcham2333 added 5 commits June 12, 2026 20:19

Merge remote-tracking branch 'upstream/main' into decode_server1

1790f34

Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>

clang format

a7c1a4a

Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>

ci: retrigger pipeline

0d0f1bd

Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>

cketcham2333 force-pushed the decode_server1 branch from 2ba3ffd to 0d0f1bd Compare June 12, 2026 22:41

cketcham2333 added 3 commits June 15, 2026 22:57

Merge remote-tracking branch 'upstream/main' into decode_server1

9835c75

Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>

clang format

3eb12ec

Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>

cketcham2333 marked this pull request as ready for review June 16, 2026 01:54

cketcham2333 requested review from melody-ren and vedika-saravanan June 16, 2026 01:54

melody-ren reviewed Jun 16, 2026

View reviewed changes

Comment thread libs/qec/lib/realtime/realtime_decoding.cpp Outdated

vedika-saravanan reviewed Jun 16, 2026

View reviewed changes

Comment thread libs/qec/lib/realtime/realtime_decoding.cpp

Comment thread libs/qec/lib/realtime/qec_realtime_session.cpp Outdated

Comment thread libs/qec/lib/realtime/rpc_producer.cpp

Comment thread libs/qec/lib/realtime/gpu_kernels.cu

cketcham2333 requested a review from boschmitt June 16, 2026 18:56

cketcham2333 and others added 2 commits June 16, 2026 21:29

Merge branch 'main' into decode_server1

a298fc9

vedika-saravanan approved these changes Jun 17, 2026

View reviewed changes

cketcham2333 merged commit 927ec7f into NVIDIA:main Jun 17, 2026
111 of 121 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add host-side in-process-RPC path for real-time QEC decoding#609

Add host-side in-process-RPC path for real-time QEC decoding#609
cketcham2333 merged 10 commits into
NVIDIA:mainfrom
cketcham2333:decode_server1

cketcham2333 commented Jun 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vedika-saravanan commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cketcham2333 commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vedika-saravanan commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cketcham2333 commented Jun 12, 2026 •

edited

Loading