Add host-side in-process-RPC path for real-time QEC decoding#609
Merged
Conversation
Adds the public, host-side support for "inproc_rpc" real-time decoding:
syndrome-decode requests are dispatched through cuda-quantum's on-GPU
DEVICE_LOOP and serviced by a GPU-resident decoder. No proprietary code
lives in this repo; the GPU device-dispatch archive is supplied at build
time and linked optionally (built privately).
Host components:
- qec_realtime_session.{h,cpp}: shared host/device ring-buffer session.
- rpc_producer.{h,cpp}: host-side RPC request/response producer.
- realtime_decoding.{h,cpp}: route through the session/producer when
CUDAQ_QEC_REALTIME_MODE=inproc_rpc; config.cpp / decoding_config.h plumbing.
Build / linking:
- Consume the optional proprietary device-dispatch archive as an IMPORTED
target via CUDAQ_QEC_REALTIME_CUDEVICE_PROPRIETARY_ARCHIVE (whole-archive
link); realtime graph/surface tests register only when it is present.
- Scope --exclude-libs,ALL to PRIVATE on the simulation/quantinuum realtime
libs and export-dynamic for app examples, so the dispatch-API symbols stay
dlsym-resolvable in test/app executables.
Kernels:
- postprocess_* XOR-accumulate corrections (^=) for multi-window shots;
test_gpu_kernels zeros the accumulator and skips ROUND_START markers.
Tests:
- test_realtime_qldpc_graph_decoding (+ qldpc_config_loader) and a
surface_code-1 inproc-rpc variant (--use-relay-bp); mock/hololink syndrome
loaders skip ROUND_START; refreshed nv_qldpc_relay fixtures.
Release/build infra:
- all_libs_release.yaml fails fast if the nv-qldpc-decoder plugin or the
proprietary archive is missing from the release asset; build_all.sh /
build_qec.sh build the realtime path.
Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
build_cudaq.sh patches CMakeLists.txt (Python find_package Development -> Development.Module), but upstream cuda-quantum #4698 restructured the python/nanobind block, shifting the surrounding context so `git apply` can no longer locate the hunk (hit when .cudaq_version points at a ref that includes Non-interactive on purpose: `patch` can hang on a File-to-patch prompt in CI. Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
2ba3ffd to
0d0f1bd
Compare
Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
After merging main (which added NVIDIA#600's HOST_CALL PyMatching session) into the decode_server1 device-dispatch session, reconcile the two into a single qec_realtime_session that selects its dispatch mode per decoder at initialize() via decoder::supports_graph_dispatch(): - DEVICE mode (e.g. the Relay BP GPU decoder): per-round GRAPH_LAUNCH enqueue + shared DEVICE_CALL get_corrections/reset, driven by a persistent DEVICE_LOOP dispatcher plus a CPU HOST_LOOP graph-worker monitor over a pinned-mapped, shared_ring_mode ring. Unchanged behavior from decode_server1. - HOST mode (e.g. PyMatching, a CPU decoder): all three RPCs run as inline CUDAQ_DISPATCH_HOST_CALL handlers on the CPU host loop over a host-memory ring (no GPU required at runtime; the device-visible pointers alias the host backings so rpc_producer stays mode-agnostic). A session must be homogeneous (all graph-dispatch, or all host); a mixed set throws, because the host loop resolves a slot by function_id alone and a GRAPH_LAUNCH + HOST_CALL enqueue would collide on kEnqueueSyndromesFunctionId. Key changes: - Port the HOST_CALL handlers to the two-ring host_fn ABI (const void *rx, void *tx, size_t): read the request from the RX slot and write the RPCResponse into the distinct TX slot, echoing request_id / ptp_timestamp; the host loop publishes tx_flags. - Run the host loop with shared_ring_mode=1 so it scans the ring for work. The mode-agnostic producer reuses the first free slot, which a single-cursor non-shared loop would advance past and deadlock on. - CMake: locate and link cudaq-realtime-host-dispatch (the host dispatch loop that upstream split out of libcudaq-realtime), keeping the realtime decoding library's link options PRIVATE. - Re-enable and adapt the PyMatching realtime test to the production (5-arg) enqueue_syndromes signature. - Pin .cudaq_version to canonical NVIDIA/cuda-quantum d201ab96 (#4712: shared_ring_mode + routing_key sub-routing + the two-ring host_fn ABI). Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
melody-ren
reviewed
Jun 16, 2026
- realtime_decoding: let CPU (non-graph) decoders under CUDAQ_QEC_REALTIME_MODE=inproc_rpc build a HOST-dispatch session instead of throwing. maybe_init_realtime_session() now selects DEVICE vs HOST the same way qec_realtime_session::initialize() does (any graph-capable decoder => DEVICE), and only resolves the device launch fn / enables the device shared-ring mode in DEVICE mode. This makes the session's HOST path reachable from the production entry point (previously only the unit test could reach it). Full app/E2E pymatching-over-inproc_rpc coverage is a separate change. - realtime_decoding: set cudaSetDeviceFlags(cudaDeviceMapHost) before the per-decoder dry-run on the inproc_rpc path, since the dry-run can create the CUDA context (after which the flag is a no-op). Best-effort: mapped host allocation works via UVA regardless, so a pre-existing context (cudaErrorSetOnActiveProcess) is tolerated, and HOST-mode sessions don't use mapped memory at all. - qec_realtime_session: the DEVICE-mode host loop now receives the host view of the function table (function_table_host_) rather than the device pointer, since the loop dereferences entries on the CPU. Equal under UVA, but the host pointer is the correct/portable choice (the device dispatcher still gets function_table_dev_). - rpc_producer: enforce the single-producer contract at runtime with an always-on guard (atomic CAS + throw -- not assert(), which release builds compile out) that fails loudly if a second concurrent producer is detected, instead of silently racing on slot selection. Doc updated from "assumption" to enforced contract. Full multi-producer support (CAS slot-claim / per-producer arena) remains a deliberate follow-up. Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
Collaborator
|
Thanks @cketcham2333 for addressing the review comments. I’m approving the PR since all of my comments have been addressed. |
vedika-saravanan
approved these changes
Jun 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
What this adds. A public, host-side path for in-process RPC ("inproc_rpc")
real-time QEC decoding. In this mode the host produces syndrome-decode requests into a
shared host/device ring buffer; they are serviced through cuda-quantum's on-GPU
DEVICE_LOOPdispatcher by a GPU-resident decoder (e.g.nv-qldpc-decoder), keepingdecode on the device with no CPU round-trip on the hot path. Selected at runtime via
CUDAQ_QEC_REALTIME_MODE=inproc_rpc; the existing host-call path is unchanged when itis unset.
No proprietary device code lives in this repo. The GPU device-dispatch archive is
consumed optionally as an
IMPORTEDtarget viaCUDAQ_QEC_REALTIME_CUDEVICE_PROPRIETARY_ARCHIVE(whole-archive link); the realtimegraph/surface tests register only when it is supplied. (That archive is built
separately; see the companion GitLab MR.)
Host components (new):
qec_realtime_session.{h,cpp}— shared host/device ring-buffer session.rpc_producer.{h,cpp}— host-side RPC request/response producer.realtime_decoding.{h,cpp}routes through the session/producer wheninproc_rpcis selected;config.cpp/decoding_config.hplumbing.Build / linking:
IMPORTEDproprietary-archive wiring inlib/realtime/CMakeLists.txt+unittests/CMakeLists.txt(whole-archive, test gating on target existence).--exclude-libs,ALLtoPRIVATEon the simulation/quantinuum realtime libs andadd
--export-dynamicfor app examples, so the dispatch-API symbols staydlsym-resolvable in test/app executables.Kernels:
postprocess_*XOR-accumulate corrections (^=) so multi-window shots(
num_rounds > decoder_window) reduce correctly;test_gpu_kernelszeroes theaccumulator and skips
ROUND_STARTmarkers.Tests:
test_realtime_qldpc_graph_decoding(+qldpc_config_loader) and asurface_code-1inproc-rpc variant (--use-relay-bp); mock/hololink syndrome loadersskip
ROUND_START; refreshed*_nv_qldpc_relayfixtures.Release/build infra:
all_libs_release.yamlfails fast if thenv-qldpc-decoderplugin or the proprietary archive is missing from the release asset;
build_all.sh/build_qec.shbuild the realtime path.