Skip to content

Add host-side in-process-RPC path for real-time QEC decoding#609

Merged
cketcham2333 merged 10 commits into
NVIDIA:mainfrom
cketcham2333:decode_server1
Jun 17, 2026
Merged

Add host-side in-process-RPC path for real-time QEC decoding#609
cketcham2333 merged 10 commits into
NVIDIA:mainfrom
cketcham2333:decode_server1

Conversation

@cketcham2333

@cketcham2333 cketcham2333 commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Description

What this adds. A public, host-side path for in-process RPC ("inproc_rpc")
real-time QEC decoding. In this mode the host produces syndrome-decode requests into a
shared host/device ring buffer; they are serviced through cuda-quantum's on-GPU
DEVICE_LOOP dispatcher by a GPU-resident decoder (e.g. nv-qldpc-decoder), keeping
decode on the device with no CPU round-trip on the hot path. Selected at runtime via
CUDAQ_QEC_REALTIME_MODE=inproc_rpc; the existing host-call path is unchanged when it
is unset.
No proprietary device code lives in this repo. The GPU device-dispatch archive is
consumed optionally as an IMPORTED target via
CUDAQ_QEC_REALTIME_CUDEVICE_PROPRIETARY_ARCHIVE (whole-archive link); the realtime
graph/surface tests register only when it is supplied. (That archive is built
separately; see the companion GitLab MR.)
Host components (new):

  • qec_realtime_session.{h,cpp} — shared host/device ring-buffer session.
  • rpc_producer.{h,cpp} — host-side RPC request/response producer.
  • realtime_decoding.{h,cpp} routes through the session/producer when
    inproc_rpc is selected; config.cpp / decoding_config.h plumbing.
    Build / linking:
  • IMPORTED proprietary-archive wiring in lib/realtime/CMakeLists.txt +
    unittests/CMakeLists.txt (whole-archive, test gating on target existence).
  • Scope --exclude-libs,ALL to PRIVATE on the simulation/quantinuum realtime libs and
    add --export-dynamic for app examples, so the dispatch-API symbols stay
    dlsym-resolvable in test/app executables.
    Kernels: postprocess_* XOR-accumulate corrections (^=) so multi-window shots
    (num_rounds > decoder_window) reduce correctly; test_gpu_kernels zeroes the
    accumulator and skips ROUND_START markers.
    Tests: test_realtime_qldpc_graph_decoding (+ qldpc_config_loader) and a
    surface_code-1 inproc-rpc variant (--use-relay-bp); mock/hololink syndrome loaders
    skip ROUND_START; refreshed *_nv_qldpc_relay fixtures.
    Release/build infra: all_libs_release.yaml fails fast if the nv-qldpc-decoder
    plugin or the proprietary archive is missing from the release asset; build_all.sh /
    build_qec.sh build the realtime path.

Adds the public, host-side support for "inproc_rpc" real-time decoding:
syndrome-decode requests are dispatched through cuda-quantum's on-GPU
DEVICE_LOOP and serviced by a GPU-resident decoder. No proprietary code
lives in this repo; the GPU device-dispatch archive is supplied at build
time and linked optionally (built privately).
Host components:
- qec_realtime_session.{h,cpp}: shared host/device ring-buffer session.
- rpc_producer.{h,cpp}: host-side RPC request/response producer.
- realtime_decoding.{h,cpp}: route through the session/producer when
  CUDAQ_QEC_REALTIME_MODE=inproc_rpc; config.cpp / decoding_config.h plumbing.
Build / linking:
- Consume the optional proprietary device-dispatch archive as an IMPORTED
  target via CUDAQ_QEC_REALTIME_CUDEVICE_PROPRIETARY_ARCHIVE (whole-archive
  link); realtime graph/surface tests register only when it is present.
- Scope --exclude-libs,ALL to PRIVATE on the simulation/quantinuum realtime
  libs and export-dynamic for app examples, so the dispatch-API symbols stay
  dlsym-resolvable in test/app executables.
Kernels:
- postprocess_* XOR-accumulate corrections (^=) for multi-window shots;
  test_gpu_kernels zeros the accumulator and skips ROUND_START markers.
Tests:
- test_realtime_qldpc_graph_decoding (+ qldpc_config_loader) and a
  surface_code-1 inproc-rpc variant (--use-relay-bp); mock/hololink syndrome
  loaders skip ROUND_START; refreshed nv_qldpc_relay fixtures.
Release/build infra:
- all_libs_release.yaml fails fast if the nv-qldpc-decoder plugin or the
  proprietary archive is missing from the release asset; build_all.sh /
  build_qec.sh build the realtime path.

Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
build_cudaq.sh patches CMakeLists.txt (Python find_package Development ->
Development.Module), but upstream cuda-quantum #4698 restructured the
python/nanobind block, shifting the surrounding context so `git apply` can no
longer locate the hunk (hit when .cudaq_version points at a ref that includes
Non-interactive on purpose: `patch` can hang on a File-to-patch prompt in CI.

Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
After merging main (which added NVIDIA#600's HOST_CALL PyMatching session) into
the decode_server1 device-dispatch session, reconcile the two into a single
qec_realtime_session that selects its dispatch mode per decoder at
initialize() via decoder::supports_graph_dispatch():
- DEVICE mode (e.g. the Relay BP GPU decoder): per-round GRAPH_LAUNCH
  enqueue + shared DEVICE_CALL get_corrections/reset, driven by a
  persistent DEVICE_LOOP dispatcher plus a CPU HOST_LOOP graph-worker
  monitor over a pinned-mapped, shared_ring_mode ring. Unchanged behavior
  from decode_server1.
- HOST mode (e.g. PyMatching, a CPU decoder): all three RPCs run as inline
  CUDAQ_DISPATCH_HOST_CALL handlers on the CPU host loop over a host-memory
  ring (no GPU required at runtime; the device-visible pointers alias the
  host backings so rpc_producer stays mode-agnostic).
A session must be homogeneous (all graph-dispatch, or all host); a mixed
set throws, because the host loop resolves a slot by function_id alone and
a GRAPH_LAUNCH + HOST_CALL enqueue would collide on
kEnqueueSyndromesFunctionId.
Key changes:
- Port the HOST_CALL handlers to the two-ring host_fn ABI
  (const void *rx, void *tx, size_t): read the request from the RX slot and
  write the RPCResponse into the distinct TX slot, echoing request_id /
  ptp_timestamp; the host loop publishes tx_flags.
- Run the host loop with shared_ring_mode=1 so it scans the ring for work.
  The mode-agnostic producer reuses the first free slot, which a
  single-cursor non-shared loop would advance past and deadlock on.
- CMake: locate and link cudaq-realtime-host-dispatch (the host dispatch
  loop that upstream split out of libcudaq-realtime), keeping the realtime
  decoding library's link options PRIVATE.
- Re-enable and adapt the PyMatching realtime test to the production
  (5-arg) enqueue_syndromes signature.
- Pin .cudaq_version to canonical NVIDIA/cuda-quantum d201ab96 (#4712:
  shared_ring_mode + routing_key sub-routing + the two-ring host_fn ABI).

Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
@cketcham2333 cketcham2333 marked this pull request as ready for review June 16, 2026 01:54
Comment thread libs/qec/lib/realtime/realtime_decoding.cpp Outdated
Comment thread libs/qec/lib/realtime/realtime_decoding.cpp
Comment thread libs/qec/lib/realtime/qec_realtime_session.cpp Outdated
Comment thread libs/qec/lib/realtime/rpc_producer.cpp
Comment thread libs/qec/lib/realtime/gpu_kernels.cu
@cketcham2333 cketcham2333 requested a review from boschmitt June 16, 2026 18:56
cketcham2333 and others added 2 commits June 16, 2026 21:29
- realtime_decoding: let CPU (non-graph) decoders under
  CUDAQ_QEC_REALTIME_MODE=inproc_rpc build a HOST-dispatch session instead
  of throwing. maybe_init_realtime_session() now selects DEVICE vs HOST the
  same way qec_realtime_session::initialize() does (any graph-capable
  decoder => DEVICE), and only resolves the device launch fn / enables the
  device shared-ring mode in DEVICE mode. This makes the session's HOST path
  reachable from the production entry point (previously only the unit test
  could reach it). Full app/E2E pymatching-over-inproc_rpc coverage is a
  separate change.
- realtime_decoding: set cudaSetDeviceFlags(cudaDeviceMapHost) before the
  per-decoder dry-run on the inproc_rpc path, since the dry-run can create
  the CUDA context (after which the flag is a no-op). Best-effort: mapped
  host allocation works via UVA regardless, so a pre-existing context
  (cudaErrorSetOnActiveProcess) is tolerated, and HOST-mode sessions don't
  use mapped memory at all.
- qec_realtime_session: the DEVICE-mode host loop now receives the host view
  of the function table (function_table_host_) rather than the device
  pointer, since the loop dereferences entries on the CPU. Equal under UVA,
  but the host pointer is the correct/portable choice (the device dispatcher
  still gets function_table_dev_).
- rpc_producer: enforce the single-producer contract at runtime with an
  always-on guard (atomic CAS + throw -- not assert(), which release builds
  compile out) that fails loudly if a second concurrent producer is
  detected, instead of silently racing on slot selection. Doc updated from
  "assumption" to enforced contract. Full multi-producer support (CAS
  slot-claim / per-producer arena) remains a deliberate follow-up.

Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
@vedika-saravanan

Copy link
Copy Markdown
Collaborator

Thanks @cketcham2333 for addressing the review comments. I’m approving the PR since all of my comments have been addressed.

@cketcham2333 cketcham2333 merged commit 927ec7f into NVIDIA:main Jun 17, 2026
111 of 121 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants