RL-Align · CyberSecurityErial · Jun 22, 2026 · Jun 22, 2026 · Jun 22, 2026 · Jun 22, 2026
@@ -20,6 +20,8 @@ nav:
     - General:
       - contributing/*
       - contributing/operator-doc-template.md
+    - Distributed:
+      - distributed/*
     - Design Documents:
       - design/*
   - Benchmarking:

@@ -0,0 +1,100 @@
+# Deterministic All-Reduce
+
+RL-Kernel provides a small all-reduce helper for distributed smoke tests and
+future WS2 integration work. It has two modes:
+
+- `torch_all_reduce`: calls `torch.distributed.all_reduce`.
+- `ordered_rank_reference`: gathers all rank tensors, accumulates them on process-group rank 0 in process-group rank order, then broadcasts the result.
+
+The helper reduces the input tensor in place and returns it.
+
+## Contract
+
+Results are expected to be stable only when the world size, process-group rank
+order, inputs, dtype, operation, backend, and environment are unchanged.
+
+`op="mean"` performs a sum and divides by world size at a fixed point. Integer
+tensors are rejected for `mean`.
+
+## Ordered-Rank Reference
+
+`ordered_rank_reference` is a reference path, not a high-performance transport.
+It uses `all_gather` and `broadcast`, so the active backend must support those
+collectives for the tensor device. The operation order is:
+
+1. make each rank input contiguous;
+2. gather tensors in process-group rank order;
+3. accumulate on process-group rank 0 in that order;
+4. optionally accumulate floating-point inputs in FP32;
+5. divide once for `op="mean"`;
+6. broadcast from process-group rank 0.
+
+This mode is meant for small tensors in tests, debug runs, and reference
+comparisons.
+
+## Torch All-Reduce
+
+`torch_all_reduce` is a thin wrapper around `torch.distributed.all_reduce`. For
+NCCL runs, callers may set best-effort ring settings before process-group
+initialization:
+
+```python
+from rl_engine.distributed import configure_deterministic_nccl_env
+
+configure_deterministic_nccl_env(overwrite=True)
+```
+
+The helper writes:
+
+```bash
+NCCL_ALGO=Ring
+NCCL_PROTO=Simple
+NCCL_MIN_NCHANNELS=1
+NCCL_MAX_NCHANNELS=1
+```
+
+These settings do not prove bitwise determinism. Validate on the target machine
+before making a hardware-specific claim.
+
+## Behavior
+
+- `world_size == 1`: returns the input tensor unchanged.
+- no initialized process group and `WORLD_SIZE <= 1`: returns the input tensor unchanged.
+- no initialized process group and `WORLD_SIZE > 1`: raises `RuntimeError`.
+- `async_op=True`: raises `NotImplementedError`.
+
+## Smoke Tests
+
+Unit and CPU/Gloo smoke checks:
+
+```bash
+PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 python -m pytest tests/distributed/test_deterministic_allreduce.py -q
+```
+
+Manual NCCL all-reduce smoke:
+
+```bash
+CUDA_VISIBLE_DEVICES=0,1 \
+NCCL_ALGO=Ring \
+NCCL_PROTO=Simple \
+NCCL_MIN_NCHANNELS=1 \
+NCCL_MAX_NCHANNELS=1 \
+torchrun --standalone --nproc_per_node=2 \
+  tests/distributed/test_deterministic_allreduce.py \
+  --backend nccl --mode torch_all_reduce --dtype fp32 --device cuda
+```
+
+DP gradient smoke compares a fixed DP=1 full-batch gradient with DP=N local
+gradients reduced by this helper:
+
+```bash
+torchrun --standalone --nproc_per_node=2 \
+  tests/distributed/test_dp_gradient_determinism.py \
+  --backend gloo --mode ordered_rank_reference --dtype fp32 --device cpu
+```
+
+## Limitations
+
+- NVLS / NVLink-Sharp is not implemented or claimed here.
+- Multi-node and RDMA behavior are not validated here.
+- DeepSpeed gradient synchronization is not controlled by this helper yet.
@@ -0,0 +1,46 @@
+# Deterministic All-Reduce Audit
+
+This audit records the distributed communication points relevant to
+[RL-Align/RL-Kernel#112](https://github.com/RL-Align/RL-Kernel/issues/112).
+
+## Search
+
+```bash
+rg -n "all_reduce|allreduce|reduce_scatter|all_gather|DistributedDataParallel|FSDP|deepspeed|gradient" \
+  rl_engine csrc tests examples benchmarks scripts docs .github
+rg -n "torch\.distributed|distributed|dist\.|process_group|ProcessGroup|nccl|NCCL|reduce|all_reduce|reduce_scatter|all_gather|gradient" \
+  rl_engine csrc tests examples benchmarks scripts docs .github
+```
+
+## Summary
+
+No direct `torch.distributed` all-reduce, reduce-scatter, all-gather, DDP, or
+FSDP call sites were found in RL-Kernel source code. The current DP-gradient
+communication risk is indirect: `DeepSpeedTrainingWorker` delegates backward and
+optimizer behavior to the optional DeepSpeed engine.
+
+CUDA IPC uses of `torch.multiprocessing.reductions.reduce_tensor` are not
+collective reductions. They serialize CUDA IPC handles for same-node weight
+handoff.
+
+## Inventory
+
+| Location | Kind | In scope for #112 | Handling |
+| --- | --- | --- | --- |
+| `rl_engine/executors/deepspeed_trainer.py` `DeepSpeedTrainingWorker.train` | Backward / optimizer delegation to DeepSpeed | Yes, indirectly | Do not claim control over DeepSpeed communication order until a tested integration point exists. |
+| `rl_engine/executors/deepspeed_trainer.py` `deepspeed.initialize(...)` | Optional distributed runtime setup | Yes, indirectly | Keep missing-DeepSpeed behavior explicit. Any future integration must document the DeepSpeed hook used for gradient reduction. |
+| `tests/test_deepspeed_training_worker.py` fake engine tests | Unit tests for worker delegation | Adjacent | These tests prove delegation only; they do not validate distributed gradient ordering. |
+| `rl_engine/executors/bridge.py` CUDA IPC `reduce_tensor` use | CUDA IPC handle serialization | No | Keep out of all-reduce scope. |
+| `rl_engine/executors/bridge.py` multi-node/RDMA/NCCL transport blockers | Unsupported weight transport guards | Adjacent | Preserve explicit blockers until a tested transport exists. |
+| `rl_engine/utils/logger.py` `info_on_rank` | Rank-filtered logging | No | No numeric reduction behavior. |
+
+## Entry Point
+
+New distributed code should route through `rl_engine.distributed` so the
+all-reduce contract and fallback/reference behavior stay testable in one place.
+
+## Not Covered
+
+- NVLS / NVLink-Sharp.
+- Multi-node or RDMA collectives.
+- DeepSpeed internal gradient synchronization order.
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: Apache-2.0
+# Copyright (c) 2026 RL-Kernel Contributors
+
+from rl_engine.distributed.deterministic_allreduce import (
+    DETERMINISTIC_NCCL_ENV,
+    DeterministicAllReduceConfig,
+    configure_deterministic_nccl_env,
+    deterministic_all_reduce,
+)
+
+__all__ = [
+    "DETERMINISTIC_NCCL_ENV",
+    "DeterministicAllReduceConfig",
+    "configure_deterministic_nccl_env",
+    "deterministic_all_reduce",
+]
@@ -0,0 +1,147 @@
+# SPDX-License-Identifier: Apache-2.0
+# Copyright (c) 2026 RL-Kernel Contributors
+
+from __future__ import annotations
+
+import os
+import warnings
+from dataclasses import dataclass
+from typing import Literal, Optional
+
+import torch
+import torch.distributed as dist
+
+DETERMINISTIC_NCCL_ENV = {
+    "NCCL_ALGO": "Ring",
+    "NCCL_PROTO": "Simple",
+    "NCCL_MIN_NCHANNELS": "1",
+    "NCCL_MAX_NCHANNELS": "1",
+}
+
+
+@dataclass(frozen=True)
+class DeterministicAllReduceConfig:
+    """Options for :func:`deterministic_all_reduce`."""
+
+    mode: Literal["torch_all_reduce", "ordered_rank_reference"] = "torch_all_reduce"
+    op: Literal["sum", "mean"] = "sum"
+    force_fp32_accumulation: bool = True
+    async_op: bool = False
+    group: Optional[dist.ProcessGroup] = None
+
+
+def configure_deterministic_nccl_env(*, overwrite: bool = False) -> dict[str, Optional[str]]:
+    """Set best-effort NCCL ring settings before process-group init."""
+
+    if dist.is_available() and dist.is_initialized():
+        warnings.warn(
+            "NCCL environment was configured after torch.distributed initialization",
+            RuntimeWarning,
+            stacklevel=2,
+        )
+
+    previous: dict[str, Optional[str]] = {}
+    for key, value in DETERMINISTIC_NCCL_ENV.items():
+        previous[key] = os.environ.get(key)
+        if overwrite or key not in os.environ:
+            os.environ[key] = value
+            continue
+        if os.environ[key] != value:
+            warnings.warn(
+                f"{key} is {os.environ[key]!r}; expected {value!r}",
+                RuntimeWarning,
+                stacklevel=2,
+            )
+    return previous
+
+
+def deterministic_all_reduce(
+    tensor: torch.Tensor,
+    config: Optional[DeterministicAllReduceConfig] = None,
+) -> torch.Tensor:
+    """Reduce ``tensor`` in place and return it."""
+
+    cfg = config or DeterministicAllReduceConfig()
+    _validate(tensor, cfg)
+
+    if cfg.async_op:
+        raise NotImplementedError("async deterministic all-reduce is not implemented")
+    if not dist.is_available():
+        raise RuntimeError("torch.distributed is unavailable")
+    if not dist.is_initialized():
+        if int(os.environ.get("WORLD_SIZE", "1")) > 1:
+            raise RuntimeError("torch.distributed is not initialized")
+        return tensor
+
+    world_size = dist.get_world_size(group=cfg.group)
+    if world_size == 1:
+        return tensor
+
+    if cfg.mode == "torch_all_reduce":
+        return _torch_all_reduce(tensor, cfg, world_size)
+    return _ordered_rank_reference(tensor, cfg, world_size)
+
+
+def _validate(tensor: torch.Tensor, cfg: DeterministicAllReduceConfig) -> None:
+    if not isinstance(tensor, torch.Tensor):
+        raise TypeError(f"tensor must be a torch.Tensor, got {type(tensor)!r}")
+    if cfg.mode not in {"torch_all_reduce", "ordered_rank_reference"}:
+        raise ValueError(f"unsupported all-reduce mode: {cfg.mode!r}")
+    if cfg.op not in {"sum", "mean"}:
+        raise ValueError(f"unsupported reduction op: {cfg.op!r}")
+    if cfg.op == "mean" and not (tensor.is_floating_point() or tensor.is_complex()):
+        raise TypeError("op='mean' requires a floating-point or complex tensor")
+
+
+def _torch_all_reduce(
+    tensor: torch.Tensor,
+    cfg: DeterministicAllReduceConfig,
+    world_size: int,
+) -> torch.Tensor:
+    dist.all_reduce(tensor, op=dist.ReduceOp.SUM, group=cfg.group, async_op=False)
+    if cfg.op == "mean":
+        tensor.div_(world_size)
+    return tensor
+
+
+def _ordered_rank_reference(
+    tensor: torch.Tensor,
+    cfg: DeterministicAllReduceConfig,
+    world_size: int,
+) -> torch.Tensor:
+    send = tensor.detach().contiguous()
+    gathered = [torch.empty_like(send) for _ in range(world_size)]
+    dist.all_gather(gathered, send, group=cfg.group)
+
+    result = torch.empty_like(send)
+    if dist.get_rank(group=cfg.group) == 0:
+        dtype = _accumulation_dtype(send, cfg.force_fp32_accumulation)
+        reduced = gathered[0].to(dtype=dtype)
+        for item in gathered[1:]:
+            reduced.add_(item.to(dtype=dtype))
+        if cfg.op == "mean":
+            reduced.div_(world_size)
+        result.copy_(reduced.to(dtype=send.dtype))
+
+    dist.broadcast(result, src=_group_root_global_rank(cfg.group), group=cfg.group)
+    tensor.copy_(result.view_as(tensor))
+    return tensor
+
+
+def _group_root_global_rank(group: Optional[dist.ProcessGroup]) -> int:
+    if group is None:
+        return 0
+    try:
+        return int(dist.get_global_rank(group, 0))
+    except AttributeError as exc:
+        raise RuntimeError(
+            "custom process groups require torch.distributed.get_global_rank"
+        ) from exc
+
+
+def _accumulation_dtype(tensor: torch.Tensor, force_fp32: bool) -> torch.dtype:
+    if not force_fp32 or not tensor.is_floating_point():
+        return tensor.dtype
+    if tensor.dtype == torch.float64:
+        return torch.float64
+    return torch.float32