add explicit CC parameters to DeviceKernel.compile_and_load (#45)

hgt312 · claude · web-flow · commit 925f4284b388 · 2026-03-30T23:12:11.000-07:00
* feat: add explicit cc_enabled/rank_id/world_size to DeviceKernel.compile_and_load

Support MPMD workloads and non-torch-distributed runtimes by allowing
callers to pass CC parameters explicitly. When cc_enabled is set, every
rank traces and compiles independently (no rank-0 broadcast or barrier).
Build directories are namespaced by rank to avoid concurrent write
collisions.

- cc_enabled=None (default): auto-detect from torch.distributed (SPMD)
- cc_enabled=True: explicit CC with per-rank compilation (MPMD)
- cc_enabled=False: disable CC even in distributed settings

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

* refactor: separate compilation strategy (is_spmd) from CC parameters

Address review feedback by introducing is_spmd flag to control compilation
strategy (rank-0 broadcast vs every-rank), keeping cc_enabled/rank_id/world_size
for load-time CC only. Simplifies resolution logic with resolved_* locals.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

* fix: address review — MPMD build dir, SPMD barrier, CC validation

- MPMD build dir now auto-namespaces by dist.get_rank() when rank_id
  not explicitly provided (fixes concurrent write conflict)
- SPMD barrier fires unconditionally when distributed, not only when
  cc_enabled is None (fixes filesystem visibility race)
- Validate rank_id/world_size are provided when cc_enabled=True without
  torch.distributed (raises ValueError instead of passing None)
- Add tests: non-rank-0 SPMD worker, MPMD auto-namespace, validation
- Update docs for build dir auto-detection

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

---------

Co-authored-by: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/docs/index.md b/docs/index.md
@@ -35,6 +35,7 @@ tutorials/index
 
 user_guide/indexing_slicing_reference
 user_guide/tracing_architecture
+user_guide/distributed_execution
 ```
 
 ```{toctree}
diff --git a/docs/user_guide/distributed_execution.md b/docs/user_guide/distributed_execution.md
@@ -0,0 +1,94 @@
+# Distributed Execution
+
+NKIPy supports multi-device execution with collective communication (CC)
+through `DeviceKernel.compile_and_load`. This guide covers the three
+execution patterns and when to use each.
+
+## Execution Patterns
+
+### 1. SPMD (default)
+
+When `torch.distributed` is initialized and `is_spmd=True` (the default),
+rank 0 traces and compiles the kernel, then broadcasts the NEFF path to all
+workers. All ranks load the same NEFF with CC enabled.
+
+```python
+import torch.distributed as dist
+
+dist.init_process_group(...)
+
+kernel = DeviceKernel.compile_and_load(my_kernel, input_a, input_b)
+```
+
+Use this when every rank runs the **same kernel** with the **same input shapes**.
+
+### 2. MPMD (`is_spmd=False`)
+
+Set `is_spmd=False` so every rank traces and compiles independently. This is
+required when different ranks run different kernels or different input shapes.
+
+```python
+# With torch.distributed (CC auto-detected)
+kernel = DeviceKernel.compile_and_load(
+    my_kernel, input_a, input_b,
+    is_spmd=False,
+)
+
+# Without torch.distributed (explicit CC)
+kernel = DeviceKernel.compile_and_load(
+    my_kernel, input_a, input_b,
+    is_spmd=False,
+    cc_enabled=True,
+    rank_id=my_rank,
+    world_size=total_workers,
+)
+```
+
+### 3. No CC (single device or explicit opt-out)
+
+Without `torch.distributed` and without explicit CC parameters, the kernel
+loads for single-device execution. You can also pass `cc_enabled=False` to
+explicitly disable CC even when `torch.distributed` is active.
+
+```python
+# Single device (no torch.distributed)
+kernel = DeviceKernel.compile_and_load(my_kernel, input_a)
+
+# Opt out of CC in a distributed setting
+kernel = DeviceKernel.compile_and_load(my_kernel, input_a, cc_enabled=False)
+```
+
+## Parameter Reference
+
+| Parameter    | Controls           | Values                                        |
+|--------------|--------------------|-----------------------------------------------|
+| `is_spmd`    | Compilation        | `True` = rank-0 broadcast, `False` = all rank |
+| `cc_enabled` | CC at load time    | `None` = auto, `True` = on, `False` = off     |
+| `rank_id`    | Rank for CC load   | `None` = auto from dist, or explicit `int`    |
+| `world_size` | World size for CC  | `None` = auto from dist, or explicit `int`    |
+
+## Comparison
+
+| Setting                 | SPMD (default)          | MPMD                 | No CC         |
+|-------------------------|-------------------------|----------------------|---------------|
+| `is_spmd`               | `True`                  | `False`              | Either        |
+| `cc_enabled`            | `None` (auto)           | `None`/`True`        | `False`/`None`|
+| `torch.distributed`     | Required                | Optional             | N/A           |
+| Compilation             | Rank 0 only + broadcast | Every rank           | Every rank    |
+| Barrier                 | Yes                     | No                   | No            |
+| Use case                | Same kernel, all ranks  | Per-rank kernels     | Single device |
+
+## Build Directory Isolation
+
+In MPMD mode (`is_spmd=False`), the build directory is automatically
+namespaced by rank (e.g. `build_dir/rank_0/`, `build_dir/rank_1/`) to
+prevent concurrent writes when different ranks produce the same content hash.
+The rank is taken from the explicit `rank_id` parameter, or auto-detected
+from `torch.distributed` when available.
+
+## Caching
+
+Compiled NEFFs are cached in memory by a content hash of the HLO and compiler
+arguments. The cache key is the same regardless of CC mode, so a kernel
+compiled once can be reused across calls. Pass `use_cached_if_exists=False` to
+force recompilation.
diff --git a/nkipy/src/nkipy/runtime/device_kernel.py b/nkipy/src/nkipy/runtime/device_kernel.py
@@ -78,13 +78,25 @@ def compile_and_load(
         use_cached_if_exists=True,
         build_dir=None,
         target=CompilationTarget.DEFAULT,
+        is_spmd=True,
+        cc_enabled=None,
+        rank_id=None,
+        world_size=None,
         **kwargs,
     ):
         """Compile and load a kernel, returning a DeviceKernel instance.
 
-        In distributed mode, only the lead worker (rank 0) traces and compiles.
-        The resulting paths are broadcast to all workers, which then load the
-        NEFF collectively.
+        Compilation strategy is controlled by ``is_spmd``:
+
+        * **True (default)** – rank 0 traces/compiles and broadcasts the NEFF
+          path to all workers.  Requires ``torch.distributed``.
+        * **False** – every rank traces and compiles independently (MPMD).
+          Required when each rank runs a different kernel or uses different
+          input shapes.  Works with or without ``torch.distributed``.
+
+        Collective-communication at load time is controlled separately by
+        ``cc_enabled``, ``rank_id``, and ``world_size``.  When left as
+        ``None`` these are auto-detected from ``torch.distributed``.
 
         Args:
             kernel: The kernel function to compile
@@ -93,6 +105,12 @@ def compile_and_load(
             use_cached_if_exists: If True, use cached neff if it exists.
             build_dir: Overriding the build directory for the kernel
             target: Compilation target for the kernel
+            is_spmd: If True, rank 0 compiles and broadcasts (SPMD).
+                If False, every rank compiles independently (MPMD).
+            cc_enabled: Enable collective communication for this kernel.
+                Auto-detected from torch.distributed when None.
+            rank_id: Worker rank for CC. Auto-detected when None.
+            world_size: Total workers for CC. Auto-detected when None.
             *args, **kwargs: Arguments for specialization (numpy array or DeviceTensor)
 
         Returns:
@@ -103,7 +121,23 @@ def compile_and_load(
 
         distributed = _is_distributed()
 
-        if distributed:
+        # In MPMD mode, namespace build dir by rank to avoid concurrent writes
+        # when different ranks produce the same content hash.
+        if not is_spmd:
+            effective_rank = rank_id if rank_id is not None else (
+                dist.get_rank() if distributed else None
+            )
+            if effective_rank is not None:
+                compile_build_dir = os.path.join(
+                    build_dir or _get_build_dir(), f"rank_{effective_rank}"
+                )
+            else:
+                compile_build_dir = build_dir
+        else:
+            compile_build_dir = build_dir
+
+        # --- 1. Compilation ---
+        if is_spmd and distributed:
             if dist.get_rank() == 0:
                 neff_path, cache_key = cls._trace_and_compile(
                     kernel,
@@ -112,7 +146,7 @@ def compile_and_load(
                     kwargs,
                     additional_compiler_args=additional_compiler_args,
                     use_cached_if_exists=use_cached_if_exists,
-                    build_dir=build_dir,
+                    build_dir=compile_build_dir,
                     target=target,
                 )
                 dist.broadcast_object_list([neff_path, cache_key], src=0)
@@ -128,7 +162,7 @@ def compile_and_load(
                 kwargs,
                 additional_compiler_args=additional_compiler_args,
                 use_cached_if_exists=use_cached_if_exists,
-                build_dir=build_dir,
+                build_dir=compile_build_dir,
                 target=target,
             )
 
@@ -137,15 +171,35 @@ def compile_and_load(
             logger.info(f"Using loaded kernel: {name} (cache_key={cache_key})")
             return _LOADED_KERNELS[cache_key]
 
-        # Load the compiled NEFF
-        if distributed:
+        # --- 2. Resolve CC parameters for loading ---
+        resolved_cc = cc_enabled if cc_enabled is not None else distributed
+        resolved_rank = (
+            rank_id if rank_id is not None
+            else (dist.get_rank() if distributed else None)
+        )
+        resolved_world = (
+            world_size if world_size is not None
+            else (dist.get_world_size() if distributed else None)
+        )
+
+        if resolved_cc and (resolved_rank is None or resolved_world is None):
+            raise ValueError(
+                "rank_id and world_size are required when cc_enabled=True "
+                "and torch.distributed is not available for auto-detection"
+            )
+
+        # Barrier only needed in SPMD mode (rank 0 compiled for everyone)
+        if is_spmd and distributed:
             dist.barrier()
+
+        # --- 3. Load the compiled NEFF ---
+        if resolved_cc:
             device_kernel = cls.load_from_neff(
                 neff_path,
                 name=name,
                 cc_enabled=True,
-                rank_id=dist.get_rank(),
-                world_size=dist.get_world_size(),
+                rank_id=resolved_rank,
+                world_size=resolved_world,
             )
         else:
             device_kernel = cls.load_from_neff(neff_path, name=name)
diff --git a/tests/unit/test_device_kernel_cc.py b/tests/unit/test_device_kernel_cc.py