Skip to content

gpudirect-tcpx: Update NCCL config manifest for GKE 1.34+ recommendat…#610

Merged
jkru3 merged 4 commits into
GoogleCloudPlatform:masterfrom
jkru3:nccl-config-deprecation
Jun 2, 2026
Merged

gpudirect-tcpx: Update NCCL config manifest for GKE 1.34+ recommendat…#610
jkru3 merged 4 commits into
GoogleCloudPlatform:masterfrom
jkru3:nccl-config-deprecation

Conversation

@jkru3
Copy link
Copy Markdown
Collaborator

@jkru3 jkru3 commented May 19, 2026

Update NCCL config manifest for GKE 1.34+ recommendations

Comment thread gpudirect-tcpx/nccl-config-latest.yaml Outdated
-x NCCL_GPUDIRECTTCPX_CTRL_DEV=eth0 \
-x NCCL_GPUDIRECTTCPX_PROGRAM_FLOW_STEERING_WAIT_MICROS=500000 \
-x NCCL_CROSS_NIC=0 \
-x NCCL_ALGO=Ring,Tree \
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For nccl-net 3.1.11+

@jkru3 jkru3 force-pushed the nccl-config-deprecation branch from 133cfa5 to 919680e Compare June 1, 2026 18:05
jkru3 added 4 commits June 2, 2026 01:13
…ions

This change updates the `nccl-config.yaml` ConfigMap manifest to remove deprecated environment variables and obsolete channel restrictions, aligning it with the official recommendations for the GKE 1.34+ TCPX stack.

Rationale for changes:

1. Removed `NCCL_GPUDIRECTTCPX_FORCE_ACK=0` & `NCCL_GPUDIRECTTCPX_TX_COMPLETION_NANOSLEEP=1000`
   - Reason: These manual packet tuning variables are deprecated and completely ignored by the updated TCPX daemon (v2.0.15+) used in GKE 1.34. With the migration to COS 125 (Linux kernel 6.12+), the stack natively utilizes upstream Device Memory TCP (devmem TCP) for zero-copy transfers, making these custom daemon-level workarounds obsolete.
   - Proof: These variables have been removed from the recommended configuration in the official Google Cloud GPUDirect-TCPX documentation:
     https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx#add-gpudirect-tcpx-manifests

2. Removed `NCCL_MAX_NCHANNELS=8` & `NCCL_MIN_NCHANNELS=8`
   - Reason: Forcing the system to use exactly 8 channels is no longer recommended for H100 workloads running NCCL core 3.1.12+ (standard in GKE 1.34). Restricting the channel count prevents NCCL from dynamically selecting the optimal number of channels based on topology, which can artificially limit GPU network bandwidth.
   - Proof: The official configuration guide no longer lists channel count limits, allowing NCCL to dynamically optimize itself:
     https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx#add-gpudirect-tcpx-manifests

These updates resolve the discrepancy where the manifest did not reflect the GKE 1.34 user guide recommendations.
…endations

This change creates `nccl-config-latest.yaml` ConfigMap manifest to remove deprecated environment variables and obsolete channel restrictions, aligning it with the official recommendations for the GKE 1.34+ TCPX stack.

Rationale for changes:

1. Removed `NCCL_GPUDIRECTTCPX_FORCE_ACK=0` & `NCCL_GPUDIRECTTCPX_TX_COMPLETION_NANOSLEEP=1000`
   - Rationale: These manual tuning parameters were workarounds for older, custom out-of-tree TCPX drivers. GKE 1.34 (COS 125) migrates to Linux Kernel 6.12+, which natively supports **Device Memory TCP (devmem TCP)**. The kernel's TCP stack now handles packet acknowledgment and zero-copy transfers natively, making these CPU-timing and socket-level workarounds obsolete. The new tcpx-daemon (v2.0.15) ignores these variables.
   - Proof (Linux Kernel v6.12 Merge): https://lore.kernel.org/netdev/20240831004313.3713467-1-almasrymina@google.com/
   - Proof (Linux Kernel Documentation): https://www.kernel.org/doc/html/v6.12/networking/devmem.html

2. Removed `NCCL_MAX_NCHANNELS=8` & `NCCL_MIN_NCHANNELS=8`
   - Rationale: Setting these variables forces NCCL to bypass its internal, automatic topology-detection and channel-tuning algorithm. In newer NCCL versions (3.1.12+), this tuner is highly optimized to dynamically allocate the optimal number of channels (often up to 24 channels on A3/H100 nodes) to fully saturate the network bandwidth. Manually capping channels at 8 disables this optimization and acts as a performance bottleneck, which is recognized as a primary cause of communication regressions in distributed GPU training (and is actively asserted against in standard ML validation suites like Megatron-LM).
   - Proof (NVIDIA NCCL Tuning Documentation): Bypassing automatic channel selection is documented by NVIDIA as a manual override that should be avoided in production to allow topology-aware tuning:
     https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html

These updates resolve the discrepancy where the manifest did not reflect the GKE 1.34 user guide recommendations.
…endations

This change creates `nccl-config-latest.yaml` ConfigMap manifest to remove deprecated environment variables and obsolete channel restrictions, aligning it with the official recommendations for the GKE 1.34+ TCPX stack.

Rationale for changes:

1. Removed `NCCL_GPUDIRECTTCPX_FORCE_ACK=0` & `NCCL_GPUDIRECTTCPX_TX_COMPLETION_NANOSLEEP=1000`
   - Rationale: These manual tuning parameters were workarounds for older, custom out-of-tree TCPX drivers. GKE 1.34 (COS 125) migrates to Linux Kernel 6.12+, which natively supports **Device Memory TCP (devmem TCP)**. The kernel's TCP stack now handles packet acknowledgment and zero-copy transfers natively, making these CPU-timing and socket-level workarounds obsolete. The new tcpx-daemon (v2.0.15) ignores these variables.
   - Proof (Linux Kernel v6.12 Merge): https://lore.kernel.org/netdev/20240831004313.3713467-1-almasrymina@google.com/
   - Proof (Linux Kernel Documentation): https://www.kernel.org/doc/html/v6.12/networking/devmem.html

2. Removed `NCCL_MAX_NCHANNELS=8` & `NCCL_MIN_NCHANNELS=8`
   - Rationale: Setting these variables forces NCCL to bypass its internal, automatic topology-detection and channel-tuning algorithm. In newer NCCL versions (3.1.12+), this tuner is highly optimized to dynamically allocate the optimal number of channels (often up to 24 channels on A3/H100 nodes) to fully saturate the network bandwidth. Manually capping channels at 8 disables this optimization and acts as a performance bottleneck, which is recognized as a primary cause of communication regressions in distributed GPU training (and is actively asserted against in standard ML validation suites like Megatron-LM).
   - Proof (NVIDIA NCCL Tuning Documentation): Bypassing automatic channel selection is documented by NVIDIA as a manual override that should be avoided in production to allow topology-aware tuning:
     https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html

Dry ran manifest with `kubectl apply --dry-run=client -f gpudirect-tcpx/nccl-config-latest.yaml`

These updates resolve the discrepancy where the manifest did not reflect the GKE 1.34 user guide recommendations.
…files

- Overwrites nccl-config.yaml with the verified GKE 1.34+ spec (removing deprecated vars like FORCE_ACK, MAX_NCHANNELS, etc., and enabling Ring,Tree).
- Deletes the temporary nccl-config-latest.yaml which is now redundant.
- Updates nccl-test-latest.yaml to point back to the standard nccl-configmap name.
@jkru3 jkru3 force-pushed the nccl-config-deprecation branch from feaf51a to 0e5609b Compare June 2, 2026 01:13
@jkru3 jkru3 merged commit 11c8319 into GoogleCloudPlatform:master Jun 2, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants