gpudirect-tcpx: Update NCCL config manifest for GKE 1.34+ recommendat… by jkru3 · Pull Request #610 · GoogleCloudPlatform/container-engine-accelerators

jkru3 · 2026-05-19T23:12:30Z

Update NCCL config manifest for GKE 1.34+ recommendations

jkru3 · 2026-05-28T23:43:17Z

+    -x NCCL_GPUDIRECTTCPX_CTRL_DEV=eth0 \
+    -x NCCL_GPUDIRECTTCPX_PROGRAM_FLOW_STEERING_WAIT_MICROS=500000 \
+    -x NCCL_CROSS_NIC=0 \
+    -x NCCL_ALGO=Ring,Tree \


For nccl-net 3.1.11+

…ions This change updates the `nccl-config.yaml` ConfigMap manifest to remove deprecated environment variables and obsolete channel restrictions, aligning it with the official recommendations for the GKE 1.34+ TCPX stack. Rationale for changes: 1. Removed `NCCL_GPUDIRECTTCPX_FORCE_ACK=0` & `NCCL_GPUDIRECTTCPX_TX_COMPLETION_NANOSLEEP=1000` - Reason: These manual packet tuning variables are deprecated and completely ignored by the updated TCPX daemon (v2.0.15+) used in GKE 1.34. With the migration to COS 125 (Linux kernel 6.12+), the stack natively utilizes upstream Device Memory TCP (devmem TCP) for zero-copy transfers, making these custom daemon-level workarounds obsolete. - Proof: These variables have been removed from the recommended configuration in the official Google Cloud GPUDirect-TCPX documentation: https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx#add-gpudirect-tcpx-manifests 2. Removed `NCCL_MAX_NCHANNELS=8` & `NCCL_MIN_NCHANNELS=8` - Reason: Forcing the system to use exactly 8 channels is no longer recommended for H100 workloads running NCCL core 3.1.12+ (standard in GKE 1.34). Restricting the channel count prevents NCCL from dynamically selecting the optimal number of channels based on topology, which can artificially limit GPU network bandwidth. - Proof: The official configuration guide no longer lists channel count limits, allowing NCCL to dynamically optimize itself: https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx#add-gpudirect-tcpx-manifests These updates resolve the discrepancy where the manifest did not reflect the GKE 1.34 user guide recommendations.

…endations This change creates `nccl-config-latest.yaml` ConfigMap manifest to remove deprecated environment variables and obsolete channel restrictions, aligning it with the official recommendations for the GKE 1.34+ TCPX stack. Rationale for changes: 1. Removed `NCCL_GPUDIRECTTCPX_FORCE_ACK=0` & `NCCL_GPUDIRECTTCPX_TX_COMPLETION_NANOSLEEP=1000` - Rationale: These manual tuning parameters were workarounds for older, custom out-of-tree TCPX drivers. GKE 1.34 (COS 125) migrates to Linux Kernel 6.12+, which natively supports **Device Memory TCP (devmem TCP)**. The kernel's TCP stack now handles packet acknowledgment and zero-copy transfers natively, making these CPU-timing and socket-level workarounds obsolete. The new tcpx-daemon (v2.0.15) ignores these variables. - Proof (Linux Kernel v6.12 Merge): https://lore.kernel.org/netdev/20240831004313.3713467-1-almasrymina@google.com/ - Proof (Linux Kernel Documentation): https://www.kernel.org/doc/html/v6.12/networking/devmem.html 2. Removed `NCCL_MAX_NCHANNELS=8` & `NCCL_MIN_NCHANNELS=8` - Rationale: Setting these variables forces NCCL to bypass its internal, automatic topology-detection and channel-tuning algorithm. In newer NCCL versions (3.1.12+), this tuner is highly optimized to dynamically allocate the optimal number of channels (often up to 24 channels on A3/H100 nodes) to fully saturate the network bandwidth. Manually capping channels at 8 disables this optimization and acts as a performance bottleneck, which is recognized as a primary cause of communication regressions in distributed GPU training (and is actively asserted against in standard ML validation suites like Megatron-LM). - Proof (NVIDIA NCCL Tuning Documentation): Bypassing automatic channel selection is documented by NVIDIA as a manual override that should be avoided in production to allow topology-aware tuning: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html These updates resolve the discrepancy where the manifest did not reflect the GKE 1.34 user guide recommendations.

…endations This change creates `nccl-config-latest.yaml` ConfigMap manifest to remove deprecated environment variables and obsolete channel restrictions, aligning it with the official recommendations for the GKE 1.34+ TCPX stack. Rationale for changes: 1. Removed `NCCL_GPUDIRECTTCPX_FORCE_ACK=0` & `NCCL_GPUDIRECTTCPX_TX_COMPLETION_NANOSLEEP=1000` - Rationale: These manual tuning parameters were workarounds for older, custom out-of-tree TCPX drivers. GKE 1.34 (COS 125) migrates to Linux Kernel 6.12+, which natively supports **Device Memory TCP (devmem TCP)**. The kernel's TCP stack now handles packet acknowledgment and zero-copy transfers natively, making these CPU-timing and socket-level workarounds obsolete. The new tcpx-daemon (v2.0.15) ignores these variables. - Proof (Linux Kernel v6.12 Merge): https://lore.kernel.org/netdev/20240831004313.3713467-1-almasrymina@google.com/ - Proof (Linux Kernel Documentation): https://www.kernel.org/doc/html/v6.12/networking/devmem.html 2. Removed `NCCL_MAX_NCHANNELS=8` & `NCCL_MIN_NCHANNELS=8` - Rationale: Setting these variables forces NCCL to bypass its internal, automatic topology-detection and channel-tuning algorithm. In newer NCCL versions (3.1.12+), this tuner is highly optimized to dynamically allocate the optimal number of channels (often up to 24 channels on A3/H100 nodes) to fully saturate the network bandwidth. Manually capping channels at 8 disables this optimization and acts as a performance bottleneck, which is recognized as a primary cause of communication regressions in distributed GPU training (and is actively asserted against in standard ML validation suites like Megatron-LM). - Proof (NVIDIA NCCL Tuning Documentation): Bypassing automatic channel selection is documented by NVIDIA as a manual override that should be avoided in production to allow topology-aware tuning: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html Dry ran manifest with `kubectl apply --dry-run=client -f gpudirect-tcpx/nccl-config-latest.yaml` These updates resolve the discrepancy where the manifest did not reflect the GKE 1.34 user guide recommendations.

…files - Overwrites nccl-config.yaml with the verified GKE 1.34+ spec (removing deprecated vars like FORCE_ACK, MAX_NCHANNELS, etc., and enabling Ring,Tree). - Deletes the temporary nccl-config-latest.yaml which is now redundant. - Updates nccl-test-latest.yaml to point back to the standard nccl-configmap name.

jkru3 commented May 28, 2026

View reviewed changes

jkru3 force-pushed the nccl-config-deprecation branch from 133cfa5 to 919680e Compare June 1, 2026 18:05

Jiaqicao257 approved these changes Jun 1, 2026

View reviewed changes

jkru3 added 4 commits June 2, 2026 01:13

jkru3 force-pushed the nccl-config-deprecation branch from feaf51a to 0e5609b Compare June 2, 2026 01:13

jkru3 merged commit 11c8319 into GoogleCloudPlatform:master Jun 2, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpudirect-tcpx: Update NCCL config manifest for GKE 1.34+ recommendat…#610

gpudirect-tcpx: Update NCCL config manifest for GKE 1.34+ recommendat…#610
jkru3 merged 4 commits into
GoogleCloudPlatform:masterfrom
jkru3:nccl-config-deprecation

jkru3 commented May 19, 2026 •

edited

Loading

Uh oh!

jkru3 May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jkru3 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkru3 May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jkru3 commented May 19, 2026 •

edited

Loading