Skip to content

Reduce blockDim in fake_quantize_kernel for improved SM occupancy#8115

Open
flutist wants to merge 1 commit into
deepspeedai:masterfrom
flutist:optimize_per_sm
Open

Reduce blockDim in fake_quantize_kernel for improved SM occupancy#8115
flutist wants to merge 1 commit into
deepspeedai:masterfrom
flutist:optimize_per_sm

Conversation

@flutist

@flutist flutist commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Reduce blockDim in fake_quantize_kernel for improved SM occupancy

Summary

Reduce blockDim from 1024 to 256 in launch_fake_quantize_kernel (symmetric variant) to improve per-SM warp occupancy. In production workloads (ZeRO gradient compression, parameter partitioning), group_num is typically 87K–1.4M, providing enough blocks per SM to benefit from the increased warp count (32→48 warps/SM), yielding 28–77% speedup.

Background

fake_quantize_kernel performs symmetric quantize-dequantize over group_num groups, with one CUDA block per group (grid_dim = group_num). Each block uses a grid-stride loop to process all elements in its group.

With blockDim=1024:

  • Max 1 block per SM (register/shared memory constraints)
  • 32 warps per SM for latency hiding

With blockDim=256:

  • Up to 6 blocks per SM
  • 48 warps per SM (6 × 8 warps) for latency hiding

The benefit is only realized when group_num is large enough to fill the GPU. For group_num < num_SMs (~80), reducing blockDim simply reduces parallelism per block with no occupancy gain.

Change

void launch_fake_quantize_kernel(T* vals, int total_count, int group_num, int num_bits, cudaStream_t stream)
 {
+    // Reduced from 1024 to 256: improves per-SM warp occupancy (32→48 warps/SM)
+    // for large group_num (ZeRO gradient compression on production models where
+    // group_num >> num_SMs * blocks_per_SM). Measured 28-77% speedup at group_num
+    // = 87K-1.4M. For small group_num (< num_SMs) this is slower, but that is not
+    // the production use case.
     dim3 grid_dim(group_num);
-    dim3 block_dim(1024);
+    dim3 block_dim(256);

Only the symmetric variant (launch_fake_quantize_kernel) is modified. The asymmetric and stochastic-rounding variants (launch_fake_quantize_kernel_asym, launch_sr_fake_quantize_kernel, launch_sr_fake_quantize_kernel_asym) are left unchanged because they use register-cached arrays (float2 data[MAX_REG], __half2 data_low[128]) whose size depends on blockDim, and reducing blockDim could cause register spill issues.

Performance

Benchmarked on A100 with ds_quantize_fp16, 8-bit symmetric quantization:

Workload group_num 1024 thr 256 thr Speedup
BERT-large 340M, group_size=4096 87,040 6.02 ms 4.32 ms -28%
Llama-7B (half), group_size=4096 917,504 57.27 ms 30.58 ms -47%
BERT-large 340M, group_size=256 1,392,640 60.23 ms 13.74 ms -77%

The third case shows the largest speedup because group_size=256 means 75% of threads are idle with blockDim=1024 (only 256 of 1024 threads have work), while blockDim=256 achieves 100% thread utilization.

Accuracy

Existing test suite passes with no tolerance changes:

pytest tests/unit/ops/quantizer/test_fake_quantization.py -v

Covers fp16, 8-bit/4-bit, groups=1/16, compared against PyTorch reference implementation with rtol=2e-2, atol=5e-3.

Scope

  • Modified: csrc/quantization/fake_quantizer.cu — single line change in launch_fake_quantize_kernel
  • Not modified: Asymmetric and SR variants (register-cached, would require restructuring)
  • Risk: Low — grid-stride loop pattern is blockDim-agnostic; only affects launch configuration
  • Backward compatible: No API or behavior change

@flutist flutist requested a review from tjruwase as a code owner July 2, 2026 10:51
@flutist flutist changed the title improves per-SM warp occupancy Reduce blockDim in fake_quantize_kernel for improved SM occupancy Jul 2, 2026
@flutist

flutist commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

@tjruwase ptal, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant