Reduce blockDim in fake_quantize_kernel for improved SM occupancy by flutist · Pull Request #8115 · deepspeedai/DeepSpeed

flutist · 2026-07-02T10:51:58Z

Reduce blockDim in fake_quantize_kernel for improved SM occupancy

Summary

Reduce blockDim from 1024 to 256 in launch_fake_quantize_kernel (symmetric variant) to improve per-SM warp occupancy. In production workloads (ZeRO gradient compression, parameter partitioning), group_num is typically 87K–1.4M, providing enough blocks per SM to benefit from the increased warp count (32→48 warps/SM), yielding 28–77% speedup.

Background

fake_quantize_kernel performs symmetric quantize-dequantize over group_num groups, with one CUDA block per group (grid_dim = group_num). Each block uses a grid-stride loop to process all elements in its group.

With blockDim=1024:

Max 1 block per SM (register/shared memory constraints)
32 warps per SM for latency hiding

With blockDim=256:

Up to 6 blocks per SM
48 warps per SM (6 × 8 warps) for latency hiding

The benefit is only realized when group_num is large enough to fill the GPU. For group_num < num_SMs (~80), reducing blockDim simply reduces parallelism per block with no occupancy gain.

Change

void launch_fake_quantize_kernel(T* vals, int total_count, int group_num, int num_bits, cudaStream_t stream)
 {
+    // Reduced from 1024 to 256: improves per-SM warp occupancy (32→48 warps/SM)
+    // for large group_num (ZeRO gradient compression on production models where
+    // group_num >> num_SMs * blocks_per_SM). Measured 28-77% speedup at group_num
+    // = 87K-1.4M. For small group_num (< num_SMs) this is slower, but that is not
+    // the production use case.
     dim3 grid_dim(group_num);
-    dim3 block_dim(1024);
+    dim3 block_dim(256);

Only the symmetric variant (launch_fake_quantize_kernel) is modified. The asymmetric and stochastic-rounding variants (launch_fake_quantize_kernel_asym, launch_sr_fake_quantize_kernel, launch_sr_fake_quantize_kernel_asym) are left unchanged because they use register-cached arrays (float2 data[MAX_REG], __half2 data_low[128]) whose size depends on blockDim, and reducing blockDim could cause register spill issues.

Performance

Benchmarked on A100 with ds_quantize_fp16, 8-bit symmetric quantization:

Workload	group_num	1024 thr	256 thr	Speedup
BERT-large 340M, group_size=4096	87,040	6.02 ms	4.32 ms	-28%
Llama-7B (half), group_size=4096	917,504	57.27 ms	30.58 ms	-47%
BERT-large 340M, group_size=256	1,392,640	60.23 ms	13.74 ms	-77%

The third case shows the largest speedup because group_size=256 means 75% of threads are idle with blockDim=1024 (only 256 of 1024 threads have work), while blockDim=256 achieves 100% thread utilization.

Accuracy

Existing test suite passes with no tolerance changes:

pytest tests/unit/ops/quantizer/test_fake_quantization.py -v

Covers fp16, 8-bit/4-bit, groups=1/16, compared against PyTorch reference implementation with rtol=2e-2, atol=5e-3.

Scope

Modified: csrc/quantization/fake_quantizer.cu — single line change in launch_fake_quantize_kernel
Not modified: Asymmetric and SR variants (register-cached, would require restructuring)
Risk: Low — grid-stride loop pattern is blockDim-agnostic; only affects launch configuration
Backward compatible: No API or behavior change

flutist · 2026-07-02T11:08:30Z

@tjruwase ptal, thanks

improves per-SM warp occupancy

3142591

flutist requested a review from tjruwase as a code owner July 2, 2026 10:51

flutist changed the title ~~improves per-SM warp occupancy~~ Reduce blockDim in fake_quantize_kernel for improved SM occupancy Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce blockDim in fake_quantize_kernel for improved SM occupancy#8115

Reduce blockDim in fake_quantize_kernel for improved SM occupancy#8115
flutist wants to merge 1 commit into
deepspeedai:masterfrom
flutist:optimize_per_sm

flutist commented Jul 2, 2026 •

edited

Loading

Uh oh!

flutist commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

flutist commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!