Reduce blockDim in fake_quantize_kernel for improved SM occupancy#8115
Open
flutist wants to merge 1 commit into
Open
Reduce blockDim in fake_quantize_kernel for improved SM occupancy#8115flutist wants to merge 1 commit into
flutist wants to merge 1 commit into
Conversation
Contributor
Author
|
@tjruwase ptal, thanks |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reduce blockDim in fake_quantize_kernel for improved SM occupancy
Summary
Reduce
blockDimfrom 1024 to 256 inlaunch_fake_quantize_kernel(symmetric variant) to improve per-SM warp occupancy. In production workloads (ZeRO gradient compression, parameter partitioning),group_numis typically 87K–1.4M, providing enough blocks per SM to benefit from the increased warp count (32→48 warps/SM), yielding 28–77% speedup.Background
fake_quantize_kernelperforms symmetric quantize-dequantize overgroup_numgroups, with one CUDA block per group (grid_dim = group_num). Each block uses a grid-stride loop to process all elements in its group.With
blockDim=1024:With
blockDim=256:The benefit is only realized when
group_numis large enough to fill the GPU. Forgroup_num < num_SMs (~80), reducingblockDimsimply reduces parallelism per block with no occupancy gain.Change
void launch_fake_quantize_kernel(T* vals, int total_count, int group_num, int num_bits, cudaStream_t stream) { + // Reduced from 1024 to 256: improves per-SM warp occupancy (32→48 warps/SM) + // for large group_num (ZeRO gradient compression on production models where + // group_num >> num_SMs * blocks_per_SM). Measured 28-77% speedup at group_num + // = 87K-1.4M. For small group_num (< num_SMs) this is slower, but that is not + // the production use case. dim3 grid_dim(group_num); - dim3 block_dim(1024); + dim3 block_dim(256);Only the symmetric variant (
launch_fake_quantize_kernel) is modified. The asymmetric and stochastic-rounding variants (launch_fake_quantize_kernel_asym,launch_sr_fake_quantize_kernel,launch_sr_fake_quantize_kernel_asym) are left unchanged because they use register-cached arrays (float2 data[MAX_REG],__half2 data_low[128]) whose size depends onblockDim, and reducingblockDimcould cause register spill issues.Performance
Benchmarked on A100 with
ds_quantize_fp16, 8-bit symmetric quantization:The third case shows the largest speedup because
group_size=256means 75% of threads are idle withblockDim=1024(only 256 of 1024 threads have work), whileblockDim=256achieves 100% thread utilization.Accuracy
Existing test suite passes with no tolerance changes:
Covers fp16, 8-bit/4-bit, groups=1/16, compared against PyTorch reference implementation with
rtol=2e-2, atol=5e-3.Scope
csrc/quantization/fake_quantizer.cu— single line change inlaunch_fake_quantize_kernel