Skip to content

perf: revert tile-count pass to single-splat-per-thread#8603

Merged
mvaligursky merged 1 commit intomainfrom
mv-tile-count-single-splat
Apr 15, 2026
Merged

perf: revert tile-count pass to single-splat-per-thread#8603
mvaligursky merged 1 commit intomainfrom
mv-tile-count-single-splat

Conversation

@mvaligursky
Copy link
Copy Markdown
Contributor

Reverts the multi-splat-per-thread (N=8) tile-count strategy introduced in #8594, which caused significant register pressure and performance regression on NVIDIA GPUs.

Changes:

Performance:
Benchmarked at 17M splats:

GPU tileCount (N=8) tileCount (N=1) Total (N=8) Total (N=1)
NVIDIA RTX 2070 11.3 ms 5.4 ms 26.6 ms 20.8 ms
Apple M4 4.3 ms 2.5 ms 10.2 ms 8.3 ms

Tile-count pass is 2.1x faster on NVIDIA and 1.7x faster on M4 with N=1. The register array approach increased register pressure which reduced occupancy on both architectures, with NVIDIA being more severely affected.

The multi-splat-per-thread (N=8) strategy from #8594 caused significant
register pressure, degrading occupancy on both NVIDIA and Apple GPUs.

Benchmarks at 17M splats show tile-count is 2.1x faster on RTX 2070 and
1.7x faster on M4 with N=1.
@mvaligursky mvaligursky self-assigned this Apr 15, 2026
@mvaligursky mvaligursky merged commit f5659c3 into main Apr 15, 2026
8 checks passed
@mvaligursky mvaligursky deleted the mv-tile-count-single-splat branch April 15, 2026 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant