Benchmark the forward pass and optimize CPU+GPU; portable build by default#2
Merged
Merged
Conversation
…fault Add a CPU+GPU benchmark harness against the upstream PyTorch reference, act on what it showed (~2.3x faster GPU, ~1.6x faster CPU), and make the portable, fast CPU build the default for every entry point -- including a bare `cmake -B build`. Benchmark harness - bench/free_splatter-bench.cpp: time free_splatter_run (warmup/iters, synthetic LCG or --input .f32 data), emit a machine-parseable RESULT line. - scripts/bench_torch.py: PyTorch perf reference (SDPA-backed attention, proper CUDA sync). scripts/bench.sh: orchestrate engine cpu/vulkan + torch cpu/cuda into one table. GPU (Vulkan): 438 -> ~193 ms per 2-view scene - Cast K/V to f16 on GPU so the coopmat2 (tensor-core) flash-attn path is taken; softmax still accumulates in f32 (GGML_PREC_F32). CPU keeps f32 K/V -- its tiled FA converts to f32 regardless -- so the CPU-f32 strict gate is unchanged. - Parallelize the host-side unshuffle + activation and drop a 48 MB copy. CPU: ~19 -> ~14 s per 2-view scene - Enable ggml's tinyBLAS (llamafile) GEMM, which ggml ships OFF: a cache-blocked F16/F32 mul_mat instead of the per-row vec_dot fallback (53% -> 32% of CPU time). Numerically a no-op (same f32 accumulation), parity-verified layer by layer against the f64 reference at the strict CPU-f32 gate (ALL TAPS OK). Portable by default - A preset-less `cmake -B build` now builds every CPU ISA variant (runtime dispatch) + tinyBLAS instead of the ~13x scalar trap (Nix strips -march=native). Scoped to plain CPU Release/bare builds; GPU/fuzz/debug/native opt out; every knob stays overridable by -D. - Co-locate executables with the dynamically-loaded backend .so (gated on GGML_BACKEND_DL) so a bare DL build finds its backends at runtime. - Presets: `release` is now the portable default, release-portable an alias; add release-native for a single-target native binary off-Nix. README: add a Speed table (measured via scripts/bench.sh) and note the portable default. A FREE_SPLATTER_PROFILE=1 host-phase timer is kept as a profiling aid. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
richiejp
added a commit
that referenced
this pull request
Jul 1, 2026
Prevent the class of regression that dropped first rotation/scale, then opacity, from the accumulated-cloud splat writer (a second copy of the encoder that drifted from the proven single-run one). #1 Unify: new src/splat.h::encode_splat_record is the ONE definition of the OpenCV->OpenGL convention, quaternion remap, opacity->alpha and byte packing. Both write_splat (single-run) and write_cloud_splat (cloud) now build a (pos,scale,quat,rgb,opacity) tuple and call it, so they cannot diverge again. Verified byte-identical to the previous write_splat output (pure refactor). #2 Pin: two asset-free tests in test_pose.cpp -- - test_splat_record: pins the encoder bytes, incl. the exact regressed field (opacity 0.5 -> alpha 127, NOT a forced 255) and the rotation remap. - test_accumulate_channels: a one-pair (T=identity) accumulation must preserve every gaussian channel (xyz, SH->rgb, opacity, scale, rotation, frame); a dropped channel fails immediately with zero fixtures. Both guards fail on either historical bug. ctest -LE model green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add a CPU+GPU benchmark harness against the upstream PyTorch reference, act on
what it showed (~2.3x faster GPU, ~1.6x faster CPU), and make the portable, fast
CPU build the default for every entry point -- including a bare
cmake -B build.Benchmark harness
LCG or --input .f32 data), emit a machine-parseable RESULT line.
CUDA sync). scripts/bench.sh: orchestrate engine cpu/vulkan + torch cpu/cuda
into one table.
GPU (Vulkan): 438 -> ~193 ms per 2-view scene
softmax still accumulates in f32 (GGML_PREC_F32). CPU keeps f32 K/V -- its
tiled FA converts to f32 regardless -- so the CPU-f32 strict gate is unchanged.
CPU: ~19 -> ~14 s per 2-view scene
F16/F32 mul_mat instead of the per-row vec_dot fallback (53% -> 32% of CPU
time). Numerically a no-op (same f32 accumulation), parity-verified layer by
layer against the f64 reference at the strict CPU-f32 gate (ALL TAPS OK).
Portable by default
cmake -B buildnow builds every CPU ISA variant (runtimedispatch) + tinyBLAS instead of the ~13x scalar trap (Nix strips -march=native).
Scoped to plain CPU Release/bare builds; GPU/fuzz/debug/native opt out; every
knob stays overridable by -D.
GGML_BACKEND_DL) so a bare DL build finds its backends at runtime.
releaseis now the portable default, release-portable an alias; addrelease-native for a single-target native binary off-Nix.
README: add a Speed table (measured via scripts/bench.sh) and note the portable
default. A FREE_SPLATTER_PROFILE=1 host-phase timer is kept as a profiling aid.
Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com