Benchmark the forward pass and optimize CPU+GPU; portable build by default by richiejp · Pull Request #2 · localai-org/free-splatter.cpp

richiejp · 2026-06-26T11:04:03Z

Add a CPU+GPU benchmark harness against the upstream PyTorch reference, act on
what it showed (~2.3x faster GPU, ~1.6x faster CPU), and make the portable, fast
CPU build the default for every entry point -- including a bare cmake -B build.

Benchmark harness

bench/free_splatter-bench.cpp: time free_splatter_run (warmup/iters, synthetic
LCG or --input .f32 data), emit a machine-parseable RESULT line.
scripts/bench_torch.py: PyTorch perf reference (SDPA-backed attention, proper
CUDA sync). scripts/bench.sh: orchestrate engine cpu/vulkan + torch cpu/cuda
into one table.

GPU (Vulkan): 438 -> ~193 ms per 2-view scene

Cast K/V to f16 on GPU so the coopmat2 (tensor-core) flash-attn path is taken;
softmax still accumulates in f32 (GGML_PREC_F32). CPU keeps f32 K/V -- its
tiled FA converts to f32 regardless -- so the CPU-f32 strict gate is unchanged.
Parallelize the host-side unshuffle + activation and drop a 48 MB copy.

CPU: ~19 -> ~14 s per 2-view scene

Enable ggml's tinyBLAS (llamafile) GEMM, which ggml ships OFF: a cache-blocked
F16/F32 mul_mat instead of the per-row vec_dot fallback (53% -> 32% of CPU
time). Numerically a no-op (same f32 accumulation), parity-verified layer by
layer against the f64 reference at the strict CPU-f32 gate (ALL TAPS OK).

Portable by default

A preset-less cmake -B build now builds every CPU ISA variant (runtime
dispatch) + tinyBLAS instead of the ~13x scalar trap (Nix strips -march=native).
Scoped to plain CPU Release/bare builds; GPU/fuzz/debug/native opt out; every
knob stays overridable by -D.
Co-locate executables with the dynamically-loaded backend .so (gated on
GGML_BACKEND_DL) so a bare DL build finds its backends at runtime.
Presets: release is now the portable default, release-portable an alias; add
release-native for a single-target native binary off-Nix.

README: add a Speed table (measured via scripts/bench.sh) and note the portable
default. A FREE_SPLATTER_PROFILE=1 host-phase timer is kept as a profiling aid.

Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com

…fault Add a CPU+GPU benchmark harness against the upstream PyTorch reference, act on what it showed (~2.3x faster GPU, ~1.6x faster CPU), and make the portable, fast CPU build the default for every entry point -- including a bare `cmake -B build`. Benchmark harness - bench/free_splatter-bench.cpp: time free_splatter_run (warmup/iters, synthetic LCG or --input .f32 data), emit a machine-parseable RESULT line. - scripts/bench_torch.py: PyTorch perf reference (SDPA-backed attention, proper CUDA sync). scripts/bench.sh: orchestrate engine cpu/vulkan + torch cpu/cuda into one table. GPU (Vulkan): 438 -> ~193 ms per 2-view scene - Cast K/V to f16 on GPU so the coopmat2 (tensor-core) flash-attn path is taken; softmax still accumulates in f32 (GGML_PREC_F32). CPU keeps f32 K/V -- its tiled FA converts to f32 regardless -- so the CPU-f32 strict gate is unchanged. - Parallelize the host-side unshuffle + activation and drop a 48 MB copy. CPU: ~19 -> ~14 s per 2-view scene - Enable ggml's tinyBLAS (llamafile) GEMM, which ggml ships OFF: a cache-blocked F16/F32 mul_mat instead of the per-row vec_dot fallback (53% -> 32% of CPU time). Numerically a no-op (same f32 accumulation), parity-verified layer by layer against the f64 reference at the strict CPU-f32 gate (ALL TAPS OK). Portable by default - A preset-less `cmake -B build` now builds every CPU ISA variant (runtime dispatch) + tinyBLAS instead of the ~13x scalar trap (Nix strips -march=native). Scoped to plain CPU Release/bare builds; GPU/fuzz/debug/native opt out; every knob stays overridable by -D. - Co-locate executables with the dynamically-loaded backend .so (gated on GGML_BACKEND_DL) so a bare DL build finds its backends at runtime. - Presets: `release` is now the portable default, release-portable an alias; add release-native for a single-target native binary off-Nix. README: add a Speed table (measured via scripts/bench.sh) and note the portable default. A FREE_SPLATTER_PROFILE=1 host-phase timer is kept as a profiling aid. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Prevent the class of regression that dropped first rotation/scale, then opacity, from the accumulated-cloud splat writer (a second copy of the encoder that drifted from the proven single-run one). #1 Unify: new src/splat.h::encode_splat_record is the ONE definition of the OpenCV->OpenGL convention, quaternion remap, opacity->alpha and byte packing. Both write_splat (single-run) and write_cloud_splat (cloud) now build a (pos,scale,quat,rgb,opacity) tuple and call it, so they cannot diverge again. Verified byte-identical to the previous write_splat output (pure refactor). #2 Pin: two asset-free tests in test_pose.cpp -- - test_splat_record: pins the encoder bytes, incl. the exact regressed field (opacity 0.5 -> alpha 127, NOT a forced 255) and the rotation remap. - test_accumulate_channels: a one-pair (T=identity) accumulation must preserve every gaussian channel (xyz, SH->rgb, opacity, scale, rotation, frame); a dropped channel fails immediately with zero fixtures. Both guards fail on either historical bug. ctest -LE model green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

richiejp merged commit cfe0a8b into master Jun 26, 2026
2 checks passed

richiejp mentioned this pull request Jun 29, 2026

feat: Fuse multiple splats from video #3

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark the forward pass and optimize CPU+GPU; portable build by default#2

Benchmark the forward pass and optimize CPU+GPU; portable build by default#2
richiejp merged 1 commit into
masterfrom
perf/bench-and-optimize-forward

richiejp commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

richiejp commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant