Skip to content

Benchmark the forward pass and optimize CPU+GPU; portable build by default#2

Merged
richiejp merged 1 commit into
masterfrom
perf/bench-and-optimize-forward
Jun 26, 2026
Merged

Benchmark the forward pass and optimize CPU+GPU; portable build by default#2
richiejp merged 1 commit into
masterfrom
perf/bench-and-optimize-forward

Conversation

@richiejp

Copy link
Copy Markdown
Contributor

Add a CPU+GPU benchmark harness against the upstream PyTorch reference, act on
what it showed (~2.3x faster GPU, ~1.6x faster CPU), and make the portable, fast
CPU build the default for every entry point -- including a bare cmake -B build.

Benchmark harness

  • bench/free_splatter-bench.cpp: time free_splatter_run (warmup/iters, synthetic
    LCG or --input .f32 data), emit a machine-parseable RESULT line.
  • scripts/bench_torch.py: PyTorch perf reference (SDPA-backed attention, proper
    CUDA sync). scripts/bench.sh: orchestrate engine cpu/vulkan + torch cpu/cuda
    into one table.

GPU (Vulkan): 438 -> ~193 ms per 2-view scene

  • Cast K/V to f16 on GPU so the coopmat2 (tensor-core) flash-attn path is taken;
    softmax still accumulates in f32 (GGML_PREC_F32). CPU keeps f32 K/V -- its
    tiled FA converts to f32 regardless -- so the CPU-f32 strict gate is unchanged.
  • Parallelize the host-side unshuffle + activation and drop a 48 MB copy.

CPU: ~19 -> ~14 s per 2-view scene

  • Enable ggml's tinyBLAS (llamafile) GEMM, which ggml ships OFF: a cache-blocked
    F16/F32 mul_mat instead of the per-row vec_dot fallback (53% -> 32% of CPU
    time). Numerically a no-op (same f32 accumulation), parity-verified layer by
    layer against the f64 reference at the strict CPU-f32 gate (ALL TAPS OK).

Portable by default

  • A preset-less cmake -B build now builds every CPU ISA variant (runtime
    dispatch) + tinyBLAS instead of the ~13x scalar trap (Nix strips -march=native).
    Scoped to plain CPU Release/bare builds; GPU/fuzz/debug/native opt out; every
    knob stays overridable by -D.
  • Co-locate executables with the dynamically-loaded backend .so (gated on
    GGML_BACKEND_DL) so a bare DL build finds its backends at runtime.
  • Presets: release is now the portable default, release-portable an alias; add
    release-native for a single-target native binary off-Nix.

README: add a Speed table (measured via scripts/bench.sh) and note the portable
default. A FREE_SPLATTER_PROFILE=1 host-phase timer is kept as a profiling aid.

Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com

…fault

Add a CPU+GPU benchmark harness against the upstream PyTorch reference, act on
what it showed (~2.3x faster GPU, ~1.6x faster CPU), and make the portable, fast
CPU build the default for every entry point -- including a bare `cmake -B build`.

Benchmark harness
- bench/free_splatter-bench.cpp: time free_splatter_run (warmup/iters, synthetic
  LCG or --input .f32 data), emit a machine-parseable RESULT line.
- scripts/bench_torch.py: PyTorch perf reference (SDPA-backed attention, proper
  CUDA sync). scripts/bench.sh: orchestrate engine cpu/vulkan + torch cpu/cuda
  into one table.

GPU (Vulkan): 438 -> ~193 ms per 2-view scene
- Cast K/V to f16 on GPU so the coopmat2 (tensor-core) flash-attn path is taken;
  softmax still accumulates in f32 (GGML_PREC_F32). CPU keeps f32 K/V -- its
  tiled FA converts to f32 regardless -- so the CPU-f32 strict gate is unchanged.
- Parallelize the host-side unshuffle + activation and drop a 48 MB copy.

CPU: ~19 -> ~14 s per 2-view scene
- Enable ggml's tinyBLAS (llamafile) GEMM, which ggml ships OFF: a cache-blocked
  F16/F32 mul_mat instead of the per-row vec_dot fallback (53% -> 32% of CPU
  time). Numerically a no-op (same f32 accumulation), parity-verified layer by
  layer against the f64 reference at the strict CPU-f32 gate (ALL TAPS OK).

Portable by default
- A preset-less `cmake -B build` now builds every CPU ISA variant (runtime
  dispatch) + tinyBLAS instead of the ~13x scalar trap (Nix strips -march=native).
  Scoped to plain CPU Release/bare builds; GPU/fuzz/debug/native opt out; every
  knob stays overridable by -D.
- Co-locate executables with the dynamically-loaded backend .so (gated on
  GGML_BACKEND_DL) so a bare DL build finds its backends at runtime.
- Presets: `release` is now the portable default, release-portable an alias; add
  release-native for a single-target native binary off-Nix.

README: add a Speed table (measured via scripts/bench.sh) and note the portable
default. A FREE_SPLATTER_PROFILE=1 host-phase timer is kept as a profiling aid.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@richiejp richiejp merged commit cfe0a8b into master Jun 26, 2026
2 checks passed
richiejp added a commit that referenced this pull request Jul 1, 2026
Prevent the class of regression that dropped first rotation/scale, then opacity,
from the accumulated-cloud splat writer (a second copy of the encoder that drifted
from the proven single-run one).

#1 Unify: new src/splat.h::encode_splat_record is the ONE definition of the
   OpenCV->OpenGL convention, quaternion remap, opacity->alpha and byte packing.
   Both write_splat (single-run) and write_cloud_splat (cloud) now build a
   (pos,scale,quat,rgb,opacity) tuple and call it, so they cannot diverge again.
   Verified byte-identical to the previous write_splat output (pure refactor).

#2 Pin: two asset-free tests in test_pose.cpp --
   - test_splat_record: pins the encoder bytes, incl. the exact regressed field
     (opacity 0.5 -> alpha 127, NOT a forced 255) and the rotation remap.
   - test_accumulate_channels: a one-pair (T=identity) accumulation must preserve
     every gaussian channel (xyz, SH->rgb, opacity, scale, rotation, frame); a
     dropped channel fails immediately with zero fixtures.

Both guards fail on either historical bug. ctest -LE model green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant