Skip to content

Add bench-proxy: Rust supervisor + reverse proxy with /meta snapshot#1

Open
wokalski wants to merge 88 commits into
masterfrom
bench-proxy
Open

Add bench-proxy: Rust supervisor + reverse proxy with /meta snapshot#1
wokalski wants to merge 88 commits into
masterfrom
bench-proxy

Conversation

@wokalski
Copy link
Copy Markdown
Contributor

@wokalski wokalski commented Apr 30, 2026

Summary

  • Replaces bin/smoke-* with a single Rust binary bench-proxy --listen <addr> --config <yaml>. The proxy spawns the engine on a kernel-assigned random loopback port that only it knows, so by construction every client request to /v1/* and /meta describes the same process — no "different config than what we benchmarked" drift.
  • /meta is a frozen JSON snapshot collected post-warmup. It captures: command argv/env/cwd read from /proc/<pid>/{cmdline,environ,cwd} (observed, not synthesized); the full descendant process tree with per-PID env (catches per-rank CUDA_VISIBLE_DEVICES divergence on TP workers); host/CPU/memory/THP/governor/NUMA tunables; nvidia-smi summary + raw -q -x XML; lshw; engine version; an exhaustive sha256 walk of every file under <venv>/{bin,lib,lib64} with a single merkle_sha256 rollup; and a deduped loaded_libs union of /proc/*/maps across the process tree, sha256'd, with versions parsed from filenames.
  • Child env is fully sanitized — env_clear() + only what the YAML's env: lists, no implicit inheritance. \$VAR/\${VAR} interpolate from the proxy's runtime env (and \$REPO_ROOT is auto-injected) so configs are portable across nix store hashes. Secrets tagged as { secret: <value> } appear in /meta as { redacted: true, sha256: ... } so two runs can be diffed without exposing the value.
  • SSE passthrough is hyper-direct (no collect, no buffering) for low-jitter streaming.

Schema (excerpt)

description: ...                # required
engine: vllm | sglang | trt-llm
cmd: vllm
args: [serve, openai/gpt-oss-120b, --tensor-parallel-size, \"2\", ...]
cwd: vllm                       # optional, repo-root-relative
venv: vllm/.venv                # optional; drives the venv snapshot
warmup: 3                       # optional; default 3, 0 to skip
env:
  PATH: \$REPO_ROOT/vllm/.venv/bin:\$PATH
  CUDA_HOME: \$CUDA_HOME
  HF_TOKEN: { secret: \$HF_TOKEN }

args may not contain --host or --port; the proxy injects them.

Test plan

  • cargo build --release --manifest-path proxy/Cargo.toml clean, no warnings
  • Validation rejects --port in args, empty description, unset \$VAR
  • Smoke against a Python mock OpenAI server (single + multi-PID): /healthz, /meta 503-then-200, command.env exactly matches YAML (no inherited LANG/USER/TERM), SSE chunks visible progressively
  • Secret redaction: HF_TOKEN: { secret: ... } surfaces as { redacted, sha256 }; plaintext absent from logs
  • Real vllm 0.19.1 + Qwen2.5-0.5B on H100 NVL (configs/vllm-smoke.yaml):
    • 3-PID process_treevllm (APIServer), python (mp helper), VLLM::EngineCore — comm-renaming captured
    • loaded_libs = 164 entries including libcudnn.so.9, libnccl.so.2, libcublas.so.12, libtorch_cuda.so from venv site-packages, plus libcuda.so.570.172.08 and libnvidia-ml.so.570.172.08 from /usr/lib/x86_64-linux-gnu (vast-mounted host driver) — versions parsed from filenames
    • venv_snapshot = 84,131 files, 21.0 GB, single merkle hash
    • warmup.elapsed_ms = [350, 193, 192] — clear cold→warm shape, exactly the lazy-load capture we want pre-snapshot
    • 5 concurrent chat completions served correctly
    • SIGINT → child exits 0, no orphans
  • Real run against vllm-gpt-oss-120b.yaml / sglang-gpt-oss-120b.yaml / trt-llm-gpt-oss-120b.yaml once weights are cached

Known follow-ups (not in this PR)

  • If the proxy's listen port is already bound (e.g. by a stale instance), proxy::serve logs the EADDRINUSE and the spawned task ends, but main keeps running and "publishes" the snapshot to no one. Should propagate the bind failure and exit non-zero.
  • TTFT-against-real-engine is too noisy as a proxy-overhead check (engine variance dominates: 3 keep-alive runs of 500 samples gave +4647 / +34 / −264 µs deltas). For a real overhead measurement, hit a no-compute upstream or use 10k+ samples.

🤖 Generated with Claude Code

wokalski and others added 30 commits May 1, 2026 18:34
Replaces bin/smoke-* with `bench-proxy --listen X --config <yaml>`. The
proxy spawns the engine on a random loopback port (only it knows), so
all client traffic to /v1/* and /meta is forced through one process. The
child env is fully sanitized (env_clear() + only the YAML's `env:`); $VAR
interpolation pulls from the proxy's runtime env so configs stay
portable across nix store hashes. Secrets are tagged `{secret: ...}` and
appear in /meta as `{redacted, sha256}`.

After warmup (default 3 chat completions, forces lazy CUDA/cuDNN/NCCL
dlopens), the proxy walks the descendant process tree from /proc, unions
loaded_libs across all PIDs, sha256s every file in the venv, and freezes
/meta with: command read from /proc/<pid>/{cmdline,environ,cwd}, gpus
(curated nvidia-smi summary + raw -q -x), lshw, host tunables (THP,
governor, NUMA), per-PID env, and a venv merkle hash.

Verified end-to-end against vllm 0.19.1 + Qwen2.5-0.5B on H100: clean
spawn, /meta is 18MB with all sections populated (3-PID tree, 164
loaded libs incl. libnccl/libcudnn/libcuda, 84k venv files, 21GB
merkle-hashed), SSE streams chunk-by-chunk through the proxy, SIGINT
shutdown is clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y_bench.sh

Engines (vllm 0.19, sglang 0.5.10, trt-llm 1.2) ship torch wheels that
do not bundle libcudart.so.12 / libnvrtc.so.12 — they expect to dlopen
them at runtime. Without ${cudaToolkit}/lib on NIX_LD_LIBRARY_PATH,
`vllm --help` fails immediately with `ImportError: libcudart.so.12`.
The CUDA driver lib (driverLib) provides libcuda.so but NOT libcudart;
those are toolkit-side. Single-line fix in the ldPath helper covers
all three devshells.

bin/verify_bench.sh wraps proxy spawn + ready-poll + bench_latency.py
+ toks.py + teardown into a single shell command that prints one
median tok/s number on stdout. Used as the autoresearch Verify cmd.

trt-llm/uv.lock created by `uv sync --extra driver-r595`; the install
itself errors on mpi4py 4.0.1 + Python 3.12 (separate issue), but the
lockfile is reproducible and worth checking in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… on B200

First runnable YAML for this single-B200 host (the bookmark configs/
target a different multi-GPU cluster — see feedback memory). Conservative:
enforce-eager (no CUDA graphs), no spec decode, no FP8 KV cache, no MXFP4
explicit override. max-num-seqs=1 hard-pins single-stream operation in
the engine. Subsequent iterations will deviate one knob at a time.

bench_results.tsv: tracked at repo root.
.bench_runs/: gitignored, holds per-iteration logs + generated bench.json.
configs/.current_experiment: gitignored loop pointer to active YAML.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
flashinfer JIT-compiles fp4_quantization.cu the first time vllm boots
on a new SM (B200 = sm_100a) and shells out to nvcc — fails with
`fatal error: cublasLt.h: No such file or directory`. Same path for
cudnn.h, cusparse.h, etc. nixpkgs cuda packages are multi-output: the
default `out` is near-empty, real content is in `lib`/`include`/`static`/
`stubs`. The previous symlinkJoin only pulled `out`.

Switch to a generated paths list that pulls every declared output of
every cuda package we need (cuda_nvcc, cudart, cccl, nvrtc, nvjitlink,
cublas, cusparse, cusolver, curand, cufft, cudnn). Verified after
rebuild: cublasLt.h, cudnn.h, cusparse.h, etc. resolve under
\$CUDA_HOME/include; libcublasLt.so, libcudnn.so, etc. under
\$CUDA_HOME/lib.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…torch wheels)

vllm 0.19 / sglang 0.5.10 / trt-llm 1.2 all install torch from the cu130
PyTorch wheel index. The wheels bundle a complete cu13 stack under
<venv>/lib/python3.12/site-packages/nvidia/cu13/{lib,include} (libcudart.so.13,
libcublasLt.so.13, headers) plus cudnn.so.9 separately. Pulling cu12.9
toolkit libs into NIX_LD_LIBRARY_PATH (previous "fix") meant flashinfer's
runtime JIT linked against libcudart.so.12 while torch loaded
libcudart.so.13 — ABI mismatch.

Revert cudaToolkit to nvcc-only (cu12.9 nvcc compiles cu13 headers fine,
forward-compatible). Add cu13EnvHook that prepends the venv's cu13/lib +
cudnn/lib + nccl/lib to NIX_LD_LIBRARY_PATH (runtime), LIBRARY_PATH (link),
and CPATH (compile). Each devshell calls cu13EnvHook with its own engine
dir name so vllm/sglang/trt-llm shells each get their own venv's paths.

YAMLs gain LIBRARY_PATH + CPATH passthroughs in env so the proxy can
forward shellHook-set values into the engine subprocess (proxy uses
env_clear() and only forwards what the YAML names).

Verified inside `nix develop .#vllm`: libcudart.so.13, libcublasLt.so.13,
libcudnn.so.9 all resolve from venv first; cublasLt.h, cudnn.h from
cu13/include + cudnn/include. Host /usr/lib/x86_64-linux-gnu has only
driver libs (libcuda.so + libnvidia-ml.so), no toolkit libs — confirmed
that approach is unavailable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…is cu13

The vllm 0.19.1 wheel ships vllm/_C.abi3.so with DT_NEEDED libcudart.so.12,
while torch 2.10.0+cu130 (pulled by the b200 extra) ships libcudart.so.13
under nvidia/cu13/lib. Both must resolve at runtime — different SONAMEs,
no conflict — but the previous "remove cu12 from LD path" change broke
vllm import.

Fix:
- Keep cu12 nix toolkit (libcudart.so.12) on NIX_LD_LIBRARY_PATH for
  vllm._C dlopen.
- Put venv cu13/lib FIRST on NIX_LD_LIBRARY_PATH and LIBRARY_PATH so
  flashinfer's JIT links -lcudart against cu13 (matches torch ABI).
- CPATH is cu13 only — JIT preprocessor never sees cu12 headers.

Per-shell version declaration via cudaEnvHook engineDir cuMajor:
- cudaMajorRuntime = "12" (vllm._C)
- cudaMajorJit     = "13" (torch + flashinfer JIT)
The hook hard-fails if either libcudart.so.{12,13} or cublasLt.h missing,
and prints a "✓ <engine>: cu13 + cu12 verified" line on success — both
per user's request to make versions explicit.

shellHook walks up from PWD to find flake.nix, so `nix develop` from
the repo root or any subdir resolves the same venv path.

Also includes the small bin/verify_bench.sh tweak adding $MODEL env knob
so smoke runs against a different served model can reuse the harness.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… cu13

flashinfer/jit/cpp_ext.py hard-codes the link command:
  c++ ... -L${CUDA_HOME}/lib64 -L${CUDA_HOME}/lib64/stubs -lcudart -lcuda
Linker searches -L paths in order, first match wins. Previous attempt
included cuda_cudart in cudaToolkit, so libcudart.so.12 lived under
${CUDA_HOME}/lib — ld picked cu12 and produced a JIT .so DT_NEEDED
libcudart.so.12, which then conflicted with torch's loaded
libcudart.so.13 at runtime.

Split:
- cudaToolkit (= ${CUDA_HOME}) carries cuda_nvcc + cuda_cccl ONLY.
  No libs at all under lib/. flashinfer's -L${CUDA_HOME}/lib64 finds
  nothing, so -lcudart falls through to LIBRARY_PATH where cu13 wins.
- cu12RuntimeLibs = lib.getOutput "lib" cuda_cudart — separate path
  with libcudart.so.12 only. Mounted on NIX_LD_LIBRARY_PATH (after the
  cu13 segments) so vllm._C.abi3.so's runtime dlopen of libcudart.so.12
  still resolves. Never on LIBRARY_PATH or -L flags.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tokio Command::new used execvp-style lookup against the launcher's PATH,
making the YAML's env.PATH advisory only. The devshells deliberately don't
put .venv/bin on PATH (uv-managed pythons are nix-ld-loaded), so `cmd: vllm`
was failing ENOENT on a fresh checkout.

Walk the YAML's child env.PATH for cmds without a slash; pass through
absolute or relative paths unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…2 shims

Three flake-level fixes to make iter-0 actually produce a tok/s number on
this host (the prior baseline at 939b749 hit a flashinfer JIT failure):

1. Repo-local cache: LLM_CACHE_ROOT defaults to <repo>/.cache (autodetected
   by walking up to flake.nix), instead of $HOME/.cache/llm-benchmark.
   Override by exporting before `nix develop`. .cache/ added to .gitignore.

2. Unversioned libcudart.so: cu13 venv ships only libcudart.so.13 (and
   libcublas.so.13 etc.). ld -lcudart needs the unversioned symlink, which
   the flake comment claimed would "fall through" to LIBRARY_PATH but
   actually didn't because cu13 has no unversioned link to fall through to.
   cudaEnvHook now creates the missing symlinks idempotently for the
   cudart/cublas/cublasLt/cusparse/cusolver/cufft/curand/cufile/cupti/
   nvJitLink/nvrtc family.

3. cu13TypedefShim: cu13's cudaTypedefs.h dropped the unversioned `PFN_X`
   macro aliases that cu12 carried (e.g. PFN_cuTensorMapEncodeTiled). The
   bundled cutlass in flashinfer 0.6.6 references those unversioned forms,
   so cu13 headers alone produce ~94 nvcc errors. Build a shim include dir
   whose cudaTypedefs.h #include_next's cu13's real header then re-defines
   each missing alias (extracted verbatim from cu12 cudart). Put FIRST on
   CPATH so it shadows cu13's cudaTypedefs.h only for that one header;
   everything else still falls through to cu13.

bench_results.tsv: iter-0 baseline = 97.3754 tok/s @ batch=1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…baseline)

Drop --enforce-eager so vllm captures CUDA graphs at BS=1. Per-step
kernel-launch + Python scheduler overhead disappears, which is the
dominant cost at BS=1 for small generation steps.

Result: 223.67 tok/s vs 97.38 baseline (+126.29 tok/s, +129.7%).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…proxy

Local commits got new SHAs after rebasing onto remote 1d2bbdd + 723c747.
Iter-0 result moves to 1ad4da0 (the flake-fix commit it was measured at);
iter-1 to 8e55e5e (the cuda-graphs experiment commit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ok/s (-40%)

Added --speculative-config method=ngram k=5 window=[2,4] on top of CUDA
graphs. Hypothesis: structured JSON outputs from queries.csv (intent
classifier on a Polish bank's call center) would yield high prompt-lookup
hit rates. In practice the result fell to 133.61 tok/s vs 223.67 iter-1.

Almost certainly because vllm's spec-decode path doesn't share the BS=1
CUDA graph captured for plain decode — each verify pass takes the eager
path, paying back per-step launch overhead. Acceptance rate × overhead
was net-negative.

DISCARD. Keep YAML as the discard bookmark; revert pointer to iter-1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switched --kv-cache-dtype fp8 on top of CUDA graphs. BS=1 decode is
memory-bound on KV reads, so halving KV byte width should be a clear
win — but result is 221.57 vs iter-1's 223.67 (Δ -0.94%, well inside
the N=50 median noise floor).

Two plausible reasons for the null result:
  1. With --max-num-seqs 1 and --max-model-len 4096 the absolute KV
     bandwidth at BS=1 is small enough that the dequant overhead in
     the attention kernel cancels the bandwidth saving.
  2. gpt-oss-120b is MoE — only ~2 of N experts active per token, so
     weight bandwidth dominates over KV bandwidth at BS=1. Halving KV
     bandwidth doesn't move the needle.

Either way: equal results + extra config = DISCARD per the autoresearch
"simpler wins" rule. YAML kept as a recorded discard bookmark.

Also fixes verify_bench.sh: find for result.json now retries 5x500ms
because in practice find returned empty immediately after python exited
even though result.json appeared on disk a moment later.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(1.66x iter-1)

Set VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1, switching the MoE backend from
FLASHINFER_TRTLLM_MXFP4_BF16 (default for sm_100a) to FLASHINFER_TRTLLM_MXFP4_MXFP8.
Intermediate activations now compute in FP8 instead of BF16, hitting B200's
native FP8 tensor cores. At BS=1 the MoE matmul is the dominant cost (only
~4 of N experts active per token, weight bandwidth dwarfs everything else),
so this is exactly where to spend a knob.

Result: 371.21 tok/s vs 223.67 iter-1 (+147.54, +66.0%). Cumulative speedup
over iter-0 baseline = 3.81x.

KEEP. iter-1 cuda-graphs YAML stays as the secondary best.

Also pre-stages iter-5 (vllm-b200-bs1-fusions.yaml: MXFP8 MoE +
compilation pass fusions) and fixes verify_bench.sh result.json discovery
to parse `run dir:` from bench.log instead of `find` — find was returning
empty after python exited even with 5x500ms retry, an overlayfs dirent
visibility quirk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… -2.17%)

Enabled fuse_norm_quant + fuse_act_quant + fuse_attn_quant on top of iter-4
(mxfp8-moe). Hypothesis: fewer kernel launches per decode step. Result is
363.16 vs iter-4's 371.21 — Δ -2.17%, within noise but slightly negative.

Possible reasons it didn't help:
- The fused quant ops weren't the dominant cost; MoE matmul still owns
  the step time.
- Inductor's fused kernels for these specific quant patterns may be less
  optimized than the un-fused vendor kernels.

DISCARD per "simpler wins on equal results."

Verify_bench.sh fix landed — first iteration where the script printed the
tok/s number to its own stdout (run dir parsing replaced flaky find).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
KV-fp8 still doesn't help even with mxfp8-moe shifting the bottleneck.
At BS=1 with max-num-seqs=1 the absolute KV bandwidth is small enough
that dequant overhead in the attention kernel cancels the byte savings.
DISCARD.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VLLM_USE_FLASHINFER_SAMPLER=1 swaps the per-token sampler for flashinfer's
kernel. Result is 371.13 vs iter-4 371.21 — Δ -0.022%, statistically
identical. Sampling is well off the critical path at BS=1 / 50-query
median; the MoE matmul dwarfs everything else.

DISCARD.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…22%)

Limited cudagraph_capture_sizes to [1] only. Hypothesis: only the BS=1
graph is ever used at --max-num-seqs 1, so dropping the other 83 captures
saves a couple GiB and skips the per-step graph-size dispatch.

Result Δ -3.22% — slightly worse. vLLM has both PIECEWISE and FULL
cudagraph slots; restricting capture sizes apparently doesn't simplify
the dispatch path enough to win, and may force some graphs into the
piecewise path. DISCARD.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Disabled chunked-prefill + prefix-caching + log-stats. Hypothesis: each
adds per-step Python work and we don't benefit from any of them at
single-stream BS=1. Result Δ -2.20% — slightly worse.

Likely the chunked-prefill code path is the better-optimized one even
when prefill is small, and the per-step accounting/stats overhead is
already negligible vs the MoE matmul. DISCARD.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… -32.6%)

Swapped TRTLLM-gen for CUTLASS variant of the same FP8-activation MoE
backend. Result is dramatically worse: 250.07 vs 371.21 (Δ -32.6%).

CUTLASS likely picks tile shapes tuned for larger BS or different sm
generations; TRTLLM-gen's tile selection is the right one for
sm_100a + BS=1 here. DISCARD.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…1%, errors)

Hypothesized smaller KV pool / smaller cuda graph workspace might give a
small win. Two independent reasons to discard:

1. Δ -2.01% vs iter-4 — slight regression, possibly because the cuda
   graph FULL-decode capture for max=4096 vs max=2048 doesn't actually
   change at BS=1 (only the per-stream KV pool size changes), so we lose
   the speed/quality tradeoff with no benefit.

2. 5 of 50 prompts overflow at 2K tokens (HTTP 400). The median is now
   computed over fewer queries — not directly comparable to iter-4. AND
   the system can no longer serve the full distribution.

DISCARD.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ncompat)

Set VLLM_USE_OINK_OPS=1 to use Blackwell-tuned RMSNorm. Engine init dies
with TypeError: unhashable type: 'list' inside vllm/logger.py warning_once
— Oink emits a warning whose args aren't hashable for the dedup cache.

Not a meaningful comparison; logging as a crash with no tok/s number.
Would need a vllm 0.19.x bump or an Oink-side fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…, noise)

VLLM_USE_DEEP_GEMM_E8M0=0 forces DeepGEMM's alternate FP32-scale path
instead of the default Blackwell E8M0 FP8 scale exponent format. Result
is 373.20 vs iter-4's 371.21 — Δ +0.54%, well within the ~1-3% N=50
median CV. Borderline.

Could KEEP for the 0.54% but per autoresearch's simpler-wins discipline
on equal results, DISCARD: iter-4 is one fewer env knob carrying the
same workload to within measurement noise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… compat)

Two unblocking fixes for getting sglang and trt-llm to start at all on
this host:

1. flake: cu12RuntimeLibsExtra symlinks together cuda_nvrtc + libcublas +
   libcusparse + libcusolver + libcurand + libcufft + cuda_cupti from
   nixpkgs cudaPackages, all cu12. Joined onto NIX_LD_LIBRARY_PATH so
   sglang's bundled triton kernels (which were compiled against cu12)
   can dlopen them at runtime. Kept *outside* CUDA_HOME so JIT linker
   `-lcublas` etc still falls through to cu13 (LIBRARY_PATH).

2. trt-llm/pyproject.toml: bump mpi4py 4.0.1 → >=4.1.1. mpi4py 4.0.1's
   build_ext calls `distutils.log.warning` which is missing under
   setuptools >=80 / Py 3.12; 4.1.1 dropped the call. Without this `uv
   sync --extra driver-r595` fails inside the trt-llm devshell.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ine reference)

First sglang result on this host. TP=1, --attention-backend trtllm_mha
(sglang's auto-pick for GptOssForCausalLM on sm100), --cuda-graph-max-bs 1,
--max-running-requests 1, --mem-fraction-static 0.85.

Result: 343.41 tok/s vs vLLM iter-4's 371.21 — vLLM wins by 8.1%. SGLang's
baseline already enables the MXFP4 trtllm-gen MoE backend + cuda graphs by
default, so this is roughly the apples-to-apples comparison vs the vLLM
best, not vs the vLLM iter-0 baseline (97.38).

KEEP as the sglang reference. Spec decoding (EAGLE3) next.

Three flake fixes were needed to get sglang to start:
- cu12 runtime libs bundle (cuda_nvrtc + libcublas/cusparse/cusolver/curand/
  cufft/cupti) joined onto NIX_LD_LIBRARY_PATH for sgl_kernel's bundled
  triton kernels (prior commit a5364b5)
- CCCL include via CUDA_HOME on CPATH so sglang's tvm-ffi JIT compile
  finds <nv/target> when cu13's cuda_fp16.h pulls it in
- attention-backend changed from fa3 (sm80-90 only) to trtllm_mha

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
….32 (noise)

Per the scout report, FLASHINFER_MOE_BACKEND chooses among throughput/
latency/masked_gemm cubins for FlashInfer MoE. Set to 'latency' on top
of iter-4. Result Δ +0.57% — within noise.

Logs confirm: 'Using FLASHINFER_TRTLLM_MXFP4_MXFP8 Mxfp4 MoE backend' —
the MXFP4_MXFP8 selector path takes precedence over the BACKEND env var
when the model is MXFP4-quantized, so the latency env knob is effectively
a no-op here. DISCARD.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Added --speculative-config method=eagle3 + RedHatAI/gpt-oss-120b-speculator.eagle3
draft, k=3, on top of iter-4. EAGLE3 is the model-based draft (much
higher acceptance than ngram) but result is the same flavor of
regression as iter-2 ngram: -33%. The spec verify path in vLLM 0.19.1
appears not to share the BS=1 FULL cuda graph captured for plain decode,
so each verify step pays back per-step launch overhead.

DISCARD. Try sglang's EAGLE3 next (scout reports it's more graph-friendly
via SGLANG_ENABLE_SPEC_V2 + chain-verify with eagle_topk=1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…3.93%)

vLLM 0.19.1 → 0.20.1rc1.dev129+ga3ec4a35f, flashinfer 0.6.6 →
0.6.8.post1, torch 2.10.0+cu130 → 2.11.0+cu130. pyproject.toml now:
  - depends on vllm>=0.20.1rc1.dev0 routed via the
    https://wheels.vllm.ai/nightly index in tool.uv.sources
  - drops the explicit flashinfer-python / flashinfer-cubin pins so
    they flow from vllm's own pin (its wheel pins specific versions
    that change per dev build)
  - drops the explicit torch / torchvision / torchaudio version pins
    for the same reason; torch is still routed through cu130 index

Re-bench of iter-4's mxfp8-moe YAML on the new stack: 385.80 tok/s vs
371.21 on 0.19.1 — Δ +3.93%, well above the ~1-2% N=50 median noise
floor. Cumulative speedup over iter-0 baseline now 3.96x.

Where the gain comes from is hard to attribute to any one knob — the
flashinfer 0.6.8 trtllm-gen MoE kernels likely got tuned cubins, torch
2.11 has a slightly faster cuda graph capture path, and vllm 0.20 dev
includes async-scheduling work. They all moved together; we get one
number.

YAML unchanged from iter-4. KEEP. Try sglang/trtllm nightlies next.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… BEST (+0.71%)

trt-llm 1.2.1 → 1.3.0rc13 (uv resolved up from `>=1.3.0rc9`). cu130 torch
2.10.0+cu130 (trt-llm 1.3 still pins torch 2.10, vllm went to 2.11). New
baseline YAML configs/trt-llm-b200-bs1-baseline.yaml runs trtllm-serve
under mpirun -np 1 — without an MPI universe trt-llm 1.3 errors at engine
init in MPI_Comm_spawn even at TP=1.

Result: 388.55 tok/s vs vLLM nightly's 385.80 — Δ +0.71%, within the N=50
noise floor but TRT-LLM lands as the cross-engine new best out of the box
(no per-engine knob tuning yet). Cumulative 3.99x over iter-0 baseline.

Three small infra fixes were needed to unblock:
- trt-llm/pyproject.toml: bump tensorrt-llm pin to >=1.3.0rc9, drop torch*
  pins (1.3 pins them transitively), keep nvidia-ml-py per-driver
- YAML: wrap trtllm-serve with `mpirun -np 1 --allow-run-as-root` so MPI_
  Comm_spawn has a real universe
- bench_latency.py: strip the non-OpenAI `provider` field from queries.csv
  payloads. vLLM/sglang silently accept unknown fields; trt-llm uses
  strict pydantic and 400s any extras (caught 50/50 errors first run).
- YAML: point at the local HF snapshot path instead of model name. trt-llm
  1.3 was requesting a different revision than what's cached and trying
  to download all 14 safetensors fresh (60+GB) — instant ENOSPC.

KEEP. Try EAGLE3 + PDL (per scout: NVIDIA's published BS=1 recipe) next.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
NVIDIA's published BS=1 EAGLE3 recipe: nvidia/gpt-oss-120b-Eagle3-short-context
draft, max_draft_len=3, disable_overlap_scheduler=true, cuda_graph
max_batch_size=4 (verify shape with k=3), TRTLLM_ENABLE_PDL=1 for
kernel-launch overlap. Loaded extras via --extra_llm_api_options.

Result: 225.37 tok/s vs trt-llm baseline 388.55 — Δ -42.0%. The third
engine to show the same pattern (vLLM iter-16: -33%, sglang iter-17:
-25.5%). Speculative decoding at BS=1 for gpt-oss-120b on B200 is
fundamentally not a win across any tested engine — the MoE matmul cost
dominates so much that the extra verify-pass overhead exceeds the
savings from accepting drafted tokens, even with EAGLE3's high
acceptance length.

DISCARD. trt-llm baseline (iter-18) remains the cross-engine best.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wokalski and others added 30 commits May 1, 2026 18:24
… -38.9% vs iter-47)

VLLM_USE_FLASHINFER_MOE_MXFP4_BF16=1 forces the oracle to
FLASHINFER_CUTLASS_MXFP4_BF16 — a monolithic CUTLASS MoE with
persistent scheduling. The other Hopper-reachable MXFP4 backend
besides Marlin and Triton.

196.09 tok/s — basically tied with iter-46 baseline (197.84) and
crushed by Marlin (320.78). The path JIT-compiles 181 sm90 CUTLASS
kernels via ninja on first run (~15 min cold, then cached). Even
after warmup the BS=1 single-token decode doesn't fill those big
TmaWarpSpecializedPingpong kernels. Marlin's smaller W4A16 tile is
just better at this regime. DISCARD.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (Δ -0.6% vs iter-47)

The Recipe's fuse_allreduce_rms pass routes through FlashInfer's
trtllm_fused_allreduce_norm kernel — Blackwell-tuned and the
documented win on B200. Hypothesis: at H100 TP=2 NVLink with
single-token allreduces, the symm-mem setup overhead may not amortize.

318.94 tok/s — Δ -0.6% vs iter-47. Pure noise. The pass is a no-op
either way at this regime; keep it on so we match the published
recipe without measurable cost. DISCARD.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…acceptance

Targeted the official vllm-project/recipes GPT-OSS_EAGLE3_Hopper.yaml
config: nvidia/gpt-oss-120b-Eagle3-v2 + k=3. v2's HF config correctly
sets eagle_config.use_aux_hidden_state=False (head uses last layer
output, not aux states like v1).

Two vllm 0.20.1rc1.dev129 bugs blocked this iter:

1. AOT COMPILE CACHE KEY BUG. After Eagle3-v1 ran in iter-47 (which
   sets aux_hidden_state_layers to 3 default layers), the compiled
   GptOssModel forward graph baked in `return (x, [aux,aux,aux])`.
   The compile cache key didn't hash aux_hidden_state_layers, so when
   v2 (use_aux=False) was loaded next, the same cache hit returned a
   graph that emitted the tuple — but the v1 model_runner code path
   for use_aux_hidden_state_outputs=False expected outputs to be a
   tensor and crashed with `hidden_states[logit_indices_device]` →
   IndexError: tuple index out of range.

   Confirmed by editing gpu_model_runner.py source (changes file hash
   → invalidates AOT cache) — fresh compile then correctly returned
   a tensor. Cache key needs to include aux config / use_aux flag.

2. Eagle3-v2 ACCEPTANCE IS BROKEN even after cache fix:
   Mean acceptance length: 1.01
   Per-position acceptance rate: 0.007, 0.000, 0.000
   Avg Draft acceptance rate: 0.2%
   Drafted 1248 tokens, target accepted 3.

   v2 must need a different hidden state than vllm 0.20's
   pass_hidden_states_to_model wires up. Unusable here despite
   being the recipe's recommended draft.

Final tok/s with the cache-fixed v2 path: 153.80 — worse than no
spec at all. Document and move on; v1 + k=2 (iter-47) is the H100
champion. Both bugs worth filing upstream.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
….3% vs iter-47)

Isolation test: same Eagle3 head as iter-47 (which works), just k=2 → k=3.
Tells us whether the iter-50 crash was about k=3 alone or v2 alone.

310.25 tok/s — runs cleanly with k=3 (so the iter-50 bug really was the
v2 model interacting with a stale AOT cache). But k=3 underperforms k=2:

  Mean acceptance length: 2.08 (k=2: ~2.0)
  Per-position acceptance: 0.591, 0.314, 0.170
  Third position only 17% accept rate

Verify cost on three drafted tokens grows faster than the marginal
acceptance gain at position 3. Same shape as B200 iter-26/27 for
trt-llm Eagle3 (k=1: 438, k=2: 478, k=3: 462, k=4: 435). k=2 is the
peak across both engines and both platforms for this workload.

DISCARD; k=2 stays.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tried to push Marlin onto Hopper FP8/INT8 tensor cores via
VLLM_MARLIN_INPUT_DTYPE. Both crash:

  fp8: ValueError "Marlin W4A8-FP8 only support SM89 or SM120 device
       (It is slower than Marlin W4A16 on other devices). You can
       consider using W4A8-INT8 instead(set VLLM_MARLIN_INPUT_DTYPE=int8)."
       H100 is SM90, not in the allowlist.

  int8: RuntimeError "MXFP4 weight + INT8 activation is not supported."
        The error message in the fp8 case is misleading — INT8 doesn't
        compose with MXFP4 weights either.

The W4A8 fast paths exist (SM89 Ada, SM120 Blackwell-RTX), just not on
H100. Default activation dtype (bf16) is what we get and it's fine.
DISCARD; iter-47 config remains the H100 best.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…0.9% vs iter-47)

VLLM_USE_FLASHINFER_SAMPLER=1 — FlashInfer's fused sampler kernel.
Hypothesis: spec decode does multi-step sampling (1 verify + k draft
tokens), so a faster sampler should compound.

317.94 tok/s — Δ -0.9%. Pure noise. Same outcome as B200 iter-7
(371.13 vs 371.21 baseline). Sampler is off the critical path
regardless of platform; the MoE GEMM dominates wall time at BS=1.

DISCARD.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…vs iter-47)

VLLM_USE_NCCL_SYMM_MEM=1 — defaults False. NCCL symmetric-memory
allreduce can be faster than ring on NVLink. Hypothesis: TP=2
H100x2 NVSwitch path benefits.

318.74 tok/s — noise. The custom_all_reduce path vllm uses already
saturates the NVLink for single-token TP=2 allreduces; symm-mem
adds setup overhead with no payload to amortize. DISCARD.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…vs iter-47)

Calibration test: how much of the H100x2 throughput is Marlin (the
kernel) vs torch.compile + cudagraphs (the launch-overhead reduction).

70.79 tok/s with --enforce-eager (no compile, no cudagraphs) on top
of Marlin. The compile/cudagraph stack is a **4.5x multiplier** on
top of whichever MoE kernel sits underneath:

  iter-47 (Marlin + compile + cudagraphs): 320.78
  iter-55 (Marlin + eager):                 70.79

At BS=1 the per-kernel-launch overhead is enormous relative to the
work each kernel does; cudagraphs fold all of that out. Plus
custom_ops:[+rms_norm] and the inductor fusions handle the small
ops between MoE blocks where Marlin isn't running.

So our +62% iter-46 → iter-47 win was Marlin replacing Triton at
the BIG kernel, on top of an already-4.5x-amplified base. Both
layers are load-bearing; iter-47 is the right ship target.

DISCARD; calibration record only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Iters 46-55 (the H100x2 sweep) were committed with TBD-N placeholders
in the commit column. Backfilling the actual short SHAs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single-GPU calibration of the iter-47 winner. Same Marlin MoE path,
EAGLE3 nvidia k=2, full Recipe compile block, just --tp 1 with
CUDA_VISIBLE_DEVICES=0.

  iter-47 (TP=2): 320.78 tok/s   (160.4 per GPU)
  iter-56 (TP=1): 246.13 tok/s   (246.1 per GPU)
  TP=2 / TP=1   : 1.30x          (vs ideal 2x)

Per-GPU efficiency is 53% better at TP=1. Classic memory-bandwidth-
bound MoE behavior: at BS=1 the bottleneck is reading expert weights,
and TP-splitting halves each rank's read but adds NVLink allreduce
overhead per layer. The crossover only pays when the allreduce cost
< the bandwidth savings.

Spec acceptance ~48% on both — speculative decoding behaves
identically across TP factors.

KEEP — useful single-GPU baseline for cross-engine comparisons
(see iters 57-59 for trt-llm at the same platform).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… 246)

Direct port of B200 iter-26 winner (TRT-LLM EAGLE3 default draft k=2,
TP=1, PDL on, cuda_graph_config batch_sizes=[1,3], disable_overlap_
scheduler) to H100. No code changes besides the model snapshot path.

141.73 tok/s — 57% of vllm iter-56's 246.13 on the exact same
hardware/workload. Root cause: trt-llm 1.3.0's default MoE on
Hopper MXFP4 falls back to TritonFusedMoE (the TRTLLM kernel
backend is Blackwell-only — see release notes for 1.3.x). vllm has
a hand-written Marlin W4A16 kernel for SM90 that ships in the wheel
(VLLM_MXFP4_USE_MARLIN); trt-llm has no equivalent in 1.3.0.

  Hopper MXFP4 MoE (BS=1):
    vllm Marlin       : 246.13 tok/s   (iter-56)
    trt-llm Triton    : 141.73 tok/s   (this iter)
    trt-llm CUTLASS   : 105.36 tok/s   (iter-58 — regression)

The B200 → H100 reversal is real: at B200 trt-llm and vllm tied at
~478 tok/s (iter-26 / iter-38) because both used FlashInfer
TRTLLM-gen MXFP4_MXFP8 kernels with native FP4 hardware. On Hopper
without FP4 hardware, the kernel landscape differs and vllm Marlin
has no trt-llm counterpart.

KEEP — cross-engine baseline at H100 TP=1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… → 105.36 (Δ -25.7%)

Tried to force CUTLASS MoE on the assumption (per research) that
trt-llm 1.3 has a Hopper-native CUTLASS MXFP4 grouped-GEMM that
beats Triton. It doesn't — at least not with autotune cold-cache.

105.36 tok/s, 25.7% worse than iter-57 default Triton. The proxy log
shows the smoking gun on every step:

  [AutoTuner] trtllm::fused_moe::gemm1 using the fallback tactic,
              due to cache miss on input shapes=...
  [AutoTuner] trtllm::fused_moe::gemm2 using the fallback tactic,
              due to cache miss on input shapes=...

The CUTLASS path needs a populated autotune cache to hit good
kernels; with cache misses it uses a generic fallback that's slower
than Triton's already-tuned kernel. The default Triton path doesn't
have this problem because its kernels are autotuned at JIT time.

There may be a way to pre-populate the cache (offline tuning run, or
extending warmup), but iter-57 default is already the right baseline
for H100 trt-llm — DISCARD.

Also adds configs/trt-llm-eagle3-extra-h100.yaml with the CUTLASS
+ stream_interval=10 + batch_sizes=[1,2,3,4] + enable_block_reuse=
false additions per the research agent's recipe — kept for record.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lm 320.78)

Tested whether TRT-LLM's ONESHOT/MIN_LATENCY allreduce on H100x2
NVSwitch closes the gap with vllm at TP=2. It scales better than
vllm in ratio but loses on absolute throughput.

  Scaling factor TP=1 → TP=2:
    vllm + Marlin    : 246.13 → 320.78  (1.30x)
    trt-llm + Triton : 141.73 → 203.50  (1.44x)

TRT-LLM's allreduce IS more efficient at small payloads (1.44 vs
1.30), but the win is dwarfed by the 73% TP=1 deficit caused by
the missing Marlin-equivalent kernel. Final cross-engine score
on this hardware/workload:

  H100x2 BS=1 gpt-oss-120b EAGLE3:
    vllm + Marlin    : 320.78 tok/s   (iter-47, NEW H100 BEST)
    trt-llm + Triton : 203.50 tok/s   (this iter, -36.6%)

Reverses the B200 picture (where iter-26 trt-llm 477.67 and
iter-38 vllm 477.84 tied within noise). The crossover comes from
different MXFP4 kernel availability per platform: B200 has
hardware FP4 + FlashInfer TRTLLM-gen MXFP4_MXFP8 that both engines
plug into; H100 has no FP4 hardware and only vllm shipped a
hand-written Marlin path that maps well to bf16 dequant.

KEEP — final cross-engine number on H100x2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cross-engine H100/H100x2 sweep (vllm + trt-llm). TBD-N placeholders
in iters 56-59 → actual short SHAs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…TON → 200.51 (Δ noise)

Set moe_config.backend=TRITON explicitly to see if it differs from
the implicit default-fallback in iter-59. It doesn't — same legacy
TritonFusedMoE class, same OAI Triton matmul_ogs kernel, same
warning, same number within noise (200.51 vs 203.50).

Source-level confirmation in
tensorrt_llm/_torch/modules/fused_moe/create_moe.py:82-83 —
"TRITON" maps directly to TritonFusedMoE (legacy class). The
ConfigurableMoE wrapper only accepts TRTLLMGenFusedMoE and
CuteDslFusedMoE per line 418, neither of which works for MXFP4.
Same warning "TritonFusedMoE is not supported by ConfigurableMoE"
fires on both iter-59 (no setting) and iter-60 (explicit TRITON).

DISCARD; iter-59 is the same path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… vs iter-59)

Per examples/models/core/gpt_oss/README.md: "the TRITON backend
doesn't support tensor parallelism (TP), only expert parallelism
(EP) and AlltoAll operations". Tried tp_size=2 + moe_expert_
parallel_size=2 to unlock the (alleged) ConfigurableMoE TRITON
path that should activate when EP is used.

Result: 188.95 tok/s, 7.1% worse than iter-59 default. Same
"TritonFusedMoE is not supported by ConfigurableMoE" warning fires.
The setting falls through to legacy TritonFusedMoE AND the EP
AllToAll routing adds overhead vs straight TP.

CODE READ confirms there's no path forward in trt-llm 1.3.0rc9 for
faster MXFP4 MoE on Hopper:

  fused_moe/create_moe.py — only viable backend selectors:
    "CUTLASS" → CutlassFusedMoE   (autotune cache miss → fallback,
                                    iter-58 = 105 tok/s)
    "TRITON"  → TritonFusedMoE    (OAI matmul_ogs, used here = 200)
    "TRTLLM"  → TRTLLMGenFusedMoE (Blackwell-only, fails on SM90)
    "DEEPGEMM"/"CUTEDSL"/"DENSEGEMM"/"WIDEEP"  → all gated to
                                                   fp8/nvfp4 quants
                                                   (won't bind to
                                                    MXFP4 weights)

  fused_moe_triton.py — TritonFusedMoE imports
    `from triton_kernels.matmul_ogs import matmul_ogs`
    so the OAI kernels ARE active (no separate install needed —
    triton_kernels package ships with the trt-llm wheel).

  ConfigurableMoE in create_moe.py:411-425 only wraps
    TRTLLMGenFusedMoE and CuteDslFusedMoE; everything else falls
    back to legacy. So all MXFP4 paths on Hopper end at the same
    OAI Triton kernel that vllm uses by default — and which vllm
    Marlin is the alternative to.

There is no Marlin-equivalent W4A16 fast path in trt-llm 1.3.0rc9
for Hopper MXFP4. iter-59 (203.50) is the trt-llm ceiling on this
hardware; vllm + Marlin (320.78, iter-47) is +57.7% faster
because vllm shipped a hand-written kernel that trt-llm doesn't
have.

DISCARD.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…9.36 (noise)

The official vllm-project/recipes GPT-OSS_EAGLE3_Hopper.yaml ships
--max-num-batched-tokens 8192 (vs vllm default 4096 = max-model-len).
With EAGLE3 k=2 each step's max_num_scheduled_tokens is bounded by
this; iter-46 logs warned the default was undersized.

319.36 tok/s vs iter-47's 320.78 — Δ -0.4%, pure noise. The warning
about being undersized was benign for our BS=1 + max-num-seqs=1
workload. iter-47 stays the H100x2 best.

DISCARD.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e-warm) → 154.69

iter-58 (TP=1 CUTLASS) regressed to 105.36 tok/s with the smoking-gun
warning "[AutoTuner] using fallback tactic, due to cache miss". Tested
whether bumping warmup 3 → 20 lets the AutoTuner populate the cache
and recover. Also bumped to TP=2 to compare apples-to-apples against
iter-59 (Triton TP=2, 203.50).

154.69 tok/s — meaningful recovery vs iter-58 (+46.8%) but still
24% behind Triton TP=2:

  iter-58 (TP=1, CUTLASS, warmup=3) :  105.36   (cache miss → fallback)
  iter-63 (TP=2, CUTLASS, warmup=20):  154.69   (cache populated, 56 entries)
  iter-59 (TP=2, Triton, warmup=3)  :  203.50   (OAI matmul_ogs, no autotune cache issues)
  iter-47 (vllm Marlin TP=2)        :  320.78   ← H100x2 ceiling

AutoTuner populated 56 cache entries successfully:
  [Autotuner] Cache size after warmup is 56

So the cache miss issue IS resolvable with longer warmup. But CUTLASS
on H100 MXFP4, even fully warm, is fundamentally slower than the OAI
Triton matmul_ogs path. CUTLASS-on-Hopper for MXFP4 in trt-llm 1.3.0
is just not as well tuned as the Triton implementation.

This closes out the "can we compile/tune our way to faster trt-llm
on H100" investigation. The honest answer: no path in 1.3.0rc13.
trt-llm Hopper ceiling = iter-59 (203.50). vllm's +57.7% lead is a
kernel availability gap (Marlin), not a tuning gap.

DISCARD.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes cross-engine loop on H100x2 BS=1 gpt-oss-120b EAGLE3:

  vllm + Marlin (iter-47)          : 320.78 tok/s   ⭐ winner
  sglang + fa3 + EAGLE3 (this iter): 243.10 tok/s   2nd
  trt-llm + Triton MoE (iter-59)   : 203.50 tok/s   3rd

sglang slots between vllm and trt-llm. Reuses the same OAI-Triton
matmul_ogs MoE kernel as trt-llm but appears to have a tighter
runtime around it.

Setup pain: the sglang devshell in this flake assumes the cu13 wheel
layout (nvidia/cu13/{include,lib}/) like vllm does, but sglang[all]
==0.5.10.post1 pins torch==2.9.1+cu129 which uses the per-component
layout (nvidia/{cublas,cuda_runtime,curand,...}/{include,lib}/). To
get the engine through cudagraph capture I had to:

  1. uv sync --extra h100 (sets up cu129 torch)
  2. Override CPATH in YAML to enumerate all 15 nvidia/*/include dirs
     so tvm_ffi/flashinfer JIT compiles find cuda_runtime.h, curand.h,
     cublas_v2.h, etc.
  3. Override LIBRARY_PATH similarly so the linker sees libcudart,
     libcurand, libcublas, etc.
  4. Bulk-symlink .so.X -> .so in each nvidia/*/lib so ld.bfd resolves
     -l<name> directly (the wheels ship versioned .so.X only).

Other notable settings:
  --attention-backend fa3 (FlashAttention 3; trtllm_mha is sm_100+ only)
  --speculative-num-steps 2 + --speculative-num-draft-tokens 3 + topk 1
  --cuda-graph-max-bs 4

KEEP — third engine on the H100x2 leaderboard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…caling matrix

Single-GPU sglang baseline. mem_fraction_static had to be raised
0.85 → 0.93; default OOMs at TP=1 because gpt-oss-120b MXFP4 +
Eagle3 draft + KV barely fits in one 80GB H100.

Now all three engines have TP=1 and TP=2 numbers, full matrix:

  engine             TP=1     TP=2    scaling
  vllm + Marlin      246.13   320.78  1.30x
  sglang + fa3       232.88   243.10  1.04x  ⚠ flat
  trt-llm + Triton   141.73   203.50  1.44x

Surprise finding: sglang's TP scaling is effectively flat at this
workload. Going TP=1 → TP=2 gains 4% — the FlashInfer allreduce
fusion overhead at single-token decode payloads nearly cancels the
bandwidth-halving gain from sharding the model across 2 GPUs.
vllm and trt-llm both scale much better.

Per-GPU efficiency picture:
  vllm TP=1 :  246.13 per GPU  (best per-GPU)
  sglang TP=1: 232.88 per GPU
  vllm TP=2 :  160.4 per GPU
  trt-llm TP=1: 141.73 per GPU
  sglang TP=2: 121.5 per GPU
  trt-llm TP=2: 101.8 per GPU

vllm wins at every parallelism factor and per-GPU. Ship iter-47.

KEEP — completes the TP scaling matrix for all 3 engines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop cudaEnvHook/mkShell/ldPath helpers; inline each shell so vllm /
sglang / trt-llm can be read top-to-bottom. Also prune what each shell
doesn't use: vllm + trt-llm no longer pull cu12Extras (sglang-only per
comment), trt-llm drops cu13TypedefShim (no flashinfer JIT). Shared
data derivations (cudaToolkit, cu12Cudart, cu12Extras, cu13TypedefShim,
cacheHook, findRoot, cu13SymlinkLoop) stay in let.

nix flake check passes for all three shells.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pin pkgs.cudaPackages_12_9 (was pkgs.cudaPackages, an alias that
could float to a different minor on nixpkgs upgrade). 12.9 matches
the cu12 versions h100/h200 venvs bundle (cudart 12.9.79, cublas
12.9.1.4, nvrtc 12.9.86 — see uv.lock).

Drop two dead abstractions:
- lib.concatMap (p: map (o: p.${o}) p.outputs) over [cuda_nvcc cuda_cccl]:
  both packages are single-output, so this loop expanded to just
  [cuda_nvcc cuda_cccl].
- lib.getOutput "lib" cuda_cudart: cuda_cudart is also single-output,
  so getOutput falls back to out. Use cu12Cudart = cuda12.cuda_cudart
  directly. cu12Extras switches from getOutput to the .lib accessor.

Tighten cu13TypedefShim's bash: the tempfile + while-read + per-line
echo loop becomes one grep | awk pipe. Verified output identical
(1349 lines, same #ifndef-guarded macro list).

Trim verbose prose comments — keep load-bearing WHYs (cu12 must not
beat cu13 for `-lcudart`, CUDA_HOME has no libs deliberately, why the
PFN_X aliases need restoring), drop historical narrative.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sthrough

Verified end-to-end on B200 + r595 driver: bench-proxy spawns each
engine, /meta returns 200, curl /v1/chat/completions returns PONG.

- vllm-smoke.yaml: add CPATH=$CPATH to env (was missing). Without it
  flashinfer JIT couldn't find cuda_runtime.h from cu13 venv, killing
  EngineCore at kernel_warmup. b200-baseline.yaml already had this;
  the smoke yaml was just stale.
- sglang-smoke.yaml: new — Qwen2.5-0.5B-Instruct, mem-fraction-static
  0.50, full env passthrough including MPICC.
- trt-llm-smoke.yaml: new — uses `trtllm-serve serve <model>` (1.3.0rc13
  added the `serve` subcommand; baseline yaml predates it and elides it,
  presumably relying on a default-subcommand shim that no longer works).
  mpirun -np 1 wrapper required even at TP=1 for mpi4py executor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…NEEDs

Checked every engine venv with objdump -p to enumerate cu12 DT_NEEDs.
Three pieces of the flake turned out to be no-ops:

1. cu13SymlinkLoop: tried to symlink lib<x>.so → lib<x>.so.13 for 13
   libs (cudart, cublas, cusparse, …). Audited against current vllm /
   sglang / trt-llm venvs: every venv either ships the unversioned .so
   already (cudart, cublas, cublasLt, cupti, nvJitLink, nvrtc) or has
   no .so.13 to link to (the cu12-version libs cusparse/cusolver/cufft
   /curand/cufile, plus cudart_static which is a .a). Result: zero
   symlinks ever created. Drop the helper entirely.

2. cu12Cudart for vllm: vllm 0.20.1rc1's _C.abi3.so DT_NEEDs
   libcudart.so.13 (was libcudart.so.12 in 0.19.x — comment was stale).
   Grep shows zero .so anywhere in the vllm venv DT_NEEDs libcudart.so.12.
   Drop the verification line + the NIX_LD entry. sglang and trt-llm
   keep cu12Cudart (sgl_kernel and torchao still DT_NEED libcudart.so.12).

3. cu12Extras prune: the bundle had nvrtc, cublas, cusparse, cusolver,
   curand, cufft, cupti. Of those, sgl_kernel only DT_NEEDs nvrtc + cublas
   /cublasLt. The other five live inside the venv's own nvidia/cu13/lib/
   (torch's cu130 wheel still ships libcufft.so.12, libcusparse.so.12 etc
   despite the cu13 prefix), and our shellHook puts the venv path before
   ${cu12Extras}/lib on NIX_LD_LIBRARY_PATH so the venv copies always win.
   Pulling them from nixpkgs was wasted store space. Trim to nvrtc + cublas.

Re-ran all three engines end-to-end (bench-proxy → /v1/chat/completions
→ "PONG!") on B200 + r595 driver after the prune. nix flake check passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Surveyed DT_NEEDs across all three venvs' .so files. Hits:
  libstdc++ / libgcc_s — torchaudio, outlines_core, others
  libz — cudnn, opencv
  libssl / libcrypto — opencv's ffmpeg pieces
  liblzma — pillow

Zero hits across all venvs:
  libffi  — uv-managed python-build-standalone statically links it
  libglib — no wheel here links glib
  libncurses — no wheel here links ncurses

Drop glib / libffi / ncurses; keep stdenv.cc.cc.lib / zlib / openssl / xz.

Verified vllm smoke (PONG via bench-proxy) on B200 + r595 with the
trimmed list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ened

trt-llm's torchao/_C.abi3.so DT_NEEDs libcudart.so.12, but checking
/proc/self/maps after a successful trtllm-serve chat completion shows
neither libcudart.so.12 nor any torchao .so module is mapped — the
torchao Python modules tensorrt_llm imports don't actually load the
compiled extension during a serve loop.

Re-ran trt-llm smoke (bench-proxy → /v1/chat/completions → PONG) on
B200 + r595 with cu12Cudart removed from NIX_LD_LIBRARY_PATH and the
verification line gone. Engine starts and serves cleanly.

Sweeps that did NOT pan out:
- Drop cu13TypedefShim from sglang: still fails with the same cutlass
  cuda_host_adapter.hpp PFN_cuTensorMapEncodeTiled error vllm hits
  (sglang's flashinfer 0.6.7.post3 has the same bundled cutlass).
- libffi/glib/ncurses: already trimmed in dc60242.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant