Add bench-proxy: Rust supervisor + reverse proxy with /meta snapshot by wokalski · Pull Request #1 · dialohq/llm-benchmark

wokalski · 2026-04-30T17:13:23Z

Summary

Replaces bin/smoke-* with a single Rust binary bench-proxy --listen <addr> --config <yaml>. The proxy spawns the engine on a kernel-assigned random loopback port that only it knows, so by construction every client request to /v1/* and /meta describes the same process — no "different config than what we benchmarked" drift.
/meta is a frozen JSON snapshot collected post-warmup. It captures: command argv/env/cwd read from /proc/<pid>/{cmdline,environ,cwd} (observed, not synthesized); the full descendant process tree with per-PID env (catches per-rank CUDA_VISIBLE_DEVICES divergence on TP workers); host/CPU/memory/THP/governor/NUMA tunables; nvidia-smi summary + raw -q -x XML; lshw; engine version; an exhaustive sha256 walk of every file under <venv>/{bin,lib,lib64} with a single merkle_sha256 rollup; and a deduped loaded_libs union of /proc/*/maps across the process tree, sha256'd, with versions parsed from filenames.
Child env is fully sanitized — env_clear() + only what the YAML's env: lists, no implicit inheritance. \$VAR/\${VAR} interpolate from the proxy's runtime env (and \$REPO_ROOT is auto-injected) so configs are portable across nix store hashes. Secrets tagged as { secret: <value> } appear in /meta as { redacted: true, sha256: ... } so two runs can be diffed without exposing the value.
SSE passthrough is hyper-direct (no collect, no buffering) for low-jitter streaming.

Schema (excerpt)

description: ...                # required
engine: vllm | sglang | trt-llm
cmd: vllm
args: [serve, openai/gpt-oss-120b, --tensor-parallel-size, \"2\", ...]
cwd: vllm                       # optional, repo-root-relative
venv: vllm/.venv                # optional; drives the venv snapshot
warmup: 3                       # optional; default 3, 0 to skip
env:
  PATH: \$REPO_ROOT/vllm/.venv/bin:\$PATH
  CUDA_HOME: \$CUDA_HOME
  HF_TOKEN: { secret: \$HF_TOKEN }

args may not contain --host or --port; the proxy injects them.

Test plan

cargo build --release --manifest-path proxy/Cargo.toml clean, no warnings
Validation rejects --port in args, empty description, unset \$VAR
Smoke against a Python mock OpenAI server (single + multi-PID): /healthz, /meta 503-then-200, command.env exactly matches YAML (no inherited LANG/USER/TERM), SSE chunks visible progressively
Secret redaction: HF_TOKEN: { secret: ... } surfaces as { redacted, sha256 }; plaintext absent from logs
Real vllm 0.19.1 + Qwen2.5-0.5B on H100 NVL (configs/vllm-smoke.yaml):
- 3-PID process_tree — vllm (APIServer), python (mp helper), VLLM::EngineCore — comm-renaming captured
- loaded_libs = 164 entries including libcudnn.so.9, libnccl.so.2, libcublas.so.12, libtorch_cuda.so from venv site-packages, plus libcuda.so.570.172.08 and libnvidia-ml.so.570.172.08 from /usr/lib/x86_64-linux-gnu (vast-mounted host driver) — versions parsed from filenames
- venv_snapshot = 84,131 files, 21.0 GB, single merkle hash
- warmup.elapsed_ms = [350, 193, 192] — clear cold→warm shape, exactly the lazy-load capture we want pre-snapshot
- 5 concurrent chat completions served correctly
- SIGINT → child exits 0, no orphans
Real run against vllm-gpt-oss-120b.yaml / sglang-gpt-oss-120b.yaml / trt-llm-gpt-oss-120b.yaml once weights are cached

Known follow-ups (not in this PR)

If the proxy's listen port is already bound (e.g. by a stale instance), proxy::serve logs the EADDRINUSE and the spawned task ends, but main keeps running and "publishes" the snapshot to no one. Should propagate the bind failure and exit non-zero.
TTFT-against-real-engine is too noisy as a proxy-overhead check (engine variance dominates: 3 keep-alive runs of 500 samples gave +4647 / +34 / −264 µs deltas). For a real overhead measurement, hit a no-compute upstream or use 10k+ samples.

🤖 Generated with Claude Code

Replaces bin/smoke-* with `bench-proxy --listen X --config <yaml>`. The proxy spawns the engine on a random loopback port (only it knows), so all client traffic to /v1/* and /meta is forced through one process. The child env is fully sanitized (env_clear() + only the YAML's `env:`); $VAR interpolation pulls from the proxy's runtime env so configs stay portable across nix store hashes. Secrets are tagged `{secret: ...}` and appear in /meta as `{redacted, sha256}`. After warmup (default 3 chat completions, forces lazy CUDA/cuDNN/NCCL dlopens), the proxy walks the descendant process tree from /proc, unions loaded_libs across all PIDs, sha256s every file in the venv, and freezes /meta with: command read from /proc/<pid>/{cmdline,environ,cwd}, gpus (curated nvidia-smi summary + raw -q -x), lshw, host tunables (THP, governor, NUMA), per-PID env, and a venv merkle hash. Verified end-to-end against vllm 0.19.1 + Qwen2.5-0.5B on H100: clean spawn, /meta is 18MB with all sections populated (3-PID tree, 164 loaded libs incl. libnccl/libcudnn/libcuda, 84k venv files, 21GB merkle-hashed), SSE streams chunk-by-chunk through the proxy, SIGINT shutdown is clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…y_bench.sh Engines (vllm 0.19, sglang 0.5.10, trt-llm 1.2) ship torch wheels that do not bundle libcudart.so.12 / libnvrtc.so.12 — they expect to dlopen them at runtime. Without ${cudaToolkit}/lib on NIX_LD_LIBRARY_PATH, `vllm --help` fails immediately with `ImportError: libcudart.so.12`. The CUDA driver lib (driverLib) provides libcuda.so but NOT libcudart; those are toolkit-side. Single-line fix in the ldPath helper covers all three devshells. bin/verify_bench.sh wraps proxy spawn + ready-poll + bench_latency.py + toks.py + teardown into a single shell command that prints one median tok/s number on stdout. Used as the autoresearch Verify cmd. trt-llm/uv.lock created by `uv sync --extra driver-r595`; the install itself errors on mpi4py 4.0.1 + Python 3.12 (separate issue), but the lockfile is reproducible and worth checking in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… on B200 First runnable YAML for this single-B200 host (the bookmark configs/ target a different multi-GPU cluster — see feedback memory). Conservative: enforce-eager (no CUDA graphs), no spec decode, no FP8 KV cache, no MXFP4 explicit override. max-num-seqs=1 hard-pins single-stream operation in the engine. Subsequent iterations will deviate one knob at a time. bench_results.tsv: tracked at repo root. .bench_runs/: gitignored, holds per-iteration logs + generated bench.json. configs/.current_experiment: gitignored loop pointer to active YAML. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

flashinfer JIT-compiles fp4_quantization.cu the first time vllm boots on a new SM (B200 = sm_100a) and shells out to nvcc — fails with `fatal error: cublasLt.h: No such file or directory`. Same path for cudnn.h, cusparse.h, etc. nixpkgs cuda packages are multi-output: the default `out` is near-empty, real content is in `lib`/`include`/`static`/ `stubs`. The previous symlinkJoin only pulled `out`. Switch to a generated paths list that pulls every declared output of every cuda package we need (cuda_nvcc, cudart, cccl, nvrtc, nvjitlink, cublas, cusparse, cusolver, curand, cufft, cudnn). Verified after rebuild: cublasLt.h, cudnn.h, cusparse.h, etc. resolve under \$CUDA_HOME/include; libcublasLt.so, libcudnn.so, etc. under \$CUDA_HOME/lib. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…torch wheels) vllm 0.19 / sglang 0.5.10 / trt-llm 1.2 all install torch from the cu130 PyTorch wheel index. The wheels bundle a complete cu13 stack under <venv>/lib/python3.12/site-packages/nvidia/cu13/{lib,include} (libcudart.so.13, libcublasLt.so.13, headers) plus cudnn.so.9 separately. Pulling cu12.9 toolkit libs into NIX_LD_LIBRARY_PATH (previous "fix") meant flashinfer's runtime JIT linked against libcudart.so.12 while torch loaded libcudart.so.13 — ABI mismatch. Revert cudaToolkit to nvcc-only (cu12.9 nvcc compiles cu13 headers fine, forward-compatible). Add cu13EnvHook that prepends the venv's cu13/lib + cudnn/lib + nccl/lib to NIX_LD_LIBRARY_PATH (runtime), LIBRARY_PATH (link), and CPATH (compile). Each devshell calls cu13EnvHook with its own engine dir name so vllm/sglang/trt-llm shells each get their own venv's paths. YAMLs gain LIBRARY_PATH + CPATH passthroughs in env so the proxy can forward shellHook-set values into the engine subprocess (proxy uses env_clear() and only forwards what the YAML names). Verified inside `nix develop .#vllm`: libcudart.so.13, libcublasLt.so.13, libcudnn.so.9 all resolve from venv first; cublasLt.h, cudnn.h from cu13/include + cudnn/include. Host /usr/lib/x86_64-linux-gnu has only driver libs (libcuda.so + libnvidia-ml.so), no toolkit libs — confirmed that approach is unavailable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…is cu13 The vllm 0.19.1 wheel ships vllm/_C.abi3.so with DT_NEEDED libcudart.so.12, while torch 2.10.0+cu130 (pulled by the b200 extra) ships libcudart.so.13 under nvidia/cu13/lib. Both must resolve at runtime — different SONAMEs, no conflict — but the previous "remove cu12 from LD path" change broke vllm import. Fix: - Keep cu12 nix toolkit (libcudart.so.12) on NIX_LD_LIBRARY_PATH for vllm._C dlopen. - Put venv cu13/lib FIRST on NIX_LD_LIBRARY_PATH and LIBRARY_PATH so flashinfer's JIT links -lcudart against cu13 (matches torch ABI). - CPATH is cu13 only — JIT preprocessor never sees cu12 headers. Per-shell version declaration via cudaEnvHook engineDir cuMajor: - cudaMajorRuntime = "12" (vllm._C) - cudaMajorJit = "13" (torch + flashinfer JIT) The hook hard-fails if either libcudart.so.{12,13} or cublasLt.h missing, and prints a "✓ <engine>: cu13 + cu12 verified" line on success — both per user's request to make versions explicit. shellHook walks up from PWD to find flake.nix, so `nix develop` from the repo root or any subdir resolves the same venv path. Also includes the small bin/verify_bench.sh tweak adding $MODEL env knob so smoke runs against a different served model can reuse the harness. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… cu13 flashinfer/jit/cpp_ext.py hard-codes the link command: c++ ... -L${CUDA_HOME}/lib64 -L${CUDA_HOME}/lib64/stubs -lcudart -lcuda Linker searches -L paths in order, first match wins. Previous attempt included cuda_cudart in cudaToolkit, so libcudart.so.12 lived under ${CUDA_HOME}/lib — ld picked cu12 and produced a JIT .so DT_NEEDED libcudart.so.12, which then conflicted with torch's loaded libcudart.so.13 at runtime. Split: - cudaToolkit (= ${CUDA_HOME}) carries cuda_nvcc + cuda_cccl ONLY. No libs at all under lib/. flashinfer's -L${CUDA_HOME}/lib64 finds nothing, so -lcudart falls through to LIBRARY_PATH where cu13 wins. - cu12RuntimeLibs = lib.getOutput "lib" cuda_cudart — separate path with libcudart.so.12 only. Mounted on NIX_LD_LIBRARY_PATH (after the cu13 segments) so vllm._C.abi3.so's runtime dlopen of libcudart.so.12 still resolves. Never on LIBRARY_PATH or -L flags. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Tokio Command::new used execvp-style lookup against the launcher's PATH, making the YAML's env.PATH advisory only. The devshells deliberately don't put .venv/bin on PATH (uv-managed pythons are nix-ld-loaded), so `cmd: vllm` was failing ENOENT on a fresh checkout. Walk the YAML's child env.PATH for cmds without a slash; pass through absolute or relative paths unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…2 shims Three flake-level fixes to make iter-0 actually produce a tok/s number on this host (the prior baseline at 939b749 hit a flashinfer JIT failure): 1. Repo-local cache: LLM_CACHE_ROOT defaults to <repo>/.cache (autodetected by walking up to flake.nix), instead of $HOME/.cache/llm-benchmark. Override by exporting before `nix develop`. .cache/ added to .gitignore. 2. Unversioned libcudart.so: cu13 venv ships only libcudart.so.13 (and libcublas.so.13 etc.). ld -lcudart needs the unversioned symlink, which the flake comment claimed would "fall through" to LIBRARY_PATH but actually didn't because cu13 has no unversioned link to fall through to. cudaEnvHook now creates the missing symlinks idempotently for the cudart/cublas/cublasLt/cusparse/cusolver/cufft/curand/cufile/cupti/ nvJitLink/nvrtc family. 3. cu13TypedefShim: cu13's cudaTypedefs.h dropped the unversioned `PFN_X` macro aliases that cu12 carried (e.g. PFN_cuTensorMapEncodeTiled). The bundled cutlass in flashinfer 0.6.6 references those unversioned forms, so cu13 headers alone produce ~94 nvcc errors. Build a shim include dir whose cudaTypedefs.h #include_next's cu13's real header then re-defines each missing alias (extracted verbatim from cu12 cudart). Put FIRST on CPATH so it shadows cu13's cudaTypedefs.h only for that one header; everything else still falls through to cu13. bench_results.tsv: iter-0 baseline = 97.3754 tok/s @ batch=1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…baseline) Drop --enforce-eager so vllm captures CUDA graphs at BS=1. Per-step kernel-launch + Python scheduler overhead disappears, which is the dominant cost at BS=1 for small generation steps. Result: 223.67 tok/s vs 97.38 baseline (+126.29 tok/s, +129.7%). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…proxy Local commits got new SHAs after rebasing onto remote 1d2bbdd + 723c747. Iter-0 result moves to 1ad4da0 (the flake-fix commit it was measured at); iter-1 to 8e55e5e (the cuda-graphs experiment commit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ok/s (-40%) Added --speculative-config method=ngram k=5 window=[2,4] on top of CUDA graphs. Hypothesis: structured JSON outputs from queries.csv (intent classifier on a Polish bank's call center) would yield high prompt-lookup hit rates. In practice the result fell to 133.61 tok/s vs 223.67 iter-1. Almost certainly because vllm's spec-decode path doesn't share the BS=1 CUDA graph captured for plain decode — each verify pass takes the eager path, paying back per-step launch overhead. Acceptance rate × overhead was net-negative. DISCARD. Keep YAML as the discard bookmark; revert pointer to iter-1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Switched --kv-cache-dtype fp8 on top of CUDA graphs. BS=1 decode is memory-bound on KV reads, so halving KV byte width should be a clear win — but result is 221.57 vs iter-1's 223.67 (Δ -0.94%, well inside the N=50 median noise floor). Two plausible reasons for the null result: 1. With --max-num-seqs 1 and --max-model-len 4096 the absolute KV bandwidth at BS=1 is small enough that the dequant overhead in the attention kernel cancels the bandwidth saving. 2. gpt-oss-120b is MoE — only ~2 of N experts active per token, so weight bandwidth dominates over KV bandwidth at BS=1. Halving KV bandwidth doesn't move the needle. Either way: equal results + extra config = DISCARD per the autoresearch "simpler wins" rule. YAML kept as a recorded discard bookmark. Also fixes verify_bench.sh: find for result.json now retries 5x500ms because in practice find returned empty immediately after python exited even though result.json appeared on disk a moment later. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…(1.66x iter-1) Set VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1, switching the MoE backend from FLASHINFER_TRTLLM_MXFP4_BF16 (default for sm_100a) to FLASHINFER_TRTLLM_MXFP4_MXFP8. Intermediate activations now compute in FP8 instead of BF16, hitting B200's native FP8 tensor cores. At BS=1 the MoE matmul is the dominant cost (only ~4 of N experts active per token, weight bandwidth dwarfs everything else), so this is exactly where to spend a knob. Result: 371.21 tok/s vs 223.67 iter-1 (+147.54, +66.0%). Cumulative speedup over iter-0 baseline = 3.81x. KEEP. iter-1 cuda-graphs YAML stays as the secondary best. Also pre-stages iter-5 (vllm-b200-bs1-fusions.yaml: MXFP8 MoE + compilation pass fusions) and fixes verify_bench.sh result.json discovery to parse `run dir:` from bench.log instead of `find` — find was returning empty after python exited even with 5x500ms retry, an overlayfs dirent visibility quirk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… -2.17%) Enabled fuse_norm_quant + fuse_act_quant + fuse_attn_quant on top of iter-4 (mxfp8-moe). Hypothesis: fewer kernel launches per decode step. Result is 363.16 vs iter-4's 371.21 — Δ -2.17%, within noise but slightly negative. Possible reasons it didn't help: - The fused quant ops weren't the dominant cost; MoE matmul still owns the step time. - Inductor's fused kernels for these specific quant patterns may be less optimized than the un-fused vendor kernels. DISCARD per "simpler wins on equal results." Verify_bench.sh fix landed — first iteration where the script printed the tok/s number to its own stdout (run dir parsing replaced flaky find). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

KV-fp8 still doesn't help even with mxfp8-moe shifting the bottleneck. At BS=1 with max-num-seqs=1 the absolute KV bandwidth is small enough that dequant overhead in the attention kernel cancels the byte savings. DISCARD. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

VLLM_USE_FLASHINFER_SAMPLER=1 swaps the per-token sampler for flashinfer's kernel. Result is 371.13 vs iter-4 371.21 — Δ -0.022%, statistically identical. Sampling is well off the critical path at BS=1 / 50-query median; the MoE matmul dwarfs everything else. DISCARD. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…22%) Limited cudagraph_capture_sizes to [1] only. Hypothesis: only the BS=1 graph is ever used at --max-num-seqs 1, so dropping the other 83 captures saves a couple GiB and skips the per-step graph-size dispatch. Result Δ -3.22% — slightly worse. vLLM has both PIECEWISE and FULL cudagraph slots; restricting capture sizes apparently doesn't simplify the dispatch path enough to win, and may force some graphs into the piecewise path. DISCARD. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Disabled chunked-prefill + prefix-caching + log-stats. Hypothesis: each adds per-step Python work and we don't benefit from any of them at single-stream BS=1. Result Δ -2.20% — slightly worse. Likely the chunked-prefill code path is the better-optimized one even when prefill is small, and the per-step accounting/stats overhead is already negligible vs the MoE matmul. DISCARD. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… -32.6%) Swapped TRTLLM-gen for CUTLASS variant of the same FP8-activation MoE backend. Result is dramatically worse: 250.07 vs 371.21 (Δ -32.6%). CUTLASS likely picks tile shapes tuned for larger BS or different sm generations; TRTLLM-gen's tile selection is the right one for sm_100a + BS=1 here. DISCARD. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…1%, errors) Hypothesized smaller KV pool / smaller cuda graph workspace might give a small win. Two independent reasons to discard: 1. Δ -2.01% vs iter-4 — slight regression, possibly because the cuda graph FULL-decode capture for max=4096 vs max=2048 doesn't actually change at BS=1 (only the per-stream KV pool size changes), so we lose the speed/quality tradeoff with no benefit. 2. 5 of 50 prompts overflow at 2K tokens (HTTP 400). The median is now computed over fewer queries — not directly comparable to iter-4. AND the system can no longer serve the full distribution. DISCARD. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ncompat) Set VLLM_USE_OINK_OPS=1 to use Blackwell-tuned RMSNorm. Engine init dies with TypeError: unhashable type: 'list' inside vllm/logger.py warning_once — Oink emits a warning whose args aren't hashable for the dedup cache. Not a meaningful comparison; logging as a crash with no tok/s number. Would need a vllm 0.19.x bump or an Oink-side fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…, noise) VLLM_USE_DEEP_GEMM_E8M0=0 forces DeepGEMM's alternate FP32-scale path instead of the default Blackwell E8M0 FP8 scale exponent format. Result is 373.20 vs iter-4's 371.21 — Δ +0.54%, well within the ~1-3% N=50 median CV. Borderline. Could KEEP for the 0.54% but per autoresearch's simpler-wins discipline on equal results, DISCARD: iter-4 is one fewer env knob carrying the same workload to within measurement noise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… compat) Two unblocking fixes for getting sglang and trt-llm to start at all on this host: 1. flake: cu12RuntimeLibsExtra symlinks together cuda_nvrtc + libcublas + libcusparse + libcusolver + libcurand + libcufft + cuda_cupti from nixpkgs cudaPackages, all cu12. Joined onto NIX_LD_LIBRARY_PATH so sglang's bundled triton kernels (which were compiled against cu12) can dlopen them at runtime. Kept *outside* CUDA_HOME so JIT linker `-lcublas` etc still falls through to cu13 (LIBRARY_PATH). 2. trt-llm/pyproject.toml: bump mpi4py 4.0.1 → >=4.1.1. mpi4py 4.0.1's build_ext calls `distutils.log.warning` which is missing under setuptools >=80 / Py 3.12; 4.1.1 dropped the call. Without this `uv sync --extra driver-r595` fails inside the trt-llm devshell. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ine reference) First sglang result on this host. TP=1, --attention-backend trtllm_mha (sglang's auto-pick for GptOssForCausalLM on sm100), --cuda-graph-max-bs 1, --max-running-requests 1, --mem-fraction-static 0.85. Result: 343.41 tok/s vs vLLM iter-4's 371.21 — vLLM wins by 8.1%. SGLang's baseline already enables the MXFP4 trtllm-gen MoE backend + cuda graphs by default, so this is roughly the apples-to-apples comparison vs the vLLM best, not vs the vLLM iter-0 baseline (97.38). KEEP as the sglang reference. Spec decoding (EAGLE3) next. Three flake fixes were needed to get sglang to start: - cu12 runtime libs bundle (cuda_nvrtc + libcublas/cusparse/cusolver/curand/ cufft/cupti) joined onto NIX_LD_LIBRARY_PATH for sgl_kernel's bundled triton kernels (prior commit a5364b5) - CCCL include via CUDA_HOME on CPATH so sglang's tvm-ffi JIT compile finds <nv/target> when cu13's cuda_fp16.h pulls it in - attention-backend changed from fa3 (sm80-90 only) to trtllm_mha Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

….32 (noise) Per the scout report, FLASHINFER_MOE_BACKEND chooses among throughput/ latency/masked_gemm cubins for FlashInfer MoE. Set to 'latency' on top of iter-4. Result Δ +0.57% — within noise. Logs confirm: 'Using FLASHINFER_TRTLLM_MXFP4_MXFP8 Mxfp4 MoE backend' — the MXFP4_MXFP8 selector path takes precedence over the BACKEND env var when the model is MXFP4-quantized, so the latency env knob is effectively a no-op here. DISCARD. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Added --speculative-config method=eagle3 + RedHatAI/gpt-oss-120b-speculator.eagle3 draft, k=3, on top of iter-4. EAGLE3 is the model-based draft (much higher acceptance than ngram) but result is the same flavor of regression as iter-2 ngram: -33%. The spec verify path in vLLM 0.19.1 appears not to share the BS=1 FULL cuda graph captured for plain decode, so each verify step pays back per-step launch overhead. DISCARD. Try sglang's EAGLE3 next (scout reports it's more graph-friendly via SGLANG_ENABLE_SPEC_V2 + chain-verify with eagle_topk=1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…3.93%) vLLM 0.19.1 → 0.20.1rc1.dev129+ga3ec4a35f, flashinfer 0.6.6 → 0.6.8.post1, torch 2.10.0+cu130 → 2.11.0+cu130. pyproject.toml now: - depends on vllm>=0.20.1rc1.dev0 routed via the https://wheels.vllm.ai/nightly index in tool.uv.sources - drops the explicit flashinfer-python / flashinfer-cubin pins so they flow from vllm's own pin (its wheel pins specific versions that change per dev build) - drops the explicit torch / torchvision / torchaudio version pins for the same reason; torch is still routed through cu130 index Re-bench of iter-4's mxfp8-moe YAML on the new stack: 385.80 tok/s vs 371.21 on 0.19.1 — Δ +3.93%, well above the ~1-2% N=50 median noise floor. Cumulative speedup over iter-0 baseline now 3.96x. Where the gain comes from is hard to attribute to any one knob — the flashinfer 0.6.8 trtllm-gen MoE kernels likely got tuned cubins, torch 2.11 has a slightly faster cuda graph capture path, and vllm 0.20 dev includes async-scheduling work. They all moved together; we get one number. YAML unchanged from iter-4. KEEP. Try sglang/trtllm nightlies next. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… BEST (+0.71%) trt-llm 1.2.1 → 1.3.0rc13 (uv resolved up from `>=1.3.0rc9`). cu130 torch 2.10.0+cu130 (trt-llm 1.3 still pins torch 2.10, vllm went to 2.11). New baseline YAML configs/trt-llm-b200-bs1-baseline.yaml runs trtllm-serve under mpirun -np 1 — without an MPI universe trt-llm 1.3 errors at engine init in MPI_Comm_spawn even at TP=1. Result: 388.55 tok/s vs vLLM nightly's 385.80 — Δ +0.71%, within the N=50 noise floor but TRT-LLM lands as the cross-engine new best out of the box (no per-engine knob tuning yet). Cumulative 3.99x over iter-0 baseline. Three small infra fixes were needed to unblock: - trt-llm/pyproject.toml: bump tensorrt-llm pin to >=1.3.0rc9, drop torch* pins (1.3 pins them transitively), keep nvidia-ml-py per-driver - YAML: wrap trtllm-serve with `mpirun -np 1 --allow-run-as-root` so MPI_ Comm_spawn has a real universe - bench_latency.py: strip the non-OpenAI `provider` field from queries.csv payloads. vLLM/sglang silently accept unknown fields; trt-llm uses strict pydantic and 400s any extras (caught 50/50 errors first run). - YAML: point at the local HF snapshot path instead of model name. trt-llm 1.3 was requesting a different revision than what's cached and trying to download all 14 safetensors fresh (60+GB) — instant ENOSPC. KEEP. Try EAGLE3 + PDL (per scout: NVIDIA's published BS=1 recipe) next. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

NVIDIA's published BS=1 EAGLE3 recipe: nvidia/gpt-oss-120b-Eagle3-short-context draft, max_draft_len=3, disable_overlap_scheduler=true, cuda_graph max_batch_size=4 (verify shape with k=3), TRTLLM_ENABLE_PDL=1 for kernel-launch overlap. Loaded extras via --extra_llm_api_options. Result: 225.37 tok/s vs trt-llm baseline 388.55 — Δ -42.0%. The third engine to show the same pattern (vLLM iter-16: -33%, sglang iter-17: -25.5%). Speculative decoding at BS=1 for gpt-oss-120b on B200 is fundamentally not a win across any tested engine — the MoE matmul cost dominates so much that the extra verify-pass overhead exceeds the savings from accepting drafted tokens, even with EAGLE3's high acceptance length. DISCARD. trt-llm baseline (iter-18) remains the cross-engine best. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… -38.9% vs iter-47) VLLM_USE_FLASHINFER_MOE_MXFP4_BF16=1 forces the oracle to FLASHINFER_CUTLASS_MXFP4_BF16 — a monolithic CUTLASS MoE with persistent scheduling. The other Hopper-reachable MXFP4 backend besides Marlin and Triton. 196.09 tok/s — basically tied with iter-46 baseline (197.84) and crushed by Marlin (320.78). The path JIT-compiles 181 sm90 CUTLASS kernels via ninja on first run (~15 min cold, then cached). Even after warmup the BS=1 single-token decode doesn't fill those big TmaWarpSpecializedPingpong kernels. Marlin's smaller W4A16 tile is just better at this regime. DISCARD. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… (Δ -0.6% vs iter-47) The Recipe's fuse_allreduce_rms pass routes through FlashInfer's trtllm_fused_allreduce_norm kernel — Blackwell-tuned and the documented win on B200. Hypothesis: at H100 TP=2 NVLink with single-token allreduces, the symm-mem setup overhead may not amortize. 318.94 tok/s — Δ -0.6% vs iter-47. Pure noise. The pass is a no-op either way at this regime; keep it on so we match the published recipe without measurable cost. DISCARD. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…acceptance Targeted the official vllm-project/recipes GPT-OSS_EAGLE3_Hopper.yaml config: nvidia/gpt-oss-120b-Eagle3-v2 + k=3. v2's HF config correctly sets eagle_config.use_aux_hidden_state=False (head uses last layer output, not aux states like v1). Two vllm 0.20.1rc1.dev129 bugs blocked this iter: 1. AOT COMPILE CACHE KEY BUG. After Eagle3-v1 ran in iter-47 (which sets aux_hidden_state_layers to 3 default layers), the compiled GptOssModel forward graph baked in `return (x, [aux,aux,aux])`. The compile cache key didn't hash aux_hidden_state_layers, so when v2 (use_aux=False) was loaded next, the same cache hit returned a graph that emitted the tuple — but the v1 model_runner code path for use_aux_hidden_state_outputs=False expected outputs to be a tensor and crashed with `hidden_states[logit_indices_device]` → IndexError: tuple index out of range. Confirmed by editing gpu_model_runner.py source (changes file hash → invalidates AOT cache) — fresh compile then correctly returned a tensor. Cache key needs to include aux config / use_aux flag. 2. Eagle3-v2 ACCEPTANCE IS BROKEN even after cache fix: Mean acceptance length: 1.01 Per-position acceptance rate: 0.007, 0.000, 0.000 Avg Draft acceptance rate: 0.2% Drafted 1248 tokens, target accepted 3. v2 must need a different hidden state than vllm 0.20's pass_hidden_states_to_model wires up. Unusable here despite being the recipe's recommended draft. Final tok/s with the cache-fixed v2 path: 153.80 — worse than no spec at all. Document and move on; v1 + k=2 (iter-47) is the H100 champion. Both bugs worth filing upstream. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

….3% vs iter-47) Isolation test: same Eagle3 head as iter-47 (which works), just k=2 → k=3. Tells us whether the iter-50 crash was about k=3 alone or v2 alone. 310.25 tok/s — runs cleanly with k=3 (so the iter-50 bug really was the v2 model interacting with a stale AOT cache). But k=3 underperforms k=2: Mean acceptance length: 2.08 (k=2: ~2.0) Per-position acceptance: 0.591, 0.314, 0.170 Third position only 17% accept rate Verify cost on three drafted tokens grows faster than the marginal acceptance gain at position 3. Same shape as B200 iter-26/27 for trt-llm Eagle3 (k=1: 438, k=2: 478, k=3: 462, k=4: 435). k=2 is the peak across both engines and both platforms for this workload. DISCARD; k=2 stays. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Tried to push Marlin onto Hopper FP8/INT8 tensor cores via VLLM_MARLIN_INPUT_DTYPE. Both crash: fp8: ValueError "Marlin W4A8-FP8 only support SM89 or SM120 device (It is slower than Marlin W4A16 on other devices). You can consider using W4A8-INT8 instead(set VLLM_MARLIN_INPUT_DTYPE=int8)." H100 is SM90, not in the allowlist. int8: RuntimeError "MXFP4 weight + INT8 activation is not supported." The error message in the fp8 case is misleading — INT8 doesn't compose with MXFP4 weights either. The W4A8 fast paths exist (SM89 Ada, SM120 Blackwell-RTX), just not on H100. Default activation dtype (bf16) is what we get and it's fine. DISCARD; iter-47 config remains the H100 best. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…0.9% vs iter-47) VLLM_USE_FLASHINFER_SAMPLER=1 — FlashInfer's fused sampler kernel. Hypothesis: spec decode does multi-step sampling (1 verify + k draft tokens), so a faster sampler should compound. 317.94 tok/s — Δ -0.9%. Pure noise. Same outcome as B200 iter-7 (371.13 vs 371.21 baseline). Sampler is off the critical path regardless of platform; the MoE GEMM dominates wall time at BS=1. DISCARD. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…vs iter-47) VLLM_USE_NCCL_SYMM_MEM=1 — defaults False. NCCL symmetric-memory allreduce can be faster than ring on NVLink. Hypothesis: TP=2 H100x2 NVSwitch path benefits. 318.74 tok/s — noise. The custom_all_reduce path vllm uses already saturates the NVLink for single-token TP=2 allreduces; symm-mem adds setup overhead with no payload to amortize. DISCARD. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…vs iter-47) Calibration test: how much of the H100x2 throughput is Marlin (the kernel) vs torch.compile + cudagraphs (the launch-overhead reduction). 70.79 tok/s with --enforce-eager (no compile, no cudagraphs) on top of Marlin. The compile/cudagraph stack is a **4.5x multiplier** on top of whichever MoE kernel sits underneath: iter-47 (Marlin + compile + cudagraphs): 320.78 iter-55 (Marlin + eager): 70.79 At BS=1 the per-kernel-launch overhead is enormous relative to the work each kernel does; cudagraphs fold all of that out. Plus custom_ops:[+rms_norm] and the inductor fusions handle the small ops between MoE blocks where Marlin isn't running. So our +62% iter-46 → iter-47 win was Marlin replacing Triton at the BIG kernel, on top of an already-4.5x-amplified base. Both layers are load-bearing; iter-47 is the right ship target. DISCARD; calibration record only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Iters 46-55 (the H100x2 sweep) were committed with TBD-N placeholders in the commit column. Backfilling the actual short SHAs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Single-GPU calibration of the iter-47 winner. Same Marlin MoE path, EAGLE3 nvidia k=2, full Recipe compile block, just --tp 1 with CUDA_VISIBLE_DEVICES=0. iter-47 (TP=2): 320.78 tok/s (160.4 per GPU) iter-56 (TP=1): 246.13 tok/s (246.1 per GPU) TP=2 / TP=1 : 1.30x (vs ideal 2x) Per-GPU efficiency is 53% better at TP=1. Classic memory-bandwidth- bound MoE behavior: at BS=1 the bottleneck is reading expert weights, and TP-splitting halves each rank's read but adds NVLink allreduce overhead per layer. The crossover only pays when the allreduce cost < the bandwidth savings. Spec acceptance ~48% on both — speculative decoding behaves identically across TP factors. KEEP — useful single-GPU baseline for cross-engine comparisons (see iters 57-59 for trt-llm at the same platform). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… 246) Direct port of B200 iter-26 winner (TRT-LLM EAGLE3 default draft k=2, TP=1, PDL on, cuda_graph_config batch_sizes=[1,3], disable_overlap_ scheduler) to H100. No code changes besides the model snapshot path. 141.73 tok/s — 57% of vllm iter-56's 246.13 on the exact same hardware/workload. Root cause: trt-llm 1.3.0's default MoE on Hopper MXFP4 falls back to TritonFusedMoE (the TRTLLM kernel backend is Blackwell-only — see release notes for 1.3.x). vllm has a hand-written Marlin W4A16 kernel for SM90 that ships in the wheel (VLLM_MXFP4_USE_MARLIN); trt-llm has no equivalent in 1.3.0. Hopper MXFP4 MoE (BS=1): vllm Marlin : 246.13 tok/s (iter-56) trt-llm Triton : 141.73 tok/s (this iter) trt-llm CUTLASS : 105.36 tok/s (iter-58 — regression) The B200 → H100 reversal is real: at B200 trt-llm and vllm tied at ~478 tok/s (iter-26 / iter-38) because both used FlashInfer TRTLLM-gen MXFP4_MXFP8 kernels with native FP4 hardware. On Hopper without FP4 hardware, the kernel landscape differs and vllm Marlin has no trt-llm counterpart. KEEP — cross-engine baseline at H100 TP=1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… → 105.36 (Δ -25.7%) Tried to force CUTLASS MoE on the assumption (per research) that trt-llm 1.3 has a Hopper-native CUTLASS MXFP4 grouped-GEMM that beats Triton. It doesn't — at least not with autotune cold-cache. 105.36 tok/s, 25.7% worse than iter-57 default Triton. The proxy log shows the smoking gun on every step: [AutoTuner] trtllm::fused_moe::gemm1 using the fallback tactic, due to cache miss on input shapes=... [AutoTuner] trtllm::fused_moe::gemm2 using the fallback tactic, due to cache miss on input shapes=... The CUTLASS path needs a populated autotune cache to hit good kernels; with cache misses it uses a generic fallback that's slower than Triton's already-tuned kernel. The default Triton path doesn't have this problem because its kernels are autotuned at JIT time. There may be a way to pre-populate the cache (offline tuning run, or extending warmup), but iter-57 default is already the right baseline for H100 trt-llm — DISCARD. Also adds configs/trt-llm-eagle3-extra-h100.yaml with the CUTLASS + stream_interval=10 + batch_sizes=[1,2,3,4] + enable_block_reuse= false additions per the research agent's recipe — kept for record. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…lm 320.78) Tested whether TRT-LLM's ONESHOT/MIN_LATENCY allreduce on H100x2 NVSwitch closes the gap with vllm at TP=2. It scales better than vllm in ratio but loses on absolute throughput. Scaling factor TP=1 → TP=2: vllm + Marlin : 246.13 → 320.78 (1.30x) trt-llm + Triton : 141.73 → 203.50 (1.44x) TRT-LLM's allreduce IS more efficient at small payloads (1.44 vs 1.30), but the win is dwarfed by the 73% TP=1 deficit caused by the missing Marlin-equivalent kernel. Final cross-engine score on this hardware/workload: H100x2 BS=1 gpt-oss-120b EAGLE3: vllm + Marlin : 320.78 tok/s (iter-47, NEW H100 BEST) trt-llm + Triton : 203.50 tok/s (this iter, -36.6%) Reverses the B200 picture (where iter-26 trt-llm 477.67 and iter-38 vllm 477.84 tied within noise). The crossover comes from different MXFP4 kernel availability per platform: B200 has hardware FP4 + FlashInfer TRTLLM-gen MXFP4_MXFP8 that both engines plug into; H100 has no FP4 hardware and only vllm shipped a hand-written Marlin path that maps well to bf16 dequant. KEEP — final cross-engine number on H100x2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cross-engine H100/H100x2 sweep (vllm + trt-llm). TBD-N placeholders in iters 56-59 → actual short SHAs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…TON → 200.51 (Δ noise) Set moe_config.backend=TRITON explicitly to see if it differs from the implicit default-fallback in iter-59. It doesn't — same legacy TritonFusedMoE class, same OAI Triton matmul_ogs kernel, same warning, same number within noise (200.51 vs 203.50). Source-level confirmation in tensorrt_llm/_torch/modules/fused_moe/create_moe.py:82-83 — "TRITON" maps directly to TritonFusedMoE (legacy class). The ConfigurableMoE wrapper only accepts TRTLLMGenFusedMoE and CuteDslFusedMoE per line 418, neither of which works for MXFP4. Same warning "TritonFusedMoE is not supported by ConfigurableMoE" fires on both iter-59 (no setting) and iter-60 (explicit TRITON). DISCARD; iter-59 is the same path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… vs iter-59) Per examples/models/core/gpt_oss/README.md: "the TRITON backend doesn't support tensor parallelism (TP), only expert parallelism (EP) and AlltoAll operations". Tried tp_size=2 + moe_expert_ parallel_size=2 to unlock the (alleged) ConfigurableMoE TRITON path that should activate when EP is used. Result: 188.95 tok/s, 7.1% worse than iter-59 default. Same "TritonFusedMoE is not supported by ConfigurableMoE" warning fires. The setting falls through to legacy TritonFusedMoE AND the EP AllToAll routing adds overhead vs straight TP. CODE READ confirms there's no path forward in trt-llm 1.3.0rc9 for faster MXFP4 MoE on Hopper: fused_moe/create_moe.py — only viable backend selectors: "CUTLASS" → CutlassFusedMoE (autotune cache miss → fallback, iter-58 = 105 tok/s) "TRITON" → TritonFusedMoE (OAI matmul_ogs, used here = 200) "TRTLLM" → TRTLLMGenFusedMoE (Blackwell-only, fails on SM90) "DEEPGEMM"/"CUTEDSL"/"DENSEGEMM"/"WIDEEP" → all gated to fp8/nvfp4 quants (won't bind to MXFP4 weights) fused_moe_triton.py — TritonFusedMoE imports `from triton_kernels.matmul_ogs import matmul_ogs` so the OAI kernels ARE active (no separate install needed — triton_kernels package ships with the trt-llm wheel). ConfigurableMoE in create_moe.py:411-425 only wraps TRTLLMGenFusedMoE and CuteDslFusedMoE; everything else falls back to legacy. So all MXFP4 paths on Hopper end at the same OAI Triton kernel that vllm uses by default — and which vllm Marlin is the alternative to. There is no Marlin-equivalent W4A16 fast path in trt-llm 1.3.0rc9 for Hopper MXFP4. iter-59 (203.50) is the trt-llm ceiling on this hardware; vllm + Marlin (320.78, iter-47) is +57.7% faster because vllm shipped a hand-written kernel that trt-llm doesn't have. DISCARD. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…9.36 (noise) The official vllm-project/recipes GPT-OSS_EAGLE3_Hopper.yaml ships --max-num-batched-tokens 8192 (vs vllm default 4096 = max-model-len). With EAGLE3 k=2 each step's max_num_scheduled_tokens is bounded by this; iter-46 logs warned the default was undersized. 319.36 tok/s vs iter-47's 320.78 — Δ -0.4%, pure noise. The warning about being undersized was benign for our BS=1 + max-num-seqs=1 workload. iter-47 stays the H100x2 best. DISCARD. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e-warm) → 154.69 iter-58 (TP=1 CUTLASS) regressed to 105.36 tok/s with the smoking-gun warning "[AutoTuner] using fallback tactic, due to cache miss". Tested whether bumping warmup 3 → 20 lets the AutoTuner populate the cache and recover. Also bumped to TP=2 to compare apples-to-apples against iter-59 (Triton TP=2, 203.50). 154.69 tok/s — meaningful recovery vs iter-58 (+46.8%) but still 24% behind Triton TP=2: iter-58 (TP=1, CUTLASS, warmup=3) : 105.36 (cache miss → fallback) iter-63 (TP=2, CUTLASS, warmup=20): 154.69 (cache populated, 56 entries) iter-59 (TP=2, Triton, warmup=3) : 203.50 (OAI matmul_ogs, no autotune cache issues) iter-47 (vllm Marlin TP=2) : 320.78 ← H100x2 ceiling AutoTuner populated 56 cache entries successfully: [Autotuner] Cache size after warmup is 56 So the cache miss issue IS resolvable with longer warmup. But CUTLASS on H100 MXFP4, even fully warm, is fundamentally slower than the OAI Triton matmul_ogs path. CUTLASS-on-Hopper for MXFP4 in trt-llm 1.3.0 is just not as well tuned as the Triton implementation. This closes out the "can we compile/tune our way to faster trt-llm on H100" investigation. The honest answer: no path in 1.3.0rc13. trt-llm Hopper ceiling = iter-59 (203.50). vllm's +57.7% lead is a kernel availability gap (Marlin), not a tuning gap. DISCARD. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes cross-engine loop on H100x2 BS=1 gpt-oss-120b EAGLE3: vllm + Marlin (iter-47) : 320.78 tok/s ⭐ winner sglang + fa3 + EAGLE3 (this iter): 243.10 tok/s 2nd trt-llm + Triton MoE (iter-59) : 203.50 tok/s 3rd sglang slots between vllm and trt-llm. Reuses the same OAI-Triton matmul_ogs MoE kernel as trt-llm but appears to have a tighter runtime around it. Setup pain: the sglang devshell in this flake assumes the cu13 wheel layout (nvidia/cu13/{include,lib}/) like vllm does, but sglang[all] ==0.5.10.post1 pins torch==2.9.1+cu129 which uses the per-component layout (nvidia/{cublas,cuda_runtime,curand,...}/{include,lib}/). To get the engine through cudagraph capture I had to: 1. uv sync --extra h100 (sets up cu129 torch) 2. Override CPATH in YAML to enumerate all 15 nvidia/*/include dirs so tvm_ffi/flashinfer JIT compiles find cuda_runtime.h, curand.h, cublas_v2.h, etc. 3. Override LIBRARY_PATH similarly so the linker sees libcudart, libcurand, libcublas, etc. 4. Bulk-symlink .so.X -> .so in each nvidia/*/lib so ld.bfd resolves -l<name> directly (the wheels ship versioned .so.X only). Other notable settings: --attention-backend fa3 (FlashAttention 3; trtllm_mha is sm_100+ only) --speculative-num-steps 2 + --speculative-num-draft-tokens 3 + topk 1 --cuda-graph-max-bs 4 KEEP — third engine on the H100x2 leaderboard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…caling matrix Single-GPU sglang baseline. mem_fraction_static had to be raised 0.85 → 0.93; default OOMs at TP=1 because gpt-oss-120b MXFP4 + Eagle3 draft + KV barely fits in one 80GB H100. Now all three engines have TP=1 and TP=2 numbers, full matrix: engine TP=1 TP=2 scaling vllm + Marlin 246.13 320.78 1.30x sglang + fa3 232.88 243.10 1.04x ⚠ flat trt-llm + Triton 141.73 203.50 1.44x Surprise finding: sglang's TP scaling is effectively flat at this workload. Going TP=1 → TP=2 gains 4% — the FlashInfer allreduce fusion overhead at single-token decode payloads nearly cancels the bandwidth-halving gain from sharding the model across 2 GPUs. vllm and trt-llm both scale much better. Per-GPU efficiency picture: vllm TP=1 : 246.13 per GPU (best per-GPU) sglang TP=1: 232.88 per GPU vllm TP=2 : 160.4 per GPU trt-llm TP=1: 141.73 per GPU sglang TP=2: 121.5 per GPU trt-llm TP=2: 101.8 per GPU vllm wins at every parallelism factor and per-GPU. Ship iter-47. KEEP — completes the TP scaling matrix for all 3 engines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drop cudaEnvHook/mkShell/ldPath helpers; inline each shell so vllm / sglang / trt-llm can be read top-to-bottom. Also prune what each shell doesn't use: vllm + trt-llm no longer pull cu12Extras (sglang-only per comment), trt-llm drops cu13TypedefShim (no flashinfer JIT). Shared data derivations (cudaToolkit, cu12Cudart, cu12Extras, cu13TypedefShim, cacheHook, findRoot, cu13SymlinkLoop) stay in let. nix flake check passes for all three shells. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pin pkgs.cudaPackages_12_9 (was pkgs.cudaPackages, an alias that could float to a different minor on nixpkgs upgrade). 12.9 matches the cu12 versions h100/h200 venvs bundle (cudart 12.9.79, cublas 12.9.1.4, nvrtc 12.9.86 — see uv.lock). Drop two dead abstractions: - lib.concatMap (p: map (o: p.${o}) p.outputs) over [cuda_nvcc cuda_cccl]: both packages are single-output, so this loop expanded to just [cuda_nvcc cuda_cccl]. - lib.getOutput "lib" cuda_cudart: cuda_cudart is also single-output, so getOutput falls back to out. Use cu12Cudart = cuda12.cuda_cudart directly. cu12Extras switches from getOutput to the .lib accessor. Tighten cu13TypedefShim's bash: the tempfile + while-read + per-line echo loop becomes one grep | awk pipe. Verified output identical (1349 lines, same #ifndef-guarded macro list). Trim verbose prose comments — keep load-bearing WHYs (cu12 must not beat cu13 for `-lcudart`, CUDA_HOME has no libs deliberately, why the PFN_X aliases need restoring), drop historical narrative. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…sthrough Verified end-to-end on B200 + r595 driver: bench-proxy spawns each engine, /meta returns 200, curl /v1/chat/completions returns PONG. - vllm-smoke.yaml: add CPATH=$CPATH to env (was missing). Without it flashinfer JIT couldn't find cuda_runtime.h from cu13 venv, killing EngineCore at kernel_warmup. b200-baseline.yaml already had this; the smoke yaml was just stale. - sglang-smoke.yaml: new — Qwen2.5-0.5B-Instruct, mem-fraction-static 0.50, full env passthrough including MPICC. - trt-llm-smoke.yaml: new — uses `trtllm-serve serve <model>` (1.3.0rc13 added the `serve` subcommand; baseline yaml predates it and elides it, presumably relying on a default-subcommand shim that no longer works). mpirun -np 1 wrapper required even at TP=1 for mpi4py executor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…NEEDs Checked every engine venv with objdump -p to enumerate cu12 DT_NEEDs. Three pieces of the flake turned out to be no-ops: 1. cu13SymlinkLoop: tried to symlink lib<x>.so → lib<x>.so.13 for 13 libs (cudart, cublas, cusparse, …). Audited against current vllm / sglang / trt-llm venvs: every venv either ships the unversioned .so already (cudart, cublas, cublasLt, cupti, nvJitLink, nvrtc) or has no .so.13 to link to (the cu12-version libs cusparse/cusolver/cufft /curand/cufile, plus cudart_static which is a .a). Result: zero symlinks ever created. Drop the helper entirely. 2. cu12Cudart for vllm: vllm 0.20.1rc1's _C.abi3.so DT_NEEDs libcudart.so.13 (was libcudart.so.12 in 0.19.x — comment was stale). Grep shows zero .so anywhere in the vllm venv DT_NEEDs libcudart.so.12. Drop the verification line + the NIX_LD entry. sglang and trt-llm keep cu12Cudart (sgl_kernel and torchao still DT_NEED libcudart.so.12). 3. cu12Extras prune: the bundle had nvrtc, cublas, cusparse, cusolver, curand, cufft, cupti. Of those, sgl_kernel only DT_NEEDs nvrtc + cublas /cublasLt. The other five live inside the venv's own nvidia/cu13/lib/ (torch's cu130 wheel still ships libcufft.so.12, libcusparse.so.12 etc despite the cu13 prefix), and our shellHook puts the venv path before ${cu12Extras}/lib on NIX_LD_LIBRARY_PATH so the venv copies always win. Pulling them from nixpkgs was wasted store space. Trim to nvrtc + cublas. Re-ran all three engines end-to-end (bench-proxy → /v1/chat/completions → "PONG!") on B200 + r595 driver after the prune. nix flake check passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Surveyed DT_NEEDs across all three venvs' .so files. Hits: libstdc++ / libgcc_s — torchaudio, outlines_core, others libz — cudnn, opencv libssl / libcrypto — opencv's ffmpeg pieces liblzma — pillow Zero hits across all venvs: libffi — uv-managed python-build-standalone statically links it libglib — no wheel here links glib libncurses — no wheel here links ncurses Drop glib / libffi / ncurses; keep stdenv.cc.cc.lib / zlib / openssl / xz. Verified vllm smoke (PONG via bench-proxy) on B200 + r595 with the trimmed list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ened trt-llm's torchao/_C.abi3.so DT_NEEDs libcudart.so.12, but checking /proc/self/maps after a successful trtllm-serve chat completion shows neither libcudart.so.12 nor any torchao .so module is mapped — the torchao Python modules tensorrt_llm imports don't actually load the compiled extension during a serve loop. Re-ran trt-llm smoke (bench-proxy → /v1/chat/completions → PONG) on B200 + r595 with cu12Cudart removed from NIX_LD_LIBRARY_PATH and the verification line gone. Engine starts and serves cleanly. Sweeps that did NOT pan out: - Drop cu13TypedefShim from sglang: still fails with the same cutlass cuda_host_adapter.hpp PFN_cuTensorMapEncodeTiled error vllm hits (sglang's flashinfer 0.6.7.post3 has the same bundled cutlass). - libffi/glib/ncurses: already trimmed in dc60242. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

wokalski and others added 30 commits May 1, 2026 18:34

wokalski and others added 30 commits May 1, 2026 18:24

chore: backfill bench_results.tsv SHAs for iters 46-55

5d213ca

Iters 46-55 (the H100x2 sweep) were committed with TBD-N placeholders in the commit column. Backfilling the actual short SHAs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore: backfill bench_results.tsv SHAs for iters 56-59

6d98ca9

Cross-engine H100/H100x2 sweep (vllm + trt-llm). TBD-N placeholders in iters 56-59 → actual short SHAs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore: backfill bench_results.tsv SHAs for iters 60-61

9e470dd

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore: backfill bench_results.tsv SHAs for iters 62-63

529cfef

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore: backfill bench_results.tsv SHA for iter 64

42d0f58

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore: backfill bench_results.tsv SHA for iter 65

75f0191

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add bench-proxy: Rust supervisor + reverse proxy with /meta snapshot#1

Add bench-proxy: Rust supervisor + reverse proxy with /meta snapshot#1
wokalski wants to merge 88 commits into
masterfrom
bench-proxy

wokalski commented Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wokalski commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Schema (excerpt)

Test plan

Known follow-ups (not in this PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wokalski commented Apr 30, 2026 •

edited

Loading