[WIP][NV] add dsv4-fp4-gb300-dynamo-sglang-mtp-1k1k by hshrivastava-droid · Pull Request #1697 · SemiAnalysisAI/InferenceX

hshrivastava-droid · 2026-06-09T21:28:52Z

Note

Low Risk
Additive benchmark and CI config only; no application runtime or auth paths change, though miswired recipes could waste cluster GPU time.

Overview
Adds DeepSeek-V4-Pro FP4 disaggregated SGLang + MTP benchmark coverage for 1k/1k on GB300, as a new top-level entry dsv4-fp4-gb300-dynamo-sglang-mtp-1k1k (separate from the existing 8k/1k MTP config because it pins a newer lmsysorg/sglang image).

nvidia-master.yaml wires 11 fixed-seq-len scenarios (isl/osl 1024) with MTP spec-decoding: three conc=8192 high-throughput disagg layouts (1p1d dep8/dep16, 2p1d dep16) and eight low-latency variants (dep4 vs tp4-tp4 prefill/decode splits with 1/2/4/6 decode workers).

New Slurm recipes under benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/1k1k/ implement those runs—Dynamo + multi-frontend for high-conc, SGLang cache_aware frontend for low-lat; Mooncake disagg, EAGLE draft settings on decode. Low-latency is one YAML per decode-worker count instead of upstream zip-override templates so srtctl emits a single job per launcher invocation (avoids launch_gb300-cw.sh corrupting multiple job IDs in one variable).

perf-changelog.yaml documents the new config key and rationale.

^{Reviewed by Cursor Bugbot for commit 37100c4. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-06-09T21:29:03Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

cursor · 2026-06-09T21:31:35Z

+
+  model:
+    path: "dsv4-pro"
+    container: "lmsysorg/sglang:nightly-dev-cu13-20260510-2473659e"


Low-latency recipe container missing

High Severity

The two low-latency recipes still pin model.container to lmsysorg/sglang:nightly-dev-cu13-20260510-2473659e, while dsv4-fp4-gb300-dynamo-sglang-mtp-1k1k imports squash only for lmsysorg/sglang:nightly-dev-cu13-20260603-83bc7766. Workers resolve the recipe tag, which is not mapped in srtslurm.yaml and is documented as absent from Docker Hub, so those matrix points can fail at enroot import.

Additional Locations (1)

benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/1k1k/disagg-low-latency-tp4-mtp.yaml#L11-L12

^{Reviewed by Cursor Bugbot for commit 2aeafb4. Configure here.}

github-actions · 2026-06-09T22:27:57Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27237009377
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27237009377

github-actions · 2026-06-10T00:37:56Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27242563991
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27242563991

github-actions · 2026-06-10T04:37:55Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27242563991
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27242563991

cursor · 2026-06-10T04:38:33Z

+      speculative-num-draft-tokens: 4
+
+      mem-fraction-static: 0.94
+      max-running-requests: 1536


Decode cap below benchmark concurrency

Medium Severity

The 1p1d-dep8 high-concurrency recipe drives the benchmark at concurrency 8192, but decode max-running-requests is only 1536 (prefill is capped at 256). Sister recipes in the same PR for dep16 allow far higher in-flight limits, so this point is labeled conc 8192 while the SGLang decode server admits a much smaller batch.

Additional Locations (1)

benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/1k1k/disagg-1p1d-dep8-conc8192-mtp.yaml#L156-L157

^{Reviewed by Cursor Bugbot for commit 7c91283. Configure here.}

github-actions · 2026-06-10T16:19:54Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27253510982
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27253510982

github-actions · 2026-06-10T21:49:21Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27253510982
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27253510982

github-actions · 2026-06-11T06:18:07Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27253510982
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27253510982

cursor · 2026-06-11T17:13:17Z

+  image: lmsysorg/sglang:nightly-dev-cu13-20260603-83bc7766
+  model: deepseek-ai/DeepSeek-V4-Pro
+  model-prefix: dsv4
+  runner: gb300-nv


Wrong runner for SGLang recipes

High Severity

The new dsv4-fp4-gb300-dynamo-sglang-mtp-1k1k entry uses runner: gb300-nv, while sibling DeepSeek-V4 GB300 dynamo-sglang configs use gb300-cw. launch_gb300-nv.sh never copies staged recipes/sglang/deepseek-v4 into srt-slurm (only glm5 gets that path), and its srtslurm.yaml omits the dsv4-pro alias many new recipes use—so srtctl apply is likely to fail on missing recipes or model preflight.

^{Reviewed by Cursor Bugbot for commit 47460ef. Configure here.}

github-actions · 2026-06-11T17:48:29Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27366225297
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27366225297

github-actions · 2026-06-11T19:38:14Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27366364441
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27366364441

github-actions · 2026-06-11T19:46:20Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27372623348
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27372623348

github-actions · 2026-06-11T19:48:52Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27372623348
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27372623348

Oseltamivir · 2026-06-11T19:49:01Z

@hshrivastava-droid The GB300 CW are back

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 4 total unresolved issues (including 3 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 37100c4. Configure here.}

cursor · 2026-06-11T20:11:15Z

+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
+    SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW: '1'
+    SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: "0"  # CAR_V2 is single-node only.


Missing distributed timeout dep16

Medium Severity

disagg-1p1d-dep8-conc8192-mtp.yaml sets TORCH_DISTRIBUTED_DEFAULT_TIMEOUT to 1800 in decode_environment for multi-node decode, but the new dep16 recipes (four decode nodes, TP16) omit it. Longer multi-node decode init can hit the default shorter timeout.

Additional Locations (1)

benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/1k1k/disagg-2p1d-dep16-conc8192-mtp.yaml#L65-L93

^{Reviewed by Cursor Bugbot for commit 37100c4. Configure here.}

github-actions · 2026-06-11T21:32:03Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27373746281
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27373746281

github-actions · 2026-06-11T21:52:48Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27373746281
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27373746281

github-actions · 2026-06-12T16:15:43Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27373746281
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27373746281

add dsv4 1k1k

2aeafb4

hshrivastava-droid requested a review from a team June 9, 2026 21:28

hshrivastava-droid requested review from jgangani and kedarpotdar-nv as code owners June 9, 2026 21:28

hshrivastava-droid added the full-sweep-enabled label Jun 9, 2026

github-project-automation Bot added this to InferenceMAX Board Jun 9, 2026

cursor Bot reviewed Jun 9, 2026

View reviewed changes

update configs

984a91e

cursor Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread ...hmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/1k1k/disagg-low-latency-dep4-mtp.yaml Outdated

Comment thread ...ulti_node/srt-slurm-recipes/sglang/deepseek-v4/1k1k/disagg-low-latency-1p1d-tp4-tp4-mtp.yaml

split configs

7c91283

cursor Bot reviewed Jun 10, 2026

View reviewed changes

update runner

47460ef

cursor Bot reviewed Jun 11, 2026

View reviewed changes

hshrivastava-droid added 2 commits June 11, 2026 10:45

Merge branch 'main' into nv/dsv4-gb300-v2

71ac4b8

Update perf-changelog.yaml

fd84ba5

hshrivastava-droid added full-sweep-enabled and removed full-sweep-enabled labels Jun 11, 2026

Update nvidia-master.yaml

37100c4

cursor Bot reviewed Jun 11, 2026

View reviewed changes

Conversation

hshrivastava-droid commented Jun 9, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

cursor Bot Jun 9, 2026

Choose a reason for hiding this comment

Low-latency recipe container missing

Uh oh!

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

Uh oh!

cursor Bot Jun 10, 2026

Choose a reason for hiding this comment

Decode cap below benchmark concurrency

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

cursor Bot Jun 11, 2026

Choose a reason for hiding this comment

Wrong runner for SGLang recipes

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Oseltamivir commented Jun 11, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 11, 2026

Choose a reason for hiding this comment

Missing distributed timeout dep16

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hshrivastava-droid commented Jun 9, 2026 •

edited by cursor Bot

Loading