Skip to content

Sync dsv4-fp4-b300-trt recipes with B300 agg frontier config#1703

Merged
Oseltamivir merged 9 commits into
mainfrom
sync-dsv4-fp4-b300-trt-0608-config
Jun 12, 2026
Merged

Sync dsv4-fp4-b300-trt recipes with B300 agg frontier config#1703
Oseltamivir merged 9 commits into
mainfrom
sync-dsv4-fp4-b300-trt-0608-config

Conversation

@Oseltamivir

@Oseltamivir Oseltamivir commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

What

B300 analog of #1699 (which did this for B200). Sync the DeepSeek-V4-Pro aggregated frontier configs into the single-node TensorRT-LLM B300 recipes and bump the feature image. The non-MTP recipe carries the MTP0 settings; the MTP recipe carries the MTP settings.

Changes

Image (.github/configs/nvidia-master.yaml)

  • dsv4-fp4-b300-trt and dsv4-fp4-b300-trt-mtp image bumped feat-deepseek_v4-9aa3715feat-deepseek_v4-c185066.

benchmarks/single_node/fixed_seq_len/dsv4_fp4_b300_trt.sh (MTP0)

  • Worker envs (all overridable): TRTLLM_SERVER_DISABLE_GC=1, TRTLLM_WORKER_DISABLE_GC=1, NCCL_GRAPH_MIXING_SUPPORT=0, MIMALLOC_PURGE_DELAY=0, PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True.
  • kv_cache_config.free_gpu_memory_fraction: 0.9 (TP / no DP-attn) / 0.7 (DP-attn), was 0.50.
  • attention_dp_config: batching_wait_iters 0 → 30, drop timeout_iters.
  • stream_interval 10 → 100; moe_config.use_low_precision_moe_combine: true.
  • max_num_tokens drops the OSL term: ISL + 256.
  • MOE_BACKEND made overridable (default TRTLLM; MEGAMOE_DEEPGEMM at high conc on 1k ISL).

benchmarks/single_node/fixed_seq_len/dsv4_fp4_b300_trt_mtp.sh (MTP)

Same as above, plus:

  • DP-attn free_gpu_memory_fraction = 0.6.
  • enable_lm_head_tp_in_adp: true on the DP-attn path.
  • speculative_config uses max_draft_len; default level 2 → 3 (overridable via TRTLLM_DSV4_MTP_NUM_NEXTN_LAYERS), stepping back to 2 at high conc on 8k ISL.
  • max_num_tokens = ISL + (draft+1)*batch + 256 (drops OSL; keeps the speculative-verification headroom).

Deliberate non-changes

  • B300-specific bits preserved: the MODEL_PATH download block, TRTLLM_MHC_ENABLE_FUSED_HC=1, and trtllm-serve "$MODEL_PATH".
  • Search space left as-is. Unlike Sync dsv4-fp4-b200-trt recipes with B200 agg frontier config #1699 (which raised the B200 MTP conc-end), the B300 fixed-seq-len sweeps already cover the high-concurrency regime the recipe changes target (1k up to 2048, 8k up to 1024), so no conc-end edit is needed.
  • cuda_graph_config / max_batch_size left CONC-derived.
  • max_seq_len kept floored at ≥ 8192.

Validation

🤖 Generated with Claude Code


Note

Low Risk
Benchmark orchestration and TRT-LLM serve YAML/env defaults only; no application auth or production serving paths.

Overview
Aligns B300 single-node DeepSeek-V4-Pro TensorRT-LLM benchmarks with the aggregated frontier settings (B200 analog in #1699): both dsv4-fp4-b300-trt and dsv4-fp4-b300-trt-mtp use image feat-deepseek_v4-c185066, and the 1k1k tp8/ep8 DP-attn sweep conc-end is lowered from 2048 → 1024 so runs avoid the MLA-overlap CUDA-graph crash regime.

The dsv4_fp4_b300_trt.sh and dsv4_fp4_b300_trt_mtp.sh recipes now default worker/runtime env (GC off, NCCL graph mixing off, alloc tweaks), tune KV cache fractions by DP path, DP batching_wait_iters: 30, stream_interval: 100, use_low_precision_moe_combine, and max_num_tokens without the OSL term. CUDA-graph max_batch_size is capped at 1024 while runtime batch stays at CONC; MoE backend switches to MEGAMOE_DEEPGEMM at high concurrency on short ISL. MTP adds max_draft_len, variable default draft length, and enable_lm_head_tp_in_adp on DP-attn.

perf-changelog.yaml documents the above for both config keys.

Reviewed by Cursor Bugbot for commit 9f02c5d. Bugbot is set up for automated code reviews on this repo. Configure here.

B300 analog of PR #1699 (B200). Apply the same TensorRT-LLM recipe sync
to dsv4_fp4_b300_trt.sh (MTP0) and dsv4_fp4_b300_trt_mtp.sh (MTP), and
bump the dsv4-fp4-b300-trt / -mtp images to feat-deepseek_v4-c185066.

Recipe changes (both):
- Worker envs (overridable): TRTLLM_SERVER_DISABLE_GC, TRTLLM_WORKER_DISABLE_GC,
  NCCL_GRAPH_MIXING_SUPPORT=0, MIMALLOC_PURGE_DELAY=0,
  PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True.
- kv_cache free_gpu_memory_fraction: 0.9 (no DP-attn) / 0.7 non-MTP, 0.6 MTP
  (DP-attn), was 0.50.
- attention_dp_config batching_wait_iters 0 -> 30, drop timeout_iters.
- stream_interval 10 -> 100; moe_config.use_low_precision_moe_combine: true.
- MOE_BACKEND overridable, switches to MEGAMOE_DEEPGEMM at high conc on 1k ISL.
- max_num_tokens drops the OSL term.

MTP additionally: max_draft_len (was num_nextn_predict_layers), default draft
3 stepping to 2 at high conc on 8k ISL, enable_lm_head_tp_in_adp on DP-attn.

B300-specific bits preserved: MODEL_PATH download block, TRTLLM_MHC_ENABLE_FUSED_HC=1,
trtllm-serve "$MODEL_PATH". B300 search space left as-is (already covers the
high-concurrency frontier the recipe changes target).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Covers the dsv4-fp4-b300-trt / -mtp image bump to feat-deepseek_v4-c185066
and the B300 agg frontier recipe sync (PR #1703, B300 analog of #1699).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

2 similar comments
@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

Oseltamivir and others added 2 commits June 10, 2026 22:00
Cap the 8k1k tp8/ep8 DP-attn sweep at conc 256 (was 256-1024) for
dsv4-fp4-b300-trt. trt-mtp and the 1k1k sweep are unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Comment thread perf-changelog.yaml Outdated
@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

Oseltamivir and others added 2 commits June 11, 2026 09:23
Revert the dsv4-fp4-b300-trt 8k1k conc-end trim (back to 1024) and instead
cap cuda_graph_config.max_batch_size at 1024 on both b300-trt and
b300-trt-mtp.

TRTLLM_MLA_EXTRA_OVERLAP hands MLA prologue tensors across CUDA streams
without record_stream(), so CUDA-graph warmup at decode batch >1024
(repros at 1088, e.g. tp8/ep8 dp-attn conc-2048 on B300) use-after-frees
into CUDA_ERROR_ILLEGAL_ADDRESS. Capping graph capture at 1024 avoids
warming up the >1024 graph; runtime --max_batch_size stays = CONC, so
batches >1024 run eager. Workaround until NVIDIA/TensorRT-LLM#15265 ships
in the image.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions

Copy link
Copy Markdown
Contributor

Remove the conc=2048 point on the 1k1k tp8/ep8 DP-attn row for both
dsv4-fp4-b300-trt and dsv4-fp4-b300-trt-mtp (now 512-1024). This is the
batch regime that triggers the MLA-overlap warmup crash (NVIDIA/TensorRT-LLM#15265);
the cudagraph cap at 1024 stays as a safety net. 8k1k unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 9f02c5d. Configure here.

MOE_BACKEND="${MOE_BACKEND:-MEGAMOE_DEEPGEMM}"
else
MOE_BACKEND="${MOE_BACKEND:-TRTLLM}"
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MoE threshold never matches sweep

Medium Severity

The non-MTP recipe only selects MEGAMOE_DEEPGEMM when CONC is at least 2048, but this PR caps the 1k tp8/ep8 DP-attn sweep at 1024. Scheduled runs never hit that branch, so high-concurrency points keep TRTLLM despite the comment and MTP sibling using MEGAMOE_DEEPGEMM from 512 upward.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 9f02c5d. Configure here.

@github-actions

Copy link
Copy Markdown
Contributor

@Oseltamivir

Copy link
Copy Markdown
Collaborator Author

/reuse-sweep-run

@Oseltamivir Oseltamivir merged commit d78e5ea into main Jun 12, 2026
73 checks passed
@Oseltamivir Oseltamivir deleted the sync-dsv4-fp4-b300-trt-0608-config branch June 12, 2026 00:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

1 participant