[WIP] Add B300 config: qwen3.5-fp8-sglang-mtp#1034
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
| FRAMEWORK_SUFFIX=$([[ "$FRAMEWORK" == "trt" ]] && printf '_trt' || printf '') | ||
| SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '') | ||
|
|
||
| salloc --partition=$SLURM_PARTITION --account=$SLURM_ACCOUNT --gres=gpu:$TP --exclusive --time=180 --no-shell --job-name="$RUNNER_NAME" | ||
| JOB_ID=$(squeue --name="$RUNNER_NAME" -u "$USER" -h -o %A | head -n1) |
There was a problem hiding this comment.
🟡 The new B300 single-node else-branch unconditionally re-imports the Docker image via enroot import on every benchmark run, even if the squash file already exists and is valid. The equivalent B200 single-node path checks for squash file validity with unsquashfs -l before importing, skipping the 5-10+ minute re-import when already present; the B300 path should do the same.
Extended reasoning...
What the bug is and how it manifests
In the new single-node else branch of runners/launch_b300-nv.sh (line 228 per the diff), the code unconditionally executes:
srun --jobid=$JOB_ID bash -c "enroot import -o $SQUASH_FILE docker://$IMAGE"There is no check for whether $SQUASH_FILE already exists and is valid before attempting the import. This means every single-node B300 benchmark run re-imports the Docker image from scratch.
The specific code path that triggers it
The else branch is entered when $IS_MULTINODE \!= "true". The new qwen3.5-fp8-b300-sglang-mtp config sets multinode: false, so every CI run for this config triggers the single-node path. The squash file path /data/squash/... resides on shared NFS storage that persists across runs (as evidenced by the multinode path also writing there), meaning the file will often already exist on subsequent runs.
Why existing code doesn't prevent it
The B300 single-node branch was newly introduced by this PR and simply omits the existence check. Compare to the analogous B200 single-node path in launch_b200-dgxc-slurm.sh, which wraps the import in a conditional using flock and unsquashfs -l to validate and skip re-imports. This PR missed applying the same guard.
Addressing the refutation
The refutation correctly notes that several other runner scripts (e.g., the B300 multinode path at line 64) also skip the existence check. However, those are pre-existing patterns — this PR introduces a new single-node path that is directly analogous to the B200 single-node path, which does have the check. The inconsistency already exists in the codebase, but this PR has the opportunity to follow the better pattern from the B200 reference implementation rather than the less careful multinode pattern.
Additionally, one verifier raised the concern that if enroot import -o does not silently overwrite existing files (the B200 path explicitly does rm -f $SQUASH_FILE before re-importing, suggesting overwrite is not guaranteed), successive runs with a stale squash file could fail outright rather than just waste time.
Impact
Each single-node B300 benchmark run wastes 5-10+ minutes re-importing a large model container image. With multiple concurrency configs per sequence length (1k1k and 8k1k), this compounds across the search space. At worst, if enroot does not overwrite existing files atomically, jobs may fail on subsequent runs.
How to fix it
Mirror the B200 single-node pattern: check with unsquashfs -l $SQUASH_FILE before importing, and only call enroot import (preceded by rm -f $SQUASH_FILE) if the file is missing or invalid. Optionally wrap with flock for safe concurrent access.
Step-by-step proof
- CI triggers
qwen3.5-fp8-b300-sglang-mtpfor the first time.IS_MULTINODEis not set (config hasmultinode: false), so theelsebranch runs. SQUASH_FILEis set to e.g./data/squash/lmsysorg_sglang_v0.5.10.post1-cu130.sqsh.- The script unconditionally runs
srun ... enroot import -o $SQUASH_FILE docker://lmsysorg/sglang:v0.5.10.post1-cu130— takes 5-10 minutes. - CI triggers the same config a second time (e.g., for a follow-up commit). The squash file at
/data/squash/still exists from run 1. - The script again unconditionally runs
enroot import— another 5-10 minutes wasted, or potentially a failure if enroot refuses to overwrite.
Summary
qwen3.5-fp8-b300-sglang-mtpconfig