Skip to content

[WIP] Add B300 config: qwen3.5-fp8-sglang-mtp#1035

Open
Ankur-singh wants to merge 1 commit intomainfrom
b300/qwen3.5-fp8-sglang-mtp
Open

[WIP] Add B300 config: qwen3.5-fp8-sglang-mtp#1035
Ankur-singh wants to merge 1 commit intomainfrom
b300/qwen3.5-fp8-sglang-mtp

Conversation

@Ankur-singh
Copy link
Copy Markdown
Collaborator

Summary

  • Add qwen3.5-fp8-b300-sglang-mtp config

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

Comment thread runners/launch_b300-nv.sh
Comment on lines +220 to +228
export MODEL="/scratch/models/${MODEL#*/}"
SQUASH_FILE="/data/squash/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
FRAMEWORK_SUFFIX=$([[ "$FRAMEWORK" == "trt" ]] && printf '_trt' || printf '')
SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '')

salloc --partition=$SLURM_PARTITION --account=$SLURM_ACCOUNT --gres=gpu:$TP --exclusive --time=180 --no-shell --job-name="$RUNNER_NAME"
JOB_ID=$(squeue --name="$RUNNER_NAME" -u "$USER" -h -o %A | head -n1)

srun --jobid=$JOB_ID bash -c "enroot import -o $SQUASH_FILE docker://$IMAGE"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The newly added single-node else branch in runners/launch_b300-nv.sh allocates a SLURM job with salloc and then queries squeue to obtain JOB_ID, but never validates that JOB_ID is non-empty before passing it to srun --jobid=$JOB_ID. If salloc fails (quota exceeded, no available nodes), squeue returns nothing and the subsequent srun calls fail with a cryptic SLURM error rather than a clear diagnostic message. Consider adding the same guard that the multinode branch of this same script already uses: if [ -z "$JOB_ID" ]; then echo "Error: Failed to extract JOB_ID"; exit 1; fi.

Extended reasoning...

What the bug is and how it manifests

The single-node else branch added in this PR (runners/launch_b300-nv.sh lines ~220-228) follows a two-step pattern to obtain a SLURM job ID: (1) salloc --no-shell to allocate the job, and (2) JOB_ID=$(squeue --name="$RUNNER_NAME" -u "$USER" -h -o %A | head -n1) to retrieve the job ID. If salloc fails for any reason, squeue will return an empty string and JOB_ID will be empty. The script then proceeds to call srun --jobid=$JOB_ID ... with an empty value, causing SLURM to emit a cryptic error like srun: error: Invalid job id specified or similar, which is harder to diagnose than a clear failure message.

The specific code path that triggers it

The script has no set -e, so a failed salloc does not automatically abort execution. The failing scenario is: quota exceeded on the partition → salloc exits non-zero → squeue finds no job with the given name → JOB_ID is empty string → srun --jobid= fails with a SLURM error that does not explain the root cause.

Why existing code doesn't prevent it

The multinode branch (earlier in the same file, around lines 115-118) handles a similar situation—extracting a job ID from srtctl text output—and does include an explicit guard: if [ -z "$JOB_ID" ]; then echo "Error: Failed to extract JOB_ID from srtctl output"; exit 1; fi. The single-node branch lacks this guard entirely.

Addressing the refutation

One verifier correctly notes that this exact pattern (no empty-JOB_ID check after squeue) exists across other scripts: launch_b200-dgxc-slurm.sh:222-223, launch_h200-dgxc-slurm.sh:237-238, launch_mi355x-amds.sh:162-163, launch_h100-dgxc-slurm.sh:234-235. This is a valid observation and weakens the severity. However, (a) the fact that a pattern is widespread does not make it correct, (b) the code in question is new code added by this PR (not pre-existing in this file), and (c) the multinode branch of this same script demonstrates that the codebase does apply explicit guards in some contexts. The refuter's claim that the multinode comparison is a "false equivalence" is partially correct—the extraction methods differ—but the underlying issue (empty JOB_ID proceeding to srun) applies equally.

What the impact would be

A failed salloc results in srun failing with a SLURM error that does not explain the root cause. The benchmark CI job will eventually fail (the result file won't be produced), but the error will appear to be an srun problem rather than an allocation problem, making it harder and slower to diagnose. No data is corrupted; the impact is purely diagnostic and operational.

How to fix it

After the squeue line, add:

if [ -z "$JOB_ID" ]; then
    echo "Error: Failed to extract JOB_ID. salloc may have failed."
    exit 1
fi

Step-by-step proof

  1. CI triggers the B300 single-node benchmark job.
  2. The runner is at quota: salloc --partition=batch_1 ... --no-shell fails with exit code 1; no job is created.
  3. Since there is no set -e, execution continues.
  4. JOB_ID=$(squeue --name="$RUNNER_NAME" -u "$USER" -h -o %A | head -n1) — squeue finds no matching job; JOB_ID is now "".
  5. srun --jobid= bash -c "enroot import ..." — SLURM rejects the empty jobid with a non-obvious error.
  6. Again no set -e, so srun failure may continue to the second srun call.
  7. The CI step eventually fails at the result-file check, not at the allocation step, making root-cause analysis harder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant