Skip to content

[WIP] [AMD/ROCM] atom glm5 fp4 on mi355x#1043

Open
seungrokj wants to merge 3 commits intomainfrom
srok/atom_glm5_fp4
Open

[WIP] [AMD/ROCM] atom glm5 fp4 on mi355x#1043
seungrokj wants to merge 3 commits intomainfrom
srok/atom_glm5_fp4

Conversation

@seungrokj
Copy link
Copy Markdown
Collaborator

@seungrokj seungrokj commented Apr 16, 2026

hi,

WIP.
internally tested. shipping soon.

used
amd/GLM-5.1-MXFP4
instead of
amd/GLM-5-MXFP4

cc. @ChangLiu0709 @andyluo7 @chunfangamd @ajith-sirra-amd

Regards,
Seungrok

Signed-off-by: seungrokj <seungrok.jung@amd.com>
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

Signed-off-by: seungrokj <seungrok.jung@amd.com>
Signed-off-by: seungrokj <seungrok.jung@amd.com>
@seungrokj seungrokj changed the title [WIP] [AMD/ROCM] atom glm5 fp4 on mi355x [WIP] [AMD/ROCM] atom glm5.1 fp4 on mi355x Apr 16, 2026
Comment on lines +28 to +32
if [ "$ISL" = "1024" ] && [ "$OSL" = "1024" ]; then
CALCULATED_MAX_MODEL_LEN=""
else
CALCULATED_MAX_MODEL_LEN=" --max-model-len 10240 "
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new glm5_fp4_mi355x_atom.sh is missing the EVAL_ONLY handling block present in other atom scripts (e.g., dsr1_fp4_mi355x_atom.sh, gptoss_fp4_mi355x_atom.sh lines 34-37). When EVAL_ONLY=true is set, the server starts with --max-model-len 10240 instead of EVAL_MAX_MODEL_LEN (typically 16384+), causing a context-length mismatch between the server and lm-eval that leads to OOM errors or silent evaluation failures. Add the missing block after line 32: if [ "${EVAL_ONLY}" = "true" ]; then setup_eval_context; CALCULATED_MAX_MODEL_LEN=" --max-model-len $EVAL_MAX_MODEL_LEN "; fi.

Extended reasoning...

What the bug is and how it manifests

The glm5_fp4_mi355x_atom.sh script is missing the EVAL_ONLY context-setup block that peer atom scripts use to reconfigure the server's context length when running in evaluation-only mode. Without it, when a user sets EVAL_ONLY=true, the script skips the throughput benchmark (handled by run_benchmark_serving's internal check) but the server was already started with an undersized context window (--max-model-len 10240 for non-1024/1024 workloads). The lm-eval framework then requests completions up to EVAL_MAX_MODEL_LEN (default 16384 or computed from ISL+OSL+200), which exceeds the server's configured limit.

The specific code path that triggers it

In other atom scripts (e.g., dsr1_fp4_mi355x_atom.sh lines 34-37, gptoss_fp4_mi355x_atom.sh lines 34-37), after computing CALCULATED_MAX_MODEL_LEN from ISL/OSL, there is:

if [ "${EVAL_ONLY}" = "true" ]; then
    setup_eval_context
    CALCULATED_MAX_MODEL_LEN=" --max-model-len $EVAL_MAX_MODEL_LEN "
fi

setup_eval_context (defined in benchmark_lib.sh) sets EVAL_MAX_MODEL_LEN based on the model's native context length (or ISL+OSL+200), then CALCULATED_MAX_MODEL_LEN is overridden to that larger value before the server starts. The new script omits this block entirely — the server always starts with whichever context limit was computed from the ISL/OSL check (10240 for most workloads).

Why existing code doesn't prevent it

EVAL_ONLY is not listed in check_env_vars, so there is no runtime guard. The script's RUN_EVAL check on line 69 only gates the lm-eval call, it does not fix the server's context size. Once the server is started with --max-model-len 10240, it cannot be reconfigured without restarting.

Impact

Any CI or ad-hoc run with EVAL_ONLY=true will either: (a) receive context-exceeded errors from the server when lm-eval submits prompts longer than 10240 tokens, producing incorrect/truncated scores; (b) trigger OOM on the GPU as the server tries to accommodate oversized KV-cache requests; or (c) silently truncate outputs, yielding misleadingly low evaluation numbers. The benchmark mode is unaffected, so this bug is invisible during normal benchmarking.

Addressing the refutation

The refutation notes that glm5_fp8_mi355x_atom.sh and kimik2.5_fp4_mi355x_atom.sh also lack this block — true, but that means those scripts share the same latent bug. The fact that the fp4 script was modeled after fp8 means it inherited the deficiency rather than being deliberately designed without eval support. The dsr1 and gptoss scripts demonstrate the correct pattern. Two scripts being broken does not make a third script correct.

How to fix it and step-by-step proof

Add after line 32 (after the ISL/OSL CALCULATED_MAX_MODEL_LEN block):

if [ "${EVAL_ONLY}" = "true" ]; then
    setup_eval_context
    CALCULATED_MAX_MODEL_LEN=" --max-model-len $EVAL_MAX_MODEL_LEN "
fi

Concrete example: ISL=4096, OSL=1024 (a common eval workload).

  1. ISL≠1024 or OSL≠1024, so CALCULATED_MAX_MODEL_LEN=" --max-model-len 10240 ".
  2. User sets EVAL_ONLY=true.
  3. Without the fix: server starts with --max-model-len 10240. setup_eval_context is never called, so EVAL_MAX_MODEL_LEN falls back to 0 or the model default (e.g., 131072 for GLM-4). lm-eval submits a prompt of 4096+1024+200=5320 tokens — just within 10240, so no immediate crash, but the eval context is wrong and larger GLM-5 native context means lm-eval may use sequences far exceeding 10240, causing rejections or OOM.
  4. With the fix: setup_eval_context computes EVAL_MAX_MODEL_LEN=5320, server starts with --max-model-len 5320, lm-eval uses the same limit — consistent and correct.

start_gpu_monitor

set -x
pip install -U transformers
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟣 Line 44 runs pip install -U transformers with no version pin, making benchmark results non-reproducible across runs and vulnerable to silent API breakage if a new transformers release is incompatible with atom or GLM5 FP4. This same unpinned install exists in the fp8 atom counterpart script (pre-existing), but other GLM5 scripts in the repo (glm5_fp8_b200.sh, glm5_fp8_mi355x.sh) use pinned versions — consider pinning to a specific version or git SHA before publishing as an official benchmark.

Extended reasoning...

What the bug is and how it manifests

Line 44 of glm5_fp4_mi355x_atom.sh runs pip install -U transformers immediately before launching the atom server, with no version constraint. This means each benchmark run installs whatever the latest transformers release is at that moment. Two runs separated by days or weeks could use different transformers versions, producing incomparable results.

The specific code path

The install runs inside set -x after GPU monitoring has started but before the server process is launched. If the install fetches an incompatible transformers version, the atom server may fail to start or behave incorrectly — and since the script uses background launch + wait_for_server_ready, a subtle startup failure might not be immediately obvious.

Why existing code doesn't prevent it

The -U (upgrade) flag explicitly requests the latest available version. There is no ==, >=, or git+...@<sha> constraint anywhere. The check_env_vars block at the top checks for required environment variables but does not validate the software environment.

Why the refutation deserves a response

One verifier argues this is intentional, since the fp8 counterpart (glm5_fp8_mi355x_atom.sh) uses the same unversioned install. Consistency with an existing pattern does not eliminate the risk — it only means the issue is pre-existing. The fp8 atom script has the same reproducibility problem, and the fp4 script inherits it. The comparison scripts for other backends (glm5_fp8_b200.sh: transformers==5.2.0; glm5_fp8_mi355x.sh: pinned git SHA) demonstrate the repo does value version pinning elsewhere.

Step-by-step proof of the problem

  1. Run A (today): pip install -U transformers installs transformers 4.51.0.
  2. Benchmark completes successfully; results recorded.
  3. transformers 4.52.0 is released with a breaking change to a tokenizer API used by GLM5.
  4. Run B (next week): pip install -U transformers installs 4.52.0.
  5. The atom server starts but produces different output, or fails silently — Run B results are incomparable with Run A.

How to fix it

Pin to the same version or commit SHA used by sister scripts. For example:
pip install --no-deps "transformers==5.2.0"
or
pip install -U --no-cache-dir "git+https://github.com/huggingface/transformers.git@<specific-sha>"
Ideally, also fix glm5_fp8_mi355x_atom.sh at the same time since it has the same issue.

@seungrokj seungrokj added the AMD label Apr 16, 2026
@seungrokj seungrokj changed the title [WIP] [AMD/ROCM] atom glm5.1 fp4 on mi355x [WIP] [AMD/ROCM] atom glm5 fp4 on mi355x Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant