[WIP] [AMD/ROCM] atom minimaxm2.5 fp4 on mi355x#1042
Conversation
Signed-off-by: seungrokj <seungrok.jung@amd.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
| minimaxm2.5-fp4-mi355x-atom: | ||
| # TODO: | ||
| image: TBD |
There was a problem hiding this comment.
🔴 The minimaxm2.5-fp4-mi355x-atom config entry has image: TBD instead of a valid Docker image tag, which will cause any CI pipeline targeting this config to fail immediately when attempting to pull the image. Before merging, replace TBD with a real atom image (the fp8 equivalent uses rocm/atom:rocm7.2.1-ubuntu24.04-pytorch2.9.1-atom0.1.2 as a reference point).
Extended reasoning...
What the bug is and how it manifests
The minimaxm2.5-fp4-mi355x-atom config entry in .github/configs/amd-master.yaml (line 420) sets image: TBD. This is not a valid Docker image reference. When any CI job targets this config entry, the container runtime will attempt docker pull TBD (or equivalent), which will fail with an image-not-found error and abort the pipeline.
The specific code path that triggers it
The CI benchmark workflow reads the image field from the YAML config and passes it directly as the Docker image to pull and run (e.g., IMAGE: ${{ inputs.image }}). There is no guard or filtering logic that would skip entries with placeholder values — the config is consumed as-is. Any automated benchmark job that selects the minimaxm2.5-fp4-mi355x-atom key will immediately fail at the Docker pull step.
Why existing code doesn't prevent it
The YAML config file has no schema validation or image-existence checks at parse time. The # TODO: comments in the diff confirm the author is aware the image value is incomplete, but these are source comments only — they have no runtime effect and do not prevent the broken entry from being picked up by automation.
What the impact would be
Any CI benchmark run that triggers the minimaxm2.5-fp4-mi355x-atom config entry will fail with a Docker image pull error, producing a broken pipeline and potentially blocking other benchmark results from the same run.
How to fix it
Replace image: TBD with the correct atom Docker image tag for the MiniMax M2.5 FP4 workload on MI355X. Based on the fp8 equivalent (minimaxm2.5-fp8-mi355x-atom), a likely candidate is rocm/atom:rocm7.2.1-ubuntu24.04-pytorch2.9.1-atom0.1.2 or newer, but the exact image should be verified with the atom team before merging.
Step-by-step proof
- CI automation reads
.github/configs/amd-master.yamland finds theminimaxm2.5-fp4-mi355x-atomkey. - It extracts
image: TBDand setsIMAGE=TBDin the job environment. - The benchmark template executes
docker pull TBD(or equivalent container invocation). - Docker returns
Error: No such image: TBD/ registry lookup fails. - The CI job exits with a non-zero status code and the pipeline is marked as failed.
hi,
WIP.
internally tested. shipping soon.
cc. @ChangLiu0709 @andyluo7 @chunfangamd @ajith-sirra-amd
Regards,
Seungrok