Skip to content

[WIP] [AMD/ROCM] atom minimaxm2.5 fp4 on mi355x#1042

Open
seungrokj wants to merge 1 commit intomainfrom
srok/atom_minimaxm2.5_fp4
Open

[WIP] [AMD/ROCM] atom minimaxm2.5 fp4 on mi355x#1042
seungrokj wants to merge 1 commit intomainfrom
srok/atom_minimaxm2.5_fp4

Conversation

@seungrokj
Copy link
Copy Markdown
Collaborator

@seungrokj seungrokj commented Apr 16, 2026

hi,

WIP.
internally tested. shipping soon.

cc. @ChangLiu0709 @andyluo7 @chunfangamd @ajith-sirra-amd

Regards,
Seungrok

Signed-off-by: seungrokj <seungrok.jung@amd.com>
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

Comment on lines +418 to +420
minimaxm2.5-fp4-mi355x-atom:
# TODO:
image: TBD
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The minimaxm2.5-fp4-mi355x-atom config entry has image: TBD instead of a valid Docker image tag, which will cause any CI pipeline targeting this config to fail immediately when attempting to pull the image. Before merging, replace TBD with a real atom image (the fp8 equivalent uses rocm/atom:rocm7.2.1-ubuntu24.04-pytorch2.9.1-atom0.1.2 as a reference point).

Extended reasoning...

What the bug is and how it manifests

The minimaxm2.5-fp4-mi355x-atom config entry in .github/configs/amd-master.yaml (line 420) sets image: TBD. This is not a valid Docker image reference. When any CI job targets this config entry, the container runtime will attempt docker pull TBD (or equivalent), which will fail with an image-not-found error and abort the pipeline.

The specific code path that triggers it

The CI benchmark workflow reads the image field from the YAML config and passes it directly as the Docker image to pull and run (e.g., IMAGE: ${{ inputs.image }}). There is no guard or filtering logic that would skip entries with placeholder values — the config is consumed as-is. Any automated benchmark job that selects the minimaxm2.5-fp4-mi355x-atom key will immediately fail at the Docker pull step.

Why existing code doesn't prevent it

The YAML config file has no schema validation or image-existence checks at parse time. The # TODO: comments in the diff confirm the author is aware the image value is incomplete, but these are source comments only — they have no runtime effect and do not prevent the broken entry from being picked up by automation.

What the impact would be

Any CI benchmark run that triggers the minimaxm2.5-fp4-mi355x-atom config entry will fail with a Docker image pull error, producing a broken pipeline and potentially blocking other benchmark results from the same run.

How to fix it

Replace image: TBD with the correct atom Docker image tag for the MiniMax M2.5 FP4 workload on MI355X. Based on the fp8 equivalent (minimaxm2.5-fp8-mi355x-atom), a likely candidate is rocm/atom:rocm7.2.1-ubuntu24.04-pytorch2.9.1-atom0.1.2 or newer, but the exact image should be verified with the atom team before merging.

Step-by-step proof

  1. CI automation reads .github/configs/amd-master.yaml and finds the minimaxm2.5-fp4-mi355x-atom key.
  2. It extracts image: TBD and sets IMAGE=TBD in the job environment.
  3. The benchmark template executes docker pull TBD (or equivalent container invocation).
  4. Docker returns Error: No such image: TBD / registry lookup fails.
  5. The CI job exits with a non-zero status code and the pipeline is marked as failed.

@seungrokj seungrokj added the AMD label Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant