Skip to content

Conversation

@yansun1996
Copy link
Member

  • [Feature] Add test runner into referenced cluster validation framework

Motivation

  • Add test runner into cluster validation framework referenced architecture

Technical Details

  • Test runner will be responsible for executing GPU validation test via ROCm Validate Suite (RVS) or AMD GPU Field Health Check (AGFHC)
  • for the nodes failed on the GPU validation, they will not be included in the RCCL test.

Test Plan

Test Result

Submission Checklist

…k (#185)

* [Feature] Add test runner into referenced cluster validation framework

Signed-off-by: yansun1996 <Yan.Sun3@amd.com>

* Update docs/cluster_validation_framework/cluster-validation-config.yaml

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update docs/_static/cluster-validation-job.yaml

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update docs/_static/cluster-validation-job.yaml

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update docs/_static/cluster-validation-job.yaml

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update docs/_static/cluster-validation-job.yaml

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update docs/_static/cluster-validation-job.yaml

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@sajmera-pensando sajmera-pensando merged commit 2395a09 into ROCm:main Nov 14, 2025
0 of 2 checks passed
@yansun1996 yansun1996 deleted the testrunner_cluster_validation_framework branch November 14, 2025 03:04
yansun1996 added a commit to yansun1996/rocm-network-operator that referenced this pull request Nov 17, 2025
…k (#185) (ROCm#26)

* [Feature] Add test runner into referenced cluster validation framework



* Update docs/cluster_validation_framework/cluster-validation-config.yaml



* Update docs/_static/cluster-validation-job.yaml



* Update docs/_static/cluster-validation-job.yaml



* Update docs/_static/cluster-validation-job.yaml



* Update docs/_static/cluster-validation-job.yaml



* Update docs/_static/cluster-validation-job.yaml



---------

Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
yansun1996 added a commit to yansun1996/rocm-network-operator that referenced this pull request Nov 17, 2025
…k (#185) (ROCm#26)

* [Feature] Add test runner into referenced cluster validation framework



* Update docs/cluster_validation_framework/cluster-validation-config.yaml



* Update docs/_static/cluster-validation-job.yaml



* Update docs/_static/cluster-validation-job.yaml



* Update docs/_static/cluster-validation-job.yaml



* Update docs/_static/cluster-validation-job.yaml



* Update docs/_static/cluster-validation-job.yaml



---------

Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
sajmera-pensando pushed a commit that referenced this pull request Nov 17, 2025
* Cluster Validation Framework using RCCL, CronJob and MPIJob (#24)

* Cluster Validation Framework RCCL MPIJobs

* Lint fixes

* Lint fix

* Lint

* adding sphinx doc and moving config and job yamls to _static dir

* [Doc] Add MPI Operator requirement to cluster validation Job. (#25)

* Add MPI oper prereq and update AINICs

* cleanup

* [Feature] Add test runner into referenced cluster validation framework (#185) (#26)

* [Feature] Add test runner into referenced cluster validation framework



* Update docs/cluster_validation_framework/cluster-validation-config.yaml



* Update docs/_static/cluster-validation-job.yaml



* Update docs/_static/cluster-validation-job.yaml



* Update docs/_static/cluster-validation-job.yaml



* Update docs/_static/cluster-validation-job.yaml



* Update docs/_static/cluster-validation-job.yaml



---------

Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
Co-authored-by: Sundara Gurunathan <105081231+sundar-pds@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
sajmera-pensando pushed a commit that referenced this pull request Nov 17, 2025
* Cluster Validation Framework using RCCL, CronJob and MPIJob (#24)

* Cluster Validation Framework RCCL MPIJobs

* Lint fixes

* Lint fix

* Lint

* adding sphinx doc and moving config and job yamls to _static dir

* [Doc] Add MPI Operator requirement to cluster validation Job. (#25)

* Add MPI oper prereq and update AINICs

* cleanup

* [Feature] Add test runner into referenced cluster validation framework (#185) (#26)

* [Feature] Add test runner into referenced cluster validation framework



* Update docs/cluster_validation_framework/cluster-validation-config.yaml



* Update docs/_static/cluster-validation-job.yaml



* Update docs/_static/cluster-validation-job.yaml



* Update docs/_static/cluster-validation-job.yaml



* Update docs/_static/cluster-validation-job.yaml



* Update docs/_static/cluster-validation-job.yaml



---------

Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
Co-authored-by: Sundara Gurunathan <105081231+sundar-pds@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants