Add MIG validations by rlandy · Pull Request #15 · rhos-vaf/gpu-validation

rlandy · 2026-03-10T20:51:08Z

flavor addition for MIG testing
Check for MIG mode

csibbitt

Nice work, this is pretty much what I expected to see. I recommend we add the loop and device counting based on gpu_validation_pci_devices

csibbitt · 2026-03-11T14:57:29Z

gpu-validation/defaults/main.yaml

 gpu_validation_image_url: https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-latest.x86_64.qcow2
 # [string] Image name to use when creating the VM
 gpu_validation_image_name: gpu-validation
+# [string] Type of GPU validation for flavor creation


Would be good to list the allowable values in the comment here

csibbitt · 2026-03-11T14:58:09Z

gpu-validation/defaults/main.yaml

+# [string] Flavor to use when creating the VM
+mig_validation_flavor_name: m1.migvgpu
+# [string] RAM value for the flavor
+mig_validation_flavor_ram: 4096


I don't know of any reason we'd want different ram/cpu/disk for MIG testing vs passthrough vs vgpu. It's all the same test at the end anyways.

csibbitt · 2026-03-11T15:03:48Z

gpu-validation/tasks/gpus_assertions.yaml

  failed_when: pci_count.stdout | int != expected_count | int
  changed_when: false
+
+- name: "TEST[gpus] Check if MIG mode is used"


This new code looks for any MIG device returned by nvidia-smi, whereas the existing pci_passthrough code actually loops through gpu_validation_pci_devices to ensure that we are seeing the correct number of entries for each listed device. Not a big deal when we're just passing through a single device, but this was meant to support multiple instances of multiple device types, so I think the new code should work the same way.

csibbitt · 2026-03-12T16:42:09Z

gpu-validation/tasks/nvidia_assertions.yaml

+  loop: "{{ gpu_validation_pci_devices | dict2items }}"
+  vars:
+    expected_count: "{{ item.value }}"
+  failed_when: pci_count.stdout | int != expected_count | int


I imagine this passed (or would pass) your testing with a single device, but I don't think it's correct.

Let's say you have 3 devices, 1x A30 and 2x L4. You'd have the following config:

gpu_validation_pci_devices:
10de:20b7: 1
20fe:20f1: 2

Given the code above, you're going to attempt to run nvidia-smi twice, and each time the wc -l is going to return 3. It would fail in both cases because expected_count will be 1 on the first item, and 2 on the second.

The problem is that you're looping per-device, but have no per-device filtering in your check. Contrast this to the lspci check, which is grep'ing on the item key: https://github.com/rhos-vaf/gpu-validation/blob/main/gpu-validation/tasks/gpus_assertions.yaml#L5

I'm seeing that the output from nvidia-smi -L doesn't include the PCI-ID string (10de:20b7) for you to grep on:

[cloud-user@vgpu-server-02 ~]$ nvidia-smi -L GPU 0: NVIDIA A30-1-6C (UUID: GPU-634472ba-1898-11f1-8083-8150c0edca31) MIG 1g.6gb Device 0: (UUID: MIG-801139f5-6a74-5e1a-8b55-c82a2d703935)

So when using this command, it's impossible to tell that there are the expected number of instances of a particular PCI-ID.

I've been looking for a command that has both the PCI-ID string and the UUID (that contains the "MIG-" you're looking for), but I haven't found it. The closest I can get is this:

[cloud-user@vgpu-server-02 ~]$ nvidia-smi -i 0 --query-gpu index,uuid,pci.device_id,mig.mode.current,count --format=noheader 0, GPU-634472ba-1898-11f1-8083-8150c0edca31, 0x20B710DE, Enabled, 1

This gets us the PCI-ID (though the format needs adjustment), it shows that MIG is enabled, but it's showing us the hardware GPU device (UUID: GPU-634472ba) not the MIG device (UUID: MIG-801139f5). I don't have an example handy with multiple MIG slices passed in but I'm hoping that the "count" field on the end would be the expected_count we're looking for - this would need testing. If it's not, then it becomes far more complicated to check if we have the correct number of MIG slices.

I'll tell you what. This whole conversation started because you had written this task next to the lspci task, with a conditional that ran either one or the other, and I pointed out that they do not do equivalent things. We talked about it and it seemed reasonable that they should do equivalent things (after all, we do want to know that each expected MIG device is present in the VM). You tried to make them do equivalent things, and it looks like that is going to be quite a bit more complex than we thought when we spoke yesterday.

Ignoring your PR for a moment, I'm looking just at the existing code:

The lspci check looks for the expected number of devices of the specified types: https://github.com/rhos-vaf/gpu-validation/blob/main/gpu-validation/tasks/gpus_assertions.yaml#L2

In my example from the beginning, it would look for 1 device with ID 10de:20b7 and 2 devices with ID 20fe:20f1

Unfortunately, AFAICT (still needs testing!) lspci is likely to only show the single hardware device even if we were passing in multiple MIG devices.

The nvidia-smi check just looks for ANY gpus visible to the driver: https://github.com/rhos-vaf/gpu-validation/blob/main/gpu-validation/tasks/nvidia_assertions.yaml#L4

Given all of this, perhaps it's fair to just confirm that mig.mode.current is showing as "Enabled" for every device, and call that good enough for now?

csibbitt · 2026-03-12T16:44:26Z

gpu-validation/tasks/vm_image.yaml


+- name: Reset extra_specs based GPU mode
+  ansible.builtin.set_fact:
+    gpu_validation_extra_specs:


- flavor addition for MIG testing - Check for MIG mode

rlandy marked this pull request as draft March 10, 2026 20:51

rlandy force-pushed the mig-validation branch from a23a90e to 1c15f2c Compare March 10, 2026 20:55

rlandy requested a review from csibbitt March 10, 2026 20:56

rlandy force-pushed the mig-validation branch 2 times, most recently from 184be1f to c09e1d0 Compare March 11, 2026 11:22

csibbitt reviewed Mar 11, 2026

View reviewed changes

rlandy force-pushed the mig-validation branch 6 times, most recently from 2a8a0d1 to 6ae23f2 Compare March 11, 2026 20:05

csibbitt reviewed Mar 12, 2026

View reviewed changes

rlandy force-pushed the mig-validation branch 4 times, most recently from c40c713 to 57fcf78 Compare March 13, 2026 20:49

Add MIG validations

69e7bd1

- flavor addition for MIG testing - Check for MIG mode

rlandy force-pushed the mig-validation branch from 57fcf78 to 69e7bd1 Compare March 13, 2026 21:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MIG validations#15

Add MIG validations#15
rlandy wants to merge 1 commit intorhos-vaf:mainfrom
rlandy:mig-validation

rlandy commented Mar 10, 2026

Uh oh!

csibbitt left a comment

Uh oh!

csibbitt Mar 11, 2026

Uh oh!

csibbitt Mar 11, 2026

Uh oh!

csibbitt Mar 11, 2026

Uh oh!

csibbitt Mar 12, 2026 •

edited

Loading

Uh oh!

csibbitt Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rlandy commented Mar 10, 2026

Uh oh!

csibbitt left a comment

Choose a reason for hiding this comment

Uh oh!

csibbitt Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

csibbitt Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

csibbitt Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

csibbitt Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csibbitt Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

csibbitt Mar 12, 2026 •

edited

Loading