Conversation
rlandy
commented
Mar 10, 2026
- flavor addition for MIG testing
- Check for MIG mode
184be1f to
c09e1d0
Compare
csibbitt
left a comment
There was a problem hiding this comment.
Nice work, this is pretty much what I expected to see. I recommend we add the loop and device counting based on gpu_validation_pci_devices
gpu-validation/defaults/main.yaml
Outdated
| gpu_validation_image_url: https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-latest.x86_64.qcow2 | ||
| # [string] Image name to use when creating the VM | ||
| gpu_validation_image_name: gpu-validation | ||
| # [string] Type of GPU validation for flavor creation |
There was a problem hiding this comment.
Would be good to list the allowable values in the comment here
gpu-validation/defaults/main.yaml
Outdated
| # [string] Flavor to use when creating the VM | ||
| mig_validation_flavor_name: m1.migvgpu | ||
| # [string] RAM value for the flavor | ||
| mig_validation_flavor_ram: 4096 |
There was a problem hiding this comment.
I don't know of any reason we'd want different ram/cpu/disk for MIG testing vs passthrough vs vgpu. It's all the same test at the end anyways.
| failed_when: pci_count.stdout | int != expected_count | int | ||
| changed_when: false | ||
|
|
||
| - name: "TEST[gpus] Check if MIG mode is used" |
There was a problem hiding this comment.
This new code looks for any MIG device returned by nvidia-smi, whereas the existing pci_passthrough code actually loops through gpu_validation_pci_devices to ensure that we are seeing the correct number of entries for each listed device. Not a big deal when we're just passing through a single device, but this was meant to support multiple instances of multiple device types, so I think the new code should work the same way.
2a8a0d1 to
6ae23f2
Compare
| loop: "{{ gpu_validation_pci_devices | dict2items }}" | ||
| vars: | ||
| expected_count: "{{ item.value }}" | ||
| failed_when: pci_count.stdout | int != expected_count | int |
There was a problem hiding this comment.
I imagine this passed (or would pass) your testing with a single device, but I don't think it's correct.
Let's say you have 3 devices, 1x A30 and 2x L4. You'd have the following config:
gpu_validation_pci_devices:
10de:20b7: 1
20fe:20f1: 2
Given the code above, you're going to attempt to run nvidia-smi twice, and each time the wc -l is going to return 3. It would fail in both cases because expected_count will be 1 on the first item, and 2 on the second.
The problem is that you're looping per-device, but have no per-device filtering in your check. Contrast this to the lspci check, which is grep'ing on the item key: https://github.com/rhos-vaf/gpu-validation/blob/main/gpu-validation/tasks/gpus_assertions.yaml#L5
I'm seeing that the output from nvidia-smi -L doesn't include the PCI-ID string (10de:20b7) for you to grep on:
[cloud-user@vgpu-server-02 ~]$ nvidia-smi -L
GPU 0: NVIDIA A30-1-6C (UUID: GPU-634472ba-1898-11f1-8083-8150c0edca31)
MIG 1g.6gb Device 0: (UUID: MIG-801139f5-6a74-5e1a-8b55-c82a2d703935)
So when using this command, it's impossible to tell that there are the expected number of instances of a particular PCI-ID.
I've been looking for a command that has both the PCI-ID string and the UUID (that contains the "MIG-" you're looking for), but I haven't found it. The closest I can get is this:
[cloud-user@vgpu-server-02 ~]$ nvidia-smi -i 0 --query-gpu index,uuid,pci.device_id,mig.mode.current,count --format=noheader
0, GPU-634472ba-1898-11f1-8083-8150c0edca31, 0x20B710DE, Enabled, 1
This gets us the PCI-ID (though the format needs adjustment), it shows that MIG is enabled, but it's showing us the hardware GPU device (UUID: GPU-634472ba) not the MIG device (UUID: MIG-801139f5). I don't have an example handy with multiple MIG slices passed in but I'm hoping that the "count" field on the end would be the expected_count we're looking for - this would need testing. If it's not, then it becomes far more complicated to check if we have the correct number of MIG slices.
I'll tell you what. This whole conversation started because you had written this task next to the lspci task, with a conditional that ran either one or the other, and I pointed out that they do not do equivalent things. We talked about it and it seemed reasonable that they should do equivalent things (after all, we do want to know that each expected MIG device is present in the VM). You tried to make them do equivalent things, and it looks like that is going to be quite a bit more complex than we thought when we spoke yesterday.
Ignoring your PR for a moment, I'm looking just at the existing code:
- The lspci check looks for the expected number of devices of the specified types: https://github.com/rhos-vaf/gpu-validation/blob/main/gpu-validation/tasks/gpus_assertions.yaml#L2
- In my example from the beginning, it would look for 1 device with ID 10de:20b7 and 2 devices with ID 20fe:20f1
- Unfortunately, AFAICT (still needs testing!) lspci is likely to only show the single hardware device even if we were passing in multiple MIG devices.
- The nvidia-smi check just looks for ANY gpus visible to the driver: https://github.com/rhos-vaf/gpu-validation/blob/main/gpu-validation/tasks/nvidia_assertions.yaml#L4
Given all of this, perhaps it's fair to just confirm that mig.mode.current is showing as "Enabled" for every device, and call that good enough for now?
|
|
||
| - name: Reset extra_specs based GPU mode | ||
| ansible.builtin.set_fact: | ||
| gpu_validation_extra_specs: |
c40c713 to
57fcf78
Compare
- flavor addition for MIG testing - Check for MIG mode