Skip to content

Add MIG validations#15

Draft
rlandy wants to merge 1 commit intorhos-vaf:mainfrom
rlandy:mig-validation
Draft

Add MIG validations#15
rlandy wants to merge 1 commit intorhos-vaf:mainfrom
rlandy:mig-validation

Conversation

@rlandy
Copy link
Copy Markdown

@rlandy rlandy commented Mar 10, 2026

  • flavor addition for MIG testing
  • Check for MIG mode

@rlandy rlandy marked this pull request as draft March 10, 2026 20:51
@rlandy rlandy requested a review from csibbitt March 10, 2026 20:56
@rlandy rlandy force-pushed the mig-validation branch 2 times, most recently from 184be1f to c09e1d0 Compare March 11, 2026 11:22
Copy link
Copy Markdown
Contributor

@csibbitt csibbitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, this is pretty much what I expected to see. I recommend we add the loop and device counting based on gpu_validation_pci_devices

gpu_validation_image_url: https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-latest.x86_64.qcow2
# [string] Image name to use when creating the VM
gpu_validation_image_name: gpu-validation
# [string] Type of GPU validation for flavor creation
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to list the allowable values in the comment here

# [string] Flavor to use when creating the VM
mig_validation_flavor_name: m1.migvgpu
# [string] RAM value for the flavor
mig_validation_flavor_ram: 4096
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know of any reason we'd want different ram/cpu/disk for MIG testing vs passthrough vs vgpu. It's all the same test at the end anyways.

failed_when: pci_count.stdout | int != expected_count | int
changed_when: false

- name: "TEST[gpus] Check if MIG mode is used"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new code looks for any MIG device returned by nvidia-smi, whereas the existing pci_passthrough code actually loops through gpu_validation_pci_devices to ensure that we are seeing the correct number of entries for each listed device. Not a big deal when we're just passing through a single device, but this was meant to support multiple instances of multiple device types, so I think the new code should work the same way.

@rlandy rlandy force-pushed the mig-validation branch 6 times, most recently from 2a8a0d1 to 6ae23f2 Compare March 11, 2026 20:05
loop: "{{ gpu_validation_pci_devices | dict2items }}"
vars:
expected_count: "{{ item.value }}"
failed_when: pci_count.stdout | int != expected_count | int
Copy link
Copy Markdown
Contributor

@csibbitt csibbitt Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine this passed (or would pass) your testing with a single device, but I don't think it's correct.

Let's say you have 3 devices, 1x A30 and 2x L4. You'd have the following config:

gpu_validation_pci_devices:
10de:20b7: 1
20fe:20f1: 2

Given the code above, you're going to attempt to run nvidia-smi twice, and each time the wc -l is going to return 3. It would fail in both cases because expected_count will be 1 on the first item, and 2 on the second.

The problem is that you're looping per-device, but have no per-device filtering in your check. Contrast this to the lspci check, which is grep'ing on the item key: https://github.com/rhos-vaf/gpu-validation/blob/main/gpu-validation/tasks/gpus_assertions.yaml#L5

I'm seeing that the output from nvidia-smi -L doesn't include the PCI-ID string (10de:20b7) for you to grep on:

[cloud-user@vgpu-server-02 ~]$ nvidia-smi -L
GPU 0: NVIDIA A30-1-6C (UUID: GPU-634472ba-1898-11f1-8083-8150c0edca31)
  MIG 1g.6gb      Device  0: (UUID: MIG-801139f5-6a74-5e1a-8b55-c82a2d703935)

So when using this command, it's impossible to tell that there are the expected number of instances of a particular PCI-ID.

I've been looking for a command that has both the PCI-ID string and the UUID (that contains the "MIG-" you're looking for), but I haven't found it. The closest I can get is this:

[cloud-user@vgpu-server-02 ~]$ nvidia-smi -i 0 --query-gpu index,uuid,pci.device_id,mig.mode.current,count --format=noheader
0, GPU-634472ba-1898-11f1-8083-8150c0edca31, 0x20B710DE, Enabled, 1

This gets us the PCI-ID (though the format needs adjustment), it shows that MIG is enabled, but it's showing us the hardware GPU device (UUID: GPU-634472ba) not the MIG device (UUID: MIG-801139f5). I don't have an example handy with multiple MIG slices passed in but I'm hoping that the "count" field on the end would be the expected_count we're looking for - this would need testing. If it's not, then it becomes far more complicated to check if we have the correct number of MIG slices.

I'll tell you what. This whole conversation started because you had written this task next to the lspci task, with a conditional that ran either one or the other, and I pointed out that they do not do equivalent things. We talked about it and it seemed reasonable that they should do equivalent things (after all, we do want to know that each expected MIG device is present in the VM). You tried to make them do equivalent things, and it looks like that is going to be quite a bit more complex than we thought when we spoke yesterday.

Ignoring your PR for a moment, I'm looking just at the existing code:

Given all of this, perhaps it's fair to just confirm that mig.mode.current is showing as "Enabled" for every device, and call that good enough for now?


- name: Reset extra_specs based GPU mode
ansible.builtin.set_fact:
gpu_validation_extra_specs:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@rlandy rlandy force-pushed the mig-validation branch 4 times, most recently from c40c713 to 57fcf78 Compare March 13, 2026 20:49
 - flavor addition for MIG testing
 - Check for MIG mode
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants