Skip to content

Feature/multi gpu#84

Closed
theodufort wants to merge 2 commits into
Andyyyy64:mainfrom
theodufort:feature/multi-gpu
Closed

Feature/multi gpu#84
theodufort wants to merge 2 commits into
Andyyyy64:mainfrom
theodufort:feature/multi-gpu

Conversation

@theodufort

Copy link
Copy Markdown

Allow simulating multiple GPUs with comma-separated names (RTX 5080,RTX 5060 Ti) or count shorthand (2x RTX 4090, 4x H100). VRAM fit uses conservative pooling — per-GPU framework overhead (~300MB each) and a utilization factor (95% homogeneous, 90% heterogeneous) are applied rather than naively summing all VRAM as one device. Speed estimation uses a conservative flat 30% overhead factor without claiming precision about the interconnect topology (PCIe vs NVLink); multi-GPU throughput is always marked low-confidence.

What

Add multi-GPU support to implement #65

  • Parse --gpu values: comma-separated heterogeneous GPUs and Nx count shorthand
  • Conservative VRAM fit that accounts for per-GPU overhead and heterogeneous split inefficiency
  • Low-confidence speed estimates that avoid faking PCIe/NVLink precision
  • Warnings surfaced for multi-GPU splits and heterogeneous configurations

Why

Users with multiple GPUs (e.g. 2x RTX 3090, mixed RTX 4090 + 3090) need to see which models fit across their combined VRAM. The fit simulation layer comes first; speed modeling is deliberately conservative until interconnect and tensor-split assumptions can be properly validated.

Testing

  • Tests pass (pytest)
  • New tests added (if applicable)
  • [] Tested on real hardware (if hardware-related)

Notes

  • NVLink detection from the initial commit was removed — it incorrectly assumed any NVIDIA GPU with compute capability >= 7.0 has NVLink (consumer GPUs like RTX 4090 do not)
  • vram_available_bytes still reports the raw physical total; the conservative budget is only used internally for fit decisions
  • Per-GPU --vram override is not yet supported for multi-GPU; the error message now says so clearly
  • Speed estimates for multi-GPU are always low-confidence with an explicit note about PCIe/NVLink dependence — this is intentional, not a gap

@Andyyyy64

Copy link
Copy Markdown
Owner

Thanks @theodufort, this is a lot of careful work, and the conservative approach to VRAM pooling and speed looks right to me. It conflicts with main now, and since it touches the ranker, compatibility, and performance paths, I'd like it rebased on the latest main before I review it properly. Could you update it? I want to give the fit logic a careful read once it's clean.

theodufort and others added 2 commits June 13, 2026 21:32
Allow simulating multiple GPUs with comma-separated names ("RTX 5080,RTX 5060 Ti")
or count shorthand ("2x RTX 4090", "4x H100"). VRAM is pooled across all GPUs for
fit determination. Speed estimation uses a tensor-parallel model where the slowest
GPU is the bottleneck, with inter-GPU communication overhead (PCIe/NVLink) factored in.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The initial multi-GPU commit naively pooled all GPU VRAM as one device
and assumed any NVIDIA GPU with compute capability >= 7.0 has NVLink
(wrong for consumer GPUs like RTX 4090). This commit:

- Applies per-GPU framework overhead (~300MB) and a utilization factor
  (95% homogeneous, 90% heterogeneous) to VRAM fit checks
- Replaces the NVLink/PCIe sync model with a flat 30% overhead factor,
  avoiding false precision about interconnect topology
- Adds warnings for multi-GPU and heterogeneous configurations
- Fixes broken --vram error message grammar

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@theodufort

Copy link
Copy Markdown
Author

Thanks @theodufort, this is a lot of careful work, and the conservative approach to VRAM pooling and speed looks right to me. It conflicts with main now, and since it touches the ranker, compatibility, and performance paths, I'd like it rebased on the latest main before I review it properly. Could you update it? I want to give the fit logic a careful read once it's clean.

Hey Andy, sorry for the delay, I just rebased it!

@Andyyyy64

Copy link
Copy Markdown
Owner

Thanks for putting this together and for coming back to rebase it.

I checked the branch again after the recent main changes. It still conflicts with current main, and the conflict is in compatibility.py around the multi-GPU fit path and the newer offload_ratio logic.

I am going to take this back into a maintainer-owned implementation instead of asking you to keep chasing main. Multi-GPU is now coming up in #65, #52, and #110, and it needs to line up with the RAM reserve work, the partial-offload ranking fix, and the display split.

Your PR is still useful. The conservative VRAM-fit direction and the test cases are good reference material, and I will credit this work if I reuse pieces of it.

@Andyyyy64

Copy link
Copy Markdown
Owner

Closing this in favor of #112.

This is not a rejection of the direction. The PR helped clarify the shape of the feature, especially the conservative VRAM-fit approach. The issue is that main has changed under it: RAM reserve behavior, partial-offload ranking, and output structure all moved, and the PR still conflicts in the compatibility path.

I will carry the useful parts forward in the new implementation and credit this PR where it informs the final code.

@theodufort

Copy link
Copy Markdown
Author

Closing this in favor of #112.

This is not a rejection of the direction. The PR helped clarify the shape of the feature, especially the conservative VRAM-fit approach. The issue is that main has changed under it: RAM reserve behavior, partial-offload ranking, and output structure all moved, and the PR still conflicts in the compatibility path.

I will carry the useful parts forward in the new implementation and credit this PR where it informs the final code.

I don't know how I feel about you closing the merge request and recommitting yourself the changes, this kind of behaviour will not get you many contributors...

@Andyyyy64

Copy link
Copy Markdown
Owner

Closing this in favor of #112.
This is not a rejection of the direction. The PR helped clarify the shape of the feature, especially the conservative VRAM-fit approach. The issue is that main has changed under it: RAM reserve behavior, partial-offload ranking, and output structure all moved, and the PR still conflicts in the compatibility path.
I will carry the useful parts forward in the new implementation and credit this PR where it informs the final code.

I don't know how I feel about you closing the merge request and recommitting yourself the changes, this kind of behaviour will not get you many contributors...

I'm sorry.

Looking back, I handled this poorly. You opened the PR, came back to rebase it, then I closed it and shipped a maintainer version soon after. I can see why that felt like your work was taken out of your hands.

The reason I took it back was that main had moved across the same code paths: RAM reserve behavior, partial-offload ranking, compatibility logic, and output structure. I thought landing a maintainer-owned version would avoid making you chase more conflicts. That was the reasoning, but I should have made that handoff clear before closing the PR.

Your PR did help shape the final implementation, especially the conservative VRAM fit approach and the tests. I should have credited that more visibly when I opened the maintainer PR.

Sorry again. Next time, I will either help get the contributor PR over the line, or say earlier that I need to take over the implementation and make the credit clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants