Feature/multi gpu#84
Conversation
|
Thanks @theodufort, this is a lot of careful work, and the conservative approach to VRAM pooling and speed looks right to me. It conflicts with main now, and since it touches the ranker, compatibility, and performance paths, I'd like it rebased on the latest main before I review it properly. Could you update it? I want to give the fit logic a careful read once it's clean. |
Allow simulating multiple GPUs with comma-separated names ("RTX 5080,RTX 5060 Ti")
or count shorthand ("2x RTX 4090", "4x H100"). VRAM is pooled across all GPUs for
fit determination. Speed estimation uses a tensor-parallel model where the slowest
GPU is the bottleneck, with inter-GPU communication overhead (PCIe/NVLink) factored in.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The initial multi-GPU commit naively pooled all GPU VRAM as one device and assumed any NVIDIA GPU with compute capability >= 7.0 has NVLink (wrong for consumer GPUs like RTX 4090). This commit: - Applies per-GPU framework overhead (~300MB) and a utilization factor (95% homogeneous, 90% heterogeneous) to VRAM fit checks - Replaces the NVLink/PCIe sync model with a flat 30% overhead factor, avoiding false precision about interconnect topology - Adds warnings for multi-GPU and heterogeneous configurations - Fixes broken --vram error message grammar Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3df8c07 to
a5da49d
Compare
Hey Andy, sorry for the delay, I just rebased it! |
|
Thanks for putting this together and for coming back to rebase it. I checked the branch again after the recent main changes. It still conflicts with current I am going to take this back into a maintainer-owned implementation instead of asking you to keep chasing main. Multi-GPU is now coming up in #65, #52, and #110, and it needs to line up with the RAM reserve work, the partial-offload ranking fix, and the display split. Your PR is still useful. The conservative VRAM-fit direction and the test cases are good reference material, and I will credit this work if I reuse pieces of it. |
|
Closing this in favor of #112. This is not a rejection of the direction. The PR helped clarify the shape of the feature, especially the conservative VRAM-fit approach. The issue is that I will carry the useful parts forward in the new implementation and credit this PR where it informs the final code. |
I don't know how I feel about you closing the merge request and recommitting yourself the changes, this kind of behaviour will not get you many contributors... |
I'm sorry. Looking back, I handled this poorly. You opened the PR, came back to rebase it, then I closed it and shipped a maintainer version soon after. I can see why that felt like your work was taken out of your hands. The reason I took it back was that Your PR did help shape the final implementation, especially the conservative VRAM fit approach and the tests. I should have credited that more visibly when I opened the maintainer PR. Sorry again. Next time, I will either help get the contributor PR over the line, or say earlier that I need to take over the implementation and make the credit clear. |
Allow simulating multiple GPUs with comma-separated names (RTX 5080,RTX 5060 Ti) or count shorthand (2x RTX 4090, 4x H100). VRAM fit uses conservative pooling — per-GPU framework overhead (~300MB each) and a utilization factor (95% homogeneous, 90% heterogeneous) are applied rather than naively summing all VRAM as one device. Speed estimation uses a conservative flat 30% overhead factor without claiming precision about the interconnect topology (PCIe vs NVLink); multi-GPU throughput is always marked low-confidence.
What
Add multi-GPU support to implement #65
Why
Users with multiple GPUs (e.g. 2x RTX 3090, mixed RTX 4090 + 3090) need to see which models fit across their combined VRAM. The fit simulation layer comes first; speed modeling is deliberately conservative until interconnect and tensor-split assumptions can be properly validated.
Testing
pytest)Notes