feat: auto-detect Apple Silicon (MPS) and keep Triton CUDA-only#22
feat: auto-detect Apple Silicon (MPS) and keep Triton CUDA-only#22Berkkirik wants to merge 2 commits into
Conversation
The CLI --device flag defaulted to "cuda", which crashed with a raw
PyTorch traceback ("Found no NVIDIA driver on your system ...") on
machines without a GPU. Users had to discover --device cpu themselves.
Add an "auto" mode that picks the best available backend (cuda if
detected, otherwise cpu) and make it the default. Users who explicitly
pass --device cuda still get the original loud failure on non-CUDA
machines, which is the correct behavior when they ask for cuda by name.
- opf/_common/device.py (new): resolve_device("auto"|...) helper.
- opf/_cli/common.py: flip --device default to "auto", expand help text.
- opf/_core/runtime.py, opf/_train/runner.py: call resolve_device()
where device names turn into torch.device objects.
Stderr on auto-fallback:
info: no CUDA device detected; falling back to CPU
(pass --device cuda to override).
Fixes openai#12
Extends the --device auto resolution from openai#17 to include Apple Silicon (MPS) so Mac users get GPU acceleration by default instead of falling back to CPU. Two coordinated changes make this safe: 1. opf/_common/device.py — "auto" now picks cuda > mps > cpu. Each fallback emits an info line on stderr so the user always knows which backend was selected. 2. opf/_model/model.py — the Triton-backed MoE kernels are CUDA-only (Triton does not target Metal). Previously the default enabled Triton on any non-CPU device, so trying mps crashed once the MoE layer was hit. Narrow the auto-enable to device.type == "cuda"; mps and cpu both fall back to the torch-ops path unless the user explicitly sets OPF_MOE_TRITON=1. 3. opf/_train/runner.py — mirror the same CUDA-only gate when setting OPF_MOE_TRITON=1 on behalf of the user (previously set it for any non-CPU device, which would silently enable Triton on mps). 4. opf/_cli/common.py — expand --device help text to list the full backend order (cuda > mps > cpu). Verified on macOS (Apple Silicon, Python 3.14, torch 2.11): - resolve_device("auto") → mps (with stderr info line) - resolve_device("mps") → mps - resolve_device("cpu") → cpu - resolve_device("cuda") → returns cuda device (still fails loudly at tensor alloc when the user explicitly asks for it — unchanged) Low-level MPS op sanity check passed for embedding, attention-like matmul/softmax, log_softmax, topk, argsort, bincount — all the ops the inference path relies on. Fixes openai#21
|
FTR, I currently do the following: |
Thanks for the data point — that's exactly the workflow this PR automates. After merge, |
This comment has been minimized.
This comment has been minimized.
I couldn't see your messages, can you write it again ? |
|
Fixes #21
Builds on #17 (which added
--device autowithcuda > cpufallback). Extends the auto chain to include Apple Silicon (MPS) so Mac users get GPU acceleration by default instead of falling back to CPU.Problem
After #17, an Apple Silicon Mac with no CUDA device falls back to CPU. That's correct but leaves performance on the table — MPS is available on M1/M2/M3 Macs. Naively adding MPS to the
autochain is unsafe, though: the Triton-backed MoE kernels are CUDA-only (Triton does not target Metal), and the current code auto-enables Triton on any non-CPU device. Pickingmpsas default would silently crash once the MoE layer is hit.Fix
Two coordinated changes:
opf/_common/device.py—autonow pickscuda > mps > cpu. Each fallback emits an info line on stderr so the user always knows which backend was selected.opf/_model/model.py— narrow the Triton auto-enable todevice.type == "cuda". MPS and CPU both use the torch-ops path unless the user explicitly setsOPF_MOE_TRITON=1.opf/_train/runner.py— mirror the same CUDA-only gate when the training runner setsOPF_MOE_TRITON=1on behalf of the user (previously set for any non-CPU device, which would silently enable Triton on mps).opf/_cli/common.py— expand--devicehelp text to list the full backend order.Verified on this machine
macOS 26.4 on Apple Silicon (M-series), Python 3.14, torch 2.11.0:
Resolver cases:
resolve_device("auto")on Apple Siliconmps+ stderr info line ✅resolve_device("mps")mps✅resolve_device("cpu")cpu✅resolve_device("cuda")on non-CUDA machinecudaobject, fails at tensor alloc — same as today when user asks for cuda explicitly ✅Low-level MPS op smoke test (ops the inference path depends on), all pass on MPS:
torch.nn.Embeddingforward (the op that crashed in on MacBook: Torch not compiled with CUDA enabled #21/Odd defaults: fails if no NVIDIA driver found on a system with no nvidia device at all #12)matmul+softmaxlog_softmax(inference logprob path)topk(MoE expert routing)argsort(MoE packing)bincount(MoE expert counts — historically flaky on MPS, works here)End-to-end inference on MPS was not run here (no checkpoint locally). A maintainer on Apple Silicon can verify with:
Backwards compatibility
--device cuda/--device cpu/--device mps: unchanged (all pass throughresolve_deviceas-is).autostill pickscudafirst — no behavior change.Depends on
#17 — This PR builds directly on the
resolve_devicehelper introduced there. If #17 is merged first, this rebases to a no-conflict diff; if the maintainer prefers to squash both into one, I'm happy to close this and post a combined patch.