Skip to content

feat: auto-detect Apple Silicon (MPS) and keep Triton CUDA-only#22

Open
Berkkirik wants to merge 2 commits into
openai:mainfrom
Berkkirik:feat/mps-support-auto
Open

feat: auto-detect Apple Silicon (MPS) and keep Triton CUDA-only#22
Berkkirik wants to merge 2 commits into
openai:mainfrom
Berkkirik:feat/mps-support-auto

Conversation

@Berkkirik
Copy link
Copy Markdown

Fixes #21

Builds on #17 (which added --device auto with cuda > cpu fallback). Extends the auto chain to include Apple Silicon (MPS) so Mac users get GPU acceleration by default instead of falling back to CPU.

Problem

After #17, an Apple Silicon Mac with no CUDA device falls back to CPU. That's correct but leaves performance on the table — MPS is available on M1/M2/M3 Macs. Naively adding MPS to the auto chain is unsafe, though: the Triton-backed MoE kernels are CUDA-only (Triton does not target Metal), and the current code auto-enables Triton on any non-CPU device. Picking mps as default would silently crash once the MoE layer is hit.

Fix

Two coordinated changes:

  1. opf/_common/device.pyauto now picks cuda > mps > cpu. Each fallback emits an info line on stderr so the user always knows which backend was selected.
  2. opf/_model/model.py — narrow the Triton auto-enable to device.type == "cuda". MPS and CPU both use the torch-ops path unless the user explicitly sets OPF_MOE_TRITON=1.
  3. opf/_train/runner.py — mirror the same CUDA-only gate when the training runner sets OPF_MOE_TRITON=1 on behalf of the user (previously set for any non-CPU device, which would silently enable Triton on mps).
  4. opf/_cli/common.py — expand --device help text to list the full backend order.

Verified on this machine

macOS 26.4 on Apple Silicon (M-series), Python 3.14, torch 2.11.0:

$ python3 -c "from opf._common.device import resolve_device; print(resolve_device('auto'))"
info: no CUDA device detected; using Apple Metal (MPS).
mps

Resolver cases:

Case Result
resolve_device("auto") on Apple Silicon mps + stderr info line ✅
resolve_device("mps") mps
resolve_device("cpu") cpu
resolve_device("cuda") on non-CUDA machine Returns cuda object, fails at tensor alloc — same as today when user asks for cuda explicitly ✅

Low-level MPS op smoke test (ops the inference path depends on), all pass on MPS:

End-to-end inference on MPS was not run here (no checkpoint locally). A maintainer on Apple Silicon can verify with:

opf "Alice was born on 1990-01-02."
# should print an "info: ... using Apple Metal (MPS)" line and redact correctly

Backwards compatibility

  • --device cuda / --device cpu / --device mps: unchanged (all pass through resolve_device as-is).
  • On CUDA machines: auto still picks cuda first — no behavior change.
  • Triton path: was auto-enabled on any non-CPU device, now auto-enabled on CUDA only. Users who were running training on CUDA get the same behavior; CPU users get the same behavior; only an MPS user (who would have crashed before) now gets the torch-ops path.
  • No public Python API change.

Depends on

#17 — This PR builds directly on the resolve_device helper introduced there. If #17 is merged first, this rebases to a no-conflict diff; if the maintainer prefers to squash both into one, I'm happy to close this and post a combined patch.

The CLI --device flag defaulted to "cuda", which crashed with a raw
PyTorch traceback ("Found no NVIDIA driver on your system ...") on
machines without a GPU. Users had to discover --device cpu themselves.

Add an "auto" mode that picks the best available backend (cuda if
detected, otherwise cpu) and make it the default. Users who explicitly
pass --device cuda still get the original loud failure on non-CUDA
machines, which is the correct behavior when they ask for cuda by name.

- opf/_common/device.py (new): resolve_device("auto"|...) helper.
- opf/_cli/common.py: flip --device default to "auto", expand help text.
- opf/_core/runtime.py, opf/_train/runner.py: call resolve_device()
  where device names turn into torch.device objects.

Stderr on auto-fallback:
  info: no CUDA device detected; falling back to CPU
        (pass --device cuda to override).

Fixes openai#12
Extends the --device auto resolution from openai#17 to include Apple Silicon
(MPS) so Mac users get GPU acceleration by default instead of falling
back to CPU.

Two coordinated changes make this safe:

1. opf/_common/device.py — "auto" now picks cuda > mps > cpu. Each
   fallback emits an info line on stderr so the user always knows which
   backend was selected.

2. opf/_model/model.py — the Triton-backed MoE kernels are CUDA-only
   (Triton does not target Metal). Previously the default enabled Triton
   on any non-CPU device, so trying mps crashed once the MoE layer was
   hit. Narrow the auto-enable to device.type == "cuda"; mps and cpu
   both fall back to the torch-ops path unless the user explicitly sets
   OPF_MOE_TRITON=1.

3. opf/_train/runner.py — mirror the same CUDA-only gate when setting
   OPF_MOE_TRITON=1 on behalf of the user (previously set it for any
   non-CPU device, which would silently enable Triton on mps).

4. opf/_cli/common.py — expand --device help text to list the full
   backend order (cuda > mps > cpu).

Verified on macOS (Apple Silicon, Python 3.14, torch 2.11):

- resolve_device("auto") → mps (with stderr info line)
- resolve_device("mps")  → mps
- resolve_device("cpu")  → cpu
- resolve_device("cuda") → returns cuda device (still fails loudly at
  tensor alloc when the user explicitly asks for it — unchanged)

Low-level MPS op sanity check passed for embedding, attention-like
matmul/softmax, log_softmax, topk, argsort, bincount — all the ops
the inference path relies on.

Fixes openai#21
@clement-heliot
Copy link
Copy Markdown

clement-heliot commented Apr 24, 2026

FTR, I currently do the following:
OPF_MOE_TRITON=0 opf --device mps "Alice was born on 1990-01-02."

@Berkkirik
Copy link
Copy Markdown
Author

FTR, I currently do the following: OPF_MOE_TRITON=0 opf --device mps "Alice was born on 1990-01-02."

Thanks for the data point — that's exactly the workflow this PR automates. After merge, OPF_MOE_TRITON=0 and --device mps both become implicit on Apple Silicon; plain opf "..." should give you the same result.

@losoy88

This comment has been minimized.

@Berkkirik
Copy link
Copy Markdown
Author

FTR, I currently do the following: OPF_MOE_TRITON=0 opf --device mps "Alice was born on 1990-01-02."

Thanks for the data point — that's exactly the workflow this PR automates. After merge, OPF_MOE_TRITON=0 and --device mps both become implicit on Apple Silicon; plain opf "..." should give you the same result.

I couldn't see your messages, can you write it again ?

@losoy88
Copy link
Copy Markdown

losoy88 commented Apr 30, 2026

FTR, I currently do the following: OPF_MOE_TRITON=0 opf --device mps "Alice was born on 1990-01-02."

Thanks for the data point — that's exactly the workflow this PR automates. After merge, OPF_MOE_TRITON=0 and --device mps both become implicit on Apple Silicon; plain opf "..." should give you the same result.

I couldn't see your messages, can you write it again ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

on MacBook: Torch not compiled with CUDA enabled

3 participants