MXNet 2.0 — CUDA 13 / Blackwell + Apple Silicon fork

Unofficial, maintained fork at smolix/mxnet. Run existing MXNet code on current hardware — NVIDIA Blackwell (and Ampere → Hopper) GPUs on the CUDA 13 stack, and native Apple Silicon CPU.

Apache MXNet was archived on 2023-11-17. The upstream tree is frozen at CUDA 11 / cuDNN 8 / oneDNN v2 and does not build on Blackwell GPUs or modern CUDA toolchains. This fork carries the patches needed to keep the residual MXNet community — legacy research notebooks, frozen production pipelines, niche operators like _contrib_quantize_*, and the d2l.ai book — running on today's hardware without a rewrite to PyTorch / JAX. It is not an official Apache release.

Current release line: 2.0.0+cu13.bw.<YYYYMMDD> (latest published wheel 2.0.0+cu13.bw.20260614) for Linux/CUDA, and 2.0.0+cpu.macos.<YYYYMMDD> for Apple Silicon. The authoritative current build is always the newest on the Releases page.
What changed vs upstream: see FIXED.md.
Known limitations / open work: see OPEN_ISSUES.md.
Build from source: see BUILDING.md and docs/cuda_wheel_build.md.

Quick install

Linux + NVIDIA GPU (CUDA 13, driver R590+):

# CPython 3.12 (a cp311 wheel is also published)
pip install "https://github.com/smolix/mxnet/releases/download/v2.0.0%2Bcu13.bw.20260614/mxnet-2.0.0%2Bcu13.bw.20260614-cp312-cp312-linux_x86_64.whl"

macOS, Apple Silicon (CPU-only):

# CPython 3.12
pip install "https://github.com/smolix/mxnet/releases/download/v2.0.0%2Bcu13.bw.20260614/mxnet-2.0.0%2Bcpu.macos.20260614-cp312-cp312-macosx_26_0_arm64.whl"

Smoke-test the install:

python -c "import mxnet as mx; from mxnet.runtime import Features; \
print(mx.__version__); print('OPENCV', Features().is_enabled('OPENCV')); \
print('GPUs', mx.device.num_gpus())"

See Installation for the runtime dependency model and troubleshooting, and check the Releases page for a newer dated wheel before pinning a URL.

What works

Blackwell sm_120 + full Ampere→Blackwell fatbin (CUDA 13.0): sm_80 (A100), sm_86 (RTX 30xx), sm_89 (Ada/RTX 40xx), sm_90 (Hopper), sm_100 (B100/B200), sm_120 (RTX 50xx) SASS + compute_120 PTX fallback.
cuDNN 9.x including the rewritten v8-style RNN path (LSTM/GRU/RNN, fwd+bwd). TF32 on by default for FP32 conv (PyTorch/TF default; ~2.87× on sm_120).
NCCL 2.28 single-process / multi-GPU.
oneDNN v3.11 float backend everywhere; full INT8 path on x86 (per-OC weight scales, fused conv/FC, fused sum, dequant-to-fp32).
cuBLASLt GEMM (fp32/fp16/fp64), bitwise-parity-verified vs the legacy path.
CUDA Graphs — revived and default-on for hybridized cached-ops with static_alloc=True + static_shape=True (eager execution unchanged); measured 1.5–2.3× on transformer/RNN-ish nets, bitwise-identical results.
INT8 quantization (quantize_net, _sg_onednn_conv, _sg_onednn_fully_connected) on x86; fp16/fp32 forward + backward training.
AMP (automatic mixed precision) subgraph — on CPUs without AVX-512-BF16 the bf16 subgraph ops fall back to fp32 (all 6 AMP subgraph tests pass).
ONNX export/import (opset-13 default, ONNX 1.21 / ORT 1.24) — bundled in the wheel; pip install "mxnet[onnx]" pulls the onnx dependency (pure-Python, no native rebuild). onnxruntime is only needed to run exported models.
Self-contained wheels. Both the Linux CUDA wheel and the macOS CPU wheel bundle OpenCV and its full transitive closure into mxnet/lib/ (ELF $ORIGIN on Linux, Mach-O @loader_path on macOS), so import mxnet reports OPENCV=True and mx.image native decode/resize work on a clean host with no system OpenCV.
Native macOS arm64 CPU wheel — Accelerate BLAS/LAPACK + float oneDNN. ~14.9k unittest / operator / NumPy / Gluon / quantization-API tests pass (tools/run_macos_wheel_full_test.sh).

What is experimental or not covered

The four most likely to affect you (full list and details in OPEN_ISSUES.md):

CUDA 13.0 / driver R580 is unsupported — the wheel pins nvidia-cublas>=13.5, which needs driver R590+. On R580 large GEMMs fail with CUBLAS_STATUS_NOT_INITIALIZED.
Apple Silicon: oneDNN INT8 + subgraph fusion are gated off (the Xbyak_aarch64 JIT is unreliable on Apple Silicon — see SupportDNNLAArch64JITPrimitives in src/operator/nn/dnnl/dnnl_base-inl.h); those ops fall back to native kernels and the tests/python/dnnl fusion/quant lane does not apply.
bf16 on CPUs without AVX-512-BF16 is emulated in fp32 — correct, not faster.
Backward through quantized ops is unvalidated — forward INT8 inference is solid; quantized training is not verified.

System requirements

Linux / CUDA:

Linux x86_64 (tested on Ubuntu 22.04 / 24.04).
NVIDIA driver R590 or newer (required by the nvidia-cublas>=13.5 pin; the older CUDA 13.0 / R580 line is not supported — see FIXED.md §1).
CUDA 13.0 toolkit at /usr/local/cuda/ — supplies the base runtime libs (libcudart, libcublas, libcufft, libcusolver, libcurand, libnvrtc) that NVIDIA does not yet ship as real cu13 PyPI wheels.
cuDNN and NCCL are pulled in automatically as pip deps (nvidia-cudnn-cu13>=9.22,<10, nvidia-nccl-cu13>=2.28,<3, nvidia-cublas>=13.5,<14).
CPython 3.11 or 3.12 (both published as wheels); 3.10–3.13 for source builds.

macOS:

Apple Silicon (arm64), macOS 13+; CPython 3.12 (published wheel). CPU-only.

Installation

The wheels are not 2 GB monoliths — only OpenCV is bundled. Everything else is a declared pip dependency or comes from your system CUDA toolkit:

CUDA / cuDNN / NCCL — pip pulls nvidia-cudnn-cu13 (~1 GB) and nvidia-nccl-cu13 (~190 MB) from PyPI (the PyTorch/JAX layout). The base CUDA 13 runtime libs come from your system toolkit at /usr/local/cuda/ (apt install cuda-13).
OpenBLAS — from the scipy-openblas32 wheel.
OpenCV — bundled inside the wheel (mxnet/lib/), reached via $ORIGIN (Linux) / @loader_path (macOS). You do not need libopencv-dev or to pip install opencv-python for the native C++ image path; the macOS wheel does additionally depend on opencv-python for the Python cv2 helpers.

The Linux CUDA wheel is ~454 MB; a smaller BUNDLE_OPENCV=0 / USE_OPENCV=OFF build is possible if you do not need MXNet's native image path (reports OPENCV=False).

Troubleshooting:

Symptom	Cause / fix
`CUBLAS_STATUS_NOT_INITIALIZED` on a non-trivial `dot`/`FullyConnected`	driver too old; upgrade to R590+ (`FIXED.md` §1)
`cuDNN major-version mismatch: …` printed on first GPU use	the linked cuDNN is a different major version than the build — install a cuDNN 9.x. (Minor skews like 9.23↔9.22 are ABI-compatible and no longer warn — `FIXED.md` §1.)
`cudaErrorNoKernelImageForDevice` / `no kernel image available`	wheel lacks SASS for your GPU — rebuild with your arch in `MXNET_CUDA_ARCH`
`Build with USE_OPENCV=1 for image io`	OpenCV-off wheel — install/build an `OPENCV=ON` wheel
slow batch-size-1 CPU inference (esp. AVX2-only, or an `IC=3` first conv)	oneDNN v3 picks a throughput-oriented `brg_conv` whose overhead dominates at bs=1, and the AVX2 weight format pads `IC=3`→16. The `IC<8` + bs≤1 first-conv cliff is now auto-mitigated on non-AVX-512 hosts (a dispatch gate steers off `brg_conv:avx2`); for the general bs=1 case set `OMP_NUM_THREADS=1` — multi-thread can scale negatively here — and optionally try `DNNL_DEFAULT_FPMATH_MODE`. The broader brgemm kernel perf fix is deferred (OI-14).

Building from source

See BUILDING.md for the from-scratch recipe (Linux/CUDA and the Apple Silicon CPU build) and docs/cuda_wheel_build.md for the authoritative, provenance-gated release-wheel pipeline. The short version for a CUDA build: clone with submodules, install libnccl-dev and libcudnn9-dev-cuda-13 before invoking cmake, then

cmake -S . -B build -G Ninja -DUSE_CUDA=ON \
  -DMXNET_CUDA_ARCH="8.0;8.6;8.9;9.0;10.0;12.0+PTX" ...

Acknowledgements

This fork builds on the work of the Apache MXNet community and its contributors. All upstream code is Apache 2.0; the CUDA 13 / Blackwell / Apple Silicon patches in this fork are likewise Apache 2.0.

Upstream Apache MXNet (archived 2023-11-17) — historical

Apache MXNet is a deep learning framework designed for both efficiency and flexibility. It lets you mix symbolic and imperative programming, features a dynamic dependency scheduler that parallelizes operations on the fly, a NumPy-like interface integrated with the Gluon 2.0 API, and automatic hybridization. It scaled to multiple GPUs and machines via ps-lite, Horovod, and BytePS, with bindings for Python, Java, C++, R, Scala, Clojure, Go, Javascript, Perl, and Julia.

The upstream project was archived on 2023-11-17. Its website (mxnet.apache.org), CI dashboards, mailing lists, Slack, and social channels are no longer actively monitored; links to them are kept only for historical reference.

MXNet emerged from a collaboration by the authors of cxxnet, minerva, and purine2.

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. In Neural Information Processing Systems, Workshop on Machine Learning Systems, 2015.

Licensed under Apache-2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 12,763 Commits
.github		.github
.investigations		.investigations
3rdparty		3rdparty
benchmark		benchmark
cd		cd
ci		ci
cmake		cmake
config		config
contrib/tvmop		contrib/tvmop
cpp-package		cpp-package
d2l-diagnostics		d2l-diagnostics
docker		docker
docs		docs
example		example
include		include
licenses		licenses
plugin		plugin
python		python
src		src
tests		tests
tools		tools
.asf.yaml		.asf.yaml
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.cmakelintrc		.cmakelintrc
.codecov.yml		.codecov.yml
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.licenserc.yaml		.licenserc.yaml
.mxnet_root		.mxnet_root
BUILDING.md		BUILDING.md
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTORS.md		CONTRIBUTORS.md
FIXED.md		FIXED.md
LICENSE		LICENSE
NEWS.md		NEWS.md
NOTICE		NOTICE
OPEN_ISSUES.md		OPEN_ISSUES.md
OPEN_ISSUES_DETAILS.md		OPEN_ISSUES_DETAILS.md
README.md		README.md
SECURITY.md		SECURITY.md
bench_cpu_inference_b8.py		bench_cpu_inference_b8.py
bench_cublaslt.py		bench_cublaslt.py
bench_cublaslt_dtypes.py		bench_cublaslt_dtypes.py
bench_cudnn_sweep.py		bench_cudnn_sweep.py
bench_fp16_mxnet.py		bench_fp16_mxnet.py
bench_fp16_mxnet_vs_pytorch.py		bench_fp16_mxnet_vs_pytorch.py
bench_fp16_pytorch.py		bench_fp16_pytorch.py
bench_gpu_storage_pool.py		bench_gpu_storage_pool.py
bench_sparse_thrust3.py		bench_sparse_thrust3.py
bench_tf32_conv.py		bench_tf32_conv.py
conftest.py		conftest.py
doap.rdf		doap.rdf
perf_scan.py		perf_scan.py
prospector.yaml		prospector.yaml
pytest.ini		pytest.ini
rat-excludes		rat-excludes
readthedocs.yml		readthedocs.yml
repro_batch_dot_fp16.py		repro_batch_dot_fp16.py
snap.python		snap.python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MXNet 2.0 — CUDA 13 / Blackwell + Apple Silicon fork

Quick install

What works

What is experimental or not covered

System requirements

Installation

Building from source

Acknowledgements

About

Uh oh!

Releases 18

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MXNet 2.0 — CUDA 13 / Blackwell + Apple Silicon fork

Quick install

What works

What is experimental or not covered

System requirements

Installation

Building from source

Acknowledgements

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 18

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages