Skip to content

smolix/mxnet

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12,763 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MXNet 2.0 — CUDA 13 / Blackwell + Apple Silicon fork

Unofficial, maintained fork at smolix/mxnet. Run existing MXNet code on current hardware — NVIDIA Blackwell (and Ampere → Hopper) GPUs on the CUDA 13 stack, and native Apple Silicon CPU.

Apache MXNet was archived on 2023-11-17. The upstream tree is frozen at CUDA 11 / cuDNN 8 / oneDNN v2 and does not build on Blackwell GPUs or modern CUDA toolchains. This fork carries the patches needed to keep the residual MXNet community — legacy research notebooks, frozen production pipelines, niche operators like _contrib_quantize_*, and the d2l.ai book — running on today's hardware without a rewrite to PyTorch / JAX. It is not an official Apache release.

  • Current release line: 2.0.0+cu13.bw.<YYYYMMDD> (latest published wheel 2.0.0+cu13.bw.20260614) for Linux/CUDA, and 2.0.0+cpu.macos.<YYYYMMDD> for Apple Silicon. The authoritative current build is always the newest on the Releases page.
  • What changed vs upstream: see FIXED.md.
  • Known limitations / open work: see OPEN_ISSUES.md.
  • Build from source: see BUILDING.md and docs/cuda_wheel_build.md.

Quick install

Linux + NVIDIA GPU (CUDA 13, driver R590+):

# CPython 3.12 (a cp311 wheel is also published)
pip install "https://github.com/smolix/mxnet/releases/download/v2.0.0%2Bcu13.bw.20260614/mxnet-2.0.0%2Bcu13.bw.20260614-cp312-cp312-linux_x86_64.whl"

macOS, Apple Silicon (CPU-only):

# CPython 3.12
pip install "https://github.com/smolix/mxnet/releases/download/v2.0.0%2Bcu13.bw.20260614/mxnet-2.0.0%2Bcpu.macos.20260614-cp312-cp312-macosx_26_0_arm64.whl"

Smoke-test the install:

python -c "import mxnet as mx; from mxnet.runtime import Features; \
print(mx.__version__); print('OPENCV', Features().is_enabled('OPENCV')); \
print('GPUs', mx.device.num_gpus())"

See Installation for the runtime dependency model and troubleshooting, and check the Releases page for a newer dated wheel before pinning a URL.

What works

  • Blackwell sm_120 + full Ampere→Blackwell fatbin (CUDA 13.0): sm_80 (A100), sm_86 (RTX 30xx), sm_89 (Ada/RTX 40xx), sm_90 (Hopper), sm_100 (B100/B200), sm_120 (RTX 50xx) SASS + compute_120 PTX fallback.
  • cuDNN 9.x including the rewritten v8-style RNN path (LSTM/GRU/RNN, fwd+bwd). TF32 on by default for FP32 conv (PyTorch/TF default; ~2.87× on sm_120).
  • NCCL 2.28 single-process / multi-GPU.
  • oneDNN v3.11 float backend everywhere; full INT8 path on x86 (per-OC weight scales, fused conv/FC, fused sum, dequant-to-fp32).
  • cuBLASLt GEMM (fp32/fp16/fp64), bitwise-parity-verified vs the legacy path.
  • CUDA Graphs — revived and default-on for hybridized cached-ops with static_alloc=True + static_shape=True (eager execution unchanged); measured 1.5–2.3× on transformer/RNN-ish nets, bitwise-identical results.
  • INT8 quantization (quantize_net, _sg_onednn_conv, _sg_onednn_fully_connected) on x86; fp16/fp32 forward + backward training.
  • AMP (automatic mixed precision) subgraph — on CPUs without AVX-512-BF16 the bf16 subgraph ops fall back to fp32 (all 6 AMP subgraph tests pass).
  • ONNX export/import (opset-13 default, ONNX 1.21 / ORT 1.24) — bundled in the wheel; pip install "mxnet[onnx]" pulls the onnx dependency (pure-Python, no native rebuild). onnxruntime is only needed to run exported models.
  • Self-contained wheels. Both the Linux CUDA wheel and the macOS CPU wheel bundle OpenCV and its full transitive closure into mxnet/lib/ (ELF $ORIGIN on Linux, Mach-O @loader_path on macOS), so import mxnet reports OPENCV=True and mx.image native decode/resize work on a clean host with no system OpenCV.
  • Native macOS arm64 CPU wheel — Accelerate BLAS/LAPACK + float oneDNN. ~14.9k unittest / operator / NumPy / Gluon / quantization-API tests pass (tools/run_macos_wheel_full_test.sh).

What is experimental or not covered

The four most likely to affect you (full list and details in OPEN_ISSUES.md):

  1. CUDA 13.0 / driver R580 is unsupported — the wheel pins nvidia-cublas>=13.5, which needs driver R590+. On R580 large GEMMs fail with CUBLAS_STATUS_NOT_INITIALIZED.
  2. Apple Silicon: oneDNN INT8 + subgraph fusion are gated off (the Xbyak_aarch64 JIT is unreliable on Apple Silicon — see SupportDNNLAArch64JITPrimitives in src/operator/nn/dnnl/dnnl_base-inl.h); those ops fall back to native kernels and the tests/python/dnnl fusion/quant lane does not apply.
  3. bf16 on CPUs without AVX-512-BF16 is emulated in fp32 — correct, not faster.
  4. Backward through quantized ops is unvalidated — forward INT8 inference is solid; quantized training is not verified.

System requirements

Linux / CUDA:

  • Linux x86_64 (tested on Ubuntu 22.04 / 24.04).
  • NVIDIA driver R590 or newer (required by the nvidia-cublas>=13.5 pin; the older CUDA 13.0 / R580 line is not supported — see FIXED.md §1).
  • CUDA 13.0 toolkit at /usr/local/cuda/ — supplies the base runtime libs (libcudart, libcublas, libcufft, libcusolver, libcurand, libnvrtc) that NVIDIA does not yet ship as real cu13 PyPI wheels.
  • cuDNN and NCCL are pulled in automatically as pip deps (nvidia-cudnn-cu13>=9.22,<10, nvidia-nccl-cu13>=2.28,<3, nvidia-cublas>=13.5,<14).
  • CPython 3.11 or 3.12 (both published as wheels); 3.10–3.13 for source builds.

macOS:

  • Apple Silicon (arm64), macOS 13+; CPython 3.12 (published wheel). CPU-only.

Installation

The wheels are not 2 GB monoliths — only OpenCV is bundled. Everything else is a declared pip dependency or comes from your system CUDA toolkit:

  • CUDA / cuDNN / NCCLpip pulls nvidia-cudnn-cu13 (~1 GB) and nvidia-nccl-cu13 (~190 MB) from PyPI (the PyTorch/JAX layout). The base CUDA 13 runtime libs come from your system toolkit at /usr/local/cuda/ (apt install cuda-13).
  • OpenBLAS — from the scipy-openblas32 wheel.
  • OpenCV — bundled inside the wheel (mxnet/lib/), reached via $ORIGIN (Linux) / @loader_path (macOS). You do not need libopencv-dev or to pip install opencv-python for the native C++ image path; the macOS wheel does additionally depend on opencv-python for the Python cv2 helpers.

The Linux CUDA wheel is ~454 MB; a smaller BUNDLE_OPENCV=0 / USE_OPENCV=OFF build is possible if you do not need MXNet's native image path (reports OPENCV=False).

Troubleshooting:

Symptom Cause / fix
CUBLAS_STATUS_NOT_INITIALIZED on a non-trivial dot/FullyConnected driver too old; upgrade to R590+ (FIXED.md §1)
cuDNN major-version mismatch: … printed on first GPU use the linked cuDNN is a different major version than the build — install a cuDNN 9.x. (Minor skews like 9.23↔9.22 are ABI-compatible and no longer warn — FIXED.md §1.)
cudaErrorNoKernelImageForDevice / no kernel image available wheel lacks SASS for your GPU — rebuild with your arch in MXNET_CUDA_ARCH
Build with USE_OPENCV=1 for image io OpenCV-off wheel — install/build an OPENCV=ON wheel
slow batch-size-1 CPU inference (esp. AVX2-only, or an IC=3 first conv) oneDNN v3 picks a throughput-oriented brg_conv whose overhead dominates at bs=1, and the AVX2 weight format pads IC=3→16. The IC<8 + bs≤1 first-conv cliff is now auto-mitigated on non-AVX-512 hosts (a dispatch gate steers off brg_conv:avx2); for the general bs=1 case set OMP_NUM_THREADS=1 — multi-thread can scale negatively here — and optionally try DNNL_DEFAULT_FPMATH_MODE. The broader brgemm kernel perf fix is deferred (OI-14).

Building from source

See BUILDING.md for the from-scratch recipe (Linux/CUDA and the Apple Silicon CPU build) and docs/cuda_wheel_build.md for the authoritative, provenance-gated release-wheel pipeline. The short version for a CUDA build: clone with submodules, install libnccl-dev and libcudnn9-dev-cuda-13 before invoking cmake, then

cmake -S . -B build -G Ninja -DUSE_CUDA=ON \
  -DMXNET_CUDA_ARCH="8.0;8.6;8.9;9.0;10.0;12.0+PTX" ...

Acknowledgements

This fork builds on the work of the Apache MXNet community and its contributors. All upstream code is Apache 2.0; the CUDA 13 / Blackwell / Apple Silicon patches in this fork are likewise Apache 2.0.


Upstream Apache MXNet (archived 2023-11-17) — historical

Apache MXNet is a deep learning framework designed for both efficiency and flexibility. It lets you mix symbolic and imperative programming, features a dynamic dependency scheduler that parallelizes operations on the fly, a NumPy-like interface integrated with the Gluon 2.0 API, and automatic hybridization. It scaled to multiple GPUs and machines via ps-lite, Horovod, and BytePS, with bindings for Python, Java, C++, R, Scala, Clojure, Go, Javascript, Perl, and Julia.

The upstream project was archived on 2023-11-17. Its website (mxnet.apache.org), CI dashboards, mailing lists, Slack, and social channels are no longer actively monitored; links to them are kept only for historical reference.

MXNet emerged from a collaboration by the authors of cxxnet, minerva, and purine2.

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. In Neural Information Processing Systems, Workshop on Machine Learning Systems, 2015.

Licensed under Apache-2.0.

About

Maintained fork of Apache MXNet 2.0 — runs legacy MXNet code on CUDA 13 / Blackwell (sm_120) GPUs and native Apple Silicon CPU.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • C++ 47.1%
  • Python 36.7%
  • Jupyter Notebook 6.9%
  • Cuda 6.0%
  • Shell 1.0%
  • CMake 0.9%
  • Other 1.4%