Skip to content

Add initial Windows support#395

Draft
0xDELUXA wants to merge 1 commit intoROCm:mainfrom
0xDELUXA:initial-windows-support
Draft

Add initial Windows support#395
0xDELUXA wants to merge 1 commit intoROCm:mainfrom
0xDELUXA:initial-windows-support

Conversation

@0xDELUXA
Copy link
Copy Markdown

@0xDELUXA 0xDELUXA commented Apr 13, 2026

Motivation

FlyDSL has been Linux-only so far. This PR adds experimental Windows support, building against the TheRock ROCm SDK (installed as a Python package into a venv) rather than a system ROCm install. The goal is to let Windows users with supported AMD GPUs author and run FlyDSL kernels without WSL.

Result: 301 / 310 unit tests pass on Windows (97%). Linux behavior is unchanged — every Windows-specific code path is gated on WIN32 / sys.platform == "win32".

Tested on Windows 11 + RDNA4 (gfx1200, Radeon RX 9060 XT) + Python 3.12 + MSVC 2022 + clang-cl + Ninja + TheRock ROCm SDK.

Technical Details

Problems solved

Problem Solution
Cross-DLL TypeID mismatch — MLIR's SelfOwningTypeID static data members can't be auto-imported across DLLs on Windows, causing StorageUniquer isn't initialized errors Compile the whole stack with MLIR_USE_FALLBACK_TYPE_IDS=1 (string-based TypeIDs). Top-level CMakeLists.txt + scripts/build_llvm.ps1 both set this
Symbol export from FlyPythonCAPI.dll.pyd extensions couldn't resolve mlir::fly::* C++ symbols Enable WINDOWS_EXPORT_ALL_SYMBOLS, add obj.MLIRFlyDialect / obj.MLIRFlyROCDLDialect as direct sources to FlyPythonCAPI, and extract MLIRIR.lib / MLIRSupport.lib via llvm-ar x so the auto-generated .def scan sees them
Duplicate linkage on Windows of Fly dialect libs lib/CAPI/Dialect/{Fly,FlyROCDL}/CMakeLists.txt drops the redundant LINK_LIBS entries on Windows
MLIR's ROCDL target hardcodes <toolkit>/llvm/bin/ld.lld AND <toolkit>/amdgcn/bitcode/ — TheRock puts them at <ROCM>/lib/llvm/bin/ld.lld.exe and <ROCM>/lib/llvm/amdgcn/bitcode/, no single toolkit path works python/flydsl/utils/platform.py::rocm_toolkit_path() stages %LOCALAPPDATA%\flydsl\rocm_toolkit\ with two directory junctions unifying the layout (junctions don't need admin)
LLVM's LoadLibraryPermanently fails to resolve transitive deps — Windows default search order doesn't include the DLL's own directory, so mlir_c_runner_utils.dll couldn't find mlir_float16_utils.dll at JIT engine init jit_executor._resolve_runtime_libs() calls os.add_dll_directory and pre-loads sibling *.dll via ctypes with LOAD_WITH_ALTERED_SEARCH_PATH before creating the engine
Missing mlir_rocm_runtime.dll — GPU runtime symbols unresolvable at JIT time Added to RocmBackend.jit_runtime_lib_basenames() and to CopyFlyPythonSources copy-list in python/mlir_flydsl/CMakeLists.txt
GPU arch auto-detection — TheRock doesn't ship rocm_agent_enumerator → default fallback gfx942 compiled kernels for the wrong chip → hipErrorNoBinaryForGpu runtime/device.py::get_rocm_arch() falls back to torch.cuda.get_device_properties(0).gcnArchName
MSVC _ITERATOR_DEBUG_LEVEL mismatch between Debug FlyDSL and Release MLIR scripts/build.ps1 forces -DCMAKE_BUILD_TYPE=Release
CMake finding ROCm's bundled LLVMConfig.cmake (missing NVPTX targets) scripts/build.ps1 passes -DLLVM_DIR=$MLIR_PATH\lib\cmake\llvm explicitly
.so / lib… hardcoded in native-lib fingerprinting utils/platform.py::shared_lib_name() / shared_lib_glob() map Linux conventions to .dll / .pyd
Runtime wrapper exports (mgpuLaunchKernel etc.) FLY_RUNTIME_EXPORT macro: __declspec(dllexport) on MSVC, __attribute__((visibility("default"))) elsewhere
Editable-install symlink requires admin / Developer Mode on Windows setup.py falls back to mklink /J junction, then copytree
Hardcoded /tmp, /opt/rocm/lib/libamd_comgr.so.3, libamdhip64.so Replaced with tempfile.gettempdir() / platform-gated DLL names in _compat.py, kernels/custom_all_reduce.py, scripts/generate_summary.py
BindingUtils.h include orderInterop.h uses PyObject without including Python.h Move Nanobind.h (pulls in Python.h) before Interop.h

New files

  • scripts/build_llvm.ps1 — PowerShell equivalent of build_llvm.sh, with -Arch parameter / FLYDSL_GPU_ARCH env var for ROCM_TEST_CHIPSET
  • scripts/build.ps1 — PowerShell equivalent of build.sh
  • python/flydsl/utils/platform.py — cross-platform DLL/.so naming helpers + the rocm_toolkit staging logic
  • docs/windows_build_guide.md — step-by-step setup guide for Windows users

Relevant upstream references

Test Plan

  1. Build LLVM/MLIR on Windows:
    .\scripts\build_llvm.ps1 -Arch gfx1200
  2. Build FlyDSL:
    $env:MLIR_PATH = "...\llvm-project\mlir_install"
    .\scripts\build.ps1
    pip install -e .
  3. Run the unit suite:
    $env:PYTHONPATH = "$PWD\build-fly\python_packages;$PWD"
    python -m pytest tests\unit\ -q
  4. Confirm Linux builds still succeed (all Windows paths are if(WIN32) / sys.platform == "win32" gated).

Full reproduction steps are in docs/windows_build_guide.md.

Test Result

Windows 11 + gfx1200 + Python 3.12: 301 passed / 4 failed / 5 skipped in tests/unit/ (97%).

The 4 remaining failures:

  • 2 multi-stream correctness tests (test_multi_stream_launch::test_two_streams_independent, test_diamond_pipeline_with_event_sync). Single-stream variants all pass — likely torch.cuda.Stream.cuda_stream handle handling on Windows ROCm needs deeper investigation.
  • 1 test-robustness failure (not Windows-specific): test_fp_math_reaches_pipeline passes solo but fails in-suite because disk-cache hits bypass the monkey-patched pipeline_fragments. Workaround: FLYDSL_RUNTIME_ENABLE_CACHE=0.
  • 1 torch profiler compat failure (not Windows-specific): test_cache_disabled_run_perftest_does_not_crashDataFrame.host_time_sum attribute missing in current torch version.

Other caveats:

  • No Windows CI job yet — each build is verified manually. Adding GH Actions with TheRock + RDNA-capable runners could be a next step.
  • Only gfx942 and gfx1200 have been exercised on Windows; other arches should work given a compatible TheRock SDK.

Submission Checklist

cc @coderfeli This PR introduces significant changes. While it’s unlikely to be merged, it is intended as a reference for those interested in experimenting with FlyDSL on Windows.

@coderfeli
Copy link
Copy Markdown
Collaborator

Thanks @0xDELUXA . Good impl and results. But as you mentioned in the issue, we don't have platforms like 9070 on hand to tune perf and validate in CI. So we don't have plan to support and maintain right now. Maybe will need it some day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants