Add initial Windows support by 0xDELUXA · Pull Request #395 · ROCm/FlyDSL

0xDELUXA · 2026-04-13T13:36:29Z

Motivation

FlyDSL has been Linux-only so far. This PR adds experimental Windows support, building against the TheRock ROCm SDK (installed as a Python package into a venv) rather than a system ROCm install. The goal is to let Windows users with supported AMD GPUs author and run FlyDSL kernels without WSL.

Result: 301 / 310 unit tests pass on Windows (97%). Linux behavior is unchanged — every Windows-specific code path is gated on WIN32 / sys.platform == "win32".

Tested on Windows 11 + RDNA4 (gfx1200, Radeon RX 9060 XT) + Python 3.12 + MSVC 2022 + clang-cl + Ninja + TheRock ROCm SDK.

Technical Details

Problems solved

Problem	Solution
Cross-DLL TypeID mismatch — MLIR's `SelfOwningTypeID` static data members can't be auto-imported across DLLs on Windows, causing `StorageUniquer isn't initialized` errors	Compile the whole stack with `MLIR_USE_FALLBACK_TYPE_IDS=1` (string-based TypeIDs). Top-level `CMakeLists.txt` + `scripts/build_llvm.ps1` both set this
Symbol export from `FlyPythonCAPI.dll` — `.pyd` extensions couldn't resolve `mlir::fly::*` C++ symbols	Enable `WINDOWS_EXPORT_ALL_SYMBOLS`, add `obj.MLIRFlyDialect` / `obj.MLIRFlyROCDLDialect` as direct sources to `FlyPythonCAPI`, and extract `MLIRIR.lib` / `MLIRSupport.lib` via `llvm-ar x` so the auto-generated `.def` scan sees them
Duplicate linkage on Windows of Fly dialect libs	`lib/CAPI/Dialect/{Fly,FlyROCDL}/CMakeLists.txt` drops the redundant `LINK_LIBS` entries on Windows
MLIR's ROCDL target hardcodes `<toolkit>/llvm/bin/ld.lld` AND `<toolkit>/amdgcn/bitcode/` — TheRock puts them at `<ROCM>/lib/llvm/bin/ld.lld.exe` and `<ROCM>/lib/llvm/amdgcn/bitcode/`, no single toolkit path works	`python/flydsl/utils/platform.py::rocm_toolkit_path()` stages `%LOCALAPPDATA%\flydsl\rocm_toolkit\` with two directory junctions unifying the layout (junctions don't need admin)
LLVM's `LoadLibraryPermanently` fails to resolve transitive deps — Windows default search order doesn't include the DLL's own directory, so `mlir_c_runner_utils.dll` couldn't find `mlir_float16_utils.dll` at JIT engine init	`jit_executor._resolve_runtime_libs()` calls `os.add_dll_directory` and pre-loads sibling `*.dll` via ctypes with `LOAD_WITH_ALTERED_SEARCH_PATH` before creating the engine
Missing `mlir_rocm_runtime.dll` — GPU runtime symbols unresolvable at JIT time	Added to `RocmBackend.jit_runtime_lib_basenames()` and to `CopyFlyPythonSources` copy-list in `python/mlir_flydsl/CMakeLists.txt`
GPU arch auto-detection — TheRock doesn't ship `rocm_agent_enumerator` → default fallback `gfx942` compiled kernels for the wrong chip → `hipErrorNoBinaryForGpu`	`runtime/device.py::get_rocm_arch()` falls back to `torch.cuda.get_device_properties(0).gcnArchName`
MSVC `_ITERATOR_DEBUG_LEVEL` mismatch between Debug FlyDSL and Release MLIR	`scripts/build.ps1` forces `-DCMAKE_BUILD_TYPE=Release`
CMake finding ROCm's bundled `LLVMConfig.cmake` (missing NVPTX targets)	`scripts/build.ps1` passes `-DLLVM_DIR=$MLIR_PATH\lib\cmake\llvm` explicitly
`.so` / `lib…` hardcoded in native-lib fingerprinting	`utils/platform.py::shared_lib_name()` / `shared_lib_glob()` map Linux conventions to `.dll` / `.pyd`
Runtime wrapper exports (`mgpuLaunchKernel` etc.)	`FLY_RUNTIME_EXPORT` macro: `__declspec(dllexport)` on MSVC, `__attribute__((visibility("default")))` elsewhere
Editable-install symlink requires admin / Developer Mode on Windows	`setup.py` falls back to `mklink /J` junction, then `copytree`
Hardcoded `/tmp`, `/opt/rocm/lib/libamd_comgr.so.3`, `libamdhip64.so`	Replaced with `tempfile.gettempdir()` / platform-gated DLL names in `_compat.py`, `kernels/custom_all_reduce.py`, `scripts/generate_summary.py`
`BindingUtils.h` include order — `Interop.h` uses `PyObject` without including `Python.h`	Move `Nanobind.h` (pulls in `Python.h`) before `Interop.h`

New files

scripts/build_llvm.ps1 — PowerShell equivalent of build_llvm.sh, with -Arch parameter / FLYDSL_GPU_ARCH env var for ROCM_TEST_CHIPSET
scripts/build.ps1 — PowerShell equivalent of build.sh
python/flydsl/utils/platform.py — cross-platform DLL/.so naming helpers + the rocm_toolkit staging logic
docs/windows_build_guide.md — step-by-step setup guide for Windows users

Relevant upstream references

MLIR TypeID: llvm/llvm-project#mlir/include/mlir/Support/TypeID.h — see the comment on MLIR_USE_FALLBACK_TYPE_IDS recommending this mode for "complex shared library setups"
MLIR ROCDL target (toolkit path handling): mlir/lib/Target/LLVM/ROCDL/Target.cpp — hardcodes llvm/bin/ld.lld and amdgcn/bitcode under the toolkit root
TheRock ROCm SDK (Windows distribution): https://github.com/ROCm/TheRock

Test Plan

Build LLVM/MLIR on Windows:
```
.\scripts\build_llvm.ps1 -Arch gfx1200
```

Build FlyDSL:

$env:MLIR_PATH = "...\llvm-project\mlir_install"
.\scripts\build.ps1
pip install -e .

Run the unit suite:

$env:PYTHONPATH = "$PWD\build-fly\python_packages;$PWD"
python -m pytest tests\unit\ -q

Confirm Linux builds still succeed (all Windows paths are if(WIN32) / sys.platform == "win32" gated).

Full reproduction steps are in docs/windows_build_guide.md.

Test Result

Windows 11 + gfx1200 + Python 3.12: 301 passed / 4 failed / 5 skipped in tests/unit/ (97%).

The 4 remaining failures:

2 multi-stream correctness tests (test_multi_stream_launch::test_two_streams_independent, test_diamond_pipeline_with_event_sync). Single-stream variants all pass — likely torch.cuda.Stream.cuda_stream handle handling on Windows ROCm needs deeper investigation.
1 test-robustness failure (not Windows-specific): test_fp_math_reaches_pipeline passes solo but fails in-suite because disk-cache hits bypass the monkey-patched pipeline_fragments. Workaround: FLYDSL_RUNTIME_ENABLE_CACHE=0.
1 torch profiler compat failure (not Windows-specific): test_cache_disabled_run_perftest_does_not_crash — DataFrame.host_time_sum attribute missing in current torch version.

Other caveats:

No Windows CI job yet — each build is verified manually. Adding GH Actions with TheRock + RDNA-capable runners could be a next step.
Only gfx942 and gfx1200 have been exercised on Windows; other arches should work given a compatible TheRock SDK.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

cc @coderfeli This PR introduces significant changes. While it’s unlikely to be merged, it is intended as a reference for those interested in experimenting with FlyDSL on Windows.

coderfeli · 2026-04-13T14:29:19Z

Thanks @0xDELUXA . Good impl and results. But as you mentioned in the issue, we don't have platforms like 9070 on hand to tune perf and validate in CI. So we don't have plan to support and maintain right now. Maybe will need it some day.

Add initial Windows support

db8e753

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial Windows support#395

Add initial Windows support#395
0xDELUXA wants to merge 1 commit intoROCm:mainfrom
0xDELUXA:initial-windows-support

0xDELUXA commented Apr 13, 2026 •

edited

Loading

Uh oh!

coderfeli commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

0xDELUXA commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Problems solved

New files

Relevant upstream references

Test Plan

Test Result

Submission Checklist

Uh oh!

coderfeli commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

0xDELUXA commented Apr 13, 2026 •

edited

Loading