Draft
Conversation
Collaborator
|
Thanks @0xDELUXA . Good impl and results. But as you mentioned in the issue, we don't have platforms like 9070 on hand to tune perf and validate in CI. So we don't have plan to support and maintain right now. Maybe will need it some day. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
FlyDSL has been Linux-only so far. This PR adds experimental Windows support, building against the TheRock ROCm SDK (installed as a Python package into a venv) rather than a system ROCm install. The goal is to let Windows users with supported AMD GPUs author and run FlyDSL kernels without WSL.
Result: 301 / 310 unit tests pass on Windows (97%). Linux behavior is unchanged — every Windows-specific code path is gated on
WIN32/sys.platform == "win32".Tested on Windows 11 + RDNA4 (
gfx1200, Radeon RX 9060 XT) + Python 3.12 + MSVC 2022 +clang-cl+ Ninja + TheRock ROCm SDK.Technical Details
Problems solved
SelfOwningTypeIDstatic data members can't be auto-imported across DLLs on Windows, causingStorageUniquer isn't initializederrorsMLIR_USE_FALLBACK_TYPE_IDS=1(string-based TypeIDs). Top-levelCMakeLists.txt+scripts/build_llvm.ps1both set thisFlyPythonCAPI.dll—.pydextensions couldn't resolvemlir::fly::*C++ symbolsWINDOWS_EXPORT_ALL_SYMBOLS, addobj.MLIRFlyDialect/obj.MLIRFlyROCDLDialectas direct sources toFlyPythonCAPI, and extractMLIRIR.lib/MLIRSupport.libviallvm-ar xso the auto-generated.defscan sees themlib/CAPI/Dialect/{Fly,FlyROCDL}/CMakeLists.txtdrops the redundantLINK_LIBSentries on Windows<toolkit>/llvm/bin/ld.lldAND<toolkit>/amdgcn/bitcode/— TheRock puts them at<ROCM>/lib/llvm/bin/ld.lld.exeand<ROCM>/lib/llvm/amdgcn/bitcode/, no single toolkit path workspython/flydsl/utils/platform.py::rocm_toolkit_path()stages%LOCALAPPDATA%\flydsl\rocm_toolkit\with two directory junctions unifying the layout (junctions don't need admin)LoadLibraryPermanentlyfails to resolve transitive deps — Windows default search order doesn't include the DLL's own directory, somlir_c_runner_utils.dllcouldn't findmlir_float16_utils.dllat JIT engine initjit_executor._resolve_runtime_libs()callsos.add_dll_directoryand pre-loads sibling*.dllvia ctypes withLOAD_WITH_ALTERED_SEARCH_PATHbefore creating the enginemlir_rocm_runtime.dll— GPU runtime symbols unresolvable at JIT timeRocmBackend.jit_runtime_lib_basenames()and toCopyFlyPythonSourcescopy-list inpython/mlir_flydsl/CMakeLists.txtrocm_agent_enumerator→ default fallbackgfx942compiled kernels for the wrong chip →hipErrorNoBinaryForGpuruntime/device.py::get_rocm_arch()falls back totorch.cuda.get_device_properties(0).gcnArchName_ITERATOR_DEBUG_LEVELmismatch between Debug FlyDSL and Release MLIRscripts/build.ps1forces-DCMAKE_BUILD_TYPE=ReleaseLLVMConfig.cmake(missing NVPTX targets)scripts/build.ps1passes-DLLVM_DIR=$MLIR_PATH\lib\cmake\llvmexplicitly.so/lib…hardcoded in native-lib fingerprintingutils/platform.py::shared_lib_name()/shared_lib_glob()map Linux conventions to.dll/.pydmgpuLaunchKerneletc.)FLY_RUNTIME_EXPORTmacro:__declspec(dllexport)on MSVC,__attribute__((visibility("default")))elsewheresetup.pyfalls back tomklink /Jjunction, thencopytree/tmp,/opt/rocm/lib/libamd_comgr.so.3,libamdhip64.sotempfile.gettempdir()/ platform-gated DLL names in_compat.py,kernels/custom_all_reduce.py,scripts/generate_summary.pyBindingUtils.hinclude order —Interop.husesPyObjectwithout includingPython.hNanobind.h(pulls inPython.h) beforeInterop.hNew files
scripts/build_llvm.ps1— PowerShell equivalent ofbuild_llvm.sh, with-Archparameter /FLYDSL_GPU_ARCHenv var forROCM_TEST_CHIPSETscripts/build.ps1— PowerShell equivalent ofbuild.shpython/flydsl/utils/platform.py— cross-platform DLL/.sonaming helpers + therocm_toolkitstaging logicdocs/windows_build_guide.md— step-by-step setup guide for Windows usersRelevant upstream references
llvm/llvm-project#mlir/include/mlir/Support/TypeID.h— see the comment onMLIR_USE_FALLBACK_TYPE_IDSrecommending this mode for "complex shared library setups"mlir/lib/Target/LLVM/ROCDL/Target.cpp— hardcodesllvm/bin/ld.lldandamdgcn/bitcodeunder the toolkit rootTest Plan
.\scripts\build_llvm.ps1 -Arch gfx1200if(WIN32)/sys.platform == "win32"gated).Full reproduction steps are in
docs/windows_build_guide.md.Test Result
Windows 11 + gfx1200 + Python 3.12:
301 passed / 4 failed / 5 skippedintests/unit/(97%).The 4 remaining failures:
test_multi_stream_launch::test_two_streams_independent,test_diamond_pipeline_with_event_sync). Single-stream variants all pass — likelytorch.cuda.Stream.cuda_streamhandle handling on Windows ROCm needs deeper investigation.test_fp_math_reaches_pipelinepasses solo but fails in-suite because disk-cache hits bypass the monkey-patchedpipeline_fragments. Workaround:FLYDSL_RUNTIME_ENABLE_CACHE=0.test_cache_disabled_run_perftest_does_not_crash—DataFrame.host_time_sumattribute missing in current torch version.Other caveats:
gfx942andgfx1200have been exercised on Windows; other arches should work given a compatible TheRock SDK.Submission Checklist
cc @coderfeli This PR introduces significant changes. While it’s unlikely to be merged, it is intended as a reference for those interested in experimenting with FlyDSL on Windows.