Sync with Microsoft ONNX Runtime - 27032026#1001
Open
ai-fw-intg wants to merge 7 commits intoovep-developfrom
Open
Sync with Microsoft ONNX Runtime - 27032026#1001ai-fw-intg wants to merge 7 commits intoovep-developfrom
ai-fw-intg wants to merge 7 commits intoovep-developfrom
Conversation
…#27342) ### Description Moves the `--build_wasm_static_lib → --build_wasm` implication from `build.py` into `build_args.py`'s post-processing, **before** the cmake generator selection. Previously, `build_args.py` chose the generator based on `args.build_wasm` (still `False`), and `build.py` only set it to `True` afterwards—too late. - **`tools/ci_build/build_args.py`**: Set `args.build_wasm = True` when `args.build_wasm_static_lib` is set, prior to generator and cross-compilation logic. - **`tools/ci_build/build.py`**: Remove the now-redundant identical check. ### Motivation and Context Using `--build_wasm_static_lib` without `--build_wasm` caused cmake to use the wrong generator (e.g., Visual Studio instead of Ninja on Windows) and miss Emscripten-specific configuration, leading to build failures like missing `libiconv`. <!-- START COPILOT CODING AGENT TIPS --> --- 💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey). --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: fs-eire <7679871+fs-eire@users.noreply.github.com>
…n MatMulNBits (microsoft#27820) ### Description Routes fp16 `HQNBIT_CompInt8` through the fp32 MLAS path (`SQNBIT_CompInt8`) at the operator level for both 4-bit and 8-bit MatMulNBits, then removes the ~370 lines of dead HQ CompInt8 wrapper code from MLAS. **Operator changes (matmul_nbits.cc):** - PrePack: Uses `SQNBIT_CompInt8` for sizing/packing, pre-converts fp16 scales and bias to fp32, computes BZpCorr for asymmetric KleidiAI on ARM64. - ComputeBPacked: Bulk fp16→fp32 conversion of A, calls `MlasQNBitGemmBatch<float>` with `SQNBIT_CompInt8`, bulk fp32→fp16 conversion of C. **MLAS cleanup (qnbitgemm.cpp, qnbitgemm_kernel_neon.cpp):** - Removed `HQ4BitGemm_CompInt8`, `HQ8BitGemm_CompInt8`, `HQ8BitCompInt8PerGemmWorkspace`, associated enum values, dispatch branches, workspace entries, and `HQNBIT_CompInt8` NEON kernel conditions. - Added `HQNBIT_CompInt8` → `SQNBIT_CompInt8` redirect in `MlasIsQNBitGemmAvailable` for `GetComputeType<MLFloat16>` compatibility. ### Motivation and Context The HQ CompInt8 kernels are wrappers that convert fp16→fp32 per-tile before calling the same SQ fp32 kernels. This change: 1. **Eliminates per-tile overhead** via bulk conversion at the operator level. 2. **Enables KleidiAI for fp16 4-bit** — previously bypassed by the `HQNBIT_CompInt8` path. 3. **Removes ~370 lines of dead wrapper code** from MLAS. ### Improvements Measured on `Snapdragon X Elite - X1E78100 - Qualcomm Oryon CPU` **Asymmetric:** | Model | Seq Len | Acc1/Acc4 (before) | Acc1/Acc4 (after) | Acc4 speedup | Acc4 latency (after) | |-------|---------|-------------------|------------------|--------------|----------------------| | Qwen 1.5B | 256 | 1.28× | 1.55× | **1.26×** | 1187.5ms | | Qwen 1.5B | 512 | 1.14× | 1.63× | **1.55×** | 2257.2ms | | Qwen 3B | 256 | 1.32× | 1.82× | **1.29×** | 2351.3ms | | Qwen 3B | 512 | 1.38× | 1.70× | **1.28×** | 4777.2ms | | Qwen 7B | 256 | 1.58× | 2.26× | **1.40×** | 4094.5ms | | Qwen 7B | 512 | 1.49× | 2.23× | **1.52×** | 8002.6ms | **Symmetric:** | Model | Seq Len | Acc1/Acc4 (before) | Acc1/Acc4 (after) | Acc4 speedup | Acc4 latency (after) | |-------|---------|-------------------|------------------|--------------|----------------------| | Qwen 1.5B | 256 | 0.95× | 1.45× | **1.67×** | 1255.5ms | | Qwen 1.5B | 512 | 1.04× | 1.52× | **1.55×** | 2406.7ms | | Qwen 3B | 256 | 1.39× | 1.88× | **1.32×** | 2215.0ms | | Qwen 3B | 512 | 1.42× | 1.85× | **1.31×** | 4318.3ms | | Qwen 7B | 256 | 1.66× | 2.58× | **1.55×** | 3564.4ms | | Qwen 7B | 512 | 1.57× | 2.60× | **1.64×** | 7227.9ms | **NOTE**: The 8-bit accuracy level 4 path shows some regression (5–25% on 1.5B/3B models, neutral on 7B) due to the bulk fp16↔fp32 conversion overhead replacing the old per-tile approach. The old HQ CompInt8 wrappers kept small tiles cache-hot, while the new unified path does full-matrix conversion passes. This trade-off is acceptable since 4-bit is the dominant quantization format (gaining 26–67%), 8-bit acc4 still outperforms acc1 by 1.7–2.2×, and the regression is most pronounced at smaller model sizes where absolute latencies are already low. A proper fix would be 8-bit KleidiAI-style kernels rather than restoring the wrapper code.
…rt. (microsoft#27825) ### Description Support for Aarch64 SME intrinsics was added to version 19.40 of MSVC. The ONNX Runtime stated supported version of Visual Studio 2022 can go back before version 19.40. This patch modifies cmake/CMakeLists.txt to check the version of MSVC, if it is the target compiler. For versions less than 19.40 KleidiAi will be disabled in the build. ### Motivation and Context This issue was raised when cross compiling 1.24 for Windows on Arm. microsoft#27304 --------- Signed-off-by: Colm Donelan <coldon01@e135129.arm.com> Co-authored-by: Colm Donelan <coldon01@e135129.arm.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
) ### Description Enable ccache and vcpkg caching for Linux workflows that use `reusable_linux_build.yml`. Saves about ~15-20 min on a 100% cache hit. Also parallelises tests. Saves ~6 minutes. Additionally, enable vcpkg and ccache for other Linux workflows. No numbers avail for comparison. ### Motivation and Context This change reduces wasted CO2 and time. ### Known Issues Benign - Android workflow doesn't seem to be populating its ccache.
### Description
See below
### Motivation and Context
Summary:The vulnerability lies in the ONNX Runtime's validate_package.py
script, which uses unsanitized string concatenation with os.system() to
construct shell commands. This allows attackers to inject arbitrary
shell commands via the --package_name argument, leading to potential
remote code execution. The issue affects the release validation
pipeline, which operates with elevated privileges, exposing sensitive
credentials and secrets. The root cause is the lack of input
sanitization and the use of os.system() for command execution.
Affected code locations:
tools/nuget/validate_package.py line 241: os.system("tar zxvf " +
package_name)
tools/nuget/validate_package.py line 339: os.system("copy " +
full_nuget_path + " " + nupkg_copy_name)
Suggested fix: Replace os.system() with subprocess.run() using argument
lists (no shell interpolation):
```
# Instead of: os.system("tar zxvf " + package_name)
subprocess.run(["tar", "zxvf", package_name], check=True)
# Instead of: os.system("copy " + full_nuget_path + " " + nupkg_copy_name)
shutil.copy2(full_nuget_path, nupkg_copy_name)
```
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.