Skip to content

WebGPU: make WebGpuContextFactory::Cleanup exception-safe and idempotent#29207

Draft
samuel100 wants to merge 1 commit into
microsoft:mainfrom
samuel100:fix/webgpu-cleanup-exception-safety
Draft

WebGPU: make WebGpuContextFactory::Cleanup exception-safe and idempotent#29207
samuel100 wants to merge 1 commit into
microsoft:mainfrom
samuel100:fix/webgpu-cleanup-exception-safety

Conversation

@samuel100

Copy link
Copy Markdown
Contributor

Description

Makes WebGpuContextFactory::Cleanup() exception-safe and idempotent.

It currently does:

void WebGpuContextFactory::Cleanup() {
  std::lock_guard<std::mutex> lock(mutex_);
  if (contexts_ != nullptr) { delete contexts_; contexts_ = nullptr; }
  if (default_instance_ != nullptr) { wgpuInstanceRelease(default_instance_); default_instance_ = nullptr; }
}

If delete contexts_ throws, contexts_ is left non-null, default_instance_ is never released, and default_instance_ is left dangling. On macOS, tearing down Dawn's Metal device/instance can raise an Objective-C NSException, which propagates through C++ frames on arm64 and is caught by the catch (...) in the WebGPU plugin EP's ReleaseEpFactory — surfacing as:

[error] ... [ep_library_plugin.cc:69 Unload] ReleaseEpFactory failed for: ".../libonnxruntime_providers_webgpu.dylib" with error: Unknown exception

This patch detaches the static state first (so the factory is left clean and reusable even if a destructor throws) and guards each release independently, so:

  • the misleading teardown ERROR goes away (the plugin's ReleaseEpFactory returns OK), and
  • the WGPUInstance / Dawn InstanceBase is no longer leaked on every register→unregister cycle.

Motivation and Context

Fixes #29206.

The current behavior is benign for single-shot processes (the library still unloads and the OS reclaims resources) but is a real cumulative leak for long-lived hosts that register/unregister EPs repeatedly, and the ERROR log is misleading. The plugin EP README already lists WebGPU cleanup under "Missing parts".

A deeper follow-up could wrap the Metal teardown in @try/@catch so Dawn doesn't throw at all, but the exception-safety fix here is correct and minimal on its own and matches the best-effort teardown already used in ReleaseEpFactory.

Note for reviewers

I was unable to build ONNX Runtime + Dawn/WebGPU in my authoring environment, so this change is untested locally and relies on CI and reviewer validation. The change is intentionally minimal and uses only constructs already present in this translation unit (LOGS_DEFAULT, WGPUInstance, wgpuInstanceRelease). Happy to adjust the wording/logging or split out a @try/@catch Dawn-side variant if preferred.

Detach the static contexts map and default instance before releasing them so
the factory is left in a clean, reusable state even if Dawn's Metal teardown
throws (an Objective-C NSException on macOS). Guard each release independently
so a throw in one cannot leak the other or leave the statics dangling.

Fixes the 'ReleaseEpFactory failed ... Unknown exception' teardown error and the
per-register/unregister WGPUInstance leak. See microsoft#29206.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WebGPU plugin EP: ReleaseEpFactory throws "Unknown exception" on macOS during teardown (WebGpuContextFactory::Cleanup is not exception-safe)

1 participant