Prevent hip internal errors by luraess · Pull Request #908 · JuliaGPU/AMDGPU.jl

luraess · 2026-05-09T14:47:12Z

No description provided.

luraess · 2026-05-11T19:20:06Z

The goal if this draft PR is to fix spurious library errors caused by sticky HIP errors leaking across tests

When a GPU kernel raises a device-side exception, AMDGPU.jl catches it via its custom exception buffer and re-throws it as a Julia error, but the underlying sticky HIP error (hipErrorLaunchFailure) remains on the HIP context. One way this behaviour exhibits is in tests, where ParallelTestRunner reuses worker processes across test files, that stale error persists into the next test. The first library call in that test (rocSPARSE, rocBLAS, MIOpen, rocRAND) internally synchronizes the device, which surfaces the stale error as a spurious library failure (rocsparse_status_internal_error, ROCRAND_STATUS_LAUNCH_FAILURE, etc. Failures are random because they depend on which test ran previously in the same worker.

I cam up with hipGetLastError() being called (without throwing) in handle() for all four libraries before every library operation. This is the possible HIP way to drain a sticky error, it both returns and clears it atomically, so any genuine failure from the subsequent library call is still caught normally. Placing it in handle() rather than create_handle() ensures it fires on every call, not just when a handle is first created.

N.B. Also temporarily fixed an unrelated flakiness in hostcall.jl: @test_logs (:error, "HostCall error") is failing when GPUCompiler deprecation warnings are emitted during kernel compilation in the same block. Added match_mode=:any to make the assertion robust to extra log output until the GPUCompiler update PR is merged.

gbaraldi · 2026-05-11T19:25:50Z

Only thing I can think about is if there's any kind of error we could be silently clearing here that wouldn't have been bubbled up

luraess · 2026-05-11T19:46:06Z

IDK much more neither tbh. I only saw these failures while testing until now so yeah - maybe they would also show up in other settings as well. Things are that test actually pass, and internal failure just spoil the next test run on the same worker...

luraess · 2026-05-12T16:56:29Z

Only thing I can think about is if there's any kind of error we could be silently clearing here that wouldn't have been bubbled up

Synchronous HIP errors should be cought the call site via @check / @gcsafe_ccall. Asynchronous errors (kernel failures) should be tracked by GLOBAL_EXCEPTION_INFO buffer and re-thrown as Julia errors before reaching handle(). The sticky error that remains uncleared is possibly from a kernel exception that AMDGPU already reported and the approach here drains the HIP-level duplicate.

An alternative version could be to log a warning if the cleared error is non-success, rather than discarding it silently.

luraess · 2026-05-13T06:54:45Z

@gbaraldi would the addition of f22f91a be something you were thinking of?

Lemme know if you feel we could merge this PR. If yes, I would then cleanup a bit and revert 4baadb2 (unless it may be actually fine to keep it?)

luraess added 2 commits May 9, 2026 20:52

Catch and clear spurious hip error

a30267a

Fix rand

40f6467

luraess force-pushed the lr/hip-err branch from 35a6c3a to 40f6467 Compare May 9, 2026 18:52

luraess changed the title ~~Rocsparse internal error~~ Prevent hip internal errors May 9, 2026

luraess added 2 commits May 11, 2026 08:52

tmp fix until #907 lands

4baadb2

Further tweaks

20d55a2

Merge remote-tracking branch 'origin/master' into lr/hip-err

758d261

track cleared err

f22f91a

luraess marked this pull request as ready for review May 13, 2026 06:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent hip internal errors#908

Prevent hip internal errors#908
luraess wants to merge 6 commits into
masterfrom
lr/hip-err

luraess commented May 9, 2026

Uh oh!

luraess commented May 11, 2026

Uh oh!

gbaraldi commented May 11, 2026

Uh oh!

luraess commented May 11, 2026

Uh oh!

luraess commented May 12, 2026

Uh oh!

luraess commented May 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

luraess commented May 9, 2026

Uh oh!

luraess commented May 11, 2026

Uh oh!

gbaraldi commented May 11, 2026

Uh oh!

luraess commented May 11, 2026

Uh oh!

luraess commented May 12, 2026

Uh oh!

luraess commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

luraess commented May 13, 2026 •

edited

Loading