Skip to content

Prevent hip internal errors#908

Open
luraess wants to merge 6 commits into
masterfrom
lr/hip-err
Open

Prevent hip internal errors#908
luraess wants to merge 6 commits into
masterfrom
lr/hip-err

Conversation

@luraess
Copy link
Copy Markdown
Member

@luraess luraess commented May 9, 2026

No description provided.

@luraess luraess changed the title Rocsparse internal error Prevent hip internal errors May 9, 2026
@luraess
Copy link
Copy Markdown
Member Author

luraess commented May 11, 2026

The goal if this draft PR is to fix spurious library errors caused by sticky HIP errors leaking across tests

When a GPU kernel raises a device-side exception, AMDGPU.jl catches it via its custom exception buffer and re-throws it as a Julia error, but the underlying sticky HIP error (hipErrorLaunchFailure) remains on the HIP context. One way this behaviour exhibits is in tests, where ParallelTestRunner reuses worker processes across test files, that stale error persists into the next test. The first library call in that test (rocSPARSE, rocBLAS, MIOpen, rocRAND) internally synchronizes the device, which surfaces the stale error as a spurious library failure (rocsparse_status_internal_error, ROCRAND_STATUS_LAUNCH_FAILURE, etc. Failures are random because they depend on which test ran previously in the same worker.

I cam up with hipGetLastError() being called (without throwing) in handle() for all four libraries before every library operation. This is the possible HIP way to drain a sticky error, it both returns and clears it atomically, so any genuine failure from the subsequent library call is still caught normally. Placing it in handle() rather than create_handle() ensures it fires on every call, not just when a handle is first created.

N.B. Also temporarily fixed an unrelated flakiness in hostcall.jl: @test_logs (:error, "HostCall error") is failing when GPUCompiler deprecation warnings are emitted during kernel compilation in the same block. Added match_mode=:any to make the assertion robust to extra log output until the GPUCompiler update PR is merged.

@gbaraldi
Copy link
Copy Markdown
Member

Only thing I can think about is if there's any kind of error we could be silently clearing here that wouldn't have been bubbled up

@luraess
Copy link
Copy Markdown
Member Author

luraess commented May 11, 2026

IDK much more neither tbh. I only saw these failures while testing until now so yeah - maybe they would also show up in other settings as well. Things are that test actually pass, and internal failure just spoil the next test run on the same worker...

@luraess
Copy link
Copy Markdown
Member Author

luraess commented May 12, 2026

Only thing I can think about is if there's any kind of error we could be silently clearing here that wouldn't have been bubbled up

Synchronous HIP errors should be cought the call site via @check / @gcsafe_ccall. Asynchronous errors (kernel failures) should be tracked by GLOBAL_EXCEPTION_INFO buffer and re-thrown as Julia errors before reaching handle(). The sticky error that remains uncleared is possibly from a kernel exception that AMDGPU already reported and the approach here drains the HIP-level duplicate.

An alternative version could be to log a warning if the cleared error is non-success, rather than discarding it silently.

@luraess
Copy link
Copy Markdown
Member Author

luraess commented May 13, 2026

@gbaraldi would the addition of f22f91a be something you were thinking of?

Lemme know if you feel we could merge this PR. If yes, I would then cleanup a bit and revert 4baadb2 (unless it may be actually fine to keep it?)

@luraess luraess marked this pull request as ready for review May 13, 2026 06:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants