Prevent hip internal errors#908
Conversation
|
The goal if this draft PR is to fix spurious library errors caused by sticky HIP errors leaking across tests When a GPU kernel raises a device-side exception, AMDGPU.jl catches it via its custom exception buffer and re-throws it as a Julia error, but the underlying sticky HIP error ( I cam up with N.B. Also temporarily fixed an unrelated flakiness in hostcall.jl: |
|
Only thing I can think about is if there's any kind of error we could be silently clearing here that wouldn't have been bubbled up |
|
IDK much more neither tbh. I only saw these failures while testing until now so yeah - maybe they would also show up in other settings as well. Things are that test actually pass, and internal failure just spoil the next test run on the same worker... |
Synchronous HIP errors should be cought the call site via An alternative version could be to log a warning if the cleared error is non-success, rather than discarding it silently. |
No description provided.