-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Hello,
I've recently discovered that if the key/value buffers used in the sort are not a multiple of PARALLELSORT_THREADGROUP_SIZE * 4, then the buffers are read out of bounds and undefined behavior can occur.
See these lines below:
FidelityFX-ParallelSort/ffx-parallelsort/FFX_ParallelSort.h
Lines 133 to 137 in 0c53994
| uint srcKeys[FFX_PARALLELSORT_ELEMENTS_PER_THREAD]; | |
| srcKeys[0] = SrcBuffer[DataIndex]; | |
| srcKeys[1] = SrcBuffer[DataIndex + FFX_PARALLELSORT_THREADGROUP_SIZE]; | |
| srcKeys[2] = SrcBuffer[DataIndex + (FFX_PARALLELSORT_THREADGROUP_SIZE * 2)]; | |
| srcKeys[3] = SrcBuffer[DataIndex + (FFX_PARALLELSORT_THREADGROUP_SIZE * 3)]; |
No bounds checks are done to SrcBuffer here, causing GPU instability when these are read out of bounds.
Also an issue here:
FidelityFX-ParallelSort/ffx-parallelsort/FFX_ParallelSort.h
Lines 348 to 360 in 0c53994
| // Pre-load the key values in order to hide some of the read latency | |
| uint srcKeys[FFX_PARALLELSORT_ELEMENTS_PER_THREAD]; | |
| srcKeys[0] = SrcBuffer[DataIndex]; | |
| srcKeys[1] = SrcBuffer[DataIndex + FFX_PARALLELSORT_THREADGROUP_SIZE]; | |
| srcKeys[2] = SrcBuffer[DataIndex + (FFX_PARALLELSORT_THREADGROUP_SIZE * 2)]; | |
| srcKeys[3] = SrcBuffer[DataIndex + (FFX_PARALLELSORT_THREADGROUP_SIZE * 3)]; | |
| #ifdef kRS_ValueCopy | |
| uint srcValues[FFX_PARALLELSORT_ELEMENTS_PER_THREAD]; | |
| srcValues[0] = SrcPayload[DataIndex]; | |
| srcValues[1] = SrcPayload[DataIndex + FFX_PARALLELSORT_THREADGROUP_SIZE]; | |
| srcValues[2] = SrcPayload[DataIndex + (FFX_PARALLELSORT_THREADGROUP_SIZE * 2)]; | |
| srcValues[3] = SrcPayload[DataIndex + (FFX_PARALLELSORT_THREADGROUP_SIZE * 3)]; |
Later on, the number of keys is checked, but by that point it's too late:
FidelityFX-ParallelSort/ffx-parallelsort/FFX_ParallelSort.h
Lines 369 to 371 in 0c53994
| uint localKey = (DataIndex < CBuffer.NumKeys ? srcKeys[i] : 0xffffffff); | |
| #ifdef kRS_ValueCopy | |
| uint localValue = (DataIndex < CBuffer.NumKeys ? srcValues[i] : 0); |
I suspect the fix would be to just check the number of keys before pre-loading the key/value pairs.
Reproducing is simple enough, just run the sort on data that is less than PARALLELSORT_THREADGROUP_SIZE * 4 with GPU-assisted validation that checks out of bounds descriptor reads.