-
Notifications
You must be signed in to change notification settings - Fork 237
Description
I'm getting back in to motion correction, and I am testing out some new options.
I ran a test on 10 minutes of NPX 1.0 data. I used the locally_exclusive peak detection and completed the run in 212 seconds (timed using python's perf_counter) with 12 workers using a 28-logical processors ~3.7 GHz CPU (2024 i7). I then tested locally_exclusive_torch and completed the run in 216 seconds using the same settings on a 4000 Ada GPU. My task manager shows essentially 100% utilization of either CPU or GPU during these runs.
I noticed on the progress bar that the torch version also said it was using 12 workers.
What is the meaning of workers when we run something on the GPU? My understanding is that this is not typical language for the parallel processing on the GPU, and that typically we just let cuda run its magic "beneath the hood"
Would you recommend changing the number of workers when running these steps on GPU? Or does that not have a meaningful impact? If a process is parallelizable I would expect more of a performance improvement moving from CPU to GPU. Do you have any ideas if the performance I observed is expected?