You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I've briefly compared deepseek-ai/DeepEP and ROCm/DeepEP and have the following questions:
deepseek-ai/DeepEP implements its own low-latency IGBD process, avoiding polling CQ when issuing WQEs. The official response is: about ibgda_reserve_wqe_slots deepseek-ai/DeepEP#180. ROCm/DeepEP, however, directly calls the API provided by rocshmem. Each time a WQE is issued, it checks for available space; if not, it polls CQ. This approach is essentially the same as the IGBD process implemented by nvshmem, potentially leading to higher latency in low-latency mode.
ROCm/DeepEP calls a warp interface similar to put_nbi_warp. In rocshmem, only one thread actually issues WQEs, while in deepseek-ai/DeepEP, all threads participate in the warp. Wouldn't this affect performance?
Suggestion Description
Hello, I've briefly compared deepseek-ai/DeepEP and ROCm/DeepEP and have the following questions:
Operating System
No response
GPU
No response
ROCm Component
No response