-
Notifications
You must be signed in to change notification settings - Fork 645
[DP] Fix dp padding logic in dummyrun #4705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly applies padding logic to dummy_run to align with changes in CudagraphDispatcher. The use of num_tokens_padded for tensor slicing and subsequent function calls is consistent. However, I've found a few issues: a critical bug where num_reqs_padded is used instead of num_tokens_padded when updating num_tokens_across_dp, a potential issue with MoE communication method selection using unpadded token counts, and a leftover debug print statement. Addressing these will ensure correctness and clean up the code.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bugfix is really imperative for DP scenarios. Can you fix ut and merge this ASAP? @MengqingCao
Signed-off-by: MengqingCao <cmq0113@163.com>
|
@GDzhu01 I've tested this pr on the deepseek-r1-w8a8 model with the latest code, could you help test again? Thx! |
Yes, I think an approve from @GDzhu01 is needed after his test |
|
After offline discussion with @GDzhu01 , I confirmed that there is no problem with this pr, let's merge this! |
### What this PR does / why we need it? Fix dp padding logic in dummyrun. After vllm-project/vllm#28579, `num_tokens` will be padded in `CudagraphDispatcher`, thus we also need to do the pad in the dummy_run. ### How was this patch tested? Test locally with the following scripts ```bash VLLM_USE_MODELSCOPE=true python3 -m vllm.entrypoints.openai.api_server \ --model wemaster/deepseek_mtp_main_random_bf16 \ --trust-remote-code \ --data-parallel-size 4 \ --tensor-parallel-size 1 \ --compilation-config '{"cudagraph_capture_sizes":[96],"cudagraph_mode":"FULL_DECODE_ONLY"}' \ --enable-expert-parallel ``` ```bash vllm bench serve --model wemaster/deepseek_mtp_main_random_bf16 --endpoint /v1/completions --dataset-name random --random-input 512 --random-output 100 --num-prompts 48 --request-rate 1 --ready-check-timeout-sec 0 ``` - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
What this PR does / why we need it?
Fix dp padding logic in dummyrun. After vllm-project/vllm#28579,
num_tokenswill be padded inCudagraphDispatcher, thus we also need to do the pad in the dummy_run.How was this patch tested?
Test locally with the following scripts
VLLM_USE_MODELSCOPE=true python3 -m vllm.entrypoints.openai.api_server \ --model wemaster/deepseek_mtp_main_random_bf16 \ --trust-remote-code \ --data-parallel-size 4 \ --tensor-parallel-size 1 \ --compilation-config '{"cudagraph_capture_sizes":[96],"cudagraph_mode":"FULL_DECODE_ONLY"}' \ --enable-expert-parallel