-
Notifications
You must be signed in to change notification settings - Fork 34
vLLM-v0 Upgrade #122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vLLM-v0 Upgrade #122
Conversation
|
|
||
| # Run HTTP server | ||
| sock_addr = (args.host or "", args.port) | ||
| sock = create_server_socket(sock_addr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure when this happened, but we are running the http server twice without dropping the previous one. see line 159. we need to remove this line and the one above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it's intentional, see the comment in the first run:
# workaround to make sure that we bind the port before the engine is set up.
# This avoids race conditions with ray.
# see https://github.com/vllm-project/vllm/issues/8204|
|
||
| if args.load_as_bf16: | ||
| loading_args["torch_dtype"] = torch.bfloat16 | ||
| loading_args["dtype"] = torch.bfloat16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this change? has the transformers API changed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch_dtype became deprecated and prints a warning
|
|
||
| logger.info(f"Merge lora checkpoint {lora_model_path}") | ||
| model = lora_load_and_merge(lora_model_path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True) | ||
| model = lora_load_and_merge(lora_model_path, dtype=torch.bfloat16, low_cpu_mem_usage=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch_dtype renamed to dtype
rafapi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!!
This PR upgrades vLLM to a recent version where the V0 engine is not removed, which is
v0.10.0(an intermediate step to fully migrate to V1 in #121)Notable upgrades:
python:
3.11to3.12vllm:
0.8.5.post1to0.10.0torch:
2.6.0to2.7.1transformers:
4.51.1to4.57.6flash-attention:
2.7.4.post1to2.8.3GSPO (blue=
v0.8.5post1, pink/purple=v0.10.0)GRPO (orange=
v0.8.5post1, purple=v0.10.0)Potentially, the latest version to upgrade to is
v0.10.2, but this error occurs from the bundled flash attention in vLLM:Also some v0 features that we use in the code were removed in
>v0.10.0such as multi-step scheduler (it's ok in this case as we don't normally use it)