Skip to content

Conversation

@ehsk
Copy link
Collaborator

@ehsk ehsk commented Jan 19, 2026

This PR upgrades vLLM to a recent version where the V0 engine is not removed, which is v0.10.0 (an intermediate step to fully migrate to V1 in #121)

Notable upgrades:
python: 3.11 to 3.12
vllm: 0.8.5.post1 to 0.10.0
torch: 2.6.0 to 2.7.1
transformers: 4.51.1 to 4.57.6
flash-attention: 2.7.4.post1 to 2.8.3

GSPO (blue=v0.8.5post1, pink/purple=v0.10.0)

Logprobs Reward
Logprobs Reward

GRPO (orange=v0.8.5post1, purple=v0.10.0)

Logprobs Reward
Logprobs Reward

Potentially, the latest version to upgrade to is v0.10.2, but this error occurs from the bundled flash attention in vLLM:

[rank0]: torch.AcceleratorError: CUDA error: the provided PTX was compiled with an unsupported toolchain.
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Also some v0 features that we use in the code were removed in >v0.10.0 such as multi-step scheduler (it's ok in this case as we don't normally use it)

@ehsk ehsk requested a review from rafapi January 20, 2026 01:36
@ehsk ehsk mentioned this pull request Jan 20, 2026
@ehsk ehsk self-assigned this Jan 20, 2026

# Run HTTP server
sock_addr = (args.host or "", args.port)
sock = create_server_socket(sock_addr)
Copy link
Collaborator

@rafapi rafapi Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure when this happened, but we are running the http server twice without dropping the previous one. see line 159. we need to remove this line and the one above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it's intentional, see the comment in the first run:

# workaround to make sure that we bind the port before the engine is set up.
# This avoids race conditions with ray.
# see https://github.com/vllm-project/vllm/issues/8204


if args.load_as_bf16:
loading_args["torch_dtype"] = torch.bfloat16
loading_args["dtype"] = torch.bfloat16
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change? has the transformers API changed here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch_dtype became deprecated and prints a warning


logger.info(f"Merge lora checkpoint {lora_model_path}")
model = lora_load_and_merge(lora_model_path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True)
model = lora_load_and_merge(lora_model_path, dtype=torch.bfloat16, low_cpu_mem_usage=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch_dtype renamed to dtype

Copy link
Collaborator

@rafapi rafapi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!

@ehsk ehsk merged commit 64073e3 into main Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants