Add default 'auto' MODEL_IMPL_TYPE that resolves based on architecture #1255

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

xingliu14 wants to merge 1 commit into vllm-project:main from xingliu14:env_var

+81 −23

Collaborator

xingliu14 commented Dec 5, 2025 •

edited

Loading

Description

Add auto as default value for MODEL_IMPL_TYPE env var
For GptOssForCausalLM, auto resolves to vllm for better performance
For all other architectures, 'auto' resolves to flax_nnx for better performance
Add tests for 'auto' resolution behavior

Tests

pytest

tests/test_envs.py
tests/models/common/test_model_loader.py

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

xingliu14 requested review from kyuyeunk, py4, sixiang-google, vipannalla and wenxindongwork as code owners

December 5, 2025 20:30

xingliu14 force-pushed the env_var branch from f89de2b to adc343e Compare

December 5, 2025 20:36

Collaborator Author

xingliu14 commented Dec 5, 2025

@kyuyeunk Please review.

kyuyeunk reviewed

View reviewed changes

Collaborator

kyuyeunk left a comment

Wouldn't it be possible move 'auto' to 'match/case' as well?


          Add 'auto' MODEL_IMPL_TYPE that resolves based on architecture

80c1023

- Add 'auto' as default value for MODEL_IMPL_TYPE env var
- For GptOssForCausalLM, 'auto' resolves to 'vllm' for better performance
- For all other architectures, 'auto' resolves to 'flax_nnx'
- Add _VLLM_REQUIRED_ARCHITECTURES frozenset in model_loader.py
- Use match/case pattern in get_model() for implementation selection
- Add tests for 'auto' resolution behavior

Signed-off-by: Xing Liu <xingliu14@gmail.com>

xingliu14 force-pushed the env_var branch from adc343e to 80c1023 Compare

December 9, 2025 01:03

Collaborator Author

xingliu14 commented Dec 9, 2025

It is possible to move it in to match-case, but in that case it will have duplicated codes, including: get_vllm_model, get_flax_model and the fall back check. I think resolve first then use the same code will be more clean.

kyuyeunk reviewed

View reviewed changes

tpu_inference/models/common/model_loader.py

    
                  return jit_model, compute_logits_fn, combine_hidden_states_fn, None, params, lora_manager, model

              # Architectures that require "vllm" implementation type when MODEL_IMPL_TYPE is "auto".

Collaborator

kyuyeunk Dec 9, 2025

"require" might be too strong word. replace it with "prefer"

tpu_inference/models/common/model_loader.py

    
              # Architectures that require "vllm" implementation type when MODEL_IMPL_TYPE is "auto".

              # These architectures are listed here because they have better performance with the

              # vLLM PyTorch backend compared to the flax_nnx JAX backend for now.

              _VLLM_REQUIRED_ARCHITECTURES: frozenset[str] = frozenset({"GptOssForCausalLM"})

Collaborator

kyuyeunk Dec 9, 2025

In general, this kind of constants should be placed at the start of the file. Please move it.

tpu_inference/models/common/model_loader.py

    
                              vllm_config.model_config.dtype.dtype)

                  if impl == "auto":

                      # Resolve "auto" based on architecture

                      architectures = getattr(vllm_config.model_config.hf_config,

Collaborator

kyuyeunk Dec 9, 2025

dumb question: is there a cases where there's a multiple "architectures" for a single model?

tpu_inference/models/common/model_loader.py

    
                      # Resolve "auto" based on architecture

                      architectures = getattr(vllm_config.model_config.hf_config,

                                              "architectures", [])

                      for arch in architectures:

Collaborator

kyuyeunk Dec 9, 2025

similar to above comment. can we just to an assert to check if len(architectures)==1 and do a simple hash map fetch instead of iterating for loop?

tpu_inference/models/common/model_loader.py

    
                          try:

                              # Try to load the flax model first

                              return get_flax_model(vllm_config, rng, mesh, is_draft_model)

                          except UnsupportedArchitectureError as e:

Collaborator

kyuyeunk Dec 9, 2025

probably a nit question: in c's switch statements, if we don't put break;, it will automatically invoke next case. Is it not the case for python's match/case? I.e., if UnsupportedArchitectureError is thrown, we skip break; statement and automatically let the next case (which is case "vllm") to be invoke.

kyuyeunk added the ready label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

kyuyeunk kyuyeunk left review comments

py4 Awaiting requested review from py4 py4 is a code owner

wenxindongwork Awaiting requested review from wenxindongwork wenxindongwork is a code owner

sixiang-google Awaiting requested review from sixiang-google sixiang-google is a code owner

vipannalla Awaiting requested review from vipannalla vipannalla is a code owner

At least 1 approving review is required to merge this pull request.

Labels

ready