Skip to content

Fix BnB quantization in vLLM #72

Open
ItzikVa wants to merge 1 commit into
mainfrom
issue-16
Open

Fix BnB quantization in vLLM #72
ItzikVa wants to merge 1 commit into
mainfrom
issue-16

Conversation

@ItzikVa

@ItzikVa ItzikVa commented May 25, 2026

Copy link
Copy Markdown
Collaborator

BitsAndBytes 4-bit quantization packs weights as uint8 with shape [total_elements//2, 1], which breaks the existing weight.shape-based dimension detection in SwitchedLoRALinear.init().

Fix:

  • Prefer input_size_per_partition / output_size_per_partition attributes (always correct, regardless of weight packing format)
  • Fall back to weight.shape only for non-parallel layers
  • Add dtype guard: if weight dtype is non-floating-point (uint8 for BnB), default to bfloat16 for LoRA buffer allocation

Also adds vLLM quantization tests (BnB INT4 + FP8) that verify:

  • Base model weights are actually quantized
  • LoRA/aLoRA weights remain in full precision
  • Adapters activate correctly under quantization
  • LoRA dimensions are not corrupted by packed weight shapes

BitsAndBytes 4-bit quantization packs weights as uint8 with shape
[total_elements//2, 1], which breaks the existing weight.shape-based
dimension detection in SwitchedLoRALinear.__init__().

Fix:
- Prefer input_size_per_partition / output_size_per_partition attributes
  (always correct, regardless of weight packing format)
- Fall back to weight.shape only for non-parallel layers
- Add dtype guard: if weight dtype is non-floating-point (uint8 for BnB),
  default to bfloat16 for LoRA buffer allocation

Also adds vLLM quantization tests (BnB INT4 + FP8) that verify:
- Base model weights are actually quantized
- LoRA/aLoRA weights remain in full precision
- Adapters activate correctly under quantization
- LoRA dimensions are not corrupted by packed weight shapes

Closes #16
@antonpibm

Copy link
Copy Markdown
Collaborator

Is this PR replace #53 ?

@antonpibm

Copy link
Copy Markdown
Collaborator

Please report the test results for this PR and also verify whether this PR is compatible with the models currently on HF

@ItzikVa

ItzikVa commented May 25, 2026

Copy link
Copy Markdown
Collaborator Author

Is this PR replace #53 ?
Yes, the previous PR was intended for the dev branch

@ItzikVa

ItzikVa commented May 25, 2026

Copy link
Copy Markdown
Collaborator Author

Please report the test results for this PR and also verify whether this PR is compatible with the models currently on HF

all vllm and quantization tests passed and worked on the current model on HF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants