Fix BnB quantization in vLLM by ItzikVa · Pull Request #72 · generative-computing/granite-switch

ItzikVa · 2026-05-25T08:26:58Z

BitsAndBytes 4-bit quantization packs weights as uint8 with shape [total_elements//2, 1], which breaks the existing weight.shape-based dimension detection in SwitchedLoRALinear.init().

Fix:

Prefer input_size_per_partition / output_size_per_partition attributes (always correct, regardless of weight packing format)
Fall back to weight.shape only for non-parallel layers
Add dtype guard: if weight dtype is non-floating-point (uint8 for BnB), default to bfloat16 for LoRA buffer allocation

Also adds vLLM quantization tests (BnB INT4 + FP8) that verify:

Base model weights are actually quantized
LoRA/aLoRA weights remain in full precision
Adapters activate correctly under quantization
LoRA dimensions are not corrupted by packed weight shapes

BitsAndBytes 4-bit quantization packs weights as uint8 with shape [total_elements//2, 1], which breaks the existing weight.shape-based dimension detection in SwitchedLoRALinear.__init__(). Fix: - Prefer input_size_per_partition / output_size_per_partition attributes (always correct, regardless of weight packing format) - Fall back to weight.shape only for non-parallel layers - Add dtype guard: if weight dtype is non-floating-point (uint8 for BnB), default to bfloat16 for LoRA buffer allocation Also adds vLLM quantization tests (BnB INT4 + FP8) that verify: - Base model weights are actually quantized - LoRA/aLoRA weights remain in full precision - Adapters activate correctly under quantization - LoRA dimensions are not corrupted by packed weight shapes Closes #16

antonpibm · 2026-05-25T10:51:34Z

Is this PR replace #53 ?

antonpibm · 2026-05-25T10:52:35Z

Please report the test results for this PR and also verify whether this PR is compatible with the models currently on HF

ItzikVa · 2026-05-25T10:56:25Z

Is this PR replace #53 ?
Yes, the previous PR was intended for the dev branch

ItzikVa · 2026-05-25T10:57:21Z

Please report the test results for this PR and also verify whether this PR is compatible with the models currently on HF

all vllm and quantization tests passed and worked on the current model on HF

ItzikVa requested review from antonpibm, freunda and yairallouche as code owners May 25, 2026 08:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix BnB quantization in vLLM #72

Fix BnB quantization in vLLM #72
ItzikVa wants to merge 1 commit into
mainfrom
issue-16

ItzikVa commented May 25, 2026

Uh oh!

antonpibm commented May 25, 2026

Uh oh!

antonpibm commented May 25, 2026

Uh oh!

ItzikVa commented May 25, 2026

Uh oh!

ItzikVa commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ItzikVa commented May 25, 2026

Uh oh!

antonpibm commented May 25, 2026

Uh oh!

antonpibm commented May 25, 2026

Uh oh!

ItzikVa commented May 25, 2026

Uh oh!

ItzikVa commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants