Feature/quantisation#1
Conversation
- Add get_linear_layer() factory in base.py that returns nn.Linear or bnb.nn.Linear4bit based on quantize flag (lazy import) - Add quantize: bool field to LlamaConfig - Swap all nn.Linear in Attention (q/k/v/o_proj) and MLP (gate/up/down_proj) to use the factory - Implement dual-path weight loading in weight_loader.py: - Standard path: batch load_state_dict(assign=True) for non-quantized - Quantized path: shard-by-shard per-parameter Params4bit loading - Preserve offline mode, FlashAttention, streaming, and stateless Sampler - Add --quantize/-q CLI flag to main.py - Add bitsandbytes>=0.49.2 to pyproject.toml - Fix .gitignore to catch __pycache__ at all directory depths - Remove tracked __pycache__ files
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (7)
📝 WalkthroughWalkthroughAdds optional 4-bit (NF4) quantization across model layers and weight loading, a new Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI (main.py)
participant Loader as Weight Loader
participant Model as Model (Llama)
participant Layers as Layer Factory
CLI->>Loader: load_hf_model(model_id, quantize=flag)
Loader->>Loader: discover shard index / local-first lookup
alt quantize == True
Loader->>Loader: for each shard -> load quantized shard (Params4bit) + register
else
Loader->>Loader: for each shard -> remap keys -> aggregate state -> load_state_dict(assign=True)
end
Loader->>Model: instantiate LlamaConfig(quantize=flag)
Model->>Layers: get_linear_layer(..., quantize=flag)
Layers->>Model: return linear modules (quantized or nn.Linear)
Loader->>CLI: return ready model
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.gitignore:
- Line 5: The .gitignore currently excludes docs/quantisation.md; verify whether
docs/quantisation.md is intended to be tracked and then update .gitignore
accordingly: if the file should be committed, remove the docs/quantisation.md
entry from .gitignore so it will be tracked; if it is generated, keep the ignore
but replace the specific file entry with a comment explaining the generation
source or broaden the ignore to only generated artifacts (and optionally add
instructions on how to regenerate) so maintainers aren’t surprised by the
ignored file.
In `@engine/generator.py`:
- Around line 31-37: The current loop decodes the entire suffix each iteration
causing quadratic behavior; change it to decode only the newly appended token
IDs instead of calling tokenizer.decode(input_ids[0, prompt_len:]) every time.
In generator.py, introduce and maintain an index/length tracker (e.g.,
last_decoded_index or prev_token_count) alongside prompt_len, slice only the
newly generated token IDs from input_ids (from prompt_len + last_decoded_index
to prompt_len + current_generated_count), decode that slice with
tokenizer.decode(..., skip_special_tokens=True), append the result to prev_text
and update the tracker, then yield the decoded new_text; update references to
prev_text/new_text accordingly to preserve SentencePiece spacing handling.
In `@main.py`:
- Around line 42-49: Remove the unconditional os.environ.pop("HF_HUB_OFFLINE",
None) and os.environ.pop("TRANSFORMERS_OFFLINE", None) calls so you don't force
network access; leave environment variables alone and rely on load_hf_model(...)
and AutoTokenizer.from_pretrained(...) to honor their own local-first/offline
behavior (i.e., delete or comment out the two pop lines near where model, config
= load_hf_model(...) and tokenizer = AutoTokenizer.from_pretrained(...) are
invoked).
In `@models/weight_loader.py`:
- Around line 198-217: Detect missing meta parameters and buffers before
materializing them: before the loops that assign param.data =
torch.empty_like(...) and buffer.data = torch.empty_like(...), collect missing
parameter names with [name for name, param in model.named_parameters() if
param.is_meta] and missing buffer names with [name for name, buf in
model.named_buffers() if buf.is_meta]; if either list is non-empty (excluding
allowed names like "lm_head" as currently done), raise or log an error and abort
rather than silently creating uninitialized tensors; remove or move the later
"missing_keys" check so it runs on the originally-collected lists (using
param.is_meta / buffer.is_meta) prior to any materialization.
In `@pyproject.toml`:
- Around line 8-14: Remove "bitsandbytes>=0.49.2" from the main dependency list
in pyproject.toml and add it under a new extras entry in the
[project.optional-dependencies] table (for example key "quantization") so users
can opt into it (pip install .[quantization]); ensure the same version
constraint "bitsandbytes>=0.49.2" is used and mention the existing --quantize
flag in the README or install docs if needed to indicate when to install the
extra.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: cddf91da-8fc3-4114-a25e-7edbe4078a8c
⛔ Files ignored due to path filters (14)
__pycache__/main.cpython-312.pycis excluded by!**/*.pycengine/__pycache__/__init__.cpython-312.pycis excluded by!**/*.pycengine/__pycache__/generator.cpython-312.pycis excluded by!**/*.pycengine/__pycache__/sampler.cpython-312.pycis excluded by!**/*.pycmodels/__pycache__/__init__.cpython-312.pycis excluded by!**/*.pycmodels/__pycache__/attention.cpython-312.pycis excluded by!**/*.pycmodels/__pycache__/base.cpython-312.pycis excluded by!**/*.pycmodels/__pycache__/llama.cpython-312.pycis excluded by!**/*.pycmodels/__pycache__/qwen3.cpython-312.pycis excluded by!**/*.pycmodels/__pycache__/weight_loader.cpython-312.pycis excluded by!**/*.pyctests/__pycache__/conftest.cpython-312-pytest-9.0.3.pycis excluded by!**/*.pyctests/__pycache__/test_main.cpython-312-pytest-9.0.3.pycis excluded by!**/*.pyctests/__pycache__/test_sampler.cpython-312-pytest-9.0.3.pycis excluded by!**/*.pycuv.lockis excluded by!**/*.lock
📒 Files selected for processing (8)
.gitignoreengine/generator.pymain.pymodels/attention.pymodels/base.pymodels/llama.pymodels/weight_loader.pypyproject.toml
| **/__pycache__/ | ||
| *.pyc | ||
|
|
||
| docs/quantisation.md No newline at end of file |
There was a problem hiding this comment.
Minor risk: verify docs/quantisation.md should be ignored.
If docs/quantisation.md is supposed to be part of the repo’s documentation, ignoring it can lead to accidental omission of doc updates (and if it’s not currently tracked, it won’t be committed going forward).
Suggested check / potential fix
- Verify whether the file is intended to be tracked:
- If it should be committed: remove the ignore line.
-docs/quantisation.md- If it’s generated: consider ignoring only the generated artifacts (or document the generation source) to avoid silencing unintended edits.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| docs/quantisation.md |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.gitignore at line 5, The .gitignore currently excludes
docs/quantisation.md; verify whether docs/quantisation.md is intended to be
tracked and then update .gitignore accordingly: if the file should be committed,
remove the docs/quantisation.md entry from .gitignore so it will be tracked; if
it is generated, keep the ignore but replace the specific file entry with a
comment explaining the generation source or broaden the ignore to only generated
artifacts (and optionally add instructions on how to regenerate) so maintainers
aren’t surprised by the ignored file.
| # Decode all generated tokens so far and yield only the new text. | ||
| # This correctly handles SentencePiece space prefixes and multi-byte chars. | ||
| full_text = self.tokenizer.decode(input_ids[0, prompt_len:], skip_special_tokens=True) | ||
| new_text = full_text[len(prev_text):] | ||
| prev_text = full_text | ||
| if new_text: | ||
| yield new_text No newline at end of file |
There was a problem hiding this comment.
Avoid decoding the entire continuation on every token.
decode(input_ids[0, prompt_len:]) runs over the whole generated suffix on each step, so streaming becomes quadratic in output length. With max_new_tokens in the thousands, this will noticeably slow long generations.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@engine/generator.py` around lines 31 - 37, The current loop decodes the
entire suffix each iteration causing quadratic behavior; change it to decode
only the newly appended token IDs instead of calling
tokenizer.decode(input_ids[0, prompt_len:]) every time. In generator.py,
introduce and maintain an index/length tracker (e.g., last_decoded_index or
prev_token_count) alongside prompt_len, slice only the newly generated token IDs
from input_ids (from prompt_len + last_decoded_index to prompt_len +
current_generated_count), decode that slice with tokenizer.decode(...,
skip_special_tokens=True), append the result to prev_text and update the
tracker, then yield the decoded new_text; update references to
prev_text/new_text accordingly to preserve SentencePiece spacing handling.
| # Don't force offline mode for model loading — the weight_loader | ||
| # handles local-first-then-download fallback on its own. | ||
| os.environ.pop("HF_HUB_OFFLINE", None) | ||
| os.environ.pop("TRANSFORMERS_OFFLINE", None) | ||
|
|
||
| model, config = load_hf_model(args.model_id, device=args.device, quantize=args.quantize) | ||
|
|
||
| tokenizer = AutoTokenizer.from_pretrained(args.model_id) |
There was a problem hiding this comment.
Don’t clear offline mode here.
These pop() calls force network access again. That breaks cache-only usage for both load_hf_model() and AutoTokenizer.from_pretrained(...), even though the loader already has local-first/offline handling.
Suggested fix
- os.environ.pop("HF_HUB_OFFLINE", None)
- os.environ.pop("TRANSFORMERS_OFFLINE", None)
+ local_only = (
+ os.environ.get("HF_HUB_OFFLINE") == "1"
+ or os.environ.get("TRANSFORMERS_OFFLINE") == "1"
+ )
model, config = load_hf_model(args.model_id, device=args.device, quantize=args.quantize)
- tokenizer = AutoTokenizer.from_pretrained(args.model_id)
+ tokenizer = AutoTokenizer.from_pretrained(args.model_id, local_files_only=local_only)🧰 Tools
🪛 Ruff (0.15.12)
[warning] 47-47: Unpacked variable config is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@main.py` around lines 42 - 49, Remove the unconditional
os.environ.pop("HF_HUB_OFFLINE", None) and
os.environ.pop("TRANSFORMERS_OFFLINE", None) calls so you don't force network
access; leave environment variables alone and rely on load_hf_model(...) and
AutoTokenizer.from_pretrained(...) to honor their own local-first/offline
behavior (i.e., delete or comment out the two pop lines near where model, config
= load_hf_model(...) and tokenizer = AutoTokenizer.from_pretrained(...) are
invoked).
| # 6. Ensure all remaining meta tensors are materialized on the correct device | ||
| for param in model.parameters(): | ||
| if param.is_meta: | ||
| param.data = torch.empty_like(param, device=device) | ||
| for buffer in model.buffers(): | ||
| if buffer.is_meta: | ||
| buffer.data = torch.empty_like(buffer, device=device) | ||
|
|
||
|
|
||
| # 7. Move model to target device/dtype | ||
| # Note: for quantized models, bnb parameters handle their own dtype, | ||
| # model.to() will skip them automatically | ||
| model.to(device, dtype=dtype) | ||
|
|
||
| if missing: | ||
| # Filter out expected missing buffers (RoPE caches computed at runtime) | ||
| real_missing = [k for k in missing if "rotary_emb" not in k] | ||
| # 8. Check for missing parameters | ||
| missing_keys = [name for name, param in model.named_parameters() if param.is_meta] | ||
| if missing_keys: | ||
| real_missing = [k for k in missing_keys if "lm_head" not in k] | ||
| if real_missing: | ||
| print(f"Missing: {real_missing}") | ||
| if unexpected: | ||
| print(f"Unexpected: {unexpected}") | ||
|
|
||
| del mapped | ||
|
|
There was a problem hiding this comment.
Fail before materializing missing meta parameters.
Line 201 converts any still-missing meta parameter into empty_like(...), so the later is_meta check can never catch it. That means an unmapped or missing weight silently turns into uninitialized data and inference continues with corrupted parameters.
Suggested fix
- # 6. Ensure all remaining meta tensors are materialized on the correct device
- for param in model.parameters():
- if param.is_meta:
- param.data = torch.empty_like(param, device=device)
+ # 6. Fail fast if any real parameters are still missing.
+ missing_keys = [
+ name for name, param in model.named_parameters()
+ if param.is_meta and "lm_head" not in name
+ ]
+ if missing_keys:
+ raise ValueError(f"Missing weights after load: {missing_keys}")
+
for buffer in model.buffers():
if buffer.is_meta:
buffer.data = torch.empty_like(buffer, device=device)
@@
- # 8. Check for missing parameters
- missing_keys = [name for name, param in model.named_parameters() if param.is_meta]
- if missing_keys:
- real_missing = [k for k in missing_keys if "lm_head" not in k]
- if real_missing:
- print(f"Missing: {real_missing}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@models/weight_loader.py` around lines 198 - 217, Detect missing meta
parameters and buffers before materializing them: before the loops that assign
param.data = torch.empty_like(...) and buffer.data = torch.empty_like(...),
collect missing parameter names with [name for name, param in
model.named_parameters() if param.is_meta] and missing buffer names with [name
for name, buf in model.named_buffers() if buf.is_meta]; if either list is
non-empty (excluding allowed names like "lm_head" as currently done), raise or
log an error and abort rather than silently creating uninitialized tensors;
remove or move the later "missing_keys" check so it runs on the
originally-collected lists (using param.is_meta / buffer.is_meta) prior to any
materialization.
| "bitsandbytes>=0.49.2", | ||
| "huggingface-hub>=1.11.0", | ||
| "packaging>=26.1", | ||
| "protobuf>=7.34.1", | ||
| "pytest>=9.0.3", | ||
| "safetensors>=0.7.0", | ||
| "sentencepiece>=0.2.1", |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
What platforms and accelerators are supported by bitsandbytes 0.49.2, and does its 4-bit quantization path work on CPU-only installs?
💡 Result:
bitsandbytes 0.49.2 supports the following platforms and accelerators, with QLoRA 4-bit quantization marked as fully supported (✅) unless noted: Platforms: - Linux (glibc >= 2.24): - x86-64: CPU (AVX2 min, AVX512 optimized), NVIDIA GPU (cuda, SM60+), AMD GPU (cuda, specific gfx), Intel GPU (xpu), Intel Gaudi (hpu, 〰️ for 4-bit) - aarch64: CPU, NVIDIA GPU (cuda, SM75+) - Windows 11 / Server 2022+ (x86-64): CPU (AVX2), NVIDIA GPU (cuda, SM60+), Intel GPU (xpu) - macOS 14+ (arm64): CPU (Apple M1+), Metal (mps, 🐢 slow for 4-bit) Yes, the 4-bit quantization path (QLoRA 4-bit) works on CPU-only installs across supported platforms (Linux x86-64/aarch64, Windows x86-64, macOS arm64), as confirmed by the official support table in the 0.49.2 README and CPU improvements noted in release notes.
Citations:
- 1: https://github.com/bitsandbytes-foundation/bitsandbytes/blob/0.49.2/README.md
- 2: https://github.com/bitsandbytes-foundation/bitsandbytes
- 3: https://github.com/bitsandbytes-foundation/bitsandbytes/releases/tag/0.49.0
- 4: https://github.com/bitsandbytes-foundation/bitsandbytes/releases
- 5: https://huggingface.co/docs/transformers/quantization/bitsandbytes
- 6: https://github.com/bitsandbytes-foundation/bitsandbytes/blob/main/README.md
- 7: 4bit quantized model.dequantize() fails on CPU bitsandbytes-foundation/bitsandbytes#1311
🏁 Script executed:
# Search for bitsandbytes imports and usage
rg "bitsandbytes|quantize" --type py -B 2 -A 5Repository: Lothnic/vllmini
Length of output: 6948
🏁 Script executed:
# Check main.py specifically for the quantization flag handling
fd "main\.py" --type f --exec cat -n {} \;Repository: Lothnic/vllmini
Length of output: 30176
Make bitsandbytes an optional dependency instead of requiring it for all installs.
Quantization is opt-in via --quantize, but requiring bitsandbytes>=0.49.2 in pyproject.toml forces its installation on all environments. While bitsandbytes 0.49.2 does support CPU-only platforms (Linux x86-64/aarch64, Windows, macOS arm64), making it a hard dependency creates unnecessary installation burden and potential wheel-availability issues for users who never use quantization. Move it to an optional extras group (e.g., [project.optional-dependencies]) so installation is only triggered when explicitly requested.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyproject.toml` around lines 8 - 14, Remove "bitsandbytes>=0.49.2" from the
main dependency list in pyproject.toml and add it under a new extras entry in
the [project.optional-dependencies] table (for example key "quantization") so
users can opt into it (pip install .[quantization]); ensure the same version
constraint "bitsandbytes>=0.49.2" is used and mention the existing --quantize
flag in the README or install docs if needed to indicate when to install the
extra.
…on of missing weights with random memory
Updated the quantization for the updated main branch with streaming output and stateless sampler
Summary by CodeRabbit
New Features
--quantize/-qCLI flag to enable 4-bit model quantization.Improvements
Bug Fixes
Tests
Dependencies