Skip to content

Feature/quantisation#1

Closed
Lothnic wants to merge 3 commits into
mainfrom
feature/quantisation
Closed

Feature/quantisation#1
Lothnic wants to merge 3 commits into
mainfrom
feature/quantisation

Conversation

@Lothnic

@Lothnic Lothnic commented Apr 27, 2026

Copy link
Copy Markdown
Owner

Updated the quantization for the updated main branch with streaming output and stateless sampler

Summary by CodeRabbit

  • New Features

    • Added a --quantize/-q CLI flag to enable 4-bit model quantization.
    • Improved streaming fallback to flush thinking buffers for smoother output.
  • Improvements

    • More accurate incremental token decoding and generation.
    • More robust and memory-conscious model loading.
  • Bug Fixes

    • Broader .gitignore patterns; now ignores additional build/docs artifacts.
  • Tests

    • Added several small tokenizer/decoding test scripts.
  • Dependencies

    • Added bitsandbytes and sentencepiece.

Lothnic added 2 commits April 27, 2026 18:50
- Add get_linear_layer() factory in base.py that returns nn.Linear or
  bnb.nn.Linear4bit based on quantize flag (lazy import)
- Add quantize: bool field to LlamaConfig
- Swap all nn.Linear in Attention (q/k/v/o_proj) and MLP
  (gate/up/down_proj) to use the factory
- Implement dual-path weight loading in weight_loader.py:
  - Standard path: batch load_state_dict(assign=True) for non-quantized
  - Quantized path: shard-by-shard per-parameter Params4bit loading
- Preserve offline mode, FlashAttention, streaming, and stateless Sampler
- Add --quantize/-q CLI flag to main.py
- Add bitsandbytes>=0.49.2 to pyproject.toml
- Fix .gitignore to catch __pycache__ at all directory depths
- Remove tracked __pycache__ files
@coderabbitai

coderabbitai Bot commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 966e9c50-82aa-4036-81e5-073833e952ce

📥 Commits

Reviewing files that changed from the base of the PR and between 01a79f3 and 6b39989.

📒 Files selected for processing (7)
  • models/weight_loader.py
  • test_decode.py
  • test_decode2.py
  • test_decode3.py
  • test_decode4.py
  • test_decode5.py
  • test_decode6.py

📝 Walkthrough

Walkthrough

Adds optional 4-bit (NF4) quantization across model layers and weight loading, a new --quantize/-q CLI flag, modular shard-aware weight loading (quantized and standard paths), streaming generator tweaks, several tokenizer/decoding test scripts, and dependency updates.

Changes

Cohort / File(s) Summary
Weight loading & quantization core
models/weight_loader.py
Implements shard discovery and two loading strategies (standard and 4-bit quantized), per-shard memory cleanup, post-load validation, and threads a quantize flag into load_hf_model.
Quantized layer factory
models/base.py
Adds get_linear_layer(in_features, out_features, bias, quantize=False) to construct either a standard nn.Linear or a bitsandbytes 4-bit linear.
Model integration
models/llama.py, models/attention.py
Adds LlamaConfig.quantize: bool = False and replaces nn.Linear projections with get_linear_layer(..., quantize=config.quantize) in MLP and attention projections.
CLI & generation logic
main.py, engine/generator.py
Adds --quantize/-q CLI flag passed into model loader, removes prior cache-forcing, changes prompt construction, suppresses a specific FutureWarning, and generator now decodes full continuation each step to emit only newly decoded text; streaming --hide-thinking gains a 20-char flush heuristic.
Dependencies & ignores
pyproject.toml, .gitignore
Adds runtime deps bitsandbytes>=0.49.2 and sentencepiece>=0.2.1; expands __pycache__ ignore to recursive pattern and starts ignoring docs/quantisation.md.
Decode/test scripts
test_decode.py, test_decode2.py, test_decode3.py, test_decode4.py, test_decode5.py, test_decode6.py
Adds multiple small tokenizer/decoding scripts demonstrating incremental decoding and token inspection.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI (main.py)
    participant Loader as Weight Loader
    participant Model as Model (Llama)
    participant Layers as Layer Factory

    CLI->>Loader: load_hf_model(model_id, quantize=flag)
    Loader->>Loader: discover shard index / local-first lookup
    alt quantize == True
        Loader->>Loader: for each shard -> load quantized shard (Params4bit) + register
    else
        Loader->>Loader: for each shard -> remap keys -> aggregate state -> load_state_dict(assign=True)
    end
    Loader->>Model: instantiate LlamaConfig(quantize=flag)
    Model->>Layers: get_linear_layer(..., quantize=flag)
    Layers->>Model: return linear modules (quantized or nn.Linear)
    Loader->>CLI: return ready model
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

I nibble bytes in twilight code,
Four-bit carrots down the road,
Shards assembled, layers light,
CLI hops on quantize night,
Streaming hums — the rabbit's mode 🐇

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Feature/quantisation' is vague and uses a generic pattern without clearly describing the main feature being added. Use a more descriptive title such as 'Add 4-bit NF4 quantization support via bitsandbytes' to clearly convey the primary feature.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/quantisation

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.gitignore:
- Line 5: The .gitignore currently excludes docs/quantisation.md; verify whether
docs/quantisation.md is intended to be tracked and then update .gitignore
accordingly: if the file should be committed, remove the docs/quantisation.md
entry from .gitignore so it will be tracked; if it is generated, keep the ignore
but replace the specific file entry with a comment explaining the generation
source or broaden the ignore to only generated artifacts (and optionally add
instructions on how to regenerate) so maintainers aren’t surprised by the
ignored file.

In `@engine/generator.py`:
- Around line 31-37: The current loop decodes the entire suffix each iteration
causing quadratic behavior; change it to decode only the newly appended token
IDs instead of calling tokenizer.decode(input_ids[0, prompt_len:]) every time.
In generator.py, introduce and maintain an index/length tracker (e.g.,
last_decoded_index or prev_token_count) alongside prompt_len, slice only the
newly generated token IDs from input_ids (from prompt_len + last_decoded_index
to prompt_len + current_generated_count), decode that slice with
tokenizer.decode(..., skip_special_tokens=True), append the result to prev_text
and update the tracker, then yield the decoded new_text; update references to
prev_text/new_text accordingly to preserve SentencePiece spacing handling.

In `@main.py`:
- Around line 42-49: Remove the unconditional os.environ.pop("HF_HUB_OFFLINE",
None) and os.environ.pop("TRANSFORMERS_OFFLINE", None) calls so you don't force
network access; leave environment variables alone and rely on load_hf_model(...)
and AutoTokenizer.from_pretrained(...) to honor their own local-first/offline
behavior (i.e., delete or comment out the two pop lines near where model, config
= load_hf_model(...) and tokenizer = AutoTokenizer.from_pretrained(...) are
invoked).

In `@models/weight_loader.py`:
- Around line 198-217: Detect missing meta parameters and buffers before
materializing them: before the loops that assign param.data =
torch.empty_like(...) and buffer.data = torch.empty_like(...), collect missing
parameter names with [name for name, param in model.named_parameters() if
param.is_meta] and missing buffer names with [name for name, buf in
model.named_buffers() if buf.is_meta]; if either list is non-empty (excluding
allowed names like "lm_head" as currently done), raise or log an error and abort
rather than silently creating uninitialized tensors; remove or move the later
"missing_keys" check so it runs on the originally-collected lists (using
param.is_meta / buffer.is_meta) prior to any materialization.

In `@pyproject.toml`:
- Around line 8-14: Remove "bitsandbytes>=0.49.2" from the main dependency list
in pyproject.toml and add it under a new extras entry in the
[project.optional-dependencies] table (for example key "quantization") so users
can opt into it (pip install .[quantization]); ensure the same version
constraint "bitsandbytes>=0.49.2" is used and mention the existing --quantize
flag in the README or install docs if needed to indicate when to install the
extra.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: cddf91da-8fc3-4114-a25e-7edbe4078a8c

📥 Commits

Reviewing files that changed from the base of the PR and between f3667d1 and 01a79f3.

⛔ Files ignored due to path filters (14)
  • __pycache__/main.cpython-312.pyc is excluded by !**/*.pyc
  • engine/__pycache__/__init__.cpython-312.pyc is excluded by !**/*.pyc
  • engine/__pycache__/generator.cpython-312.pyc is excluded by !**/*.pyc
  • engine/__pycache__/sampler.cpython-312.pyc is excluded by !**/*.pyc
  • models/__pycache__/__init__.cpython-312.pyc is excluded by !**/*.pyc
  • models/__pycache__/attention.cpython-312.pyc is excluded by !**/*.pyc
  • models/__pycache__/base.cpython-312.pyc is excluded by !**/*.pyc
  • models/__pycache__/llama.cpython-312.pyc is excluded by !**/*.pyc
  • models/__pycache__/qwen3.cpython-312.pyc is excluded by !**/*.pyc
  • models/__pycache__/weight_loader.cpython-312.pyc is excluded by !**/*.pyc
  • tests/__pycache__/conftest.cpython-312-pytest-9.0.3.pyc is excluded by !**/*.pyc
  • tests/__pycache__/test_main.cpython-312-pytest-9.0.3.pyc is excluded by !**/*.pyc
  • tests/__pycache__/test_sampler.cpython-312-pytest-9.0.3.pyc is excluded by !**/*.pyc
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (8)
  • .gitignore
  • engine/generator.py
  • main.py
  • models/attention.py
  • models/base.py
  • models/llama.py
  • models/weight_loader.py
  • pyproject.toml

Comment thread .gitignore
**/__pycache__/
*.pyc

docs/quantisation.md No newline at end of file

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor risk: verify docs/quantisation.md should be ignored.

If docs/quantisation.md is supposed to be part of the repo’s documentation, ignoring it can lead to accidental omission of doc updates (and if it’s not currently tracked, it won’t be committed going forward).

Suggested check / potential fix
  1. Verify whether the file is intended to be tracked:
  • If it should be committed: remove the ignore line.
-docs/quantisation.md
  • If it’s generated: consider ignoring only the generated artifacts (or document the generation source) to avoid silencing unintended edits.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
docs/quantisation.md
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitignore at line 5, The .gitignore currently excludes
docs/quantisation.md; verify whether docs/quantisation.md is intended to be
tracked and then update .gitignore accordingly: if the file should be committed,
remove the docs/quantisation.md entry from .gitignore so it will be tracked; if
it is generated, keep the ignore but replace the specific file entry with a
comment explaining the generation source or broaden the ignore to only generated
artifacts (and optionally add instructions on how to regenerate) so maintainers
aren’t surprised by the ignored file.

Comment thread engine/generator.py
Comment on lines +31 to +37
# Decode all generated tokens so far and yield only the new text.
# This correctly handles SentencePiece space prefixes and multi-byte chars.
full_text = self.tokenizer.decode(input_ids[0, prompt_len:], skip_special_tokens=True)
new_text = full_text[len(prev_text):]
prev_text = full_text
if new_text:
yield new_text No newline at end of file

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid decoding the entire continuation on every token.

decode(input_ids[0, prompt_len:]) runs over the whole generated suffix on each step, so streaming becomes quadratic in output length. With max_new_tokens in the thousands, this will noticeably slow long generations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@engine/generator.py` around lines 31 - 37, The current loop decodes the
entire suffix each iteration causing quadratic behavior; change it to decode
only the newly appended token IDs instead of calling
tokenizer.decode(input_ids[0, prompt_len:]) every time. In generator.py,
introduce and maintain an index/length tracker (e.g., last_decoded_index or
prev_token_count) alongside prompt_len, slice only the newly generated token IDs
from input_ids (from prompt_len + last_decoded_index to prompt_len +
current_generated_count), decode that slice with tokenizer.decode(...,
skip_special_tokens=True), append the result to prev_text and update the
tracker, then yield the decoded new_text; update references to
prev_text/new_text accordingly to preserve SentencePiece spacing handling.

Comment thread main.py
Comment on lines +42 to +49
# Don't force offline mode for model loading — the weight_loader
# handles local-first-then-download fallback on its own.
os.environ.pop("HF_HUB_OFFLINE", None)
os.environ.pop("TRANSFORMERS_OFFLINE", None)

model, config = load_hf_model(args.model_id, device=args.device, quantize=args.quantize)

tokenizer = AutoTokenizer.from_pretrained(args.model_id)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t clear offline mode here.

These pop() calls force network access again. That breaks cache-only usage for both load_hf_model() and AutoTokenizer.from_pretrained(...), even though the loader already has local-first/offline handling.

Suggested fix
-    os.environ.pop("HF_HUB_OFFLINE", None)
-    os.environ.pop("TRANSFORMERS_OFFLINE", None)
+    local_only = (
+        os.environ.get("HF_HUB_OFFLINE") == "1"
+        or os.environ.get("TRANSFORMERS_OFFLINE") == "1"
+    )
 
     model, config = load_hf_model(args.model_id, device=args.device, quantize=args.quantize)
 
-    tokenizer = AutoTokenizer.from_pretrained(args.model_id)
+    tokenizer = AutoTokenizer.from_pretrained(args.model_id, local_files_only=local_only)
🧰 Tools
🪛 Ruff (0.15.12)

[warning] 47-47: Unpacked variable config is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@main.py` around lines 42 - 49, Remove the unconditional
os.environ.pop("HF_HUB_OFFLINE", None) and
os.environ.pop("TRANSFORMERS_OFFLINE", None) calls so you don't force network
access; leave environment variables alone and rely on load_hf_model(...) and
AutoTokenizer.from_pretrained(...) to honor their own local-first/offline
behavior (i.e., delete or comment out the two pop lines near where model, config
= load_hf_model(...) and tokenizer = AutoTokenizer.from_pretrained(...) are
invoked).

Comment thread models/weight_loader.py Outdated
Comment on lines 198 to 217
# 6. Ensure all remaining meta tensors are materialized on the correct device
for param in model.parameters():
if param.is_meta:
param.data = torch.empty_like(param, device=device)
for buffer in model.buffers():
if buffer.is_meta:
buffer.data = torch.empty_like(buffer, device=device)


# 7. Move model to target device/dtype
# Note: for quantized models, bnb parameters handle their own dtype,
# model.to() will skip them automatically
model.to(device, dtype=dtype)

if missing:
# Filter out expected missing buffers (RoPE caches computed at runtime)
real_missing = [k for k in missing if "rotary_emb" not in k]
# 8. Check for missing parameters
missing_keys = [name for name, param in model.named_parameters() if param.is_meta]
if missing_keys:
real_missing = [k for k in missing_keys if "lm_head" not in k]
if real_missing:
print(f"Missing: {real_missing}")
if unexpected:
print(f"Unexpected: {unexpected}")

del mapped

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fail before materializing missing meta parameters.

Line 201 converts any still-missing meta parameter into empty_like(...), so the later is_meta check can never catch it. That means an unmapped or missing weight silently turns into uninitialized data and inference continues with corrupted parameters.

Suggested fix
-    # 6. Ensure all remaining meta tensors are materialized on the correct device
-    for param in model.parameters():
-        if param.is_meta:
-            param.data = torch.empty_like(param, device=device)
+    # 6. Fail fast if any real parameters are still missing.
+    missing_keys = [
+        name for name, param in model.named_parameters()
+        if param.is_meta and "lm_head" not in name
+    ]
+    if missing_keys:
+        raise ValueError(f"Missing weights after load: {missing_keys}")
+
     for buffer in model.buffers():
         if buffer.is_meta:
             buffer.data = torch.empty_like(buffer, device=device)
@@
-    # 8. Check for missing parameters
-    missing_keys = [name for name, param in model.named_parameters() if param.is_meta]
-    if missing_keys:
-        real_missing = [k for k in missing_keys if "lm_head" not in k]
-        if real_missing:
-            print(f"Missing: {real_missing}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@models/weight_loader.py` around lines 198 - 217, Detect missing meta
parameters and buffers before materializing them: before the loops that assign
param.data = torch.empty_like(...) and buffer.data = torch.empty_like(...),
collect missing parameter names with [name for name, param in
model.named_parameters() if param.is_meta] and missing buffer names with [name
for name, buf in model.named_buffers() if buf.is_meta]; if either list is
non-empty (excluding allowed names like "lm_head" as currently done), raise or
log an error and abort rather than silently creating uninitialized tensors;
remove or move the later "missing_keys" check so it runs on the
originally-collected lists (using param.is_meta / buffer.is_meta) prior to any
materialization.

Comment thread pyproject.toml
Comment on lines +8 to +14
"bitsandbytes>=0.49.2",
"huggingface-hub>=1.11.0",
"packaging>=26.1",
"protobuf>=7.34.1",
"pytest>=9.0.3",
"safetensors>=0.7.0",
"sentencepiece>=0.2.1",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

What platforms and accelerators are supported by bitsandbytes 0.49.2, and does its 4-bit quantization path work on CPU-only installs?

💡 Result:

bitsandbytes 0.49.2 supports the following platforms and accelerators, with QLoRA 4-bit quantization marked as fully supported (✅) unless noted: Platforms: - Linux (glibc >= 2.24): - x86-64: CPU (AVX2 min, AVX512 optimized), NVIDIA GPU (cuda, SM60+), AMD GPU (cuda, specific gfx), Intel GPU (xpu), Intel Gaudi (hpu, 〰️ for 4-bit) - aarch64: CPU, NVIDIA GPU (cuda, SM75+) - Windows 11 / Server 2022+ (x86-64): CPU (AVX2), NVIDIA GPU (cuda, SM60+), Intel GPU (xpu) - macOS 14+ (arm64): CPU (Apple M1+), Metal (mps, 🐢 slow for 4-bit) Yes, the 4-bit quantization path (QLoRA 4-bit) works on CPU-only installs across supported platforms (Linux x86-64/aarch64, Windows x86-64, macOS arm64), as confirmed by the official support table in the 0.49.2 README and CPU improvements noted in release notes.

Citations:


🏁 Script executed:

# Search for bitsandbytes imports and usage
rg "bitsandbytes|quantize" --type py -B 2 -A 5

Repository: Lothnic/vllmini

Length of output: 6948


🏁 Script executed:

# Check main.py specifically for the quantization flag handling
fd "main\.py" --type f --exec cat -n {} \;

Repository: Lothnic/vllmini

Length of output: 30176


Make bitsandbytes an optional dependency instead of requiring it for all installs.

Quantization is opt-in via --quantize, but requiring bitsandbytes>=0.49.2 in pyproject.toml forces its installation on all environments. While bitsandbytes 0.49.2 does support CPU-only platforms (Linux x86-64/aarch64, Windows, macOS arm64), making it a hard dependency creates unnecessary installation burden and potential wheel-availability issues for users who never use quantization. Move it to an optional extras group (e.g., [project.optional-dependencies]) so installation is only triggered when explicitly requested.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 8 - 14, Remove "bitsandbytes>=0.49.2" from the
main dependency list in pyproject.toml and add it under a new extras entry in
the [project.optional-dependencies] table (for example key "quantization") so
users can opt into it (pip install .[quantization]); ensure the same version
constraint "bitsandbytes>=0.49.2" is used and mention the existing --quantize
flag in the README or install docs if needed to indicate when to install the
extra.

@Lothnic Lothnic closed this Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant