Skip to content

Feature/quantisation#2

Merged
Lothnic merged 6 commits into
mainfrom
feature/quantisation
Apr 27, 2026
Merged

Feature/quantisation#2
Lothnic merged 6 commits into
mainfrom
feature/quantisation

Conversation

@Lothnic

@Lothnic Lothnic commented Apr 27, 2026

Copy link
Copy Markdown
Owner

Summary by CodeRabbit

  • New Features

    • Added a --quantize/-q option to enable 4-bit model quantization for more efficient inference.
  • Improvements

    • Streaming now emits incremental text based on full decoded output for smoother, less choppy responses.
    • CLI suppresses specific noisy warnings and better handles models that omit special "thinking" tags by flushing buffered output.
  • Chores

    • Added runtime dependencies to support quantization and tokenization.
    • Broadened ignore rules to exclude additional cache files and a docs file.

Lothnic added 4 commits April 27, 2026 18:50
- Add get_linear_layer() factory in base.py that returns nn.Linear or
  bnb.nn.Linear4bit based on quantize flag (lazy import)
- Add quantize: bool field to LlamaConfig
- Swap all nn.Linear in Attention (q/k/v/o_proj) and MLP
  (gate/up/down_proj) to use the factory
- Implement dual-path weight loading in weight_loader.py:
  - Standard path: batch load_state_dict(assign=True) for non-quantized
  - Quantized path: shard-by-shard per-parameter Params4bit loading
- Preserve offline mode, FlashAttention, streaming, and stateless Sampler
- Add --quantize/-q CLI flag to main.py
- Add bitsandbytes>=0.49.2 to pyproject.toml
- Fix .gitignore to catch __pycache__ at all directory depths
- Remove tracked __pycache__ files
@coderabbitai

coderabbitai Bot commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: f8a98862-688f-4835-b0cc-416bb0dd4dd0

📥 Commits

Reviewing files that changed from the base of the PR and between 47eee60 and 5560688.

📒 Files selected for processing (1)
  • models/weight_loader.py

📝 Walkthrough

Walkthrough

Adds optional 4‑bit NF4 quantization via bitsandbytes, a linear-layer factory to create quantized or standard layers, quantize-aware model loading with shard discovery and a dedicated quantized load path, a CLI flag to enable quantization, and generator output changed to emit incremental substrings from full-text decoding.

Changes

Cohort / File(s) Summary
Project config & ignores
\.gitignore, pyproject.toml
Broadened __pycache__/ pattern, ignore docs/quantisation.md, and added runtime deps bitsandbytes and sentencepiece.
Layer factory
models/base.py
New get_linear_layer(in_features, out_features, bias, quantize: bool=False) to choose between torch.nn.Linear and bitsandbytes Linear4bit.
Model modules
models/llama.py, models/attention.py
Added quantize: bool to LlamaConfig; replaced direct nn.Linear with get_linear_layer(..., quantize=config.quantize) for attention and MLP projections.
Weight loading & quantization
models/weight_loader.py
load_hf_model(..., quantize=False) arg; shard discovery, separate _load_standard and _load_quantized flows, per-parameter Params4bit / registration for quantized shards, stricter completeness checks, and explicit memory cleanup.
CLI & generation behavior
main.py, engine/generator.py
CLI: new --quantize/-q flag, remove offline env vars before loading, simplified tokenizer init, flush thinking buffer when <think> absent. Generator: track prompt_len/prev_text and emit only newly appended substrings.
Misc
docs/quantisation.md
File normalization (trailing newline) and now ignored by .gitignore.

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as main.py
    participant Loader as models/weight_loader.py
    participant Factory as models/base.py
    participant BNB as bitsandbytes
    participant Generator as engine/generator.py

    User->>CLI: run with --quantize
    CLI->>Loader: load_hf_model(model_id, quantize=True)
    Loader->>Loader: discover shard paths
    alt quantize=True
        Loader->>Loader: load shard to CPU & remap keys
        Loader->>Factory: request params/layers (quantize=True)
        Factory->>BNB: instantiate Linear4bit (nf4, bfloat16)
        BNB-->>Factory: Linear4bit
        Factory-->>Loader: return quantized params/layers (Params4bit)
        Loader->>Loader: register params, cleanup tensors
    else quantize=False
        Loader->>Loader: load shards, remap & cast tensors
        Loader->>Loader: load_state_dict(assign=True)
    end
    Loader-->>CLI: model ready
    CLI->>Generator: start generation
    Generator->>Generator: decode full token sequence, compute new_text = full - prev_text
    Generator-->>User: emit incremental substrings
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

  • Feature/quantisation #2 — Implements closely related quantization changes: get_linear_layer addition, threading quantize through configs and weight loading, and switching Linear usages to the factory.

Poem

🐰 I swap small bits in NF4 delight,

Layers hum softly through night,
Shards find homes, weights take their place,
Streams now hop with steady pace,
A tiny rabbit cheers the bright new byte!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Feature/quantisation' is partially related to the changeset, referring to a real aspect (quantization feature addition), but is too generic and doesn't highlight the primary change clearly. Use a more descriptive title that clearly indicates the main purpose, such as 'Add 4-bit NF4 quantization support via bitsandbytes' or 'Implement quantization mode with incremental token streaming improvements'.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/quantisation

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
main.py (2)

47-47: Unused variable config from unpacking.

The config returned from load_hf_model is unpacked but never used. Use _ to indicate intentional discard.

♻️ Proposed fix
-    model, config = load_hf_model(args.model_id, device=args.device, quantize=args.quantize)
+    model, _ = load_hf_model(args.model_id, device=args.device, quantize=args.quantize)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@main.py` at line 47, The unpacked but unused variable `config` from the call
to load_hf_model should be intentionally discarded; change the assignment from
"model, config = load_hf_model(args.model_id, device=args.device,
quantize=args.quantize)" to use an underscore for the unused value (e.g.,
"model, _ = load_hf_model(...)") so the unused `config` no longer triggers a
linter warning while keeping the same call to load_hf_model with args.model_id,
args.device, and args.quantize.

117-121: Consider extracting the magic number to a named constant.

The threshold 20 is a magic number. Extracting it to a constant improves readability and makes future adjustments easier.

♻️ Proposed fix
+THINKING_TAG_DETECTION_THRESHOLD = 20  # chars to buffer before assuming no <think> tags
+
 # ... in the streaming loop:
-                elif not indicator_shown and len(buffer) > 20:
+                elif not indicator_shown and len(buffer) > THINKING_TAG_DETECTION_THRESHOLD:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@main.py` around lines 117 - 121, Replace the magic number 20 used in the
conditional "elif not indicator_shown and len(buffer) > 20:" with a named
constant (e.g. BUFFER_FLUSH_THRESHOLD or THINKING_BUFFER_THRESHOLD) defined near
other module constants or top of main.py; update the condition to use that
constant and ensure any related comments or tests reference the constant name
instead of the literal so the behavior tied to indicator_shown, buffer,
thinking_done and parts remains unchanged.
models/base.py (1)

7-16: Consider adding error handling for missing bitsandbytes and making compute_dtype configurable.

Two observations:

  1. compute_dtype=torch.bfloat16 is hardcoded, but the model could be loaded with a different dtype (e.g., float16). This may cause dtype mismatches during forward passes.

  2. If bitsandbytes is not installed, the import will raise a confusing ImportError. A clearer error message would improve UX.

♻️ Proposed improvement
-def get_linear_layer(in_features: int, out_features: int, bias: bool, quantize: bool = False):
+def get_linear_layer(in_features: int, out_features: int, bias: bool, quantize: bool = False, compute_dtype: torch.dtype = torch.bfloat16):
     """Factory that returns nn.Linear or bnb.nn.Linear4bit depending on quantize flag."""
     if quantize:
-        import bitsandbytes as bnb
+        try:
+            import bitsandbytes as bnb
+        except ImportError as e:
+            raise ImportError(
+                "bitsandbytes is required for quantization. "
+                "Install it with: pip install bitsandbytes"
+            ) from e
         return bnb.nn.Linear4bit(
             in_features, out_features, bias=bias,
-            compute_dtype=torch.bfloat16,
+            compute_dtype=compute_dtype,
             quant_type="nf4",
         )
     return nn.Linear(in_features, out_features, bias=bias)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@models/base.py` around lines 7 - 16, get_linear_layer currently imports
bitsandbytes and hardcodes compute_dtype=torch.bfloat16 for bnb.nn.Linear4bit
which can cause ImportError or dtype mismatches; update get_linear_layer to (1)
catch ImportError around "import bitsandbytes as bnb" and raise a clear
RuntimeError mentioning bitsandbytes is required for quantize=True, and (2)
accept an optional compute_dtype parameter (or infer from a passed dtype arg)
and pass that into bnb.nn.Linear4bit instead of the hardcoded torch.bfloat16 so
dtype matches the model's dtype; reference function get_linear_layer, symbol
compute_dtype, and Linear4bit when making changes.
models/weight_loader.py (1)

86-129: Asymmetry with _load_standard is notable but mitigated by validation strategy.

Unlike _load_standard (line 71) which returns (missing, unexpected) diagnostics, _load_quantized returns nothing. However, the return value from _load_standard is not actually used at call sites (lines 175–177). Instead, the code validates loading success by checking for meta tensors at lines 199–204, which works for both paths.

The per-parameter manual loading approach in _load_quantized using setattr is appropriate for direct parameter replacement. The Params4bit API usage is correct. The memory management (pop, del, gc.collect, empty_cache) is appropriate for large-model quantized loading.

If you want consistent diagnostics across both paths, consider returning tracking information from _load_quantized or making it explicit why the validation approach differs (e.g., a comment explaining the meta-tensor check suffices for both paths).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@models/weight_loader.py` around lines 86 - 129, The issue: _load_quantized is
asymmetrical with _load_standard because it returns no diagnostics while callers
expect (missing, unexpected) diagnostics from the loader path; to fix, either
make _load_quantized return the same (missing, unexpected) tuple or explicitly
document why the meta-tensor presence check covers validation. Concretely,
inside _load_quantized (function name) track missing keys and unexpected keys
the same way _load_standard does when you skip unmapped keys or successfully
assign parameters (use the same key bookkeeping logic used in _load_standard),
and return (missing, unexpected) at the end; alternatively, add a clear comment
in _load_quantized explaining that callers validate load success via meta-tensor
checks elsewhere (so no diagnostics are returned) and keep behavior as-is.
Include references to _load_standard, _load_quantized and Params4bit so the
change is easy to locate.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.gitignore:
- Line 5: Remove the stale ignore entry "docs/quantisation.md" from .gitignore
(or replace it with a commented explanation) so documentation files are tracked
consistently; locate the "docs/quantisation.md" line in .gitignore and either
delete that line or change it to a commented note explaining why this specific
doc should be ignored/generated (and include the rationale in the commit
message).

In `@main.py`:
- Around line 41-45: Don't unconditionally remove HF_HUB_OFFLINE and
TRANSFORMERS_OFFLINE; instead, stop calling os.environ.pop("HF_HUB_OFFLINE",
None) and os.environ.pop("TRANSFORMERS_OFFLINE", None) and implement a
conditional path: check for those keys in os.environ and if present, write a
clear log message (e.g., via logger.warn/info) that the process will respect the
user's offline setting and avoid forcing online model downloads, otherwise
proceed with normal model-loading behavior (local-first-then-download). Update
the code around the existing os.environ.pop calls in main.py to perform the
presence check and logging rather than removing the variables.

In `@models/weight_loader.py`:
- Around line 206-212: The loop that materializes remaining meta tensors uses
torch.empty_like which leaves uninitialized memory and can produce garbage for
lm_head if it was allowed to remain meta but embeddings are not tied; update the
logic so before materializing any parameter named "lm_head" (or any lm_head
parameter) you check config.tie_word_embeddings is True; if tie_word_embeddings
is False and lm_head is meta, either raise a clear exception indicating the
missing checkpoint weight or allocate a deterministic tensor (e.g., zeros)
instead of torch.empty_like; apply the same validation for buffers and ensure
the checks reference the param names (e.g., "lm_head") and the
config.tie_word_embeddings flag so lm_head is never silently left uninitialized.
- Around line 58-61: The current broad try/except around the call to
_download("model.safetensors", must_be_local=local_only) swallows all errors;
change it to catch the specific huggingface_hub.errors.EntryNotFoundError (and
its LocalEntryNotFoundError/RemoteEntryNotFoundError if available) so
missing-file cases are handled, and for any other Exception log the error (use
the module logger or processLogger) and re-raise or return a clear sentinel
instead of silently passing; update the import to pull EntryNotFoundError from
huggingface_hub and modify the try/except in the same scope where _download(...)
is invoked to implement this behavior.

---

Nitpick comments:
In `@main.py`:
- Line 47: The unpacked but unused variable `config` from the call to
load_hf_model should be intentionally discarded; change the assignment from
"model, config = load_hf_model(args.model_id, device=args.device,
quantize=args.quantize)" to use an underscore for the unused value (e.g.,
"model, _ = load_hf_model(...)") so the unused `config` no longer triggers a
linter warning while keeping the same call to load_hf_model with args.model_id,
args.device, and args.quantize.
- Around line 117-121: Replace the magic number 20 used in the conditional "elif
not indicator_shown and len(buffer) > 20:" with a named constant (e.g.
BUFFER_FLUSH_THRESHOLD or THINKING_BUFFER_THRESHOLD) defined near other module
constants or top of main.py; update the condition to use that constant and
ensure any related comments or tests reference the constant name instead of the
literal so the behavior tied to indicator_shown, buffer, thinking_done and parts
remains unchanged.

In `@models/base.py`:
- Around line 7-16: get_linear_layer currently imports bitsandbytes and
hardcodes compute_dtype=torch.bfloat16 for bnb.nn.Linear4bit which can cause
ImportError or dtype mismatches; update get_linear_layer to (1) catch
ImportError around "import bitsandbytes as bnb" and raise a clear RuntimeError
mentioning bitsandbytes is required for quantize=True, and (2) accept an
optional compute_dtype parameter (or infer from a passed dtype arg) and pass
that into bnb.nn.Linear4bit instead of the hardcoded torch.bfloat16 so dtype
matches the model's dtype; reference function get_linear_layer, symbol
compute_dtype, and Linear4bit when making changes.

In `@models/weight_loader.py`:
- Around line 86-129: The issue: _load_quantized is asymmetrical with
_load_standard because it returns no diagnostics while callers expect (missing,
unexpected) diagnostics from the loader path; to fix, either make
_load_quantized return the same (missing, unexpected) tuple or explicitly
document why the meta-tensor presence check covers validation. Concretely,
inside _load_quantized (function name) track missing keys and unexpected keys
the same way _load_standard does when you skip unmapped keys or successfully
assign parameters (use the same key bookkeeping logic used in _load_standard),
and return (missing, unexpected) at the end; alternatively, add a clear comment
in _load_quantized explaining that callers validate load success via meta-tensor
checks elsewhere (so no diagnostics are returned) and keep behavior as-is.
Include references to _load_standard, _load_quantized and Params4bit so the
change is easy to locate.
🪄 Autofix (Beta)

✅ Autofix completed


ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 8f148ca5-d848-4c54-8ac9-00109a4d881f

📥 Commits

Reviewing files that changed from the base of the PR and between f3667d1 and 8b97690.

⛔ Files ignored due to path filters (14)
  • __pycache__/main.cpython-312.pyc is excluded by !**/*.pyc
  • engine/__pycache__/__init__.cpython-312.pyc is excluded by !**/*.pyc
  • engine/__pycache__/generator.cpython-312.pyc is excluded by !**/*.pyc
  • engine/__pycache__/sampler.cpython-312.pyc is excluded by !**/*.pyc
  • models/__pycache__/__init__.cpython-312.pyc is excluded by !**/*.pyc
  • models/__pycache__/attention.cpython-312.pyc is excluded by !**/*.pyc
  • models/__pycache__/base.cpython-312.pyc is excluded by !**/*.pyc
  • models/__pycache__/llama.cpython-312.pyc is excluded by !**/*.pyc
  • models/__pycache__/qwen3.cpython-312.pyc is excluded by !**/*.pyc
  • models/__pycache__/weight_loader.cpython-312.pyc is excluded by !**/*.pyc
  • tests/__pycache__/conftest.cpython-312-pytest-9.0.3.pyc is excluded by !**/*.pyc
  • tests/__pycache__/test_main.cpython-312-pytest-9.0.3.pyc is excluded by !**/*.pyc
  • tests/__pycache__/test_sampler.cpython-312-pytest-9.0.3.pyc is excluded by !**/*.pyc
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (8)
  • .gitignore
  • engine/generator.py
  • main.py
  • models/attention.py
  • models/base.py
  • models/llama.py
  • models/weight_loader.py
  • pyproject.toml

Comment thread .gitignore
**/__pycache__/
*.pyc

docs/quantisation.md No newline at end of file

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Check if quantisation.md exists and preview its content to determine if it should be tracked

# Check if the file exists
if [ -f "docs/quantisation.md" ]; then
  echo "File exists. First 20 lines:"
  head -n 20 docs/quantisation.md
  echo ""
  echo "File size:"
  wc -l docs/quantisation.md
else
  echo "File does not exist yet"
fi

Repository: Lothnic/vllmini

Length of output: 83


🏁 Script executed:

# Check git diff to see what's being changed
git diff --name-only

# Check what files are in the docs directory
ls -la docs/ 2>/dev/null || echo "docs directory does not exist or is empty"

# Search for any quantisation-related files
find . -name "*quantisation*" -o -name "*quantization*" 2>/dev/null | head -20

# Check the .gitignore file to see context
cat -n .gitignore

Repository: Lothnic/vllmini

Length of output: 322


🏁 Script executed:

# Search for quantisation/quantization references
rg -i "quantis" --type py --type md --type txt --type sh

# Check for documentation generation scripts
fd -e "py" -e "sh" -e "mk" | xargs rg -l "doc\|md\|markdown" 2>/dev/null | head -20

# Look at git log to see if quantisation.md is mentioned in any commits
git log --all --oneline --grep="quantisation\|quantization" 2>/dev/null | head -10

# Check if there are any build or doc generation scripts
fd "build\|gen\|make" -type f | head -20

Repository: Lothnic/vllmini

Length of output: 497


🏁 Script executed:

# Search for quantisation in Python source files
rg -i "quantis" --type py -A 3 -B 1

# Check what Python files contain quantisation-related code
rg -i "quantis" --type py -l

# Look for bitsandbytes imports or usage
rg "bitsandbytes\|bnb" --type py

Repository: Lothnic/vllmini

Length of output: 41


Remove docs/quantisation.md from .gitignore or provide clarification.

Documentation files should typically be committed to the repository. The docs/ directory already tracks sampler.md, and this project follows a pattern of version-controlling documentation. Preemptively ignoring a non-existent documentation file for a feature currently on a separate branch is inconsistent with the project's practices and may cause confusion when the quantisation documentation is later created.

If this file will be auto-generated or excluded for a specific reason, clarify that in the commit message.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitignore at line 5, Remove the stale ignore entry "docs/quantisation.md"
from .gitignore (or replace it with a commented explanation) so documentation
files are tracked consistently; locate the "docs/quantisation.md" line in
.gitignore and either delete that line or change it to a commented note
explaining why this specific doc should be ignored/generated (and include the
rationale in the commit message).

Comment thread main.py
Comment on lines +41 to +45

# Don't force offline mode for model loading — the weight_loader
# handles local-first-then-download fallback on its own.
os.environ.pop("HF_HUB_OFFLINE", None)
os.environ.pop("TRANSFORMERS_OFFLINE", None)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Removing user-set environment variables may cause unexpected behavior.

Explicitly removing HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE could surprise users who intentionally set these variables in their environment. The weight loader's local-first fallback doesn't fully replicate offline mode semantics (which prevents any network access).

Consider logging when these are removed, or respecting them if explicitly set.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@main.py` around lines 41 - 45, Don't unconditionally remove HF_HUB_OFFLINE
and TRANSFORMERS_OFFLINE; instead, stop calling os.environ.pop("HF_HUB_OFFLINE",
None) and os.environ.pop("TRANSFORMERS_OFFLINE", None) and implement a
conditional path: check for those keys in os.environ and if present, write a
clear log message (e.g., via logger.warn/info) that the process will respect the
user's offline setting and avoid forcing online model downloads, otherwise
proceed with normal model-loading behavior (local-first-then-download). Update
the code around the existing os.environ.pop calls in main.py to perform the
presence check and logging rather than removing the variables.

Comment thread models/weight_loader.py
Comment thread models/weight_loader.py Outdated
Comment on lines 206 to 212
# 7. Ensure all remaining allowed meta tensors (e.g. lm_head) are materialized
for param in model.parameters():
if param.is_meta:
param.data = torch.empty_like(param, device=device)
for buffer in model.buffers():
if buffer.is_meta:
buffer.data = torch.empty_like(buffer, device=device)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Risk of uninitialized tensors if lm_head isn't actually tied.

torch.empty_like creates tensors with uninitialized memory. The code allows lm_head meta params (line 202) assuming they'll be tied to embeddings (line 182). However, if tie_word_embeddings=False in config but lm_head weights are missing from checkpoint, this would silently produce garbage values.

Consider validating that lm_head is only allowed to be meta when config.tie_word_embeddings is True.

🔧 Proposed fix
-    real_missing = [k for k in missing_params if "lm_head" not in k]
+    # Only allow lm_head to be missing if weights are tied
+    real_missing = [
+        k for k in missing_params 
+        if "lm_head" not in k or not config.tie_word_embeddings
+    ]
     if real_missing or missing_buffers:
         raise ValueError(f"Missing weights/buffers in checkpoint! params={real_missing}, buffers={missing_buffers}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@models/weight_loader.py` around lines 206 - 212, The loop that materializes
remaining meta tensors uses torch.empty_like which leaves uninitialized memory
and can produce garbage for lm_head if it was allowed to remain meta but
embeddings are not tied; update the logic so before materializing any parameter
named "lm_head" (or any lm_head parameter) you check config.tie_word_embeddings
is True; if tie_word_embeddings is False and lm_head is meta, either raise a
clear exception indicating the missing checkpoint weight or allocate a
deterministic tensor (e.g., zeros) instead of torch.empty_like; apply the same
validation for buffers and ensure the checks reference the param names (e.g.,
"lm_head") and the config.tie_word_embeddings flag so lm_head is never silently
left uninitialized.

@coderabbitai

coderabbitai Bot commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Note

Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it.

Fixes Applied Successfully

Fixed 3 file(s) based on 3 unresolved review comments.

Files modified:

  • .gitignore
  • main.py
  • models/weight_loader.py

Commit: 91af63e7587c9fe0d06f6a1dbd25508597a035b0

The changes have been pushed to the feature/quantisation branch.

Time taken: 3m 0s

@Lothnic Lothnic merged commit af369b5 into main Apr 27, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant