Feature/quantisation#2
Conversation
- Add get_linear_layer() factory in base.py that returns nn.Linear or bnb.nn.Linear4bit based on quantize flag (lazy import) - Add quantize: bool field to LlamaConfig - Swap all nn.Linear in Attention (q/k/v/o_proj) and MLP (gate/up/down_proj) to use the factory - Implement dual-path weight loading in weight_loader.py: - Standard path: batch load_state_dict(assign=True) for non-quantized - Quantized path: shard-by-shard per-parameter Params4bit loading - Preserve offline mode, FlashAttention, streaming, and stateless Sampler - Add --quantize/-q CLI flag to main.py - Add bitsandbytes>=0.49.2 to pyproject.toml - Fix .gitignore to catch __pycache__ at all directory depths - Remove tracked __pycache__ files
…on of missing weights with random memory
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds optional 4‑bit NF4 quantization via bitsandbytes, a linear-layer factory to create quantized or standard layers, quantize-aware model loading with shard discovery and a dedicated quantized load path, a CLI flag to enable quantization, and generator output changed to emit incremental substrings from full-text decoding. Changes
Sequence DiagramsequenceDiagram
participant User
participant CLI as main.py
participant Loader as models/weight_loader.py
participant Factory as models/base.py
participant BNB as bitsandbytes
participant Generator as engine/generator.py
User->>CLI: run with --quantize
CLI->>Loader: load_hf_model(model_id, quantize=True)
Loader->>Loader: discover shard paths
alt quantize=True
Loader->>Loader: load shard to CPU & remap keys
Loader->>Factory: request params/layers (quantize=True)
Factory->>BNB: instantiate Linear4bit (nf4, bfloat16)
BNB-->>Factory: Linear4bit
Factory-->>Loader: return quantized params/layers (Params4bit)
Loader->>Loader: register params, cleanup tensors
else quantize=False
Loader->>Loader: load shards, remap & cast tensors
Loader->>Loader: load_state_dict(assign=True)
end
Loader-->>CLI: model ready
CLI->>Generator: start generation
Generator->>Generator: decode full token sequence, compute new_text = full - prev_text
Generator-->>User: emit incremental substrings
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (4)
main.py (2)
47-47: Unused variableconfigfrom unpacking.The
configreturned fromload_hf_modelis unpacked but never used. Use_to indicate intentional discard.♻️ Proposed fix
- model, config = load_hf_model(args.model_id, device=args.device, quantize=args.quantize) + model, _ = load_hf_model(args.model_id, device=args.device, quantize=args.quantize)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@main.py` at line 47, The unpacked but unused variable `config` from the call to load_hf_model should be intentionally discarded; change the assignment from "model, config = load_hf_model(args.model_id, device=args.device, quantize=args.quantize)" to use an underscore for the unused value (e.g., "model, _ = load_hf_model(...)") so the unused `config` no longer triggers a linter warning while keeping the same call to load_hf_model with args.model_id, args.device, and args.quantize.
117-121: Consider extracting the magic number to a named constant.The threshold
20is a magic number. Extracting it to a constant improves readability and makes future adjustments easier.♻️ Proposed fix
+THINKING_TAG_DETECTION_THRESHOLD = 20 # chars to buffer before assuming no <think> tags + # ... in the streaming loop: - elif not indicator_shown and len(buffer) > 20: + elif not indicator_shown and len(buffer) > THINKING_TAG_DETECTION_THRESHOLD:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@main.py` around lines 117 - 121, Replace the magic number 20 used in the conditional "elif not indicator_shown and len(buffer) > 20:" with a named constant (e.g. BUFFER_FLUSH_THRESHOLD or THINKING_BUFFER_THRESHOLD) defined near other module constants or top of main.py; update the condition to use that constant and ensure any related comments or tests reference the constant name instead of the literal so the behavior tied to indicator_shown, buffer, thinking_done and parts remains unchanged.models/base.py (1)
7-16: Consider adding error handling for missingbitsandbytesand makingcompute_dtypeconfigurable.Two observations:
compute_dtype=torch.bfloat16is hardcoded, but the model could be loaded with a different dtype (e.g.,float16). This may cause dtype mismatches during forward passes.If
bitsandbytesis not installed, the import will raise a confusingImportError. A clearer error message would improve UX.♻️ Proposed improvement
-def get_linear_layer(in_features: int, out_features: int, bias: bool, quantize: bool = False): +def get_linear_layer(in_features: int, out_features: int, bias: bool, quantize: bool = False, compute_dtype: torch.dtype = torch.bfloat16): """Factory that returns nn.Linear or bnb.nn.Linear4bit depending on quantize flag.""" if quantize: - import bitsandbytes as bnb + try: + import bitsandbytes as bnb + except ImportError as e: + raise ImportError( + "bitsandbytes is required for quantization. " + "Install it with: pip install bitsandbytes" + ) from e return bnb.nn.Linear4bit( in_features, out_features, bias=bias, - compute_dtype=torch.bfloat16, + compute_dtype=compute_dtype, quant_type="nf4", ) return nn.Linear(in_features, out_features, bias=bias)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@models/base.py` around lines 7 - 16, get_linear_layer currently imports bitsandbytes and hardcodes compute_dtype=torch.bfloat16 for bnb.nn.Linear4bit which can cause ImportError or dtype mismatches; update get_linear_layer to (1) catch ImportError around "import bitsandbytes as bnb" and raise a clear RuntimeError mentioning bitsandbytes is required for quantize=True, and (2) accept an optional compute_dtype parameter (or infer from a passed dtype arg) and pass that into bnb.nn.Linear4bit instead of the hardcoded torch.bfloat16 so dtype matches the model's dtype; reference function get_linear_layer, symbol compute_dtype, and Linear4bit when making changes.models/weight_loader.py (1)
86-129: Asymmetry with_load_standardis notable but mitigated by validation strategy.Unlike
_load_standard(line 71) which returns(missing, unexpected)diagnostics,_load_quantizedreturns nothing. However, the return value from_load_standardis not actually used at call sites (lines 175–177). Instead, the code validates loading success by checking for meta tensors at lines 199–204, which works for both paths.The per-parameter manual loading approach in
_load_quantizedusingsetattris appropriate for direct parameter replacement. The Params4bit API usage is correct. The memory management (pop, del, gc.collect, empty_cache) is appropriate for large-model quantized loading.If you want consistent diagnostics across both paths, consider returning tracking information from
_load_quantizedor making it explicit why the validation approach differs (e.g., a comment explaining the meta-tensor check suffices for both paths).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@models/weight_loader.py` around lines 86 - 129, The issue: _load_quantized is asymmetrical with _load_standard because it returns no diagnostics while callers expect (missing, unexpected) diagnostics from the loader path; to fix, either make _load_quantized return the same (missing, unexpected) tuple or explicitly document why the meta-tensor presence check covers validation. Concretely, inside _load_quantized (function name) track missing keys and unexpected keys the same way _load_standard does when you skip unmapped keys or successfully assign parameters (use the same key bookkeeping logic used in _load_standard), and return (missing, unexpected) at the end; alternatively, add a clear comment in _load_quantized explaining that callers validate load success via meta-tensor checks elsewhere (so no diagnostics are returned) and keep behavior as-is. Include references to _load_standard, _load_quantized and Params4bit so the change is easy to locate.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.gitignore:
- Line 5: Remove the stale ignore entry "docs/quantisation.md" from .gitignore
(or replace it with a commented explanation) so documentation files are tracked
consistently; locate the "docs/quantisation.md" line in .gitignore and either
delete that line or change it to a commented note explaining why this specific
doc should be ignored/generated (and include the rationale in the commit
message).
In `@main.py`:
- Around line 41-45: Don't unconditionally remove HF_HUB_OFFLINE and
TRANSFORMERS_OFFLINE; instead, stop calling os.environ.pop("HF_HUB_OFFLINE",
None) and os.environ.pop("TRANSFORMERS_OFFLINE", None) and implement a
conditional path: check for those keys in os.environ and if present, write a
clear log message (e.g., via logger.warn/info) that the process will respect the
user's offline setting and avoid forcing online model downloads, otherwise
proceed with normal model-loading behavior (local-first-then-download). Update
the code around the existing os.environ.pop calls in main.py to perform the
presence check and logging rather than removing the variables.
In `@models/weight_loader.py`:
- Around line 206-212: The loop that materializes remaining meta tensors uses
torch.empty_like which leaves uninitialized memory and can produce garbage for
lm_head if it was allowed to remain meta but embeddings are not tied; update the
logic so before materializing any parameter named "lm_head" (or any lm_head
parameter) you check config.tie_word_embeddings is True; if tie_word_embeddings
is False and lm_head is meta, either raise a clear exception indicating the
missing checkpoint weight or allocate a deterministic tensor (e.g., zeros)
instead of torch.empty_like; apply the same validation for buffers and ensure
the checks reference the param names (e.g., "lm_head") and the
config.tie_word_embeddings flag so lm_head is never silently left uninitialized.
- Around line 58-61: The current broad try/except around the call to
_download("model.safetensors", must_be_local=local_only) swallows all errors;
change it to catch the specific huggingface_hub.errors.EntryNotFoundError (and
its LocalEntryNotFoundError/RemoteEntryNotFoundError if available) so
missing-file cases are handled, and for any other Exception log the error (use
the module logger or processLogger) and re-raise or return a clear sentinel
instead of silently passing; update the import to pull EntryNotFoundError from
huggingface_hub and modify the try/except in the same scope where _download(...)
is invoked to implement this behavior.
---
Nitpick comments:
In `@main.py`:
- Line 47: The unpacked but unused variable `config` from the call to
load_hf_model should be intentionally discarded; change the assignment from
"model, config = load_hf_model(args.model_id, device=args.device,
quantize=args.quantize)" to use an underscore for the unused value (e.g.,
"model, _ = load_hf_model(...)") so the unused `config` no longer triggers a
linter warning while keeping the same call to load_hf_model with args.model_id,
args.device, and args.quantize.
- Around line 117-121: Replace the magic number 20 used in the conditional "elif
not indicator_shown and len(buffer) > 20:" with a named constant (e.g.
BUFFER_FLUSH_THRESHOLD or THINKING_BUFFER_THRESHOLD) defined near other module
constants or top of main.py; update the condition to use that constant and
ensure any related comments or tests reference the constant name instead of the
literal so the behavior tied to indicator_shown, buffer, thinking_done and parts
remains unchanged.
In `@models/base.py`:
- Around line 7-16: get_linear_layer currently imports bitsandbytes and
hardcodes compute_dtype=torch.bfloat16 for bnb.nn.Linear4bit which can cause
ImportError or dtype mismatches; update get_linear_layer to (1) catch
ImportError around "import bitsandbytes as bnb" and raise a clear RuntimeError
mentioning bitsandbytes is required for quantize=True, and (2) accept an
optional compute_dtype parameter (or infer from a passed dtype arg) and pass
that into bnb.nn.Linear4bit instead of the hardcoded torch.bfloat16 so dtype
matches the model's dtype; reference function get_linear_layer, symbol
compute_dtype, and Linear4bit when making changes.
In `@models/weight_loader.py`:
- Around line 86-129: The issue: _load_quantized is asymmetrical with
_load_standard because it returns no diagnostics while callers expect (missing,
unexpected) diagnostics from the loader path; to fix, either make
_load_quantized return the same (missing, unexpected) tuple or explicitly
document why the meta-tensor presence check covers validation. Concretely,
inside _load_quantized (function name) track missing keys and unexpected keys
the same way _load_standard does when you skip unmapped keys or successfully
assign parameters (use the same key bookkeeping logic used in _load_standard),
and return (missing, unexpected) at the end; alternatively, add a clear comment
in _load_quantized explaining that callers validate load success via meta-tensor
checks elsewhere (so no diagnostics are returned) and keep behavior as-is.
Include references to _load_standard, _load_quantized and Params4bit so the
change is easy to locate.
🪄 Autofix (Beta)
✅ Autofix completed
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 8f148ca5-d848-4c54-8ac9-00109a4d881f
⛔ Files ignored due to path filters (14)
__pycache__/main.cpython-312.pycis excluded by!**/*.pycengine/__pycache__/__init__.cpython-312.pycis excluded by!**/*.pycengine/__pycache__/generator.cpython-312.pycis excluded by!**/*.pycengine/__pycache__/sampler.cpython-312.pycis excluded by!**/*.pycmodels/__pycache__/__init__.cpython-312.pycis excluded by!**/*.pycmodels/__pycache__/attention.cpython-312.pycis excluded by!**/*.pycmodels/__pycache__/base.cpython-312.pycis excluded by!**/*.pycmodels/__pycache__/llama.cpython-312.pycis excluded by!**/*.pycmodels/__pycache__/qwen3.cpython-312.pycis excluded by!**/*.pycmodels/__pycache__/weight_loader.cpython-312.pycis excluded by!**/*.pyctests/__pycache__/conftest.cpython-312-pytest-9.0.3.pycis excluded by!**/*.pyctests/__pycache__/test_main.cpython-312-pytest-9.0.3.pycis excluded by!**/*.pyctests/__pycache__/test_sampler.cpython-312-pytest-9.0.3.pycis excluded by!**/*.pycuv.lockis excluded by!**/*.lock
📒 Files selected for processing (8)
.gitignoreengine/generator.pymain.pymodels/attention.pymodels/base.pymodels/llama.pymodels/weight_loader.pypyproject.toml
| **/__pycache__/ | ||
| *.pyc | ||
|
|
||
| docs/quantisation.md No newline at end of file |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Check if quantisation.md exists and preview its content to determine if it should be tracked
# Check if the file exists
if [ -f "docs/quantisation.md" ]; then
echo "File exists. First 20 lines:"
head -n 20 docs/quantisation.md
echo ""
echo "File size:"
wc -l docs/quantisation.md
else
echo "File does not exist yet"
fiRepository: Lothnic/vllmini
Length of output: 83
🏁 Script executed:
# Check git diff to see what's being changed
git diff --name-only
# Check what files are in the docs directory
ls -la docs/ 2>/dev/null || echo "docs directory does not exist or is empty"
# Search for any quantisation-related files
find . -name "*quantisation*" -o -name "*quantization*" 2>/dev/null | head -20
# Check the .gitignore file to see context
cat -n .gitignoreRepository: Lothnic/vllmini
Length of output: 322
🏁 Script executed:
# Search for quantisation/quantization references
rg -i "quantis" --type py --type md --type txt --type sh
# Check for documentation generation scripts
fd -e "py" -e "sh" -e "mk" | xargs rg -l "doc\|md\|markdown" 2>/dev/null | head -20
# Look at git log to see if quantisation.md is mentioned in any commits
git log --all --oneline --grep="quantisation\|quantization" 2>/dev/null | head -10
# Check if there are any build or doc generation scripts
fd "build\|gen\|make" -type f | head -20Repository: Lothnic/vllmini
Length of output: 497
🏁 Script executed:
# Search for quantisation in Python source files
rg -i "quantis" --type py -A 3 -B 1
# Check what Python files contain quantisation-related code
rg -i "quantis" --type py -l
# Look for bitsandbytes imports or usage
rg "bitsandbytes\|bnb" --type pyRepository: Lothnic/vllmini
Length of output: 41
Remove docs/quantisation.md from .gitignore or provide clarification.
Documentation files should typically be committed to the repository. The docs/ directory already tracks sampler.md, and this project follows a pattern of version-controlling documentation. Preemptively ignoring a non-existent documentation file for a feature currently on a separate branch is inconsistent with the project's practices and may cause confusion when the quantisation documentation is later created.
If this file will be auto-generated or excluded for a specific reason, clarify that in the commit message.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.gitignore at line 5, Remove the stale ignore entry "docs/quantisation.md"
from .gitignore (or replace it with a commented explanation) so documentation
files are tracked consistently; locate the "docs/quantisation.md" line in
.gitignore and either delete that line or change it to a commented note
explaining why this specific doc should be ignored/generated (and include the
rationale in the commit message).
|
|
||
| # Don't force offline mode for model loading — the weight_loader | ||
| # handles local-first-then-download fallback on its own. | ||
| os.environ.pop("HF_HUB_OFFLINE", None) | ||
| os.environ.pop("TRANSFORMERS_OFFLINE", None) |
There was a problem hiding this comment.
Removing user-set environment variables may cause unexpected behavior.
Explicitly removing HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE could surprise users who intentionally set these variables in their environment. The weight loader's local-first fallback doesn't fully replicate offline mode semantics (which prevents any network access).
Consider logging when these are removed, or respecting them if explicitly set.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@main.py` around lines 41 - 45, Don't unconditionally remove HF_HUB_OFFLINE
and TRANSFORMERS_OFFLINE; instead, stop calling os.environ.pop("HF_HUB_OFFLINE",
None) and os.environ.pop("TRANSFORMERS_OFFLINE", None) and implement a
conditional path: check for those keys in os.environ and if present, write a
clear log message (e.g., via logger.warn/info) that the process will respect the
user's offline setting and avoid forcing online model downloads, otherwise
proceed with normal model-loading behavior (local-first-then-download). Update
the code around the existing os.environ.pop calls in main.py to perform the
presence check and logging rather than removing the variables.
| # 7. Ensure all remaining allowed meta tensors (e.g. lm_head) are materialized | ||
| for param in model.parameters(): | ||
| if param.is_meta: | ||
| param.data = torch.empty_like(param, device=device) | ||
| for buffer in model.buffers(): | ||
| if buffer.is_meta: | ||
| buffer.data = torch.empty_like(buffer, device=device) |
There was a problem hiding this comment.
Risk of uninitialized tensors if lm_head isn't actually tied.
torch.empty_like creates tensors with uninitialized memory. The code allows lm_head meta params (line 202) assuming they'll be tied to embeddings (line 182). However, if tie_word_embeddings=False in config but lm_head weights are missing from checkpoint, this would silently produce garbage values.
Consider validating that lm_head is only allowed to be meta when config.tie_word_embeddings is True.
🔧 Proposed fix
- real_missing = [k for k in missing_params if "lm_head" not in k]
+ # Only allow lm_head to be missing if weights are tied
+ real_missing = [
+ k for k in missing_params
+ if "lm_head" not in k or not config.tie_word_embeddings
+ ]
if real_missing or missing_buffers:
raise ValueError(f"Missing weights/buffers in checkpoint! params={real_missing}, buffers={missing_buffers}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@models/weight_loader.py` around lines 206 - 212, The loop that materializes
remaining meta tensors uses torch.empty_like which leaves uninitialized memory
and can produce garbage for lm_head if it was allowed to remain meta but
embeddings are not tied; update the logic so before materializing any parameter
named "lm_head" (or any lm_head parameter) you check config.tie_word_embeddings
is True; if tie_word_embeddings is False and lm_head is meta, either raise a
clear exception indicating the missing checkpoint weight or allocate a
deterministic tensor (e.g., zeros) instead of torch.empty_like; apply the same
validation for buffers and ensure the checks reference the param names (e.g.,
"lm_head") and the config.tie_word_embeddings flag so lm_head is never silently
left uninitialized.
|
Note Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it. Fixes Applied SuccessfullyFixed 3 file(s) based on 3 unresolved review comments. Files modified:
Commit: The changes have been pushed to the Time taken: |
Summary by CodeRabbit
New Features
Improvements
Chores