Feature/quantisation by Lothnic · Pull Request #2 · Lothnic/vllmini

Lothnic · 2026-04-27T15:23:16Z

Summary by CodeRabbit

New Features
- Added a --quantize/-q option to enable 4-bit model quantization for more efficient inference.
Improvements
- Streaming now emits incremental text based on full decoded output for smoother, less choppy responses.
- CLI suppresses specific noisy warnings and better handles models that omit special "thinking" tags by flushing buffered output.
Chores
- Added runtime dependencies to support quantization and tokenization.
- Broadened ignore rules to exclude additional cache files and a docs file.

- Add get_linear_layer() factory in base.py that returns nn.Linear or bnb.nn.Linear4bit based on quantize flag (lazy import) - Add quantize: bool field to LlamaConfig - Swap all nn.Linear in Attention (q/k/v/o_proj) and MLP (gate/up/down_proj) to use the factory - Implement dual-path weight loading in weight_loader.py: - Standard path: batch load_state_dict(assign=True) for non-quantized - Quantized path: shard-by-shard per-parameter Params4bit loading - Preserve offline mode, FlashAttention, streaming, and stateless Sampler - Add --quantize/-q CLI flag to main.py - Add bitsandbytes>=0.49.2 to pyproject.toml - Fix .gitignore to catch __pycache__ at all directory depths - Remove tracked __pycache__ files

…on of missing weights with random memory

coderabbitai · 2026-04-27T15:23:27Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: f8a98862-688f-4835-b0cc-416bb0dd4dd0

📥 Commits

Reviewing files that changed from the base of the PR and between 47eee60 and 5560688.

📒 Files selected for processing (1)

models/weight_loader.py

📝 Walkthrough

Walkthrough

Adds optional 4‑bit NF4 quantization via bitsandbytes, a linear-layer factory to create quantized or standard layers, quantize-aware model loading with shard discovery and a dedicated quantized load path, a CLI flag to enable quantization, and generator output changed to emit incremental substrings from full-text decoding.

Changes

Cohort / File(s)	Summary
Project config & ignores `\.gitignore`, `pyproject.toml`	Broadened `__pycache__/` pattern, ignore `docs/quantisation.md`, and added runtime deps `bitsandbytes` and `sentencepiece`.
Layer factory `models/base.py`	New `get_linear_layer(in_features, out_features, bias, quantize: bool=False)` to choose between `torch.nn.Linear` and `bitsandbytes` `Linear4bit`.
Model modules `models/llama.py`, `models/attention.py`	Added `quantize: bool` to `LlamaConfig`; replaced direct `nn.Linear` with `get_linear_layer(..., quantize=config.quantize)` for attention and MLP projections.
Weight loading & quantization `models/weight_loader.py`	`load_hf_model(..., quantize=False)` arg; shard discovery, separate `_load_standard` and `_load_quantized` flows, per-parameter `Params4bit` / registration for quantized shards, stricter completeness checks, and explicit memory cleanup.
CLI & generation behavior `main.py`, `engine/generator.py`	CLI: new `--quantize/-q` flag, remove offline env vars before loading, simplified tokenizer init, flush thinking buffer when `<think>` absent. Generator: track `prompt_len`/`prev_text` and emit only newly appended substrings.
Misc `docs/quantisation.md`	File normalization (trailing newline) and now ignored by `.gitignore`.

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as main.py
    participant Loader as models/weight_loader.py
    participant Factory as models/base.py
    participant BNB as bitsandbytes
    participant Generator as engine/generator.py

    User->>CLI: run with --quantize
    CLI->>Loader: load_hf_model(model_id, quantize=True)
    Loader->>Loader: discover shard paths
    alt quantize=True
        Loader->>Loader: load shard to CPU & remap keys
        Loader->>Factory: request params/layers (quantize=True)
        Factory->>BNB: instantiate Linear4bit (nf4, bfloat16)
        BNB-->>Factory: Linear4bit
        Factory-->>Loader: return quantized params/layers (Params4bit)
        Loader->>Loader: register params, cleanup tensors
    else quantize=False
        Loader->>Loader: load shards, remap & cast tensors
        Loader->>Loader: load_state_dict(assign=True)
    end
    Loader-->>CLI: model ready
    CLI->>Generator: start generation
    Generator->>Generator: decode full token sequence, compute new_text = full - prev_text
    Generator-->>User: emit incremental substrings

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Feature/quantisation #2 — Implements closely related quantization changes: get_linear_layer addition, threading quantize through configs and weight loading, and switching Linear usages to the factory.

Poem

🐰 I swap small bits in NF4 delight,

Layers hum softly through night,
Shards find homes, weights take their place,
Streams now hop with steady pace,
A tiny rabbit cheers the bright new byte!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'Feature/quantisation' is partially related to the changeset, referring to a real aspect (quantization feature addition), but is too generic and doesn't highlight the primary change clearly.	Use a more descriptive title that clearly indicates the main purpose, such as 'Add 4-bit NF4 quantization support via bitsandbytes' or 'Implement quantization mode with incremental token streaming improvements'.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/quantisation

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (4)

main.py (2)

47-47: Unused variable config from unpacking.

The config returned from load_hf_model is unpacked but never used. Use _ to indicate intentional discard.

♻️ Proposed fix

-    model, config = load_hf_model(args.model_id, device=args.device, quantize=args.quantize)
+    model, _ = load_hf_model(args.model_id, device=args.device, quantize=args.quantize)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@main.py` at line 47, The unpacked but unused variable `config` from the call
to load_hf_model should be intentionally discarded; change the assignment from
"model, config = load_hf_model(args.model_id, device=args.device,
quantize=args.quantize)" to use an underscore for the unused value (e.g.,
"model, _ = load_hf_model(...)") so the unused `config` no longer triggers a
linter warning while keeping the same call to load_hf_model with args.model_id,
args.device, and args.quantize.

117-121: Consider extracting the magic number to a named constant.

The threshold 20 is a magic number. Extracting it to a constant improves readability and makes future adjustments easier.

♻️ Proposed fix

+THINKING_TAG_DETECTION_THRESHOLD = 20  # chars to buffer before assuming no <think> tags
+
 # ... in the streaming loop:
-                elif not indicator_shown and len(buffer) > 20:
+                elif not indicator_shown and len(buffer) > THINKING_TAG_DETECTION_THRESHOLD:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@main.py` around lines 117 - 121, Replace the magic number 20 used in the
conditional "elif not indicator_shown and len(buffer) > 20:" with a named
constant (e.g. BUFFER_FLUSH_THRESHOLD or THINKING_BUFFER_THRESHOLD) defined near
other module constants or top of main.py; update the condition to use that
constant and ensure any related comments or tests reference the constant name
instead of the literal so the behavior tied to indicator_shown, buffer,
thinking_done and parts remains unchanged.

models/base.py (1)

7-16: Consider adding error handling for missing bitsandbytes and making compute_dtype configurable.

Two observations:

compute_dtype=torch.bfloat16 is hardcoded, but the model could be loaded with a different dtype (e.g., float16). This may cause dtype mismatches during forward passes.
If bitsandbytes is not installed, the import will raise a confusing ImportError. A clearer error message would improve UX.

♻️ Proposed improvement

-def get_linear_layer(in_features: int, out_features: int, bias: bool, quantize: bool = False):
+def get_linear_layer(in_features: int, out_features: int, bias: bool, quantize: bool = False, compute_dtype: torch.dtype = torch.bfloat16):
     """Factory that returns nn.Linear or bnb.nn.Linear4bit depending on quantize flag."""
     if quantize:
-        import bitsandbytes as bnb
+        try:
+            import bitsandbytes as bnb
+        except ImportError as e:
+            raise ImportError(
+                "bitsandbytes is required for quantization. "
+                "Install it with: pip install bitsandbytes"
+            ) from e
         return bnb.nn.Linear4bit(
             in_features, out_features, bias=bias,
-            compute_dtype=torch.bfloat16,
+            compute_dtype=compute_dtype,
             quant_type="nf4",
         )
     return nn.Linear(in_features, out_features, bias=bias)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@models/base.py` around lines 7 - 16, get_linear_layer currently imports
bitsandbytes and hardcodes compute_dtype=torch.bfloat16 for bnb.nn.Linear4bit
which can cause ImportError or dtype mismatches; update get_linear_layer to (1)
catch ImportError around "import bitsandbytes as bnb" and raise a clear
RuntimeError mentioning bitsandbytes is required for quantize=True, and (2)
accept an optional compute_dtype parameter (or infer from a passed dtype arg)
and pass that into bnb.nn.Linear4bit instead of the hardcoded torch.bfloat16 so
dtype matches the model's dtype; reference function get_linear_layer, symbol
compute_dtype, and Linear4bit when making changes.

models/weight_loader.py (1)

86-129: Asymmetry with _load_standard is notable but mitigated by validation strategy.

Unlike _load_standard (line 71) which returns (missing, unexpected) diagnostics, _load_quantized returns nothing. However, the return value from _load_standard is not actually used at call sites (lines 175–177). Instead, the code validates loading success by checking for meta tensors at lines 199–204, which works for both paths.

The per-parameter manual loading approach in _load_quantized using setattr is appropriate for direct parameter replacement. The Params4bit API usage is correct. The memory management (pop, del, gc.collect, empty_cache) is appropriate for large-model quantized loading.

If you want consistent diagnostics across both paths, consider returning tracking information from _load_quantized or making it explicit why the validation approach differs (e.g., a comment explaining the meta-tensor check suffices for both paths).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@models/weight_loader.py` around lines 86 - 129, The issue: _load_quantized is
asymmetrical with _load_standard because it returns no diagnostics while callers
expect (missing, unexpected) diagnostics from the loader path; to fix, either
make _load_quantized return the same (missing, unexpected) tuple or explicitly
document why the meta-tensor presence check covers validation. Concretely,
inside _load_quantized (function name) track missing keys and unexpected keys
the same way _load_standard does when you skip unmapped keys or successfully
assign parameters (use the same key bookkeeping logic used in _load_standard),
and return (missing, unexpected) at the end; alternatively, add a clear comment
in _load_quantized explaining that callers validate load success via meta-tensor
checks elsewhere (so no diagnostics are returned) and keep behavior as-is.
Include references to _load_standard, _load_quantized and Params4bit so the
change is easy to locate.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.gitignore:
- Line 5: Remove the stale ignore entry "docs/quantisation.md" from .gitignore
(or replace it with a commented explanation) so documentation files are tracked
consistently; locate the "docs/quantisation.md" line in .gitignore and either
delete that line or change it to a commented note explaining why this specific
doc should be ignored/generated (and include the rationale in the commit
message).

In `@main.py`:
- Around line 41-45: Don't unconditionally remove HF_HUB_OFFLINE and
TRANSFORMERS_OFFLINE; instead, stop calling os.environ.pop("HF_HUB_OFFLINE",
None) and os.environ.pop("TRANSFORMERS_OFFLINE", None) and implement a
conditional path: check for those keys in os.environ and if present, write a
clear log message (e.g., via logger.warn/info) that the process will respect the
user's offline setting and avoid forcing online model downloads, otherwise
proceed with normal model-loading behavior (local-first-then-download). Update
the code around the existing os.environ.pop calls in main.py to perform the
presence check and logging rather than removing the variables.

In `@models/weight_loader.py`:
- Around line 206-212: The loop that materializes remaining meta tensors uses
torch.empty_like which leaves uninitialized memory and can produce garbage for
lm_head if it was allowed to remain meta but embeddings are not tied; update the
logic so before materializing any parameter named "lm_head" (or any lm_head
parameter) you check config.tie_word_embeddings is True; if tie_word_embeddings
is False and lm_head is meta, either raise a clear exception indicating the
missing checkpoint weight or allocate a deterministic tensor (e.g., zeros)
instead of torch.empty_like; apply the same validation for buffers and ensure
the checks reference the param names (e.g., "lm_head") and the
config.tie_word_embeddings flag so lm_head is never silently left uninitialized.
- Around line 58-61: The current broad try/except around the call to
_download("model.safetensors", must_be_local=local_only) swallows all errors;
change it to catch the specific huggingface_hub.errors.EntryNotFoundError (and
its LocalEntryNotFoundError/RemoteEntryNotFoundError if available) so
missing-file cases are handled, and for any other Exception log the error (use
the module logger or processLogger) and re-raise or return a clear sentinel
instead of silently passing; update the import to pull EntryNotFoundError from
huggingface_hub and modify the try/except in the same scope where _download(...)
is invoked to implement this behavior.

---

Nitpick comments:
In `@main.py`:
- Line 47: The unpacked but unused variable `config` from the call to
load_hf_model should be intentionally discarded; change the assignment from
"model, config = load_hf_model(args.model_id, device=args.device,
quantize=args.quantize)" to use an underscore for the unused value (e.g.,
"model, _ = load_hf_model(...)") so the unused `config` no longer triggers a
linter warning while keeping the same call to load_hf_model with args.model_id,
args.device, and args.quantize.
- Around line 117-121: Replace the magic number 20 used in the conditional "elif
not indicator_shown and len(buffer) > 20:" with a named constant (e.g.
BUFFER_FLUSH_THRESHOLD or THINKING_BUFFER_THRESHOLD) defined near other module
constants or top of main.py; update the condition to use that constant and
ensure any related comments or tests reference the constant name instead of the
literal so the behavior tied to indicator_shown, buffer, thinking_done and parts
remains unchanged.

In `@models/base.py`:
- Around line 7-16: get_linear_layer currently imports bitsandbytes and
hardcodes compute_dtype=torch.bfloat16 for bnb.nn.Linear4bit which can cause
ImportError or dtype mismatches; update get_linear_layer to (1) catch
ImportError around "import bitsandbytes as bnb" and raise a clear RuntimeError
mentioning bitsandbytes is required for quantize=True, and (2) accept an
optional compute_dtype parameter (or infer from a passed dtype arg) and pass
that into bnb.nn.Linear4bit instead of the hardcoded torch.bfloat16 so dtype
matches the model's dtype; reference function get_linear_layer, symbol
compute_dtype, and Linear4bit when making changes.

In `@models/weight_loader.py`:
- Around line 86-129: The issue: _load_quantized is asymmetrical with
_load_standard because it returns no diagnostics while callers expect (missing,
unexpected) diagnostics from the loader path; to fix, either make
_load_quantized return the same (missing, unexpected) tuple or explicitly
document why the meta-tensor presence check covers validation. Concretely,
inside _load_quantized (function name) track missing keys and unexpected keys
the same way _load_standard does when you skip unmapped keys or successfully
assign parameters (use the same key bookkeeping logic used in _load_standard),
and return (missing, unexpected) at the end; alternatively, add a clear comment
in _load_quantized explaining that callers validate load success via meta-tensor
checks elsewhere (so no diagnostics are returned) and keep behavior as-is.
Include references to _load_standard, _load_quantized and Params4bit so the
change is easy to locate.

🪄 Autofix (Beta)

✅ Autofix completed

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 8f148ca5-d848-4c54-8ac9-00109a4d881f

📥 Commits

Reviewing files that changed from the base of the PR and between f3667d1 and 8b97690.

⛔ Files ignored due to path filters (14)

__pycache__/main.cpython-312.pyc is excluded by !**/*.pyc
engine/__pycache__/__init__.cpython-312.pyc is excluded by !**/*.pyc
engine/__pycache__/generator.cpython-312.pyc is excluded by !**/*.pyc
engine/__pycache__/sampler.cpython-312.pyc is excluded by !**/*.pyc
models/__pycache__/__init__.cpython-312.pyc is excluded by !**/*.pyc
models/__pycache__/attention.cpython-312.pyc is excluded by !**/*.pyc
models/__pycache__/base.cpython-312.pyc is excluded by !**/*.pyc
models/__pycache__/llama.cpython-312.pyc is excluded by !**/*.pyc
models/__pycache__/qwen3.cpython-312.pyc is excluded by !**/*.pyc
models/__pycache__/weight_loader.cpython-312.pyc is excluded by !**/*.pyc
tests/__pycache__/conftest.cpython-312-pytest-9.0.3.pyc is excluded by !**/*.pyc
tests/__pycache__/test_main.cpython-312-pytest-9.0.3.pyc is excluded by !**/*.pyc
tests/__pycache__/test_sampler.cpython-312-pytest-9.0.3.pyc is excluded by !**/*.pyc
uv.lock is excluded by !**/*.lock

📒 Files selected for processing (8)

.gitignore
engine/generator.py
main.py
models/attention.py
models/base.py
models/llama.py
models/weight_loader.py
pyproject.toml

coderabbitai · 2026-04-27T15:28:52Z

+**/__pycache__/
+*.pyc
+
+docs/quantisation.md


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Check if quantisation.md exists and preview its content to determine if it should be tracked # Check if the file exists if [ -f "docs/quantisation.md" ]; then echo "File exists. First 20 lines:" head -n 20 docs/quantisation.md echo "" echo "File size:" wc -l docs/quantisation.md else echo "File does not exist yet" fi

Repository: Lothnic/vllmini

Length of output: 83

🏁 Script executed:

# Check git diff to see what's being changed git diff --name-only # Check what files are in the docs directory ls -la docs/ 2>/dev/null || echo "docs directory does not exist or is empty" # Search for any quantisation-related files find . -name "*quantisation*" -o -name "*quantization*" 2>/dev/null | head -20 # Check the .gitignore file to see context cat -n .gitignore

Repository: Lothnic/vllmini

Length of output: 322

🏁 Script executed:

# Search for quantisation/quantization references rg -i "quantis" --type py --type md --type txt --type sh # Check for documentation generation scripts fd -e "py" -e "sh" -e "mk" | xargs rg -l "doc\|md\|markdown" 2>/dev/null | head -20 # Look at git log to see if quantisation.md is mentioned in any commits git log --all --oneline --grep="quantisation\|quantization" 2>/dev/null | head -10 # Check if there are any build or doc generation scripts fd "build\|gen\|make" -type f | head -20

Repository: Lothnic/vllmini

Length of output: 497

🏁 Script executed:

# Search for quantisation in Python source files rg -i "quantis" --type py -A 3 -B 1 # Check what Python files contain quantisation-related code rg -i "quantis" --type py -l # Look for bitsandbytes imports or usage rg "bitsandbytes\|bnb" --type py

Repository: Lothnic/vllmini

Length of output: 41

Remove docs/quantisation.md from .gitignore or provide clarification.

Documentation files should typically be committed to the repository. The docs/ directory already tracks sampler.md, and this project follows a pattern of version-controlling documentation. Preemptively ignoring a non-existent documentation file for a feature currently on a separate branch is inconsistent with the project's practices and may cause confusion when the quantisation documentation is later created.

If this file will be auto-generated or excluded for a specific reason, clarify that in the commit message.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.gitignore at line 5, Remove the stale ignore entry "docs/quantisation.md" from .gitignore (or replace it with a commented explanation) so documentation files are tracked consistently; locate the "docs/quantisation.md" line in .gitignore and either delete that line or change it to a commented note explaining why this specific doc should be ignored/generated (and include the rationale in the commit message).

coderabbitai · 2026-04-27T15:28:52Z

+
+    # Don't force offline mode for model loading — the weight_loader
+    # handles local-first-then-download fallback on its own.
+    os.environ.pop("HF_HUB_OFFLINE", None)
+    os.environ.pop("TRANSFORMERS_OFFLINE", None)


⚠️ Potential issue | 🟡 Minor

Removing user-set environment variables may cause unexpected behavior.

Explicitly removing HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE could surprise users who intentionally set these variables in their environment. The weight loader's local-first fallback doesn't fully replicate offline mode semantics (which prevents any network access).

Consider logging when these are removed, or respecting them if explicitly set.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@main.py` around lines 41 - 45, Don't unconditionally remove HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE; instead, stop calling os.environ.pop("HF_HUB_OFFLINE", None) and os.environ.pop("TRANSFORMERS_OFFLINE", None) and implement a conditional path: check for those keys in os.environ and if present, write a clear log message (e.g., via logger.warn/info) that the process will respect the user's offline setting and avoid forcing online model downloads, otherwise proceed with normal model-loading behavior (local-first-then-download). Update the code around the existing os.environ.pop calls in main.py to perform the presence check and logging rather than removing the variables.

coderabbitai · 2026-04-27T15:28:52Z

+    # 7. Ensure all remaining allowed meta tensors (e.g. lm_head) are materialized
    for param in model.parameters():
        if param.is_meta:
            param.data = torch.empty_like(param, device=device)
    for buffer in model.buffers():
        if buffer.is_meta:
            buffer.data = torch.empty_like(buffer, device=device)


⚠️ Potential issue | 🟠 Major

Risk of uninitialized tensors if lm_head isn't actually tied.

torch.empty_like creates tensors with uninitialized memory. The code allows lm_head meta params (line 202) assuming they'll be tied to embeddings (line 182). However, if tie_word_embeddings=False in config but lm_head weights are missing from checkpoint, this would silently produce garbage values.

Consider validating that lm_head is only allowed to be meta when config.tie_word_embeddings is True.

🔧 Proposed fix

- real_missing = [k for k in missing_params if "lm_head" not in k] + # Only allow lm_head to be missing if weights are tied + real_missing = [ + k for k in missing_params + if "lm_head" not in k or not config.tie_word_embeddings + ] if real_missing or missing_buffers: raise ValueError(f"Missing weights/buffers in checkpoint! params={real_missing}, buffers={missing_buffers}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@models/weight_loader.py` around lines 206 - 212, The loop that materializes remaining meta tensors uses torch.empty_like which leaves uninitialized memory and can produce garbage for lm_head if it was allowed to remain meta but embeddings are not tied; update the logic so before materializing any parameter named "lm_head" (or any lm_head parameter) you check config.tie_word_embeddings is True; if tie_word_embeddings is False and lm_head is meta, either raise a clear exception indicating the missing checkpoint weight or allocate a deterministic tensor (e.g., zeros) instead of torch.empty_like; apply the same validation for buffers and ensure the checks reference the param names (e.g., "lm_head") and the config.tie_word_embeddings flag so lm_head is never silently left uninitialized.

…m_head

coderabbitai · 2026-04-27T15:48:08Z

Note

Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it.

Fixes Applied Successfully

Fixed 3 file(s) based on 3 unresolved review comments.

Files modified:

.gitignore
main.py
models/weight_loader.py

Commit: 91af63e7587c9fe0d06f6a1dbd25508597a035b0

The changes have been pushed to the feature/quantisation branch.

Time taken: 3m 0s

Lothnic added 4 commits April 27, 2026 18:50

fixed quantisation and ready to merge

01a79f3

fix: corrected weight validation logic to prevent silent initializati…

6b39989

…on of missing weights with random memory

removed temporary decoding files

8b97690

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

Lothnic added 2 commits April 27, 2026 21:14

improved logging in weight_loader

47eee60

fix: removed lm_head exception so as to raise ValueError on missing l…

5560688

…m_head

Lothnic merged commit af369b5 into main Apr 27, 2026
2 checks passed

coderabbitai Bot mentioned this pull request Apr 27, 2026

feat: add quantization support to benchmarking script #3

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/quantisation#2

Feature/quantisation#2
Lothnic merged 6 commits into
mainfrom
feature/quantisation

Lothnic commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment •

edited

Loading

Uh oh!

coderabbitai Bot Apr 27, 2026

Uh oh!

coderabbitai Bot Apr 27, 2026

Uh oh!

Uh oh!

coderabbitai Bot Apr 27, 2026

Uh oh!

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Lothnic commented Apr 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixes Applied Successfully

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Lothnic commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading

coderabbitai Bot left a comment •

edited

Loading

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading