Feat/kv cache fp8 support #26

luozixin2 · 2026-01-18T05:47:41Z

No description provided.

主要变更： - 添加 GPTQ Marlin (W4A16) 和 AWQ Marlin (W4A16) 量化策略 - 修复 loader.py 以正确加载 gptq_marlin 格式权重（支持 Marlin 特有的 repacked qweight 和 permuted scales） - 修改 quantize_model.py 支持导出 gptq_marlin 格式（对称量化 + Marlin repack/permute） - 更新 linear.py： - 添加 _offline_quant_bits 缓冲区存储量化位数 - 添加 GPTQ runtime shuffle 支持（gptq_shuffle） - 添加 GPTQ/AWQ Marlin 的 lazy repack 支持（_maybe_prepare_offline_gptq_marlin/_awq_marlin） - 统一使用 vLLM 格式（int32 packed, fp16 scales） - 简化各策略文件，移除重复代码 - 移除旧的 AllSpark Marlin 实现文件 - 添加多个 benchmark 配置文件（GPTQ/AWQ Marlin 各 bit 版本）

benchmark_results 是本地生成的评测产物，不应进入版本库。本提交将其作为正常删除移出，并依赖 .gitignore 中的 benchmark_results/ 规则避免后续再次提交。

coderabbitai · 2026-01-18T05:47:46Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

- 添加 quant-method=auto 支持：使用 auto-gptq / awq 进行真正的校准量化 - 添加校准数据参数：--calib-text-file, --calib-num-samples, --calib-seq-len 等 - 实现 _export_autogptq_to_vllm_weights：从 auto-gptq 量化模型中导出 vLLM 格式权重 - 实现 _export_awq_to_vllm_weights：从 awq 量化模型中导出 vLLM 格式权重 - 保留 quant-method=simple 旧实现作为后向兼容 - 修复 loader.py 中 gptq_marlin scales 的 shape 推理和 TP sharding 逻辑 - 修复 linear_gptq_marlin_w4a16.py 移除不必要的 bf16->fp16 转换

主要重构内容： 1. **diffulex/layer/linear.py** - 大幅简化量化逻辑（-197行）: - 新增 `_forward_base()`: 统一的前向分发器，替换子类中重复的量化分支逻辑 - 新增 `_build_offline_forward_kwargs()`: 统一构建离线量化（GPTQ/AWQ）前向参数 - 新增 `_get_linear_strategy()`, `_offline_meta()`, `_infer_gptq_weight_bits()` 等辅助方法 - 修复 `LoRAMixin.merge_lora` 中 base weight 为 None 的边界情况 - 移除未使用的导入（marlin_zero_points, unpack_cols, marlin_make_empty_g_idx） 2. **diffulex/utils/loader.py** - 优化性能和代码结构: - 一次性扫描 safetensors 文件建立 key_to_file 索引，避免重复文件 I/O - 缓存 `model.named_modules()` 结果，避免重复构建字典 - 新增 `_find_offline_capable_module()`: 统一模块查找逻辑 - 新增 `_load_tensors_for_prefix()`: 集中加载张量，仅打开必要的文件 - 将 print() 替换为 logger.warning()/logger.exception() 以规范化日志 3. **diffulex/engine/model_runner.py** - 消除重复循环: - 在 `allocate_kv_cache` 中统一缓存 attention 模块列表 - 用 `enumerate(attn_modules)` 替换重复的模块遍历循环 4. **diffulex/utils/quantization/strategies/linear_int4_w4a16.py** - 修复缺失实现: - 添加 `quantize_weight_for_kernel` 方法，修复 W4A16 在线量化运行时错误 5. 删除未使用的配置文件 `gptq_marlin_w2_bf16kv_varlen.yml` 测试: 已验证 W8A16 在线量化和 GPTQ 离线量化功能正常

luozixin2 added 2 commits January 18, 2026 05:43

chore: 从仓库移除 benchmark_results

16d7892

benchmark_results 是本地生成的评测产物，不应进入版本库。本提交将其作为正常删除移出，并依赖 .gitignore 中的 benchmark_results/ 规则避免后续再次提交。

luozixin2 marked this pull request as draft January 18, 2026 06:01

luozixin2 added 2 commits January 18, 2026 06:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/kv cache fp8 support #26

Feat/kv cache fp8 support #26

Uh oh!

luozixin2 commented Jan 18, 2026

Uh oh!

coderabbitai bot commented Jan 18, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Feat/kv cache fp8 support #26

Are you sure you want to change the base?

Feat/kv cache fp8 support #26

Uh oh!

Conversation

luozixin2 commented Jan 18, 2026

Uh oh!

coderabbitai bot commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai bot commented Jan 18, 2026 •

edited

Loading