feat: integrate Marlin/AllSpark INT8 W8A16 quantization strategy #25

luozixin2 · 2026-01-16T16:38:40Z

主要新增内容：

Marlin/AllSpark INT8 W8A16 量化策略集成：
- 新增 linear_marlin_int8_w8a16.py：实现基于 vLLM AllSpark kernel 的 W8A16 量化策略
- 新增 diffulex_kernel/csrc/marlin/：vendored vLLM 的 AllSpark CUDA kernels
  - allspark_qgemm_w8a16.cu: W8A16 fused GEMM kernel
  - allspark_repack.cu: N32K16 权重重排 kernel * allspark_utils.cuh: 工具函数和数据结构 * torch_bindings_marlin.cpp: PyTorch C++ 绑定
- 新增 diffulex_kernel/python/marlin_ops.py：Python 接口用于 JIT 编译和加载 Marlin/AllSpark kernels
量化策略注册更新：
- 在 registry.py 中添加 'marlin' 别名支持（映射到 marlin_int8）
- 在 strategies/init.py 中导入新的策略
性能改进：
- Marlin W8A16 策略显著提升了 Prefill 吞吐量（从 4518.92 tok/s 提升到 9520.91 tok/s，约 2.1 倍）
- Decode 吞吐量接近 BF16 基线（23.16 tok/s vs 23.36 tok/s）
- 支持与 FP8 KV cache 组合使用
其他改进：
- 优化了多个量化策略的实现
- 改进了 KV cache 管理
- 增强了 profiler 功能
- 新增了多个 benchmark 配置文件

主要新增内容： 1. **Marlin/AllSpark INT8 W8A16 量化策略集成**： - 新增 linear_marlin_int8_w8a16.py：实现基于 vLLM AllSpark kernel 的 W8A16 量化策略 - 新增 diffulex_kernel/csrc/marlin/：vendored vLLM 的 AllSpark CUDA kernels * allspark_qgemm_w8a16.cu: W8A16 fused GEMM kernel * allspark_repack.cu: N32K16 权重重排 kernel * allspark_utils.cuh: 工具函数和数据结构 * torch_bindings_marlin.cpp: PyTorch C++ 绑定 - 新增 diffulex_kernel/python/marlin_ops.py：Python 接口用于 JIT 编译和加载 Marlin/AllSpark kernels 2. **量化策略注册更新**： - 在 registry.py 中添加 'marlin' 别名支持（映射到 marlin_int8） - 在 strategies/__init__.py 中导入新的策略 3. **性能改进**： - Marlin W8A16 策略显著提升了 Prefill 吞吐量（从 4518.92 tok/s 提升到 9520.91 tok/s，约 2.1 倍） - Decode 吞吐量接近 BF16 基线（23.16 tok/s vs 23.36 tok/s） - 支持与 FP8 KV cache 组合使用 4. **其他改进**： - 优化了多个量化策略的实现 - 改进了 KV cache 管理 - 增强了 profiler 功能 - 新增了多个 benchmark 配置文件

coderabbitai · 2026-01-16T16:38:48Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

luozixin2 merged commit 55b8b4d into SJTU-DENG-Lab:feat/kv-cache-fp8-support Jan 16, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: integrate Marlin/AllSpark INT8 W8A16 quantization strategy #25

feat: integrate Marlin/AllSpark INT8 W8A16 quantization strategy #25

Uh oh!

luozixin2 commented Jan 16, 2026

Uh oh!

coderabbitai bot commented Jan 16, 2026

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: integrate Marlin/AllSpark INT8 W8A16 quantization strategy #25

feat: integrate Marlin/AllSpark INT8 W8A16 quantization strategy #25

Uh oh!

Conversation

luozixin2 commented Jan 16, 2026

Uh oh!

coderabbitai bot commented Jan 16, 2026

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant