Skip to content

[Plugin] [Feature] Supoort MLA q/k norm-quant fusion with SGLang + ATOM plugin for Deepseek#528

Open
qichu-yun wants to merge 1 commit intoROCm:mainfrom
qichu-yun:fuse_norm_quant_sgl
Open

[Plugin] [Feature] Supoort MLA q/k norm-quant fusion with SGLang + ATOM plugin for Deepseek#528
qichu-yun wants to merge 1 commit intoROCm:mainfrom
qichu-yun:fuse_norm_quant_sgl

Conversation

@qichu-yun
Copy link
Copy Markdown

@qichu-yun qichu-yun commented Apr 9, 2026

Motivation

DeepSeek MLA preprocessing in the SGLang + ATOM plugin was still doing q/k RMSNorm and q quantization in separate steps, leaving unnecessary kernel and memory overhead in a hot path. Since ATOM already provides a gated fused norm-quant implementation for DeepSeek, this PR integrates that path into the plugin so supported workloads can benefit from the fusion while unsupported cases continue to use the existing fallback path.

before :
image

after :
image

Test Plan

lauch server:

export AITER_QUICK_REDUCE_QUANTIZATION=INT4
export SGLANG_AITER_FP8_PREFILL_ATTN=0
export SGLANG_USE_AITER=1
export ATOM_ENABLE_DS_QKNORM_QUANT_FUSION=1

model_path=/shared/data/amd_int/models/DeepSeek-R1-0528

export SGLANG_EXTERNAL_MODEL_PACKAGE=atom.plugin.sglang.models

TORCHINDUCTOR_COMPILE_THREADS=128 python3 -m sglang.launch_server \
    --model-path $model_path \
    --host localhost \
    --port 9000 \
    --trust-remote-code \
    --tensor-parallel-size 4 \
    --kv-cache-dtype fp8_e4m3 \
    --mem-fraction-static 0.9 \
    --page-size 1 \
    --disable-radix-cache \

client:

model_path=/shared/data/amd_int/models/DeepSeek-R1-0528-MXFP4

ISL=8000
OSL=1000
CON=4
NUM=$(( CON * 2 ))
RANGE_RATIO=1.0

PYTHONDONTWRITEBYTECODE=1 python "/home/qichu_qle/my_sgl/bench_serving/benchmark_serving.py" \
  --model=$model_path \
  --backend=sglang \
  --base-url=http://127.0.0.1:9000 \
  --dataset-name=random \
  --random-input-len="${ISL}" \
  --random-output-len="${OSL}" \
  --random-range-ratio "${RANGE_RATIO}" \
  --num-prompts="${NUM}" \
  --max-concurrency="${CON}" \
  --trust-remote-code \
  --request-rate=inf \
  --num-warmups="$(( 2 * CON ))" \
  --ignore-eos \
  --save-result \
  --percentile-metrics="ttft,tpot,itl,e2el" \
  --result-dir="./tmp/oot-benchmark-results" \
  --result-filename="${ISL}_${OSL}_${CON}.json" \
  --profile

Test Result

============ Serving Benchmark Result ============
Successful requests:                     8         
Benchmark duration (s):                  97.66     
Total input tokens:                      64000     
Total generated tokens:                  8000      
Request throughput (req/s):              0.08      
Output token throughput (tok/s):         81.92     
Total Token throughput (tok/s):          737.26    
---------------Time to First Token----------------
Mean TTFT (ms):                          1330.96   
Median TTFT (ms):                        1457.61   
P99 TTFT (ms):                           1891.41   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          20.20     
Median TPOT (ms):                        20.08     
P99 TPOT (ms):                           21.02     
---------------Inter-token Latency----------------
Mean ITL (ms):                           20.20     
Median ITL (ms):                         19.68     
P99 ITL (ms):                            20.15     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          21514.63  
Median E2EL (ms):                        21514.56  
P99 E2EL (ms):                           21516.52  
==================================================

Submission Checklist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant