Skip to content

[ATOM-vLLM][Qwen3.5][Qwen3Next][GDN] add packed decode fast path for GDN#508

Closed
zejunchen-zejun wants to merge 2 commits intomainfrom
zejun/vLLM_19_qwen3.5_fast_path_0407
Closed

[ATOM-vLLM][Qwen3.5][Qwen3Next][GDN] add packed decode fast path for GDN#508
zejunchen-zejun wants to merge 2 commits intomainfrom
zejun/vLLM_19_qwen3.5_fast_path_0407

Conversation

@zejunchen-zejun
Copy link
Copy Markdown
Contributor

@zejunchen-zejun zejunchen-zejun commented Apr 7, 2026

have a packed recurrent attention triton kernel for fast path
Qwen3-Next-80B-A3B-Instruct-FP8 TP4 1K/8K Concurrency 4 TTPS + 1.6%
Qwen3-Next-80B-A3B-Instruct-FP8 TP4 1K/8K Concurrency 32 TTPS + 1.13%

Copilot AI review requested due to automatic review settings April 7, 2026 14:07
@zejunchen-zejun zejunchen-zejun marked this pull request as draft April 7, 2026 14:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a packed recurrent decode “fast path” for the GDN attention backend in the vLLM OOT plugin, and enables it for the pure non-speculative decode case (targeting Qwen3.5/Qwen3Next on vLLM 0.19.0).

Changes:

  • Introduces a non-spec decode helper (_forward_core_decode_non_spec) that uses fused_recurrent_gated_delta_rule_packed_decode.
  • Gates the packed-decode path behind vllm_envs.VLLM_ENABLE_FLA_PACKED_RECURRENT_DECODE and runtime metadata conditions (no spec masks, decode-only).
  • Updates KV cache access to use compilation_config[layer_name].kv_cache directly (removing virtual_engine indexing).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@zejunchen-zejun zejunchen-zejun changed the title [plugin][OOT Qwen3.5][GDN][vLLM 0.19.0] add GDN packed decode fast path, enable the pure non-spec decode GDN fast path into the OOT plugin backend for Qwen3.5/Qwen3Next [plugin][OOT Qwen3.5][GDN] add GDN packed decode fast path Apr 7, 2026
fast path, enable the pure non-spec decode GDN fast path into
the OOT plugin backend for Qwen3.5/Qwen3Next

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
@zejunchen-zejun zejunchen-zejun force-pushed the zejun/vLLM_19_qwen3.5_fast_path_0407 branch from 799dc44 to 9583056 Compare April 13, 2026 06:24
@zejunchen-zejun zejunchen-zejun changed the title [plugin][OOT Qwen3.5][GDN] add GDN packed decode fast path [ATOM-vLLM][Qwen3.5][GDN] add GDN packed decode fast path Apr 13, 2026
@zejunchen-zejun zejunchen-zejun changed the title [ATOM-vLLM][Qwen3.5][GDN] add GDN packed decode fast path [ATOM-vLLM][Qwen3.5][Qwen3Next][GDN] add packed decode fast path for GDN Apr 13, 2026
@zejunchen-zejun
Copy link
Copy Markdown
Contributor Author

No performance benefit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants