【Hackathon 10th Spring No.53】[Feature][Kernel] Optimize AppendAttention for discrete head-wise block_idx PR2 [cf]#7718
Conversation
Adds rank-2 block_tables_headwise plumbing for c16 multi-query attention path. Updates template_config.json so the codegen produces explicit instantiations matching the new impl signature (added optional block_table_headwise param).
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览当前无 required 失败任务,但有 7 个 Workflow 处于
2 任务状态汇总2.1 Required任务 : 0/0 通过
2.2 可选任务 — 1/2 通过
3 失败详情(仅 required)无 required 失败任务。 |
- gpu_model_runner: _maybe_slice_block_tables_headwise now is_dummy_or_profile_run-aware so captured CUDA graph records non-null sidecar; identity-stride dummy seeding aligned with kernel shape assert (dim0 == bsz * kv_num_heads). - input_batch: InputBatch.swap_states + ProposerInputBatch.swap_states clone-then-copy swap block_tables_headwise[i*kv_local:(i+1)*kv_local] row groups so head-wise rows follow slot moves on both target and proposer paths. - gpu_model_runner._process_reorder: in-place clear forward_batch_reqs_list before repopulating from share_inputs.index_to_batch_id; prevents stale tail entries from leaking into logprob-settings consumers (Option A: post-hoc rebuild). - gpu_model_runner: docstring corrected to match C16 kernel sentinel handling (multiquery_attention_c16_impl.cuh L215-223 / L605-613); -1 sentinel reads block 0 as harmless placeholder, SWA mask zeroes the contribution. No fallback to flat block_tables. - benchmarks/yaml: add eb45-21b-a3b-32k-bf16-kv50-512s.yaml for PR2 bench geometry. Refs: T53 PR2 PaddlePaddle#7718.
self.input_batch is not constructed yet during _dummy_prefill_inputs and CUDA-graph capture, so reading self.input_batch.kv_num_heads_local crashed the worker before the bench server could start. Use self.model_config.kv_num_heads (set in init_share_inputs before warmup) which has the same TP-aware value.
The PR1 head-wise allocator (PaddlePaddle#7717) emits flat global block IDs in [0, num_gpu_blocks * kv_num_heads) from a single shared min-heap, but the PR2 discrete kernel (PaddlePaddle#7718) ABI L1 expects per-head local IDs in {-1} ∪ [0, num_gpu_blocks). This causes cudaIllegalAddress on any request whose allocated IDs cross the num_gpu_blocks boundary (i.e. immediately on head index ≥ ceil(num_gpu_blocks / num_blocks)). This commit normalizes IDs at the backend boundary in append_attn_backend.py using `local = flat % num_gpu_blocks` (sentinel -1 preserved), with a fail-fast assert to catch any residual OOB. The hotfix is bench-only; the canonical fix (per-head independent allocator pools) is deferred to PR1 v5 (RFC-PR1-reanchored.md §3). Also adds FD_T53_HEAD_WISE_SWA_RATIO ∈ [0.0, 1.0] validator. Refs: .checkpoints/h10/task-53/design/PR2-HOTFIX-SPEC.md (Option B, OPUS-GATE PASS) .checkpoints/h10/task-53/design/CONTRACT-ORACLE.md (I2, I7) .checkpoints/h10/task-53/design/RFC-PR2-reanchored.md (ABI L1) Files: 2 changed (1 backend hotfix, 1 envs validator)
…mixed Boolean fancy indexing and .item() CPU sync inside forward_mixed crash CUDA graph capture (cudaError 900 cudaErrorStreamCaptureUnsupported). The paddle.where normalization is graph-safe (static-shape elementwise ops). Assert was debug-only; normalization alone is the actual OOB fix.
- prefix_cache_manager: replace shared flat heap with kv_num_heads independent heaps; allocate/recycle now per-head with rank-2 [kv_num_heads][N] nested-list contract per RFC-PR2 §3 - gpu_model_runner: warmup base = idx * fill_blocks (not cross-head flat); rank-2 buffer shape preserved per kernel ABI - append_attn_backend: revert flat % num_gpu_blocks HOTFIX (silent aliasing); replace with FD_T53_DEBUG_BLOCK_TABLES gated assert - tests: 4 per-head value-space invariants, no MagicMock - .gitignore: ignore runs/ bench output dir Closes T53-PR2-OOB-blocker (kernel ABI now matches producer).
…c [opus v2 D-revised]
….data<int>() Adds dtype guards before .data<int>() reads of: - set_max_lengths, encoder_num_blocks, kv_num_blocks, decoder_num_blocks (in AppendAttentionKernel, lines 100-105/186/187/285) - mask_offset.get() (in AppendAttention L599 and AppendAttentionWithOutput L763) Catches accidental INT64/FP dtype before UB. Matches existing PD_CHECK style from set_flags.cu / set_mask_value.cu.
…p_size Guards against silent under-allocation when kv_num_heads_global is not a multiple of tp_size (and >= tp_size). The kv<tp replication path is explicitly excluded from the assert, preserving existing GQA/MQA behavior.
…l-recycle accuracy PaddlePaddle-bot review on PR PaddlePaddle#7718 noted that integer division in the available_gpu_resource property zeros out fractional values when fewer than kv_num_heads logical blocks are free, causing the metric to underreport partial recycle progress. The scheduler can then refuse admissible requests because it sees 0 capacity even though several heads' worth of blocks are actually available. Switch to float division so the metric matches the legacy [0, 1] continuous value-domain and dashboards / scheduler see true availability. Refs: review on PR PaddlePaddle#7718 (PaddlePaddle-bot) Signed-off-by: bob-cloudforge <bob@cloudforge.solutions>
PaddlePaddle-bot review on PR PaddlePaddle#7718 asked why recycle_request_swa_head_cache short-circuits on total_tokens % block_size != 0. Document the rationale: the in-flight decode token is mid-write to the tail block, so releasing it now races with the next decode write. Recycle resumes on the next step that lands on a clean boundary. Comment-only change. No code semantics altered. Refs: review on PR PaddlePaddle#7718 (PaddlePaddle-bot) Signed-off-by: bob-cloudforge <bob@cloudforge.solutions>
|
Acknowledged on the
Also addressed in this push:
Please let us know if either deferral blocks merge — happy to scope an inline minimal version if so. |
|
@PaddlePaddle-bot — re: A3 (operator C++ unit tests for Both are acknowledged and deliberately deferred out of this PR:
Both items are tracked in the task checkpoint and will be raised as separate PRs after this one merges. — bob-cloudforge |
- prefix_cache_manager.available_gpu_resource: prefer plural per-head heaps (gpu_free_head_wise_block_lists) which carry the real free blocks under FD_HEAD_WISE_KV_CACHE=1; legacy singular and gpu_free_block_list kept as fallbacks for startup window and non-head-wise callers. - resource_manager_v1._num_swa_heads: assert -> raise ValueError for GQA divisibility (P9 validation, asserts strip under -O). Root cause: launch_cache_manager populates plural heaps and resets singular to [] for compat; the property still read singular and returned 0.0 in head-wise mode -> resource_manager_v1 throttled admissions -> queue backlog -> TTFT mean +6.2%, p95 +8.7-9.5% in SMOKE/SMOKE2. Throughput barely lifted (+3.1-6.1%) because kernel wins were gated by admission rate.
2d7b796 to
719f62c
Compare
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-06 21:13:09
📋 Review 摘要
PR 概述:为 AppendAttention 引入 head-wise 离散 block_idx(block_tables_headwise),C16 kernel 支持 SWA 与 Full-Context 头混用,配套 per-head KV 独立堆分配/回收逻辑和 9 个 Cache/Layer 测试。
变更范围:custom_ops/gpu_ops/(C16 CUDA kernel)、fastdeploy/cache_manager/、fastdeploy/engine/sched/、fastdeploy/config.py、fastdeploy/worker/、tests/
影响面 Tag:[OP] [KVCache] [Scheduler] [FDConfig] [Feature]
📝 PR 规范检查
标题包含非官方 Tag [Kernel](不在 §D1 官方列表中)和后缀 [cf](提交者标记,非官方格式),超出单 Tag 格式规范;PR 描述五个必填节均完整,但 Checklist 中已完成的两项未勾选。
标题建议(可直接复制):
[Feature] Optimize AppendAttention for discrete head-wise block_idx
Checklist 需更新(其余项维持原状):
[x] Add at least a tag in the PR title(已有官方[Feature]✓)[x] Add unit tests(已新增 9 个测试文件 ✓)
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | custom_ops/gpu_ops/append_attn/multiquery_attention_c16_impl.cuh:912 |
ENFORCE_GE 要求 head-wise 列宽 ≥ uniform 列宽,与 PR 描述 "shorter row" 语义不符,建议澄清 |
| 🟡 建议 | benchmarks/yaml/eb45-21b-a3b-32k-bf16-kv50-512s.yaml:4 |
YAML 含开发调试注释("4 identical failures"、"per opus v2 verdict" 等),不适合合入生产仓库 |
| 🟡 建议 | fastdeploy/config.py:2053 |
T53 字段通过 setattr 注入 model_config,EngineArgs(CLI)未同步,不符合 A2 三入口规范 |
| 🟡 建议 | custom_ops/gpu_ops/append_attn/ |
A3:C16 kernel 签名变更,仅有 tests/layers/ 形态测试,缺少 tests/operators/ 数值正确性单测 |
总体评价
PR2 整体结构清晰,per-head 独立堆分配、block_id=-1 sentinel + mask 置零的安全设计合理正确。主要建议:YAML 调试注释需清理,EngineArgs 三入口同步需说明,PR 描述中"shorter row"与 ENFORCE_GE 约束的语义一致性需澄清,以及补充 tests/operators/ 数值正确性测试。无阻塞性 Bug,建议解决上述问题后合入。
| "> 0; got %d.", | ||
| max_blocks_per_head)); | ||
| PADDLE_ENFORCE_GE( | ||
| max_blocks_per_head, |
There was a problem hiding this comment.
❓ 疑问 PADDLE_ENFORCE_GE(max_blocks_per_head, max_block_num_per_seq) 要求 head-wise 表的物理列宽 ≥ uniform 表列宽。
这意味着 SWA head 的 head-wise 行在物理上并非更短,与 PR 描述中 "lets SWA-head CTAs walk a shorter / sparser row" 表述有出入。PR2 的实际节省来自 -1 sentinel → block 0 dummy 路径 + mask 置零(避免错误输出),而非缩短物理表宽度;真正的 "shorter row" 优化(SWA head 列数 = window_blocks + sink_blocks)似乎是 PR3 范围。
建议在此处补一行注释澄清:PR2 中 max_blocks_per_head 必须 ≥ max_block_num_per_seq 是 OOB 防护(kernel 按 kv_len 迭代索引,非按 swa_window 迭代),避免误导 PR3 实现者对此约束的期望。
| # T53 bench workload — KV-bound (not slot-bound); gate: FD_HEAD_WISE_KV_CACHE=1 | ||
| # max_num_seqs raised to 256 so the KV pool, not the slot count, is the bottleneck. | ||
| # kv_cache_ratio: 0.30 → ~24GB KV on A800-80GB (TINY envelope diagnostic per opus v2 verdict). | ||
| # (0.35 deterministic OOM at 78.99GB / index 3408/3689 weights load — 4 identical failures. |
There was a problem hiding this comment.
🟡 建议 此行及下一行包含开发过程调试注释(TINY envelope diagnostic per opus v2 verdict、4 identical failures、Revert to 0.35 before SMOKE/FULL only after opus comparability decision),不适合进入生产代码库。
建议 merge 前将 kv_cache_ratio 固定为最终确认值,注释仅保留选取该值的客观理由,删除调试迭代过程记录。
| # on a DIFFERENT FDConfig copy (worker process). This block mirrors that mutation | ||
| # in the engine-main process so the dispatcher gate is not dormant. | ||
| # Guards are identical to the worker side — idempotent if already set. | ||
| if envs.FD_T53_HEAD_WISE_SWA_FIXTURE: |
There was a problem hiding this comment.
🟡 建议(A2 三入口同步) window_size、sink_size、window_attn_skip_freq、head_wise_swa_ratio 四个 T53 字段通过 setattr 注入 model_config,仅由 env var FD_T53_HEAD_WISE_SWA_FIXTURE 驱动,但 fastdeploy/engine/args_utils.py(CLI EngineArgs)未同步对应参数。
按 FastDeploy A2 惯例,Config 字段新增须同步 CLI(EngineArgs)和 envs.py(已完成)。若这是有意为之的 env-var-only 实验性设计,建议在此处补注释说明,例如 # Intentionally env-var-only in this experimental phase; CLI args deferred.,避免后续 reviewer 重复提同一问题。
PR2 Body —
【Hackathon 10th Spring No.53】[Feature][Kernel] Optimize AppendAttention for discrete head-wise block_idx [cf]Motivation
Hackathon 10th Spring Task No.53 PR2 of 2. Spec: https://github.com/PaddlePaddle/community/blob/master/hackathon/hackathon_10th/【Hackathon_10th】开源贡献个人挑战赛春节特别季—任务合集.md#no53.
When SWA and full-attention heads coexist in one layer, the current AppendAttention path walks the same uniform
block_tablesrow for every KV head. The discreteblock_tables_headwiselayout (rank-2 logical[batch, kv_head, block], physical[batch * local_kv_heads, max_blocks_per_head]) lets SWA-head CTAs walk a shorter / sparser row while full heads preserve the existing full-context row. That reduces unnecessary block-id loads and K/V page reads under the required recycle OFF benchmark.The ABI is additive: callers without
block_tables_headwiseuse the legacy path unchanged; callers with the head-wise table take the new kernel-visible fast path.Modifications
Total stacked diff: 31 files, +2360/-49, grouped below. The
PR2-only deltablock lists what this PR adds on top of PR1.Stacked surface (PR1 producer + PR2 kernel + shared tests)
custom_ops/gpu_ops/)append_attention.cu,append_attn/{append_attention_c16_impl.cuh, append_attention_kernel.h, multiquery_attention_c16_impl.cuh, multiquery_attention_c16_kernel.h, template_config.json},cpp_extensions.ccfastdeploy/)cache_manager/prefix_cache_manager.py,engine/sched/resource_manager_v1.py,worker/{gpu_model_runner.py, input_batch.py, worker_process.py},model_executor/{forward_meta.py, layers/attention/append_attn_backend.py, layers/attention/ops/append_attention.py, models/paddleformers/base.py},engine/request.py,spec_decode/mtp.py,config.py,envs.pytests/)tests/cache_manager/test_{per_head_heaps, head_wise_freelist, head_wise_extend_validation, head_wise_abort_reset, head_wise_tp_consistency, swa_recycle, swa_recycle_legacy_relief, benchmark_head_wise_swa}.py,tests/layers/test_append_attention_head_wise_shapes.pybenchmarks/yaml/eb45-21b-a3b-32k-bf16-kv50-512s.yaml,.gitignorePR2-only delta (changes added on top of PR1 #7717)
custom_ops/gpu_ops/append_attention.cublock_tables_headwisethroughAppendAttentionKernel,AppendAttention, andAppendAttentionWithOutput; addPD_CHECK(.dtype() == INT32)dtype guards on every Python-supplied.data<int>()read (set_max_lengths,encoder_num_blocks,kv_num_blocks,decoder_num_blocks,mask_offset); makeblock_tables_headwisekeyword-only on the Python op; addsink_size/head_wise_full_hiddenparameters; threadsink_sizeintoappend_attention_with_output_gpu()(was hardcoded0).custom_ops/gpu_ops/append_attn/multiquery_attention_c16_impl.cuhblock_tablesrow walk with per-head row selection fromblock_tables_headwisewhen present; preserve existingblock_id < 0 → 0clamp at the load site (-1sentinel = evicted SWA slot, mask zeroes contribution). c8/c4 variants deferred to PR3.custom_ops/gpu_ops/append_attn/{append_attention_c16_impl.cuh, append_attention_kernel.h, multiquery_attention_c16_kernel.h, template_config.json},custom_ops/gpu_ops/cpp_extensions.ccblock_tables_headwisetensor through kernel headers, template config, and the PHI op signature.fastdeploy/model_executor/layers/attention/append_attn_backend.py_get_block_tables_headwise(forward_meta)helper (per-call read offorward_meta, thenforward_meta.cache_manager, elseNone); thread the tensor as a kwarg into bothappend_attention()andappend_attention_with_output()call sites; passsink_sizeandhead_wise_full_hiddento the with-output path.fastdeploy/model_executor/layers/attention/ops/append_attention.pyblock_tables_headwisekeyword-only on both ops; guardhead_wise_full_hidden > 0in theuse_output=Truepath withassert head_wise_full_hidden == 0(dual-call merge stays inappend_attention()only; with-output path deferred to PR3).fastdeploy/engine/sched/resource_manager_v1.pyassert (kv_num_heads_global < tp_size) or (kv_num_heads_global % tp_size == 0)GQA divisibility guard beforekv_num_heads_global // tp_size.tests/layers/test_append_attention_head_wise_shapes.pyThe c16 kernel is the only flavor consumed in PR2. c8 / c4 / write-path mirrors and the graph-blacklist update are intentionally deferred to PR3. Safety in PR2 = legacy uniform
block_tableswalk + existingblock_id<0fallback + SWA mask zero-contribution.Clean-room note: PR2 uses public PR #6702 only as behavior/reference context. No
Co-authored-bytrailer; prose acknowledgement only.Usage or Command
No user-facing API change. The optimized path is active when PR1 provides
block_tables_headwiseand head-wise SWA is enabled:Spec acceptance must be measured with timely SWA recycle OFF, comparing 1D uniform
block_idxagainst 2D discreteblock_idx.Accuracy Tests
Spec PR2 acceptance — recycle OFF; H/B card; 1D uniform vs 2D discrete; both TTFT and TBT improve ≥5%:
block_idxmodeBenchmark:
FastDeploy/benchmarks/serving/benchmark_serving.pywithbenchmarks/yaml/eb45-21b-a3b-32k-bf16-kv50-512s.yaml.Correctness gates before push:
block_tables_headwise=Nonelegacy path unchanged.use_output=Trueanduse_output=Falseboth consume the same head-wise table contract.-1sentinel rows skip before K/V pointer derivation.tests/cache_manager/andtests/layers/green locally.CI run: https://github.com/PaddlePaddle/FastDeploy/pull/7718/checks
Depends on: #7717 (ResourceManagerV1 head-wise SWA recycle, producer for
block_tables_headwise).Checklist
pre-commit run --all-filescleanblock_tables_headwise)Co-authored-bytrailer for PR2