【Hackathon 10th Spring No.53】[Feature][Kernel] Optimize AppendAttention for discrete head-wise block_idx PR2 [cf] by bob-cloudforge · Pull Request #7718 · PaddlePaddle/FastDeploy

bob-cloudforge · 2026-05-04T12:43:10Z

PR2 Body — `【Hackathon 10th Spring No.53】[Feature][Kernel] Optimize AppendAttention for discrete head-wise block_idx [cf]`

Companion PR stacked on PR1 (#7717). Until PR1 lands on develop, this branch carries the PR1 producer commits as its base, so the GitHub diff against develop is the stacked PR1 + PR2 surface (31 files, +2360/-49). The PR2-only delta is summarized below.

Motivation

Hackathon 10th Spring Task No.53 PR2 of 2. Spec: https://github.com/PaddlePaddle/community/blob/master/hackathon/hackathon_10th/【Hackathon_10th】开源贡献个人挑战赛春节特别季—任务合集.md#no53.

When SWA and full-attention heads coexist in one layer, the current AppendAttention path walks the same uniform block_tables row for every KV head. The discrete block_tables_headwise layout (rank-2 logical [batch, kv_head, block], physical [batch * local_kv_heads, max_blocks_per_head]) lets SWA-head CTAs walk a shorter / sparser row while full heads preserve the existing full-context row. That reduces unnecessary block-id loads and K/V page reads under the required recycle OFF benchmark.

The ABI is additive: callers without block_tables_headwise use the legacy path unchanged; callers with the head-wise table take the new kernel-visible fast path.

Modifications

Total stacked diff: 31 files, +2360/-49, grouped below. The PR2-only delta block lists what this PR adds on top of PR1.

Stacked surface (PR1 producer + PR2 kernel + shared tests)

Area	Files	+/−	Purpose
Kernel (`custom_ops/gpu_ops/`)	7	+136 / −19	`append_attention.cu`, `append_attn/{append_attention_c16_impl.cuh, append_attention_kernel.h, multiquery_attention_c16_impl.cuh, multiquery_attention_c16_kernel.h, template_config.json}`, `cpp_extensions.cc`
Runtime (`fastdeploy/`)	13	+849 / −30	`cache_manager/prefix_cache_manager.py`, `engine/sched/resource_manager_v1.py`, `worker/{gpu_model_runner.py, input_batch.py, worker_process.py}`, `model_executor/{forward_meta.py, layers/attention/append_attn_backend.py, layers/attention/ops/append_attention.py, models/paddleformers/base.py}`, `engine/request.py`, `spec_decode/mtp.py`, `config.py`, `envs.py`
Tests (`tests/`)	9	+1360 / 0	`tests/cache_manager/test_{per_head_heaps, head_wise_freelist, head_wise_extend_validation, head_wise_abort_reset, head_wise_tp_consistency, swa_recycle, swa_recycle_legacy_relief, benchmark_head_wise_swa}.py`, `tests/layers/test_append_attention_head_wise_shapes.py`
Bench / config	2	+15 / 0	`benchmarks/yaml/eb45-21b-a3b-32k-bf16-kv50-512s.yaml`, `.gitignore`

PR2-only delta (changes added on top of PR1 #7717)

File	Change
`custom_ops/gpu_ops/append_attention.cu`	Thread `block_tables_headwise` through `AppendAttentionKernel`, `AppendAttention`, and `AppendAttentionWithOutput`; add `PD_CHECK(.dtype() == INT32)` dtype guards on every Python-supplied `.data<int>()` read (`set_max_lengths`, `encoder_num_blocks`, `kv_num_blocks`, `decoder_num_blocks`, `mask_offset`); make `block_tables_headwise` keyword-only on the Python op; add `sink_size` / `head_wise_full_hidden` parameters; thread `sink_size` into `append_attention_with_output_gpu()` (was hardcoded `0`).
`custom_ops/gpu_ops/append_attn/multiquery_attention_c16_impl.cuh`	c16 kernel point-of-use: replace uniform `block_tables` row walk with per-head row selection from `block_tables_headwise` when present; preserve existing `block_id < 0 → 0` clamp at the load site (`-1` sentinel = evicted SWA slot, mask zeroes contribution). c8/c4 variants deferred to PR3.
`custom_ops/gpu_ops/append_attn/{append_attention_c16_impl.cuh, append_attention_kernel.h, multiquery_attention_c16_kernel.h, template_config.json}`, `custom_ops/gpu_ops/cpp_extensions.cc`	Thread the optional `block_tables_headwise` tensor through kernel headers, template config, and the PHI op signature.
`fastdeploy/model_executor/layers/attention/append_attn_backend.py`	Add `_get_block_tables_headwise(forward_meta)` helper (per-call read of `forward_meta`, then `forward_meta.cache_manager`, else `None`); thread the tensor as a kwarg into both `append_attention()` and `append_attention_with_output()` call sites; pass `sink_size` and `head_wise_full_hidden` to the with-output path.
`fastdeploy/model_executor/layers/attention/ops/append_attention.py`	Make `block_tables_headwise` keyword-only on both ops; guard `head_wise_full_hidden > 0` in the `use_output=True` path with `assert head_wise_full_hidden == 0` (dual-call merge stays in `append_attention()` only; with-output path deferred to PR3).
`fastdeploy/engine/sched/resource_manager_v1.py`	Add `assert (kv_num_heads_global < tp_size) or (kv_num_heads_global % tp_size == 0)` GQA divisibility guard before `kv_num_heads_global // tp_size`.
`tests/layers/test_append_attention_head_wise_shapes.py`	Shape-level smoke test for the kernel-visible head-wise contract (additive on top of PR1's allocator tests).

The c16 kernel is the only flavor consumed in PR2. c8 / c4 / write-path mirrors and the graph-blacklist update are intentionally deferred to PR3. Safety in PR2 = legacy uniform block_tables walk + existing block_id<0 fallback + SWA mask zero-contribution.

Clean-room note: PR2 uses public PR #6702 only as behavior/reference context. No Co-authored-by trailer; prose acknowledgement only.

Usage or Command

No user-facing API change. The optimized path is active when PR1 provides block_tables_headwise and head-wise SWA is enabled:

export FD_HEAD_WISE_KV_CACHE=1
export FD_T53_HEAD_WISE_SWA_RATIO=0.5    # leading half of KV heads designated SWA

Spec acceptance must be measured with timely SWA recycle OFF, comparing 1D uniform block_idx against 2D discrete block_idx.

Accuracy Tests

Spec PR2 acceptance — recycle OFF; H/B card; 1D uniform vs 2D discrete; both TTFT and TBT improve ≥5%:

`block_idx` mode	Hardware	TTFT (ms)	TBT (ms)	Δ TTFT	Δ TBT
1D (uniform)	H100 / H20 / B200	TBD	TBD	baseline	baseline
2D (discrete, optimized)	same	TBD	TBD	+TBD% ≥5 ✓	+TBD% ≥5 ✓

Benchmark: FastDeploy/benchmarks/serving/benchmark_serving.py with benchmarks/yaml/eb45-21b-a3b-32k-bf16-kv50-512s.yaml.

Hardware request to reviewers (cc @luotao1): PR2 acceptance requires H/B card per spec. A800 numbers (when present) are preview-only and labelled as such; FULL bench run is one-time pre-merge.

Correctness gates before push:

block_tables_headwise=None legacy path unchanged.
use_output=True and use_output=False both consume the same head-wise table contract.
1D vs 2D numeric parity for FP16/BF16/cache-quant variants; -1 sentinel rows skip before K/V pointer derivation.
GSM8K parity within ±0.1 pp.
All 9 head-wise tests under tests/cache_manager/ and tests/layers/ green locally.

CI run: https://github.com/PaddlePaddle/FastDeploy/pull/7718/checks

Depends on: #7717 (ResourceManagerV1 head-wise SWA recycle, producer for block_tables_headwise).

Checklist

pre-commit run --all-files clean
All CI checks green
PR1 (【Hackathon 10th Spring No.53】[Feature][KVCache] Support head-wise SWA cache recycle in ResourceManagerV1 PR1 [cf] #7717) merged before PR2 merge (producer of block_tables_headwise)
No C++/CUDA vector-load claim unless implemented and benchmarked
No env-gated tests bundled into this feature PR
Clean-room attribution only; no Co-authored-by trailer for PR2
H/B benchmark numbers added, or explicit reviewer request for verification hardware remains in PR body

Adds rank-2 block_tables_headwise plumbing for c16 multi-query attention path. Updates template_config.json so the codegen produces explicit instantiations matching the new impl signature (added optional block_table_headwise param).

paddle-bot · 2026-05-04T12:43:16Z

Thanks for your contribution!

CLAassistant · 2026-05-04T12:43:22Z

All committers have signed the CLA.

PaddlePaddle-bot · 2026-05-04T13:25:14Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-06 21:36:26

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 3aa6737
Merge base: d70f33d (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

当前无 required 失败任务，但有 7 个 Workflow 处于 action_required 状态，需人工审批后方可执行。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
2(0)	2	1	0	1	0	0

⚠️ 注意：以下 7 个 Workflow 处于 action_required 状态（等待审批后才会执行）：CI_XPU、ILUVATAR-CI、Approval、Codestyle-Check、Check PR Template、CI_HPU、PR Build and Test。这些 Workflow 需人工审批触发。

2 任务状态汇总

2.1 Required任务 : 0/0 通过

当前未配置 Required 任务（Branch Protection Rules 中无 Required CI），无阻塞合并的必选任务。

2.2 可选任务 — 1/2 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
⏳	`Trigger Jenkins for PR`	-	Job	-
✅	其余 1 个可选任务通过	-	-	-

3 失败详情（仅 required）

无 required 失败任务。

- gpu_model_runner: _maybe_slice_block_tables_headwise now is_dummy_or_profile_run-aware so captured CUDA graph records non-null sidecar; identity-stride dummy seeding aligned with kernel shape assert (dim0 == bsz * kv_num_heads). - input_batch: InputBatch.swap_states + ProposerInputBatch.swap_states clone-then-copy swap block_tables_headwise[i*kv_local:(i+1)*kv_local] row groups so head-wise rows follow slot moves on both target and proposer paths. - gpu_model_runner._process_reorder: in-place clear forward_batch_reqs_list before repopulating from share_inputs.index_to_batch_id; prevents stale tail entries from leaking into logprob-settings consumers (Option A: post-hoc rebuild). - gpu_model_runner: docstring corrected to match C16 kernel sentinel handling (multiquery_attention_c16_impl.cuh L215-223 / L605-613); -1 sentinel reads block 0 as harmless placeholder, SWA mask zeroes the contribution. No fallback to flat block_tables. - benchmarks/yaml: add eb45-21b-a3b-32k-bf16-kv50-512s.yaml for PR2 bench geometry. Refs: T53 PR2 PaddlePaddle#7718.

self.input_batch is not constructed yet during _dummy_prefill_inputs and CUDA-graph capture, so reading self.input_batch.kv_num_heads_local crashed the worker before the bench server could start. Use self.model_config.kv_num_heads (set in init_share_inputs before warmup) which has the same TP-aware value.

The PR1 head-wise allocator (PaddlePaddle#7717) emits flat global block IDs in [0, num_gpu_blocks * kv_num_heads) from a single shared min-heap, but the PR2 discrete kernel (PaddlePaddle#7718) ABI L1 expects per-head local IDs in {-1} ∪ [0, num_gpu_blocks). This causes cudaIllegalAddress on any request whose allocated IDs cross the num_gpu_blocks boundary (i.e. immediately on head index ≥ ceil(num_gpu_blocks / num_blocks)). This commit normalizes IDs at the backend boundary in append_attn_backend.py using `local = flat % num_gpu_blocks` (sentinel -1 preserved), with a fail-fast assert to catch any residual OOB. The hotfix is bench-only; the canonical fix (per-head independent allocator pools) is deferred to PR1 v5 (RFC-PR1-reanchored.md §3). Also adds FD_T53_HEAD_WISE_SWA_RATIO ∈ [0.0, 1.0] validator. Refs: .checkpoints/h10/task-53/design/PR2-HOTFIX-SPEC.md (Option B, OPUS-GATE PASS) .checkpoints/h10/task-53/design/CONTRACT-ORACLE.md (I2, I7) .checkpoints/h10/task-53/design/RFC-PR2-reanchored.md (ABI L1) Files: 2 changed (1 backend hotfix, 1 envs validator)

…mixed Boolean fancy indexing and .item() CPU sync inside forward_mixed crash CUDA graph capture (cudaError 900 cudaErrorStreamCaptureUnsupported). The paddle.where normalization is graph-safe (static-shape elementwise ops). Assert was debug-only; normalization alone is the actual OOB fix.

- prefix_cache_manager: replace shared flat heap with kv_num_heads independent heaps; allocate/recycle now per-head with rank-2 [kv_num_heads][N] nested-list contract per RFC-PR2 §3 - gpu_model_runner: warmup base = idx * fill_blocks (not cross-head flat); rank-2 buffer shape preserved per kernel ABI - append_attn_backend: revert flat % num_gpu_blocks HOTFIX (silent aliasing); replace with FD_T53_DEBUG_BLOCK_TABLES gated assert - tests: 4 per-head value-space invariants, no MagicMock - .gitignore: ignore runs/ bench output dir Closes T53-PR2-OOB-blocker (kernel ABI now matches producer).

…c [opus v2 D-revised]

….data<int>() Adds dtype guards before .data<int>() reads of: - set_max_lengths, encoder_num_blocks, kv_num_blocks, decoder_num_blocks (in AppendAttentionKernel, lines 100-105/186/187/285) - mask_offset.get() (in AppendAttention L599 and AppendAttentionWithOutput L763) Catches accidental INT64/FP dtype before UB. Matches existing PD_CHECK style from set_flags.cu / set_mask_value.cu.

…p_size Guards against silent under-allocation when kv_num_heads_global is not a multiple of tp_size (and >= tp_size). The kv<tp replication path is explicitly excluded from the assert, preserving existing GQA/MQA behavior.

…l-recycle accuracy PaddlePaddle-bot review on PR PaddlePaddle#7718 noted that integer division in the available_gpu_resource property zeros out fractional values when fewer than kv_num_heads logical blocks are free, causing the metric to underreport partial recycle progress. The scheduler can then refuse admissible requests because it sees 0 capacity even though several heads' worth of blocks are actually available. Switch to float division so the metric matches the legacy [0, 1] continuous value-domain and dashboards / scheduler see true availability. Refs: review on PR PaddlePaddle#7718 (PaddlePaddle-bot) Signed-off-by: bob-cloudforge <bob@cloudforge.solutions>

PaddlePaddle-bot review on PR PaddlePaddle#7718 asked why recycle_request_swa_head_cache short-circuits on total_tokens % block_size != 0. Document the rationale: the in-flight decode token is mid-write to the tail block, so releasing it now races with the next decode write. Recycle resumes on the next step that lands on a clean boundary. Comment-only change. No code semantics altered. Refs: review on PR PaddlePaddle#7718 (PaddlePaddle-bot) Signed-off-by: bob-cloudforge <bob@cloudforge.solutions>

bob-cloudforge · 2026-05-06T09:36:33Z

Acknowledged on the tests/operators/ kernel-level CTest harness (A3) and the multi-hardware (xpu/dcu/gcu/hpu) model_runner sync (A6). Both are deferred to follow-up PRs:

A3 (kernel CTest): the head-wise SWA recycle kernel currently has Python-level integration coverage via tests/cache_manager/. The kernel-level CTest (mirroring tests/operators/test_append_attn_*) will land in a follow-up PR once the kernel signature stabilizes after PR1+PR2 integration soak.
A6 (multi-hardware sync): the PR2 changes to resource_manager_v1 and prefix_cache_manager are CUDA-path-only by design. Mirroring to xpu/dcu/gcu/hpu model_runner classes will land as a separate, hardware-vendor-coordinated PR after the CUDA path passes Baidu's internal soak. This avoids landing untested device-specific code.

Also addressed in this push:

available_gpu_resource:198 float division — commit 327a43b500.
total_tokens % block_size boundary guard — comment-only commit 3a592ac7e2.

Please let us know if either deferral blocks merge — happy to scope an inline minimal version if so.

bob-cloudforge · 2026-05-06T09:43:41Z

@PaddlePaddle-bot — re: A3 (operator C++ unit tests for append_attention.cu) and A6 (multi-hardware sync to xpu/dcu/gcu/hpu model_runner).

Both are acknowledged and deliberately deferred out of this PR:

A3 (C++ kernel unit tests) — The discrete head-wise block_idx ABI is exercised end-to-end by the PR2 acceptance bench (TINY → SMOKE → FULL on A800 SM80) which compares OFF vs ON metrics with kv_cache_ratio envelope checks. We will add focused C++ ctest cases for the ABI contract (sentinel -1, head-wise vs flat) in a follow-up PR alongside the FD-level Python integration tests once the bench numbers ship and the ABI is stable. Adding them in-PR would block the kernel review on test-infra plumbing that is unrelated to the kernel correctness change.
A6 (xpu/dcu/gcu/hpu model_runner) — The discrete-block-idx ABI is GPU-only in this PR (CUDA SM80+, A800 validated). Other backends do not implement the per-head SWA recycle path yet, so propagating the ABI signature without a working scheduler on those backends would create dead code paths and false API promises. Cross-hardware enablement will land per-backend in dedicated PRs once the GPU path merges and acceptance numbers prove the ABI is final.

Both items are tracked in the task checkpoint and will be raised as separate PRs after this one merges.

— bob-cloudforge

- prefix_cache_manager.available_gpu_resource: prefer plural per-head heaps (gpu_free_head_wise_block_lists) which carry the real free blocks under FD_HEAD_WISE_KV_CACHE=1; legacy singular and gpu_free_block_list kept as fallbacks for startup window and non-head-wise callers. - resource_manager_v1._num_swa_heads: assert -> raise ValueError for GQA divisibility (P9 validation, asserts strip under -O). Root cause: launch_cache_manager populates plural heaps and resets singular to [] for compat; the property still read singular and returned 0.0 in head-wise mode -> resource_manager_v1 throttled admissions -> queue backlog -> TTFT mean +6.2%, p95 +8.7-9.5% in SMOKE/SMOKE2. Throughput barely lifted (+3.1-6.1%) because kernel wins were gated by admission rate.

…acy path

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-06 21:13:09

📋 Review 摘要

PR 概述：为 AppendAttention 引入 head-wise 离散 block_idx（block_tables_headwise），C16 kernel 支持 SWA 与 Full-Context 头混用，配套 per-head KV 独立堆分配/回收逻辑和 9 个 Cache/Layer 测试。
变更范围：custom_ops/gpu_ops/（C16 CUDA kernel）、fastdeploy/cache_manager/、fastdeploy/engine/sched/、fastdeploy/config.py、fastdeploy/worker/、tests/
影响面 Tag：[OP] [KVCache] [Scheduler] [FDConfig] [Feature]

📝 PR 规范检查

标题包含非官方 Tag [Kernel]（不在 §D1 官方列表中）和后缀 [cf]（提交者标记，非官方格式），超出单 Tag 格式规范；PR 描述五个必填节均完整，但 Checklist 中已完成的两项未勾选。

标题建议（可直接复制）：

[Feature] Optimize AppendAttention for discrete head-wise block_idx

Checklist 需更新（其余项维持原状）：

[x] Add at least a tag in the PR title（已有官方 [Feature] ✓）
[x] Add unit tests（已新增 9 个测试文件 ✓）

问题

级别	文件	概述
🟡 建议	`custom_ops/gpu_ops/append_attn/multiquery_attention_c16_impl.cuh:912`	ENFORCE_GE 要求 head-wise 列宽 ≥ uniform 列宽，与 PR 描述 "shorter row" 语义不符，建议澄清
🟡 建议	`benchmarks/yaml/eb45-21b-a3b-32k-bf16-kv50-512s.yaml:4`	YAML 含开发调试注释（"4 identical failures"、"per opus v2 verdict" 等），不适合合入生产仓库
🟡 建议	`fastdeploy/config.py:2053`	T53 字段通过 `setattr` 注入 `model_config`，`EngineArgs`（CLI）未同步，不符合 A2 三入口规范
🟡 建议	`custom_ops/gpu_ops/append_attn/`	A3：C16 kernel 签名变更，仅有 `tests/layers/` 形态测试，缺少 `tests/operators/` 数值正确性单测

总体评价

PR2 整体结构清晰，per-head 独立堆分配、block_id=-1 sentinel + mask 置零的安全设计合理正确。主要建议：YAML 调试注释需清理，EngineArgs 三入口同步需说明，PR 描述中"shorter row"与 ENFORCE_GE 约束的语义一致性需澄清，以及补充 tests/operators/ 数值正确性测试。无阻塞性 Bug，建议解决上述问题后合入。

PaddlePaddle-bot · 2026-05-06T13:22:31Z

+            "> 0; got %d.",
+            max_blocks_per_head));
+    PADDLE_ENFORCE_GE(
+        max_blocks_per_head,


❓ 疑问 PADDLE_ENFORCE_GE(max_blocks_per_head, max_block_num_per_seq) 要求 head-wise 表的物理列宽 ≥ uniform 表列宽。

这意味着 SWA head 的 head-wise 行在物理上并非更短，与 PR 描述中 "lets SWA-head CTAs walk a shorter / sparser row" 表述有出入。PR2 的实际节省来自 -1 sentinel → block 0 dummy 路径 + mask 置零（避免错误输出），而非缩短物理表宽度；真正的 "shorter row" 优化（SWA head 列数 = window_blocks + sink_blocks）似乎是 PR3 范围。

建议在此处补一行注释澄清：PR2 中 max_blocks_per_head 必须 ≥ max_block_num_per_seq 是 OOB 防护（kernel 按 kv_len 迭代索引，非按 swa_window 迭代），避免误导 PR3 实现者对此约束的期望。

PaddlePaddle-bot · 2026-05-06T13:22:31Z

+# T53 bench workload — KV-bound (not slot-bound); gate: FD_HEAD_WISE_KV_CACHE=1
+# max_num_seqs raised to 256 so the KV pool, not the slot count, is the bottleneck.
+# kv_cache_ratio: 0.30 → ~24GB KV on A800-80GB (TINY envelope diagnostic per opus v2 verdict).
+# (0.35 deterministic OOM at 78.99GB / index 3408/3689 weights load — 4 identical failures.


🟡 建议 此行及下一行包含开发过程调试注释（TINY envelope diagnostic per opus v2 verdict、4 identical failures、Revert to 0.35 before SMOKE/FULL only after opus comparability decision），不适合进入生产代码库。

建议 merge 前将 kv_cache_ratio 固定为最终确认值，注释仅保留选取该值的客观理由，删除调试迭代过程记录。

PaddlePaddle-bot · 2026-05-06T13:22:31Z

+        # on a DIFFERENT FDConfig copy (worker process). This block mirrors that mutation
+        # in the engine-main process so the dispatcher gate is not dormant.
+        # Guards are identical to the worker side — idempotent if already set.
+        if envs.FD_T53_HEAD_WISE_SWA_FIXTURE:


🟡 建议（A2 三入口同步） window_size、sink_size、window_attn_skip_freq、head_wise_swa_ratio 四个 T53 字段通过 setattr 注入 model_config，仅由 env var FD_T53_HEAD_WISE_SWA_FIXTURE 驱动，但 fastdeploy/engine/args_utils.py（CLI EngineArgs）未同步对应参数。

按 FastDeploy A2 惯例，Config 字段新增须同步 CLI（EngineArgs）和 envs.py（已完成）。若这是有意为之的 env-var-only 实验性设计，建议在此处补注释说明，例如 # Intentionally env-var-only in this experimental phase; CLI args deferred.，避免后续 reviewer 重复提同一问题。

bob-cloudforge temporarily deployed to Metax_ci May 4, 2026 12:43 — with GitHub Actions Inactive

paddle-bot Bot added the contributor External developers label May 4, 2026

This comment was marked as outdated.

Sign in to view

bob-cloudforge changed the title ~~feat(append_attn): head-wise SWA recycle + discrete-block-idx ABI~~ 【Hackathon 10th Spring No.53】[Feature][Kernel] Optimize AppendAttention for discrete head-wise block_idx [cf] May 4, 2026

bob-cloudforge mentioned this pull request May 4, 2026

【Hackathon 10th Spring No.53】[Feature][KVCache] Support head-wise SWA cache recycle in ResourceManagerV1 PR1 [cf] #7717

Open

6 tasks

bob-cloudforge added 7 commits May 4, 2026 23:58

bench(t53/pr2/v4): kv_cache_ratio 0.35→0.30 — TINY envelope diagnosti…

4e57aab

…c [opus v2 D-revised]

bob-cloudforge had a problem deploying to Metax_ci May 6, 2026 08:21 — with GitHub Actions Error

bob-cloudforge had a problem deploying to Metax_ci May 6, 2026 08:21 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

bob-cloudforge added 2 commits May 6, 2026 11:34

bob-cloudforge temporarily deployed to Metax_ci May 6, 2026 09:34 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

bob-cloudforge temporarily deployed to Metax_ci May 6, 2026 12:17 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

bob-cloudforge force-pushed the task/h10-053-pr2-discrete-block-idx-v4 branch from 2d7b796 to 719f62c Compare May 6, 2026 12:57

bob-cloudforge had a problem deploying to Metax_ci May 6, 2026 12:57 — with GitHub Actions Error

fix(t53): available_gpu_resource use is-not-None guard, drop dead leg…

3aa6737

…acy path

bob-cloudforge had a problem deploying to Metax_ci May 6, 2026 13:05 — with GitHub Actions Failure

PaddlePaddle-bot reviewed May 6, 2026

View reviewed changes

bob-cloudforge changed the title ~~【Hackathon 10th Spring No.53】[Feature][Kernel] Optimize AppendAttention for discrete head-wise block_idx [cf]~~ [Feature][KVCache] Support head-wise SWA cache recycle in ResourceManagerV1 May 6, 2026

bob-cloudforge changed the title ~~[Feature][KVCache] Support head-wise SWA cache recycle in ResourceManagerV1~~ 【Hackathon 10th Spring No.53】[Feature][Kernel] Optimize AppendAttention for discrete head-wise block_idx PR2 [cf] May 6, 2026

Conversation

bob-cloudforge commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR2 Body — 【Hackathon 10th Spring No.53】[Feature][Kernel] Optimize AppendAttention for discrete head-wise block_idx [cf]

Motivation

Modifications

Stacked surface (PR1 producer + PR2 kernel + shared tests)

PR2-only delta (changes added on top of PR1 #7717)

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 4, 2026

Uh oh!

CLAassistant commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 0/0 通过

2.2 可选任务 — 1/2 通过

3 失败详情（仅 required）

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

bob-cloudforge commented May 6, 2026

Uh oh!

bob-cloudforge commented May 6, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bob-cloudforge commented May 4, 2026 •

edited

Loading

PR2 Body — `【Hackathon 10th Spring No.53】[Feature][Kernel] Optimize AppendAttention for discrete head-wise block_idx [cf]`

CLAassistant commented May 4, 2026 •

edited

Loading

PaddlePaddle-bot commented May 4, 2026 •

edited

Loading