Skip to content

【Hackathon 10th Spring No.53】[Feature][Kernel] Optimize AppendAttention for discrete head-wise block_idx PR2 [cf]#7718

Open
bob-cloudforge wants to merge 13 commits intoPaddlePaddle:developfrom
CloudForge-Solutions:task/h10-053-pr2-discrete-block-idx-v4
Open

【Hackathon 10th Spring No.53】[Feature][Kernel] Optimize AppendAttention for discrete head-wise block_idx PR2 [cf]#7718
bob-cloudforge wants to merge 13 commits intoPaddlePaddle:developfrom
CloudForge-Solutions:task/h10-053-pr2-discrete-block-idx-v4

Conversation

@bob-cloudforge
Copy link
Copy Markdown

@bob-cloudforge bob-cloudforge commented May 4, 2026

PR2 Body — 【Hackathon 10th Spring No.53】[Feature][Kernel] Optimize AppendAttention for discrete head-wise block_idx [cf]

Companion PR stacked on PR1 (#7717). Until PR1 lands on develop, this branch carries the PR1 producer commits as its base, so the GitHub diff against develop is the stacked PR1 + PR2 surface (31 files, +2360/-49). The PR2-only delta is summarized below.


Motivation

Hackathon 10th Spring Task No.53 PR2 of 2. Spec: https://github.com/PaddlePaddle/community/blob/master/hackathon/hackathon_10th/【Hackathon_10th】开源贡献个人挑战赛春节特别季—任务合集.md#no53.

When SWA and full-attention heads coexist in one layer, the current AppendAttention path walks the same uniform block_tables row for every KV head. The discrete block_tables_headwise layout (rank-2 logical [batch, kv_head, block], physical [batch * local_kv_heads, max_blocks_per_head]) lets SWA-head CTAs walk a shorter / sparser row while full heads preserve the existing full-context row. That reduces unnecessary block-id loads and K/V page reads under the required recycle OFF benchmark.

The ABI is additive: callers without block_tables_headwise use the legacy path unchanged; callers with the head-wise table take the new kernel-visible fast path.

Modifications

Total stacked diff: 31 files, +2360/-49, grouped below. The PR2-only delta block lists what this PR adds on top of PR1.

Stacked surface (PR1 producer + PR2 kernel + shared tests)

Area Files +/− Purpose
Kernel (custom_ops/gpu_ops/) 7 +136 / −19 append_attention.cu, append_attn/{append_attention_c16_impl.cuh, append_attention_kernel.h, multiquery_attention_c16_impl.cuh, multiquery_attention_c16_kernel.h, template_config.json}, cpp_extensions.cc
Runtime (fastdeploy/) 13 +849 / −30 cache_manager/prefix_cache_manager.py, engine/sched/resource_manager_v1.py, worker/{gpu_model_runner.py, input_batch.py, worker_process.py}, model_executor/{forward_meta.py, layers/attention/append_attn_backend.py, layers/attention/ops/append_attention.py, models/paddleformers/base.py}, engine/request.py, spec_decode/mtp.py, config.py, envs.py
Tests (tests/) 9 +1360 / 0 tests/cache_manager/test_{per_head_heaps, head_wise_freelist, head_wise_extend_validation, head_wise_abort_reset, head_wise_tp_consistency, swa_recycle, swa_recycle_legacy_relief, benchmark_head_wise_swa}.py, tests/layers/test_append_attention_head_wise_shapes.py
Bench / config 2 +15 / 0 benchmarks/yaml/eb45-21b-a3b-32k-bf16-kv50-512s.yaml, .gitignore

PR2-only delta (changes added on top of PR1 #7717)

File Change
custom_ops/gpu_ops/append_attention.cu Thread block_tables_headwise through AppendAttentionKernel, AppendAttention, and AppendAttentionWithOutput; add PD_CHECK(.dtype() == INT32) dtype guards on every Python-supplied .data<int>() read (set_max_lengths, encoder_num_blocks, kv_num_blocks, decoder_num_blocks, mask_offset); make block_tables_headwise keyword-only on the Python op; add sink_size / head_wise_full_hidden parameters; thread sink_size into append_attention_with_output_gpu() (was hardcoded 0).
custom_ops/gpu_ops/append_attn/multiquery_attention_c16_impl.cuh c16 kernel point-of-use: replace uniform block_tables row walk with per-head row selection from block_tables_headwise when present; preserve existing block_id < 0 → 0 clamp at the load site (-1 sentinel = evicted SWA slot, mask zeroes contribution). c8/c4 variants deferred to PR3.
custom_ops/gpu_ops/append_attn/{append_attention_c16_impl.cuh, append_attention_kernel.h, multiquery_attention_c16_kernel.h, template_config.json}, custom_ops/gpu_ops/cpp_extensions.cc Thread the optional block_tables_headwise tensor through kernel headers, template config, and the PHI op signature.
fastdeploy/model_executor/layers/attention/append_attn_backend.py Add _get_block_tables_headwise(forward_meta) helper (per-call read of forward_meta, then forward_meta.cache_manager, else None); thread the tensor as a kwarg into both append_attention() and append_attention_with_output() call sites; pass sink_size and head_wise_full_hidden to the with-output path.
fastdeploy/model_executor/layers/attention/ops/append_attention.py Make block_tables_headwise keyword-only on both ops; guard head_wise_full_hidden > 0 in the use_output=True path with assert head_wise_full_hidden == 0 (dual-call merge stays in append_attention() only; with-output path deferred to PR3).
fastdeploy/engine/sched/resource_manager_v1.py Add assert (kv_num_heads_global < tp_size) or (kv_num_heads_global % tp_size == 0) GQA divisibility guard before kv_num_heads_global // tp_size.
tests/layers/test_append_attention_head_wise_shapes.py Shape-level smoke test for the kernel-visible head-wise contract (additive on top of PR1's allocator tests).

The c16 kernel is the only flavor consumed in PR2. c8 / c4 / write-path mirrors and the graph-blacklist update are intentionally deferred to PR3. Safety in PR2 = legacy uniform block_tables walk + existing block_id<0 fallback + SWA mask zero-contribution.

Clean-room note: PR2 uses public PR #6702 only as behavior/reference context. No Co-authored-by trailer; prose acknowledgement only.

Usage or Command

No user-facing API change. The optimized path is active when PR1 provides block_tables_headwise and head-wise SWA is enabled:

export FD_HEAD_WISE_KV_CACHE=1
export FD_T53_HEAD_WISE_SWA_RATIO=0.5    # leading half of KV heads designated SWA

Spec acceptance must be measured with timely SWA recycle OFF, comparing 1D uniform block_idx against 2D discrete block_idx.

Accuracy Tests

Spec PR2 acceptance — recycle OFF; H/B card; 1D uniform vs 2D discrete; both TTFT and TBT improve ≥5%:

block_idx mode Hardware TTFT (ms) TBT (ms) Δ TTFT Δ TBT
1D (uniform) H100 / H20 / B200 TBD TBD baseline baseline
2D (discrete, optimized) same TBD TBD +TBD% ≥5 ✓ +TBD% ≥5 ✓

Benchmark: FastDeploy/benchmarks/serving/benchmark_serving.py with benchmarks/yaml/eb45-21b-a3b-32k-bf16-kv50-512s.yaml.

Hardware request to reviewers (cc @luotao1): PR2 acceptance requires H/B card per spec. A800 numbers (when present) are preview-only and labelled as such; FULL bench run is one-time pre-merge.

Correctness gates before push:

  • block_tables_headwise=None legacy path unchanged.
  • use_output=True and use_output=False both consume the same head-wise table contract.
  • 1D vs 2D numeric parity for FP16/BF16/cache-quant variants; -1 sentinel rows skip before K/V pointer derivation.
  • GSM8K parity within ±0.1 pp.
  • All 9 head-wise tests under tests/cache_manager/ and tests/layers/ green locally.

CI run: https://github.com/PaddlePaddle/FastDeploy/pull/7718/checks

Depends on: #7717 (ResourceManagerV1 head-wise SWA recycle, producer for block_tables_headwise).

Checklist

Adds rank-2 block_tables_headwise plumbing for c16 multi-query attention path.

Updates template_config.json so the codegen produces explicit instantiations matching the new impl signature (added optional block_table_headwise param).
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 4, 2026

Thanks for your contribution!

@paddle-bot paddle-bot Bot added the contributor External developers label May 4, 2026
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 4, 2026

CLA assistant check
All committers have signed the CLA.

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 4, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-06 21:36:26

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

当前无 required 失败任务,但有 7 个 Workflow 处于 action_required 状态,需人工审批后方可执行。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
2(0) 2 1 0 1 0 0

⚠️ 注意:以下 7 个 Workflow 处于 action_required 状态(等待审批后才会执行):CI_XPU、ILUVATAR-CI、Approval、Codestyle-Check、Check PR Template、CI_HPU、PR Build and Test。这些 Workflow 需人工审批触发。


2 任务状态汇总

2.1 Required任务 : 0/0 通过

当前未配置 Required 任务(Branch Protection Rules 中无 Required CI),无阻塞合并的必选任务。

2.2 可选任务 — 1/2 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Trigger Jenkins for PR - Job -
其余 1 个可选任务通过 - - -

3 失败详情(仅 required)

无 required 失败任务。

@bob-cloudforge bob-cloudforge changed the title feat(append_attn): head-wise SWA recycle + discrete-block-idx ABI 【Hackathon 10th Spring No.53】[Feature][Kernel] Optimize AppendAttention for discrete head-wise block_idx [cf] May 4, 2026
- gpu_model_runner: _maybe_slice_block_tables_headwise now is_dummy_or_profile_run-aware so captured CUDA graph records non-null sidecar; identity-stride dummy seeding aligned with kernel shape assert (dim0 == bsz * kv_num_heads).
- input_batch: InputBatch.swap_states + ProposerInputBatch.swap_states clone-then-copy swap block_tables_headwise[i*kv_local:(i+1)*kv_local] row groups so head-wise rows follow slot moves on both target and proposer paths.
- gpu_model_runner._process_reorder: in-place clear forward_batch_reqs_list before repopulating from share_inputs.index_to_batch_id; prevents stale tail entries from leaking into logprob-settings consumers (Option A: post-hoc rebuild).
- gpu_model_runner: docstring corrected to match C16 kernel sentinel handling (multiquery_attention_c16_impl.cuh L215-223 / L605-613); -1 sentinel reads block 0 as harmless placeholder, SWA mask zeroes the contribution. No fallback to flat block_tables.
- benchmarks/yaml: add eb45-21b-a3b-32k-bf16-kv50-512s.yaml for PR2 bench geometry.

Refs: T53 PR2 PaddlePaddle#7718.
self.input_batch is not constructed yet during _dummy_prefill_inputs
and CUDA-graph capture, so reading self.input_batch.kv_num_heads_local
crashed the worker before the bench server could start. Use
self.model_config.kv_num_heads (set in init_share_inputs before warmup)
which has the same TP-aware value.
The PR1 head-wise allocator (PaddlePaddle#7717) emits flat global block IDs in
[0, num_gpu_blocks * kv_num_heads) from a single shared min-heap, but
the PR2 discrete kernel (PaddlePaddle#7718) ABI L1 expects per-head local IDs in
{-1} ∪ [0, num_gpu_blocks). This causes cudaIllegalAddress on any
request whose allocated IDs cross the num_gpu_blocks boundary
(i.e. immediately on head index ≥ ceil(num_gpu_blocks / num_blocks)).

This commit normalizes IDs at the backend boundary in append_attn_backend.py
using `local = flat % num_gpu_blocks` (sentinel -1 preserved), with a
fail-fast assert to catch any residual OOB. The hotfix is bench-only;
the canonical fix (per-head independent allocator pools) is deferred to
PR1 v5 (RFC-PR1-reanchored.md §3).

Also adds FD_T53_HEAD_WISE_SWA_RATIO ∈ [0.0, 1.0] validator.

Refs: .checkpoints/h10/task-53/design/PR2-HOTFIX-SPEC.md (Option B, OPUS-GATE PASS)
     .checkpoints/h10/task-53/design/CONTRACT-ORACLE.md (I2, I7)
     .checkpoints/h10/task-53/design/RFC-PR2-reanchored.md (ABI L1)

Files: 2 changed (1 backend hotfix, 1 envs validator)
…mixed

Boolean fancy indexing and .item() CPU sync inside forward_mixed
crash CUDA graph capture (cudaError 900 cudaErrorStreamCaptureUnsupported).
The paddle.where normalization is graph-safe (static-shape elementwise ops).
Assert was debug-only; normalization alone is the actual OOB fix.
- prefix_cache_manager: replace shared flat heap with kv_num_heads
  independent heaps; allocate/recycle now per-head with rank-2
  [kv_num_heads][N] nested-list contract per RFC-PR2 §3
- gpu_model_runner: warmup base = idx * fill_blocks (not cross-head
  flat); rank-2 buffer shape preserved per kernel ABI
- append_attn_backend: revert flat % num_gpu_blocks HOTFIX (silent
  aliasing); replace with FD_T53_DEBUG_BLOCK_TABLES gated assert
- tests: 4 per-head value-space invariants, no MagicMock
- .gitignore: ignore runs/ bench output dir

Closes T53-PR2-OOB-blocker (kernel ABI now matches producer).
….data<int>()

Adds dtype guards before .data<int>() reads of:
- set_max_lengths, encoder_num_blocks, kv_num_blocks, decoder_num_blocks
  (in AppendAttentionKernel, lines 100-105/186/187/285)
- mask_offset.get() (in AppendAttention L599 and AppendAttentionWithOutput L763)

Catches accidental INT64/FP dtype before UB. Matches existing PD_CHECK style
from set_flags.cu / set_mask_value.cu.
…p_size

Guards against silent under-allocation when kv_num_heads_global is not a
multiple of tp_size (and >= tp_size). The kv<tp replication path is
explicitly excluded from the assert, preserving existing GQA/MQA behavior.
PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

…l-recycle accuracy

PaddlePaddle-bot review on PR PaddlePaddle#7718 noted that integer division in the
available_gpu_resource property zeros out fractional values when fewer
than kv_num_heads logical blocks are free, causing the metric to
underreport partial recycle progress. The scheduler can then refuse
admissible requests because it sees 0 capacity even though several
heads' worth of blocks are actually available.

Switch to float division so the metric matches the legacy [0, 1]
continuous value-domain and dashboards / scheduler see true availability.

Refs: review on PR PaddlePaddle#7718 (PaddlePaddle-bot)
Signed-off-by: bob-cloudforge <bob@cloudforge.solutions>
PaddlePaddle-bot review on PR PaddlePaddle#7718 asked why
recycle_request_swa_head_cache short-circuits on
total_tokens % block_size != 0. Document the rationale: the in-flight
decode token is mid-write to the tail block, so releasing it now races
with the next decode write. Recycle resumes on the next step that lands
on a clean boundary.

Comment-only change. No code semantics altered.

Refs: review on PR PaddlePaddle#7718 (PaddlePaddle-bot)
Signed-off-by: bob-cloudforge <bob@cloudforge.solutions>
PaddlePaddle-bot

This comment was marked as outdated.

@bob-cloudforge
Copy link
Copy Markdown
Author

Acknowledged on the tests/operators/ kernel-level CTest harness (A3) and the multi-hardware (xpu/dcu/gcu/hpu) model_runner sync (A6). Both are deferred to follow-up PRs:

  • A3 (kernel CTest): the head-wise SWA recycle kernel currently has Python-level integration coverage via tests/cache_manager/. The kernel-level CTest (mirroring tests/operators/test_append_attn_*) will land in a follow-up PR once the kernel signature stabilizes after PR1+PR2 integration soak.
  • A6 (multi-hardware sync): the PR2 changes to resource_manager_v1 and prefix_cache_manager are CUDA-path-only by design. Mirroring to xpu/dcu/gcu/hpu model_runner classes will land as a separate, hardware-vendor-coordinated PR after the CUDA path passes Baidu's internal soak. This avoids landing untested device-specific code.

Also addressed in this push:

  • available_gpu_resource:198 float division — commit 327a43b500.
  • total_tokens % block_size boundary guard — comment-only commit 3a592ac7e2.

Please let us know if either deferral blocks merge — happy to scope an inline minimal version if so.

@bob-cloudforge
Copy link
Copy Markdown
Author

@PaddlePaddle-bot — re: A3 (operator C++ unit tests for append_attention.cu) and A6 (multi-hardware sync to xpu/dcu/gcu/hpu model_runner).

Both are acknowledged and deliberately deferred out of this PR:

  • A3 (C++ kernel unit tests) — The discrete head-wise block_idx ABI is exercised end-to-end by the PR2 acceptance bench (TINY → SMOKE → FULL on A800 SM80) which compares OFF vs ON metrics with kv_cache_ratio envelope checks. We will add focused C++ ctest cases for the ABI contract (sentinel -1, head-wise vs flat) in a follow-up PR alongside the FD-level Python integration tests once the bench numbers ship and the ABI is stable. Adding them in-PR would block the kernel review on test-infra plumbing that is unrelated to the kernel correctness change.
  • A6 (xpu/dcu/gcu/hpu model_runner) — The discrete-block-idx ABI is GPU-only in this PR (CUDA SM80+, A800 validated). Other backends do not implement the per-head SWA recycle path yet, so propagating the ABI signature without a working scheduler on those backends would create dead code paths and false API promises. Cross-hardware enablement will land per-backend in dedicated PRs once the GPU path merges and acceptance numbers prove the ABI is final.

Both items are tracked in the task checkpoint and will be raised as separate PRs after this one merges.

— bob-cloudforge

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

- prefix_cache_manager.available_gpu_resource: prefer plural per-head
  heaps (gpu_free_head_wise_block_lists) which carry the real free
  blocks under FD_HEAD_WISE_KV_CACHE=1; legacy singular and
  gpu_free_block_list kept as fallbacks for startup window and
  non-head-wise callers.
- resource_manager_v1._num_swa_heads: assert -> raise ValueError for
  GQA divisibility (P9 validation, asserts strip under -O).

Root cause: launch_cache_manager populates plural heaps and resets
singular to [] for compat; the property still read singular and
returned 0.0 in head-wise mode -> resource_manager_v1 throttled
admissions -> queue backlog -> TTFT mean +6.2%, p95 +8.7-9.5% in
SMOKE/SMOKE2. Throughput barely lifted (+3.1-6.1%) because kernel
wins were gated by admission rate.
@bob-cloudforge bob-cloudforge force-pushed the task/h10-053-pr2-discrete-block-idx-v4 branch from 2d7b796 to 719f62c Compare May 6, 2026 12:57
Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-06 21:13:09

📋 Review 摘要

PR 概述:为 AppendAttention 引入 head-wise 离散 block_idx(block_tables_headwise),C16 kernel 支持 SWA 与 Full-Context 头混用,配套 per-head KV 独立堆分配/回收逻辑和 9 个 Cache/Layer 测试。
变更范围custom_ops/gpu_ops/(C16 CUDA kernel)、fastdeploy/cache_manager/fastdeploy/engine/sched/fastdeploy/config.pyfastdeploy/worker/tests/
影响面 Tag[OP] [KVCache] [Scheduler] [FDConfig] [Feature]

📝 PR 规范检查

标题包含非官方 Tag [Kernel](不在 §D1 官方列表中)和后缀 [cf](提交者标记,非官方格式),超出单 Tag 格式规范;PR 描述五个必填节均完整,但 Checklist 中已完成的两项未勾选。

标题建议(可直接复制):

  • [Feature] Optimize AppendAttention for discrete head-wise block_idx

Checklist 需更新(其余项维持原状):

  • [x] Add at least a tag in the PR title(已有官方 [Feature] ✓)
  • [x] Add unit tests(已新增 9 个测试文件 ✓)

问题

级别 文件 概述
🟡 建议 custom_ops/gpu_ops/append_attn/multiquery_attention_c16_impl.cuh:912 ENFORCE_GE 要求 head-wise 列宽 ≥ uniform 列宽,与 PR 描述 "shorter row" 语义不符,建议澄清
🟡 建议 benchmarks/yaml/eb45-21b-a3b-32k-bf16-kv50-512s.yaml:4 YAML 含开发调试注释("4 identical failures"、"per opus v2 verdict" 等),不适合合入生产仓库
🟡 建议 fastdeploy/config.py:2053 T53 字段通过 setattr 注入 model_configEngineArgs(CLI)未同步,不符合 A2 三入口规范
🟡 建议 custom_ops/gpu_ops/append_attn/ A3:C16 kernel 签名变更,仅有 tests/layers/ 形态测试,缺少 tests/operators/ 数值正确性单测

总体评价

PR2 整体结构清晰,per-head 独立堆分配、block_id=-1 sentinel + mask 置零的安全设计合理正确。主要建议:YAML 调试注释需清理,EngineArgs 三入口同步需说明,PR 描述中"shorter row"与 ENFORCE_GE 约束的语义一致性需澄清,以及补充 tests/operators/ 数值正确性测试。无阻塞性 Bug,建议解决上述问题后合入。

"> 0; got %d.",
max_blocks_per_head));
PADDLE_ENFORCE_GE(
max_blocks_per_head,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 PADDLE_ENFORCE_GE(max_blocks_per_head, max_block_num_per_seq) 要求 head-wise 表的物理列宽 ≥ uniform 表列宽。

这意味着 SWA head 的 head-wise 行在物理上并非更短,与 PR 描述中 "lets SWA-head CTAs walk a shorter / sparser row" 表述有出入。PR2 的实际节省来自 -1 sentinel → block 0 dummy 路径 + mask 置零(避免错误输出),而非缩短物理表宽度;真正的 "shorter row" 优化(SWA head 列数 = window_blocks + sink_blocks)似乎是 PR3 范围。

建议在此处补一行注释澄清:PR2 中 max_blocks_per_head 必须 ≥ max_block_num_per_seq 是 OOB 防护(kernel 按 kv_len 迭代索引,非按 swa_window 迭代),避免误导 PR3 实现者对此约束的期望。

# T53 bench workload — KV-bound (not slot-bound); gate: FD_HEAD_WISE_KV_CACHE=1
# max_num_seqs raised to 256 so the KV pool, not the slot count, is the bottleneck.
# kv_cache_ratio: 0.30 → ~24GB KV on A800-80GB (TINY envelope diagnostic per opus v2 verdict).
# (0.35 deterministic OOM at 78.99GB / index 3408/3689 weights load — 4 identical failures.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 此行及下一行包含开发过程调试注释(TINY envelope diagnostic per opus v2 verdict4 identical failuresRevert to 0.35 before SMOKE/FULL only after opus comparability decision),不适合进入生产代码库。

建议 merge 前将 kv_cache_ratio 固定为最终确认值,注释仅保留选取该值的客观理由,删除调试迭代过程记录。

Comment thread fastdeploy/config.py
# on a DIFFERENT FDConfig copy (worker process). This block mirrors that mutation
# in the engine-main process so the dispatcher gate is not dormant.
# Guards are identical to the worker side — idempotent if already set.
if envs.FD_T53_HEAD_WISE_SWA_FIXTURE:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议(A2 三入口同步) window_sizesink_sizewindow_attn_skip_freqhead_wise_swa_ratio 四个 T53 字段通过 setattr 注入 model_config,仅由 env var FD_T53_HEAD_WISE_SWA_FIXTURE 驱动,但 fastdeploy/engine/args_utils.py(CLI EngineArgs)未同步对应参数。

按 FastDeploy A2 惯例,Config 字段新增须同步 CLI(EngineArgs)和 envs.py(已完成)。若这是有意为之的 env-var-only 实验性设计,建议在此处补注释说明,例如 # Intentionally env-var-only in this experimental phase; CLI args deferred.,避免后续 reviewer 重复提同一问题。

@bob-cloudforge bob-cloudforge changed the title 【Hackathon 10th Spring No.53】[Feature][Kernel] Optimize AppendAttention for discrete head-wise block_idx [cf] [Feature][KVCache] Support head-wise SWA cache recycle in ResourceManagerV1 May 6, 2026
@bob-cloudforge bob-cloudforge changed the title [Feature][KVCache] Support head-wise SWA cache recycle in ResourceManagerV1 【Hackathon 10th Spring No.53】[Feature][Kernel] Optimize AppendAttention for discrete head-wise block_idx PR2 [cf] May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants