Qwen3.6-27B DFlash speculative decoding on Thor: 48.9 tok/s on structured prompts (+45% vs MTP)#130
Qwen3.6-27B DFlash speculative decoding on Thor: 48.9 tok/s on structured prompts (+45% vs MTP)#130LiangSu8899 wants to merge 10 commits into
Conversation
The shared frontend added an optional max_new_tokens parameter to _long_tq_effective_k and its call sites pass it positionally; the Thor override kept the two-argument form, so every long-context generate on Thor raised TypeError. Accept and forward the new parameter; the Thor K-cap behaviour is unchanged.
The DFlash generate loop hard-coded the BF16-staged verify forward and the per-position prompt walk, both of which are RTX-shaped: the Thor K-row layer path at S=16 is single-XQA over the persistent FP8 KV cache, so the verify needs the FP8-KV mode flag active and the prompt rows must be present in that cache before the first verify. Split both stages behind arch hooks on the shared frontend, defaulting to the existing behaviour: * _dflash_verify_forward_K — the S=K verify used for warmup and graph capture; default remains forward_own_decode_K_nvfp4. * _dflash_prefill_nvfp4 — prompt prefill returning the first greedy token; default remains the per-position captured-graph walk. Thor overrides route the verify through the FP8-KV wrapper (same as the long-ctx spec verify), prefill through the chunked FP8 prefill (also the fast-TTFT path on Thor), and guarantee the FP8 KV cache exists after drafter load for short-context constructions. Verified on Thor at ctx=128/64 new tokens: greedy tokens identical to the production MTP spec path.
The DFlash spec loop handled a partial accept by restoring a snapshot and re-advancing the committed rows through a second tapped verify — a second full main-model forward per cycle. On Thor decode is weight- read bound, so with partial accepts on nearly every cycle this doubled the per-cycle weight traffic (measured 152.8 ms/cycle at ctx=128). Split the snapshot and rollback stages behind arch hooks with the existing behaviour as the default, and use the per-step state checkpoint machinery on Thor instead: * _dflash_snap_state / _dflash_partial_rollback hooks on the shared loop; defaults preserve the restore + re-advance flow byte-for-byte. * Thor grows _K_save_max (and the per-step lin/conv checkpoints) to the DFlash verify q_seq at drafter load, dropping any K-row graphs captured against the old buffers. * Thor K-row dispatch delegates BOTH layer types to the parent K-row for K <= _K_save_max. The lin path gains per-step state saves; the full-attn path keeps the verify rows on the same kernel family as the K<=7 re-advance/spec verifies. The latter is required for correctness: committing rows produced by one kernel family while recovery recomputes them with another surfaces their occasional rounding disagreements as greedy divergence (observed once in ~40 cycles, cascading from a single full-attn row). * Thor rollback is then two gpu_copy calls from checkpoint slot N — the same pattern the long-context MTP spec loop uses — and the snapshot stage becomes a no-op. ctx=128 steady state: 152.8 -> 92.9 ms/cycle (+65% decode tok/s at unchanged AL); greedy parity with the production MTP path holds over 256 tokens (3 runs, bit-identical); MTP baseline unchanged.
The DFlash drafter window appended ONE fc-projected tap set per spec cycle, so its entries were ~AL committed tokens apart while the drafter attends to them as consecutive positions — starving the drafter of the context features it was trained on and capping the acceptance length well below the MTP chain's. Per-token window mode keeps a feature entry for EVERY committed token: * pertoken_window_append fc-projects the verify tap rows of all committed tokens (N+1 per cycle) and shift-writes them into a fixed-length window, outside the drafter graph. * dflash_drafter_forward_pertoken is a read-only forward over that window; graph capture needs no state snapshot/restore and there is exactly one graph per frontend. * The generate loop gains a _dflash_pertoken_window branch (default off; the shift-window path is untouched). Thor enables it at drafter load: FLASHRT_QWEN36_DFLASH_PERTOKEN (default on), FLASHRT_QWEN36_DFLASH_WINDOW (default 128). * The Thor prefill seeds the window from the prompt tail: the last min(window, prompt) tokens run as a tap-captured chunk, so the drafter starts with real context features instead of ramping from an empty window. FLASHRT_QWEN36_DFLASH_WINDOW_SEED=0 disables the seed (measured to help natural prompts and hurt only degenerate repeated-sentence prompts, where the tail features steer the drafter into repeating the prompt). Thor ctx<=128, steady state, against the FP8-KV MTP reference (greedy parity PASS on all four prompts): robot JSON plan MTP 33.7 tok/s -> DFlash 52.8 tok/s (AL 4.92) robot navigation MTP 30.5 -> DFlash 34.4 (AL 3.20) explain (prose) MTP 28.6 -> DFlash 30.8 (AL 2.87) repeated academic sentence: DFlash 22.9 (drafter is fed its own degenerate context; disable the seed to recover 4.27 AL)
Add docs/qwen36_dflash.md covering the drafter checkpoint, the generate entry point, the per-token context window and its env knobs, measured Thor numbers against the FP8-KV MTP reference, and the benchmark caveats (seed vs verbatim-repeated prompts, KV-format-matched parity references). Register the new env vars in qwen36_usage.md, correct the stale init_dflash_drafter reference, and add the README news entry.
|
TBD: Add a speculative decoding abstraction for multi-backend support. |
The per-token window append ran AFTER the end-of-cycle taps[:, 0] <- taps[:, N] shuffle, so the first committed row's feature entered the window as a duplicate of the last accepted row on every cycle — the window was not one-feature-per-committed-token as documented. Extract the append into _dflash_window_commit(N) and call it before the shuffle; the shuffle itself is hoisted out of the accept branches (N == K on a full accept, so a single taps[:, N] copy covers both, byte-identical for the non-per-token default path). Add structural tests (no checkpoint / no GPU): window-commit row order and copy semantics, a source-order guard on the generate loop, the missing-drafter fail-fast, the public drafter-init delegation, and Thor per-token env routing (default on, opt-out, window override). Re-measured on Thor with the corrected window (parity PASS on all prompts): robot JSON 48.9 tok/s (AL 4.57) vs MTP 33.7; robot navigation 34.8 (3.25); prose 31.7 (3.00). Docs and README updated to the corrected numbers.
Add serving/qwen36_dflash_agent: a request/response serving host for the Qwen3.6-27B DFlash spec-decode path, following the serving-layer contract — policy above the frontend, no session or KV verbs, no exec/ changes. Scope: stateless per request (full prefill each call), batch 1 with serialized requests, greedy decode, /v1/chat/completions + /v1/models + /health, frontend arch auto-detected (SM110 -> Thor, otherwise RTX). Responses carry a flashrt telemetry block with the speculation cycle count, realized accept length, and end-to-end latency. Long-running agent sessions (prefix reuse, tool calling, SSE streaming) remain the domain of serving/qwen36_agent on the MTP path; the README states the split.
Qwen3.6 reasons inside a <think> block before answering, and the thinking stream dominates the token budget of short-context requests. Mirroring the TensorRT-LLM MTP relaxed-acceptance policy: inside the think block a draft is accepted when it is in the verify logits' top-k AND within a logit margin of the argmax (a raw-logit margin equals a log-prob margin), and the accepted token is the draft itself — the verify rows and per-step state already condition on the drafts, so state and KV stay consistent. Rows from the first draft that closes the think block fall back to strict argmax matching, keeping the visible answer exact-verified. Opt-in via FLASHRT_QWEN36_DFLASH_RELAXED_THINKING (default off; the strict path is byte-identical with it disabled, re-verified by greedy parity against the MTP reference). Knobs: FLASHRT_QWEN36_DFLASH_RELAXED_TOPK (3), FLASHRT_QWEN36_DFLASH_RELAXED_DELTA (1.0). The acceptance math lives in a static helper with CPU unit tests (top-k membership, margin cutoff, strict-after-close). Measured on Thor (thinking-enabled robot JSON-plan prompt, steady state): AL 3.78 -> 5.42, 40.4 -> 57.7 tok/s (+43%); a math prompt whose drafts rarely reach the top-k measured neutral. The thinking transcript is no longer token-identical to the strict run.
Add per-step-checkpoint variants of the Qwen3.6 chunk kernels: causal_conv1d_qwen36_update_chunk_saves_bf16 and qwen36_gdn_chunk_from_conv_smem_strided_saves_bf16 dump the post-step state after every step (the conv window is bf16-exact in registers; the GDN kernel already rounds the carried state to bf16 each step, so slot s byte-matches the committed state of an S = s + 1 run). This serves the DFlash partial-accept rollback in one pass instead of the parent branch's per-step kernels with global state round-trips: verify 84.1 -> 79.5 ms, cycle 93.3 -> 88.8 ms (~+5% decode) at unchanged acceptance behavior. Default OFF (FLASHRT_QWEN36_THOR_LIN_CHUNK_SAVES=1 to enable): the route moves the S=8..16 verify onto the Thor kernel family while the greedy-parity reference (the MTP spec path) runs the parent family, and the families' occasional rounding disagreements surface as transcript-level divergence — measured on the repeated-sentence parity prompt. K <= 7 dispatch (the production MTP verify) is untouched either way.
Document the two opt-in DFlash performance modes: relaxed thinking-phase acceptance (env knobs, +43% measured on a thinking-enabled robot-plan prompt, transcript-exactness tradeoff stated) and the Thor chunk-saves verify kernels (~5% cycle, kernel family vs the parity reference stated). Both default off; the default configuration remains token-identical to the FP8-KV MTP reference.
|
Follow-up: decode-ceiling round (borrowed from the TRT-LLM/vLLM/SGLang playbook, measured on Thor). Cycle breakdown first (CUDA events, steady state): verify 84.1 ms / drafter 8.1 ms / everything else 1.2 ms — so scheduler/host-overlap techniques from the serving engines have nothing to buy here; the whole budget is the verify. Three additions, defaults chosen by the greedy-parity red line:
Default configuration is re-verified byte-identical to the FP8-KV MTP reference (parity PASS, numbers unchanged: 93.5 ms cycle / AL 2.13 on the parity prompt). 11 structural tests pass without checkpoints or GPU. |
Summary
Brings the DFlash block-diffusion drafter path up on Jetson AGX Thor (SM110) and closes the two structural gaps that kept it below the MTP chain. On structured robot-plan prompts the DFlash path now decodes at 48.9 tok/s vs 33.7 tok/s for the FP8-KV MTP reference (AL 4.57 vs 2.87), with byte-identical greedy output on every measured prompt.
New doc:
docs/qwen36_dflash.md(checkpoint, usage, window knobs, measured numbers, benchmark caveats).Commits
fix(qwen36_thor): match parent
_long_tq_effective_ksignature — the shared frontend grew an optionalmax_new_tokensparameter and passes it positionally; the Thor override kept the two-argument form, so every long-context generate on Thor raisedTypeError. All 11 Thor overrides were audited for signature drift; this was the only one.feat(qwen36): DFlash speculative decoding on Thor — the DFlash loop hard-coded the BF16-staged verify and per-position prompt walk (RTX-shaped). Split both behind arch hooks with defaults preserving the RTX flow byte-for-byte; Thor routes the verify through the FP8-KV wrapper (same as the long-ctx spec verify), prefills through the chunked FP8 path, and guarantees the FP8 KV cache exists after drafter load on short-context constructions.
perf(qwen36): constant-time partial-accept rollback — the loop re-advanced committed rows with a second full main-model forward per cycle; on weight-read-bound Thor this doubled per-cycle traffic (152.8 ms/cycle). Thor now grows the per-step state checkpoints to the DFlash verify q_seq and rolls back with two
gpu_copycalls (same pattern as the long-context MTP loop): 152.8 -> 92.9 ms/cycle. Both layer types delegate K <=_K_save_maxto the parent K-row so committed rows and any recovery recompute stay on one kernel family — committing rows from one family while recovering with another surfaces their occasional rounding disagreements as greedy divergence (observed and root-caused during bring-up).feat(qwen36): per-token drafter window with prompt-tail seeding — the drafter window held one fc-projected feature per spec cycle, i.e. entries ~AL tokens apart read as consecutive positions. The per-token window keeps one entry per committed token (appended in bulk outside the drafter graph; the graph itself is read-only over the window and needs no capture snapshot), and the Thor prefill seeds it from the prompt tail. Default on for Thor, opt-in on RTX; env-gated (
FLASHRT_QWEN36_DFLASH_PERTOKEN,..._WINDOW,..._WINDOW_SEED).docs(qwen36) — usage doc, env table, README news entry.
fix(qwen36): commit per-token window rows before the tap shuffle (review follow-up) — the window append ran after the end-of-cycle
taps[:, 0]shuffle, so each cycle's first committed feature entered the window as a duplicate of the last accepted row. The append is now_dflash_window_commit(N), called before the shuffle. All performance claims re-measured with the corrected window and updated everywhere (the buggy window happened to inflate the JSON-plan number; the honest figure is 48.9 tok/s).tests: structural coverage without checkpoints (review follow-up) —
tests/test_qwen36_dflash_structural.py: window-commit row order + copy semantics, a source-order guard on the generate loop, missing-drafter fail-fast, public drafter-init delegation, and Thor per-token env routing (default / opt-out / window override). 8 tests, CPU-only, no model files.feat(qwen36): public
init_dflash_drafter(review follow-up) — public wrapper for the drafter load; docs no longer instruct users to call an underscored method.feat(serving):
serving/qwen36_dflash_agent— stateless OpenAI-compatible host for the DFlash path per the serving-layer contract (policy above the frontend, no exec/ verbs): /v1/chat/completions, /v1/models, /health, batch-1 serialized, accept-length telemetry in the response. Long-running sessions stay onserving/qwen36_agent(MTP path); the README states the split.Measured (Thor SM110, steady state, 64/256-token delta, greedy)
Parity is against the FP8-KV MTP route (
FLASHRT_QWEN36_LONG_CTX_ROUTE_MIN_SEQ=0for short prompts) — the BF16 short route stores KV in a different format, so token-exact comparison across formats is not meaningful. The repeated-sentence row is a benchmark pathology: the seeded window is fed degenerate context;FLASHRT_QWEN36_DFLASH_WINDOW_SEED=0recovers AL 4.27 there.Regression evidence
Review follow-ups addressed
init_dflash_drafteradded and documented.docs/qwen36_dflash.md(oneargmin().item()per cycle, ~10 us against an ~86 ms verify, included in all measurements; device-side accept noted as follow-up).Risk
R3 (touches the shared DFlash loop and Thor dispatch). No kernel or build changes;
csrc/untouched. Thor-validated on hardware; RTX validated by default-path preservation (hooks default to prior behavior).