Skip to content

Qwen3.6-27B DFlash speculative decoding on Thor: 48.9 tok/s on structured prompts (+45% vs MTP)#130

Open
LiangSu8899 wants to merge 10 commits into
mainfrom
feat/qwen36-thor-dflash
Open

Qwen3.6-27B DFlash speculative decoding on Thor: 48.9 tok/s on structured prompts (+45% vs MTP)#130
LiangSu8899 wants to merge 10 commits into
mainfrom
feat/qwen36-thor-dflash

Conversation

@LiangSu8899

@LiangSu8899 LiangSu8899 commented Jul 3, 2026

Copy link
Copy Markdown
Member

Summary

Brings the DFlash block-diffusion drafter path up on Jetson AGX Thor (SM110) and closes the two structural gaps that kept it below the MTP chain. On structured robot-plan prompts the DFlash path now decodes at 48.9 tok/s vs 33.7 tok/s for the FP8-KV MTP reference (AL 4.57 vs 2.87), with byte-identical greedy output on every measured prompt.

New doc: docs/qwen36_dflash.md (checkpoint, usage, window knobs, measured numbers, benchmark caveats).

Commits

  1. fix(qwen36_thor): match parent _long_tq_effective_k signature — the shared frontend grew an optional max_new_tokens parameter and passes it positionally; the Thor override kept the two-argument form, so every long-context generate on Thor raised TypeError. All 11 Thor overrides were audited for signature drift; this was the only one.

  2. feat(qwen36): DFlash speculative decoding on Thor — the DFlash loop hard-coded the BF16-staged verify and per-position prompt walk (RTX-shaped). Split both behind arch hooks with defaults preserving the RTX flow byte-for-byte; Thor routes the verify through the FP8-KV wrapper (same as the long-ctx spec verify), prefills through the chunked FP8 path, and guarantees the FP8 KV cache exists after drafter load on short-context constructions.

  3. perf(qwen36): constant-time partial-accept rollback — the loop re-advanced committed rows with a second full main-model forward per cycle; on weight-read-bound Thor this doubled per-cycle traffic (152.8 ms/cycle). Thor now grows the per-step state checkpoints to the DFlash verify q_seq and rolls back with two gpu_copy calls (same pattern as the long-context MTP loop): 152.8 -> 92.9 ms/cycle. Both layer types delegate K <= _K_save_max to the parent K-row so committed rows and any recovery recompute stay on one kernel family — committing rows from one family while recovering with another surfaces their occasional rounding disagreements as greedy divergence (observed and root-caused during bring-up).

  4. feat(qwen36): per-token drafter window with prompt-tail seeding — the drafter window held one fc-projected feature per spec cycle, i.e. entries ~AL tokens apart read as consecutive positions. The per-token window keeps one entry per committed token (appended in bulk outside the drafter graph; the graph itself is read-only over the window and needs no capture snapshot), and the Thor prefill seeds it from the prompt tail. Default on for Thor, opt-in on RTX; env-gated (FLASHRT_QWEN36_DFLASH_PERTOKEN, ..._WINDOW, ..._WINDOW_SEED).

  5. docs(qwen36) — usage doc, env table, README news entry.

  6. fix(qwen36): commit per-token window rows before the tap shuffle (review follow-up) — the window append ran after the end-of-cycle taps[:, 0] shuffle, so each cycle's first committed feature entered the window as a duplicate of the last accepted row. The append is now _dflash_window_commit(N), called before the shuffle. All performance claims re-measured with the corrected window and updated everywhere (the buggy window happened to inflate the JSON-plan number; the honest figure is 48.9 tok/s).

  7. tests: structural coverage without checkpoints (review follow-up) — tests/test_qwen36_dflash_structural.py: window-commit row order + copy semantics, a source-order guard on the generate loop, missing-drafter fail-fast, public drafter-init delegation, and Thor per-token env routing (default / opt-out / window override). 8 tests, CPU-only, no model files.

  8. feat(qwen36): public init_dflash_drafter (review follow-up) — public wrapper for the drafter load; docs no longer instruct users to call an underscored method.

  9. feat(serving): serving/qwen36_dflash_agent — stateless OpenAI-compatible host for the DFlash path per the serving-layer contract (policy above the frontend, no exec/ verbs): /v1/chat/completions, /v1/models, /health, batch-1 serialized, accept-length telemetry in the response. Long-running sessions stay on serving/qwen36_agent (MTP path); the README states the split.

Measured (Thor SM110, steady state, 64/256-token delta, greedy)

prompt MTP AL / tok/s DFlash AL / tok/s parity
robot task -> JSON plan 2.87 / 33.7 4.57 / 48.9 PASS
robot navigation plan 2.59 / 30.5 3.25 / 34.8 PASS
prose explanation 2.43 / 28.5 3.00 / 31.7 PASS
repeated-sentence stress text 3.84 / 36.8 2.13 / 22.9 PASS

Parity is against the FP8-KV MTP route (FLASHRT_QWEN36_LONG_CTX_ROUTE_MIN_SEQ=0 for short prompts) — the BF16 short route stores KV in a different format, so token-exact comparison across formats is not meaningful. The repeated-sentence row is a benchmark pathology: the seeded window is fed degenerate context; FLASHRT_QWEN36_DFLASH_WINDOW_SEED=0 recovers AL 4.27 there.

Regression evidence

  • MTP spec path unchanged: 3.84 AL / 37.0 tok/s / 103.9 ms cycle at ctx=128 before and after (same-process A/B).
  • RTX default flow: all new behavior is behind arch hooks whose defaults reproduce the previous code path byte-for-byte, and the per-token window is default-off outside Thor.
  • Greedy parity: 3 consecutive 256-token DFlash runs bit-identical, all matching the MTP reference.
  • Per-step checkpoint growth reallocates the save buffers at drafter load and drops K-row graphs captured against the old allocations, so pre-existing MTP graphs cannot hold stale pointers.

Review follow-ups addressed

  • Blocking: per-token window tap-order bug — fixed, re-measured, numbers corrected in docs/README/PR.
  • Blocking: no test coverage — structural test suite added (runs without checkpoints or GPU).
  • Minor: underscored public entry — init_dflash_drafter added and documented.
  • Minor: host-sync in the decode loop — documented in docs/qwen36_dflash.md (one argmin().item() per cycle, ~10 us against an ~86 ms verify, included in all measurements; device-side accept noted as follow-up).

Risk

R3 (touches the shared DFlash loop and Thor dispatch). No kernel or build changes; csrc/ untouched. Thor-validated on hardware; RTX validated by default-path preservation (hooks default to prior behavior).

The shared frontend added an optional max_new_tokens parameter to
_long_tq_effective_k and its call sites pass it positionally; the Thor
override kept the two-argument form, so every long-context generate on
Thor raised TypeError. Accept and forward the new parameter; the Thor
K-cap behaviour is unchanged.
The DFlash generate loop hard-coded the BF16-staged verify forward and
the per-position prompt walk, both of which are RTX-shaped: the Thor
K-row layer path at S=16 is single-XQA over the persistent FP8 KV
cache, so the verify needs the FP8-KV mode flag active and the prompt
rows must be present in that cache before the first verify.

Split both stages behind arch hooks on the shared frontend, defaulting
to the existing behaviour:

* _dflash_verify_forward_K — the S=K verify used for warmup and graph
  capture; default remains forward_own_decode_K_nvfp4.
* _dflash_prefill_nvfp4 — prompt prefill returning the first greedy
  token; default remains the per-position captured-graph walk.

Thor overrides route the verify through the FP8-KV wrapper (same as
the long-ctx spec verify), prefill through the chunked FP8 prefill
(also the fast-TTFT path on Thor), and guarantee the FP8 KV cache
exists after drafter load for short-context constructions.

Verified on Thor at ctx=128/64 new tokens: greedy tokens identical to
the production MTP spec path.
The DFlash spec loop handled a partial accept by restoring a snapshot
and re-advancing the committed rows through a second tapped verify —
a second full main-model forward per cycle. On Thor decode is weight-
read bound, so with partial accepts on nearly every cycle this doubled
the per-cycle weight traffic (measured 152.8 ms/cycle at ctx=128).

Split the snapshot and rollback stages behind arch hooks with the
existing behaviour as the default, and use the per-step state
checkpoint machinery on Thor instead:

* _dflash_snap_state / _dflash_partial_rollback hooks on the shared
  loop; defaults preserve the restore + re-advance flow byte-for-byte.
* Thor grows _K_save_max (and the per-step lin/conv checkpoints) to
  the DFlash verify q_seq at drafter load, dropping any K-row graphs
  captured against the old buffers.
* Thor K-row dispatch delegates BOTH layer types to the parent K-row
  for K <= _K_save_max. The lin path gains per-step state saves; the
  full-attn path keeps the verify rows on the same kernel family as
  the K<=7 re-advance/spec verifies. The latter is required for
  correctness: committing rows produced by one kernel family while
  recovery recomputes them with another surfaces their occasional
  rounding disagreements as greedy divergence (observed once in ~40
  cycles, cascading from a single full-attn row).
* Thor rollback is then two gpu_copy calls from checkpoint slot N —
  the same pattern the long-context MTP spec loop uses — and the
  snapshot stage becomes a no-op.

ctx=128 steady state: 152.8 -> 92.9 ms/cycle (+65% decode tok/s at
unchanged AL); greedy parity with the production MTP path holds over
256 tokens (3 runs, bit-identical); MTP baseline unchanged.
The DFlash drafter window appended ONE fc-projected tap set per spec
cycle, so its entries were ~AL committed tokens apart while the
drafter attends to them as consecutive positions — starving the
drafter of the context features it was trained on and capping the
acceptance length well below the MTP chain's.

Per-token window mode keeps a feature entry for EVERY committed
token:

* pertoken_window_append fc-projects the verify tap rows of all
  committed tokens (N+1 per cycle) and shift-writes them into a
  fixed-length window, outside the drafter graph.
* dflash_drafter_forward_pertoken is a read-only forward over that
  window; graph capture needs no state snapshot/restore and there is
  exactly one graph per frontend.
* The generate loop gains a _dflash_pertoken_window branch (default
  off; the shift-window path is untouched). Thor enables it at
  drafter load: FLASHRT_QWEN36_DFLASH_PERTOKEN (default on),
  FLASHRT_QWEN36_DFLASH_WINDOW (default 128).
* The Thor prefill seeds the window from the prompt tail: the last
  min(window, prompt) tokens run as a tap-captured chunk, so the
  drafter starts with real context features instead of ramping from
  an empty window. FLASHRT_QWEN36_DFLASH_WINDOW_SEED=0 disables the
  seed (measured to help natural prompts and hurt only degenerate
  repeated-sentence prompts, where the tail features steer the
  drafter into repeating the prompt).

Thor ctx<=128, steady state, against the FP8-KV MTP reference
(greedy parity PASS on all four prompts):

  robot JSON plan   MTP 33.7 tok/s -> DFlash 52.8 tok/s (AL 4.92)
  robot navigation  MTP 30.5      -> DFlash 34.4      (AL 3.20)
  explain (prose)   MTP 28.6      -> DFlash 30.8      (AL 2.87)
  repeated academic sentence: DFlash 22.9 (drafter is fed its own
  degenerate context; disable the seed to recover 4.27 AL)
Add docs/qwen36_dflash.md covering the drafter checkpoint, the
generate entry point, the per-token context window and its env knobs,
measured Thor numbers against the FP8-KV MTP reference, and the
benchmark caveats (seed vs verbatim-repeated prompts, KV-format-matched
parity references). Register the new env vars in qwen36_usage.md,
correct the stale init_dflash_drafter reference, and add the README
news entry.
@LiangSu8899

Copy link
Copy Markdown
Member Author

TBD: Add a speculative decoding abstraction for multi-backend support.

The per-token window append ran AFTER the end-of-cycle
taps[:, 0] <- taps[:, N] shuffle, so the first committed row's feature
entered the window as a duplicate of the last accepted row on every
cycle — the window was not one-feature-per-committed-token as
documented.

Extract the append into _dflash_window_commit(N) and call it before
the shuffle; the shuffle itself is hoisted out of the accept branches
(N == K on a full accept, so a single taps[:, N] copy covers both,
byte-identical for the non-per-token default path).

Add structural tests (no checkpoint / no GPU): window-commit row order
and copy semantics, a source-order guard on the generate loop, the
missing-drafter fail-fast, the public drafter-init delegation, and
Thor per-token env routing (default on, opt-out, window override).

Re-measured on Thor with the corrected window (parity PASS on all
prompts): robot JSON 48.9 tok/s (AL 4.57) vs MTP 33.7; robot
navigation 34.8 (3.25); prose 31.7 (3.00). Docs and README updated to
the corrected numbers.
Add serving/qwen36_dflash_agent: a request/response serving host for
the Qwen3.6-27B DFlash spec-decode path, following the serving-layer
contract — policy above the frontend, no session or KV verbs, no
exec/ changes.

Scope: stateless per request (full prefill each call), batch 1 with
serialized requests, greedy decode, /v1/chat/completions +
/v1/models + /health, frontend arch auto-detected (SM110 -> Thor,
otherwise RTX). Responses carry a flashrt telemetry block with the
speculation cycle count, realized accept length, and end-to-end
latency. Long-running agent sessions (prefix reuse, tool calling,
SSE streaming) remain the domain of serving/qwen36_agent on the MTP
path; the README states the split.
@LiangSu8899 LiangSu8899 changed the title Qwen3.6-27B DFlash speculative decoding on Thor: 52.8 tok/s on structured prompts (+57% vs MTP) Qwen3.6-27B DFlash speculative decoding on Thor: 48.9 tok/s on structured prompts (+45% vs MTP) Jul 3, 2026
Qwen3.6 reasons inside a <think> block before answering, and the
thinking stream dominates the token budget of short-context requests.
Mirroring the TensorRT-LLM MTP relaxed-acceptance policy: inside the
think block a draft is accepted when it is in the verify logits'
top-k AND within a logit margin of the argmax (a raw-logit margin
equals a log-prob margin), and the accepted token is the draft itself
— the verify rows and per-step state already condition on the drafts,
so state and KV stay consistent. Rows from the first draft that closes
the think block fall back to strict argmax matching, keeping the
visible answer exact-verified.

Opt-in via FLASHRT_QWEN36_DFLASH_RELAXED_THINKING (default off; the
strict path is byte-identical with it disabled, re-verified by greedy
parity against the MTP reference). Knobs:
FLASHRT_QWEN36_DFLASH_RELAXED_TOPK (3),
FLASHRT_QWEN36_DFLASH_RELAXED_DELTA (1.0). The acceptance math lives
in a static helper with CPU unit tests (top-k membership, margin
cutoff, strict-after-close).

Measured on Thor (thinking-enabled robot JSON-plan prompt, steady
state): AL 3.78 -> 5.42, 40.4 -> 57.7 tok/s (+43%); a math prompt
whose drafts rarely reach the top-k measured neutral. The thinking
transcript is no longer token-identical to the strict run.
Add per-step-checkpoint variants of the Qwen3.6 chunk kernels:
causal_conv1d_qwen36_update_chunk_saves_bf16 and
qwen36_gdn_chunk_from_conv_smem_strided_saves_bf16 dump the post-step
state after every step (the conv window is bf16-exact in registers;
the GDN kernel already rounds the carried state to bf16 each step, so
slot s byte-matches the committed state of an S = s + 1 run). This
serves the DFlash partial-accept rollback in one pass instead of the
parent branch's per-step kernels with global state round-trips:
verify 84.1 -> 79.5 ms, cycle 93.3 -> 88.8 ms (~+5% decode) at
unchanged acceptance behavior.

Default OFF (FLASHRT_QWEN36_THOR_LIN_CHUNK_SAVES=1 to enable): the
route moves the S=8..16 verify onto the Thor kernel family while the
greedy-parity reference (the MTP spec path) runs the parent family,
and the families' occasional rounding disagreements surface as
transcript-level divergence — measured on the repeated-sentence
parity prompt. K <= 7 dispatch (the production MTP verify) is
untouched either way.
Document the two opt-in DFlash performance modes: relaxed
thinking-phase acceptance (env knobs, +43% measured on a
thinking-enabled robot-plan prompt, transcript-exactness tradeoff
stated) and the Thor chunk-saves verify kernels (~5% cycle, kernel
family vs the parity reference stated). Both default off; the default
configuration remains token-identical to the FP8-KV MTP reference.
@LiangSu8899

Copy link
Copy Markdown
Member Author

Follow-up: decode-ceiling round (borrowed from the TRT-LLM/vLLM/SGLang playbook, measured on Thor).

Cycle breakdown first (CUDA events, steady state): verify 84.1 ms / drafter 8.1 ms / everything else 1.2 ms — so scheduler/host-overlap techniques from the serving engines have nothing to buy here; the whole budget is the verify.

Three additions, defaults chosen by the greedy-parity red line:

  1. Relaxed thinking-phase acceptance (opt-in, FLASHRT_QWEN36_DFLASH_RELAXED_THINKING=1) — the TensorRT-LLM MTP policy: inside <think>, accept drafts in the verify top-k within a logit margin; strict matching resumes at </think> so the visible answer stays exact-verified. Thinking-enabled robot-plan prompt: AL 3.78 -> 5.42, 40.4 -> 57.7 tok/s (+43%). CPU unit tests cover the acceptance math (top-k membership, margin cutoff, strict-after-close).

  2. Chunk-saves verify kernels (opt-in, FLASHRT_QWEN36_THOR_LIN_CHUNK_SAVES=1) — chunk-scan variants that emit the per-step rollback checkpoints in one pass: verify 84.1 -> 79.5 ms (~+5% decode). Kept off by default: they move the S=8..16 verify onto the Thor kernel family while the parity reference runs the parent family, and the transcript drifts (measured). The K<=7 production MTP dispatch is untouched either way.

  3. Docs for both knobs, with the tradeoffs stated.

Default configuration is re-verified byte-identical to the FP8-KV MTP reference (parity PASS, numbers unchanged: 93.5 ms cycle / AL 2.13 on the parity prompt). 11 structural tests pass without checkpoints or GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant