[Track B] track ggml-org/llama.cpp#22105 (DFlash drafter) for castle adaptation by Leechael · Pull Request #6 · Leechael/llama.cpp-dflash-ggml

Leechael · 2026-05-05T17:59:18Z

Tracking: #3 (Phase 2)

Purpose

Mirror of ggml-org/llama.cpp#22105 (DFlash drafter by ruixiang63), evaluated on castle hardware as an upstream-tracking alternative to the castle's own DFlash + DDTree implementation.

Current state

Branch starts identical to upstream-pr/22105 head (67cb0d507):

67cb0d507 dflash: enable llama-cli & llama-server with np=1
e344c4a71 dflash: remove rebundant logic & correct bias naming
85a0089e6 dflash: add support for qwen3.5/3.6 moe models
(plus the rest of [Speculative decoding] feat: add DFlash support ggml-org/llama.cpp#22105's stack on EAGLE3 [Speculative decoding] feat: add EAGLE3 speculative decoding support ggml-org/llama.cpp#18039)

No castle-side adjustments applied. Stock ggml-org#22105 was sufficient to run end-to-end on castle hardware.

Castle smoke test (single prompt)

build  : build-track-b/ (cmake + CUDA, sm_89, RTX 4090)
target : Qwen3.5-27B-Q4_K_M.gguf (16 GB)
draft  : dflash-22105.gguf (3.46 GB, bf16, converted from DFlash-draft-hf via #22105's convert_hf_to_gguf.py with --target-model-dir)
prompt : "Write a quicksort algorithm in Python. Write code only."

decode_tps   = 99.1
accept_pct   = 67.8% (61/90)
n_predict    = 68
graphs reused = 3 (decoder graph rebuild — known #22105 future work)
GPU memory   = 24079 = 3446 free + (16034 model + 213 ctx + 548 compute) + 4598 unaccounted

30-prompt benchmark (HumanEval / GSM8K / Math500, 10 each, seed=42)

Same dataset and sampling as the castle self-implementation benchmark in docs/ddtree-dataset-eval-plan.md. KV q8_0, n-batch=2048, n-ubatch=512 (ggml-org#22105 has no --prompt-chunk flag — large batch chosen so prompt fits in a single ubatch).

dataset	n	avg decode tok/s	avg accept %
HumanEval	10	91.11	54.13
GSM8K	10	89.47	56.05
Math500	10	93.42	55.55
avg	30	91.3	55.2

Comparison vs castle self-implementation (numbers from existing castle docs)

stack	avg tok/s	bit-equal	speedup vs AR
AR baseline	46.4	10/10 (by definition)	1.0x
castle self-impl exact-gated	~40	10/10	0.87x
upstream ggml-org#22105 stock	91	bit-eq via ggml-org#19493 checkpoint	2.0x
castle self-impl + TARGET_TOP1	123	4.7/10	2.7x
castle self-impl + unsafe-trust-batched	137	4.7/10	3.0x

Interpretation

Upstream ggml-org#22105 stock sits between the two ends of the castle self-implementation:

33% slower than castle's unsafe-trust-batched (which sacrifices bit-equal correctness)
2.3× faster than castle's correctness-gated path (castle's exact-gated is validate_tree_with_chain — runs N single-token chain decodes per step, effectively degenerates spec into AR plus overhead)

Why ggml-org#22105 is fast under correctness gate: it leans on upstream #19493 speculative checkpointing (merged 2026-04-19). Rejected suffixes are rolled back via KV checkpoints, so the accepted path runs forward exactly once. Castle's self-implementation predates ggml-org#19493 and replaced it with a per-token chain validation walk.

Trade-off

Accept bit-equal loss → castle's unsafe path still wins (137 tok/s).
Require correctness gate → use [Speculative decoding] feat: add DFlash support ggml-org/llama.cpp#22105 (91 tok/s, 2.3× faster than castle's correctness path).

Status

This branch will be kept in sync with upstream-pr/22105 as ruixiang63 evolves it (waits on ggml-org#18039 EAGLE3 merge + ggerganov's unified spec API refactor).

No active development planned on top of this branch. If we later decide to port castle DDTree on top of ggml-org#22105 (to combine bit-equal + faster speedup), it will become a new branch (track-b/ddtree-on-22105 or similar).

Castle artifacts produced during this evaluation

~/workshop/lucebox-hub/dflash/deps/llama.cpp/build-track-b/ — CUDA build
~/workshop/lucebox-hub/dflash/models/draft/dflash-22105.gguf — DFlash GGUF re-converted with [Speculative decoding] feat: add DFlash support ggml-org/llama.cpp#22105's converter (architecture string is dflash, not the castle-specific dflash-draft)
~/workshop/lucebox-hub/dflash/deps/llama.cpp/.venv-track-b/ — Python venv for the convert script
~/workshop/lucebox-hub/dflash/deps/llama.cpp/scripts/bench_track_b.py — bench harness adapted for [Speculative decoding] feat: add DFlash support ggml-org/llama.cpp#22105 (parses llama-speculative-simple --dflash stdout)
/tmp/bench-track-b-v2/ on castle — bench logs and results.json

EAGLE3 is an encoder-decoder based speculative decoding method: - Extracts features from target model at specific layers - Uses feature fusion layer to compress target features - Generates draft tokens with single-layer decoder - Maps draft vocabulary to target vocabulary via d2t tensor Key changes: - Add LLM_ARCH_EAGLE3 architecture - Add EAGLE3 encoder/decoder graph (src/models/eagle3.cpp) - Add feature extraction from target model layers - Add g_embeddings handling for decoder input - Add GGML_TENSOR_FLAG_SYNC for GPU synchronization - Add --eagle3 flag for speculative-simple example - Add EAGLE3 model conversion in convert_hf_to_gguf.py

…vert GGUF

* CUDA: use a ring-buffer for cuda graphs * bump limit to 128 * use LRU eviction * better naming * do periodic clean-up

…ng (ggml-org#21052) * Update workflows to remove dependence on llvmpipe * Try setting Dawn_DIR * remove c++20 initializers * Move to proper guid * Try avoiding segfaults on vulkan backend process exit * Remove compiler warnings on parameter casting * Fix soft_max and update reg_tile accumulation to f32 for better precision * Refactor flash_attn a bit * remove c++20 initializers and format * Increase div precision for NVIDIA * revert div precision and comment out ggml-ci node for now * Formatting * Try debugging on a failing CI node * Revert "Try debugging on a failing CI node" This reverts commit 1971e33.

…-org#22063)

* refactor bias tensor variable names * use create_tensor_qkv for jina-bert-v2

* rpc : refactor the RPC transport Move all transport related code into a separate file and use the socket_t interface to hide all transport implementation details. * fix win32 * better socket_t construction

* server : speculative decoding using checkpoints * server : fix draft check with checkpoints * server : rename spec vars * server : log levels * server : refactored spec logic to speculative.cpp * server : renamed spec checkpoints option * server : fix spec checkpoints, logging * speculative : checkpoints with draft model, logging * server : n_tokens_cur and create_checkpoint in draft * server : fix server_speculative_callback (slot.id) * spec : fix ngram-map/begin idx_last_check * spec : init ckpt (begin() wasn't called) * chore: update webui build output * server : restore sampler in spec checkpoint and clear mem * cont : avoid --spec-use-checkpoints argument * cont : remove server_prompt_checkpoint_with_size * spec : rename (leave_draft_state) * cont : clean-up * cont : do not ignore partial drafts even if the are short * cont : spec callback owned by session * cont : simplify * cont : avoid empty speculative session * cont : simplify * cont : simplify * cont : enable mtmd speculative decoding * cont : keep the spec sampler alive * cont : simplify * cont : fix nullptr deref + draft checkpoints * cont : remove common_speculative_accept_response * cont : remove callback * cont : simplify * cont : minor * cont : simplify * cont : fix accepted number --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ggml-org#21630 added the CMP0194 NEW policy to silence a CMake warning, but on Windows runners it caused CMake to prefer the MinGW toolchain for ASM and broke MSVC builds. Reverting only that policy block restores the previous working behavior. The CMake 4.1+ warning comes back, but that is cosmetic and does not break any platform. Reported-by: oobabooga Refs: ggml-org#21630 Co-authored-by: texasich <texasich@users.noreply.github.com>

* convert : support sentence-transformer 5.4 config files * fix: embeddinggemma * fix: mapping Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix: pooling_mode Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

…g#22292)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

ruixiang63 and others added 30 commits December 14, 2025 18:12

fix eagle3 logits sync bug & remove ggml_set_sync()

ac5667d

Merge branch 'master' into pr/18039

3e7f376

eagle3 : improve naming

5a79c19

add eagle3 support for Qwen3 series models

c0d99e6

add eagle3 support for Qwen3 MoE models

71ba283

eagle3: load lm_head from target model if not in draft model when con…

3da288d

…vert GGUF

eagle3: make d2t mapping optional

13a9f31

eagle3: add support for gpt-oss-120B eagle3

75883cd

eagle3: add support for RedHtAI eagle3 speculator series models

7b78bfa

Merge branch 'master' into HEAD

7d4c223

Merge branch 'master' into pr/18039

5e224bc

eagle3: fix model convert issue

b353792

eagle3: fix model convert code format

9fea243

Merge branch 'master' into pr/18039

b8ab2cc

eagle3: support --eagle3 in llama-cli

07e2c97

Merge branch 'master' into pr/18039

5bb2d50

CUDA: use LRU based eviction for cuda graphs (ggml-org#21611)

b94050e

* CUDA: use a ring-buffer for cuda graphs * bump limit to 128 * use LRU eviction * better naming * do periodic clean-up

llama: fit ctx size for CPU only (ggml-org#21568)

fd1c0ec

convert : fix (ignore for now) typings errors (ggml-org#22002)

89a5474

ci : free disk space for rocm release (ggml-org#22012)

83d58e0

ggml-backend-meta: add multi-segment read support in get_tensor (ggml…

59accc8

…-org#22063)

android : libcommon -> libllama-common (ggml-org#22076)

23b8cc4

model : refactor bias tensor variable names (ggml-org#22079)

4f02d47

* refactor bias tensor variable names * use create_tensor_qkv for jina-bert-v2

server: Expose media_tag on /props endpoint. (ggml-org#22028)

9e5647a

rpc : refactor the RPC transport (ggml-org#21998)

91fef95

* rpc : refactor the RPC transport Move all transport related code into a separate file and use the socket_t interface to hide all transport implementation details. * fix win32 * better socket_t construction

yychyo and others added 11 commits April 24, 2026 09:28

server: rename debug tags to match --cache-idle-slots naming (ggml-or…

793d0a7

…g#22292)

server : fix swa-full logic (ggml-org#22288)

ffdd983

jinja : remove unused header (ggml-org#22310)

017f090

ggml : minor coding style (ggml-org#22308)

e583f3b

common : fix jinja warnings with clang 21 (ggml-org#22313)

dc80c52

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

metal : print GPU description (ggml-org#22318)

15fa3c4

Merge branch 'master' into pr/18039

91b03e4

dflash: first working POC

0724d66

dflash: add support for qwen3.5/3.6 moe models

85a0089

dflash: remove rebundant logic & correct bias naming

e344c4a

dflash: enable llama-cli & llama-server with np=1

67cb0d5

Leechael mentioned this pull request May 5, 2026

[meta] Branch & upstream status (DFlash + DDTree) #3

Open

5 tasks

github-actions Bot added documentation Improvements or additions to documentation Nvidia GPU ggml testing examples python server model Apple Metal SYCL Vulkan devops script android jinja parser Hexagon WebGPU OpenVINO labels May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Track B] track ggml-org/llama.cpp#22105 (DFlash drafter) for castle adaptation#6

[Track B] track ggml-org/llama.cpp#22105 (DFlash drafter) for castle adaptation#6
Leechael wants to merge 111 commits into
masterfrom
track-b/dflash-on-22105

Leechael commented May 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

Leechael commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Current state

Castle smoke test (single prompt)

30-prompt benchmark (HumanEval / GSM8K / Math500, 10 each, seed=42)

Comparison vs castle self-implementation (numbers from existing castle docs)

Interpretation

Trade-off

Status

Castle artifacts produced during this evaluation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Leechael commented May 5, 2026 •

edited

Loading