Skip to content

[Track B] track ggml-org/llama.cpp#22105 (DFlash drafter) for castle adaptation#6

Draft
Leechael wants to merge 111 commits into
masterfrom
track-b/dflash-on-22105
Draft

[Track B] track ggml-org/llama.cpp#22105 (DFlash drafter) for castle adaptation#6
Leechael wants to merge 111 commits into
masterfrom
track-b/dflash-on-22105

Conversation

@Leechael
Copy link
Copy Markdown
Owner

@Leechael Leechael commented May 5, 2026

Tracking: #3 (Phase 2)

Purpose

Mirror of ggml-org/llama.cpp#22105 (DFlash drafter by ruixiang63), evaluated on castle hardware as an upstream-tracking alternative to the castle's own DFlash + DDTree implementation.

Current state

Branch starts identical to upstream-pr/22105 head (67cb0d507):

No castle-side adjustments applied. Stock ggml-org#22105 was sufficient to run end-to-end on castle hardware.

Castle smoke test (single prompt)

build  : build-track-b/ (cmake + CUDA, sm_89, RTX 4090)
target : Qwen3.5-27B-Q4_K_M.gguf (16 GB)
draft  : dflash-22105.gguf (3.46 GB, bf16, converted from DFlash-draft-hf via #22105's convert_hf_to_gguf.py with --target-model-dir)
prompt : "Write a quicksort algorithm in Python. Write code only."

decode_tps   = 99.1
accept_pct   = 67.8% (61/90)
n_predict    = 68
graphs reused = 3 (decoder graph rebuild — known #22105 future work)
GPU memory   = 24079 = 3446 free + (16034 model + 213 ctx + 548 compute) + 4598 unaccounted

30-prompt benchmark (HumanEval / GSM8K / Math500, 10 each, seed=42)

Same dataset and sampling as the castle self-implementation benchmark in docs/ddtree-dataset-eval-plan.md. KV q8_0, n-batch=2048, n-ubatch=512 (ggml-org#22105 has no --prompt-chunk flag — large batch chosen so prompt fits in a single ubatch).

dataset n avg decode tok/s avg accept %
HumanEval 10 91.11 54.13
GSM8K 10 89.47 56.05
Math500 10 93.42 55.55
avg 30 91.3 55.2

Comparison vs castle self-implementation (numbers from existing castle docs)

stack avg tok/s bit-equal speedup vs AR
AR baseline 46.4 10/10 (by definition) 1.0x
castle self-impl exact-gated ~40 10/10 0.87x
upstream ggml-org#22105 stock 91 bit-eq via ggml-org#19493 checkpoint 2.0x
castle self-impl + TARGET_TOP1 123 4.7/10 2.7x
castle self-impl + unsafe-trust-batched 137 4.7/10 3.0x

Interpretation

Upstream ggml-org#22105 stock sits between the two ends of the castle self-implementation:

  • 33% slower than castle's unsafe-trust-batched (which sacrifices bit-equal correctness)
  • 2.3× faster than castle's correctness-gated path (castle's exact-gated is validate_tree_with_chain — runs N single-token chain decodes per step, effectively degenerates spec into AR plus overhead)

Why ggml-org#22105 is fast under correctness gate: it leans on upstream #19493 speculative checkpointing (merged 2026-04-19). Rejected suffixes are rolled back via KV checkpoints, so the accepted path runs forward exactly once. Castle's self-implementation predates ggml-org#19493 and replaced it with a per-token chain validation walk.

Trade-off

Status

This branch will be kept in sync with upstream-pr/22105 as ruixiang63 evolves it (waits on ggml-org#18039 EAGLE3 merge + ggerganov's unified spec API refactor).

No active development planned on top of this branch. If we later decide to port castle DDTree on top of ggml-org#22105 (to combine bit-equal + faster speedup), it will become a new branch (track-b/ddtree-on-22105 or similar).

Castle artifacts produced during this evaluation

ruixiang63 and others added 30 commits December 14, 2025 18:12
EAGLE3 is an encoder-decoder based speculative decoding method:
- Extracts features from target model at specific layers
- Uses feature fusion layer to compress target features
- Generates draft tokens with single-layer decoder
- Maps draft vocabulary to target vocabulary via d2t tensor

Key changes:
- Add LLM_ARCH_EAGLE3 architecture
- Add EAGLE3 encoder/decoder graph (src/models/eagle3.cpp)
- Add feature extraction from target model layers
- Add g_embeddings handling for decoder input
- Add GGML_TENSOR_FLAG_SYNC for GPU synchronization
- Add --eagle3 flag for speculative-simple example
- Add EAGLE3 model conversion in convert_hf_to_gguf.py
* CUDA: use a ring-buffer for cuda graphs

* bump limit to 128

* use LRU eviction

* better naming

* do periodic clean-up
…ng (ggml-org#21052)

* Update workflows to remove dependence on llvmpipe

* Try setting Dawn_DIR

* remove c++20 initializers

* Move to proper guid

* Try avoiding segfaults on vulkan backend process exit

* Remove compiler warnings on parameter casting

* Fix soft_max and update reg_tile accumulation to f32 for better precision

* Refactor flash_attn a bit

* remove c++20 initializers and format

* Increase div precision for NVIDIA

* revert div precision and comment out ggml-ci node for now

* Formatting

* Try debugging on a failing CI node

* Revert "Try debugging on a failing CI node"

This reverts commit 1971e33.
* refactor bias tensor variable names

* use create_tensor_qkv for jina-bert-v2
* rpc : refactor the RPC transport

Move all transport related code into a separate file and use the
socket_t interface to hide all transport implementation details.

* fix win32

* better socket_t construction
* server : speculative decoding using checkpoints

* server : fix draft check with checkpoints

* server : rename spec vars

* server : log levels

* server : refactored spec logic to speculative.cpp

* server : renamed spec checkpoints option

* server : fix spec checkpoints, logging

* speculative : checkpoints with draft model, logging

* server : n_tokens_cur and create_checkpoint in draft

* server : fix server_speculative_callback (slot.id)

* spec : fix ngram-map/begin idx_last_check

* spec : init ckpt (begin() wasn't called)

* chore: update webui build output

* server : restore sampler in spec checkpoint and clear mem

* cont : avoid --spec-use-checkpoints argument

* cont : remove server_prompt_checkpoint_with_size

* spec : rename (leave_draft_state)

* cont : clean-up

* cont : do not ignore partial drafts even if the are short

* cont : spec callback owned by session

* cont : simplify

* cont : avoid empty speculative session

* cont : simplify

* cont : simplify

* cont : enable mtmd speculative decoding

* cont : keep the spec sampler alive

* cont : simplify

* cont : fix nullptr deref + draft checkpoints

* cont : remove common_speculative_accept_response

* cont : remove callback

* cont : simplify

* cont : minor

* cont : simplify

* cont : fix accepted number

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
ggml-org#21630 added the CMP0194 NEW policy to silence a CMake warning, but on Windows runners it caused CMake to prefer the MinGW toolchain for ASM and broke MSVC builds.

Reverting only that policy block restores the previous working behavior. The CMake 4.1+ warning comes back, but that is cosmetic and does not break any platform.

Reported-by: oobabooga

Refs: ggml-org#21630

Co-authored-by: texasich <texasich@users.noreply.github.com>
* convert : support sentence-transformer 5.4 config files

* fix: embeddinggemma

* fix: mapping

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix: pooling_mode

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.