[Track B] track ggml-org/llama.cpp#22105 (DFlash drafter) for castle adaptation#6
Draft
Leechael wants to merge 111 commits into
Draft
[Track B] track ggml-org/llama.cpp#22105 (DFlash drafter) for castle adaptation#6Leechael wants to merge 111 commits into
Leechael wants to merge 111 commits into
Conversation
EAGLE3 is an encoder-decoder based speculative decoding method: - Extracts features from target model at specific layers - Uses feature fusion layer to compress target features - Generates draft tokens with single-layer decoder - Maps draft vocabulary to target vocabulary via d2t tensor Key changes: - Add LLM_ARCH_EAGLE3 architecture - Add EAGLE3 encoder/decoder graph (src/models/eagle3.cpp) - Add feature extraction from target model layers - Add g_embeddings handling for decoder input - Add GGML_TENSOR_FLAG_SYNC for GPU synchronization - Add --eagle3 flag for speculative-simple example - Add EAGLE3 model conversion in convert_hf_to_gguf.py
* CUDA: use a ring-buffer for cuda graphs * bump limit to 128 * use LRU eviction * better naming * do periodic clean-up
…ng (ggml-org#21052) * Update workflows to remove dependence on llvmpipe * Try setting Dawn_DIR * remove c++20 initializers * Move to proper guid * Try avoiding segfaults on vulkan backend process exit * Remove compiler warnings on parameter casting * Fix soft_max and update reg_tile accumulation to f32 for better precision * Refactor flash_attn a bit * remove c++20 initializers and format * Increase div precision for NVIDIA * revert div precision and comment out ggml-ci node for now * Formatting * Try debugging on a failing CI node * Revert "Try debugging on a failing CI node" This reverts commit 1971e33.
* refactor bias tensor variable names * use create_tensor_qkv for jina-bert-v2
* rpc : refactor the RPC transport Move all transport related code into a separate file and use the socket_t interface to hide all transport implementation details. * fix win32 * better socket_t construction
* server : speculative decoding using checkpoints * server : fix draft check with checkpoints * server : rename spec vars * server : log levels * server : refactored spec logic to speculative.cpp * server : renamed spec checkpoints option * server : fix spec checkpoints, logging * speculative : checkpoints with draft model, logging * server : n_tokens_cur and create_checkpoint in draft * server : fix server_speculative_callback (slot.id) * spec : fix ngram-map/begin idx_last_check * spec : init ckpt (begin() wasn't called) * chore: update webui build output * server : restore sampler in spec checkpoint and clear mem * cont : avoid --spec-use-checkpoints argument * cont : remove server_prompt_checkpoint_with_size * spec : rename (leave_draft_state) * cont : clean-up * cont : do not ignore partial drafts even if the are short * cont : spec callback owned by session * cont : simplify * cont : avoid empty speculative session * cont : simplify * cont : simplify * cont : enable mtmd speculative decoding * cont : keep the spec sampler alive * cont : simplify * cont : fix nullptr deref + draft checkpoints * cont : remove common_speculative_accept_response * cont : remove callback * cont : simplify * cont : minor * cont : simplify * cont : fix accepted number --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
ggml-org#21630 added the CMP0194 NEW policy to silence a CMake warning, but on Windows runners it caused CMake to prefer the MinGW toolchain for ASM and broke MSVC builds. Reverting only that policy block restores the previous working behavior. The CMake 4.1+ warning comes back, but that is cosmetic and does not break any platform. Reported-by: oobabooga Refs: ggml-org#21630 Co-authored-by: texasich <texasich@users.noreply.github.com>
* convert : support sentence-transformer 5.4 config files * fix: embeddinggemma * fix: mapping Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix: pooling_mode Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tracking: #3 (Phase 2)
Purpose
Mirror of ggml-org/llama.cpp#22105 (DFlash drafter by ruixiang63), evaluated on castle hardware as an upstream-tracking alternative to the castle's own DFlash + DDTree implementation.
Current state
Branch starts identical to upstream-pr/22105 head (
67cb0d507):67cb0d507dflash: enable llama-cli & llama-server with np=1e344c4a71dflash: remove rebundant logic & correct bias naming85a0089e6dflash: add support for qwen3.5/3.6 moe modelsNo castle-side adjustments applied. Stock ggml-org#22105 was sufficient to run end-to-end on castle hardware.
Castle smoke test (single prompt)
30-prompt benchmark (HumanEval / GSM8K / Math500, 10 each, seed=42)
Same dataset and sampling as the castle self-implementation benchmark in
docs/ddtree-dataset-eval-plan.md. KV q8_0, n-batch=2048, n-ubatch=512 (ggml-org#22105 has no--prompt-chunkflag — large batch chosen so prompt fits in a single ubatch).Comparison vs castle self-implementation (numbers from existing castle docs)
Interpretation
Upstream ggml-org#22105 stock sits between the two ends of the castle self-implementation:
validate_tree_with_chain— runs N single-token chain decodes per step, effectively degenerates spec into AR plus overhead)Why ggml-org#22105 is fast under correctness gate: it leans on upstream #19493 speculative checkpointing (merged 2026-04-19). Rejected suffixes are rolled back via KV checkpoints, so the accepted path runs forward exactly once. Castle's self-implementation predates ggml-org#19493 and replaced it with a per-token chain validation walk.
Trade-off
Status
This branch will be kept in sync with
upstream-pr/22105as ruixiang63 evolves it (waits on ggml-org#18039 EAGLE3 merge + ggerganov's unified spec API refactor).No active development planned on top of this branch. If we later decide to port castle DDTree on top of ggml-org#22105 (to combine bit-equal + faster speedup), it will become a new branch (
track-b/ddtree-on-22105or similar).Castle artifacts produced during this evaluation
~/workshop/lucebox-hub/dflash/deps/llama.cpp/build-track-b/— CUDA build~/workshop/lucebox-hub/dflash/models/draft/dflash-22105.gguf— DFlash GGUF re-converted with [Speculative decoding] feat: add DFlash support ggml-org/llama.cpp#22105's converter (architecture string isdflash, not the castle-specificdflash-draft)~/workshop/lucebox-hub/dflash/deps/llama.cpp/.venv-track-b/— Python venv for the convert script~/workshop/lucebox-hub/dflash/deps/llama.cpp/scripts/bench_track_b.py— bench harness adapted for [Speculative decoding] feat: add DFlash support ggml-org/llama.cpp#22105 (parsesllama-speculative-simple --dflashstdout)/tmp/bench-track-b-v2/on castle — bench logs andresults.json