Potential 2x decoding speedup with MTP

Thank you for your work @antirez, been following closely since I'm receiving a 128GB Strix Halo soon where I want to port this to.

I've been also following the recent developments with regards to MTP: https://github.com/ggml-org/llama.cpp/pull/22673

Looking around, it looks like DeepSeek v4 architecture also brings MTP to the table: https://docs.vllm.ai/projects/ascend/en/v0.13.0/tutorials/DeepSeek-V4.html

I think that this can bring another nice boost to decoding speed. One caveat I see from the `llama.cpp` PR above is that there seems to be a regression with regards to prompt processing speed, not sure yet if it's simply a side effect or consequence of MTP, or just a regression in the PR implementation. Another user doesn't see the same regression.

I have seen your videos on YouTube, and yes, I agree that this DeepSeek v4 Flash is truly a gamechanger for local AI and can't wait to run it on my machine.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential 2x decoding speedup with MTP #5

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Potential 2x decoding speedup with MTP #5

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions