Skip to content

Potential 2x decoding speedup with MTP #5

@MirkoCovizzi

Description

@MirkoCovizzi

Thank you for your work @antirez, been following closely since I'm receiving a 128GB Strix Halo soon where I want to port this to.

I've been also following the recent developments with regards to MTP: ggml-org/llama.cpp#22673

Looking around, it looks like DeepSeek v4 architecture also brings MTP to the table: https://docs.vllm.ai/projects/ascend/en/v0.13.0/tutorials/DeepSeek-V4.html

I think that this can bring another nice boost to decoding speed. One caveat I see from the llama.cpp PR above is that there seems to be a regression with regards to prompt processing speed, not sure yet if it's simply a side effect or consequence of MTP, or just a regression in the PR implementation. Another user doesn't see the same regression.

I have seen your videos on YouTube, and yes, I agree that this DeepSeek v4 Flash is truly a gamechanger for local AI and can't wait to run it on my machine.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions