Thank you for your work @antirez, been following closely since I'm receiving a 128GB Strix Halo soon where I want to port this to.
I've been also following the recent developments with regards to MTP: ggml-org/llama.cpp#22673
Looking around, it looks like DeepSeek v4 architecture also brings MTP to the table: https://docs.vllm.ai/projects/ascend/en/v0.13.0/tutorials/DeepSeek-V4.html
I think that this can bring another nice boost to decoding speed. One caveat I see from the llama.cpp PR above is that there seems to be a regression with regards to prompt processing speed, not sure yet if it's simply a side effect or consequence of MTP, or just a regression in the PR implementation. Another user doesn't see the same regression.
I have seen your videos on YouTube, and yes, I agree that this DeepSeek v4 Flash is truly a gamechanger for local AI and can't wait to run it on my machine.
Thank you for your work @antirez, been following closely since I'm receiving a 128GB Strix Halo soon where I want to port this to.
I've been also following the recent developments with regards to MTP: ggml-org/llama.cpp#22673
Looking around, it looks like DeepSeek v4 architecture also brings MTP to the table: https://docs.vllm.ai/projects/ascend/en/v0.13.0/tutorials/DeepSeek-V4.html
I think that this can bring another nice boost to decoding speed. One caveat I see from the
llama.cppPR above is that there seems to be a regression with regards to prompt processing speed, not sure yet if it's simply a side effect or consequence of MTP, or just a regression in the PR implementation. Another user doesn't see the same regression.I have seen your videos on YouTube, and yes, I agree that this DeepSeek v4 Flash is truly a gamechanger for local AI and can't wait to run it on my machine.