feat: one-sided target probability acceptance for MTP drafts increases acceptance rate and throughput compared to argmax alone#8
Open
sujitvasanth wants to merge 1 commit intoAtomicBot-ai:feature/turboquant-kv-cachefrom
Conversation
…s acceptance rate and throughput compared to argmax alone MTP drafters use greedy argmax internally — they do not expose a full logit distribution, by design, for speed. This change adds a further tok/s improvement by allowing users to tune the acceptance threshold, achieving ~20% throughput gains by accepting more draft tokens while retaining the ability to manually verify the threshold at which semantic breakdown occurs for their specific model/task combination. When the drafter and target model disagree on a token, rather than immediately rejecting (standard argmax behaviour), --draft-p-accept triggers a one-sided softmax check over the target model's logits for the draft token. If the target assigns p >= draft-p-accept to that token, it is accepted in place of the target's own argmax prediction and decoding continues. No drafter logits are required, keeping the drafter inference path unchanged and preserving the speed advantage of argmax-only drafting. This is intentionally lighter than the full ratio test in the MTP paper. Changes: - common/sampling.cpp: add p_accept parameter to sample_and_accept_n; on drafter/target disagreement compute softmax over target logits and accept draft token if p_target(draft_token) >= p_accept - common/sampling.h: update both overloads of sample_and_accept_n signature - common/arg.cpp: register --draft-p-accept CLI argument - common/common.h: add p_accept field to common_params_speculative struct - tools/server/server-context.cpp: wire p_accept into speculative config Usage: --draft-p-accept 0.005 # accept draft token if p_target >= 0.005 --draft-p-accept 0.0 # standard argmax-only behaviour (default)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
MTP drafters use greedy argmax internally — they do not expose a full logit distribution, by design, for speed. This change adds a further tok/s improvements by allowing users to tune the acceptance threshold, achieving ~20% throughput gains by accepting more draft tokens, The user can manually verify the threshold at which semantic breakdown occurs for their specific model/task combination.
When the drafter and target model disagree on a token, rather than immediately rejecting (standard argmax behaviour), --draft-p-accept triggers a one-sided softmax check over the target model's logits for the draft token. If the target assigns p >= draft-p-accept to that token, it is accepted in place of the target's own argmax prediction and decoding continues.
No drafter logits are required, keeping the drafter inference path unchanged and preserving the speed advantage of argmax-only drafting. This is intentionally lighter than the full ratio test in the MTP paper.
Changes:
Usage:
--draft-p-accept 0.005 # accept draft token if p_target >= 0.005
--draft-p-accept 0.0 # standard argmax-only behaviour (default)
Requirements
best test 15.5 t/s 300,000 token context