Skip to content

feat: one-sided target probability acceptance for MTP drafts increases acceptance rate and throughput compared to argmax alone#8

Open
sujitvasanth wants to merge 1 commit intoAtomicBot-ai:feature/turboquant-kv-cachefrom
sujitvasanth:feat/draft-p-accept
Open

feat: one-sided target probability acceptance for MTP drafts increases acceptance rate and throughput compared to argmax alone#8
sujitvasanth wants to merge 1 commit intoAtomicBot-ai:feature/turboquant-kv-cachefrom
sujitvasanth:feat/draft-p-accept

Conversation

@sujitvasanth
Copy link
Copy Markdown

@sujitvasanth sujitvasanth commented May 11, 2026

Overview

MTP drafters use greedy argmax internally — they do not expose a full logit distribution, by design, for speed. This change adds a further tok/s improvements by allowing users to tune the acceptance threshold, achieving ~20% throughput gains by accepting more draft tokens, The user can manually verify the threshold at which semantic breakdown occurs for their specific model/task combination.

When the drafter and target model disagree on a token, rather than immediately rejecting (standard argmax behaviour), --draft-p-accept triggers a one-sided softmax check over the target model's logits for the draft token. If the target assigns p >= draft-p-accept to that token, it is accepted in place of the target's own argmax prediction and decoding continues.

No drafter logits are required, keeping the drafter inference path unchanged and preserving the speed advantage of argmax-only drafting. This is intentionally lighter than the full ratio test in the MTP paper.

Changes:

  • common/sampling.cpp: add p_accept parameter to sample_and_accept_n; on drafter/target disagreement compute softmax over target logits and accept draft token if p_target(draft_token) >= p_accept
  • common/sampling.h: update both overloads of sample_and_accept_n signature
  • common/arg.cpp: register --draft-p-accept CLI argument
  • common/common.h: add p_accept field to common_params_speculative struct
  • tools/server/server-context.cpp: wire p_accept into speculative config

Usage:

--draft-p-accept 0.005 # accept draft token if p_target >= 0.005
--draft-p-accept 0.0 # standard argmax-only behaviour (default)

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, co-wrote with Claude, I have read, checked, compiled and tested the code on combined RTX 3060+GTX1660 on Ubuntu 20.04. There is a 20% improvement in throughput with no breakdown of output coherence, acceptance increases inversely proportional to draft-p-accept as expected.
    best test 15.5 t/s 300,000 token context

…s acceptance rate and throughput compared to argmax alone

MTP drafters use greedy argmax internally — they do not expose a full logit
distribution, by design, for speed. This change adds a further tok/s improvement
by allowing users to tune the acceptance threshold, achieving ~20% throughput
gains by accepting more draft tokens while retaining the ability to manually
verify the threshold at which semantic breakdown occurs for their specific
model/task combination.

When the drafter and target model disagree on a token, rather than immediately
rejecting (standard argmax behaviour), --draft-p-accept triggers a one-sided
softmax check over the target model's logits for the draft token. If the target
assigns p >= draft-p-accept to that token, it is accepted in place of the
target's own argmax prediction and decoding continues.

No drafter logits are required, keeping the drafter inference path unchanged
and preserving the speed advantage of argmax-only drafting. This is intentionally
lighter than the full ratio test in the MTP paper.

Changes:
- common/sampling.cpp: add p_accept parameter to sample_and_accept_n; on
  drafter/target disagreement compute softmax over target logits and accept
  draft token if p_target(draft_token) >= p_accept
- common/sampling.h: update both overloads of sample_and_accept_n signature
- common/arg.cpp: register --draft-p-accept CLI argument
- common/common.h: add p_accept field to common_params_speculative struct
- tools/server/server-context.cpp: wire p_accept into speculative config

Usage:
  --draft-p-accept 0.005   # accept draft token if p_target >= 0.005
  --draft-p-accept 0.0     # standard argmax-only behaviour (default)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant