Skip to content

Conversation

@drewjin
Copy link
Contributor

@drewjin drewjin commented Dec 29, 2025

image

Description

This PR refactors and optimizes the D2F (Draft-to-Fill) native inference strategy. The core enhancement involves transitioning the current decoding logic to a fixed-size FIFO buffer management system for handling D2F computation blocks.

Key Optimizations

By maintaining a constant buffer size (defaulted to 4 computation blocks), we effectively lock the decoding sequence length. This design choice yields several critical performance benefits:

  1. Scheduling Efficiency: Simplifies the scheduler's logic by eliminating the overhead associated with managing dynamic, variable-length blocks.
  2. Computation & Compilation Gains: A fixed sequence length enables better static optimization for kernels (e.g., Triton), preventing frequent re-compilations and improving overall GPU hardware utilization.

Technical Highlights

  • Fixed-size FIFO Mechanism: Implements a sliding window FIFO buffer with a default capacity of 4 computation blocks.
  • Logic Refactoring: Comprehensive overhaul of scheduling and computation logic to align with the fixed-size window (refer to the provided algorithm diagram for details).

TODO List

  • Refactor d2f strategy engine.
  • Refactor d2f attention metadata.
  • Adapt d2f attention kernels to the fixed-window strategy.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the Diffulex project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@drewjin drewjin mentioned this pull request Jan 5, 2026
17 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant