[Feat] Optimize D2F Decoding Strategy to Support CUDA Graph and More Efficient Inference #17

drewjin · 2025-12-29T14:02:30Z

Description

This PR refactors and optimizes the D2F (Draft-to-Fill) native inference strategy. The core enhancement involves transitioning the current decoding logic to a fixed-size FIFO buffer management system for handling D2F computation blocks.

Key Optimizations

By maintaining a constant buffer size (defaulted to 4 computation blocks), we effectively lock the decoding sequence length. This design choice yields several critical performance benefits:

Scheduling Efficiency: Simplifies the scheduler's logic by eliminating the overhead associated with managing dynamic, variable-length blocks.
Computation & Compilation Gains: A fixed sequence length enables better static optimization for kernels (e.g., Triton), preventing frequent re-compilations and improving overall GPU hardware utilization.

Technical Highlights

Fixed-size FIFO Mechanism: Implements a sliding window FIFO buffer with a default capacity of 4 computation blocks.
Logic Refactoring: Comprehensive overhaul of scheduling and computation logic to align with the fixed-size window (refer to the provided algorithm diagram for details).

TODO List

Refactor d2f strategy engine.
Refactor d2f attention metadata.
Adapt d2f attention kernels to the fixed-window strategy.

github-actions · 2025-12-29T16:02:25Z

👋 Hi! Thank you for contributing to the Diffulex project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

chore: add Tilelang-failed_test_cases to .gitignore

e3e2102

drewjin mentioned this pull request Jan 5, 2026

[Release Plan] v0.0.1 #14

Open

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Optimize D2F Decoding Strategy to Support CUDA Graph and More Efficient Inference #17

[Feat] Optimize D2F Decoding Strategy to Support CUDA Graph and More Efficient Inference #17

Uh oh!

drewjin commented Dec 29, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Feat] Optimize D2F Decoding Strategy to Support CUDA Graph and More Efficient Inference #17

Are you sure you want to change the base?

[Feat] Optimize D2F Decoding Strategy to Support CUDA Graph and More Efficient Inference #17

Uh oh!

Conversation

drewjin commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key Optimizations

Technical Highlights

TODO List

Uh oh!

github-actions bot commented Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

drewjin commented Dec 29, 2025 •

edited

Loading