Convert open-weight autoregressive LLMs into diffusion language models. Locally, with one command.
Status: Phase 0 (walking skeleton) landed. The CLI, contracts codegen, worker protocol, and run directories exist and run end to end as a no-op pipeline. The actual conversion recipe lands in Phase 2 of the roadmap; nothing converts a real model yet. Contributions welcome - see CONTRIBUTING.md.
Let's be precise about the claim, because the field deserves it:
a2d is not a new method. Converting an AR model into a diffusion model by continued training is established research — AR2Diff formulated it, DiffuGPT/DiffuLLaMA demonstrated it from 127M to 7B, and Dream 7B shipped a strong open model built exactly this way (initialized from Qwen2.5). "Recipes for any AR model" exist too (Tiny-A2D, the dLLM library). If you want the science, start with those papers — and see Prior art, which a2d builds on directly.
a2d is the first tool that makes that method universal, safe, and one command. What exists today is research scripts, a training library, and one-off model releases. What doesn't exist — and what a2d is — is a product you point at an arbitrary local checkpoint and get:
- A verdict before you spend anything.
a2d detectreads the model's config (no weights, no GPU) and tells you whether it can convert, how, and — when it can't — exactly why. - Automated conversion with a safety gate. Attention surgery, an identity check that proves the surgery changed nothing before training starts, then the diffusion training recipe.
- Reproducible runs. Every conversion writes a manifest (source hash, config, capability set, test results) — same inputs, same run.
- An honest no. Architectures the recipe doesn't fit are rejected with reasons, never silently mis-converted.
Architecture-universal within the AR-transformer family — dense and Mixture-of-Experts. That covers the Llama / Qwen / GLM / Gemma / OLMoE lineage, i.e. most open-weight releases. It does not cover other paradigms: Mamba/SSM, encoder-decoder, or models that are already non-autoregressive. Those get a clean unsupported at the gate, with reasons.
New model drops that fit the family should work day-one with zero a2d code: detection is generic over the standard HF config vocabulary, and conversion runs on the HF ecosystem as soon as transformers supports the model. Support is defined by capabilities (attention variant, FFN family, weight format), not by a hardcoded model list — so gaps are visible, named, and additively fixable.
| Converts | Rejected (with reasons) |
|---|---|
| Dense AR transformers (GPT-2, Pythia, Llama, Qwen…) | SSM/Mamba (paradigm) |
| MoE AR transformers (OLMoE, Qwen-MoE…) | MLA-attention models (attn.mla) — until a handler lands |
| Sliding-window / attention-sink models (Mistral, Gemma, GPT-OSS) — planned, capability-gated | ONNX-only exports (format) |
The model already knows language; that knowledge lives in weights a2d never touches. The conversion changes how it reads and what it practices:
- Detect & gate — parse
config.json→ normalized spec → capability check. - Patch — open causal attention to bidirectional (gradually, via annealing); drop the next-token shift.
- Identity gate — at anneal=0 the patched model must match the original's logits exactly; fail = abort, nothing wasted.
- Train — masked-diffusion objective (MDLM; block diffusion later): fill-in-the-blank at varying mask ratios over a few billion tokens.
- Output — a standard HF-layout checkpoint (+ provenance manifest and eval report) that loads with normal tooling.
# planned CLI
a2d detect ./models/qwen2.5-1.5b # verdict + plan, config-only
a2d convert ./models/qwen2.5-1.5b --out runs/qwen-diff --data ./corpus
a2d eval runs/qwen-diff
a2d sample runs/qwen-diff -p "The cat"
Weights must already be downloaded — a2d never fetches models. Third-party model code (trust_remote_code) never runs without an explicit flag.
Phased, no dates: walking skeleton → detection & gate (no GPU) → dense conversion (GPT-2) → eval harness → MoE (OLMoE) → block diffusion & fast sampling → hard architectures (SWA/sinks/quantized: Mistral, Gemma, GPT-OSS) → polish. Details and exit criteria: docs/SPEC-HANDOFF.md.
a2d packages other people's science. Read and cite them:
| Work | What it contributed |
|---|---|
| AR2Diff — Transfer Learning for Text Diffusion | The pretrain-AR → continue-as-diffusion formulation |
| DiffuGPT / DiffuLLaMA | Demonstrated adaptation 127M–7B, <200B tokens; the recipe a2d's core follows |
| Dream 7B | AR-init diffusion at scale (from Qwen2.5); context-adaptive noise rescheduling |
| MDLM | The masked-diffusion objective a2d uses first |
| BD3LM / block diffusion | Block-parallel objective/sampling (planned) |
| LLaDA | From-scratch proof that diffusion LMs scale competitively |
| dLLM library / Tiny-A2D | Open training/eval infra and any-model recipes; a candidate dependency of a2d's worker |
| Full landscape | docs/LANDSCAPE.md |
If a2d's framing ever drifts toward claiming the method — file an issue. The honest claim is the product.
docs/ARCHITECTURE.md— the ML conversion recipe (objectives, annealing, identity test)docs/SPEC-HANDOFF.md— the tool: lifecycle, extensibility model, contracts, roadmapdocs/LANDSCAPE.md— prior art & positioning
MIT. Licensed under the terms of the LICENSE file.