DiffiT: Diffusion Vision Transformers for Image Generation

Official PyTorch implementation of DiffiT: Diffusion Vision Transformers for Image Generation.

For business inquiries, please visit our website and submit the form: NVIDIA Research Licensing

DiffiT (Diffusion Vision Transformers) is a generative model that combines the expressive power of diffusion models with Vision Transformers (ViTs), introducing Time-dependent Multihead Self Attention (TMSA) for fine-grained control over the denoising at each timestep. DiffiT achieves SOTA performance on class-conditional ImageNet generation at multiple resolutions, notably an FID score of 1.73 on ImageNet-256.

💥 News 💥

[03.08.2026] 🔥🔥 DiffiT code and pretrained model are released !
[07.01.2024] 🔥🔥 DiffiT has been accepted to ECCV 2024 !
[04.02.2024] Updated manuscript now available on arXiv !
[12.04.2023] 🔥 Paper is published on arXiv !

Models

ImageNet-256

Model	Dataset	Resolution	FID-50K	Inception Score	Download
DiffiT	ImageNet	256x256	1.73	276.49	model

ImageNet-512

Model	Dataset	Resolution	FID-50K	Inception Score	Download
DiffiT	ImageNet	512x512	2.67	252.12	model

Getting Started: Sampling & Evaluation

This repository provides the code for the DiffiT model, pretrained model checkpoints, and everything needed to sample images and compute FID scores to reproduce the results reported in our paper.

Sampling Images

Image sampling is performed using sample.py. To reproduce the reported numbers, use the commands below.

ImageNet-256:

python sample.py \
    --log_dir $LOG_DIR \
    --cfg_scale 4.4 \
    --model_path $MODEL \
    --image_size 256 \
    --model Diffit \
    --num_sampling_steps 250 \
    --num_samples 50000 \
    --cfg_cond True

ImageNet-512:

python sample.py \
    --log_dir $LOG_DIR \
    --cfg_scale 1.49 \
    --model_path $MODEL \
    --image_size 512 \
    --model Diffit \
    --num_sampling_steps 250 \
    --num_samples 50000 \
    --cfg_cond True

We also provide ready-to-use Slurm scripts for convenience:

slurm_sample_256.sh — samples 50K images at 256×256 resolution
slurm_sample_512.sh — samples 50K images at 512×512 resolution

Computing FID

Once images have been sampled, you can compute the FID and other metrics using the provided eval_run.sh script. Our evaluation pipeline exactly follows the protocol from openai/guided-diffusion/evaluations.

bash eval_run.sh

Expected Results

Running the above sampling and evaluation commands should yield the following metrics:

ImageNet-256:

Inception Score	FID	sFID	Precision	Recall
276.49	1.73	4.54	0.8024	0.6205

ImageNet-512:

Inception Score	FID	sFID	Precision	Recall
252.13	2.67	4.99	0.8277	0.5500

Note: Small variations in the reported numbers are expected depending on the device used for sampling and due to numerical precision differences.

Citation

@inproceedings{hatamizadeh2025diffit,
  title={Diffit: Diffusion vision transformers for image generation},
  author={Hatamizadeh, Ali and Song, Jiaming and Liu, Guilin and Kautz, Jan and Vahdat, Arash},
  booktitle={European Conference on Computer Vision},
  pages={37--55},
  year={2025},
  organization={Springer}
}

Star History

Licenses

This work is made available under the NVIDIA Source Code License-NC. Click here to view a copy of this license.

The pre-trained models are shared under CC-BY-NC-SA-4.0. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

Acknowledgement

We gratefully acknowledge the authors of Guided-Diffusion, DiT and MDT for making their excellent codebases publicly available.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
assets		assets
diffit		diffit
LICENSE		LICENSE
README.md		README.md
eval_run.sh		eval_run.sh
evaluator.py		evaluator.py
requirements.txt		requirements.txt
sample.py		sample.py
slurm_sample_256.sh		slurm_sample_256.sh
slurm_sample_512.sh		slurm_sample_512.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DiffiT: Diffusion Vision Transformers for Image Generation

💥 News 💥

Models

ImageNet-256

ImageNet-512

Getting Started: Sampling & Evaluation

Sampling Images

Computing FID

Expected Results

Citation

Star History

Licenses

Acknowledgement

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

DiffiT: Diffusion Vision Transformers for Image Generation

💥 News 💥

Models

ImageNet-256

ImageNet-512

Getting Started: Sampling & Evaluation

Sampling Images

Computing FID

Expected Results

Citation

Star History

Licenses

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages