This project demonstrates how to train a text-conditioned Latent Diffusion Model (LDM) on the CelebA-Dialog dataset. It uses pretrained encoders and a custom diffusion model to generate face images conditioned on natural language descriptions.
For a detailed explanation of the methodology, training setup, and results, check out the full report:
📄 Read the Report (PDF)
| Latent Representation | Reconstructed Image |
![]() |
![]() |
The model generates images by denoising in the latent space based on a text prompt and decoding through a pretrained VAE.
- Latent Encoder (VAE): Pretrained
CompVis/stable-diffusion-v1-4 - Text Encoder: Pretrained CLIP (
openai/clip-vit-large-patch14) - Diffusion Model: Trained from scratch on CelebA-Dialog latents and captions
To speed up training and evaluation, you can use our preprocessed datasets:
-
Aligned and captioned images from CelebA-Dialog
👉 Kaggle Dataset -
Precomputed latents and tokenized captions
👉 Hugging Face Dataset


