This project was developed by me and Soham bit
This repository contains a re-implementation of the Variational Autoencoder architecture inspired by the "Autoencoders by Dor Bank, Noam Koenigstein, Raja Giryes" paper. Our approach builds upon the architecture used in Stable Diffusion's VAE encoder-decoder, with novel training strategies that significantly enhance model performance.
You can get this note book on Kaggle[https://www.kaggle.com/code/sohamumbare/vaencoder]
- Near State-of-the-Art Performance:
- Achieved MAE of
0.1031and MSE of0.03385on the COCO dataset with 128x128 image resolution. - Performance comparison with state-of-the-art (SOTA) models:
- Achieved MAE of
| Model | MAE | MSE | Image Resolution |
|---|---|---|---|
| NVAE | 0.08-0.12 | 0.01-0.02 | 128x128 to 512x512 |
| Beta-VAE | 0.1-0.15 | 0.02 | 128x128 |
| Our Model | 0.1031 | 0.03385 | 128x128 |
[Figure 1]
While our model's MAE and MSE are slightly higher than NVAE, it performs competitively with Beta-VAE, offering a trade-off between flexibility and fidelity. The incorporation of perceptual loss adds robustness to reconstructions, ensuring semantic and structural fidelity.
- Epochs Trained: 600
- Final Loss Value: 22
- MAE: 0.1031
- MSE: 0.03385
The loss decreased uniformly across epochs with minimal fluctuations, showcasing stable and consistent training. The trends in the loss curve are shown in Figure 1.
- Hardware: NVIDIA Tesla P100 GPU (16 GiB RAM)
- Training Time: Approximately 7 hours of GPU time
The architecture of this VAE is derived from the Stable Diffusion VAE encoder-decoder, which we acknowledge. However, the key innovation lies in the novel loss logic we developed, located in the src/Loss/ directory of this repository. This logic was instrumental in achieving near-SOTA performance.
Our loss function includes:
- Perceptual Loss: To enhance semantic and structural fidelity in reconstructions.
- KL Divergence Penalty: To ensure proper regularization in the latent space.
- Reconstruction Loss: A combination of MAE and MSE for pixel-level accuracy.
- Architecture: The encoder-decoder architecture is adapted from Stable Diffusion's VAE(https://github.com/Stability-AI/stablediffusion).
- Dataset: flickr8k dataset was used for training and evaluation.
-
Clone this repository:
git clone https://github.com/theSohamTUmbare/Variational-AutoEncoder.git cd Variational-AutoEncoder -
Install the required dependencies:
pip install -r requirements.txt
-
Run the evaluation script:
python evaluate_vae.py
-
To evaluate the model on a custom image, change the
image_pathinevaluate_vae.py:image_path = '/path/to/your/image.jpg'
The reconstructions produced by our model exhibit:
- Strong semantic and structural fidelity.
- Competitive performance compared to SOTA approaches.
With further training and architectural optimizations, this model holds potential to align more closely with benchmarks like NVAE.
We express our gratitude to the creators of Stable Diffusion and the original paper "An Introduction to Variational Autoencoders" for their foundational contributions to this work.
🧑💻 Happy Experimenting! 🔬


