This repository implements EfficientQAT (Efficient Quantization-Aware Training) for SmolVLM vision-language models. EfficientQAT is a two-phase training approach:
- Block-AP (Block-wise All Parameters): Sequential block-wise training with reconstruction loss
- E2E-QP (End-to-End Quantization Parameters): End-to-end training of step sizes only
.
├── quantization_primitives.py # Core quantization implementations
├── block_ap_trainer.py # Phase 1: Block-AP training
├── e2e_qp_trainer.py # Phase 2: E2E-QP training
├── data_utils.py # Dataset utilities for VQA
├── train_efficientqat.py # Main training script
├── evaluate_quantized.py # Evaluation script
└── README.md # This file
- Minimum: 1x GPU with 24GB VRAM (e.g., RTX 3090/4090)
- Recommended: 1x A100 40GB/80GB GPU
- For 2B model: Training takes ~5-9 hours on A100
# Create conda environment
conda create -n efficientqat python=3.10
conda activate efficientqat
# Install PyTorch (adjust CUDA version as needed)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Install transformers and dependencies
pip install transformers>=4.37.0
pip install accelerate
pip install datasets
pip install Pillow
pip install tqdm
# Install evaluation metrics
pip install bert-score
pip install nltk
pip install rouge-score
pip install sentence-transformers
# Download NLTK data
python -c "import nltk; nltk.download('punkt')"- Download AOKVQA:
# Create directory
mkdir -p datasets/aokvqa
# Download AOKVQA annotations
cd datasets/aokvqa
wget https://prior-datasets.s3.us-east-2.amazonaws.com/aokvqa/aokvqa_v1p0_train.json
wget https://prior-datasets.s3.us-east-2.amazonaws.com/aokvqa/aokvqa_v1p0_val.json
cd ../..- Download COCO Images:
# Create COCO directory
mkdir -p datasets/coco
# Download COCO train2017 (18GB)
cd datasets/coco
wget http://images.cocodataset.org/zips/train2017.zip
unzip train2017.zip
# Download COCO val2017 (1GB)
wget http://images.cocodataset.org/zips/val2017.zip
unzip val2017.zip
cd ../..Your data structure should look like:
datasets/
├── aokvqa/
│ ├── aokvqa_v1p0_train.json
│ └── aokvqa_v1p0_val.json
└── coco/
├── train2017/
│ ├── 000000000009.jpg
│ └── ...
└── val2017/
├── 000000000139.jpg
└── ...
Modify data_utils.py to load your custom dataset format.
python train_efficientqat.py \
--model_name HuggingFaceTB/SmolVLM-Instruct \
--output_dir ./outputs \
--n_bits 2 \
--group_size 64 \
--train_json /home/aritrad/aokvqa/datasets/aokvqa/aokvqa_v1p0_train.json \
--train_image_dir /home/aritrad/aokvqa/datasets/aokvqa/datasets/coco/train2017 \
--val_json /home/aritrad/aokvqa/datasets/aokvqa/aokvqa_v1p0_val.json \
--val_image_dir /home/aritrad/aokvqa/datasets/aokvqa/datasets/coco/val2017 \
--block_ap_samples 512 \
--block_ap_batch_size 2 \
--block_ap_epochs 2 \
--e2e_samples 4096 \
--e2e_batch_size 16 \
--e2e_epochs 1 \
--device cuda:0Quantization Settings:
--n_bits: Number of bits (2, 3, or 4)--group_size: Group size for quantization (default: 64)
Block-AP Phase:
--block_ap_samples: Calibration samples (default: 4096)--block_ap_batch_size: Batch size (default: 2, limited by memory)--block_ap_epochs: Training epochs per block (default: 2)--block_ap_lr_weights: Learning rate for weights (default: 2e-5 for 2-bit)--block_ap_lr_quant: Learning rate for quantization params (default: 1e-4)
E2E-QP Phase:
--e2e_samples: Training samples (default: 4096)--e2e_batch_size: Batch size (default: 16)--e2e_epochs: Training epochs (default: 1)--e2e_lr: Learning rate (default: 2e-5 for 2-bit)
3-bit Quantization (Better accuracy, less compression):
python train_efficientqat.py \
--n_bits 3 \
--block_ap_lr_weights 1e-5 \
--e2e_lr 1e-5 \
[... other args ...]4-bit Quantization (Best accuracy, moderate compression):
python train_efficientqat.py \
--n_bits 4 \
--block_ap_lr_weights 1e-5 \
--e2e_lr 1e-5 \
[... other args ...]Skip Block-AP (if already completed):
python train_efficientqat.py \
--skip_block_ap \
--block_ap_checkpoint outputs/run_TIMESTAMP/block_ap_checkpoint.pt \
[... other args ...]Run only Block-AP:
python train_efficientqat.py \
--skip_e2e_qp \
[... other args ...]If you run out of memory, try:
- Reduce batch sizes:
--block_ap_batch_size 1 \
--e2e_batch_size 8- Reduce samples:
--block_ap_samples 2048 \
--e2e_samples 2048- Use gradient accumulation (modify
e2e_qp_trainer.py):
# Add gradient accumulation steps
accumulation_steps = 4python evaluate_quantized.py \
--model_path outputs/run_TIMESTAMP/final_quantized_model \
--test_json datasets/aokvqa/aokvqa_v1p0_val.json \
--test_image_dir datasets/coco/val2017 \
--output_file results.json \
--batch_size 8 \
--max_new_tokens 20 \
--device cuda:0The script computes:
- Exact Match (EM): Percentage of exact string matches
- BLEU-1: Unigram overlap score
- BERTScore: Semantic similarity (Precision, Recall, F1)
- ROUGE: n-gram overlap (ROUGE-1, ROUGE-2, ROUGE-L)
python evaluate_quantized.py \
--model_path outputs/run_TIMESTAMP/final_quantized_model \
--test_json datasets/aokvqa/aokvqa_v1p0_val.json \
--test_image_dir datasets/coco/val2017 \
--n_samples 100 \
--output_file quick_test.jsonBased on the EfficientQAT paper (Table 14) for LLaVA-1.5:
| Quantization | MMBench | MME | MM-Vet | ScienceQA | Avg |
|---|---|---|---|---|---|
| FP16 | ~62 | ~70 | ~28 | ~65 | ~56 |
| 4-bit | ~61 | ~69 | ~27 | ~64 | ~55 |
| 3-bit | ~60 | ~68 | ~27 | ~63 | ~54 |
| 2-bit | ~58 | ~65 | ~25 | ~60 | ~52 |
Note: These are rough estimates. Actual results depend on:
- Calibration data quality
- Hyperparameter tuning
- Number of training samples
For SmolVLM-2B on single A100-80GB:
| Phase | Time | Memory |
|---|---|---|
| Block-AP | ~4-6 hrs | ~8 GB |
| E2E-QP | ~1-2 hrs | ~10 GB |
| Total | ~5-8 hrs | ~10 GB |
Solution: Reduce batch sizes or use smaller samples
--block_ap_batch_size 1 --e2e_batch_size 4Possible causes:
- Insufficient calibration samples
- Learning rate too high/low
- Wrong group size
Solution: Try:
--block_ap_samples 8192 \
--block_ap_lr_weights 1e-5 \
--group_size 32Solution: Check PyTorch and Transformers versions
pip install transformers>=4.37.0 --upgradeSolutions:
- Enable Flash Attention 2 (if supported)
- Reduce number of workers:
--num_workers 2 - Use smaller image resolution (modify processor)
Quantize vision and text at different bit-widths:
Modify block_ap_trainer.py:
def replace_linear_with_quantized(self, block, block_type: str):
# Use different n_bits based on block_type
if block_type == 'vision':
n_bits = 4 # Higher bits for vision
else:
n_bits = 2 # Lower bits for textCreate custom dataset loader in data_utils.py:
def load_custom_dataset(json_path, image_dir):
# Implement your loading logic
return {
'questions': [...],
'image_paths': [...],
'answers': [...]
}The code supports any Idefics3-based model. To use different models:
--model_name HuggingFaceTB/SmolVLM-256M # Smaller versionAfter training, check model size:
import os
# Original FP16 model
original_size = 2.0 * 1024 # MB (SmolVLM-2B in FP16)
# Quantized model
quantized_dir = "outputs/run_TIMESTAMP/final_quantized_model"
quantized_size = sum(
os.path.getsize(os.path.join(quantized_dir, f))
for f in os.listdir(quantized_dir)
) / (1024**2) # Convert to MB
compression_ratio = original_size / quantized_size
print(f"Compression: {compression_ratio:.2f}x")Expected compression:
- 2-bit: ~6-7x smaller
- 3-bit: ~4-5x smaller
- 4-bit: ~3-4x smaller
If you use this code, please cite:
@inproceedings{chen2025efficientqat,
title={EfficientQAT: Efficient Quantization-Aware Training for Large Language Models},
author={Chen, Mengzhao and Shao, Wenqi and Xu, Peng and Wang, Jiahao and Gao, Peng and Zhang, Kaipeng and Luo, Ping},
booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics},
year={2025}
}- Baseline Evaluation: Run FP16 model first to get baseline
- Train 4-bit: Start with 4-bit for best accuracy
- Train 2-bit: Push to extreme compression
- Ablation Studies: Vary group size, samples, learning rates
- Write Paper: Document your findings!
For issues or questions:
- Check this README thoroughly
- Review the code comments
- Try different hyperparameters
- Open an issue with:
- Error message
- Command used
- System specifications
This code is provided for research purposes. Please check the licenses of:
- SmolVLM model
- EfficientQAT paper
- Dependencies (PyTorch, Transformers, etc.)
Good luck with your research! 🚀