Skip to content

Test Prototype for Efficient QAT ACL 2025 Paper.

Notifications You must be signed in to change notification settings

beingdutta/VLM-QAT

Repository files navigation

EfficientQAT for SmolVLM: Complete Running Instructions

Overview

This repository implements EfficientQAT (Efficient Quantization-Aware Training) for SmolVLM vision-language models. EfficientQAT is a two-phase training approach:

  1. Block-AP (Block-wise All Parameters): Sequential block-wise training with reconstruction loss
  2. E2E-QP (End-to-End Quantization Parameters): End-to-end training of step sizes only

Repository Structure

.
├── quantization_primitives.py   # Core quantization implementations
├── block_ap_trainer.py          # Phase 1: Block-AP training
├── e2e_qp_trainer.py           # Phase 2: E2E-QP training
├── data_utils.py               # Dataset utilities for VQA
├── train_efficientqat.py       # Main training script
├── evaluate_quantized.py       # Evaluation script
└── README.md                   # This file

Prerequisites

Hardware Requirements

  • Minimum: 1x GPU with 24GB VRAM (e.g., RTX 3090/4090)
  • Recommended: 1x A100 40GB/80GB GPU
  • For 2B model: Training takes ~5-9 hours on A100

Software Requirements

# Create conda environment
conda create -n efficientqat python=3.10
conda activate efficientqat

# Install PyTorch (adjust CUDA version as needed)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install transformers and dependencies
pip install transformers>=4.37.0
pip install accelerate
pip install datasets
pip install Pillow
pip install tqdm

# Install evaluation metrics
pip install bert-score
pip install nltk
pip install rouge-score
pip install sentence-transformers

# Download NLTK data
python -c "import nltk; nltk.download('punkt')"

Data Preparation

Option 1: AOKVQA Dataset (Recommended)

  1. Download AOKVQA:
# Create directory
mkdir -p datasets/aokvqa

# Download AOKVQA annotations
cd datasets/aokvqa
wget https://prior-datasets.s3.us-east-2.amazonaws.com/aokvqa/aokvqa_v1p0_train.json
wget https://prior-datasets.s3.us-east-2.amazonaws.com/aokvqa/aokvqa_v1p0_val.json

cd ../..
  1. Download COCO Images:
# Create COCO directory
mkdir -p datasets/coco

# Download COCO train2017 (18GB)
cd datasets/coco
wget http://images.cocodataset.org/zips/train2017.zip
unzip train2017.zip

# Download COCO val2017 (1GB)
wget http://images.cocodataset.org/zips/val2017.zip
unzip val2017.zip

cd ../..

Your data structure should look like:

datasets/
├── aokvqa/
│   ├── aokvqa_v1p0_train.json
│   └── aokvqa_v1p0_val.json
└── coco/
    ├── train2017/
    │   ├── 000000000009.jpg
    │   └── ...
    └── val2017/
        ├── 000000000139.jpg
        └── ...

Option 2: Use Your Own VQA Dataset

Modify data_utils.py to load your custom dataset format.

Training

Quick Start: 2-bit Quantization

python train_efficientqat.py \
    --model_name HuggingFaceTB/SmolVLM-Instruct \
    --output_dir ./outputs \
    --n_bits 2 \
    --group_size 64 \
    --train_json /home/aritrad/aokvqa/datasets/aokvqa/aokvqa_v1p0_train.json \
    --train_image_dir /home/aritrad/aokvqa/datasets/aokvqa/datasets/coco/train2017 \
    --val_json /home/aritrad/aokvqa/datasets/aokvqa/aokvqa_v1p0_val.json \
    --val_image_dir /home/aritrad/aokvqa/datasets/aokvqa/datasets/coco/val2017 \
    --block_ap_samples 512 \
    --block_ap_batch_size 2 \
    --block_ap_epochs 2 \
    --e2e_samples 4096 \
    --e2e_batch_size 16 \
    --e2e_epochs 1 \
    --device cuda:0

Training Parameters Explained

Quantization Settings:

  • --n_bits: Number of bits (2, 3, or 4)
  • --group_size: Group size for quantization (default: 64)

Block-AP Phase:

  • --block_ap_samples: Calibration samples (default: 4096)
  • --block_ap_batch_size: Batch size (default: 2, limited by memory)
  • --block_ap_epochs: Training epochs per block (default: 2)
  • --block_ap_lr_weights: Learning rate for weights (default: 2e-5 for 2-bit)
  • --block_ap_lr_quant: Learning rate for quantization params (default: 1e-4)

E2E-QP Phase:

  • --e2e_samples: Training samples (default: 4096)
  • --e2e_batch_size: Batch size (default: 16)
  • --e2e_epochs: Training epochs (default: 1)
  • --e2e_lr: Learning rate (default: 2e-5 for 2-bit)

Training Different Bit-widths

3-bit Quantization (Better accuracy, less compression):

python train_efficientqat.py \
    --n_bits 3 \
    --block_ap_lr_weights 1e-5 \
    --e2e_lr 1e-5 \
    [... other args ...]

4-bit Quantization (Best accuracy, moderate compression):

python train_efficientqat.py \
    --n_bits 4 \
    --block_ap_lr_weights 1e-5 \
    --e2e_lr 1e-5 \
    [... other args ...]

Resume from Checkpoint

Skip Block-AP (if already completed):

python train_efficientqat.py \
    --skip_block_ap \
    --block_ap_checkpoint outputs/run_TIMESTAMP/block_ap_checkpoint.pt \
    [... other args ...]

Run only Block-AP:

python train_efficientqat.py \
    --skip_e2e_qp \
    [... other args ...]

Memory-Constrained Training

If you run out of memory, try:

  1. Reduce batch sizes:
--block_ap_batch_size 1 \
--e2e_batch_size 8
  1. Reduce samples:
--block_ap_samples 2048 \
--e2e_samples 2048
  1. Use gradient accumulation (modify e2e_qp_trainer.py):
# Add gradient accumulation steps
accumulation_steps = 4

Evaluation

Evaluate on AOKVQA Validation Set

python evaluate_quantized.py \
    --model_path outputs/run_TIMESTAMP/final_quantized_model \
    --test_json datasets/aokvqa/aokvqa_v1p0_val.json \
    --test_image_dir datasets/coco/val2017 \
    --output_file results.json \
    --batch_size 8 \
    --max_new_tokens 20 \
    --device cuda:0

Evaluation Metrics

The script computes:

  • Exact Match (EM): Percentage of exact string matches
  • BLEU-1: Unigram overlap score
  • BERTScore: Semantic similarity (Precision, Recall, F1)
  • ROUGE: n-gram overlap (ROUGE-1, ROUGE-2, ROUGE-L)

Quick Test on Subset

python evaluate_quantized.py \
    --model_path outputs/run_TIMESTAMP/final_quantized_model \
    --test_json datasets/aokvqa/aokvqa_v1p0_val.json \
    --test_image_dir datasets/coco/val2017 \
    --n_samples 100 \
    --output_file quick_test.json

Expected Results

Based on the EfficientQAT paper (Table 14) for LLaVA-1.5:

SmolVLM-2B Expected Performance (Estimated)

Quantization MMBench MME MM-Vet ScienceQA Avg
FP16 ~62 ~70 ~28 ~65 ~56
4-bit ~61 ~69 ~27 ~64 ~55
3-bit ~60 ~68 ~27 ~63 ~54
2-bit ~58 ~65 ~25 ~60 ~52

Note: These are rough estimates. Actual results depend on:

  • Calibration data quality
  • Hyperparameter tuning
  • Number of training samples

Training Time Estimates

For SmolVLM-2B on single A100-80GB:

Phase Time Memory
Block-AP ~4-6 hrs ~8 GB
E2E-QP ~1-2 hrs ~10 GB
Total ~5-8 hrs ~10 GB

Troubleshooting

Issue 1: CUDA Out of Memory

Solution: Reduce batch sizes or use smaller samples

--block_ap_batch_size 1 --e2e_batch_size 4

Issue 2: Poor Accuracy After Block-AP

Possible causes:

  • Insufficient calibration samples
  • Learning rate too high/low
  • Wrong group size

Solution: Try:

--block_ap_samples 8192 \
--block_ap_lr_weights 1e-5 \
--group_size 32

Issue 3: Model Not Loading

Solution: Check PyTorch and Transformers versions

pip install transformers>=4.37.0 --upgrade

Issue 4: Slow Training

Solutions:

  • Enable Flash Attention 2 (if supported)
  • Reduce number of workers: --num_workers 2
  • Use smaller image resolution (modify processor)

Advanced Usage

1. Mixed-Precision Quantization

Quantize vision and text at different bit-widths:

Modify block_ap_trainer.py:

def replace_linear_with_quantized(self, block, block_type: str):
    # Use different n_bits based on block_type
    if block_type == 'vision':
        n_bits = 4  # Higher bits for vision
    else:
        n_bits = 2  # Lower bits for text

2. Custom Datasets

Create custom dataset loader in data_utils.py:

def load_custom_dataset(json_path, image_dir):
    # Implement your loading logic
    return {
        'questions': [...],
        'image_paths': [...],
        'answers': [...]
    }

3. Different Vision Encoders

The code supports any Idefics3-based model. To use different models:

--model_name HuggingFaceTB/SmolVLM-256M  # Smaller version

Model Compression Results

After training, check model size:

import os

# Original FP16 model
original_size = 2.0 * 1024  # MB (SmolVLM-2B in FP16)

# Quantized model
quantized_dir = "outputs/run_TIMESTAMP/final_quantized_model"
quantized_size = sum(
    os.path.getsize(os.path.join(quantized_dir, f))
    for f in os.listdir(quantized_dir)
) / (1024**2)  # Convert to MB

compression_ratio = original_size / quantized_size
print(f"Compression: {compression_ratio:.2f}x")

Expected compression:

  • 2-bit: ~6-7x smaller
  • 3-bit: ~4-5x smaller
  • 4-bit: ~3-4x smaller

Citation

If you use this code, please cite:

@inproceedings{chen2025efficientqat,
  title={EfficientQAT: Efficient Quantization-Aware Training for Large Language Models},
  author={Chen, Mengzhao and Shao, Wenqi and Xu, Peng and Wang, Jiahao and Gao, Peng and Zhang, Kaipeng and Luo, Ping},
  booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics},
  year={2025}
}

Next Steps

  1. Baseline Evaluation: Run FP16 model first to get baseline
  2. Train 4-bit: Start with 4-bit for best accuracy
  3. Train 2-bit: Push to extreme compression
  4. Ablation Studies: Vary group size, samples, learning rates
  5. Write Paper: Document your findings!

Contact & Support

For issues or questions:

  1. Check this README thoroughly
  2. Review the code comments
  3. Try different hyperparameters
  4. Open an issue with:
    • Error message
    • Command used
    • System specifications

License

This code is provided for research purposes. Please check the licenses of:

  • SmolVLM model
  • EfficientQAT paper
  • Dependencies (PyTorch, Transformers, etc.)

Good luck with your research! 🚀

About

Test Prototype for Efficient QAT ACL 2025 Paper.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published