Name: Pranav Tapdiya
Batch: C1_08
Automatic image captioning is a challenging task that requires understanding the visual content of an image and generating a natural language description that accurately captures the key elements and relationships within the scene. The goal of this project is to develop a deep learning model that can automatically generate descriptive captions for images by combining computer vision and natural language processing techniques.
This task involves:
- Extracting meaningful visual features from images
- Understanding the semantic context of the image content
- Generating grammatically correct and contextually relevant descriptions
- Mapping visual information to natural language sequences
This project implements an Image Captioning System using a combination of Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). The architecture follows an encoder-decoder framework:
-
Image Feature Extraction (Encoder)
- Uses pre-trained VGG16 model as the feature extractor
- VGG16 is trained on ImageNet and provides robust visual feature representations
- The last fully connected layer is removed, and the model outputs 4096-dimensional feature vectors
- Images are preprocessed to 224x224 pixels to match VGG16 input requirements
-
Caption Generation (Decoder)
- Implements an LSTM (Long Short-Term Memory) network for sequence generation
- Processes the image features along with word embeddings
- Generates captions word-by-word in an autoregressive manner
- Uses embedding layers to convert words into dense vector representations
-
Training Process
- The model is trained on image-caption pairs from the Flickr8k dataset
- Each image has multiple reference captions (typically 5 captions per image)
- Captions are tokenized and preprocessed with start and end tokens
- The model learns to predict the next word given the image features and previous words
- Transfer Learning: Leverages pre-trained VGG16 for efficient feature extraction
- Attention Mechanism: May include attention mechanisms to focus on relevant image regions
- BLEU Score Evaluation: Uses BLEU (Bilingual Evaluation Understudy) metrics to evaluate caption quality
- Vocabulary Management: Tokenizes captions and builds a vocabulary for word-to-index mapping
- TensorFlow/Keras: Deep learning framework for model building and training
- VGG16: Pre-trained CNN model for image feature extraction
- LSTM: Recurrent neural network for sequence generation
- NLTK: Natural Language Toolkit for BLEU score calculation
- NumPy & Matplotlib: Data manipulation and visualization
- PIL (Python Imaging Library): Image processing
The project uses the Flickr8k dataset, which is a benchmark dataset for image captioning tasks.
Dataset Link: https://www.kaggle.com/datasets/adityajn105/flickr8k/data
- Total Images: 8,000 images
- Captions per Image: 5 human-annotated captions for each image
- Total Captions: 40,000 captions
- Image Source: Collected from Flickr
- Caption Format: Natural language descriptions of image content
images/- Directory containing 8,000 JPG imagescaptions.txt- Text file with image filenames and corresponding captions
image,caption
1000268201_693b08cb0e.jpg,A child in a pink dress is climbing up a set of stairs in an entry way.
1000268201_693b08cb0e.jpg,A girl going into a wooden building.
DL Final/
│
├── images/ # Directory containing Flickr8k images
├── captions.txt # Image-caption mapping file
├── pranav_tapdiya_image_captioner.ipynb # Main Jupyter notebook with implementation
└── README.md # Project documentation (this file)
-
Data Loading & Preprocessing
- Load images and captions from the Flickr8k dataset
- Clean and preprocess caption text (lowercase, remove punctuation, add start/end tokens)
- Create vocabulary and tokenize captions
-
Feature Extraction
- Load pre-trained VGG16 model
- Extract 4096-dimensional feature vectors from all images
- Save features for efficient training
-
Model Building
- Design encoder-decoder architecture
- Configure embedding layers, LSTM layers, and dense layers
- Compile model with appropriate loss function and optimizer
-
Training
- Train the model on image-caption pairs
- Use data generators for efficient batch processing
- Monitor training and validation loss
-
Evaluation
- Generate captions for test images
- Calculate BLEU scores (BLEU-1, BLEU-2, BLEU-3, BLEU-4)
- Visualize results with sample predictions
-
Inference
- Load trained model
- Generate captions for new images
- Display images with predicted captions
tensorflow>=2.x
keras
numpy
matplotlib
Pillow
nltk
tqdm
git clone <repository-url>
cd "DL Final"Important: The images folder is not included in this repository due to its large size.
- Download the Flickr8k dataset from: https://www.kaggle.com/datasets/adityajn105/flickr8k/data
- Extract the downloaded files
- Place the
images/folder in the project root directory - The
captions.txtfile is already included in this repository
Your directory structure should look like:
DL Final/
├── images/ # Download from Kaggle (8,000 images)
├── captions.txt # Already included
├── pranav_tapdiya_image_captioner.ipynb
└── README.md
pip install tensorflow keras numpy matplotlib Pillow nltk tqdm- Open the Jupyter notebook:
pranav_tapdiya_image_captioner.ipynb - Run all cells to train the model or load pre-trained weights
- Generate captions for new images using the trained model
The model generates natural language descriptions for images by learning from the Flickr8k dataset. Performance is evaluated using BLEU scores, which measure the similarity between generated captions and human-annotated reference captions.
- VGG16 Paper: Very Deep Convolutional Networks for Large-Scale Image Recognition
- LSTM Paper: Long Short-Term Memory
- Flickr8k Dataset: Kaggle Dataset Link
- Show and Tell Paper: Show and Tell: A Neural Image Caption Generator
This project is for educational purposes as part of a Deep Learning course assignment.
- Dataset provided by Flickr and made available on Kaggle
- Pre-trained VGG16 weights from ImageNet
- TensorFlow and Keras communities for excellent documentation and resources