Image Captioning with Deep Learning

Author

Name: Pranav Tapdiya
Batch: C1_08

Problem Statement

Automatic image captioning is a challenging task that requires understanding the visual content of an image and generating a natural language description that accurately captures the key elements and relationships within the scene. The goal of this project is to develop a deep learning model that can automatically generate descriptive captions for images by combining computer vision and natural language processing techniques.

This task involves:

Extracting meaningful visual features from images
Understanding the semantic context of the image content
Generating grammatically correct and contextually relevant descriptions
Mapping visual information to natural language sequences

Explanation

This project implements an Image Captioning System using a combination of Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). The architecture follows an encoder-decoder framework:

Model Architecture

Image Feature Extraction (Encoder)
- Uses pre-trained VGG16 model as the feature extractor
- VGG16 is trained on ImageNet and provides robust visual feature representations
- The last fully connected layer is removed, and the model outputs 4096-dimensional feature vectors
- Images are preprocessed to 224x224 pixels to match VGG16 input requirements
Caption Generation (Decoder)
- Implements an LSTM (Long Short-Term Memory) network for sequence generation
- Processes the image features along with word embeddings
- Generates captions word-by-word in an autoregressive manner
- Uses embedding layers to convert words into dense vector representations
Training Process
- The model is trained on image-caption pairs from the Flickr8k dataset
- Each image has multiple reference captions (typically 5 captions per image)
- Captions are tokenized and preprocessed with start and end tokens
- The model learns to predict the next word given the image features and previous words

Key Features

Transfer Learning: Leverages pre-trained VGG16 for efficient feature extraction
Attention Mechanism: May include attention mechanisms to focus on relevant image regions
BLEU Score Evaluation: Uses BLEU (Bilingual Evaluation Understudy) metrics to evaluate caption quality
Vocabulary Management: Tokenizes captions and builds a vocabulary for word-to-index mapping

Technologies Used

TensorFlow/Keras: Deep learning framework for model building and training
VGG16: Pre-trained CNN model for image feature extraction
LSTM: Recurrent neural network for sequence generation
NLTK: Natural Language Toolkit for BLEU score calculation
NumPy & Matplotlib: Data manipulation and visualization
PIL (Python Imaging Library): Image processing

Dataset

Flickr8k Dataset

The project uses the Flickr8k dataset, which is a benchmark dataset for image captioning tasks.

Dataset Link: https://www.kaggle.com/datasets/adityajn105/flickr8k/data

Dataset Details:

Total Images: 8,000 images
Captions per Image: 5 human-annotated captions for each image
Total Captions: 40,000 captions
Image Source: Collected from Flickr
Caption Format: Natural language descriptions of image content

Dataset Structure:

images/ - Directory containing 8,000 JPG images
captions.txt - Text file with image filenames and corresponding captions

Sample Caption Format:

image,caption
1000268201_693b08cb0e.jpg,A child in a pink dress is climbing up a set of stairs in an entry way.
1000268201_693b08cb0e.jpg,A girl going into a wooden building.

Project Structure

DL Final/
│
├── images/                              # Directory containing Flickr8k images
├── captions.txt                         # Image-caption mapping file
├── pranav_tapdiya_image_captioner.ipynb # Main Jupyter notebook with implementation
└── README.md                            # Project documentation (this file)

Implementation Workflow

Data Loading & Preprocessing
- Load images and captions from the Flickr8k dataset
- Clean and preprocess caption text (lowercase, remove punctuation, add start/end tokens)
- Create vocabulary and tokenize captions
Feature Extraction
- Load pre-trained VGG16 model
- Extract 4096-dimensional feature vectors from all images
- Save features for efficient training
Model Building
- Design encoder-decoder architecture
- Configure embedding layers, LSTM layers, and dense layers
- Compile model with appropriate loss function and optimizer
Training
- Train the model on image-caption pairs
- Use data generators for efficient batch processing
- Monitor training and validation loss
Evaluation
- Generate captions for test images
- Calculate BLEU scores (BLEU-1, BLEU-2, BLEU-3, BLEU-4)
- Visualize results with sample predictions
Inference
- Load trained model
- Generate captions for new images
- Display images with predicted captions

Requirements

tensorflow>=2.x
keras
numpy
matplotlib
Pillow
nltk
tqdm

Setup Instructions

1. Clone the Repository

git clone <repository-url>
cd "DL Final"

2. Download the Dataset

Important: The images folder is not included in this repository due to its large size.

Download the Flickr8k dataset from: https://www.kaggle.com/datasets/adityajn105/flickr8k/data
Extract the downloaded files
Place the images/ folder in the project root directory
The captions.txt file is already included in this repository

Your directory structure should look like:

DL Final/
├── images/                  # Download from Kaggle (8,000 images)
├── captions.txt            # Already included
├── pranav_tapdiya_image_captioner.ipynb
└── README.md

3. Install Dependencies

pip install tensorflow keras numpy matplotlib Pillow nltk tqdm

4. Run the Notebook

Open the Jupyter notebook: pranav_tapdiya_image_captioner.ipynb
Run all cells to train the model or load pre-trained weights
Generate captions for new images using the trained model

Results

The model generates natural language descriptions for images by learning from the Flickr8k dataset. Performance is evaluated using BLEU scores, which measure the similarity between generated captions and human-annotated reference captions.

References

VGG16 Paper: Very Deep Convolutional Networks for Large-Scale Image Recognition
LSTM Paper: Long Short-Term Memory
Flickr8k Dataset: Kaggle Dataset Link
Show and Tell Paper: Show and Tell: A Neural Image Caption Generator

License

This project is for educational purposes as part of a Deep Learning course assignment.

Acknowledgments

Dataset provided by Flickr and made available on Kaggle
Pre-trained VGG16 weights from ImageNet
TensorFlow and Keras communities for excellent documentation and resources

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
RUN_ME_TO_COMMIT.bat		RUN_ME_TO_COMMIT.bat
captions.txt		captions.txt
commit_push.bat		commit_push.bat
do_commit.py		do_commit.py
git_commit.py		git_commit.py
pranav_tapdiya_image_captioner.ipynb		pranav_tapdiya_image_captioner.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Image Captioning with Deep Learning

Author

Problem Statement

Explanation

Model Architecture

Key Features

Technologies Used

Dataset

Flickr8k Dataset

Dataset Details:

Dataset Structure:

Sample Caption Format:

Project Structure

Implementation Workflow

Requirements

Setup Instructions

1. Clone the Repository

2. Download the Dataset

3. Install Dependencies

4. Run the Notebook

Results

References

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Image Captioning with Deep Learning

Author

Problem Statement

Explanation

Model Architecture

Key Features

Technologies Used

Dataset

Flickr8k Dataset

Dataset Details:

Dataset Structure:

Sample Caption Format:

Project Structure

Implementation Workflow

Requirements

Setup Instructions

1. Clone the Repository

2. Download the Dataset

3. Install Dependencies

4. Run the Notebook

Results

References

License

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages