Skip to content

Pranav0931/Image-Captioning-using-Deep-Learning

Repository files navigation

Image Captioning with Deep Learning

Author

Name: Pranav Tapdiya
Batch: C1_08


Problem Statement

Automatic image captioning is a challenging task that requires understanding the visual content of an image and generating a natural language description that accurately captures the key elements and relationships within the scene. The goal of this project is to develop a deep learning model that can automatically generate descriptive captions for images by combining computer vision and natural language processing techniques.

This task involves:

  • Extracting meaningful visual features from images
  • Understanding the semantic context of the image content
  • Generating grammatically correct and contextually relevant descriptions
  • Mapping visual information to natural language sequences

Explanation

This project implements an Image Captioning System using a combination of Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). The architecture follows an encoder-decoder framework:

Model Architecture

  1. Image Feature Extraction (Encoder)

    • Uses pre-trained VGG16 model as the feature extractor
    • VGG16 is trained on ImageNet and provides robust visual feature representations
    • The last fully connected layer is removed, and the model outputs 4096-dimensional feature vectors
    • Images are preprocessed to 224x224 pixels to match VGG16 input requirements
  2. Caption Generation (Decoder)

    • Implements an LSTM (Long Short-Term Memory) network for sequence generation
    • Processes the image features along with word embeddings
    • Generates captions word-by-word in an autoregressive manner
    • Uses embedding layers to convert words into dense vector representations
  3. Training Process

    • The model is trained on image-caption pairs from the Flickr8k dataset
    • Each image has multiple reference captions (typically 5 captions per image)
    • Captions are tokenized and preprocessed with start and end tokens
    • The model learns to predict the next word given the image features and previous words

Key Features

  • Transfer Learning: Leverages pre-trained VGG16 for efficient feature extraction
  • Attention Mechanism: May include attention mechanisms to focus on relevant image regions
  • BLEU Score Evaluation: Uses BLEU (Bilingual Evaluation Understudy) metrics to evaluate caption quality
  • Vocabulary Management: Tokenizes captions and builds a vocabulary for word-to-index mapping

Technologies Used

  • TensorFlow/Keras: Deep learning framework for model building and training
  • VGG16: Pre-trained CNN model for image feature extraction
  • LSTM: Recurrent neural network for sequence generation
  • NLTK: Natural Language Toolkit for BLEU score calculation
  • NumPy & Matplotlib: Data manipulation and visualization
  • PIL (Python Imaging Library): Image processing

Dataset

Flickr8k Dataset

The project uses the Flickr8k dataset, which is a benchmark dataset for image captioning tasks.

Dataset Link: https://www.kaggle.com/datasets/adityajn105/flickr8k/data

Dataset Details:

  • Total Images: 8,000 images
  • Captions per Image: 5 human-annotated captions for each image
  • Total Captions: 40,000 captions
  • Image Source: Collected from Flickr
  • Caption Format: Natural language descriptions of image content

Dataset Structure:

  • images/ - Directory containing 8,000 JPG images
  • captions.txt - Text file with image filenames and corresponding captions

Sample Caption Format:

image,caption
1000268201_693b08cb0e.jpg,A child in a pink dress is climbing up a set of stairs in an entry way.
1000268201_693b08cb0e.jpg,A girl going into a wooden building.

Project Structure

DL Final/
│
├── images/                              # Directory containing Flickr8k images
├── captions.txt                         # Image-caption mapping file
├── pranav_tapdiya_image_captioner.ipynb # Main Jupyter notebook with implementation
└── README.md                            # Project documentation (this file)

Implementation Workflow

  1. Data Loading & Preprocessing

    • Load images and captions from the Flickr8k dataset
    • Clean and preprocess caption text (lowercase, remove punctuation, add start/end tokens)
    • Create vocabulary and tokenize captions
  2. Feature Extraction

    • Load pre-trained VGG16 model
    • Extract 4096-dimensional feature vectors from all images
    • Save features for efficient training
  3. Model Building

    • Design encoder-decoder architecture
    • Configure embedding layers, LSTM layers, and dense layers
    • Compile model with appropriate loss function and optimizer
  4. Training

    • Train the model on image-caption pairs
    • Use data generators for efficient batch processing
    • Monitor training and validation loss
  5. Evaluation

    • Generate captions for test images
    • Calculate BLEU scores (BLEU-1, BLEU-2, BLEU-3, BLEU-4)
    • Visualize results with sample predictions
  6. Inference

    • Load trained model
    • Generate captions for new images
    • Display images with predicted captions

Requirements

tensorflow>=2.x
keras
numpy
matplotlib
Pillow
nltk
tqdm

Setup Instructions

1. Clone the Repository

git clone <repository-url>
cd "DL Final"

2. Download the Dataset

Important: The images folder is not included in this repository due to its large size.

Your directory structure should look like:

DL Final/
├── images/                  # Download from Kaggle (8,000 images)
├── captions.txt            # Already included
├── pranav_tapdiya_image_captioner.ipynb
└── README.md

3. Install Dependencies

pip install tensorflow keras numpy matplotlib Pillow nltk tqdm

4. Run the Notebook

  • Open the Jupyter notebook: pranav_tapdiya_image_captioner.ipynb
  • Run all cells to train the model or load pre-trained weights
  • Generate captions for new images using the trained model

Results

The model generates natural language descriptions for images by learning from the Flickr8k dataset. Performance is evaluated using BLEU scores, which measure the similarity between generated captions and human-annotated reference captions.


References


License

This project is for educational purposes as part of a Deep Learning course assignment.


Acknowledgments

  • Dataset provided by Flickr and made available on Kaggle
  • Pre-trained VGG16 weights from ImageNet
  • TensorFlow and Keras communities for excellent documentation and resources

About

Deep Learning-based image captioning system using VGG16 (CNN) for feature extraction and LSTM (RNN) for caption generation. Trained on Flickr8k dataset with 8,000 images.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages