Skip to content

Sonicof/hybrid-sentiment-analysis-model

Repository files navigation

Sentiment Analysis Using Word2Vec and PyTorch

🧠 Overview

This project implements a binary sentiment analysis system that classifies textual input as positive or negative using a hybrid neural network. The model leverages pre-trained Word2Vec embeddings and is built using PyTorch. It is trained on a dataset of user reviews and achieves high accuracy on validation data.


🚀 Setup Instructions

To get started with the project, follow these steps:

  1. Create and Activate a Virtual Environment

    • Use Python 3.12 or compatible.
    • Set up a virtual environment using venv or conda.
  2. Install Dependencies

    • Install all required Python packages using:
      pip install -r requirements.txt
      
  3. Download NLTK Resources

    • Required: punkt, stopwords, wordnet
    • Download them programmatically or manually through the NLTK Downloader.
    • Here I have downloaded the NLTK Resources in my local directory rather then home directory
  4. Download Word2Vec Embeddings

    • Get the pre-trained GoogleNews-vectors-negative300.bin.gz model (approx. 1.5 GB) from Google's Word2Vec archive.
    • Place it in the project root or specified path.
  5. Prepare the Dataset

    • Ensure your dataset is a CSV with a column (e.g., sentence) containing raw text.
    • Clean and tokenize the text before feeding it into the model.
  6. Train the Model

    • Run the training script to preprocess the data, prepare the embedding matrix, and train the sentiment classifier.
  7. Save & Export

    • The trained model is saved as a .pkl file.
    • Use this file later for inference or deployment.

🧠 Model Architecture

The model used is a hybrid neural network with the following architecture:

  • Embedding Layer: Uses pre-trained Word2Vec weights. This layer is frozen to preserve the semantic richness of the embeddings.
  • Adaptive Average Pooling Layer: Reduces the sequence dimension to a fixed-size representation.
  • Fully Connected Layers:
    • Linear(300 → 64) + ReLU
    • Linear(64 → 1) + Sigmoid
  • Output: A scalar probability representing sentiment. If probability > 0.5, it's classified as positive, otherwise negative.

This design is lightweight and efficient, offering both speed and strong performance on small to medium-sized datasets.


🤖 Why This is a Hybrid Model

This project implements a hybrid sentiment analysis model, combining multiple architectural ideas into a streamlined and efficient pipeline. Here's a breakdown of what makes it hybrid and how it compares to traditional approaches:


⚙️ Components of the Hybrid Model

Component Function Type
Pretrained Word2Vec Encodes semantic relationships between words from large corpora Static
Embedding Layer Loads pretrained vectors and maps input tokens to dense representations Non-trainable
AdaptiveAvgPool1d Compresses variable-length sequences into fixed-size vectors Statistical
Dense Neural Layers Learns sentiment patterns from the pooled vector Trainable
Sigmoid Output Outputs binary sentiment prediction (positive/negative) Trainable

🧠 Comparison with Traditional Models

Model Type Description Limitation
RNN / LSTM Sequentially processes word embeddings to capture contextual dependencies Computationally expensive; sensitive to input length
CNN for Text Uses convolutional filters to learn local patterns across n-grams Requires tuning kernel size; may miss global context
Transformer Self-attention captures all pairwise relationships across the sequence High memory and compute cost; overkill for small datasets
Bag-of-Words (BoW) Simple count-based features with no semantic understanding Ignores order and semantics

➡️ This Hybrid Approach:

  • Leverages Word2Vec for semantic understanding without retraining.
  • Uses adaptive pooling instead of sequential modeling (faster, simpler).
  • Applies dense layers to learn patterns, allowing expressiveness while avoiding RNN/Transformer complexity.

Not Hybrid If…:

  • Only Uses Pretrained Embeddings Without Further Learning Example: Use Word2Vec to average word vectors and directly classify with a threshold.:

    • No learnable layers → it's just a feature-based/statistical approach

    • "Classic NLP pipeline" — not hybrid, purely static.

  • Only Uses a Trainable Deep Learning Model Without External Knowledge Example: A trainable Embedding layer + LSTM + Dense layers (randomly initialized weights).

Entirely learned from scratch → it's a neural model, not hybrid.

  • Only Uses One Architectural Style Pure RNN, pure Transformer, pure CNN, etc.

These are end-to-end deep learning models with no "hybrid" characteristics.

Even if efficient or well-designed, they don’t mix paradigms.


✅ Why It's Effective

  • Efficient: No recurrent or attention-based overhead.
  • Compact: Only a few trainable layers, reducing overfitting on small data.
  • Flexible: Can handle variable-length input thanks to pooling.
  • Transferable: Embeddings trained on general corpora boost performance on limited domain-specific data.

🛠️ Implementation Details

  • Text Preprocessing:

    • Lowercasing, punctuation removal
    • Stopword removal using NLTK
    • Lemmatization with WordNet
  • Tokenization:

    • Done using NLTK’s word_tokenize function.
  • Vocabulary Construction:

    • A Counter is used to map tokens to unique integer IDs.
    • A fixed vocabulary size is maintained for efficiency.
  • Embedding Matrix:

    • A matrix of size [vocab_size, 300] is created using pre-trained vectors.
    • Words not found in Word2Vec are initialized as zeros.
  • Training:

    • Binary Cross-Entropy Loss
    • Optimizer: Adam
    • Epochs: Typically trained for 30–50 epochs
    • Accuracy around 83%+ on 3000 rows observed
  • Inference:

    • Input sentence is preprocessed and converted into indices.
    • Passed through the trained model to get a probability score.

📦 Output

  • Model File: Serialized as /models/hybrid_model_full.pkl
  • Requirements File: Frozen via pip freeze > requirements.txt
  • Readiness for Deployment: The model can be easily loaded for predictions in any Python environment. Find API deployment script in the /api folder

💡 Notes

  • Ensure the GoogleNews embedding file path is correctly set.
  • Restart Jupyter Kernel after downloading NLTK data, if used inside notebooks.

📫 Contact

For feedback or issues, feel free to reach out or raise a GitHub issue in the project repository.

Thank you for visiting😃!!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published