This project implements a binary sentiment analysis system that classifies textual input as positive or negative using a hybrid neural network. The model leverages pre-trained Word2Vec embeddings and is built using PyTorch. It is trained on a dataset of user reviews and achieves high accuracy on validation data.
To get started with the project, follow these steps:
-
Create and Activate a Virtual Environment
- Use Python 3.12 or compatible.
- Set up a virtual environment using
venvorconda.
-
Install Dependencies
- Install all required Python packages using:
pip install -r requirements.txt
- Install all required Python packages using:
-
Download NLTK Resources
- Required:
punkt,stopwords,wordnet - Download them programmatically or manually through the NLTK Downloader.
- Here I have downloaded the NLTK Resources in my local directory rather then home directory
- Required:
-
Download Word2Vec Embeddings
- Get the pre-trained
GoogleNews-vectors-negative300.bin.gzmodel (approx. 1.5 GB) from Google's Word2Vec archive. - Place it in the project root or specified path.
- Get the pre-trained
-
Prepare the Dataset
- Ensure your dataset is a CSV with a column (e.g.,
sentence) containing raw text. - Clean and tokenize the text before feeding it into the model.
- Ensure your dataset is a CSV with a column (e.g.,
-
Train the Model
- Run the training script to preprocess the data, prepare the embedding matrix, and train the sentiment classifier.
-
Save & Export
- The trained model is saved as a
.pklfile. - Use this file later for inference or deployment.
- The trained model is saved as a
The model used is a hybrid neural network with the following architecture:
- Embedding Layer: Uses pre-trained Word2Vec weights. This layer is frozen to preserve the semantic richness of the embeddings.
- Adaptive Average Pooling Layer: Reduces the sequence dimension to a fixed-size representation.
- Fully Connected Layers:
Linear(300 → 64) + ReLULinear(64 → 1) + Sigmoid
- Output: A scalar probability representing sentiment. If probability > 0.5, it's classified as positive, otherwise negative.
This design is lightweight and efficient, offering both speed and strong performance on small to medium-sized datasets.
This project implements a hybrid sentiment analysis model, combining multiple architectural ideas into a streamlined and efficient pipeline. Here's a breakdown of what makes it hybrid and how it compares to traditional approaches:
| Component | Function | Type |
|---|---|---|
| Pretrained Word2Vec | Encodes semantic relationships between words from large corpora | Static |
| Embedding Layer | Loads pretrained vectors and maps input tokens to dense representations | Non-trainable |
| AdaptiveAvgPool1d | Compresses variable-length sequences into fixed-size vectors | Statistical |
| Dense Neural Layers | Learns sentiment patterns from the pooled vector | Trainable |
| Sigmoid Output | Outputs binary sentiment prediction (positive/negative) | Trainable |
| Model Type | Description | Limitation |
|---|---|---|
| RNN / LSTM | Sequentially processes word embeddings to capture contextual dependencies | Computationally expensive; sensitive to input length |
| CNN for Text | Uses convolutional filters to learn local patterns across n-grams | Requires tuning kernel size; may miss global context |
| Transformer | Self-attention captures all pairwise relationships across the sequence | High memory and compute cost; overkill for small datasets |
| Bag-of-Words (BoW) | Simple count-based features with no semantic understanding | Ignores order and semantics |
➡️ This Hybrid Approach:
- Leverages Word2Vec for semantic understanding without retraining.
- Uses adaptive pooling instead of sequential modeling (faster, simpler).
- Applies dense layers to learn patterns, allowing expressiveness while avoiding RNN/Transformer complexity.
❌ Not Hybrid If…:
-
Only Uses Pretrained Embeddings Without Further Learning Example: Use Word2Vec to average word vectors and directly classify with a threshold.:
-
No learnable layers → it's just a feature-based/statistical approach -
"Classic NLP pipeline" — not hybrid, purely static.
-
-
Only Uses a Trainable Deep Learning Model Without External Knowledge Example: A trainable Embedding layer + LSTM + Dense layers (randomly initialized weights).
Entirely learned from scratch → it's a neural model, not hybrid.
- Only Uses One Architectural Style Pure RNN, pure Transformer, pure CNN, etc.
These are end-to-end deep learning models with no "hybrid" characteristics.
Even if efficient or well-designed, they don’t mix paradigms.
- Efficient: No recurrent or attention-based overhead.
- Compact: Only a few trainable layers, reducing overfitting on small data.
- Flexible: Can handle variable-length input thanks to pooling.
- Transferable: Embeddings trained on general corpora boost performance on limited domain-specific data.
-
Text Preprocessing:
- Lowercasing, punctuation removal
- Stopword removal using NLTK
- Lemmatization with WordNet
-
Tokenization:
- Done using NLTK’s
word_tokenizefunction.
- Done using NLTK’s
-
Vocabulary Construction:
- A
Counteris used to map tokens to unique integer IDs. - A fixed vocabulary size is maintained for efficiency.
- A
-
Embedding Matrix:
- A matrix of size
[vocab_size, 300]is created using pre-trained vectors. - Words not found in Word2Vec are initialized as zeros.
- A matrix of size
-
Training:
- Binary Cross-Entropy Loss
- Optimizer: Adam
- Epochs: Typically trained for 30–50 epochs
- Accuracy around 83%+ on 3000 rows observed
-
Inference:
- Input sentence is preprocessed and converted into indices.
- Passed through the trained model to get a probability score.
- Model File: Serialized as
/models/hybrid_model_full.pkl - Requirements File: Frozen via
pip freeze > requirements.txt - Readiness for Deployment: The model can be easily loaded for predictions in any Python environment. Find API deployment script in the /api folder
- Ensure the
GoogleNewsembedding file path is correctly set. - Restart Jupyter Kernel after downloading NLTK data, if used inside notebooks.
For feedback or issues, feel free to reach out or raise a GitHub issue in the project repository.
Thank you for visiting😃!!