Sentiment Analysis Using Word2Vec and PyTorch

🧠 Overview

This project implements a binary sentiment analysis system that classifies textual input as positive or negative using a hybrid neural network. The model leverages pre-trained Word2Vec embeddings and is built using PyTorch. It is trained on a dataset of user reviews and achieves high accuracy on validation data.

🚀 Setup Instructions

To get started with the project, follow these steps:

Create and Activate a Virtual Environment
- Use Python 3.12 or compatible.
- Set up a virtual environment using venv or conda.
Install Dependencies
- Install all required Python packages using:
```
pip install -r requirements.txt
```
Download NLTK Resources
- Required: punkt, stopwords, wordnet
- Download them programmatically or manually through the NLTK Downloader.
- Here I have downloaded the NLTK Resources in my local directory rather then home directory
Download Word2Vec Embeddings
- Get the pre-trained GoogleNews-vectors-negative300.bin.gz model (approx. 1.5 GB) from Google's Word2Vec archive.
- Place it in the project root or specified path.
Prepare the Dataset
- Ensure your dataset is a CSV with a column (e.g., sentence) containing raw text.
- Clean and tokenize the text before feeding it into the model.
Train the Model
- Run the training script to preprocess the data, prepare the embedding matrix, and train the sentiment classifier.
Save & Export
- The trained model is saved as a .pkl file.
- Use this file later for inference or deployment.

🧠 Model Architecture

The model used is a hybrid neural network with the following architecture:

Embedding Layer: Uses pre-trained Word2Vec weights. This layer is frozen to preserve the semantic richness of the embeddings.
Adaptive Average Pooling Layer: Reduces the sequence dimension to a fixed-size representation.
Fully Connected Layers:
- Linear(300 → 64) + ReLU
- Linear(64 → 1) + Sigmoid
Output: A scalar probability representing sentiment. If probability > 0.5, it's classified as positive, otherwise negative.

This design is lightweight and efficient, offering both speed and strong performance on small to medium-sized datasets.

🤖 Why This is a Hybrid Model

This project implements a hybrid sentiment analysis model, combining multiple architectural ideas into a streamlined and efficient pipeline. Here's a breakdown of what makes it hybrid and how it compares to traditional approaches:

⚙️ Components of the Hybrid Model

Component	Function	Type
Pretrained Word2Vec	Encodes semantic relationships between words from large corpora	Static
Embedding Layer	Loads pretrained vectors and maps input tokens to dense representations	Non-trainable
AdaptiveAvgPool1d	Compresses variable-length sequences into fixed-size vectors	Statistical
Dense Neural Layers	Learns sentiment patterns from the pooled vector	Trainable
Sigmoid Output	Outputs binary sentiment prediction (positive/negative)	Trainable

🧠 Comparison with Traditional Models

Model Type	Description	Limitation
RNN / LSTM	Sequentially processes word embeddings to capture contextual dependencies	Computationally expensive; sensitive to input length
CNN for Text	Uses convolutional filters to learn local patterns across n-grams	Requires tuning kernel size; may miss global context
Transformer	Self-attention captures all pairwise relationships across the sequence	High memory and compute cost; overkill for small datasets
Bag-of-Words (BoW)	Simple count-based features with no semantic understanding	Ignores order and semantics

➡️ This Hybrid Approach:

Leverages Word2Vec for semantic understanding without retraining.
Uses adaptive pooling instead of sequential modeling (faster, simpler).
Applies dense layers to learn patterns, allowing expressiveness while avoiding RNN/Transformer complexity.

❌ Not Hybrid If…:

Only Uses Pretrained Embeddings Without Further Learning Example: Use Word2Vec to average word vectors and directly classify with a threshold.:
- No learnable layers → it's just a feature-based/statistical approach
- "Classic NLP pipeline" — not hybrid, purely static.
Only Uses a Trainable Deep Learning Model Without External Knowledge Example: A trainable Embedding layer + LSTM + Dense layers (randomly initialized weights).

Entirely learned from scratch → it's a neural model, not hybrid.

Only Uses One Architectural Style Pure RNN, pure Transformer, pure CNN, etc.

These are end-to-end deep learning models with no "hybrid" characteristics.

Even if efficient or well-designed, they don’t mix paradigms.

✅ Why It's Effective

Efficient: No recurrent or attention-based overhead.
Compact: Only a few trainable layers, reducing overfitting on small data.
Flexible: Can handle variable-length input thanks to pooling.
Transferable: Embeddings trained on general corpora boost performance on limited domain-specific data.

🛠️ Implementation Details

Text Preprocessing:
- Lowercasing, punctuation removal
- Stopword removal using NLTK
- Lemmatization with WordNet
Tokenization:
- Done using NLTK’s word_tokenize function.
Vocabulary Construction:
- A Counter is used to map tokens to unique integer IDs.
- A fixed vocabulary size is maintained for efficiency.
Embedding Matrix:
- A matrix of size [vocab_size, 300] is created using pre-trained vectors.
- Words not found in Word2Vec are initialized as zeros.
Training:
- Binary Cross-Entropy Loss
- Optimizer: Adam
- Epochs: Typically trained for 30–50 epochs
- Accuracy around 83%+ on 3000 rows observed
Inference:
- Input sentence is preprocessed and converted into indices.
- Passed through the trained model to get a probability score.

📦 Output

Model File: Serialized as /models/hybrid_model_full.pkl
Requirements File: Frozen via pip freeze > requirements.txt
Readiness for Deployment: The model can be easily loaded for predictions in any Python environment. Find API deployment script in the /api folder

💡 Notes

Ensure the GoogleNews embedding file path is correctly set.
Restart Jupyter Kernel after downloading NLTK data, if used inside notebooks.

📫 Contact

For feedback or issues, feel free to reach out or raise a GitHub issue in the project repository.

Thank you for visiting😃!!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
api		api
dataset		dataset
models		models
.gitattributes		.gitattributes
.gitignore		.gitignore
Hybrid_Sentiment_Analysis_Mac_Sectioned.ipynb		Hybrid_Sentiment_Analysis_Mac_Sectioned.ipynb
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sentiment Analysis Using Word2Vec and PyTorch

🧠 Overview

🚀 Setup Instructions

🧠 Model Architecture

🤖 Why This is a Hybrid Model

⚙️ Components of the Hybrid Model

🧠 Comparison with Traditional Models

✅ Why It's Effective

🛠️ Implementation Details

📦 Output

💡 Notes

📫 Contact

About

Uh oh!

Releases

Packages

Languages

Sonicof/hybrid-sentiment-analysis-model

Folders and files

Latest commit

History

Repository files navigation

Sentiment Analysis Using Word2Vec and PyTorch

🧠 Overview

🚀 Setup Instructions

🧠 Model Architecture

🤖 Why This is a Hybrid Model

⚙️ Components of the Hybrid Model

🧠 Comparison with Traditional Models

✅ Why It's Effective

🛠️ Implementation Details

📦 Output

💡 Notes

📫 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages