A high-fidelity American Sign Language (ASL) translation system that uses MediaPipe Holistic for landmark detection and LSTM neural networks for real-time sign recognition.
- Overview
- Technologies Used
- System Architecture
- Features
- Installation
- Usage
- Project Structure
- How It Works
- Model Details
This project implements a real-time sign language detection system capable of recognizing ASL signs from a webcam feed. The system extracts body, hand, and facial landmarks using Google's MediaPipe, processes them through an LSTM neural network, and provides immediate predictions with a clean Streamlit interface.
- β Real-time detection from webcam at ~30 FPS
- β High accuracy using deep LSTM networks
- β Scalable architecture - easily add new signs
- β Motion-based recognition - captures dynamic signing
- β Prediction stabilization - reduces flickering
- β Distance & position invariant - works at any distance/position
- Python 3.9 - Core programming language
- TensorFlow 2.20 / Keras - Deep learning framework for LSTM models
- MediaPipe 0.10.9 - Google's ML solution for landmark detection
- Holistic model (pose + hands + face)
- 1,662 feature points per frame
- NumPy 1.26 - Numerical computations and array operations
- OpenCV 4.10 - Camera capture and image processing
- Pillow - Image manipulation
- Streamlit 1.50 - Interactive web UI for real-time predictions
- streamlit-webrtc - WebRTC support for camera streaming
- scikit-learn 1.6 - Train/test splitting and metrics
- Matplotlib 3.9 - Training visualizations
- SciPy 1.13 - Scientific computations
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β USER WORKFLOW β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. DATA COLLECTION (collect_data.py) β
β ββ Webcam capture β
β ββ MediaPipe landmark extraction β
β ββ Save 30-frame sequences (.npy files) β
β β
β 2. MODEL TRAINING (train_model.py) β
β ββ Load collected sequences β
β ββ Build LSTM model β
β ββ Train with validation split β
β ββ Save trained model (.h5) β
β β
β 3. REAL-TIME INFERENCE (app.py) β
β ββ Live webcam feed β
β ββ Continuous landmark extraction β
β ββ 30-frame rolling buffer β
β ββ LSTM prediction β
β ββ Prediction stabilization β
β ββ Streamlit UI display β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Webcam Frame β MediaPipe β Landmarks (1662 features)
β Normalization β 30-Frame Sequence β LSTM Model
β Softmax Predictions β Stabilization β Display
-
Extracts 1,662 features per frame:
- Left hand: 21 landmarks Γ 3 coords = 63 features
- Right hand: 21 landmarks Γ 3 coords = 63 features
- Pose: 33 landmarks Γ 3 coords = 99 features
- Face: 468 landmarks Γ 3 coords = 1,404 features
- Pose visibility: 33 landmarks = 33 features
-
Normalization Strategy:
- Position-invariant: All coordinates relative to nose position
- Scale-invariant: Normalized by shoulder width
- Enables recognition regardless of distance from camera
-
Architecture:
Input (30, 1662) β LSTM(64) + Dropout(0.2) β LSTM(128) + Dropout(0.3) β Dense(64, ReLU) + Dropout(0.2) β Dense(num_classes, Softmax) -
Training Features:
- Adam optimizer with learning rate 0.001
- Categorical crossentropy loss
- Early stopping (patience: 15 epochs)
- Learning rate reduction on plateau
- Model checkpointing (saves best model)
- Requires 10 consecutive frames with same prediction
- Minimum 90% confidence threshold
- Prevents flickering and false positives
- Builds sentence by adding stable predictions
- Single sign collection (
collect_data.py) - Batch collection (
batch_collect_data.py) - Visual countdown and progress bars
- Customizable vocabulary
- Python 3.9 or higher
- Webcam
- macOS / Linux / Windows
# 1. Clone the repository
git clone <your-repo-url>
cd sign-language-detector
# 2. Create virtual environment
python3.9 -m venv .venv39
source .venv39/bin/activate # On Windows: .venv39\Scripts\activate
# 3. Install dependencies
pip install -r requirements.txt# Simply run the script
./run_app.sh# For a single sign
python collect_data.py
# For multiple signs (batch)
# First, edit batch_collect_data.py to add your vocabulary
python batch_collect_data.pyTips for data collection:
- Use good lighting
- Perform each sign 30 times
- Vary speed and style slightly
- Keep hands visible in frame
python train_model.pyThis will:
- Load all collected data from
training_data/ - Build and train the LSTM model
- Save the model to
models/asl_model.h5 - Generate
labels.jsonmapping - Display training metrics
# Simple camera test
streamlit run app_simple.py
# Full app with predictions
streamlit run app.pyOpen browser at http://localhost:8501
sign-language-detector/
βββ LandmarkExtractor.py # MediaPipe landmark extraction
βββ SignModel.py # LSTM model architecture & training
βββ collect_data.py # Single sign data collection
βββ batch_collect_data.py # Batch data collection
βββ train_model.py # Model training script
βββ app.py # Full Streamlit app with predictions
βββ app_simple.py # Simple camera test
βββ run_app.sh # Convenience runner script
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ QUICKSTART.md # Quick start guide
βββ README_SCALING.md # Guide for scaling to 100s of signs
βββ training_data/ # Collected sign sequences
β βββ 0000_hello/
β β βββ sequence_0000.npy
β β βββ ...
β βββ 0001_goodbye/
β βββ ...
βββ models/
β βββ asl_model.h5 # Trained model
βββ labels.json # Sign label mappings
MediaPipe Holistic processes each frame to detect:
- Pose landmarks: Body position (shoulders, elbows, wrists, etc.)
- Hand landmarks: Detailed finger positions for both hands
- Face landmarks: Facial expressions and head orientation
All landmarks are normalized to be invariant to:
- Distance: Works whether you're close or far from camera
- Position: Works regardless of where you stand in frame
- Scale: Normalized by shoulder width
Signs are recognized as 30-frame sequences (~1 second at 30 FPS):
- Captures the motion dynamics of signing
- Uses a rolling buffer for continuous detection
- Each sequence:
(30 frames, 1662 features) = (30, 1662)array
The LSTM network:
- Learns temporal patterns in sign movements
- Processes entire sequences, not individual frames
- Outputs probability distribution over all known signs
- Uses softmax activation for multi-class prediction
To prevent false positives:
- Only accept predictions with β₯90% confidence
- Require same prediction for 10 consecutive frames
- Add to sentence when stable
- Reset buffer after accepting a sign
- (batch_size, 30, 1662)
- 30 frames per sequence
- 1,662 features per frame
Layer (type) Output Shape Param #
=================================================================
lstm_1 (LSTM) (None, 30, 64) 442,112
dropout (Dropout) (None, 30, 64) 0
lstm_2 (LSTM) (None, 128) 98,816
dropout_1 (Dropout) (None, 128) 0
dense_1 (Dense) (None, 64) 8,256
dropout_2 (Dropout) (None, 64) 0
output (Dense) (None, num_classes) varies
=================================================================
Total params: ~549,000+ (varies with num_classes)
- Optimizer: Adam (lr=0.001)
- Loss: Categorical Crossentropy
- Metrics: Accuracy, Top-5 Accuracy
- Epochs: 100 (with early stopping)
- Batch Size: 32
- Validation Split: 20%
- Inference Time: ~10-30ms per prediction
- FPS: ~30 frames per second
- Memory: ~500MB (model + MediaPipe)
- MediaPipe Holistic: Pre-trained ML models for landmark detection
- Normalization: Position and scale invariance through geometric transformations
- LSTM (Long Short-Term Memory): Captures temporal dependencies in sign sequences
- Dropout Regularization: Prevents overfitting (20-30% dropout)
- Early Stopping: Automatic training termination when validation loss plateaus
- Rolling Buffer: Continuous 30-frame window for real-time processing
- Prediction Smoothing: Temporal consistency through multi-frame voting
Edit SignModel.py:
# Increase for larger vocabularies
layers.LSTM(128, ...) # Increase from 64
layers.LSTM(256, ...) # Increase from 128Edit app.py:
stabilizer = PredictionStabilizer(
min_confidence=0.85, # Lower for easier acceptance
stability_frames=8 # Lower for faster response
)See README_SCALING.md for detailed guide on:
- Collecting data for 100+ signs
- Using pre-trained datasets (WLASL, MS-ASL)
- Optimizing model for large vocabularies
- Performance tuning
- Check camera permissions in System Preferences
- Try different camera index (0, 1, 2)
- Collect more training sequences (50+ per sign)
- Ensure good lighting during data collection
- Increase model capacity
- Train for more epochs
- Reduce MediaPipe model complexity
- Use GPU acceleration
- Lower camera resolution
MIT License - feel free to use for educational and commercial purposes.
- Google MediaPipe - Landmark detection
- TensorFlow/Keras - Deep learning framework
- Streamlit - Web interface
For questions or contributions, please contact hello.sameerbusiness@gmail.com
Built by Sameer Abrar (Flexcrit Inc)