ASL Recognition Model

American Sign Language (ASL) recognition system using MediaPipe landmarks and deep learning.

📁 Project Structure

asl_model/
├── configs/                      # Configuration files
│   ├── config.yaml              # Main configuration
│   └── label_map.json           # Label normalization rules
├── data/                        # Raw datasets
│   ├── kaggle_asl1/            # Kaggle ASL Alphabet dataset
│   ├── kaggle_asl2/            # Kaggle ASL Dataset
│   ├── kaggle_asl_combined/    # Combined Kaggle images (A-Z)
│   ├── microsoft_asl/          # MS-ASL videos and metadata
│   └── personal/               # Personal recordings (optional)
├── artifacts/                   # Generated artifacts
│   ├── landmarks/              # Raw MediaPipe landmarks [T, 543, 4]
│   ├── features/               # Preprocessed features [T, 75, 4]
│   ├── manifests/              # Dataset manifests (CSV)
│   ├── models/                 # Trained model checkpoints
│   │   └── lstm_attention_20251020_151032/  # Best model (86.3%)
│   └── logs/                   # TensorBoard training logs
├── src/                        # Core library code
│   ├── data/                   # Data processing modules
│   │   ├── __init__.py
│   │   └── dataloader.py      # PyTorch dataloader with windowing
│   ├── models/                 # Model architectures
│   │   ├── __init__.py
│   │   └── lstm_model.py      # BiLSTM + Attention
│   └── utils/                  # Utility functions
│       └── __init__.py
├── scripts/                    # Executable scripts
│   ├── 1_data_preparation/    # Download and organize data
│   │   ├── combine_kaggle_asl.py
│   │   ├── build_manifest.py
│   │   ├── assign_splits.py
│   │   └── msasl_*.py         # MS-ASL pipeline scripts
│   ├── 2_preprocessing/       # Extract and preprocess features
│   │   ├── extract_landmarks.py
│   │   ├── preprocess_features.py
│   │   └── filter_valid_features.py
│   ├── 3_training/            # Train models
│   │   └── train_baseline.py
│   └── 4_evaluation/          # Evaluate and visualize
│       ├── analyze_errors.py
│       ├── quick_stats.py
│       └── quick_viz.py
├── tests/                      # Test scripts
│   ├── test_dataloader_with_splits.py
│   ├── test_model.py
│   └── README.md
├── bash_scripts/               # Bash shell scripts
│   ├── download_kaggle_datasets.sh
│   ├── check_feature_validity.sh
│   ├── check_feature_validity_unfiltered.sh
│   ├── train_ensemble.sh
│   └── README.md
├── plans/                      # Project planning documents
├── requirements.txt           # Python dependencies
├── README.md                  # This file
├── TRAINING_GUIDE.md          # Training documentation
├── KAGGLE_DATASETS_INFO.md    # Kaggle dataset details
├── MSASL_PIPELINE.md          # MS-ASL pipeline guide
└── PROJECT_STRUCTURE.md       # Detailed structure docs

🚀 Quick Start

1. Clone the Repository

git clone https://github.com/davidgit3000/asl_recognition.git
cd asl_recognition

2. Setup Environment

Mac/Linux:

# Create virtual environment
python3.11 -m venv .venv311
source .venv311/bin/activate

# Install dependencies
pip install -r requirements.txt

Windows:

# Create virtual environment
python -m venv .venv311
.venv311\Scripts\activate

# Install dependencies
pip install -r requirements.txt

3. Download Datasets

Kaggle Datasets (required):

Option A - Using script (Mac/Linux):

# Setup Kaggle API credentials first (~/.kaggle/kaggle.json)
bash download_kaggle_datasets.sh

Option B - Manual download:

Download ASL Alphabet → extract to data/kaggle_asl1/
Download ASL Dataset → extract to data/kaggle_asl2/

MS-ASL Videos (optional):

Download from MS-ASL Dataset
Then, use automated scripts (see scripts/1_data_preparation/README.md)

4. Prepare Data

# Combine Kaggle datasets
python scripts/1_data_preparation/combine_kaggle_asl.py

# Build manifest and assign splits
python scripts/1_data_preparation/build_manifest.py
python scripts/1_data_preparation/assign_splits.py

5. Extract and Preprocess Features

# Extract MediaPipe landmarks
python scripts/2_preprocessing/extract_landmarks.py

# Preprocess features for training
python scripts/2_preprocessing/preprocess_features.py

6. Test Dataloader

python scripts/4_evaluation/test_dataloader_with_splits.py

📊 Dataset

Raw Data

Kaggle ASL Alphabet: ~78,000 images (A-Z letters)
Kaggle ASL Dataset: ~9,000 images (A-Z letters)
MS-ASL: ~190 videos (20 common words)
Total Raw: 87,200 samples

After Processing & Filtering

Valid Features: 68,671 samples (86.1% extraction success rate)
Classes: 26 (A-Z letters only)
Splits: 70% train (48,074), 15% val (10,278), 15% test (10,319)
Removed: 11,144 zero features + 188 MS-ASL samples (class imbalance)

Key Statistics

Feature extraction success: 86.1% (dual-detection pipeline)
Class balance: 1,847-3,008 samples per class (well-balanced)
Most challenging classes: M, N (39.8% detection failure due to closed fist)

🔧 Pipeline Overview

Data Collection → Download Kaggle + MS-ASL datasets ✅
Manifest Building → Create unified CSV with metadata ✅
Landmark Extraction → Dual-detection (MediaPipe Holistic + Hands fallback) ✅
Feature Preprocessing → Normalize, smooth, reduce to 75 landmarks ✅
Quality Filtering → Remove zero features and low-count classes ✅
Train/Val/Test Split → Stratified 70/15/15 split ✅
Dataloader → PyTorch DataLoader with windowing ✅
Model Training → BiLSTM + Attention (86.3% test accuracy) ✅
Evaluation → Error analysis and confusion matrix 🔄
Inference Pipeline → Real-time webcam demo 📋

📦 Features

Data Processing

✅ Dual-detection pipeline: MediaPipe Holistic + Hands fallback (86.1% success)
✅ Temporal smoothing: Savitzky-Golay filter (window=5, polynomial=2)
✅ Normalization: Centered on torso, scaled by shoulder width
✅ Quality filtering: Remove zero features and low-count classes
✅ Windowed sequences: 32 frames with configurable stride
✅ Data augmentation: Rotation, scale, translation (training only)
✅ Class balancing: Weighted loss for imbalanced data
✅ Stratified splits: 70/15/15 train/val/test

Model Architecture

✅ BiLSTM + Attention: 3 layers, 512 hidden units, 17M parameters
✅ Regularization: Dropout (0.25), weight decay (1e-5), label smoothing (0.1)
✅ Optimization: Adam optimizer with ReduceLROnPlateau scheduler
✅ Early stopping: Patience=10 epochs

Performance

✅ Test Accuracy: 86.29% (22.7× better than random)
✅ Validation Accuracy: 86.79%
✅ Generalization: Val ≈ Test (no overfitting)
✅ Training Time: ~6 hours (100 epochs on Apple M4)

📝 Documentation

KAGGLE_DATASETS_INFO.md - Kaggle dataset details
MSASL_PIPELINE.md - MS-ASL download pipeline
scripts/README.md - Scripts documentation
scripts/*/README.md - Detailed docs for each stage

🎯 Current Status

✅ Completed

✅ Data collection and preprocessing (68,671 samples, 26 classes)
✅ Dual-detection landmark extraction (86.1% success rate)
✅ Feature engineering and quality filtering
✅ BiLSTM + Attention model (86.29% test accuracy)
✅ Training pipeline with early stopping and LR scheduling

🔄 In Progress

🔄 Error analysis and confusion matrix visualization
🔄 Per-class performance evaluation

📋 Next Steps

Phase 1: Model Exploration (Week 1-2)

CNN-based models (for spatial feature extraction)
- 2D CNN on landmark heatmaps
- 3D CNN for spatiotemporal features
- ResNet/EfficientNet backbones
Hybrid models (combining spatial + temporal)
- CNN + LSTM (extract spatial features, then temporal modeling)
- CNN + Transformer (attention over CNN features)
- Two-stream networks (appearance + motion)
Transformer-based models
- Vision Transformer (ViT) for landmark sequences
- Temporal Transformer with positional encoding
- BERT-style pre-training on ASL data
Ensemble methods
- Train 5 diverse models (different architectures/hyperparameters)
- Probability averaging or voting
- Expected: +2-4% accuracy boost

Phase 2: Deployment (Week 3-4)

Real-time inference pipeline
- Webcam integration with MediaPipe
- Sliding window prediction (30 FPS)
- Temporal smoothing for stable predictions
Demo application
- GUI with live video feed
- Top-3 predictions with confidence scores
- Recording capability for new samples
Model optimization
- ONNX export for cross-platform deployment
- Quantization for faster inference
- Mobile deployment (TensorFlow Lite)

Phase 3: Advanced Features (Optional)

Word-level recognition
- Collect more MS-ASL data (500+ samples per word)
- Sequence-to-sequence models
- Sentence-level ASL translation
Transfer learning
- Fine-tune on personal signing style
- Few-shot learning for new signs
Multi-modal learning
- Combine landmarks + raw video
- Audio integration (for signed songs)

🏆 Model Comparison (Planned)

Model	Architecture	Params	Expected Acc	Training Time	Notes
BiLSTM + Attention	3-layer BiLSTM	17M	86.3% ✅	6h	Current best
CNN + LSTM	ResNet18 + 2-layer LSTM	~15M	87-89%	8h	Spatial + temporal
3D CNN	3D ResNet	~30M	85-88%	10h	End-to-end spatiotemporal
Transformer	6-layer encoder	~20M	88-91%	12h	Pure attention
Ensemble (5 models)	Mixed	~80M	89-92%	30h	Best performance

📈 Performance Targets

✅ Baseline (Random): 3.8% (1/26 classes)
✅ Current (BiLSTM+Attention): 86.3%
🎯 Target (Ensemble): 90%+
🏆 SOTA (Published research): 92-95%

📄 License

Educational project for CS 4620.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
artifacts		artifacts
bash_scripts		bash_scripts
configs		configs
plans		plans
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
KAGGLE_DATASETS_INFO.md		KAGGLE_DATASETS_INFO.md
MSASL_PIPELINE.md		MSASL_PIPELINE.md
PROJECT_STRUCTURE.md		PROJECT_STRUCTURE.md
README.md		README.md
TRAINING_GUIDE.md		TRAINING_GUIDE.md
requirements.txt		requirements.txt

davidgit3000/asl_recognition

Folders and files

Latest commit

History

Repository files navigation

ASL Recognition Model

📁 Project Structure

🚀 Quick Start

1. Clone the Repository

2. Setup Environment

3. Download Datasets

4. Prepare Data

5. Extract and Preprocess Features

6. Test Dataloader

📊 Dataset

Raw Data

After Processing & Filtering

Key Statistics

🔧 Pipeline Overview

📦 Features

Data Processing

Model Architecture

Performance

📝 Documentation

🎯 Current Status

✅ Completed

🔄 In Progress

📋 Next Steps

Phase 1: Model Exploration (Week 1-2)

Phase 2: Deployment (Week 3-4)

Phase 3: Advanced Features (Optional)

🏆 Model Comparison (Planned)

📈 Performance Targets

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages