Skip to content

Live Translation Caption system powered by AI with overlay system you can use in OBS studio or any other tool.

License

Notifications You must be signed in to change notification settings

arvindjuneja/subai

Repository files navigation

SubAI - Realtime Audio Translation for OBS

Python CUDA macOS License

A professional solution for realtime audio translation in streaming/webinar environments. Stream in any language with live captions in another language, powered by GPU-accelerated Whisper ASR.

Perfect for:

  • 🎥 International streaming and webinars
  • 🎬 Video production with multilingual audiences
  • 📺 Live events with real-time translation
  • 🎓 Educational content with language barriers

Platforms: Windows (CUDA), macOS (Apple Silicon), Linux


🚀 Quick Start

Windows (Production Ready)

INSTALLER.bat              # One-click setup
SubAI_Start.bat            # Double-click to run

Mac (Apple Silicon)

# Install with MLX support for 3-5x faster performance
pip install -e './backend[mlx]'
chmod +x start.sh && ./start.sh
python3 set_model.py mlx large-v3-turbo  # Use MLX acceleration

Then

  1. Open Control Panel at http://127.0.0.1:3000
  2. Go to ⚙️ Settings → Set pipeline to whisper
  3. Select your model, configure languages
  4. 🎙️ Stream tab → Connect microphone
  5. Add OBS overlay from 📊 Monitor tab

📖 Full Guides: Windows Setup | Mac Setup | MLX Setup (Apple Silicon) | Production Deployment


📑 Table of Contents


Features

Core Translation

  • GPU-accelerated Whisper ASR (CUDA on Windows/Linux, MLX on Apple Silicon, CPU fallback)
  • MLX optimization - 3-5x faster on M1/M2/M3/M4 Macs
  • 99 languages supported - Real-time translation between any Whisper-supported languages
  • 6+ model sizes - From tiny (40MB) to large-v3 (1.5GB) with live model switching
  • OBS Browser Source overlay - Transparent captions for streaming

User Experience

  • Modern React control UI - Visual model/language switching, real-time monitoring
  • Production launchers - Double-click .bat (Windows) or .sh (Mac) to start everything
  • System tray application - Professional background operation with menu
  • One-click installer - Automated setup for Windows

Professional Features

  • Session logging - Record all transcriptions to JSON with timestamps
  • Live audio monitoring - Visual audio levels and caption preview
  • WebSocket streaming - Low-latency audio and caption delivery
  • Model switching - Change models on-the-fly via UI, script, or API
  • Cross-platform - Windows (CUDA), macOS (Apple Silicon), Linux

Developer Friendly

  • Pluggable pipeline architecture - Easy to add new ASR/translation backends
  • REST API - Full programmatic control
  • MLX support - 3-5x faster on Apple Silicon (M1/M2/M3/M4)

Architecture

┌─────────────────┐                  ┌──────────────────────┐
│  Computer 1     │                  │   Computer 2 (GPU)   │
│  (OBS PC)       │                  │                      │
│                 │                  │                      │
│  ┌───────────┐  │   Audio (WS)     │  ┌────────────────┐  │
│  │ Browser   │──┼──────────────────┼─→│ FastAPI Server │  │
│  │ Sender    │  │                  │  │                │  │
│  └───────────┘  │                  │  │  ┌──────────┐  │  │
│                 │                  │  │  │ Whisper  │  │  │
│  ┌───────────┐  │   Captions (WS)  │  │  │ Pipeline │  │  │
│  │   OBS     │←─┼──────────────────┼──│  └──────────┘  │  │
│  │  Overlay  │  │                  │  │                │  │
│  └───────────┘  │                  │  └────────────────┘  │
└─────────────────┘                  └──────────────────────┘

System Requirements

Computer 2 (GPU PC)

  • GPU: NVIDIA RTX 20xx+ with 8GB+ VRAM (tested on RTX 5090)
  • OS: Windows 10/11, Linux, or macOS with Apple Silicon (M1/M2/M3/M4)
  • Python: 3.10 or 3.11
  • CUDA: 12.1+ drivers (Windows/Linux with NVIDIA GPU)
  • RAM: 16GB+ recommended
  • Storage: 5GB for models (can use smaller models on Mac)

Computer 1 (OBS PC)

  • OBS Studio 28+
  • Modern browser (Chrome/Edge/Firefox)

Setup

Quick Start for Mac Users

Running on macOS with Apple Silicon? See MAC_SETUP.md for Mac-specific instructions.

# Quick start for Mac:
chmod +x start.sh
./start.sh

# Switch to a smaller model for better performance:
python3 set_model.py small

On Computer 2 (GPU PC)

  1. Clone and install dependencies

    git clone https://github.com/arvindjuneja/subai.git SubAI
    cd SubAI
    python -m venv .venv
    .venv\Scripts\Activate.ps1  # Windows PowerShell
    # or: source .venv/bin/activate  # Linux
    
    pip install -e ./backend
    pip install torch --index-url https://download.pytorch.org/whl/cu121
  2. Pre-download Whisper models (optional but recommended, ~3GB)

    python download_models.py
  3. Start the backend server

    uvicorn app.main:app --app-dir backend --host 0.0.0.0 --port 8080

    Server will be accessible at http://<GPU_PC_IP>:8080

  4. Start the control UI (optional, for config management)

    cd frontend
    npm install
    npm run dev

    UI accessible at http://localhost:3000

  5. Allow firewall (Windows)

    New-NetFirewallRule -DisplayName "SubAI 8080" -Direction Inbound -Protocol TCP -LocalPort 8080 -Action Allow

On Computer 1 (OBS PC)

Option A: Use Browser Flags (Quick Test)

Close your browser and launch with flags:

# Chrome
"C:\Program Files\Google\Chrome\Application\chrome.exe" --unsafely-treat-insecure-origin-as-secure=http://<GPU_PC_IP>:8080 --user-data-dir="C:\temp\chrome-subai"

# Edge
"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe" --unsafely-treat-insecure-origin-as-secure=http://<GPU_PC_IP>:8080 --user-data-dir="C:\temp\edge-subai"

Then open: http://<GPU_PC_IP>:8080/static/sender.html

Option B: Serve Locally (Secure Context)

  1. Download sender.html from http://<GPU_PC_IP>:8080/static/sender.html
  2. Serve locally:
    # Python
    python -m http.server 9000
    
    # Node
    npx serve -p 9000 .
  3. Open: http://localhost:9000/sender.html?ws=ws://<GPU_PC_IP>:8080/ws/audio

OBS Overlay Setup

  1. In OBS: SourcesAddBrowser Source
  2. URL: http://<GPU_PC_IP>:8080/overlay
  3. Width: 1920, Height: 1080
  4. ✅ Enable Shutdown source when not visible
  5. ✅ Enable Refresh browser when scene becomes active

Production Deployment (Non-Technical Users)

For video production environments with non-technical users, SubAI includes easy-to-use launchers:

Option 1: One-Click Installer (Recommended)

  1. Run the installer (one-time setup):

    INSTALLER.bat

    This automatically:

    • Creates Python environment
    • Installs all dependencies
    • Downloads AI models
    • Creates desktop shortcut
  2. Users just double-click:

    SubAI_Start.bat
    • Starts backend + frontend automatically
    • Opens browser windows
    • System tray icon appears
    • Ready to use!

Option 2: System Tray Application

Professional launcher with tray icon and menu:

python launcher_app.py

Features:

  • Runs in background (no console window)
  • Right-click tray menu for quick access
  • Auto-opens browser windows
  • Clean shutdown

Option 3: Standalone Executable

Build a single .exe file for distribution:

python build_launcher.py

Creates dist/SubAI.exe - no Python knowledge required for end users!

See DEPLOYMENT.md for complete production deployment guide.

Usage

Quick Start

  1. Configure translation (via Control UI at http://localhost:3000 or REST API):

    curl -X POST http://<GPU_PC_IP>:8080/api/config \
      -H "Content-Type: application/json" \
      -d '{"source_language":"pl","target_language":"en","pipeline_name":"whisper"}'
  2. On Computer 1: Open the audio sender, select your microphone, click Connect

  3. On Computer 1: Start OBS with the overlay configured

  4. Speak Polish → captions appear in English after ~1-2 seconds

Session Logging

Record your transcriptions for later review. Use the Monitor tab in the Control UI to start/stop recording sessions, or use the API directly:

# Start a logging session
curl -X POST http://<GPU_PC_IP>:8080/api/session/start

# Check session status
curl http://<GPU_PC_IP>:8080/api/session/status

# Stop the session
curl -X POST http://<GPU_PC_IP>:8080/api/session/stop

Control UI: Open the React control panel at http://localhost:3000, go to the Monitor tab, and use the "Start Recording Session" button.

Session logs are saved as JSON files in the sessions/ directory with format:

{
  "session_id": "abc123",
  "start_time": "2025-10-22T14:30:15Z",
  "end_time": "2025-10-22T15:45:20Z",
  "config": {
    "source_language": "pl",
    "target_language": "en",
    "pipeline": "whisper",
    "model": "large-v3"
  },
  "entries": [
    {
      "timestamp": "2025-10-22T14:30:20Z",
      "text": "Hello everyone",
      "processing_time_ms": 234
    }
  ]
}

API Endpoints

  • GET /health - Health check
  • GET /api/config - Get current config
  • POST /api/config - Update config (JSON body: {source_language, target_language, pipeline_name})
  • GET /api/pipelines - List available pipelines
  • POST /api/session/start - Start logging session
  • POST /api/session/stop - Stop logging session
  • GET /api/session/status - Get session status
  • WS /ws/audio - Audio ingest (PCM16LE mono 48kHz)
  • WS /ws/captions - Caption broadcast (JSON: {caption: string})
  • GET /overlay - OBS overlay HTML page
  • GET /static/sender.html - Browser-based audio sender

Switching Models

SubAI supports multiple Whisper models with different speed/quality tradeoffs:

Using the Helper Script (Easiest):

# List available models
python set_model.py list

# Switch to a specific model
python set_model.py small    # Fast, good for Mac/testing
python set_model.py medium   # Balanced
python set_model.py large-v3 # Best quality (default)

Via API:

curl -X POST http://<GPU_PC_IP>:8080/api/config \
  -H "Content-Type: application/json" \
  -d '{"model_name":"small"}'

Available Models:

  • tiny - Fastest, 40MB (testing only)
  • small - Fast, 245MB (recommended for Mac)
  • medium - Balanced, 770MB (production ready)
  • large-v3 - Best quality, 1.5GB (default for GPU PC)

See QUICK_REFERENCE.md for detailed model comparison.

Supported Languages

Whisper supports 99 languages. Common codes:

  • pl - Polish
  • en - English
  • de - German
  • fr - French
  • es - Spanish
  • it - Italian
  • ja - Japanese
  • zh - Chinese

For translation, set source_language and target_language to different codes.

Pipeline Comparison: Choose the Right One for You

SubAI offers three pipelines optimized for different platforms and use cases:

Feature MLX (Apple Silicon) Faster-Whisper (CUDA) Faster-Whisper (CPU)
Platform macOS (M1/M2/M3/M4) Windows/Linux (NVIDIA GPU) All platforms
Speed ⚡⚡⚡ Very Fast (2-4s) ⚡⚡ Fast (3-5s) 🐢 Slow (8-10s)
Quality ⭐⭐⭐ Good ⭐⭐⭐⭐⭐ Best ⭐⭐⭐⭐⭐ Best
Decoding Greedy (no beam search) Beam search (size=5) Beam search (size=5)
Power Usage 🔋 Low (30-40% less) 🔥 High 🔥 Medium-High
Setup pip install -e './backend[mlx]' Standard install + CUDA Standard install
Best For Development, testing on Mac Production (best quality) Intel Macs, no GPU
Quality Note May have occasional odd translations Most coherent & accurate Most coherent & accurate

🎯 Recommendation by Use Case

Production / Important Translations:

  • Windows/Linux with NVIDIA GPU: Use whisper (CUDA) → Best quality
  • Mac with Apple Silicon: Use whisper (CPU) → Best quality, slower but worth it
  • Intel Mac / No GPU: Use whisper (CPU) → Only option for quality

Development / Testing / Quick Iteration:

  • Mac with Apple Silicon: Use mlx → 3-5x faster, good enough for testing

Why Quality Differs:

  • Beam Search (Faster-Whisper): Explores 5 hypotheses simultaneously → better translations
  • Greedy Decoding (MLX): Picks single best choice at each step → faster but less coherent

Pipeline Details

null (Debug)

Placeholder pipeline that emits debug messages every ~1s of audio. Use for testing connectivity.

whisper (Production - Recommended)

GPU-accelerated Faster-Whisper pipeline with beam search:

  • Quality: Best (beam search with size=5)
  • Model: large-v3 (best quality, ~3GB VRAM/RAM)
  • Alternative models: medium, small, large-v2
  • Acceleration: CTranslate2 with CUDA (NVIDIA GPU) or CPU
  • Best for: Production use, Windows with NVIDIA GPU, all platforms when quality matters

mlx (Development - Apple Silicon Only)

MLX-accelerated Whisper pipeline with greedy decoding:

  • Speed: 3-5x faster than CPU on M1/M2/M3/M4 Macs
  • Quality: Good (greedy decoding, no beam search yet)
  • Power: 30-40% lower consumption vs CPU
  • Recommended models: large-v3-mlx, medium-mlx, small-mlx
  • Best for: Development and testing on Mac, quick iterations
  • Limitation: No beam search support yet (lower quality than faster-whisper)
  • See: MLX_SETUP.md for detailed setup and comparison

Quick Setup

For MLX (Mac only):

# Install MLX support
pip install -e './backend[mlx]'

# Download model
python3 download_mlx_models.py --model large-v3-mlx

# Use it
python3 set_model.py mlx large-v3-mlx

For Faster-Whisper (All platforms):

# Already included in standard install
python3 set_model.py whisper large-v3

Performance Tips

  1. GPU not detected?

    • Verify NVIDIA drivers: nvidia-smi
    • Check CUDA version matches PyTorch
    • The pipeline will fall back to CPU (slower but functional)
  2. High latency?

    • Use medium model instead of large-v3 for 2x speed
    • Reduce chunk size in backend/app/pipeline/whisper_pipeline.py
  3. Model downloads slow?

    • Pre-download with python download_models.py
    • Models cached in ~/.cache/huggingface/
  4. OBS overlay not updating?

    • Check firewall on GPU PC
    • Verify overlay URL in browser first
    • Restart OBS browser source (right-click → Refresh)
  5. Microphone access blocked?

    • Use HTTPS or localhost for sender
    • Or launch browser with --unsafely-treat-insecure-origin-as-secure flag

Troubleshooting

"Failed to load config"

  • Backend not running or firewall blocking port 8080
  • Check: curl http://<GPU_PC_IP>:8080/health

"No captions appearing"

  • Check sender is connected (green status)
  • Verify pipeline is whisper not null
  • Check backend logs for errors

"Out of memory" on GPU

  • Switch to medium model
  • Close other GPU applications
  • Fall back to CPU: edit whisper_pipeline.py device to "cpu"

Development

File Structure

SubAI/
├── backend/
│   ├── app/
│   │   ├── main.py           # FastAPI app
│   │   ├── config.py          # Runtime config
│   │   ├── state.py           # WebSocket clients manager
│   │   ├── routers/
│   │   │   ├── ingest.py      # Audio WS endpoint
│   │   │   ├── overlay.py     # Overlay page
│   │   │   └── control.py     # Config API
│   │   ├── pipeline/
│   │   │   ├── base.py        # Pipeline interface
│   │   │   ├── null_pipeline.py
│   │   │   ├── whisper_pipeline.py
│   │   │   └── registry.py    # Pipeline registry
│   │   └── static/
│   │       └── sender.html    # Browser audio sender
│   └── pyproject.toml
├── frontend/
│   ├── src/
│   │   ├── App.tsx            # Control UI
│   │   └── App.css
│   ├── package.json
│   └── vite.config.ts
├── test_gpu.py                # GPU verification script
├── download_models.py         # Model pre-download
└── README.md

Adding a New Pipeline

  1. Create backend/app/pipeline/my_pipeline.py:

    from .base import TranslationPipeline
    
    class MyPipeline(TranslationPipeline):
        def name(self) -> str:
            return "my_pipeline"
        
        def available_models(self):
            return ["model1", "model2"]
        
        def select_model(self, model_name: str):
            pass
        
        def reset(self):
            pass
        
        def process_chunk(self, pcm16le: bytes, sample_rate_hz: int):
            # Process audio, return caption or None
            return "translated text"
  2. Register in backend/app/pipeline/registry.py:

    from .my_pipeline import MyPipeline
    
    _pipelines["my_pipeline"] = MyPipeline()
  3. Restart backend

Known Limitations

  • Requires two computers (or VM setup)
  • GPU highly recommended for real-time performance
  • WebSocket connections require secure context or browser flags for microphone access
  • Translation quality depends on Whisper model size (larger = better but slower)

Roadmap

  • MLX support for Apple Silicon - 3-5x faster on M-series Macs ✅
  • Single-PC mode with virtual audio routing
  • Custom model fine-tuning support
  • Real-time voice cloning/dubbing
  • Multi-speaker detection
  • Cloud deployment option
  • Whisper Turbo integration for even faster processing

📚 Documentation

SubAI includes comprehensive documentation for all users:

For End Users

For Administrators

For Mac Users

For Developers

Helper Scripts

Support

If you encounter any issues or have questions:

License

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

This project is licensed under CC BY-NC-SA 4.0 - see the LICENSE file for details.

You are free to:

  • ✅ Use for personal, educational, and non-profit purposes
  • ✅ Modify and adapt the code
  • ✅ Share and redistribute

Under these conditions:

  • 📝 Must give appropriate credit
  • 🚫 No commercial use without permission
  • 🔄 Must share modifications under the same license
  • 🔓 Must keep derivatives open source

Commercial Use: If you're interested in using SubAI commercially, please contact the maintainers to discuss licensing options.

Credits

Built with:


Made with ❤️ for the streaming community

About

Live Translation Caption system powered by AI with overlay system you can use in OBS studio or any other tool.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published