Scaling Bilingual Speech-to-Text on Serverless GPUs
Deploying real-time machine learning models for transcription and translation is a complex engineering challenge, specifically when aiming for low latency and high accuracy across multiple languages like Vietnamese and English.
In this post, I detail how we approached the Bilingual Real-Time Speech-to-Text System for academic lecture halls, leveraging state-of-the-art models and serverless infrastructure.
The Processing Pipeline
The core pipeline consists of three major stages:
- Voice Activity Detection (VAD): We utilized Silero VAD to cleanly segment continuous audio streams. This drastically reduced the computational load by ignoring silent segments.
- Transcription: Segmented audio chunks are passed to WhisperX. WhisperX ensures high accuracy and provides word-level timestamps.
- Translation: The transcribed text is then piped into Meta's NLLB (No Language Left Behind) model to provide accurate cross-lingual translations.
Overcoming Serverless Cold Starts
One of the biggest hurdles was managing the cold starts on serverless platforms. By containerizing our models with Docker and utilizing Modal's A100 GPU infrastructure, we achieved sub-second cold starts and blazingly fast inference times, making real-time streaming viable.
Integration via WebSockets
To connect the frontend React application with our Python backend, we implemented a robust WebSocket connection. This ensures bi-directional, low-latency communication, allowing the transcribed text to appear on the user's screen almost instantaneously as the professor speaks.