Article
speech-to-textwhispertranscriptionaudio-processingpythondiarizationwhisperxvoice-activity-detection
Transcribe Long Audio with a VAD-Whisper Pipeline
Use whisperX to build a production-ready speech-to-text pipeline. It combines VAD (Voice Activity Detection) and Whisper for fast, accurate transcription of long audio files, complete with speaker labels and precise timestamps.
intermediate30 min5 steps
The play
- Install WhisperX and DependenciesThis pipeline is best built using the whisperX library, which extends OpenAI's Whisper with the features described. First, ensure you have PyTorch (>=2.0) and ffmpeg installed. Then, install whisperX from its GitHub repository to get the latest version.
- Run Transcription & AlignmentUse the whisperX CLI to transcribe an audio file. This command will download the model, perform transcription, and then use phoneme-based alignment to generate highly accurate word-level timestamps.
- Add Speaker DiarizationTo identify different speakers, add the `--diarize` flag. This requires a free authentication token from Hugging Face to use the underlying pyannote/speaker-diarization model. Get your token from your Hugging Face account settings.
- Generate Subtitle FilesAfter processing, whisperX creates several files, including `.json`, `.tsv`, and subtitle formats. You can find the generated `.srt` and `.vtt` files in the same directory as your audio file, ready for use in video players.
- Use the Python API for IntegrationFor more control and integration into your own applications, use the Python API. The starter code below provides a complete example of a full transcription, alignment, and diarization pipeline.
Starter code
import whisperx
import torch
import os
import requests
# --- 1. Setup Environment & Download Sample Audio ---
HF_TOKEN = "YOUR_HUGGINGFACE_TOKEN" # Replace with your Hugging Face token
if HF_TOKEN == "YOUR_HUGGINGFACE_TOKEN":
print("Please replace 'YOUR_HUGGINGFACE_TOKEN' with your actual Hugging Face token.")
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
COMPUTE_TYPE = "float16" if torch.cuda.is_available() else "int8"
# Download a sample audio file
audio_url = "https://upload.wikimedia.org/wikipedia/commons/d/dd/A_sample_of_my_voice.ogg"
audio_file = "sample_voice.ogg"
if not os.path.exists(audio_file):
print(f"Downloading sample audio from {audio_url}...")
with requests.get(audio_url) as r:
r.raise_for_status()
with open(audio_file, 'wb') as f:
f.write(r.content)
print("Download complete.")
# --- 2. Initialize Models ---
print("Loading Whisper model...")
# Use a smaller model for faster processing on CPU
model_size = "large-v2" if DEVICE == "cuda" else "base.en"
model = whisperx.load_model(model_size, DEVICE, compute_type=COMPUTE_TYPE)
print("Loading alignment model...")
model_a, metadata = whisperx.load_align_model(language_code="en", device=DEVICE)
print("Loading diarization model...")
diarize_model = whisperx.DiarizationPipeline(use_auth_token=HF_TOKEN, device=DEVICE)
# --- 3. Run the Full Pipeline ---
print(f"Processing audio file: {audio_file}")
# Transcribe
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=16)
# Align
result = whisperx.align(result["segments"], model_a, metadata, audio, DEVICE, return_char_alignments=False)
# Diarize
diarize_segments = diarize_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)
# --- 4. Print Results ---
print("\n--- Transcription with Speaker Labels ---")
for segment in result['segments']:
speaker = segment.get('speaker', 'UNKNOWN')
start_time = segment['start']
end_time = segment['end']
text = segment['text']
print(f"[{start_time:.2f}s - {end_time:.2f}s] {speaker}: {text}")