Skip to main content
Article
speech-to-textwhispertranscriptionaudio-processingpythondiarizationwhisperxvoice-activity-detection

Transcribe Long Audio with a VAD-Whisper Pipeline

Use whisperX to build a production-ready speech-to-text pipeline. It combines VAD (Voice Activity Detection) and Whisper for fast, accurate transcription of long audio files, complete with speaker labels and precise timestamps.

intermediate30 min5 steps
The play
  1. Install WhisperX and Dependencies
    This pipeline is best built using the whisperX library, which extends OpenAI's Whisper with the features described. First, ensure you have PyTorch (>=2.0) and ffmpeg installed. Then, install whisperX from its GitHub repository to get the latest version.
  2. Run Transcription & Alignment
    Use the whisperX CLI to transcribe an audio file. This command will download the model, perform transcription, and then use phoneme-based alignment to generate highly accurate word-level timestamps.
  3. Add Speaker Diarization
    To identify different speakers, add the `--diarize` flag. This requires a free authentication token from Hugging Face to use the underlying pyannote/speaker-diarization model. Get your token from your Hugging Face account settings.
  4. Generate Subtitle Files
    After processing, whisperX creates several files, including `.json`, `.tsv`, and subtitle formats. You can find the generated `.srt` and `.vtt` files in the same directory as your audio file, ready for use in video players.
  5. Use the Python API for Integration
    For more control and integration into your own applications, use the Python API. The starter code below provides a complete example of a full transcription, alignment, and diarization pipeline.
Starter code
import whisperx
import torch
import os
import requests

# --- 1. Setup Environment & Download Sample Audio ---
HF_TOKEN = "YOUR_HUGGINGFACE_TOKEN" # Replace with your Hugging Face token
if HF_TOKEN == "YOUR_HUGGINGFACE_TOKEN":
    print("Please replace 'YOUR_HUGGINGFACE_TOKEN' with your actual Hugging Face token.")

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
COMPUTE_TYPE = "float16" if torch.cuda.is_available() else "int8"

# Download a sample audio file
audio_url = "https://upload.wikimedia.org/wikipedia/commons/d/dd/A_sample_of_my_voice.ogg"
audio_file = "sample_voice.ogg"
if not os.path.exists(audio_file):
    print(f"Downloading sample audio from {audio_url}...")
    with requests.get(audio_url) as r:
        r.raise_for_status()
        with open(audio_file, 'wb') as f:
            f.write(r.content)
    print("Download complete.")

# --- 2. Initialize Models ---
print("Loading Whisper model...")
# Use a smaller model for faster processing on CPU
model_size = "large-v2" if DEVICE == "cuda" else "base.en"
model = whisperx.load_model(model_size, DEVICE, compute_type=COMPUTE_TYPE)

print("Loading alignment model...")
model_a, metadata = whisperx.load_align_model(language_code="en", device=DEVICE)

print("Loading diarization model...")
diarize_model = whisperx.DiarizationPipeline(use_auth_token=HF_TOKEN, device=DEVICE)

# --- 3. Run the Full Pipeline ---
print(f"Processing audio file: {audio_file}")

# Transcribe
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=16)

# Align
result = whisperx.align(result["segments"], model_a, metadata, audio, DEVICE, return_char_alignments=False)

# Diarize
diarize_segments = diarize_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)

# --- 4. Print Results ---
print("\n--- Transcription with Speaker Labels ---")
for segment in result['segments']:
    speaker = segment.get('speaker', 'UNKNOWN')
    start_time = segment['start']
    end_time = segment['end']
    text = segment['text']
    print(f"[{start_time:.2f}s - {end_time:.2f}s] {speaker}: {text}")
Transcribe Long Audio with a VAD-Whisper Pipeline — Action Pack