Article
speaker-diarizationpyannotewhisperaudio-processingpythontranscriptionmulti-speakerspeech-to-text
Transcribe Multi-Speaker Audio with pyannote.audio
Use the pyannote.audio Speaker Diarization Script to identify who spoke when in an audio file. This guide shows how to combine its speaker timeline with a Whisper transcript to create a turn-by-turn conversation log.
intermediate30 min4 steps
The play
- Install DependenciesSet up your environment by installing pyannote.audio, its dependencies (PyTorch), and openai-whisper. This script relies on these libraries to perform diarization and transcription.
- Get Hugging Face AccessThe pyannote models are gated. Visit huggingface.co/pyannote/speaker-diarization-3.1 and huggingface.co/pyannote/segmentation-3.0, accept the user agreements, and create an access token at huggingface.co/settings/tokens. This token is required to download and use the models.
- Run Speaker DiarizationInstantiate the pyannote.audio pipeline with your Hugging Face token. Run it on your audio file to generate a 'diarization' object containing speaker labels and their corresponding timestamps.
- Merge Diarization with Whisper TranscriptTranscribe the audio with Whisper to get text segments with timestamps. Then, iterate through the Whisper segments and use the pyannote.audio diarization result to assign the correct speaker to each piece of text.
Starter code
import os
import whisper
from pyannote.audio import Pipeline
from pyannote.core import Segment
import torch
# --- Configuration ---
HUGGING_FACE_TOKEN = os.environ.get("HF_TOKEN") # Get token from environment variable or replace with your token string
AUDIO_FILE_PATH = "audio.wav" # Replace with the path to your audio file
# --- Pre-flight checks ---
if HUGGING_FACE_TOKEN is None:
raise ValueError("Hugging Face token not found. Please set the HF_TOKEN environment variable or replace the placeholder.")
if not os.path.exists(AUDIO_FILE_PATH):
raise FileNotFoundError(f"Audio file not found at {AUDIO_FILE_PATH}. Please provide a valid path.")
# --- 1. Speaker Diarization ---
print("Step 1: Performing speaker diarization...")
device = "cuda" if torch.cuda.is_available() else "cpu"
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token=HUGGING_FACE_TOKEN
).to(device)
diarization = pipeline(AUDIO_FILE_PATH)
print("Diarization complete.")
# --- 2. Transcription ---
print("\nStep 2: Transcribing audio with Whisper...")
whisper_model = whisper.load_model("base")
transcription = whisper_model.transcribe(AUDIO_FILE_PATH)
print("Transcription complete.")
# --- 3. Merging Diarization and Transcription ---
print("\nStep 3: Merging results and generating final transcript...")
final_transcript = []
# Iterate through diarization segments
for turn, _, speaker in diarization.itertracks(yield_label=True):
# Find all transcribed words that fall within the current speaker's turn
turn_text = ""
for segment in transcription['segments']:
for word in segment.get('words', []):
word_start, word_end = word['start'], word['end']
# Check if the word's midpoint is within the speaker's turn
if (turn.start <= (word_start + word_end) / 2 <= turn.end):
turn_text += word['text'] + " "
if turn_text.strip():
final_transcript.append({
"start": f"{turn.start:.2f}",
"end": f"{turn.end:.2f}",
"speaker": speaker,
"text": turn_text.strip()
})
# --- 4. Print Final Transcript ---
print("\n--- Final Transcript ---")
for entry in final_transcript:
print(f"[{entry['start']}s - {entry['end']}s] {entry['speaker']}: {entry['text']}")