Article

speaker-diarizationpyannotewhisperaudio-processingpythontranscriptionmulti-speakerspeech-to-text

Transcribe Multi-Speaker Audio with pyannote.audio

Use the pyannote.audio Speaker Diarization Script to identify who spoke when in an audio file. This guide shows how to combine its speaker timeline with a Whisper transcript to create a turn-by-turn conversation log.

intermediate30 min4 steps

The play

Install Dependencies
Set up your environment by installing pyannote.audio, its dependencies (PyTorch), and openai-whisper. This script relies on these libraries to perform diarization and transcription.
Get Hugging Face Access
The pyannote models are gated. Visit huggingface.co/pyannote/speaker-diarization-3.1 and huggingface.co/pyannote/segmentation-3.0, accept the user agreements, and create an access token at huggingface.co/settings/tokens. This token is required to download and use the models.
Run Speaker Diarization
Instantiate the pyannote.audio pipeline with your Hugging Face token. Run it on your audio file to generate a 'diarization' object containing speaker labels and their corresponding timestamps.
Merge Diarization with Whisper Transcript
Transcribe the audio with Whisper to get text segments with timestamps. Then, iterate through the Whisper segments and use the pyannote.audio diarization result to assign the correct speaker to each piece of text.

Starter code

import os
import whisper
from pyannote.audio import Pipeline
from pyannote.core import Segment
import torch

# --- Configuration ---
HUGGING_FACE_TOKEN = os.environ.get("HF_TOKEN") # Get token from environment variable or replace with your token string
AUDIO_FILE_PATH = "audio.wav" # Replace with the path to your audio file

# --- Pre-flight checks ---
if HUGGING_FACE_TOKEN is None:
    raise ValueError("Hugging Face token not found. Please set the HF_TOKEN environment variable or replace the placeholder.")

if not os.path.exists(AUDIO_FILE_PATH):
    raise FileNotFoundError(f"Audio file not found at {AUDIO_FILE_PATH}. Please provide a valid path.")

# --- 1. Speaker Diarization ---
print("Step 1: Performing speaker diarization...")
device = "cuda" if torch.cuda.is_available() else "cpu"
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token=HUGGING_FACE_TOKEN
).to(device)

diarization = pipeline(AUDIO_FILE_PATH)
print("Diarization complete.")

# --- 2. Transcription ---
print("\nStep 2: Transcribing audio with Whisper...")
whisper_model = whisper.load_model("base")
transcription = whisper_model.transcribe(AUDIO_FILE_PATH)
print("Transcription complete.")

# --- 3. Merging Diarization and Transcription ---
print("\nStep 3: Merging results and generating final transcript...")

final_transcript = []
# Iterate through diarization segments
for turn, _, speaker in diarization.itertracks(yield_label=True):
    # Find all transcribed words that fall within the current speaker's turn
    turn_text = ""
    for segment in transcription['segments']:
        for word in segment.get('words', []):
            word_start, word_end = word['start'], word['end']
            # Check if the word's midpoint is within the speaker's turn
            if (turn.start <= (word_start + word_end) / 2 <= turn.end):
                turn_text += word['text'] + " "
    
    if turn_text.strip():
        final_transcript.append({
            "start": f"{turn.start:.2f}",
            "end": f"{turn.end:.2f}",
            "speaker": speaker,
            "text": turn_text.strip()
        })

# --- 4. Print Final Transcript ---
print("\n--- Final Transcript ---")
for entry in final_transcript:
    print(f"[{entry['start']}s - {entry['end']}s] {entry['speaker']}: {entry['text']}")