Article

speech-recognitionasrwhisperpythonaudio-processingvoice-agentsffmpegtranscription

Implement Local Speech Recognition with OpenAI's Whisper

Learn to perform accurate, local Speech Recognition using the open-source Whisper model. This guide covers setup, basic transcription, and model selection for building voice-enabled applications without relying on cloud APIs. Get started transcribing audio in minutes.

intermediate30 min5 steps

The play

Set Up Your Environment
Whisper requires the `ffmpeg` command-line tool for audio processing. Install it first using your system's package manager (e.g., `brew install ffmpeg` on macOS, `sudo apt install ffmpeg` on Debian/Ubuntu). Then, install the Python library.
Perform Basic Transcription
Create a Python script to load a Whisper model and transcribe an audio file. The `base` model is a good starting point, offering a balance between speed and accuracy for English. This is the core of the Speech Recognition skill.
Improve Accuracy with Larger Models
Whisper offers various model sizes (`tiny`, `base`, `small`, `medium`, `large`). Larger models provide better accuracy, especially for noisy audio or diverse languages, at the cost of speed and computational resources. Simply change the model name in `load_model`.
Enable Language Identification
A key capability of Speech Recognition is identifying the language being spoken. Whisper can do this automatically. By not specifying a language in the `transcribe` function, it will auto-detect it and include it in the output.
Extract Word-Level Timestamps
For agentic voice pipelines, knowing *when* a word was said is crucial. Set `word_timestamps=True` in the `transcribe` call to get start and end times for each word, enabling more responsive and interactive applications.

Starter code

import whisper
import os

# This starter script requires `pydub` to generate a test file.
# Install it with: pip install pydub

# --- 1. Generate a dummy audio file for testing ---
# This avoids needing a separate audio file. Requires ffmpeg to be installed.
try:
    from pydub import AudioSegment
    from pydub.generators import Sine

    # Generate a 3-second sine wave at 440 Hz (note 'A')
    sine_wave = Sine(440).to_audio_segment(duration=3000) # duration in ms
    # Add 1 second of silence at the beginning
    silence = AudioSegment.silent(duration=1000)
    audio_segment = silence + sine_wave

    # Export as WAV
    audio_file = "generated_speech_test.wav"
    audio_segment.export(audio_file, format="wav")
    print(f"Successfully created test audio file: {audio_file}")

except ImportError:
    print("\nPlease install pydub (`pip install pydub`) to auto-generate a test audio file.")
    print("Alternatively, create a file named 'generated_speech_test.wav' manually.")
    exit()
except Exception as e:
    print(f"Error creating audio file: {e}")
    print("Please ensure ffmpeg is installed and accessible in your system's PATH.")
    exit()

# --- 2. Perform Speech Recognition with Whisper ---
if os.path.exists(audio_file):
    print("\nLoading Whisper model ('base'). This may take a moment on first run...")
    # Using the 'base' model for a good balance of speed and accuracy
    model = whisper.load_model("base")

    print(f"Transcribing {audio_file}...")
    # Note: The generated audio is a sine wave, so Whisper will likely transcribe silence or nonsensical words.
    # Replace 'generated_speech_test.wav' with a real speech file for an actual transcription.
    result = model.transcribe(audio_file)

    print("\n--- Transcription Result ---")
    # For the generated sine wave, the text will be empty or nonsensical.
    # This demonstrates the process is working.
    print(f"Detected Language: {result['language']}")
    print(f"Transcription: {result['text']}")
    print("\nStarter script finished. Replace 'generated_speech_test.wav' with your own audio file to see a real transcription.")

    # Clean up the generated file
    os.remove(audio_file)