Article
speech-recognitionasrwhisperpythonaudio-processingvoice-agentsffmpegtranscription
Implement Local Speech Recognition with OpenAI's Whisper
Learn to perform accurate, local Speech Recognition using the open-source Whisper model. This guide covers setup, basic transcription, and model selection for building voice-enabled applications without relying on cloud APIs. Get started transcribing audio in minutes.
intermediate30 min5 steps
The play
- Set Up Your EnvironmentWhisper requires the `ffmpeg` command-line tool for audio processing. Install it first using your system's package manager (e.g., `brew install ffmpeg` on macOS, `sudo apt install ffmpeg` on Debian/Ubuntu). Then, install the Python library.
- Perform Basic TranscriptionCreate a Python script to load a Whisper model and transcribe an audio file. The `base` model is a good starting point, offering a balance between speed and accuracy for English. This is the core of the Speech Recognition skill.
- Improve Accuracy with Larger ModelsWhisper offers various model sizes (`tiny`, `base`, `small`, `medium`, `large`). Larger models provide better accuracy, especially for noisy audio or diverse languages, at the cost of speed and computational resources. Simply change the model name in `load_model`.
- Enable Language IdentificationA key capability of Speech Recognition is identifying the language being spoken. Whisper can do this automatically. By not specifying a language in the `transcribe` function, it will auto-detect it and include it in the output.
- Extract Word-Level TimestampsFor agentic voice pipelines, knowing *when* a word was said is crucial. Set `word_timestamps=True` in the `transcribe` call to get start and end times for each word, enabling more responsive and interactive applications.
Starter code
import whisper
import os
# This starter script requires `pydub` to generate a test file.
# Install it with: pip install pydub
# --- 1. Generate a dummy audio file for testing ---
# This avoids needing a separate audio file. Requires ffmpeg to be installed.
try:
from pydub import AudioSegment
from pydub.generators import Sine
# Generate a 3-second sine wave at 440 Hz (note 'A')
sine_wave = Sine(440).to_audio_segment(duration=3000) # duration in ms
# Add 1 second of silence at the beginning
silence = AudioSegment.silent(duration=1000)
audio_segment = silence + sine_wave
# Export as WAV
audio_file = "generated_speech_test.wav"
audio_segment.export(audio_file, format="wav")
print(f"Successfully created test audio file: {audio_file}")
except ImportError:
print("\nPlease install pydub (`pip install pydub`) to auto-generate a test audio file.")
print("Alternatively, create a file named 'generated_speech_test.wav' manually.")
exit()
except Exception as e:
print(f"Error creating audio file: {e}")
print("Please ensure ffmpeg is installed and accessible in your system's PATH.")
exit()
# --- 2. Perform Speech Recognition with Whisper ---
if os.path.exists(audio_file):
print("\nLoading Whisper model ('base'). This may take a moment on first run...")
# Using the 'base' model for a good balance of speed and accuracy
model = whisper.load_model("base")
print(f"Transcribing {audio_file}...")
# Note: The generated audio is a sine wave, so Whisper will likely transcribe silence or nonsensical words.
# Replace 'generated_speech_test.wav' with a real speech file for an actual transcription.
result = model.transcribe(audio_file)
print("\n--- Transcription Result ---")
# For the generated sine wave, the text will be empty or nonsensical.
# This demonstrates the process is working.
print(f"Detected Language: {result['language']}")
print(f"Transcription: {result['text']}")
print("\nStarter script finished. Replace 'generated_speech_test.wav' with your own audio file to see a real transcription.")
# Clean up the generated file
os.remove(audio_file)