Set Up Zero-Shot Voice Cloning with Coqui TTS

Launch a local voice cloning server using Coqui XTTS-v2. This allows you to synthesize speech in any voice from just a 3-second audio sample, accessible via a simple API.

intermediate30 min5 steps

The play

Install Coqui TTS
First, set up a Python virtual environment. Then, install the Coqui TTS library using pip. This command downloads and installs all the necessary packages to run the text-to-speech models and the inference server.
Prepare Reference Audio
Find or record a short audio clip of the voice you want to clone. For best results, use a high-quality, 3-10 second clip with a clear voice and minimal background noise. Save it as a WAV file (e.g., `reference.wav`).
Launch the Inference Server
Start the FastAPI server with the powerful XTTS-v2 model. The first time you run this, it will download the model files, which can be several gigabytes. The server will then be available at `http://localhost:5002`.
Synthesize Speech via API
Use a tool like `curl` to send a request to the running server. You need to provide the text to synthesize, the language code, and the path to your reference audio file. The server will return the synthesized audio.
Reuse Voices with Speaker Embeddings
To avoid uploading the reference audio for every request, you can compute and save the voice's speaker embedding. First, generate the embedding from the audio file, then use that embedding in subsequent TTS requests for faster inference.

Starter code

import requests
import os

# --- Configuration ---
SERVER_URL = "http://localhost:5002/api/tts"
REFERENCE_AUDIO_PATH = "/path/to/your/reference.wav" # IMPORTANT: Change this path
TEXT_TO_SYNTHESIZE = "This is a demonstration of zero-shot voice cloning using a Python script."
OUTPUT_WAV_PATH = "starter_output.wav"
LANGUAGE = "en"

# --- Validation ---
if not os.path.exists(REFERENCE_AUDIO_PATH):
    print(f"Error: Reference audio file not found at '{REFERENCE_AUDIO_PATH}'")
    print("Please update the REFERENCE_AUDIO_PATH variable in the script.")
    exit()

# --- API Request ---
print(f"Sending request to Coqui TTS server for text: '{TEXT_TO_SYNTHESIZE}'")

with open(REFERENCE_AUDIO_PATH, 'rb') as f:
    files = {'speaker_wav': (os.path.basename(REFERENCE_AUDIO_PATH), f, 'audio/wav')}
    data = {
        'text': TEXT_TO_SYNTHESIZE,
        'language': LANGUAGE
    }
    
    try:
        response = requests.post(SERVER_URL, files=files, data=data)
        response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

        # --- Save Output ---
        with open(OUTPUT_WAV_PATH, 'wb') as out_f:
            out_f.write(response.content)
        
        print(f"Successfully synthesized audio and saved to '{OUTPUT_WAV_PATH}'")

    except requests.exceptions.RequestException as e:
        print(f"An error occurred: {e}")
        print("Is the TTS server running? Run: tts-server --model_name tts_models/multilingual/multi-dataset/xtts_v2")