Build a Low-Latency Conversational Voice Agent

Leverage streaming ASR and TTS to build a voice interface with sub-500ms latency, moving beyond clunky commands to natural, fluid conversation.

intermediate30 minutes5 steps

The play

Benchmark Your Existing Voice Pipeline
Identify the 'user-stops-speaking to agent-starts-speaking' round-trip time in your current voice application. Use browser developer tools or logging to measure the time from the final audio packet sent to the first audio packet received. If it's over 800ms, it's creating an unnatural 'walkie-talkie' experience.
Switch to a Streaming ASR Provider
Replace your batch or VAD-gated ASR API with a streaming-first provider (e.g., Deepgram, AssemblyAI). Configure your client to send audio in real-time chunks over a WebSocket. This dramatically reduces the 'Time-to-First-Transcript' to under 300ms, the first step in feeling real-time.
Chain Streaming LLM and TTS
Don't wait for the full ASR transcript. Use endpoint detection or stable partial transcripts to trigger your LLM call. As the LLM generates response tokens, immediately pipe them to a streaming TTS API (e.g., ElevenLabs, PlayHT). This parallelizes text generation and speech synthesis, allowing audio to start playing while the rest of the sentence is still being formulated.
Enable Natural Turn-Taking and Interruption
A key benefit of low latency is natural conversation flow. Implement logic to detect user speech while your agent is speaking (barge-in). When an interruption is detected, immediately stop the TTS audio playback and begin processing the new user input. This responsiveness is critical for a fluid user experience.
Verify and Ship Your Conversational Agent
Re-run your latency benchmark. You should now see a round-trip latency under 700ms, with a P95 goal of <500ms. You've built the core of a next-gen voice interface. To go deeper on product design, error handling, and deploying a complete agent, follow our step-by-step guide in the Flagship Voice Interfaces DIY package.

Starter code

The 'walkie-talkie' pause in voice assistants is dead. Production-ready streaming models now allow for truly conversational turn-taking. This pack shows you how to audit your current stack and implement a modern, low-latency voice pipeline.