How to Build a Voice AI Agent with a TTS API: STT → LLM → TTS Pipeline

Posted on May 1, 2026
By Speeko Team
voice-agenttts-apillmspeech-to-texttutorial

How to Build a Voice AI Agent with a TTS API: STT → LLM → TTS Pipeline

A voice agent hears speech, thinks, and speaks back. The basic pipeline is three components: speech-to-text (STT) converts audio to text, an LLM generates a response, and TTS converts that response back to audio. Simple in theory. The latency is where it gets hard.

This guide covers the architecture, the latency budget you need to hit, and a Python implementation using Deepgram for STT, Claude for the LLM layer, and Speeko for TTS.

Latency Budget

Human conversation feels natural when turn-taking happens under 1 second end-to-end. Here's where the time goes:

Component Target latency
STT transcription 200–400ms
LLM first token 200–500ms
TTS first audio chunk 150–300ms
Network round-trips 100–200ms
Total target < 1,200ms

Miss that 1.2-second mark consistently and the conversation feels robotic. Users pause, wonder if the agent heard them, speak again.

The biggest lever: streaming at every stage. Don't wait for the full LLM response before starting TTS. Don't wait for the full TTS audio before playing it. Start each stage as data arrives from the previous one.

Architecture

Microphone input
      ↓
  STT (streaming) → partial transcripts
      ↓
  LLM (streaming) → partial response text
      ↓
  TTS (streaming) → audio chunks
      ↓
  Speaker output

Each arrow is a streaming pipe. The LLM starts generating before STT is done (with the partial transcript). TTS starts generating before the LLM is done (with the first sentence). The user hears audio before any component has finished.

Python Implementation

Install dependencies:

pip install deepgram-sdk anthropic requests pyaudio

STT → LLM → TTS pipeline:

import asyncio
import anthropic
import requests
import pyaudio
from deepgram import DeepgramClient, LiveTranscriptionEvents, LiveOptions

SPEEKO_KEY = "your-speeko-key"
ANTHROPIC_KEY = "your-anthropic-key"
DEEPGRAM_KEY = "your-deepgram-key"

anthropic_client = anthropic.Anthropic(api_key=ANTHROPIC_KEY)
p = pyaudio.PyAudio()

def generate_and_play_tts(text: str):
    """Generate TTS audio and stream to speaker."""
    response = requests.post(
        "https://api.speekoapp.com/v1/tts/stream",
        headers={
            "X-API-Key": SPEEKO_KEY,
            "Content-Type": "application/json"
        },
        json={"text": text, "voice": "en-US-neural-1", "format": "mp3"},
        stream=True
    )

    stream = p.open(
        format=pyaudio.paInt16,
        channels=1,
        rate=22050,
        output=True
    )

    for chunk in response.iter_content(chunk_size=4096):
        if chunk:
            stream.write(chunk)

    stream.stop_stream()
    stream.close()

def get_llm_response(user_text: str) -> str:
    """Get response from Claude — stream and collect."""
    full_response = ""
    with anthropic_client.messages.stream(
        model="claude-sonnet-4-6",
        max_tokens=300,
        messages=[{"role": "user", "content": user_text}],
        system="You are a helpful voice assistant. Respond in 1-3 sentences. No bullet points, no markdown — spoken word only."
    ) as stream:
        for text in stream.text_stream:
            full_response += text
    return full_response

async def run_voice_agent():
    deepgram = DeepgramClient(DEEPGRAM_KEY)

    options = LiveOptions(
        model="nova-2",
        language="en-US",
        smart_format=True,
        endpointing=500  # 500ms silence = end of utterance
    )

    connection = deepgram.listen.live.v("1")

    @connection.on(LiveTranscriptionEvents.Transcript)
    def on_transcript(client, result, **kwargs):
        transcript = result.channel.alternatives[0].transcript
        if result.is_final and transcript.strip():
            print(f"User: {transcript}")
            response = get_llm_response(transcript)
            print(f"Agent: {response}")
            generate_and_play_tts(response)

    connection.start(options)

    mic_stream = p.open(
        format=pyaudio.paInt16,
        channels=1,
        rate=16000,
        input=True,
        frames_per_buffer=1024
    )

    print("Voice agent ready. Speak...")

    try:
        while True:
            audio_data = mic_stream.read(1024, exception_on_overflow=False)
            connection.send(audio_data)
    except KeyboardInterrupt:
        pass
    finally:
        mic_stream.stop_stream()
        mic_stream.close()
        connection.finish()

asyncio.run(run_voice_agent())

Sentence-Level TTS Chunking

Waiting for the full LLM response adds 500–800ms for a 3-sentence answer. Instead, split on sentence boundaries and start TTS on the first complete sentence:

import re

def stream_llm_with_tts(user_text: str):
    buffer = ""
    sentence_end = re.compile(r'(?<=[.!?])\s')

    with anthropic_client.messages.stream(
        model="claude-sonnet-4-6",
        max_tokens=300,
        messages=[{"role": "user", "content": user_text}],
        system="Respond in spoken English. Short sentences. No markdown."
    ) as stream:
        for text in stream.text_stream:
            buffer += text
            sentences = sentence_end.split(buffer)

            # All but the last fragment are complete sentences
            for sentence in sentences[:-1]:
                if sentence.strip():
                    generate_and_play_tts(sentence.strip())

            buffer = sentences[-1]  # Keep incomplete fragment

    # Flush remaining text
    if buffer.strip():
        generate_and_play_tts(buffer.strip())

This gets first audio to the user's ears 400–600ms faster than waiting for the full LLM response. Noticeable in conversation.

Voice Selection for Agents

Pick a voice that doesn't fatigue at speed. Voices tuned for presentations often clip syllables at TTS's default speaking rate. Test at 1.0× and 1.1× before committing.

For English-language customer service agents, en-US-neural-1 in Speeko's voice library performs well in extended conversation benchmarks. For European markets, test en-GB-neural-2 — British voices tend to rate higher for trust in financial and healthcare contexts.

What to Skip (for Now)

Voice activity detection (VAD), barge-in handling, conversation memory — these matter for production. They don't matter for a first prototype. Get the basic pipeline working and measure latency first. You'll know which components need work before you complicate the architecture.

Next Steps

Get your free Speeko credit and test TTS latency against your LLM's first-token time. The ratio matters: if your LLM is taking 800ms and TTS is 200ms, optimize the LLM first. If TTS is the bottleneck, look at streaming and chunking.

For pronunciation control — brand names, technical terms, acronyms — see the SSML pronunciation guide.