How to Build a Customer Service Voicebot with a TTS API
The architecture is STT → LLM → TTS. Three steps. The TTS part is where most voicebots fall apart — either the voice sounds robotic, the latency kills the conversation, or the monthly bill doesn't survive contact with real traffic.
Here's how to build the speech output layer correctly.
The Latency Problem
Natural conversation has gaps of 200–300ms between speakers. For a voicebot to feel human, the entire STT → LLM → TTS pipeline needs to fit inside something close to that window. TTS alone should contribute no more than 150ms.
Most providers advertise "inference latency" which is just model compute time. Production latency — including network roundtrips, API gateway overhead, encoding — is often 3–5x higher. A TTS model benchmarking at 100ms in a lab can deliver 600ms in production during peak traffic.
This is why streaming matters.
Streaming vs Batch: Pick the Right One
Streaming TTS: The API sends audio chunks as they're generated, before the full response is ready. Your client starts playing audio while the rest generates. Latency to first audio: 80–200ms. Use this for conversational applications where the user is waiting.
Batch TTS: The API generates the full audio file, then returns it. Latency: 300ms–2s depending on length. Use this for pre-generated content — hold music, menu prompts, FAQ answers that never change.
For a voicebot handling live calls or chat, you need streaming. There's no workaround.
Architecture
User speaks
↓
STT (Whisper / Deepgram) → transcript
↓
LLM (GPT-4o / Claude) → response text
↓
TTS API (streaming) → audio chunks
↓
Play audio to userThe LLM and TTS steps can be pipelined: start sending the LLM's output to TTS before the LLM finishes generating. This cuts perceived latency by another 100–300ms.
Python Implementation
import requests
import pyaudio
import threading
import queue
SPEEKO_KEY = "your-key"
TTS_URL = "https://api.speekoapp.com/v1/tts/stream"
def stream_tts(text: str, audio_queue: queue.Queue):
"""Stream TTS audio chunks into a queue for playback."""
with requests.post(
TTS_URL,
headers={"X-API-Key": SPEEKO_KEY, "Content-Type": "application/json"},
json={"text": text, "voice": "en-US-neural-1", "format": "pcm"},
stream=True
) as response:
for chunk in response.iter_content(chunk_size=4096):
if chunk:
audio_queue.put(chunk)
audio_queue.put(None) # sentinel
def play_audio(audio_queue: queue.Queue):
"""Pull chunks from queue and play via PyAudio."""
pa = pyaudio.PyAudio()
stream = pa.open(format=pyaudio.paInt16, channels=1, rate=22050, output=True)
while True:
chunk = audio_queue.get()
if chunk is None:
break
stream.write(chunk)
stream.close()
pa.terminate()
def speak(text: str):
q = queue.Queue()
producer = threading.Thread(target=stream_tts, args=(text, q))
consumer = threading.Thread(target=play_audio, args=(q,))
producer.start()
consumer.start()
producer.join()
consumer.join()
# Usage
speak("Your order has shipped. Expected delivery is Thursday, May 5th.")PCM format skips the MP3 decode step, which cuts another 20–50ms. For telephony (SIP/WebRTC), use format: "pcm" and stream directly into your audio pipeline.
LLM + TTS Pipelining
Don't wait for the full LLM response before calling TTS. Send each sentence as it arrives:
import anthropic
client = anthropic.Anthropic()
def respond_and_speak(user_message: str):
buffer = ""
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=512,
messages=[{"role": "user", "content": user_message}]
) as stream:
for text in stream.text_stream:
buffer += text
# Speak on sentence boundaries
if buffer.endswith((".", "?", "!")) and len(buffer) > 20:
speak(buffer.strip())
buffer = ""
if buffer.strip():
speak(buffer.strip())This drops first-audio latency from "time to full LLM response" to "time to first sentence" — typically 600–800ms faster.
Pre-Generate Static Prompts
Most voicebot traffic is the same 20 prompts repeated thousands of times: "I'm connecting you now," "Please hold," "Your account number is..." — the dynamic part is the variable, not the full sentence.
Pre-generate the static parts. Store MP3s on your CDN. Combine static audio + dynamic TTS on the fly:
# Static: pre-generated
HOLD_MUSIC_URL = "https://cdn.example.com/audio/please-hold.mp3"
# Dynamic: generated at runtime
def order_status_message(order_id: str, eta: str) -> str:
return f"Order {order_id} is on its way. Expected by {eta}."Static prompts: zero API cost, zero latency. Dynamic parts: ~$0.03/1K chars via Speeko. A voicebot handling 10,000 calls/day where 70% of audio is static costs roughly $4–8/day in TTS for the dynamic segments.
Voice Selection for Phone Audio
Phone audio is 8kHz (G.711) or 16kHz (wideband). Neural TTS voices at 22–44kHz get downsampled. Some voices degrade more than others under downsampling.
Test your chosen voice at phone codec quality before deploying. Generate a sample, run it through:
ffmpeg -i sample.mp3 -ar 8000 -ac 1 -acodec pcm_mulaw sample_phone.wavListen. If it sounds tinny or the fricatives blur together, try a different voice. en-US-neural-1 and en-GB-neural-2 on Speeko hold up well at 8kHz.
Cost at Scale
Speeko: $0.03/1K characters. A typical voicebot response is 20–60 words, or 100–350 characters.
At 10,000 calls/day with 5 TTS interactions per call, each averaging 200 characters:
- 10,000 × 5 × 200 = 10,000,000 characters/day
- Cost: $300/day
That sounds like a lot until you compare it to ElevenLabs ($3,000/day for the same volume). And if 70% of prompts are pre-generated statics, actual spend is $90/day.
Getting Started
Speeko's TTS API includes a free $5 credit — enough to prototype and stress-test your pipeline before committing. The streaming endpoint works with the same API key as batch requests; just add /stream to the path.
See the SSML advanced guide for controlling pacing, pauses, and emphasis in voicebot responses.