Real-time Streaming TTS: Tutorial and Use Cases
Batch TTS is fine for pre-recorded content. Voice agents, interactive narration, and live translation need streaming.
Why Streaming Matters
Generating 30 seconds of audio takes 3 seconds with batch TTS. For conversational UX, that's unacceptable. Streaming TTS returns the first audio chunk in 100-300ms.
Streaming Architecture
Client -> WebSocket -> Speeko
|
+-> First chunk @ 150ms
+-> Subsequent chunks every 40msSample Code
const ws = new WebSocket('wss://api.speekoapp.com/v1/tts/stream');
ws.onopen = () => {
ws.send(JSON.stringify({
text: "This streams in real-time.",
voice: "af_heart",
format: "mp3"
}));
};
ws.onmessage = (event) => {
const chunk = new Uint8Array(event.data);
audioBuffer.append(chunk);
audioElement.play();
};Chunked Audio Playback
Use MediaSource Extensions (MSE) in browsers:
const mediaSource = new MediaSource();
audioElement.src = URL.createObjectURL(mediaSource);
mediaSource.addEventListener('sourceopen', () => {
const buffer = mediaSource.addSourceBuffer('audio/mpeg');
ws.onmessage = (e) => buffer.appendBuffer(e.data);
});Use Cases
Voice agents: Pair streaming TTS with streaming LLM output. As the LLM generates tokens, stream them to TTS. Total latency stays under 500ms.
Live translation: Conference interpreting with AI. Speaker's words captured, translated, synthesized, delivered — all in under a second.
Interactive fiction: Branching audio narratives where listener choices drive generation.
Accessibility tools: Screen readers that feel responsive, not robotic.
Caveats
Streaming costs 20% more than batch at Speeko. The infrastructure to serve low-latency chunks is expensive. For pre-recorded content, stick with batch.
Start Building
Get streaming API access and build the next generation of voice products.