Real-time Streaming TTS: Tutorial and Use Cases

Posted on March 31, 2026
By Speeko Team
streamingreal-timevoice-agentswebsocket

Real-time Streaming TTS: Tutorial and Use Cases

Batch TTS is fine for pre-recorded content. Voice agents, interactive narration, and live translation need streaming.

Why Streaming Matters

Generating 30 seconds of audio takes 3 seconds with batch TTS. For conversational UX, that's unacceptable. Streaming TTS returns the first audio chunk in 100-300ms.

Streaming Architecture

Client -> WebSocket -> Speeko
            |
            +-> First chunk @ 150ms
            +-> Subsequent chunks every 40ms

Sample Code

const ws = new WebSocket('wss://api.speekoapp.com/v1/tts/stream');

ws.onopen = () => {
  ws.send(JSON.stringify({
    text: "This streams in real-time.",
    voice: "af_heart",
    format: "mp3"
  }));
};

ws.onmessage = (event) => {
  const chunk = new Uint8Array(event.data);
  audioBuffer.append(chunk);
  audioElement.play();
};

Chunked Audio Playback

Use MediaSource Extensions (MSE) in browsers:

const mediaSource = new MediaSource();
audioElement.src = URL.createObjectURL(mediaSource);

mediaSource.addEventListener('sourceopen', () => {
  const buffer = mediaSource.addSourceBuffer('audio/mpeg');
  ws.onmessage = (e) => buffer.appendBuffer(e.data);
});

Use Cases

Voice agents: Pair streaming TTS with streaming LLM output. As the LLM generates tokens, stream them to TTS. Total latency stays under 500ms.

Live translation: Conference interpreting with AI. Speaker's words captured, translated, synthesized, delivered — all in under a second.

Interactive fiction: Branching audio narratives where listener choices drive generation.

Accessibility tools: Screen readers that feel responsive, not robotic.

Caveats

Streaming costs 20% more than batch at Speeko. The infrastructure to serve low-latency chunks is expensive. For pre-recorded content, stick with batch.

Start Building

Get streaming API access and build the next generation of voice products.