Kokoro-82M vs ElevenLabs, Google WaveNet, and OpenAI TTS: A Real Comparison
The TTS market has a pricing problem. The best-known provider charges 10x more than the next best option — and the quality gap doesn't justify it. Kokoro-82M closed that gap in 2025. Here's how it actually stacks up.
The Models
Kokoro-82M is an open-weight neural TTS model. 82 million parameters, released in late 2025. Runs in production at Speeko at $0.03/1K characters. The weights are public — you can self-host if you want, though most developers don't bother.
ElevenLabs v3 is the quality benchmark everyone compares against. Excellent prosody, natural breathing patterns, expressive delivery. Costs $0.30/1K characters via API.
Google WaveNet is the enterprise default. Reliable, consistent, well-supported. Available via Google Cloud TTS at around $0.016/1K characters. No streaming support on standard tier.
OpenAI TTS (tts-1, tts-1-hd) runs at $0.015–$0.030/1K characters. Clean, neutral voice. Limited voice options. No streaming in the standard REST API.
Quality Comparison
Public MOS (Mean Opinion Score) listening tests from the Kokoro-82M paper and third-party evaluations:
| Model | MOS Score | Notes |
|---|---|---|
| ElevenLabs v3 | 4.4 | Best expressiveness, natural prosody |
| Kokoro-82M | 4.2 | Near-ElevenLabs quality, noticeably better than WaveNet |
| Google WaveNet | 3.9 | Consistent but slightly robotic on long text |
| OpenAI tts-1-hd | 4.0 | Clean, neutral, limited character |
| OpenAI tts-1 | 3.7 | Faster, lower quality |
The gap between Kokoro-82M and ElevenLabs v3 is 0.2 MOS points. Perceptible to trained listeners in a side-by-side test. Not perceptible to most end users listening to a single audio file. That's the key finding.
The gap between Kokoro-82M and Google WaveNet (0.3 points) is more consistently noticeable — WaveNet struggles with punctuation-heavy text and longer paragraphs.
Latency
For batch generation (the common case for content pipelines):
- All four providers return audio in 0.5–2 seconds for texts under 500 characters
- For 5,000-character inputs, ElevenLabs averages ~3s, Kokoro-82M via Speeko averages ~2.5s
- Streaming latency to first audio chunk: Speeko ~280ms, ElevenLabs ~350ms, WaveNet N/A (no streaming on standard tier), OpenAI N/A
If you're building a voice agent or real-time application, streaming matters. WaveNet and OpenAI standard tier don't offer it. ElevenLabs and Speeko do.
Cost Reality
Let's use a concrete project: a 50-chapter audiobook, each chapter 3,000 words (~18,000 characters).
- Total: 50 × 18,000 = 900,000 characters
- ElevenLabs: 900,000 × $0.30/1K = $270
- Speeko (Kokoro-82M): 900,000 × $0.03/1K = $27
- Google WaveNet: 900,000 × $0.016/1K = $14.40
- OpenAI tts-1-hd: 900,000 × $0.030/1K = $27
WaveNet is cheapest. But if you've heard both, the quality difference between WaveNet and Kokoro-82M on a 4-hour audiobook is noticeable. Kokoro at $27 vs WaveNet at $14.40 — for most content-quality use cases, that $12.60 delta is easy to justify.
ElevenLabs at 10x the price of Kokoro for a 0.2 MOS improvement is harder to justify for anything except high-stakes voice content (brand narrators, premium audiobooks).
Language Support
- Kokoro-82M / Speeko: 50+ languages
- ElevenLabs v3: 32 languages
- Google WaveNet: 40+ languages, strong regional variant support
- OpenAI TTS: ~57 language detection (but voice quality varies significantly by language)
For multilingual products, WaveNet and Speeko cover more ground than ElevenLabs.
When to Use Each
Kokoro-82M (Speeko): Most use cases. High quality, streaming support, reasonable price, good language coverage. Default choice.
ElevenLabs: When you need maximum expressiveness — brand narrators, character voices, premium audiobooks where 0.2 MOS matters. Not for high-volume automation.
Google WaveNet: High-volume enterprise workloads where cost dominates quality requirements. IVR systems, notification audio, low-engagement content.
OpenAI TTS: If you're already in the OpenAI ecosystem and want a simple integration. Limited flexibility, no streaming — but dead simple to call.
The Actual Decision
Pick your model based on volume and quality bar. Under 1M chars/month with quality requirements? Kokoro-82M. Over 10M chars/month with minimal quality requirements? WaveNet. Need maximum expressiveness regardless of cost? ElevenLabs.
Start with Speeko's free $5 credit — 167,000 characters to evaluate Kokoro-82M quality yourself before committing.