Kokoro-82M Explained: The Open-Weight TTS Revolution
Kokoro-82M proved that bigger isn't always better. At just 82 million parameters, it matches models 10x its size.
The Technical Breakthrough
Traditional neural TTS models trend toward billions of parameters — massive compute bills and slow inference. Kokoro-82M took a different path:
- Efficient architecture — StyleTTS2-inspired design with aggressive pruning
- High-quality training data — Curated, not crawled
- Smart tokenization — Phoneme-level input reduces the learning burden
Result: studio-quality voices with 50ms per-character inference on commodity GPUs.
What 82M Parameters Means
- Runs on a single consumer GPU (8GB VRAM)
- Real-time generation at 10x realtime speed
- Lower inference cost → lower API prices
- Deployable on edge devices with quantization
Voice Quality
Kokoro-82M produces audio at 24kHz with natural prosody. The model understands:
- Punctuation-driven pacing (periods = pauses, commas = brief breaths)
- Emphasis from italic/bold markup
- Contextual intonation (questions rise, statements fall)
Supported Voices
The base model ships with 9 voices spanning American and British English, with community contributions adding Spanish, French, Japanese, Chinese, and Hindi voices.
Speeko extends Kokoro-82M with fine-tuned voices for 50+ languages.
Try It Yourself
Every TTS request on Speeko runs through Kokoro-82M. Start with $5 free and hear the difference.