Building a Meditation App with TTS API: Voice Selection, Pacing, and Cost
Calm launched in 2012 with a human narrator and a recording studio. You don't need either. A TTS API with the right voice and pacing configuration produces guided meditation audio that's indistinguishable from human recording for most listeners.
Here's the technical side.
What Makes a Good Meditation Voice
Three things matter: pitch, pace, and breathiness. High-pitched voices create alertness — the opposite of what meditation needs. Fast delivery breaks the calming effect. Dry, sharp consonants read as clinical.
When evaluating voices for meditation content, test with this sentence:
"There's nowhere you need to be right now. Just breathe."
A good meditation voice delivers this slowly, with the pitch staying flat or dropping slightly at the end. If the voice sounds cheerful or newscaster-neutral, it's wrong for this use case.
Neural TTS voices that work well for meditation: low-register female voices tend to perform best in user testing. The Kokoro-82M model voices used in Speeko's API include several that score well on "calming" perception tests. Test en-US-neural-3 and en-GB-neural-2 — both have lower fundamental frequency than default options.
SSML Configuration for Meditation
Standard TTS output is too fast and too flat for meditation. You need three adjustments:
Rate: 80–85% of normal. This is the most important setting. rate="80%" adds roughly 20% more time to each sentence without making the voice sound unnatural.
Pitch: -1 to -2 semitones below baseline. Subtle. The goal is "calm presence," not "robot dungeon narrator."
Breaks: Long and deliberate. Meditation timing lives in the silence between instructions.
A full template:
<speak>
<prosody rate="82%" pitch="-1st" volume="soft">
Find a comfortable position.
<break time="3s"/>
Close your eyes if that feels right.
<break time="2s"/>
Take a slow breath in.
<break time="4s"/>
And let it go.
<break time="5s"/>
Your only job right now is to breathe.
<break time="6s"/>
</prosody>
</speak>The pauses feel long when you read the script. They feel right when you're lying down with eyes closed. Always evaluate meditation audio with eyes closed, in a quiet room — not by reading the transcript.
Session Types and Their Audio Needs
Most meditation apps offer four session types, each with slightly different audio requirements:
Breathing exercises (4-8 minutes): Precise timing matters. SSML <break> tags need to match the inhale/exhale cycle exactly. For box breathing (4-4-4-4): 4-second breaks between each cue. Test this — the numbers in the script and the actual pause duration need to sync.
Body scan (15-30 minutes): Slower pace, longer pauses between body regions. 6–8 second breaks between major sections. Voice should be consistently low-energy throughout.
Sleep meditations (20-45 minutes): Pace slows progressively. Hard to do with static SSML; consider generating in 5-minute segments with incrementally slower rate settings, then concatenating with ffmpeg.
Visualization (10-20 minutes): Normal meditation pace, but voice needs more warmth than a body scan. Some slight rate variation keeps it from feeling robotic.
Library Architecture
A typical meditation app ships 50–200 sessions at launch. Generate all of them upfront and store on CDN — don't call the TTS API at playback time.
Directory structure:
/audio/
/en/
/breathing/
box-breathing-5min.mp3
478-breathing-8min.mp3
/body-scan/
beginner-15min.mp3
deep-30min.mp3
/sleep/
...Generation script in Python:
import requests
from pathlib import Path
API_KEY = "your-speeko-key"
sessions = [
{
"id": "box-breathing-5min",
"ssml": """<speak>
<prosody rate="82%" pitch="-1st" volume="soft">
...your full script...
</prosody>
</speak>""",
"output": "audio/en/breathing/box-breathing-5min.mp3"
},
# ... more sessions
]
for session in sessions:
response = requests.post(
"https://api.speekoapp.com/v1/tts",
headers={"X-API-Key": API_KEY, "Content-Type": "application/json"},
json={
"text": session["ssml"],
"input_type": "ssml",
"voice": "en-GB-neural-2",
"format": "mp3"
}
)
Path(session["output"]).parent.mkdir(parents=True, exist_ok=True)
Path(session["output"]).write_bytes(response.content)
print(f"Generated: {session['id']}")Run once at content creation time. Serve the static MP3s from your CDN.
Audio Post-Processing
Two optional processing steps that meaningfully improve meditation audio:
Normalization: Meditation audio should be quieter than typical media. Target -16 LUFS for mobile playback (Spotify uses -14 LUFS; meditation should be calmer). Use ffmpeg:
ffmpeg -i input.mp3 -af "loudnorm=I=-16:TP=-1.5:LRA=11" output.mp3Subtle ambient bed: A light pink noise or gentle nature sound mixed under the voice at -20dB relative to speech adds warmth. Also masks the "too clean" quality that sometimes makes TTS feel synthetic. Keep it subtle — it should be imperceptible as a separate sound.
Cost for a Typical App
A meditation app launch library: 100 sessions, average 20 minutes each.
At 150 words per minute and 5.5 characters per word, 20 minutes = ~3,000 words = ~16,500 characters per session.
100 sessions × 16,500 characters = 1,650,000 characters.
At $0.03/1K = $49.50 total to generate the entire launch library.
Add 20 new sessions per month: ~330,000 characters = $9.90/month.
For a subscription app generating $10–20k MRR, that's a rounding error in infrastructure cost.
Speeko's free $5 credit covers 167,000 characters — enough to generate 10 full-length meditation sessions and evaluate voice quality before committing to a library.