Building a Meditation App with TTS API: Voice Selection, Pacing, and Cost

Calm launched in 2012 with a human narrator and a recording studio. You don't need either. A TTS API with the right voice and pacing configuration produces guided meditation audio that's indistinguishable from human recording for most listeners.

Here's the technical side.

What Makes a Good Meditation Voice

Three things matter: pitch, pace, and breathiness. High-pitched voices create alertness — the opposite of what meditation needs. Fast delivery breaks the calming effect. Dry, sharp consonants read as clinical.

When evaluating voices for meditation content, test with this sentence:

"There's nowhere you need to be right now. Just breathe."

A good meditation voice delivers this slowly, with the pitch staying flat or dropping slightly at the end. If the voice sounds cheerful or newscaster-neutral, it's wrong for this use case.

Neural TTS voices that work well for meditation: low-register female voices tend to perform best in user testing. The Kokoro-82M model voices used in Speeko's API include several that score well on "calming" perception tests. Test en-US-neural-3 and en-GB-neural-2 — both have lower fundamental frequency than default options.

SSML Configuration for Meditation

Standard TTS output is too fast and too flat for meditation. You need three adjustments:

Rate: 80–85% of normal. This is the most important setting. rate="80%" adds roughly 20% more time to each sentence without making the voice sound unnatural.

Pitch: -1 to -2 semitones below baseline. Subtle. The goal is "calm presence," not "robot dungeon narrator."

Breaks: Long and deliberate. Meditation timing lives in the silence between instructions.

A full template:

<speak>
  <prosody rate="82%" pitch="-1st" volume="soft">
    Find a comfortable position.
    <break time="3s"/>
    Close your eyes if that feels right.
    <break time="2s"/>
    Take a slow breath in.
    <break time="4s"/>
    And let it go.
    <break time="5s"/>
    Your only job right now is to breathe.
    <break time="6s"/>
  </prosody>
</speak>

The pauses feel long when you read the script. They feel right when you're lying down with eyes closed. Always evaluate meditation audio with eyes closed, in a quiet room — not by reading the transcript.

Session Types and Their Audio Needs

Most meditation apps offer four session types, each with slightly different audio requirements:

Breathing exercises (4-8 minutes): Precise timing matters. SSML <break> tags need to match the inhale/exhale cycle exactly. For box breathing (4-4-4-4): 4-second breaks between each cue. Test this — the numbers in the script and the actual pause duration need to sync.

Body scan (15-30 minutes): Slower pace, longer pauses between body regions. 6–8 second breaks between major sections. Voice should be consistently low-energy throughout.

Sleep meditations (20-45 minutes): Pace slows progressively. Hard to do with static SSML; consider generating in 5-minute segments with incrementally slower rate settings, then concatenating with ffmpeg.

Visualization (10-20 minutes): Normal meditation pace, but voice needs more warmth than a body scan. Some slight rate variation keeps it from feeling robotic.

Library Architecture

A typical meditation app ships 50–200 sessions at launch. Generate all of them upfront and store on CDN — don't call the TTS API at playback time.

Directory structure:

/audio/
  /en/
    /breathing/
      box-breathing-5min.mp3
      478-breathing-8min.mp3
    /body-scan/
      beginner-15min.mp3
      deep-30min.mp3
    /sleep/
      ...

Generation script in Python:

import requests
from pathlib import Path

API_KEY = "your-speeko-key"
sessions = [
    {
        "id": "box-breathing-5min",
        "ssml": """<speak>
          <prosody rate="82%" pitch="-1st" volume="soft">
            ...your full script...
          </prosody>
        </speak>""",
        "output": "audio/en/breathing/box-breathing-5min.mp3"
    },
    # ... more sessions
]

for session in sessions:
    response = requests.post(
        "https://api.speekoapp.com/v1/tts",
        headers={"X-API-Key": API_KEY, "Content-Type": "application/json"},
        json={
            "text": session["ssml"],
            "input_type": "ssml",
            "voice": "en-GB-neural-2",
            "format": "mp3"
        }
    )
    Path(session["output"]).parent.mkdir(parents=True, exist_ok=True)
    Path(session["output"]).write_bytes(response.content)
    print(f"Generated: {session['id']}")

Run once at content creation time. Serve the static MP3s from your CDN.

Audio Post-Processing

Two optional processing steps that meaningfully improve meditation audio:

Normalization: Meditation audio should be quieter than typical media. Target -16 LUFS for mobile playback (Spotify uses -14 LUFS; meditation should be calmer). Use ffmpeg:

ffmpeg -i input.mp3 -af "loudnorm=I=-16:TP=-1.5:LRA=11" output.mp3

Subtle ambient bed: A light pink noise or gentle nature sound mixed under the voice at -20dB relative to speech adds warmth. Also masks the "too clean" quality that sometimes makes TTS feel synthetic. Keep it subtle — it should be imperceptible as a separate sound.

Cost for a Typical App

A meditation app launch library: 100 sessions, average 20 minutes each.

At 150 words per minute and 5.5 characters per word, 20 minutes = ~3,000 words = ~16,500 characters per session.

100 sessions × 16,500 characters = 1,650,000 characters.

At $0.03/1K = $49.50 total to generate the entire launch library.

Add 20 new sessions per month: ~330,000 characters = $9.90/month.

For a subscription app generating $10–20k MRR, that's a rounding error in infrastructure cost.

Speeko's free $5 credit covers 167,000 characters — enough to generate 10 full-length meditation sessions and evaluate voice quality before committing to a library.

Building a Meditation App with TTS API: Voice Selection, Pacing, and Cost

Building a Meditation App with TTS API: Voice Selection, Pacing, and Cost

What Makes a Good Meditation Voice

SSML Configuration for Meditation

Session Types and Their Audio Needs

Library Architecture

Audio Post-Processing

Cost for a Typical App

Related articles

Real-Time Voice Translation: Building Multilingual Conversation Systems

Voice Commerce Integration: Building Voice-Enabled Checkout Experiences