Controlling Emotion and Style in TTS API Output: A Practical Guide

Posted on May 1, 2026
By Speeko Team
tts-apissmlvoice-styleprosodyemotion

Controlling Emotion and Style in TTS API Output: A Practical Guide

Default TTS voices are neutral. That's the right default — neutral works for most things. But "neutral" isn't right for a meditation guide, a hype video, or a customer service bot that needs to sound apologetic when something goes wrong.

Here's how to move past default.

Two Control Layers

You have two levers: SSML (which every major TTS API supports) and style parameters (which some APIs expose natively). They work at different levels.

SSML controls the acoustics — speed, pitch, volume, pauses. Style parameters tell the model which emotional mode to render in. SSML is more portable; style parameters are more powerful where available.

SSML Prosody: The Baseline Tool

The <prosody> tag adjusts rate, pitch, and volume.

<speak>
  <prosody rate="slow" pitch="low" volume="soft">
    Breathe in slowly. Let your shoulders drop.
  </prosody>
</speak>

The three attributes:

  • rate: x-slow, slow, medium, fast, x-fast — or a percentage like 80%
  • pitch: low, medium, high — or semitone offsets like +2st
  • volume: silent, soft, medium, loud, x-loud — or dB adjustments like +3dB

For meditation audio, rate="slow" and pitch="low" combined with volume="soft" gets you 80% of the way to a calming delivery without touching style parameters.

For energetic marketing narration:

<speak>
  <prosody rate="110%" pitch="+1st" volume="loud">
    Three days only. The deal that doesn't come back.
  </prosody>
</speak>

Pauses Are Underrated

<break> is the most useful SSML tag for emotional pacing. Silence creates weight. Rushed speech creates urgency.

<speak>
  We have a problem with your order.
  <break time="600ms"/>
  But we've already fixed it.
  <break time="300ms"/>
  Your replacement ships today.
</speak>

That pause before "But we've already fixed it" turns a stressful sentence into a reassuring one. Without it, the whole thing reads as bad news.

For meditation specifically, pauses carry more work than any prosody adjustment:

<speak>
  Close your eyes.
  <break time="2s"/>
  Feel the weight of your body against the floor.
  <break time="3s"/>
  There's nowhere you need to be right now.
  <break time="4s"/>
</speak>

Emphasis

<emphasis> increases the prominence of a word — useful for marketing copy where a specific word needs to land.

<speak>
  This is the <emphasis level="strong">only</emphasis> plan that includes unlimited voices.
</speak>

Use it sparingly. One or two per paragraph maximum. More than that and everything loses its relative weight.

Style Parameters (Where Available)

Some TTS APIs let you pass a style or speaking_style parameter alongside the text. Common styles:

  • cheerful — upbeat, faster, higher pitch
  • sad — slower, lower pitch, softer
  • angry — clipped consonants, louder, faster
  • calm — slower, lower, less pitch variation
  • newscast — neutral, formal, clear articulation
  • customerservice — warm, measured, polite

With Speeko's API:

{
  "text": "I'm sorry you had that experience. Let me fix this right away.",
  "voice": "en-US-neural-2",
  "style": "customerservice"
}

The difference between customerservice and neutral on an apology script is significant. Neutral reads as robotic indifference. customerservice adds the slight warmth and measured pace that reads as genuine concern.

Use Case Recipes

Guided Meditation

<speak>
  <prosody rate="85%" pitch="-1st" volume="soft">
    Find a comfortable position.
    <break time="2s"/>
    Take a breath in through your nose.
    <break time="3s"/>
    And release slowly through your mouth.
    <break time="4s"/>
  </prosody>
</speak>

Pass with style: "calm" if available. Generate at a slightly lower sample rate if file size matters — 22kHz is enough for voice content.

Energetic Marketing Voiceover

<speak>
  <prosody rate="105%" pitch="+0.5st">
    Summer sale ends Sunday.
    <break time="200ms"/>
    Forty percent off everything.
    <break time="200ms"/>
    No code required.
  </prosody>
</speak>

Short sentences. Tight pauses. Don't slow down. Marketing copy needs forward momentum.

IVR Neutrality

For phone menus, stay neutral. Excitement reads as fake. Sad reads as broken. You want rate="medium", pitch="medium", clear articulation.

<speak>
  Thank you for calling.
  <break time="300ms"/>
  For account balance, press one.
  <break time="200ms"/>
  For billing questions, press two.
  <break time="200ms"/>
  To speak with an agent, press zero.
</speak>

IVR tip: generate at 8kHz sample rate if the output goes through a phone network. 16kHz if it's a VoIP/SIP setup. 44kHz for phone IVR is wasted bandwidth.

Testing Your Output

Don't adjust SSML blind. The iteration loop is:

  1. Write the script
  2. Generate with default settings
  3. Listen and note specific moments that feel wrong
  4. Target those moments with SSML adjustments
  5. Regenerate and compare

Most TTS APIs charge per character including SSML tags — but the tags themselves are short. An extra 200 characters of SSML on a 2,000-character script adds less than $0.01 at Speeko's rates.

Start experimenting with Speeko's free $5 credit — enough to iterate on a full script dozens of times before committing to a voice and style.