How to Test TTS API Integration Quality: Automated and Manual Strategies
Most teams test their TTS integration once — at setup — and call it done. Then they switch providers six months later and discover half their edge cases break silently. No error. Just wrong audio.
Testing TTS isn't like testing a JSON API where you assert on a field value. The output is audio. You need different tools.
What Can Actually Go Wrong
Before writing tests, know what you're testing against:
- Empty audio: The API returns 200 but the file is 0 bytes or sub-threshold duration
- Truncated output: Long inputs (3,000+ characters) get cut at character limits you didn't expect
- Voice drift: The same request returns perceptibly different prosody after a model update
- Encoding corruption: MP3 headers get mangled, audio plays in some players but not others
- Latency spikes: Average response time is 400ms but p99 is 8 seconds — your UI times out
Each of these needs a different test.
Automated Checks
1. Duration Validation
Generate audio for a known-length text, then validate the duration falls within an expected range:
import requests
import wave
import io
from pydub import AudioSegment
def test_audio_duration(text: str, expected_seconds: float, tolerance: float = 0.5):
response = requests.post(
"https://api.speekoapp.com/v1/tts",
headers={"X-API-Key": "YOUR_KEY", "Content-Type": "application/json"},
json={"text": text, "voice": "en-US-neural-1", "format": "mp3"}
)
assert response.status_code == 200
assert len(response.content) > 1000 # non-empty
audio = AudioSegment.from_mp3(io.BytesIO(response.content))
duration = len(audio) / 1000 # pydub gives milliseconds
assert abs(duration - expected_seconds) < tolerance, (
f"Expected ~{expected_seconds}s, got {duration:.1f}s"
)
# A 50-word sentence at natural pace ≈ 15 seconds
test_audio_duration(
"This is a test sentence with roughly fifty words designed to validate that our TTS integration "
"returns audio of the correct length when given normal prose input with no special characters.",
expected_seconds=14.0,
tolerance=3.0
)Run this in CI on every deploy. It catches silent failures fast.
2. Character Limit Regression
Different providers handle over-limit inputs differently. Some truncate. Some error. Some split automatically.
def test_long_input_handling():
# 6,000 characters — above common single-request limits
long_text = "This sentence will be repeated. " * 200
response = requests.post(
"https://api.speekoapp.com/v1/tts",
headers={"X-API-Key": "YOUR_KEY"},
json={"text": long_text, "voice": "en-US-neural-1", "format": "mp3"}
)
if response.status_code == 200:
audio = AudioSegment.from_mp3(io.BytesIO(response.content))
duration = len(audio) / 1000
# If truncation happened silently, duration will be far shorter than expected
assert duration > 60, "Long input may have been silently truncated"
elif response.status_code == 400:
# Explicit error is acceptable — at least it's not silent
assert "character" in response.json().get("error", "").lower()
else:
raise AssertionError(f"Unexpected status: {response.status_code}")3. Encoding Integrity
from pydub.utils import mediainfo
def test_mp3_valid(audio_bytes: bytes):
audio = AudioSegment.from_mp3(io.BytesIO(audio_bytes))
info = mediainfo(io.BytesIO(audio_bytes))
assert info.get("codec_name") == "mp3"
assert int(info.get("bit_rate", 0)) >= 64000 # at least 64kbpsA corrupt MP3 will raise on from_mp3. Catching that in CI beats discovering it in production when a user reports silence.
Latency Testing
Add a latency benchmark separate from your functional tests:
import time
import statistics
def benchmark_latency(n: int = 20):
text = "The quick brown fox jumps over the lazy dog."
times = []
for _ in range(n):
start = time.time()
response = requests.post(
"https://api.speekoapp.com/v1/tts",
headers={"X-API-Key": "YOUR_KEY"},
json={"text": text, "voice": "en-US-neural-1", "format": "mp3"}
)
elapsed = time.time() - start
times.append(elapsed)
print(f"p50: {statistics.median(times):.2f}s")
print(f"p95: {sorted(times)[int(n * 0.95)]:.2f}s")
print(f"p99: {sorted(times)[int(n * 0.99)]:.2f}s")Run this before switching providers, not after.
Manual Listening Tests
Automated checks validate structure. They don't catch a voice that sounds like it's reading a list of ingredients when it should sound like a narrator.
For manual listening tests, build a fixture set:
- Neutral prose — a paragraph of plain explanation
- Technical content — code variable names, acronyms, domain terms
- Proper nouns — names, company names, cities
- Numbers and dates — "On April 24, 2026, the regulation took effect"
- Punctuation-heavy text — semicolons, colons, parentheticals
- Emotional content — a sentence that should convey urgency or warmth
Rate each on a 1–5 scale for naturalness and intelligibility. Keep scores. When you update providers or model versions, re-run the fixture set and compare against your baseline.
Regression Testing When Switching Providers
Generate your fixture set on both the old and new provider before cutting over. Listen to both. If you're migrating from OpenAI TTS to Speeko, the voice ID will change — but the content should be comparably natural.
Automated regression: hash the audio file and store it. On re-run, if the hash changes (expected on model update), flag it for manual review rather than failing the build outright. Voice model updates are legitimate — what matters is that the new audio still passes your quality bar.
CI/CD Integration
Keep the fast automated checks in your main pipeline:
# .github/workflows/tts-quality.yml
- name: TTS integration tests
run: python -m pytest tests/tts/ -v --timeout=30
env:
SPEEKO_API_KEY: ${{ secrets.SPEEKO_API_KEY }}Run the latency benchmark and manual fixture generation as a separate scheduled job — daily or weekly, not on every PR. Latency varies by time of day; don't fail a deployment because a Monday 9am benchmark ran hot.
Where to Start
Pick one test: the duration validation on your most common input length. Add it to CI this week. That single check will catch 80% of the silent failures that waste debugging time.
See also: TTS voice quality benchmarks for how to evaluate providers before you choose one.