TTS API for Accessibility: Meeting WCAG 2.1 Requirements in 2026
April 24, 2026 was the WCAG 2.1 Level AA compliance deadline for public entities serving populations over 50,000. If that deadline just passed and your product isn't there yet, or if you're building something new and don't want to be in the same situation in two years — here's what text-to-speech APIs actually do for accessibility, and where they fall short.
What TTS Solves (and What It Doesn't)
TTS helps users who can't read text visually — low vision, blindness, dyslexia, cognitive disabilities. A TTS layer reads page content aloud, which directly supports WCAG success criteria around perceivable content.
But: native browser screen readers (NVDA, VoiceOver, JAWS) already do this. Built-in TTS in the OS is free and always available. Adding a custom TTS API to your site doesn't replace the need for proper semantic HTML — screen readers rely on ARIA labels, heading structure, alt text, and focus management. No TTS API patches those gaps.
Where a TTS API adds genuine value:
- Pre-generated audio content — articles, product descriptions, course transcripts. High quality, consistent voice, no reliance on the user's device.
- Voice interfaces — chatbots, IVR, interactive tools that speak responses aloud.
- Compliance documentation — generating accessible audio versions of PDFs, legal documents, onboarding flows.
WCAG 2.1 Success Criteria That TTS Addresses
1.1.1 Non-text Content (Level A) — Images need alt text. TTS doesn't help here, but if you have image captions you want narrated, TTS can read them.
1.2.x Audio/Video Media (Level A & AA) — Pre-recorded audio needs transcripts. Pre-recorded video needs captions and audio descriptions. If you're generating audio with a TTS API, you already have the source text — that text is the transcript. Store it.
1.3.1 Info and Relationships (Level A) — Structure must be conveyed programmatically. This is semantic HTML territory, not TTS.
1.4.5 Audio Control (Level A) — Any audio that plays automatically must be stoppable within 3 seconds. If you autoplay TTS, wire up a stop control. Non-negotiable.
The direct TTS win is 1.2.x: if you're publishing audio content generated via API, you have the transcripts automatically. Publish them.
Implementation: On-Page TTS Player
A simple read-aloud button that generates and plays audio for article content:
async function readAloud(text) {
const response = await fetch('https://api.speekoapp.com/v1/tts', {
method: 'POST',
headers: {
'X-API-Key': process.env.SPEEKO_API_KEY,
'Content-Type': 'application/json'
},
body: JSON.stringify({
text: text,
voice: 'en-US-neural-1',
format: 'mp3'
})
});
const audioBlob = await response.blob();
const audioUrl = URL.createObjectURL(audioBlob);
const audio = new Audio(audioUrl);
audio.play();
return audio; // return for pause/stop control
}Add the button with proper ARIA attributes:
<button
aria-label="Read article aloud"
onclick="readAloud(articleText)"
>
🔊 Listen
</button>Cache the generated audio. There's no reason to re-generate the same article content on every request. Store the MP3 on your CDN with the content ID as the cache key. When the content updates, invalidate and regenerate.
Pre-Generated vs. On-Demand
Two architectures, different tradeoffs:
Pre-generated: Run TTS when content publishes. Store MP3 on CDN. Serve statically. Zero latency for users, zero API calls at read time. Best for articles, product pages, documentation.
On-demand: Call the API when the user clicks listen. Adds 300–800ms before audio starts. Best for dynamic content — search results, user-generated content, personalized responses.
For a blog or knowledge base, pre-generate. The cost is low — a 2,000-word article is roughly 12,000 characters, or $0.36 at Speeko's rate. Generate once, serve thousands of times.
Transcript Storage Pattern
Every generated audio file should have a paired transcript. Here's a simple metadata pattern:
{
"audio_url": "https://cdn.example.com/audio/article-123.mp3",
"transcript": "Full article text here...",
"generated_at": "2026-05-01T10:00:00Z",
"character_count": 11840,
"voice": "en-US-neural-1",
"content_id": "article-123"
}Store this in your CMS or database alongside the content. Publish the transcript as a <details> element below the audio player:
<details>
<summary>Transcript</summary>
<p>Full article text...</p>
</details>Screen readers can access both the audio and the text. You satisfy 1.2.1 (transcript requirement). Two birds.
Cost at Scale
A site with 500 articles averaging 1,500 words each:
- ~500 × 9,000 characters = 4,500,000 characters
- At $0.03/1K chars = $135 one-time generation cost
Monthly additions (10 articles/month): ~$2.70/month in TTS costs. Negligible for any commercial product.
Regeneration is only needed when content changes. For a knowledge base with quarterly updates, this is a minor line item.
What TTS Doesn't Fix
This is worth stating plainly: a custom TTS integration does not make your site accessible. It's one layer.
The things that actually move the needle on WCAG 2.1 compliance:
- Proper semantic HTML (headings, landmarks, lists)
- ARIA labels on interactive elements
- Keyboard navigation for all functionality
- Sufficient color contrast (4.5:1 for normal text)
- Focus indicators on all focusable elements
Run an automated scan (Axe, Lighthouse) to find the obvious issues. Then bring in a user with a screen reader to find the ones automated tools miss. TTS is a complement to that work, not a substitute.
Getting Started
Speeko's TTS API includes a free $5 credit — enough to generate audio for roughly 167,000 characters of content. For most content teams, that covers the entire backlog in one pass.
See also: async TTS job queue guide if you're batch-generating audio for a large content library.