Building a Read-Aloud Chrome Extension with a TTS API (Manifest V3)

The browser's built-in speechSynthesis API is free and requires zero setup. It's also inconsistent across browsers, limited to the voices installed on the user's OS, and sounds noticeably robotic on Windows. For an extension where voice quality matters, you swap it out for a TTS API. Here's how.

Architecture Overview

Manifest V3 ends persistent background pages. Your background context is now a service worker — it wakes up to handle events, then sleeps. This matters for TTS because you can't hold audio state in the service worker long-term.

The pattern that works:

Content Script (page context)
  → extracts selected text or full article
  → sends message to Service Worker

Service Worker (background)
  → receives text
  → checks IndexedDB cache
  → calls TTS API if cache miss
  → returns audio blob URL to Content Script

Content Script
  → creates <audio> element
  → plays audio in page context

Audio playback must happen in the content script. Service workers can't play audio — no DOM access.

File Structure

my-tts-extension/
  manifest.json
  background/service-worker.js
  content/content-script.js
  popup/popup.html
  popup/popup.js

manifest.json

{
  "manifest_version": 3,
  "name": "Read Aloud",
  "version": "1.0.0",
  "permissions": ["storage", "activeTab", "scripting"],
  "background": {
    "service_worker": "background/service-worker.js"
  },
  "content_scripts": [
    {
      "matches": ["<all_urls>"],
      "js": ["content/content-script.js"]
    }
  ],
  "action": {
    "default_popup": "popup/popup.html"
  }
}

No host_permissions for the TTS API domain — you call the API from the service worker, not from a content script, so CORS doesn't apply.

Service Worker: TTS API Call with Caching

// background/service-worker.js

const API_KEY = 'your-speeko-api-key';
const API_URL = 'https://api.speekoapp.com/v1/tts';
const CACHE_NAME = 'tts-audio-v1';

chrome.runtime.onMessage.addListener((message, sender, sendResponse) => {
  if (message.type === 'SPEAK') {
    handleSpeak(message.text).then(sendResponse);
    return true; // keep message channel open for async response
  }
});

async function handleSpeak(text) {
  const cacheKey = hashText(text);

  // Check cache first
  const cached = await getCached(cacheKey);
  if (cached) return { audioData: cached, cached: true };

  // Call TTS API
  const response = await fetch(API_URL, {
    method: 'POST',
    headers: {
      'X-API-Key': API_KEY,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({ text, voice: 'en-US-neural-1', format: 'mp3' })
  });

  const arrayBuffer = await response.arrayBuffer();
  const base64 = bufferToBase64(arrayBuffer);

  await setCached(cacheKey, base64);
  return { audioData: base64, cached: false };
}

function hashText(text) {
  // Simple djb2 hash — good enough for cache keys
  let hash = 5381;
  for (let i = 0; i < text.length; i++) {
    hash = (hash * 33) ^ text.charCodeAt(i);
  }
  return (hash >>> 0).toString(36);
}

async function getCached(key) {
  return new Promise(resolve => {
    chrome.storage.local.get(key, result => resolve(result[key] || null));
  });
}

async function setCached(key, data) {
  return new Promise(resolve => {
    chrome.storage.local.set({ [key]: data }, resolve);
  });
}

function bufferToBase64(buffer) {
  const bytes = new Uint8Array(buffer);
  let binary = '';
  bytes.forEach(b => binary += String.fromCharCode(b));
  return btoa(binary);
}

chrome.storage.local has a 10MB limit. For a read-aloud extension, that's plenty — it caches the last 50–100 articles before eviction. If you need more, use IndexedDB.

Content Script: Sending Text and Playing Audio

// content/content-script.js

chrome.runtime.onMessage.addListener((message) => {
  if (message.type === 'READ_SELECTION') {
    const text = window.getSelection().toString() || extractArticleText();
    speakText(text);
  }
});

async function speakText(text) {
  // Chunk text if longer than 5000 characters
  const chunks = chunkText(text, 5000);

  for (const chunk of chunks) {
    const result = await chrome.runtime.sendMessage({
      type: 'SPEAK',
      text: chunk
    });

    await playBase64Audio(result.audioData);
  }
}

function playBase64Audio(base64) {
  return new Promise(resolve => {
    const audio = new Audio(`data:audio/mp3;base64,${base64}`);
    audio.onended = resolve;
    audio.play();
  });
}

function chunkText(text, maxLength) {
  const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
  const chunks = [];
  let current = '';

  for (const sentence of sentences) {
    if ((current + sentence).length > maxLength) {
      if (current) chunks.push(current.trim());
      current = sentence;
    } else {
      current += sentence;
    }
  }
  if (current) chunks.push(current.trim());
  return chunks;
}

function extractArticleText() {
  // Try semantic article element first
  const article = document.querySelector('article, [role="main"], main');
  return article ? article.innerText : document.body.innerText;
}

The chunking step matters. Most TTS APIs have a character limit per request (Speeko's is 5,000 characters). A news article runs 6,000–8,000 characters. Without chunking, you get an error.

Popup: Simple Play/Stop Controls

<!-- popup/popup.html -->
<!DOCTYPE html>
<html>
<head><meta charset="utf-8"><title>Read Aloud</title></head>
<body style="width:200px;padding:12px">
  <button id="read">▶ Read Page</button>
  <button id="stop">■ Stop</button>
  <script src="popup.js"></script>
</body>
</html>

// popup/popup.js
document.getElementById('read').addEventListener('click', async () => {
  const [tab] = await chrome.tabs.query({ active: true, currentWindow: true });
  chrome.tabs.sendMessage(tab.id, { type: 'READ_SELECTION' });
  window.close();
});

Cost Per User

A typical article read: 6,000 characters = $0.18 at Speeko's rate. With caching, the second read of the same article costs $0. For a daily-reader user who hits 10 unique articles per day, that's $1.80/day in API costs if you're footing the bill, or $0.18 if they provide their own key.

Most read-aloud extensions either charge a subscription ($3–8/month) or let users bring their own API key. The bring-your-own-key model is simpler to ship — no billing infrastructure.

One Gotcha with MV3 Service Workers

Service workers have no persistent lifecycle. If the user clicks "Read" and then switches tabs, the service worker may go to sleep mid-generation. Wrap your API call in a chrome.runtime.getBackgroundPage keepalive or use the chrome.offscreen API (available in MV3) to keep the context alive during long audio operations.

The chrome.offscreen approach is the right one for anything that plays audio from the background — it creates a hidden document that can play audio without a visible page.

Getting Started

Get a Speeko API key — the free $5 credit handles about 167,000 characters, enough to test your extension thoroughly before shipping.

See also: the async TTS job queue guide if you're pre-generating audio server-side instead of on-demand in the extension.

Building a Read-Aloud Chrome Extension with a TTS API (Manifest V3)

Building a Read-Aloud Chrome Extension with a TTS API (Manifest V3)

Architecture Overview

File Structure

manifest.json

Service Worker: TTS API Call with Caching

Content Script: Sending Text and Playing Audio

Popup: Simple Play/Stop Controls

Cost Per User

One Gotcha with MV3 Service Workers

Getting Started

Related articles

Real-Time Voice Translation: Building Multilingual Conversation Systems

Voice Commerce Integration: Building Voice-Enabled Checkout Experiences