What real users actually recommend for TTS as of March 2026 — compiled from Reddit discussions, r/LocalLLaMA threads, TTS Arena leaderboards, and independent benchmarks
Based on the most common recommendations across Reddit threads, TTS Arena rankings, and community discussions as of March 2026:
The TTS landscape has undergone a fundamental transformation since mid-2025. Here are the six paradigm shifts reshaping the field.
The watershed moment: Chatterbox beat ElevenLabs in blind listener tests with 63.75% preference. Multiple open-source models now rank alongside commercial offerings on TTS Arena. Kokoro, Chatterbox, Fish Speech, and others with Apache 2.0 or MIT licenses enable zero-marginal-cost deployment in production.
TTS is no longer just about single-speaker narration. Dia (Nari Labs) generates entire multi-speaker dialogues in one pass. MOSS-TTSD supports 1-5 speakers with natural turn-taking and overlapping speech for up to 60 minutes. Sesame CSM generates speech with "umms", "uhhs", and natural conversational flow.
NeuTTS Air brings realistic voice cloning to phones and Raspberry Pis with no GPU or cloud needed. Kokoro (82M) achieves 96x real-time on CPU. VibeVoice-Realtime (0.5B) streams in ~300ms on consumer hardware. GGUF/GGML formats enable embedded deployment.
FishAudio S1 introduced open-domain fine-grained emotion control. Hume Octave understands what words mean, not just how to pronounce them. IndexTTS-2 disentangles emotion from speaker identity. Chatterbox Turbo offers emotion exaggeration control. This was science fiction two years ago.
Real-time conversational AI became practical. Cartesia Sonic 3: sub-100ms. Qwen3-TTS: 97ms streaming. CosyVoice2: 150ms. Inworld Mini: <130ms. Multiple providers now support the latency requirements needed for natural voice agent interactions.
A new generation of TTS models that understand meaning, not just pronunciation. FishAudio S1 brings LLM reasoning into the speech pipeline. Hume Octave is the first speech-language model. OpenAI's gpt-4o-mini-tts offers prompt-based steerability. These models grasp context, sarcasm, and emotional weight.
The commercial TTS landscape has been reshaped by aggressive new entrants and open-source competition. Here are the leaders as of March 2026.
The breakout commercial TTS of 2025-2026. Top-ranked on both major TTS arenas with best price-to-quality ratio.
Still the most-discussed TTS on Reddit and the broadest ecosystem. Recently cut conversational AI pricing by ~50%.
"The voices sound incredibly human, and their free tier gives you access to this incredible quality."— Reddit user discussion
The first speech-language model for TTS. Understands meaning, not just pronunciation. Launched Octave 2 in October 2025.
Industry-leading latency with cinema-grade output. Now available on Amazon SageMaker JumpStart (Feb 2026).
Three model tiers with unique "steerability" -- prompt the model not just what to say but how to say it.
The open-source TTS landscape exploded in 2025-2026. An unprecedented number of high-quality models were released, several rivaling or exceeding commercial offerings. These are the models that Reddit's technical communities (especially r/LocalLLaMA) recommend most.
The open-source model that beat ElevenLabs in blind tests. Turbo variant (Dec 2025) cut diffusion from 10 steps to 1.
"Definitely better than F5-TTS" — used in production podcast pipeline— Developer blog (duarteocarmo.com)
A tiny model that punches absurdly above its weight class. The undisputed speed and efficiency champion.
FishAudio S1 (October 2025) is the first TTS model with open-domain fine-grained emotion control. Brings LLM reasoning directly into the speech pipeline.
"Sounds insanely real: Fish Audio voices show emotion, pause, breathe."— nerdynav.com review
Released January 2026. A breakthrough in open-source voice generation, offering capabilities previously only in closed commercial systems.
Built on Llama 3B backbone. The breakthrough model of late 2025 for emotional speech. Rivals premium cloud services.
Purpose-built for dialogue. Made by two Korean undergrads with no funding. Generates entire conversations in a single pass.
Ultra-low latency streaming champion. 30-50% fewer pronunciation errors than v1.
World's first super-realistic on-device TTS with instant voice cloning. Runs on phones, laptops, and Raspberry Pi.
Fully non-autoregressive flow matching model. V1 released March 2025 with improved training and inference.
Outperforms SOTA zero-shot TTS in WER, speaker similarity, and emotional fidelity.
Optimized for conversational contexts. Generates speech with remarkably human-like qualities.
The paradigm shift from "text-to-speech" to "script-to-conversation." Released February 2026.
Two major independent arenas track TTS quality via blind crowdsourced comparisons. Here are the current rankings.
| # | Model | Win Rate | ELO | Type |
|---|---|---|---|---|
| 1 | Vocu V3.0 | 57% | 1609 | Commercial (China) |
| 2 | Inworld TTS 1.5 MAX | 59% | 1576 | Commercial |
| 3 | CastleFlow v1.0 | 60% | 1575 | Commercial |
| 4 | Inworld TTS MAX | 61% | 1570 | Commercial |
| 5 | Papla P1 | 57% | 1565 | Commercial |
| 6 | Hume Octave | 64% | 1560 | Commercial |
| 7 | ElevenLabs Flash v2.5 | ~60% | 1548 | Commercial |
| ~15 | Chatterbox | ~55% | ~1505 | Open-Source (MIT) |
| ~17 | Kokoro v1.0 | ~50% | ~1400 | Open-Source (Apache) |
| ~23 | StyleTTS 2 | ~48% | ~1369 | Open-Source (MIT) |
Rankings via blind crowdsourced comparisons. ELO calculated like chess ratings. 61 models tracked. Open-source models are closing the gap rapidly.
| # | Model | ELO | Win Rate |
|---|---|---|---|
| 1 | Inworld TTS 1 Max | 1,162 | 67% |
| 2 | Inworld TTS 1.5 Max | 1,115 | 62% |
| 3 | OpenAI TTS-1 | 1,111 | 65% |
| 4 | MiniMax Speech-02-Turbo | 1,107 | 63% |
| 5 | ElevenLabs Multilingual v2 | 1,105 | 64% |
| 9 | Kokoro 82M v1.0 | 1,060 | ~50% |
Kokoro at $0.65/1M chars is the highest-ranked open-weight model on this arena. 61 models compared on Quality, Price, and Speed.
Direct comparisons from Reddit users, independent reviewers, and arena data.
In blind tests conducted by Resemble AI, 63.8% of listeners preferred Chatterbox's output. However, ElevenLabs still leads in overall polish, feature ecosystem, ease of use, and multilingual breadth (32 languages vs English-focused). The gap is real but narrowing fast.
GPU benchmarks on NVIDIA L4, 24GB VRAM. Kokoro also runs on CPU at 36x real-time (Colab T4). Source: Inferless comparison (2025).
First-token latency (P90). Lower is better for real-time voice agents.
ElevenLabs pricing varies heavily by plan ($5-$1,320/mo for character credits). Open-source models: $0 marginal cost once deployed.
Hard numbers from independent testing, TTS Arena rankings, and community benchmarks as of March 2026.
| Model | Params | License | Speed | Voice Cloning | Languages | GPU Required |
|---|---|---|---|---|---|---|
| Kokoro | 82M | Apache 2.0 | <0.3s / 96x RT | No (14 presets) | 6+ | No (runs on CPU) |
| Chatterbox Turbo | 350M | MIT | Sub-200ms / 6x RT | Yes (5-10s audio) | EN (multi variant) | Yes (4-8GB) |
| Fish Speech V1.5 | 4B (mini: 0.5B) | Apache 2.0 | Good | Yes (10s sample) | 13 | Yes |
| Qwen3-TTS | 600M | Apache 2.0 | 97ms streaming | Yes (3s audio) | 10 | Yes |
| CosyVoice 2.0 | 500M | Apache 2.0 | 150ms streaming | Yes (5-15s) | 4+ (cross-lingual) | Yes |
| NeuTTS Air | 748M | Apache 2.0 | Real-time | Yes (3-15s) | English | No (on-device) |
| F5-TTS v1 | 335M | MIT | Sub-7s / 3x RT | Yes | Multi | Yes (~8GB) |
| Orpheus | 150M-3B | Apache 2.0 | Varies by size | Yes (zero-shot) | 5 (EN, ZH, HI, KO, ES) | Yes |
| Dia2 | 1-2B | Apache 2.0 | Stable / Streaming | Yes (Dia2) | English only | Yes |
| IndexTTS-2 | Large | Open | Good | Yes (zero-shot) | Multi | Yes |
| Sesame CSM-1B | 1B | Open | Moderate | Context-based | English | Yes (high) |
| XTTS-v2 | Medium | Coqui License | <150ms stream | Yes (6s audio) | 17 | Yes |
| Piper | Small | Apache 2.0 | Near-instant | No | 25 | No (CPU/RPi) |
| StyleTTS 2 | ~200M | MIT | 95x RT (4090) | Fine-tuning | English | Yes (12GB+) |
| MeloTTS | Small | MIT | Real-time on CPU | No | Multi | No |
| GPT-SoVITS | Medium | MIT | Moderate | Yes (1 min data) | Multi | Yes |
From independent testing on identical hardware (NVIDIA L4, 24GB VRAM), these models excelled in synthesized speech quality:
Four dominant approaches power modern TTS:
ElevenLabs remains the most well-known name in TTS, but as of March 2026 it no longer leads the arena rankings, and many alternatives offer better value. Here is what Reddit users recommend, categorized by budget.
ElevenLabs is no longer the default recommendation in technical communities. Inworld TTS offers better quality at a fraction of the cost. Chatterbox Turbo (free, MIT) beats ElevenLabs in blind tests. Kokoro runs on a Raspberry Pi with quality comparable to models 50x its size. For voice cloning, Chatterbox and Fish Audio are legitimate production alternatives. ElevenLabs retains an edge in ecosystem breadth, ease of use, and multilingual coverage (32 languages), but the quality and pricing moats have been breached.
Reddit users are refreshingly honest about the limitations. Here are the gotchas that come up repeatedly in March 2026.
In mid-2025, a developer concluded that "open-source TTS remains significantly behind proprietary solutions." By March 2026, the picture has changed:
"Chatterbox outperforming ElevenLabs in blind tests was unthinkable a year ago. Open-source models now rank alongside commercial offerings on TTS Arena. The gap is no longer about quality — it is about ecosystem polish, reliability at scale, and ease of integration."
For many use cases — podcasts, audiobooks, voice agents, accessibility tools — free open-source models are now genuinely production-ready. The remaining commercial advantages are in multilingual breadth, enterprise support, and turnkey ease-of-use.
Practical advice collected from Reddit discussions, r/LocalLLaMA threads, and community blogs as of March 2026.
Most open-source TTS models still degrade after ~1,000 characters. Process one sentence at a time and concatenate the audio. This dramatically improves reliability and consistency. Exception: MOSS-TTSD and VibeVoice are designed for long-form.
Free, runs on CPU, processes in under 0.3 seconds. If the quality is sufficient for your use case, you are done. If you need voice cloning, graduate to Chatterbox. If you need emotion control, try Fish Audio or Orpheus.
Sub-200ms first-token latency is the threshold for natural conversation. Cartesia Sonic 3 (sub-100ms), Qwen3-TTS (97ms), Inworld Mini (<130ms), and CosyVoice2 (150ms) all meet this bar. ElevenLabs Flash v2.5 and Hume Octave 2 are also in range.
Apache 2.0 and MIT are safe for commercial use. The Coqui Public Model License (XTTS-v2) is non-commercial. ChatTTS is CC BY-NC 4.0 (non-commercial). VibeVoice's TTS code was pulled entirely. Do not assume "open-source" means "use however you want."
Dia2 generates multi-speaker dialogue in a single pass with nonverbal cues. MOSS-TTSD supports up to 5 speakers for 60 minutes with natural turn-taking. Both are dramatically better than stitching single-speaker clips together.
The open-source "Ultimate TTS Studio" project on GitHub bundles Kokoro, Chatterbox, Fish Speech, F5-TTS, and more in one unified interface. Great for comparing models side-by-side on your own content. NVIDIA GPU required.
NeuTTS Air (Apache 2.0) runs on phones, laptops, and Raspberry Pis with voice cloning from just 3 seconds of audio and no GPU required. For simpler needs, Piper remains excellent for Home Assistant and offline use cases (25 languages, near-instant).
Microsoft's Edge TTS via the edge-tts Python package gives you Azure Neural voices for free without an API key or GPU. Requires internet access and is not suitable for high-volume production, but ideal for quick prototyping and personal projects.
TTS Arena rankings are crowdsourced on short clips and may not reflect your use case. A model that tops the arena on demo sentences might struggle with technical jargon, code, or unusual proper nouns. Always A/B test with your own text.
Open-source models have zero per-character costs once deployed. One developer cut costs 80% switching from Google TTS to Coqui. At scale, Inworld Mini ($5/1M chars) is 33x cheaper than ElevenLabs per-character on standard plans. Self-hosted Kokoro or Chatterbox costs only GPU compute time.
Released January 2026, Qwen3-TTS supports 10 languages with 97ms streaming, voice design via text prompts ("a warm female voice with a gentle accent"), and 3-second voice cloning. Apache 2.0 license. If you need multilingual open-source TTS, this is the current best choice.
Cloud GPU rental from $0.20/hour lets you run heavier models (Chatterbox, Orpheus 3B, Fish Speech) without buying expensive hardware. Google Colab's free tier works for Kokoro and smaller models.
Voice cloning is one of the most requested TTS features, and the open-source options have improved dramatically in 2025-2026. Here is what Reddit users and experts recommend.
| Model | Audio Needed | Quality | License |
|---|---|---|---|
| Chatterbox Turbo | 5-10 seconds | High (beat ElevenLabs) | MIT (free) |
| Qwen3-TTS | 3 seconds | High (10 languages) | Apache 2.0 (free) |
| NeuTTS Air | 3-15 seconds | Good (on-device) | Apache 2.0 (free) |
| Orpheus | Short sample | High (zero-shot) | Apache 2.0 (free) |
| Fish Audio S1 | 10 seconds | Excellent | $9.99/mo |
| Inworld TTS | 5-15 seconds | Excellent (free clone) | $5-10/1M chars |
| ElevenLabs | Short sample | Excellent | From $5/mo |
| XTTS-v2 | 6 seconds | Good (17 languages) | Non-commercial |
| GPT-SoVITS | 1 minute | Good (fast training) | MIT (free) |
| OpenVoice | Short sample | Good (style control) | Open |
| MARS5-TTS | 2-3 seconds | Good (140+ languages) | Open (commercial OK) |
A comprehensive pricing and quality breakdown of 35+ TTS providers as of March 2026. All prices in USD. Sorted by effective cost per 1M characters.
| Provider | Model / Tier | $/1M Chars | Free Tier | Voice Clone | Languages | Quality Tier |
|---|---|---|---|---|---|---|
| Kokoro 82M (hosted) | v1.0 | $0.65 | Self-host free | No | 5 | Open-source leader |
| Neets.ai | Standard | ~$1.00 | Yes | Unknown | 80+ | Budget |
| StyleTTS 2 (hosted) | Standard | $2.82 | Self-host free | No | English | Open-source |
| Google Cloud | Standard | $4.00 | 4M chars/mo | No | 40+ | Basic |
| Amazon Polly | Standard | $4.00 | 5M chars/mo (12mo) | No | 30+ | Basic |
| Fish Audio | speech-1.5/1.6/S1 | $4-15 | 8K credits | Yes | 15+ | High |
| Smallest.ai | Lightning V2 | ~$5.00 | Unknown | Yes | 16 | Good |
| Inworld | TTS-1 (standard) | $5.00 | Unknown | Yes | Multi | Very High |
| Unreal Speech | Enterprise rate | $8.00 | 250K chars/mo | No | English | Good |
| Speechify | API | $10.00 | Yes (basic) | No | 60+ | Good |
| Inworld | TTS-1 Max | $10.00 | Unknown | Yes | Multi | #1 AA ELO |
| Speechmatics | Neural | $11.00 | 1M chars/mo | No | English | Very Good |
| Cartesia | Sonic 3 | ~$11.00 | 20K chars | Yes | Multi | High |
| OpenAI | TTS-1 (legacy) | $15.00 | No | No | 50+ | Good |
| Deepgram | Aura-1 | $15.00 | $200 credit | No | English | Good |
| OpenAI | gpt-4o-mini-tts | ~$15.90 | No | No | 50+ | Good+ |
| Google Cloud | WaveNet / Neural2 | $16.00 | 1M WaveNet/mo | No | 40+ | Good |
| Amazon Polly | Neural | $16.00 | 1M chars/mo (12mo) | No | 30+ | Good |
| Microsoft Azure | Neural (standard) | $16.00 | 5M chars/mo | No | 100+ | Good |
| Deepgram | Aura-2 | $30.00 | $200 credit | No | English+ | Very Good |
| Google Cloud | Studio / Chirp 3 HD | $30.00 | No | No | 40+ | Very Good |
| Amazon Polly | Generative | $30.00 | 100K chars/mo (12mo) | No | Limited | Very Good |
| Murf AI | API | $30.00 | Limited free | Yes | 20+ | Good |
| OpenAI | TTS-1-HD | $30.00 | No | No | 50+ | Good+ |
| Microsoft Azure | Neural HD V2 | $30.00 | No | No | 100+ | Very Good |
| LMNT | Standard (overage) | $35-50 | 15K chars | Yes (unlimited) | English+ | Good |
| MiniMax | Speech-02-HD | $50-100 | Unknown | Yes | Multi | Very High |
| WellSaid Labs | Maker plan | ~$60-80 | 7-day trial | No | English | Very Good |
| Hume AI | Octave 2 | ~$72/min | 10K chars/mo | No | Multi | Very High (emotional) |
| Amazon Polly | Long-Form | $100.00 | 500K chars/mo (12mo) | No | Limited | Very Good |
| Microsoft Azure | Long Audio | $100.00 | No | No | 100+ | Very Good |
| Resemble AI | Standard | ~$99 | Limited | Yes (advanced) | 20+ | High |
| ElevenLabs | Scale plan eff. | ~$120 | 10K chars/mo | Yes | 32 | Premium |
| Play.ht | Starter (annual) | ~$125 | Limited | Yes | 140+ | Good |
| ElevenLabs | Pro plan eff. | ~$198 | 10K chars/mo | Yes | 32 | Premium |
| ElevenLabs | Overage (Creator) | $300 | 10K chars/mo | Yes | 32 | Premium |
Blind A/B comparison voting by users. Higher ELO = more natural/preferred.
| # | Model | ELO | Win Rate | Votes | Notes |
|---|---|---|---|---|---|
| 1 | Vocu V3.0 | 1609 | 57% | 1,175 | New entrant; limited availability |
| 2 | Inworld TTS | 1576 | 59% | 1,800 | $5/1M chars |
| 3 | CastleFlow v1.0 | 1575 | 60% | 1,641 | New entrant |
| 4 | Inworld TTS MAX | 1571 | 61% | 1,285 | $10/1M chars |
| 5 | Papla P1 | 1565 | 57% | 3,134 | New entrant |
| 6 | Hume Octave | 1561 | 64% | 3,265 | Emotional expression |
| 7 | Eleven Flash v2.5 | 1547 | 56% | 3,256 | ElevenLabs fast model |
| 8 | MiniMax Speech-02-HD | 1545 | 57% | 2,667 | High quality |
| 9 | Eleven Turbo v2.5 | 1544 | 58% | 3,253 | ElevenLabs turbo |
| 10 | MiniMax Speech-02-Turbo | 1540 | 52% | 2,734 | Fast variant |
| 15 | Chatterbox | 1503 | 47% | 1,630 | Open-source (MIT) |
| 17 | Kokoro v1.0 | 1498 | 45% | 3,265 | Best open-source value |
| 24 | StyleTTS 2 | 1369 | 26% | 1,246 | Open-source (MIT) |
| 25 | CosyVoice 2.0 | 1358 | 28% | 2,218 | Open-source (Alibaba) |
| 26 | Spark TTS | 1342 | 25% | 1,134 | Open-source |
| # | Model | ELO | Appearances | $/1M Chars |
|---|---|---|---|---|
| 1 | Inworld TTS 1 Max | 1,162 | 2,164 | $10 |
| 2 | Inworld TTS 1.5 Max | 1,115 | 1,302 | $10 |
| 3 | OpenAI TTS-1 | 1,111 | 6,913 | $15 |
| 4 | MiniMax Speech-02-Turbo | 1,107 | 3,592 | ~$50 |
| 5 | ElevenLabs Multi. v2 | 1,105 | 10,206 | ~$200 |
| 6 | MiniMax Speech-02-HD | 1,105 | 3,731 | ~$100 |
| 7 | MiniMax Speech 2.6 HD | 1,105 | 3,307 | ~$100 |
| 8 | MiniMax Speech 2.6 Turbo | 1,103 | 3,447 | ~$50 |
| 9 | ElevenLabs Turbo v2.5 | 1,096 | 9,195 | ~$200 |
| 10 | ElevenLabs v3 Alpha | 1,095 | 3,847 | ~$200 |
| - | Kokoro 82M v1.0 | 1,060 | - | $0.65 |
| Provider | Model / Tier | Cost (500K chars) | Notes |
|---|---|---|---|
| Kokoro 82M (hosted) | v1.0 | $0.33 | Cheapest hosted option |
| Kokoro (self-hosted) | v1.0 | $0.00 + GPU | Free weights, ~$0.03/hr GPU |
| Neets.ai | Standard | ~$0.50 | Budget quality |
| StyleTTS 2 | Standard | $1.41 | Open-source |
| Google Cloud | Standard | $2.00 | Robotic; or free within 4M/mo tier |
| Amazon Polly | Standard | $2.00 | Robotic; or free within 5M/mo tier |
| Smallest.ai | Lightning V2 | $2.50 | Ultra-fast generation |
| Inworld | TTS-1 | $2.50 | Great quality at low cost |
| Speechify | API | $5.00 | Simple pricing |
| Inworld | TTS-1 Max | $5.00 | Best quality under $10 |
| Speechmatics | Neural | $5.50 | English-only |
| Cartesia | Sonic 3 | $5.50 | Low latency bonus |
| OpenAI | TTS-1 | $7.50 | Reliable standard |
| Fish Audio | S1 | $7.50 | Good quality, cloned voices |
| Google Cloud | WaveNet | $8.00 | Good; or free within 1M/mo |
| Microsoft Azure | Neural | $8.00 | Good; free within 5M/mo |
| Deepgram | Aura-2 | $15.00 | Very Good quality |
| OpenAI | TTS-1-HD | $15.00 | Higher quality tier |
| Murf AI | API | $15.00 | Content creation focus |
| LMNT | Premium tier | $17.46 | Cloned voice included |
| MiniMax | Speech-02-HD | $25-50 | Platform-dependent |
| Resemble AI | Standard | ~$49.50 | Cloned voice specialist |
| ElevenLabs | Scale plan | $60-82 | Premium quality |
| Play.ht | Starter | ~$62.50 | Creator-focused |
| ElevenLabs | Pro plan | $99+ | All features included |
Inworld TTS-1 Max at $5.00 offers the best combination of quality (#1 ranked on Artificial Analysis) and price. For absolute minimum cost with decent quality, hosted Kokoro at $0.33 is unbeatable. For zero cost, self-hosted Kokoro or using Azure/Google free tiers.
| Provider / Model | ELO (HF Arena) | $/1M Chars | ELO per Dollar | Value Rating |
|---|---|---|---|---|
| Kokoro v1.0 | 1498 | $0.65 | 2,305 | EXCEPTIONAL |
| Inworld TTS | 1576 | $5.00 | 315 | EXCELLENT |
| Inworld TTS MAX | 1571 | $10.00 | 157 | EXCELLENT |
| Cartesia Sonic 2 | 1513 | $11.00 | 138 | VERY GOOD |
| Hume Octave | 1561 | ~$43* | 36 | GOOD (niche) |
| MiniMax Speech-02-HD | 1545 | $50.00 | 31 | GOOD |
| Eleven Flash v2.5 | 1547 | ~$150 | 10 | POOR value |
| Eleven Multilingual v2 | 1522 | ~$200 | 8 | POOR value |
*Hume priced per-minute; ~$43/1M is a rough estimate assuming ~150 chars/min of speech. Higher ELO per dollar = better value.
| Provider | $/1M | Best For |
|---|---|---|
| Kokoro 82M | $0.65 | Bulk processing, EN/JP/FR/KR/ZH |
| Neets.ai | ~$1.00 | Budget multilingual |
| StyleTTS 2 | $2.82 | Research, English |
| Google Standard | $4.00 | Enterprise reliability |
| Amazon Polly Std | $4.00 | AWS ecosystem |
| Provider | $/1M | Best For |
|---|---|---|
| Smallest.ai | $5.00 | Ultra-fast generation |
| Inworld TTS-1 | $5.00 | High quality on a budget |
| Inworld TTS Max | $10.00 | Best quality for price |
| Speechmatics | $11.00 | English neural quality |
| Cartesia Sonic | $11.00 | Voice agents, low latency |
| OpenAI TTS-1 | $15.00 | Reliability, simplicity |
| Fish Audio | $15.00 | Voice cloning, community |
| Azure Neural | $16.00 | Microsoft ecosystem |
| Provider | $/1M | Best For |
|---|---|---|
| Deepgram Aura-2 | $30 | Real-time voice agents |
| Google Chirp 3 | $30 | Premium Google voices |
| Murf AI | $30 | Content creation |
| MiniMax | $50-100 | Top-tier quality |
| Resemble AI | ~$99 | Advanced voice cloning |
| Provider | $/1M | Best For |
|---|---|---|
| ElevenLabs | $120-300 | Max quality + features |
| Play.ht | ~$125 | Creator tools + cloning |
| Hume AI | ~$43/min | Emotional AI voices |
Comprehensive survey of tools and workflows for converting documents (PDF, EPUB, RSS feeds, web pages) into audio. Covers commercial services, open-source pipelines, and browser extensions as of March 2026.
| Tool | Formats | TTS Engine | Price | Platform |
|---|---|---|---|---|
| Speechify | PDF, EPUB, DOCX, web, images (OCR) | 200+ AI voices, 60+ languages | Free / ~$139/yr | All platforms |
| NaturalReader | PDF (OCR), TXT, DOC, EPUB | AI + neural voices | Free / ~$10/mo | All platforms |
| Voice Dream | PDF, EPUB, DAISY, DOC, HTML | 200+ voices, 30+ languages | ~$15 one-time | iOS/Mac only |
| Paper2Audio | PDF, EPUB, web, text | AI voices | Free (56 hrs/wk) | Web, mobile, Firefox |
| Narakeet | PDF, TXT, DOCX, PPT, MD | 800+ voices, 100+ languages | ~$6/30 min audio | Web, API, CLI |
| Wondercraft AI | PDF, URLs, text, docs | 500-1000+ voices; cloning | Free tier + paid | Web |
| Adobe Read Aloud | PDF only | System SAPI voices | Free (built-in) | Win/Mac/Web |
| Narration Box | PDF, text | Context-aware AI | Freemium | Web |
| ReadLoudly | 50+ AI voices | Free / Premium | Web |
Engine: Piper (local neural TTS, ONNX) · Price: Fully free · Platform: Linux/Mac/Win CLI
Fully offline, no API costs. Runs on CPU including Raspberry Pi. Example: pdftotext -stdout input.pdf | piper --model en_US-lessac-medium --output_file output.wav
Limitation: No OCR (needs text-based PDFs), no smart layout handling.
Engine: OpenAI TTS (tts-1, tts-1-hd, gpt-4o-mini-tts) · Price: $15-30/1M chars · Platform: Any (API)
13 voices, multiple output formats (MP3, Opus, AAC, FLAC, WAV). Typical novel (~90K words): ~$6.35 standard, ~$12.69 HD.
Engine: Windows SAPI5 voices · Price: Free (freeware) · Platform: Windows only
Supports PDF, EPUB, DOC, MOBI, ODT, RTF, HTML, DjVu, FB2. Batch conversion. Export to WAV, MP3, OGG, WMA, MP4. Quality depends on installed SAPI5 voices.
Standard pdftotext produces jumbled output from multi-column layouts. Solutions:
| Tool | TTS Engines | Voice Clone | Chapters | Output | Audiobookshelf | Key Strength |
|---|---|---|---|---|---|---|
| ebook2audiobook | XTTS, Piper, StyleTTS2, F5-TTS | Yes | Yes | MP3, M4B, WAV | Yes | Most feature-rich; 1158 languages |
| epub_to_audiobook | OpenAI, Azure, Edge TTS | No | Yes | MP3 | Yes (optimized) | Clean, focused; WebUI available |
| epub2tts | Coqui, OpenAI, Edge, Kokoro | No | Yes | M4B | Yes | Kokoro variant "especially good and fast" |
| audiblez | Kokoro-82M | No | Yes | M4B | Yes | Orwell's Animal Farm in ~5 min on T4 GPU |
| abogen | Kokoro-82M (8 languages) | No | Yes | Audio + subtitles | Yes | Synced captions; markdown input support |
| App | TTS Engine | Price | Platform | Key Feature |
|---|---|---|---|---|
| ElevenReader | ElevenLabs AI voices | Free 10 hrs/mo; Ultra $11/mo | iOS, Android, Chrome | Top-tier voice quality |
| Speechify | 200+ AI voices | Free / ~$139/yr | All platforms | Universal "read anything" |
| Voice Dream | 200+ voices | ~$15 one-time | iOS/Mac | Best format compatibility (iOS) |
| Apple Books | System voices (60+ lang) | Free (built-in) | iOS, macOS | Two-finger swipe to read |
| Calibre TTS Plugin | SAPI5 voices | Free | Windows | MP3 audiobook creation |
| @Voice Aloud | System TTS (Android) | Free / Pro | Android | Broadest Android format support |
| Speech Central | System + AI voices | Free (blind) / Paid | iOS, Mac, Android | Best PDF cleanup, RSS integration |
| Readwise Reader | Unreal Speech AI | $8.99/mo | Web, iOS, Android | Best for annotators/power readers |
| Tool | Self-Hosted | Summarizes | Multi-Speaker | TTS Engine | Setup Effort |
|---|---|---|---|---|---|
| rss2podcast | Yes | Yes (AI) | No | Kokoro/Coqui/MLX | Medium |
| n8n workflow | Yes | Yes (Gemini) | Yes (2-person) | Kokoro | Medium |
| Podcastfy | Yes | Yes | Yes (conversational) | Configurable | Medium |
| Mozilla Blueprint | Yes | No | Yes (multi-speaker) | Kokoro-82M | Medium |
| TTSReader | No (web) | No | No | Web Speech API + cloud | Low |
rss2podcast reads RSS > extracts articles > scrapes content > summarizes with AI > converts to podcast audio. n8n workflow uses Google Gemini to write a two-person dialogue > Kokoro generates audio > FFmpeg merges into MP3. Both are fully self-hostable with zero TTS API costs.
| Service | TTS Engine | Price | Key Feature |
|---|---|---|---|
| BeyondWords | Google/AWS/Azure (500+ voices) | Freemium | WordPress/Ghost plugin, RSS ingestion |
| Speech Central | System + AI voices | Free (blind) / Paid | Built-in RSS reader + TTS |
| Google NotebookLM | Gemini-based AI | Free | Podcast-style AI discussion of docs |
| Wondercraft AI | 500+ voices; cloning | Free tier + paid | Multi-speaker podcast from URLs |
| Podcastle | AI + recording studio | Free tier + paid | All-in-one podcast platform |
| Extension | TTS Engine | Price | Key Feature |
|---|---|---|---|
| Read Aloud | Browser + Google WaveNet, AWS Polly, IBM Watson, Azure, OpenAI | Free | Power-user TTS; connects to premium cloud engines; open source |
| Speechify | Speechify AI voices | Free / Premium | Reads any webpage, Google Doc, PDF in browser |
| NaturalReader | NaturalReader AI voices | Free / Premium | Emails, websites, PDFs, Google Docs, Kindle |
| ElevenReader | ElevenLabs AI voices | Free / Ultra ($11/mo) | One-click save, sync to mobile, offline listening |
| Voice Out | 60+ languages, 100+ voices | Free / Premium | Google Docs, PDFs, webpages, books |
| Talkie | Browser Web Speech API | Free | Privacy-focused: all processing local, no cloud |
| Listening.com | Unknown | Free / Premium | Dedicated web page reading extension |
audiblez/epub2tts with Kokoro (local GPU) — $0
epub2tts-edge with Microsoft Edge TTS — $0 (free cloud API)
epub_to_audiobook + OpenAI Standard — ~$6.35
epub_to_audiobook + OpenAI HD — ~$12.69
ElevenLabs API — $20-50+ (plan dependent)
ElevenReader subscription — $11/mo (unlimited listening)
Best universal reader: Speechify (all platforms, all formats, Apple Design Award winner). Best voice quality: ElevenReader (ElevenLabs voices). Best free open-source pipeline: audiblez or epub2tts with Kokoro engine (82M params, near-commercial quality, Apache 2.0). Best for research papers: Paper2Audio or Google NotebookLM. Best self-hosted RSS-to-podcast: rss2podcast with Kokoro TTS or the n8n + Gemini + Kokoro workflow.
Technical deep-dives on 20+ open-source and commercial TTS models. Each entry covers architecture, parameter count, training data, license, languages, voice cloning method, latency, VRAM requirements, sample rate, and known limitations. Click any model to expand.
| Architecture | Transformer-based with speech-token-to-mel decoder. Turbo: distilled one-step decoder (reduced from 10 diffusion steps to 1). Paralinguistic tags ([cough], [laugh], [chuckle]) native to Turbo. |
| Parameters | Original: 500M | Multilingual: 500M | Turbo: 350M |
| Training Data | Not publicly disclosed |
| Languages | Original/Turbo: English only. Multilingual: 23 languages with emotion control. |
| Voice Cloning | Zero-shot from 5-10s reference. Emotion exaggeration control slider (first among open-source). |
| Latency | Sub-200ms on GPU. Up to 6x faster than real-time. |
| VRAM | Entry: ~8 GB (RTX 3060Ti) | Mid: 16-24 GB | Turbo: lower than original |
| Sample Rate | ~24 kHz (estimated) |
| Safety | PerTh neural watermarking embedded; survives MP3 compression |
| Limitations | Turbo is English-only. Multilingual variant is heavier (500M). Can hallucinate on long text. |
| Architecture | Decoder-only (StyleTTS 2 + ISTFTNet). No diffusion, no encoder. Minimal overhead. |
| Parameters | 82M (one of the smallest high-quality TTS models) |
| Training Data | <100 hours curated, permissive audio with IPA labels. ~500 GPU hours on single A100 80GB. |
| Languages | v1.0: 8 languages with 54 voice packs. Trained on 13 core languages. |
| Voice Cloning | Limited; relies on voice packs. Not a zero-shot cloner. |
| Latency | 96x RT on basic cloud GPU. 210x on RTX 4090, 36x on free Colab T4. |
| VRAM | 2-3 GB (one of the most efficient) |
| Sample Rate | 24 kHz |
| Limitations | No voice cloning. Fewer voice packs than commercial. Less expressiveness range. |
| Architecture | Dual Autoregressive (Dual-AR): Slow Transformer (hidden states + token logits) + Fast Transformer (codebook refinement). Online RLHF (GRPO). |
| Parameters | S1 flagship: 4B | S1-mini (distilled): 500M | Fish Speech v1.5: ~500M |
| Training Data | 2M+ hours. 300K+ hrs EN/ZH, 100K+ hrs JA. |
| Languages | 70+ via platform; strongest EN, ZH, JA. 13 core languages. |
| Voice Cloning | Zero-shot from ~10s reference. Fine-grained emotion tags: (angry), (furious), (frustrated), (whisper), (sob). |
| VRAM | S1-mini: ~4-6 GB. Full S1 (4B): API-only. |
| Sample Rate | 44.1 kHz (FishAudio platform) |
| Limitations | Full 4B S1 is API-only (not open-weight). EN/ZH much stronger than other languages. |
| Architecture | Transformer-based, inspired by SoundStorm/Parakeet. Dia2: streaming architecture synthesizing from first few tokens. |
| Parameters | Dia: 1.6B | Dia2 (1B) | Dia2 (2B) |
| Languages | English only (both versions) |
| Voice Cloning | Zero-shot supported. Speaker tags [S1]/[S2] for multi-character dialogue. |
| Latency | Dia: ~40 tokens/s on A4000. Dia2: streaming, low-latency conversational use. |
| VRAM | Dia: ~10 GB. Dia2: varies by variant. |
| Sample Rate | ~24 kHz (estimated) |
| Key Feature | Natural nonverbal cues (laughter, coughing, sighs) in dialogue. Multi-speaker conversation in one pass. |
| Limitations | English only. Dia1 non-streaming. Dia2 still in active development. ~2 min max per generation. |
| Architecture | Discrete multi-codebook LM for end-to-end speech. Qwen3-TTS-Tokenizer-12Hz (16-layer multi-codebook, 12.5 Hz). Non-DiT lightweight decoder. Dual-Track hybrid streaming: single model supports both streaming and non-streaming. |
| Parameters | Large: 1.7B | Small: 600M |
| Languages | 10: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian. |
| Voice Cloning | Clone from 3 seconds. Also create voices from text descriptions ("what you imagine is what you hear"). |
| Latency | 97ms end-to-end (streaming). First audio from single character input. |
| VRAM | 1.7B: 6-8 GB (~3.89 GB practical). 0.6B: 4-6 GB. FlashAttention 2: 30-40% speedup. |
| Sample Rate | 12.5 Hz token rate; final audio likely 16-24 kHz |
| Benchmarks | SOTA on Seed-TTS benchmark. Lowest WER across all 10 languages vs commercial baselines. |
| Limitations | New (Jan 2026); ecosystem still maturing. 16 GB VRAM limiting for 1.7B multi-user. |
| Architecture | Diffusion Transformer (DiT) with ConvNeXt V2 backbone. Flow Matching for generation. |
| Parameters | 335M |
| Training Data | ~95-100K hours multilingual. 8x A100 GPUs, ~1 week. |
| Languages | English and Chinese |
| Voice Cloning | Zero-shot from 10-15s of clear reference audio. Flow Matching + DiT. |
| VRAM | ~6-8 GB inference. 12-16 GB recommended for comfort. |
| Sample Rate | 24 kHz |
| Limitations | EN/ZH only. Reference limited to ~15s. No fine-tuning. Slower than non-diffusion models. |
| Architecture | v2: Simplified LM (removed text encoder + speaker embedding). Pre-trained textual LLMs as backbone. Finite Scalar Quantization (FSQ) replacing VQ. Chunk-aware causal flow matching for unified streaming/non-streaming. |
| Parameters | v1: 300M | v2: 500M |
| Training Data | 1,500+ hours instructional dataset for accent/emotion/style control. |
| Languages | Chinese (with dialects), English, Japanese, Korean. Cross-lingual. |
| Voice Cloning | Zero-shot. Improved in v2 with FSQ-based tokens. |
| Latency | 150ms response time for streaming. 30-50% fewer pronunciation errors vs v1. |
| VRAM | ~4-6 GB estimated |
| Limitations | v1 superseded. Docs primarily in Chinese. Streaming quality depends on chunk size config. |
| Architecture | 1B transformer backbone + 100M transformer decoder (both Llama variants). Interleaved audio/text tokens. Mimi audio codec (split-RVQ, 1.1 kbps). Produces RVQ audio codes. |
| Parameters | ~1.1B total (1B backbone + 100M decoder) |
| Languages | English (primary) |
| Voice Cloning | Supported but "decent, not perfect." Captures some characteristics. |
| VRAM | CUDA GPU: ~4.5 GB | MLX (Mac): ~8.1 GB | CPU: ~8.5 GB RAM. Recommended: 8 GB+ GPU. |
| Key Feature | Multi-speaker dialogue with contextual awareness across turns. Natural pauses, "umms", "uhhs". |
| Limitations | License requires HuggingFace acceptance. English-focused. High latency vs streaming models. |
| Architecture | Three modules: (1) Text-to-Semantic (AR framework), (2) Semantic-to-Mel (NAR), (3) Vocoder. IndexTTS-2 adds GPT latent representations, three-stage training, and soft instruction mechanism (Qwen3-based). |
| Parameters | ~1B total |
| Training Data | 55,000 hours multilingual (Chinese, English, Japanese) |
| Languages | Chinese, English, Japanese |
| Voice Cloning | Zero-shot. Outperforms SOTA in speaker similarity. Disentangles emotion from speaker identity. |
| VRAM | ~8 GB. FP16 recommended. |
| Key Feature | First AR TTS with precise duration control (ms-level). Two modes: explicit token count or free AR. Emotion-controllable via multiple modalities. |
| Limitations | *Commercial license separate. ZH/EN/JA only. Slow on RTX 3060 12GB. |
| Architecture | 8 original StyleTTS modules + style diffusion denoiser + prosodic encoders. PL-BERT text encoder. HiFi-GAN/iSTFTNet decoder with AdaIN. Adversarial training with large speech LMs. |
| Languages | English (primary) |
| Voice Cloning | Style-based transfer; not true zero-shot arbitrary cloning. |
| VRAM | ~2 GB (extremely efficient) |
| Sample Rate | 24 kHz |
| Limitations | English only. Style transfer less flexible than zero-shot. Older architecture (2023). No streaming. |
| Architecture | Built on Llama-3 backbone. Autoregressive predicting SNAC audio tokens. |
| Parameters | 3B (best) | 1B | 400M | 150M (most efficient) |
| Training Data | 100,000+ hours English. Multilingual research preview (April 2025). |
| Languages | English (primary). Multilingual in research preview. |
| Voice Cloning | Supported via reference audio. |
| Latency | ~200ms default streaming. 100ms with input streaming. 25-50ms with KV caching. |
| VRAM | 3B FP16: ~15 GB | 3B GGUF quantized: <4 GB | 3B FP8: ~24 GB |
| Sample Rate | 24 kHz |
| Limitations | Full 3B needs 15GB+ VRAM. English-primary. Smaller variants trade quality for efficiency. |
| Architecture | VITS (Variational Inference TTS). Transformer posterior encoder + normalizing flow decoder + HiFi-GAN vocoder. Non-autoregressive. ONNX-exported. eSpeak phonemizer. |
| Languages | Dozens of languages and accents via community voice packs |
| Voice Cloning | Not supported. Pre-trained voice packs only. |
| Latency | Sub-0.2 RTF on CPUs. 5x faster than cloud TTS latency for edge AI. |
| VRAM | CPU only. Raspberry Pi 4 compatible. INT8 on Android reduces memory 60%. |
| Sample Rate | 16-22 kHz (varies by model) |
| Limitations | No voice cloning. Quality below newer models. Limited voice variety. |
| Architecture | Decoupled TTS: separates tone color from content/style. Granular control over emotion, accent, rhythm, pauses independently of tone color. |
| Languages | V2: English, Spanish, French, Chinese, Japanese, Korean |
| Voice Cloning | Instant from short reference. Accurate tone color. Cross-lingual supported. |
| VRAM | ~4-6 GB estimated |
| Limitations | Older architecture (2024). Quality surpassed by Chatterbox, Fish Speech, Qwen3-TTS. Primarily a cloning framework. |
| Architecture | Autoregressive transformer with speaker conditioning. Multiple speaker references + interpolation. |
| Parameters | 467M |
| Training Data | English: 541.7 hrs (LibriTTS-R) + 1812.7 hrs (LibriLight) + internal. 4x A100 80GB. |
| Languages | 17: en, es, fr, de, it, pt, pl, tr, ru, nl, cs, ar, zh-cn, ja, hu, ko, hi |
| Voice Cloning | Zero-shot from 6 seconds. Cross-lingual cloning supported. |
| VRAM | ~6-8 GB inference |
| Sample Rate | 24 kHz |
| Limitations | Coqui shut down (2024); community-maintained. CPML license is restrictive (non-commercial). Quality varies by language. |
| Architecture | 3 transformer models called sequentially (similar to AudioLM). 4 sub-models: text-to-semantic, semantic-to-coarse, coarse-to-fine, audio generation. |
| Languages | Multilingual (English strongest) |
| Voice Cloning | No individualized cloning. Speaker/accent variation only. |
| VRAM | Full: ~12 GB | Small: ~8 GB | Half-precision: 50% reduction |
| Sample Rate | 24 kHz |
| Key Feature | Can generate non-speech audio: music, background noise, sound effects, laughter. |
| Limitations | ~13-14 sec max per generation. No voice cloning. Slow. English-centric. Suno shifted to music. |
| Architecture | BiCodec (single-stream codec with semantic + global tokens for speaker) + Qwen2.5 LLM + chain-of-thought generation. No separate flow matching. |
| Parameters | 500M |
| Training Data | 100K hours |
| Languages | English, Chinese |
| Voice Cloning | Zero-shot. Controllable via gender, pitch, speaking rate. |
| Limitations | EN/ZH only. Newer model, smaller community. |
| Architecture | Transformer-based with autoregressive + non-autoregressive components. |
| Training Data | ~100,000 hours Chinese and English |
| Languages | Chinese, English |
| Voice Cloning | Not supported |
| VRAM | 4 GB+. RTF ~0.65 on 4090D. |
| Key Feature | Conversational optimization. Token-level prosodic control: [laugh], [uv_break], [lbreak]. |
| Limitations | CC BY-NC 4.0 (non-commercial only). No voice cloning. |
| Architecture | Qwen2-based LM backbone + NeuCodec audio codec (50Hz, single codebook, 0.8 kbps). End-to-end speech LM optimized for on-device. |
| Parameters | 748M |
| Languages | English (primary) |
| Voice Cloning | Instant zero-shot from 3 seconds of mono WAV audio. |
| Latency | RTF <0.5 on CPU (Intel i5, ARM RPi 5). ~50 tokens/s on CPU. |
| VRAM | Q4 GGUF: 400-600 MB | Q8 GGUF: ~800 MB. Deployable on 2GB+ devices. |
| Sample Rate | 24 kHz |
| Safety | PerTh watermarking built-in |
| Limitations | English only. Newer model, smaller community. |
| Architecture | Built on Qwen3-1.7B-base. 8-layer RVQ codebook. AR modeling with Delay Pattern. MOSS-Audio-Tokenizer: 1.6B CNN-free tokenizer (Causal Transformer layers). Multi-head parallel prediction with delay scheduling. |
| Parameters | 1.6B to 8B. Audio tokenizer alone: 1.6B. |
| Languages | Multilingual (Chinese, English, Japanese, European languages) |
| Voice Cloning | Zero-shot from short references |
| VRAM | Recommended: 24 GB+ (RTX 3090/4090) |
| Key Feature | 60-minute single-session context. 1-5 speakers with flexible control and persona maintenance. Natural turn-taking and overlapping speech. |
| Limitations | High VRAM. Complex setup. New ecosystem. |
| Architecture | Qwen2.5-1.5B backbone + sigma-VAE Acoustic Tokenizer (~340M encoder/decoder) + Diffusion Head (~123M). Ultra-low 7.5 Hz frame rate (vs typical 50-75 Hz). |
| Parameters | 1.5B | 7B | Realtime-0.5B |
| Languages | Multilingual. ASR variant: 50+ languages. |
| Voice Cloning | Zero-shot supported |
| VRAM | 1.5B: ~7 GB | 7B: ~24 GB | Realtime 0.5B: <2 GB |
| Key Feature | 90 minutes of speech, up to 4 speakers. Designed for podcasts, narration, multi-speaker content. |
| Limitations | CUDA 12.x required. Code pulled after misuse; community fork exists. Legal status ambiguous. |
| Use Case | Best Model | Why |
|---|---|---|
| Audiobooks (long-form) | VibeVoice-1.5B | 90 min, 4 speakers, MIT |
| Audiobooks (quality) | Chatterbox | Beats ElevenLabs in blind tests |
| Voice Assistants | Qwen3-TTS | 97ms streaming, 10 languages |
| Game Characters | Fish Audio S1 | Fine-grained emotion tags |
| Dialogue / Podcasts | Dia2 / MOSS-TTSD | Multi-speaker, nonverbal cues |
| On-Device / Edge | NeuTTS Air / Piper | CPU-only, no cloud needed |
| Duration Control | IndexTTS-2 | ms-precise timing + emotion |
| Emotional Expression | Hume Octave 2 | Understands meaning, not just sound |
| Multilingual (open) | Qwen3-TTS | 10 languages, SOTA benchmarks |
| Multilingual (comm.) | Cartesia Sonic 3 | 42 languages, sub-100ms |
| Budget / Speed | Kokoro-82M | 2-3 GB, 96x RT, Apache 2.0 |
| Runs on CPU | Pocket TTS (100M) | 47ms TTFA, no GPU |
| Voice Cloning (quality) | Chatterbox | 5-10s, 63.8% pref vs ElevenLabs |
| Voice Cloning (speed) | Qwen3-TTS | 3-second cloning |
| Hardware | Best Model | Notes |
|---|---|---|
| Raspberry Pi / ARM | Piper TTS | ONNX, near-instant |
| CPU only (quality) | Pocket TTS | 100M, 47ms TTFA |
| CPU + cloning | NeuTTS Air | 400 MB Q4, 3s clone |
| 2-4 GB VRAM | Kokoro-82M | 2-3 GB, 96x RT |
| 4-8 GB VRAM | Qwen3-TTS 0.6B | 97ms streaming |
| 8-16 GB VRAM | Chatterbox Turbo | 350M, beat ElevenLabs |
| 24 GB+ VRAM | MOSS-TTSD | 60-min sessions, 5 speakers |
Models that understand meaning, not just pronunciation. Built on LLM backbones (Llama, Qwen): Orpheus TTS, LLaSA, Qwen3-TTS, MOSS-TTSD, Spark-TTS, OuteTTS, NeuTTS Air. Enables semantic understanding — sarcasm sounds sarcastic, questions have natural rising intonation.
Purpose-built for multi-speaker dialogue: Dia/Dia2 (dialogue-first), Sesame CSM-1B (cross-turn context), MOSS-TTSD (60-min multi-party), ChatTTS (LLM assistant conversations), VibeVoice (90-min, 4 speakers).
Emotion control has moved from coarse categories to fine-grained: Fish Audio S1 (explicit tags), Chatterbox (continuous slider), Qwen3-TTS (text description), IndexTTS-2 (duration + emotion disentangled), GLM-TTS (RL-optimized).
Reference audio needed has dropped: 30-60s (2023) to 3s (2026). Leaders: Qwen3-TTS (3s), NeuTTS Air (3s on-device), Pocket TTS (5s, CPU), Fish Speech (10s capturing delivery style), GLM-TTS (RL-optimized).
Breakthrough in CPU-capable TTS. Pocket TTS: 100M params, 47ms, CPU-only. Sopro: 135M, 250ms. NeuTTS Air: 748M, Q4 fits in 400 MB. Piper: ONNX on RPi. Key enablers: CALM architecture, consistency models, GGUF quantization.
State Space Models: Cartesia Sonic (40-90ms). CALM: Pocket TTS (continuous, no discrete tokens). FSQ: CosyVoice2. BiCodec: Spark-TTS (decoupled streams). Delay Pattern: MOSS-TTS. Ultra-low frame rate: VibeVoice (7.5 Hz vs 50-75 Hz typical).