Voice synthesis built for real-time use. Sonic 3 is the current flagship, delivering first audio in roughly 90ms across 40+ languages. The Line agent platform (launched 2026) bundles TTS, Ink-Whisper streaming STT, and LLM orchestration for voice agents in one stack.
Founded in 2023 by MIT and Carnegie Mellon researchers. Integrates natively with LiveKit, Daily.co, and Twilio Voice for voice-agent deployments. SOC 2 Type II, HIPAA, and PCI Level 1 compliant.
System Verdict
Pick Cartesia if building a voice agent, phone system, or any product where sub-100ms latency sets user trust. Sonic 3 leads the real-time TTS category in 2026 benchmarks, with native WebSocket streaming and the Line platform now bundling STT, TTS, and LLM orchestration in one developer surface.
Skip it for long-form narration, podcasts, or audiobooks. Fish Audio S2 Pro and ElevenLabs both rank higher on expressiveness and emotional range. Cartesia optimizes for speed, not nuance.
Who pays which tier: Free tier for prototyping. Pro $4/mo (annual) for solo devs piloting voice agents. Startup $39/mo (annual) for teams shipping production agents at modest volume. Scale $239/mo (annual) for sustained high-volume workloads. Enterprise for on-prem, BAA-eligible, and custom models.
Key Facts
| Flagship model | Sonic 3 (~90ms time-to-first-audio) |
| Speech-to-text | Ink-Whisper streaming, $0.13/hr |
| Agent platform | Line (TTS + STT + LLM orchestration), launched 2026 |
| Languages | 40+ with native prosody, ~95% world population coverage |
| Indian-language coverage | 9 Indian languages including Hindi at native-speaker quality |
| Voice cloning | Instant clone in ~10 seconds (no clone fee) + Professional fine-tuned voices |
| Streaming | WebSocket, bidirectional audio |
| Integrations | LiveKit, Daily.co, Twilio Voice |
| SDKs | Python, Node.js, cURL |
| Pricing model | Bundled credits + prepaid Agent dollars; TTS bills at 15 credits per second of audio |
| Compliance | SOC 2 Type II, HIPAA, PCI Level 1 |
Every data point above was verified against vendor sources on 2026-05-13. See Sources.
What it actually is
A developer API, streaming reliability, and end-to-end agent infrastructure to teams shipping voice agents.
Sonic 3 handles the default case in roughly 90ms time-to-first-audio, with global P50-to-P99 latency benchmarks that competing TTS APIs do not match. The 2026 product expansion added Ink-Whisper (streaming STT at $0.13/hr), and the Line platform now wraps STT, TTS, and LLM orchestration into a single agent stack billed via prepaid Agent dollars.
The moat is the combination of architecture and integration depth. Competing TTS APIs ship streaming, but few maintain sub-100ms time-to-first-audio at scale, and none have the same native hooks into LiveKit and Twilio. Instant voice cloning from a 10-second sample covers most production scenarios.
When to pick Cartesia
- Building voice agents or conversational AI. 100ms latency gaps destroy user trust. Cartesia eliminates them.
- Phone and IVR systems. Native Twilio Voice integration plus sub-100ms TTFA makes it the default real-time voice stack.
- Game NPC dialogue at runtime. Dynamic voice generation during gameplay stays under perceptible-delay thresholds.
- Already on LiveKit or Daily.co. First-class integrations shorten deployment time significantly.
- Indian-language or multilingual products. 40+ languages including 9 Indian languages at native-speaker quality is rare in real-time TTS.
- Regulated voice workloads. SOC 2 Type II, HIPAA, and PCI Level 1 cover healthcare, finance, and payments use cases out of the box.
When to pick something else
- Long-form narration or podcasts: Fish Audio S2 Pro tops 2026 blind preference tests. ElevenLabs remains the creator default.
- Open-weight self-hosting: Fish Audio ships MIT weights. Voxtral ships CC BY-NC weights for non-commercial use.
- Cheapest multilingual commercial API: Voxtral at $0.016 per 1K chars undercuts Cartesia’s credit pricing at most volumes.
- Enterprise dubbing with lip-sync: Resemble AI ships Localize across 149 languages and deepfake detection.
- Personal document reading: Speechify solves consumption, not production.
Pricing
| Plan | Price (annual) | Model Credits | Agent (Line) Prepaid | Notes |
|---|---|---|---|---|
| Free | $0 | 20K | $1 | Prototyping, Sonic 3 access |
| Pro | $4/mo | 100K | $5 | Solo devs piloting agents |
| Startup | $39/mo | 1.25M | $49 | Production voice agents at modest volume |
| Scale | $239/mo | 8M | $299 | Sustained high-volume workloads |
| Enterprise | Custom | Custom | Custom | On-prem, BAA, custom models |
TTS is billed at 15 credits per second of generated audio. Instant voice cloning costs nothing to clone (1 credit per character at synthesis). Professional voice cloning costs 1M credits to train plus 1.5 credits per character. Ink-Whisper STT runs $0.13/hr. For a limited time, LLM usage during text-to-agent calls on Line is free.
Prices verified 2026-05-13 via cartesia.ai/pricing and Cartesia docs.
Against the alternatives
| Cartesia Sonic 3 | ElevenLabs v3 | Fish Audio S2 Pro | Voxtral | |
|---|---|---|---|---|
| Time-to-first-audio | ~90ms | 200-400ms streaming | Low, not sub-100ms | ~70ms |
| Voice cloning reference | 10+ sec instant | 1-5 min for best quality | Short samples | 3 sec |
| Languages | 40+ | 30+ | 80+ | 9 |
| Open weights | None | None | MIT | CC BY-NC 4.0 |
| Agent stack | Line (TTS + STT + LLM orchestration) | Conversational AI add-on | None native | None native |
| Voice agent integrations | LiveKit, Daily, Twilio | Some | None native | None native |
| Compliance | SOC 2 Type II, HIPAA, PCI L1 | SOC 2 | Limited | Limited |
| Best viewed as | Real-time agent specialist | Creator platform default | Quality + open-weight leader | Mistral-stack voice |
Failure modes
- Not tuned for long-form narration. Expressiveness and emotional range trail ElevenLabs and Fish Audio at equivalent speeds. Use it for agents, not audiobooks.
- Credit math is non-obvious. TTS at 15 credits per second of audio means a typical 30-second IVR turn burns 450 credits. Free tier 20K credits covers roughly 22 minutes of audio before the Pro tier becomes mandatory. Model your traffic before committing to Startup or Scale.
- Professional voice cloning has real upfront cost. 1M credits to train a Professional voice clone is roughly $200 of credit value before per-character billing. Instant cloning is the right starting point for most teams.
- Limited-time Line LLM pricing. Free LLM usage during text-to-agent calls is explicitly time-limited. Production buyers should plan for that line item to appear later.
- No consumer UI. API-only. Creators without engineering resources should pick ElevenLabs or Fish Audio.
- On-prem is Enterprise-only. Teams with data-residency requirements need the custom tier. Scale at $239 still uses the hosted API, even with HIPAA available.
Methodology
This page was produced by the aipedia.wiki editorial pipeline, an automated system that ingests vendor documentation, verifies pricing and model details against primary sources, and generates the editorial analysis you are reading. No individual human wrote this review. Scoring follows the four-dimension rubric at /about/scoring/ (Utility, Value, Moat, Longevity). Last verified 2026-05-13 against Cartesia pricing, Sonic 3 page, and Cartesia docs.
FAQ
How does Cartesia latency compare to ElevenLabs? Sonic 3 hits roughly 90ms time-to-first-audio at global P50-P99. ElevenLabs streaming typically lands at 200-400ms. The gap creates perceptible delays in voice agents where Cartesia feels live and ElevenLabs feels laggy.
What audio length is needed for voice cloning? Instant cloning works from ~10 seconds of clean reference audio. Professional fine-tuned voice clones use longer datasets and a 1M-credit training fee for production-grade quality.
Does Cartesia support long conversations? Yes. The model maintains prosody context across multiple turns, which keeps voice consistency stable across long voice-agent sessions. The Line platform layers turn-taking and interruption handling on top.
Can Cartesia handle non-English languages? Yes. 40+ languages with native prosody, covering approximately 95% of the world population. 9 Indian languages including Hindi ship at native-speaker quality. Coverage is now broader than the late-2025 Sonic 2 stack and competitive with ElevenLabs on Western markets.
Is there a free tier? Yes. The free plan provides 20K model credits and $1 in prepaid Agent dollars for prototyping on Sonic 3. Production workloads start on Pro at $4/mo (annual).
Sources
- Cartesia pricing: current tier structure, credit allowances, Agent prepaid amounts
- Sonic 3 page: latency, language coverage, voice cloning, compliance posture
- Cartesia docs: API spec, SDKs, Line agent platform, Ink-Whisper STT pricing
- Inworld: Best TTS APIs for real-time voice agents 2026: latency benchmarks
Related
- Category: AI Voice / TTS
- Comparisons: Cartesia vs ElevenLabs, Cartesia vs Fish Audio, Cartesia vs Voxtral, Cartesia vs Resemble AI