SWEN Audio Registry

Best AI Audio Models 2026Speech, Voice and Music

This page now uses a canonical audio registry instead of static lists. The ranking covers TTS, STT, voice cloning and music generation with a modality-specific score.

canonical products ranked

with public API or developer access

with voice cloning or programmable voice

1 · Top Score

Eleven v3

ElevenLabs

94.2

Latest general-availability ElevenLabs speech model focused on expressive text-to-speech, multi-speaker control and production-grade voice output.

ttsRealtimeVoice CloneAPI

2 · Top Score

GPT-4o Transcribe

OpenAI

91.4

OpenAI speech-to-text model optimized for more accurate transcription than older Whisper-based production defaults.

sttAPI

3 · Top Score

Speech 2.8

MiniMax

90.6

MiniMax speech stack with HD and turbo variants, native sound-tag support and high-fidelity cloning for developer-facing audio generation.

ttsRealtimeVoice CloneAPI

Full Ranking

Composite score built from quality, latency, control, value and API readiness.

0 open source

#	Model	Company	Score	Quality	Latency	Control	Price	Highlights	Release
1	Eleven v3 Latest general-availability ElevenLabs speech model focused on expressive text-to-speech, multi-speaker control and production-grade voice output.	ElevenLabs	94.2	97	91	96	Paid plans from US$5/month	Primary use: Expressive TTS • Realtime: Yes • Languages: 70+	Feb 2026
2	GPT-4o Transcribe OpenAI speech-to-text model optimized for more accurate transcription than older Whisper-based production defaults.	OpenAI	91.4	95	84	82	Pay-as-you-go via Transcription API	Primary use: High-accuracy STT • Context: 16k • Realtime: No	—
3	Speech 2.8 MiniMax speech stack with HD and turbo variants, native sound-tag support and high-fidelity cloning for developer-facing audio generation.	MiniMax	90.6	93	90	92	US$60/M chars (turbo) or US$100/M chars (HD)	Primary use: Realtime + HD TTS • Voice cloning: Yes • Sound tags: Native	—
4	GPT-Realtime-Whisper Streaming speech-to-text model from OpenAI for low-latency transcript deltas and live audio applications.	OpenAI	89.7	90	96	80	Priced by audio duration	Primary use: Realtime STT • Context: 16k • Realtime: Native	—
5	Suno v4.5 Latest major Suno music-generation model tier focused on richer vocals, more accurate style following and consumer-grade music creation speed.	Suno	88.9	92	84	87	Paid plans from creator tiers	Primary use: Music generation • Commercial use: Plan-dependent • API: No public API	May 2025
6	Music 2.6 MiniMax music generation model in the same flagship family as Speech 2.8, aimed at API-driven music workflows rather than UI-only creation.	MiniMax	87.3	89	82	88	API-accessible via MiniMax platform	Primary use: Programmable music gen • API: Yes • Suite: Speech & Music	—

Score Breakdown

Qualitytop model reference

Latencytop model reference

Controltop model reference

FAQ

Why is this page no longer using editorial fixed lists?

Because the audio page now reads from the canonical SWEN Audio Registry. New speech, transcription and music entries can be refreshed automatically by the benchmark sync stack instead of requiring manual page edits.

What does the composite score optimize for?

The score balances output quality, latency, controllability, value and API readiness. That prevents a consumer-only product from outranking a production-ready API stack on hype alone.

Can music generators appear together with TTS and STT models?

Yes. This page is modality-first, not task-first. Audio includes speech synthesis, speech recognition and music generation, while each model keeps its own subcategory and descriptor set.

Benchmark Code Image Audio Video Agents