making human voices
We achieved state-of-the-artSoTA voice cloning TTS and human-like prosody. SoTA voice cloning & human-like prosody
in
voice cloning.
Meet Soul TTS, leading voice cloning with human-like prosody.
Talk to a researcherSoul TTS
Clone any voice with full customization & human-like prosody.
The reference speaker sets the voice identity. All models generate the target text shown below.
For each reference speaker and target text pair, we select the first result from every model. No cherry-picking. All models use default recommended settings.
Soul TTS captures reference speaker qualities and generates correct semantics with high fidelity. Per 30 seconds of audio, generation takes ~1.2s, ~14s, and ~50s respectively for Soul TTS, Higgs Audio, and VibeVoice on an A100.
These prompts are stylistically in-distribution for Soul TTS. Better samples for the other models could likely be obtained via prompt or hyperparameter optimization. More optimized implementations may also yield speed-ups.
Compare against leading open-source TTS models:
“Dobby has to tell you something, sir. Something terrible. You must not trust this place. Dobby was bound once, bound forever to serve. But no longer. Now, Dobby has to use his freedom to keep you safe.”
“Messi picks it up just inside the half, he turns away from one, he glides past another and the whole stadium can feel it coming! He's driving at the heart of the defense now, nobody can get near him. Oh! He's curled it into the top corner and this place has lost its mind! Absolutely magical from the little genius!”
“So I'm standing there trying to act normal, right, and I reach for a drink and somehow knock the entire tray out of the waiter's hands and everyone just turns and stares at me. And I'm thinking, you know, please just let the floor open up and swallow me whole, but no of course not, the night was just getting started.”
Reference audio conditions the decoder through cross-attention. The model learns to denoise latents into speech that matches both the speaker's voice and the text content. Architecture inspired by Jordan Darefsky.
Rectified flow learns straight paths from noise to data. Fewer steps, faster inference, better quality.
Generate and play in parallel. First audio arrives before generation finishes.
Single attention operation over all context. The denoiser sees itself, the speaker, and the text simultaneously.
attn = softmax(Q @ K.T / √d) @ V K,V = cat(self_kv, spk_kv, txt_kv) Scale matters. Diverse data teaches natural variation.
Amplify the signal. CFG steers generation toward the conditioning, trading diversity for fidelity.
ε = ε_uncond + s(ε_cond - ε_uncond) separate spk/txt scales temporal score rescaling DAC-VAE gives us native continuous latents. No discrete bottleneck, no quantization artifacts.