making human voices

We achieved state-of-the-artSoTA voice cloning TTS and human-like prosody. SoTA voice cloning & human-like prosody

in
voice cloning.

Meet Soul TTS, leading voice cloning with human-like prosody.

Talk to a researcher

Soul TTS

Clone any voice with full customization & human-like prosody.

The reference speaker sets the voice identity. All models generate the target text shown below.

For each reference speaker and target text pair, we select the first result from every model. No cherry-picking. All models use default recommended settings.

Soul TTS captures reference speaker qualities and generates correct semantics with high fidelity. Per 30 seconds of audio, generation takes ~1.2s, ~14s, and ~50s respectively for Soul TTS, Higgs Audio, and VibeVoice on an A100.

These prompts are stylistically in-distribution for Soul TTS. Better samples for the other models could likely be obtained via prompt or hyperparameter optimization. More optimized implementations may also yield speed-ups.

Compare against leading open-source TTS models:

sample 1
reference speaker
target text

“Dobby has to tell you something, sir. Something terrible. You must not trust this place. Dobby was bound once, bound forever to serve. But no longer. Now, Dobby has to use his freedom to keep you safe.”

VibeVoice (latest)
Soul TTS
Higgs Audio (latest)
sample 2
reference speaker
target text

“Messi picks it up just inside the half, he turns away from one, he glides past another and the whole stadium can feel it coming! He's driving at the heart of the defense now, nobody can get near him. Oh! He's curled it into the top corner and this place has lost its mind! Absolutely magical from the little genius!”

VibeVoice (latest)
Soul TTS
Higgs Audio (latest)
sample 3
reference speaker
target text

“So I'm standing there trying to act normal, right, and I reach for a drink and somehow knock the entire tray out of the waiter's hands and everyone just turns and stares at me. And I'm thinking, you know, please just let the floor open up and swallow me whole, but no of course not, the night was just getting started.”

VibeVoice (latest)
Soul TTS
Higgs Audio (latest)

Reference audio conditions the decoder through cross-attention. The model learns to denoise latents into speech that matches both the speaker's voice and the text content. Architecture inspired by Jordan Darefsky.

Audio Input ≤120s @ 44.1kHz → DAC-VAE encode → continuous latents 44.1kHz → ~86Hz Speaker Encoder Causal Transformer patch_size=4 2560 → 640 seq len Speaker embedding K, V projection Text Encoder Bidirectional Transformer UTF-8 bytes, max 768 Diffusion Decoder Joint Self-Cross Attention K,V = concat(self, speaker, text) QK-norm · Gated attention SwiGLU MLP AdaLN · LoRA adapters Rectified Flow noise latent audio 30 steps DAC-VAE Decode continuous latents → DAC-VAE decoder 44.1kHz output soul's audio
scroll

Rectified flow learns straight paths from noise to data. Fewer steps, faster inference, better quality.

t=1.0 (noise)
t=0.75
t=0.5
t=0.25
t=0 (audio)
sampling 30 steps
sampler Euler ODE
CFG scale 3.0 (joint)
RTF <0.05

Generate and play in parallel. First audio arrives before generation finishes.

generate
playback
latent prefix encoder variable block size
TTFB optimization shorter initial blocks

Single attention operation over all context. The denoiser sees itself, the speaker, and the text simultaneously.

Q (denoiser) K,V sources
self
speaker
text
attn = softmax(Q @ K.T / √d) @ V K,V = cat(self_kv, spk_kv, txt_kv)

Scale matters. Diverse data teaches natural variation.

160K hours audio
800K steps
768 batch size
TPU v4-64 compute
optimizer Muon
precision BF16
dropout 10% (spk/txt independent)
transcription WhisperD (diarization + events)

Amplify the signal. CFG steers generation toward the conditioning, trading diversity for fidelity.

joint 2× NFE
ε = ε_uncond + s(ε_cond - ε_uncond)
independent 3× NFE
separate spk/txt scales
alternating 2× NFE
temporal score rescaling

DAC-VAE gives us native continuous latents. No discrete bottleneck, no quantization artifacts.

input 44.1kHz PCM
DAC-VAE encoder continuous latents
model latents diffusion target