making human voices
We achieved state-of-the-art voice cloning and human-like prosody.
Most AI voices sound unnatural. We built our models a different way.
Reference audio conditions the decoder through cross-attention. The model learns to denoise latents into speech that matches both the speaker's voice and the text content.
Rectified flow learns straight paths from noise to data. Fewer steps, faster inference, better quality.
Generate and play in parallel. First audio arrives before generation finishes.
Single attention operation over all context. The denoiser sees itself, the speaker, and the text simultaneously.
attn = softmax(Q @ K.T / √d) @ V K,V = cat(self_kv, spk_kv, txt_kv) Scale matters. Diverse data teaches natural variation.
Amplify the signal. CFG steers generation toward the conditioning, trading diversity for fidelity.
ε = ε_uncond + s(ε_cond - ε_uncond) separate spk/txt scales temporal score rescaling DAC-VAE gives us native continuous latents. No discrete bottleneck, no quantization artifacts.