making human voices

We achieved state-of-the-art voice cloning and human-like prosody.

Most AI voices sound unnatural. We built our models a different way.

Reference audio conditions the decoder through cross-attention. The model learns to denoise latents into speech that matches both the speaker's voice and the text content.

Audio Input ≤120s @ 44.1kHz → DAC-VAE encode → continuous latents 44.1kHz → ~86Hz Speaker Encoder Causal Transformer patch_size=4 2560 → 640 seq len Speaker embedding K, V projection Text Encoder Bidirectional Transformer UTF-8 bytes, max 768 Diffusion Decoder Joint Self-Cross Attention K,V = concat(self, speaker, text) QK-norm · Gated attention SwiGLU MLP AdaLN · LoRA adapters Rectified Flow noise latent audio 30 steps DAC-VAE Decode continuous latents → DAC-VAE decoder 44.1kHz output soul's audio

Rectified flow learns straight paths from noise to data. Fewer steps, faster inference, better quality.

t=1.0 (noise)
t=0.75
t=0.5
t=0.25
t=0 (audio)
sampling 30 steps
sampler Euler ODE
CFG scale 3.0 (joint)
RTF <0.05

Generate and play in parallel. First audio arrives before generation finishes.

generate
playback
latent prefix encoder variable block size
TTFB optimization shorter initial blocks

Single attention operation over all context. The denoiser sees itself, the speaker, and the text simultaneously.

Q (denoiser) K,V sources
self
speaker
text
attn = softmax(Q @ K.T / √d) @ V K,V = cat(self_kv, spk_kv, txt_kv)

Scale matters. Diverse data teaches natural variation.

160K hours audio
800K steps
768 batch size
TPU v4-64 compute
optimizer Muon
precision BF16
dropout 10% (spk/txt independent)
transcription WhisperD (diarization + events)

Amplify the signal. CFG steers generation toward the conditioning, trading diversity for fidelity.

joint 2× NFE
ε = ε_uncond + s(ε_cond - ε_uncond)
independent 3× NFE
separate spk/txt scales
alternating 2× NFE
temporal score rescaling

DAC-VAE gives us native continuous latents. No discrete bottleneck, no quantization artifacts.

input 44.1kHz PCM
DAC-VAE encoder continuous latents
model latents diffusion target