making human voices

We achieved state-of-the-art voice cloning and human-like prosody.

Most AI voices sound unnatural. We built our models a different way.

architecture

Reference audio conditions the decoder through cross-attention. The model learns to denoise latents into speech that matches both the speaker's voice and the text content.

diffusion process

Rectified flow learns straight paths from noise to data. Fewer steps, faster inference, better quality.

t=1.0 (noise)

t=0.75

t=0.5

t=0.25

t=0 (audio)

sampling 30 steps

sampler Euler ODE

CFG scale 3.0 (joint)

RTF <0.05

block-wise streaming

Generate and play in parallel. First audio arrives before generation finishes.

generate

playback

latent prefix encoder variable block size

TTFB optimization shorter initial blocks

attention mechanism

Single attention operation over all context. The denoiser sees itself, the speaker, and the text simultaneously.

Q (denoiser) K,V sources

self

speaker

text

attn = softmax(Q @ K.T / √d) @ V K,V = cat(self_kv, spk_kv, txt_kv)

training

Scale matters. Diverse data teaches natural variation.

160K hours audio

800K steps

768 batch size

TPU v4-64 compute

optimizer Muon

precision BF16

dropout 10% (spk/txt independent)

transcription WhisperD (diarization + events)

classifier-free guidance

Amplify the signal. CFG steers generation toward the conditioning, trading diversity for fidelity.

joint 2× NFE

ε = ε_uncond + s(ε_cond - ε_uncond)

independent 3× NFE

separate spk/txt scales

alternating 2× NFE

temporal score rescaling

audio codec

DAC-VAE gives us native continuous latents. No discrete bottleneck, no quantization artifacts.

input 44.1kHz PCM

→

DAC-VAE encoder continuous latents

→

model latents diffusion target

making human voices

request to join.