Abstract: Mainstream zero-shot TTS production systems like Voicebox and Seed-TTS achieve human parity speech by leveraging Flow-matching and Diffusion models, respectively. Unfortunately, human-level ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results