
Zero-shot voice cloning from 3-second audio sample
Unofficial PyTorch implementation of VALL-E for zero-shot text-to-speech and voice cloning (Jan 2023 – Mar 2023). Uses neural codec language models with EnCodec tokenizer to generate high-quality speech matching any speaker's voice from a single 3-second reference audio. Features autoregressive (AR) and non-autoregressive (NAR) transformer architectures with DeepSpeed training for scalable model development.
