AI/MLLLM Integration

VALL-E — Zero-Shot Text-to-Speech with Neural Codec Language Models

Zero-shot voice cloning from 3-second audio sample

Unofficial PyTorch implementation of VALL-E for zero-shot text-to-speech and voice cloning (Jan 2023 – Mar 2023). Uses neural codec language models with EnCodec tokenizer to generate high-quality speech matching any speaker's voice from a single 3-second reference audio. Features autoregressive (AR) and non-autoregressive (NAR) transformer architectures with DeepSpeed training for scalable model development.

Tech Stack

PythonPyTorchEnCodecTransformersDeepSpeedCUDAGit/GitHub

Technical Highlights

Zero-shot TTS: Generate speech in any voice from a single 3-second reference audio without fine-tuning
Neural codec language modeling with EnCodec tokenizer for audio quantization and decoding
Dual architecture: Autoregressive (AR) model for first quantizer + Non-autoregressive (NAR) for remaining quantizers
DeepSpeed training integration for scalable distributed training with CUDA/ROCm support

Challenges & Solutions

Voice cloning without fine-tuning → Implemented neural codec language model approach for zero-shot synthesis
Scalable training → Integrated DeepSpeed for distributed training with efficient memory management

Gallery

VALL-E — Zero-Shot Text-to-Speech with Neural Codec Language Models - Image 1

Links

GitHub Paper

Back to All Projects