
~200ms end-to-end latency with full-duplex voice AI
Open-source speech-text foundation model for real-time full-duplex voice dialogue (Sep 2024 – Dec 2024). Uses Mimi neural audio codec (24 kHz → 12.5 Hz, 1.1 kbps, 80 ms frames). PyTorch, MLX (Apple Silicon), and Rust backends. Features Moshika (female) and Moshiko (male) voices with multiple quantizations (bf16, int8, int4). 7B-parameter Temporal Transformer with theoretical 160ms latency, practical ~200ms on L4 GPU.
