AI/MLLLM Integration

Moshi — Speech-Text Foundation Model for Real-Time Voice Dialogue

~200ms end-to-end latency with full-duplex voice AI

Open-source speech-text foundation model for real-time full-duplex voice dialogue (Sep 2024 – Dec 2024). Uses Mimi neural audio codec (24 kHz → 12.5 Hz, 1.1 kbps, 80 ms frames). PyTorch, MLX (Apple Silicon), and Rust backends. Features Moshika (female) and Moshiko (male) voices with multiple quantizations (bf16, int8, int4). 7B-parameter Temporal Transformer with theoretical 160ms latency, practical ~200ms on L4 GPU.

Tech Stack

PythonPyTorchRustMLXHugging FaceTransformersCUDAMetalGit/GitHub

Technical Highlights

Mimi neural audio codec: 24 kHz audio → 12.5 Hz representation at 1.1 kbps with 80ms frame latency, fully streaming
Dual audio streams: Models user input and Moshi output simultaneously with inner monologue text prediction for quality
7B-parameter architecture: Depth Transformer for inter-codebook dependencies + Temporal Transformer for time modeling
Multi-backend support: PyTorch, MLX (Apple Silicon M-series), and Rust/Candle with Web UI and CLI clients

Challenges & Solutions

High latency in voice AI → Achieved ~200ms end-to-end latency with streaming codec and efficient transformer architecture
Platform compatibility → Built three backends (PyTorch, MLX, Rust) supporting CUDA, Metal, and CPU with quantization options

Gallery

Moshi — Speech-Text Foundation Model for Real-Time Voice Dialogue - Image 1

Links

GitHub Hugging Face Paper

Back to All Projects