I like to train large deep neural nets too 🧠🤖💥 | First Paper (AutoAgents: A Framework for Automatic Agent Generation) Accepted @ IJCAI 2024 | Role Model Karpathy
Thrilled to share our latest work: Voila - a family of fully opensourced voice models for real-time autonomous convos and role-play, some of our major contributions include 🧵: 1) An End-to-End Full-Duplex Arch: that directly processes & handles simultaneous audio token streams from user to model and vice versa. 2) Voila-Tokenizer: A 100K-hour trained tokenizer with interleaved alignment (audio & text) that distills semantic/acoustic tokens via RVQ. 3) Text-Audio Interleaved Alignment: We leveraged a fine-grained alignment of text and audio tokens that allows synchronization and expressiveness for tasks like ASR (WER 2.7%) and TTS (WER 2.8%). 4) Voice Customization: Supports 1M+ pre-built voices and 1 shot voice clone from 10s audio clips using Wespeaker embeddings.