|
--- |
|
license: cc-by-nc-sa-4.0 |
|
pipeline_tag: text-to-speech |
|
library_name: transformers |
|
tags: |
|
- tts |
|
- voice |
|
- wip |
|
--- |
|
|
|
This repository contains the model as described in [LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM](https://hf.co/papers/2503.04724). |
|
|
|
For more information, check out the project page at https://mbzuai-oryx.github.io/LLMVoX/ and the code at https://github.com/mbzuai-oryx/LLMVoX. |
|
|
|
# LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM |
|
|
|
<div> |
|
<a href="https://mbzuai-oryx.github.io/LLMVoX/"><img src="https://img.shields.io/badge/Project-Page-blue" alt="Project Page"></a> |
|
<a href="https://arxiv.org/abs/2503.04724"><img src="https://img.shields.io/badge/arXiv-2503.04724-b31b1b.svg" alt="arXiv"></a> |
|
<a href="https://github.com/mbzuai-oryx/LLMVoX/"><img src="https://img.shields.io/badge/GitHub-LLMVoX-black?logo=github" alt="GitHub Repository"></a> |
|
<a href="https://github.com/mbzuai-oryx/LLMVoX/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a> |
|
</div> |
|
|
|
**Authors:** |
|
**[Sambal Shikar](https://github.com/mbzuai-oryx/LLMVoX?tab=readme-ov-file)**, **[Mohammed Irfan K](https://scholar.google.com/citations?user=GJp0keYAAAAJ&hl=en)**, **[Sahal Shaji Mullappilly](https://scholar.google.com/citations?user=LJWxVpUAAAAJ&hl=en)**, **[Fahad Khan](https://sites.google.com/view/fahadkhans/home)**, **[Jean Lahoud](https://scholar.google.com/citations?user=LsivLPoAAAAJ&hl=en)**, **[Rao Muhammad Anwer](https://scholar.google.com/citations?hl=en&authuser=1&user=_KlvMVoAAAAJ)**, **[Salman Khan](https://salman-h-khan.github.io/)**, **[Hisham Cholakkal](https://scholar.google.com/citations?hl=en&user=bZ3YBRcAAAAJ)** |
|
|
|
**Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), UAE** |
|
|
|
<p align="center"> |
|
<img src="assets/arch_diagram.svg" alt="LLMVoX Architecture" width="800px"> |
|
</p> |
|
|
|
<video src="https://github.com/user-attachments/assets/6d305563-3c62-4f14-a8aa-acedf2143f76" width="500" controls></video> |
|
|
|
## Overview |
|
|
|
LLMVoX is a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming Text-to-Speech (TTS) system designed to convert text outputs from Large Language Models into high-fidelity streaming speech with low latency. |
|
|
|
Key features: |
|
- π **Lightweight & Fast**: Only 30M parameters with end-to-end latency as low as 300ms |
|
- π **LLM-Agnostic**: Works with any LLM and Vision-Language Model without fine-tuning |
|
- π **Multi-Queue Streaming**: Enables continuous, low-latency speech generation |
|
- π **Multilingual Support**: Adaptable to new languages with dataset adaptation |
|
|
|
## Quick Start |
|
|
|
### Installation |
|
|
|
```bash |
|
# Requirements: CUDA 11.7+, Flash Attention 2.0+ compatible GPU |
|
|
|
git clone https://github.com/mbzuai-oryx/LLMVoX.git |
|
cd LLMVoX |
|
|
|
conda create -n llmvox python=3.9 |
|
conda activate llmvox |
|
|
|
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 |
|
pip install flash-attn --no-build-isolation |
|
pip install -r requirements.txt |
|
|
|
# Download checkpoints from Hugging Face |
|
# https://huggingface.co/MBZUAI/LLMVoX/tree/main |
|
mkdir -p CHECKPOINTS |
|
# Download wavtokenizer_large_speech_320_24k.ckpt and ckpt_english_tiny.pt |
|
``` |
|
|
|
### Voice Chat |
|
|
|
```bash |
|
# Basic usage |
|
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" |
|
|
|
# With multiple GPUs |
|
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" \ |
|
--llm_device "cuda:0" --tts_device_1 1 --tts_device_2 2 |
|
|
|
# Balance latency/quality |
|
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" \ |
|
--initial_dump_size_1 10 --initial_dump_size_2 160 --max_dump_size 1280 |
|
``` |
|
|
|
### Text Chat & Visual Speech |
|
|
|
```bash |
|
# Text-to-Speech |
|
python streaming_server.py --chat_type text --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" |
|
|
|
# Visual Speech (Speech + Image β Speech) |
|
python streaming_server.py --chat_type visual_speech --llm_checkpoint "Qwen/Qwen2.5-VL-7B-Instruct" \ |
|
--eos_token "<|im_end|>" |
|
|
|
# Multimodal (support for models like Phi-4) |
|
python streaming_server.py --chat_type multimodal --llm_checkpoint "microsoft/Phi-4-multimodal-instruct" \ |
|
--eos_token "<|end|>" |
|
``` |
|
|
|
## API Reference |
|
|
|
| Endpoint | Purpose | Required Parameters | |
|
|----------|---------|---------------------| |
|
| `/tts` | Text-to-speech | `text`: String to convert | |
|
| `/voicechat` | Voice conversations | `audio_base64`, `source_language`, `target_language` | |
|
| `/multimodalchat` | Voice + multiple images | `audio_base64`, `image_list` | |
|
| `/vlmschat` | Voice + single image | `audio_base64`, `image_base64`, `source_language`, `target_language` | |
|
|
|
## Local UI Demo |
|
|
|
<p align="center"> |
|
<img src="assets/ui.png" alt="Demo UI" width="800px"> |
|
</p> |
|
|
|
```bash |
|
# Start server |
|
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --api_port PORT |
|
|
|
# Launch UI |
|
python run_ui.py --ip STREAMING_SERVER_IP --port PORT |
|
``` |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@article{shikhar2025llmvox, |
|
title={LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM}, |
|
author={Shikhar, Sambal and Kurpath, Mohammed Irfan and Mullappilly, Sahal Shaji and Lahoud, Jean and Khan, Fahad and Anwer, Rao Muhammad and Khan, Salman and Cholakkal, Hisham}, |
|
journal={arXiv preprint arXiv:2503.04724}, |
|
year={2025} |
|
} |
|
``` |
|
|
|
## Acknowledgments |
|
|
|
- [Andrej Karpathy's NanoGPT](https://github.com/karpathy/nanoGPT) |
|
- [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) |
|
- [Whisper](https://github.com/openai/whisper) |
|
- [Neural G2P](https://github.com/lingjzhu/CharsiuG2P) |
|
|
|
## License |
|
|
|
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. |