Spaces:
Sleeping
Sleeping
File size: 3,359 Bytes
ce072fc d030d66 ce072fc d030d66 ce072fc 23f581e d030d66 ce072fc 5f25e53 ce072fc d030d66 ce072fc d6f2f90 ce072fc d030d66 ce072fc c24affa ce072fc cf74bd7 ce072fc 461784a ce072fc c24affa ce072fc d030d66 ce072fc 2e22f6e f402afd 2e22f6e ce072fc d030d66 d9d41ee |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
# Mini-Omni
<p align="center"><strong style="font-size: 18px;">
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
</strong>
</p>
<p align="center">
π€ <a href="https://huggingface.co/gpt-omni/mini-omni">Hugging Face</a> | π <a href="https://github.com/gpt-omni/mini-omni">Github</a>
| π <a href="https://arxiv.org/abs/2408.16725">Technical report</a>
</p>
Mini-Omni is an open-source multimodal large language model that can **hear, talk while thinking**. Featuring real-time end-to-end speech input and **streaming audio output** conversational capabilities.
<p align="center">
<img src="data/figures/frameworkv3.jpg" width="100%"/>
</p>
## Features
β
**Real-time speech-to-speech** conversational capabilities. No extra ASR or TTS models required.
β
**Talking while thinking**, with the ability to generate text and audio at the same time.
β
**Streaming audio output** capabilities.
β
With "Audio-to-Text" and "Audio-to-Audio" **batch inference** to further boost the performance.
## Demo
NOTE: need to unmute first.
https://github.com/user-attachments/assets/03bdde05-9514-4748-b527-003bea57f118
## Install
Create a new conda environment and install the required packages:
```sh
conda create -n omni python=3.10
conda activate omni
git clone https://github.com/gpt-omni/mini-omni.git
cd mini-omni
pip install -r requirements.txt
```
## Quick start
**Interactive demo**
- start server
NOTE: you need to start the server before running the streamlit or gradio demo with API_URL set to the server address.
```sh
sudo apt-get install ffmpeg
conda activate omni
cd mini-omni
python3 server.py --ip '0.0.0.0' --port 60808
```
- run streamlit demo
NOTE: you need to run streamlit locally with PyAudio installed. For error: `ModuleNotFoundError: No module named 'utils.vad'`, please run `export PYTHONPATH=./` first.
```sh
pip install PyAudio==0.2.14
API_URL=http://0.0.0.0:60808/chat streamlit run webui/omni_streamlit.py
```
- run gradio demo
```sh
API_URL=http://0.0.0.0:60808/chat python3 webui/omni_gradio.py
```
example:
NOTE: need to unmute first. Gradio seems can not play audio stream instantly, so the latency feels a bit longer.
https://github.com/user-attachments/assets/29187680-4c42-47ff-b352-f0ea333496d9
**Local test**
```sh
conda activate omni
cd mini-omni
# test run the preset audio samples and questions
python inference.py
```
## Common issues
- Error: `ModuleNotFoundError: No module named 'utils.xxxx'`
Answer: run `export PYTHONPATH=./` first.
## Acknowledgements
- [Qwen2](https://github.com/QwenLM/Qwen2/) as the LLM backbone.
- [litGPT](https://github.com/Lightning-AI/litgpt/) for training and inference.
- [whisper](https://github.com/openai/whisper/) for audio encoding.
- [snac](https://github.com/hubertsiuzdak/snac/) for audio decoding.
- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) for generating synthetic speech.
- [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) and [MOSS](https://github.com/OpenMOSS/MOSS/tree/main) for alignment.
## Star History
[](https://star-history.com/#gpt-omni/mini-omni&Date)
|