File size: 3,359 Bytes
ce072fc
 
 
d030d66
ce072fc
d030d66
 
ce072fc
 
23f581e
d030d66
 
ce072fc
5f25e53
ce072fc
 
 
d030d66
ce072fc
 
 
 
 
 
 
 
d6f2f90
ce072fc
 
 
 
 
d030d66
 
 
 
ce072fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c24affa
 
 
ce072fc
cf74bd7
ce072fc
 
 
 
 
461784a
ce072fc
 
c24affa
ce072fc
 
 
 
 
 
 
 
 
 
 
 
 
d030d66
 
 
 
ce072fc
 
 
 
 
 
 
 
 
 
2e22f6e
 
 
f402afd
2e22f6e
 
ce072fc
 
 
 
 
 
 
d030d66
d9d41ee
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115

# Mini-Omni

<p align="center"><strong style="font-size: 18px;">
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
</strong>
</p>

<p align="center">
πŸ€— <a href="https://huggingface.co/gpt-omni/mini-omni">Hugging Face</a>   | πŸ“– <a href="https://github.com/gpt-omni/mini-omni">Github</a> 
|     πŸ“‘ <a href="https://arxiv.org/abs/2408.16725">Technical report</a>
</p>

Mini-Omni is an open-source multimodal large language model that can **hear, talk while thinking**. Featuring real-time end-to-end speech input and **streaming audio output** conversational capabilities.

<p align="center">
    <img src="data/figures/frameworkv3.jpg" width="100%"/>

</p>



## Features

βœ… **Real-time speech-to-speech** conversational capabilities. No extra ASR or TTS models required.

βœ… **Talking while thinking**, with the ability to generate text and audio at the same time.

βœ… **Streaming audio output** capabilities.

βœ… With "Audio-to-Text" and "Audio-to-Audio" **batch inference** to further boost the performance.

## Demo

NOTE: need to unmute first.

https://github.com/user-attachments/assets/03bdde05-9514-4748-b527-003bea57f118


## Install

Create a new conda environment and install the required packages:

```sh

conda create -n omni python=3.10

conda activate omni



git clone https://github.com/gpt-omni/mini-omni.git

cd mini-omni

pip install -r requirements.txt

```

## Quick start

**Interactive demo**

- start server

NOTE: you need to start the server before running the streamlit or gradio demo with API_URL set to the server address.



```sh

sudo apt-get install ffmpeg

conda activate omni

cd mini-omni

python3 server.py --ip '0.0.0.0' --port 60808

```





- run streamlit demo



NOTE: you need to run streamlit locally with PyAudio installed. For error: `ModuleNotFoundError: No module named 'utils.vad'`, please run `export PYTHONPATH=./` first.



```sh

pip install PyAudio==0.2.14

API_URL=http://0.0.0.0:60808/chat streamlit run webui/omni_streamlit.py

```



- run gradio demo

```sh

API_URL=http://0.0.0.0:60808/chat python3 webui/omni_gradio.py

```



example:



NOTE: need to unmute first. Gradio seems can not play audio stream instantly, so the latency feels a bit longer.



https://github.com/user-attachments/assets/29187680-4c42-47ff-b352-f0ea333496d9





**Local test**



```sh

conda activate omni

cd mini-omni

# test run the preset audio samples and questions

python inference.py

```



## Common issues



- Error: `ModuleNotFoundError: No module named 'utils.xxxx'`



    Answer: run `export PYTHONPATH=./` first.



## Acknowledgements 



- [Qwen2](https://github.com/QwenLM/Qwen2/) as the LLM backbone.

- [litGPT](https://github.com/Lightning-AI/litgpt/) for training and inference.

- [whisper](https://github.com/openai/whisper/)  for audio encoding.

- [snac](https://github.com/hubertsiuzdak/snac/)  for audio decoding.

- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) for generating synthetic speech.

- [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) and [MOSS](https://github.com/OpenMOSS/MOSS/tree/main) for alignment.



## Star History



[![Star History Chart](https://api.star-history.com/svg?repos=gpt-omni/mini-omni&type=Date)](https://star-history.com/#gpt-omni/mini-omni&Date)