Spaces:
Running
Running
Upload 240 files
Browse files- F5-TTS/src/f5_tts.egg-info/PKG-INFO +220 -0
- F5-TTS/src/f5_tts.egg-info/SOURCES.txt +46 -0
- F5-TTS/src/f5_tts.egg-info/dependency_links.txt +1 -0
- F5-TTS/src/f5_tts.egg-info/entry_points.txt +5 -0
- F5-TTS/src/f5_tts.egg-info/requires.txt +36 -0
- F5-TTS/src/f5_tts.egg-info/top_level.txt +1 -0
- MMAudio/demo.py +21 -17
- README.md +74 -77
- app.py +27 -10
- v2s.sh +1 -1
F5-TTS/src/f5_tts.egg-info/PKG-INFO
ADDED
@@ -0,0 +1,220 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Metadata-Version: 2.4
|
2 |
+
Name: f5-tts
|
3 |
+
Version: 0.5.2
|
4 |
+
Summary: F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
|
5 |
+
License: MIT License
|
6 |
+
Project-URL: Homepage, https://github.com/SWivid/F5-TTS
|
7 |
+
Classifier: License :: OSI Approved :: MIT License
|
8 |
+
Classifier: Operating System :: OS Independent
|
9 |
+
Classifier: Programming Language :: Python :: 3
|
10 |
+
Description-Content-Type: text/markdown
|
11 |
+
License-File: LICENSE
|
12 |
+
Requires-Dist: accelerate>=0.33.0
|
13 |
+
Requires-Dist: bitsandbytes>0.37.0; platform_machine != "arm64" and platform_system != "Darwin"
|
14 |
+
Requires-Dist: cached_path
|
15 |
+
Requires-Dist: click
|
16 |
+
Requires-Dist: datasets
|
17 |
+
Requires-Dist: ema_pytorch>=0.5.2
|
18 |
+
Requires-Dist: gradio>=3.45.2
|
19 |
+
Requires-Dist: hydra-core>=1.3.0
|
20 |
+
Requires-Dist: jieba
|
21 |
+
Requires-Dist: librosa
|
22 |
+
Requires-Dist: matplotlib
|
23 |
+
Requires-Dist: numpy<=1.26.4
|
24 |
+
Requires-Dist: pydub
|
25 |
+
Requires-Dist: pypinyin
|
26 |
+
Requires-Dist: safetensors
|
27 |
+
Requires-Dist: soundfile
|
28 |
+
Requires-Dist: tomli
|
29 |
+
Requires-Dist: torch>=2.0.0
|
30 |
+
Requires-Dist: torchaudio>=2.0.0
|
31 |
+
Requires-Dist: torchdiffeq
|
32 |
+
Requires-Dist: tqdm>=4.65.0
|
33 |
+
Requires-Dist: transformers
|
34 |
+
Requires-Dist: transformers_stream_generator
|
35 |
+
Requires-Dist: vocos
|
36 |
+
Requires-Dist: wandb
|
37 |
+
Requires-Dist: x_transformers>=1.31.14
|
38 |
+
Provides-Extra: eval
|
39 |
+
Requires-Dist: faster_whisper==0.10.1; extra == "eval"
|
40 |
+
Requires-Dist: funasr; extra == "eval"
|
41 |
+
Requires-Dist: jiwer; extra == "eval"
|
42 |
+
Requires-Dist: modelscope; extra == "eval"
|
43 |
+
Requires-Dist: zhconv; extra == "eval"
|
44 |
+
Requires-Dist: zhon; extra == "eval"
|
45 |
+
Dynamic: license-file
|
46 |
+
|
47 |
+
# F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
|
48 |
+
|
49 |
+
[](https://github.com/SWivid/F5-TTS)
|
50 |
+
[](https://arxiv.org/abs/2410.06885)
|
51 |
+
[](https://swivid.github.io/F5-TTS/)
|
52 |
+
[](https://huggingface.co/spaces/mrfakename/E2-F5-TTS)
|
53 |
+
[](https://modelscope.cn/studios/modelscope/E2-F5-TTS)
|
54 |
+
[](https://x-lance.sjtu.edu.cn/)
|
55 |
+
[](https://www.pcl.ac.cn)
|
56 |
+
<!-- <img src="https://github.com/user-attachments/assets/12d7749c-071a-427c-81bf-b87b91def670" alt="Watermark" style="width: 40px; height: auto"> -->
|
57 |
+
|
58 |
+
**F5-TTS**: Diffusion Transformer with ConvNeXt V2, faster trained and inference.
|
59 |
+
|
60 |
+
**E2 TTS**: Flat-UNet Transformer, closest reproduction from [paper](https://arxiv.org/abs/2406.18009).
|
61 |
+
|
62 |
+
**Sway Sampling**: Inference-time flow step sampling strategy, greatly improves performance
|
63 |
+
|
64 |
+
### Thanks to all the contributors !
|
65 |
+
|
66 |
+
## News
|
67 |
+
- **2024/10/08**: F5-TTS & E2 TTS base models on [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS), [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), [🟣 Wisemodel](https://wisemodel.cn/models/SJTU_X-LANCE/F5-TTS_Emilia-ZH-EN).
|
68 |
+
|
69 |
+
## Installation
|
70 |
+
|
71 |
+
```bash
|
72 |
+
# Create a python 3.10 conda env (you could also use virtualenv)
|
73 |
+
conda create -n f5-tts python=3.10
|
74 |
+
conda activate f5-tts
|
75 |
+
|
76 |
+
# NVIDIA GPU: install pytorch with your CUDA version, e.g.
|
77 |
+
pip install torch==2.3.0+cu118 torchaudio==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
|
78 |
+
|
79 |
+
# AMD GPU: install pytorch with your ROCm version, e.g. (Linux only)
|
80 |
+
pip install torch==2.5.1+rocm6.2 torchaudio==2.5.1+rocm6.2 --extra-index-url https://download.pytorch.org/whl/rocm6.2
|
81 |
+
|
82 |
+
# Intel GPU: install pytorch with your XPU version, e.g.
|
83 |
+
# Intel® Deep Learning Essentials or Intel® oneAPI Base Toolkit must be installed
|
84 |
+
pip install --pre torch torchaudio --index-url https://download.pytorch.org/whl/nightly/xpu
|
85 |
+
```
|
86 |
+
|
87 |
+
Then you can choose from a few options below:
|
88 |
+
|
89 |
+
### 1. As a pip package (if just for inference)
|
90 |
+
|
91 |
+
```bash
|
92 |
+
pip install git+https://github.com/SWivid/F5-TTS.git
|
93 |
+
```
|
94 |
+
|
95 |
+
### 2. Local editable (if also do training, finetuning)
|
96 |
+
|
97 |
+
```bash
|
98 |
+
git clone https://github.com/SWivid/F5-TTS.git
|
99 |
+
cd F5-TTS
|
100 |
+
# git submodule update --init --recursive # (optional, if need bigvgan)
|
101 |
+
pip install -e .
|
102 |
+
```
|
103 |
+
|
104 |
+
### 3. Docker usage
|
105 |
+
```bash
|
106 |
+
# Build from Dockerfile
|
107 |
+
docker build -t f5tts:v1 .
|
108 |
+
|
109 |
+
# Or pull from GitHub Container Registry
|
110 |
+
docker pull ghcr.io/swivid/f5-tts:main
|
111 |
+
```
|
112 |
+
|
113 |
+
|
114 |
+
## Inference
|
115 |
+
|
116 |
+
### 1. Gradio App
|
117 |
+
|
118 |
+
Currently supported features:
|
119 |
+
|
120 |
+
- Basic TTS with Chunk Inference
|
121 |
+
- Multi-Style / Multi-Speaker Generation
|
122 |
+
- Voice Chat powered by Qwen2.5-3B-Instruct
|
123 |
+
- [Custom inference with more language support](src/f5_tts/infer/SHARED.md)
|
124 |
+
|
125 |
+
```bash
|
126 |
+
# Launch a Gradio app (web interface)
|
127 |
+
f5-tts_infer-gradio
|
128 |
+
|
129 |
+
# Specify the port/host
|
130 |
+
f5-tts_infer-gradio --port 7860 --host 0.0.0.0
|
131 |
+
|
132 |
+
# Launch a share link
|
133 |
+
f5-tts_infer-gradio --share
|
134 |
+
```
|
135 |
+
|
136 |
+
### 2. CLI Inference
|
137 |
+
|
138 |
+
```bash
|
139 |
+
# Run with flags
|
140 |
+
# Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
|
141 |
+
f5-tts_infer-cli \
|
142 |
+
--model "F5-TTS" \
|
143 |
+
--ref_audio "ref_audio.wav" \
|
144 |
+
--ref_text "The content, subtitle or transcription of reference audio." \
|
145 |
+
--gen_text "Some text you want TTS model generate for you."
|
146 |
+
|
147 |
+
# Run with default setting. src/f5_tts/infer/examples/basic/basic.toml
|
148 |
+
f5-tts_infer-cli
|
149 |
+
# Or with your own .toml file
|
150 |
+
f5-tts_infer-cli -c custom.toml
|
151 |
+
|
152 |
+
# Multi voice. See src/f5_tts/infer/README.md
|
153 |
+
f5-tts_infer-cli -c src/f5_tts/infer/examples/multi/story.toml
|
154 |
+
```
|
155 |
+
|
156 |
+
### 3. More instructions
|
157 |
+
|
158 |
+
- In order to have better generation results, take a moment to read [detailed guidance](src/f5_tts/infer).
|
159 |
+
- The [Issues](https://github.com/SWivid/F5-TTS/issues?q=is%3Aissue) are very useful, please try to find the solution by properly searching the keywords of problem encountered. If no answer found, then feel free to open an issue.
|
160 |
+
|
161 |
+
|
162 |
+
## Training
|
163 |
+
|
164 |
+
### 1. Gradio App
|
165 |
+
|
166 |
+
Read [training & finetuning guidance](src/f5_tts/train) for more instructions.
|
167 |
+
|
168 |
+
```bash
|
169 |
+
# Quick start with Gradio web interface
|
170 |
+
f5-tts_finetune-gradio
|
171 |
+
```
|
172 |
+
|
173 |
+
|
174 |
+
## [Evaluation](src/f5_tts/eval)
|
175 |
+
|
176 |
+
|
177 |
+
## Development
|
178 |
+
|
179 |
+
Use pre-commit to ensure code quality (will run linters and formatters automatically)
|
180 |
+
|
181 |
+
```bash
|
182 |
+
pip install pre-commit
|
183 |
+
pre-commit install
|
184 |
+
```
|
185 |
+
|
186 |
+
When making a pull request, before each commit, run:
|
187 |
+
|
188 |
+
```bash
|
189 |
+
pre-commit run --all-files
|
190 |
+
```
|
191 |
+
|
192 |
+
Note: Some model components have linting exceptions for E722 to accommodate tensor notation
|
193 |
+
|
194 |
+
|
195 |
+
## Acknowledgements
|
196 |
+
|
197 |
+
- [E2-TTS](https://arxiv.org/abs/2406.18009) brilliant work, simple and effective
|
198 |
+
- [Emilia](https://arxiv.org/abs/2407.05361), [WenetSpeech4TTS](https://arxiv.org/abs/2406.05763), [LibriTTS](https://arxiv.org/abs/1904.02882), [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) valuable datasets
|
199 |
+
- [lucidrains](https://github.com/lucidrains) initial CFM structure with also [bfs18](https://github.com/bfs18) for discussion
|
200 |
+
- [SD3](https://arxiv.org/abs/2403.03206) & [Hugging Face diffusers](https://github.com/huggingface/diffusers) DiT and MMDiT code structure
|
201 |
+
- [torchdiffeq](https://github.com/rtqichen/torchdiffeq) as ODE solver, [Vocos](https://huggingface.co/charactr/vocos-mel-24khz) and [BigVGAN](https://github.com/NVIDIA/BigVGAN) as vocoder
|
202 |
+
- [FunASR](https://github.com/modelscope/FunASR), [faster-whisper](https://github.com/SYSTRAN/faster-whisper), [UniSpeech](https://github.com/microsoft/UniSpeech), [SpeechMOS](https://github.com/tarepan/SpeechMOS) for evaluation tools
|
203 |
+
- [ctc-forced-aligner](https://github.com/MahmoudAshraf97/ctc-forced-aligner) for speech edit test
|
204 |
+
- [mrfakename](https://x.com/realmrfakename) huggingface space demo ~
|
205 |
+
- [f5-tts-mlx](https://github.com/lucasnewman/f5-tts-mlx/tree/main) Implementation with MLX framework by [Lucas Newman](https://github.com/lucasnewman)
|
206 |
+
- [F5-TTS-ONNX](https://github.com/DakeQQ/F5-TTS-ONNX) ONNX Runtime version by [DakeQQ](https://github.com/DakeQQ)
|
207 |
+
|
208 |
+
## Citation
|
209 |
+
If our work and codebase is useful for you, please cite as:
|
210 |
+
```
|
211 |
+
@article{chen-etal-2024-f5tts,
|
212 |
+
title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
|
213 |
+
author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
|
214 |
+
journal={arXiv preprint arXiv:2410.06885},
|
215 |
+
year={2024},
|
216 |
+
}
|
217 |
+
```
|
218 |
+
## License
|
219 |
+
|
220 |
+
Our code is released under MIT License. The pre-trained models are licensed under the CC-BY-NC license due to the training data Emilia, which is an in-the-wild dataset. Sorry for any inconvenience this may cause.
|
F5-TTS/src/f5_tts.egg-info/SOURCES.txt
ADDED
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
LICENSE
|
2 |
+
README.md
|
3 |
+
pyproject.toml
|
4 |
+
src/f5_tts/api.py
|
5 |
+
src/f5_tts/socket_server.py
|
6 |
+
src/f5_tts.egg-info/PKG-INFO
|
7 |
+
src/f5_tts.egg-info/SOURCES.txt
|
8 |
+
src/f5_tts.egg-info/dependency_links.txt
|
9 |
+
src/f5_tts.egg-info/entry_points.txt
|
10 |
+
src/f5_tts.egg-info/requires.txt
|
11 |
+
src/f5_tts.egg-info/top_level.txt
|
12 |
+
src/f5_tts/eval/ecapa_tdnn.py
|
13 |
+
src/f5_tts/eval/eval_infer_batch.py
|
14 |
+
src/f5_tts/eval/eval_librispeech_test_clean.py
|
15 |
+
src/f5_tts/eval/eval_seedtts_testset.py
|
16 |
+
src/f5_tts/eval/eval_utmos.py
|
17 |
+
src/f5_tts/eval/eval_v2c_test.py
|
18 |
+
src/f5_tts/eval/utils_eval.py
|
19 |
+
src/f5_tts/infer/infer_cli.py
|
20 |
+
src/f5_tts/infer/infer_cli_libritts.py
|
21 |
+
src/f5_tts/infer/infer_cli_s3.py
|
22 |
+
src/f5_tts/infer/infer_cli_test.py
|
23 |
+
src/f5_tts/infer/infer_cli_tts_test.py
|
24 |
+
src/f5_tts/infer/infer_gradio.py
|
25 |
+
src/f5_tts/infer/speech_edit.py
|
26 |
+
src/f5_tts/infer/utils_infer.py
|
27 |
+
src/f5_tts/model/__init__.py
|
28 |
+
src/f5_tts/model/cfm.py
|
29 |
+
src/f5_tts/model/dataset.py
|
30 |
+
src/f5_tts/model/modules.py
|
31 |
+
src/f5_tts/model/trainer.py
|
32 |
+
src/f5_tts/model/utils.py
|
33 |
+
src/f5_tts/model/backbones/dit.py
|
34 |
+
src/f5_tts/model/backbones/mmdit.py
|
35 |
+
src/f5_tts/model/backbones/unett.py
|
36 |
+
src/f5_tts/scripts/count_max_epoch.py
|
37 |
+
src/f5_tts/scripts/count_params_gflops.py
|
38 |
+
src/f5_tts/train/finetune_cli.py
|
39 |
+
src/f5_tts/train/finetune_gradio.py
|
40 |
+
src/f5_tts/train/train.py
|
41 |
+
src/f5_tts/train/datasets/prepare_csv_wavs.py
|
42 |
+
src/f5_tts/train/datasets/prepare_emilia.py
|
43 |
+
src/f5_tts/train/datasets/prepare_libritts.py
|
44 |
+
src/f5_tts/train/datasets/prepare_ljspeech.py
|
45 |
+
src/f5_tts/train/datasets/prepare_v2c.py
|
46 |
+
src/f5_tts/train/datasets/prepare_wenetspeech4tts.py
|
F5-TTS/src/f5_tts.egg-info/dependency_links.txt
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
|
F5-TTS/src/f5_tts.egg-info/entry_points.txt
ADDED
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
[console_scripts]
|
2 |
+
f5-tts_finetune-cli = f5_tts.train.finetune_cli:main
|
3 |
+
f5-tts_finetune-gradio = f5_tts.train.finetune_gradio:main
|
4 |
+
f5-tts_infer-cli = f5_tts.infer.infer_cli:main
|
5 |
+
f5-tts_infer-gradio = f5_tts.infer.infer_gradio:main
|
F5-TTS/src/f5_tts.egg-info/requires.txt
ADDED
@@ -0,0 +1,36 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
accelerate>=0.33.0
|
2 |
+
cached_path
|
3 |
+
click
|
4 |
+
datasets
|
5 |
+
ema_pytorch>=0.5.2
|
6 |
+
gradio>=3.45.2
|
7 |
+
hydra-core>=1.3.0
|
8 |
+
jieba
|
9 |
+
librosa
|
10 |
+
matplotlib
|
11 |
+
numpy<=1.26.4
|
12 |
+
pydub
|
13 |
+
pypinyin
|
14 |
+
safetensors
|
15 |
+
soundfile
|
16 |
+
tomli
|
17 |
+
torch>=2.0.0
|
18 |
+
torchaudio>=2.0.0
|
19 |
+
torchdiffeq
|
20 |
+
tqdm>=4.65.0
|
21 |
+
transformers
|
22 |
+
transformers_stream_generator
|
23 |
+
vocos
|
24 |
+
wandb
|
25 |
+
x_transformers>=1.31.14
|
26 |
+
|
27 |
+
[:platform_machine != "arm64" and platform_system != "Darwin"]
|
28 |
+
bitsandbytes>0.37.0
|
29 |
+
|
30 |
+
[eval]
|
31 |
+
faster_whisper==0.10.1
|
32 |
+
funasr
|
33 |
+
jiwer
|
34 |
+
modelscope
|
35 |
+
zhconv
|
36 |
+
zhon
|
F5-TTS/src/f5_tts.egg-info/top_level.txt
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
f5_tts
|
MMAudio/demo.py
CHANGED
@@ -68,7 +68,8 @@ def main():
|
|
68 |
seq_cfg = model.seq_cfg
|
69 |
|
70 |
if args.video:
|
71 |
-
video_path: Path = Path(args.video).expanduser()
|
|
|
72 |
else:
|
73 |
video_path = None
|
74 |
prompt: str = args.prompt
|
@@ -117,22 +118,25 @@ def main():
|
|
117 |
#test_scp = "/ailab-train/speech/zhanghaomin/datas/v2cdata/test.scp"
|
118 |
test_scp = args.scp
|
119 |
|
120 |
-
|
121 |
-
|
122 |
-
|
123 |
-
|
124 |
-
|
125 |
-
|
126 |
-
|
127 |
-
|
128 |
-
|
129 |
-
|
130 |
-
|
131 |
-
|
132 |
-
|
133 |
-
|
134 |
-
|
135 |
-
|
|
|
|
|
|
|
136 |
|
137 |
print(datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S.%f")[:-3], "start")
|
138 |
for video_path, prompt, negative_prompt, audio_path in tests:
|
|
|
68 |
seq_cfg = model.seq_cfg
|
69 |
|
70 |
if args.video:
|
71 |
+
#video_path: Path = Path(args.video).expanduser()
|
72 |
+
video_path = args.video
|
73 |
else:
|
74 |
video_path = None
|
75 |
prompt: str = args.prompt
|
|
|
118 |
#test_scp = "/ailab-train/speech/zhanghaomin/datas/v2cdata/test.scp"
|
119 |
test_scp = args.scp
|
120 |
|
121 |
+
if video_path is None:
|
122 |
+
lines = []
|
123 |
+
with open(test_scp, "r") as fr:
|
124 |
+
lines += fr.readlines()
|
125 |
+
#with open(test_scp2, "r") as fr:
|
126 |
+
# lines += fr.readlines()
|
127 |
+
tests = []
|
128 |
+
for line in lines[args.start: args.end]:
|
129 |
+
####video_path, prompt = line.strip().split("\t")
|
130 |
+
####prompt = "the sound of " + prompt
|
131 |
+
####negative_prompt = ""
|
132 |
+
video_path, _, audio_path = line.strip().split("\t")
|
133 |
+
####video_path = "/ailab-train/speech/zhanghaomin/datas/v2cdata/DragonII/DragonII_videos/Gobber/0725.mp4"
|
134 |
+
prompt = ""
|
135 |
+
#negative_prompt = "speech, voice, talking, speaking"
|
136 |
+
negative_prompt = ""
|
137 |
+
tests.append([video_path, prompt, negative_prompt, audio_path])
|
138 |
+
else:
|
139 |
+
tests = [[video_path, prompt, negative_prompt, ""]]
|
140 |
|
141 |
print(datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S.%f")[:-3], "start")
|
142 |
for video_path, prompt, negative_prompt, audio_path in tests:
|
README.md
CHANGED
@@ -1,77 +1,74 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
##
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
```
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
```
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
```
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
- [
|
72 |
-
- [
|
73 |
-
- [
|
74 |
-
|
75 |
-
- [WavLM-SV](https://huggingface.co/microsoft/wavlm-base-sv) for speech recognition in SPK-SIM evaluation.
|
76 |
-
- [Whisper](https://huggingface.co/Systran/faster-whisper-large-v3) for speech recognition in WER evaluation.
|
77 |
-
|
|
|
1 |
+
<div align="center">
|
2 |
+
<p align="center">
|
3 |
+
<h2>DeepAudio-V1</h2>
|
4 |
+
<a href="https://arxiv.org/">Paper</a> | <a href="https://pages.github.com/">Webpage</a> | <a href="https://huggingface.co/">Models</a>
|
5 |
+
</p>
|
6 |
+
</div>
|
7 |
+
|
8 |
+
|
9 |
+
## [DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation](https://pages.github.com/)
|
10 |
+
|
11 |
+
|
12 |
+
## Installation
|
13 |
+
|
14 |
+
**1. Create a conda environment**
|
15 |
+
|
16 |
+
```bash
|
17 |
+
conda create -n v2as python=3.10
|
18 |
+
conda activate v2as
|
19 |
+
```
|
20 |
+
|
21 |
+
**2. F5-TTS base install**
|
22 |
+
|
23 |
+
```bash
|
24 |
+
cd ./F5-TTS
|
25 |
+
pip install -e .
|
26 |
+
```
|
27 |
+
|
28 |
+
**3. Additional requirements**
|
29 |
+
|
30 |
+
```bash
|
31 |
+
pip install -r requirements.txt
|
32 |
+
conda install cudnn
|
33 |
+
```
|
34 |
+
|
35 |
+
**Pretrained models**
|
36 |
+
|
37 |
+
The models are available at https://huggingface.co/. See [MODELS.md](./MODELS.md) for more details.
|
38 |
+
|
39 |
+
## Inference
|
40 |
+
|
41 |
+
**1. V2A inference**
|
42 |
+
|
43 |
+
```bash
|
44 |
+
bash v2a.sh
|
45 |
+
```
|
46 |
+
|
47 |
+
**2. V2S inference**
|
48 |
+
|
49 |
+
```bash
|
50 |
+
bash v2s.sh
|
51 |
+
```
|
52 |
+
|
53 |
+
**3. TTS inference**
|
54 |
+
|
55 |
+
```bash
|
56 |
+
bash tts.sh
|
57 |
+
```
|
58 |
+
|
59 |
+
## Evaluation
|
60 |
+
|
61 |
+
```bash
|
62 |
+
bash eval_v2c.sh
|
63 |
+
```
|
64 |
+
|
65 |
+
|
66 |
+
## Acknowledgement
|
67 |
+
|
68 |
+
- [MMAudio](https://github.com/hkchengrex/MMAudio) for video-to-audio backbone and pretrained models
|
69 |
+
- [F5-TTS](https://github.com/SWivid/F5-TTS) for text-to-speech and video-to-speech backbone
|
70 |
+
- [V2C](https://github.com/chenqi008/V2C) for animated movie benchmark
|
71 |
+
- [Wav2Vec2-Emotion](https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim) for emotion recognition in EMO-SIM evaluation.
|
72 |
+
- [WavLM-SV](https://huggingface.co/microsoft/wavlm-base-sv) for speech recognition in SPK-SIM evaluation.
|
73 |
+
- [Whisper](https://huggingface.co/Systran/faster-whisper-large-v3) for speech recognition in WER evaluation.
|
74 |
+
|
|
|
|
|
|
app.py
CHANGED
@@ -16,23 +16,37 @@ import torchaudio
|
|
16 |
|
17 |
import tempfile
|
18 |
|
|
|
|
|
19 |
log = logging.getLogger()
|
20 |
|
21 |
|
22 |
#@spaces.GPU(duration=120)
|
23 |
-
|
24 |
-
|
25 |
-
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
-
|
|
|
|
|
|
|
|
|
|
|
28 |
|
29 |
-
|
|
|
|
|
30 |
|
31 |
|
32 |
video_to_audio_tab = gr.Interface(
|
33 |
fn=video_to_audio,
|
34 |
description="""
|
35 |
-
Project page: <a href="https://
|
36 |
Code: <a href="https://github.com/acappemin/DeepAudio-V1">https://github.com/acappemin/DeepAudio-V1</a><br>
|
37 |
""",
|
38 |
inputs=[
|
@@ -41,16 +55,19 @@ video_to_audio_tab = gr.Interface(
|
|
41 |
],
|
42 |
outputs='playable_video',
|
43 |
cache_examples=False,
|
44 |
-
title='
|
45 |
examples=[
|
46 |
[
|
47 |
-
'
|
|
|
|
|
|
|
|
|
48 |
'',
|
49 |
],
|
50 |
])
|
51 |
|
52 |
|
53 |
if __name__ == "__main__":
|
54 |
-
gr.TabbedInterface([video_to_audio_tab],
|
55 |
-
['Video-to-Audio']).launch()
|
56 |
|
|
|
16 |
|
17 |
import tempfile
|
18 |
|
19 |
+
import requests
|
20 |
+
|
21 |
log = logging.getLogger()
|
22 |
|
23 |
|
24 |
#@spaces.GPU(duration=120)
|
25 |
+
def video_to_audio(video: gr.Video, prompt: str):
|
26 |
+
|
27 |
+
video_path = tempfile.NamedTemporaryFile(delete=False, suffix='.mp4').name
|
28 |
+
|
29 |
+
output_dir = video_path.rsplit("/", 1)[0]
|
30 |
+
video_save_path = str(output_dir) + "/" + str(video_path).replace("/", "__").strip(".") + ".mp4"
|
31 |
+
|
32 |
+
print("paths", video, video_path, output_dir, video_save_path)
|
33 |
|
34 |
+
if video.startswith("http"):
|
35 |
+
data = requests.get(video, timeout=60).content
|
36 |
+
with open(video_path, "wb") as fw:
|
37 |
+
fw.write(data)
|
38 |
+
else:
|
39 |
+
os.system("cp %s %s" % (video, video_path))
|
40 |
|
41 |
+
os.system("cd ./MMAudio; python ./demo.py --output %s --video_path %s --prompt %s --calc_energy 1" % (output_dir, video_path, prompt))
|
42 |
+
|
43 |
+
return video_save_path
|
44 |
|
45 |
|
46 |
video_to_audio_tab = gr.Interface(
|
47 |
fn=video_to_audio,
|
48 |
description="""
|
49 |
+
Project page: <a href="https://acappemin.github.io/DeepAudio-V1.github.io">https://acappemin.github.io/DeepAudio-V1.github.io</a><br>
|
50 |
Code: <a href="https://github.com/acappemin/DeepAudio-V1">https://github.com/acappemin/DeepAudio-V1</a><br>
|
51 |
""",
|
52 |
inputs=[
|
|
|
55 |
],
|
56 |
outputs='playable_video',
|
57 |
cache_examples=False,
|
58 |
+
title='Video-to-Audio',
|
59 |
examples=[
|
60 |
[
|
61 |
+
'./tests/0235.mp4',
|
62 |
+
'',
|
63 |
+
],
|
64 |
+
[
|
65 |
+
'./tests/0778.mp4',
|
66 |
'',
|
67 |
],
|
68 |
])
|
69 |
|
70 |
|
71 |
if __name__ == "__main__":
|
72 |
+
gr.TabbedInterface([video_to_audio_tab], ['Video-to-Audio']).launch()
|
|
|
73 |
|
v2s.sh
CHANGED
@@ -1 +1 @@
|
|
1 |
-
python ./F5-TTS/src/f5_tts/infer/infer_cli_test.py --output_dir ./tests/outputs_v2c_l44_test/ --start 0 --end 10 --ckpt_file ./F5-TTS/ckpts/v2c/
|
|
|
1 |
+
python ./F5-TTS/src/f5_tts/infer/infer_cli_test.py --output_dir ./tests/outputs_v2c_l44_test/ --start 0 --end 10 --ckpt_file ./F5-TTS/ckpts/v2c/v2c_s16.pt --v2a_path ./tests/outputs_v2a_l44_test/ --infer_list ./tests/v2c_test.lst
|