lshzhm commited on
Commit
1488c83
·
verified ·
1 Parent(s): cf7e36e

Upload 240 files

Browse files
F5-TTS/src/f5_tts.egg-info/PKG-INFO ADDED
@@ -0,0 +1,220 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Metadata-Version: 2.4
2
+ Name: f5-tts
3
+ Version: 0.5.2
4
+ Summary: F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
5
+ License: MIT License
6
+ Project-URL: Homepage, https://github.com/SWivid/F5-TTS
7
+ Classifier: License :: OSI Approved :: MIT License
8
+ Classifier: Operating System :: OS Independent
9
+ Classifier: Programming Language :: Python :: 3
10
+ Description-Content-Type: text/markdown
11
+ License-File: LICENSE
12
+ Requires-Dist: accelerate>=0.33.0
13
+ Requires-Dist: bitsandbytes>0.37.0; platform_machine != "arm64" and platform_system != "Darwin"
14
+ Requires-Dist: cached_path
15
+ Requires-Dist: click
16
+ Requires-Dist: datasets
17
+ Requires-Dist: ema_pytorch>=0.5.2
18
+ Requires-Dist: gradio>=3.45.2
19
+ Requires-Dist: hydra-core>=1.3.0
20
+ Requires-Dist: jieba
21
+ Requires-Dist: librosa
22
+ Requires-Dist: matplotlib
23
+ Requires-Dist: numpy<=1.26.4
24
+ Requires-Dist: pydub
25
+ Requires-Dist: pypinyin
26
+ Requires-Dist: safetensors
27
+ Requires-Dist: soundfile
28
+ Requires-Dist: tomli
29
+ Requires-Dist: torch>=2.0.0
30
+ Requires-Dist: torchaudio>=2.0.0
31
+ Requires-Dist: torchdiffeq
32
+ Requires-Dist: tqdm>=4.65.0
33
+ Requires-Dist: transformers
34
+ Requires-Dist: transformers_stream_generator
35
+ Requires-Dist: vocos
36
+ Requires-Dist: wandb
37
+ Requires-Dist: x_transformers>=1.31.14
38
+ Provides-Extra: eval
39
+ Requires-Dist: faster_whisper==0.10.1; extra == "eval"
40
+ Requires-Dist: funasr; extra == "eval"
41
+ Requires-Dist: jiwer; extra == "eval"
42
+ Requires-Dist: modelscope; extra == "eval"
43
+ Requires-Dist: zhconv; extra == "eval"
44
+ Requires-Dist: zhon; extra == "eval"
45
+ Dynamic: license-file
46
+
47
+ # F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
48
+
49
+ [![python](https://img.shields.io/badge/Python-3.10-brightgreen)](https://github.com/SWivid/F5-TTS)
50
+ [![arXiv](https://img.shields.io/badge/arXiv-2410.06885-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2410.06885)
51
+ [![demo](https://img.shields.io/badge/GitHub-Demo%20page-orange.svg)](https://swivid.github.io/F5-TTS/)
52
+ [![hfspace](https://img.shields.io/badge/🤗-Space%20demo-yellow)](https://huggingface.co/spaces/mrfakename/E2-F5-TTS)
53
+ [![msspace](https://img.shields.io/badge/🤖-Space%20demo-blue)](https://modelscope.cn/studios/modelscope/E2-F5-TTS)
54
+ [![lab](https://img.shields.io/badge/X--LANCE-Lab-grey?labelColor=lightgrey)](https://x-lance.sjtu.edu.cn/)
55
+ [![lab](https://img.shields.io/badge/Peng%20Cheng-Lab-grey?labelColor=lightgrey)](https://www.pcl.ac.cn)
56
+ <!-- <img src="https://github.com/user-attachments/assets/12d7749c-071a-427c-81bf-b87b91def670" alt="Watermark" style="width: 40px; height: auto"> -->
57
+
58
+ **F5-TTS**: Diffusion Transformer with ConvNeXt V2, faster trained and inference.
59
+
60
+ **E2 TTS**: Flat-UNet Transformer, closest reproduction from [paper](https://arxiv.org/abs/2406.18009).
61
+
62
+ **Sway Sampling**: Inference-time flow step sampling strategy, greatly improves performance
63
+
64
+ ### Thanks to all the contributors !
65
+
66
+ ## News
67
+ - **2024/10/08**: F5-TTS & E2 TTS base models on [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS), [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), [🟣 Wisemodel](https://wisemodel.cn/models/SJTU_X-LANCE/F5-TTS_Emilia-ZH-EN).
68
+
69
+ ## Installation
70
+
71
+ ```bash
72
+ # Create a python 3.10 conda env (you could also use virtualenv)
73
+ conda create -n f5-tts python=3.10
74
+ conda activate f5-tts
75
+
76
+ # NVIDIA GPU: install pytorch with your CUDA version, e.g.
77
+ pip install torch==2.3.0+cu118 torchaudio==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
78
+
79
+ # AMD GPU: install pytorch with your ROCm version, e.g. (Linux only)
80
+ pip install torch==2.5.1+rocm6.2 torchaudio==2.5.1+rocm6.2 --extra-index-url https://download.pytorch.org/whl/rocm6.2
81
+
82
+ # Intel GPU: install pytorch with your XPU version, e.g.
83
+ # Intel® Deep Learning Essentials or Intel® oneAPI Base Toolkit must be installed
84
+ pip install --pre torch torchaudio --index-url https://download.pytorch.org/whl/nightly/xpu
85
+ ```
86
+
87
+ Then you can choose from a few options below:
88
+
89
+ ### 1. As a pip package (if just for inference)
90
+
91
+ ```bash
92
+ pip install git+https://github.com/SWivid/F5-TTS.git
93
+ ```
94
+
95
+ ### 2. Local editable (if also do training, finetuning)
96
+
97
+ ```bash
98
+ git clone https://github.com/SWivid/F5-TTS.git
99
+ cd F5-TTS
100
+ # git submodule update --init --recursive # (optional, if need bigvgan)
101
+ pip install -e .
102
+ ```
103
+
104
+ ### 3. Docker usage
105
+ ```bash
106
+ # Build from Dockerfile
107
+ docker build -t f5tts:v1 .
108
+
109
+ # Or pull from GitHub Container Registry
110
+ docker pull ghcr.io/swivid/f5-tts:main
111
+ ```
112
+
113
+
114
+ ## Inference
115
+
116
+ ### 1. Gradio App
117
+
118
+ Currently supported features:
119
+
120
+ - Basic TTS with Chunk Inference
121
+ - Multi-Style / Multi-Speaker Generation
122
+ - Voice Chat powered by Qwen2.5-3B-Instruct
123
+ - [Custom inference with more language support](src/f5_tts/infer/SHARED.md)
124
+
125
+ ```bash
126
+ # Launch a Gradio app (web interface)
127
+ f5-tts_infer-gradio
128
+
129
+ # Specify the port/host
130
+ f5-tts_infer-gradio --port 7860 --host 0.0.0.0
131
+
132
+ # Launch a share link
133
+ f5-tts_infer-gradio --share
134
+ ```
135
+
136
+ ### 2. CLI Inference
137
+
138
+ ```bash
139
+ # Run with flags
140
+ # Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
141
+ f5-tts_infer-cli \
142
+ --model "F5-TTS" \
143
+ --ref_audio "ref_audio.wav" \
144
+ --ref_text "The content, subtitle or transcription of reference audio." \
145
+ --gen_text "Some text you want TTS model generate for you."
146
+
147
+ # Run with default setting. src/f5_tts/infer/examples/basic/basic.toml
148
+ f5-tts_infer-cli
149
+ # Or with your own .toml file
150
+ f5-tts_infer-cli -c custom.toml
151
+
152
+ # Multi voice. See src/f5_tts/infer/README.md
153
+ f5-tts_infer-cli -c src/f5_tts/infer/examples/multi/story.toml
154
+ ```
155
+
156
+ ### 3. More instructions
157
+
158
+ - In order to have better generation results, take a moment to read [detailed guidance](src/f5_tts/infer).
159
+ - The [Issues](https://github.com/SWivid/F5-TTS/issues?q=is%3Aissue) are very useful, please try to find the solution by properly searching the keywords of problem encountered. If no answer found, then feel free to open an issue.
160
+
161
+
162
+ ## Training
163
+
164
+ ### 1. Gradio App
165
+
166
+ Read [training & finetuning guidance](src/f5_tts/train) for more instructions.
167
+
168
+ ```bash
169
+ # Quick start with Gradio web interface
170
+ f5-tts_finetune-gradio
171
+ ```
172
+
173
+
174
+ ## [Evaluation](src/f5_tts/eval)
175
+
176
+
177
+ ## Development
178
+
179
+ Use pre-commit to ensure code quality (will run linters and formatters automatically)
180
+
181
+ ```bash
182
+ pip install pre-commit
183
+ pre-commit install
184
+ ```
185
+
186
+ When making a pull request, before each commit, run:
187
+
188
+ ```bash
189
+ pre-commit run --all-files
190
+ ```
191
+
192
+ Note: Some model components have linting exceptions for E722 to accommodate tensor notation
193
+
194
+
195
+ ## Acknowledgements
196
+
197
+ - [E2-TTS](https://arxiv.org/abs/2406.18009) brilliant work, simple and effective
198
+ - [Emilia](https://arxiv.org/abs/2407.05361), [WenetSpeech4TTS](https://arxiv.org/abs/2406.05763), [LibriTTS](https://arxiv.org/abs/1904.02882), [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) valuable datasets
199
+ - [lucidrains](https://github.com/lucidrains) initial CFM structure with also [bfs18](https://github.com/bfs18) for discussion
200
+ - [SD3](https://arxiv.org/abs/2403.03206) & [Hugging Face diffusers](https://github.com/huggingface/diffusers) DiT and MMDiT code structure
201
+ - [torchdiffeq](https://github.com/rtqichen/torchdiffeq) as ODE solver, [Vocos](https://huggingface.co/charactr/vocos-mel-24khz) and [BigVGAN](https://github.com/NVIDIA/BigVGAN) as vocoder
202
+ - [FunASR](https://github.com/modelscope/FunASR), [faster-whisper](https://github.com/SYSTRAN/faster-whisper), [UniSpeech](https://github.com/microsoft/UniSpeech), [SpeechMOS](https://github.com/tarepan/SpeechMOS) for evaluation tools
203
+ - [ctc-forced-aligner](https://github.com/MahmoudAshraf97/ctc-forced-aligner) for speech edit test
204
+ - [mrfakename](https://x.com/realmrfakename) huggingface space demo ~
205
+ - [f5-tts-mlx](https://github.com/lucasnewman/f5-tts-mlx/tree/main) Implementation with MLX framework by [Lucas Newman](https://github.com/lucasnewman)
206
+ - [F5-TTS-ONNX](https://github.com/DakeQQ/F5-TTS-ONNX) ONNX Runtime version by [DakeQQ](https://github.com/DakeQQ)
207
+
208
+ ## Citation
209
+ If our work and codebase is useful for you, please cite as:
210
+ ```
211
+ @article{chen-etal-2024-f5tts,
212
+ title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
213
+ author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
214
+ journal={arXiv preprint arXiv:2410.06885},
215
+ year={2024},
216
+ }
217
+ ```
218
+ ## License
219
+
220
+ Our code is released under MIT License. The pre-trained models are licensed under the CC-BY-NC license due to the training data Emilia, which is an in-the-wild dataset. Sorry for any inconvenience this may cause.
F5-TTS/src/f5_tts.egg-info/SOURCES.txt ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ LICENSE
2
+ README.md
3
+ pyproject.toml
4
+ src/f5_tts/api.py
5
+ src/f5_tts/socket_server.py
6
+ src/f5_tts.egg-info/PKG-INFO
7
+ src/f5_tts.egg-info/SOURCES.txt
8
+ src/f5_tts.egg-info/dependency_links.txt
9
+ src/f5_tts.egg-info/entry_points.txt
10
+ src/f5_tts.egg-info/requires.txt
11
+ src/f5_tts.egg-info/top_level.txt
12
+ src/f5_tts/eval/ecapa_tdnn.py
13
+ src/f5_tts/eval/eval_infer_batch.py
14
+ src/f5_tts/eval/eval_librispeech_test_clean.py
15
+ src/f5_tts/eval/eval_seedtts_testset.py
16
+ src/f5_tts/eval/eval_utmos.py
17
+ src/f5_tts/eval/eval_v2c_test.py
18
+ src/f5_tts/eval/utils_eval.py
19
+ src/f5_tts/infer/infer_cli.py
20
+ src/f5_tts/infer/infer_cli_libritts.py
21
+ src/f5_tts/infer/infer_cli_s3.py
22
+ src/f5_tts/infer/infer_cli_test.py
23
+ src/f5_tts/infer/infer_cli_tts_test.py
24
+ src/f5_tts/infer/infer_gradio.py
25
+ src/f5_tts/infer/speech_edit.py
26
+ src/f5_tts/infer/utils_infer.py
27
+ src/f5_tts/model/__init__.py
28
+ src/f5_tts/model/cfm.py
29
+ src/f5_tts/model/dataset.py
30
+ src/f5_tts/model/modules.py
31
+ src/f5_tts/model/trainer.py
32
+ src/f5_tts/model/utils.py
33
+ src/f5_tts/model/backbones/dit.py
34
+ src/f5_tts/model/backbones/mmdit.py
35
+ src/f5_tts/model/backbones/unett.py
36
+ src/f5_tts/scripts/count_max_epoch.py
37
+ src/f5_tts/scripts/count_params_gflops.py
38
+ src/f5_tts/train/finetune_cli.py
39
+ src/f5_tts/train/finetune_gradio.py
40
+ src/f5_tts/train/train.py
41
+ src/f5_tts/train/datasets/prepare_csv_wavs.py
42
+ src/f5_tts/train/datasets/prepare_emilia.py
43
+ src/f5_tts/train/datasets/prepare_libritts.py
44
+ src/f5_tts/train/datasets/prepare_ljspeech.py
45
+ src/f5_tts/train/datasets/prepare_v2c.py
46
+ src/f5_tts/train/datasets/prepare_wenetspeech4tts.py
F5-TTS/src/f5_tts.egg-info/dependency_links.txt ADDED
@@ -0,0 +1 @@
 
 
1
+
F5-TTS/src/f5_tts.egg-info/entry_points.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ [console_scripts]
2
+ f5-tts_finetune-cli = f5_tts.train.finetune_cli:main
3
+ f5-tts_finetune-gradio = f5_tts.train.finetune_gradio:main
4
+ f5-tts_infer-cli = f5_tts.infer.infer_cli:main
5
+ f5-tts_infer-gradio = f5_tts.infer.infer_gradio:main
F5-TTS/src/f5_tts.egg-info/requires.txt ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ accelerate>=0.33.0
2
+ cached_path
3
+ click
4
+ datasets
5
+ ema_pytorch>=0.5.2
6
+ gradio>=3.45.2
7
+ hydra-core>=1.3.0
8
+ jieba
9
+ librosa
10
+ matplotlib
11
+ numpy<=1.26.4
12
+ pydub
13
+ pypinyin
14
+ safetensors
15
+ soundfile
16
+ tomli
17
+ torch>=2.0.0
18
+ torchaudio>=2.0.0
19
+ torchdiffeq
20
+ tqdm>=4.65.0
21
+ transformers
22
+ transformers_stream_generator
23
+ vocos
24
+ wandb
25
+ x_transformers>=1.31.14
26
+
27
+ [:platform_machine != "arm64" and platform_system != "Darwin"]
28
+ bitsandbytes>0.37.0
29
+
30
+ [eval]
31
+ faster_whisper==0.10.1
32
+ funasr
33
+ jiwer
34
+ modelscope
35
+ zhconv
36
+ zhon
F5-TTS/src/f5_tts.egg-info/top_level.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ f5_tts
MMAudio/demo.py CHANGED
@@ -68,7 +68,8 @@ def main():
68
  seq_cfg = model.seq_cfg
69
 
70
  if args.video:
71
- video_path: Path = Path(args.video).expanduser()
 
72
  else:
73
  video_path = None
74
  prompt: str = args.prompt
@@ -117,22 +118,25 @@ def main():
117
  #test_scp = "/ailab-train/speech/zhanghaomin/datas/v2cdata/test.scp"
118
  test_scp = args.scp
119
 
120
- lines = []
121
- with open(test_scp, "r") as fr:
122
- lines += fr.readlines()
123
- #with open(test_scp2, "r") as fr:
124
- # lines += fr.readlines()
125
- tests = []
126
- for line in lines[args.start: args.end]:
127
- ####video_path, prompt = line.strip().split("\t")
128
- ####prompt = "the sound of " + prompt
129
- ####negative_prompt = ""
130
- video_path, _, audio_path = line.strip().split("\t")
131
- ####video_path = "/ailab-train/speech/zhanghaomin/datas/v2cdata/DragonII/DragonII_videos/Gobber/0725.mp4"
132
- prompt = ""
133
- #negative_prompt = "speech, voice, talking, speaking"
134
- negative_prompt = ""
135
- tests.append([video_path, prompt, negative_prompt, audio_path])
 
 
 
136
 
137
  print(datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S.%f")[:-3], "start")
138
  for video_path, prompt, negative_prompt, audio_path in tests:
 
68
  seq_cfg = model.seq_cfg
69
 
70
  if args.video:
71
+ #video_path: Path = Path(args.video).expanduser()
72
+ video_path = args.video
73
  else:
74
  video_path = None
75
  prompt: str = args.prompt
 
118
  #test_scp = "/ailab-train/speech/zhanghaomin/datas/v2cdata/test.scp"
119
  test_scp = args.scp
120
 
121
+ if video_path is None:
122
+ lines = []
123
+ with open(test_scp, "r") as fr:
124
+ lines += fr.readlines()
125
+ #with open(test_scp2, "r") as fr:
126
+ # lines += fr.readlines()
127
+ tests = []
128
+ for line in lines[args.start: args.end]:
129
+ ####video_path, prompt = line.strip().split("\t")
130
+ ####prompt = "the sound of " + prompt
131
+ ####negative_prompt = ""
132
+ video_path, _, audio_path = line.strip().split("\t")
133
+ ####video_path = "/ailab-train/speech/zhanghaomin/datas/v2cdata/DragonII/DragonII_videos/Gobber/0725.mp4"
134
+ prompt = ""
135
+ #negative_prompt = "speech, voice, talking, speaking"
136
+ negative_prompt = ""
137
+ tests.append([video_path, prompt, negative_prompt, audio_path])
138
+ else:
139
+ tests = [[video_path, prompt, negative_prompt, ""]]
140
 
141
  print(datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S.%f")[:-3], "start")
142
  for video_path, prompt, negative_prompt, audio_path in tests:
README.md CHANGED
@@ -1,77 +1,74 @@
1
- ---
2
- title: DeepAudio-V1 — multi-modal speech and audio generation
3
- emoji: 🔊
4
- colorFrom: blue
5
- colorTo: indigo
6
- sdk: gradio
7
- app_file: app.py
8
- pinned: false
9
- ---
10
-
11
-
12
- ## [DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation](https://pages.github.com/)
13
-
14
-
15
- ## Installation
16
-
17
- **1. Create a conda environment**
18
-
19
- ```bash
20
- conda create -n v2as python=3.10
21
- conda activate v2as
22
- ```
23
-
24
- **2. F5-TTS base install**
25
-
26
- ```bash
27
- cd ./F5-TTS
28
- pip install -e .
29
- ```
30
-
31
- **3. Additional requirements**
32
-
33
- ```bash
34
- pip install -r requirements.txt
35
- conda install cudnn
36
- ```
37
-
38
- **Pretrained models**
39
-
40
- The models are available at https://huggingface.co/. See [MODELS.md](./MODELS.md) for more details.
41
-
42
- ## Inference
43
-
44
- **1. V2A inference**
45
-
46
- ```bash
47
- bash v2a.sh
48
- ```
49
-
50
- **2. V2S inference**
51
-
52
- ```bash
53
- bash v2s.sh
54
- ```
55
-
56
- **3. TTS inference**
57
-
58
- ```bash
59
- bash tts.sh
60
- ```
61
-
62
- ## Evaluation
63
-
64
- ```bash
65
- bash eval_v2c.sh
66
- ```
67
-
68
-
69
- ## Acknowledgement
70
-
71
- - [MMAudio](https://github.com/hkchengrex/MMAudio) for video-to-audio backbone and pretrained models
72
- - [F5-TTS](https://github.com/SWivid/F5-TTS) for text-to-speech and video-to-speech backbone
73
- - [V2C](https://github.com/chenqi008/V2C) for animated movie benchmark
74
- - [Wav2Vec2-Emotion](https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim) for emotion recognition in EMO-SIM evaluation.
75
- - [WavLM-SV](https://huggingface.co/microsoft/wavlm-base-sv) for speech recognition in SPK-SIM evaluation.
76
- - [Whisper](https://huggingface.co/Systran/faster-whisper-large-v3) for speech recognition in WER evaluation.
77
-
 
1
+ <div align="center">
2
+ <p align="center">
3
+ <h2>DeepAudio-V1</h2>
4
+ <a href="https://arxiv.org/">Paper</a> | <a href="https://pages.github.com/">Webpage</a> | <a href="https://huggingface.co/">Models</a>
5
+ </p>
6
+ </div>
7
+
8
+
9
+ ## [DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation](https://pages.github.com/)
10
+
11
+
12
+ ## Installation
13
+
14
+ **1. Create a conda environment**
15
+
16
+ ```bash
17
+ conda create -n v2as python=3.10
18
+ conda activate v2as
19
+ ```
20
+
21
+ **2. F5-TTS base install**
22
+
23
+ ```bash
24
+ cd ./F5-TTS
25
+ pip install -e .
26
+ ```
27
+
28
+ **3. Additional requirements**
29
+
30
+ ```bash
31
+ pip install -r requirements.txt
32
+ conda install cudnn
33
+ ```
34
+
35
+ **Pretrained models**
36
+
37
+ The models are available at https://huggingface.co/. See [MODELS.md](./MODELS.md) for more details.
38
+
39
+ ## Inference
40
+
41
+ **1. V2A inference**
42
+
43
+ ```bash
44
+ bash v2a.sh
45
+ ```
46
+
47
+ **2. V2S inference**
48
+
49
+ ```bash
50
+ bash v2s.sh
51
+ ```
52
+
53
+ **3. TTS inference**
54
+
55
+ ```bash
56
+ bash tts.sh
57
+ ```
58
+
59
+ ## Evaluation
60
+
61
+ ```bash
62
+ bash eval_v2c.sh
63
+ ```
64
+
65
+
66
+ ## Acknowledgement
67
+
68
+ - [MMAudio](https://github.com/hkchengrex/MMAudio) for video-to-audio backbone and pretrained models
69
+ - [F5-TTS](https://github.com/SWivid/F5-TTS) for text-to-speech and video-to-speech backbone
70
+ - [V2C](https://github.com/chenqi008/V2C) for animated movie benchmark
71
+ - [Wav2Vec2-Emotion](https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim) for emotion recognition in EMO-SIM evaluation.
72
+ - [WavLM-SV](https://huggingface.co/microsoft/wavlm-base-sv) for speech recognition in SPK-SIM evaluation.
73
+ - [Whisper](https://huggingface.co/Systran/faster-whisper-large-v3) for speech recognition in WER evaluation.
74
+
 
 
 
app.py CHANGED
@@ -16,23 +16,37 @@ import torchaudio
16
 
17
  import tempfile
18
 
 
 
19
  log = logging.getLogger()
20
 
21
 
22
  #@spaces.GPU(duration=120)
23
- @torch.inference_mode()
24
- def video_to_audio(video: gr.Video, prompt: str, negative_prompt: str, seed: int, num_steps: int,
25
- cfg_strength: float, duration: float):
 
 
 
 
 
26
 
27
- os.system("bash v2a.sh")
 
 
 
 
 
28
 
29
- return "v2a"
 
 
30
 
31
 
32
  video_to_audio_tab = gr.Interface(
33
  fn=video_to_audio,
34
  description="""
35
- Project page: <a href="https://hkchengrex.com/MMAudio/">https://hkchengrex.com/MMAudio/</a><br>
36
  Code: <a href="https://github.com/acappemin/DeepAudio-V1">https://github.com/acappemin/DeepAudio-V1</a><br>
37
  """,
38
  inputs=[
@@ -41,16 +55,19 @@ video_to_audio_tab = gr.Interface(
41
  ],
42
  outputs='playable_video',
43
  cache_examples=False,
44
- title='MMAudio — Video-to-Audio Synthesis',
45
  examples=[
46
  [
47
- 'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_beach.mp4',
 
 
 
 
48
  '',
49
  ],
50
  ])
51
 
52
 
53
  if __name__ == "__main__":
54
- gr.TabbedInterface([video_to_audio_tab],
55
- ['Video-to-Audio']).launch()
56
 
 
16
 
17
  import tempfile
18
 
19
+ import requests
20
+
21
  log = logging.getLogger()
22
 
23
 
24
  #@spaces.GPU(duration=120)
25
+ def video_to_audio(video: gr.Video, prompt: str):
26
+
27
+ video_path = tempfile.NamedTemporaryFile(delete=False, suffix='.mp4').name
28
+
29
+ output_dir = video_path.rsplit("/", 1)[0]
30
+ video_save_path = str(output_dir) + "/" + str(video_path).replace("/", "__").strip(".") + ".mp4"
31
+
32
+ print("paths", video, video_path, output_dir, video_save_path)
33
 
34
+ if video.startswith("http"):
35
+ data = requests.get(video, timeout=60).content
36
+ with open(video_path, "wb") as fw:
37
+ fw.write(data)
38
+ else:
39
+ os.system("cp %s %s" % (video, video_path))
40
 
41
+ os.system("cd ./MMAudio; python ./demo.py --output %s --video_path %s --prompt %s --calc_energy 1" % (output_dir, video_path, prompt))
42
+
43
+ return video_save_path
44
 
45
 
46
  video_to_audio_tab = gr.Interface(
47
  fn=video_to_audio,
48
  description="""
49
+ Project page: <a href="https://acappemin.github.io/DeepAudio-V1.github.io">https://acappemin.github.io/DeepAudio-V1.github.io</a><br>
50
  Code: <a href="https://github.com/acappemin/DeepAudio-V1">https://github.com/acappemin/DeepAudio-V1</a><br>
51
  """,
52
  inputs=[
 
55
  ],
56
  outputs='playable_video',
57
  cache_examples=False,
58
+ title='Video-to-Audio',
59
  examples=[
60
  [
61
+ './tests/0235.mp4',
62
+ '',
63
+ ],
64
+ [
65
+ './tests/0778.mp4',
66
  '',
67
  ],
68
  ])
69
 
70
 
71
  if __name__ == "__main__":
72
+ gr.TabbedInterface([video_to_audio_tab], ['Video-to-Audio']).launch()
 
73
 
v2s.sh CHANGED
@@ -1 +1 @@
1
- python ./F5-TTS/src/f5_tts/infer/infer_cli_test.py --output_dir ./tests/outputs_v2c_l44_test/ --start 0 --end 10 --ckpt_file ./F5-TTS/ckpts/v2c/v2c_l44.pt --v2a_path ./tests/outputs_v2a_l44_test/ --infer_list ./tests/v2c_test.lst
 
1
+ python ./F5-TTS/src/f5_tts/infer/infer_cli_test.py --output_dir ./tests/outputs_v2c_l44_test/ --start 0 --end 10 --ckpt_file ./F5-TTS/ckpts/v2c/v2c_s16.pt --v2a_path ./tests/outputs_v2a_l44_test/ --infer_list ./tests/v2c_test.lst