Spaces:

lshzhm
/

DeepAudio-V1

Running

App Files Files Community

lshzhm commited on Mar 25

Commit

1488c83

verified ·

1 Parent(s): cf7e36e

Upload 240 files

Browse files

Files changed (10) hide show

F5-TTS/src/f5_tts.egg-info/PKG-INFO +220 -0
F5-TTS/src/f5_tts.egg-info/SOURCES.txt +46 -0
F5-TTS/src/f5_tts.egg-info/dependency_links.txt +1 -0
F5-TTS/src/f5_tts.egg-info/entry_points.txt +5 -0
F5-TTS/src/f5_tts.egg-info/requires.txt +36 -0
F5-TTS/src/f5_tts.egg-info/top_level.txt +1 -0
MMAudio/demo.py +21 -17
README.md +74 -77
app.py +27 -10
v2s.sh +1 -1

F5-TTS/src/f5_tts.egg-info/PKG-INFO ADDED Viewed

	@@ -0,0 +1,220 @@

+Metadata-Version: 2.4
+Name: f5-tts
+Version: 0.5.2
+Summary: F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
+License: MIT License
+Project-URL: Homepage, https://github.com/SWivid/F5-TTS
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: accelerate>=0.33.0
+Requires-Dist: bitsandbytes>0.37.0; platform_machine != "arm64" and platform_system != "Darwin"
+Requires-Dist: cached_path
+Requires-Dist: click
+Requires-Dist: datasets
+Requires-Dist: ema_pytorch>=0.5.2
+Requires-Dist: gradio>=3.45.2
+Requires-Dist: hydra-core>=1.3.0
+Requires-Dist: jieba
+Requires-Dist: librosa
+Requires-Dist: matplotlib
+Requires-Dist: numpy<=1.26.4
+Requires-Dist: pydub
+Requires-Dist: pypinyin
+Requires-Dist: safetensors
+Requires-Dist: soundfile
+Requires-Dist: tomli
+Requires-Dist: torch>=2.0.0
+Requires-Dist: torchaudio>=2.0.0
+Requires-Dist: torchdiffeq
+Requires-Dist: tqdm>=4.65.0
+Requires-Dist: transformers
+Requires-Dist: transformers_stream_generator
+Requires-Dist: vocos
+Requires-Dist: wandb
+Requires-Dist: x_transformers>=1.31.14
+Provides-Extra: eval
+Requires-Dist: faster_whisper==0.10.1; extra == "eval"
+Requires-Dist: funasr; extra == "eval"
+Requires-Dist: jiwer; extra == "eval"
+Requires-Dist: modelscope; extra == "eval"
+Requires-Dist: zhconv; extra == "eval"
+Requires-Dist: zhon; extra == "eval"
+Dynamic: license-file
+# F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
+[![python](https://img.shields.io/badge/Python-3.10-brightgreen)](https://github.com/SWivid/F5-TTS)
+[![arXiv](https://img.shields.io/badge/arXiv-2410.06885-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2410.06885)
+[![demo](https://img.shields.io/badge/GitHub-Demo%20page-orange.svg)](https://swivid.github.io/F5-TTS/)
+[![hfspace](https://img.shields.io/badge/🤗-Space%20demo-yellow)](https://huggingface.co/spaces/mrfakename/E2-F5-TTS)
+[![msspace](https://img.shields.io/badge/🤖-Space%20demo-blue)](https://modelscope.cn/studios/modelscope/E2-F5-TTS)
+[![lab](https://img.shields.io/badge/X--LANCE-Lab-grey?labelColor=lightgrey)](https://x-lance.sjtu.edu.cn/)
+[![lab](https://img.shields.io/badge/Peng%20Cheng-Lab-grey?labelColor=lightgrey)](https://www.pcl.ac.cn)
+<!-- <img src="https://github.com/user-attachments/assets/12d7749c-071a-427c-81bf-b87b91def670" alt="Watermark" style="width: 40px; height: auto"> -->
+**F5-TTS**: Diffusion Transformer with ConvNeXt V2, faster trained and inference.
+**E2 TTS**: Flat-UNet Transformer, closest reproduction from [paper](https://arxiv.org/abs/2406.18009).
+**Sway Sampling**: Inference-time flow step sampling strategy, greatly improves performance
+### Thanks to all the contributors !
+## News
+- **2024/10/08**: F5-TTS & E2 TTS base models on [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS), [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), [🟣 Wisemodel](https://wisemodel.cn/models/SJTU_X-LANCE/F5-TTS_Emilia-ZH-EN).
+## Installation
+```bash
+# Create a python 3.10 conda env (you could also use virtualenv)
+conda create -n f5-tts python=3.10
+conda activate f5-tts
+# NVIDIA GPU: install pytorch with your CUDA version, e.g.
+pip install torch==2.3.0+cu118 torchaudio==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
+# AMD GPU: install pytorch with your ROCm version, e.g. (Linux only)
+pip install torch==2.5.1+rocm6.2 torchaudio==2.5.1+rocm6.2 --extra-index-url https://download.pytorch.org/whl/rocm6.2
+# Intel GPU: install pytorch with your XPU version, e.g.
+# Intel® Deep Learning Essentials or Intel® oneAPI Base Toolkit must be installed
+pip install --pre torch torchaudio --index-url https://download.pytorch.org/whl/nightly/xpu
+```
+Then you can choose from a few options below:
+### 1. As a pip package (if just for inference)
+```bash
+pip install git+https://github.com/SWivid/F5-TTS.git
+```
+### 2. Local editable (if also do training, finetuning)
+```bash
+git clone https://github.com/SWivid/F5-TTS.git
+cd F5-TTS
+# git submodule update --init --recursive  # (optional, if need bigvgan)
+pip install -e .
+```
+### 3. Docker usage
+```bash
+# Build from Dockerfile
+docker build -t f5tts:v1 .
+# Or pull from GitHub Container Registry
+docker pull ghcr.io/swivid/f5-tts:main
+```
+## Inference
+### 1. Gradio App
+Currently supported features:
+- Basic TTS with Chunk Inference
+- Multi-Style / Multi-Speaker Generation
+- Voice Chat powered by Qwen2.5-3B-Instruct
+- [Custom inference with more language support](src/f5_tts/infer/SHARED.md)
+```bash
+# Launch a Gradio app (web interface)
+f5-tts_infer-gradio
+# Specify the port/host
+f5-tts_infer-gradio --port 7860 --host 0.0.0.0
+# Launch a share link
+f5-tts_infer-gradio --share
+```
+### 2. CLI Inference
+```bash
+# Run with flags
+# Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
+f5-tts_infer-cli \
+--model "F5-TTS" \
+--ref_audio "ref_audio.wav" \
+--ref_text "The content, subtitle or transcription of reference audio." \
+--gen_text "Some text you want TTS model generate for you."
+# Run with default setting. src/f5_tts/infer/examples/basic/basic.toml
+f5-tts_infer-cli
+# Or with your own .toml file
+f5-tts_infer-cli -c custom.toml
+# Multi voice. See src/f5_tts/infer/README.md
+f5-tts_infer-cli -c src/f5_tts/infer/examples/multi/story.toml
+```
+### 3. More instructions
+- In order to have better generation results, take a moment to read [detailed guidance](src/f5_tts/infer).
+- The [Issues](https://github.com/SWivid/F5-TTS/issues?q=is%3Aissue) are very useful, please try to find the solution by properly searching the keywords of problem encountered. If no answer found, then feel free to open an issue.
+## Training
+### 1. Gradio App
+Read [training & finetuning guidance](src/f5_tts/train) for more instructions.
+```bash
+# Quick start with Gradio web interface
+f5-tts_finetune-gradio
+```
+## [Evaluation](src/f5_tts/eval)
+## Development
+Use pre-commit to ensure code quality (will run linters and formatters automatically)
+```bash
+pip install pre-commit
+pre-commit install
+```
+When making a pull request, before each commit, run:
+```bash
+pre-commit run --all-files
+```
+Note: Some model components have linting exceptions for E722 to accommodate tensor notation
+## Acknowledgements
+- [E2-TTS](https://arxiv.org/abs/2406.18009) brilliant work, simple and effective
+- [Emilia](https://arxiv.org/abs/2407.05361), [WenetSpeech4TTS](https://arxiv.org/abs/2406.05763), [LibriTTS](https://arxiv.org/abs/1904.02882), [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) valuable datasets
+- [lucidrains](https://github.com/lucidrains) initial CFM structure with also [bfs18](https://github.com/bfs18) for discussion
+- [SD3](https://arxiv.org/abs/2403.03206) & [Hugging Face diffusers](https://github.com/huggingface/diffusers) DiT and MMDiT code structure
+- [torchdiffeq](https://github.com/rtqichen/torchdiffeq) as ODE solver, [Vocos](https://huggingface.co/charactr/vocos-mel-24khz) and [BigVGAN](https://github.com/NVIDIA/BigVGAN) as vocoder
+- [FunASR](https://github.com/modelscope/FunASR), [faster-whisper](https://github.com/SYSTRAN/faster-whisper), [UniSpeech](https://github.com/microsoft/UniSpeech), [SpeechMOS](https://github.com/tarepan/SpeechMOS) for evaluation tools
+- [ctc-forced-aligner](https://github.com/MahmoudAshraf97/ctc-forced-aligner) for speech edit test
+- [mrfakename](https://x.com/realmrfakename) huggingface space demo ~
+- [f5-tts-mlx](https://github.com/lucasnewman/f5-tts-mlx/tree/main) Implementation with MLX framework by [Lucas Newman](https://github.com/lucasnewman)
+- [F5-TTS-ONNX](https://github.com/DakeQQ/F5-TTS-ONNX) ONNX Runtime version by [DakeQQ](https://github.com/DakeQQ)
+## Citation
+If our work and codebase is useful for you, please cite as:
+```
+@article{chen-etal-2024-f5tts,
+      title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
+      author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
+      journal={arXiv preprint arXiv:2410.06885},
+      year={2024},
+}
+```
+## License
+Our code is released under MIT License. The pre-trained models are licensed under the CC-BY-NC license due to the training data Emilia, which is an in-the-wild dataset. Sorry for any inconvenience this may cause.

F5-TTS/src/f5_tts.egg-info/SOURCES.txt ADDED Viewed

	@@ -0,0 +1,46 @@

+LICENSE
+README.md
+pyproject.toml
+src/f5_tts/api.py
+src/f5_tts/socket_server.py
+src/f5_tts.egg-info/PKG-INFO
+src/f5_tts.egg-info/SOURCES.txt
+src/f5_tts.egg-info/dependency_links.txt
+src/f5_tts.egg-info/entry_points.txt
+src/f5_tts.egg-info/requires.txt
+src/f5_tts.egg-info/top_level.txt
+src/f5_tts/eval/ecapa_tdnn.py
+src/f5_tts/eval/eval_infer_batch.py
+src/f5_tts/eval/eval_librispeech_test_clean.py
+src/f5_tts/eval/eval_seedtts_testset.py
+src/f5_tts/eval/eval_utmos.py
+src/f5_tts/eval/eval_v2c_test.py
+src/f5_tts/eval/utils_eval.py
+src/f5_tts/infer/infer_cli.py
+src/f5_tts/infer/infer_cli_libritts.py
+src/f5_tts/infer/infer_cli_s3.py
+src/f5_tts/infer/infer_cli_test.py
+src/f5_tts/infer/infer_cli_tts_test.py
+src/f5_tts/infer/infer_gradio.py
+src/f5_tts/infer/speech_edit.py
+src/f5_tts/infer/utils_infer.py
+src/f5_tts/model/__init__.py
+src/f5_tts/model/cfm.py
+src/f5_tts/model/dataset.py
+src/f5_tts/model/modules.py
+src/f5_tts/model/trainer.py
+src/f5_tts/model/utils.py
+src/f5_tts/model/backbones/dit.py
+src/f5_tts/model/backbones/mmdit.py
+src/f5_tts/model/backbones/unett.py
+src/f5_tts/scripts/count_max_epoch.py
+src/f5_tts/scripts/count_params_gflops.py
+src/f5_tts/train/finetune_cli.py
+src/f5_tts/train/finetune_gradio.py
+src/f5_tts/train/train.py
+src/f5_tts/train/datasets/prepare_csv_wavs.py
+src/f5_tts/train/datasets/prepare_emilia.py
+src/f5_tts/train/datasets/prepare_libritts.py
+src/f5_tts/train/datasets/prepare_ljspeech.py
+src/f5_tts/train/datasets/prepare_v2c.py
+src/f5_tts/train/datasets/prepare_wenetspeech4tts.py

F5-TTS/src/f5_tts.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+

F5-TTS/src/f5_tts.egg-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+[console_scripts]
+f5-tts_finetune-cli = f5_tts.train.finetune_cli:main
+f5-tts_finetune-gradio = f5_tts.train.finetune_gradio:main
+f5-tts_infer-cli = f5_tts.infer.infer_cli:main
+f5-tts_infer-gradio = f5_tts.infer.infer_gradio:main

F5-TTS/src/f5_tts.egg-info/requires.txt ADDED Viewed

	@@ -0,0 +1,36 @@

+accelerate>=0.33.0
+cached_path
+click
+datasets
+ema_pytorch>=0.5.2
+gradio>=3.45.2
+hydra-core>=1.3.0
+jieba
+librosa
+matplotlib
+numpy<=1.26.4
+pydub
+pypinyin
+safetensors
+soundfile
+tomli
+torch>=2.0.0
+torchaudio>=2.0.0
+torchdiffeq
+tqdm>=4.65.0
+transformers
+transformers_stream_generator
+vocos
+wandb
+x_transformers>=1.31.14
+[:platform_machine != "arm64" and platform_system != "Darwin"]
+bitsandbytes>0.37.0
+[eval]
+faster_whisper==0.10.1
+funasr
+jiwer
+modelscope
+zhconv
+zhon

F5-TTS/src/f5_tts.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ f5_tts

MMAudio/demo.py CHANGED Viewed

@@ -68,7 +68,8 @@ def main():
     seq_cfg = model.seq_cfg
     if args.video:
-        video_path: Path = Path(args.video).expanduser()
     else:
         video_path = None
     prompt: str = args.prompt
@@ -117,22 +118,25 @@ def main():
     #test_scp = "/ailab-train/speech/zhanghaomin/datas/v2cdata/test.scp"
     test_scp = args.scp
-    lines = []
-    with open(test_scp, "r") as fr:
-        lines += fr.readlines()
-    #with open(test_scp2, "r") as fr:
-    #    lines += fr.readlines()
-    tests = []
-    for line in lines[args.start: args.end]:
-        ####video_path, prompt = line.strip().split("\t")
-        ####prompt = "the sound of " + prompt
-        ####negative_prompt = ""
-        video_path, _, audio_path = line.strip().split("\t")
-        ####video_path = "/ailab-train/speech/zhanghaomin/datas/v2cdata/DragonII/DragonII_videos/Gobber/0725.mp4"
-        prompt = ""
-        #negative_prompt = "speech, voice, talking, speaking"
-        negative_prompt = ""
-        tests.append([video_path, prompt, negative_prompt, audio_path])
     print(datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S.%f")[:-3], "start")
     for video_path, prompt, negative_prompt, audio_path in tests:

     seq_cfg = model.seq_cfg
     if args.video:
+        #video_path: Path = Path(args.video).expanduser()
+        video_path = args.video
     else:
         video_path = None
     prompt: str = args.prompt
     #test_scp = "/ailab-train/speech/zhanghaomin/datas/v2cdata/test.scp"
     test_scp = args.scp
+    if video_path is None:
+        lines = []
+        with open(test_scp, "r") as fr:
+            lines += fr.readlines()
+        #with open(test_scp2, "r") as fr:
+        #    lines += fr.readlines()
+        tests = []
+        for line in lines[args.start: args.end]:
+            ####video_path, prompt = line.strip().split("\t")
+            ####prompt = "the sound of " + prompt
+            ####negative_prompt = ""
+            video_path, _, audio_path = line.strip().split("\t")
+            ####video_path = "/ailab-train/speech/zhanghaomin/datas/v2cdata/DragonII/DragonII_videos/Gobber/0725.mp4"
+            prompt = ""
+            #negative_prompt = "speech, voice, talking, speaking"
+            negative_prompt = ""
+            tests.append([video_path, prompt, negative_prompt, audio_path])
+    else:
+        tests = [[video_path, prompt, negative_prompt, ""]]
     print(datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S.%f")[:-3], "start")
     for video_path, prompt, negative_prompt, audio_path in tests:

README.md CHANGED Viewed

@@ -1,77 +1,74 @@
----
-title: DeepAudio-V1 — multi-modal speech and audio generation
-emoji: 🔊
-colorFrom: blue
-colorTo: indigo
-sdk: gradio
-app_file: app.py
-pinned: false
----
-## [DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation](https://pages.github.com/)
-## Installation
-**1. Create a conda environment**
-```bash
-conda create -n v2as python=3.10
-conda activate v2as
-```
-**2. F5-TTS base install**
-```bash
-cd ./F5-TTS
-pip install -e .
-```
-**3. Additional requirements**
-```bash
-pip install -r requirements.txt
-conda install cudnn
-```
-**Pretrained models**
-The models are available at https://huggingface.co/. See [MODELS.md](./MODELS.md) for more details.
-## Inference
-**1. V2A inference**
-```bash
-bash v2a.sh
-```
-**2. V2S inference**
-```bash
-bash v2s.sh
-```
-**3. TTS inference**
-```bash
-bash tts.sh
-```
-## Evaluation
-```bash
-bash eval_v2c.sh
-```
-## Acknowledgement
-- [MMAudio](https://github.com/hkchengrex/MMAudio) for video-to-audio backbone and pretrained models
-- [F5-TTS](https://github.com/SWivid/F5-TTS) for text-to-speech and video-to-speech backbone
-- [V2C](https://github.com/chenqi008/V2C) for animated movie benchmark
-- [Wav2Vec2-Emotion](https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim) for emotion recognition in EMO-SIM evaluation.
-- [WavLM-SV](https://huggingface.co/microsoft/wavlm-base-sv) for speech recognition in SPK-SIM evaluation.
-- [Whisper](https://huggingface.co/Systran/faster-whisper-large-v3) for speech recognition in WER evaluation.

+<div align="center">
+<p align="center">
+  <h2>DeepAudio-V1</h2>
+  <a href="https://arxiv.org/">Paper</a> | <a href="https://pages.github.com/">Webpage</a> | <a href="https://huggingface.co/">Models</a>
+</p>
+</div>
+## [DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation](https://pages.github.com/)
+## Installation
+**1. Create a conda environment**
+```bash
+conda create -n v2as python=3.10
+conda activate v2as
+```
+**2. F5-TTS base install**
+```bash
+cd ./F5-TTS
+pip install -e .
+```
+**3. Additional requirements**
+```bash
+pip install -r requirements.txt
+conda install cudnn
+```
+**Pretrained models**
+The models are available at https://huggingface.co/. See [MODELS.md](./MODELS.md) for more details.
+## Inference
+**1. V2A inference**
+```bash
+bash v2a.sh
+```
+**2. V2S inference**
+```bash
+bash v2s.sh
+```
+**3. TTS inference**
+```bash
+bash tts.sh
+```
+## Evaluation
+```bash
+bash eval_v2c.sh
+```
+## Acknowledgement
+- [MMAudio](https://github.com/hkchengrex/MMAudio) for video-to-audio backbone and pretrained models
+- [F5-TTS](https://github.com/SWivid/F5-TTS) for text-to-speech and video-to-speech backbone
+- [V2C](https://github.com/chenqi008/V2C) for animated movie benchmark
+- [Wav2Vec2-Emotion](https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim) for emotion recognition in EMO-SIM evaluation.
+- [WavLM-SV](https://huggingface.co/microsoft/wavlm-base-sv) for speech recognition in SPK-SIM evaluation.
+- [Whisper](https://huggingface.co/Systran/faster-whisper-large-v3) for speech recognition in WER evaluation.

app.py CHANGED Viewed

@@ -16,23 +16,37 @@ import torchaudio
 import tempfile
 log = logging.getLogger()
 #@spaces.GPU(duration=120)
-@torch.inference_mode()
-def video_to_audio(video: gr.Video, prompt: str, negative_prompt: str, seed: int, num_steps: int,
-                   cfg_strength: float, duration: float):
-    os.system("bash v2a.sh")
-    return "v2a"
 video_to_audio_tab = gr.Interface(
     fn=video_to_audio,
     description="""
-    Project page: <a href="https://hkchengrex.com/MMAudio/">https://hkchengrex.com/MMAudio/</a><br>
     Code: <a href="https://github.com/acappemin/DeepAudio-V1">https://github.com/acappemin/DeepAudio-V1</a><br>
     """,
     inputs=[
@@ -41,16 +55,19 @@ video_to_audio_tab = gr.Interface(
     ],
     outputs='playable_video',
     cache_examples=False,
-    title='MMAudio — Video-to-Audio Synthesis',
     examples=[
         [
-            'https://huggingface.co/hkchengrex/MMAudio/resolve/main/examples/sora_beach.mp4',
             '',
         ],
     ])
 if __name__ == "__main__":
-    gr.TabbedInterface([video_to_audio_tab],
-                       ['Video-to-Audio']).launch()

 import tempfile
+import requests
 log = logging.getLogger()
 #@spaces.GPU(duration=120)
+def video_to_audio(video: gr.Video, prompt: str):
+    video_path = tempfile.NamedTemporaryFile(delete=False, suffix='.mp4').name
+    output_dir = video_path.rsplit("/", 1)[0]
+    video_save_path = str(output_dir) + "/" + str(video_path).replace("/", "__").strip(".") + ".mp4"
+    print("paths", video, video_path, output_dir, video_save_path)
+    if video.startswith("http"):
+        data = requests.get(video, timeout=60).content
+        with open(video_path, "wb") as fw:
+            fw.write(data)
+    else:
+        os.system("cp %s %s" % (video, video_path))
+    os.system("cd ./MMAudio; python ./demo.py --output %s --video_path %s --prompt %s --calc_energy 1" % (output_dir, video_path, prompt))
+    return video_save_path
 video_to_audio_tab = gr.Interface(
     fn=video_to_audio,
     description="""
+    Project page: <a href="https://acappemin.github.io/DeepAudio-V1.github.io">https://acappemin.github.io/DeepAudio-V1.github.io</a><br>
     Code: <a href="https://github.com/acappemin/DeepAudio-V1">https://github.com/acappemin/DeepAudio-V1</a><br>
     """,
     inputs=[
     ],
     outputs='playable_video',
     cache_examples=False,
+    title='Video-to-Audio',
     examples=[
         [
+            './tests/0235.mp4',
+            '',
+        ],
+        [
+            './tests/0778.mp4',
             '',
         ],
     ])
 if __name__ == "__main__":
+    gr.TabbedInterface([video_to_audio_tab], ['Video-to-Audio']).launch()

v2s.sh CHANGED Viewed

	@@ -1 +1 @@
1	- python ./F5-TTS/src/f5_tts/infer/infer_cli_test.py --output_dir ./tests/outputs_v2c_l44_test/ --start 0 --end 10 --ckpt_file ./F5-TTS/ckpts/v2c/~~v2c_l44~~.pt --v2a_path ./tests/outputs_v2a_l44_test/ --infer_list ./tests/v2c_test.lst


1	+ python ./F5-TTS/src/f5_tts/infer/infer_cli_test.py --output_dir ./tests/outputs_v2c_l44_test/ --start 0 --end 10 --ckpt_file ./F5-TTS/ckpts/v2c/v2c_s16.pt --v2a_path ./tests/outputs_v2a_l44_test/ --infer_list ./tests/v2c_test.lst