--- title: DeepSound-V1 colorFrom: blue colorTo: indigo sdk: gradio app_file: app.py pinned: false ---

DeepSound-V1

Paper | Webpage | Huggingface Demo

## [DeepSound-V1: Start to Think Step-by-Step in the Audio Generation from Videos](https://github.com/lym0302/DeepSound-V1) ## Highlight DeepSound-V1 is a framework enabling audio generation from videos towards initial step-by-step thinking without extra annotations based on the internal chain-of-thought (CoT) of Multi-modal large language model(MLLM). ## Installation ```bash conda create -n deepsound-v1 python=3.10.16 -y conda activate deepsound-v1 pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu120 pip install flash-attn==2.5.8 --no-build-isolation pip install -e . pip install -r reqirments.txt ``` ## Demo ### Pretrained models See [MODELS.md](docs/MODELS.md). ### Command-line interface With `demo.py` ```bash python demo.py -i ``` All training parameters are [here](). ## Evaluation Refer [av-benchmark](https://github.com/hkchengrex/av-benchmark) for benchmarking results. See [EVAL.md](docs/EVAL.md). ## Citation ## Relevant Repositories - [av-benchmark](https://github.com/hkchengrex/av-benchmark) for benchmarking results. ## Acknowledgement Many thanks to: - [VideoLLaMA2](https://github.com/DAMO-NLP-SG/VideoLLaMA2) - [MMAudio](https://github.com/hkchengrex/MMAudio) - [FoleyCrafter](https://github.com/open-mmlab/FoleyCrafter) - [BS-RoFormer](https://github.com/ZFTurbo/Music-Source-Separation-Training)