Spaces:

MBZUAI
/

artst-demo-asr

Runtime error

App Files Files Community

artst-demo-asr / SpeechT5 /VATLM /README.md

amupd

SpeechT5 upload

62e9ca6 over 1 year ago

preview code

raw

history blame contribute delete

6.94 kB

	# VATLM
	<!--Pre-trained models for speech related tasks-->

	[VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning](https://arxiv.org/abs/2211.11275)


	- (Done) Nov. 2022: release the code and models
	- Nov. 2022: release preprint in [arXiv](https://arxiv.org/abs/2211.11275)

	## Pre-Trained and Fine-tuned Models

	\| Model \| Pre-training Dataset \| Fine-tuning Dataset \| Model \|
	\| :---------: \| :----------------------------------------: \| :-------------------: \| :----------------------------------------------------------: \|
	\| VatLM Base \| LRS3 + paired audio+text+audio \| - \| [Google drive](https://drive.google.com/file/d/121ITJc22prpbd4sCy9bPWpdkKgGikkgm/view?usp=share_link) \|
	\| VatLM Base \| LRS3 + paired audio+text+audio \| LRS-30h audio-visual \| [Google drive](https://drive.google.com/file/d/1Bfbq0G-tASw3YrI3rzdpYgTE-UV-YaN0/view?usp=share_link) \|
	\| VatLM Base \| LRS3 + paired audio+text+audio \| LRS-30h visual \| [Google drive](https://drive.google.com/file/d/1qALD9obym0zCDoszVn2CzW0U3EUl-4v7/view?usp=share_link) \|
	\| VatLM Base \| VoxCeleb2 + LRS3 + paired audio+text+audio \| - \| [Google drive](https://drive.google.com/file/d/1piae9Row25OEfAekVz5Bxb9YnIVyEP0A/view?usp=share_link) \|
	\| VatLM Base \| VoxCeleb2 + LRS3 + paired audio+text+audio \| LRS-30h audio-visual \| [Google drive](https://drive.google.com/file/d/13JVuUi9gIIoUM888XcAOzvN7ioazn-cv/view?usp=share_link) \|
	\| VatLM Base \| VoxCeleb2 + LRS3 + paired audio+text+audio \| LRS-30h visual \| [Google drive](https://drive.google.com/file/d/1pAQHf60HgqDORGzyqEjdGTIywLKO3Ko5/view?usp=share_link) \|
	\| VatLM Base \| VoxCeleb2 + LRS3 + paired audio+text+audio \| LRS-433h audio-visual \| [Google drive](https://drive.google.com/file/d/1u9oMnivBelxznQcMDoM_u5EOfJuxnSuL/view?usp=share_link) \|
	\| VatLM Base \| VoxCeleb2 + LRS3 + paired audio+text+audio \| LRS-433h visual \| [Google drive](https://drive.google.com/file/d/1g107k5tL3XyvevSe0BzMqYOQFyFQG7jf/view?usp=share_link) \|
	\| VatLM Large \| VoxCeleb2 + LRS3 + paired audio+text+audio \| - \| [Google drive](https://drive.google.com/file/d/1_vbVFpKcaaPcCx2FtI-GyzVvxAhppg_b/view?usp=share_link) \|
	\| VatLM Large \| VoxCeleb2 + LRS3 + paired audio+text+audio \| LRS-30h audio-visual \| [Google drive](https://drive.google.com/file/d/1LyTCxceTZIqjVdMY6hlJjWolaIAZ0Mhs/view?usp=share_link) \|
	\| VatLM Large \| VoxCeleb2 + LRS3 + paired audio+text+audio \| LRS-30h visual \| [Google drive](https://drive.google.com/file/d/1CuyGg5O14F9Y_WCwpCVoKYbDKVtjBRQU/view?usp=share_link) \|
	\| VatLM Large \| VoxCeleb2 + LRS3 + paired audio+text+audio \| LRS-433h audio-visual \| [Google drive](https://drive.google.com/file/d/12orvO3xBuzdUDrBOqjW0mdGhV2Kmsy0Q/view?usp=share_link) \|
	\| VatLM Large \| VoxCeleb2 + LRS3 + paired audio+text+audio \| LRS-433h visual \| [Google drive](https://drive.google.com/file/d/17DDTUPs0BkaJtSUTiJHLBbymt2LCGo6e/view?usp=share_link) \|



	## Setup

	To fine-tune or pre-train more models, please follow the instructions below.

	```bash
	git clone https://github.com/microsoft/SpeechT5.git
	cd SpeechT5/VATLM
	git submodule init && git submodule update

	cd VATLM/fairseq && pip install --editable
	cd VATLM/vat_hubert && pip install -r requirements.txt
	```

	## Data preparation

	1. For audio or visual data, please follow the steps of AV-HuBERT's [script](https://github.com/facebookresearch/av_hubert/tree/main/avhubert/preparation) to pre-process the data and get the corresponding `train.tsv`,` train.km` files.

	2. For unimodal audio data, the visual modality is replaced with a zero vector, and the features are extracted according to this [script](https://github.com/facebookresearch/av_hubert/tree/main/avhubert/preparation) and then kmeans [clustering](https://github.com/facebookresearch/av_hubert/tree/main/avhubert/clustering) is performed to get the corresponding labels.

	3. For unimodal text data, we use a small amount of pair text-audio data to obtain paired phone-unit data, and get the corresponding phoneme sequences by looking up the [lexicon](https://drive.google.com/file/d/1dh9NEx_cCF9_Aa0UcKyl9j00GXs6LmLQ/view?usp=sharing), and the unit data are obtained by extracting features and performing kmeans [clustering](https://github.com/facebookresearch/av_hubert/tree/main/avhubert/clustering). Then follow this [script](https://github.com/microsoft/SpeechT5/tree/main/SpeechLM#hidden-unit-tokenizer-for-text) to train the phone2unit model.

	## Pre-train

	- VatLM Base model (LRS3 + paired audio+text+audio)

	```shell
	cd VATLM/vat_hubert/vathubert/scripts/pretrain
	ngpu=32
	updatefreq=1
	save_path=/path/to/save_path

	bash base_lsr3_pretrain_iter5.sh ${ngpu} ${updatefreq} ${save_path}
	```

	- VatLM Base model (VoxCeleb2 + paired audio+text+audio)

	```shell
	cd VATLM/vat_hubert/vathubert/scripts/pretrain
	ngpu=32
	updatefreq=1
	save_path=/path/to/save_path

	bash base_vox_pretrain_iter5.sh ${ngpu} ${updatefreq} ${save_path}
	```

	- VatLM Large model (VoxCeleb2 + paired audio+text+audio)

	```shell
	cd VATLM/vat_hubert/vathubert/scripts/pretrain
	ngpu=32
	updatefreq=2
	save_path=/path/to/save_path

	bash large_vox_pretrain_iter5.sh ${ngpu} ${updatefreq} ${save_path}
	```

	## Fine-tune AVSR/VSR

	For example, the AVSR model can be obtained by fine-tuning the VatLM model using 30 hours of labeled data.

	```shell
	cd VATLM/vat_hubert/vathubert/scripts/finetune_avsr
	ngpu=8
	updatefreq=1
	save_path=/path/to/save_path

	bash base_lrs3_finetune30_av.sh ${ngpu} ${updatefreq} ${save_path}
	```

	## Decode

	For example, decoding the fine-tuned AVSR model.

	```sh
	cd VATLM/vat_hubert/vathubert/
	data="test"
	bash decode_avhubert_lrs3.sh ${data}
	```

	## License

	This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
	Portions of the source code are based on the [FAIRSEQ](https://github.com/pytorch/fairseq) and [av_hubert](https://github.com/facebookresearch/av_hubert)

	[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)

	## Reference

	If you find our work is useful in your research, please cite the following paper:

	```bibtex
	@article{zhu2022vatlm,
	title={VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning},
	author={Qiushi Zhu and Long Zhou and Ziqiang Zhang and Shujie Liu and Binxing Jiao and Jie Zhang and Lirong Dai and Daxin Jiang and Jinyu Li and Furu Wei},
	year={2022},
	eprint={2211.11275},
	archivePrefix={arXiv},
	}
	```

	### Contact Information

	For help or issues using VatLM models, please submit a GitHub issue.

	For other communications related to VatLM, please contact Long Zhou (`[email protected]`).