The AV_MossFormer2_TSE_16K model weights for 16 kHz audio-visual target speaker extraction in ClearerVoice-Studio repo.
This model is trained on large scale open-sourced datasets.
It extracts each speaker's voice from a multi-speaker video using facial recognition.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support