Most of the vision LMs focus on image as a whole, lacking localized references in captions, and not taking in visual prompts (points, boxes, drawings around objects)
DAM addresses this on two levels: new vision backbone that takes in focal crops and the image itself, and a large scale dataset ๐
They generate a dataset by extending existing segmentation and referring expression generation datasets like REFCOCO, by passing in the images and classes to VLMs and generating captions.
Lastly, they also release a new benchmark again with self-supervision, they use an LLM to evaluate the detailed captions focusing on localization ๐
๐ ClearerVoice-Studio New Feature: Speech Super-Resolution with MossFormer2 ! ๐ Weโre excited to announce that ClearerVoice-Studio now supports speech super-resolution, powered by our latest MossFormer2-based model! Whatโs New?
๐ Convert Low-Resolution to High-Resolution Audio: Transform low-resolution audio (effective sampling rate โฅ 16 kHz) into crystal-clear, high-resolution audio at 48 kHz.
๐ค Cutting-Edge Technology: Leverages the MossFormer2 model plus HiFi-GAN, optimised for generating high-quality audio with enhanced perceptual clarity.
๐ง Enhanced Listening Experience: Perfect for speech enhancement, content restoration, and high-fidelity audio applications.
๐ Try It Out! Upgrade to the latest version of ClearerVoice-Studio (https://github.com/modelscope/ClearerVoice-Studio) to experience this powerful feature. Check out the updated documentation and examples in our repository.
Let us know your thoughts, feedback, or feature requests in the Issues section.