Merve Noyan PRO

merve

AI & ML interests

VLMs, vision & co

Recent Activity

posted an update 3 days ago
Don't sleep on new AI at Meta Vision-Language release! ๐Ÿ”ฅ https://huggingface.co/collections/facebook/perception-encoder-67f977c9a65ca5895a7f6ba1 https://huggingface.co/collections/facebook/perception-lm-67f9783f171948c383ee7498 Meta dropped swiss army knives for vision with A2.0 license ๐Ÿ‘ > image/video encoders for vision language modelling and spatial understanding (object detection etc) ๐Ÿ‘ > The vision LM outperforms InternVL3 and Qwen2.5VL ๐Ÿ‘ > They also release gigantic video and image datasets The authors attempt to come up with single versatile vision encoder to align on diverse set of tasks. They trained Perception Encoder (PE) Core: a new state-of-the-art family of vision encoders that can be aligned for both vision-language and spatial tasks. For zero-shot image tasks, it outperforms latest sota SigLIP2 ๐Ÿ‘ > Among fine-tuned ones, first one is PE-Spatial. It's a model to detect bounding boxes, segmentation, depth estimation and it outperforms all other models ๐Ÿ˜ฎ > Second one is PLM, Perception Language Model, where they combine PE-Core with Qwen2.5 LM 7B. it outperforms all other models (including InternVL3 which was trained with Qwen2.5LM too!) The authors release the following checkpoints in sizes base, large and giant: > 3 PE-Core checkpoints (224, 336, 448) > 2 PE-Lang checkpoints (L, G) > One PE-Spatial (G, 448) > 3 PLM (1B, 3B, 8B) > Datasets Authors release following datasets ๐Ÿ“‘ > PE Video: Gigantic video datasete of 1M videos with 120k expert annotations โฏ๏ธ > PLM-Video and PLM-Image: Human and auto-annotated image and video datasets on region-based tasks > PLM-VideoBench: New video benchmark on MCQA
upvoted a collection 4 days ago
Perception Encoder
View all activity

Organizations

Hugging Face's profile picture Google's profile picture SODA's profile picture Notebooks-explorers's profile picture Deprem Yapay Zeka's profile picture Deprem Private's profile picture PyTorch Image Models's profile picture Turkish NLP Dataset Creators's profile picture Templates's profile picture Demo Crafters ๐Ÿค— 's profile picture Keras's profile picture tensorflow's profile picture Mukayese's profile picture HugGAN Community's profile picture EPFL VILAB's profile picture Hugging Face Fellows's profile picture Huggingface.js's profile picture Tools's profile picture HuggingFaceM4's profile picture scikit-learn's profile picture JAX โ™ฅ๏ธ Diffusers ๐Ÿงจ's profile picture 2023 Jan Offsite hackathon's profile picture HF Canonical Model Maintainers's profile picture scikit-learn's profile picture fastai X Hugging Face Group 2022's profile picture Huggingface Projects's profile picture boun-tabi-LMG's profile picture Kornia AI's profile picture skops-tests's profile picture Hugging Face H4's profile picture Keras Dreambooth Event's profile picture Turkish T5 - BERT - GPT-2's profile picture Blog-explorers's profile picture Hugging Face for Computer Vision's profile picture Hacktoberfest 2023's profile picture Hugging Face Smol Models Research's profile picture adept-hf-collab's profile picture Qwen's profile picture ZeroGPU Explorers's profile picture kotol's profile picture Magic Leap Community's profile picture Llava Hugging Face's profile picture MLX Community's profile picture Social Post Explorers's profile picture Top Contributors: Profile Followers's profile picture Dev Mode Explorers's profile picture Paris AI Running Club's profile picture yorg's profile picture CVPR2024's profile picture Les papiers de Merve's profile picture nltpt's profile picture s0409's profile picture Hugging Face FineVideo's profile picture mv's profile picture Cookbook Authors's profile picture open/ acc's profile picture Agents's profile picture University of Sydney's profile picture smolagents's profile picture s0225's profile picture Orr and associates org's profile picture gg-hf-g's profile picture VLMs's profile picture

Posts 102

view post
Post
2403
Don't sleep on new AI at Meta Vision-Language release! ๐Ÿ”ฅ

facebook/perception-encoder-67f977c9a65ca5895a7f6ba1
facebook/perception-lm-67f9783f171948c383ee7498

Meta dropped swiss army knives for vision with A2.0 license ๐Ÿ‘
> image/video encoders for vision language modelling and spatial understanding (object detection etc) ๐Ÿ‘
> The vision LM outperforms InternVL3 and Qwen2.5VL ๐Ÿ‘
> They also release gigantic video and image datasets

The authors attempt to come up with single versatile vision encoder to align on diverse set of tasks.

They trained Perception Encoder (PE) Core: a new state-of-the-art family of vision encoders that can be aligned for both vision-language and spatial tasks. For zero-shot image tasks, it outperforms latest sota SigLIP2 ๐Ÿ‘



> Among fine-tuned ones, first one is PE-Spatial. It's a model to detect bounding boxes, segmentation, depth estimation and it outperforms all other models ๐Ÿ˜ฎ



> Second one is PLM, Perception Language Model, where they combine PE-Core with Qwen2.5 LM 7B. it outperforms all other models (including InternVL3 which was trained with Qwen2.5LM too!)

The authors release the following checkpoints in sizes base, large and giant:

> 3 PE-Core checkpoints (224, 336, 448)
> 2 PE-Lang checkpoints (L, G)
> One PE-Spatial (G, 448)
> 3 PLM (1B, 3B, 8B)
> Datasets



Authors release following datasets ๐Ÿ“‘
> PE Video: Gigantic video datasete of 1M videos with 120k expert annotations โฏ๏ธ
> PLM-Video and PLM-Image: Human and auto-annotated image and video datasets on region-based tasks
> PLM-VideoBench: New video benchmark on MCQA

Articles 25

Article
124

Cohere on Hugging Face Inference Providers ๐Ÿ”ฅ