Merve Noyan
PRO
merve
ยท
AI & ML interests
VLMs, vision & co
Recent Activity
posted
an
update
3 days ago
Don't sleep on new AI at Meta Vision-Language release! ๐ฅ
https://huggingface.co/collections/facebook/perception-encoder-67f977c9a65ca5895a7f6ba1
https://huggingface.co/collections/facebook/perception-lm-67f9783f171948c383ee7498
Meta dropped swiss army knives for vision with A2.0 license ๐
> image/video encoders for vision language modelling and spatial understanding (object detection etc) ๐
> The vision LM outperforms InternVL3 and Qwen2.5VL ๐
> They also release gigantic video and image datasets
The authors attempt to come up with single versatile vision encoder to align on diverse set of tasks.
They trained Perception Encoder (PE) Core: a new state-of-the-art family of vision encoders that can be aligned for both vision-language and spatial tasks. For zero-shot image tasks, it outperforms latest sota SigLIP2 ๐
> Among fine-tuned ones, first one is PE-Spatial. It's a model to detect bounding boxes, segmentation, depth estimation and it outperforms all other models ๐ฎ
> Second one is PLM, Perception Language Model, where they combine PE-Core with Qwen2.5 LM 7B. it outperforms all other models (including InternVL3 which was trained with Qwen2.5LM too!)
The authors release the following checkpoints in sizes base, large and giant:
> 3 PE-Core checkpoints (224, 336, 448)
> 2 PE-Lang checkpoints (L, G)
> One PE-Spatial (G, 448)
> 3 PLM (1B, 3B, 8B)
> Datasets
Authors release following datasets ๐
> PE Video: Gigantic video datasete of 1M videos with 120k expert annotations โฏ๏ธ
> PLM-Video and PLM-Image: Human and auto-annotated image and video datasets on region-based tasks
> PLM-VideoBench: New video benchmark on MCQA
View all activity
Organizations
merve's activity
-
-
-
-
-
-
-
-
-
-
-
view article
Cohere on Hugging Face Inference Providers ๐ฅ
published
an
article
about 2 months ago
view article
Welcome Gemma 3: Google's all new multimodal, multilingual, long context open LLM
view article
SigLIP 2: A better multilingual vision language encoder
view article
SmolVLM2: Bringing Video Understanding to Every Device
view article
Open-source DeepResearch โ Freeing our search agents
By
and 4 others
โข
โข
1.23k
view article
We now support VLMs in smolagents!
By
and 2 others
โข
โข
100
view article
SmolVLM Grows Smaller โ Introducing the 250M & 500M Models!
view article
Introducing smolagents: simple agents that write actions in code.
By
and 2 others
โข
โข
997
view article
Welcome PaliGemma 2 โ New vision language models by Google
By
and 3 others
โข
โข
152
view article
SmolVLM - small yet mighty Vision Language Model
view article
Llama can now see and run on your device - welcome Llama 3.2
By
and 6 others
โข
โข
188
view article
Preference Optimization for Vision Language Models
view article
Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models
view article
PaliGemma โ Google's Cutting-Edge Open Vision Language Model
By
and 2 others
โข
โข
247
published
an
article
about 1 year ago
published
an
article
about 1 year ago
view article
PaliGemma 2 Mix - New Instruction Vision Language Models by Google
published
an
article
over 1 year ago
view article
Introduction to Quantization cooked in ๐ค with ๐๐งโ๐ณ
published
an
article
over 1 year ago
view article
Deploy MusicGen in no time with Inference Endpoints
published
an
article
almost 2 years ago
view article
Open-Source Text Generation & LLM Ecosystem at Hugging Face
published
an
article
about 2 years ago