Describe Anything: Detailed Localized Image and Video Captioning Paper • 2504.16072 • Published 16 days ago • 60
HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation Paper • 2504.13072 • Published 21 days ago • 12
DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning Paper • 2504.07128 • Published Apr 2 • 83
SmolVLM: Redefining small and efficient multimodal models Paper • 2504.05299 • Published about 1 month ago • 180
One-Minute Video Generation with Test-Time Training Paper • 2504.05298 • Published about 1 month ago • 102
Vision-Speech Models: Teaching Speech Models to Converse about Images Paper • 2503.15633 • Published Mar 19 • 1
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs Paper • 2503.01743 • Published Mar 3 • 87
S^2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning Paper • 2502.12853 • Published Feb 18 • 29
One-step Diffusion Models with f-Divergence Distribution Matching Paper • 2502.15681 • Published Feb 21 • 7
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Paper • 2502.14786 • Published Feb 20 • 144
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment Paper • 2502.04328 • Published Feb 6 • 30