Softpick: No Attention Sink, No Massive Activations with Rectified Softmax Paper • 2504.20966 • Published 9 days ago • 25
CheXWorld: Exploring Image World Modeling for Radiograph Representation Learning Paper • 2504.13820 • Published 20 days ago • 17
It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization Paper • 2504.13173 • Published 21 days ago • 18
Describe Anything: Detailed Localized Image and Video Captioning Paper • 2504.16072 • Published 16 days ago • 60
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation Paper • 2504.13055 • Published 21 days ago • 19
Perception Encoder: The best visual embeddings are not at the output of the network Paper • 2504.13181 • Published 21 days ago • 34
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling Paper • 2504.13169 • Published 21 days ago • 39
FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding Paper • 2504.09925 • Published 24 days ago • 38
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models Paper • 2504.10479 • Published 24 days ago • 255
Towards Visual Text Grounding of Multimodal Large Language Model Paper • 2504.04974 • Published Apr 7 • 16
Scaling Laws for Native Multimodal Models Scaling Laws for Native Multimodal Models Paper • 2504.07951 • Published 28 days ago • 27
Clinical ModernBERT: An efficient and long context encoder for biomedical text Paper • 2504.03964 • Published Apr 4 • 5
Concept Lancet: Image Editing with Compositional Representation Transplant Paper • 2504.02828 • Published Apr 3 • 17
SmolVLM: Redefining small and efficient multimodal models Paper • 2504.05299 • Published about 1 month ago • 180
Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models Paper • 2504.02821 • Published Apr 3 • 10
ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers Paper • 2504.00502 • Published Apr 1 • 22
Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks Paper • 2504.01308 • Published Apr 2 • 13