SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers Paper • 2407.09413 • Published Jul 12, 2024 • 11
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct Paper • 2409.05840 • Published Sep 9, 2024 • 49
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning Paper • 2409.12568 • Published Sep 19, 2024 • 51
LVD-2M: A Long-take Video Dataset with Temporally Dense Captions Paper • 2410.10816 • Published Oct 14, 2024 • 21
Harnessing Webpage UIs for Text-Rich Visual Understanding Paper • 2410.13824 • Published Oct 17, 2024 • 32
EMMA: End-to-End Multimodal Model for Autonomous Driving Paper • 2410.23262 • Published Oct 30, 2024 • 2
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions Paper • 2411.07461 • Published Nov 12, 2024 • 23
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation Paper • 2411.08380 • Published Nov 13, 2024 • 27
LLaVA-o1: Let Vision Language Models Reason Step-by-Step Paper • 2411.10440 • Published Nov 15, 2024 • 125
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection Paper • 2411.14794 • Published Nov 22, 2024 • 13
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format Paper • 2411.17991 • Published Nov 27, 2024 • 5
GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation Paper • 2411.18499 • Published Nov 27, 2024 • 18
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation Paper • 2412.00927 • Published Dec 1, 2024 • 28
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale Paper • 2412.05237 • Published Dec 6, 2024 • 48
CompCap: Improving Multimodal Large Language Models with Composite Captions Paper • 2412.05243 • Published Dec 6, 2024 • 19
Maya: An Instruction Finetuned Multilingual Multimodal Model Paper • 2412.07112 • Published Dec 10, 2024 • 29
MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation Paper • 2412.07147 • Published Dec 10, 2024 • 5
InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption Paper • 2412.09283 • Published Dec 12, 2024 • 19
Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding Paper • 2412.17295 • Published Dec 23, 2024 • 9
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search Paper • 2412.18319 • Published Dec 24, 2024 • 40
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining Paper • 2501.00958 • Published Jan 1 • 107
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics Paper • 2501.04686 • Published Jan 8 • 54
BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature Paper • 2501.07171 • Published Jan 13 • 56
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks Paper • 2501.08326 • Published Jan 14 • 35
SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation Paper • 2502.08168 • Published Feb 12 • 12
RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm Paper • 2502.12513 • Published Feb 18 • 16
Qilin: A Multimodal Information Retrieval Dataset with APP-level User Sessions Paper • 2503.00501 • Published Mar 1 • 12
Unified Reward Model for Multimodal Understanding and Generation Paper • 2503.05236 • Published Mar 7 • 123
Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning Paper • 2503.07002 • Published Mar 10 • 40
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia Paper • 2503.07920 • Published Mar 10 • 98
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing Paper • 2503.10639 • Published Mar 13 • 50
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning Paper • 2503.10291 • Published Mar 13 • 36
VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search Paper • 2503.10582 • Published Mar 13 • 23
Sightation Counts: Leveraging Sighted User Feedback in Building a BLV-aligned Dataset of Diagram Descriptions Paper • 2503.13369 • Published Mar 17 • 7
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization Paper • 2503.10615 • Published Mar 13 • 17
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models Paper • 2502.16033 • Published Feb 22 • 18
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL Paper • 2503.07536 • Published Mar 10 • 86
MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems Paper • 2503.16549 • Published Mar 19 • 14
ViLBench: A Suite for Vision-Language Process Reward Modeling Paper • 2503.20271 • Published Mar 26 • 7
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks Paper • 2503.21696 • Published Mar 27 • 22
FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding Paper • 2504.09925 • Published 25 days ago • 38
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning Paper • 2504.09641 • Published 25 days ago • 16
SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning Paper • 2504.09081 • Published 27 days ago • 17
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling Paper • 2504.13169 • Published 21 days ago • 39
MM-IFEngine: Towards Multimodal Instruction Following Paper • 2504.07957 • Published 28 days ago • 34
TrustGeoGen: Scalable and Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving Paper • 2504.15780 • Published 17 days ago • 5
RoboVerse: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning Paper • 2504.18904 • Published 12 days ago • 9
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks Paper • 2412.04626 • Published Dec 5, 2024 • 14