MMR1-Math-v0-7B / README.md

Update README.md

87dbb32 verified about 2 months ago

7.51 kB

	---
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	language:
	- en
	library_name: transformers
	license: apache-2.0
	pipeline_tag: image-text-to-text
	tags:
	- multi-modal
	- large-language-model
	---

	<p align="center">
	<img src="https://github.com/LengSicong/MMR1/blob/main/assets/logo.png?raw=true" width="150" style="margin-bottom: 0.2;"/>
	<p>

	<h3 align="center">
	MMR1: Advancing the Frontiers of Multimodal Reasoning</a></h3>
	<h5 align="center"> If you like our project, please give us a star ⭐ on <a href="https://github.com/LengSicong/MMR1">Github</a> to support us. 🙏🙏 </h2>

	## 📰 News
	* [2025.03.11] 🔥🔥 Release MMR1-Math-v0, achieving SOTA with only 6k data!

	## Links
	Code: https://github.com/LengSicong/MMR1

	This model was presented in the paper [LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL](https://arxiv.org/abs/2503.07536). Code can be found at https://github.com/LengSicong/MMR1

	## Model Description
	MMR1-Math-v0-7B is a Large Multimodal Model specialized in mathematical tasks. Remarkably, MMR1-Math-v0-7B achieves state-of-the-art performance among open-source 7B multimodal models, competing effectively even against proprietary models with significantly larger parameter sizes—all trained using only 6k carefully curated data instances.

	### Key Highlights:

	- SOTA Performance: Sets a new state-of-the-art benchmark on math-related multimodal tasks among open-source 7B models.

	- Minimal Training Data: Remarkably achieves top-tier performance with just 6k high-quality samples from public training datasets.

	- Efficient Training with GRPO: 6 hours of RL training with 64 H100s for 15 epochs.

	- Public and High-Quality Data: Publicly sourced datasets, rigorously filtered and balanced across both difficulty and mathematical problem types.

	- Balanced Data Strategy: Uniform sampling of data based on both task difficulty (filtering out overly simple problems) and mathematical reasoning diversity.


	## Evaluation Results

	We evaluated our model using [VLMEvalKit](https://github.com/open-compass/VLMEvalKit/tree/main) on four mathematical reasoning benchmarks: MathVista_MINI, MathVision, LogicVista, and MathVerse_MINI.

	We also include results on the MathVerse_MINI_Vision_Only_cot (MathVerse_V) subset to maintain consistency with the VLMEvalKit leaderboard. The table below compares our model's performance against various open-source and proprietary models.

	\| Model \| size \| MathVista \| MathVision \| LogicVista \| MathVerse \| MathVerse_V \|
	\|-------\|:----:\|:--------------:\|:----------:\|:----------:\|:--------------:\|:-------------------:\|
	\| Close-sourced \| \| \| \| \| \| \|
	\| [GPT-4o 1120](https://openai.com/index/gpt-4o-system-card/) \| - \| 60.0 \| 31.2 \| 52.8 \| 40.6 \| - \|
	\| [Gemini-2.0-flash](https://deepmind.google/technologies/gemini/flash/) \| - \| 70.4 \| 43.6 \| 52.3 \| 47.8 \| - \|
	\| [Claude3.7-Sonnet](https://www.anthropic.com/news/claude-3-7-sonnet) \| - \| 66.8 \| 41.9 \| 58.2 \| 46.7 \| - \|
	\| R1-related \| \| \| \| \| \| \|
	\| [LLaVA-CoT](https://github.com/PKU-YuanGroup/LLaVA-CoT) \| 11B \| 52.5 \| 19.9 \| 39.6 \| 22.6 \| - \|
	\| [Open-R1-Multimodal](https://github.com/EvolvingLMMs-Lab/open-r1-multimodal) \| 7B \| 60.6 \| - \| - \| - \| - \|
	\| [Mulberry](https://github.com/HJYao00/Mulberry) \| 7B \| 63.1 \| - \| - \| - \| - \|
	\| [LMM-R1](https://arxiv.org/abs/2503.07536) \| 3B \| 63.2 \| 26.4 \| - \| - \| 41.6 \|
	\| [R1-Onevision](https://github.com/Fancy-MLLM/R1-Onevision?tab=readme-ov-file) \| 7B \| - \| 26.2 \| - \| - \| 44.1 \|
	\| [MM-Eureka](https://github.com/ModalMinds/MM-EUREKA) \| 8B \| 67.1 \| 22.2 \| - \| - \| 40.4 \|
	\| [MM-Eureka](https://github.com/ModalMinds/MM-EUREKA) \| 38B \| 64.2 \| 26.6 \| - \| - \| 48.9 \|
	\| Open-sourced \| \| \| \| \| \| \|
	\| [Ovis2-8b](https://github.com/AIDC-AI/Ovis) \| 8B \| 71.8 \| 25.9 \| 39.4 \| 42.3 \| - \|
	\| [MiniCPM-o-2.6](https://github.com/OpenBMB/MiniCPM-o) \| 8B \| 71.9 \| 21.7 \| 36.0 \| 35.0 \| - \|
	\| [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL) (official) \| 7B \| 68.2 \| 25.4 \| 47.9 \| 41.1 \| - \|
	\| [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL) (reproduced) \| 7B \| 67.5 \| 25.6 \| 46.8 \| 42.5 \| 46.9 \|
	\| Ours \| \| \| \| \| \| \|
	\| MMR1-math-v0 \| 7B \| 71.0 \| 30.2 \| 50.8 \| 45.1 \| 49.8 \|



	### Quick Start
	```python
	from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
	from qwen_vl_utils import process_vision_info
	# default: Load the model on the available device(s)
	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	"MMR1/MMR1-Math-v0-7B",
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	device_map="auto",
	)
	# default processer
	processor = AutoProcessor.from_pretrained("MMR1/MMR1-Math-v0-7B")
	# Example input
	messages = [
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"image": "path/to/image.jpeg",
	},
	{"type": "text", "text": "Describe this image."},
	],
	}
	]
	# Preparation for inference
	text = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	)
	inputs = inputs.to("cuda")
	# Inference: Generation of the output
	generated_ids = model.generate(**inputs, max_new_tokens=128)
	generated_ids_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_text = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)
	print(output_text)
	```
	<details>
	<summary>Batch inference</summary>

	```python
	# Sample messages for batch inference
	messages1 = [
	{
	"role": "user",
	"content": [
	{"type": "image", "image": "file:///path/to/image1.jpg"},
	{"type": "image", "image": "file:///path/to/image2.jpg"},
	{"type": "text", "text": "What are the common elements in these pictures?"},
	],
	}
	]
	messages2 = [
	{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": "Who are you?"},
	]
	# Combine messages for batch processing
	messages = [messages1, messages2]
	# Preparation for batch inference
	texts = [
	processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
	for msg in messages
	]
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=texts,
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	)
	inputs = inputs.to("cuda")
	# Batch Inference
	generated_ids = model.generate(**inputs, max_new_tokens=128)
	generated_ids_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_texts = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)
	print(output_texts)
	```
	</details>


	## Citation

	If you find MMR1 useful for your research and applications, please cite using this BibTeX:

	```bibtex
	@misc{MMR1-Math2025,
	title={MMR1: Advancing the Frontiers of Multimodal Reasoning},
	author={Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Hang Zhang, Yuming Jiang, Xin Li, Fan Wang, Yu Rong, Aixin Sun†, Shijian Lu†},
	year={2025},
	howpublished={\url{https://github.com/LengSicong/MMR1}},
	}
	```