lixinhao commited on
Commit
41bf44a
ยท
verified ยท
1 Parent(s): 6d175a4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +102 -3
README.md CHANGED
@@ -1,3 +1,102 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: transformers
5
+ license: apache-2.0
6
+ metrics:
7
+ - accuracy
8
+ tags:
9
+ - multimodal
10
+ pipeline_tag: video-text-to-text
11
+ base_model: Qwen/Qwen2.5-VL-7B-Instruct
12
+ ---
13
+
14
+
15
+ # ๐Ÿ’ก VideoChat-R1_7B
16
+
17
+ [\[๐Ÿ“‚ GitHub\]](https://github.com/OpenGVLab/VideoChat-R1)
18
+ [\[๐Ÿ“œ Tech Report\]](https://arxiv.org/pdf/2504.06958)
19
+
20
+
21
+ ## ๐Ÿš€ How to use the model
22
+
23
+ We provide a simple installation example below:
24
+ ```
25
+ pip install transformers
26
+ pip install qwen_vl_utils
27
+ ```
28
+ Then you could use our model:
29
+ ```python
30
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
31
+ from qwen_vl_utils import process_vision_info
32
+
33
+ model_path = "OpenGVLab/VideoChat-R1_7B"
34
+ # default: Load the model on the available device(s)
35
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
36
+ model_path, torch_dtype="auto", device_map="auto",
37
+ attn_implementation="flash_attention_2"
38
+ )
39
+
40
+ # default processer
41
+ processor = AutoProcessor.from_pretrained(model_path)
42
+
43
+ video_path = "your_video.mp4"
44
+ question = "Where is the final cup containing the object?"
45
+
46
+ messages = [
47
+ {
48
+ "role": "user",
49
+ "content": [
50
+ {
51
+ "type": "video",
52
+ "video": video_path,
53
+ "max_pixels": 360 * 420,
54
+ "fps": 1.0,
55
+ },
56
+ {"type": "text", "text": f"""{question}
57
+ Provide your final answer within the <answer> </answer> tags.
58
+ """},
59
+ ],
60
+ }
61
+ ]
62
+
63
+
64
+
65
+ #In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.
66
+ # Preparation for inference
67
+ text = processor.apply_chat_template(
68
+ messages, tokenize=False, add_generation_prompt=True
69
+ )
70
+ image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
71
+ inputs = processor(
72
+ text=[text],
73
+ images=image_inputs,
74
+ videos=video_inputs,
75
+ padding=True,
76
+ return_tensors="pt",
77
+ **video_kwargs,
78
+ )
79
+ inputs = inputs.to("cuda")
80
+
81
+ # Inference
82
+ generated_ids = model.generate(**inputs, max_new_tokens=512)
83
+ generated_ids_trimmed = [
84
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
85
+ ]
86
+ output_text = processor.batch_decode(
87
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
88
+ )
89
+ print(output_text)
90
+ ```
91
+
92
+ ## โœ๏ธ Citation
93
+
94
+ ```bibtex
95
+
96
+ @article{li2025videochatr1,
97
+ title={VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning},
98
+ author={Li, Xinhao and Yan, Ziang and Meng, Desen and Dong, Lu and Zeng, Xiangyu and He, Yinan and Wang, Yali and Qiao, Yu and Wang, Yi and Wang, Limin},
99
+ journal={arXiv preprint arXiv:2504.06958},
100
+ year={2025}
101
+ }
102
+ ```