Slow-Fast Video MLLM
Collection
3 items
โข
Updated
# Slow-Fast Architecture for Video Multi-Modal Large Language Models (Qwen2-7B, 64 Frames)
This repository contains the **Slow-Fast Video MLLM (Qwen2-7B, ConvNeXt-576, 64 frames, stride 1/4)** model, presented in the paper [Slow-Fast Architecture for Video Multi-Modal Large Language Models](https://huggingface.co/papers/2504.01328).
[Code Repository](https://github.com/SHI-Labs/Slow-Fast-Video-Multimodal-LLM) | [HuggingFace Collection](https://huggingface.co/collections/shi-labs/slow-fast-video-mllm-67ef347a28772734c15a78b5)
## Model Description
This model introduces a novel slow-fast architecture to address the challenge of balancing temporal resolution and spatial detail in video-based multi-modal large language models (MLLMs) under limited compute budgets. Existing methods often compress video representations irreversibly, losing detail.
Inspired by how humans first skim a video before focusing on relevant parts, the slow-fast design employs a dual-token strategy:
1. **"Fast" visual tokens:** A compact set of compressed video features fed into the LLM (Qwen2-7B-Instruct) alongside text embeddings for a quick overview.
2. **"Slow" visual tokens:** Uncompressed video features cross-attended by text embeddings via specially designed hybrid decoder layers, enabling instruction-aware extraction of relevant visual details with linear complexity.
This approach allows processing more input frames (e.g., 64 frames for this checkpoint) while preserving spatial details, leading to significant performance improvements on video understanding benchmarks compared to self-attention-only baselines. This checkpoint uses a Qwen2-7B-Instruct base LLM and a ConvNeXt-576 vision tower.
<div align="center">
<img src="https://huggingface.co/shi-labs/slowfast-video-mllm-qwen2-7b-convnext-576-frame64-s1t4/resolve/main/assets/images/fig-teaser.png" width="45%">
</div>
## Usage
**Note:** This model relies on custom code integrated within the `transformers` library (`LlavaQwenSlowFastForCausalLM`). Ensure you have the necessary packages installed from the [official repository](https://github.com/SHI-Labs/Slow-Fast-Video-Multimodal-LLM) or use `trust_remote_code=True` when loading the model.
First, clone the repository and install requirements if running locally:
```bash
git clone https://github.com/SHI-Labs/Slow-Fast-Video-Multimodal-LLM.git
cd Slow-Fast-Video-Multimodal-LLM
pip install --upgrade pip
pip install -r requirements.txt
# Add the cloned repo path to your PYTHONPATH or install it
Then, use the following Python script:
import torch
import os
import numpy as np
from decord import VideoReader, cpu
import requests # Required to download video
# Make sure the necessary llava modules are importable
# If not installed from the repo, trust_remote_code=True handles this
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from llava.conversation import conv_templates
from llava.model.builder import load_pretrained_model
from llava.mm_utils import tokenizer_image_token, get_model_name_from_path
from llava.utils import disable_torch_init
def load_video(video_path, max_frames_num):
"""Helper function to load video frames."""
vr = VideoReader(video_path, num_threads=4)
total_frames = len(vr)
# Ensure sparse sampling doesn't lead to fewer frames than requested
if total_frames >= max_frames_num:
# Uniformly sample frames across the video
uniform_sampled_frames = np.linspace(0, total_frames - 1, max_frames_num, dtype=int)
frame_idx = uniform_sampled_frames.tolist()
else:
# If video is shorter than max_frames_num, sample all frames and repeat the last
frame_idx = list(range(total_frames))
frame_idx.extend([total_frames - 1] * (max_frames_num - total_frames))
try:
spare_frames = vr.get_batch(frame_idx).asnumpy()
except Exception as e:
print(f"Error loading video frames: {e}")
# Fallback or error handling: return None or raise exception
# Example: return a black frame tensor of the expected shape
# This part depends on how image_processor handles None or errors
# For now, re-raising the exception might be best
raise e
return spare_frames
# Model configuration
model_path = "shi-labs/slowfast-video-mllm-qwen2-7b-convnext-576-frame64-s1t4"
video_url = "https://huggingface.co/shi-labs/slowfast-video-mllm-qwen2-7b-convnext-576-frame64-s1t4/resolve/main/assets/catinterrupt.mp4"
video_local_path = "catinterrupt.mp4"
question = "Please describe this video in detail."
max_frames = 64 # This checkpoint was trained with 64 frames
# Download the video if it doesn't exist
if not os.path.exists(video_local_path):
print(f"Downloading video from {video_url}...")
response = requests.get(video_url, stream=True)
response.raise_for_status() # Raise an exception for bad status codes
with open(video_local_path, "wb") as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
print("Download complete.")
# Load the model and processor
disable_torch_init()
model_name = get_model_name_from_path(model_path)
# Use trust_remote_code=True to load the custom architecture
tokenizer, model, image_processor, context_len = load_pretrained_model(
model_path,
None,
model_name,
use_flash_attn=True, # Use Flash Attention if available
device_map="auto", # Automatically distribute model across GPUs/CPU
torch_dtype=torch.bfloat16, # Use bfloat16 for efficiency
trust_remote_code=True
)
# Prepare the prompt
if model.config.mm_use_im_start_end:
prompt = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + "
" + question
else:
prompt = DEFAULT_IMAGE_TOKEN + "
" + question
conv = conv_templates["qwen_1_5"].copy() # Use the appropriate conversation template
conv.append_message(conv.roles[0], prompt)
conv.append_message(conv.roles[1], None)
prompt_final = conv.get_prompt()
# Load and process video frames
print("Loading video...")
video_frames = load_video(video_local_path, max_frames_num=max_frames)
print(f"Video loaded, shape: {video_frames.shape}")
# Preprocess video frames
print("Preprocessing video...")
# Ensure video has shape (T, H, W, C) before preprocessing
video_tensor = image_processor.preprocess(video_frames, return_tensors="pt")["pixel_values"]
video_tensor = video_tensor.to(model.device, dtype=torch.bfloat16)
videos = [video_tensor] # The model expects a list of video tensors
print(f"Video tensor processed, shape: {videos[0].shape}")
# Tokenize the prompt
input_ids = tokenizer_image_token(prompt_final, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt')
input_ids = input_ids.to(device=model.device, non_blocking=True)
# Add batch dimension if necessary (tokenizer_image_token might already return batched)
if input_ids.ndim == 1:
input_ids = input_ids.unsqueeze(0)
print(f"Input IDs processed, shape: {input_ids.shape}")
# Generate response
print("Generating response...")
with torch.inference_mode():
output_ids = model.generate(
input_ids,
images=videos, # Pass the processed video tensor list
do_sample=True,
temperature=0.2,
top_p=1.0,
num_beams=1,
max_new_tokens=1024,
use_cache=True
)
# Decode and print the output
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(f"
User input: {question}
")
print(f"Model output:
{outputs}")
The model weights are released under the CC-BY-NC-4.0 license. The code is released under the Apache 2.0 license. Users must comply with all terms and conditions of the original licenses, including the specific licenses for the base language model (Qwen2 License).
If you find this work useful, please consider citing the paper:
@misc{zhou2025slowfast,
title={Slow-Fast Architecture for Video Multi-Modal Large Language Models},
author={Yifei Zhou and Jiaming Zuo and Chen Change Loy and Chongyang Zhong and Xin Wang and Qi Wu and Weidong Cai and Xiaodong He and Qingzhong Wang and Lei Zhang and Marcelo H. Ang Jr and Boyang Li and Yanfeng Wang and Qinghai He and Fengbei Liu and Liangchen Luo and Jingdong Wang and Conghui He and Wenhai Wang},
year={2025},
eprint={2504.01328},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
(Note: Author list based on potential updates to the arXiv paper; please verify with the final published version if available.) ```