fun-research
/

Video-LLaVA-Seg

Video-Text-to-Text

text-generation

Model card Files Files and versions Community

Video-LLaVA-Seg

Project | Arxiv

This is the official baseline implementation for the ViCas dataset, presented in the paper ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation.

For details about setting up the model, refer to the Video-LLaVA-Seg GitHub repo

For details about downloading and evaluating the dataset benchmark, refer to the ViCaS GitHub repo

Downloads last month: 6

Safetensors

Model size

8.74B params

Tensor type

I64

·

BF16

·

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support