Collections

1

Video Creation by Demonstration

Paper • 2412.09551 • Published Dec 12, 2024 • 9
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation

Paper • 2412.07589 • Published Dec 10, 2024 • 49
Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation

Paper • 2412.06531 • Published Dec 9, 2024 • 73
APOLLO: SGD-like Memory, AdamW-level Performance

Paper • 2412.05270 • Published Dec 6, 2024 • 39

92

Running

460

460

InternVL

⚡

Chat with an AI that understands text and images
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 157
OpenGVLab/InternVL2_5-78B

Image-Text-to-Text • Updated Mar 25 • 8.53k • 189
OpenGVLab/InternVL2_5-78B-AWQ

Image-Text-to-Text • Updated Mar 25 • 272 • 14

Video Creation by Demonstration

DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation

Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation

APOLLO: SGD-like Memory, AdamW-level Performance

InternVL

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

OpenGVLab/InternVL2_5-78B

OpenGVLab/InternVL2_5-78B-AWQ

Rethinking Data Selection at Scale: Random Selection is Almost All You Need

From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

Emergent properties with repeated examples

Personalized Visual Instruction Tuning

Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages

AutoTrain: No-code training for state-of-the-art models

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks

LinFusion: 1 GPU, 1 Minute, 16K Image

Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion

Diffusion Models Are Real-Time Game Engines

Segment Anything with Multiple Modalities

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Building and better understanding vision-language models: insights and future directions

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

LLM Pruning and Distillation in Practice: The Minitron Approach

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

To Code, or Not To Code? Exploring Impact of Code in Pre-training

Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications

LLaVA-OneVision: Easy Visual Task Transfer

VILA^2: VILA Augmented VILA

PaliGemma: A versatile 3B VLM for transfer

openbmb/MiniCPM-V-2_6

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection

Chameleon: Mixed-Modal Early-Fusion Foundation Models

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

databricks/dbrx-instruct

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

The Ultra-Scale Playbook