Spaces:

shayan5422
/

back_rag_huggingface

Running

App Files Files Community

back_rag_huggingface / model_data_json /BAAI_Emu3-VisionTokenizer.json

shayan5422

Upload 3710 files

21cad66 verified 24 days ago

raw

history blame contribute delete

2.65 kB

	{
	"model_id": "BAAI/Emu3-VisionTokenizer",
	"downloads": 26276,
	"tags": [
	"transformers",
	"safetensors",
	"Emu3VisionVQ",
	"feature-extraction",
	"custom_code",
	"arxiv:2409.18869",
	"license:apache-2.0",
	"region:us"
	],
	"description": "--- license: apache-2.0 library_name: transformers --- <div align='center'> <h1>Emu3: Next-Token Prediction is All You Need</h1h1> <h3></h3> Emu3 Team, BAAI \| Project Page \| Paper \| 🤗HF Models \| github \| Demo \| </div> <div align='center'> <img src=\" class=\"interpolation-image\" alt=\"arch.\" height=\"80%\" width=\"70%\" /> </div> We introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with <i>next-token prediction</i>! By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. ### Emu3 excels in both generation and perception Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship open models such as SDXL, LLaVA-1.6 and OpenSora-1.2, while eliminating the need for diffusion or compositional architectures. <div align='center'> <img src=\" class=\"interpolation-image\" alt=\"comparison.\" height=\"80%\" width=\"80%\" /> </div> ### Highlights - Emu3 is capable of generating high-quality images following the text input, by simply predicting the next vision token. The model naturally supports flexible resolutions and styles. - Emu3 shows strong vision-language understanding capabilities to see the physical world and provides coherent text responses. Notably, this capability is achieved without depending on a CLIP and a pretrained LLM. - Emu3 simply generates a video causally by predicting the next token in a video sequence, unlike the video diffusion model as in Sora. With a video in context, Emu3 can also naturally extend the video and predict what will happen next. ### Quickstart for Autoencoding",
	"model_explanation_gemini": "\"Emu3 is a multimodal model trained with next-token prediction to generate images, videos, and text while excelling in vision-language understanding, outperforming models like SDXL, LLaVA-1.6, and OpenSora-1.2 without relying on diffusion or compositional architectures.\"\n\n### Features: \n1. Multimodal generation (images, videos, text) via next-token prediction. \n2. Flexible resolutions/styles for image generation.",
	"release_year": "2024",
	"parameter_count": null,
	"is_fine_tuned": false,
	"category": "Embedding",
	"api_enhanced": true
	}