--- license: mit license_name: deepseek license_link: LICENSE pipeline_tag: any-to-any library_name: transformers tags: - muiltimodal - text-to-image - unified-model --- ## 1. Introduction Janus-Pro is a novel autoregressive framework that unifies multimodal understanding and generation. It addresses the limitations of previous approaches by decoupling visual encoding into separate pathways, while still utilizing a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder’s roles in understanding and generation, but also enhances the framework’s flexibility. Janus-Pro surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus-Pro make it a strong candidate for next-generation unified multimodal models. [**Github Repository**](https://github.com/deepseek-ai/Janus)
image
image
### 2. Model Summary Janus-Pro is a unified understanding and generation MLLM, which decouples visual encoding for multimodal understanding and generation. Janus-Pro is constructed based on the DeepSeek-LLM-1.5b-base/DeepSeek-LLM-7b-base. For multimodal understanding, it uses the [SigLIP-L](https://huggingface.co/timm/ViT-L-16-SigLIP-384) as the vision encoder, which supports 384 x 384 image input. For image generation, Janus-Pro uses the tokenizer from [here](https://github.com/FoundationVision/LlamaGen) with a downsample rate of 16. ## 3. Usage Examples ### Single Image Inference Here is an example of visual understanding with a single image. ```python import torch from PIL import Image import requests from transformers import JanusForConditionalGeneration, JanusProcessor model_id = "deepseek-community/Janus-Pro-1B" # Prepare input for generation messages = [ { "role": "user", "content": [ {'type': 'image', 'url': 'http://images.cocodataset.org/val2017/000000039769.jpg'}, {'type': 'text', 'text': "What do you see in this image?"} ] }, ] # Set generation mode to 'text' to perform text generation processor = JanusProcessor.from_pretrained(model_id) model = JanusForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto" ) inputs = processor.apply_chat_template( messages, add_generation_prompt=True, generation_mode="text", tokenize=True, return_dict=True, return_tensors="pt" ).to(model.device, dtype=torch.bfloat16) output = model.generate(**inputs, max_new_tokens=40, generation_mode='text', do_sample=True) text = processor.decode(output[0], skip_special_tokens=True) print(text) ``` ## Text to Image generation Janus can also generate images from prompts by simply setting the generation mode to `image` as shown below. ```python import torch from transformers import JanusForConditionalGeneration, JanusProcessor model_id = "deepseek-community/Janus-Pro-1B" # Load processor and model processor = JanusProcessor.from_pretrained(model_id) model = JanusForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto" ) messages = [ { "role": "user", "content": [ {"type": "text", "text": "A dog running under the rain."} ] } ] # Apply chat template prompt = processor.apply_chat_template(messages, add_generation_prompt=True) inputs = processor( text=prompt, generation_mode="image", return_tensors="pt" ).to(model.device, dtype=torch.bfloat16) # Set number of images to generate model.generation_config.num_return_sequences = 2 outputs = model.generate( **inputs, generation_mode="image", do_sample=True, use_cache=True ) # Decode and save images decoded_image = model.decode_image_tokens(outputs) images = processor.postprocess(list(decoded_image.float()), return_tensors="PIL.Image.Image") for i, image in enumerate(images["pixel_values"]): image.save(f"image{i}.png") ``` ## 4. License This code repository is licensed under [the MIT License](https://github.com/deepseek-ai/DeepSeek-LLM/blob/HEAD/LICENSE-CODE). The use of Janus-Pro models is subject to [DeepSeek Model License](https://github.com/deepseek-ai/DeepSeek-LLM/blob/HEAD/LICENSE-MODEL). ## 5. Citation ``` @article{chen2025janus, title={Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling}, author={Chen, Xiaokang and Wu, Zhiyu and Liu, Xingchao and Pan, Zizheng and Liu, Wen and Xie, Zhenda and Yu, Xingkai and Ruan, Chong}, journal={arXiv preprint arXiv:2501.17811}, year={2025} } ``` ## 6. Contact If you have any questions, please raise an issue or contact us at [service@deepseek.com](mailto:service@deepseek.com).