--- pipeline_tag: image-to-video library_name: diffusers license: mit --- # VIRES model card **Model Page**: [VIRES](https://hjzheng.net/projects/VIRES/) ## Model Information Summary description and brief definition of inputs and outputs. ### Description VIRES is a video instance repainting method with sketch and text guidance, enabling video instance repainting, replacement, generation, and removal. It leverages the generative priors of text-to-video models to maintain temporal consistency and produce visually pleasing results. Key features include a Sequential ControlNet for structure layout extraction and detail capture, sketch attention for injecting fine-grained semantics, and a sketch-aware encoder for alignment. ### Inputs and outputs - **Input:** - Text string describing the desired changes. - Mask Sequence (51 x 512 x 512 resolution). - Sketch Sequence (51 x 512 x 512 resolution). - **Output:** - A repainted video. ### Usage A basic example using the `diffusers` library (requires appropriate model weights and dependencies): ```python from diffusers import DiffusionPipeline #Import necessary libraries # Load the model (replace with your actual paths) pipe = DiffusionPipeline.from_pretrained("suimu/VIRES", torch_dtype=torch.float16).to("cuda") # Prepare inputs: text prompt, mask, and sketch prompt = "A cat replaces the dog in this video" mask = ... #Load your mask sequence sketch = ... #Load your sketch sequence # Generate the video video = pipe(prompt, mask, sketch).videos[0] # Save or display the video ... ``` For complete usage instructions and advanced options, refer to our GitHub page: https://github.com/suimuc/VIRES/ ## Citation ```BibTeX @article{vires, title={VIRES: Video Instance Repainting via Sketch and Text Guided Generation}, author={Weng, Shuchen and Zheng, Haojie and Zhang, Peixuan and Hong, Yuchen and Jiang, Han and Li, Si and Shi, Boxin}, journal={arXiv preprint arXiv:2411.16199}, year={2024} } ```