---
pipeline_tag: image-to-video
library_name: diffusers
license: mit
---

# VIRES model card

**Model Page**: [VIRES](https://hjzheng.net/projects/VIRES/)

## Model Information

Summary description and brief definition of inputs and outputs.

### Description

VIRES is a video instance repainting method with sketch and text guidance, enabling video instance repainting, replacement, generation, and removal. It leverages the generative priors of text-to-video models to maintain temporal consistency and produce visually pleasing results. Key features include a Sequential ControlNet for structure layout extraction and detail capture, sketch attention for injecting fine-grained semantics, and a sketch-aware encoder for alignment.


### Inputs and outputs

-   **Input:**
    -  Text string describing the desired changes.
    -  Mask Sequence (51 x 512 x 512 resolution).
    -  Sketch Sequence (51 x 512 x 512 resolution).

-   **Output:**
    -   A repainted video.

### Usage

A basic example using the `diffusers` library (requires appropriate model weights and dependencies):

```python
from diffusers import DiffusionPipeline #Import necessary libraries
# Load the model (replace with your actual paths)
pipe = DiffusionPipeline.from_pretrained("suimu/VIRES", torch_dtype=torch.float16).to("cuda")

# Prepare inputs: text prompt, mask, and sketch
prompt = "A cat replaces the dog in this video"
mask = ... #Load your mask sequence
sketch = ... #Load your sketch sequence

# Generate the video
video = pipe(prompt, mask, sketch).videos[0]

# Save or display the video
...
```

For complete usage instructions and advanced options, refer to our GitHub page: https://github.com/suimuc/VIRES/


## Citation

```BibTeX
@article{vires,
      title={VIRES: Video Instance Repainting via Sketch and Text Guided Generation},
      author={Weng, Shuchen and Zheng, Haojie and Zhang, Peixuan and Hong, Yuchen and Jiang, Han and Li, Si and Shi, Boxin},
      journal={arXiv preprint arXiv:2411.16199},
      year={2024}
}
```