Spaces:

maffia
/

vace-demo

Build error

App Files Files Community

maffia commited on Apr 8

Commit

690f890

verified ·

1 Parent(s): 8519254

Upload 94 files

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +7 -0
UserGuide.md +160 -0
app.py +278 -0
assets/images/test.jpg +3 -0
assets/images/test2.jpg +0 -0
assets/images/test3.jpg +3 -0
assets/masks/test.png +0 -0
assets/masks/test2.png +0 -0
assets/materials/gr_infer_demo.jpg +3 -0
assets/materials/gr_pre_demo.jpg +3 -0
assets/materials/tasks.png +3 -0
assets/materials/teaser.jpg +3 -0
assets/videos/test.mp4 +3 -0
assets/videos/test2.mp4 +0 -0
benchmarks/.gitkeep +0 -0
models/.gitkeep +0 -0
pyproject.toml +75 -0
requirements.txt +1 -0
requirements/annotator.txt +6 -0
requirements/framework.txt +26 -0
tests/test_annotators.py +568 -0
vace/__init__.py +6 -0
vace/annotators/__init__.py +24 -0
vace/annotators/canvas.py +60 -0
vace/annotators/common.py +62 -0
vace/annotators/composition.py +155 -0
vace/annotators/depth.py +51 -0
vace/annotators/dwpose/__init__.py +2 -0
vace/annotators/dwpose/onnxdet.py +127 -0
vace/annotators/dwpose/onnxpose.py +362 -0
vace/annotators/dwpose/util.py +299 -0
vace/annotators/dwpose/wholebody.py +80 -0
vace/annotators/face.py +55 -0
vace/annotators/flow.py +53 -0
vace/annotators/frameref.py +118 -0
vace/annotators/gdino.py +88 -0
vace/annotators/gray.py +24 -0
vace/annotators/inpainting.py +283 -0
vace/annotators/layout.py +161 -0
vace/annotators/mask.py +79 -0
vace/annotators/maskaug.py +181 -0
vace/annotators/midas/__init__.py +2 -0
vace/annotators/midas/api.py +166 -0
vace/annotators/midas/base_model.py +18 -0
vace/annotators/midas/blocks.py +391 -0
vace/annotators/midas/dpt_depth.py +107 -0
vace/annotators/midas/midas_net.py +80 -0
vace/annotators/midas/midas_net_custom.py +167 -0
vace/annotators/midas/transforms.py +231 -0
vace/annotators/midas/utils.py +193 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,10 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+assets/images/test.jpg filter=lfs diff=lfs merge=lfs -text
+assets/images/test3.jpg filter=lfs diff=lfs merge=lfs -text
+assets/materials/gr_infer_demo.jpg filter=lfs diff=lfs merge=lfs -text
+assets/materials/gr_pre_demo.jpg filter=lfs diff=lfs merge=lfs -text
+assets/materials/tasks.png filter=lfs diff=lfs merge=lfs -text
+assets/materials/teaser.jpg filter=lfs diff=lfs merge=lfs -text
+assets/videos/test.mp4 filter=lfs diff=lfs merge=lfs -text

UserGuide.md ADDED Viewed

	@@ -0,0 +1,160 @@

+# VACE User Guide
+## 1. Overall Steps
+- Preparation: Be aware of the task type ([single task](#32-single-task) or [multi-task composition](#33-composition-task)) of your creative idea, and prepare all the required materials (images, videos, prompt, etc.)
+- Preprocessing: Select the appropriate preprocessing method based task name, then preprocess your materials to meet the model's input requirements.
+- Inference: Based on the preprocessed materials, perform VACE inference to obtain results.
+## 2. Preparations
+### 2.1 Task Definition
+VACE, as a unified video generation solution, simultaneously supports Video Generation, Video Editing, and complex composition task. Specifically:
+- Video Generation: No video input. Injecting concepts into the model through semantic understanding based on text and reference materials, including **T2V** (Text-to-Video Generation) and **R2V** (Reference-to-Video Generation) tasks.
+- Video Editing: With video input. Modifying input video at the pixel level globally or locally,including **V2V** (Video-to-Video Editing) and **MV2V** (Masked Video-to-Video Editing).
+- Composition Task: Compose two or more single task above into a complex composition task, such as **Reference Anything** (Face R2V + Object R2V), **Move Anything**(Frame R2V + Layout V2V), **Animate Anything**(R2V + Pose V2V), **Swap Anything**(R2V + Inpainting MV2V), and **Expand Anything**(Object R2V + Frame R2V + Outpainting MV2V), etc.
+Single tasks and compositional tasks are illustrated in the diagram below:
+![vace_task](assets/materials/tasks.png)
+### 2.2 Limitations
+- Super high resolution video will be resized to proper spatial size.
+- Super long video will be trimmed or uniformly sampled into around 5 seconds.
+- For users who are demanding of long video generation, we recommend to generate 5s video clips one by one, while using `firstclip` video extension task to keep the temporal consistency.
+## 3. Preprocessing
+### 3.1 VACE-Recognizable Inputs
+User-collected materials needs to be preprocessed into VACE-recognizable inputs, including **`src_video`**, **`src_mask`**, **`src_ref_images`**, and **`prompt`**.
+Specific descriptions are as follows:
+- `src_video`: The video to be edited for input into the model, such as condition videos (Depth, Pose, etc.) or in/outpainting input video. **Gray areas**(values equal to 127) represent missing video part. In first-frame R2V task, the first frame are reference frame while subsequent frames are left gray. The missing parts of in/outpainting `src_video` are also set gray.
+- `src_mask`: A 3D mask in the same shape of `src_video`. **White areas** represent the parts to be generated, while **black areas** represent the parts to be retained.
+- `src_ref_images`: Reference images of R2V. Salient object segmentation can be performed to keep the background white.
+- `prompt`: A text describing the content of the output video. Prompt expansion can be used to achieve better generation effects for LTX-Video and English user of Wan2.1. Use descriptive prompt instead of instructions.
+Among them, `prompt` is required while `src_video`, `src_mask`, and `src_ref_images` are optional. For instance, MV2V task requires `src_video`, `src_mask`, and `prompt`; R2V task only requires `src_ref_images` and `prompt`.
+### 3.2 Preprocessing Tools
+Both command line and Gradio demo are supported.
+1) Command Line: You can refer to the `run_vace_preproccess.sh` script and invoke it based on the different task types. An example command is as follows:
+```bash
+python vace/vace_preproccess.py --task depth --video assets/videos/test.mp4
+```
+2) Gradio Interactive: Launch the graphical interface for data preprocessing and perform preprocessing on the interface. The specific command is as follows:
+```bash
+python vace/gradios/preprocess_demo.py
+```
+![gr_pre_demo](assets/materials/gr_pre_demo.jpg)
+### 3.2 Single Tasks
+VACE is an all-in-one model supporting various task types. However, different preprocessing is required for these task types. The specific task types and descriptions are as follows:
+| Task       | Subtask              | Annotator                  | Input modal                              | Params                                                                                                                                                                                                               | Note                                                   |
+|------------|----------------------|----------------------------|------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------|
+| txt2vid    | txt2vid              | /                          | /                                        | /                                                                                                                                                                                                                    |                                                        |
+| control    | depth                | DepthVideoAnnotator        | video                                    | /                                                                                                                                                                                                                    |                                                        |
+| control    | flow                 | FlowVisAnnotator           | video                                    | /                                                                                                                                                                                                                    |                                                        |
+| control    | gray                 | GrayVideoAnnotator         | video                                    | /                                                                                                                                                                                                                    |                                                        |
+| control    | pose                 | PoseBodyFaceVideoAnnotator | video                                    | /                                                                                                                                                                                                                    |                                                        |
+| control    | scribble             | ScribbleVideoAnnotator     | video                                    | /                                                                                                                                                                                                                    |                                                        |
+| control    | layout_bbox          | LayoutBboxAnnotator        | two bboxes <br>'x1,y1,x2,y2 x1,y1,x2,y2' | /                                                                                                                                                                                                                    | Move linearly from the first box to the second box     |
+| control    | layout_track         | LayoutTrackAnnotator       | video                                    | mode='masktrack/bboxtrack/label/caption'<br>maskaug_mode(optional)='original/original_expand/hull/hull_expand/bbox/bbox_expand'<br>maskaug_ratio(optional)=0~1.0                                                     | Mode represents different methods of subject tracking. |
+| extension  | frameref             | FrameRefExpandAnnotator    | image                                    | mode='firstframe'<br>expand_num=80 (default)                                                                                                                                                                         |                                                        |
+| extension  | frameref             | FrameRefExpandAnnotator    | image                                    | mode='lastframe'<br>expand_num=80 (default)                                                                                                                                                                          |                                                        |
+| extension  | frameref             | FrameRefExpandAnnotator    | two images<br>a.jpg,b.jpg                | mode='firstlastframe'<br>expand_num=80 (default)                                                                                                                                                                     | Images are separated by commas.                        |
+| extension  | clipref              | FrameRefExpandAnnotator    | video                                    | mode='firstclip'<br>expand_num=80 (default)                                                                                                                                                                          |                                                        |
+| extension  | clipref              | FrameRefExpandAnnotator    | video                                    | mode='lastclip'<br>expand_num=80 (default)                                                                                                                                                                           |                                                        |
+| extension  | clipref              | FrameRefExpandAnnotator    | two videos<br>a.mp4,b.mp4                | mode='firstlastclip'<br>expand_num=80 (default)                                                                                                                                                                      | Videos are separated by commas.                        |
+| repainting | inpainting_mask      | InpaintingAnnotator        | video                                    | mode='salient'                                                                                                                                                                                                       | Use salient as a fixed mask.                           |
+| repainting | inpainting_mask      | InpaintingAnnotator        | video + mask                             | mode='mask'                                                                                                                                                                                                          | Use mask as a fixed mask.                              |
+| repainting | inpainting_bbox      | InpaintingAnnotator        | video + bbox<br>'x1, y1, x2, y2'         | mode='bbox'                                                                                                                                                                                                          | Use bbox as a fixed mask.                              |
+| repainting | inpainting_masktrack | InpaintingAnnotator        | video                                    | mode='salientmasktrack'                                                                                                                                                                                              | Use salient mask for dynamic tracking.                 |
+| repainting | inpainting_masktrack | InpaintingAnnotator        | video                                    | mode='salientbboxtrack'                                                                                                                                                                                              | Use salient bbox for dynamic tracking.                 |
+| repainting | inpainting_masktrack | InpaintingAnnotator        | video + mask                             | mode='masktrack'                                                                                                                                                                                                     | Use mask for dynamic tracking.                         |
+| repainting | inpainting_bboxtrack | InpaintingAnnotator        | video + bbox<br>'x1, y1, x2, y2'         | mode='bboxtrack'                                                                                                                                                                                                     | Use bbox for dynamic tracking.                         |
+| repainting | inpainting_label     | InpaintingAnnotator        | video + label                            | mode='label'                                                                                                                                                                                                         | Use label for dynamic tracking.                        |
+| repainting | inpainting_caption   | InpaintingAnnotator        | video + caption                          | mode='caption'                                                                                                                                                                                                       | Use caption for dynamic tracking.                      |
+| repainting | outpainting          | OutpaintingVideoAnnotator  | video                                    | direction=left/right/up/down<br>expand_ratio=0~1.0                                                                                                                                                                   | Combine outpainting directions arbitrarily.            |
+| reference  | image_reference      | SubjectAnnotator           | image                                    | mode='salient/mask/bbox/salientmasktrack/salientbboxtrack/masktrack/bboxtrack/label/caption'<br>maskaug_mode(optional)='original/original_expand/hull/hull_expand/bbox/bbox_expand'<br>maskaug_ratio(optional)=0~1.0 | Use different methods to obtain the subject region.    |
+### 3.3 Composition Task
+Moreover, VACE supports combining tasks to accomplish more complex objectives. The following examples illustrate how tasks can be combined, but these combinations are not limited to the examples provided:
+| Task        | Subtask            | Annotator                  | Input modal        | Params                                                                                                                                                           | Note                                                                                                                           |
+|-------------|--------------------|----------------------------|--------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------|
+| composition | reference_anything | ReferenceAnythingAnnotator | image_list         | mode='salientmasktrack/salientbboxtrack/masktrack/bboxtrack/label/caption'                                                                                       | Input no more than three images.                                                                                               |
+| composition | animate_anything   | AnimateAnythingAnnotator   | image + video      | mode='salientmasktrack/salientbboxtrack/masktrack/bboxtrack/label/caption'                                                                                       | Video for conditional redrawing; images for reference generation.                                                              |
+| composition | swap_anything      | SwapAnythingAnnotator      | image + video      | mode='masktrack/bboxtrack/label/caption'<br>maskaug_mode(optional)='original/original_expand/hull/hull_expand/bbox/bbox_expand'<br>maskaug_ratio(optional)=0~1.0 | Video for conditional redrawing; images for reference generation.<br>Comma-separated mode: first for video, second for images. |
+| composition | expand_anything    | ExpandAnythingAnnotator    | image + image_list | mode='masktrack/bboxtrack/label/caption'<br>direction=left/right/up/down<br>expand_ratio=0~1.0<br>expand_num=80 (default)                                        | First image for extension edit; others for reference.<br>Comma-separated mode: first for video, second for images.             |
+| composition | move_anything      | MoveAnythingAnnotator      | image + two bboxes | expand_num=80 (default)                                                                                                                                          | First image for initial frame reference; others represented by linear bbox changes.                                            |
+| composition | more_anything      | ...                        | ...                | ...                                                                                                                                                              | ...                                                                                                                            |
+## 4. Model Inference
+### 4.1 Execution Methods
+Both command line and Gradio demo are supported.
+1) Command Line: Refer to the `run_vace_ltx.sh` and `run_vace_wan.sh` scripts and invoke them based on the different task types. The input data needs to be preprocessed to obtain parameters such as `src_video`, `src_mask`, `src_ref_images` and `prompt`. An example command is as follows:
+```bash
+python vace/vace_wan_inference.py --src_video <path-to-src-video> --src_mask <path-to-src-mask> --src_ref_images <paths-to-src-ref-images> --prompt <prompt>  # wan
+python vace/vace_ltx_inference.py --src_video <path-to-src-video> --src_mask <path-to-src-mask> --src_ref_images <paths-to-src-ref-images> --prompt <prompt>  # ltx
+```
+2) Gradio Interactive: Launch the graphical interface for model inference and perform inference through interactions on the interface. The specific command is as follows:
+```bash
+python vace/gradios/vace_wan_demo.py  # wan
+python vace/gradios/vace_ltx_demo.py  # ltx
+```
+![gr_infer_demo](assets/materials/gr_infer_demo.jpg)
+3) End-to-End Inference: Refer to the `run_vace_pipeline.sh` script and invoke it based on different task types and input data. This pipeline includes both preprocessing and model inference, thereby requiring only user-provided materials. However, it offers relatively less flexibility. An example command is as follows:
+```bash
+python vace/vace_pipeline.py --base wan --task depth --video <path-to-video> --prompt <prompt>  # wan
+python vace/vace_pipeline.py --base lxt --task depth --video <path-to-video> --prompt <prompt>  # ltx
+```
+### 4.2 Inference Examples
+We provide test examples under different tasks, enabling users to validate according to their needs. These include **task**, **sub-tasks**, **original inputs** (ori_videos and ori_images), **model inputs** (src_video, src_mask, src_ref_images, prompt), and **model outputs**.
+| task        | subtask            | src_video                                                                                                            | src_mask                                                                                                           | src_ref_images                                                                                                                                                                                                                                                                                                | out_video                                                                                                              | prompt                                                                                                                                                                                                                                                                                                                                            | ori_video                                                                                                            | ori_images                                                                                                                                                                                                                                                                                            |
+|-------------|--------------------|----------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| txt2vid     | txt2vid            |                                                                                                                      |                                                                                                                    |                                                                                                                                                                                                                                                                                                               | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/txt2vid/out_video.mp4"></video>            | 狂风巨浪的大海，镜头缓缓推进，一艘渺小的帆船在汹涌的波涛中挣扎漂荡。海面上白沫翻滚，帆船时隐时现，仿佛随时可能被巨浪吞噬。天空乌云密布，雷声轰鸣，海鸥在空中盘旋尖叫。帆船上的人们紧紧抓住缆绳，努力保持平衡。画面风格写实，充满紧张和动感。近景特写，强调风浪的冲击力和帆船的摇晃                                                                |                                                                                                                      |                                                                                                                                                                                                                                                                                                       |
+| extension   | firstframe         | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/firstframe/src_video.mp4"></video>       | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/firstframe/src_mask.mp4"></video>      |                                                                                                                                                                                                                                                                                                               | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/firstframe/out_video.mp4"></video>         | 纪实摄影风格，前景是一位中国越野爱好者坐在越野车上，手持车载电台正在进行通联。他五官清晰，表情专注，眼神坚定地望向前方。越野车停在户外，车身略显脏污，显示出经历过的艰难路况。镜头从车外缓缓拉近，最后定格在人物的面部特写上，展现出他的坚定与热情。中景到近景，动态镜头运镜。                                                                    |                                                                                                                      | <img  style="width: auto; height: 200px; object-fit: contain;" src="benchmarks/VACE-Benchmark/assets/examples/firstframe/ori_image_1.png">                                                                                                                                                            |
+| repainting  | inpainting         | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/inpainting/src_video.mp4"></video>       | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/inpainting/src_mask.mp4"></video>      |                                                                                                                                                                                                                                                                                                               | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/inpainting/out_video.mp4"></video>         | 一只巨大的金色凤凰从繁华的城市上空展翅飞过，羽毛如火焰般璀璨，闪烁着温暖的光辉，翅膀雄伟地展开。凤凰高昂着头，目光炯炯，轻轻扇动翅膀，散发出淡淡的光芒。下方是熙熙攘攘的市中心，人群惊叹，车水马龙，红蓝两色的霓虹灯在夜空下闪烁。镜头俯视城市街道，捕捉这一壮丽的景象，营造出既神秘又辉煌的氛围。                                                | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/inpainting/ori_video.mp4"></video>       |                                                                                                                                                                                                                                                                                                       |
+| repainting  | outpainting        | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/outpainting/src_video.mp4"></video>      | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/outpainting/src_mask.mp4"></video>     |                                                                                                                                                                                                                                                                                                               | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/outpainting/out_video.mp4"></video>        | 赛博朋克风格，无人机俯瞰视角下的现代西安城墙，镜头穿过永宁门时泛起金色涟漪，城墙砖块化作数据流重组为唐代长安城。周围的街道上流动的人群和飞驰的机械交通工具交织在一起，现代与古代的交融，城墙上的灯光闪烁，形成时空隧道的效果。全息投影技术展现历史变迁，粒子重组特效细腻逼真。大远景逐渐过渡到特写，聚焦于城门特效。                              | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/outpainting/ori_video.mp4"></video>      |                                                                                                                                                                                                                                                                                                       |
+| control     | depth              | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/depth/src_video.mp4"></video>            |                                                                                                                    |                                                                                                                                                                                                                                                                                                               | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/depth/out_video.mp4"></video>              | 一群年轻人在天空之城拍摄集体照。画面中，一对年轻情侣手牵手，轻声细语，相视而笑，周围是飞翔的彩色热气球和闪烁的星星，营造出浪漫的氛围。天空中，暖阳透过飘浮的云朵，洒下斑驳的光影。镜头以近景特写开始，随着情侣间的亲密互动，缓缓拉远。                                                                                                            | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/depth/ori_video.mp4"></video>            |                                                                                                                                                                                                                                                                                                       |
+| control     | flow               | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/flow/src_video.mp4"></video>             |                                                                                                                    |                                                                                                                                                                                                                                                                                                               | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/flow/out_video.mp4"></video>               | 纪实摄影风格，一颗鲜红的小番茄缓缓落入盛着牛奶的玻璃杯中，溅起晶莹的水花。画面以慢镜头捕捉这一瞬间，水花在空中绽放，形成美丽的弧线。玻璃杯中的牛奶纯白，番茄的鲜红与之形成鲜明对比。背景简洁，突出主体。近景特写，垂直俯视视角，展现细节之美。                                                                                                    | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/flow/ori_video.mp4"></video>             |                                                                                                                                                                                                                                                                                                       |
+| control     | gray               | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/gray/src_video.mp4"></video>             |                                                                                                                    |                                                                                                                                                                                                                                                                                                               | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/gray/out_video.mp4"></video>               | 镜头缓缓向右平移，身穿淡黄色坎肩长裙的长发女孩面对镜头露出灿烂的漏齿微笑。她的长发随风轻扬，眼神明亮而充满活力。背景是秋天红色和黄色的树叶，阳光透过树叶的缝隙洒下斑驳光影，营造出温馨自然的氛围。画面风格清新自然，仿佛夏日午后的一抹清凉。中景人像，强调自然光效和细腻的皮肤质感。                                                              | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/gray/ori_video.mp4"></video>             |                                                                                                                                                                                                                                                                                                       |
+| control     | pose               | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/pose/src_video.mp4"></video>             |                                                                                                                    |                                                                                                                                                                                                                                                                                                               | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/pose/out_video.mp4"></video>               | 在一个热带的庆祝派对上，一家人围坐在椰子树下的长桌旁。桌上摆满了异国风味的美食。长辈们愉悦地交谈，年轻人兴奋地举杯碰撞，孩子们在沙滩上欢乐奔跑。背景中是湛蓝的海洋和明亮的阳光，营造出轻松的气氛。镜头以动态中景捕捉每个开心的瞬间，温暖的阳光映照着他们幸福的面庞。                                                                              | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/pose/ori_video.mp4"></video>             |                                                                                                                                                                                                                                                                                                       |
+| control     | scribble           | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/scribble/src_video.mp4"></video>         |                                                                                                                    |                                                                                                                                                                                                                                                                                                               | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/scribble/out_video.mp4"></video>           | 画面中荧光色彩的无人机从极低空高速掠过超现实主义风格的西安古城墙，尘埃反射着阳光。镜头快速切换至城墙上的砖石特写，阳光温暖地洒落，勾勒出每一块砖块的细腻纹理。整体画质清晰华丽，运镜流畅如水。                                                                                                                                                    | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/scribble/ori_video.mp4"></video>         |                                                                                                                                                                                                                                                                                                       |
+| control     | layout             | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/layout/src_video.mp4"></video>           |                                                                                                                    |                                                                                                                                                                                                                                                                                                               | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/layout/out_video.mp4"></video>             | 视频展示了一只成鸟在树枝上的巢中喂养它的幼鸟。成鸟在喂食的过程中，幼鸟张开嘴巴等待食物。随后，成鸟飞走，幼鸟继续等待。成鸟再次飞回，带回食物喂养幼鸟。整个视频的拍摄角度固定，聚焦于巢穴和鸟类的互动，背景是模糊的绿色植被，强调了鸟类的自然行为和生态环境。                                                                                      | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/layout/ori_video.mp4"></video>           |                                                                                                                                                                                                                                                                                                       |
+| reference   | face               |                                                                                                                      |                                                                                                                    | <img  style="width: auto; height: 200px; object-fit: contain;" src="benchmarks/VACE-Benchmark/assets/examples/face/src_ref_image_1.png">                                                                                                                                                                      | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/face/out_video.mp4"></video>               | 视频展示了一位长着尖耳朵的老人，他有一头银白色的长发和小胡子，穿着一件色彩斑斓的长袍，内搭金色衬衫，散发出神秘与智慧的气息。背景为一个华丽宫殿的内部，金碧辉煌。灯光明亮，照亮他脸上的神采奕奕。摄像机旋转动态拍摄，捕捉老人轻松挥手的动作。                                                                                                      |                                                                                                                      | <img  style="width: auto; height: 200px; object-fit: contain;" src="benchmarks/VACE-Benchmark/assets/examples/face/ori_image_1.png">                                                                                                                                                                  |
+| reference   | object             |                                                                                                                      |                                                                                                                    | <img  style="width: auto; height: 200px; object-fit: contain;" src="benchmarks/VACE-Benchmark/assets/examples/object/src_ref_image_1.png">                                                                                                                                                                    | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/object/out_video.mp4"></video>             | 经典游戏角色马里奥在绿松石色水下世界中，四周环绕着珊瑚和各种各样的热带鱼。马里奥兴奋地向上跳起，摆出经典的欢快姿势，身穿鲜明的蓝色潜水服，红色的潜水面���上印有“M”标志，脚上是一双潜水靴。背景中，水泡随波逐流，浮现出一个巨大而友好的海星。摄像机从水底向上快速移动，捕捉他跃出水面的瞬间，灯光明亮而流动。该场景融合了动画与幻想元素，令人惊叹。 |                                                                                                                      | <img  style="width: auto; height: 200px; object-fit: contain;" src="benchmarks/VACE-Benchmark/assets/examples/object/ori_image_1.png">                                                                                                                                                                |
+| composition | reference_anything |                                                                                                                      |                                                                                                                    | <img  style="width: auto; height: 200px; object-fit: contain;" src="benchmarks/VACE-Benchmark/assets/examples/reference_anything/src_ref_image_1.png">,<img  style="width: auto; height: 200px; object-fit: contain;" src="benchmarks/VACE-Benchmark/assets/examples/reference_anything/src_ref_image_2.png"> | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/reference_anything/out_video.mp4"></video> | 一名打扮成超人的男子自信地站着，面对镜头，肩头有一只充满活力的毛绒黄色鸭子。他留着整齐的短发和浅色胡须，鸭子有橙色的喙和脚，它的翅膀稍微展开，脚分开以保持稳定。他的表情严肃而坚定。他穿着标志性的蓝红超人服装，胸前有黄色“S”标志。斗篷在他身后飘逸。背景有行人。相机位于视线水平，捕捉角色的整个上半身。灯光均匀明亮。                           |                                                                                                                      | <img  style="width: auto; height: 200px; object-fit: contain;" src="benchmarks/VACE-Benchmark/assets/examples/reference_anything/ori_image_1.png">,<img  style="width: auto; height: 200px; object-fit: contain;" src="benchmarks/VACE-Benchmark/assets/examples/reference_anything/ori_image_2.png"> |
+| composition | swap_anything      | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/swap_anything/src_video.mp4"></video>    | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/swap_anything/src_mask.mp4"></video>   | <img  style="width: auto; height: 200px; object-fit: contain;" src="benchmarks/VACE-Benchmark/assets/examples/swap_anything/src_ref_image_1.png">                                                                                                                                                             | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/swap_anything/out_video.mp4"></video>      | 视频展示了一个人在宽阔的草原上骑马。他有淡紫色长发，穿着传统服饰白上衣黑裤子，动画建模画风，看起来像是在进行某种户外活动或者是在进行某种表演。背景是壮观的山脉和多云的天空，给人一种宁静而广阔的感觉。整个视频的拍摄角度是固定的，重点展示了骑手和他的马。                                                                                        | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/swap_anything/ori_video.mp4"></video>    | <img  style="width: auto; height: 200px; object-fit: contain;" src="benchmarks/VACE-Benchmark/assets/examples/swap_anything/ori_image_1.jpg">                                                                                                                                                         |
+| composition | expand_anything    | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/expand_anything/src_video.mp4"></video>  | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/expand_anything/src_mask.mp4"></video> | <img  style="width: auto; height: 200px; object-fit: contain;" src="benchmarks/VACE-Benchmark/assets/examples/expand_anything/src_ref_image_1.png">                                                                                                                                                           | <video controls height="200" src="benchmarks/VACE-Benchmark/assets/examples/expand_anything/out_video.mp4"></video>    | 古典油画风格，背景是一条河边，画面中央一位成熟优雅的女人，穿着长裙坐在椅子上。她双手从怀里取出打开的红色心形墨镜戴上。固定机位。                                                                                                                                                                                                                  |                                                                                                                      | <img  style="width: auto; height: 200px; object-fit: contain;" src="benchmarks/VACE-Benchmark/assets/examples/expand_anything/ori_image_1.jpeg">,<img  style="width: auto; height: 200px; object-fit: contain;" src="benchmarks/VACE-Benchmark/assets/examples/expand_anything/ori_image_2.png">      |
+## 5. Limitations
+- VACE-LTX-Video-0.9
+  - The prompt significantly impacts video generation quality on LTX-Video. It must be extended in accordance with the methods described in this [system prompt](https://huggingface.co/spaces/Lightricks/LTX-Video-Playground/blob/main/assets/system_prompt_i2v.txt). We also provide input parameters for using prompt extension (--use_prompt_extend).
+  - This model is intended for experimental research validation within the VACE paper and may not guarantee performance in real-world scenarios. However, its inference speed is very fast, capable of creating a video in 25 seconds with 40 steps on an A100 GPU, making it suitable for preliminary data and creative validation.
+- VACE-Wan2.1-1.3B-Preview
+  - This model mainly keeps the original Wan2.1-T2V-1.3B's video quality while supporting various tasks.
+  - When you encounter failure cases with specific tasks, we recommend trying again with a different seed and adjusting the prompt.

app.py ADDED Viewed

	@@ -0,0 +1,278 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+import argparse
+import os
+import sys
+import datetime
+import imageio
+import numpy as np
+import torch
+import gradio as gr
+sys.path.insert(0, os.path.sep.join(os.path.realpath(__file__).split(os.path.sep)[:-3]))
+import wan
+from vace.models.wan.wan_vace import WanVace
+from vace.models.wan.configs import WAN_CONFIGS, SIZE_CONFIGS
+class FixedSizeQueue:
+    def __init__(self, max_size):
+        self.max_size = max_size
+        self.queue = []
+    def add(self, item):
+        self.queue.insert(0, item)
+        if len(self.queue) > self.max_size:
+            self.queue.pop()
+    def get(self):
+        return self.queue
+    def __repr__(self):
+        return str(self.queue)
+class VACEInference:
+    def __init__(self, cfg, skip_load=False, gallery_share=True, gallery_share_limit=5):
+        self.cfg = cfg
+        self.save_dir = cfg.save_dir
+        self.gallery_share = gallery_share
+        self.gallery_share_data = FixedSizeQueue(max_size=gallery_share_limit)
+        if not skip_load:
+            self.pipe = WanVace(
+                config=WAN_CONFIGS['vace-1.3B'],
+                checkpoint_dir=cfg.ckpt_dir,
+                device_id=0,
+                rank=0,
+                t5_fsdp=False,
+                dit_fsdp=False,
+                use_usp=False,
+            )
+    def create_ui(self, *args, **kwargs):
+        gr.Markdown("""
+                    <div style="text-align: center; font-size: 24px; font-weight: bold; margin-bottom: 15px;">
+                        <a href="https://ali-vilab.github.io/VACE-Page/" style="text-decoration: none; color: inherit;">VACE-WAN Demo</a>
+                    </div>
+                    """)
+        with gr.Row(variant='panel', equal_height=True):
+            with gr.Column(scale=1, min_width=0):
+                self.src_video = gr.Video(
+                    label="src_video",
+                    sources=['upload'],
+                    value=None,
+                    interactive=True)
+            with gr.Column(scale=1, min_width=0):
+                self.src_mask = gr.Video(
+                    label="src_mask",
+                    sources=['upload'],
+                    value=None,
+                    interactive=True)
+        #
+        with gr.Row(variant='panel', equal_height=True):
+            with gr.Column(scale=1, min_width=0):
+                with gr.Row(equal_height=True):
+                    self.src_ref_image_1 = gr.Image(label='src_ref_image_1',
+                                                    height=200,
+                                                    interactive=True,
+                                                    type='filepath',
+                                                    image_mode='RGB',
+                                                    sources=['upload'],
+                                                    elem_id="src_ref_image_1",
+                                                    format='png')
+                    self.src_ref_image_2 = gr.Image(label='src_ref_image_2',
+                                                    height=200,
+                                                    interactive=True,
+                                                    type='filepath',
+                                                    image_mode='RGB',
+                                                    sources=['upload'],
+                                                    elem_id="src_ref_image_2",
+                                                    format='png')
+                    self.src_ref_image_3 = gr.Image(label='src_ref_image_3',
+                                                    height=200,
+                                                    interactive=True,
+                                                    type='filepath',
+                                                    image_mode='RGB',
+                                                    sources=['upload'],
+                                                    elem_id="src_ref_image_3",
+                                                    format='png')
+        with gr.Row(variant='panel', equal_height=True):
+            with gr.Column(scale=1):
+                self.prompt = gr.Textbox(
+                    show_label=False,
+                    placeholder="positive_prompt_input",
+                    elem_id='positive_prompt',
+                    container=True,
+                    autofocus=True,
+                    elem_classes='type_row',
+                    visible=True,
+                    lines=2)
+                self.negative_prompt = gr.Textbox(
+                    show_label=False,
+                    value=self.pipe.config.sample_neg_prompt,
+                    placeholder="negative_prompt_input",
+                    elem_id='negative_prompt',
+                    container=True,
+                    autofocus=False,
+                    elem_classes='type_row',
+                    visible=True,
+                    interactive=True,
+                    lines=1)
+        #
+        with gr.Row(variant='panel', equal_height=True):
+            with gr.Column(scale=1, min_width=0):
+                with gr.Row(equal_height=True):
+                    self.shift_scale = gr.Slider(
+                        label='shift_scale',
+                        minimum=0.0,
+                        maximum=10.0,
+                        step=1.0,
+                        value=8.0,
+                        interactive=True)
+                    self.sample_steps = gr.Slider(
+                        label='sample_steps',
+                        minimum=1,
+                        maximum=100,
+                        step=1,
+                        value=25,
+                        interactive=True)
+                    self.context_scale = gr.Slider(
+                        label='context_scale',
+                        minimum=0.0,
+                        maximum=2.0,
+                        step=0.1,
+                        value=1.0,
+                        interactive=True)
+                    self.guide_scale = gr.Slider(
+                        label='guide_scale',
+                        minimum=1,
+                        maximum=10,
+                        step=0.5,
+                        value=6.0,
+                        interactive=True)
+                    self.infer_seed = gr.Slider(minimum=-1,
+                                                maximum=10000000,
+                                                value=2025,
+                                                label="Seed")
+        #
+        with gr.Accordion(label="Usable without source video", open=False):
+            with gr.Row(equal_height=True):
+                self.output_height = gr.Textbox(
+                    label='resolutions_height',
+                    value=480,
+                    interactive=True)
+                self.output_width = gr.Textbox(
+                    label='resolutions_width',
+                    value=832,
+                    interactive=True)
+                self.frame_rate = gr.Textbox(
+                    label='frame_rate',
+                    value=16,
+                    interactive=True)
+                self.num_frames = gr.Textbox(
+                    label='num_frames',
+                    value=81,
+                    interactive=True)
+        #
+        with gr.Row(equal_height=True):
+            with gr.Column(scale=5):
+                self.generate_button = gr.Button(
+                    value='Run',
+                    elem_classes='type_row',
+                    elem_id='generate_button',
+                    visible=True)
+            with gr.Column(scale=1):
+                self.refresh_button = gr.Button(value='\U0001f504')  # 🔄
+        #
+        self.output_gallery = gr.Gallery(
+            label="output_gallery",
+            value=[],
+            interactive=False,
+            allow_preview=True,
+            preview=True)
+    def generate(self, output_gallery, src_video, src_mask, src_ref_image_1, src_ref_image_2, src_ref_image_3, prompt, negative_prompt, shift_scale, sample_steps, context_scale, guide_scale, infer_seed, output_height, output_width, frame_rate, num_frames):
+        output_height, output_width, frame_rate, num_frames = int(output_height), int(output_width), int(frame_rate), int(num_frames)
+        src_ref_images = [x for x in [src_ref_image_1, src_ref_image_2, src_ref_image_3] if
+                          x is not None]
+        src_video, src_mask, src_ref_images = self.pipe.prepare_source([src_video],
+                                                                         [src_mask],
+                                                                         [src_ref_images],
+                                                                         num_frames=num_frames,
+                                                                         image_size=SIZE_CONFIGS[f"{output_height}*{output_width}"],
+                                                                         device=self.pipe.device)
+        video = self.pipe.generate(
+            prompt,
+            src_video,
+            src_mask,
+            src_ref_images,
+            size=(output_width, output_height),
+            context_scale=context_scale,
+            shift=shift_scale,
+            sampling_steps=sample_steps,
+            guide_scale=guide_scale,
+            n_prompt=negative_prompt,
+            seed=infer_seed,
+            offload_model=True)
+        name = '{0:%Y%m%d%-H%M%S}'.format(datetime.datetime.now())
+        video_path = os.path.join(self.save_dir, f'cur_gallery_{name}.mp4')
+        video_frames = (torch.clamp(video / 2 + 0.5, min=0.0, max=1.0).permute(1, 2, 3, 0) * 255).cpu().numpy().astype(np.uint8)
+        try:
+            writer = imageio.get_writer(video_path, fps=frame_rate, codec='libx264', quality=8, macro_block_size=1)
+            for frame in video_frames:
+                writer.append_data(frame)
+            writer.close()
+            print(video_path)
+        except Exception as e:
+            raise gr.Error(f"Video save error: {e}")
+        if self.gallery_share:
+            self.gallery_share_data.add(video_path)
+            return self.gallery_share_data.get()
+        else:
+            return [video_path]
+    def set_callbacks(self, **kwargs):
+        self.gen_inputs = [self.output_gallery, self.src_video, self.src_mask, self.src_ref_image_1, self.src_ref_image_2, self.src_ref_image_3, self.prompt, self.negative_prompt, self.shift_scale, self.sample_steps, self.context_scale, self.guide_scale, self.infer_seed, self.output_height, self.output_width, self.frame_rate, self.num_frames]
+        self.gen_outputs = [self.output_gallery]
+        self.generate_button.click(self.generate,
+                                   inputs=self.gen_inputs,
+                                   outputs=self.gen_outputs,
+                                   queue=True)
+        self.refresh_button.click(lambda x: self.gallery_share_data.get() if self.gallery_share else x, inputs=[self.output_gallery], outputs=[self.output_gallery])
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='Argparser for VACE-LTXV Demo:\n')
+    parser.add_argument('--server_port', dest='server_port', help='', type=int, default=7860)
+    parser.add_argument('--server_name', dest='server_name', help='', default='0.0.0.0')
+    parser.add_argument('--root_path', dest='root_path', help='', default=None)
+    parser.add_argument('--save_dir', dest='save_dir', help='', default='cache')
+    parser.add_argument(
+        "--ckpt_dir",
+        type=str,
+        default='models/VACE-Wan2.1-1.3B-Preview',
+        help="The path to the checkpoint directory.",
+    )
+    parser.add_argument(
+        "--offload_to_cpu",
+        action="store_true",
+        help="Offloading unnecessary computations to CPU.",
+    )
+    args = parser.parse_args()
+    if not os.path.exists(args.save_dir):
+        os.makedirs(args.save_dir, exist_ok=True)
+    with gr.Blocks() as demo:
+        infer_gr = VACEInference(args, skip_load=False, gallery_share=True, gallery_share_limit=5)
+        infer_gr.create_ui()
+        infer_gr.set_callbacks()
+        allowed_paths = [args.save_dir]
+        demo.queue(status_update_rate=1).launch(server_name=args.server_name,
+                                                server_port=args.server_port,
+                                                root_path=args.root_path,
+                                                allowed_paths=allowed_paths,
+                                                show_error=True, debug=True)

assets/images/test.jpg ADDED Viewed

Git LFS Details

SHA256: 71549d76843c4ee220f37f45e87f0dfc22079d1bc5fbe3f52fe2ded2b9454a3b
Pointer size: 131 Bytes
Size of remote file: 143 kB

assets/images/test2.jpg ADDED Viewed

assets/images/test3.jpg ADDED Viewed

Git LFS Details

SHA256: bee71955dac07594b21937c2354ab5b7bd3f3321447202476178dab5ceead497
Pointer size: 131 Bytes
Size of remote file: 214 kB

assets/masks/test.png ADDED Viewed

assets/masks/test2.png ADDED Viewed

assets/materials/gr_infer_demo.jpg ADDED Viewed

Git LFS Details

SHA256: 9b4f0df3c602da88e707262029d78284b3b5857e2bac413edef6f117e3ddb8be
Pointer size: 131 Bytes
Size of remote file: 320 kB

assets/materials/gr_pre_demo.jpg ADDED Viewed

Git LFS Details

SHA256: 6939180a97bd5abfc8d90bef6b31e949c591e2d75f5719e0eac150871d4aaae2
Pointer size: 131 Bytes
Size of remote file: 267 kB

assets/materials/tasks.png ADDED Viewed

Git LFS Details

SHA256: 1f1c4b3f3e6ae927880fbe2f9a46939cc98824bb56c2753c975a2e3c4820830b
Pointer size: 131 Bytes
Size of remote file: 709 kB

assets/materials/teaser.jpg ADDED Viewed

Git LFS Details

SHA256: 87ce75e8dcbf1536674d3a951326727e0aff80192f52cf7388b34c03f13f711f
Pointer size: 131 Bytes
Size of remote file: 892 kB

assets/videos/test.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2195efbd92773f1ee262154577c700e9c3b7a4d7d04b1a2ac421db0879c696b0
+size 737090

assets/videos/test2.mp4 ADDED Viewed

Binary file (79.6 kB). View file

benchmarks/.gitkeep ADDED Viewed

File without changes

models/.gitkeep ADDED Viewed

File without changes

pyproject.toml ADDED Viewed

	@@ -0,0 +1,75 @@

+[build-system]
+requires = ["setuptools>=42", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "vace"
+version = "1.0.0"
+description = "VACE: All-in-One Video Creation and Editing"
+authors = [
+    { name = "VACE Team", email = "[email protected]" }
+]
+requires-python = ">=3.10,<4.0"
+readme = "README.md"
+dependencies = [
+    "torch>=2.5.1",
+    "torchvision>=0.20.1",
+    "opencv-python>=4.9.0.80",
+    "diffusers>=0.31.0",
+    "transformers>=4.49.0",
+    "tokenizers>=0.20.3",
+    "accelerate>=1.1.1",
+    "gradio>=5.0.0",
+    "numpy>=1.23.5,<2",
+    "tqdm",
+    "imageio",
+    "easydict",
+    "ftfy",
+    "dashscope",
+    "imageio-ffmpeg",
+    "flash_attn",
+    "decord",
+    "einops",
+    "scikit-image",
+    "scikit-learn",
+    "pycocotools",
+    "timm",
+    "onnxruntime-gpu",
+    "BeautifulSoup4"
+]
+[project.optional-dependencies]
+ltx = [
+    "ltx-video@git+https://github.com/Lightricks/[email protected]"
+]
+wan = [
+    "wan@git+https://github.com/Wan-Video/Wan2.1"
+]
+annotator = [
+    "insightface",
+    "sam-2@git+https://github.com/facebookresearch/sam2.git",
+    "segment-anything@git+https://github.com/facebookresearch/segment-anything.git",
+    "groundingdino@git+https://github.com/IDEA-Research/GroundingDINO.git",
+    "ram@git+https://github.com/xinyu1205/recognize-anything.git",
+    "raft@git+https://github.com/martin-chobanyan-sdc/RAFT.git"
+]
+[project.urls]
+homepage = "https://ali-vilab.github.io/VACE-Page/"
+documentation = "https://ali-vilab.github.io/VACE-Page/"
+repository = "https://github.com/ali-vilab/VACE"
+hfmodel = "https://huggingface.co/collections/ali-vilab/vace-67eca186ff3e3564726aff38"
+msmodel = "https://modelscope.cn/collections/VACE-8fa5fcfd386e43"
+paper = "https://arxiv.org/abs/2503.07598"
+[tool.setuptools]
+packages = { find = {} }
+[tool.black]
+line-length = 88
+[tool.isort]
+profile = "black"
+[tool.mypy]
+strict = true

requirements.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ -r requirements/framework.txt

requirements/annotator.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+insightface
+git+https://github.com/facebookresearch/sam2.git
+git+https://github.com/facebookresearch/segment-anything.git
+git+https://github.com/IDEA-Research/GroundingDINO.git
+git+https://github.com/xinyu1205/recognize-anything.git
+git+https://github.com/martin-chobanyan-sdc/RAFT.git

requirements/framework.txt ADDED Viewed

	@@ -0,0 +1,26 @@

+torch>=2.5.1
+torchvision>=0.20.1
+opencv-python>=4.9.0.80
+diffusers>=0.31.0
+transformers>=4.49.0
+tokenizers>=0.20.3
+accelerate>=1.1.1
+gradio>=5.0.0
+numpy>=1.23.5,<2
+tqdm
+imageio
+easydict
+ftfy
+dashscope
+imageio-ffmpeg
+flash_attn
+decord
+einops
+scikit-image
+scikit-learn
+pycocotools
+timm
+onnxruntime-gpu
+BeautifulSoup4
+#ltx-video@git+https://github.com/Lightricks/[email protected]
+#wan@git+https://github.com/Wan-Video/Wan2.1

tests/test_annotators.py ADDED Viewed

	@@ -0,0 +1,568 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+import os
+import unittest
+import numpy as np
+from PIL import Image
+from vace.annotators.utils import read_video_frames
+from vace.annotators.utils import save_one_video
+class AnnotatorTest(unittest.TestCase):
+    def setUp(self):
+        print(('Testing %s.%s' % (type(self).__name__, self._testMethodName)))
+        self.save_dir = './cache/test_annotator'
+        if not os.path.exists(self.save_dir):
+            os.makedirs(self.save_dir)
+        # load test image
+        self.image_path = './assets/images/test.jpg'
+        self.image = Image.open(self.image_path).convert('RGB')
+        # load test video
+        self.video_path = './assets/videos/test.mp4'
+        self.frames = read_video_frames(self.video_path)
+    def tearDown(self):
+        super().tearDown()
+    @unittest.skip('')
+    def test_annotator_gray_image(self):
+        from vace.annotators.gray import GrayAnnotator
+        cfg_dict = {}
+        anno_ins = GrayAnnotator(cfg_dict)
+        anno_image = anno_ins.forward(np.array(self.image))
+        save_path = os.path.join(self.save_dir, 'test_gray_image.png')
+        Image.fromarray(anno_image).save(save_path)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+    @unittest.skip('')
+    def test_annotator_gray_video(self):
+        from vace.annotators.gray import GrayAnnotator
+        cfg_dict = {}
+        anno_ins = GrayAnnotator(cfg_dict)
+        ret_frames = []
+        for frame in self.frames:
+            anno_frame = anno_ins.forward(np.array(frame))
+            ret_frames.append(anno_frame)
+        save_path = os.path.join(self.save_dir, 'test_gray_video.mp4')
+        save_one_video(save_path, ret_frames, fps=16)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+    @unittest.skip('')
+    def test_annotator_gray_video_2(self):
+        from vace.annotators.gray import GrayVideoAnnotator
+        cfg_dict = {}
+        anno_ins = GrayVideoAnnotator(cfg_dict)
+        ret_frames = anno_ins.forward(self.frames)
+        save_path = os.path.join(self.save_dir, 'test_gray_video_2.mp4')
+        save_one_video(save_path, ret_frames, fps=16)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+    @unittest.skip('')
+    def test_annotator_pose_image(self):
+        from vace.annotators.pose import PoseBodyFaceAnnotator
+        cfg_dict = {
+            "DETECTION_MODEL": "models/VACE-Annotators/pose/yolox_l.onnx",
+            "POSE_MODEL": "models/VACE-Annotators/pose/dw-ll_ucoco_384.onnx",
+            "RESIZE_SIZE": 1024
+        }
+        anno_ins = PoseBodyFaceAnnotator(cfg_dict)
+        anno_image = anno_ins.forward(np.array(self.image))
+        save_path = os.path.join(self.save_dir, 'test_pose_image.png')
+        Image.fromarray(anno_image).save(save_path)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+    @unittest.skip('')
+    def test_annotator_pose_video(self):
+        from vace.annotators.pose import PoseBodyFaceAnnotator
+        cfg_dict = {
+            "DETECTION_MODEL": "models/VACE-Annotators/pose/yolox_l.onnx",
+            "POSE_MODEL": "models/VACE-Annotators/pose/dw-ll_ucoco_384.onnx",
+            "RESIZE_SIZE": 1024
+        }
+        anno_ins = PoseBodyFaceAnnotator(cfg_dict)
+        ret_frames = []
+        for frame in self.frames:
+            anno_frame = anno_ins.forward(np.array(frame))
+            ret_frames.append(anno_frame)
+        save_path = os.path.join(self.save_dir, 'test_pose_video.mp4')
+        save_one_video(save_path, ret_frames, fps=16)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+    @unittest.skip('')
+    def test_annotator_pose_video_2(self):
+        from vace.annotators.pose import PoseBodyFaceVideoAnnotator
+        cfg_dict = {
+            "DETECTION_MODEL": "models/VACE-Annotators/pose/yolox_l.onnx",
+            "POSE_MODEL": "models/VACE-Annotators/pose/dw-ll_ucoco_384.onnx",
+            "RESIZE_SIZE": 1024
+        }
+        anno_ins = PoseBodyFaceVideoAnnotator(cfg_dict)
+        ret_frames = anno_ins.forward(self.frames)
+        save_path = os.path.join(self.save_dir, 'test_pose_video_2.mp4')
+        save_one_video(save_path, ret_frames, fps=16)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+    @unittest.skip('')
+    def test_annotator_depth_image(self):
+        from vace.annotators.depth import DepthAnnotator
+        cfg_dict = {
+            "PRETRAINED_MODEL": "models/VACE-Annotators/depth/dpt_hybrid-midas-501f0c75.pt"
+        }
+        anno_ins = DepthAnnotator(cfg_dict)
+        anno_image = anno_ins.forward(np.array(self.image))
+        save_path = os.path.join(self.save_dir, 'test_depth_image.png')
+        Image.fromarray(anno_image).save(save_path)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+    @unittest.skip('')
+    def test_annotator_depth_video(self):
+        from vace.annotators.depth import DepthAnnotator
+        cfg_dict = {
+            "PRETRAINED_MODEL": "models/VACE-Annotators/depth/dpt_hybrid-midas-501f0c75.pt"
+        }
+        anno_ins = DepthAnnotator(cfg_dict)
+        ret_frames = []
+        for frame in self.frames:
+            anno_frame = anno_ins.forward(np.array(frame))
+            ret_frames.append(anno_frame)
+        save_path = os.path.join(self.save_dir, 'test_depth_video.mp4')
+        save_one_video(save_path, ret_frames, fps=16)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+    @unittest.skip('')
+    def test_annotator_depth_video_2(self):
+        from vace.annotators.depth import DepthVideoAnnotator
+        cfg_dict = {
+            "PRETRAINED_MODEL": "models/VACE-Annotators/depth/dpt_hybrid-midas-501f0c75.pt"
+        }
+        anno_ins = DepthVideoAnnotator(cfg_dict)
+        ret_frames = anno_ins.forward(self.frames)
+        save_path = os.path.join(self.save_dir, 'test_depth_video_2.mp4')
+        save_one_video(save_path, ret_frames, fps=16)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+    @unittest.skip('')
+    def test_annotator_scribble_image(self):
+        from vace.annotators.scribble import ScribbleAnnotator
+        cfg_dict = {
+            "PRETRAINED_MODEL": "models/VACE-Annotators/scribble/anime_style/netG_A_latest.pth"
+        }
+        anno_ins = ScribbleAnnotator(cfg_dict)
+        anno_image = anno_ins.forward(np.array(self.image))
+        save_path = os.path.join(self.save_dir, 'test_scribble_image.png')
+        Image.fromarray(anno_image).save(save_path)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+    @unittest.skip('')
+    def test_annotator_scribble_video(self):
+        from vace.annotators.scribble import ScribbleAnnotator
+        cfg_dict = {
+            "PRETRAINED_MODEL": "models/VACE-Annotators/scribble/anime_style/netG_A_latest.pth"
+        }
+        anno_ins = ScribbleAnnotator(cfg_dict)
+        ret_frames = []
+        for frame in self.frames:
+            anno_frame = anno_ins.forward(np.array(frame))
+            ret_frames.append(anno_frame)
+        save_path = os.path.join(self.save_dir, 'test_scribble_video.mp4')
+        save_one_video(save_path, ret_frames, fps=16)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+    @unittest.skip('')
+    def test_annotator_scribble_video_2(self):
+        from vace.annotators.scribble import ScribbleVideoAnnotator
+        cfg_dict = {
+            "PRETRAINED_MODEL": "models/VACE-Annotators/scribble/anime_style/netG_A_latest.pth"
+        }
+        anno_ins = ScribbleVideoAnnotator(cfg_dict)
+        ret_frames = anno_ins.forward(self.frames)
+        save_path = os.path.join(self.save_dir, 'test_scribble_video_2.mp4')
+        save_one_video(save_path, ret_frames, fps=16)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+    @unittest.skip('')
+    def test_annotator_flow_video(self):
+        from vace.annotators.flow import FlowVisAnnotator
+        cfg_dict = {
+            "PRETRAINED_MODEL": "models/VACE-Annotators/flow/raft-things.pth"
+        }
+        anno_ins = FlowVisAnnotator(cfg_dict)
+        ret_frames = anno_ins.forward(self.frames)
+        save_path = os.path.join(self.save_dir, 'test_flow_video.mp4')
+        save_one_video(save_path, ret_frames, fps=16)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+    @unittest.skip('')
+    def test_annotator_frameref_video_1(self):
+        from vace.annotators.frameref import FrameRefExtractAnnotator
+        cfg_dict = {
+            "REF_CFG": [{"mode": "first", "proba": 0.1},
+                       {"mode": "last", "proba": 0.1},
+                       {"mode": "firstlast", "proba": 0.1},
+                       {"mode": "random", "proba": 0.1}],
+        }
+        anno_ins = FrameRefExtractAnnotator(cfg_dict)
+        ret_frames, ret_masks = anno_ins.forward(self.frames, ref_num=10)
+        save_path = os.path.join(self.save_dir, 'test_frameref_video_1.mp4')
+        save_one_video(save_path, ret_frames, fps=16)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+        save_path = os.path.join(self.save_dir, 'test_frameref_mask_1.mp4')
+        save_one_video(save_path, ret_masks, fps=16)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+    @unittest.skip('')
+    def test_annotator_frameref_video_2(self):
+        from vace.annotators.frameref import FrameRefExpandAnnotator
+        cfg_dict = {}
+        anno_ins = FrameRefExpandAnnotator(cfg_dict)
+        ret_frames, ret_masks = anno_ins.forward(frames=self.frames, mode='lastclip', expand_num=50)
+        save_path = os.path.join(self.save_dir, 'test_frameref_video_2.mp4')
+        save_one_video(save_path, ret_frames, fps=16)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+        save_path = os.path.join(self.save_dir, 'test_frameref_mask_2.mp4')
+        save_one_video(save_path, ret_masks, fps=16)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+    @unittest.skip('')
+    def test_annotator_outpainting_1(self):
+        from vace.annotators.outpainting import OutpaintingAnnotator
+        cfg_dict = {
+            "RETURN_MASK": True,
+            "KEEP_PADDING_RATIO": 1,
+            "MASK_COLOR": "gray"
+        }
+        anno_ins = OutpaintingAnnotator(cfg_dict)
+        ret_data = anno_ins.forward(self.image, direction=['right', 'up', 'down'], expand_ratio=0.5)
+        save_path = os.path.join(self.save_dir, 'test_outpainting_image.png')
+        Image.fromarray(ret_data['image']).save(save_path)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+        save_path = os.path.join(self.save_dir, 'test_outpainting_mask.png')
+        Image.fromarray(ret_data['mask']).save(save_path)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+    @unittest.skip('')
+    def test_annotator_outpainting_video_1(self):
+        from vace.annotators.outpainting import OutpaintingVideoAnnotator
+        cfg_dict = {
+            "RETURN_MASK": True,
+            "KEEP_PADDING_RATIO": 1,
+            "MASK_COLOR": "gray"
+        }
+        anno_ins = OutpaintingVideoAnnotator(cfg_dict)
+        ret_data = anno_ins.forward(frames=self.frames, direction=['right', 'up', 'down'], expand_ratio=0.5)
+        save_path = os.path.join(self.save_dir, 'test_outpainting_video_1.mp4')
+        save_one_video(save_path, ret_data['frames'], fps=16)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+        save_path = os.path.join(self.save_dir, 'test_outpainting_mask_1.mp4')
+        save_one_video(save_path, ret_data['masks'], fps=16)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+    @unittest.skip('')
+    def test_annotator_outpainting_inner_1(self):
+        from vace.annotators.outpainting import OutpaintingInnerAnnotator
+        cfg_dict = {
+            "RETURN_MASK": True,
+            "KEEP_PADDING_RATIO": 1,
+            "MASK_COLOR": "gray"
+        }
+        anno_ins = OutpaintingInnerAnnotator(cfg_dict)
+        ret_data = anno_ins.forward(self.image, direction=['right', 'up', 'down'], expand_ratio=0.15)
+        save_path = os.path.join(self.save_dir, 'test_outpainting_inner_image.png')
+        Image.fromarray(ret_data['image']).save(save_path)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+        save_path = os.path.join(self.save_dir, 'test_outpainting_inner_mask.png')
+        Image.fromarray(ret_data['mask']).save(save_path)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+    @unittest.skip('')
+    def test_annotator_outpainting_inner_video_1(self):
+        from vace.annotators.outpainting import OutpaintingInnerVideoAnnotator
+        cfg_dict = {
+            "RETURN_MASK": True,
+            "KEEP_PADDING_RATIO": 1,
+            "MASK_COLOR": "gray"
+        }
+        anno_ins = OutpaintingInnerVideoAnnotator(cfg_dict)
+        ret_data = anno_ins.forward(self.frames, direction=['right', 'up', 'down'], expand_ratio=0.15)
+        save_path = os.path.join(self.save_dir, 'test_outpainting_inner_video_1.mp4')
+        save_one_video(save_path, ret_data['frames'], fps=16)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+        save_path = os.path.join(self.save_dir, 'test_outpainting_inner_mask_1.mp4')
+        save_one_video(save_path, ret_data['masks'], fps=16)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+    @unittest.skip('')
+    def test_annotator_salient(self):
+        from vace.annotators.salient import SalientAnnotator
+        cfg_dict = {
+            "PRETRAINED_MODEL": "models/VACE-Annotators/salient/u2net.pt",
+        }
+        anno_ins = SalientAnnotator(cfg_dict)
+        ret_data = anno_ins.forward(self.image)
+        save_path = os.path.join(self.save_dir, 'test_salient_image.png')
+        Image.fromarray(ret_data).save(save_path)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+    @unittest.skip('')
+    def test_annotator_salient_video(self):
+        from vace.annotators.salient import SalientVideoAnnotator
+        cfg_dict = {
+            "PRETRAINED_MODEL": "models/VACE-Annotators/salient/u2net.pt",
+        }
+        anno_ins = SalientVideoAnnotator(cfg_dict)
+        ret_frames = anno_ins.forward(self.frames)
+        save_path = os.path.join(self.save_dir, 'test_salient_video.mp4')
+        save_one_video(save_path, ret_frames, fps=16)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+    @unittest.skip('')
+    def test_annotator_layout_video(self):
+        from vace.annotators.layout import LayoutBboxAnnotator
+        cfg_dict = {
+            "RAM_TAG_COLOR_PATH": "models/VACE-Annotators/layout/ram_tag_color_list.txt",
+        }
+        anno_ins = LayoutBboxAnnotator(cfg_dict)
+        ret_frames = anno_ins.forward(bbox=[(544, 288, 744, 680), (1112, 240, 1280, 712)], frame_size=(720, 1280), num_frames=49, label='person')
+        save_path = os.path.join(self.save_dir, 'test_layout_video.mp4')
+        save_one_video(save_path, ret_frames, fps=16)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+    @unittest.skip('')
+    def test_annotator_layout_mask_video(self):
+        # salient
+        from vace.annotators.salient import SalientVideoAnnotator
+        cfg_dict = {
+            "PRETRAINED_MODEL": "models/VACE-Annotators/salient/u2net.pt",
+        }
+        anno_ins = SalientVideoAnnotator(cfg_dict)
+        salient_frames = anno_ins.forward(self.frames)
+        # mask layout
+        from vace.annotators.layout import LayoutMaskAnnotator
+        cfg_dict = {
+            "RAM_TAG_COLOR_PATH": "models/VACE-Annotators/layout/ram_tag_color_list.txt",
+        }
+        anno_ins = LayoutMaskAnnotator(cfg_dict)
+        ret_frames = anno_ins.forward(salient_frames, label='cat')
+        save_path = os.path.join(self.save_dir, 'test_mask_layout_video.mp4')
+        save_one_video(save_path, ret_frames, fps=16)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+    @unittest.skip('')
+    def test_annotator_layout_mask_video_2(self):
+        # salient
+        from vace.annotators.salient import SalientVideoAnnotator
+        cfg_dict = {
+            "PRETRAINED_MODEL": "models/VACE-Annotators/salient/u2net.pt",
+        }
+        anno_ins = SalientVideoAnnotator(cfg_dict)
+        salient_frames = anno_ins.forward(self.frames)
+        # mask layout
+        from vace.annotators.layout import LayoutMaskAnnotator
+        cfg_dict = {
+            "RAM_TAG_COLOR_PATH": "models/VACE-Annotators/layout/ram_tag_color_list.txt",
+            "USE_AUG": True
+        }
+        anno_ins = LayoutMaskAnnotator(cfg_dict)
+        ret_frames = anno_ins.forward(salient_frames, label='cat', mask_cfg={'mode': 'bbox_expand'})
+        save_path = os.path.join(self.save_dir, 'test_mask_layout_video_2.mp4')
+        save_one_video(save_path, ret_frames, fps=16)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+    @unittest.skip('')
+    def test_annotator_maskaug_video(self):
+        # salient
+        from vace.annotators.salient import SalientVideoAnnotator
+        cfg_dict = {
+            "PRETRAINED_MODEL": "models/VACE-Annotators/salient/u2net.pt",
+        }
+        anno_ins = SalientVideoAnnotator(cfg_dict)
+        salient_frames = anno_ins.forward(self.frames)
+        # mask aug
+        from vace.annotators.maskaug import MaskAugAnnotator
+        cfg_dict = {}
+        anno_ins = MaskAugAnnotator(cfg_dict)
+        ret_frames = anno_ins.forward(salient_frames, mask_cfg={'mode': 'hull_expand'})
+        save_path = os.path.join(self.save_dir, 'test_maskaug_video.mp4')
+        save_one_video(save_path, ret_frames, fps=16)
+        print(('Testing %s: %s' % (type(self).__name__, save_path)))
+    @unittest.skip('')
+    def test_annotator_ram(self):
+        from vace.annotators.ram import RAMAnnotator
+        cfg_dict = {
+            "TOKENIZER_PATH": "models/VACE-Annotators/ram/bert-base-uncased",
+            "PRETRAINED_MODEL": "models/VACE-Annotators/ram/ram_plus_swin_large_14m.pth",
+        }
+        anno_ins = RAMAnnotator(cfg_dict)
+        ret_data = anno_ins.forward(self.image)
+        print(ret_data)
+    @unittest.skip('')
+    def test_annotator_gdino_v1(self):
+        from vace.annotators.gdino import GDINOAnnotator
+        cfg_dict = {
+            "TOKENIZER_PATH": "models/VACE-Annotators/gdino/bert-base-uncased",
+            "CONFIG_PATH": "models/VACE-Annotators/gdino/GroundingDINO_SwinT_OGC_mod.py",
+            "PRETRAINED_MODEL": "models/VACE-Annotators/gdino/groundingdino_swint_ogc.pth",
+        }
+        anno_ins = GDINOAnnotator(cfg_dict)
+        ret_data = anno_ins.forward(self.image, caption="a cat and a vase")
+        print(ret_data)
+    @unittest.skip('')
+    def test_annotator_gdino_v2(self):
+        from vace.annotators.gdino import GDINOAnnotator
+        cfg_dict = {
+            "TOKENIZER_PATH": "models/VACE-Annotators/gdino/bert-base-uncased",
+            "CONFIG_PATH": "models/VACE-Annotators/gdino/GroundingDINO_SwinT_OGC_mod.py",
+            "PRETRAINED_MODEL": "models/VACE-Annotators/gdino/groundingdino_swint_ogc.pth",
+        }
+        anno_ins = GDINOAnnotator(cfg_dict)
+        ret_data = anno_ins.forward(self.image, classes=["cat", "vase"])
+        print(ret_data)
+    @unittest.skip('')
+    def test_annotator_gdino_with_ram(self):
+        from vace.annotators.gdino import GDINORAMAnnotator
+        cfg_dict = {
+            "RAM": {
+                "TOKENIZER_PATH": "models/VACE-Annotators/ram/bert-base-uncased",
+                "PRETRAINED_MODEL": "models/VACE-Annotators/ram/ram_plus_swin_large_14m.pth",
+            },
+            "GDINO": {
+                "TOKENIZER_PATH": "models/VACE-Annotators/gdino/bert-base-uncased",
+                "CONFIG_PATH": "models/VACE-Annotators/gdino/GroundingDINO_SwinT_OGC_mod.py",
+                "PRETRAINED_MODEL": "models/VACE-Annotators/gdino/groundingdino_swint_ogc.pth",
+            }
+        }
+        anno_ins = GDINORAMAnnotator(cfg_dict)
+        ret_data = anno_ins.forward(self.image)
+        print(ret_data)
+    @unittest.skip('')
+    def test_annotator_sam2(self):
+        from vace.annotators.sam2 import SAM2VideoAnnotator
+        from vace.annotators.utils import save_sam2_video
+        cfg_dict = {
+            "CONFIG_PATH": 'models/VACE-Annotators/sam2/configs/sam2.1/sam2.1_hiera_l.yaml',
+            "PRETRAINED_MODEL": 'models/VACE-Annotators/sam2/sam2.1_hiera_large.pt'
+        }
+        anno_ins = SAM2VideoAnnotator(cfg_dict)
+        ret_data = anno_ins.forward(video=self.video_path, input_box=[0, 0, 640, 480])
+        video_segments = ret_data['annotations']
+        save_path = os.path.join(self.save_dir, 'test_sam2_video')
+        if not os.path.exists(save_path):
+            os.makedirs(save_path)
+        save_sam2_video(video_path=self.video_path, video_segments=video_segments, output_video_path=save_path)
+        print(save_path)
+    @unittest.skip('')
+    def test_annotator_sam2salient(self):
+        from vace.annotators.sam2 import SAM2SalientVideoAnnotator
+        from vace.annotators.utils import save_sam2_video
+        cfg_dict = {
+            "SALIENT": {
+                "PRETRAINED_MODEL": "models/VACE-Annotators/salient/u2net.pt",
+            },
+            "SAM2": {
+                "CONFIG_PATH": 'models/VACE-Annotators/sam2/configs/sam2.1/sam2.1_hiera_l.yaml',
+                "PRETRAINED_MODEL": 'models/VACE-Annotators/sam2/sam2.1_hiera_large.pt'
+            }
+        }
+        anno_ins = SAM2SalientVideoAnnotator(cfg_dict)
+        ret_data = anno_ins.forward(video=self.video_path)
+        video_segments = ret_data['annotations']
+        save_path = os.path.join(self.save_dir, 'test_sam2salient_video')
+        if not os.path.exists(save_path):
+            os.makedirs(save_path)
+        save_sam2_video(video_path=self.video_path, video_segments=video_segments, output_video_path=save_path)
+        print(save_path)
+    @unittest.skip('')
+    def test_annotator_sam2gdinoram_video(self):
+        from vace.annotators.sam2 import SAM2GDINOVideoAnnotator
+        from vace.annotators.utils import save_sam2_video
+        cfg_dict = {
+            "GDINO": {
+                "TOKENIZER_PATH": "models/VACE-Annotators/gdino/bert-base-uncased",
+                "CONFIG_PATH": "models/VACE-Annotators/gdino/GroundingDINO_SwinT_OGC_mod.py",
+                "PRETRAINED_MODEL": "models/VACE-Annotators/gdino/groundingdino_swint_ogc.pth",
+            },
+            "SAM2": {
+                "CONFIG_PATH": 'models/VACE-Annotators/sam2/configs/sam2.1/sam2.1_hiera_l.yaml',
+                "PRETRAINED_MODEL": 'models/VACE-Annotators/sam2/sam2.1_hiera_large.pt'
+            }
+        }
+        anno_ins = SAM2GDINOVideoAnnotator(cfg_dict)
+        ret_data = anno_ins.forward(video=self.video_path, classes='cat')
+        video_segments = ret_data['annotations']
+        save_path = os.path.join(self.save_dir, 'test_sam2gdino_video')
+        if not os.path.exists(save_path):
+            os.makedirs(save_path)
+        save_sam2_video(video_path=self.video_path, video_segments=video_segments, output_video_path=save_path)
+        print(save_path)
+    @unittest.skip('')
+    def test_annotator_sam2_image(self):
+        from vace.annotators.sam2 import SAM2ImageAnnotator
+        cfg_dict = {
+            "CONFIG_PATH": 'models/VACE-Annotators/sam2/configs/sam2.1/sam2.1_hiera_l.yaml',
+            "PRETRAINED_MODEL": 'models/VACE-Annotators/sam2/sam2.1_hiera_large.pt'
+        }
+        anno_ins = SAM2ImageAnnotator(cfg_dict)
+        ret_data = anno_ins.forward(image=self.image, input_box=[0, 0, 640, 480])
+        print(ret_data)
+    # @unittest.skip('')
+    def test_annotator_prompt_extend(self):
+        from vace.annotators.prompt_extend import PromptExtendAnnotator
+        from vace.configs.prompt_preprocess import WAN_LM_ZH_SYS_PROMPT, WAN_LM_EN_SYS_PROMPT, LTX_LM_EN_SYS_PROMPT
+        cfg_dict = {
+            "MODEL_NAME": "models/VACE-Annotators/llm/Qwen2.5-3B-Instruct" # "Qwen2.5_3B"
+        }
+        anno_ins = PromptExtendAnnotator(cfg_dict)
+        ret_data = anno_ins.forward('一位男孩', system_prompt=WAN_LM_ZH_SYS_PROMPT)
+        print('wan_zh:', ret_data)
+        ret_data = anno_ins.forward('a boy', system_prompt=WAN_LM_EN_SYS_PROMPT)
+        print('wan_en:', ret_data)
+        ret_data = anno_ins.forward('a boy', system_prompt=WAN_LM_ZH_SYS_PROMPT)
+        print('wan_zh en:', ret_data)
+        ret_data = anno_ins.forward('a boy', system_prompt=LTX_LM_EN_SYS_PROMPT)
+        print('ltx_en:', ret_data)
+        from vace.annotators.utils import get_annotator
+        anno_ins = get_annotator(config_type='prompt', config_task='ltx_en', return_dict=False)
+        ret_data = anno_ins.forward('a boy', seed=2025)
+        print('ltx_en:', ret_data)
+        ret_data = anno_ins.forward('a boy')
+        print('ltx_en:', ret_data)
+        ret_data = anno_ins.forward('a boy', seed=2025)
+        print('ltx_en:', ret_data)
+    @unittest.skip('')
+    def test_annotator_prompt_extend_ds(self):
+        from vace.annotators.utils import get_annotator
+        # export DASH_API_KEY=''
+        anno_ins = get_annotator(config_type='prompt', config_task='wan_zh_ds', return_dict=False)
+        ret_data = anno_ins.forward('一位男孩', seed=2025)
+        print('wan_zh_ds:', ret_data)
+        ret_data = anno_ins.forward('a boy', seed=2025)
+        print('wan_zh_ds:', ret_data)
+# ln -s your/path/annotator_models annotator_models
+# PYTHONPATH=. python tests/test_annotators.py
+if __name__ == '__main__':
+    unittest.main()

vace/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+from . import annotators
+from . import configs
+from . import models
+from . import gradios

vace/annotators/__init__.py ADDED Viewed

	@@ -0,0 +1,24 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+from .depth import DepthAnnotator, DepthVideoAnnotator
+from .flow import FlowAnnotator, FlowVisAnnotator
+from .frameref import FrameRefExtractAnnotator, FrameRefExpandAnnotator
+from .gdino import GDINOAnnotator, GDINORAMAnnotator
+from .gray import GrayAnnotator, GrayVideoAnnotator
+from .inpainting import InpaintingAnnotator, InpaintingVideoAnnotator
+from .layout import LayoutBboxAnnotator, LayoutMaskAnnotator, LayoutTrackAnnotator
+from .maskaug import MaskAugAnnotator
+from .outpainting import OutpaintingAnnotator, OutpaintingInnerAnnotator, OutpaintingVideoAnnotator, OutpaintingInnerVideoAnnotator
+from .pose import PoseBodyFaceAnnotator, PoseBodyFaceVideoAnnotator, PoseAnnotator
+from .ram import RAMAnnotator
+from .salient import SalientAnnotator, SalientVideoAnnotator
+from .sam import SAMImageAnnotator
+from .sam2 import SAM2ImageAnnotator, SAM2VideoAnnotator, SAM2SalientVideoAnnotator, SAM2GDINOVideoAnnotator
+from .scribble import ScribbleAnnotator, ScribbleVideoAnnotator
+from .face import FaceAnnotator
+from .subject import SubjectAnnotator
+from .common import PlainImageAnnotator, PlainMaskAnnotator, PlainMaskAugAnnotator, PlainMaskVideoAnnotator, PlainVideoAnnotator, PlainMaskAugVideoAnnotator, PlainMaskAugInvertAnnotator, PlainMaskAugInvertVideoAnnotator, ExpandMaskVideoAnnotator
+from .prompt_extend import PromptExtendAnnotator
+from .composition import CompositionAnnotator, ReferenceAnythingAnnotator, AnimateAnythingAnnotator, SwapAnythingAnnotator, ExpandAnythingAnnotator, MoveAnythingAnnotator
+from .mask import MaskDrawAnnotator
+from .canvas import RegionCanvasAnnotator

vace/annotators/canvas.py ADDED Viewed

	@@ -0,0 +1,60 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+import random
+import cv2
+import numpy as np
+from .utils import convert_to_numpy
+class RegionCanvasAnnotator:
+    def __init__(self, cfg, device=None):
+        self.scale_range = cfg.get('SCALE_RANGE', [0.75, 1.0])
+        self.canvas_value = cfg.get('CANVAS_VALUE', 255)
+        self.use_resize = cfg.get('USE_RESIZE', True)
+        self.use_canvas = cfg.get('USE_CANVAS', True)
+        self.use_aug = cfg.get('USE_AUG', False)
+        if self.use_aug:
+            from .maskaug import MaskAugAnnotator
+            self.maskaug_anno = MaskAugAnnotator(cfg={})
+    def forward(self, image, mask, mask_cfg=None):
+        image = convert_to_numpy(image)
+        mask = convert_to_numpy(mask)
+        image_h, image_w = image.shape[:2]
+        if self.use_aug:
+            mask = self.maskaug_anno.forward(mask, mask_cfg)
+        # get region with white bg
+        image[np.array(mask) == 0] = self.canvas_value
+        x, y, w, h = cv2.boundingRect(mask)
+        region_crop = image[y:y + h, x:x + w]
+        if self.use_resize:
+            # resize region
+            scale_min, scale_max = self.scale_range
+            scale_factor = random.uniform(scale_min, scale_max)
+            new_w, new_h = int(image_w * scale_factor), int(image_h * scale_factor)
+            obj_scale_factor = min(new_w/w, new_h/h)
+            new_w = int(w * obj_scale_factor)
+            new_h = int(h * obj_scale_factor)
+            region_crop_resized = cv2.resize(region_crop, (new_w, new_h), interpolation=cv2.INTER_AREA)
+        else:
+            region_crop_resized = region_crop
+        if self.use_canvas:
+            # plot region into canvas
+            new_canvas = np.ones_like(image) * self.canvas_value
+            max_x = max(0, image_w - new_w)
+            max_y = max(0, image_h - new_h)
+            new_x = random.randint(0, max_x)
+            new_y = random.randint(0, max_y)
+            new_canvas[new_y:new_y + new_h, new_x:new_x + new_w] = region_crop_resized
+        else:
+            new_canvas = region_crop_resized
+        return new_canvas

vace/annotators/common.py ADDED Viewed

	@@ -0,0 +1,62 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+class PlainImageAnnotator:
+    def __init__(self, cfg):
+        pass
+    def forward(self, image):
+        return image
+class PlainVideoAnnotator:
+    def __init__(self, cfg):
+        pass
+    def forward(self, frames):
+        return frames
+class PlainMaskAnnotator:
+    def __init__(self, cfg):
+        pass
+    def forward(self, mask):
+        return mask
+class PlainMaskAugInvertAnnotator:
+    def __init__(self, cfg):
+        pass
+    def forward(self, mask):
+        return 255 - mask
+class PlainMaskAugAnnotator:
+    def __init__(self, cfg):
+        pass
+    def forward(self, mask):
+        return mask
+class PlainMaskVideoAnnotator:
+    def __init__(self, cfg):
+        pass
+    def forward(self, mask):
+        return mask
+class PlainMaskAugVideoAnnotator:
+    def __init__(self, cfg):
+        pass
+    def forward(self, masks):
+        return masks
+class PlainMaskAugInvertVideoAnnotator:
+    def __init__(self, cfg):
+        pass
+    def forward(self, masks):
+        return [255 - mask for mask in masks]
+class ExpandMaskVideoAnnotator:
+    def __init__(self, cfg):
+        pass
+    def forward(self, mask, expand_num):
+        return [mask] * expand_num
+class PlainPromptAnnotator:
+    def __init__(self, cfg):
+        pass
+    def forward(self, prompt):
+        return prompt

vace/annotators/composition.py ADDED Viewed

	@@ -0,0 +1,155 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+import numpy as np
+class CompositionAnnotator:
+    def __init__(self, cfg):
+        self.process_types = ["repaint", "extension", "control"]
+        self.process_map = {
+            "repaint": "repaint",
+            "extension": "extension",
+            "control": "control",
+            "inpainting": "repaint",
+            "outpainting": "repaint",
+            "frameref": "extension",
+            "clipref": "extension",
+            "depth": "control",
+            "flow": "control",
+            "gray": "control",
+            "pose": "control",
+            "scribble": "control",
+            "layout": "control"
+        }
+    def forward(self, process_type_1, process_type_2, frames_1, frames_2, masks_1, masks_2):
+        total_frames = min(len(frames_1), len(frames_2), len(masks_1), len(masks_2))
+        combine_type = (self.process_map[process_type_1], self.process_map[process_type_2])
+        if combine_type in [("extension", "repaint"), ("extension", "control"), ("extension", "extension")]:
+            output_video = [frames_2[i] * masks_1[i] + frames_1[i] * (1 - masks_1[i]) for i in range(total_frames)]
+            output_mask = [masks_1[i] * masks_2[i] * 255 for i in range(total_frames)]
+        elif combine_type in [("repaint", "extension"), ("control", "extension"), ("repaint", "repaint")]:
+            output_video = [frames_1[i] * (1 - masks_2[i]) + frames_2[i] * masks_2[i] for i in range(total_frames)]
+            output_mask = [(masks_1[i] * (1 - masks_2[i]) + masks_2[i] * masks_2[i]) * 255 for i in range(total_frames)]
+        elif combine_type in [("repaint", "control"), ("control", "repaint")]:
+            if combine_type in [("control", "repaint")]:
+                frames_1, frames_2, masks_1, masks_2 = frames_2, frames_1, masks_2, masks_1
+            output_video = [frames_1[i] * (1 - masks_1[i]) + frames_2[i] * masks_1[i] for i in range(total_frames)]
+            output_mask = [masks_1[i] * 255 for i in range(total_frames)]
+        elif combine_type in [("control", "control")]:  # apply masks_2
+            output_video = [frames_1[i] * (1 - masks_2[i]) + frames_2[i] * masks_2[i] for i in range(total_frames)]
+            output_mask = [(masks_1[i] * (1 - masks_2[i]) + masks_2[i] * masks_2[i]) * 255 for i in range(total_frames)]
+        else:
+            raise Exception("Unknown combine type")
+        return output_video, output_mask
+class ReferenceAnythingAnnotator:
+    def __init__(self, cfg):
+        from .subject import SubjectAnnotator
+        self.sbjref_ins = SubjectAnnotator(cfg['SUBJECT'] if 'SUBJECT' in cfg else cfg)
+        self.key_map = {
+            "image": "images",
+            "mask": "masks"
+        }
+    def forward(self, images, mode=None, return_mask=None, mask_cfg=None):
+        ret_data = {}
+        for image in images:
+            ret_one_data = self.sbjref_ins.forward(image=image, mode=mode, return_mask=return_mask, mask_cfg=mask_cfg)
+            if isinstance(ret_one_data, dict):
+                for key, val in ret_one_data.items():
+                    if key in self.key_map:
+                        new_key = self.key_map[key]
+                    else:
+                        continue
+                    if new_key in ret_data:
+                        ret_data[new_key].append(val)
+                    else:
+                        ret_data[new_key] = [val]
+            else:
+                if 'images' in ret_data:
+                    ret_data['images'].append(ret_data)
+                else:
+                    ret_data['images'] = [ret_data]
+        return ret_data
+class AnimateAnythingAnnotator:
+    def __init__(self, cfg):
+        from .pose import PoseBodyFaceVideoAnnotator
+        self.pose_ins = PoseBodyFaceVideoAnnotator(cfg['POSE'])
+        self.ref_ins = ReferenceAnythingAnnotator(cfg['REFERENCE'])
+    def forward(self, frames=None, images=None, mode=None, return_mask=None, mask_cfg=None):
+        ret_data = {}
+        ret_pose_data = self.pose_ins.forward(frames=frames)
+        ret_data.update({"frames": ret_pose_data})
+        ret_ref_data = self.ref_ins.forward(images=images, mode=mode, return_mask=return_mask, mask_cfg=mask_cfg)
+        ret_data.update({"images": ret_ref_data['images']})
+        return ret_data
+class SwapAnythingAnnotator:
+    def __init__(self, cfg):
+        from .inpainting import InpaintingVideoAnnotator
+        self.inp_ins = InpaintingVideoAnnotator(cfg['INPAINTING'])
+        self.ref_ins = ReferenceAnythingAnnotator(cfg['REFERENCE'])
+    def forward(self, video=None, frames=None, images=None, mode=None, mask=None, bbox=None, label=None, caption=None, return_mask=None, mask_cfg=None):
+        ret_data = {}
+        mode = mode.split(',') if ',' in mode else [mode, mode]
+        ret_inp_data = self.inp_ins.forward(video=video, frames=frames, mode=mode[0], mask=mask, bbox=bbox, label=label, caption=caption, mask_cfg=mask_cfg)
+        ret_data.update(ret_inp_data)
+        ret_ref_data = self.ref_ins.forward(images=images, mode=mode[1], return_mask=return_mask, mask_cfg=mask_cfg)
+        ret_data.update({"images": ret_ref_data['images']})
+        return ret_data
+class ExpandAnythingAnnotator:
+    def __init__(self, cfg):
+        from .outpainting import OutpaintingAnnotator
+        from .frameref import FrameRefExpandAnnotator
+        self.ref_ins = ReferenceAnythingAnnotator(cfg['REFERENCE'])
+        self.frameref_ins = FrameRefExpandAnnotator(cfg['FRAMEREF'])
+        self.outpainting_ins = OutpaintingAnnotator(cfg['OUTPAINTING'])
+    def forward(self, images=None, mode=None, return_mask=None, mask_cfg=None, direction=None, expand_ratio=None, expand_num=None):
+        ret_data = {}
+        expand_image, reference_image= images[0], images[1:]
+        mode = mode.split(',') if ',' in mode else ['firstframe', mode]
+        outpainting_data = self.outpainting_ins.forward(expand_image,expand_ratio=expand_ratio, direction=direction)
+        outpainting_image, outpainting_mask = outpainting_data['image'], outpainting_data['mask']
+        frameref_data = self.frameref_ins.forward(outpainting_image,  mode=mode[0], expand_num=expand_num)
+        frames, masks = frameref_data['frames'], frameref_data['masks']
+        masks[0] = outpainting_mask
+        ret_data.update({"frames": frames, "masks": masks})
+        ret_ref_data = self.ref_ins.forward(images=reference_image, mode=mode[1], return_mask=return_mask, mask_cfg=mask_cfg)
+        ret_data.update({"images": ret_ref_data['images']})
+        return ret_data
+class MoveAnythingAnnotator:
+    def __init__(self, cfg):
+        from .layout import LayoutBboxAnnotator
+        self.layout_bbox_ins = LayoutBboxAnnotator(cfg['LAYOUTBBOX'])
+    def forward(self, image=None, bbox=None, label=None, expand_num=None):
+        frame_size = image.shape[:2]   # [H, W]
+        ret_layout_data = self.layout_bbox_ins.forward(bbox, frame_size=frame_size, num_frames=expand_num, label=label)
+        out_frames = [image] + ret_layout_data
+        out_mask = [np.zeros(frame_size, dtype=np.uint8)] + [np.ones(frame_size, dtype=np.uint8) * 255] * len(ret_layout_data)
+        ret_data = {
+            "frames": out_frames,
+            "masks": out_mask
+        }
+        return ret_data

vace/annotators/depth.py ADDED Viewed

	@@ -0,0 +1,51 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+import numpy as np
+import torch
+from einops import rearrange
+from .utils import convert_to_numpy, resize_image, resize_image_ori
+class DepthAnnotator:
+    def __init__(self, cfg, device=None):
+        from .midas.api import MiDaSInference
+        pretrained_model = cfg['PRETRAINED_MODEL']
+        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") if device is None else device
+        self.model = MiDaSInference(model_type='dpt_hybrid', model_path=pretrained_model).to(self.device)
+        self.a = cfg.get('A', np.pi * 2.0)
+        self.bg_th = cfg.get('BG_TH', 0.1)
+    @torch.no_grad()
+    @torch.inference_mode()
+    @torch.autocast('cuda', enabled=False)
+    def forward(self, image):
+        image = convert_to_numpy(image)
+        image_depth = image
+        h, w, c = image.shape
+        image_depth, k = resize_image(image_depth,
+                                      1024 if min(h, w) > 1024 else min(h, w))
+        image_depth = torch.from_numpy(image_depth).float().to(self.device)
+        image_depth = image_depth / 127.5 - 1.0
+        image_depth = rearrange(image_depth, 'h w c -> 1 c h w')
+        depth = self.model(image_depth)[0]
+        depth_pt = depth.clone()
+        depth_pt -= torch.min(depth_pt)
+        depth_pt /= torch.max(depth_pt)
+        depth_pt = depth_pt.cpu().numpy()
+        depth_image = (depth_pt * 255.0).clip(0, 255).astype(np.uint8)
+        depth_image = depth_image[..., None].repeat(3, 2)
+        depth_image = resize_image_ori(h, w, depth_image, k)
+        return depth_image
+class DepthVideoAnnotator(DepthAnnotator):
+    def forward(self, frames):
+        ret_frames = []
+        for frame in frames:
+            anno_frame = super().forward(np.array(frame))
+            ret_frames.append(anno_frame)
+        return ret_frames

vace/annotators/dwpose/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # -- coding: utf-8 --
2	+ # Copyright (c) Alibaba, Inc. and its affiliates.

vace/annotators/dwpose/onnxdet.py ADDED Viewed

	@@ -0,0 +1,127 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+import cv2
+import numpy as np
+import onnxruntime
+def nms(boxes, scores, nms_thr):
+    """Single class NMS implemented in Numpy."""
+    x1 = boxes[:, 0]
+    y1 = boxes[:, 1]
+    x2 = boxes[:, 2]
+    y2 = boxes[:, 3]
+    areas = (x2 - x1 + 1) * (y2 - y1 + 1)
+    order = scores.argsort()[::-1]
+    keep = []
+    while order.size > 0:
+        i = order[0]
+        keep.append(i)
+        xx1 = np.maximum(x1[i], x1[order[1:]])
+        yy1 = np.maximum(y1[i], y1[order[1:]])
+        xx2 = np.minimum(x2[i], x2[order[1:]])
+        yy2 = np.minimum(y2[i], y2[order[1:]])
+        w = np.maximum(0.0, xx2 - xx1 + 1)
+        h = np.maximum(0.0, yy2 - yy1 + 1)
+        inter = w * h
+        ovr = inter / (areas[i] + areas[order[1:]] - inter)
+        inds = np.where(ovr <= nms_thr)[0]
+        order = order[inds + 1]
+    return keep
+def multiclass_nms(boxes, scores, nms_thr, score_thr):
+    """Multiclass NMS implemented in Numpy. Class-aware version."""
+    final_dets = []
+    num_classes = scores.shape[1]
+    for cls_ind in range(num_classes):
+        cls_scores = scores[:, cls_ind]
+        valid_score_mask = cls_scores > score_thr
+        if valid_score_mask.sum() == 0:
+            continue
+        else:
+            valid_scores = cls_scores[valid_score_mask]
+            valid_boxes = boxes[valid_score_mask]
+            keep = nms(valid_boxes, valid_scores, nms_thr)
+            if len(keep) > 0:
+                cls_inds = np.ones((len(keep), 1)) * cls_ind
+                dets = np.concatenate(
+                    [valid_boxes[keep], valid_scores[keep, None], cls_inds], 1
+                )
+                final_dets.append(dets)
+    if len(final_dets) == 0:
+        return None
+    return np.concatenate(final_dets, 0)
+def demo_postprocess(outputs, img_size, p6=False):
+    grids = []
+    expanded_strides = []
+    strides = [8, 16, 32] if not p6 else [8, 16, 32, 64]
+    hsizes = [img_size[0] // stride for stride in strides]
+    wsizes = [img_size[1] // stride for stride in strides]
+    for hsize, wsize, stride in zip(hsizes, wsizes, strides):
+        xv, yv = np.meshgrid(np.arange(wsize), np.arange(hsize))
+        grid = np.stack((xv, yv), 2).reshape(1, -1, 2)
+        grids.append(grid)
+        shape = grid.shape[:2]
+        expanded_strides.append(np.full((*shape, 1), stride))
+    grids = np.concatenate(grids, 1)
+    expanded_strides = np.concatenate(expanded_strides, 1)
+    outputs[..., :2] = (outputs[..., :2] + grids) * expanded_strides
+    outputs[..., 2:4] = np.exp(outputs[..., 2:4]) * expanded_strides
+    return outputs
+def preprocess(img, input_size, swap=(2, 0, 1)):
+    if len(img.shape) == 3:
+        padded_img = np.ones((input_size[0], input_size[1], 3), dtype=np.uint8) * 114
+    else:
+        padded_img = np.ones(input_size, dtype=np.uint8) * 114
+    r = min(input_size[0] / img.shape[0], input_size[1] / img.shape[1])
+    resized_img = cv2.resize(
+        img,
+        (int(img.shape[1] * r), int(img.shape[0] * r)),
+        interpolation=cv2.INTER_LINEAR,
+    ).astype(np.uint8)
+    padded_img[: int(img.shape[0] * r), : int(img.shape[1] * r)] = resized_img
+    padded_img = padded_img.transpose(swap)
+    padded_img = np.ascontiguousarray(padded_img, dtype=np.float32)
+    return padded_img, r
+def inference_detector(session, oriImg):
+    input_shape = (640,640)
+    img, ratio = preprocess(oriImg, input_shape)
+    ort_inputs = {session.get_inputs()[0].name: img[None, :, :, :]}
+    output = session.run(None, ort_inputs)
+    predictions = demo_postprocess(output[0], input_shape)[0]
+    boxes = predictions[:, :4]
+    scores = predictions[:, 4:5] * predictions[:, 5:]
+    boxes_xyxy = np.ones_like(boxes)
+    boxes_xyxy[:, 0] = boxes[:, 0] - boxes[:, 2]/2.
+    boxes_xyxy[:, 1] = boxes[:, 1] - boxes[:, 3]/2.
+    boxes_xyxy[:, 2] = boxes[:, 0] + boxes[:, 2]/2.
+    boxes_xyxy[:, 3] = boxes[:, 1] + boxes[:, 3]/2.
+    boxes_xyxy /= ratio
+    dets = multiclass_nms(boxes_xyxy, scores, nms_thr=0.45, score_thr=0.1)
+    if dets is not None:
+        final_boxes, final_scores, final_cls_inds = dets[:, :4], dets[:, 4], dets[:, 5]
+        isscore = final_scores>0.3
+        iscat = final_cls_inds == 0
+        isbbox = [ i and j for (i, j) in zip(isscore, iscat)]
+        final_boxes = final_boxes[isbbox]
+    else:
+        final_boxes = np.array([])
+    return final_boxes

vace/annotators/dwpose/onnxpose.py ADDED Viewed

	@@ -0,0 +1,362 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+from typing import List, Tuple
+import cv2
+import numpy as np
+import onnxruntime as ort
+def preprocess(
+    img: np.ndarray, out_bbox, input_size: Tuple[int, int] = (192, 256)
+) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
+    """Do preprocessing for RTMPose model inference.
+    Args:
+        img (np.ndarray): Input image in shape.
+        input_size (tuple): Input image size in shape (w, h).
+    Returns:
+        tuple:
+        - resized_img (np.ndarray): Preprocessed image.
+        - center (np.ndarray): Center of image.
+        - scale (np.ndarray): Scale of image.
+    """
+    # get shape of image
+    img_shape = img.shape[:2]
+    out_img, out_center, out_scale = [], [], []
+    if len(out_bbox) == 0:
+        out_bbox = [[0, 0, img_shape[1], img_shape[0]]]
+    for i in range(len(out_bbox)):
+        x0 = out_bbox[i][0]
+        y0 = out_bbox[i][1]
+        x1 = out_bbox[i][2]
+        y1 = out_bbox[i][3]
+        bbox = np.array([x0, y0, x1, y1])
+        # get center and scale
+        center, scale = bbox_xyxy2cs(bbox, padding=1.25)
+        # do affine transformation
+        resized_img, scale = top_down_affine(input_size, scale, center, img)
+        # normalize image
+        mean = np.array([123.675, 116.28, 103.53])
+        std = np.array([58.395, 57.12, 57.375])
+        resized_img = (resized_img - mean) / std
+        out_img.append(resized_img)
+        out_center.append(center)
+        out_scale.append(scale)
+    return out_img, out_center, out_scale
+def inference(sess: ort.InferenceSession, img: np.ndarray) -> np.ndarray:
+    """Inference RTMPose model.
+    Args:
+        sess (ort.InferenceSession): ONNXRuntime session.
+        img (np.ndarray): Input image in shape.
+    Returns:
+        outputs (np.ndarray): Output of RTMPose model.
+    """
+    all_out = []
+    # build input
+    for i in range(len(img)):
+        input = [img[i].transpose(2, 0, 1)]
+        # build output
+        sess_input = {sess.get_inputs()[0].name: input}
+        sess_output = []
+        for out in sess.get_outputs():
+            sess_output.append(out.name)
+        # run model
+        outputs = sess.run(sess_output, sess_input)
+        all_out.append(outputs)
+    return all_out
+def postprocess(outputs: List[np.ndarray],
+                model_input_size: Tuple[int, int],
+                center: Tuple[int, int],
+                scale: Tuple[int, int],
+                simcc_split_ratio: float = 2.0
+                ) -> Tuple[np.ndarray, np.ndarray]:
+    """Postprocess for RTMPose model output.
+    Args:
+        outputs (np.ndarray): Output of RTMPose model.
+        model_input_size (tuple): RTMPose model Input image size.
+        center (tuple): Center of bbox in shape (x, y).
+        scale (tuple): Scale of bbox in shape (w, h).
+        simcc_split_ratio (float): Split ratio of simcc.
+    Returns:
+        tuple:
+        - keypoints (np.ndarray): Rescaled keypoints.
+        - scores (np.ndarray): Model predict scores.
+    """
+    all_key = []
+    all_score = []
+    for i in range(len(outputs)):
+        # use simcc to decode
+        simcc_x, simcc_y = outputs[i]
+        keypoints, scores = decode(simcc_x, simcc_y, simcc_split_ratio)
+        # rescale keypoints
+        keypoints = keypoints / model_input_size * scale[i] + center[i] - scale[i] / 2
+        all_key.append(keypoints[0])
+        all_score.append(scores[0])
+    return np.array(all_key), np.array(all_score)
+def bbox_xyxy2cs(bbox: np.ndarray,
+                 padding: float = 1.) -> Tuple[np.ndarray, np.ndarray]:
+    """Transform the bbox format from (x,y,w,h) into (center, scale)
+    Args:
+        bbox (ndarray): Bounding box(es) in shape (4,) or (n, 4), formatted
+            as (left, top, right, bottom)
+        padding (float): BBox padding factor that will be multilied to scale.
+            Default: 1.0
+    Returns:
+        tuple: A tuple containing center and scale.
+        - np.ndarray[float32]: Center (x, y) of the bbox in shape (2,) or
+            (n, 2)
+        - np.ndarray[float32]: Scale (w, h) of the bbox in shape (2,) or
+            (n, 2)
+    """
+    # convert single bbox from (4, ) to (1, 4)
+    dim = bbox.ndim
+    if dim == 1:
+        bbox = bbox[None, :]
+    # get bbox center and scale
+    x1, y1, x2, y2 = np.hsplit(bbox, [1, 2, 3])
+    center = np.hstack([x1 + x2, y1 + y2]) * 0.5
+    scale = np.hstack([x2 - x1, y2 - y1]) * padding
+    if dim == 1:
+        center = center[0]
+        scale = scale[0]
+    return center, scale
+def _fix_aspect_ratio(bbox_scale: np.ndarray,
+                      aspect_ratio: float) -> np.ndarray:
+    """Extend the scale to match the given aspect ratio.
+    Args:
+        scale (np.ndarray): The image scale (w, h) in shape (2, )
+        aspect_ratio (float): The ratio of ``w/h``
+    Returns:
+        np.ndarray: The reshaped image scale in (2, )
+    """
+    w, h = np.hsplit(bbox_scale, [1])
+    bbox_scale = np.where(w > h * aspect_ratio,
+                          np.hstack([w, w / aspect_ratio]),
+                          np.hstack([h * aspect_ratio, h]))
+    return bbox_scale
+def _rotate_point(pt: np.ndarray, angle_rad: float) -> np.ndarray:
+    """Rotate a point by an angle.
+    Args:
+        pt (np.ndarray): 2D point coordinates (x, y) in shape (2, )
+        angle_rad (float): rotation angle in radian
+    Returns:
+        np.ndarray: Rotated point in shape (2, )
+    """
+    sn, cs = np.sin(angle_rad), np.cos(angle_rad)
+    rot_mat = np.array([[cs, -sn], [sn, cs]])
+    return rot_mat @ pt
+def _get_3rd_point(a: np.ndarray, b: np.ndarray) -> np.ndarray:
+    """To calculate the affine matrix, three pairs of points are required. This
+    function is used to get the 3rd point, given 2D points a & b.
+    The 3rd point is defined by rotating vector `a - b` by 90 degrees
+    anticlockwise, using b as the rotation center.
+    Args:
+        a (np.ndarray): The 1st point (x,y) in shape (2, )
+        b (np.ndarray): The 2nd point (x,y) in shape (2, )
+    Returns:
+        np.ndarray: The 3rd point.
+    """
+    direction = a - b
+    c = b + np.r_[-direction[1], direction[0]]
+    return c
+def get_warp_matrix(center: np.ndarray,
+                    scale: np.ndarray,
+                    rot: float,
+                    output_size: Tuple[int, int],
+                    shift: Tuple[float, float] = (0., 0.),
+                    inv: bool = False) -> np.ndarray:
+    """Calculate the affine transformation matrix that can warp the bbox area
+    in the input image to the output size.
+    Args:
+        center (np.ndarray[2, ]): Center of the bounding box (x, y).
+        scale (np.ndarray[2, ]): Scale of the bounding box
+            wrt [width, height].
+        rot (float): Rotation angle (degree).
+        output_size (np.ndarray[2, ] | list(2,)): Size of the
+            destination heatmaps.
+        shift (0-100%): Shift translation ratio wrt the width/height.
+            Default (0., 0.).
+        inv (bool): Option to inverse the affine transform direction.
+            (inv=False: src->dst or inv=True: dst->src)
+    Returns:
+        np.ndarray: A 2x3 transformation matrix
+    """
+    shift = np.array(shift)
+    src_w = scale[0]
+    dst_w = output_size[0]
+    dst_h = output_size[1]
+    # compute transformation matrix
+    rot_rad = np.deg2rad(rot)
+    src_dir = _rotate_point(np.array([0., src_w * -0.5]), rot_rad)
+    dst_dir = np.array([0., dst_w * -0.5])
+    # get four corners of the src rectangle in the original image
+    src = np.zeros((3, 2), dtype=np.float32)
+    src[0, :] = center + scale * shift
+    src[1, :] = center + src_dir + scale * shift
+    src[2, :] = _get_3rd_point(src[0, :], src[1, :])
+    # get four corners of the dst rectangle in the input image
+    dst = np.zeros((3, 2), dtype=np.float32)
+    dst[0, :] = [dst_w * 0.5, dst_h * 0.5]
+    dst[1, :] = np.array([dst_w * 0.5, dst_h * 0.5]) + dst_dir
+    dst[2, :] = _get_3rd_point(dst[0, :], dst[1, :])
+    if inv:
+        warp_mat = cv2.getAffineTransform(np.float32(dst), np.float32(src))
+    else:
+        warp_mat = cv2.getAffineTransform(np.float32(src), np.float32(dst))
+    return warp_mat
+def top_down_affine(input_size: dict, bbox_scale: dict, bbox_center: dict,
+                    img: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
+    """Get the bbox image as the model input by affine transform.
+    Args:
+        input_size (dict): The input size of the model.
+        bbox_scale (dict): The bbox scale of the img.
+        bbox_center (dict): The bbox center of the img.
+        img (np.ndarray): The original image.
+    Returns:
+        tuple: A tuple containing center and scale.
+        - np.ndarray[float32]: img after affine transform.
+        - np.ndarray[float32]: bbox scale after affine transform.
+    """
+    w, h = input_size
+    warp_size = (int(w), int(h))
+    # reshape bbox to fixed aspect ratio
+    bbox_scale = _fix_aspect_ratio(bbox_scale, aspect_ratio=w / h)
+    # get the affine matrix
+    center = bbox_center
+    scale = bbox_scale
+    rot = 0
+    warp_mat = get_warp_matrix(center, scale, rot, output_size=(w, h))
+    # do affine transform
+    img = cv2.warpAffine(img, warp_mat, warp_size, flags=cv2.INTER_LINEAR)
+    return img, bbox_scale
+def get_simcc_maximum(simcc_x: np.ndarray,
+                      simcc_y: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
+    """Get maximum response location and value from simcc representations.
+    Note:
+        instance number: N
+        num_keypoints: K
+        heatmap height: H
+        heatmap width: W
+    Args:
+        simcc_x (np.ndarray): x-axis SimCC in shape (K, Wx) or (N, K, Wx)
+        simcc_y (np.ndarray): y-axis SimCC in shape (K, Wy) or (N, K, Wy)
+    Returns:
+        tuple:
+        - locs (np.ndarray): locations of maximum heatmap responses in shape
+            (K, 2) or (N, K, 2)
+        - vals (np.ndarray): values of maximum heatmap responses in shape
+            (K,) or (N, K)
+    """
+    N, K, Wx = simcc_x.shape
+    simcc_x = simcc_x.reshape(N * K, -1)
+    simcc_y = simcc_y.reshape(N * K, -1)
+    # get maximum value locations
+    x_locs = np.argmax(simcc_x, axis=1)
+    y_locs = np.argmax(simcc_y, axis=1)
+    locs = np.stack((x_locs, y_locs), axis=-1).astype(np.float32)
+    max_val_x = np.amax(simcc_x, axis=1)
+    max_val_y = np.amax(simcc_y, axis=1)
+    # get maximum value across x and y axis
+    mask = max_val_x > max_val_y
+    max_val_x[mask] = max_val_y[mask]
+    vals = max_val_x
+    locs[vals <= 0.] = -1
+    # reshape
+    locs = locs.reshape(N, K, 2)
+    vals = vals.reshape(N, K)
+    return locs, vals
+def decode(simcc_x: np.ndarray, simcc_y: np.ndarray,
+           simcc_split_ratio) -> Tuple[np.ndarray, np.ndarray]:
+    """Modulate simcc distribution with Gaussian.
+    Args:
+        simcc_x (np.ndarray[K, Wx]): model predicted simcc in x.
+        simcc_y (np.ndarray[K, Wy]): model predicted simcc in y.
+        simcc_split_ratio (int): The split ratio of simcc.
+    Returns:
+        tuple: A tuple containing center and scale.
+        - np.ndarray[float32]: keypoints in shape (K, 2) or (n, K, 2)
+        - np.ndarray[float32]: scores in shape (K,) or (n, K)
+    """
+    keypoints, scores = get_simcc_maximum(simcc_x, simcc_y)
+    keypoints /= simcc_split_ratio
+    return keypoints, scores
+def inference_pose(session, out_bbox, oriImg):
+    h, w = session.get_inputs()[0].shape[2:]
+    model_input_size = (w, h)
+    resized_img, center, scale = preprocess(oriImg, out_bbox, model_input_size)
+    outputs = inference(session, resized_img)
+    keypoints, scores = postprocess(outputs, model_input_size, center, scale)
+    return keypoints, scores

vace/annotators/dwpose/util.py ADDED Viewed

	@@ -0,0 +1,299 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+import math
+import numpy as np
+import matplotlib
+import cv2
+eps = 0.01
+def smart_resize(x, s):
+    Ht, Wt = s
+    if x.ndim == 2:
+        Ho, Wo = x.shape
+        Co = 1
+    else:
+        Ho, Wo, Co = x.shape
+    if Co == 3 or Co == 1:
+        k = float(Ht + Wt) / float(Ho + Wo)
+        return cv2.resize(x, (int(Wt), int(Ht)), interpolation=cv2.INTER_AREA if k < 1 else cv2.INTER_LANCZOS4)
+    else:
+        return np.stack([smart_resize(x[:, :, i], s) for i in range(Co)], axis=2)
+def smart_resize_k(x, fx, fy):
+    if x.ndim == 2:
+        Ho, Wo = x.shape
+        Co = 1
+    else:
+        Ho, Wo, Co = x.shape
+    Ht, Wt = Ho * fy, Wo * fx
+    if Co == 3 or Co == 1:
+        k = float(Ht + Wt) / float(Ho + Wo)
+        return cv2.resize(x, (int(Wt), int(Ht)), interpolation=cv2.INTER_AREA if k < 1 else cv2.INTER_LANCZOS4)
+    else:
+        return np.stack([smart_resize_k(x[:, :, i], fx, fy) for i in range(Co)], axis=2)
+def padRightDownCorner(img, stride, padValue):
+    h = img.shape[0]
+    w = img.shape[1]
+    pad = 4 * [None]
+    pad[0] = 0 # up
+    pad[1] = 0 # left
+    pad[2] = 0 if (h % stride == 0) else stride - (h % stride) # down
+    pad[3] = 0 if (w % stride == 0) else stride - (w % stride) # right
+    img_padded = img
+    pad_up = np.tile(img_padded[0:1, :, :]*0 + padValue, (pad[0], 1, 1))
+    img_padded = np.concatenate((pad_up, img_padded), axis=0)
+    pad_left = np.tile(img_padded[:, 0:1, :]*0 + padValue, (1, pad[1], 1))
+    img_padded = np.concatenate((pad_left, img_padded), axis=1)
+    pad_down = np.tile(img_padded[-2:-1, :, :]*0 + padValue, (pad[2], 1, 1))
+    img_padded = np.concatenate((img_padded, pad_down), axis=0)
+    pad_right = np.tile(img_padded[:, -2:-1, :]*0 + padValue, (1, pad[3], 1))
+    img_padded = np.concatenate((img_padded, pad_right), axis=1)
+    return img_padded, pad
+def transfer(model, model_weights):
+    transfered_model_weights = {}
+    for weights_name in model.state_dict().keys():
+        transfered_model_weights[weights_name] = model_weights['.'.join(weights_name.split('.')[1:])]
+    return transfered_model_weights
+def draw_bodypose(canvas, candidate, subset):
+    H, W, C = canvas.shape
+    candidate = np.array(candidate)
+    subset = np.array(subset)
+    stickwidth = 4
+    limbSeq = [[2, 3], [2, 6], [3, 4], [4, 5], [6, 7], [7, 8], [2, 9], [9, 10], \
+               [10, 11], [2, 12], [12, 13], [13, 14], [2, 1], [1, 15], [15, 17], \
+               [1, 16], [16, 18], [3, 17], [6, 18]]
+    colors = [[255, 0, 0], [255, 85, 0], [255, 170, 0], [255, 255, 0], [170, 255, 0], [85, 255, 0], [0, 255, 0], \
+              [0, 255, 85], [0, 255, 170], [0, 255, 255], [0, 170, 255], [0, 85, 255], [0, 0, 255], [85, 0, 255], \
+              [170, 0, 255], [255, 0, 255], [255, 0, 170], [255, 0, 85]]
+    for i in range(17):
+        for n in range(len(subset)):
+            index = subset[n][np.array(limbSeq[i]) - 1]
+            if -1 in index:
+                continue
+            Y = candidate[index.astype(int), 0] * float(W)
+            X = candidate[index.astype(int), 1] * float(H)
+            mX = np.mean(X)
+            mY = np.mean(Y)
+            length = ((X[0] - X[1]) ** 2 + (Y[0] - Y[1]) ** 2) ** 0.5
+            angle = math.degrees(math.atan2(X[0] - X[1], Y[0] - Y[1]))
+            polygon = cv2.ellipse2Poly((int(mY), int(mX)), (int(length / 2), stickwidth), int(angle), 0, 360, 1)
+            cv2.fillConvexPoly(canvas, polygon, colors[i])
+    canvas = (canvas * 0.6).astype(np.uint8)
+    for i in range(18):
+        for n in range(len(subset)):
+            index = int(subset[n][i])
+            if index == -1:
+                continue
+            x, y = candidate[index][0:2]
+            x = int(x * W)
+            y = int(y * H)
+            cv2.circle(canvas, (int(x), int(y)), 4, colors[i], thickness=-1)
+    return canvas
+def draw_handpose(canvas, all_hand_peaks):
+    H, W, C = canvas.shape
+    edges = [[0, 1], [1, 2], [2, 3], [3, 4], [0, 5], [5, 6], [6, 7], [7, 8], [0, 9], [9, 10], \
+             [10, 11], [11, 12], [0, 13], [13, 14], [14, 15], [15, 16], [0, 17], [17, 18], [18, 19], [19, 20]]
+    for peaks in all_hand_peaks:
+        peaks = np.array(peaks)
+        for ie, e in enumerate(edges):
+            x1, y1 = peaks[e[0]]
+            x2, y2 = peaks[e[1]]
+            x1 = int(x1 * W)
+            y1 = int(y1 * H)
+            x2 = int(x2 * W)
+            y2 = int(y2 * H)
+            if x1 > eps and y1 > eps and x2 > eps and y2 > eps:
+                cv2.line(canvas, (x1, y1), (x2, y2), matplotlib.colors.hsv_to_rgb([ie / float(len(edges)), 1.0, 1.0]) * 255, thickness=2)
+        for i, keyponit in enumerate(peaks):
+            x, y = keyponit
+            x = int(x * W)
+            y = int(y * H)
+            if x > eps and y > eps:
+                cv2.circle(canvas, (x, y), 4, (0, 0, 255), thickness=-1)
+    return canvas
+def draw_facepose(canvas, all_lmks):
+    H, W, C = canvas.shape
+    for lmks in all_lmks:
+        lmks = np.array(lmks)
+        for lmk in lmks:
+            x, y = lmk
+            x = int(x * W)
+            y = int(y * H)
+            if x > eps and y > eps:
+                cv2.circle(canvas, (x, y), 3, (255, 255, 255), thickness=-1)
+    return canvas
+# detect hand according to body pose keypoints
+# please refer to https://github.com/CMU-Perceptual-Computing-Lab/openpose/blob/master/src/openpose/hand/handDetector.cpp
+def handDetect(candidate, subset, oriImg):
+    # right hand: wrist 4, elbow 3, shoulder 2
+    # left hand: wrist 7, elbow 6, shoulder 5
+    ratioWristElbow = 0.33
+    detect_result = []
+    image_height, image_width = oriImg.shape[0:2]
+    for person in subset.astype(int):
+        # if any of three not detected
+        has_left = np.sum(person[[5, 6, 7]] == -1) == 0
+        has_right = np.sum(person[[2, 3, 4]] == -1) == 0
+        if not (has_left or has_right):
+            continue
+        hands = []
+        #left hand
+        if has_left:
+            left_shoulder_index, left_elbow_index, left_wrist_index = person[[5, 6, 7]]
+            x1, y1 = candidate[left_shoulder_index][:2]
+            x2, y2 = candidate[left_elbow_index][:2]
+            x3, y3 = candidate[left_wrist_index][:2]
+            hands.append([x1, y1, x2, y2, x3, y3, True])
+        # right hand
+        if has_right:
+            right_shoulder_index, right_elbow_index, right_wrist_index = person[[2, 3, 4]]
+            x1, y1 = candidate[right_shoulder_index][:2]
+            x2, y2 = candidate[right_elbow_index][:2]
+            x3, y3 = candidate[right_wrist_index][:2]
+            hands.append([x1, y1, x2, y2, x3, y3, False])
+        for x1, y1, x2, y2, x3, y3, is_left in hands:
+            # pos_hand = pos_wrist + ratio * (pos_wrist - pos_elbox) = (1 + ratio) * pos_wrist - ratio * pos_elbox
+            # handRectangle.x = posePtr[wrist*3] + ratioWristElbow * (posePtr[wrist*3] - posePtr[elbow*3]);
+            # handRectangle.y = posePtr[wrist*3+1] + ratioWristElbow * (posePtr[wrist*3+1] - posePtr[elbow*3+1]);
+            # const auto distanceWristElbow = getDistance(poseKeypoints, person, wrist, elbow);
+            # const auto distanceElbowShoulder = getDistance(poseKeypoints, person, elbow, shoulder);
+            # handRectangle.width = 1.5f * fastMax(distanceWristElbow, 0.9f * distanceElbowShoulder);
+            x = x3 + ratioWristElbow * (x3 - x2)
+            y = y3 + ratioWristElbow * (y3 - y2)
+            distanceWristElbow = math.sqrt((x3 - x2) ** 2 + (y3 - y2) ** 2)
+            distanceElbowShoulder = math.sqrt((x2 - x1) ** 2 + (y2 - y1) ** 2)
+            width = 1.5 * max(distanceWristElbow, 0.9 * distanceElbowShoulder)
+            # x-y refers to the center --> offset to topLeft point
+            # handRectangle.x -= handRectangle.width / 2.f;
+            # handRectangle.y -= handRectangle.height / 2.f;
+            x -= width / 2
+            y -= width / 2  # width = height
+            # overflow the image
+            if x < 0: x = 0
+            if y < 0: y = 0
+            width1 = width
+            width2 = width
+            if x + width > image_width: width1 = image_width - x
+            if y + width > image_height: width2 = image_height - y
+            width = min(width1, width2)
+            # the max hand box value is 20 pixels
+            if width >= 20:
+                detect_result.append([int(x), int(y), int(width), is_left])
+    '''
+    return value: [[x, y, w, True if left hand else False]].
+    width=height since the network require squared input.
+    x, y is the coordinate of top left
+    '''
+    return detect_result
+# Written by Lvmin
+def faceDetect(candidate, subset, oriImg):
+    # left right eye ear 14 15 16 17
+    detect_result = []
+    image_height, image_width = oriImg.shape[0:2]
+    for person in subset.astype(int):
+        has_head = person[0] > -1
+        if not has_head:
+            continue
+        has_left_eye = person[14] > -1
+        has_right_eye = person[15] > -1
+        has_left_ear = person[16] > -1
+        has_right_ear = person[17] > -1
+        if not (has_left_eye or has_right_eye or has_left_ear or has_right_ear):
+            continue
+        head, left_eye, right_eye, left_ear, right_ear = person[[0, 14, 15, 16, 17]]
+        width = 0.0
+        x0, y0 = candidate[head][:2]
+        if has_left_eye:
+            x1, y1 = candidate[left_eye][:2]
+            d = max(abs(x0 - x1), abs(y0 - y1))
+            width = max(width, d * 3.0)
+        if has_right_eye:
+            x1, y1 = candidate[right_eye][:2]
+            d = max(abs(x0 - x1), abs(y0 - y1))
+            width = max(width, d * 3.0)
+        if has_left_ear:
+            x1, y1 = candidate[left_ear][:2]
+            d = max(abs(x0 - x1), abs(y0 - y1))
+            width = max(width, d * 1.5)
+        if has_right_ear:
+            x1, y1 = candidate[right_ear][:2]
+            d = max(abs(x0 - x1), abs(y0 - y1))
+            width = max(width, d * 1.5)
+        x, y = x0, y0
+        x -= width
+        y -= width
+        if x < 0:
+            x = 0
+        if y < 0:
+            y = 0
+        width1 = width * 2
+        width2 = width * 2
+        if x + width > image_width:
+            width1 = image_width - x
+        if y + width > image_height:
+            width2 = image_height - y
+        width = min(width1, width2)
+        if width >= 20:
+            detect_result.append([int(x), int(y), int(width)])
+    return detect_result
+# get max index of 2d array
+def npmax(array):
+    arrayindex = array.argmax(1)
+    arrayvalue = array.max(1)
+    i = arrayvalue.argmax()
+    j = arrayindex[i]
+    return i, j

vace/annotators/dwpose/wholebody.py ADDED Viewed

	@@ -0,0 +1,80 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+import cv2
+import numpy as np
+import onnxruntime as ort
+from .onnxdet import inference_detector
+from .onnxpose import inference_pose
+def HWC3(x):
+    assert x.dtype == np.uint8
+    if x.ndim == 2:
+        x = x[:, :, None]
+    assert x.ndim == 3
+    H, W, C = x.shape
+    assert C == 1 or C == 3 or C == 4
+    if C == 3:
+        return x
+    if C == 1:
+        return np.concatenate([x, x, x], axis=2)
+    if C == 4:
+        color = x[:, :, 0:3].astype(np.float32)
+        alpha = x[:, :, 3:4].astype(np.float32) / 255.0
+        y = color * alpha + 255.0 * (1.0 - alpha)
+        y = y.clip(0, 255).astype(np.uint8)
+        return y
+def resize_image(input_image, resolution):
+    H, W, C = input_image.shape
+    H = float(H)
+    W = float(W)
+    k = float(resolution) / min(H, W)
+    H *= k
+    W *= k
+    H = int(np.round(H / 64.0)) * 64
+    W = int(np.round(W / 64.0)) * 64
+    img = cv2.resize(input_image, (W, H), interpolation=cv2.INTER_LANCZOS4 if k > 1 else cv2.INTER_AREA)
+    return img
+class Wholebody:
+    def __init__(self, onnx_det, onnx_pose, device = 'cuda:0'):
+        providers = ['CPUExecutionProvider'
+                 ] if device == 'cpu' else ['CUDAExecutionProvider']
+        # onnx_det = 'annotator/ckpts/yolox_l.onnx'
+        # onnx_pose = 'annotator/ckpts/dw-ll_ucoco_384.onnx'
+        self.session_det = ort.InferenceSession(path_or_bytes=onnx_det, providers=providers)
+        self.session_pose = ort.InferenceSession(path_or_bytes=onnx_pose, providers=providers)
+    def __call__(self, ori_img):
+        det_result = inference_detector(self.session_det, ori_img)
+        keypoints, scores = inference_pose(self.session_pose, det_result, ori_img)
+        keypoints_info = np.concatenate(
+            (keypoints, scores[..., None]), axis=-1)
+        # compute neck joint
+        neck = np.mean(keypoints_info[:, [5, 6]], axis=1)
+        # neck score when visualizing pred
+        neck[:, 2:4] = np.logical_and(
+            keypoints_info[:, 5, 2:4] > 0.3,
+            keypoints_info[:, 6, 2:4] > 0.3).astype(int)
+        new_keypoints_info = np.insert(
+            keypoints_info, 17, neck, axis=1)
+        mmpose_idx = [
+            17, 6, 8, 10, 7, 9, 12, 14, 16, 13, 15, 2, 1, 4, 3
+        ]
+        openpose_idx = [
+            1, 2, 3, 4, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17
+        ]
+        new_keypoints_info[:, openpose_idx] = \
+            new_keypoints_info[:, mmpose_idx]
+        keypoints_info = new_keypoints_info
+        keypoints, scores = keypoints_info[
+            ..., :2], keypoints_info[..., 2]
+        return keypoints, scores, det_result

vace/annotators/face.py ADDED Viewed

	@@ -0,0 +1,55 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+import numpy as np
+import torch
+from .utils import convert_to_numpy
+class FaceAnnotator:
+    def __init__(self, cfg, device=None):
+        from insightface.app import FaceAnalysis
+        self.return_raw = cfg.get('RETURN_RAW', True)
+        self.return_mask = cfg.get('RETURN_MASK', False)
+        self.return_dict = cfg.get('RETURN_DICT', False)
+        self.multi_face = cfg.get('MULTI_FACE', True)
+        pretrained_model = cfg['PRETRAINED_MODEL']
+        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") if device is None else device
+        self.device_id = self.device.index if self.device.type == 'cuda' else None
+        ctx_id = self.device_id if self.device_id is not None else 0
+        self.model = FaceAnalysis(name=cfg.MODEL_NAME, root=pretrained_model, providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
+        self.model.prepare(ctx_id=ctx_id, det_size=(640, 640))
+    def forward(self, image=None, return_mask=None, return_dict=None):
+        return_mask = return_mask if return_mask is not None else self.return_mask
+        return_dict = return_dict if return_dict is not None else self.return_dict
+        image = convert_to_numpy(image)
+        # [dict_keys(['bbox', 'kps', 'det_score', 'landmark_3d_68', 'pose', 'landmark_2d_106', 'gender', 'age', 'embedding'])]
+        faces = self.model.get(image)
+        if self.return_raw:
+            return faces
+        else:
+            crop_face_list, mask_list = [], []
+            if len(faces) > 0:
+                if not self.multi_face:
+                    faces = faces[:1]
+                for face in faces:
+                    x_min, y_min, x_max, y_max = face['bbox'].tolist()
+                    crop_face = image[int(y_min): int(y_max) + 1, int(x_min): int(x_max) + 1]
+                    crop_face_list.append(crop_face)
+                    mask = np.zeros_like(image[:, :, 0])
+                    mask[int(y_min): int(y_max) + 1, int(x_min): int(x_max) + 1] = 255
+                    mask_list.append(mask)
+                if not self.multi_face:
+                    crop_face_list = crop_face_list[0]
+                    mask_list = mask_list[0]
+                if return_mask:
+                    if return_dict:
+                        return {'image': crop_face_list, 'mask': mask_list}
+                    else:
+                        return crop_face_list, mask_list
+                else:
+                    return crop_face_list
+            else:
+                return None

vace/annotators/flow.py ADDED Viewed

	@@ -0,0 +1,53 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+import torch
+import numpy as np
+import argparse
+from .utils import convert_to_numpy
+class FlowAnnotator:
+    def __init__(self, cfg, device=None):
+        try:
+            from raft import RAFT
+            from raft.utils.utils import InputPadder
+            from raft.utils import flow_viz
+        except:
+            import warnings
+            warnings.warn(
+                "ignore raft import, please pip install raft package. you can refer to models/VACE-Annotators/flow/raft-1.0.0-py3-none-any.whl")
+        params = {
+            "small": False,
+            "mixed_precision": False,
+            "alternate_corr": False
+        }
+        params = argparse.Namespace(**params)
+        pretrained_model = cfg['PRETRAINED_MODEL']
+        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") if device is None else device
+        self.model = RAFT(params)
+        self.model.load_state_dict({k.replace('module.', ''): v for k, v in torch.load(pretrained_model, map_location="cpu", weights_only=True).items()})
+        self.model = self.model.to(self.device).eval()
+        self.InputPadder = InputPadder
+        self.flow_viz = flow_viz
+    def forward(self, frames):
+        # frames / RGB
+        frames = [torch.from_numpy(convert_to_numpy(frame).astype(np.uint8)).permute(2, 0, 1).float()[None].to(self.device) for frame in frames]
+        flow_up_list, flow_up_vis_list = [], []
+        with torch.no_grad():
+            for i, (image1, image2) in enumerate(zip(frames[:-1], frames[1:])):
+                padder = self.InputPadder(image1.shape)
+                image1, image2 = padder.pad(image1, image2)
+                flow_low, flow_up = self.model(image1, image2, iters=20, test_mode=True)
+                flow_up = flow_up[0].permute(1, 2, 0).cpu().numpy()
+                flow_up_vis = self.flow_viz.flow_to_image(flow_up)
+                flow_up_list.append(flow_up)
+                flow_up_vis_list.append(flow_up_vis)
+        return flow_up_list, flow_up_vis_list  # RGB
+class FlowVisAnnotator(FlowAnnotator):
+    def forward(self, frames):
+        flow_up_list, flow_up_vis_list = super().forward(frames)
+        return flow_up_vis_list[:1] + flow_up_vis_list

vace/annotators/frameref.py ADDED Viewed

	@@ -0,0 +1,118 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+import random
+import numpy as np
+from .utils import align_frames
+class FrameRefExtractAnnotator:
+    para_dict = {}
+    def __init__(self, cfg, device=None):
+        # first / last / firstlast / random
+        self.ref_cfg = cfg.get('REF_CFG', [{"mode": "first", "proba": 0.1},
+                                           {"mode": "last", "proba": 0.1},
+                                           {"mode": "firstlast", "proba": 0.1},
+                                           {"mode": "random", "proba": 0.1}])
+        self.ref_num = cfg.get('REF_NUM', 1)
+        self.ref_color = cfg.get('REF_COLOR', 127.5)
+        self.return_dict = cfg.get('RETURN_DICT', True)
+        self.return_mask = cfg.get('RETURN_MASK', True)
+    def forward(self, frames, ref_cfg=None, ref_num=None, return_mask=None, return_dict=None):
+        return_mask = return_mask if return_mask is not None else self.return_mask
+        return_dict = return_dict if return_dict is not None else self.return_dict
+        ref_cfg = ref_cfg if ref_cfg is not None else self.ref_cfg
+        ref_cfg = [ref_cfg] if not isinstance(ref_cfg, list) else ref_cfg
+        probas = [item['proba'] if 'proba' in item else 1.0 / len(ref_cfg) for item in ref_cfg]
+        sel_ref_cfg = random.choices(ref_cfg, weights=probas, k=1)[0]
+        mode = sel_ref_cfg['mode'] if 'mode' in sel_ref_cfg else 'original'
+        ref_num = int(ref_num) if ref_num is not None else self.ref_num
+        frame_num = len(frames)
+        frame_num_range = list(range(frame_num))
+        if mode == "first":
+            sel_idx = frame_num_range[:ref_num]
+        elif mode == "last":
+            sel_idx = frame_num_range[-ref_num:]
+        elif mode == "firstlast":
+            sel_idx = frame_num_range[:ref_num] + frame_num_range[-ref_num:]
+        elif mode == "random":
+            sel_idx = random.sample(frame_num_range, ref_num)
+        else:
+            raise NotImplementedError
+        out_frames, out_masks = [], []
+        for i in range(frame_num):
+            if i in sel_idx:
+                out_frame = frames[i]
+                out_mask = np.zeros_like(frames[i][:, :, 0])
+            else:
+                out_frame = np.ones_like(frames[i]) * self.ref_color
+                out_mask = np.ones_like(frames[i][:, :, 0]) * 255
+            out_frames.append(out_frame)
+            out_masks.append(out_mask)
+        if return_dict:
+            ret_data = {"frames": out_frames}
+            if return_mask:
+                ret_data['masks'] = out_masks
+            return ret_data
+        else:
+            if return_mask:
+                return out_frames, out_masks
+            else:
+                return out_frames
+class FrameRefExpandAnnotator:
+    para_dict = {}
+    def __init__(self, cfg, device=None):
+        # first / last / firstlast
+        self.ref_color = cfg.get('REF_COLOR', 127.5)
+        self.return_mask = cfg.get('RETURN_MASK', True)
+        self.return_dict = cfg.get('RETURN_DICT', True)
+        self.mode = cfg.get('MODE', "firstframe")
+        assert self.mode in ["firstframe", "lastframe", "firstlastframe", "firstclip", "lastclip", "firstlastclip", "all"]
+    def forward(self, image=None, image_2=None, frames=None, frames_2=None, mode=None, expand_num=None, return_mask=None, return_dict=None):
+        mode = mode if mode is not None else self.mode
+        return_mask = return_mask if return_mask is not None else self.return_mask
+        return_dict = return_dict if return_dict is not None else self.return_dict
+        if 'frame' in mode:
+            frames = [image] if image is not None and not isinstance(frames, list) else image
+            frames_2 = [image_2] if image_2 is not None and not isinstance(image_2, list) else image_2
+        expand_frames = [np.ones_like(frames[0]) * self.ref_color] * expand_num
+        expand_masks = [np.ones_like(frames[0][:, :, 0]) * 255] * expand_num
+        source_frames = frames
+        source_masks = [np.zeros_like(frames[0][:, :, 0])] * len(frames)
+        if mode in ["firstframe", "firstclip"]:
+            out_frames = source_frames + expand_frames
+            out_masks = source_masks + expand_masks
+        elif mode in ["lastframe", "lastclip"]:
+            out_frames = expand_frames + source_frames
+            out_masks = expand_masks + source_masks
+        elif mode in ["firstlastframe", "firstlastclip"]:
+            source_frames_2 = [align_frames(source_frames[0], f2) for f2 in frames_2]
+            source_masks_2 = [np.zeros_like(source_frames_2[0][:, :, 0])] * len(frames_2)
+            out_frames = source_frames + expand_frames + source_frames_2
+            out_masks = source_masks + expand_masks + source_masks_2
+        else:
+            raise NotImplementedError
+        if return_dict:
+            ret_data = {"frames": out_frames}
+            if return_mask:
+                ret_data['masks'] = out_masks
+            return ret_data
+        else:
+            if return_mask:
+                return out_frames, out_masks
+            else:
+                return out_frames

vace/annotators/gdino.py ADDED Viewed

	@@ -0,0 +1,88 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+import cv2
+import torch
+import numpy as np
+import torchvision
+from .utils import convert_to_numpy
+class GDINOAnnotator:
+    def __init__(self, cfg, device=None):
+        try:
+            from groundingdino.util.inference import Model, load_model, load_image, predict
+        except:
+            import warnings
+            warnings.warn("please pip install groundingdino package, or you can refer to models/VACE-Annotators/gdino/groundingdino-0.1.0-cp310-cp310-linux_x86_64.whl")
+        grounding_dino_config_path = cfg['CONFIG_PATH']
+        grounding_dino_checkpoint_path = cfg['PRETRAINED_MODEL']
+        grounding_dino_tokenizer_path = cfg['TOKENIZER_PATH']  # TODO
+        self.box_threshold = cfg.get('BOX_THRESHOLD', 0.25)
+        self.text_threshold = cfg.get('TEXT_THRESHOLD', 0.2)
+        self.iou_threshold = cfg.get('IOU_THRESHOLD', 0.5)
+        self.use_nms = cfg.get('USE_NMS', True)
+        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") if device is None else device
+        self.model = Model(model_config_path=grounding_dino_config_path,
+                           model_checkpoint_path=grounding_dino_checkpoint_path,
+                           device=self.device)
+    def forward(self, image, classes=None, caption=None):
+        image_bgr = convert_to_numpy(image)[..., ::-1]  # bgr
+        if classes is not None:
+            classes = [classes] if isinstance(classes, str) else classes
+            detections = self.model.predict_with_classes(
+                image=image_bgr,
+                classes=classes,
+                box_threshold=self.box_threshold,
+                text_threshold=self.text_threshold
+            )
+        elif caption is not None:
+            detections, phrases = self.model.predict_with_caption(
+                image=image_bgr,
+                caption=caption,
+                box_threshold=self.box_threshold,
+                text_threshold=self.text_threshold
+            )
+        else:
+            raise NotImplementedError()
+        if self.use_nms:
+            nms_idx = torchvision.ops.nms(
+                torch.from_numpy(detections.xyxy),
+                torch.from_numpy(detections.confidence),
+                self.iou_threshold
+            ).numpy().tolist()
+            detections.xyxy = detections.xyxy[nms_idx]
+            detections.confidence = detections.confidence[nms_idx]
+            detections.class_id = detections.class_id[nms_idx] if detections.class_id is not None else None
+        boxes = detections.xyxy
+        confidences = detections.confidence
+        class_ids = detections.class_id
+        class_names = [classes[_id] for _id in class_ids] if classes is not None else phrases
+        ret_data = {
+            "boxes": boxes.tolist() if boxes is not None else None,
+            "confidences": confidences.tolist() if confidences is not None else None,
+            "class_ids": class_ids.tolist() if class_ids is not None else None,
+            "class_names": class_names if class_names is not None else None,
+        }
+        return ret_data
+class GDINORAMAnnotator:
+    def __init__(self, cfg, device=None):
+        from .ram import RAMAnnotator
+        from .gdino import GDINOAnnotator
+        self.ram_model = RAMAnnotator(cfg['RAM'], device=device)
+        self.gdino_model = GDINOAnnotator(cfg['GDINO'], device=device)
+    def forward(self, image):
+        ram_res = self.ram_model.forward(image)
+        classes = ram_res['tag_e'] if isinstance(ram_res, dict) else ram_res
+        gdino_res = self.gdino_model.forward(image, classes=classes)
+        return gdino_res

vace/annotators/gray.py ADDED Viewed

	@@ -0,0 +1,24 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+import cv2
+import numpy as np
+from .utils import convert_to_numpy
+class GrayAnnotator:
+    def __init__(self, cfg):
+        pass
+    def forward(self, image):
+        image = convert_to_numpy(image)
+        gray_map = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
+        return gray_map[..., None].repeat(3, axis=2)
+class GrayVideoAnnotator(GrayAnnotator):
+    def forward(self, frames):
+        ret_frames = []
+        for frame in frames:
+            anno_frame = super().forward(np.array(frame))
+            ret_frames.append(anno_frame)
+        return ret_frames

vace/annotators/inpainting.py ADDED Viewed

	@@ -0,0 +1,283 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+import cv2
+import math
+import random
+from abc import ABCMeta
+import numpy as np
+import torch
+from PIL import Image, ImageDraw
+from .utils import convert_to_numpy, convert_to_pil, single_rle_to_mask, get_mask_box, read_video_one_frame
+class InpaintingAnnotator:
+    def __init__(self, cfg, device=None):
+        self.use_aug = cfg.get('USE_AUG', True)
+        self.return_mask = cfg.get('RETURN_MASK', True)
+        self.return_source = cfg.get('RETURN_SOURCE', True)
+        self.mask_color = cfg.get('MASK_COLOR', 128)
+        self.mode = cfg.get('MODE', "mask")
+        assert self.mode in ["salient", "mask", "bbox", "salientmasktrack", "salientbboxtrack", "maskpointtrack", "maskbboxtrack", "masktrack", "bboxtrack", "label", "caption", "all"]
+        if self.mode in ["salient", "salienttrack"]:
+            from .salient import SalientAnnotator
+            self.salient_model = SalientAnnotator(cfg['SALIENT'], device=device)
+        if self.mode in ['masktrack', 'bboxtrack', 'salienttrack']:
+            from .sam2 import SAM2ImageAnnotator
+            self.sam2_model = SAM2ImageAnnotator(cfg['SAM2'], device=device)
+        if self.mode in ['label', 'caption']:
+            from .gdino import GDINOAnnotator
+            from .sam2 import SAM2ImageAnnotator
+            self.gdino_model = GDINOAnnotator(cfg['GDINO'], device=device)
+            self.sam2_model = SAM2ImageAnnotator(cfg['SAM2'], device=device)
+        if self.mode in ['all']:
+            from .salient import SalientAnnotator
+            from .gdino import GDINOAnnotator
+            from .sam2 import SAM2ImageAnnotator
+            self.salient_model = SalientAnnotator(cfg['SALIENT'], device=device)
+            self.gdino_model = GDINOAnnotator(cfg['GDINO'], device=device)
+            self.sam2_model = SAM2ImageAnnotator(cfg['SAM2'], device=device)
+        if self.use_aug:
+            from .maskaug import MaskAugAnnotator
+            self.maskaug_anno = MaskAugAnnotator(cfg={})
+    def apply_plain_mask(self, image, mask, mask_color):
+        bool_mask = mask > 0
+        out_image = image.copy()
+        out_image[bool_mask] = mask_color
+        out_mask = np.where(bool_mask, 255, 0).astype(np.uint8)
+        return out_image, out_mask
+    def apply_seg_mask(self, image, mask, mask_color, mask_cfg=None):
+        out_mask = (mask * 255).astype('uint8')
+        if self.use_aug and mask_cfg is not None:
+            out_mask = self.maskaug_anno.forward(out_mask, mask_cfg)
+        bool_mask = out_mask > 0
+        out_image = image.copy()
+        out_image[bool_mask] = mask_color
+        return out_image, out_mask
+    def forward(self, image=None, mask=None, bbox=None, label=None, caption=None, mode=None, return_mask=None, return_source=None, mask_color=None, mask_cfg=None):
+        mode = mode if mode is not None else self.mode
+        return_mask = return_mask if return_mask is not None else self.return_mask
+        return_source = return_source if return_source is not None else self.return_source
+        mask_color = mask_color if mask_color is not None else self.mask_color
+        image = convert_to_numpy(image)
+        out_image, out_mask = None, None
+        if mode in ['salient']:
+            mask = self.salient_model.forward(image)
+            out_image, out_mask = self.apply_plain_mask(image, mask, mask_color)
+        elif mode in ['mask']:
+            mask_h, mask_w = mask.shape[:2]
+            h, w = image.shape[:2]
+            if (mask_h ==h) and (mask_w == w):
+                mask = cv2.resize(mask, (w, h), interpolation=cv2.INTER_NEAREST)
+            out_image, out_mask = self.apply_plain_mask(image, mask, mask_color)
+        elif mode in ['bbox']:
+            x1, y1, x2, y2 = bbox
+            h, w = image.shape[:2]
+            x1, y1 = int(max(0, x1)), int(max(0, y1))
+            x2, y2 = int(min(w, x2)), int(min(h, y2))
+            out_image = image.copy()
+            out_image[y1:y2, x1:x2] = mask_color
+            out_mask = np.zeros((h, w), dtype=np.uint8)
+            out_mask[y1:y2, x1:x2] = 255
+        elif mode in ['salientmasktrack']:
+            mask = self.salient_model.forward(image)
+            resize_mask = cv2.resize(mask, (256, 256), interpolation=cv2.INTER_NEAREST)
+            out_mask = self.sam2_model.forward(image=image, mask=resize_mask, task_type='mask', return_mask=True)
+            out_image, out_mask = self.apply_seg_mask(image, out_mask, mask_color, mask_cfg)
+        elif mode in ['salientbboxtrack']:
+            mask = self.salient_model.forward(image)
+            bbox = get_mask_box(np.array(mask), threshold=1)
+            out_mask = self.sam2_model.forward(image=image, input_box=bbox, task_type='input_box', return_mask=True)
+            out_image, out_mask = self.apply_seg_mask(image, out_mask, mask_color, mask_cfg)
+        elif mode in ['maskpointtrack']:
+            out_mask = self.sam2_model.forward(image=image, mask=mask, task_type='mask_point', return_mask=True)
+            out_image, out_mask = self.apply_seg_mask(image, out_mask, mask_color, mask_cfg)
+        elif mode in ['maskbboxtrack']:
+            out_mask = self.sam2_model.forward(image=image, mask=mask, task_type='mask_box', return_mask=True)
+            out_image, out_mask = self.apply_seg_mask(image, out_mask, mask_color, mask_cfg)
+        elif mode in ['masktrack']:
+            resize_mask = cv2.resize(mask, (256, 256), interpolation=cv2.INTER_NEAREST)
+            out_mask = self.sam2_model.forward(image=image, mask=resize_mask, task_type='mask', return_mask=True)
+            out_image, out_mask = self.apply_seg_mask(image, out_mask, mask_color, mask_cfg)
+        elif mode in ['bboxtrack']:
+            out_mask = self.sam2_model.forward(image=image, input_box=bbox, task_type='input_box', return_mask=True)
+            out_image, out_mask = self.apply_seg_mask(image, out_mask, mask_color, mask_cfg)
+        elif mode in ['label']:
+            gdino_res = self.gdino_model.forward(image, classes=label)
+            if 'boxes' in gdino_res and len(gdino_res['boxes']) > 0:
+                bboxes = gdino_res['boxes'][0]
+            else:
+                raise ValueError(f"Unable to find the corresponding boxes of label: {label}")
+            out_mask = self.sam2_model.forward(image=image, input_box=bboxes, task_type='input_box', return_mask=True)
+            out_image, out_mask = self.apply_seg_mask(image, out_mask, mask_color, mask_cfg)
+        elif mode in ['caption']:
+            gdino_res = self.gdino_model.forward(image, caption=caption)
+            if 'boxes' in gdino_res and len(gdino_res['boxes']) > 0:
+                bboxes = gdino_res['boxes'][0]
+            else:
+                raise ValueError(f"Unable to find the corresponding boxes of caption: {caption}")
+            out_mask = self.sam2_model.forward(image=image, input_box=bboxes, task_type='input_box', return_mask=True)
+            out_image, out_mask = self.apply_seg_mask(image, out_mask, mask_color, mask_cfg)
+        ret_data = {"image": out_image}
+        if return_mask:
+            ret_data["mask"] = out_mask
+        if return_source:
+            ret_data["src_image"] = image
+        return ret_data
+class InpaintingVideoAnnotator:
+    def __init__(self, cfg, device=None):
+        self.use_aug = cfg.get('USE_AUG', True)
+        self.return_frame = cfg.get('RETURN_FRAME', True)
+        self.return_mask = cfg.get('RETURN_MASK', True)
+        self.return_source = cfg.get('RETURN_SOURCE', True)
+        self.mask_color = cfg.get('MASK_COLOR', 128)
+        self.mode = cfg.get('MODE', "mask")
+        assert self.mode in ["salient", "mask", "bbox", "salientmasktrack", "salientbboxtrack", "maskpointtrack", "maskbboxtrack", "masktrack", "bboxtrack", "label", "caption", "all"]
+        if self.mode in ["salient", "salienttrack"]:
+            from .salient import SalientAnnotator
+            self.salient_model = SalientAnnotator(cfg['SALIENT'], device=device)
+        if self.mode in ['masktrack', 'bboxtrack', 'salienttrack']:
+            from .sam2 import SAM2VideoAnnotator
+            self.sam2_model = SAM2VideoAnnotator(cfg['SAM2'], device=device)
+        if self.mode in ['label', 'caption']:
+            from .gdino import GDINOAnnotator
+            from .sam2 import SAM2VideoAnnotator
+            self.gdino_model = GDINOAnnotator(cfg['GDINO'], device=device)
+            self.sam2_model = SAM2VideoAnnotator(cfg['SAM2'], device=device)
+        if self.mode in ['all']:
+            from .salient import SalientAnnotator
+            from .gdino import GDINOAnnotator
+            from .sam2 import SAM2VideoAnnotator
+            self.salient_model = SalientAnnotator(cfg['SALIENT'], device=device)
+            self.gdino_model = GDINOAnnotator(cfg['GDINO'], device=device)
+            self.sam2_model = SAM2VideoAnnotator(cfg['SAM2'], device=device)
+        if self.use_aug:
+            from .maskaug import MaskAugAnnotator
+            self.maskaug_anno = MaskAugAnnotator(cfg={})
+    def apply_plain_mask(self, frames, mask, mask_color, return_frame=True):
+        out_frames = []
+        num_frames = len(frames)
+        bool_mask = mask > 0
+        out_masks = [np.where(bool_mask, 255, 0).astype(np.uint8)] * num_frames
+        if not return_frame:
+            return None, out_masks
+        for i in range(num_frames):
+            masked_frame = frames[i].copy()
+            masked_frame[bool_mask] = mask_color
+            out_frames.append(masked_frame)
+        return out_frames, out_masks
+    def apply_seg_mask(self, mask_data, frames, mask_color, mask_cfg=None, return_frame=True):
+        out_frames = []
+        out_masks = [(single_rle_to_mask(val[0]["mask"]) * 255).astype('uint8') for key, val in mask_data['annotations'].items()]
+        if not return_frame:
+            return None, out_masks
+        num_frames = min(len(out_masks), len(frames))
+        for i in range(num_frames):
+            sub_mask = out_masks[i]
+            if self.use_aug and mask_cfg is not None:
+                sub_mask = self.maskaug_anno.forward(sub_mask, mask_cfg)
+                out_masks[i] = sub_mask
+            bool_mask = sub_mask > 0
+            masked_frame = frames[i].copy()
+            masked_frame[bool_mask] = mask_color
+            out_frames.append(masked_frame)
+        out_masks = out_masks[:num_frames]
+        return out_frames, out_masks
+    def forward(self, frames=None, video=None, mask=None, bbox=None, label=None, caption=None, mode=None, return_frame=None, return_mask=None, return_source=None, mask_color=None, mask_cfg=None):
+        mode = mode if mode is not None else self.mode
+        return_frame = return_frame if return_frame is not None else self.return_frame
+        return_mask = return_mask if return_mask is not None else self.return_mask
+        return_source = return_source if return_source is not None else self.return_source
+        mask_color = mask_color if mask_color is not None else self.mask_color
+        out_frames, out_masks = [], []
+        if mode in ['salient']:
+            first_frame = frames[0] if frames is not None else read_video_one_frame(video)
+            mask = self.salient_model.forward(first_frame)
+            out_frames, out_masks = self.apply_plain_mask(frames, mask, mask_color, return_frame)
+        elif mode in ['mask']:
+            first_frame = frames[0] if frames is not None else read_video_one_frame(video)
+            mask_h, mask_w = mask.shape[:2]
+            h, w = first_frame.shape[:2]
+            if (mask_h ==h) and (mask_w == w):
+                mask = cv2.resize(mask, (w, h), interpolation=cv2.INTER_NEAREST)
+            out_frames, out_masks = self.apply_plain_mask(frames, mask, mask_color, return_frame)
+        elif mode in ['bbox']:
+            first_frame = frames[0] if frames is not None else read_video_one_frame(video)
+            num_frames = len(frames)
+            x1, y1, x2, y2 = bbox
+            h, w = first_frame.shape[:2]
+            x1, y1 = int(max(0, x1)), int(max(0, y1))
+            x2, y2 = int(min(w, x2)), int(min(h, y2))
+            mask = np.zeros((h, w), dtype=np.uint8)
+            mask[y1:y2, x1:x2] = 255
+            out_masks = [mask] * num_frames
+            if not return_frame:
+                out_frames = None
+            else:
+                for i in range(num_frames):
+                    masked_frame = frames[i].copy()
+                    masked_frame[y1:y2, x1:x2] = mask_color
+                    out_frames.append(masked_frame)
+        elif mode in ['salientmasktrack']:
+            first_frame = frames[0] if frames is not None else read_video_one_frame(video)
+            salient_mask = self.salient_model.forward(first_frame)
+            mask_data = self.sam2_model.forward(video=video, mask=salient_mask, task_type='mask')
+            out_frames, out_masks = self.apply_seg_mask(mask_data, frames, mask_color, mask_cfg, return_frame)
+        elif mode in ['salientbboxtrack']:
+            first_frame = frames[0] if frames is not None else read_video_one_frame(video)
+            salient_mask = self.salient_model.forward(first_frame)
+            bbox = get_mask_box(np.array(salient_mask), threshold=1)
+            mask_data = self.sam2_model.forward(video=video, input_box=bbox, task_type='input_box')
+            out_frames, out_masks = self.apply_seg_mask(mask_data, frames, mask_color, mask_cfg, return_frame)
+        elif mode in ['maskpointtrack']:
+            mask_data = self.sam2_model.forward(video=video, mask=mask, task_type='mask_point')
+            out_frames, out_masks = self.apply_seg_mask(mask_data, frames, mask_color, mask_cfg, return_frame)
+        elif mode in ['maskbboxtrack']:
+            mask_data = self.sam2_model.forward(video=video, mask=mask, task_type='mask_box')
+            out_frames, out_masks = self.apply_seg_mask(mask_data, frames, mask_color, mask_cfg, return_frame)
+        elif mode in ['masktrack']:
+            mask_data = self.sam2_model.forward(video=video, mask=mask, task_type='mask')
+            out_frames, out_masks = self.apply_seg_mask(mask_data, frames, mask_color, mask_cfg, return_frame)
+        elif mode in ['bboxtrack']:
+            mask_data = self.sam2_model.forward(video=video, input_box=bbox, task_type='input_box')
+            out_frames, out_masks = self.apply_seg_mask(mask_data, frames, mask_color, mask_cfg, return_frame)
+        elif mode in ['label']:
+            first_frame = frames[0] if frames is not None else read_video_one_frame(video)
+            gdino_res = self.gdino_model.forward(first_frame, classes=label)
+            if 'boxes' in gdino_res and len(gdino_res['boxes']) > 0:
+                bboxes = gdino_res['boxes'][0]
+            else:
+                raise ValueError(f"Unable to find the corresponding boxes of label: {label}")
+            mask_data = self.sam2_model.forward(video=video, input_box=bboxes, task_type='input_box')
+            out_frames, out_masks = self.apply_seg_mask(mask_data, frames, mask_color, mask_cfg, return_frame)
+        elif mode in ['caption']:
+            first_frame = frames[0] if frames is not None else read_video_one_frame(video)
+            gdino_res = self.gdino_model.forward(first_frame, caption=caption)
+            if 'boxes' in gdino_res and len(gdino_res['boxes']) > 0:
+                bboxes = gdino_res['boxes'][0]
+            else:
+                raise ValueError(f"Unable to find the corresponding boxes of caption: {caption}")
+            mask_data = self.sam2_model.forward(video=video, input_box=bboxes, task_type='input_box')
+            out_frames, out_masks = self.apply_seg_mask(mask_data, frames, mask_color, mask_cfg, return_frame)
+        ret_data = {}
+        if return_frame:
+            ret_data["frames"] = out_frames
+        if return_mask:
+            ret_data["masks"] = out_masks
+        return ret_data

vace/annotators/layout.py ADDED Viewed

	@@ -0,0 +1,161 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+import cv2
+import numpy as np
+from .utils import convert_to_numpy
+class LayoutBboxAnnotator:
+    def __init__(self, cfg, device=None):
+        self.bg_color = cfg.get('BG_COLOR', [255, 255, 255])
+        self.box_color = cfg.get('BOX_COLOR', [0, 0, 0])
+        self.frame_size = cfg.get('FRAME_SIZE', [720, 1280])  # [H, W]
+        self.num_frames = cfg.get('NUM_FRAMES', 81)
+        ram_tag_color_path = cfg.get('RAM_TAG_COLOR_PATH', None)
+        self.color_dict = {'default': tuple(self.box_color)}
+        if ram_tag_color_path is not None:
+            lines = [id_name_color.strip().split('#;#') for id_name_color in open(ram_tag_color_path).readlines()]
+            self.color_dict.update({id_name_color[1]: tuple(eval(id_name_color[2])) for id_name_color in lines})
+    def forward(self, bbox, frame_size=None, num_frames=None, label=None, color=None):
+        frame_size = frame_size if frame_size is not None else self.frame_size
+        num_frames = num_frames if num_frames is not None else self.num_frames
+        assert len(bbox) == 2, 'bbox should be a list of two elements (start_bbox & end_bbox)'
+        # frame_size = [H, W]
+        # bbox = [x1, y1, x2, y2]
+        label = label[0] if label is not None and isinstance(label, list) else label
+        if label is not None and label in self.color_dict:
+            box_color = self.color_dict[label]
+        elif color is not None:
+            box_color = color
+        else:
+            box_color = self.color_dict['default']
+        start_bbox, end_bbox = bbox
+        start_bbox = [start_bbox[0], start_bbox[1], start_bbox[2] - start_bbox[0], start_bbox[3] - start_bbox[1]]
+        start_bbox = np.array(start_bbox, dtype=np.float32)
+        end_bbox = [end_bbox[0], end_bbox[1], end_bbox[2] - end_bbox[0], end_bbox[3] - end_bbox[1]]
+        end_bbox = np.array(end_bbox, dtype=np.float32)
+        bbox_increment = (end_bbox - start_bbox) / num_frames
+        ret_frames = []
+        for frame_idx in range(num_frames):
+            frame = np.zeros((frame_size[0], frame_size[1], 3), dtype=np.uint8)
+            frame[:] = self.bg_color
+            current_bbox = start_bbox + bbox_increment * frame_idx
+            current_bbox = current_bbox.astype(int)
+            x, y, w, h = current_bbox
+            cv2.rectangle(frame, (x, y), (x + w, y + h), box_color, 2)
+            ret_frames.append(frame[..., ::-1])
+        return ret_frames
+class LayoutMaskAnnotator:
+    def __init__(self, cfg, device=None):
+        self.use_aug = cfg.get('USE_AUG', False)
+        self.bg_color = cfg.get('BG_COLOR', [255, 255, 255])
+        self.box_color = cfg.get('BOX_COLOR', [0, 0, 0])
+        ram_tag_color_path = cfg.get('RAM_TAG_COLOR_PATH', None)
+        self.color_dict = {'default': tuple(self.box_color)}
+        if ram_tag_color_path is not None:
+            lines = [id_name_color.strip().split('#;#') for id_name_color in open(ram_tag_color_path).readlines()]
+            self.color_dict.update({id_name_color[1]: tuple(eval(id_name_color[2])) for id_name_color in lines})
+        if self.use_aug:
+            from .maskaug import MaskAugAnnotator
+            self.maskaug_anno = MaskAugAnnotator(cfg={})
+    def find_contours(self, mask):
+        contours, hier = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
+        return contours
+    def draw_contours(self, canvas, contour, color):
+        canvas = np.ascontiguousarray(canvas, dtype=np.uint8)
+        canvas = cv2.drawContours(canvas, contour, -1, color, thickness=3)
+        return canvas
+    def forward(self, mask=None, color=None, label=None, mask_cfg=None):
+        if not isinstance(mask, list):
+            is_batch = False
+            mask = [mask]
+        else:
+            is_batch = True
+        if label is not None and label in self.color_dict:
+            color = self.color_dict[label]
+        elif color is not None:
+            color = color
+        else:
+            color = self.color_dict['default']
+        ret_data = []
+        for sub_mask in mask:
+            sub_mask = convert_to_numpy(sub_mask)
+            if self.use_aug:
+                sub_mask = self.maskaug_anno.forward(sub_mask, mask_cfg)
+            canvas = np.ones((sub_mask.shape[0], sub_mask.shape[1], 3)) * 255
+            contour = self.find_contours(sub_mask)
+            frame = self.draw_contours(canvas, contour, color)
+            ret_data.append(frame)
+        if is_batch:
+            return ret_data
+        else:
+            return ret_data[0]
+class LayoutTrackAnnotator:
+    def __init__(self, cfg, device=None):
+        self.use_aug = cfg.get('USE_AUG', False)
+        self.bg_color = cfg.get('BG_COLOR', [255, 255, 255])
+        self.box_color = cfg.get('BOX_COLOR', [0, 0, 0])
+        ram_tag_color_path = cfg.get('RAM_TAG_COLOR_PATH', None)
+        self.color_dict = {'default': tuple(self.box_color)}
+        if ram_tag_color_path is not None:
+            lines = [id_name_color.strip().split('#;#') for id_name_color in open(ram_tag_color_path).readlines()]
+            self.color_dict.update({id_name_color[1]: tuple(eval(id_name_color[2])) for id_name_color in lines})
+        if self.use_aug:
+            from .maskaug import MaskAugAnnotator
+            self.maskaug_anno = MaskAugAnnotator(cfg={})
+        from .inpainting import InpaintingVideoAnnotator
+        self.inpainting_anno = InpaintingVideoAnnotator(cfg=cfg['INPAINTING'])
+    def find_contours(self, mask):
+        contours, hier = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
+        return contours
+    def draw_contours(self, canvas, contour, color):
+        canvas = np.ascontiguousarray(canvas, dtype=np.uint8)
+        canvas = cv2.drawContours(canvas, contour, -1, color, thickness=3)
+        return canvas
+    def forward(self, color=None, mask_cfg=None, frames=None, video=None, mask=None, bbox=None, label=None, caption=None, mode=None):
+        inp_data = self.inpainting_anno.forward(frames, video, mask, bbox, label, caption, mode)
+        inp_masks = inp_data['masks']
+        label = label[0] if label is not None and isinstance(label, list) else label
+        if label is not None and label in self.color_dict:
+            color = self.color_dict[label]
+        elif color is not None:
+            color = color
+        else:
+            color = self.color_dict['default']
+        num_frames = len(inp_masks)
+        ret_data = []
+        for i in range(num_frames):
+            sub_mask = inp_masks[i]
+            if self.use_aug and mask_cfg is not None:
+                sub_mask = self.maskaug_anno.forward(sub_mask, mask_cfg)
+            canvas = np.ones((sub_mask.shape[0], sub_mask.shape[1], 3)) * 255
+            contour = self.find_contours(sub_mask)
+            frame = self.draw_contours(canvas, contour, color)
+            ret_data.append(frame)
+        return ret_data

vace/annotators/mask.py ADDED Viewed

	@@ -0,0 +1,79 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+import numpy as np
+from scipy.spatial import ConvexHull
+from skimage.draw import polygon
+from scipy import ndimage
+from .utils import convert_to_numpy
+class MaskDrawAnnotator:
+    def __init__(self, cfg, device=None):
+        self.mode = cfg.get('MODE', 'maskpoint')
+        self.return_dict = cfg.get('RETURN_DICT', True)
+        assert self.mode in ['maskpoint', 'maskbbox', 'mask', 'bbox']
+    def forward(self,
+                mask=None,
+                image=None,
+                bbox=None,
+                mode=None,
+                return_dict=None):
+        mode = mode if mode is not None else self.mode
+        return_dict = return_dict if return_dict is not None else self.return_dict
+        mask = convert_to_numpy(mask) if mask is not None else None
+        image = convert_to_numpy(image) if image is not None else None
+        mask_shape = mask.shape
+        if mode == 'maskpoint':
+            scribble = mask.transpose(1, 0)
+            labeled_array, num_features = ndimage.label(scribble >= 255)
+            centers = ndimage.center_of_mass(scribble, labeled_array,
+                                             range(1, num_features + 1))
+            centers = np.array(centers)
+            out_mask = np.zeros(mask_shape, dtype=np.uint8)
+            hull = ConvexHull(centers)
+            hull_vertices = centers[hull.vertices]
+            rr, cc = polygon(hull_vertices[:, 1], hull_vertices[:, 0], mask_shape)
+            out_mask[rr, cc] = 255
+        elif mode == 'maskbbox':
+            scribble = mask.transpose(1, 0)
+            labeled_array, num_features = ndimage.label(scribble >= 255)
+            centers = ndimage.center_of_mass(scribble, labeled_array,
+                                             range(1, num_features + 1))
+            centers = np.array(centers)
+            # (x1, y1, x2, y2)
+            x_min = centers[:, 0].min()
+            x_max = centers[:, 0].max()
+            y_min = centers[:, 1].min()
+            y_max = centers[:, 1].max()
+            out_mask = np.zeros(mask_shape, dtype=np.uint8)
+            out_mask[int(y_min) : int(y_max) + 1, int(x_min) : int(x_max) + 1] = 255
+            if image is not None:
+                out_image = image[int(y_min) : int(y_max) + 1, int(x_min) : int(x_max) + 1]
+        elif mode == 'bbox':
+            if isinstance(bbox, list):
+                bbox = np.array(bbox)
+            x_min, y_min, x_max, y_max = bbox
+            out_mask = np.zeros(mask_shape, dtype=np.uint8)
+            out_mask[int(y_min) : int(y_max) + 1, int(x_min) : int(x_max) + 1] = 255
+            if image is not None:
+                out_image = image[int(y_min) : int(y_max) + 1, int(x_min) : int(x_max) + 1]
+        elif mode == 'mask':
+            out_mask = mask
+        else:
+            raise NotImplementedError
+        if return_dict:
+            if image is not None:
+                return {"image": out_image, "mask": out_mask}
+            else:
+                return {"mask": out_mask}
+        else:
+            if image is not None:
+                return out_image, out_mask
+            else:
+                return out_mask

vace/annotators/maskaug.py ADDED Viewed

	@@ -0,0 +1,181 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+import random
+from functools import partial
+import cv2
+import numpy as np
+from PIL import Image, ImageDraw
+from .utils import convert_to_numpy
+class MaskAugAnnotator:
+    def __init__(self, cfg, device=None):
+        # original / original_expand / hull / hull_expand / bbox / bbox_expand
+        self.mask_cfg = cfg.get('MASK_CFG', [{"mode": "original", "proba": 0.1},
+                                             {"mode": "original_expand", "proba": 0.1},
+                                             {"mode": "hull", "proba": 0.1},
+                                             {"mode": "hull_expand", "proba":0.1, "kwargs": {"expand_ratio": 0.2}},
+                                             {"mode": "bbox", "proba": 0.1},
+                                             {"mode": "bbox_expand", "proba": 0.1, "kwargs": {"min_expand_ratio": 0.2, "max_expand_ratio": 0.5}}])
+    def forward(self, mask, mask_cfg=None):
+        mask_cfg = mask_cfg if mask_cfg is not None else self.mask_cfg
+        if not isinstance(mask, list):
+            is_batch = False
+            masks = [mask]
+        else:
+            is_batch = True
+            masks = mask
+        mask_func = self.get_mask_func(mask_cfg)
+        # print(mask_func)
+        aug_masks = []
+        for submask in masks:
+            mask = convert_to_numpy(submask)
+            valid, large, h, w, bbox = self.get_mask_info(mask)
+            # print(valid, large, h, w, bbox)
+            if valid:
+                mask = mask_func(mask, bbox, h, w)
+            else:
+                mask = mask.astype(np.uint8)
+            aug_masks.append(mask)
+        return  aug_masks if is_batch else aug_masks[0]
+    def get_mask_info(self, mask):
+        h, w = mask.shape
+        locs = mask.nonzero()
+        valid = True
+        if len(locs) < 1 or locs[0].shape[0] < 1 or locs[1].shape[0] < 1:
+            valid = False
+            return valid, False, h, w, [0, 0, 0, 0]
+        left, right = np.min(locs[1]), np.max(locs[1])
+        top, bottom = np.min(locs[0]), np.max(locs[0])
+        bbox = [left, top, right, bottom]
+        large = False
+        if (right - left + 1) * (bottom - top + 1) > 0.9 * h * w:
+            large = True
+        return valid, large, h, w, bbox
+    def get_expand_params(self, mask_kwargs):
+        if 'expand_ratio' in mask_kwargs:
+            expand_ratio = mask_kwargs['expand_ratio']
+        elif 'min_expand_ratio' in mask_kwargs and 'max_expand_ratio' in mask_kwargs:
+            expand_ratio = random.uniform(mask_kwargs['min_expand_ratio'], mask_kwargs['max_expand_ratio'])
+        else:
+            expand_ratio = 0.3
+        if 'expand_iters' in mask_kwargs:
+            expand_iters = mask_kwargs['expand_iters']
+        else:
+            expand_iters = random.randint(1, 10)
+        if 'expand_lrtp' in mask_kwargs:
+            expand_lrtp = mask_kwargs['expand_lrtp']
+        else:
+            expand_lrtp = [random.random(), random.random(), random.random(), random.random()]
+        return expand_ratio, expand_iters, expand_lrtp
+    def get_mask_func(self, mask_cfg):
+        if not isinstance(mask_cfg, list):
+            mask_cfg = [mask_cfg]
+        probas = [item['proba'] if 'proba' in item else 1.0 / len(mask_cfg) for item in mask_cfg]
+        sel_mask_cfg = random.choices(mask_cfg, weights=probas, k=1)[0]
+        mode = sel_mask_cfg['mode'] if 'mode' in sel_mask_cfg else 'original'
+        mask_kwargs = sel_mask_cfg['kwargs'] if 'kwargs' in sel_mask_cfg else {}
+        if mode == 'random':
+            mode = random.choice(['original', 'original_expand', 'hull', 'hull_expand', 'bbox', 'bbox_expand'])
+        if mode == 'original':
+            mask_func = partial(self.generate_mask)
+        elif mode == 'original_expand':
+            expand_ratio, expand_iters, expand_lrtp = self.get_expand_params(mask_kwargs)
+            mask_func = partial(self.generate_mask, expand_ratio=expand_ratio, expand_iters=expand_iters, expand_lrtp=expand_lrtp)
+        elif mode == 'hull':
+            clockwise = random.choice([True, False]) if 'clockwise' not in mask_kwargs else mask_kwargs['clockwise']
+            mask_func = partial(self.generate_hull_mask, clockwise=clockwise)
+        elif mode == 'hull_expand':
+            expand_ratio, expand_iters, expand_lrtp = self.get_expand_params(mask_kwargs)
+            clockwise = random.choice([True, False]) if 'clockwise' not in mask_kwargs else mask_kwargs['clockwise']
+            mask_func = partial(self.generate_hull_mask, clockwise=clockwise, expand_ratio=expand_ratio, expand_iters=expand_iters, expand_lrtp=expand_lrtp)
+        elif mode == 'bbox':
+            mask_func = partial(self.generate_bbox_mask)
+        elif mode == 'bbox_expand':
+            expand_ratio, expand_iters, expand_lrtp = self.get_expand_params(mask_kwargs)
+            mask_func = partial(self.generate_bbox_mask, expand_ratio=expand_ratio, expand_iters=expand_iters, expand_lrtp=expand_lrtp)
+        else:
+            raise NotImplementedError
+        return mask_func
+    def generate_mask(self, mask, bbox, h, w, expand_ratio=None, expand_iters=None, expand_lrtp=None):
+        bin_mask = mask.astype(np.uint8)
+        if expand_ratio:
+            bin_mask = self.rand_expand_mask(bin_mask, bbox, h, w, expand_ratio, expand_iters, expand_lrtp)
+        return bin_mask
+    @staticmethod
+    def rand_expand_mask(mask, bbox, h, w, expand_ratio=None, expand_iters=None, expand_lrtp=None):
+        expand_ratio = 0.3 if expand_ratio is None else expand_ratio
+        expand_iters = random.randint(1, 10) if expand_iters is None else expand_iters
+        expand_lrtp = [random.random(), random.random(), random.random(), random.random()] if expand_lrtp is None else expand_lrtp
+        # print('iters', expand_iters, 'expand_ratio', expand_ratio, 'expand_lrtp', expand_lrtp)
+        # mask = np.squeeze(mask)
+        left, top, right, bottom = bbox
+        # mask expansion
+        box_w = (right - left + 1) * expand_ratio
+        box_h = (bottom - top + 1) * expand_ratio
+        left_, right_ = int(expand_lrtp[0] * min(box_w, left / 2) / expand_iters), int(
+            expand_lrtp[1] * min(box_w, (w - right) / 2) / expand_iters)
+        top_, bottom_ = int(expand_lrtp[2] * min(box_h, top / 2) / expand_iters), int(
+            expand_lrtp[3] * min(box_h, (h - bottom) / 2) / expand_iters)
+        kernel_size = max(left_, right_, top_, bottom_)
+        if kernel_size > 0:
+            kernel = np.zeros((kernel_size * 2, kernel_size * 2), dtype=np.uint8)
+            new_left, new_right = kernel_size - right_, kernel_size + left_
+            new_top, new_bottom = kernel_size - bottom_, kernel_size + top_
+            kernel[new_top:new_bottom + 1, new_left:new_right + 1] = 1
+            mask = mask.astype(np.uint8)
+            mask = cv2.dilate(mask, kernel, iterations=expand_iters).astype(np.uint8)
+            # mask = new_mask - (mask / 2).astype(np.uint8)
+        # mask = np.expand_dims(mask, axis=-1)
+        return mask
+    @staticmethod
+    def _convexhull(image, clockwise):
+        contours, hierarchy = cv2.findContours(image, 2, 1)
+        cnt = np.concatenate(contours)  # merge all regions
+        hull = cv2.convexHull(cnt, clockwise=clockwise)
+        hull = np.squeeze(hull, axis=1).astype(np.float32).tolist()
+        hull = [tuple(x) for x in hull]
+        return hull  # b, 1, 2
+    def generate_hull_mask(self, mask, bbox, h, w, clockwise=None, expand_ratio=None, expand_iters=None, expand_lrtp=None):
+        clockwise = random.choice([True, False]) if clockwise is None else clockwise
+        hull = self._convexhull(mask, clockwise)
+        mask_img = Image.new('L', (w, h), 0)
+        pt_list = hull
+        mask_img_draw = ImageDraw.Draw(mask_img)
+        mask_img_draw.polygon(pt_list, fill=255)
+        bin_mask = np.array(mask_img).astype(np.uint8)
+        if expand_ratio:
+            bin_mask = self.rand_expand_mask(bin_mask, bbox, h, w, expand_ratio, expand_iters, expand_lrtp)
+        return bin_mask
+    def generate_bbox_mask(self, mask, bbox, h, w, expand_ratio=None, expand_iters=None, expand_lrtp=None):
+        left, top, right, bottom = bbox
+        bin_mask = np.zeros((h, w), dtype=np.uint8)
+        bin_mask[top:bottom + 1, left:right + 1] = 255
+        if expand_ratio:
+            bin_mask = self.rand_expand_mask(bin_mask, bbox, h, w, expand_ratio, expand_iters, expand_lrtp)
+        return bin_mask

vace/annotators/midas/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # -- coding: utf-8 --
2	+ # Copyright (c) Alibaba, Inc. and its affiliates.

vace/annotators/midas/api.py ADDED Viewed

	@@ -0,0 +1,166 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+# based on https://github.com/isl-org/MiDaS
+import cv2
+import torch
+import torch.nn as nn
+from torchvision.transforms import Compose
+from .dpt_depth import DPTDepthModel
+from .midas_net import MidasNet
+from .midas_net_custom import MidasNet_small
+from .transforms import NormalizeImage, PrepareForNet, Resize
+# ISL_PATHS = {
+#     "dpt_large": "dpt_large-midas-2f21e586.pt",
+#     "dpt_hybrid": "dpt_hybrid-midas-501f0c75.pt",
+#     "midas_v21": "",
+#     "midas_v21_small": "",
+# }
+# remote_model_path =
+# "https://huggingface.co/lllyasviel/ControlNet/resolve/main/annotator/ckpts/dpt_hybrid-midas-501f0c75.pt"
+def disabled_train(self, mode=True):
+    """Overwrite model.train with this function to make sure train/eval mode
+    does not change anymore."""
+    return self
+def load_midas_transform(model_type):
+    # https://github.com/isl-org/MiDaS/blob/master/run.py
+    # load transform only
+    if model_type == 'dpt_large':  # DPT-Large
+        net_w, net_h = 384, 384
+        resize_mode = 'minimal'
+        normalization = NormalizeImage(mean=[0.5, 0.5, 0.5],
+                                       std=[0.5, 0.5, 0.5])
+    elif model_type == 'dpt_hybrid':  # DPT-Hybrid
+        net_w, net_h = 384, 384
+        resize_mode = 'minimal'
+        normalization = NormalizeImage(mean=[0.5, 0.5, 0.5],
+                                       std=[0.5, 0.5, 0.5])
+    elif model_type == 'midas_v21':
+        net_w, net_h = 384, 384
+        resize_mode = 'upper_bound'
+        normalization = NormalizeImage(mean=[0.485, 0.456, 0.406],
+                                       std=[0.229, 0.224, 0.225])
+    elif model_type == 'midas_v21_small':
+        net_w, net_h = 256, 256
+        resize_mode = 'upper_bound'
+        normalization = NormalizeImage(mean=[0.485, 0.456, 0.406],
+                                       std=[0.229, 0.224, 0.225])
+    else:
+        assert False, f"model_type '{model_type}' not implemented, use: --model_type large"
+    transform = Compose([
+        Resize(
+            net_w,
+            net_h,
+            resize_target=None,
+            keep_aspect_ratio=True,
+            ensure_multiple_of=32,
+            resize_method=resize_mode,
+            image_interpolation_method=cv2.INTER_CUBIC,
+        ),
+        normalization,
+        PrepareForNet(),
+    ])
+    return transform
+def load_model(model_type, model_path):
+    # https://github.com/isl-org/MiDaS/blob/master/run.py
+    # load network
+    # model_path = ISL_PATHS[model_type]
+    if model_type == 'dpt_large':  # DPT-Large
+        model = DPTDepthModel(
+            path=model_path,
+            backbone='vitl16_384',
+            non_negative=True,
+        )
+        net_w, net_h = 384, 384
+        resize_mode = 'minimal'
+        normalization = NormalizeImage(mean=[0.5, 0.5, 0.5],
+                                       std=[0.5, 0.5, 0.5])
+    elif model_type == 'dpt_hybrid':  # DPT-Hybrid
+        model = DPTDepthModel(
+            path=model_path,
+            backbone='vitb_rn50_384',
+            non_negative=True,
+        )
+        net_w, net_h = 384, 384
+        resize_mode = 'minimal'
+        normalization = NormalizeImage(mean=[0.5, 0.5, 0.5],
+                                       std=[0.5, 0.5, 0.5])
+    elif model_type == 'midas_v21':
+        model = MidasNet(model_path, non_negative=True)
+        net_w, net_h = 384, 384
+        resize_mode = 'upper_bound'
+        normalization = NormalizeImage(mean=[0.485, 0.456, 0.406],
+                                       std=[0.229, 0.224, 0.225])
+    elif model_type == 'midas_v21_small':
+        model = MidasNet_small(model_path,
+                               features=64,
+                               backbone='efficientnet_lite3',
+                               exportable=True,
+                               non_negative=True,
+                               blocks={'expand': True})
+        net_w, net_h = 256, 256
+        resize_mode = 'upper_bound'
+        normalization = NormalizeImage(mean=[0.485, 0.456, 0.406],
+                                       std=[0.229, 0.224, 0.225])
+    else:
+        print(
+            f"model_type '{model_type}' not implemented, use: --model_type large"
+        )
+        assert False
+    transform = Compose([
+        Resize(
+            net_w,
+            net_h,
+            resize_target=None,
+            keep_aspect_ratio=True,
+            ensure_multiple_of=32,
+            resize_method=resize_mode,
+            image_interpolation_method=cv2.INTER_CUBIC,
+        ),
+        normalization,
+        PrepareForNet(),
+    ])
+    return model.eval(), transform
+class MiDaSInference(nn.Module):
+    MODEL_TYPES_TORCH_HUB = ['DPT_Large', 'DPT_Hybrid', 'MiDaS_small']
+    MODEL_TYPES_ISL = [
+        'dpt_large',
+        'dpt_hybrid',
+        'midas_v21',
+        'midas_v21_small',
+    ]
+    def __init__(self, model_type, model_path):
+        super().__init__()
+        assert (model_type in self.MODEL_TYPES_ISL)
+        model, _ = load_model(model_type, model_path)
+        self.model = model
+        self.model.train = disabled_train
+    def forward(self, x):
+        with torch.no_grad():
+            prediction = self.model(x)
+        return prediction

vace/annotators/midas/base_model.py ADDED Viewed

	@@ -0,0 +1,18 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+import torch
+class BaseModel(torch.nn.Module):
+    def load(self, path):
+        """Load model from file.
+        Args:
+            path (str): file path
+        """
+        parameters = torch.load(path, map_location=torch.device('cpu'), weights_only=True)
+        if 'optimizer' in parameters:
+            parameters = parameters['model']
+        self.load_state_dict(parameters)

vace/annotators/midas/blocks.py ADDED Viewed

	@@ -0,0 +1,391 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+import torch
+import torch.nn as nn
+from .vit import (_make_pretrained_vitb16_384, _make_pretrained_vitb_rn50_384,
+                  _make_pretrained_vitl16_384)
+def _make_encoder(
+    backbone,
+    features,
+    use_pretrained,
+    groups=1,
+    expand=False,
+    exportable=True,
+    hooks=None,
+    use_vit_only=False,
+    use_readout='ignore',
+):
+    if backbone == 'vitl16_384':
+        pretrained = _make_pretrained_vitl16_384(use_pretrained,
+                                                 hooks=hooks,
+                                                 use_readout=use_readout)
+        scratch = _make_scratch(
+            [256, 512, 1024, 1024], features, groups=groups,
+            expand=expand)  # ViT-L/16 - 85.0% Top1 (backbone)
+    elif backbone == 'vitb_rn50_384':
+        pretrained = _make_pretrained_vitb_rn50_384(
+            use_pretrained,
+            hooks=hooks,
+            use_vit_only=use_vit_only,
+            use_readout=use_readout,
+        )
+        scratch = _make_scratch(
+            [256, 512, 768, 768], features, groups=groups,
+            expand=expand)  # ViT-H/16 - 85.0% Top1 (backbone)
+    elif backbone == 'vitb16_384':
+        pretrained = _make_pretrained_vitb16_384(use_pretrained,
+                                                 hooks=hooks,
+                                                 use_readout=use_readout)
+        scratch = _make_scratch(
+            [96, 192, 384, 768], features, groups=groups,
+            expand=expand)  # ViT-B/16 - 84.6% Top1 (backbone)
+    elif backbone == 'resnext101_wsl':
+        pretrained = _make_pretrained_resnext101_wsl(use_pretrained)
+        scratch = _make_scratch([256, 512, 1024, 2048],
+                                features,
+                                groups=groups,
+                                expand=expand)  # efficientnet_lite3
+    elif backbone == 'efficientnet_lite3':
+        pretrained = _make_pretrained_efficientnet_lite3(use_pretrained,
+                                                         exportable=exportable)
+        scratch = _make_scratch([32, 48, 136, 384],
+                                features,
+                                groups=groups,
+                                expand=expand)  # efficientnet_lite3
+    else:
+        print(f"Backbone '{backbone}' not implemented")
+        assert False
+    return pretrained, scratch
+def _make_scratch(in_shape, out_shape, groups=1, expand=False):
+    scratch = nn.Module()
+    out_shape1 = out_shape
+    out_shape2 = out_shape
+    out_shape3 = out_shape
+    out_shape4 = out_shape
+    if expand is True:
+        out_shape1 = out_shape
+        out_shape2 = out_shape * 2
+        out_shape3 = out_shape * 4
+        out_shape4 = out_shape * 8
+    scratch.layer1_rn = nn.Conv2d(in_shape[0],
+                                  out_shape1,
+                                  kernel_size=3,
+                                  stride=1,
+                                  padding=1,
+                                  bias=False,
+                                  groups=groups)
+    scratch.layer2_rn = nn.Conv2d(in_shape[1],
+                                  out_shape2,
+                                  kernel_size=3,
+                                  stride=1,
+                                  padding=1,
+                                  bias=False,
+                                  groups=groups)
+    scratch.layer3_rn = nn.Conv2d(in_shape[2],
+                                  out_shape3,
+                                  kernel_size=3,
+                                  stride=1,
+                                  padding=1,
+                                  bias=False,
+                                  groups=groups)
+    scratch.layer4_rn = nn.Conv2d(in_shape[3],
+                                  out_shape4,
+                                  kernel_size=3,
+                                  stride=1,
+                                  padding=1,
+                                  bias=False,
+                                  groups=groups)
+    return scratch
+def _make_pretrained_efficientnet_lite3(use_pretrained, exportable=False):
+    efficientnet = torch.hub.load('rwightman/gen-efficientnet-pytorch',
+                                  'tf_efficientnet_lite3',
+                                  pretrained=use_pretrained,
+                                  exportable=exportable)
+    return _make_efficientnet_backbone(efficientnet)
+def _make_efficientnet_backbone(effnet):
+    pretrained = nn.Module()
+    pretrained.layer1 = nn.Sequential(effnet.conv_stem, effnet.bn1,
+                                      effnet.act1, *effnet.blocks[0:2])
+    pretrained.layer2 = nn.Sequential(*effnet.blocks[2:3])
+    pretrained.layer3 = nn.Sequential(*effnet.blocks[3:5])
+    pretrained.layer4 = nn.Sequential(*effnet.blocks[5:9])
+    return pretrained
+def _make_resnet_backbone(resnet):
+    pretrained = nn.Module()
+    pretrained.layer1 = nn.Sequential(resnet.conv1, resnet.bn1, resnet.relu,
+                                      resnet.maxpool, resnet.layer1)
+    pretrained.layer2 = resnet.layer2
+    pretrained.layer3 = resnet.layer3
+    pretrained.layer4 = resnet.layer4
+    return pretrained
+def _make_pretrained_resnext101_wsl(use_pretrained):
+    resnet = torch.hub.load('facebookresearch/WSL-Images',
+                            'resnext101_32x8d_wsl')
+    return _make_resnet_backbone(resnet)
+class Interpolate(nn.Module):
+    """Interpolation module.
+    """
+    def __init__(self, scale_factor, mode, align_corners=False):
+        """Init.
+        Args:
+            scale_factor (float): scaling
+            mode (str): interpolation mode
+        """
+        super(Interpolate, self).__init__()
+        self.interp = nn.functional.interpolate
+        self.scale_factor = scale_factor
+        self.mode = mode
+        self.align_corners = align_corners
+    def forward(self, x):
+        """Forward pass.
+        Args:
+            x (tensor): input
+        Returns:
+            tensor: interpolated data
+        """
+        x = self.interp(x,
+                        scale_factor=self.scale_factor,
+                        mode=self.mode,
+                        align_corners=self.align_corners)
+        return x
+class ResidualConvUnit(nn.Module):
+    """Residual convolution module.
+    """
+    def __init__(self, features):
+        """Init.
+        Args:
+            features (int): number of features
+        """
+        super().__init__()
+        self.conv1 = nn.Conv2d(features,
+                               features,
+                               kernel_size=3,
+                               stride=1,
+                               padding=1,
+                               bias=True)
+        self.conv2 = nn.Conv2d(features,
+                               features,
+                               kernel_size=3,
+                               stride=1,
+                               padding=1,
+                               bias=True)
+        self.relu = nn.ReLU(inplace=True)
+    def forward(self, x):
+        """Forward pass.
+        Args:
+            x (tensor): input
+        Returns:
+            tensor: output
+        """
+        out = self.relu(x)
+        out = self.conv1(out)
+        out = self.relu(out)
+        out = self.conv2(out)
+        return out + x
+class FeatureFusionBlock(nn.Module):
+    """Feature fusion block.
+    """
+    def __init__(self, features):
+        """Init.
+        Args:
+            features (int): number of features
+        """
+        super(FeatureFusionBlock, self).__init__()
+        self.resConfUnit1 = ResidualConvUnit(features)
+        self.resConfUnit2 = ResidualConvUnit(features)
+    def forward(self, *xs):
+        """Forward pass.
+        Returns:
+            tensor: output
+        """
+        output = xs[0]
+        if len(xs) == 2:
+            output += self.resConfUnit1(xs[1])
+        output = self.resConfUnit2(output)
+        output = nn.functional.interpolate(output,
+                                           scale_factor=2,
+                                           mode='bilinear',
+                                           align_corners=True)
+        return output
+class ResidualConvUnit_custom(nn.Module):
+    """Residual convolution module.
+    """
+    def __init__(self, features, activation, bn):
+        """Init.
+        Args:
+            features (int): number of features
+        """
+        super().__init__()
+        self.bn = bn
+        self.groups = 1
+        self.conv1 = nn.Conv2d(features,
+                               features,
+                               kernel_size=3,
+                               stride=1,
+                               padding=1,
+                               bias=True,
+                               groups=self.groups)
+        self.conv2 = nn.Conv2d(features,
+                               features,
+                               kernel_size=3,
+                               stride=1,
+                               padding=1,
+                               bias=True,
+                               groups=self.groups)
+        if self.bn is True:
+            self.bn1 = nn.BatchNorm2d(features)
+            self.bn2 = nn.BatchNorm2d(features)
+        self.activation = activation
+        self.skip_add = nn.quantized.FloatFunctional()
+    def forward(self, x):
+        """Forward pass.
+        Args:
+            x (tensor): input
+        Returns:
+            tensor: output
+        """
+        out = self.activation(x)
+        out = self.conv1(out)
+        if self.bn is True:
+            out = self.bn1(out)
+        out = self.activation(out)
+        out = self.conv2(out)
+        if self.bn is True:
+            out = self.bn2(out)
+        if self.groups > 1:
+            out = self.conv_merge(out)
+        return self.skip_add.add(out, x)
+        # return out + x
+class FeatureFusionBlock_custom(nn.Module):
+    """Feature fusion block.
+    """
+    def __init__(self,
+                 features,
+                 activation,
+                 deconv=False,
+                 bn=False,
+                 expand=False,
+                 align_corners=True):
+        """Init.
+        Args:
+            features (int): number of features
+        """
+        super(FeatureFusionBlock_custom, self).__init__()
+        self.deconv = deconv
+        self.align_corners = align_corners
+        self.groups = 1
+        self.expand = expand
+        out_features = features
+        if self.expand is True:
+            out_features = features // 2
+        self.out_conv = nn.Conv2d(features,
+                                  out_features,
+                                  kernel_size=1,
+                                  stride=1,
+                                  padding=0,
+                                  bias=True,
+                                  groups=1)
+        self.resConfUnit1 = ResidualConvUnit_custom(features, activation, bn)
+        self.resConfUnit2 = ResidualConvUnit_custom(features, activation, bn)
+        self.skip_add = nn.quantized.FloatFunctional()
+    def forward(self, *xs):
+        """Forward pass.
+        Returns:
+            tensor: output
+        """
+        output = xs[0]
+        if len(xs) == 2:
+            res = self.resConfUnit1(xs[1])
+            output = self.skip_add.add(output, res)
+            # output += res
+        output = self.resConfUnit2(output)
+        output = nn.functional.interpolate(output,
+                                           scale_factor=2,
+                                           mode='bilinear',
+                                           align_corners=self.align_corners)
+        output = self.out_conv(output)
+        return output

vace/annotators/midas/dpt_depth.py ADDED Viewed

	@@ -0,0 +1,107 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+import torch
+import torch.nn as nn
+from .base_model import BaseModel
+from .blocks import FeatureFusionBlock_custom, Interpolate, _make_encoder
+from .vit import forward_vit
+def _make_fusion_block(features, use_bn):
+    return FeatureFusionBlock_custom(
+        features,
+        nn.ReLU(False),
+        deconv=False,
+        bn=use_bn,
+        expand=False,
+        align_corners=True,
+    )
+class DPT(BaseModel):
+    def __init__(
+        self,
+        head,
+        features=256,
+        backbone='vitb_rn50_384',
+        readout='project',
+        channels_last=False,
+        use_bn=False,
+    ):
+        super(DPT, self).__init__()
+        self.channels_last = channels_last
+        hooks = {
+            'vitb_rn50_384': [0, 1, 8, 11],
+            'vitb16_384': [2, 5, 8, 11],
+            'vitl16_384': [5, 11, 17, 23],
+        }
+        # Instantiate backbone and reassemble blocks
+        self.pretrained, self.scratch = _make_encoder(
+            backbone,
+            features,
+            False,  # Set to true of you want to train from scratch, uses ImageNet weights
+            groups=1,
+            expand=False,
+            exportable=False,
+            hooks=hooks[backbone],
+            use_readout=readout,
+        )
+        self.scratch.refinenet1 = _make_fusion_block(features, use_bn)
+        self.scratch.refinenet2 = _make_fusion_block(features, use_bn)
+        self.scratch.refinenet3 = _make_fusion_block(features, use_bn)
+        self.scratch.refinenet4 = _make_fusion_block(features, use_bn)
+        self.scratch.output_conv = head
+    def forward(self, x):
+        if self.channels_last is True:
+            x.contiguous(memory_format=torch.channels_last)
+        layer_1, layer_2, layer_3, layer_4 = forward_vit(self.pretrained, x)
+        layer_1_rn = self.scratch.layer1_rn(layer_1)
+        layer_2_rn = self.scratch.layer2_rn(layer_2)
+        layer_3_rn = self.scratch.layer3_rn(layer_3)
+        layer_4_rn = self.scratch.layer4_rn(layer_4)
+        path_4 = self.scratch.refinenet4(layer_4_rn)
+        path_3 = self.scratch.refinenet3(path_4, layer_3_rn)
+        path_2 = self.scratch.refinenet2(path_3, layer_2_rn)
+        path_1 = self.scratch.refinenet1(path_2, layer_1_rn)
+        out = self.scratch.output_conv(path_1)
+        return out
+class DPTDepthModel(DPT):
+    def __init__(self, path=None, non_negative=True, **kwargs):
+        features = kwargs['features'] if 'features' in kwargs else 256
+        head = nn.Sequential(
+            nn.Conv2d(features,
+                      features // 2,
+                      kernel_size=3,
+                      stride=1,
+                      padding=1),
+            Interpolate(scale_factor=2, mode='bilinear', align_corners=True),
+            nn.Conv2d(features // 2, 32, kernel_size=3, stride=1, padding=1),
+            nn.ReLU(True),
+            nn.Conv2d(32, 1, kernel_size=1, stride=1, padding=0),
+            nn.ReLU(True) if non_negative else nn.Identity(),
+            nn.Identity(),
+        )
+        super().__init__(head, **kwargs)
+        if path is not None:
+            self.load(path)
+    def forward(self, x):
+        return super().forward(x).squeeze(dim=1)

vace/annotators/midas/midas_net.py ADDED Viewed

	@@ -0,0 +1,80 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+"""MidashNet: Network for monocular depth estimation trained by mixing several datasets.
+This file contains code that is adapted from
+https://github.com/thomasjpfan/pytorch_refinenet/blob/master/pytorch_refinenet/refinenet/refinenet_4cascade.py
+"""
+import torch
+import torch.nn as nn
+from .base_model import BaseModel
+from .blocks import FeatureFusionBlock, Interpolate, _make_encoder
+class MidasNet(BaseModel):
+    """Network for monocular depth estimation.
+    """
+    def __init__(self, path=None, features=256, non_negative=True):
+        """Init.
+        Args:
+            path (str, optional): Path to saved model. Defaults to None.
+            features (int, optional): Number of features. Defaults to 256.
+            backbone (str, optional): Backbone network for encoder. Defaults to resnet50
+        """
+        print('Loading weights: ', path)
+        super(MidasNet, self).__init__()
+        use_pretrained = False if path is None else True
+        self.pretrained, self.scratch = _make_encoder(
+            backbone='resnext101_wsl',
+            features=features,
+            use_pretrained=use_pretrained)
+        self.scratch.refinenet4 = FeatureFusionBlock(features)
+        self.scratch.refinenet3 = FeatureFusionBlock(features)
+        self.scratch.refinenet2 = FeatureFusionBlock(features)
+        self.scratch.refinenet1 = FeatureFusionBlock(features)
+        self.scratch.output_conv = nn.Sequential(
+            nn.Conv2d(features, 128, kernel_size=3, stride=1, padding=1),
+            Interpolate(scale_factor=2, mode='bilinear'),
+            nn.Conv2d(128, 32, kernel_size=3, stride=1, padding=1),
+            nn.ReLU(True),
+            nn.Conv2d(32, 1, kernel_size=1, stride=1, padding=0),
+            nn.ReLU(True) if non_negative else nn.Identity(),
+        )
+        if path:
+            self.load(path)
+    def forward(self, x):
+        """Forward pass.
+        Args:
+            x (tensor): input data (image)
+        Returns:
+            tensor: depth
+        """
+        layer_1 = self.pretrained.layer1(x)
+        layer_2 = self.pretrained.layer2(layer_1)
+        layer_3 = self.pretrained.layer3(layer_2)
+        layer_4 = self.pretrained.layer4(layer_3)
+        layer_1_rn = self.scratch.layer1_rn(layer_1)
+        layer_2_rn = self.scratch.layer2_rn(layer_2)
+        layer_3_rn = self.scratch.layer3_rn(layer_3)
+        layer_4_rn = self.scratch.layer4_rn(layer_4)
+        path_4 = self.scratch.refinenet4(layer_4_rn)
+        path_3 = self.scratch.refinenet3(path_4, layer_3_rn)
+        path_2 = self.scratch.refinenet2(path_3, layer_2_rn)
+        path_1 = self.scratch.refinenet1(path_2, layer_1_rn)
+        out = self.scratch.output_conv(path_1)
+        return torch.squeeze(out, dim=1)

vace/annotators/midas/midas_net_custom.py ADDED Viewed

	@@ -0,0 +1,167 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+"""MidashNet: Network for monocular depth estimation trained by mixing several datasets.
+This file contains code that is adapted from
+https://github.com/thomasjpfan/pytorch_refinenet/blob/master/pytorch_refinenet/refinenet/refinenet_4cascade.py
+"""
+import torch
+import torch.nn as nn
+from .base_model import BaseModel
+from .blocks import FeatureFusionBlock_custom, Interpolate, _make_encoder
+class MidasNet_small(BaseModel):
+    """Network for monocular depth estimation.
+    """
+    def __init__(self,
+                 path=None,
+                 features=64,
+                 backbone='efficientnet_lite3',
+                 non_negative=True,
+                 exportable=True,
+                 channels_last=False,
+                 align_corners=True,
+                 blocks={'expand': True}):
+        """Init.
+        Args:
+            path (str, optional): Path to saved model. Defaults to None.
+            features (int, optional): Number of features. Defaults to 256.
+            backbone (str, optional): Backbone network for encoder. Defaults to resnet50
+        """
+        print('Loading weights: ', path)
+        super(MidasNet_small, self).__init__()
+        use_pretrained = False if path else True
+        self.channels_last = channels_last
+        self.blocks = blocks
+        self.backbone = backbone
+        self.groups = 1
+        features1 = features
+        features2 = features
+        features3 = features
+        features4 = features
+        self.expand = False
+        if 'expand' in self.blocks and self.blocks['expand'] is True:
+            self.expand = True
+            features1 = features
+            features2 = features * 2
+            features3 = features * 4
+            features4 = features * 8
+        self.pretrained, self.scratch = _make_encoder(self.backbone,
+                                                      features,
+                                                      use_pretrained,
+                                                      groups=self.groups,
+                                                      expand=self.expand,
+                                                      exportable=exportable)
+        self.scratch.activation = nn.ReLU(False)
+        self.scratch.refinenet4 = FeatureFusionBlock_custom(
+            features4,
+            self.scratch.activation,
+            deconv=False,
+            bn=False,
+            expand=self.expand,
+            align_corners=align_corners)
+        self.scratch.refinenet3 = FeatureFusionBlock_custom(
+            features3,
+            self.scratch.activation,
+            deconv=False,
+            bn=False,
+            expand=self.expand,
+            align_corners=align_corners)
+        self.scratch.refinenet2 = FeatureFusionBlock_custom(
+            features2,
+            self.scratch.activation,
+            deconv=False,
+            bn=False,
+            expand=self.expand,
+            align_corners=align_corners)
+        self.scratch.refinenet1 = FeatureFusionBlock_custom(
+            features1,
+            self.scratch.activation,
+            deconv=False,
+            bn=False,
+            align_corners=align_corners)
+        self.scratch.output_conv = nn.Sequential(
+            nn.Conv2d(features,
+                      features // 2,
+                      kernel_size=3,
+                      stride=1,
+                      padding=1,
+                      groups=self.groups),
+            Interpolate(scale_factor=2, mode='bilinear'),
+            nn.Conv2d(features // 2, 32, kernel_size=3, stride=1, padding=1),
+            self.scratch.activation,
+            nn.Conv2d(32, 1, kernel_size=1, stride=1, padding=0),
+            nn.ReLU(True) if non_negative else nn.Identity(),
+            nn.Identity(),
+        )
+        if path:
+            self.load(path)
+    def forward(self, x):
+        """Forward pass.
+        Args:
+            x (tensor): input data (image)
+        Returns:
+            tensor: depth
+        """
+        if self.channels_last is True:
+            print('self.channels_last = ', self.channels_last)
+            x.contiguous(memory_format=torch.channels_last)
+        layer_1 = self.pretrained.layer1(x)
+        layer_2 = self.pretrained.layer2(layer_1)
+        layer_3 = self.pretrained.layer3(layer_2)
+        layer_4 = self.pretrained.layer4(layer_3)
+        layer_1_rn = self.scratch.layer1_rn(layer_1)
+        layer_2_rn = self.scratch.layer2_rn(layer_2)
+        layer_3_rn = self.scratch.layer3_rn(layer_3)
+        layer_4_rn = self.scratch.layer4_rn(layer_4)
+        path_4 = self.scratch.refinenet4(layer_4_rn)
+        path_3 = self.scratch.refinenet3(path_4, layer_3_rn)
+        path_2 = self.scratch.refinenet2(path_3, layer_2_rn)
+        path_1 = self.scratch.refinenet1(path_2, layer_1_rn)
+        out = self.scratch.output_conv(path_1)
+        return torch.squeeze(out, dim=1)
+def fuse_model(m):
+    prev_previous_type = nn.Identity()
+    prev_previous_name = ''
+    previous_type = nn.Identity()
+    previous_name = ''
+    for name, module in m.named_modules():
+        if prev_previous_type == nn.Conv2d and previous_type == nn.BatchNorm2d and type(
+                module) == nn.ReLU:
+            # print("FUSED ", prev_previous_name, previous_name, name)
+            torch.quantization.fuse_modules(
+                m, [prev_previous_name, previous_name, name], inplace=True)
+        elif prev_previous_type == nn.Conv2d and previous_type == nn.BatchNorm2d:
+            # print("FUSED ", prev_previous_name, previous_name)
+            torch.quantization.fuse_modules(
+                m, [prev_previous_name, previous_name], inplace=True)
+        # elif previous_type == nn.Conv2d and type(module) == nn.ReLU:
+        #    print("FUSED ", previous_name, name)
+        #    torch.quantization.fuse_modules(m, [previous_name, name], inplace=True)
+        prev_previous_type = previous_type
+        prev_previous_name = previous_name
+        previous_type = type(module)
+        previous_name = name

vace/annotators/midas/transforms.py ADDED Viewed

	@@ -0,0 +1,231 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+import math
+import cv2
+import numpy as np
+def apply_min_size(sample, size, image_interpolation_method=cv2.INTER_AREA):
+    """Rezise the sample to ensure the given size. Keeps aspect ratio.
+    Args:
+        sample (dict): sample
+        size (tuple): image size
+    Returns:
+        tuple: new size
+    """
+    shape = list(sample['disparity'].shape)
+    if shape[0] >= size[0] and shape[1] >= size[1]:
+        return sample
+    scale = [0, 0]
+    scale[0] = size[0] / shape[0]
+    scale[1] = size[1] / shape[1]
+    scale = max(scale)
+    shape[0] = math.ceil(scale * shape[0])
+    shape[1] = math.ceil(scale * shape[1])
+    # resize
+    sample['image'] = cv2.resize(sample['image'],
+                                 tuple(shape[::-1]),
+                                 interpolation=image_interpolation_method)
+    sample['disparity'] = cv2.resize(sample['disparity'],
+                                     tuple(shape[::-1]),
+                                     interpolation=cv2.INTER_NEAREST)
+    sample['mask'] = cv2.resize(
+        sample['mask'].astype(np.float32),
+        tuple(shape[::-1]),
+        interpolation=cv2.INTER_NEAREST,
+    )
+    sample['mask'] = sample['mask'].astype(bool)
+    return tuple(shape)
+class Resize(object):
+    """Resize sample to given size (width, height).
+    """
+    def __init__(
+        self,
+        width,
+        height,
+        resize_target=True,
+        keep_aspect_ratio=False,
+        ensure_multiple_of=1,
+        resize_method='lower_bound',
+        image_interpolation_method=cv2.INTER_AREA,
+    ):
+        """Init.
+        Args:
+            width (int): desired output width
+            height (int): desired output height
+            resize_target (bool, optional):
+                True: Resize the full sample (image, mask, target).
+                False: Resize image only.
+                Defaults to True.
+            keep_aspect_ratio (bool, optional):
+                True: Keep the aspect ratio of the input sample.
+                Output sample might not have the given width and height, and
+                resize behaviour depends on the parameter 'resize_method'.
+                Defaults to False.
+            ensure_multiple_of (int, optional):
+                Output width and height is constrained to be multiple of this parameter.
+                Defaults to 1.
+            resize_method (str, optional):
+                "lower_bound": Output will be at least as large as the given size.
+                "upper_bound": Output will be at max as large as the given size. "
+                "(Output size might be smaller than given size.)"
+                "minimal": Scale as least as possible.  (Output size might be smaller than given size.)
+                Defaults to "lower_bound".
+        """
+        self.__width = width
+        self.__height = height
+        self.__resize_target = resize_target
+        self.__keep_aspect_ratio = keep_aspect_ratio
+        self.__multiple_of = ensure_multiple_of
+        self.__resize_method = resize_method
+        self.__image_interpolation_method = image_interpolation_method
+    def constrain_to_multiple_of(self, x, min_val=0, max_val=None):
+        y = (np.round(x / self.__multiple_of) * self.__multiple_of).astype(int)
+        if max_val is not None and y > max_val:
+            y = (np.floor(x / self.__multiple_of) *
+                 self.__multiple_of).astype(int)
+        if y < min_val:
+            y = (np.ceil(x / self.__multiple_of) *
+                 self.__multiple_of).astype(int)
+        return y
+    def get_size(self, width, height):
+        # determine new height and width
+        scale_height = self.__height / height
+        scale_width = self.__width / width
+        if self.__keep_aspect_ratio:
+            if self.__resize_method == 'lower_bound':
+                # scale such that output size is lower bound
+                if scale_width > scale_height:
+                    # fit width
+                    scale_height = scale_width
+                else:
+                    # fit height
+                    scale_width = scale_height
+            elif self.__resize_method == 'upper_bound':
+                # scale such that output size is upper bound
+                if scale_width < scale_height:
+                    # fit width
+                    scale_height = scale_width
+                else:
+                    # fit height
+                    scale_width = scale_height
+            elif self.__resize_method == 'minimal':
+                # scale as least as possbile
+                if abs(1 - scale_width) < abs(1 - scale_height):
+                    # fit width
+                    scale_height = scale_width
+                else:
+                    # fit height
+                    scale_width = scale_height
+            else:
+                raise ValueError(
+                    f'resize_method {self.__resize_method} not implemented')
+        if self.__resize_method == 'lower_bound':
+            new_height = self.constrain_to_multiple_of(scale_height * height,
+                                                       min_val=self.__height)
+            new_width = self.constrain_to_multiple_of(scale_width * width,
+                                                      min_val=self.__width)
+        elif self.__resize_method == 'upper_bound':
+            new_height = self.constrain_to_multiple_of(scale_height * height,
+                                                       max_val=self.__height)
+            new_width = self.constrain_to_multiple_of(scale_width * width,
+                                                      max_val=self.__width)
+        elif self.__resize_method == 'minimal':
+            new_height = self.constrain_to_multiple_of(scale_height * height)
+            new_width = self.constrain_to_multiple_of(scale_width * width)
+        else:
+            raise ValueError(
+                f'resize_method {self.__resize_method} not implemented')
+        return (new_width, new_height)
+    def __call__(self, sample):
+        width, height = self.get_size(sample['image'].shape[1],
+                                      sample['image'].shape[0])
+        # resize sample
+        sample['image'] = cv2.resize(
+            sample['image'],
+            (width, height),
+            interpolation=self.__image_interpolation_method,
+        )
+        if self.__resize_target:
+            if 'disparity' in sample:
+                sample['disparity'] = cv2.resize(
+                    sample['disparity'],
+                    (width, height),
+                    interpolation=cv2.INTER_NEAREST,
+                )
+            if 'depth' in sample:
+                sample['depth'] = cv2.resize(sample['depth'], (width, height),
+                                             interpolation=cv2.INTER_NEAREST)
+            sample['mask'] = cv2.resize(
+                sample['mask'].astype(np.float32),
+                (width, height),
+                interpolation=cv2.INTER_NEAREST,
+            )
+            sample['mask'] = sample['mask'].astype(bool)
+        return sample
+class NormalizeImage(object):
+    """Normlize image by given mean and std.
+    """
+    def __init__(self, mean, std):
+        self.__mean = mean
+        self.__std = std
+    def __call__(self, sample):
+        sample['image'] = (sample['image'] - self.__mean) / self.__std
+        return sample
+class PrepareForNet(object):
+    """Prepare sample for usage as network input.
+    """
+    def __init__(self):
+        pass
+    def __call__(self, sample):
+        image = np.transpose(sample['image'], (2, 0, 1))
+        sample['image'] = np.ascontiguousarray(image).astype(np.float32)
+        if 'mask' in sample:
+            sample['mask'] = sample['mask'].astype(np.float32)
+            sample['mask'] = np.ascontiguousarray(sample['mask'])
+        if 'disparity' in sample:
+            disparity = sample['disparity'].astype(np.float32)
+            sample['disparity'] = np.ascontiguousarray(disparity)
+        if 'depth' in sample:
+            depth = sample['depth'].astype(np.float32)
+            sample['depth'] = np.ascontiguousarray(depth)
+        return sample

vace/annotators/midas/utils.py ADDED Viewed

	@@ -0,0 +1,193 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+"""Utils for monoDepth."""
+import re
+import sys
+import cv2
+import numpy as np
+import torch
+def read_pfm(path):
+    """Read pfm file.
+    Args:
+        path (str): path to file
+    Returns:
+        tuple: (data, scale)
+    """
+    with open(path, 'rb') as file:
+        color = None
+        width = None
+        height = None
+        scale = None
+        endian = None
+        header = file.readline().rstrip()
+        if header.decode('ascii') == 'PF':
+            color = True
+        elif header.decode('ascii') == 'Pf':
+            color = False
+        else:
+            raise Exception('Not a PFM file: ' + path)
+        dim_match = re.match(r'^(\d+)\s(\d+)\s$',
+                             file.readline().decode('ascii'))
+        if dim_match:
+            width, height = list(map(int, dim_match.groups()))
+        else:
+            raise Exception('Malformed PFM header.')
+        scale = float(file.readline().decode('ascii').rstrip())
+        if scale < 0:
+            # little-endian
+            endian = '<'
+            scale = -scale
+        else:
+            # big-endian
+            endian = '>'
+        data = np.fromfile(file, endian + 'f')
+        shape = (height, width, 3) if color else (height, width)
+        data = np.reshape(data, shape)
+        data = np.flipud(data)
+        return data, scale
+def write_pfm(path, image, scale=1):
+    """Write pfm file.
+    Args:
+        path (str): pathto file
+        image (array): data
+        scale (int, optional): Scale. Defaults to 1.
+    """
+    with open(path, 'wb') as file:
+        color = None
+        if image.dtype.name != 'float32':
+            raise Exception('Image dtype must be float32.')
+        image = np.flipud(image)
+        if len(image.shape) == 3 and image.shape[2] == 3:  # color image
+            color = True
+        elif (len(image.shape) == 2
+              or len(image.shape) == 3 and image.shape[2] == 1):  # greyscale
+            color = False
+        else:
+            raise Exception(
+                'Image must have H x W x 3, H x W x 1 or H x W dimensions.')
+        file.write('PF\n' if color else 'Pf\n'.encode())
+        file.write('%d %d\n'.encode() % (image.shape[1], image.shape[0]))
+        endian = image.dtype.byteorder
+        if endian == '<' or endian == '=' and sys.byteorder == 'little':
+            scale = -scale
+        file.write('%f\n'.encode() % scale)
+        image.tofile(file)
+def read_image(path):
+    """Read image and output RGB image (0-1).
+    Args:
+        path (str): path to file
+    Returns:
+        array: RGB image (0-1)
+    """
+    img = cv2.imread(path)
+    if img.ndim == 2:
+        img = cv2.cvtColor(img, cv2.COLOR_GRAY2BGR)
+    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) / 255.0
+    return img
+def resize_image(img):
+    """Resize image and make it fit for network.
+    Args:
+        img (array): image
+    Returns:
+        tensor: data ready for network
+    """
+    height_orig = img.shape[0]
+    width_orig = img.shape[1]
+    if width_orig > height_orig:
+        scale = width_orig / 384
+    else:
+        scale = height_orig / 384
+    height = (np.ceil(height_orig / scale / 32) * 32).astype(int)
+    width = (np.ceil(width_orig / scale / 32) * 32).astype(int)
+    img_resized = cv2.resize(img, (width, height),
+                             interpolation=cv2.INTER_AREA)
+    img_resized = (torch.from_numpy(np.transpose(
+        img_resized, (2, 0, 1))).contiguous().float())
+    img_resized = img_resized.unsqueeze(0)
+    return img_resized
+def resize_depth(depth, width, height):
+    """Resize depth map and bring to CPU (numpy).
+    Args:
+        depth (tensor): depth
+        width (int): image width
+        height (int): image height
+    Returns:
+        array: processed depth
+    """
+    depth = torch.squeeze(depth[0, :, :, :]).to('cpu')
+    depth_resized = cv2.resize(depth.numpy(), (width, height),
+                               interpolation=cv2.INTER_CUBIC)
+    return depth_resized
+def write_depth(path, depth, bits=1):
+    """Write depth map to pfm and png file.
+    Args:
+        path (str): filepath without extension
+        depth (array): depth
+    """
+    write_pfm(path + '.pfm', depth.astype(np.float32))
+    depth_min = depth.min()
+    depth_max = depth.max()
+    max_val = (2**(8 * bits)) - 1
+    if depth_max - depth_min > np.finfo('float').eps:
+        out = max_val * (depth - depth_min) / (depth_max - depth_min)
+    else:
+        out = np.zeros(depth.shape, dtype=depth.type)
+    if bits == 1:
+        cv2.imwrite(path + '.png', out.astype('uint8'))
+    elif bits == 2:
+        cv2.imwrite(path + '.png', out.astype('uint16'))
+    return