siatheindochinese commited on
Commit
2ac6d20
·
verified ·
1 Parent(s): fad61ed

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -11
README.md CHANGED
@@ -1,3 +1,9 @@
 
 
 
 
 
 
1
  # Official PyTorch Implementation of SiA
2
 
3
  Official PyTorch implementation of SiA, our model from **[Scaling Open-Vocabulary Action Detection](https://arxiv.org/abs/2504.03096)**. If you use this code for your research, please cite our paper.
@@ -10,17 +16,17 @@ Official PyTorch implementation of SiA, our model from **[Scaling Open-Vocabular
10
  ## To-do
11
  - Code TBA
12
  - Weights TBA
13
- > **Scaling Open-Vocabulary Action Detection**<br>
14
- > Z.H Sia and Y.S Rawat<br>
15
- > <br>
16
- > <p align="center">
17
- > <img src="assets/virat.gif" height="200px" style="display:inline-block; margin-right: 1%;">
18
- > <img src="assets/cctv_smoking.gif" height="200px" style="display:inline-block; margin-right: 1%;">
19
- > <img src="assets/construction_fall.gif" height="200px" style="display:inline-block;">
20
- > <img src="assets/babycam.gif" height="200px" style="display:inline-block;">
21
- > </p>
22
- >
23
- > **Abstract**:
24
 
25
  <p alighn="center">
26
  In this work, we focus on scaling open-vocabulary action detection. Existing approaches for action detection are predominantly limited to closed-set scenarios and rely on complex, parameter-heavy architectures. Extending these models to the open-vocabulary setting poses two key challenges: (1) the lack of large-scale datasets with many action classes for robust training, and (2) parameter-heavy adaptations to a pretrained vision-language contrastive model to convert it for detection, risking overfitting the additional non-pretrained parameters to base action classes. Firstly, we introduce an encoder-only multimodal model for video action detection, reducing the reliance on parameter-heavy additions for video action detection. Secondly, we introduce a simple weakly supervised training strategy to exploit an existing closed-set action detection dataset for pretraining. Finally, we depart from the ill-posed base-to-novel benchmark used by prior works in open-vocabulary action detection and devise a new benchmark to evaluate on existing closed-set action detection datasets without ever using them for training, showing novel results to serve as baselines for future work.
 
1
+ ---
2
+ license: agpl-3.0
3
+ language:
4
+ - en
5
+ pipeline_tag: action-detection
6
+ ---
7
  # Official PyTorch Implementation of SiA
8
 
9
  Official PyTorch implementation of SiA, our model from **[Scaling Open-Vocabulary Action Detection](https://arxiv.org/abs/2504.03096)**. If you use this code for your research, please cite our paper.
 
16
  ## To-do
17
  - Code TBA
18
  - Weights TBA
19
+ **Scaling Open-Vocabulary Action Detection**<br>
20
+ Z.H Sia and Y.S Rawat<br>
21
+ <br>
22
+ <p align="center">
23
+ <img src="assets/virat.gif" height="200px" style="display:inline-block; margin-right: 1%;">
24
+ <img src="assets/cctv_smoking.gif" height="200px" style="display:inline-block; margin-right: 1%;">
25
+ <img src="assets/construction_fall.gif" height="200px" style="display:inline-block;">
26
+ <img src="assets/babycam.gif" height="200px" style="display:inline-block;">
27
+ </p>
28
+
29
+ **Abstract**:
30
 
31
  <p alighn="center">
32
  In this work, we focus on scaling open-vocabulary action detection. Existing approaches for action detection are predominantly limited to closed-set scenarios and rely on complex, parameter-heavy architectures. Extending these models to the open-vocabulary setting poses two key challenges: (1) the lack of large-scale datasets with many action classes for robust training, and (2) parameter-heavy adaptations to a pretrained vision-language contrastive model to convert it for detection, risking overfitting the additional non-pretrained parameters to base action classes. Firstly, we introduce an encoder-only multimodal model for video action detection, reducing the reliance on parameter-heavy additions for video action detection. Secondly, we introduce a simple weakly supervised training strategy to exploit an existing closed-set action detection dataset for pretraining. Finally, we depart from the ill-posed base-to-novel benchmark used by prior works in open-vocabulary action detection and devise a new benchmark to evaluate on existing closed-set action detection datasets without ever using them for training, showing novel results to serve as baselines for future work.