Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
# Official PyTorch Implementation of SiA
|
2 |
|
3 |
Official PyTorch implementation of SiA, our model from **[Scaling Open-Vocabulary Action Detection](https://arxiv.org/abs/2504.03096)**. If you use this code for your research, please cite our paper.
|
@@ -10,17 +16,17 @@ Official PyTorch implementation of SiA, our model from **[Scaling Open-Vocabular
|
|
10 |
## To-do
|
11 |
- Code TBA
|
12 |
- Weights TBA
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
|
25 |
<p alighn="center">
|
26 |
In this work, we focus on scaling open-vocabulary action detection. Existing approaches for action detection are predominantly limited to closed-set scenarios and rely on complex, parameter-heavy architectures. Extending these models to the open-vocabulary setting poses two key challenges: (1) the lack of large-scale datasets with many action classes for robust training, and (2) parameter-heavy adaptations to a pretrained vision-language contrastive model to convert it for detection, risking overfitting the additional non-pretrained parameters to base action classes. Firstly, we introduce an encoder-only multimodal model for video action detection, reducing the reliance on parameter-heavy additions for video action detection. Secondly, we introduce a simple weakly supervised training strategy to exploit an existing closed-set action detection dataset for pretraining. Finally, we depart from the ill-posed base-to-novel benchmark used by prior works in open-vocabulary action detection and devise a new benchmark to evaluate on existing closed-set action detection datasets without ever using them for training, showing novel results to serve as baselines for future work.
|
|
|
1 |
+
---
|
2 |
+
license: agpl-3.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
pipeline_tag: action-detection
|
6 |
+
---
|
7 |
# Official PyTorch Implementation of SiA
|
8 |
|
9 |
Official PyTorch implementation of SiA, our model from **[Scaling Open-Vocabulary Action Detection](https://arxiv.org/abs/2504.03096)**. If you use this code for your research, please cite our paper.
|
|
|
16 |
## To-do
|
17 |
- Code TBA
|
18 |
- Weights TBA
|
19 |
+
**Scaling Open-Vocabulary Action Detection**<br>
|
20 |
+
Z.H Sia and Y.S Rawat<br>
|
21 |
+
<br>
|
22 |
+
<p align="center">
|
23 |
+
<img src="assets/virat.gif" height="200px" style="display:inline-block; margin-right: 1%;">
|
24 |
+
<img src="assets/cctv_smoking.gif" height="200px" style="display:inline-block; margin-right: 1%;">
|
25 |
+
<img src="assets/construction_fall.gif" height="200px" style="display:inline-block;">
|
26 |
+
<img src="assets/babycam.gif" height="200px" style="display:inline-block;">
|
27 |
+
</p>
|
28 |
+
|
29 |
+
**Abstract**:
|
30 |
|
31 |
<p alighn="center">
|
32 |
In this work, we focus on scaling open-vocabulary action detection. Existing approaches for action detection are predominantly limited to closed-set scenarios and rely on complex, parameter-heavy architectures. Extending these models to the open-vocabulary setting poses two key challenges: (1) the lack of large-scale datasets with many action classes for robust training, and (2) parameter-heavy adaptations to a pretrained vision-language contrastive model to convert it for detection, risking overfitting the additional non-pretrained parameters to base action classes. Firstly, we introduce an encoder-only multimodal model for video action detection, reducing the reliance on parameter-heavy additions for video action detection. Secondly, we introduce a simple weakly supervised training strategy to exploit an existing closed-set action detection dataset for pretraining. Finally, we depart from the ill-posed base-to-novel benchmark used by prior works in open-vocabulary action detection and devise a new benchmark to evaluate on existing closed-set action detection datasets without ever using them for training, showing novel results to serve as baselines for future work.
|