siatheindochinese
/

sia_act

English

action-detection

Model card Files Files and versions Community

siatheindochinese commited on 23 days ago

Commit

2ac6d20

verified ·

1 Parent(s): fad61ed

Update README.md

Browse files

Files changed (1) hide show

README.md +17 -11

README.md CHANGED Viewed

@@ -1,3 +1,9 @@
 # Official PyTorch Implementation of SiA
 Official PyTorch implementation of SiA, our model from **[Scaling Open-Vocabulary Action Detection](https://arxiv.org/abs/2504.03096)**. If you use this code for your research, please cite our paper.
@@ -10,17 +16,17 @@ Official PyTorch implementation of SiA, our model from **[Scaling Open-Vocabular
 ## To-do
 - Code TBA
 - Weights TBA
-> **Scaling Open-Vocabulary Action Detection**<br>
-> Z.H Sia and Y.S Rawat<br>
-> <br>
-> <p align="center">
->  <img src="assets/virat.gif" height="200px" style="display:inline-block; margin-right: 1%;">
->  <img src="assets/cctv_smoking.gif" height="200px" style="display:inline-block; margin-right: 1%;">
->  <img src="assets/construction_fall.gif" height="200px" style="display:inline-block;">
->  <img src="assets/babycam.gif" height="200px" style="display:inline-block;">
-> </p>
->
-> **Abstract**:
 <p alighn="center">
 In this work, we focus on scaling open-vocabulary action detection. Existing approaches for action detection are predominantly limited to closed-set scenarios and rely on complex, parameter-heavy architectures. Extending these models to the open-vocabulary setting poses two key challenges: (1) the lack of large-scale datasets with many action classes for robust training, and (2) parameter-heavy adaptations to a pretrained vision-language contrastive model to convert it for detection, risking overfitting the additional non-pretrained parameters to base action classes. Firstly, we introduce an encoder-only multimodal model for video action detection, reducing the reliance on parameter-heavy additions for video action detection. Secondly, we introduce a simple weakly supervised training strategy to exploit an existing closed-set action detection dataset for pretraining. Finally, we depart from the ill-posed base-to-novel benchmark used by prior works in open-vocabulary action detection and devise a new benchmark to evaluate on existing closed-set action detection datasets without ever using them for training, showing novel results to serve as baselines for future work.

+---
+license: agpl-3.0
+language:
+- en
+pipeline_tag: action-detection
+---
 # Official PyTorch Implementation of SiA
 Official PyTorch implementation of SiA, our model from **[Scaling Open-Vocabulary Action Detection](https://arxiv.org/abs/2504.03096)**. If you use this code for your research, please cite our paper.
 ## To-do
 - Code TBA
 - Weights TBA
+**Scaling Open-Vocabulary Action Detection**<br>
+Z.H Sia and Y.S Rawat<br>
+<br>
+<p align="center">
+<img src="assets/virat.gif" height="200px" style="display:inline-block; margin-right: 1%;">
+<img src="assets/cctv_smoking.gif" height="200px" style="display:inline-block; margin-right: 1%;">
+<img src="assets/construction_fall.gif" height="200px" style="display:inline-block;">
+<img src="assets/babycam.gif" height="200px" style="display:inline-block;">
+</p>
+**Abstract**:
 <p alighn="center">
 In this work, we focus on scaling open-vocabulary action detection. Existing approaches for action detection are predominantly limited to closed-set scenarios and rely on complex, parameter-heavy architectures. Extending these models to the open-vocabulary setting poses two key challenges: (1) the lack of large-scale datasets with many action classes for robust training, and (2) parameter-heavy adaptations to a pretrained vision-language contrastive model to convert it for detection, risking overfitting the additional non-pretrained parameters to base action classes. Firstly, we introduce an encoder-only multimodal model for video action detection, reducing the reliance on parameter-heavy additions for video action detection. Secondly, we introduce a simple weakly supervised training strategy to exploit an existing closed-set action detection dataset for pretraining. Finally, we depart from the ill-posed base-to-novel benchmark used by prior works in open-vocabulary action detection and devise a new benchmark to evaluate on existing closed-set action detection datasets without ever using them for training, showing novel results to serve as baselines for future work.