A newer version of the Gradio SDK is available:
5.29.1
DiT for Text Detection


Fine-tuned models on FUNSD
We summarize the validation results as follows. We also provide the fine-tuned weights.
name | initialized checkpoint | detection algorithm | F1 | weight |
---|---|---|---|---|
DiT-base-syn | dit_base_patch16_224_syn | Mask R-CNN | 94.25 | link |
DiT-large-syn | dit_large_patch16_224_syn | Mask R-CNN | 94.29 | link |
Usage
Data Preparation
Follow these steps to download and process the FUNSD. The resulting directory structure looks like the following:
βββ data
β βββ annotations
β βββ imgs
β βββ instances_test.json
β βββ instances_training.json
Training
The following command provide example to train the Mask R-CNN with DiT backbone on 8 32GB Nvidia V100 GPUs.
The config files can be found in configs
.
python train_net.py --config-file configs/mask_rcnn_dit_base.yaml --num-gpus 8 --resume MODEL.WEIGHTS path/to/model OUTPUT_DIR path/to/output
Evaluation
The following commands provide examples to evaluate the fine-tuned checkpoint of DiT-Base with Mask R-CNN.
python train_net.py --config-file configs/mask_rcnn_dit_base.yaml --eval-only --num-gpus 8 --resume MODEL.WEIGHTS path/to/model OUTPUT_DIR path/to/output
Citation
If you find this repository useful, please consider citing our work:
@misc{li2022dit,
title={DiT: Self-supervised Pre-training for Document Image Transformer},
author={Junlong Li and Yiheng Xu and Tengchao Lv and Lei Cui and Cha Zhang and Furu Wei},
year={2022},
eprint={2203.02378},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Acknowledgment
Thanks to Detectron2 for Mask R-CNN implementation and MMOCR for the data preprocessing implementation of the FUNSD