Spaces:
Runtime error
Runtime error
A newer version of the Gradio SDK is available:
5.29.0
🫠 SMILE
Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation
News 📢
- [2023.09.30] We now provide the code and our trained checkpoints (of BLIP) for quick deploying and easy reproduction. The previous demonstrative codes are now available at demonstrative.md.
- [2023.06.26] We provide the demonstrative codes to show how to implement SMILE in your codebase, including a pseudocode, a BLIP version, and a transformers version.
Demo
We are building online demos. Please stay tuned.
Usage
git clone https://github.com/yuezih/SMILE
cd SMILE/BLIP
Installation
pip install -r requirements.txt
The code has been tested on PyTorch 2.0.0.
Data Preparation
The data configs are in SMILE/BLIP/configs/caption_coco.yaml
.
- Set the
image_root
to your MSCOCO image root. - MSCOCO annotation files will be automatically downloaded.
Checkpoints
The pre-trained and MLE-finetuned checkpoints are available at the original BLIP repo.
We provide our two checkpoints finetuned on MSCOCO with SMILE:
blip_smile_base.pth
: The vanilla SMILE-optimized BLIP.blip_mle_smile_base.pth
: BLIP finetuned with MLE+SMILE (0.01:0.99), with a compromise between descriptiveness and accuracy.
Method | Download | Cap. Len. | Lex. Div. | R@1 | R@5 | CLIPScore | PPL |
---|---|---|---|---|---|---|---|
blip_smile_base.pth |
OneDrive | 22.3 | 4.5 | 10.0 | 24.5 | 75.0 | 95.6 |
blip_mle_smile_base.pth |
OneDrive | 19.8 | 3.6 | 10.9 | 25.1 | 76.2 | 79.4 |
Set the checkpoint path in SMILE/BLIP/configs/caption_coco.yaml
.
Training & Inference
bash scripts/train.sh
bash scripts/eval.sh
Kind reminders:
- Please use
transformers==4.15.0
rather than a higher version. - For
torch<=2.0.0
, replacetorchrun
withpython -m torch.distributed.run
in the training and inference scripts.
Citation
If you find this repo to be helpful for your research, please consider citing our paper:
@misc{yue2023learning,
title={Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation},
author={Zihao Yue and Anwen Hu and Liang Zhang and Qin Jin},
year={2023},
eprint={2306.13460},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Acknowledgement
Our work relies on resources from BLIP and HuggingFace transformers. Many thanks to them for their amazing efforts.