SMILE/README.md · yuezih/BLIP-SMILE at main

🫠 SMILE

Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation

News 📢

[2023.09.30] We now provide the code and our trained checkpoints (of BLIP) for quick deploying and easy reproduction. The previous demonstrative codes are now available at demonstrative.md.
[2023.06.26] We provide the demonstrative codes to show how to implement SMILE in your codebase, including a pseudocode, a BLIP version, and a transformers version.

Demo

We are building online demos. Please stay tuned.

Usage

git clone https://github.com/yuezih/SMILE
cd SMILE/BLIP

Installation

pip install -r requirements.txt

The code has been tested on PyTorch 2.0.0.

Data Preparation

The data configs are in SMILE/BLIP/configs/caption_coco.yaml.

Set the image_root to your MSCOCO image root.
MSCOCO annotation files will be automatically downloaded.

Checkpoints

The pre-trained and MLE-finetuned checkpoints are available at the original BLIP repo.

We provide our two checkpoints finetuned on MSCOCO with SMILE:

blip_smile_base.pth: The vanilla SMILE-optimized BLIP.
blip_mle_smile_base.pth: BLIP finetuned with MLE+SMILE (0.01:0.99), with a compromise between descriptiveness and accuracy.

Method	Download	Cap. Len.	Lex. Div.	R@1	R@5	CLIPScore	PPL
`blip_smile_base.pth`	OneDrive	22.3	4.5	10.0	24.5	75.0	95.6
`blip_mle_smile_base.pth`	OneDrive	19.8	3.6	10.9	25.1	76.2	79.4

Set the checkpoint path in SMILE/BLIP/configs/caption_coco.yaml.

Training & Inference

bash scripts/train.sh

bash scripts/eval.sh

Kind reminders:

Please use transformers==4.15.0 rather than a higher version.
For torch<=2.0.0, replace torchrun with python -m torch.distributed.run in the training and inference scripts.

Citation

If you find this repo to be helpful for your research, please consider citing our paper:

@misc{yue2023learning,
      title={Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation}, 
      author={Zihao Yue and Anwen Hu and Liang Zhang and Qin Jin},
      year={2023},
      eprint={2306.13460},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Acknowledgement

Our work relies on resources from BLIP and HuggingFace transformers. Many thanks to them for their amazing efforts.