File size: 3,246 Bytes
ca19ab4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
<div>
  <h2 align="center">
    🫠 SMILE
  </h2>
</div>

<p align="center">
    <a >
       <img alt="Issues" src="https://img.shields.io/github/issues/yuezih/SMILE?color=blueviolet" />
  	</a>
    <a >
       <img alt="Forks" src="https://img.shields.io/github/forks/yuezih/SMILE?color=orange" />
  	</a>
    <a >
       <img alt="Stars" src="https://img.shields.io/github/stars/yuezih/SMILE?color=ff69b4" />
  	</a>
    <br />
</p>

[Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation](https://arxiv.org/abs/2306.13460)

![case.png](./assets/case.png)

---

## News 📢

- [2023.09.30] We now provide the code and our trained checkpoints (of BLIP) for quick deploying and easy reproduction. The previous demonstrative codes are now available at [demonstrative.md](./assets/demonstrative.md).
- [2023.06.26] We provide the demonstrative codes to show how to implement SMILE in your codebase, including a pseudocode, a [BLIP](https://github.com/salesforce/BLIP) version, and a [transformers](https://github.com/huggingface/transformers) version.

## Demo

We are building online demos. Please stay tuned.

## Usage

```
git clone https://github.com/yuezih/SMILE
cd SMILE/BLIP
```

### Installation

```
pip install -r requirements.txt
```

The code has been tested on PyTorch 2.0.0.

### Data Preparation

The data configs are in `SMILE/BLIP/configs/caption_coco.yaml`.
- Set the `image_root` to your MSCOCO image root.
- MSCOCO annotation files will be automatically downloaded.

### Checkpoints

The pre-trained and MLE-finetuned checkpoints are available at the [original BLIP repo](https://github.com/salesforce/BLIP).

We provide our two checkpoints finetuned on MSCOCO with SMILE:
- `blip_smile_base.pth`: The vanilla SMILE-optimized BLIP.
- `blip_mle_smile_base.pth`: BLIP finetuned with MLE+SMILE (0.01:0.99), with a compromise between descriptiveness and accuracy.

Method|Download|Cap. Len.|Lex. Div.|R@1|R@5|CLIPScore|PPL
-|:-:|:-:|:-:|:-:|:-:|:-:|:-:
`blip_smile_base.pth`|[OneDrive](https://1drv.ms/u/s!AocXJ7uKxt6XcsGzBZ4XKoZWKJY?e=BW7fJK)|22.3|4.5|10.0|24.5|75.0|95.6
`blip_mle_smile_base.pth`|[OneDrive](https://1drv.ms/u/s!AocXJ7uKxt6Xc85rDJCdunDI0jU?e=eDpAGG)|19.8|3.6|**10.9**|**25.1**|76.2|79.4

Set the checkpoint path in `SMILE/BLIP/configs/caption_coco.yaml`.

### Training & Inference

```
bash scripts/train.sh
```

```
bash scripts/eval.sh
```

Kind reminders:
- Please use `transformers==4.15.0` rather than a higher version.
- For `torch<=2.0.0`, replace `torchrun` with `python -m torch.distributed.run` in the training and inference scripts.

## Citation

If you find this repo to be helpful for your research, please consider citing our paper:

```bibtex
@misc{yue2023learning,
      title={Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation}, 
      author={Zihao Yue and Anwen Hu and Liang Zhang and Qin Jin},
      year={2023},
      eprint={2306.13460},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```

## Acknowledgement

Our work relies on resources from [BLIP](https://github.com/salesforce/BLIP) and [HuggingFace transformers](https://github.com/huggingface/transformers). Many thanks to them for their amazing efforts.