File size: 8,503 Bytes
57746f1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
# Language-driven Semantic Segmentation (LSeg)
The repo contains official PyTorch Implementation of paper [Language-driven Semantic Segmentation](https://arxiv.org/abs/2201.03546). 

ICLR 2022

#### Authors: 
* [Boyi Li](https://sites.google.com/site/boyilics/home)
* [Kilian Q. Weinberger](http://kilian.cs.cornell.edu/index.html)
* [Serge Belongie](https://scholar.google.com/citations?user=ORr4XJYAAAAJ&hl=zh-CN)
* [Vladlen Koltun](http://vladlen.info/)
* [Rene Ranftl](https://scholar.google.at/citations?user=cwKg158AAAAJ&hl=de)


### Overview


We present LSeg, a novel model for language-driven semantic image segmentation. LSeg uses a text encoder to compute embeddings of descriptive input labels (e.g., ''grass'' or 'building'') together with a transformer-based image encoder that computes dense per-pixel embeddings of the input image. The image encoder is trained with a contrastive objective to align pixel embeddings to the text embedding of the corresponding semantic class. The text embeddings provide a flexible label representation in which semantically similar labels map to similar regions in the embedding space (e.g., ''cat'' and ''furry''). This allows LSeg to generalize to previously unseen categories at test time, without retraining or even requiring a single additional training sample. We demonstrate that our approach achieves highly competitive zero-shot performance compared to existing zero- and few-shot semantic segmentation methods, and even matches the accuracy of traditional segmentation algorithms when a fixed label set is provided. 

Please check our [Video Demo (4k)](https://www.youtube.com/watch?v=bmU75rsmv6s) to further showcase the capabilities of LSeg.

## Usage
### Installation
Option 1: 

``` pip install -r requirements.txt ```

Option 2: 
```
conda install ipython
pip install torch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2
pip install git+https://github.com/zhanghang1989/PyTorch-Encoding/
pip install pytorch-lightning==1.3.5
pip install opencv-python
pip install imageio
pip install ftfy regex tqdm
pip install git+https://github.com/openai/CLIP.git
pip install altair
pip install streamlit
pip install --upgrade protobuf
pip install timm
pip install tensorboardX
pip install matplotlib
pip install test-tube
pip install wandb
```

### Data Preparation
By default, for training, testing and demo, we use [ADE20k](https://groups.csail.mit.edu/vision/datasets/ADE20K/).

```
python prepare_ade20k.py
unzip ../datasets/ADEChallengeData2016.zip
```

Note: for demo, if you want to use random inputs, you can ignore data loading and comment the code at [link](https://github.com/isl-org/lang-seg/blob/main/modules/lseg_module.py#L55). 


### 🌻 Try demo now

#### Download Demo Model
<table>
  <thead>
    <tr style="text-align: right;">
      <th>name</th>
      <th>backbone</th>
      <th>text encoder</th>
      <th>url</th>
    </tr>
  </thead>
  <tbody>
    <tr>
       <td>Model for demo</td>
      <th>ViT-L/16</th>
      <th>CLIP ViT-B/32</th>
      <td><a href="https://drive.google.com/file/d/1FTuHY1xPUkM-5gaDtMfgCl3D0gR89WV7/view?usp=sharing">download</a></td>
    </tr>
  </tbody>
</table>

#### πŸ‘‰ Option 1: Running interactive app
Download the model for demo and put it under folder `checkpoints` as `checkpoints/demo_e200.ckpt`. 

Then ``` streamlit run lseg_app.py ```

#### πŸ‘‰ Option 2: Jupyter Notebook
Download the model for demo and put it under folder `checkpoints` as `checkpoints/demo_e200.ckpt`. 

Then follow [lseg_demo.ipynb](https://github.com/isl-org/lang-seg/blob/main/lseg_demo.ipynb) to play around with LSeg. Enjoy!



### Training and Testing Example
Training: Backbone = ViT-L/16, Text Encoder from CLIP ViT-B/32

``` bash train.sh ```

Testing: Backbone = ViT-L/16, Text Encoder from CLIP ViT-B/32

``` bash test.sh ```

### Zero-shot Experiments
#### Data Preparation
Please follow [HSNet](https://github.com/juhongm999/hsnet) and put all dataset in `data/Dataset_HSN`

#### Pascal-5i
``` 
for fold in 0 1 2 3; do
python -u test_lseg_zs.py --backbone clip_resnet101 --module clipseg_DPT_test_v2 --dataset pascal \
--widehead --no-scaleinv --arch_option 0 --ignore_index 255 --fold ${fold} --nshot 0 \
--weights checkpoints/pascal_fold${fold}.ckpt 
done
```
#### COCO-20i
``` 
for fold in 0 1 2 3; do
python -u test_lseg_zs.py --backbone clip_resnet101 --module clipseg_DPT_test_v2 --dataset coco \
--widehead --no-scaleinv --arch_option 0 --ignore_index 255 --fold ${fold} --nshot 0 \
--weights checkpoints/pascal_fold${fold}.ckpt 
done
```
#### FSS
``` 
python -u test_lseg_zs.py --backbone clip_vitl16_384 --module clipseg_DPT_test_v2 --dataset fss \
--widehead --no-scaleinv --arch_option 0 --ignore_index 255 --fold 0 --nshot 0 \
--weights checkpoints/fss_l16.ckpt 
```

``` 
python -u test_lseg_zs.py --backbone clip_resnet101 --module clipseg_DPT_test_v2 --dataset fss \
--widehead --no-scaleinv --arch_option 0 --ignore_index 255 --fold 0 --nshot 0 \
--weights checkpoints/fss_rn101.ckpt 
```

#### Model Zoo
<table>
  <thead>
    <tr style="text-align: right;">
       <th>dataset</th>
      <th>fold</th>
      <th>backbone</th>
      <th>text encoder</th>
      <th>performance</th>
      <th>url</th>
    </tr>
  </thead>
  <tbody>
    <tr>
       <th>pascal</th>
       <td>0</td>
      <th>ResNet101</th>
      <th>CLIP ViT-B/32</th>
      <th>52.8</th>
      <td><a href="https://drive.google.com/file/d/1y4z4_yNGlZtn6osaeN4ZjMs6c0vr0F3m/view?usp=sharing">download</a></td>
    </tr>
    <tr>
       <th>pascal</th>
       <td>1</td>
      <th>ResNet101</th>
      <th>CLIP ViT-B/32</th>
      <th>53.8</th>
      <td><a href="https://drive.google.com/file/d/1UZzN8kWkH-G8v6P8xcBXEHRZlQKPRxrX/view?usp=sharing">download</a></td>
    </tr>
    <tr>
       <th>pascal</th>
       <td>2</td>
      <th>ResNet101</th>
      <th>CLIP ViT-B/32</th>
      <th>44.4</th>
      <td><a href="https://drive.google.com/file/d/1KCq1JphSMvj8X78bkWbdNIFm5zYzLTMX/view?usp=sharing">download</a></td>
    </tr>
    <tr>
       <th>pascal</th>
       <td>3</td>
      <th>ResNet101</th>
      <th>CLIP ViT-B/32</th>
      <th>38.5</th>
      <td><a href="https://drive.google.com/file/d/1A_fllOJqyBg0ZTJcm85Cn0NAcwQbXnhl/view?usp=sharing">download</a></td>
    </tr>
    <tr>
       <th>coco</th>
       <td>0</td>
      <th>ResNet101</th>
      <th>CLIP ViT-B/32</th>
      <th>22.1</th>
      <td><a href="https://drive.google.com/file/d/1nSYO3XtAv4mzWi4x-MFUfk04cpBKJT38/view?usp=sharing">download</a></td>
    </tr>
    <tr>
       <th>coco</th>
       <td>1</td>
      <th>ResNet101</th>
      <th>CLIP ViT-B/32</th>
      <th>25.1</th>
      <td><a href="https://drive.google.com/file/d/1w0vz3yjEi_ZLgECRrgtoLxEHugyJkrs5/view?usp=sharing">download</a></td>
    </tr>
    <tr>
       <th>coco</th>
       <td>2</td>
      <th>ResNet101</th>
      <th>CLIP ViT-B/32</th>
      <th>24.9</th>
      <td><a href="https://drive.google.com/file/d/1wmHtmJLdta18XuWQv6oX9llidCll_HrD/view?usp=sharing">download</a></td>
    </tr>
    <tr>
       <th>coco</th>
       <td>3</td>
      <th>ResNet101</th>
      <th>CLIP ViT-B/32</th>
      <th>21.5</th>
      <td><a href="https://drive.google.com/file/d/1dliBUSOog7taJxMmb9cdefKH4XOKChVJ/view?usp=sharing">download</a></td>
    </tr>
    <tr>
       <th>fss</th>
       <td>-</td>
      <th>ResNet101</th>
      <th>CLIP ViT-B/32</th>
      <th>84.7</th>
      <td><a href="https://drive.google.com/file/d/1UIj49Wp1mAopPub5M6O4WW-Z79VB1bhw/view?usp=sharing">download</a></td>
    </tr>
    <tr>
       <th>fss</th>
       <td>-</td>
      <th>ViT-L/16</th>
      <th>CLIP ViT-B/32</th>
      <th>87.8</th>
      <td><a href="https://drive.google.com/file/d/1Nplkc_JsHIS55d--K2vonOOC3HrppzYy/view?usp=sharing">download</a></td>
    </tr>
  </tbody>
</table>

If you find this repo useful, please cite:
```
@inproceedings{
li2022languagedriven,
title={Language-driven Semantic Segmentation},
author={Boyi Li and Kilian Q Weinberger and Serge Belongie and Vladlen Koltun and Rene Ranftl},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=RriDjddCLN}
}
```

## Acknowledgement
Thanks to the code base from [DPT](https://github.com/isl-org/DPT), [Pytorch_lightning](https://github.com/PyTorchLightning/pytorch-lightning), [CLIP](https://github.com/openai/CLIP), [Pytorch Encoding](https://github.com/zhanghang1989/PyTorch-Encoding), [Streamlit](https://streamlit.io/), [Wandb](https://wandb.ai/site)