File size: 12,217 Bytes
0924f30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
# Panoptic-DeepLab

Panoptic-DeepLab is a state-of-the-art **box-free** system for panoptic
segmentation [1], where the goal is to assign a unique value, encoding both
semantic label (e.g., person, car) and instance ID (e.g., instance_1,
instance_2), to every pixel in an image.

Panoptic-DeepLab improves over the DeeperLab [6], which is one of the first
box-free systems for panoptic segmentation combining DeepLabv3+ [7] and
PersonLab [8], by simplifying the class-agnostic instance detection to only use
a center keypoint. As a result, Panoptic-DeepLab predicts three outputs: (1)
semantic segmentation, (2) instance center heatmap, and (3) instance center
regression.

The class-agnostic instance segmentation is first obtained by grouping
the predicted foreground pixels (inferred by semantic segmentation) to their
closest predicted instance centers [2]. To generate final panoptic segmentation,
we then fuse the class-agnostic instance segmentation with semantic segmentation
by the efficient majority-vote scheme [6].


<p align="center">
   <img src="../img/panoptic_deeplab.png" width=800>
</p>


## Prerequisite

1. Make sure the software is properly [installed](../setup/installation.md).

2. Make sure the target dataset is correctly prepared (e.g.,
[Cityscapes](../setup/cityscapes.md), [COCO](../setup/coco.md)).

3. Download the ImageNet pretrained
[checkpoints](./imagenet_pretrained_checkpoints.md), and update the
`initial_checkpoint` path in the config files.

## Model Zoo

In the Model Zoo, we explore building Panoptic-DeepLab on top of several
backbones (e.g., ResNet model variants [3]).

Herein, we highlight some of the employed backbones:

1. **ResNet-50-Beta**: We replace the original stem in ResNet-50 [3] with the
Inception stem [9], i.e., the first original 7x7 convolution is replaced
by three 3x3 convolutions.

2. **Wide-ResNet-41**: We modify the Wide-ResNet-38 [5] by (1) removing the
last residual block, and (2) repeating the second last residual block two
more times.

3. **SWideRNet-SAC-(1, 1, x)**, where x = $$\{1, 3, 4.5\}$$, scaling the
backbone layers (excluding the stem) of Wide-ResNet-41 by a factor of x. This
backbone only employs the Switchable Atrous Convolution (SAC) without the
Squeeze-and-Excitation modules [10].

### Cityscapes Panoptic Segmentation

We provide checkpoints pretrained on Cityscapes train-fine set below. If you
would like to train those models by yourself, please find the corresponding
config files under the directory
[configs/cityscapes/panoptic_deeplab](../../configs/cityscapes/panoptic_deeplab).

All the reported results are obtained by *single-scale* inference and
*ImageNet-1K* pretrained checkpoints.

Backbone                                                                                                                                                                                                                                                             | Output stride | Input resolution | PQ [*] | mIoU [*] | PQ [**] | mIoU [**] | AP<sup>Mask</sup> [**]
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-----------: | :---------------: | :----: | :------: | :-----: | :-------: | :--------------------:
MobilenetV3-S ([config](../../configs/cityscapes/panoptic_deeplab/mobilenet_v3_small_os32.textproto), [ckpt](https://storage.googleapis.com/gresearch/tf-deeplab/checkpoint/mobilenet_v3_small_os32_panoptic_deeplab_cityscapes_trainfine.tar.gz))                   | 32            | 1025 x 2049       | 46.7   | 69.5     | 46.92   | 69.8      | 16.53
MobilenetV3-L ([config](../../configs/cityscapes/panoptic_deeplab/mobilenet_v3_large_os32.textproto), [ckpt](https://storage.googleapis.com/gresearch/tf-deeplab/checkpoint/mobilenet_v3_large_os32_panoptic_deeplab_cityscapes_trainfine.tar.gz))                   | 32            | 1025 x 2049       | 52.7   | 73.8     | 53.07   | 74.15     | 22.58
ResNet-50 ([config](../../configs/cityscapes/panoptic_deeplab/resnet50_os32_merge_with_pure_tf_func.textproto), [ckpt](https://storage.googleapis.com/gresearch/tf-deeplab/checkpoint/resnet50_os32_panoptic_deeplab_cityscapes_trainfine.tar.gz))                   | 32            | 1025 x 2049       | 59.8   | 76.0     | 60.24   | 76.36     | 30.01
ResNet-50-Beta ([config](../../configs/cityscapes/panoptic_deeplab/resnet50_beta_os32.textproto), [ckpt](https://storage.googleapis.com/gresearch/tf-deeplab/checkpoint/resnet50_beta_os32_panoptic_deeplab_cityscapes_trainfine.tar.gz))                            | 32            | 1025 x 2049       | 60.8   | 77.0     | 61.16   | 77.37     | 31.58
Wide-ResNet-41 ([config](../../configs/cityscapes/panoptic_deeplab/wide_resnet41_os16.textproto), [ckpt](https://storage.googleapis.com/gresearch/tf-deeplab/checkpoint/wide_resnet41_os16_panoptic_deeplab_cityscapes_trainfine.tar.gz))                            | 16            | 1025 x 2049       | 64.4   | 81.5     | 64.83   | 81.92     | 36.07
SWideRNet-SAC-(1, 1, 1) ([config](../../configs/cityscapes/panoptic_deeplab/swidernet_sac_1_1_1_os16.textproto), [ckpt](https://storage.googleapis.com/gresearch/tf-deeplab/checkpoint/swidernet_sac_1_1_1_os16_panoptic_deeplab_cityscapes_trainfine.tar.gz))       | 16            | 1025 x 2049       | 64.3   | 81.8     | 64.81   | 82.24     | 36.80
SWideRNet-SAC-(1, 1, 3) ([config](../../configs/cityscapes/panoptic_deeplab/swidernet_sac_1_1_3_os16.textproto), [ckpt](https://storage.googleapis.com/gresearch/tf-deeplab/checkpoint/swidernet_sac_1_1_3_os16_panoptic_deeplab_cityscapes_trainfine.tar.gz)))      | 16            | 1025 x 2049       | 66.6   | 82.1     | 67.05   | 82.67     | 38.59
SWideRNet-SAC-(1, 1, 4.5) ([config](../../configs/cityscapes/panoptic_deeplab/swidernet_sac_1_1_4.5_os16.textproto), [ckpt](https://storage.googleapis.com/gresearch/tf-deeplab/checkpoint/swidernet_sac_1_1_4.5_os16_panoptic_deeplab_cityscapes_trainfine.tar.gz)) | 16            | 1025 x 2049       | 66.8   | 82.2     | 67.29   | 82.74     | 39.51

[*]: Results evaluated by the official script. Instance segmentation evaluation
is not supported yet (need to convert our prediction format).

[**]: Results evaluated by our pipeline. See Q4 in [FAQ](../faq.md).

### COCO Panoptic Segmentation

We provide checkpoints pretrained on COCO train set below. If you would like to
train those models by yourself, please find the corresponding config files under
the directory
[configs/coco/panoptic_deeplab](../../configs/coco/panoptic_deeplab).

All the reported results are obtained by *single-scale* inference and
*ImageNet-1K* pretrained checkpoints.

Backbone                                                                                                                                                                                                                 | Output stride | Input resolution | PQ [*] | PQ [**] | mIoU [**] | AP<sup>Mask</sup> [**]
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | :-----------: | :---------------: | :----: | :-----: | :-------: | :--------------------:
ResNet-50 ([config](../../configs/coco/panoptic_deeplab/resnet50_os32.textproto), [ckpt](https://storage.googleapis.com/gresearch/tf-deeplab/checkpoint/resnet50_os32_panoptic_deeplab_coco_train_2.tar.gz))             | 32            | 641 x 641         | 34.1   | 34.60   | 54.75     | 18.50
ResNet-50-Beta ([config](../../configs/coco/panoptic_deeplab/resnet50_beta_os32.textproto), [ckpt](https://storage.googleapis.com/gresearch/tf-deeplab/checkpoint/resnet50beta_os32_panoptic_deeplab_coco_train.tar.gz)) | 32            | 641 x 641         | 34.6   | 35.10   | 54.98     | 19.24
ResNet-50 ([config](../../configs/coco/panoptic_deeplab/resnet50_os16.textproto), [ckpt](https://storage.googleapis.com/gresearch/tf-deeplab/checkpoint/resnet50_os16_panoptic_deeplab_coco_train.tar.gz))               | 16            | 641 x 641         | 35.1   | 35.67   | 55.52     | 19.40
ResNet-50-Beta ([config](../../configs/coco/panoptic_deeplab/resnet50_beta_os16.textproto), [ckpt](https://storage.googleapis.com/gresearch/tf-deeplab/checkpoint/resnet50beta_os16_panoptic_deeplab_coco_train.tar.gz)) | 16            | 641 x 641         | 35.2   | 35.76   | 55.45     | 19.63

\[*]: Results evaluated by the official script.

\[**]: Results evaluated by our pipeline. See Q4 in [FAQ](../faq.md).

## Citing Panoptic-DeepLab

If you find this code helpful in your research or wish to refer to the baseline
results, please use the following BibTeX entry.

* Panoptic-DeepLab:

```
@inproceedings{panoptic_deeplab_2020,
  author={Bowen Cheng and Maxwell D Collins and Yukun Zhu and Ting Liu and Thomas S Huang and Hartwig Adam and Liang-Chieh Chen},
  title={{Panoptic-DeepLab}: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation},
  booktitle={CVPR},
  year={2020}
}

```

If you use the Wide-ResNet-41 backbone, please consider citing

* Naive-Student:

```
@inproceedings{naive_student_2020,
  title={{Naive-Student: Leveraging Semi-Supervised Learning in Video Sequences for Urban Scene Segmentation}},
  author={Chen, Liang-Chieh and Lopes, Raphael Gontijo and Cheng, Bowen and Collins, Maxwell D and Cubuk, Ekin D and Zoph, Barret and Adam, Hartwig and Shlens, Jonathon},
  booktitle={ECCV},
  year={2020}
}
```

If you use the SWideRNet backbone w/ Switchable Atrous Convolution,
please consider citing

* SWideRNet:

```
@article{swidernet_2020,
  title={Scaling Wide Residual Networks for Panoptic Segmentation},
  author={Chen, Liang-Chieh and Wang, Huiyu and Qiao, Siyuan},
  journal={arXiv:2011.11675},
  year={2020}
}

```

* Swichable Atrous Convolution (SAC):

```
@inproceedings{detectors_2021,
  title={{DetectoRS}: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution},
  author={Qiao, Siyuan and Chen, Liang-Chieh and Yuille, Alan},
  booktitle={CVPR},
  year={2021}
}

```

If you use the MobileNetv3 backbone, please consider citing

* MobileNetv3

```
@inproceedings{howard2019searching,
  title={Searching for {MobileNetV3}},
  author={Howard, Andrew and Sandler, Mark and Chu, Grace and Chen, Liang-Chieh and Chen, Bo and Tan, Mingxing and Wang, Weijun and Zhu, Yukun and Pang, Ruoming and Vasudevan, Vijay and others},
  booktitle={ICCV},
  year={2019}
}
```

### References

1. Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr
   Dollar. "Panoptic segmentation." In CVPR, 2019.

2. Alex Kendall, Yarin Gal, and Roberto Cipolla. "Multi-task learning using
   uncertainty to weigh losses for scene geometry and semantics." In CVPR, 2018.

3. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual
   learning for image recognition." In CVPR, 2016.

4. Sergey Zagoruyko and Nikos Komodakis. "Wide residual networks." In BMVC,
   2016.

5. Zifeng Wu, Chunhua Shen, and Anton Van Den Hengel. "Wider or deeper:
   Revisiting the ResNet model for visual recognition." Pattern Recognition,
   2019.

6. Tien-Ju Yang, Maxwell D Collins, Yukun Zhu, Jyh-Jing Hwang, Ting Liu,
   Xiao Zhang, Vivienne Sze, George Papandreou, and Liang-Chieh Chen.
   "DeeperLab: Single-shot image parser." arXiv:1902.05093, 2019.

7. Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and
   Hartwig Adam. "Encoder-decoder with atrous separable convolution for
   semantic image segmentation." In ECCV, 2018.

8. George Papandreou, Tyler Zhu, Liang-Chieh Chen, Spyros Gidaris,
   Jonathan Tompson, and Kevin Murphy. "Personlab: Person pose estimation
   and instance segmentation with a bottom-up, part-based, geometric embedding
   model." In ECCV, 2018.

9. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and
   Zbigniew Wojna. "Rethinking the inception architecture for computer
   vision." In CVPR, 2016.

10. Jie Hu, Li Shen, and Gang Sun. "Squeeze-and-excitation networks."
    In CVPR, 2018.