Panoptic-DeepLab

Panoptic-DeepLab is a state-of-the-art box-free system for panoptic segmentation [1], where the goal is to assign a unique value, encoding both semantic label (e.g., person, car) and instance ID (e.g., instance_1, instance_2), to every pixel in an image.

Panoptic-DeepLab improves over the DeeperLab [6], which is one of the first box-free systems for panoptic segmentation combining DeepLabv3+ [7] and PersonLab [8], by simplifying the class-agnostic instance detection to only use a center keypoint. As a result, Panoptic-DeepLab predicts three outputs: (1) semantic segmentation, (2) instance center heatmap, and (3) instance center regression.

The class-agnostic instance segmentation is first obtained by grouping the predicted foreground pixels (inferred by semantic segmentation) to their closest predicted instance centers [2]. To generate final panoptic segmentation, we then fuse the class-agnostic instance segmentation with semantic segmentation by the efficient majority-vote scheme [6].

Prerequisite

Make sure the software is properly installed.
Make sure the target dataset is correctly prepared (e.g., Cityscapes, COCO).
Download the ImageNet pretrained checkpoints, and update the initial_checkpoint path in the config files.

Model Zoo

In the Model Zoo, we explore building Panoptic-DeepLab on top of several backbones (e.g., ResNet model variants [3]).

Herein, we highlight some of the employed backbones:

ResNet-50-Beta: We replace the original stem in ResNet-50 [3] with the Inception stem [9], i.e., the first original 7x7 convolution is replaced by three 3x3 convolutions.
Wide-ResNet-41: We modify the Wide-ResNet-38 [5] by (1) removing the last residual block, and (2) repeating the second last residual block two more times.
SWideRNet-SAC-(1, 1, x), where x = $${1, 3, 4.5}$$, scaling the backbone layers (excluding the stem) of Wide-ResNet-41 by a factor of x. This backbone only employs the Switchable Atrous Convolution (SAC) without the Squeeze-and-Excitation modules [10].

Cityscapes Panoptic Segmentation

We provide checkpoints pretrained on Cityscapes train-fine set below. If you would like to train those models by yourself, please find the corresponding config files under the directory configs/cityscapes/panoptic_deeplab.

All the reported results are obtained by single-scale inference and ImageNet-1K pretrained checkpoints.

Backbone	Output stride	Input resolution	PQ [*]	mIoU [*]	PQ [**]	mIoU [**]	AP^Mask [**]
MobilenetV3-S (config, ckpt)	32	1025 x 2049	46.7	69.5	46.92	69.8	16.53
MobilenetV3-L (config, ckpt)	32	1025 x 2049	52.7	73.8	53.07	74.15	22.58
ResNet-50 (config, ckpt)	32	1025 x 2049	59.8	76.0	60.24	76.36	30.01
ResNet-50-Beta (config, ckpt)	32	1025 x 2049	60.8	77.0	61.16	77.37	31.58
Wide-ResNet-41 (config, ckpt)	16	1025 x 2049	64.4	81.5	64.83	81.92	36.07
SWideRNet-SAC-(1, 1, 1) (config, ckpt)	16	1025 x 2049	64.3	81.8	64.81	82.24	36.80
SWideRNet-SAC-(1, 1, 3) (config, ckpt))	16	1025 x 2049	66.6	82.1	67.05	82.67	38.59
SWideRNet-SAC-(1, 1, 4.5) (config, ckpt)	16	1025 x 2049	66.8	82.2	67.29	82.74	39.51

[*]: Results evaluated by the official script. Instance segmentation evaluation is not supported yet (need to convert our prediction format).

[**]: Results evaluated by our pipeline. See Q4 in FAQ.

COCO Panoptic Segmentation

We provide checkpoints pretrained on COCO train set below. If you would like to train those models by yourself, please find the corresponding config files under the directory configs/coco/panoptic_deeplab.

All the reported results are obtained by single-scale inference and ImageNet-1K pretrained checkpoints.

Backbone	Output stride	Input resolution	PQ [*]	PQ [**]	mIoU [**]	AP^Mask [**]
ResNet-50 (config, ckpt)	32	641 x 641	34.1	34.60	54.75	18.50
ResNet-50-Beta (config, ckpt)	32	641 x 641	34.6	35.10	54.98	19.24
ResNet-50 (config, ckpt)	16	641 x 641	35.1	35.67	55.52	19.40
ResNet-50-Beta (config, ckpt)	16	641 x 641	35.2	35.76	55.45	19.63

[*]: Results evaluated by the official script.

[**]: Results evaluated by our pipeline. See Q4 in FAQ.

Citing Panoptic-DeepLab

If you find this code helpful in your research or wish to refer to the baseline results, please use the following BibTeX entry.

Panoptic-DeepLab:

@inproceedings{panoptic_deeplab_2020,
  author={Bowen Cheng and Maxwell D Collins and Yukun Zhu and Ting Liu and Thomas S Huang and Hartwig Adam and Liang-Chieh Chen},
  title={{Panoptic-DeepLab}: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation},
  booktitle={CVPR},
  year={2020}
}

If you use the Wide-ResNet-41 backbone, please consider citing

Naive-Student:

@inproceedings{naive_student_2020,
  title={{Naive-Student: Leveraging Semi-Supervised Learning in Video Sequences for Urban Scene Segmentation}},
  author={Chen, Liang-Chieh and Lopes, Raphael Gontijo and Cheng, Bowen and Collins, Maxwell D and Cubuk, Ekin D and Zoph, Barret and Adam, Hartwig and Shlens, Jonathon},
  booktitle={ECCV},
  year={2020}
}

If you use the SWideRNet backbone w/ Switchable Atrous Convolution, please consider citing

SWideRNet:

@article{swidernet_2020,
  title={Scaling Wide Residual Networks for Panoptic Segmentation},
  author={Chen, Liang-Chieh and Wang, Huiyu and Qiao, Siyuan},
  journal={arXiv:2011.11675},
  year={2020}
}

Swichable Atrous Convolution (SAC):

@inproceedings{detectors_2021,
  title={{DetectoRS}: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution},
  author={Qiao, Siyuan and Chen, Liang-Chieh and Yuille, Alan},
  booktitle={CVPR},
  year={2021}
}

If you use the MobileNetv3 backbone, please consider citing

MobileNetv3

@inproceedings{howard2019searching,
  title={Searching for {MobileNetV3}},
  author={Howard, Andrew and Sandler, Mark and Chu, Grace and Chen, Liang-Chieh and Chen, Bo and Tan, Mingxing and Wang, Weijun and Zhu, Yukun and Pang, Ruoming and Vasudevan, Vijay and others},
  booktitle={ICCV},
  year={2019}
}

References

Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollar. "Panoptic segmentation." In CVPR, 2019.
Alex Kendall, Yarin Gal, and Roberto Cipolla. "Multi-task learning using uncertainty to weigh losses for scene geometry and semantics." In CVPR, 2018.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition." In CVPR, 2016.
Sergey Zagoruyko and Nikos Komodakis. "Wide residual networks." In BMVC,
Zifeng Wu, Chunhua Shen, and Anton Van Den Hengel. "Wider or deeper: Revisiting the ResNet model for visual recognition." Pattern Recognition,
Tien-Ju Yang, Maxwell D Collins, Yukun Zhu, Jyh-Jing Hwang, Ting Liu, Xiao Zhang, Vivienne Sze, George Papandreou, and Liang-Chieh Chen. "DeeperLab: Single-shot image parser." arXiv:1902.05093, 2019.
Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. "Encoder-decoder with atrous separable convolution for semantic image segmentation." In ECCV, 2018.
George Papandreou, Tyler Zhu, Liang-Chieh Chen, Spyros Gidaris, Jonathan Tompson, and Kevin Murphy. "Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model." In ECCV, 2018.
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. "Rethinking the inception architecture for computer vision." In CVPR, 2016.
Jie Hu, Li Shen, and Gang Sun. "Squeeze-and-excitation networks." In CVPR, 2018.