MaX-DeepLab

MaX-DeepLab is the first fully end-to-end method for panoptic segmentation [1], removing the needs for previously hand-designed priors such as object bounding boxes (used in DETR [2]), instance centers (used in Panoptic-DeepLab [3]), non-maximum suppression, thing-stuff merging, etc.

The goal of panoptic segmentation is to predict a set of non-overlapping masks along with their corresponding class labels (e.g., person, car, road, sky). MaX-DeepLab achieves this goal directly by predicting a set of class-labeled masks with a mask transformer.

The mask transformer is trained end-to-end with a panoptic quality (PQ) inspired loss function, which matches and optimizes the predicted masks to the ground truth masks with a PQ-style similarity metric. In addition, our proposed mask transformer introduces a global memory path beside the pixel path CNN and employs all 4 types of attention between the two paths, allowing the CNN to read and write the global memory in any layer.

Prerequisite

Make sure the software is properly installed.
Make sure the target dataset is correctly prepared (e.g., COCO).
Download the ImageNet pretrained checkpoints, and update the initial_checkpoint path in the config files.

Model Zoo

We explore MaX-DeepLab model variants that are built on top of several backbones (e.g., ResNet model variants [4]).

MaX-DeepLab-S replaces the last two stages of ResNet-50-beta with axial-attention blocks and applies a small dual-path transformer. (ResNet-50-beta replaces the ResNet-50 stem with the Inception stem [5].)

COCO Panoptic Segmentation

We provide checkpoints pretrained on COCO 2017 panoptic train set and evaluated on the val set. If you would like to train those models by yourself, please find the corresponding config files under the directory configs/coco/max_deeplab.

All the reported results are obtained by single-scale inference and ImageNet-1K pretrained checkpoints.

Model	Input Resolution	Training Steps	PQ [*]	PQ^thing [*]	PQ^stuff [*]	PQ [**]
MaX-DeepLab-S (config, ckpt)	641 x 641	100k	45.9	49.2	40.9	46.36
MaX-DeepLab-S (config, ckpt)	641 x 641	200k	46.5	50.6	40.4	47.04
MaX-DeepLab-S (config, ckpt)	641 x 641	400k	47.0	51.3	40.5	47.56
MaX-DeepLab-S (config, ckpt)	1025 x 1025	100k	47.9	52.1	41.5	48.41
MaX-DeepLab-S (config, ckpt)	1025 x 1025	200k	48.7	53.6	41.3	49.23

[*]: Results evaluated by the official script. [**]: Results evaluated by our pipeline. See Q4 in FAQ.

Note that the results are slightly different from the paper, because of the implementation differences:

Stronger pretrained checkpoints are used in this repo.
A linear drop path schedule is used, rather than a constant schedule.
For simplicity, Adam [6] is used without weight decay, rather than Radam [7] LookAhead [8] with weight decay.

Citing MaX-DeepLab

If you find this code helpful in your research or wish to refer to the baseline results, please use the following BibTeX entry.

MaX-DeepLab:

@inproceedings{max_deeplab_2021,
  author={Huiyu Wang and Yukun Zhu and Hartwig Adam and Alan Yuille and Liang-Chieh Chen},
  title={{MaX-DeepLab}: End-to-End Panoptic Segmentation with Mask Transformers},
  booktitle={CVPR},
  year={2021}
}

Axial-DeepLab:

@inproceedings{axial_deeplab_2020,
  author={Huiyu Wang and Yukun Zhu and Bradley Green and Hartwig Adam and Alan Yuille and Liang-Chieh Chen},
  title={{Axial-DeepLab}: Stand-Alone Axial-Attention for Panoptic Segmentation},
  booktitle={ECCV},
  year={2020}
}

References

Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollar. "Panoptic segmentation." In CVPR, 2019.
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. "End-to-End Object Detection with Transformers." In ECCV, 2020.
Bowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, and Liang-Chieh Chen. "Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation." In CVPR 2020.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition." In CVPR, 2016.
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. "Rethinking the inception architecture for computer vision." In CVPR, 2016.
Diederik P. Kingma, and Jimmy Ba. "Adam: A Method for Stochastic Optimization" In ICLR, 2015.
Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. "On the Variance of the Adaptive Learning Rate and Beyond" In ICLR, 2020.
Michael R. Zhang, James Lucas, Geoffrey Hinton, and Jimmy Ba. "Lookahead Optimizer: k steps forward, 1 step back" In NeurIPS, 2019.