Box2Mask: Weakly Supervised 3D Semantic Instance Segmentation Using Bounding Boxes

Julian Chibane^1,2, Francis Engelmann³, Tuan Anh Tran², Gerard Pons-Moll^1,2

¹University of Tübingen, Germany
²Max Planck Institute for Informatics, Saarland Informatics Campus, Germany
³ETH Zurich AI Center, Switzerland

European Conference on Computer Vision (ECCV), 2022 - Oral

Are 3D bounding box annotations suitable to train dense 3D semantic instance segmentation?

Yes!

Key Finding

Box2Mask - Method Overview

1.) Input: Input to our method, Box2Mask, is a colored 3D point cloud of a scene. 2.) Bounding Box Voting: for each point in the input scene, our model predicts the points instance - parameterized as a bounding box. Our key contribution is that this is trained with only coarse bounding box annotations and requires no per-point labels. 3.) Non-Maximum Clustering: Votes are clustered via a novel algorithm called Non-Maximum Clustering (NMC) which is specifically tailored to the bounding box representation. 4.) Back-Projection: A point is associated with the cluster of the box it predicted. Doing this for each point yields the final result.

Results

Box2Mask trained via bounding boxes, obtains result quality of fully supervised SOTA methods (96.9% of full supervision in mAP@50, 100.0% in mAP@25) on the ScanNet benchmark, and largly outperforms prior weakly supervised methods (+ 15.2 mAP@50). See the paper for details.

Interactive Results

Object semantics: Cabinet Bed Chair Sofa Table Door Window Bookshelf Picture Counter Desk Curtain Refrigerator Bathtub Shower curtain Toilet Sink Other furniture

Object instances: Different colors (chosen at random) represent different instances.

Video

Citation

@inproceedings{chibane2021box2mask,
    title = {Box2Mask: Weakly Supervised 3D Semantic Instance Segmentation Using Bounding Boxes},
    author = {Chibane, Julian and Engelmann, Francis and Tran, Tuan Anh and Pons-Moll, Gerard},
    booktitle = {European Conference on Computer Vision ({ECCV})},
    month = {October},
    organization = {{Springer}},
    year = {2022},
}

Acknowledgments

We thank Alexey Nekrasov and Jonas Schult for helpful discussions and feedback. This work is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - 409792180 (Emmy Noether Programme, project: Real Virtual Humans). Gerard Pons-Moll is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 – Project number 390727645. Julian Chibane is a fellow of the Meta Research PhD Fellowship Program - area: AR/VR Human Understanding. Francis Engelmann is a post-doctoral research fellow at the ETH AI Center. The project was made possible by funding from the Carl Zeiss Foundation.