Box2Mask: Weakly Supervised 3D Semantic Instance Segmentation Using Bounding Boxes

Julian Chibane1,2,   Francis Engelmann3,   Tuan Anh Tran2,   Gerard Pons-Moll1,2

1University of Tübingen, Germany
2Max Planck Institute for Informatics, Saarland Informatics Campus, Germany
3ETH Zurich AI Center, Switzerland

European Conference on Computer Vision (ECCV), 2022 - Oral
Are 3D bounding box annotations suitable to train dense 3D semantic instance segmentation?


Key Finding

Annotation types. Our key finding is that bounding box annotations serve as a surprisingly valuable source of weak annotation for 3D instance segmentation. Prior works either require per-point annotation (left) with instance id and semantic class for millions of points or, initial weak supervision methods use sparse point annotations (middle), where some subset of points is annotated with instance center and semantic class. We propose bounding box annotations (right), where each object is annotated with its tight fitting box and a semantic label. We find boxes combine desirable properties: they allow for results on par with full supervision, reduce annotation effort to the object-level and are readily available in several large-scale 3D datasets.

Box2Mask - Method Overview

1.) Input: Input to our method, Box2Mask, is a colored 3D point cloud of a scene. 2.) Bounding Box Voting: for each point in the input scene, our model predicts the points instance - parameterized as a bounding box. Our key contribution is that this is trained with only coarse bounding box annotations and requires no per-point labels. 3.) Non-Maximum Clustering: Votes are clustered via a novel algorithm called Non-Maximum Clustering (NMC) which is specifically tailored to the bounding box representation. 4.) Back-Projection: A point is associated with the cluster of the box it predicted. Doing this for each point yields the final result.


Box2Mask trained via bounding boxes, obtains result quality of fully supervised SOTA methods (96.9% of full supervision in mAP@50, 100.0% in mAP@25) on the ScanNet benchmark, and largly outperforms prior weakly supervised methods (+ 15.2 mAP@50). See the paper for details.

Interactive Results
Object semantics: Cabinet Bed Chair Sofa Table Door Window Bookshelf Picture Counter Desk Curtain Refrigerator Bathtub Shower curtain Toilet Sink Other furniture
Object instances: Different colors (chosen at random) represent different instances.



    title = {Box2Mask: Weakly Supervised 3D Semantic Instance Segmentation Using Bounding Boxes},
    author = {Chibane, Julian and Engelmann, Francis and Tran, Tuan Anh and Pons-Moll, Gerard},
    booktitle = {European Conference on Computer Vision ({ECCV})},
    month = {October},
    organization = {{Springer}},
    year = {2022},


Carl-Zeiss-Stiftung Tübingen AI Center University of Tübingen MPII Saarbrücken

We thank Alexey Nekrasov and Jonas Schult for helpful discussions and feedback. This work is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - 409792180 (Emmy Noether Programme, project: Real Virtual Humans). Gerard Pons-Moll is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 – Project number 390727645. Julian Chibane is a fellow of the Meta Research PhD Fellowship Program - area: AR/VR Human Understanding. Francis Engelmann is a post-doctoral research fellow at the ETH AI Center. The project was made possible by funding from the Carl Zeiss Foundation.