Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation

Xianghui Xie^{1, 2, 3}, Bharat Lal Bhatnagar⁴, Jan Eric Lenssen³, Gerard Pons-Moll^1,2,3
¹ University of Tübingen, Germany
² Tübingen AI Center, Germany
³ Max Planck Institute for Informatics, Saarland Informatics Campus, Germany
⁴ Meta Reality Labs

CVPR, 2024 (Highlight, Top ~2.8% of all submissions)

Given a single RGB image, our method trained **only** on our proposed synthetic interaction dataset, can reconstruct the human, object and contacts, without any predefined template meshes.

Abstract

Reconstructing human-object interaction in 3D from a single RGB image is a challenging task and existing data driven methods do not generalize beyond the objects present in the carefully curated 3D interaction datasets. Capturing large-scale real data to learn strong interaction and 3D shape priors is very expensive due to the combinatorial nature of human-object interactions. In this paper, we propose ProciGen (Procedural interaction Generation), a method to procedurally generate datasets with both, plausible interaction and diverse object variation. We generate 1M+ human-object interaction pairs in 3D and leverage this large-scale data to train our HDM (Hierarchical Diffusion Model), a novel method to reconstruct interacting human and unseen objects, without any templates. Our HDM is an image-conditioned diffusion model that learns both realistic interaction and highly accurate human and object shapes. Experiments show that our HDM trained with ProciGen significantly outperforms prior methods that requires template meshes and that our dataset allows training methods with strong generalization ability to unseen object instances.

Key idea 1: generate large amount of interaction data with diverse shapes

We present ProciGen dataset: Procedural interaction Generation. Capturing real interaction dataset is very expensive. We hence propose to Procedurally combining interaction, human, and object shape datasets to generate large amount of interaction data with diverse object shapes. This leads to 1M+ interaction with 21k+ different objects, which allows training interaction reconstruction method with strong generalization ability. The large-scale of the dataset is possible by essentially multiplying different datasets.

Key idea 2: learn joint interaction space and individual human-object shape spaces separately.

Our hierarchical diffusion model. Learning the joint shape space of both human and object is difficult as the combinatorial space of all possible human + object shapes is huge. We hence propose to learn the joint interaction and individual shape spaces separately using 3 networks. Given an RGB image of a human interacting with an object, we first jointly reconstruct the human and object as one point cloud with segmentation labels (Stage 1). This prediction reasons interaction but lacks accurate shapes. We then use two diffusion models for human or object separately with cross attention to refine the initial noisy prediction while preserving the interaction context(Stage 2). Our hierarchical design faithfully predicts interaction and shapes.

Long narrated video

Reconstruction results

Comparison with CHORE on the BEHAVE dataset.

CHORE requires predefined templates and cannot predict accurate object poses under challenging interactions. Our method is template-free and recovers human and object shapes accurately.

Comparison with PC2 on the BEHAVE dataset.

PC2 does not rely on template shapes but its reconstruction is noisy and missing details because it is difficult to learn both human and object shapes. Our hierarchical design allows us to model both joint interaction and separate shapes accurately.

Generalization: without seeing any images from these datasets.

Generalization to the NTU-RGBD dataset.

Our method can reconstruct different objects faithfully under various camera viewpoints and lighting conditions, without relying on any template shapes.

Generalization to the SYSU-action dataset.

Our method can reconstruct different real-life human and objects during challenging interactions and occlusions.

Generalization to in the wild COCO dataset.

Our method generalizes well to in the wild images, without any template shapes.

Application: texturifying human and object separately using Text2txt.

Our accurate shape reconstruction allows us to obtain high-fidelity meshes using simple surface reconstruction methods and our segmentation makes it possible to manipulate and texturify human and object separately.

Updates

March 27, 2024: Training code for our HDM model is released. Check out HDM repo !

March 09, 2024: ProciGen data and HDM hugging face demo released!

Citation

    @inproceedings{xie2023template_free,
        title = {Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation},
        author = {Xie, Xianghui and Bhatnagar, Bharat Lal and Lenssen, Jan Eric and Pons-Moll, Gerard},
        booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
        month = {June},
        year = {2024},
    }

Acknowledgments

We thank RVH group members for their helpful discussions. This work is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - 409792180 (Emmy Noether Programme, project: Real Virtual Humans), and German Federal Ministry of Education and Research (BMBF): T¨ubingen AI Center, FKZ: 01IS18039A, and Amazon-MPI science hub. Gerard Pons-Moll is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 – Project number 390727645.The project was made possible by funding from the Carl Zeiss Foundation.