Modeling 3D human-object interaction (HOI) is a problem of great interest for computer vision and a key enabler for virtual and mixed-reality applications. Existing methods work in a one-way direction: some recover plausible human interactions conditioned on a 3D object; others recover the object pose conditioned on a human pose. Instead, we provide the first unified model - TriDi which works in any direction. Concretely, we generate Human, Object, and Interaction modalities simultaneously with a new three-way diffusion process, allowing to model seven distributions with one network. We implement TriDi as a transformer attending to the various modalities' tokens, thereby discovering conditional relations between them. The user can control the interaction either as a text description of HOI or a contact map. We embed these two representations into a shared latent space, combining the practicality of text descriptions with the expressiveness of contact maps. Using a single network, TriDi unifies all the special cases of prior work and extends to new ones, modeling a family of seven distributions. Remarkably, despite using a single model, TriDi generated samples surpass one-way specialized baselines on GRAB and BEHAVE in terms of both qualitative and quantitative metrics, and demonstrating better diversity. We show the applicability of TriDi to scene population, generating objects for human-contact datasets, and generalization to unseen object geometry.
Our goal is to provide a joint model of static Human, Object, and Interaction.
TriDi is a Trilateral Diffusion for Human (pose , identity pose , and 6-DoF global pose ), Object (6-DoF global pose ), and Interaction (Contact-Text latent ). In the example above the model is first configured to sample from , and then reconfigured to sample form . One of the seven operating modes is chosen by adjusting the timestamp to be 0 for a given condition and T for the desired prediction ( , for or , for above ), and supplying an object class condition ('Table' or 'Suitcase' above).
We want to combine the intuitiveness of text descriptions with the expressiveness of contact maps. Thus, our solution is to learn a compact latent representation that encodes both in a joint latent space. We train a mapping from the contact map and CLIP embedding to a joint latent space . Jointly we train the decoder that maps the latent code back to the contact map.
The method is qualitatievly evaluated on GRAB, BEHAVE, InterCap, and OMOMO.
Additionally we provide an example of generating keyframe-based animation. Objects are generated by TriDi at a low framerate, and then interpolated to 30fps via linear interpolation for object center and slerp for object rotation.
@article{petrov2024tridi,
title={TriDi: Trilateral Diffusion of 3D Humans, Objects, and Interactions},
author={Petrov, Ilya A and Marin, Riccardo and Chibane, Julian and Pons-Moll, Gerard},
journal={arXiv preprint arXiv:2412.06334},
year={2024}
}
Special thanks to Garvita Tiwari, Nikita Kister, and Xianghui Xie for the helpful discussions. We also thank RVH team for their help with proofreading the manuscript. This work is funded by the Deutsche Forschungsgemeinschaft - 409792180 (EmmyNoether Programme, project: Real Virtual Humans) and the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A. G. Pons-Moll is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 – Project number 390727645. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting I.~A.~Petrov. R. Marin has been supported by the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No 101109330. J. Chibane is a fellow of the Meta Research PhD Fellowship Program - area: AR/VR Human Understanding.The project was made possible by funding from the Carl Zeiss Foundation. Website is based on StyleGAN3 and Nerfies websites.