Hoi3DGen: Generating High-Quality Human-Object-Interactions in 3D
Agniv Sharma*1, 4, Xianghui Xie*1, 2, 3, Tom Fischer4, Eddy Ilg4, Gerard Pons-Moll1,2,3* Equal contribution.
1 University of Tübingen, Germany
2 Tübingen AI Center, Germany
3 Max Planck Institute for Informatics, Saarland Informatics Campus, Germany
4 Technische Universität Nürnberg
CVPR 2026 findings
Given a text prompt describing human, object and how they interact with each other, our method generates a high-quality textured mesh of the human-object interaction in 3D.
Abstract
TL;DR: We introduce Hoi3DGen, a method that generate diverse and high-quality human-object-interactions in 3D, from text prompts.Modeling and generating 3D human–object interactions from text is crucial for applications in AR, XR, and gaming. Existing approaches often rely on score distillation from text-to-image models, but their results suffer from the Janus problem and do not follow text prompts faithfully due to the scarcity of high-quality interaction data. We introduce Hoi3DGen, a framework that generates high-quality textured meshes of human-object interaction that follow the input interaction descriptions precisely. We first curate realistic and high-quality interaction data leveraging multimodal large language models, and then create a full text-to-3D pipeline, which achieves orders-of-magnitude improvements in interaction fidelity. Our method surpasses baselines by 4–15x in text consistency and 3–7x in 3D model quality, exhibiting strong generalization to diverse categories and interaction types, while maintaining high-quality 3D generation.
Key idea: distill rich interaction prior from foundation models via instructed fine-tuning
Generation Results
More qualitative examples
Out-of-Distribution Generalization
Our model can generate samples where humans/objects/interactions are unseen in our interaction data.
Controllability
Our model allows flexible control of the object, contact type, and human while keeping the other components fixed.
Unusual Interactions
Our model can generate very dynamic and unusual interactions of human and object.
Citation
@inproceedings{Sharma_and_xie2026Hoi3DGen,
title={Hoi3DGen: Generating High-Quality Human-Object-Interactions in 3D},
author={Agniv Sharma and Xianghui Xie and Tom Fischer and Eddy Ilg and Gerard Pons-Moll},
booktitle = {CVPR findings},
month = {June},
year = {2026},
}
Acknowledgments
We thank RVH group members for their helpful discussions. This work is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - 409792180 (Emmy Noether Programme, project: Real Virtual Humans), and German Federal Ministry of Education and Research (BMBF): Tuebingen AI Center, FKZ: 01IS18039A, and Amazon-MPI science hub. Gerard Pons-Moll is a Professor at the University of Tuebingen endowed by the Carl Zeiss Foundation, at the Department of Computer Science and a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 - Project number 390727645.