InterTrack: Tracking Human Object Interaction without Object Templates
Xianghui Xie1, 2, 3, Jan Eric Lenssen3, Gerard Pons-Moll1,2,31 University of Tübingen, Germany
2 Tübingen AI Center, Germany
3 Max Planck Institute for Informatics, Saarland Informatics Campus, Germany
From a monocular RGB video, our method tracks the human and object under occlusion and dynamic motions, without using any object templates. Our method is trained only on synthetic data and generalizes well to real-world videos such as these captured by mobile phones.
Abstract
Tracking human object interaction from videos is important to understand human behavior from the rapidly growing stream of video data. Previous video-based methods require predefined object templates while single-image-based methods are template-free but lack temporal consistency. In this paper, we present a method to track human object interaction without any object shape templates. We decompose the 4D tracking problem into per-frame pose tracking and canonical shape optimization. We first apply a single-view reconstruction method to obtain temporally-inconsistent per-frame interaction reconstructions. Then, for the human, we propose an efficient autoencoder to predict SMPL vertices directly from the per-frame reconstructions, introducing temporally consistent correspondence. For the object, we introduce a pose estimator that leverages temporal information to predict smooth object rotations under occlusions. To train our model, we propose a method to generate synthetic interaction videos and synthesize in total 10 hour videos of 8.5k sequences with full 3D ground truth. Experiments on BEHAVE and InterCap show that our method significantly outperforms previous template-based video tracking and single-frame reconstruction methods. Our proposed synthetic video dataset also allows training video-based methods that generalize to real-world videos.
Key idea: 4D tracking = one global shape + per-frame poses
Our key idea is to decompose the 4D tracking problem into global shape optimization and per-frame pose tracking, which greatly reduces the solution space. For the human, we use a simple yet efficient autoencoder CorrAE to obtain coherent humans points and optimize human via the SMPL layer. For the object, use a temporal object pose estimator TOPNet to predict the object rotation, which allows us to optimize a common object shape in canonical space and fine tune pose predictions. We then jointly optimize human and object based on contacts to obtain consistent tracking.
ProciGen-Video dataset: video sequences for interaction
To train our video-based pose estimator, we propose a method to generate video data for interaction. We generate ProciGen-Video dataset, which has 10-hour videos of 8.5k sequences of humans interacting with 4.5k different objects. Our dataset allows training various video based methods.
Long narrated video
HDM produces different shapes across frames while our method consistently tracks the shape and pose.
More results on videos captured by mobile phone camera.
Updates
Citation
@inproceedings{xie2024InterTrack, title={InterTrack: Tracking Human Object Interaction without Object Templates}, author={Xianghui Xie and Jan Eric Lenssen and Gerard Pons-Moll}, year={2024}, eprint={2408.13953}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2408.13953}, }
Acknowledgments
We thank RVH group members for their helpful discussions. This work is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - 409792180 (Emmy Noether Programme, project: Real Virtual Humans), and German Federal Ministry of Education and Research (BMBF): Tuebingen AI Center, FKZ: 01IS18039A, and Amazon-MPI science hub. Gerard Pons-Moll is a Professor at the University of Tuebingen endowed by the Carl Zeiss Foundation, at the Department of Computer Science and a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 - Project number 390727645.