Human POSEitioning System (HPS): 3D Human Pose Estimation and Self-localization in Large Scenes from Body-Mounted Sensors

Vladimir Guzov* 1,2, Aymen Mir* 1,2, Torsten Sattler 3 Gerard Pons-Moll 1,2
*Joint first authors with equal contribution

1University of Tübingen, Germany
2Max Planck Institute for Informatics, Saarland Informatics Campus, Germany
3CIIRC, Czech Technical University in Prague, Czech Republic

CVPR 2021 Virtual
Overview: HPS jointly estimates the full 3D human pose and location of a subject within large 3D scenes, using only wearable sensors. We use IMU data, RGB video from a head mounted camera, and a pre-scanned scene as input. We obtain an approximate 3D body pose using IMU data, and use head camera self-localization to localize the subject in the 3D scene. We then integrate the approximate body pose, the camera position and orientation, along with the 3D scene in a joint optimization to obtain the final location and pose estimates.
Example sequence recorded with our method


We introduce (HPS) Human POSEitioning System, a method to recover the full 3D pose of a human registered with a 3D scan of the surrounding environment using wearable sensors. Using IMUs attached at the body limbs and a head mounted camera looking outwards, HPS fuses camera based self-localization with IMU-based human body tracking. The former provides drift-free but noisy position and orientation estimates while the latter is accurate in the short-term but subject to drift over longer periods of time. We show that our optimization-based integration exploits the benefits of the two, resulting in pose accuracy free of drift. Furthermore, we integrate 3D scene constraints into our optimization, such as foot contact with the ground, resulting in physically plausible motion. HPS complements more common third-person-based 3D pose estimation methods. It allows capturing larger recording volumes and longer periods of motion, and could be used for VR/AR applications where humans interact with the scene without requiring direct line of sight with an external camera, or to train agents that navigate and interact with the environment based on first-person visual input, like real humans. With HPS, we recorded a dataset of humans interacting with large 3D scenes (300-1000 sq.m) consisting of 7 subjects and more than 3 hours of diverse motion.


Carl-Zeiss-Stiftung Tübingen AI Center University of Tübingen MPII Saarbrücken

We thank Bharat Bhatnagar, Verica Lazova, Anna Kukleva and Garvita Tiwari for their feedback. This work is partly funded by the DFG - 409792180 (Emmy Noether Programme, project: Real Virtual Humans), the EU Horizon 2020 project RICAIP (grant agreeement No.857306), and the European Regional Development Fund under project IMPACT (No. CZ.02.1.01/0.0/0.0/15 003/0000468). The project was made possible by funding from the Carl Zeiss Foundation.