Are Pose Estimators Ready for the Open World? STAGE: A GenAI Toolkit for Auditing 3D Human Pose Estimators

1University of Tübingen, 2Tübingen AI Center, 3Bosch Center of AI 4Zalando SE 5Max Planck Institute for Informatics, Saarland Informatics Campus
Image

STAGE allows you to create custom benchmarks to stress test your pose estimator.

Abstract

For safety-critical applications, it is crucial to audit 3D human pose estimators before deployment. Will the system break down if the weather or the clothing changes? Is it robust regarding gender and age? To answer these questions and more, we need controlled studies with images that differ in a single attribute, but real benchmarks cannot provide such pairs. We thus present STAGE, a GenAI data toolkit for auditing 3D human pose estimators. For STAGE, we develop the first GenAI image creator with accurate 3D pose control and propose a novel evaluation strategy to isolate and quantify the effects of single factors such as gender, ethnicity, age, clothing, location, and weather. Enabled by STAGE, we generate a series of benchmarks to audit, for the first time, the sensitivity of popular pose estimators towards such factors. Our results show that natural variations can severely degrade pose estimator performance, raising doubts about their readiness for open-world deployment. We aim to highlight these robustness issues and establish STAGE as a benchmark to quantify them.

Diverse Generation with 3D Pose Control

STAGE allows you to create custom benchmarks to stress test your pose estimator. Images generated via STAGE. We are able to generate images of people with different body shapes and appearances and in different locations, well-aligned with the given 3D ground truth pose.

Training on STAGE rivals Training on BEDLAM

Method Training Data 3DPW-test EMDB-test
CG STAGE MPJPE PA-MPJPE MPJPE PA-MPJPE
PARE 76.24 49.07 98.55 60.11
75.37 48.52 97.32 61.76
72.93 46.84 92.71 57.65
HMR 75.07 48.08 96.61 59.08
76.23 48.72 96.66 62.62
72.75 46.80 91.56 58.02
Training on our GenAI images matches training CG performance time consuming CG-data designed by human experts.
BEDLAM real example 1 BEDLAM real example 2 BEDLAM real example 3 BEDLAM synth example 1 BEDLAM synth example 2 BEDLAM synth example 3
Comparison of STAGE (bottom) and BEDLAM (top). STAGE images depict more realistic clothing and locations. Note especially the physically plausible wind effects and the complex scene composi- tion; both of which would be costly to create in a simulation

Evaluation Results

Examined Estimators

Sensitivity towards clothing

Sensitivity towards clothing texture

Sensitivity towards outdoor locations

Sensitivity towards indoor locations

Sensitivity towards protected attributes

Sensitivity towards weather and lighting

Acknowledgement

We thank Riccardo Marin for proofreading and the whole RVH team for the support. Nikita Kister was supported by Bosch Industry on Campus Lab at the University of Tübingen. Nikita Kister thanks the European Laboratory for Learning and Intelligent Systems (ELLIS) PhD program for support. István Sárándi and Gerard Pons-Moll were supported by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A, by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) -- 409792180 (Emmy Noether Programme, project: Real Virtual Humans). GPM is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 -- Project number 390727645 and is supported by the Carl Zeiss Foundation.

BibTeX

@inproceedings{stage,
  author    = {Kister, Nikita and Sárándi, István and Wang, Jiayi and Khoreva, Anna and Pons-Moll, Gerard},
  title     = {Are Pose Estimators Ready for the Open World? STAGE: A GenAI Toolkit for Auditing 3D Human Pose Estimators},
  booktitle = {International Conference on 3D Vision (3DV)},
  month = {March},
  year = {2026},
}