MVGBench: a Comprehensive Benchmark for Multi-view Generation Models
Xianghui Xie1, 2, 3, Chuhang Zou, Meher Gitika Karumuri, Jan Eric Lenssen3, Gerard Pons-Moll1,2,31 University of Tübingen, Germany
2 Tübingen AI Center, Germany
3 Max Planck Institute for Informatics, Saarland Informatics Campus, Germany
Our benchmark can compare methods trained on different camera settings and evaluate each method using its best possible setup.
Abstract
TL;DR: We introduce MVGBench, a benchmark suite that can fairly compare different multi-view generation models for object reconstruction.We propose MVGBench, a comprehensive benchmark for multi-view image generation models (MVGs) that evaluates 3D consistency in geometry and texture, image quality, and semantics (using vision language models). Recently, MVGs have been the main driving force in 3D object creation. However, existing metrics compare generated images against ground truth target views, which is not suitable for generative tasks where multiple solutions exist while differing from ground truth. Furthermore, different MVGs are trained on different view angles, synthetic data and specific lightings -- robustness to these factors and generalization to real data are rarely evaluated thoroughly. Without a rigorous evaluation protocol, it is also unclear what design choices contribute to the progress of MVGs. MVGBench evaluates three different aspects: best setup performance, generalization to real data and robustness. Instead of comparing against ground truth, we introduce a novel 3D self-consistency metric which compares 3D reconstructions from disjoint generated multi-views. We systematically compare 12 existing MVGs on 4 different curated real and synthetic datasets. With our analysis, we identify important limitations of existing methods specially in terms of robustness and generalization, and we find the most critical design choices. Using the discovered best practices, we propose ViFiGen, a method that outperforms all evaluated MVGs on 3D consistency. Our benchmark suite and pretrained models will be publicly released.
MVGBench Overview

We present MVGBench, a comprehensive evaluation suite for multi-view image generation models (MVGs). We propose ten metrics to evaluate the 3D consistency in geometry and texture, image quality, and semantics of generated multi-view images. This suite allows us to fairly compare existing MVGs in three aspects: best setup performance, generalization, and robustness to input perturbations. We use our benchmark to systematically analyze different models and identify critical design choices, leading to a new model that achieves the best 3D consistency and robustness, with otherwise on-par performance. All values are normalized, and outermost is better.
Observations of existing MVG models
Most models are not robust to different lighting, elevation or azimuth angles.

Robustness w.r.t different light intensity, azimuth and elevation angles. Some methods (EscherNet) are sensitive to dark lighting while others (SyncDreamer) are sensitive to strong lighting. Some methods (EscherNet, Vivid123) are also sensitive to the input azimuth angles and none of the methods are robust to higher elevation angles.
Trade off between 3D consistency and quality, synthetic to real gap.

Left: Trade-off between 3D consistency and image quality. No method can achieve the best performance in both dimensions. Right: The performance gap between synthetic (GSO) and real data (CO3D) is large, especially on the image quality aspect (IQ-vlm).
What makes a MVG model 3D consistent?

Investigating different design choices of MVGs added on top of SV3D: camera embedding (†), input image encoder (‡), and multi-view feature syncronization (♮). Results on GSO. Better input image encoder brings the most improvement.
ViFiGen: SoTA multi-view generation model
TL;DR: We introduce ViFiGen, a SoTA multi-view generation model by leveraging the best practices of MVG models.Left to right: input, Zero123, SyncDreamer, our ViFiGen




Updates
Citation
@inproceedings{xie2025MVGBench, title={MVGBench: a Comprehensive Benchmark for Multi-view Generation Models}, author={Xianghui Xie and Chuhang Zou and Meher Gitika Karumuri and Jan Eric Lenssen and Gerard Pons-Moll}, year={2025}, archivePrefix={arXiv}, primaryClass={cs.CV} }
Acknowledgments




We thank RVH group members for their helpful discussions. This work is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - 409792180 (Emmy Noether Programme, project: Real Virtual Humans), and German Federal Ministry of Education and Research (BMBF): Tuebingen AI Center, FKZ: 01IS18039A, and Amazon-MPI science hub. Gerard Pons-Moll is a Professor at the University of Tuebingen endowed by the Carl Zeiss Foundation, at the Department of Computer Science and a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 - Project number 390727645.