4D Object-Mover: Given a 4D scene and an edited first frame with a new object (left), we generate plausible motion for the object in subsequent frames (right).

Abstract

Recent advances in dynamic scene reconstruction using Neural Radiance Fields (NeRFs) and Gaussian Splatting (GS) have created a demand for effective 4D editing tools. While existing methods primarily focus on appearance alterations or object removal, the challenge of adding objects to 4D scenes, which requires understanding of objects' interactions with the original scene, remains largely unexplored. We present a novel approach to address this gap, focusing on generating plausible motion for newly added objects in 4D scenes. Our key finding is that 2D image-based diffusion models carry strong scene interaction priors that can be extracted from a static scene-object frame and propagated to novel frames of a dynamic 3D scene. Concretely, our method takes an object and its initial placement in a single frame as input, aiming to generate its position and orientation throughout the entire sequence. We first capture the object's appearance, shape, and interaction with the original scene from the static edited frame via fine-tuning a 2D diffusion-based editor. Building on this, we propose an iterative algorithm that leverages the fine-tuned diffusion model to generate frame-to-frame motion for the new object. We show that our method significantly improves 4D motion generation for the new objects compared to prior works on the diverse D-NeRF scene dataset.


4D Object-Mover: Learn and Reconstruct Novel Object-Scene Interaction



Approach: Starting with the first frame (top-left image) in our 4D sequence, where a new object is added, we create a paired multiview dataset of rendered images before and after the edit (Stage 1, top box). We then fine-tune a latent diffusion model (LDM) using this dataset to capture the scene-object interaction, enabling the LDM to accurately place the object into the 2D rendered image of the original scene in any frame. In Stage 2, we generate the object's pose frame by frame. Using the fine-tuned editor, we create a pseudo ground truth dataset of 2D images showing the object interacting with the scene for each frame f (Stage 2, top box). We initialize each frame's pose with the previous frame's pose and then optimize the current pose by minimizing the photometric loss between the rendered image and the corresponding pseudo ground truth. Red arrows indicate the backward gradient direction in this process.



Adding new objects to D-NeRF scenes

We add Gaussians reconstruction of new objects, add it to the first frame of D-NeRF scene then animate them via 4D Object-Mover


Video




Citation

@article{tuan2025_4d_object_mover,
          title = {4D Object-Mover: Distilling Pretrained Diffusion Priors for Object Animation},
          author = {Tran Anh, Tuan and Eric Lenssen, Jan and Pons-Moll, Gerard and Chibane, Julian},
          booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR) Workshop on 4D Vision: Modeling the Dynamic World},
          month = {June},
          year = {2025},
        }

Acknowledgment


Carl-Zeiss-Stiftung Tübingen AI Center University of Tübingen MPII Saarbrücken

We thank Verica Lazova and István Sárándifor for their helpful feedback.