We propose a novel method JOSH (Joint Optimization of Scene Geometry and Human Motion) for 4D Human-Scene Reconstruction in the wild, which jointly
optimizes the global human motion, the surrounding environment, and the camera poses with coherent human-scene interaction given a web video captured from a single camera.
JOSH uses local scene reconstruction and human mesh recovery as initialization and then jointly optimizes motion and scene
with the human-scene contact constraints. JOSH achieves state-of-the-art performance for both global human motion estimation and metric-scale scene reconstruction with joint optimization.
Results on Datasets
JOSH surpasses existing methods on both global human motion estimation and metric-scale scene reconstruction by a large margin. It is also the only method that directly supports the task of 4D human-scene reconstruction.
Evaluation on Global Human Motion Estimation with the EMDB Dataset
Evaluation on Global Camera Trajectory Estimation with the SLOPER4D Dataset
Evaluation on 4D Human-Scene Reconstruction with the RICH Dataset
Interactive Demo on Web Video
Please feel free to drag the reconstruction result on the left to change the viewing angles, use the bottom bar to control the time step, and hover on the scene components for interactive visualization.
BibTeX
@article{liu2025joint,
title={Joint Optimization for 4D Human-Scene Reconstruction in the Wild},
author={Liu, Zhizheng and Lin, Joe and Wu, Wayne and Zhou, Bolei},
journal={arXiv preprint arXiv:2501.02158},
year={2025}
}
Comment: This work proposes a dataset and a model for context-aware pedestrian movement generation from pseudo-labels of web videos. We can use JOSH to extract human and scene labels with better quality for pedestrian movement generation.