A unified video world model interface
Holo-World jointly controls camera motion, object dynamics, and weather state from a single image.
Source-side controls define the scene scaffold to preserve
Camera trajectory, rendered geometry buffers, and object controls anchor the background structure and dynamic entities, keeping the generated world consistent with the observed source.
Target weather specifies the state to render
The target-weather prompt guides the preserved scene scaffold into a new weather state, allowing weather-dependent appearance and particle effects to emerge within the same controlled world.
Holo-World jointly controls camera motion, object dynamics, and scene weather state within the same observed world. The model must change weather from a single image while still following explicit camera and object controls, rather than relying on a complete source video as in video-to-video weather editing.
15000+ training samples across Real / Simulation / V2V subsets, carrying paired controls for camera, object, and weather supervision.
150 mutually exclusive evaluation samples for world preservation and weather transfer tracks.
Given the same input: camera trajectory and rendered controls, Holo-World renders the controlled scene under different target weather states. Our learn weather-state transfer within the same scene scaffold, rather than regenerating a different world.
Source-side controls define the scene scaffold Holo-World should preserve. Camera trajectory, rendered world controls, and object controls anchor background structure and dynamic entities, enabling temporally consistent video synthesis.
Controllable video generation under camera trajectory and object manipulation. We compare Holo-World with VerseCrafter, Gen3C, and Uni3C. The top rows show the camera trajectory, Ground Truth, each method's rendered RGB control, and the bottom row shows the corresponding generated results.
Holo-World performs Image-to-Video (I2V) generation, synthesizing temporal dynamics and weather effects from only a single input frame plus camera trajectory. In contrast, other methods (Wan2.7-Edit, Cosmos-Transfer-2.5) are Video-to-Video (V2V) approaches that receive the full origin video as input and perform video editing or transfer, which is a less challenging task as they can leverage the temporal information from the origin video.
If you find HoloWorld useful in your research, please cite us:
@article{yin2026holoworld,
title={Holo-World: Unified Camera, Object and Weather Control for Video World Model},
author={Yin, Xiangchen},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2026}
}