WorldExplorer: Towards Generating Fully Navigable 3D Scenes

Abstract

Generating 3D worlds from text is a highly anticipated goal in computer vision. Existing works are limited by the degree of exploration they allow inside of a scene, i.e., produce streched-out and noisy artifacts when moving beyond central or panoramic perspectives. To this end, we propose WorldExplorer, a novel method based on autoregressive video trajectory generation, which builds fully navigable 3D scenes with consistent visual quality across a wide range of viewpoints. We initialize our scenes by creating multi-view consistent images corresponding to a 360 degree panorama. Then, we expand it by leveraging video diffusion models in an iterative scene generation pipeline. Concretely, we generate multiple videos along short, pre-defined trajectories, that explore the scene in depth, including motion around objects. Our novel scene memory conditions each video on the most relevant prior views, while a collision-detection mechanism prevents degenerate results, like moving into objects. Finally, we fuse all generated views into a unified 3D representation via 3D Gaussian Splatting optimization. Compared to prior approaches, WorldExplorer produces high-quality scenes that remain stable under large camera motion, enabling for the first time realistic and unrestricted exploration. We believe this marks a significant step toward generating immersive and truly explorable virtual 3D environments.

Three-Stage Scene Generation

We generate a fully navigable 3D scene from text input in three stages.

First, we create four initial images with a T2I model, and we define that they look outwards, distributed uniformly around the scene center (without overlap). We leverage monocular depth estimators and inpainting to create the remaining four images of our scene scaffold.

WorldExplorer panorama initialization

Next, we propose a novel iterative video trajectory generation pipeline. It exploits camera-guided video diffusion models to generate additional images of the scene along pre-defined trajectories. For each video generation, we select optimal conditioning views from all previously generated frames, which encourages 3D-consistent content generation. We further discard frames (marked as grey), that are too close to (running into) objects to avoid degenerate outputs. We repeat this trajectory generation multiple times to obtain a final set of posed images.

WorldExplorer iterative video trajectory generation

Finally, we optimize a 3D Gaussian-Splatting scene from our generated images. To this end, we first obtain a point-cloud initialization using VGGT. Then we globally align the predicted points with our camera poses using rigid transformations. Finally, we can explore the scene, i.e., synthesize novel views from perspectives beyond the scene center, in real-time using standard computer graphics rasterization.

WorldExplorer 3D scene optimization

Conditional video generation via scene memory

We generate videos in multiple batches of 21 frames along 4 trajectories (blue, purple, red, orange). These trajectories start from each of our initial panoramic images (bottom left). To ensure 3D-consistent outputs, we prefill the first 8 frames with our initial scene scaffold (marked in green). The remaining 5 are sampled based on rotational similarity from our scene memory, which stores the images and poses of all previously generated frames (top). After each trajectory is generated, we append all frames starting from the third image to our scene memory for conditioning of later frames.

Interactive 3DGS Viewer

Explore our generated scenes interactively with the 3DGS viewer below (powered by Spark). Use the dropdown menu to select a scene and click "Load Scene" to visualize it. Navigate the scene with mouse, keyboard, or touch controls.

Progress: 0%

BibTeX

@InProceedings{schneider_hoellein_2025_worldexplorer, title={WorldExplorer: Towards Generating Fully Navigable 3D Scenes}, author={Schneider, Manuel-Andreas and H{\"o}llein, Lukas and Nie{\ss}ner, Matthias}, booktitle={SIGGRAPH Asia 2025 Conference Papers}, pages={1--11}, year={2025} }

WorldExplorer: Towards Generating Fully Navigable 3D Scenes

SIGGRAPH Asia 2025

WorldExplorer generates 3D scenes from a given text prompt using camera-guided video diffusion models.

apartment of a bioluminescent, gravity-defying, telepathic cosmic jellyfish hive

apartment of an airy Hamptons beach house with white shiplap and coastal blue accents

apartment of a flowing, reflective, chime-filled liquid metal monastery

lush desert oasis with palm trees and clear pools surrounded by sand dunes

apartment of an elegant French country estate with limestone facade and lavender gardens

apartment of a hexagonal, amber-glowing, bee-tended honeycomb observatory

apartment of a stretchy, sugar-scented, pop-art bubble gum coral reef

apartment of an urban industrial loft with exposed brick and ductwork

apartment of a resonating, transparent, melodic crystal music forest

apartment of an elegant Gothic Revival mansion with ornate details and dark woods

apartment of a star-written, constellation-binding, midnight celestial ink library

Abstract

Video

Three-Stage Scene Generation

Conditional video generation via scene memory

Interactive 3DGS Viewer

BibTeX