WorldMesh: Generating Navigable Multi-Room 3D Scenes via Mesh-Conditioned Image Diffusion

Technical University of Munich
WorldMesh Teaser

WorldMesh generates arbitrarily large multi-room 3D scenes efficiently using mesh-guided image diffusion.

Abstract

Recent progress in image and video synthesis has inspired their use in advancing 3D scene generation. However, we observe that text-to-image and -video approaches struggle to maintain scene- and object-level consistency beyond a limited environment scale due to the absence of explicit geometry. We thus present a geometry-first approach that decouples this complex problem of large-scale 3D scene synthesis into its structural composition, represented as a mesh scaffold, and realistic appearance synthesis, which leverages powerful image synthesis models conditioned on the mesh scaffold. From an input text description, we first construct a mesh capturing the environment's geometry (walls, floors, etc.), and then use image synthesis, segmentation and object reconstruction to populate the mesh structure with objects in realistic layouts. This mesh scaffold is then rendered to condition image synthesis, providing a structural backbone for consistent appearance generation. This enables scalable, arbitrarily-sized 3D scenes of high object richness and diversity, combining robust 3D consistency with photorealistic detail. We believe this marks a significant step toward generating truly environment-scale, immersive 3D worlds.

Video

Overview of WorldMesh

To generate a complex, multi-room 3D scene from a text prompt, we decompose this problem into first constructing the global scene structure as a mesh scaffold (top), and then using the scaffold mesh as anchor for realistic local appearance (bottom).

The text prompt is used to generate a text-based floor plan, which we construct in 3D to use as depth conditioning for an image synthesis model Φ, in order to reconstruct estimated 3D objects in each room. The structural elements and 3D objects constitute the scaffold mesh M, for which initial wall textures are generated with Φ.

M then serves as a geometric anchor for iterative image synthesis using Φ to generate images {Ii}. Finally, the output scene S is optimized with geometry-regularized 3DGS, against both images {Ii} and rendered depth from M.

WorldMesh Method Overview

BibTeX

@misc{TBD}