World Model for Controllable Generation ‒ VITA ‐ EPFL

World models predict future frames from past observations and actions, making them powerful simulators for ego-vision tasks with complex dynamics, such as autonomous driving. Nonetheless, existing world models for ego-vision mainly focus on the driving domain and the ego-vehicle’s actions limiting the complexity and diversity of the generated scenes. In this work, we propose GEM, a diffusion-based world model with generalized control strategy. By leveraging ego-trajectories and general image features, GEM not only allows for fine-grained control over the ego-motion, but also enables to control the motion of other objects in the scene and supports scene composition, by inserting new objects. GEM is multimodal, capable of generating both videos and future depth sequences, providing rich semantic and spatial output contexts. Although our primary focus remains on the domain of autonomous driving, we explore the adaptability of GEM to other ego-vision domain such as human activity and drone navigation. Project page: https://vita-epfl.github.io/GEM.github.io/

Visualizations

Sample Generations

Control Scene Objects Motion

Insert Objects

Change Ego Motion

Control Human Pose

Multimodal Generation

MultiDomain Generation:

Drones

Human EgoCentric