World Model for Controllable Generation

World models predict future frames from past observations and actions, making them powerful simulators for ego-vision tasks with complex dynamics, such as autonomous driving.  Nonetheless, existing world models for ego-vision mainly focus on the driving domain and the ego-vehicle’s actions limiting the complexity and diversity of the generated scenes. In this work, we propose GEM, a diffusion-based world model with generalized control strategy.  By leveraging ego-trajectories and general image features, GEM not only allows for fine-grained control over the ego-motion, but also enables to control the motion of other objects in the scene and supports scene composition, by inserting new objects. GEM is multimodal, capable of generating both videos and future depth sequences, providing rich semantic and spatial output contexts. Although our primary focus remains on the domain of autonomous driving, we explore the adaptability of GEM to other ego-vision domain such as human activity and drone navigation.  Project page: https://vita-epfl.github.io/GEM.github.io/

 

Visualizations

Sample Generations

   

 

Control Scene Objects Motion

  

 

Insert Objects

   

 

Change Ego Motion

    

 

Control Human Pose

 

Multimodal Generation

   

MultiDomain Generation:

Drones

 

Human EgoCentric