ScenarioDiff

ScenarioDiff: Text-to-video Generation with Dynamic Transformations of Scene Conditions

¹ Department of Computer Science and Technology, Tsinghua University ² Beijing National Research Center for Information Science and Technology ³ MoE Key Lab of High Confidence Software Technologies, Peking University

Abstract

With the development of diffusion models, text-to-video generation has recently received significant attention and achieved remarkable success. However, existing text-to-video approaches suffer from the following weakness: i) they fail to control the trajectory of the subject as well as the process of scene transformations; ii) they can only generate videos with limited frames, failing to capture the whole transformation process. To address these issues, we propose the model named ScenarioDiff, which is able to generate longer videos with scene transformations. Specifically, we employ a spatial layout fuser to control the positions of subjects and the scenes of each frame. To effectively present the process of scene transformation, we introduce mixed frequency controlnet, which utilizes several frames of the generated videos to extend them to long videos chunk by chunk in an auto-regressive manner. Additionally, to ensure consistency between different video chunks, we propose a cross-chunk scheduling mechanism during inference. Experimental results demonstrate the effectiveness of our approach in generating videos with dynamic scene transformations.

Framework

The framework of ScenarioDiff. Our proposed method contains three main stages. The first stage is tuning the UNet with simulated videos which aims to learn the knowledge that how to change the scenes. The second stage is training the Mixed Frequency ControlNet which makes our method get the ability to generate long videos. During inference stage, we propose a cross chunk scheduling mechanism which utilizes the information of the previous chunk to make the video more consistency.

ScenarioDiff: Text-to-video Generation with Dynamic Transformations of Scene Conditions

Abstract

Framework

Some Cases

Ablations