With the development of diffusion models, text-to-video generation has recently received significant attention and achieved remarkable success. However, existing text-to-video approaches suffer from the following weakness: i) they fail to control the trajectory of the subject as well as the process of scene transformations; ii) they can only generate videos with limited frames, failing to capture the whole transformation process. To address these issues, we propose the model named ScenarioDiff, which is able to generate longer videos with scene transformations. Specifically, we employ a spatial layout fuser to control the positions of subjects and the scenes of each frame. To effectively present the process of scene transformation, we introduce mixed frequency controlnet, which utilizes several frames of the generated videos to extend them to long videos chunk by chunk in an auto-regressive manner. Additionally, to ensure consistency between different video chunks, we propose a cross-chunk scheduling mechanism during inference. Experimental results demonstrate the effectiveness of our approach in generating videos with dynamic scene transformations.
The framework of ScenarioDiff. Our proposed method contains three main stages. The first stage is tuning the UNet with simulated videos which aims to learn the knowledge that how to change the scenes. The second stage is training the Mixed Frequency ControlNet which makes our method get the ability to generate long videos. During inference stage, we propose a cross chunk scheduling mechanism which utilizes the information of the previous chunk to make the video more consistency.
Qualitative results when ScenarioDiff with and without tuning on the simulated dataset.
(a) without tuning
(b) with tuning
Qualitative results when ScenarioDiff with and without the components of mixed frequency controlnet.
(a) without both
(b) without mixed init
(c) without refiner
(d) with all
Qualitative results of different \( \hat{\lambda} \)
(a) \( \hat{\lambda} \) = 0.00
(b) \( \hat{\lambda} \) = 0.25
(c) \( \hat{\lambda} \) = 0.50
(d) \( \hat{\lambda} \) = 0.75
(e) \( \hat{\lambda} \) = 1.00
Qualitative results when ScenarioDiff with and without the components of cross-chunk scheduling.
(a) without both
(b) without mixed init
(c) without reuse
(d) with all