DeepLandscape: Adversarial Modeling of Landscape Videos

ECCV 2020 paper
A. Mashikhin

V. Lempitsky

Abstract

We build a new model of landscape videos that can be trained on a mixture of static landscape images as well as landscape animations. Our architecture extends StyleGAN model by augmenting it with parts that allow to model dynamic changes in a scene. Once trained, our model can be used to generate realistic time-lapse landscape videos with moving objects and time-of-the-day changes. Furthermore, by fitting the learned models to a static landscape image, the latter can be reenacted in a realistic way. We propose simple but necessary modifications to StyleGAN inversion procedure, which lead to in-domain latent codes and allow to manipulate real images. Quantitative comparisons and user studies suggest that our model produces more compelling animations of given photographs than previously proposed methods. The results of our approach including comparisons with prior art can be seen in the paper, supplementary materials and on the project page.

DeepLandscape in 1 min.



DeepLandscape explained

Model Overview

Model architecture

The architecture of our DeepLandscape model is based on StyleGAN and has four sets of latent variables $z^{st}$ (encodes colors and the general scene layout), $z^{dyn}$ (encodes global lighting, e.g. time of day), $S^{st}$ (a set of square matrices which encode shapes and details of static objects and $S^{dyn}$ (a set of square matrices which encode shapes and details of dynamic objects).
The model is trained from two sources of data, the dataset of static scenery images $\mathcal{I}$ and the dataset of timelapse scenery videos $\mathcal{V}$. It is relatively easy to collect a large static dataset, while with our best efforts we were able to collect a few hundreds of videos, that do not cover all the diversity of landscapes. Thus, both sources of data have to be utilized in order to build a good model. To do that, we train our generative model in an adversarial way with two different discriminators. To create a plausible landscape animation (video), the model should preserve static details (buildings, fields, mountains) from the original image and move objects such ad clouds and waves).
We implement the training mode containing two discriminators: $D_{st}$ (static) and $D_{dyn}$ (pairwise discriminator). In this work we propose to use a simplified temporal discriminator ($D_{dyn}$), which only looks at unordered pairs of frames. We augment the fake set of frames with pairs of crops taken from same video frame, but from different locations. Since these crops have the same visual quality as the images in real frames, and since they come from the same videos as images within real pairs, the pairwise discriminator effectively stops paying attention to image quality, cannot simply overfit to the statistics of scenes in the video dataset, and has to focus on finding pairwise inconsistencies within fake pairs.

Animating Real Scenery Images

To animate a given (real) scenery image $I$, we find (infer) a set of latent variables that produce such image within the generator. We perform inference using the following three-step procedure:
Step 1: Predict a set of style vectors $W^{\prime}$ (the notation is described in the paper) using a feedforward encoder network.
Step 2: Starting from $\mathcal{W}^\prime$ and zero $\mathcal{S}$, we optimize all latents to improve reconstruction error.
Step 3: Freezing latents and fine-tuning the weights of $\mathbf{G}$ to further drive down the reconstruction error. Lighting manipulation is performed using a separate neural network $\mathbf{A}$ to approximate $\mathbf{M}$ ($\mathbf{z}=\mathbf{M}^{-1}(\mathbf{w})$)


Results

Landscapes

To animate scenery images we use spatial maps ($S^{dyn}$): we sample $S^{st}$ and $S^{dyn,1}$ from a unit normal distribution and then warp the $S^{dyn}$ tensor continuously using a homography to obtain $S^{dyn,2}, S^{dyn,3}, ..., S^{dyn,n}$. To change daytime in the video sequence the model uses linear interpolation between two randomly sampled $z^{dyn} \in \mathbf{R}^{D^{dyn}}$ vectors.

The original (real) content image is encoded into $G$'s latent space and animated (with different daytime styles).
Note: webm format leads to image quality loss

High Resolution

Original Image
(960px)
Encoded Images (256px)
Hi-res Encoded Images
(960px)

The proposed approach also supports increasing resolution up to 4 times by each side using Super Resolution technique described in the paper.

Artworks

It is possible to use the trained model to animate out of domain images such as artworks.

Citation

E. Logacheva, R. Suvorov, O. Khomenko, A. Mashikhin, and V. Lempitsky. "DeepLandscape: Adversarial Modeling of Landscape Videos" In 2020 Europeran Conference on Computer Vision (ECCV).


or use Bibtex