We build a new model of landscape videos that can be trained on a mixture of static landscape images as well as landscape animations. Our architecture extends StyleGAN model by augmenting it with parts that allow to model dynamic changes in a scene. Once trained, our model can be used to generate realistic time-lapse landscape videos with moving objects and time-of-the-day changes. Furthermore, by fitting the learned models to a static landscape image, the latter can be reenacted in a realistic way. We propose simple but necessary modifications to StyleGAN inversion procedure, which lead to in-domain latent codes and allow to manipulate real images. Quantitative comparisons and user studies suggest that our model produces more compelling animations of given photographs than previously proposed methods. The results of our approach including comparisons with prior art can be seen in the paper, supplementary materials and on the project page.
The architecture of our DeepLandscape model is based on StyleGAN and has four
sets of latent variables $z^{st}$ (encodes colors and the general scene layout), $z^{dyn}$ (encodes global
lighting, e.g. time of day),
$S^{st}$ (a set of square matrices which encode shapes and details of static objects and $S^{dyn}$ (a set of
square matrices which
encode shapes and details of dynamic objects).
The model is trained from two sources of data, the dataset of static scenery images $\mathcal{I}$ and the
dataset of timelapse scenery videos $\mathcal{V}$.
It is relatively easy to collect a large static dataset, while with our best
efforts we were able to collect a few hundreds of videos, that do not cover all
the diversity of landscapes. Thus, both sources of data have to be utilized in
order to build a good model. To do that, we train our generative model in an
adversarial way with two different discriminators.
To create a plausible landscape animation (video), the model should preserve static details (buildings,
fields, mountains) from the original image
and move objects such ad clouds and waves).
We implement the training mode containing two discriminators: $D_{st}$ (static) and $D_{dyn}$ (pairwise
discriminator).
In this work we propose to use a simplified temporal discriminator ($D_{dyn}$), which only looks at
unordered pairs of frames.
We augment the fake set of frames with pairs of crops taken from same
video frame, but from different locations. Since these crops have the same visual
quality as the images in real frames, and since they come from the same videos
as images within real pairs, the pairwise discriminator effectively stops paying
attention to image quality, cannot simply overfit to the statistics of scenes in the
video dataset, and has to focus on finding pairwise inconsistencies within fake
pairs.
To animate a given (real) scenery image $I$, we find (infer) a set of latent variables that produce such
image within the generator.
We perform inference using the following three-step procedure:
Step 1: Predict a set of style vectors $W^{\prime}$ (the notation is described in the paper) using a
feedforward encoder
network.
Step 2: Starting from $\mathcal{W}^\prime$ and zero $\mathcal{S}$, we optimize all latents to improve
reconstruction error.
Step 3: Freezing latents and fine-tuning the weights of $\mathbf{G}$ to further drive
down the reconstruction error.
Lighting manipulation is performed using a separate neural network $\mathbf{A}$ to approximate $\mathbf{M}$
($\mathbf{z}=\mathbf{M}^{-1}(\mathbf{w})$)
To animate scenery images we use spatial maps ($S^{dyn}$): we sample $S^{st}$ and $S^{dyn,1}$ from a unit normal distribution and then warp the $S^{dyn}$ tensor continuously using a homography to obtain $S^{dyn,2}, S^{dyn,3}, ..., S^{dyn,n}$. To change daytime in the video sequence the model uses linear interpolation between two randomly sampled $z^{dyn} \in \mathbf{R}^{D^{dyn}}$ vectors.
The original (real) content image is encoded into $G$'s latent space and animated (with different
daytime styles).
Note: webm format leads to image quality loss
The proposed approach also supports increasing resolution up to 4 times by each side using Super Resolution technique described in the paper.
It is possible to use the trained model to animate out of domain images such as artworks.
E. Logacheva, R. Suvorov, O. Khomenko, A. Mashikhin, and V. Lempitsky. "DeepLandscape: Adversarial Modeling of Landscape Videos" In 2020 Europeran Conference on Computer Vision (ECCV).
or use Bibtex