Recent Developments in High-Resolution Image Synthesis

tech talk @ ZHAW

16/06/2022
by

Philipp Denzel contact_qr.png

Slides on my website

https://phdenzel.github.io/

talk_qr.png

My previous work

talk_qr.png

  • PhD in Physics from UZH @ ICS
  • gravitational lens modelling
  • "relativistic/astrophysical" raytracing
  • Bayesian technique based on image synthesis & reconstruction (see Denzel et al. 2021a)

input data
my-work_composite_SW05.png

latent representation
my-work_kappa_SW05.png

reconstruction
my-work_composite_SW05_synth.png

Denzel et al. 2021b

Generative deep learning

Motivation

  • goal of AI: automate intelligent behaviour on silicon-based machines
  • in contrast to discriminative deep learning:
    • pattern recognition
  • generative deep learning:
    • approximate the true data density with parameters \(\theta\), optionally conditioned on some information \(c\): \[ P_\theta(x|c) \sim P_\text{data}(x|c) \]
    • (inspired) creativity ➝ much more ambitious

Approaches and objectives

  • VAEs: \(\quad \log{p(x)} \ge \mathbb{E}_{z\sim q_{\theta}(z\vert x)}[\log{p_\theta(x\vert z)}] - D_{KL}\left(q_\theta(z\vert x) \vert\vert p(z)\right)\)
    • fast, regularized latent space, lower bound to LL, trade-offs: reconstruction ⇿ regularization
  • GANs: \(\quad \mathbb{E}_{x\sim p_\text{data}}[\log{D_\theta(x)}] + \mathbb{E}_{z\sim q(z)}[1-\log{D_\theta(G_\theta(z))}]\)
    • fast, high quality, implicit density, mode collapse
  • Autoregressive models: \(\quad p(x) = \prod_i p_\theta(x_i\vert x_{\lt i})\)
    • exact, good results, no latent representation, slow inference
  • Diffusion Models: \(\quad -\log{p(x)} \le \mathbb{E}_{q}[\log{\frac{q(x_{1:T}\vert x_0)}{p_\theta(x_{0:T})}}]\)
    • flexible, high fidelity, lower bound to LL
  • etc.

Diffusion models

LDMs by Rombach & Blattmann et al. (2022), Google's Imagen or OpenAI's DALLE-2:

dalle-2_arch.png
from Ramesh et al. (2022)

Text to image

  • DALLE-2 - a new champion in semantic understanding
  • generates images up to 1 Megapixel!


"A corgi's head depicted as
an explosion of a nebula"
dalle-2_A_corgis_head_depicted_as_an_explosion_of_a_nebula.jpg
from Ramesh et al. (2022)

"A dolphin in an astronaut suit
on saturn, artstation"
dalle-2_a_dolphin_in_an_astronaut_suit_on_saturn,_artstation.jpg
from Ramesh et al. (2022)

"Panda mad scientist mixing
sparkling chemicals, artstation"
dalle-2_panda_mad_scientist_mixing_sparkling_chemicals,_artstation.jpg
from Ramesh et al. (2022)

Taming Transformers

paper_thumb.png

Transformers - Attention mechanism

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^{T}}{\sqrt{d_k}}\right) V \]

  • where \(Q\) query, \(K\) key, and \(V\) value are matrixes
  • (packed sets of vectors)
  • • $\quad$dot-product: quadratic complexity!
  • • $\quad$captures long-range interaction

transformers_scheme.png
Scaled Dot-Product Attention
(Vaswani et al. 2017)

Transformers - Architecture

transformers_scheme2.png
Multi-headed attention (Vaswani et al. 2017)

Combine approaches

  • VQVAEs + GANs + autoregressive model = VQGAN
    • VQVAEs: latent variables
    • CNNs: local interactions
    • Transformers: global interactions
    • adversarial idea: efficient learning

VQGAN

vqgan_arch.png

\(\quad \mathcal{Q}^{*}_{\text{VQGAN}} = \arg\min_{E, G, \mathcal{Z}} \max_{D} \mathbb{E}_{x\sim p(x)} \left[\mathcal{L}_{\text{VQ}}(E, G, \mathcal{Z}) + \lambda\mathcal{L}_{\text{GAN}}(\{E, G, \mathcal{Z}\}, D)\right]\)


Esser & Rombach et al. (2020)

First-stage reconstructions

vqgan_first_stage_squirrels_x4.png


Esser & Rombach et al. (2020)

First-stage reconstructions

vqgan_first_stage_squirrels_annx4.png


Esser & Rombach et al. (2020)

Semantic conditioning (\(f=16\))

semantic map
vqgan_semantic_map1.jpg

sample
vqgan_semantic_gen1a.jpg

another sample
vqgan_semantic_gen1b.jpg

vqgan_semantic_map2.jpg

vqgan_semantic_gen2a.jpg

vqgan_semantic_gen2b.jpg


Esser & Rombach et al. (2020)

A variety of image synthesis tasks

vqgan_tasks.jpg


Esser & Rombach et al. (2020)

A variety of image synthesis tasks

vqgan_tasks2.jpg


Esser & Rombach et al. (2020)

High-resolution?

  • long sequences computationally expensive since \(\mathcal{O}(N^2)\)
  • possible using attention sliding window

vqgan_attention_slide.png


Esser & Rombach et al. (2020)

High-resolution

vqgan_HR1.jpg


Esser & Rombach et al. (2020)

High-resolution

vqgan_HR1.jpg

vqgan_HR2.jpg


Esser & Rombach et al. (2020)

High-resolution

vqgan_HR1.jpg

vqgan_HR2.jpg

vqgan_HR3.jpg


Esser & Rombach et al. (2020)

Summary: High-resolution image synthesis

  • > 1 Megapixel images are possible w/ two-stage approaches
  • diffusion models show excellent results in semantic understanding
  • autoregressive models can be optimized for high-resolution image generation


https://compvis.github.io/taming-transformers/

Demo

Original repo: https://github.com/CompVis/taming-transformers

My fork: https://github.com/phdenzel/taming-transformers

Created by phdenzel.