input data
➝
latent representation
➝
reconstruction
LDMs by Rombach & Blattmann et al. (2022), Google's Imagen or OpenAI's DALLE-2:
from Ramesh et al. (2022)
"A corgi's head depicted as
an explosion of a nebula"
from Ramesh et al. (2022)
"A dolphin in an astronaut suit
on saturn, artstation"
from Ramesh et al. (2022)
"Panda mad scientist mixing
sparkling chemicals, artstation"
from Ramesh et al. (2022)
\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^{T}}{\sqrt{d_k}}\right) V \]
Scaled Dot-Product Attention
(Vaswani et al. 2017)
Multi-headed attention (Vaswani et al. 2017)
\(\quad \mathcal{Q}^{*}_{\text{VQGAN}} = \arg\min_{E, G, \mathcal{Z}} \max_{D} \mathbb{E}_{x\sim p(x)} \left[\mathcal{L}_{\text{VQ}}(E, G, \mathcal{Z}) + \lambda\mathcal{L}_{\text{GAN}}(\{E, G, \mathcal{Z}\}, D)\right]\)
semantic map
sample
another sample
Original repo: https://github.com/CompVis/taming-transformers
Created by phdenzel.