Autoencoders

Categories:

Updated:

15 minute read

Original Source: https://youtu.be/o_peo6U7IRM, https://youtu.be/rNh2CrTFpm4, https://youtu.be/LeVkjCuUdRs

Slides From: https://www.slideshare.net/NaverEngineering/ss-96581209

Gradient Descent

We want to update $\theta$ with $\theta+\Delta\theta$. How do we find $\Delta\theta$?

Approximation by Taylor Expansion

\[L(\theta + \Delta\theta) \approx L(\theta)+\Delta\theta^T \nabla_\theta L(\theta)\]

We also know that

\[\Delta\theta^T \nabla_\theta L(\theta) = |\Delta\theta||\nabla_\theta L(\theta)|\cos\beta\]

where $\beta$ is the angle between $\Delta\theta$ and $\nabla_\theta L(\theta)$.

We want to minimize this value. $\cos\beta$ is minimized when $beta$ is 180 degrees, so

\[\Delta\theta = -\alpha \nabla_\theta L(\theta)\]

We need to set constant (learning rate) $\alpha$ to be sufficiently small, since first order taylor approximation is valid only locally.

BackPropagation Algorithm

  • $C$: cost function
  • $L$: last layer
  • $l$: layer index
  • $\delta^l$: error term at $l$th layer
  • $\sigma$: activation function
  • $z^l$: output of layer $l$
  • $a^l$: activation of $z^l$
  • $W^l$: weight of layer $l$
  • $b^l$: bias of layer $l$
\[\delta ^L = \nabla_aC \odot \sigma'(z^L)\] \[\delta^l = \sigma'(z^l) \odot ((w^{l+1})^T \delta^{l+1})\] \[\nabla_{b^l}C = \delta^l\] \[\nabla_{W^l}C = \delta^l(a^{l-1})^T\]

Sigmoid vs ReLU

Sigmoid function is formulated:

\[sigmoid(x) = \frac{1}{1+e^{-x}}\]

Maximum $sigmoid ‘ (x)$ is 0.25, and if $x$ goes far from 0, $sigmoid ‘ (x)$ approaches 0. Therefore, if we use sigmoid as activation function, the error signal gets smaller and smaller as we backpropagate. So the parameters at the front part of the network do not get updated well. This is called vanishing gradient problem.

On the other hand, derivative of ReLU is 1 for $x>0$ and 0 for $x<0$, so it does not create vanishing gradient.

Also, for classification, if we use cross entropy for loss function, $\delta^L$ becomes $\hat{y} - y$, so the problem of derivative of sigmoid disappears in the last layer.

Maximum Likelihood Estimation in Cost Function

We assume the target distribution given network output follows known distribution parameterized by network output. We assume target $Y$ follows below distribution

\[p(Y|f_\theta(X))\]

When we observe input $x_i$ and target $y_i$, its likelihood is:

\[p(y_i|f_\theta(x_i))\]

We observe multiple instances, and with i.i.d condition, the likelihood of observing input vector $x$ and target vector $y$ is:

\[\prod_i p(y_i|f_\theta(x_i))\]

and we can do maximum likelihood estimation by minimizing negative log likelihood:

\[-\sum_i\log(p(y_i|f_\theta(x_i)))\]

We’ll see how this negative log likelihood formulates in two distributions.

1. Gaussian Distribution -> MSE

There are two parameters in Gaussian distribution: mean and standard deviation. We assume $\sigma = 1$. We set network output as the mean of the Gaussian distribution. So we are interpreting the network output is the mean of the target distribution given $x$.

\[f_\theta(x_i) = \mu_i\]

We want to minimize the negative log likelihood, which, if we do some calculations, is:

\[-\sum_i\log(p(y_i|f_\theta(x_i))) = -n\log\frac{1}{\sqrt{2\pi}} + \sum_i\frac{(y_i-f_\theta(x_i))^2}{2}\]

Minimizing above term with respect to $\theta$ is the same as minimizing the following term:

\[\frac{1}{n}\sum_i(y_i-f_\theta(x_i))^2\]

which is Mean Squared Error.

2. Bernoulli Distribution -> BCE

We interpret the network output is the $p$ parameter of the Bernoulli distribution.

\[f_\theta(x_i) = p_i\] \[-\sum_i\log(p(y_i|f_\theta(x_i))) = -\sum_i [y_i\log f_\theta(x_i) + (1-y_i)\log(1-f_\theta(x_i))]\]

which is Binary Cross Entropy

Manifold Learning

Manifold learning is an approach to non-linear dimensionality reduction. It is based on the manifold hypothesis, which assumes that natural data in high dimensional spaces concentrates close to lower dimensional manifolds, and probability density decreases very rapidly when moving away from the supporting manifold.

Objectives

  1. Data Compression

  2. Data Visualization

  3. Curse of Dimensionality

    In 10x10 plane, 10 samples occupy 10/100 positions. However in 10x10x10 cube space, 10 samples occupy 10/1000 positions. So as dimensionality increases, the samples becomes more sparse, so it is harder to model the data.

  4. Discovering Most Important Features

    The axes of the manifold space represents important features of the data.

    Also, distance becomes more reasonable in manifold than in original space.

AutoEncoders

AutoEncoder

Denoising AutoEncoder

We add random noise to the autoencoder input. The model will learn representation that is robust to small corruptions of the input. Also it will learn to project noisy input back on to the manifold.

Contractive AutoEncoder

Contractive autoencoder tries to achieve similar goal through adding regularization term to the loss.

Stochastic Contractive Autoencoder uses,

\[L_{SCAE} = \sum_x (L(x, g(h(x))) + \lambda E_{q(\tilde{x}|x)}[||h(x)-h(\tilde{x})||^2])\]

The regularization term (after $\lambda$) constrains the model to produce same encoded vectors for original input and corresponding noisy inputs.

Contractive Autoencoder uses

\[\lambda ||\frac{\partial h}{\partial x}(x)||^2_F\]

as analytic regularization term.

Variational Autoencoder

1. How do we seed the generator?

We want to train a generator $g_\theta$ with input $z$ and target $x$. We want to maximize the probability of x in the training set;

\[p(x) = \int p(x|g_\theta(z))p(z)dz\]

If we assume $X|Z$ ~ gaussian with mean $g_\theta(z)$, the image with less MSE with the training data will be more likely to be generated. However, less MSE in the original space doesn’t mean that they are meaningfully similar.

So we want to sample $z$ from $p(z|x)$ so that it can be seeded to generate image that is meaningfully similar to $x$. But since we don’t know $p(z|x)$, we choose well known distribution $q_\phi (z|x)$ and train encoder to estimate $\phi$.

2. How do we find $\phi$?

First let’s look at $log(p(x))$

\[log(p(x)) \\ = \int log(p(x))q_\phi(z|x)dz \\ = \int log(\frac{p(x,z)q_\phi(z|x)}{q_\phi(z|x)p(z|x)}) q_\phi(z|x)dz \\ = \int log(\frac{p(x,z)}{q_\phi(z|x)})q_\phi(z|x)dz + \int log(\frac{q_\phi(z|x)}{p(z|x)})q_\phi(z|x)dz \\ = ELBO(\phi) + KL(q_\phi(z|x)||p(z|x))\]

Our goal is to simulate $p$ with $q$, so our goal is to minimize the KL term above. But since we don’t know $p(z|x)$ we cannot directly do that. So we maximize ELBO with respect to $\phi$, which will in turn minimize KL with fixed $p(x)$.

Now let’s look at $ELBO(\phi)$

\[ELBO(\phi) \\ = \int log(\frac{p(x,z)}{q_\phi(z|x)})q_\phi(z|x)dz \\ = \int log(\frac{p(x|z)p(z)}{q_\phi(z|x)})q_\phi(z|x)dz \\ = \int log(p(x|z)p(z))q_\phi(z|x)dz - \int log(\frac{q_\phi(z|x)}{p(z)})q_\phi(z|x)dz \\ = E_{q_\phi(z|x)}[log(p(x|z))] - KL(q_\phi(z|x)||p(z)) \\ = E_{q_\phi(z|x)}[log(p(x|g_\theta(z)))] - KL(q_\phi(z|x)||p(z))\]

Maximizing the expectation term above is equivalent to maximizing the probability of training data $x$ generated from the generator. In short, when we maximize ELBO, we are simultaneously optimizing $\phi$ for encoder and $\theta$ for generator.

3. Training Objective

So our objective is minimizing negative ELBO with respect to $\phi$ and $\theta$;

\[\arg\min_{\phi, \theta} \sum_i (-E_{q_\phi(z|x_i)}[log(p(x_i|g_\theta(z)))] + KL(q_\phi(z|x_i)||p(z)))\]

We call expectation term ‘reconstruction error’ and KL term ‘regularization’.

4. How do we compute the above loss?

1. Regularization term

If we assume two things;

\[q_\phi(z|x_i) \sim N(\mu_i, \sigma^2_iI) \\ p(z) \sim N(0, I)\]

then it is shown that KL term can be calculated using only $\mu$s and $\sigma$s which are the output of the encoder.

2. Reconstruction Error term

We estimate expectation term with Monte-carlo technique. We sample $L$ $z_i^l$s from $q_\phi(z|x_i)$ and compute mean;

\[E... \approx \frac{1}{L}\sum_l \log(p_\theta(x_i|z_i^l))\]

In practice, we just set $L=1$ for simplicity.

If we assume $X|Z$ follows multivariate bernoulli distribution, we use cross entropy like we’ve seen above.

\[\log(p_\theta(x_i|z_i)) = \sum_{j=1}^D (x_{i,j} \log p_{i,j}+(1-x_{i,j})\log(1-p_{i,j}))\]

Likewise, if we assume $X|Z$ ~ gaussian, we use MSE.

5. How do we backpropagate in the sampling part of the network?

We use reparameterization trick.

Original sampling process is:

\[z_i^l \sim N(\mu_i, \sigma^2_iI)\]

But since we cannot backprogate random node, we instead use:

\[z_i^l = \mu_i + \sigma^2_i \odot \epsilon \\ \epsilon \sim N(0, I)\]

(Both sampling process is the same)

Now we sample $\epsilon$ and since this random node does not have previous nodes to pass derivative, we can apply backpropagation algorithm to the network.

To summarize, the architecture of VAE is:

Learned Manifold (MNIST)

The manifold of VAE is constrained to resemble normal distribution, since we have KL regularization term in VAE.

VAE succeeded in learning meaningful features of MNIST.

Conditional VAE

If we have label information, how do we use it to improve our model? We just concat label information to input of encoder and generator. Everything else is the same with VAE.

Since we give label(number) to the model, generator seed ($z$) determines features other than number.

Adversarial AE

In original AE, we needed to set $Z|X_i$ and $Z$ as gaussian so that KL divergence term in ELBO can be calculated. In AAE, KL term is replaced by discrimator in GAN, so there are no constrains when setting distributions $q$ and $p$.

Architecture of AAE

We set $p(z)$ to any distribution we want. Then we train a discrimator that tries to predict 0 for $q(z)$ and 1 for $p(z)$, and a generator that tries to make $q(z)$ similar to $p(z)$. As a result, we can make $q(z)$ follow any distribution we want.

Training AAE

Conditional AAE

We can set label information for some regions of $z$ sampled from $p(z)$. Also we have label information of $z$ sampled from $q(z|x_i)$, which is label of $x_i$. We just concatenate label information to the input of discriminator when training the network. Then, we can assign labels to different parts of the manifold like visualized below.

VAE vs GAN

Why are VAE generated images blurry?

VAE uses pixel-wise mse for reconstruction (when we assume $p(X|Z)$ follows guassian). On the other hand, GAN uses compares image-wise. So GAN generates image that looks real on the image level, but VAE generates image that minimizes mean pixel difference. Therefore, VAE generated images are blurrier than GAN generated images.

VAE + GAN

Below two methods are widely used.

  1. Use VAE learned manifold as condition to GAN generator.
  2. Use GAN discriminator for VAE generated data, to make sure it is real.

StackGAN

3DGAN

SEGAN

Age Progression/Regression by Conditional Adversarial Autoencoder

PaletteNet

Hallucinating Very Low-Resolution Unaligned and Noisy Face Images by Transformative Discriminative Autoencoders

A Generative Model of People in Clothing

Leave a Comment