How it works?

Encoder & Latent (Hidden) Feature

For every input image , we try to guess “which latent feature will generate this image ?” We achieve this guess by using an encoder to encode the input image into latent feature

However, since we are just guessing the latent feature, we don’t know the exact . Hence, we express as probability distribution

In summary, the encoder in variational autoencoder output a mean and a variance which tells us the distribution of (normally Gaussian distribution)

Decoder

Step 1: Sample

The decoder input and output . In last section, we mentioned is a probability distribution, thus in order to pass it into the decoder, we sample a specific using the given distribution

Step 2: Decode

Now, we pass our sampled into the network. The output of the decoder is also a distribution, which mean the decoder also output a mean and a variance

Step 3: Get generated image

To get the final output image, we have two choices

  1. sample from the distribution
  2. use the mean

means the learnable parameters of the network

Hence, means “the probability of , given learnable parameters


How do we train the model?

Basic Idea

If we input , we want to maximize the output image probability distribution of generating exact input

Idea 1: Integration

Mathematical Expression

We want to find that maximize - the probability of generating given learnable parameters

Since is also a distribution, we need to marginalize the calculation

Problem

However, we can’t integrate over since has infinite number of possibilities

Idea 2: Bayes’ Formula

Mathematical Expression

Problem: Unable to Compute

means “given input image , what is the probability of latent feature

This is what we want the encoder to learn!!! The encoder wants to learn the best probability distribution for , so we can’t know the exact for sure

Solution: Approximate by

We use a new network to approximate

is fixed. It is our belief in what should distribution looks like before we look at input data . Normally we choose Gaussian distribution with mean 0 and variance 1


Mathematical Detail Finding Lower Bound for

What is ?

We train encoder and decoder together in the training process

Steps

Step 1: Change Bayes’ Formula Representation

Step 2: Wrap

Since doesn’t depends on , so we can wrap it with expectation:

Thus the Bayes’ formula can then be organize to

Step 3: Observe Lower Bound

KL >= 0, so dropping the last term gives lower bound on

This gives us the lower bound of


Training Process

Encoder

The corresponding term for encoder in lower bound formula is:

This term tells us in order to let the lower bound higher, we want the distribution of and to be close, which will give us divergence nearly zero

KL divergence goes to zero when the two distributions is about the same, and large if two distributions are different

is predefined, normally

Decoder

The corresponding term for decoder in lower bound formula is:

This term mean

  1. Sample from
  2. Compute the weighted average of across all possible values, where the weights are given by
  3. This measures: “For the given distribution, how likely can it reconstruct the input image ?” We want to maximize this expectation


Generating Data

  1. Sample from prior
  2. Pass into decoder
  3. We should get which resembles the training input image


Pros & Cons

Pros

  • The mathematical foundation of the model make them interpretable and theoretically well-understood
  • The encoder learns meaningful latent feature, which can be used for downstream tasks

Cons

  • VAEs optimize the lower bound, but not the exact likelihood
  • Samples blurrier and low quality compared to GANs