“Generative Adversarial Nets” Summarized
https://arxiv.org/abs/1406.2661 (2014-6-10)
1. Generative Model VS Discriminative Model
Generative model G: tries to simulate the data distribution
Discriminative model D: estimates the probability that a sample came from the real data rather than G
Generative model and discriminative model tries to beat each other and improve, so finally the trained generative model perfectly replicates the true data and the discriminative model just outputs 0.5.
2. Training Objective
\[min_Gmax_DV(D,G) = E_{DATA}[logD(data)] + E_Z[log(1-D(G(z)))]\]- $z$: noise (vector), ~$p_{Z}(z)$
- $data$: real data point (vector) ~$p_{DATA}(data)$
- $x$: data point (vector)
- $G(z; \theta_g)$: differentiable function represented by a multilayer perceptron with parameters $\theta_g$, it maps noise to data space
- $D(x; \theta_d)$: function that returns probability that the input $x$ came from the real data
First we want to find D that maximizes $V(D, G)$, for any given Generator G. Since
\[\int_zp_Z(z)log(1-D(G(z)))dz = \int_gp_G(g)log(1-D(g))dg\] \[V(D,G) = \int_x[p_{DATA}(x)log(D(x)) + p_G(x)log(1-D(x))]dx\]where the support of $x$ is $Supp(p_{DATA}) \cup Supp(p_G)$.
Maximizing this value is equivalent to maximizing the term inside the integral. So we get
\[D^*(x) = \frac{p_{DATA}(x)}{p_{DATA}(x) + p_G(x)}\]Now we can define
\[C(G) = max_DV(G,D) =E_{DATA}[logD^*(x)] + E_G[log(1-D^*(x))]\]and find G that minimize this value.
The equation can be rewritten as follows
\[C(G) = -log(4) + 2*JSD(p_{DATA}||p_G)\]Jensen-Shannon divergence between two distributions is always non-negative and zero only when they are equal, $C^*=-log(4)$ is the global minimum of $C(G)$ and the only solution is $p_G = p_{DATA}$. So we can say that our training objective, when optimized perfectly, succeeds in making the generative model perfectly replicating the true data generating process.
3. Algorithm
4. Experiment
The authors estimated probability of the test set data under $p_G$ by fitting a Gaussian Parzen window to the samples generated with G and reporting the log-likelihood under this distribution. So the higher the number, the better G simulated real world samples.
5. Advantages & Disadvantages
1. Disadvantages
- There is no explicit representation of $p_G(x)$
- D must be synchronized well with G during training if not, ‘the Helvetica scenario’ happens, in which G collapses too many values of z to the same value of x
2. Advantages
- Marov chains are never needed
- Only backprop is used to obtain gradients
- No inference is needed during learning
- Wide variety of functions can be incorporated into the model
- It can represent very sharp distributions, while methods based on Markov chains require that the distribution be somewhat blurry
Leave a Comment