“Image Style Transfer Using Convolutional Neural Networks” Summarized
https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Gatys_Image_Style_Transfer_CVPR_2016_paper.pdf (2016)
1. Texture Transfer
The goal of texture transfer is to synthesize a texture from a source image while constraining the texture synthesis in order to preserve the semantic content of a target image.
Previous non-parametric algorithms use only low-level image features of the target image. They are not able to extract the semantic image content from the target image.
To generally separate content from style in natural images, the authors use deep convolutional neural networks. CNN trained with sufficient labeled data on specific tasks such as object recognition learn to extract high-level image content features that generalize across tasks. The authors introduce A Neural Algorithm of Artistic Style, which is a texture transfer algorithm that constrains a texture synthesis method by feature representations from state-of-the-art CNNs.
2. Deep Image Representations
1. Content Representation
- $x$: input noise / we want to transfer style of another image to this image (vector)
- $N_l$: number of distinct filters / number of response channels
- $M_l$: height x width of the response feature map
- $F^l \in R^{N_l \times M_l}$: response feature map in layer $l$ (matrix)
- $F^l_{ij}$: activation of the $i$th filter at position $j$ in layer $l$
- $p$: original image (vector)
- $P^l$: response feature map in layer $l$ of $p$ (matrix)
We want to optimize $x$ so that it has the same content as original image $p$. We want to minimize the difference between feature representations of two images at certain layer of pretrained VGG-19. So we define a MSE loss and optimize $x$ so that it has similar content as $p$, with bac-propagation.
In formula, this is the content loss.
\[L_{content}(p, x, l) = \frac{1}{2}\sum_{i,j}(F^l_{ij}-P^l_{ij})^2\]$x$ trained from lower layers simply reproduce the exact pixel values of the original image.
In constrast, higher layers in the network capture the high-level content. They do not constrain the exact pixel values of the reconstruction very much. We refer to the feature responses in higher layers of the network as the content representation.
2. Style Representation
Gram Matrix
$G^l \in R^{N_l \times N_l}$
correlations between the different filter responses.
To calculate Gram matrix of feature maps $i$ and $j$ in layer $l$,
\[G^l_{ij} = \sum_k F^l_{ik}F^l_{jk}\]We use gram matrix as a style representation. Why? I guess because when the activations have high correlation, it means it is encoded with high locality of the image.
We want to find $x$ that minimizes the mean-squared distance between the Gram matrices of original image and itself.
The contribution of layer $l$ to the total loss is
\[E_l = \frac{1}{4N_l^2M_l^2}\sum_{i,j}(G^l_{ij}-A^l_{ij})^2\]- $a$: original image
- $A^l$: style representation(gram matrix) of original image
- $G^l$: style representation(gram matrix) of the generated image
The total style loss is
\[L_{style}(a, x) = \sum_{l=0}^L w_lE_l\]We do gradient descent on this loss with respect to $x$ to update $x$.
3. Style Transfer
We want to generate an image $x$ that has content of $p$ and style of $a$. So we jointly minimize the above two losses.
\[L_{total}(p, a, x) = \alpha L_{content}(p, x) + \beta L_{style}(a, x)\]3. Visualizing the Effect
1. Trade-off between content and style
2. Effect of different layer of the CNN
3. Initialization of gradient descent
4. Discussions
1. Limitations
- Takes too long to synthesize high resolution images.
- Synthesized images are sometimes subject to some low-level noise. When applied on photos, photorealism of the synthesized image is affected.
2. Contribution
The separation of image content from style is not necessarily a well defined problem. Nevertheless, the authors succeeded in finding that a neural system automatically learns image representations that allow the separation of image content from style at least to some extent.
Leave a Comment