“Accurate Image Super-Resolution Using Very Deep Convolutional Networks” Summarized

https://arxiv.org/abs/1511.04587 (2016-11-11)

1. Introduction

Dong et al. has demonstrated with SRCNN that CNN can be used to learn a mapping from low resolution(LR) to high resolution(HR) in an end-to-end manner. But it has limitations in 3 aspects. The authors propose VDSR which resolves these problems.

Context

SRCNN used small patches to train CNN. For a large scale factor, information contained in a small patch is not sufficient for detail recovery. VDSR uses large receptive field and takes a large image context into account.
Convergence

SRCNN training converges too slowly. To achieve fast convergence, VDSR uses two techniques. First, VDSR explicitly models residual image, which is the difference between HR and LR images. Second, it uses extremely high learning rates which was enabled by residual-learning and gradient clipping.
Scale

SRCNN only works for a single scale factor. The authors found that a single convolutional network is sufficient for multi-scale-factor super-resolution.

2. Network

$d$ layers
first layer: 64 filter of size 3x3x3
last layer: 1 filter of size 3x3x64
other layers: 64 filter of the size 3x3x64
Pad zeros before convolutions to keep the sizes of all feature maps the same.

The center-surround relation of convolution is useful for recovering fine details, but for pixels near the image boundary, this relation cannot be exploited well. Many SR methods crop the result image. But it turned out that zero-padding before each convolutions worked well.
Input: low-resolution image interpolated to the desired size
Target: residual(input-original)
Finally the input image is added to the predicted output to yield the final HR image.

3. Training

Loss: mean squared error
Residual learning

If the target is the hr image itself, the model needs to learn to maintain much of the input information until the end. With many weight layers, the vanishing/exploding gradients problem can be critical. This problem is solved by residual-learning
High Learning Rates

When training deep neural networks, small learning rates will make training time too long. So the authors used large initial learning rate.
Adjustable Gradient Clipping

To avoid exploding gradients, gradient clipping is applied. The authors clipped the gradients to $[-\frac{\theta}{\gamma}, \frac{\theta}{\gamma}]$, where $\gamma$ denotes the current learning rate. This technique made the convergence procedure extremely fast.
Data prepration

Similar to SRCNN with following differences
- Input patch size is equal to the size of the receptive field
- Images are divided into sub-images with no overlap
- Mini-batch consists of 64 sub-images, where sub-images from different scales can be in the same batch

4. Understanding Properties

1. The Deeper, the Better

Deep network produces large receptive field, which means that the network can use more context to predict image details.

Also very deep networks can exploit high nonlinearities.

2. Residual-Learning

First, residual network converges much faster.

Second, at convergence, the residual network shows superior performance.

3. High Learning Rates

With small initial learning rate is used, the network never reaches the level high lr reaches.

residual learning & learning rates experiment results

4. Single Model for Multiple Scales

According to the authors’ tests, a network trained over single-scale data is not capable of handling other scales.

However, when trained with multi-scale dataset, its PSNR for each scale is comparable to those achieved from the corresponding result of single-scale network. Moreover, for large scales, the multi-scale network outperforms single-scale network.