“Deformable Convolutional Networks” Summarized

https://arxiv.org/abs/1703.06211 (2017-06-05)

1. Geometric Transformations

How do we model geometric transformations in object scale, pose, viewpoint, and part deformation?

Existing methods

build the training datasets with sufficient desired variations
use transformation-invariant features and algorithms

Drawbacks

geometric transformations are assumed fixed and known
hand-crafted design of invariant features and algorithms could be difficult or infeasible for complex transformations

CNNs are limited to model large unknown transformations and this comes from the fixed geometric structures of CNN modules. For example, a convolution unit samples the input feature map at fixed locations.

Proposal

Learn offsets for the regular convolutional filter locations.

offsets

Deformable convolution

It adds 2D offsets to the regular grid sampling locations. The offsets are learned from the preceding feature maps, via additional convolutional layers.
Deformable RoI pooling

It adds an offset to each bin position in the regular bin partition of the previous RoI pooling. The offsets are learned from the preceding feature maps and the RoIs via fc layers.

2. Deformable Convolution

Regular Convolution

\[y(p_0) = \sum_{p_n\in R}w(p_n)*x(p_0+p_n)\]

$p$: location (x, y)
$y(p_0)$: location $p_0$ at output feature map $y$
$R$: grid

for example, 3x3 kernel with dilation 1 is defined $R={(-1, -1), (-1, 0), …, .(0, 1), (1, 1)}$
$w(p_n)$: learnable weight at location $p_n$ of the kernel
$x(p_0+p_n)$: location $p_o+p_n$ at input feature map $x$

Deformable Convolution

\[y(p_0) = \sum_{p_n\in R}w(p_n)*x(p_0+p_n+\Delta p_n)\]

$\Delta p_n$: offset

Offsets are typically fractional, so $x(p)$ is implemented via bilinear interpolation.

How To Obtain Offsets

deformable convolution

Offset $\Delta p_n$ is obtained by applying a convolutional layer over the same input feature map. To avoid confusion, I’ll call this conv layer to obtain offsets ‘offset conv layer’, and call the original conv layer ‘main conv layer’. Also let’s assume the size of the main conv kenel is 3x3 and we apply $N$ of them in each layer.

We should apply $2N$ kernels for the offset conv layer, since we need to obtain (x, y) offsets for each kernel in main conv layer.

The offset conv kernels also has size of 3x3 and the output offset fields have the same spatial resolution with the input feature map. So for each $x(p)$(location in input), we have corresponding offset. When we apply main conv kernel on certain 3x3 region in input feature map, we first see the corresponding 3x3 offsets in the offeset field, and actually multiply the weight and the original+offset position ($x(p_0+p_n+\Delta p_n$)).

3. Deformable RoI Pooling

Concept of RoI Pooling is well explained at https://towardsdatascience.com/understanding-region-of-interest-part-1-roi-pooling-e4f5dd65bb44.

When we do average pooling, the RoI pooling is formulated as follows

\[y(i, j) = \sum_{p \in bin(i,j)} x(p_0+p+\Delta p_{ij})/n_{ij}\]

How To Obtain Offsets

deformable roi pooling

apply regular RoI pooling -> get pooled feature maps
apply fc layer -> get normalized offsets $\Delta \hat {p_{ij}}$
transform $\Delta \hat {p_{ij}}$ to offsets $\Delta p_{ij}$
\[\Delta p_{ij} = \gamma \Delta \hat{p_{ij}} * (w, h)\]
The offeset normalization is necessary to make the offset learning invariant to RoI size.

4. Effects

When the deformable convolution are stacked, the receptive field and the sampling locations are adaptively adjusted according to the objects’ scale and shape.