“Deformable Convolutional Networks” Summarized
https://arxiv.org/abs/1703.06211 (2017-06-05)
1. Geometric Transformations
How do we model geometric transformations in object scale, pose, viewpoint, and part deformation?
Existing methods
- build the training datasets with sufficient desired variations
- use transformation-invariant features and algorithms
Drawbacks
- geometric transformations are assumed fixed and known
- hand-crafted design of invariant features and algorithms could be difficult or infeasible for complex transformations
CNNs are limited to model large unknown transformations and this comes from the fixed geometric structures of CNN modules. For example, a convolution unit samples the input feature map at fixed locations.
Proposal
Learn offsets for the regular convolutional filter locations.
-
Deformable convolution
It adds 2D offsets to the regular grid sampling locations. The offsets are learned from the preceding feature maps, via additional convolutional layers.
-
Deformable RoI pooling
It adds an offset to each bin position in the regular bin partition of the previous RoI pooling. The offsets are learned from the preceding feature maps and the RoIs via fc layers.
2. Deformable Convolution
Regular Convolution
\[y(p_0) = \sum_{p_n\in R}w(p_n)*x(p_0+p_n)\]-
$p$: location (x, y)
-
$y(p_0)$: location $p_0$ at output feature map $y$
-
$R$: grid
for example, 3x3 kernel with dilation 1 is defined $R={(-1, -1), (-1, 0), …, .(0, 1), (1, 1)}$
-
$w(p_n)$: learnable weight at location $p_n$ of the kernel
-
$x(p_0+p_n)$: location $p_o+p_n$ at input feature map $x$
Deformable Convolution
\[y(p_0) = \sum_{p_n\in R}w(p_n)*x(p_0+p_n+\Delta p_n)\]- $\Delta p_n$: offset
Offsets are typically fractional, so $x(p)$ is implemented via bilinear interpolation.
How To Obtain Offsets
Offset $\Delta p_n$ is obtained by applying a convolutional layer over the same input feature map. To avoid confusion, I’ll call this conv layer to obtain offsets ‘offset conv layer’, and call the original conv layer ‘main conv layer’. Also let’s assume the size of the main conv kenel is 3x3 and we apply $N$ of them in each layer.
We should apply $2N$ kernels for the offset conv layer, since we need to obtain (x, y) offsets for each kernel in main conv layer.
The offset conv kernels also has size of 3x3 and the output offset fields have the same spatial resolution with the input feature map. So for each $x(p)$(location in input), we have corresponding offset. When we apply main conv kernel on certain 3x3 region in input feature map, we first see the corresponding 3x3 offsets in the offeset field, and actually multiply the weight and the original+offset position ($x(p_0+p_n+\Delta p_n$)).
3. Deformable RoI Pooling
Concept of RoI Pooling is well explained at https://towardsdatascience.com/understanding-region-of-interest-part-1-roi-pooling-e4f5dd65bb44.
When we do average pooling, the RoI pooling is formulated as follows
\[y(i, j) = \sum_{p \in bin(i,j)} x(p_0+p+\Delta p_{ij})/n_{ij}\]How To Obtain Offsets
-
apply regular RoI pooling -> get pooled feature maps
-
apply fc layer -> get normalized offsets $\Delta \hat {p_{ij}}$
-
transform $\Delta \hat {p_{ij}}$ to offsets $\Delta p_{ij}$
\[\Delta p_{ij} = \gamma \Delta \hat{p_{ij}} * (w, h)\]The offeset normalization is necessary to make the offset learning invariant to RoI size.
4. Effects
When the deformable convolution are stacked, the receptive field and the sampling locations are adaptively adjusted according to the objects’ scale and shape.
The effect of deformable RoI pooling is similar. Parts deviate from the RoI bins and move onto the nearby object foreground regions.
As objects gets bigger, the effective dilation values also got larger.
Leave a Comment