“Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization” Summarized

https://arxiv.org/abs/1610.02391 (2016-10-7)

1. Accuracy vs Interpretability

There typically exists a trade-off between accuracy and interpertability(simplicity).

Classical rule-based systems: interpretable but not accurate
Deep models: accurate but not interpretable

The authors made deep models interpretable with Grad-CAM.

2. Grad-CAM Formulation

We first compute $\alpha^c_k$ which

represents partial linearization of the deep network downstream from $A$
captures the ‘importance’ of feature map $k$ for a target class $c$

\[\alpha^c_k = \frac{1}{Z}\sum_i\sum_j \frac{\partial y^c}{\partial A^k_{i,j}}\]

$c$: class index
$k$: channel index
$i$: height index, $j$: width index
$y^c$: score of $c$th class (before softmax) / In general, it could be any differentiable output for any tasks.
$A$: activation from the last convolution
$A^k_{i,j}$: number at $(i,j)$ position of $k$th channel of $A$
$Z$: width * height of $A^k$

Then we compute Grad-Cam $L^c_{Grad-CAM}$

\[L^c_{Grad-CAM} = ReLU(\sum_k \alpha^c_k A^k)\]

We apply a ReLU to the linear combination of maps because we are only interested in the features that have a positive influence on the class of interest.

3. Grad-CAM generalizes CAM

Previous work, CAM is formulated as follows.

\[Y^c = \sum_k w^c_k \frac{1}{Z}\sum_i\sum_jA^k_{i,j}\]

CAM can only be applied to a specific kind of architecture where global average pooled convolutional feature maps are fed directly into softmax. In this condition, the paper shows Grad-CAM is equivalent to CAM, thus it is a strict generalization of CAM.

4. Guided Grad-CAM

A good visual explanation from the model for justifying any target category should be ‘class discriminative’ and ‘high-resolution’.

Guided Backpropagation, Deconvolution: high-resolution but not class-discriminative
Grad-Cam, CAM: class-discriminative but not high-resolution

The authors upscaled Grad-CAM and multiplied it pixel-wise with guided Backpropagation to create Guided Grad-CAM visualizations that are both high-resolution and class-discriminative.

5. Counterfactual Explanations

Counterfactual explanation highlights support for regions that would make the network change its prediction.

It is computed the same way as Grad-CAM but uses negative gradients as shown below.

\[\alpha^c_k = \frac{1}{Z}\sum_i\sum_j - \frac{\partial y^c}{\partial A^k_{i,j}}\]

6. Localization Ability

1. Weakly-supervised Localization

2. Weakly-supervised Segmentation

7. Pros of Grad-CAM

1. Class Discrimination

When viewing Guided Grad-CAM, human subjects can correctly identify the category being visualized better than with Guided Backpropagation.

2. Trust

Human subjects are able to identify the more accurate classifier simply from the Guided Grad-CAM, despite both models making identical predictions

3. Faithful to the model

Patches which change the CNN score are also patches to which Grad-CAM assign high intensity.

4. Analyzing failure modes

With Grad-CAM on failed images, we can see that seemingly unreasonable predictions have reasonable explanations.

YoonSoo

“Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization” Summarized

1. Accuracy vs Interpretability

2. Grad-CAM Formulation

3. Grad-CAM generalizes CAM

4. Guided Grad-CAM

5. Counterfactual Explanations

6. Localization Ability

1. Weakly-supervised Localization

2. Weakly-supervised Segmentation

7. Pros of Grad-CAM

1. Class Discrimination

2. Trust

3. Faithful to the model

4. Analyzing failure modes

5. Robust to adversarial noise

6. Identifies bias in dataset

8. In relate with Words

1. Image Captioning

2. VQA(Visual Question Answering)

Share on

Leave a Comment

You May Also Enjoy

Generalized Linear Models (GLM)

“ALBERT: A Lite BERT for Self-supervised Learning of Language Representations” Summarized

“Generative Pretraining from Pixels” Summarized

“Language Models are Few-Shot Learners” Summarized