Abstract
We propose a technique for producing ‘visual explanations’ for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent and explainable. Our approach—Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept (say ‘dog’ in a classification network or a sequence of words in captioning network) flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept. Unlike previous approaches, Grad-CAM is applicable to a wide variety of CNN model-families: (1) CNNs with fully-connected layers (e.g.VGG), (2) CNNs used for structured outputs (e.g.captioning), (3) CNNs used in tasks with multi-modal inputs (e.g.visual question answering) or reinforcement learning, all without architectural changes or re-training. We combine Grad-CAM with existing fine-grained visualizations to create a high-resolution class-discriminative visualization, Guided Grad-CAM, and apply it to image classification, image captioning, and visual question answering (VQA) models, including ResNet-based architectures. In the context of image classification models, our visualizations (a) lend insights into failure modes of these models (showing that seemingly unreasonable predictions have reasonable explanations), (b) outperform previous methods on the ILSVRC-15 weakly-supervised localization task, (c) are robust to adversarial perturbations, (d) are more faithful to the underlying model, and (e) help achieve model generalization by identifying dataset bias. For image captioning and VQA, our visualizations show that even non-attention based models learn to localize discriminative regions of input image. We devise a way to identify important neurons through Grad-CAM and combine it with neuron names (Bau et al. in Computer vision and pattern recognition, 2017) to provide textual explanations for model decisions. Finally, we design and conduct human studies to measure if Grad-CAM explanations help users establish appropriate trust in predictions from deep networks and show that Grad-CAM helps untrained users successfully discern a ‘stronger’ deep network from a ‘weaker’ one even when both make identical predictions. Our code is available at https://github.com/ramprs/grad-cam/, along with a demo on CloudCV (Agrawal et al., in: Mobile cloud visual media computing, pp 265–290. Springer, 2015) (http://gradcam.cloudcv.org) and a video at http://youtu.be/COjUB9Izk6E.
Access this article
We’re sorry, something doesn't seem to be working properly.
Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Empirically we found global-average-pooling to work better than global-max-pooling as can be found in the “Appendix”.
We find that Grad-CAM maps become progressively worse as we move to earlier convolutional layers as they have smaller receptive fields and only focus on less semantic local features.
We use GoogLeNet finetuned on COCO, as provided by Zhang et al. (2016).
c-MWP (Zhang et al. 2016) highlights arbitrary regions for predicted but non-existent categories, unlike Grad-CAM maps which typically do not.
The green and red boxes are drawn manually to highlight correct and incorrect focus of the model.
Area of overlap between ground truth concept annotation and neuron activation over area of their union. More details of this metric can be found in Bau et al. (2017).
References
Agrawal, A., Batra, D., & Parikh, D. (2016). Analyzing the behavior of visual question answering models. In EMNLP.
Agrawal, H., Mathialagan, C. S., Goyal, Y., Chavali, N., Banik, P., Mohapatra, A., Osman, A., & Batra, D. (2015). CloudCV: Large scale distributed computer vision as a cloud service. In Mobile cloud visual media computing (pp. 265–290). Springer.
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). VQA: Visual question answering. In ICCV.
Bau, D., Zhou, B., Khosla, A., Oliva, A., & Torralba, A. (2017). Network dissection: Quantifying interpretability of deep visual representations. In computer vision and pattern recognition.
Bazzani, L., Bergamo, A., Anguelov, D., Torresani, L. (2016). Self-taught object localization with deep networks. In WACV.
Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828.
Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollár, P., & Zitnick, C. L. (2015). Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
Cinbis, R. G., Verbeek, J., & Schmid, C. (2016). Weakly supervised object localization with multi-fold multiple instance learning. IEEE transactions on pattern analysis and machine intelligence.
Das, A., Agrawal, H., Zitnick, C. L., Parikh, D., & Batra, D. (2016). Human attention in visual question answering: Do humans and deep networks look at the same regions? In EMNLP.
Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., & Batra, D. (2018). Embodied question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J. M., Parikh, D., & Batra, D. (2017a). Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Das, A., Kottur, S., Moura, J. M., Lee, S., & Batra, D. (2017b). Learning cooperative visual dialog agents with deep reinforcement learning. In Proceedings of the IEEE international conference on computer vision (ICCV).
de Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., & Courville, A. C. (2017). Guesswhat?! visual object discovery through multi-modal dialogue. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.
Dosovitskiy, A., & Brox, T. (2015). Inverting convolutional networks with convolutional networks. In CVPR.
Erhan, D., Bengio, Y., Courville, A., & Vincent, P. (2009). Visualizing higher-layer features of a deep network. University of Montreal, 1341.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2009). The PASCAL visual object classes challenge 2007 (VOC2007) results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.
Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J. C., et al. (2015). From captions to visual concepts and back. In CVPR.
Gan, C., Wang, N., Yang, Y., Yeung, D.-Y., & Hauptmann, A. G. (2015). Devnet: A deep event network for multimedia event detection and evidence recounting. In CVPR.
Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., & Xu, W. (2015). Are you talking to a machine? In NIPS: Dataset and methods for multilingual image question answering.
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.
Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. stat.
Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., & Farhadi, A. (2017). Iqa: Visual question answering in interactive environments. arXiv preprint arXiv:1712.03316.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.
Hoiem, D., Chodpathumwan, Y., & Dai, Q. (2012). Diagnosing error in object detectors. In ECCV.
Jackson, P. (1998). Introduction to expert systems (3rd ed.). Boston, MA: Addison-Wesley Longman Publishing Co., Inc,
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In ACM MM.
Johns, E., Mac Aodha, O., & Brostow, G. J. (2015). Becoming the expert—interactive multi-class machine teaching. In CVPR.
Johnson, J., Karpathy, A., & Fei-Fei, L. (2016). DenseCap: Fully convolutional localization networks for dense captioning. In CVPR.
Karpathy, A. (2014). What I learned from competing against a ConvNet on ImageNet. http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/.
Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In CVPR.
Kolesnikov, A., & Lampert, C. H. (2016). Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In ECCV.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.
Lin, M., Chen, Q., & Yan, S. (2014a). Network in network. In ICLR.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014b). Microsoft coco: Common objects in context. In ECCV.
Lipton, Z. C. (2016). The mythos of model interpretability. arXiv preprint arXiv:1606.03490v3.
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR.
Lu, J., Lin, X., Batra, D., & Parikh, D. (2015). Deeper LSTM and normalized CNN visual question answering model. https://github.com/VT-vision-lab/VQA_LSTM_CNN.
Lu, J., Yang, J., Batra, D., & Parikh, D. (2016). Hierarchical question-image co-attention for visual question answering. In NIPS.
Mahendran, A., & Vedaldi, A. (2016a). Salient deconvolutional networks. In European conference on computer vision.
Mahendran, A., & Vedaldi, A. (2016b). Visualizing deep convolutional neural networks using natural pre-images. International Journal of Computer Vision, 1–23.
Malinowski, M., Rohrbach, M., & Fritz, M. (2015). Ask your neurons: A neural-based approach to answering questions about images. In ICCV.
Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2014). Learning and transferring mid-level image representations using convolutional neural networks. In CVPR.
Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2015). Is object localization for free?—weakly-supervised learning with convolutional neural networks. In CVPR.
Pinheiro, P. O., & Collobert, R. (2015). From image-level to pixel-level labeling with convolutional networks. In CVPR.
Ren, M., Kiros, R., & Zemel, R. (2015). Exploring models and data for image question answering. In NIPS.
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should i trust you?” Explaining the predictions of any classifier. In SIGKDD.
Selvaraju, R. R., Chattopadhyay, P., Elhoseiny, M., Sharma, T., Batra, D., Parikh, D., & Lee, S. (2018). Choose your neuron: Incorporating domain knowledge through neuron-importance. In Proceedings of the European conference on computer vision (ECCV) (pp. 526–541).
Selvaraju, R. R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., & Batra, D. (2016). Grad-CAM: Why did you say that? Visual explanations from deep networks via gradient-based localization. CoRR. arXiv:1610.02391
Selvaraju, R. R., Lee, S., Shen, Y., Jin, H., Ghosh, S., Heck, L., Batra, D., & Parikh, D. (2019) Taking a hint: Leveraging explanations to make vision and language models more grounded. In Proceedings of the international conference on computer vision (ICCV).
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., et al. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 484–489.
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR.
Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR. arXiv:1312.6034
Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. A. (2014). Striving for simplicity: The all convolutional net. CoRR. arXiv:1412.6806
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818–2826).
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In CVPR.
Vondrick, C., Khosla, A., Malisiewicz, T., & Torralba, A. (2013). HOGgles: Visualizing object detection features. ICCV.
Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In ECCV.
Zhang, J., Lin, Z., Brandt, J., Shen, X., & Sclaroff, S. (2016). Top-down neural attention by excitation backprop. In ECCV.
Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., & Torralba, A. (2014). Object detectors emerge in deep scene cnns. CoRR. arXiv:1412.6856
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In CVPR.
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., & Torralba, A. (2017). Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence.
Acknowledgements
This work was funded in part by NSF CAREER awards to DB and DP, DARPA XAI Grant to DB and DP, ONR YIP awards to DP and DB, ONR Grant N00014-14-1-0679 to DB, a Sloan Fellowship to DP, ARO YIP awards to DB and DP, an Allen Distinguished Investigator award to DP from the Paul G. Allen Family Foundation, ICTAS Junior Faculty awards to DB and DP, Google Faculty Research Awards to DP and DB, Amazon Academic Research Awards to DP and DB, AWS in Education Research Grant to DB, and NVIDIA GPU donations to DB. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor. Funding was provided by Virginia Polytechnic Institute and State University.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Antonio Torralba.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
A Appendix Overview
In the “Appendix”, we provide:
- I
Ablation studies evaluating our design choices
- II
More qualitative examples for image classification, captioning and VQA
- III
More details of Pointing Game evaluation technique
- IV
Qualitative comparison to existing visualization techniques
- V
More qualitative examples of textual explanations
B Ablation Studies
We perform several ablation studies to explore and validate our design choices for computing Grad-CAM visualizations. This includes visualizing different layers in the network, understanding importance of ReLU in (2), analyzing different types of gradients (for ReLU backward pass), and different gradient pooling strategies.
1.1 1. Grad-CAM for Different Layers
We show Grad-CAM visualizations for the “tiger-cat” class at different convolutional layers in AlexNet and VGG-16. As expected, the results from Fig. 13 show that localization becomes progressively worse as we move to earlier convolutional layers. This is because later convolutional layers better capture high-level semantic information while retaining spatial information than earlier layers, that have smaller receptive fields and only focus on local features (Fig. 14).
1.2 2. Design Choices
We evaluate different design choices via top-1 localization errors on the ILSVRC-15 val set (Deng et al. 2009). See Table 3.
1.2.1 2.1. Importance of ReLU in (3)
Removing ReLU [(3)] increases error by 15.3%. Negative values in Grad-CAM indicate confusion between multiple occurring classes.
1.2.2 2.2. Global Average Pooling Versus Global Max Pooling
Instead of Global Average Pooling (GAP) the incoming gradients to the convolutional layer, we tried Global Max Pooling (GMP). We observe that using GMP lowers the localization ability of Grad-CAM. An example can be found in Fig. 15 below. This may be due to the fact that max is statistically less robust to noise compared to the averaged gradient.
1.2.3 2.3. Effect of Different ReLU on Grad-CAM
We experiment with Guided-ReLU (Springenberg et al. 2014) and Deconv-ReLU (Zeiler and Fergus 2014) as modifications to the backward pass of ReLU.
Guided-ReLU Springenberg et al. (2014) introduced Guided Backprop, where the backward pass of ReLU is modified to only pass positive gradients to regions of positive activations. Applying this change to the computation of Grad-CAM introduces a drop in the class-discriminative ability as can be seen in Fig. 16, but it marginally improves localization performance as can be seen in Table 3.
Deconv-ReLU In Deconvolution (Zeiler and Fergus 2014), Zeiler and Fergus introduced a modification to the backward pass of ReLU to only pass positive gradients. Applying this modification to the computation of Grad-CAM leads to worse results (Fig. 16). This indicates that negative gradients also carry important information for class-discriminativeness.
C Qualitative Results for Vision and Language Tasks
In this section we provide more qualitative results for Grad-CAM and Guided Grad-CAM applied to the task of image classification, image captioning and VQA.
1.1 1. Image Classification
We use Grad-CAM and Guided Grad-CAM to visualize the regions of the image that provide support for a particular prediction. The results reported in Fig. 17 correspond to the VGG-16 (Simonyan and Zisserman 2015) network trained on ImageNet.
Figure 17 shows randomly sampled examples from COCO (Lin et al. 2014b) validation set. COCO images typically have multiple objects per image and Grad-CAM visualizations show precise localization to support the model’s prediction.
Guided Grad-CAM can even localize tiny objects. For example our approach correctly localizes the predicted class “torch” (Fig. 17a) inspite of its size and odd location in the image. Our method is also class-discriminative—it places attention only on the “toilet seat” even when a popular ImageNet category “dog” exists in the image (Fig. 17e).
We also visualized Grad-CAM, Guided Backpropagation (GB), Deconvolution (DC), GB \(+\) Grad-CAM (Guided Grad-CAM), DC \(+\) Grad-CAM (Deconvolution Grad-CAM) for images from the ILSVRC13 detection val set that have at least 2 unique object categories each. The visualizations for the mentioned class can be found in the following links.
“computer keyboard, keypad” class: http://i.imgur.com/QMhsRzf.jpg
“sunglasses, dark glasses, shades” class: http://i.imgur.com/a1C7DGh.jpg
1.2 2. Image Captioning
We use the publicly available Neuraltalk2 code and modelFootnote 8 for our image captioning experiments. The model uses VGG-16 to encode the image. The image representation is passed as input at the first time step to an LSTM that generates a caption for the image. The model is trained end-to-end along with CNN finetuning using the COCO (Lin et al. 2014b) Captioning dataset. We feedforward the image to the image captioning model to obtain a caption. We use Grad-CAM to get a coarse localization and combine it with Guided Backpropagation to get a high-resolution visualization that highlights regions in the image that provide support for the generated caption (Fig. 18).
1.3 3. Visual Question Answering (VQA)
We use Grad-CAM and Guided Grad-CAM to explain why a publicly available VQA model (Lu et al. 2015) answered what it answered.
The VQA model by Lu et al. uses a standard CNN followed by a fully connected layer to transform the image to 1024-dim to match the LSTM embeddings of the question. Then the transformed image and LSTM embeddings are pointwise multiplied to get a combined representation of the image and question and a multi-layer perceptron is trained on top to predict one among 1000 answers. We show visualizations for the VQA model trained with 3 different CNNs—AlexNet (Krizhevsky et al. 2012), VGG-16 and VGG-19 (Simonyan and Zisserman 2015). Even though the CNNs were not finetuned for the task of VQA, it is interesting to see how our approach can serve as a tool to understand these networks better by providing a localized high-resolution visualization of the regions the model is looking at. Note that these networks were trained with no explicit attention mechanism enforced.
Notice in the first row of Fig. 19, for the question, “Is the person riding the waves?”, the VQA model with AlexNet and VGG-16 answered “No”, as they concentrated on the person mainly, and not the waves. On the other hand, VGG-19 correctly answered “Yes”, and it looked at the regions around the man in order to answer the question. In the second row, for the question, “What is the person hitting?”, the VQA model trained with AlexNet answered “Tennis ball” just based on context without looking at the ball. Such a model might be risky when employed in real-life scenarios. It is difficult to determine the trustworthiness of a model just based on the predicted answer. Our visualizations provide an accurate way to explain the model’s predictions and help in determining which model to trust, without making any architectural changes or sacrificing accuracy. Notice in the last row of Fig. 19, for the question, “Is this a whole orange?”, the model looks for regions around the orange to answer “No”.
D More Details of Pointing Game
In Zhang et al. (2016), the pointing game was setup to evaluate the discriminativeness of different attention maps for localizing ground-truth categories. In a sense, this evaluates the precision of a visualization, i.e. how often does the attention map intersect the segmentation map of the ground-truth category. This does not evaluate how often the visualization technique produces maps which do not correspond to the category of interest.
Hence we propose a modification to the pointing game to evaluate visualizations of the top-5 predicted category. In this case the visualizations are given an additional option to reject any of the top-5 predictions from the CNN classifiers. For each of the two visualizations, Grad-CAM and c-MWP, we choose a threshold on the max value of the visualization, that can be used to determine if the category being visualized exists in the image.
We compute the maps for the top-5 categories, and based on the maximum value in the map, we try to classify if the map is of the GT label or a category that is absent in the image. As mentioned in Sect. 4.2 of the main paper, we find that our approach Grad-CAM outperforms c-MWP by a significant margin (70.58% vs. 60.30% on VGG-16).
E Qualitative Comparison to Excitation Backprop (c-MWP) and CAM
In this section we provide more qualitative results comparing Grad-CAM with CAM (Zhou et al. 2016) and c-MWP (Zhang et al. 2016) on Pascal (Everingham et al. 2009).
We compare Grad-CAM, CAM and c-MWP visualizations from ImageNet trained VGG-16 models finetuned on PASCAL VOC 2012 dataset. While Grad-CAM and c-MWP visualizations can be directly obtained from existing models, CAM requires an architectural change, and requires re-training, which leads to loss in accuracy. Also, unlike Grad-CAM, c-MWP and CAM can only be applied for image classification networks. Visualizations for the ground-truth categories can be found in Fig. 20. Qualitative examples comparing Grad-CAM with existing approaches can be found in Selvaraju et al. (2016).
F Visual and Textual Explanations for Places Dataset
Figure 21 shows more examples of visual and textual explanations (Sect. 7) for the image classification model (VGG-16) trained on Places365 dataset (Zhou et al. 2017).
Rights and permissions
About this article
Cite this article
Selvaraju, R.R., Cogswell, M., Das, A. et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int J Comput Vis 128, 336–359 (2020). https://doi.org/10.1007/s11263-019-01228-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-019-01228-7