Abstract
Normalization method deals with parameters training of convolution neural networks (CNNs) in which there are often multiple convolution layers. Despite the fact that layers in CNN are not homogeneous in the role they play at representing a prediction function, existing works often employ identical normalizer in different layers, making performance away from idealism. To tackle this problem and further boost performance, a recently-proposed switchable normalization (SN) provides a new perspective for deep learning: it learns to select different normalizers for different convolution layers of a ConvNet. However, SN uses softmax function to learn importance ratios to combine normalizers, not only leading to redundant computations compared to a single normalizer but also making model less interpretable. This work addresses this issue by presenting sparse switchable normalization (SSN) where the importance ratios are constrained to be sparse. Unlike \(\ell _1\) and \(\ell _0\) regularizations that impose difficulties in tuning layer-wise regularization coefficients, we turn this sparse-constrained optimization problem into feed-forward computation by proposing SparsestMax, which is a sparse version of softmax. SSN has several appealing properties. (1) It inherits all benefits from SN such as applicability in various tasks and robustness to a wide range of batch sizes. (2) It is guaranteed to select only one normalizer for each normalization layer, avoiding redundant computations and improving interpretability of normalizer selection. (3) SSN can be transferred to various tasks in an end-to-end manner. Extensive experiments show that SSN outperforms its counterparts on various challenging benchmarks such as ImageNet, COCO, Cityscapes, ADE20K, Kinetics and MegaFace. Models and code are available at https://github.com/switchablenorms/Sparse_SwitchNorm.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The softmax function is defined by \(p_k=\mathrm {softmax}_k(\mathbf {z})=\exp (z_k)/\sum _{k=1}^{|\varOmega |}\exp (z_k)\).
Unless otherwise stated, all \((1),(2),\ldots ,(K)\) in this paper represent subscripts in the descending order of all elements in \(\mathbf {z}\).
References
Advani, M. S., & Saxe, A. M. (2017). High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667.
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.
Bartlett, P. L., Maiorov, V., & Meir, R. (1999). Almost linear vc dimension bounds for piecewise polynomial networks. In Advances in neural information processing systems (pp. 190–196).
Bentley, J. L., & McIlroy, M. D. (1993). Engineering a sort function. Software: Practice and Experience, 23, 1249–1265.
Bertsekas, D. P. (2014). Constrained optimization and Lagrange multiplier methods. New York: Academic Press.
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and he kinetics dataset. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4724–4733). IEEE.
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.
Condat, L. (2016). Fast projection onto the simplex and the \(\ell _1\) ball. Mathematical Programming, 158, 575–585.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Deng, J., Guo, J., Xue, N., & Zafeiriou, S. (2018). Arcface: Additive angular margin loss for deep face recognition. arXiv preprint arXiv:1801.07698.
Denton, E. L., Zaremba, W., Bruna, J., LeCun, Y., & Fergus, R. (2014). Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems (pp. 1269–1277).
Girshick, R. (2015). Fast R-CNN. In Proceedings of the IEEE international conference on computer vision (pp. 1440–1448).
Girshick, R., Radosavovic, I., Gkioxari, G., Dollár, P., & He, K. (2018). Detectron. https://github.com/facebookresearch/detectron.
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., & He, K. (2017). Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In Proceedings of the IEEE international conference on computer vision (pp. 2961–2969).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
Held, M., Wolfe, P., & Crowder, H. P. (1974). Validation of subgradient optimization. Mathematical Programming, 6(1), 62–88.
Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2), 251–257.
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (vol. 1, p. 3).
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
Jang, E., Gu, S., & Poole, B. (2016). Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144.
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al. (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
Kemelmacher-Shlizerman, I., Seitz, S. M., Miller, D., & Brossard, E. (2016). The megaface benchmark: 1 million faces for recognition at scale. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4873–4882).
Laurent, C., Pereyra, G., Brakel, P., Zhang, Y., & Bengio, Y. (2016). Batch normalized recurrent neural networks. 2016 IEEE international conference on acoustics (pp. 2657–2661). IEEE: Speech and Signal Processing (ICASSP).
Li, Y., Wang, N., Shi, J., Liu, J., & Hou, X. (2016). Revisiting batch normalization for practical domain adaptation. arXiv preprint arXiv:1603.04779.
Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017), Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2117–2125)
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740–755.) Springer.
Liu, H., Simonyan, K., & Yang, Y. (2018). Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055.
Liu, B., Wang, M., Foroosh, H., Tappen, M., & Pensky, M. (2015). Sparse convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 806–814).
Louizos, C., Welling, M., & Kingma, D. P. (2017). Learning sparse neural networks through \( l\_0 \) regularization. arXiv preprint arXiv:1712.01312.
Luo, P., Ren, J., & Peng, Z. (2018). Differentiable learning-to-normalize via switchable normalization. arXiv preprint arXiv:1806.10779.
Luo, P., Wang, X., Shao, W., & Peng, Z. (2018). Understanding regularization in batch normalization. arXiv preprint arXiv:1809.00846.
Ma, N., Zhang, X., Zheng, H. T., & Sun, J. (2018). Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV) (pp. 116–131).
Maddison, C. J., Mnih, A., & Teh, Y. W. (2016). The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712.
Malaviya, C., Ferreira, P., & Martins, A. F. (2018). Sparse and constrained attention for neural machine translation. arXiv preprint arXiv:1805.08241.
Martins, A. F. T., & Astudillo, R. F. (2016). From softmax to sparsemax: A sparse model of attention and multi-label classification. CoRR arXiv:1602.02068.
Martins, A. F., & Kreutzer, J. (2017). Learning what’s easy: Fully differentiable neural easy-first taggers. In Proceedings of the 2017 conference on empirical methods in natural language processing (pp. 349–362).
Miyato, T., Kataoka, T., Koyama, M., & Yoshida, Y. (2018). Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957.
Pan, X., Luo, P., Shi, J., & Tang, X. (2018). Two at once: Enhancing learning and generalization capacities via ibn-net. arXiv preprint arXiv:1807.09441.
Pascanu, R., Montufar, G., & Bengio, Y. (2013). On the number of response regions of deep feed forward networks with piece-wise linear activations. arXiv preprint arXiv:1312.6098.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic differentiation in pytorch. In NIPS-W.
Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., & Dickstein, J. S. (2017). On the expressive power of deep neural networks. In Proceedings of the 34th international conference on machine learning—Volume 70 (pp. 2847–2854). JMLR. org.
Real, E., Aggarwal, A., Huang, Y., & Le, Q. V. (2018). Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91–99).
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Salimans, T., & Kingma, D. P. (2016). Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in neural information processing systems (pp. 901–909).
Santurkar, S., Tsipras, D., Ilyas, A., & Madry, A. (2018). How does batch normalization help optimization?(no, it is not about internal covariate shift). arXiv preprint arXiv:1805.11604.
Scardapane, S., Comminiello, D., Hussain, A., & Uncini, A. (2017). Group sparse regularization for deep neural networks. Neurocomputing, 241, 81–89.
Sun, S., Pang, J., Shi, J., Yi, S., & Ouyang, W. (2018). Fishnet: A versatile backbone for image, region, and pixel level prediction. In Advances in neural information processing systems (pp. 762–772).
Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence.
Tai, C., Xiao, T., Zhang, Y., Wang, X., et al. (2015). Convolutional neural networks with low-rank regularization. arXiv preprint arXiv:1511.06067.
Tartaglione, E., Lepsøy, S., Fiandrotti, A., & Francini, G. (2018), Learning sparse neural networks via sensitivity-driven regularization. In Advances in neural information processing systems (pp. 3882–3892).
Teye, M., Azizpour, H., & Smith, K. (2018). Bayesian uncertainty estimation for batch normalized deep networks. arXiv preprint arXiv:1802.06455.
Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2017), Instance normalization: the missing ingredient for fast stylization. cscv. arXiv preprint arXiv:1607.08022.
Van Den Berg, E., & Friedlander, M. P. (2008). Probing the pareto frontier for basis pursuit solutions. SIAM Journal on Scientific Computing, 31(2), 890–912.
Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In CVPR.
Wen, W., Wu, C., Wang, Y., Chen, Y., & Li, H. (2016). Learning structured sparsity in deep neural networks. In Advances in neural information processing systems (pp. 2074–2082).
Wu, Y., & He, K. (2018). Group normalization. arXiv preprint arXiv:1803.08494.
Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1492–1500).
Xie, S., Zheng, H., Liu, C., & Lin, L. (2018) SNAS: stochastic neural architecture search. arXiv preprint arXiv:1812.09926.
Yang, G., Pennington, J., Rao, V., Sohl-Dickstein, J., Schoenholz, S. S. (2019). A mean field theory of batch normalization. arXiv preprint arXiv:1902.08129.
Zagoruyko, S., & Komodakis, N. (2016). Wide residual networks. arXiv preprint arXiv:1605.07146.
Zhang, C., Bengio, S., & Singer, Y. (2019). Are all layers created equal? arXiv preprint arXiv:1902.01996.
Zhang, X., Zhou, X., Lin, M., & Sun, J. (2018). Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6848–6856).
Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Zoph, B., Le, Q. V. (2016). Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578.
Zoph, B., Vasudevan, V., Shlens, J., & Le, Q. V. (2018). Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8697–8710).
Acknowledgements
Thank Xinjiang Wang and Tianjian Meng for their helpful discussions. Ping Luo is partially supported by the HKU Seed Funding for Basic Research and SenseTime’s Donation for Basic Research. This work was supported in part by SenseTime Group Limited and in part by the General Research Fund through the Research Grants Council of Hong Kong under Grants CUHK14202217, CUHK14203118, CUHK14207319.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Li Liu, Matti Pietikäinen, Jie Qin, Jie Chen, Wanli Ouyang, Luc Van Gool.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A: Derivations of Back-Propagated Gradients of SyncSSN
Appendix A: Derivations of Back-Propagated Gradients of SyncSSN
To compute the gradients wrt. \(\mathbf {z}\) and \(\mathbf {z}^\prime \), we first derive the Jacobian of \(\mathbf {p}\) wrt. \(\mathbf {z}\) in three cases in Theorem (1). Let \(\mathcal {I}_S\) be an indicator vector whose i-th entry is 1 if \(i \in S(\mathbf {z})\), and 0 otherwise, then Eq. (17) can be rewritten into the following vector form:
Besides, the Jocobian of \(\mathbf {p}_1\) wrt. \(\mathbf {z}\) can be derived as:
Moreover, we have \(\frac{\partial \mathbf {p}_3}{\partial \mathbf {z}}=0\) almost everywhere by Eq. (16). In all, the Jacobian of \(\mathbf {p}\) wrt. \(\mathbf {z}\) is of the form:
Therefore, the gradient of loss \(\mathcal {L}\) wrt. \(\mathbf {z}\) can be obtained as:
In the same way, we have
where the Jacobian \(\partial \mathbf {p}^\prime /\partial \mathbf {z}^\prime \) is computed the same as \(\partial \mathbf {p}/\partial \mathbf {z}\).
The remaining term that needs to be back-propagated is the gradients wrt. the input, which can be derived as follows:
and,
where \(\delta \) indicates a Dirac Delta function that \(\delta _{t\ddot{t}} = 1\) if \(t=\ddot{t}\) and \(\delta _{t\ddot{t}}=0\) if \(t \ne \ddot{t}\).
Rights and permissions
About this article
Cite this article
Shao, W., Li, J., Ren, J. et al. SSN: Learning Sparse Switchable Normalization via SparsestMax. Int J Comput Vis 128, 2107–2125 (2020). https://doi.org/10.1007/s11263-019-01269-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-019-01269-y