SSN: Learning Sparse Switchable Normalization via SparsestMax

Wenqi Shao ORCID: orcid.org/0000-0003-3781-4086¹,
Jingyu Li²,
Jiamin Ren²,
Ruimao Zhang¹,
Xiaogang Wang¹ &
…
Ping Luo³

638 Accesses
12 Citations
Explore all metrics

Abstract

Normalization method deals with parameters training of convolution neural networks (CNNs) in which there are often multiple convolution layers. Despite the fact that layers in CNN are not homogeneous in the role they play at representing a prediction function, existing works often employ identical normalizer in different layers, making performance away from idealism. To tackle this problem and further boost performance, a recently-proposed switchable normalization (SN) provides a new perspective for deep learning: it learns to select different normalizers for different convolution layers of a ConvNet. However, SN uses softmax function to learn importance ratios to combine normalizers, not only leading to redundant computations compared to a single normalizer but also making model less interpretable. This work addresses this issue by presenting sparse switchable normalization (SSN) where the importance ratios are constrained to be sparse. Unlike $\ell _1$ and $\ell _0$ regularizations that impose difficulties in tuning layer-wise regularization coefficients, we turn this sparse-constrained optimization problem into feed-forward computation by proposing SparsestMax, which is a sparse version of softmax. SSN has several appealing properties. (1) It inherits all benefits from SN such as applicability in various tasks and robustness to a wide range of batch sizes. (2) It is guaranteed to select only one normalizer for each normalization layer, avoiding redundant computations and improving interpretability of normalizer selection. (3) SSN can be transferred to various tasks in an end-to-end manner. Extensive experiments show that SSN outperforms its counterparts on various challenging benchmarks such as ImageNet, COCO, Cityscapes, ADE20K, Kinetics and MegaFace. Models and code are available at https://github.com/switchablenorms/Sparse_SwitchNorm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Demystifying Batch Normalization: Analysis of Normalizing Layer Inputs in Neural Networks

Structure injected weight normalization for training deep networks

Article 27 April 2021

Aggregated squeeze-and-excitation transformations for densely connected convolutional networks

Article 03 May 2021

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

The softmax function is defined by $p_k=\mathrm {softmax}_k(\mathbf {z})=\exp (z_k)/\sum _{k=1}^{|\varOmega |}\exp (z_k)$.
Unless otherwise stated, all $(1),(2),\ldots ,(K)$ in this paper represent subscripts in the descending order of all elements in $\mathbf {z}$.
https://github.com/deepinsight/insightface/tree/master/src/megaface

References

Advani, M. S., & Saxe, A. M. (2017). High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667.
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.
Bartlett, P. L., Maiorov, V., & Meir, R. (1999). Almost linear vc dimension bounds for piecewise polynomial networks. In Advances in neural information processing systems (pp. 190–196).
Bentley, J. L., & McIlroy, M. D. (1993). Engineering a sort function. Software: Practice and Experience, 23, 1249–1265.
Google Scholar
Bertsekas, D. P. (2014). Constrained optimization and Lagrange multiplier methods. New York: Academic Press.
MATH Google Scholar
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and he kinetics dataset. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4724–4733). IEEE.
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.
Article Google Scholar
Condat, L. (2016). Fast projection onto the simplex and the $\ell _1$ ball. Mathematical Programming, 158, 575–585.
MathSciNet MATH Google Scholar
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Deng, J., Guo, J., Xue, N., & Zafeiriou, S. (2018). Arcface: Additive angular margin loss for deep face recognition. arXiv preprint arXiv:1801.07698.
Denton, E. L., Zaremba, W., Bruna, J., LeCun, Y., & Fergus, R. (2014). Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems (pp. 1269–1277).
Girshick, R. (2015). Fast R-CNN. In Proceedings of the IEEE international conference on computer vision (pp. 1440–1448).
Girshick, R., Radosavovic, I., Gkioxari, G., Dollár, P., & He, K. (2018). Detectron. https://github.com/facebookresearch/detectron.
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., & He, K. (2017). Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In Proceedings of the IEEE international conference on computer vision (pp. 2961–2969).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
Held, M., Wolfe, P., & Crowder, H. P. (1974). Validation of subgradient optimization. Mathematical Programming, 6(1), 62–88.
Article MathSciNet MATH Google Scholar
Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2), 251–257.
Article MathSciNet Google Scholar
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.
Article MATH Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (vol. 1, p. 3).
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
Jang, E., Gu, S., & Poole, B. (2016). Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144.
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al. (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
Kemelmacher-Shlizerman, I., Seitz, S. M., Miller, D., & Brossard, E. (2016). The megaface benchmark: 1 million faces for recognition at scale. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4873–4882).
Laurent, C., Pereyra, G., Brakel, P., Zhang, Y., & Bengio, Y. (2016). Batch normalized recurrent neural networks. 2016 IEEE international conference on acoustics (pp. 2657–2661). IEEE: Speech and Signal Processing (ICASSP).
Li, Y., Wang, N., Shi, J., Liu, J., & Hou, X. (2016). Revisiting batch normalization for practical domain adaptation. arXiv preprint arXiv:1603.04779.
Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017), Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2117–2125)
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740–755.) Springer.
Liu, H., Simonyan, K., & Yang, Y. (2018). Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055.
Liu, B., Wang, M., Foroosh, H., Tappen, M., & Pensky, M. (2015). Sparse convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 806–814).
Louizos, C., Welling, M., & Kingma, D. P. (2017). Learning sparse neural networks through $ l\_0 $ regularization. arXiv preprint arXiv:1712.01312.
Luo, P., Ren, J., & Peng, Z. (2018). Differentiable learning-to-normalize via switchable normalization. arXiv preprint arXiv:1806.10779.
Luo, P., Wang, X., Shao, W., & Peng, Z. (2018). Understanding regularization in batch normalization. arXiv preprint arXiv:1809.00846.
Ma, N., Zhang, X., Zheng, H. T., & Sun, J. (2018). Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV) (pp. 116–131).
Maddison, C. J., Mnih, A., & Teh, Y. W. (2016). The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712.
Malaviya, C., Ferreira, P., & Martins, A. F. (2018). Sparse and constrained attention for neural machine translation. arXiv preprint arXiv:1805.08241.
Martins, A. F. T., & Astudillo, R. F. (2016). From softmax to sparsemax: A sparse model of attention and multi-label classification. CoRR arXiv:1602.02068.
Martins, A. F., & Kreutzer, J. (2017). Learning what’s easy: Fully differentiable neural easy-first taggers. In Proceedings of the 2017 conference on empirical methods in natural language processing (pp. 349–362).
Miyato, T., Kataoka, T., Koyama, M., & Yoshida, Y. (2018). Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957.
Pan, X., Luo, P., Shi, J., & Tang, X. (2018). Two at once: Enhancing learning and generalization capacities via ibn-net. arXiv preprint arXiv:1807.09441.
Pascanu, R., Montufar, G., & Bengio, Y. (2013). On the number of response regions of deep feed forward networks with piece-wise linear activations. arXiv preprint arXiv:1312.6098.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic differentiation in pytorch. In NIPS-W.
Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., & Dickstein, J. S. (2017). On the expressive power of deep neural networks. In Proceedings of the 34th international conference on machine learning—Volume 70 (pp. 2847–2854). JMLR. org.
Real, E., Aggarwal, A., Huang, Y., & Le, Q. V. (2018). Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91–99).
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Article MathSciNet Google Scholar
Salimans, T., & Kingma, D. P. (2016). Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in neural information processing systems (pp. 901–909).
Santurkar, S., Tsipras, D., Ilyas, A., & Madry, A. (2018). How does batch normalization help optimization?(no, it is not about internal covariate shift). arXiv preprint arXiv:1805.11604.
Scardapane, S., Comminiello, D., Hussain, A., & Uncini, A. (2017). Group sparse regularization for deep neural networks. Neurocomputing, 241, 81–89.
Article Google Scholar
Sun, S., Pang, J., Shi, J., Yi, S., & Ouyang, W. (2018). Fishnet: A versatile backbone for image, region, and pixel level prediction. In Advances in neural information processing systems (pp. 762–772).
Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence.
Tai, C., Xiao, T., Zhang, Y., Wang, X., et al. (2015). Convolutional neural networks with low-rank regularization. arXiv preprint arXiv:1511.06067.
Tartaglione, E., Lepsøy, S., Fiandrotti, A., & Francini, G. (2018), Learning sparse neural networks via sensitivity-driven regularization. In Advances in neural information processing systems (pp. 3882–3892).
Teye, M., Azizpour, H., & Smith, K. (2018). Bayesian uncertainty estimation for batch normalized deep networks. arXiv preprint arXiv:1802.06455.
Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2017), Instance normalization: the missing ingredient for fast stylization. cscv. arXiv preprint arXiv:1607.08022.
Van Den Berg, E., & Friedlander, M. P. (2008). Probing the pareto frontier for basis pursuit solutions. SIAM Journal on Scientific Computing, 31(2), 890–912.
Article MathSciNet MATH Google Scholar
Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In CVPR.
Wen, W., Wu, C., Wang, Y., Chen, Y., & Li, H. (2016). Learning structured sparsity in deep neural networks. In Advances in neural information processing systems (pp. 2074–2082).
Wu, Y., & He, K. (2018). Group normalization. arXiv preprint arXiv:1803.08494.
Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1492–1500).
Xie, S., Zheng, H., Liu, C., & Lin, L. (2018) SNAS: stochastic neural architecture search. arXiv preprint arXiv:1812.09926.
Yang, G., Pennington, J., Rao, V., Sohl-Dickstein, J., Schoenholz, S. S. (2019). A mean field theory of batch normalization. arXiv preprint arXiv:1902.08129.
Zagoruyko, S., & Komodakis, N. (2016). Wide residual networks. arXiv preprint arXiv:1605.07146.
Zhang, C., Bengio, S., & Singer, Y. (2019). Are all layers created equal? arXiv preprint arXiv:1902.01996.
Zhang, X., Zhou, X., Lin, M., & Sun, J. (2018). Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6848–6856).
Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Zoph, B., Le, Q. V. (2016). Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578.
Zoph, B., Vasudevan, V., Shlens, J., & Le, Q. V. (2018). Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8697–8710).

Download references

Acknowledgements

Thank Xinjiang Wang and Tianjian Meng for their helpful discussions. Ping Luo is partially supported by the HKU Seed Funding for Basic Research and SenseTime’s Donation for Basic Research. This work was supported in part by SenseTime Group Limited and in part by the General Research Fund through the Research Grants Council of Hong Kong under Grants CUHK14202217, CUHK14203118, CUHK14207319.

Author information

Authors and Affiliations

Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong, People’s Republic of China
Wenqi Shao, Ruimao Zhang & Xiaogang Wang
SenseTime Research, Shenzhen, People’s Republic of China
Jingyu Li & Jiamin Ren
Department of Computer Science, The University of Hong Kong, Hong Kong, People’s Republic of China
Ping Luo

Authors

Wenqi Shao
View author publications
You can also search for this author in PubMed Google Scholar
Jingyu Li
View author publications
You can also search for this author in PubMed Google Scholar
Jiamin Ren
View author publications
You can also search for this author in PubMed Google Scholar
Ruimao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaogang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ping Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ping Luo.

Additional information

Communicated by Li Liu, Matti Pietikäinen, Jie Qin, Jie Chen, Wanli Ouyang, Luc Van Gool.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Derivations of Back-Propagated Gradients of SyncSSN

To compute the gradients wrt. $\mathbf {z}$ and $\mathbf {z}^\prime $, we first derive the Jacobian of $\mathbf {p}$ wrt. $\mathbf {z}$ in three cases in Theorem (1). Let $\mathcal {I}_S$ be an indicator vector whose i-th entry is 1 if $i \in S(\mathbf {z})$, and 0 otherwise, then Eq. (17) can be rewritten into the following vector form:

$$\begin{aligned} \frac{\partial \mathbf {p}_0}{\partial \mathbf {z}}=diag(\mathcal {I}_S)-\mathcal {I}_S{^{\textsf {T}}}\mathcal {I}_S/|S(\mathbf {z})| \end{aligned}$$

(24)

Besides, the Jocobian of $\mathbf {p}_1$ wrt. $\mathbf {z}$ can be derived as:

$$\begin{aligned} \frac{\partial \mathbf {p}_1}{\partial \mathbf {z}}= & {} \frac{\partial \mathbf {p}_1}{\partial \mathbf {p}_0}\frac{\partial \mathbf {p}_0}{\partial \mathbf {z}}\nonumber \\= & {} \frac{1}{\left\| \mathbf {p}_0-\mathbf {u}\right\| _2}\left( I-\frac{(\mathbf {p}_0-\mathbf {u}){^{\textsf {T}}}(\mathbf {p}_0-\mathbf {u})}{\left\| \mathbf {p}_0-\mathbf {u}\right\| ^2}\right) \frac{\partial \mathbf {p}_0}{\partial \mathbf {z}} \end{aligned}$$

(25)

Moreover, we have $\frac{\partial \mathbf {p}_3}{\partial \mathbf {z}}=0$ almost everywhere by Eq. (16). In all, the Jacobian of $\mathbf {p}$ wrt. $\mathbf {z}$ is of the form:

$$\begin{aligned} \frac{\partial \mathbf {p}}{\partial \mathbf {z}}=\left\{ \begin{array}{lll} diag(\mathcal {I}_S)-\frac{\mathcal {I}_S{^{\textsf {T}}}\mathcal {I}_S}{|S(\mathbf {z})|} &{}\quad if \, \left\| \mathbf {p}_0-\mathbf {u}\right\| _2 \ge r\\ \frac{\partial \mathbf {p}_1}{\partial \mathbf {z}} &{}\quad if \, \left\| \mathbf {p}_0-\mathbf {u}\right\| _2<r,\, \mathbf {p}_1>0\\ 0 &{}\quad else \end{array} \right. \nonumber \\ \end{aligned}$$

(26)

Therefore, the gradient of loss $\mathcal {L}$ wrt. $\mathbf {z}$ can be obtained as:

$$\begin{aligned} \frac{\partial \mathcal {L}}{\partial \mathbf {z}}= & {} \frac{\partial \mathbf {p}}{\partial \mathbf {z}}^\mathsf {T}\frac{\partial \mathcal {L}}{\partial \mathbf {p}}\nonumber \\= & {} \frac{\partial \mathbf {p}}{\partial \mathbf {z}}^\mathsf {T}\left[ \begin{matrix} \sum _{g,n,c}^{G,N,C}\frac{\partial \mathcal {L}}{\partial \mu _{g,nc}} \mu _{\mathrm {in},g}\\ \sum _{g,n,c}^{G,N,C}\frac{\partial \mathcal {L}}{\partial \mu _{g,nc}} \mu _{\mathrm {syncbn}}\\ \sum _{g,n,c}^{G,N,C}\frac{\partial \mathcal {L}}{\partial \mu _{g,nc}} \mu _{\mathrm {ln},g} \end{matrix}\right] \end{aligned}$$

(27)

In the same way, we have

$$\begin{aligned} \frac{\partial \mathcal {L}}{\partial \mathbf {z}^\prime }= & {} \frac{\partial \mathbf {p}^\prime }{\partial \mathbf {z}^\prime }^\mathsf {T}\frac{\partial \mathcal {L}}{\partial \mathbf {p}^\prime }\nonumber \\= & {} \frac{\partial \mathbf {p}^\prime }{\partial \mathbf {z}^\prime }^\mathsf {T}\left[ \begin{matrix} \sum _{g,n,c}^{G,N,C}\frac{\partial \mathcal {L}}{\partial \sigma ^2_{g,nc}} \sigma ^2_{\mathrm {in},g}\\ \sum _{g,n,c}^{G,N,C}\frac{\partial \mathcal {L}}{\partial \sigma ^2_{g,nc}} \sigma ^2_{\mathrm {syncbn}}\\ \sum _{g,n,c}^{G,N,C}\frac{\partial \mathcal {L}}{\partial \sigma ^2_{g,nc}} \sigma ^2_{\mathrm {ln},g} \end{matrix}\right] , \end{aligned}$$

(28)

where the Jacobian $\partial \mathbf {p}^\prime /\partial \mathbf {z}^\prime $ is computed the same as $\partial \mathbf {p}/\partial \mathbf {z}$.

The remaining term that needs to be back-propagated is the gradients wrt. the input, which can be derived as follows:

$$\begin{aligned} \frac{\partial \mathcal {L}}{\partial h_{g,ncij}}= & {} \underbrace{\frac{\partial \mathcal {L}}{\partial \bar{h}_{g,ncij}} \cdot \frac{\partial \bar{h}_{g,ncij}}{\partial h_{g,ncij}} }_{\text {Term1}} + \underbrace{ \sum _{\ddot{g},\ddot{n},\ddot{c}}^{G,N,C} \frac{\partial \mathcal {L}}{\partial \sigma ^2_{\ddot{g},\ddot{n}\ddot{c}} } \cdot \frac{\partial \sigma ^2_{\ddot{g},\ddot{n}\ddot{c}} }{\partial h_{g,ncij}} }_{\text {Term2}} \nonumber \\&+ \underbrace{ \sum _{\ddot{g},\ddot{n},\ddot{c}}^{G,N,C} \frac{\partial \mathcal {L}}{\partial \mu _{\ddot{g},\ddot{n}\ddot{c}} } \cdot \frac{\partial \mu _{\ddot{g},\ddot{n}\ddot{c}} }{\partial h_{g,ncij}} }_{\text {Term3}}, \end{aligned}$$

(29)

and,

$$\begin{aligned} \text {Term1}= & {} \frac{\partial \mathcal {L}}{\partial \hat{h}_{g,ncij}} \cdot \frac{\gamma }{\sqrt{ \sigma ^2_{g,nc} +\epsilon }}, \end{aligned}$$

(30)

$$\begin{aligned} \text {Term2}= & {} \sum _{\ddot{g},\ddot{n},\ddot{c}}^{G,N,C} \frac{\partial \mathcal {L}}{\partial \sigma ^2_{\ddot{g},\ddot{n}\ddot{c}}} \cdot \big ( p^\prime _{\mathrm {in}} \frac{\partial \sigma ^2_{\mathrm {in},g,nc}}{\partial h_{g,ncij}} \delta _{g\ddot{g},n\ddot{n},c\ddot{c}} \nonumber \\&+ p^\prime _{\mathrm {ln}} \frac{\partial \sigma ^2_{\mathrm {ln},g,n}}{\partial h_{g,ncij}} \delta _{g\ddot{g},n\ddot{n}} + p^\prime _{\mathrm {bn}} \frac{\partial \sigma ^2_{\mathrm {sybn},c}}{\partial h_{p,ncij}} \delta _{c\ddot{c}}\big ) \nonumber \\= & {} p^\prime _{\mathrm {in}} \frac{\partial \sigma ^2_{\mathrm {in},g,nc}}{h_{g,ncij}} \frac{\partial \mathcal {L}}{\partial \sigma ^2_{g,nc}} + p^\prime _{\mathrm {ln}} \frac{\partial \sigma ^2_{\mathrm {ln},g,n}}{h_{g,ncij}} \sum _{\ddot{c}}^C\frac{\partial \mathcal {L}}{\partial \sigma ^2_{g,n\ddot{c}}} \nonumber \\&+ p^\prime _{\mathrm {bn}} \frac{\partial \sigma ^2_{\mathrm {sybn},c}}{h_{p,ncij}} \sum _{\ddot{p},\ddot{n}}^{G,N} \frac{\partial \mathcal {L}}{\partial \sigma ^2_{\ddot{g},\ddot{n}c}} \nonumber \\= & {} p^\prime _{\mathrm {in}} \frac{ 2(h_{g,ncij} - \mu _{\mathrm {in},g,nc}) }{HW} \frac{\partial \mathcal {L}}{\partial \sigma ^2_{g,nc}} \nonumber \\&+ p^\prime _{\mathrm {ln}} \frac{ 2(h_{g,ncij} - \mu _{\mathrm {ln},g,n}) }{CHW} \sum _{\ddot{c}}^C \frac{\partial \mathcal {L}}{\partial \sigma ^2_{g,n\ddot{c}}} \nonumber \\&+ p^\prime _{\mathrm {bn}} \frac{ 2(h_{g,ncij} - \mu _{\mathrm {sybn},c}) }{GNHW} \sum _{\ddot{g},\ddot{n}}^{G,N} \frac{\partial \mathcal {L}}{\partial \sigma ^2_{\ddot{g},\ddot{n}c}}\end{aligned}$$

(31)

$$\begin{aligned} \text {Term3}= & {} \sum _{\ddot{g},\ddot{n},\ddot{c}}^{G,N,C} \frac{\partial \mathcal {L}}{\partial \mu _{\ddot{g},\ddot{n}\ddot{c}}} \cdot \big ( p_{\mathrm {in}} \frac{\partial \mu _{\mathrm {in},g,nc}}{\partial h_{g,ncij}} \delta _{g\ddot{g},n\ddot{n},c\ddot{c}} \nonumber \\&+ p_{\mathrm {ln}} \frac{\partial \mu _{\mathrm {ln},g,n}}{\partial h_{g,ncij}} \delta _{g\ddot{g},n\ddot{n}} + p_{\mathrm {bn}} \frac{\partial \mu _{\mathrm {sybn},c}}{\partial h_{g,ncij}} \delta _{c\ddot{c}}\big ) \nonumber \\= & {} p_{\mathrm {in}} \frac{\partial \mu _{\mathrm {in},g,nc}}{h_{g,ncij}} \frac{\partial \mathcal {L}}{\partial \mu _{g,nc}} + p_{\mathrm {ln}} \frac{\partial \mu _{\mathrm {ln},g,n}}{h_{g,ncij}} \sum _{\ddot{c}}^C\frac{\partial \mathcal {L}}{\partial \mu _{g,n\ddot{c}}} \nonumber \\&+ p_{\mathrm {bn}} \frac{\partial \mu _{\mathrm {sybn},c}}{h_{g,ncij}} \sum _{\ddot{g},\ddot{n}}^{G,N} \frac{\partial \mathcal {L}}{\partial \mu _{\ddot{g},\ddot{n}c}} \nonumber \\= & {} p_{\mathrm {in}} \frac{ 1 }{HW} \frac{\partial \mathcal {L}}{\partial \mu _{g,nc}} + p_{\mathrm {ln}} \frac{ 1 }{CHW} \sum _{\ddot{c}}^C \frac{\partial \mathcal {L}}{\partial \mu _{g,n\ddot{c}}} \nonumber \\&+ p_{\mathrm {bn}} \frac{ 1 }{GNHW} \sum _{\ddot{g},\ddot{n}}^{G,N} \frac{\partial \mathcal {L}}{\partial \mu _{\ddot{g},\ddot{n}c}} \end{aligned}$$

(32)

where $\delta $ indicates a Dirac Delta function that $\delta _{t\ddot{t}} = 1$ if $t=\ddot{t}$ and $\delta _{t\ddot{t}}=0$ if $t \ne \ddot{t}$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shao, W., Li, J., Ren, J. et al. SSN: Learning Sparse Switchable Normalization via SparsestMax. Int J Comput Vis 128, 2107–2125 (2020). https://doi.org/10.1007/s11263-019-01269-y

Download citation

Received: 21 March 2019
Accepted: 23 November 2019
Published: 09 December 2019
Issue Date: September 2020
DOI: https://doi.org/10.1007/s11263-019-01269-y

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Demystifying Batch Normalization: Analysis of Normalizing Layer Inputs in Neural Networks

Structure injected weight normalization for training deep networks

Aggregated squeeze-and-excitation transformations for densely connected convolutional networks

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix A: Derivations of Back-Propagated Gradients of SyncSSN

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

SSN: Learning Sparse Switchable Normalization via SparsestMax

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Demystifying Batch Normalization: Analysis of Normalizing Layer Inputs in Neural Networks

Structure injected weight normalization for training deep networks

Aggregated squeeze-and-excitation transformations for densely connected convolutional networks

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix A: Derivations of Back-Propagated Gradients of SyncSSN

Appendix A: Derivations of Back-Propagated Gradients of SyncSSN

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation