Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

SSN: Learning Sparse Switchable Normalization via SparsestMax

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Normalization method deals with parameters training of convolution neural networks (CNNs) in which there are often multiple convolution layers. Despite the fact that layers in CNN are not homogeneous in the role they play at representing a prediction function, existing works often employ identical normalizer in different layers, making performance away from idealism. To tackle this problem and further boost performance, a recently-proposed switchable normalization (SN) provides a new perspective for deep learning: it learns to select different normalizers for different convolution layers of a ConvNet. However, SN uses softmax function to learn importance ratios to combine normalizers, not only leading to redundant computations compared to a single normalizer but also making model less interpretable. This work addresses this issue by presenting sparse switchable normalization (SSN) where the importance ratios are constrained to be sparse. Unlike \(\ell _1\) and \(\ell _0\) regularizations that impose difficulties in tuning layer-wise regularization coefficients, we turn this sparse-constrained optimization problem into feed-forward computation by proposing SparsestMax, which is a sparse version of softmax. SSN has several appealing properties. (1) It inherits all benefits from SN such as applicability in various tasks and robustness to a wide range of batch sizes. (2) It is guaranteed to select only one normalizer for each normalization layer, avoiding redundant computations and improving interpretability of normalizer selection. (3) SSN can be transferred to various tasks in an end-to-end manner. Extensive experiments show that SSN outperforms its counterparts on various challenging benchmarks such as ImageNet, COCO, Cityscapes, ADE20K, Kinetics and MegaFace. Models and code are available at https://github.com/switchablenorms/Sparse_SwitchNorm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. The softmax function is defined by \(p_k=\mathrm {softmax}_k(\mathbf {z})=\exp (z_k)/\sum _{k=1}^{|\varOmega |}\exp (z_k)\).

  2. Unless otherwise stated, all \((1),(2),\ldots ,(K)\) in this paper represent subscripts in the descending order of all elements in \(\mathbf {z}\).

  3. https://github.com/deepinsight/insightface/tree/master/src/megaface

References

  • Advani, M. S., & Saxe, A. M. (2017). High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667.

  • Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.

  • Bartlett, P. L., Maiorov, V., & Meir, R. (1999). Almost linear vc dimension bounds for piecewise polynomial networks. In Advances in neural information processing systems (pp. 190–196).

  • Bentley, J. L., & McIlroy, M. D. (1993). Engineering a sort function. Software: Practice and Experience, 23, 1249–1265.

    Google Scholar 

  • Bertsekas, D. P. (2014). Constrained optimization and Lagrange multiplier methods. New York: Academic Press.

    MATH  Google Scholar 

  • Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and he kinetics dataset. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4724–4733). IEEE.

  • Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.

    Article  Google Scholar 

  • Condat, L. (2016). Fast projection onto the simplex and the \(\ell _1\) ball. Mathematical Programming, 158, 575–585.

    MathSciNet  MATH  Google Scholar 

  • Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Deng, J., Guo, J., Xue, N., & Zafeiriou, S. (2018). Arcface: Additive angular margin loss for deep face recognition. arXiv preprint arXiv:1801.07698.

  • Denton, E. L., Zaremba, W., Bruna, J., LeCun, Y., & Fergus, R. (2014). Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems (pp. 1269–1277).

  • Girshick, R. (2015). Fast R-CNN. In Proceedings of the IEEE international conference on computer vision (pp. 1440–1448).

  • Girshick, R., Radosavovic, I., Gkioxari, G., Dollár, P., & He, K. (2018). Detectron. https://github.com/facebookresearch/detectron.

  • Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., & He, K. (2017). Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677.

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In Proceedings of the IEEE international conference on computer vision (pp. 2961–2969).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

  • Held, M., Wolfe, P., & Crowder, H. P. (1974). Validation of subgradient optimization. Mathematical Programming, 6(1), 62–88.

    Article  MathSciNet  MATH  Google Scholar 

  • Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2), 251–257.

    Article  MathSciNet  Google Scholar 

  • Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.

    Article  MATH  Google Scholar 

  • Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (vol. 1, p. 3).

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.

  • Jang, E., Gu, S., & Poole, B. (2016). Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144.

  • Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al. (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.

  • Kemelmacher-Shlizerman, I., Seitz, S. M., Miller, D., & Brossard, E. (2016). The megaface benchmark: 1 million faces for recognition at scale. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4873–4882).

  • Laurent, C., Pereyra, G., Brakel, P., Zhang, Y., & Bengio, Y. (2016). Batch normalized recurrent neural networks. 2016 IEEE international conference on acoustics (pp. 2657–2661). IEEE: Speech and Signal Processing (ICASSP).

  • Li, Y., Wang, N., Shi, J., Liu, J., & Hou, X. (2016). Revisiting batch normalization for practical domain adaptation. arXiv preprint arXiv:1603.04779.

  • Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017), Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2117–2125)

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740–755.) Springer.

  • Liu, H., Simonyan, K., & Yang, Y. (2018). Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055.

  • Liu, B., Wang, M., Foroosh, H., Tappen, M., & Pensky, M. (2015). Sparse convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 806–814).

  • Louizos, C., Welling, M., & Kingma, D. P. (2017). Learning sparse neural networks through \( l\_0 \) regularization. arXiv preprint arXiv:1712.01312.

  • Luo, P., Ren, J., & Peng, Z. (2018). Differentiable learning-to-normalize via switchable normalization. arXiv preprint arXiv:1806.10779.

  • Luo, P., Wang, X., Shao, W., & Peng, Z. (2018). Understanding regularization in batch normalization. arXiv preprint arXiv:1809.00846.

  • Ma, N., Zhang, X., Zheng, H. T., & Sun, J. (2018). Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV) (pp. 116–131).

  • Maddison, C. J., Mnih, A., & Teh, Y. W. (2016). The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712.

  • Malaviya, C., Ferreira, P., & Martins, A. F. (2018). Sparse and constrained attention for neural machine translation. arXiv preprint arXiv:1805.08241.

  • Martins, A. F. T., & Astudillo, R. F. (2016). From softmax to sparsemax: A sparse model of attention and multi-label classification. CoRR arXiv:1602.02068.

  • Martins, A. F., & Kreutzer, J. (2017). Learning what’s easy: Fully differentiable neural easy-first taggers. In Proceedings of the 2017 conference on empirical methods in natural language processing (pp. 349–362).

  • Miyato, T., Kataoka, T., Koyama, M., & Yoshida, Y. (2018). Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957.

  • Pan, X., Luo, P., Shi, J., & Tang, X. (2018). Two at once: Enhancing learning and generalization capacities via ibn-net. arXiv preprint arXiv:1807.09441.

  • Pascanu, R., Montufar, G., & Bengio, Y. (2013). On the number of response regions of deep feed forward networks with piece-wise linear activations. arXiv preprint arXiv:1312.6098.

  • Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic differentiation in pytorch. In NIPS-W.

  • Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., & Dickstein, J. S. (2017). On the expressive power of deep neural networks. In Proceedings of the 34th international conference on machine learning—Volume 70 (pp. 2847–2854). JMLR. org.

  • Real, E., Aggarwal, A., Huang, Y., & Le, Q. V. (2018). Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548.

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91–99).

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.

    Article  MathSciNet  Google Scholar 

  • Salimans, T., & Kingma, D. P. (2016). Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in neural information processing systems (pp. 901–909).

  • Santurkar, S., Tsipras, D., Ilyas, A., & Madry, A. (2018). How does batch normalization help optimization?(no, it is not about internal covariate shift). arXiv preprint arXiv:1805.11604.

  • Scardapane, S., Comminiello, D., Hussain, A., & Uncini, A. (2017). Group sparse regularization for deep neural networks. Neurocomputing, 241, 81–89.

    Article  Google Scholar 

  • Sun, S., Pang, J., Shi, J., Yi, S., & Ouyang, W. (2018). Fishnet: A versatile backbone for image, region, and pixel level prediction. In Advances in neural information processing systems (pp. 762–772).

  • Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence.

  • Tai, C., Xiao, T., Zhang, Y., Wang, X., et al. (2015). Convolutional neural networks with low-rank regularization. arXiv preprint arXiv:1511.06067.

  • Tartaglione, E., Lepsøy, S., Fiandrotti, A., & Francini, G. (2018), Learning sparse neural networks via sensitivity-driven regularization. In Advances in neural information processing systems (pp. 3882–3892).

  • Teye, M., Azizpour, H., & Smith, K. (2018). Bayesian uncertainty estimation for batch normalized deep networks. arXiv preprint arXiv:1802.06455.

  • Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2017), Instance normalization: the missing ingredient for fast stylization. cscv. arXiv preprint arXiv:1607.08022.

  • Van Den Berg, E., & Friedlander, M. P. (2008). Probing the pareto frontier for basis pursuit solutions. SIAM Journal on Scientific Computing, 31(2), 890–912.

    Article  MathSciNet  MATH  Google Scholar 

  • Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In CVPR.

  • Wen, W., Wu, C., Wang, Y., Chen, Y., & Li, H. (2016). Learning structured sparsity in deep neural networks. In Advances in neural information processing systems (pp. 2074–2082).

  • Wu, Y., & He, K. (2018). Group normalization. arXiv preprint arXiv:1803.08494.

  • Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1492–1500).

  • Xie, S., Zheng, H., Liu, C., & Lin, L. (2018) SNAS: stochastic neural architecture search. arXiv preprint arXiv:1812.09926.

  • Yang, G., Pennington, J., Rao, V., Sohl-Dickstein, J., Schoenholz, S. S. (2019). A mean field theory of batch normalization. arXiv preprint arXiv:1902.08129.

  • Zagoruyko, S., & Komodakis, N. (2016). Wide residual networks. arXiv preprint arXiv:1605.07146.

  • Zhang, C., Bengio, S., & Singer, Y. (2019). Are all layers created equal? arXiv preprint arXiv:1902.01996.

  • Zhang, X., Zhou, X., Lin, M., & Sun, J. (2018). Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6848–6856).

  • Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).

  • Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Zoph, B., Le, Q. V. (2016). Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578.

  • Zoph, B., Vasudevan, V., Shlens, J., & Le, Q. V. (2018). Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8697–8710).

Download references

Acknowledgements

Thank Xinjiang Wang and Tianjian Meng for their helpful discussions. Ping Luo is partially supported by the HKU Seed Funding for Basic Research and SenseTime’s Donation for Basic Research. This work was supported in part by SenseTime Group Limited and in part by the General Research Fund through the Research Grants Council of Hong Kong under Grants CUHK14202217, CUHK14203118, CUHK14207319.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ping Luo.

Additional information

Communicated by Li Liu, Matti Pietikäinen, Jie Qin, Jie Chen, Wanli Ouyang, Luc Van Gool.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Derivations of Back-Propagated Gradients of SyncSSN

Appendix A: Derivations of Back-Propagated Gradients of SyncSSN

To compute the gradients wrt. \(\mathbf {z}\) and \(\mathbf {z}^\prime \), we first derive the Jacobian of \(\mathbf {p}\) wrt. \(\mathbf {z}\) in three cases in Theorem (1). Let \(\mathcal {I}_S\) be an indicator vector whose i-th entry is 1 if \(i \in S(\mathbf {z})\), and 0 otherwise, then Eq. (17) can be rewritten into the following vector form:

$$\begin{aligned} \frac{\partial \mathbf {p}_0}{\partial \mathbf {z}}=diag(\mathcal {I}_S)-\mathcal {I}_S{^{\textsf {T}}}\mathcal {I}_S/|S(\mathbf {z})| \end{aligned}$$
(24)

Besides, the Jocobian of \(\mathbf {p}_1\) wrt. \(\mathbf {z}\) can be derived as:

$$\begin{aligned} \frac{\partial \mathbf {p}_1}{\partial \mathbf {z}}= & {} \frac{\partial \mathbf {p}_1}{\partial \mathbf {p}_0}\frac{\partial \mathbf {p}_0}{\partial \mathbf {z}}\nonumber \\= & {} \frac{1}{\left\| \mathbf {p}_0-\mathbf {u}\right\| _2}\left( I-\frac{(\mathbf {p}_0-\mathbf {u}){^{\textsf {T}}}(\mathbf {p}_0-\mathbf {u})}{\left\| \mathbf {p}_0-\mathbf {u}\right\| ^2}\right) \frac{\partial \mathbf {p}_0}{\partial \mathbf {z}} \end{aligned}$$
(25)

Moreover, we have \(\frac{\partial \mathbf {p}_3}{\partial \mathbf {z}}=0\) almost everywhere by Eq. (16). In all, the Jacobian of \(\mathbf {p}\) wrt. \(\mathbf {z}\) is of the form:

$$\begin{aligned} \frac{\partial \mathbf {p}}{\partial \mathbf {z}}=\left\{ \begin{array}{lll} diag(\mathcal {I}_S)-\frac{\mathcal {I}_S{^{\textsf {T}}}\mathcal {I}_S}{|S(\mathbf {z})|} &{}\quad if \, \left\| \mathbf {p}_0-\mathbf {u}\right\| _2 \ge r\\ \frac{\partial \mathbf {p}_1}{\partial \mathbf {z}} &{}\quad if \, \left\| \mathbf {p}_0-\mathbf {u}\right\| _2<r,\, \mathbf {p}_1>0\\ 0 &{}\quad else \end{array} \right. \nonumber \\ \end{aligned}$$
(26)

Therefore, the gradient of loss \(\mathcal {L}\) wrt. \(\mathbf {z}\) can be obtained as:

$$\begin{aligned} \frac{\partial \mathcal {L}}{\partial \mathbf {z}}= & {} \frac{\partial \mathbf {p}}{\partial \mathbf {z}}^\mathsf {T}\frac{\partial \mathcal {L}}{\partial \mathbf {p}}\nonumber \\= & {} \frac{\partial \mathbf {p}}{\partial \mathbf {z}}^\mathsf {T}\left[ \begin{matrix} \sum _{g,n,c}^{G,N,C}\frac{\partial \mathcal {L}}{\partial \mu _{g,nc}} \mu _{\mathrm {in},g}\\ \sum _{g,n,c}^{G,N,C}\frac{\partial \mathcal {L}}{\partial \mu _{g,nc}} \mu _{\mathrm {syncbn}}\\ \sum _{g,n,c}^{G,N,C}\frac{\partial \mathcal {L}}{\partial \mu _{g,nc}} \mu _{\mathrm {ln},g} \end{matrix}\right] \end{aligned}$$
(27)

In the same way, we have

$$\begin{aligned} \frac{\partial \mathcal {L}}{\partial \mathbf {z}^\prime }= & {} \frac{\partial \mathbf {p}^\prime }{\partial \mathbf {z}^\prime }^\mathsf {T}\frac{\partial \mathcal {L}}{\partial \mathbf {p}^\prime }\nonumber \\= & {} \frac{\partial \mathbf {p}^\prime }{\partial \mathbf {z}^\prime }^\mathsf {T}\left[ \begin{matrix} \sum _{g,n,c}^{G,N,C}\frac{\partial \mathcal {L}}{\partial \sigma ^2_{g,nc}} \sigma ^2_{\mathrm {in},g}\\ \sum _{g,n,c}^{G,N,C}\frac{\partial \mathcal {L}}{\partial \sigma ^2_{g,nc}} \sigma ^2_{\mathrm {syncbn}}\\ \sum _{g,n,c}^{G,N,C}\frac{\partial \mathcal {L}}{\partial \sigma ^2_{g,nc}} \sigma ^2_{\mathrm {ln},g} \end{matrix}\right] , \end{aligned}$$
(28)

where the Jacobian \(\partial \mathbf {p}^\prime /\partial \mathbf {z}^\prime \) is computed the same as \(\partial \mathbf {p}/\partial \mathbf {z}\).

The remaining term that needs to be back-propagated is the gradients wrt. the input, which can be derived as follows:

$$\begin{aligned} \frac{\partial \mathcal {L}}{\partial h_{g,ncij}}= & {} \underbrace{\frac{\partial \mathcal {L}}{\partial \bar{h}_{g,ncij}} \cdot \frac{\partial \bar{h}_{g,ncij}}{\partial h_{g,ncij}} }_{\text {Term1}} + \underbrace{ \sum _{\ddot{g},\ddot{n},\ddot{c}}^{G,N,C} \frac{\partial \mathcal {L}}{\partial \sigma ^2_{\ddot{g},\ddot{n}\ddot{c}} } \cdot \frac{\partial \sigma ^2_{\ddot{g},\ddot{n}\ddot{c}} }{\partial h_{g,ncij}} }_{\text {Term2}} \nonumber \\&+ \underbrace{ \sum _{\ddot{g},\ddot{n},\ddot{c}}^{G,N,C} \frac{\partial \mathcal {L}}{\partial \mu _{\ddot{g},\ddot{n}\ddot{c}} } \cdot \frac{\partial \mu _{\ddot{g},\ddot{n}\ddot{c}} }{\partial h_{g,ncij}} }_{\text {Term3}}, \end{aligned}$$
(29)

and,

$$\begin{aligned} \text {Term1}= & {} \frac{\partial \mathcal {L}}{\partial \hat{h}_{g,ncij}} \cdot \frac{\gamma }{\sqrt{ \sigma ^2_{g,nc} +\epsilon }}, \end{aligned}$$
(30)
$$\begin{aligned} \text {Term2}= & {} \sum _{\ddot{g},\ddot{n},\ddot{c}}^{G,N,C} \frac{\partial \mathcal {L}}{\partial \sigma ^2_{\ddot{g},\ddot{n}\ddot{c}}} \cdot \big ( p^\prime _{\mathrm {in}} \frac{\partial \sigma ^2_{\mathrm {in},g,nc}}{\partial h_{g,ncij}} \delta _{g\ddot{g},n\ddot{n},c\ddot{c}} \nonumber \\&+ p^\prime _{\mathrm {ln}} \frac{\partial \sigma ^2_{\mathrm {ln},g,n}}{\partial h_{g,ncij}} \delta _{g\ddot{g},n\ddot{n}} + p^\prime _{\mathrm {bn}} \frac{\partial \sigma ^2_{\mathrm {sybn},c}}{\partial h_{p,ncij}} \delta _{c\ddot{c}}\big ) \nonumber \\= & {} p^\prime _{\mathrm {in}} \frac{\partial \sigma ^2_{\mathrm {in},g,nc}}{h_{g,ncij}} \frac{\partial \mathcal {L}}{\partial \sigma ^2_{g,nc}} + p^\prime _{\mathrm {ln}} \frac{\partial \sigma ^2_{\mathrm {ln},g,n}}{h_{g,ncij}} \sum _{\ddot{c}}^C\frac{\partial \mathcal {L}}{\partial \sigma ^2_{g,n\ddot{c}}} \nonumber \\&+ p^\prime _{\mathrm {bn}} \frac{\partial \sigma ^2_{\mathrm {sybn},c}}{h_{p,ncij}} \sum _{\ddot{p},\ddot{n}}^{G,N} \frac{\partial \mathcal {L}}{\partial \sigma ^2_{\ddot{g},\ddot{n}c}} \nonumber \\= & {} p^\prime _{\mathrm {in}} \frac{ 2(h_{g,ncij} - \mu _{\mathrm {in},g,nc}) }{HW} \frac{\partial \mathcal {L}}{\partial \sigma ^2_{g,nc}} \nonumber \\&+ p^\prime _{\mathrm {ln}} \frac{ 2(h_{g,ncij} - \mu _{\mathrm {ln},g,n}) }{CHW} \sum _{\ddot{c}}^C \frac{\partial \mathcal {L}}{\partial \sigma ^2_{g,n\ddot{c}}} \nonumber \\&+ p^\prime _{\mathrm {bn}} \frac{ 2(h_{g,ncij} - \mu _{\mathrm {sybn},c}) }{GNHW} \sum _{\ddot{g},\ddot{n}}^{G,N} \frac{\partial \mathcal {L}}{\partial \sigma ^2_{\ddot{g},\ddot{n}c}}\end{aligned}$$
(31)
$$\begin{aligned} \text {Term3}= & {} \sum _{\ddot{g},\ddot{n},\ddot{c}}^{G,N,C} \frac{\partial \mathcal {L}}{\partial \mu _{\ddot{g},\ddot{n}\ddot{c}}} \cdot \big ( p_{\mathrm {in}} \frac{\partial \mu _{\mathrm {in},g,nc}}{\partial h_{g,ncij}} \delta _{g\ddot{g},n\ddot{n},c\ddot{c}} \nonumber \\&+ p_{\mathrm {ln}} \frac{\partial \mu _{\mathrm {ln},g,n}}{\partial h_{g,ncij}} \delta _{g\ddot{g},n\ddot{n}} + p_{\mathrm {bn}} \frac{\partial \mu _{\mathrm {sybn},c}}{\partial h_{g,ncij}} \delta _{c\ddot{c}}\big ) \nonumber \\= & {} p_{\mathrm {in}} \frac{\partial \mu _{\mathrm {in},g,nc}}{h_{g,ncij}} \frac{\partial \mathcal {L}}{\partial \mu _{g,nc}} + p_{\mathrm {ln}} \frac{\partial \mu _{\mathrm {ln},g,n}}{h_{g,ncij}} \sum _{\ddot{c}}^C\frac{\partial \mathcal {L}}{\partial \mu _{g,n\ddot{c}}} \nonumber \\&+ p_{\mathrm {bn}} \frac{\partial \mu _{\mathrm {sybn},c}}{h_{g,ncij}} \sum _{\ddot{g},\ddot{n}}^{G,N} \frac{\partial \mathcal {L}}{\partial \mu _{\ddot{g},\ddot{n}c}} \nonumber \\= & {} p_{\mathrm {in}} \frac{ 1 }{HW} \frac{\partial \mathcal {L}}{\partial \mu _{g,nc}} + p_{\mathrm {ln}} \frac{ 1 }{CHW} \sum _{\ddot{c}}^C \frac{\partial \mathcal {L}}{\partial \mu _{g,n\ddot{c}}} \nonumber \\&+ p_{\mathrm {bn}} \frac{ 1 }{GNHW} \sum _{\ddot{g},\ddot{n}}^{G,N} \frac{\partial \mathcal {L}}{\partial \mu _{\ddot{g},\ddot{n}c}} \end{aligned}$$
(32)

where \(\delta \) indicates a Dirac Delta function that \(\delta _{t\ddot{t}} = 1\) if \(t=\ddot{t}\) and \(\delta _{t\ddot{t}}=0\) if \(t \ne \ddot{t}\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shao, W., Li, J., Ren, J. et al. SSN: Learning Sparse Switchable Normalization via SparsestMax. Int J Comput Vis 128, 2107–2125 (2020). https://doi.org/10.1007/s11263-019-01269-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-019-01269-y

Keywords

Navigation