Normalization method deals with parameters training of convolution neural networks (CNNs) in which there are often multiple convolution layers. Despite the fact that layers in CNN are not homogeneous in the role they play at representing a prediction function, existing works often employ identical normalizer in different layers, making performance away from idealism. To tackle this problem and further boost performance, a recently-proposed switchable normalization (SN) provides a new perspective for deep learning: it learns to select different normalizers for different convolution layers of a ConvNet. However, SN uses softmax function to learn importance ratios to combine normalizers, not only leading to redundant computations compared to a single normalizer but also making model less interpretable. This work addresses this issue by presenting sparse switchable normalization (SSN) where the importance ratios are constrained to be sparse. Unlike \(\ell _1\) and \(\ell _0\) regularizations that impose difficulties in tuning layer-wise regularization coefficients, we turn this sparse-constrained optimization problem into feed-forward computation by proposing SparsestMax, which is a sparse version of softmax. SSN has several appealing properties. (1) It inherits all benefits from SN such as applicability in various tasks and robustness to a wide range of batch sizes. (2) It is guaranteed to select only one normalizer for each normalization layer, avoiding redundant computations and improving interpretability of normalizer selection. (3) SSN can be transferred to various tasks in an end-to-end manner. Extensive experiments show that SSN outperforms its counterparts on various challenging benchmarks such as ImageNet, COCO, Cityscapes, ADE20K, Kinetics and MegaFace. Models and code are available at https://github.com/switchablenorms/Sparse_SwitchNorm.
The softmax function is defined by \(p_k=\mathrm {softmax}_k(\mathbf {z})=\exp (z_k)/\sum _{k=1}^{|\varOmega |}\exp (z_k)\).
Unless otherwise stated, all \((1),(2),\ldots ,(K)\) in this paper represent subscripts in the descending order of all elements in \(\mathbf {z}\).
Thank Xinjiang Wang and Tianjian Meng for their helpful discussions.
Appendix A: Derivations of Back-Propagated Gradients of SyncSSN
Appendix A: Derivations of Back-Propagated Gradients of SyncSSN
To compute the gradients wrt. \(\mathbf {z}\) and \(\mathbf {z}^\prime \), we first derive the Jacobian of \(\mathbf {p}\) wrt. \(\mathbf {z}\) in three cases in Theorem (1). Let \(\mathcal {I}_S\) be an indicator vector whose i-th entry is 1 if \(i \in S(\mathbf {z})\), and 0 otherwise, then Eq. (17) can be rewritten into the following vector form:
Besides, the Jocobian of \(\mathbf {p}_1\) wrt. \(\mathbf {z}\) can be derived as:
Moreover, we have \(\frac{\partial \mathbf {p}_3}{\partial \mathbf {z}}=0\) almost everywhere by Eq. (16). In all, the Jacobian of \(\mathbf {p}\) wrt. \(\mathbf {z}\) is of the form:
Therefore, the gradient of loss \(\mathcal {L}\) wrt. \(\mathbf {z}\) can be obtained as:
In the same way, we have
where the Jacobian \(\partial \mathbf {p}^\prime /\partial \mathbf {z}^\prime \) is computed the same as \(\partial \mathbf {p}/\partial \mathbf {z}\).
The remaining term that needs to be back-propagated is the gradients wrt. the input, which can be derived as follows:
where \(\delta \) indicates a Dirac Delta function that \(\delta _{t\ddot{t}} = 1\) if \(t=\ddot{t}\) and \(\delta _{t\ddot{t}}=0\) if \(t \ne \ddot{t}\).
Shao, W., Li, J., Ren, J. et al. SSN: Learning Sparse Switchable Normalization via SparsestMax. Int J Comput Vis 128, 2107–2125 (2020). https://doi.org/10.1007/s11263-019-01269-y
https://doi.org/10.1007/s11263-019-01269-y