Keywords

1 Introduction

In recent years, deep learning has been dominating in the field of computer vision. As one of the most important models in deep learning, the convolutional neural networks (CNNs) have been applied to various vision tasks, including image classification [19], object detection [7], semantic segmentation [23], boundary detection [41], etc. The fundamental idea is to stack a number of linear operations (e.g., convolution) and non-linear activations (e.g., ReLU [24]), so that a deep network has the ability to fit very complicated distributions. There are two prerequisites in training a deep network, namely, the availability of large-scale image data, and the support of powerful computational resources.

Convolution is the most important operation in a deep network. A window is slid across the image lattice, and a number of small convolutional kernels are applied to capture local visual patterns. This operation suffers from a weakness of being spatially-symmetric, which assumes that visual features are independent of their spatial position. This limits the network’s ability to learn from contextual cues (e.g., an object is located upon another) which are often important in visual recognition. Conventional networks capture such spatial information by stacking a number of convolutions and gradually enlarging the receptive field, but we propose an alternative solution which equips each neuron with the ability to refer to its contexts at multiple scales efficiently.

Our approach is named multi-scale spatially asymmetric recalibration (MS-SAR). It quantifies the importance of each neuron by a score, and multiplies it to the original neural response. This process is named recalibration [13]. Two features are proposed to enhance the effect of recalibration. First, the importance score of each neuron is computed from a local region (named a coordinate set) covering that neuron. This introduces the factor of spatial position into recalibration, leading to the desired spatially-asymmetric property. Second, we relate each neuron to multiple coordinate sets of different sizes, so that the importance of that neuron is evaluated by incorporating multi-scale information. The conceptual flowchart of our approach is illustrated in Fig. 1.

In practice, the recalibration function (taking inputs from the coordinate sets and outputting the importance score) is the combination of two linear operations and two non-linear activations, and we allow the parameters to be learned from training data. To avoid heavy computational costs as well as a large amount of extra parameters to be introduced, we first perform a regional pooling over the coordinate set to reduce the spatial resolution, and use a smaller number of outputs in the first linear layer to reduce the channel resolution. Consequently, our approach only requires small fractions of extra parameters and computations beyond the baseline building blocks.

We integrate MS-SAR into two popular building blocks, namely, the residual block [11] and the densely-connected block [15], and empirically evaluate its performance in two image classification tasks. In the CIFAR datasets [18], our approach outperforms the baseline networks, the ResNets [11] and the DenseNets [15]. In the ILSVRC2012 dataset [29], we also compare with SENet [13], a special case of our approach with single-scale spatially-symmetric recalibration and demonstrate the superior performance of MS-SAR. In all cases, the extra computational overhead brought by MS-SAR does not exceed \(1\%\).

The remainder of this paper is organized as follows. Section 2 briefly reviews the previous literatures on image classification based on deep learning, and Sect. 3 illustrates the MS-SAR approach and describes how we apply it to different building blocks. After extensive experimental results are shown in Sect. 4, we conclude this work in Sect. 5.

2 Related Work

2.1 Convolutional Neural Networks for Visual Recognition

Deep convolutional neural networks (CNNs) have been widely applied to computer vision tasks. These models are based on the same motivation to learn and organize visual features in a hierarchical manner. In the early years, CNNs were verified successful in simple classification problems, in which the input image is small yet simple (e.g., MNIST [20] and CIFAR [18]) and the network is shallow (i.e. with 3–5 layers). With the emerge of large-scale image datasets [4, 22] and powerful computational resources such as GPUs, it is possible to design and train deep networks for recognizing high-resolution natural images [19]. Important technical advances involve using the piecewise-linear ReLU activation [24] to prevent under-fitting, and applying Dropout [32] to regularize the training process and avoid over-fitting.

Modern deep networks are built upon a handful of building blocks, including convolution, pooling, normalization, activation, element-wise operation (sum [11] or product [36]), etc. Among them, convolution is considered the most important module to capture visual patterns by template matching (computing the inner-product between the input data and the learned templates), and most often, we refer to the depth of a network by the maximal number of convolutional layers along any path connecting the input to the output. It is believed that increasing the depth leads to better recognition performance [3, 11, 15, 31, 34]. In order to train these very deep networks efficiently, researchers proposed batch normalization [17] to improve numerical stability, and highway connections [11, 33] to facilitate visual information to be propagated faster. The idea of automatically learning network architectures was also explored [38, 47].

Image classification lays the foundation of other vision tasks. The pre-trained networks can be used to extract high-quality visual features for image classification [5], instance retrieval [27], fine-grained object recognition [39, 45] or object detection [8], surpassing the performance of conventional handcraft features. Another way of transferring knowledge learned in these networks is to fine-tune them to other tasks, including object detection [7, 28], semantic segmentation [1, 23], boundary detection [41], pose estimation [25, 35], etc. A network with stronger classification results often works better in other tasks.

2.2 Spatial Enhancement for Deep Networks

One of the most important factor of deep networks lies in the spatial domain. Although the convolution operation is naturally invariant to spatial translation, there still exist various approaches aimed at enhancing the ability of visual recognition by introducing different priors into deep networks.

In an image, the relationship between two features is often tighter when their spatial locations are closer to each other. An efficient way of modeling such distance-sensitive information is to perform spatial pooling [10], which explicitly splits the image lattice into several groups, and ignores the diversity of features in the same group. This idea is also widely used in object detection to summarize visual features given a set of regional proposals [7, 28].

On the other hand, researchers also noticed that spatial importance (saliency) is not uniformly distributed in the spatial domain. Thus, various approaches were designed to discriminate the important (salient) features from others. Typical examples include using gradient back-propagation to find the neurons that contribute most to the classification result [39, 43], introducing saliency [26, 30] or attention [2] into the network, and investigating local properties (e.g., smoothness [37]). We note that a regular convolutional layer also captures local patterns in the spatial domain, but (i) it performs linear template matching and so cannot capture non-linear properties (e.g., smoothness), meanwhile (ii) it often needs a larger number of parameters and heavier computational overheads.

In this work, we consider a recalibration approach [13], which aims at revising the response of each neuron by a spatial weight. Unlike [13], the proposed approach utilizes multi-scale visual information and allows different weights to be added at different spatial positions. This brings significant accuracy gains.

3 Our Approach

3.1 Motivation: Why Spatial Asymmetry Is Required?

Let \(\mathbf {X}\) be the output of a convolutional layer. This is a 3D cube with \(W\times H\times D\) entries, where W and H are the width and height, indicating the spatial resolution, and D is the depth, indicating the number of convolutional kernels. According to the definition of convolution, each element in \(\mathbf {X}\), denoted by \(x_{w,h,d}\), represents the intensity of the d-th visual pattern at the coordinate \(\left( w,h\right) \), which is obtained from the inner-product of the d-th convolutional kernel and the input region corresponding to the coordinate \(\left( w,h\right) \).

Here we notice that convolution performs spatially-symmetric template matching, in which the intensity \(x_{w,h,d}\) is independent of the spatial position \(\left( w,h\right) \). We argue that this is not the optimal choice. In visual recognition, we often hope to learn contextual information (e.g., feature \(d_1\) often appears upon feature \(d_2\)), and so the spatially-asymmetric property is desired. To this end, we define \(\mathcal {S}_{w,h}\) to be the coordinate set containing the neighboring coordinates of \(\left( w,h\right) \) (detailed in the next subsection). We aim at computing a new response \(\tilde{x}_{w,h,d}\) by taking into consideration all neural responses in \(\mathcal {S}_{w,h}\times \left\{ 1,2,\ldots ,D\right\} \), where \(\times \) denotes the Cartesian product. Our approach is related but different from several existing approaches.

  • First, we note that a standard convolution can learn contexts in a small local region, e.g., \(\mathcal {S}_{w,h}\) is a \(3\times 3\) square centered at \(\left( w,h\right) \). Our approach can refer to multiple \(\mathcal {S}_{w,h}\)’s at different scales, capturing richer information and being more computationally efficient than convolution.

  • The second type works in the spatial domain, which uses the responses in the set \(\mathcal {S}_{w,h}\times \left\{ d\right\} \) to compute \(\tilde{x}_{w,h,d}\). Examples include the Spatial Pyramid Pooling (SPP) [10] layer which set regular pooling regions and ignored feature diversity within each region, and the Geometric Neural Phrase Pooling (GNPP) [37] layer which took advantage of the spatial relationship of neighboring neurons (it also assumed that spatially closer neurons have tighter connections) to capture feature co-occurrence. But, both of them are non-parameterized and work in each channel individually, which limited their ability to adjust feature weights.

  • Another related approach is called feature recalibration [13], which computed \(\tilde{x}_{w,h,d}\) by referring to the visual cues in the entire image lattice, i.e., the set \(\left\{ \left( w,h\right) \right\} _{w=1,h=1}^{W,H}\times \left\{ 1,2,\ldots ,D\right\} \) was used. This is still a spatially-symmetric operation. As we shall see later, our approach is a generalized version and produces better visual recognition performance.

3.2 Formulation: Spatially-Asymmetric Recalibration

Given the neural responses cube \(\mathbf {X}\) and the coordinate set \(\mathcal {S}_{w,h}\) at \(\left( w,h\right) \), the goal is to compute a revised intensity \(\tilde{x}_{w,h,d}\) with spatial information taken into consideration. We formulate it as a weighting scheme \({\tilde{x}_{w,h,d}}={x_{w,h,d}\times z_{w,h,d}}\), in which \({z_{w,h,d}}={f_d\!\left( \mathbf {X},\mathcal {S}_{w,h}\right) }\) and \(f_d\!\left( \cdot \right) \) is named the recalibration function [13]. This creates a weighting cube \(\mathbf {Z}\) with the same size as \(\mathbf {X}\) and propagate \({\tilde{\mathbf {X}}}={\mathbf {X}\odot \mathbf {Z}}\) to the next network layer. We denote the D-dimensional feature vector of \(\mathbf {X}\) at \(\left( w,h\right) \) by \(\mathbf {x}_{w,h}={\left[ x_{w,h,1};\ldots ;x_{w,h,D}\right] ^\top }\), and similarly for \(\tilde{\mathbf {x}}_{w,h}\) and \(\mathbf {z}_{w,h}\).

Let the set of all spatial positions be \({\mathcal {P}}={\left\{ \left( w,h\right) \right\} _{w=1,h=1}^{W,H}}\). The coordinate set of each position is a subset of \(\mathcal {P}\), i.e., \({\mathcal {S}_{w,h}}\in {2^\mathcal {P}}\) where \(2^\mathcal {P}\) is the power set of \(\mathcal {P}\). Each coordinate set \(\mathcal {S}_{w,h}\) defines a corresponding feature set \({\mathbf {X}_{\mathcal {S}_{w,h}}}={\left[ \mathbf {x}_{w',h'}\right] _{\left( w',h'\right) \in \mathcal {S}_{w,h}}}\), and we abbreviate \(\mathbf {X}_{\mathcal {S}_{w,h}}\) as \(\mathfrak {X}_{w,h}\). Thus, \({z_{w,h,d}}={f_d\!\left( \mathbf {X},\mathcal {S}_{w,h}\right) }\) can be rewritten as \({z_{w,h,d}}={f_d\!\left( \mathfrak {X}_{w,h}\right) }\). This means that, for two spatial positions \(\left( w_1,h_1\right) \) and \(\left( w_2,h_2\right) \), \(\mathbf {z}_{w_1,h_1}\) can be impacted by \(\mathbf {x}_{w_2,h_2}\) if and only if \({\left( w_2,h_2\right) }\in {\mathcal {S}_{w_1,h_1}}\), and vice versa. It is common knowledge that if two positions \(\left( w_1,h_1\right) \) and \(\left( w_2,h_2\right) \) are close in the image lattice, i.e., \(\left\| \left( w_1,h_1\right) -\left( w_2,h_2\right) \right\| _1\) is smallFootnote 1, the relationship of their feature vectors is more likely to be tight. Therefore, we define each \(\mathcal {S}_{w,h}\) to be a continuous regionFootnote 2 that covers \(\left( w,h\right) \) itself.

We provide two ways of defining \(\mathcal {S}_{w,h}\), both of which are based on a scale parameter K. The first one is named the sliding strategy, in which \({\mathcal {S}_{w,h}}={\left\{ \left( w',h'\right) \mid \left\| \left( w,h\right) -\left( w',h'\right) \right\| _1\leqslant T\right\} }\), where \({T}={\sqrt{WH}/K}\) is the threshold of distance. The second one is named the regional strategy, which partitions the image lattice into \(K\times K\) equally-sized regions, and \(\mathcal {S}_{w,h}\) is composed of all positions falling in the same region with it. The former is more flexible, i.e., each position has a unique spatial region set, and so there are \(W\times H\) different sets, while the latter reduces this number to \(K^2\), which slightly reduces the computational costs (see Sect. 3.5).

It remains to determine the form of the recalibration function \(f_d\!\left( \mathfrak {X}_{w,h}\right) \). The major consideration is to reduce the number of parameters to alleviate the risk of over-fitting, and reduce the computational costs (FLOPs) to prevent the network from being much slower. We borrow the idea of adding both spatial and channel bottlenecks for this purpose [13]. \(\mathfrak {X}_{w,h}\) is first down-sampled into a single vector using average pooling, i.e., \({\mathbf {y}_{w,h}}={\left| \mathcal {S}_{w,h}\right| ^{-1}\sum _{\left( w,h\right) \in \mathcal {S}_{w,h}}\mathbf {x}_{w,h}}\), and passed through two fully-connected layers: \({z_{w,h,d}}={\sigma _2\!\left[ \varvec{\varOmega }_{2,d}\cdot \sigma _1\!\left[ \varvec{\varOmega }_1\cdot \mathbf {y}_{w,h}\right] \right] }\). Here, both \(\varvec{\varOmega }_1\) and \(\varvec{\varOmega }_{2,d}\) are learnable weight matrices, and \(\sigma _1\!\left[ \cdot \right] \) and \(\sigma _2\!\left[ \cdot \right] \) are activation functions which add non-linearity to the recalibration function. The dimension of \(\varvec{\varOmega }_1\) is \(D'\times D\) (\({D'}<{D}\)), and that of \(\varvec{\varOmega }_{2,d}\) is \(1\times D'\). This idea is similar to using channel bottleneck to reduce computations [11]. \(\sigma _1\!\left[ \cdot \right] \) is a composite function of batch normalization [17] followed by ReLU activation [24], and \(\sigma _2\!\left[ \cdot \right] \) replaces ReLU with sigmoid so as to output a floating point number in \(\left( 0,1\right) \).

We share \(\varvec{\varOmega }_1\) over all \(f_d\!\left( \mathfrak {X}_{w,h}\right) \)’s, but use an individual \(\varvec{\varOmega }_{2,d}\) for each output channel. Let \({\varvec{\varOmega }_2}={\left[ \varvec{\varOmega }_{2,1}^\top ;\ldots ;\varvec{\varOmega }_{2,D}^\top \right] ^\top }\), and thus the recalibration function is:

$$\begin{aligned} {\mathbf {z}_{w,h}}={\mathbf {f}\!\left( \mathfrak {X}_{w,h}\right) }={\sigma _2\!\left[ \varvec{\varOmega }_2\cdot \sigma _1\!\left[ \varvec{\varOmega }_1\cdot \frac{1}{\left| \mathcal {S}_{w,h}\right| }\cdot \sum _{\left( w,h\right) \in \mathcal {S}_{w,h}}\mathbf {x}_{w,h}\right] \right] }. \end{aligned}$$
(1)
Fig. 1.
figure 1

Illustration of multi-scale spatially-asymmetric recalibration (MS-SAR). The feature vector for recalibration is marked in red, and the spatial coordinate sets at different scales are marked in yellow, and the weighting vectors are marked in green. For the first and second scales, for better visualization, we copy the neural responses used for recalibration. This figure is best viewed in color. (Color figure online)

3.3 Multi-scale Spatially Asymmetric Recalibration

In Eq. (1), the coordinate set \(\mathcal {S}_{w,h}\) determines the region-of-interest (ROI) that can impact \(\mathbf {z}_{w,h}\). There is the need of using different scales to evaluate the importance of each feature. We achieve this goal by defining multiple coordinate sets for each spatial position.

Let the total number of scales be L. For each \({l}={1,2,\ldots ,L}\), we define the scale factor \(K^{\left( l\right) }\), construct the coordinate set \(\mathcal {S}_{w,h}^{\left( l\right) }\) and the feature set \(\mathfrak {X}_{w,h}^{\left( l\right) }\), and compute \(\mathbf {z}_{w,h}^{\left( l\right) }\) using Eq. (1). The weights from different scales are averaged: \({\mathbf {z}_{w,h}}={\frac{1}{L}{\sum _{l=1}^L}\mathbf {z}_{w,h}^{\left( l\right) }}\). Using the matrix notation, we write multi-scale spatially-asymmetric recalibration (MS-SAR) as:

$$\begin{aligned} {\tilde{\mathbf {X}}_{w,h}}={\mathbf {X}\odot \mathbf {Z}}={\mathbf {X}\odot \frac{1}{L}{\sum _{l=1}^L}\mathbf {Z}^{\left( l\right) }}. \end{aligned}$$
(2)

The configuration of this an MS-SAR is denoted by \({\mathcal {L}}={\left\{ K^{\left( l\right) }\right\} _{l=1}^L}\). When \({\mathcal {L}}={\left\{ 1\right\} }\), MS-SAR degenerates to the recalibration approach used in the Squeeze-and-Excitation Network (SENet) [13], which is single-scaled and spatially-symmetric, i.e., each pair of spatial positions can impact each other, and \(\mathbf {z}_{w,h}\) is the same at all positions. We will show in experiments that MS-SAR produces superior performance than this degenerated version.

3.4 Applications to Existing Building Blocks

MS-SAR can be applied to each convolutional layer individually. Here we consider two examples, which integrate MS-SAR into a residual block [11] and a densely-connected block [15], respectively. The modified blocks are shown in Fig. 2. In a residual block, we only recalibrate the second convolutional layer, while in a densely-connected block, this operation is performed before each convolved feature vector is concatenated to the main feature vector.

Fig. 2.
figure 2

Applying MS-SAR (green parts) to a residual block (left) or one single step in a densely-connected block (right). In both examples we set \({\mathcal {L}}={\left\{ 1,2,4\right\} }\). Here, pool indicates a \(\frac{W}{K}\times \frac{H}{K}\) regional pooling, lfc is a local fully-connected layer (\(1\times 1\) convolution), and ups performs up-sampling by duplicating each element for \(\frac{W}{K}\times \frac{H}{K}\) times. The feature map size is labeled for each cube. This figure is best viewed in color.

Another difference lies in the input of the recalibration function. In the residual block, we simply use the convolved response map for “self recalibration”, but in the densely-connected block, especially in the late stages, we note that the main vector is of a much higher dimensionality and thus contains multi-stage visual information. Therefore, we compute the recalibration function using the main vector. We name this option as multi-stage recalibration. In comparison to single-stage recalibration (input the convolved vector to the recalibration function), it requires more parameters as well as computations, but also leads to better classification performance (see Sect. 4.2).

3.5 Computational Costs

Let \(\mathbf {X}\) be a \(W\times H\times D\) cube, and the input of convolution also have D channels, then the number of parameters of convolution is \(9D^2\) (assuming the convolution kernel size is \(3\times 3\)). Given that MS-SAR is configured by \({\mathcal {L}}={\left\{ K^{\left( l\right) }\right\} _{l=1}^L}\), the learnable parameters come from two weight matrices \(\varvec{\varOmega }_1\) (\(D'\times D\)) and \(\varvec{\varOmega }_2\) (\(D\times D'\)), and so there are \(2DD'\) extra parameters for each scale, and \(2LDD'\) for all L scales. We set \({D'}={D/L}\) so that using multiple scales does not increase the total number of parameters.

The extra computations (FLOPs) brought by MS-SAR is related to the strategy of defining the coordinate sets. We first consider the sliding strategy, in which each position \(\left( w,h\right) \) has a different feature set \(\mathfrak {X}_{w,h}\). The spatial average pooling over the feature sets of all positions takes around WHD FLOPsFootnote 3. Then, each D-dimensional vector \(\mathbf {y}_{w,h}\) is passed through two matrix-vector multiplications, and the total FLOPs is \(2WHDD'\). For the regional strategy, the difference lies in that the number of unique feature sets is \(K^{\left( l\right) 2}\) at the l-th scale. By sharing computations, the total FLOPs of the fully-connected layers is decreased to \(2K^{\left( l\right) 2}DD'\). For all L scales, the extra FLOPs is \(2LWHDD'\) for the sliding strategy and \(2DD'{\sum _{l=1}^L}K^{\left( l\right) 2}\) for the regional strategy, respectively.

Note that in both ResNets and DenseNets, MS-SAR is applied to half of convolutional layers, and so the fractions of extra parameters and FLOPs are relatively small. We will report the detailed numbers in experiments.

Table 1. Comparison of classification error rates (\(\%\)) on the CIFAR10 and CIFAR100 datasets. The left three columns list several recent work, and the right part compares our approach with the baselines. “RN” and“DN” denotes “ResNet” and“DenseNet”. An asterisk sign (*) indicates that MS-SAR is added. For all ResNets, the error rates are averaged from 3 individual runs. All FLOPs and numbers of parameters are computed on the experiments on CIFAR10. The difference in these numbers between the CIFAR10 and CIFAR100 experiments are ignorable.

4 Experiments

4.1 The CIFAR Datasets

We first evaluate MS-SAR on the CIFAR datasets [18] which contain tiny RGB images with a fixed spatial resolution of \(32\times 32\). There are two subsets with 10 and 100 object classes, referred to as CIFAR10 and CIFAR100, respectively. Each set has 50,000 training samples and 10,000 testing samples, both of which are evenly distributed over all (10 or 100) classes.

We choose different baseline network architectures, including the deep residual networks (ResNets) [11] with 20, 32 and 56 layers and the densely-connected networks (DenseNets) [15] with 100 and 190 layers. MS-SAR is applied to each residual block and densely-connected block, as illustrated in Fig. 2. We choose the regional strategy to construct coordinate sets, use \({\mathcal {L}}={\left\{ 1,2,4\right\} }\) and set \({D'}={D/3}\). For other options, see ablation studies in the next subsection.

We follow the conventions to train these networks from scratch. The standard SGD with a weight decay of 0.0001 and a Nesterov momentum of 0.9 are used. In the ResNets, we train the network for 160 epochs with mini-batch size of 128. The base learning rate is 0.1, and is divided by 10 after 80 and 120 epochs. In the DenseNets, we train the network for 300 epochs with a mini-batch size of 64. The base learning rate is 0.1, and is divided by 10 after 150 and 225 epochs. Adding MS-SAR does not require any of these settings to be modified. In the training process, the standard data-augmentation is used, i.e., each image is padded with a 4-pixel margin on each of the four sides. In the enlarged \(40\times 40\) image, a subregion with \(32\times 32\) pixels is randomly cropped and flipped with a probability of 0.5. No augmentation is used at the testing stage.

Classification results are summarized in Table 1. One can observe that MS-SAR improves the baseline classification accuracy consistently and significantly. In particular, in terms of the relative drop in error rates, almost all these numbers are higher than \(10\%\) on CIFAR10 (except for DenseNet-190), and higher than \(4\%\) on CIFAR100 (except for ResNet-20 and DenseNet-190). The highest drop is over \(10\%\) on CIFAR10 and over \(5\%\) on CIFAR100. We note that these improvements are produced at the price of higher model complexities. The additional computational costs are very small for both the ResNets (e.g, \({\sim }0.3\%\) extra FLOPs) and DenseNets (e.g, \({\sim }0.3\%\) and \({\sim }0.4\%\) extra FLOPs for DenseNet-100 and DenseNet-190, respectively), and the fractions of extra parameters are moderate (\({\sim }5\%\) for the ResNets and \({\sim }25\%\) for the DenseNets, respectively).

We also compare our results with the state-of-the-arts (listed in the left part of Table 1). Although some recent approaches reported much higher accuracies in the CIFAR datasets, we point out that they often used larger spatial resolutions [9], complicated network modules [46] or complicated regularization methods [6, 44], and thus the results are not directly comparable to ours. In addition, we believe that MS-SAR can be applied to these networks towards better classification performance.

Fig. 3.
figure 3

The curves of different networks, with and without MS-SAR. All the curves on ResNet-32 and ResNet-56 are averaged over 3 individual runs.

In Fig. 3, we plot the training/testing curves of different networks on the CIFAR datasets. We find that MS-SAR effectively decreases the testing losses (and consequently, error rates) in all cases. On CIFAR10, due to the simplicity of the recognition task (10 classes), the training losses of both approaches, with and without MS-SAR, are very close to 0, but MS-SAR produces lower testing losses, giving evidence for its ability to alleviate over-fitting.

Table 2. Comparison of classification error rates (\(\%\)) on the CIFAR10 and CIFAR100 datasets with different scale combinations. Other specifications remain the same as in Fig. 1. All results of ResNet-56 are averaged over 3 individual runs. See Sect. 3.5 for the reason that different scale configurations have the same number of parameters.

4.2 Ablation Study and Analysis

We first investigate the impacts of incorporating multi-scale visual information. To this end, we set \(\mathcal {L}\) to be a non-empty subset of \(\left\{ 1,2,4\right\} \) (7 possibilities), and summarize the results in Table 2. Compared with using a single scale, incorporating multi-scale information often leads to better classification performance (the only exception is that on DenseNet-100, \({\mathcal {L}}={\left\{ 2,4\right\} }\) works worse than \({\mathcal {L}}={\left\{ 2\right\} }\), which may be caused by random noise as DenseNet-100 experiments are performed only once). Combining all three scales is always produces the best recognition performance. Provided that the extra computational costs brought by multi-scale recalibration are almost ignorable, we will use \({\mathcal {L}}={\left\{ 1,2,4\right\} }\) in all the remaining experiments.

Next, we compare the two ways of defining coordinate sets (sliding vs. regional, see Sect. 3.2). In the experiments on CIFAR100, in both ResNets and DenseNets, the regional strategy outperforms the sliding strategy by \({\sim }0.2\%\). The training accuracy using the sliding strategy is also decreased, giving evidence that it is less capable of fitting training data. This reveals that, although spatial asymmetry is a nice property, its degree of freedom should be controlled, so that MS-SAR, containing a limited number of parameters, does not need to fit an over-complicated distribution. Considering that the regional strategy requires fewer computational costs (see Sect. 3.5), we set it to be the default option.

Finally, we compare the single-level and multi-level recalibration methods on DenseNet-100. Detailed descriptions are in Sect. 3.4. Note that this is independent of the comparison between multi-scale and single-scale methods – they work on the spatial domain and the channel domain, and are complementary to each other. In the 100-layer DenseNet, multi-level recalibration produces \(4.06\%\) and \(21.13\%\) error rates on CIFAR10 and CIFAR100, and these numbers are \(4.45\%\) and \(21.83\%\) for single-level recalibration, respectively. Multi-level recalibration reduces the relative errors by \(7.77\%\) and \(5.12\%\), at the price of \(23.75\%\) extra parameters and \(0.3\%\) additional FLOPs.

Table 3. Comparison of top-1 and top-5 classification error rates (%) produced by different recalibration approaches (none, SE and MS-SAR) on the ILSVRC2012 dataset. All these numbers are based on our own implementation. See Sect. 3.5 for the reason that different scale configurations have the same number of parameters.

4.3 The ILSVRC2012 Dataset

The ILSVRC2012 dataset [29] is a subset of the ImageNet database [4], created for a large-scale visual recognition competition. It contains 1, 000 categories located at different levels of the WordNet hierarchy. The training and testing sets have \(\sim 1.3\mathrm {M}\) and \(50\mathrm {K}\) images, roughly uniformly distributed over all classes.

The baseline network architectures include two ResNets [11] with 18 and 34 layers, and a ResNeXt [40] with 50 layers. We also compare with the Squeeze-and-Excitation (SE) module [13], which is a special case of our approach (\({\mathcal {L}}={\left\{ 1\right\} }\): single-scale and spatially-symmetric). As illustrated in Fig. 2, both SE and MS-SAR modules are appended after each residual block.

All these networks are trained from scratch. We follow [13] in configuring the following parameters. SGD with a weight decay of 0.0001 and a Nesterov momentum of 0.9 is used. There are a total of 100 epochs in the training process, and the mini-batch size is 1024. The learning rate starts with 0.6, and is divided by 10 after 30, 60 and 90 epochs. Again, adding MS-SAR does not require any of these settings to be modified. In the training process, we apply a series of data-augmentation techniques, including rescaling and cropping the image, randomly mirroring and rotating (slightly) the image, changing its aspect ratio and performing pixel jittering, which is same with SENet [13]. In the testing process, we use the standard single-center-crop on each image.

Fig. 4.
figure 4

The curves of different networks with and without MS-SAR on the ILSVRC2012 dataset. We zoom-in on a small part of each curve for better visualization.

Results are summarized in Table 3. In all cases, MS-SAR works better than the baseline (no recalibration) and SE (single-scale spatially-symmetric recalibration). For example, based on ResNeXt-50, MS-SAR reduces the top-5 error of the baseline by an absolute value of \(0.34\%\) or a relative value of \(5.56\%\), using \({\sim }1\%\) extra FLOPs and \(~10\%\) extra parameters. On top of SE, the error rate drops are \(0.15\%\) (absolute) and \(2.53\%\) (relative) and the extra FLOPs and parameters are merely \({\sim }0.5\%\) and \({\sim }0.4\%\), respectively. The training/testing curves in Fig. 4 show similar phenomena as in CIFAR experiments.

Fig. 5.
figure 5

The relationship between classification accuracy and computation (in FLOPs) on three datasets. RN, DN and RNeXt denote ResNet, DenseNet and ResNeXt, respectively. An asterisk sign (*) indicates that MS-SAR is added.

We also investigate the relationship between classification accuracy and computation on these three datasets. In Fig. 5, we plot the testing error as the function of FLOPs, which reveals the trend that MS-SAR can achieve higher recognition accuracy under the same computational complexity.

Fig. 6.
figure 6

Visualization of weights added by MS-SAR (best viewed in color, adjusted to the spatial resolution in each layer) to a 18-layer ResNet. The response/weight is higher if the color is closer to yellow. Each number in parentheses indicates the filter index. (Color figure online)

Last but not least, we visualize spatial weights added by the MS-SAR layer in Fig. 6. We present two input images containing an object (a bird) and a scene (a mountain), respectively. One can observe that, in comparison to the \(1\times 1\) weight, both \(2\times 2\) and \(4\times 4\) weights are more flexible to capture semantically meaningful regions and add higher weights. In each layer, we see some filters focus on the foreground, e.g., the characteristic patterns of the bird and the mountain, while some others focus on the background, e.g., the tree branch or the sky. High-level layers have low-resolution feature maps, but this property is preserved. We argue that it is the spatial asymmetry that allows the recalibration module to capture different visual information (foreground vs. background), which allows the weighted neural response (\(x_{w,h,d}\)) to be dependent to its spatial location \(\left( w,h\right) \).

5 Conclusions

In this paper, we present a module named MS-SAR for image classification. This is aimed at assigning eacg convolutional layer with the ability to incorporate spatial contexts to “recalibrate” neural responses, i.e., summarizing regional information into an importance factor and multiplying it to the original response. We implement each recalibration function as the combination of a multi-scale pooling operation in the spatial domain and a linear model in the channel domain. Experiments on CIFAR and ILSVRC2012 demonstrate the superior performance of MS-SAR over several baseline network architectures.

Our work delivers two messages. First, it is not the best choice to rely on a gradually increasing receptive field (via local convolution, pooling or down-sampling) to capture spatial information – MS-SAR is a light-weighted yet specifically designed module which deals with this issue more efficiently. Second, there exists a tradeoff between diversity and simplicity – this is why regional pooling works better than sliding pooling. In its current form, MS-SAR is able to add a weight factor to each neural response (unary or linear terms), but unable to explicitly model the co-occurrence of multiple features (binary or higher-order terms). We leave this topic for future research.