Condense Net
Condense Net
Condense Net
Kilian Q. Weinberger
Cornell University
arXiv:1711.09224v2 [cs.CV] 7 Jun 2018
kqw4@cornell.edu
1
2. Related Work and Background Input Input Input
efficient network architectures, which inspire our work. BN-ReLU BN-ReLU BN-ReLU
Next, we review the DenseNets and group convolutions that 1x1 Conv 1x1 L-Conv 1x1 G-Conv
form the basis for CondenseNet.
Permute Permute
2.1. Related Work
BN-ReLU BN-ReLU BN-ReLU
Weights pruning and quantization. CondenseNets are
3x3 Conv 3x3 G-Conv 3x3 G-Conv
closely related to approaches that improve the inference
efficiency of (convolutional) networks via weight prun- Output Output Output
ing [11, 14, 27, 29, 32] and/or weight quantization [21, 36]. Figure 1. The transformations within a layer in DenseNets (left),
These approaches are effective because deep networks of- and CondenseNets at training time (middle) and at test time (right).
ten have a substantial number of redundant weights that can The Index and Permute operations are explained in Section 3.1 and
be pruned or quantized without sacrificing (and sometimes 4.1, respectively. (L-Conv: learned group convolution; G-Conv:
even improving) accuracy. For convolutional networks, dif- group convolution)
ferent pruning techniques may lead to different levels of Input Output Input Output
granularity [34]. Fine-grained pruning, e.g., independent Features Features Features Features
1 1 group 1
weight pruning [10, 27], generally achieves a high degree 2 2
of sparsity. However, it requires storing a large number of 3 3
4 4
indices, and relies on special hardware/software accelera-
Convolution
Convolution
5 5 group 2
Group
tors. In contrast, coarse-grained pruning methods such as 6 6
7 7
filter-level pruning [1, 14, 29, 32] achieve a lower degree of 8 8
9 9 group 3
sparsity, but the resulting networks are much more regular,
10 10
which facilitates efficient implementations. 11 11
12 12
CondenseNets also rely on a pruning technique, but dif-
fer from prior approaches in two main ways: First, the Figure 2. Standard convolution (left) and group convolution
weight pruning is initiated in the early stages of training, (right). The latter enforces a sparsity pattern by partitioning the
which is substantially more effective and efficient than us- inputs (and outputs) into disjoint groups.
ing L1 regularization throughout. Second, CondenseNets
have a higher degree of sparsity than filter-level pruning, conjunction with CondenseNets.
yet generate highly efficient group convolution—reaching a 2.2. DenseNet
sweet spot between sparsity and regularity.
Densely connected networks (DenseNets; [19]) consist
Efficient network architectures. A range of recent stud-
of multiple dense blocks, each of which consists of multiple
ies has explored efficient convolutional networks that can
layers. Each layer produces k features, where k is referred
be trained end-to-end [16, 19, 22, 46, 47, 48, 49]. Three
to as the growth rate of the network. The distinguishing
prominent examples of networks that are sufficiently effi-
property of DenseNets is that the input of each layer is a
cient to be deployed on mobile devices are MobileNet [16],
concatenation of all feature maps generated by all preced-
ShuffleNet [47], and Neural Architecture Search (NAS) net-
ing layers within the same dense block. Each layer performs
works [49]. All these networks use depth-wise separable
a sequence of consecutive transformations, as shown in the
convolutions, which greatly reduce computational require-
left part of Figure 1. The first transformation (BN-ReLU,
ments without significantly reducing accuracy. A practical
blue) is a composition of batch normalization [23] and rec-
downside of these networks is depth-wise separable convo-
tified linear units [35]. The first convolutional layer in the
lutions are not (yet) efficiently implemented in most deep-
sequence reduces the number of channels to save computa-
learning platforms. By contrast, CondenseNet uses the
tional cost by using the 1×1 filters. The output is followed
well-supported group convolution operation [25], leading
by another BN-ReLU transformation and is then reduced to
to better computational efficiency in practice.
the final k output features through a 3×3 convolution.
Architecture-agnostic efficient inference has also been
explored by several prior studies. For example, knowledge 2.3. Group Convolution
distillation [3, 15] trains small “student” networks to repro- Group convolution is a special case of a sparsely con-
duce the output of large “teacher” networks to reduce test- nected convolution, as illustrated in Figure 2. It was first
time costs. Dynamic inference methods [2, 7, 8, 17] adapt used in the AlexNet architecture [25], and has more re-
the inference to each specific test example, skipping units cently been popularized by their successful application in
or even entire layers to reduce computation. We do not ex- ResNeXt [43]. Standard convolutional layers (left illustra-
plore such approaches here, but believe they can be used in tion in Figure 2) generate O output features by applying
Input Output Input Output Input Output Input Selected and Output
Features Features Features Features Features Features Features Rearranged Features Features
1 group 1 1 group 1 1 group 1 1 3 group 1
2 2 2 2 7
3 3 3 3 9
4 4 4 4 12
Convolution
Index Layer
Convolution
Convolution
Convolution
5 group 2 5 group 2 5 group 2 5 1 group 2
Sparsified
Sparsified
6 6 6 6 5
Group
7 7 7 7 10
8 8 8 8 12
9 group 3 9 group 3 9 group 3 9 5 group 3
10 10 10 10 6
11 11 11 11 8
12 12 12 12 11
Figure 3. Illustration of learned group convolutions with G = 3 groups and a condensation factor of C = 3. During training a fraction of
(C −1)/C connections are removed after each of the C − 1 condensing stages. Filters from the same group use the same set of features,
and during test-time the index layer rearranges the features to allow the resulting model to be implemented as standard group convolutions.
a convolutional filter (one per output) over all R input fea- evant inputs. Further, we allow multiple groups to share
tures, leading to a computational cost of R×O. In compari- input features and also allow features to be ignored by all
son, group convolution (right illustration) reduces this com- groups. Note that in a DenseNEt, even if an input feature is
putational cost by partitioning the input features into G mu- ignored by all groups in a specific layer, it can still be uti-
tually exclusive groups, each producing its own outputs— lized by some groups At different layers. To differentiate it
reducing the computational cost by a factor G to R×O G . from regular group convolutions, we refer to our approach
as learned group convolution.
3. CondenseNets 3.1. Learned Group Convolution
Group convolution works well with many deep neural We learn group convolutions through a multi-stage pro-
network architectures [43, 46, 47] that are connected in a cess, illustrated in Figures 3 and 4. The first half of the
layer-by-layer fashion. For dense architectures group con- training iterations comprises of condensing stages. Here,
volution can be used in the 3 × 3 convolutional layer (see we repeatedly train the network with sparsity inducing reg-
Figure 1, left). However, preliminary experiments show ularization for a fixed number of iterations and subsequently
that a naı̈ve adaptation of group convolutions in the 1 × 1 prune away unimportant filters with low magnitude weights.
convolutional layer leads to drastic reductions in accuracy. The second half of the training consists of the optimization
We surmise that this is caused by the fact that the inputs to stage, in which we learn the filters after the groupings are
the 1 × 1 convolutional layer are concatenations of feature fixed. When performing the pruning, we ensure that filters
maps generated by preceding layers. Therefore, they dif- from the same group share the same sparsity pattern. As a
fer in two ways from typical inputs to convolutional layers: result, the sparsified layer can be implemented using a stan-
1. they have an intrinsic order; and 2. they are far more dard group convolution once training is completed (testing
diverse. The hard assignment of these features to disjoint stage). Because group convolutions are efficiently imple-
groups hinders effective re-use of features in the network. mented by many deep-learning libraries, this leads to high
Experiments in which we randomly permute input feature computational savings both in theory and in practice. We
maps in each layer before performing the group convolu- present details on our approach below.
tion show that this reduces the negative impact on accuracy Filter Groups. We start with a standard convolution of
— but even with the random permutation, group convolu- which filter weights form a 4D tensor of size O×R×W×H,
tion in the 1×1 convolutional layer makes DenseNets less where O, R, W , and H denote the number of output chan-
accurate than for example smaller DenseNets with equiva- nels, the number of input channels, and the width and the
lent computational cost. height of the filter kernels, respectively. As we are focusing
It is shown in [19] that making early features available on the 1×1 convolutional layer in DenseNets, the 4D tensor
as inputs to later layers is important for efficient feature re- reduces to an O ×R matrix F. We consider the simplified
use. Although not all prior features are needed at every sub- case in this paper. But our procedure can readily be used
sequent layer, it is hard to predict which features should with larger convolutional kernels. Before training, we first
be utilized at what point. To address this problem, we de- split the filters (or, equivalently, the output features) into G
velop an approach that learns the input feature groupings groups of equal size. We denote the filter weights for these
g
automatically during training. Learning the group structure groups by F1 , . . . , FG ; each Fg has size O G × R and Fij
allows each filter group to select its own set of most rel- corresponds to the weight of the jth input for the ith output
within group g. Because the output features do not have an 0.5 0.10
implicit ordering, this random grouping does not negatively Training Loss
Learning Rate
affect the quality of the layer. 0.4 0.08
Condensation Criterion. During the training process we
learning rate
training loss
gradually screen out subsets of less important input features 0.3 0.06
for each group. The importance of the jth incoming feature
map for the filter group g is evaluated by the averaged abso- 0.2 0.04
lute value of weights between them across all outputs within
PO/G
the group, i.e., by i=1 |Fgi,j |. In other words, we remove 0.1 Condensing Condensing Condensing Optimization
0.02
columns in Fg (by zeroing them out) if their L1 -norm is Stage 1 Stage 2 Stage 3 Stage
small compared to the L1 -norm of other columns. This re- 0.0 0.00
0 50 100 150 200 250 300
sults in a convolutional layer that is structurally sparse: fil- Epoch
ters from the same group always receive the same set of
features as input. Figure 4. The cosine shape learning rate and a typical training loss
curve with a condensation factor of C = 4.
Group Lasso. To reduce the negative effects on accuracy
introduced by weight pruning, L1 regularization is com- Learning rate. We adopt the cosine shape learning rate
monly used to induce sparsity [29, 32]. In CondenseNets, schedule of Loshchilov et al. [33], which smoothly an-
we encourage convolutional filters from the same group to neals the learning rate, and usually leads to improved accu-
use the same subset of incoming features, i.e., we induce racy [18,49]. Figure 4 visualizes the learning rate as a func-
group-level sparsity instead. To this end, we use the follow- tion of training epoch (in magenta), and the corresponding
ing group-lasso regularizer [44] during training: training loss (blue curve) of a CondenseNet trained on the
r CIFAR-10 dataset [24]. The abrupt increase in the loss at
XG XR XO/G 2 epoch 150 is causes by the final condensation operation,
Fgi,j .
g=1 j=1 i=1 which removes half of the remaining weights. However,
the plot shows that the model gradually recovers from this
The group-lasso regularizer simultaneously pushes all the pruning step in the optimization stage.
elements of a column of Fg to zero, because the term in
Index Layer. After training we remove the pruned weights
the square root is dominated by the largest elements in that
and convert the sparsified model into a network with a reg-
column. This induces the group-level sparsity we aim for.
ular connectivity pattern that can be efficiently deployed on
Condensation Factor. In addition to the fact that learned devices with limited computational power. For this reason
group convolutions are able to automatically discover good we introduce an index layer that implements the feature se-
connectivity patterns, they are also more flexible than stan- lection and rearrangement operation (see Figure 3, right).
dard group convolutions. In particular, the proportion of The convolutional filters in the output of the index layer
feature maps used by a group does not necessarily need to are rearranged to be amenable to existing (and highly opti-
1
be G . We define a condensation factor C, which may differ mized) implementations of regular group convolution. Fig-
R
from G, and allow each group to select C of inputs. ure 1 shows the transformations of the CondenseNet layers
Condensation Procedure. In contrast to approaches that during training (middle) and during testing (right). During
prune weights in pre-trained networks, our weight pruning training the 1 × 1 convolution is a learned group convolu-
process is integrated into the training procedure. As illus- tion (L-Conv), but during testing, with the help of the index
trated in Figure 3 (which uses C = 3), at the end of each layer, it becomes a standard group convolution (G-Conv).
C − 1 condensing stages we prune C1 of the filter weights.
By the end of training, only C1 of the weights remain in 3.2. Architecture Design
each filter group. In all our experiments we set the number In addition to the use of learned group convolutions
M
of training epochs of the condensing stages to 2(C−1) , where introduced above, we make two changes to the regular
M denotes the total number of training epochs—such that DenseNet architecture. These changes are designed to fur-
the first half of the training epochs is used for condensing. ther simplify the architecture and improve its computational
In the second half of the training process, the Optimization efficiency. Figure 5 illustrates the two changes that we made
stage, we train the sparsified model.2 to the DenseNet architecture.
Exponentially increasing growth rate. The original
2 In our implementation of the training procedure we do not actually
DenseNet design adds k new feature maps at each layer,
remove the pruned weights, but instead mask the filter F by a binary tensor
M of the same size using an element-wise product. The mask is initialized does not require sparse matrix operations. In practice, the pruning hardly
with only ones, and elements corresponding to pruned weights are set to increases the wall time needed to perform a forward-backward pass during
zero. This implementation via masking is more efficient on GPUs, as it training.
Global Pooling 9.00 LGC(×)-IGR(×)-FDC(×): DenseNets
4x4 Pooling
2x2 Pooling
LGC( )-IGR(×)-FDC(×): CondenseNetslight
Identity 8.00 LGC( )-IGR( )-FDC(×)
LGC( )-IGR( )-FDC( ): CondenseNets
where k is a constant referred to as the growth rate. As Figure 6. Ablation study on CIFAR-10 to investigate the efficiency
shown in [19], deeper layers in a DenseNet tend to rely on gains obtained by the various components of CondenseNet.
high-level features more than on low-level features. This
motivates us to improve the network by strengthening short- izontally mirrored with probability 0.5.
range connections. We found that this can be achieved by The ImageNet dataset comprises 1000 visual classes,
gradually increasing the growth rate as the depth grows. and contains a total of 1.2 million training images and
This increases the proportion of features coming from later 50,000 validation images. We adopt the data-augmentation
layers relative to those from earlier layers. For simplicity, scheme of [12] at training time, and perform a rescaling to
we set the growth rate to k = 2m−1 k0 , where m is the index 256 × 256 followed by a 224 × 224 center crop at test time
of the dense block, and k0 is a constant. This way of setting before feeding the input image into the networks.
the growth rate does not introduce any additional hyper-
parameters. The “increasing growth rate” (IGR) strategy
4.1. Results on CIFAR
places a larger proportion of parameters in the later layers We first perform a set of experiments on CIFAR-10 and
of the model. This increases the computational efficiency CIFAR-100 to validate the effectiveness of learned group
substantially but may decrease the parameter efficiency in convolutions and the proposed CondenseNet architecture.
some cases. Depending on the specific hardware limitations Model configurations. Unless otherwise specified, we use
it may be advantageous to trade-off one for the other [22]. the following network configurations in all experiments on
Fully dense connectivity. To encourage feature re-use the CIFAR datasets. The standard DenseNet has a con-
even more than the original DenseNet architecture does al- stant growth rate of k = 12 following [19]; our proposed
ready, we connect input layers to all subsequent layers in architecture uses growth rates k0 ∈ {8,16,32} to ensure that
the network, even if these layers are located in different the growth rate is divisable by the number of groups. The
dense blocks (see Figure 5). As dense blocks have differ- learned group convolution is only applied to the first con-
ent feature resolutions, we downsample feature maps with volutional layer (with filter size 1×1, see Figure 1) of each
higher resolutions when we use them as inputs into lower- basic layer, with a condensation factor of C = 4, i.e., 75%
resolution layers using average pooling. of filter weights are gradually pruned during training with a
step of 25%. The 3 × 3 convolutional layers are replaced
4. Experiments by standard group convolution (without applying learned
group convolution) with four groups. Following [46, 47],
We evaluate CondenseNets on the CIFAR-10, CIFAR- we permute the output channels of the first 1 × 1 learned
100 [24], and the ImageNet (ILSVRC 2012; [6]) image- group convolutional layer, such that the features generated
classification datasets. The models and code reproduc- by each of its groups are evenly used by all the groups of
ing our experiments are publicly available at https:// the subsequent 3 × 3 group convolutional layer .
github.com/ShichenLiu/CondenseNet. Training details. We train all models with stochastic
Datasets. The CIFAR-10 and CIFAR-100 datasets consist gradient descent (SGD) using similar optimization hyper-
of RGB images of size 32×32 pixels, corresponding to 10 parameters as in [12, 19]. Specifically, we adopt Nesterov
and 100 classes, respectively. Both datasets contain 50,000 momentum with a momentum weight of 0.9 without damp-
training images and 10,000 test images. We use a stan- ening, and use a weight decay of 10−4 . All models are
dard data-augmentation scheme [20, 26, 28, 30, 37, 39, 41], trained with mini-batch size 64 for 300 epochs, unless oth-
in which the images are zero-padded with 4 pixels on each erwise specified. We use a cosine shape learning rate which
side, randomly cropped to produce 32×32 images, and hor- starts from 0.1 and gradually reduces to 0. Dropout [40]
Model Params FLOPs C-10 C-100 Model FLOPs Params C-10 C-100
ResNet-1001 [13] 16.1M 2,357M 4.62 22.71 VGG-16-pruned [29] 206M 5.40M 6.60 25.28
Stochastic-Depth-1202 [20] 19.4M 2,840M 4.91 - VGG-19-pruned [32] 195M 2.30M 6.20 -
Wide-ResNet-28 [45] 36.5M 5,248M 4.00 19.25 VGG-19-pruned [32] 250M 5.00M - 26.52
ResNeXt-29 [43] 68.1M 10,704M 3.58 17.31 ResNet-56-pruned [14] 62M 8.20 -
DenseNet-190 [19] 25.6M 9,388M 3.46 17.18 ResNet-56-pruned [29] 90M 0.73M 6.94 -
NASNet-A∗ [49] 3.3M - 3.41 - ResNet-110-pruned [29] 213M 1.68M 6.45 -
CondenseNetlight -160∗ 3.1M 1,084M 3.46 17.55 ResNet-164-B-pruned [32] 124M 1.21M 5.27 23.91
CondenseNet-182∗ 4.2M 513M 3.76 18.47 DenseNet-40-pruned [32] 190M 0.66M 5.19 25.28
Table 1. Comparison of classification error rate (%) with other con- CondenseNetlight -94 122M 0.33M 5.00 24.08
CondenseNet-86 65M 0.52M 5.00 23.64
volutional networks on the CIFAR-10(C-10) and CIFAR-100(C-
100) datasets. * indicates models that are trained with cosine shape Table 2. Comparison of classification error rate (%) on CIFAR-
learning rate for 600 epochs. 10 (C-10) and CIFAR-100 (C-100) with state-of-the-art filter-level
weight pruning methods.
with a drop rate of 0.1 was applied to train CondenseNets CondenseNet Feature map size
with > 3 million parameters (shown in Table 1). 3×3 Conv (stride
2) 112×112
1×1 L-Conv
Component analysis. Figure 6 compares the compu- ×4 (k = 8) 112×112
3×3 G-Conv
tational efficiency gains obtained by each component of
CondenseNet: learned group convolution (LGR), exponen- 2×2 average pool,
stride 2 56×56
1×1 L-Conv
tially increasing learning rate (IGR), full dense connectiv- ×6 (k = 16) 56×56
3×3 G-Conv
ity (FDC). Specifically, the figure plots the test error as a
function of the number of FLOPs (i.e., multiply-addition 2×2 average pool,
stride 2 28×28
1×1 L-Conv
operations). The large gap between the two red curves with ×8 (k = 32) 28×28
3×3 G-Conv
dot markers shows that learned group convolution signifi-
2×2 average pool, stride 2 14×14
cantly improves the efficiency of our models. Compared to 1×1 L-Conv
×10 (k = 64) 14×14
DenseNets, CondenseNetlight only requires half the num- 3×3 G-Conv
ber of FLOPs to achieve comparable accuracy. Further, we 2×2 average pool, stride 2 7×7
observe that the exponentially increasing growth rate, yields 1×1 L-Conv
×8 (k = 128) 7×7
even further efficiency. Full dense connectivity does not 3×3 G-Conv
boost the efficiency significantly on CIFAR-10, but there 7×7 global average pool 1×1
does appear to be a trend that as models getting larger, full 1000-dim fully-connected, softmax
connectivity starts to help. We opt to include this architec- Table 3. CondenseNet architectures for ImageNet.
ture change in the CondenseNet model, as it does lead to
substantial improvements on ImageNet (see later). Comparison with existing pruning techniques. In Ta-
Comparison with state-of-the-art efficient CNNs. In Ta- ble 2, we compare our CondenseNets and CondenseNetslight
ble 1, we show the results of experiments comparing a with models that are obtained by state-of-the-art filter-level
160-layer CondenseNetlight and a 182-layer CondenseNet weight pruning techniques [14, 29, 32]. The results show
with alternative state-of-the-art CNN architectures. Follow- that, in general, CondenseNet is about 3× more efficient in
ing [49], our models were trained for 600 epochs. From terms of FLOPs than ResNets or DenseNets pruned by the
the results, we observe that CondenseNet requires approxi- method introduced in [32]. The advantage over the other
mately 8× fewer parameters and FLOPs to achieve a com- pruning techniques is even more pronounced. We also re-
parable accuracy to DenseNet-190. CondenseNet seems port the results for CondenseNetlight in the second last row
to be less parameter-efficient than CondenseNetlight , but of Table 2. It uses only half the number of parameters to
is more compute-efficient. Somewhat surprisingly, our achieve comparable performance as the most competitive
CondenseNetlight model performs on par with the NASNet- baseline, the 40-layer DenseNet described by [32].
A, an architecture that was obtained using an automated
search procedure over 20, 000 candidate architectures com- 4.2. Results on ImageNet
posed of a rich set of components, and is thus carefully In a second set of experiments, we test CondenseNet on
tuned on the CIFAR-10 dataset [49]. Moreover, Con- the ImageNet dataset.
denseNet (or CondenseNetlight ) does not use depth-wise Model configurations. Detailed network configurations
separable convolutions, and only use simple convolutional are shown in Table 3. To reduce the number of parameters,
filters with size 1 × 1 and 3 × 3. It may be possible to in- we prune 50% of weights from the fully connected (FC)
clude CondenseNet as a meta-architecture in the procedure layer at epoch 60 in a way similar to the learned group con-
of [49] to obtain even more efficient networks. volution, but with G = 1 (as the FC layer could not be split
Model FLOPs Params Top-1 Top-5 Model FLOPs Top-1 Time(s)
Inception V1 [42] 1,448M 6.6M 30.2 10.1 VGG-16 15,300M 28.5 354
1.0 MobileNet-224 [16] 569M 4.2M 29.4 10.5 ResNet-18 1,818M 30.2 8.14
ShuffleNet 2x [47] 524M 5.3M 29.1 10.2 1.0 MobileNet-224 [16] 569M 29.4 1.96
NASNet-A (N=4) [49] 564M 5.3M 26.0 8.4 CondenseNet (G = C = 4) 529M 26.2 1.89
NASNet-B (N=4) [49] 488M 5.3M 27.2 8.7 CondenseNet (G = C = 8) 274M 29.0 0.99
NASNet-C (N=3) [49] 558M 4.9M 27.5 9.0
Table 5. Actual inference time of different models on an ARM
CondenseNet (G = C = 8) 274M 2.9M 29.0 10.0
processor. All models are trained on ImageNet, and accept input
CondenseNet (G = C = 4) 529M 4.8M 26.2 8.3
with resolution 224×224.
Table 4. Comparison of Top-1 and Top-5 classification error rate
(%) with other state-of-the-art compact models on ImageNet. proach of pruning weights of fully converged models. We
use a DenseNet with 50 layers as the basis for this exper-
into multiple groups) and C = 2. Similar to prior studies on iment. We implement a “traditional” pruning method in
MobileNets and ShuffleNets, we focus on training relatively which the weights are pruned in the same way as in as in
small models that require less than 600 million FLOPs to CondenseNets, but the pruning is only done once after train-
perform inference on a single image. ing has completed (for 300 epochs). Following [32], we
Training details. We train all models using stochastic gra- fine-tune the resulting sparsely connected network for an-
dient descent (SGD) with a batch size of 256. As before, we other 300 epochs with the same cosine shape learning rate
adopt Nesterov momentum with a momentum weight of 0.9 that we use for training CondenseNets. We compare the tra-
without dampening, and a weight decay of 10−4 . All mod- ditional pruning approach with the CondenseNet approach,
els are trained for 120 epochs, with a cosine shape learning setting the number of groups G is set to 4. In both settings,
rate which starts from 0.1 and gradually reduces to 0. We we vary the condensation factor C between 2 and 8.
use group lasso regularization in all experiments on Ima- The results in Figure 7 show that pruning weights grad-
geNet; the regularization parameter is set to 10−5 . ually during training outperforms pruning weights on fully
Comparison with state-of-the-art efficient CNNs. Ta- trained models. Moreover, gradual weight pruning reduces
ble 4 shows the results of CondenseNets and several state- the training time: the “traditional pruning” models were
of-the-art, efficient models on the ImageNet dataset. We ob- trained for 600 epochs, whereas the CondenseNets were
serve that a CondenseNet with 274 million FLOPs obtains trained for 300 epochs. The results also show that removing
a 29.0% Top-1 error, which is comparable to the accuracy 50% the weights (by setting C = 2) from the 1 × 1 con-
achieved by MobileNets and ShuffleNets that require twice volutional layers in a DenseNet incurs hardly any loss in
as much compute. A CondenseNet with 529 million FLOPs accuracy.
produces to a 3% absolute reduction in top-1 error com- Number of groups. In the middle panel of Figure 7, we
pared to a MobileNet and a ShuffleNet of comparable size. compare four CondenseNets with exactly the same network
Our CondenseNet even achieves a the same accuracy with architecture, but a number of groups, G, that varies between
slightly fewer FLOPs and parameters than the most com- 1 and 8. We fix the condensation factor, C, to 8 for all the
petitive NASNet-A, despite the fact that we only trained a models, which implies all models have the same number of
very small number of models (as opposed to the study that parameters after training has completed. In CondenseNets
lead to the NASNet-A model). with a single group, we discard entire filters in the same way
Actual inference time. Table 5 shows the actual inference that is common in filter-pruning techniques [29, 32]. The
time on an ARM processor for different models. The wall- results presented in the figure demonstrate that test errors
time to inference an image sized at 224 × 224 is highly cor- tends to decrease as the number of groups increases. This
related with the number of FLOPs of the model. Compared result is in line with our analysis in Section 3, in particular,
to the recently proposed MobileNet, our CondenseNet (G = it suggests that grouping filters gives the training algorithm
C = 8) with 274 million FLOPs inferences an image 2× more flexibility to remove redundant weights.
faster, while without sacrificing accuracy. Effect of the condensation factor. In the right panel of
Figure 7, we compare CondenseNets with varying conden-
4.3. Ablation Study sation factors. Specifically, we set the condensation factor
We perform an ablation study on CIFAR-10 in which C to 1, 2, 4, or 8; this corresponds to removing 0%, 50%,
we investigate the effect of (1) the pruning strategy, (2) the 75%, or 87.5% of the weights from each of the 1×1 convolu-
number of groups, and (3) the condensation factor. We also tional layers, respectively. A condensation factor C = 1 cor-
investigate the stability of our weight pruning procedure. responds to a baseline model without weight pruning. The
Pruning strategy. The left panel of Figure 7 compares number of groups, G, is set to 4 for all the networks. The
our on-the-fly pruning method with the more common ap- results show that a condensation factors C larger than 1 con-
9.50 8.00 8.00
Full model 7.63 CondenseNet C=1
9.00 8.60 7.75 7.50
CondenseNet CondenseNet C=2
8.50 Traditional Pruning 7.32 CondenseNet C=4
7.50 7.00
8.00 7.60 CondenseNet C=8
7.25
Figure 7. Classification error rate (%) on CIFAR-10. Left: Comparison between our condense method with traditional pruning approach,
under varying condensation factors. Middle: CondenseNets with different number of groups for the 1×1 learned group convolution. All
the models have the same number of parameters. Right: CondenseNets with different condensation factors.
200
250 on top of the CondenseNet. The gray vertical dotted lines
300 0 0.2 0.4 0.6 0.8 1
350 correspond to pooling layers that decrease the feature reso-
400 Classification layer Classification layer Classification layer
lution.
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Depth ( ) Depth ( ) Depth ( ) The results in the figure suggest that while there are dif-
0 Model (1) Model (2) Model (3)
ferences in learned connectivity at the filter-group level (top
5 row), the overall information flow between layers (bottom
Source layer (s)