Nothing Special   »   [go: up one dir, main page]

Semantic Image Segmentation With Task-Specific Edge Detection Using Cnns and A Discriminatively Trained Domain Transform

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Semantic Image Segmentation with Task-Specific Edge Detection Using CNNs

and a Discriminatively Trained Domain Transform

Liang-Chieh Chen∗ Jonathan T. Barron, George Papandreou, Kevin Murphy Alan L. Yuille
lcchen@cs.ucla.edu {barron, gpapan, kpmurphy}@google.com yuille@stat.ucla.edu
alan.yuille@jhu.edu

Edge Prediction
Abstract
Image Filtered Score

Deep convolutional neural networks (CNNs) are the back- Convolutional Domain
Neural Network Segmentation Score Transform
bone of state-of-art semantic image segmentation systems.
Recent work has shown that complementing CNNs with
fully-connected conditional random fields (CRFs) can signif-
Figure 1. A single unified CNN produces both coarse semantic
icantly enhance their object localization accuracy, yet dense
segmentation scores and an edge map, which respectively serve as
CRF inference is computationally expensive. We propose input multi-channel image and reference edge to a domain trans-
replacing the fully-connected CRF with domain transform form edge-preserving filter. The resulting filtered semantic segmen-
(DT), a modern edge-preserving filtering method in which tation scores are well-aligned with the object boundaries. The full
the amount of smoothing is controlled by a reference edge architecture is discriminatively trained by backpropagation (red
map. Domain transform filtering is several times faster than dashed arrows) to optimize the target semantic segmentation.
dense CRF inference and we show that it yields comparable
semantic segmentation results, accurately capturing object
boundaries. Importantly, our formulation allows learning dimensional Gaussian filtering in the 5-D bilateral (2-D po-
the reference edge map from intermediate CNN features sition, 3-D color) space and is expensive in terms of both
instead of using the image gradient magnitude as in stan- memory and CPU time, even when advanced algorithmic
dard DT filtering. This produces task-specific edges in an techniques are used.
end-to-end trainable system optimizing the target semantic In this paper, we propose replacing the fully-connected
segmentation quality. CRF and its associated bilateral filtering with the domain
transform (DT) [16], an alternative edge-aware filter. The
1. Introduction recursive formulation of the domain transform amounts to
adaptive recursive filtering of a signal, where information
Deep convolutional neural networks (CNNs) are very is not allowed to propagate across edges in some reference
effective in semantic image segmentation, the task of assign- signal. This results in an extremely efficient scheme which
ing a semantic label to every pixel in an image. Recently, it is an order of magnitude faster than the fastest algorithms
has been demonstrated that post-processing the output of a for a bilateral filter of equivalent quality.
CNN with a fully-connected CRF can significantly increase The domain transform can equivalently be seen as a recur-
segmentation accuracy near object boundaries [5]. rent neural network (RNN). In particular, we show that the
As explained in [26], mean-field inference in the fully- domain transform is a special case of the recently proposed
connected CRF model amounts to iterated application of the RNN with gated recurrent units. This connection allows us to
bilateral filter, a popular technique for edge-aware filtering. share insights, better understanding two seemingly different
This encourages pixels which are nearby in position and in methods, as we explain in Section 3.4.
color to be assigned the same semantic label. In practice,
The amount of smoothing in a DT is spatially modulated
this produces semantic segmentation results which are well
by a reference edge map, which in the standard DT corre-
aligned with object boundaries in the image.
sponds to image gradient magnitude. Instead, we will learn
One key impediment in adopting the fully-connected CRF
the reference edge map from intermediate layer features
is the rather high computational cost of the underlying bi-
of the same CNN that produces the semantic segmentation
lateral filtering step. Bilateral filtering amounts to high-
scores, as illustrated in Fig. 1. Crucially, this allows us to
∗ Work done in part during an internship at Google Inc. learn a task-specific edge detector tuned for semantic image

4545
segmentation in an end-to-end trainable system. Long range dependency Recurrent neural networks
We evaluate the performance of the proposed method on (RNNs) [12] with long short-term memory (LSTM) units
the challenging PASCAL VOC 2012 semantic segmentation [20] or gated recurrent units (GRUs) [8, 9] have proven
task. In this task, domain transform filtering is several times successful to model the long term dependencies in sequen-
faster than dense CRF inference, while performing almost tial data (e.g., text and speech). Sainath et al. [37] have
as well in terms of the mean intersection-over-union (mIOU) combined CNNs and RNNs into one unified architecture
metric. In addition, although we only trained for semantic for speech recognition. Some recent work has attempted
segmentation, the learned edge map performs competitively to model spatial long range dependency with recurrent net-
on the BSDS500 edge detection benchmark. works for computer vision tasks [17, 41, 35, 4, 43]. Our
work, integrating CNNs and Domain Transform (DT) with
2. Related Work recursive filtering [16], bears a similarity to ReNet [43],
which also performs recursive operations both horizontally
Semantic image segmentation Deep Convolutional Neu- and vertically to capture long range dependency within
ral Networks (CNNs) [27] have demonstrated excellent whole image. In this work, we show the relationship between
performance on the task of semantic image segmentation DT and GRU, and we also demonstrate the effectiveness of
[10, 28, 30]. However, due to the employment of max- exploiting long range dependency by DT for semantic image
pooling layers and downsampling, the output of these net- segmentation. While [42] has previously employed the DT
works tend to have poorly localized object boundaries. Sev- (for joint object-stereo labeling), we propose to backpropa-
eral approaches have been adopted to handle this problem. gate through both of the DT inputs to jointly learn segmenta-
[31, 19, 5] proposed to extract features from the interme- tion scores and edge maps in an end-to-end trainable system.
diate layers of a deep network to better estimate the object We show that these learned edge maps bring significant im-
boundaries. Networks employing deconvolutional layers and provements compared to standard image gradient magnitude
unpooling layers to recover the “spatial invariance” effect of used by [42] or earlier DT literature [16].
max-pooling layers have been proposed by [45, 33]. [14, 32]
used super-pixel representation, which essentially appeals 3. Proposed Model
to low-level segmentation methods for the task of localiza-
tion. The fully connected Conditional Random Field (CRF) 3.1. Model overview
[26] has been applied to capture long range dependencies Our proposed model consists of three components, il-
between pixels in [5, 28, 30, 34]. Further improvements lustrated in Fig. 2. They are jointly trained end-to-end to
have been shown in [46, 38] when backpropagating through optimize the output semantic segmentation quality.
the CRF to refine the segmentation CNN. In contrary, we The first component that produces coarse semantic seg-
adopt another approach based on the domain transform [16] mentation score predictions is based on the publicly available
and show that beyond refining the segmentation CNN, we DeepLab model, [5], which modifies VGG-16 net [40] to
can also jointly learn to detect object boundaries, embedding be FCN [31]. The model is initialized from the VGG-16
task-specific edge detection into the proposed model. ImageNet [36] pretrained model. We employ the DeepLab-
LargeFOV variant of [5], which introduces zeros into the
Edge detection The edge/contour detection task has a filters to enlarge its Field-Of-View, which we will simply
long history [25, 1, 11], which we will only briefly re- denote by DeepLab in the sequel.
view. Recently, several works have achieved outstanding We add a second component, which we refer to as Ed-
performance on the edge detection task by employing CNNs geNet. The EdgeNet predicts edges by exploiting features
[2, 3, 15, 21, 39, 44]. Our work is most related to the ones from intermediate layers of DeepLab. The features are re-
by [44, 3, 24]. While Xie and Tu [44] also exploited fea- sized to have the same spatial resolution by bilinear inter-
tures from the intermediate layers of a deep network [40] polation before concatenation. A convolutional layer with
for edge detection, they did not apply the learned edges for kernel size 1×1 and one output channel is applied to yield
high-level tasks, such as semantic image segmentation. On edge prediction. ReLU is used so that the edge prediction is
the other hand, Bertasius et al. [3] and Kokkinos [24] made in the range of zero to infinity.
use of their learned boundaries to improve the performance The third component in our system is the domain trans-
of semantic image segmentation. However, the boundary form (DT), which is is an edge-preserving filter that lends
detection and semantic image segmentation are considered itself to very efficient implementation by separable 1-D re-
as two separate tasks. They optimized the performance of cursive filtering across rows and columns. Though DT is
boundary detection instead of the performance of high level traditionally used for graphics applications [16], we use it to
tasks. On the contrary, we learn object boundaries in or- filter the raw CNN semantic segmentation scores to be bet-
der to directly optimize the performance of semantic image ter aligned with object boundaries, guided by the EdgeNet
segmentation. produced edge map.

4546
Image Segmentation Prediction 128 + 256 + 512 Edge Prediction Filtered Score Map

321 321
1
Upsampling and concatenation x2 x4
x8
Edge Prediction

Upsampling (x8)

81 41 41 41 41 41
321
161 512 512 1024 1024 21
256
321 321 128 321 321 321 321 321 321
3 64 21 21 21 21 21 21
Image Semantic Segmentation Prediction Domain Transform Filtered
(one iteration) Score Map

Figure 2. Our proposed model has three components: (1) DeepLab for semantic segmentation prediction, (2) EdgeNet for edge prediction,
and (3) Domain Transform to accurately align segmentation scores with object boundaries. EdgeNet reuses features from intermediate
DeepLab layers, resized and concatenated before edge prediction. Domain transform takes as input the raw segmentation scores and edge
map, and recursively filters across rows and columns to produce the final filtered segmentation scores.

We review the standard DT in Sec. 3.2, we extend it to a Filtering by Eq. (1) is asymmetric, since the current out-
fully trainable system with learned edge detection in Sec. 3.3, put only depends on previous outputs. To overcome this
and we discuss connections with the recently proposed gated asymmetry, we filter 1-D signals twice, first left-to-right,
recurrent unit networks in Sec. 3.4. then right-to-left on the output of the left-to-right pass.
Domain transform filtering for 2-D signals works in a
3.2. Domain transform with recursive filtering separable fashion, employing 1-D filtering sequentially along
The domain transform takes two inputs: (1) The raw each signal dimension. That is, a horizontal pass (left-to-
input signal x to be filtered, which in our case corresponds right and right-to-left) is performed along each row, followed
to the coarse DCNN semantic segmentation scores, and (2) a by a vertical pass (top-to-bottom and bottom-to-top) along
positive “domain transform density” signal d, whose choice each column. In practice, K > 1 iterations of the two-pass 1-
we discuss in detail in the following section. The output D filtering process can suppress “striping” artifacts resulting
of the DT is a filtered signal y. We will use the recursive from 1-D filtering on 2-D signals [16, Fig. 4]. We reduce the
formulation of the DT due to its speed and efficiency, though standard deviation of the DT filtering kernel at each iteration,
the filter can be applied via other techniques [16]. requiring that the sum of total variances equals the desired
For 1-D signals of length N , the output is computed by variance σs2 , following [16, Eq. 14]
setting y1 = x1 and then recursively for i = 2, . . . , N
√ 2K−k
yi = (1 − wi )xi + wi yi−1 . (1) σk = σs 3 √ , k = 1, . . . , K , (3)
4K − 1
The weight wi depends on the domain transform density di plugging σk in place of σs to compute the weights wi by
 √  Eq. (2) at the k-th iteration.
wi = exp − 2di /σs , (2) The domain transform density values di are defined as
where σs is the standard deviation of the filter kernel over σs
di = 1 + g i , (4)
the input’s spatial domain. σr
Intuitively, the strength of the domain transform density
di ≥ 0 determines the amount of diffusion/smoothing by where gi ≥ 0 is the “reference edge”, and σr is the standard
controlling the relative contribution of the raw input signal deviation of the filter kernel over the reference edge map’s
xi to the filtered signal value at the previous position yi−1 range. Note that the larger the value of gi is, the more
when computing the filtered signal at the current position confident the model thinks there is a strong edge at pixel i,
yi . The value of wi ∈ (0, 1) acts like a gate, which controls thus inhibiting diffusion (i.e., di → ∞ and wi = 0). The
how much information is propagated from pixel i − 1 to i. standard DT [16] usually employs the color image gradient
We have full diffusion when di is very small, resulting into 3
wi = 1 and yi = yi−1 . On the other extreme, if di is very (c)
X
gi = k∇Ii k (5)
large, then wi = 0 and diffusion stops, resulting in yi = xi . c=1

4547
Eq. (2) yields
√  !
2 σs
wi = exp − 1 + gi . (9)
σk σr

Then, by the chain rule, the derivative with respect to gi is



∂L 2 σs ∂L
=− wi . (10)
∂gi σk σr ∂wi
(a) (b)
Figure 3. Computation tree for domain transform recursive filtering:
This gradient is then further propagated onto the deep convo-
(a) Forward pass. Upward arrows from yi nodes denote feeds to lutional neural network that generated the edge predictions
subsequent layers. (b) Backward pass, including contributions ∂y∂L that were used as input to the DT.
i
from subsequent layers.
3.4. Relation to gated recurrent unit networks
Equation 1 defines DT filtering as a recursive operation.
but we show next that better results can be obtained by It is interesting to draw connections with other recent RNN
computing the reference edge map by a learned DCNN. formulations. Here we establish a precise connection with
the gated recurrent unit (GRU) RNN architecture [8] recently
3.3. Trainable domain transform filtering proposed for modeling sequential text data. The GRU em-
ploys the update rule
One novel aspect of our proposed approach is to back-
propagate the segmentation errors at the DT output y through yi = zi ỹi + (1 − zi )yi−1 . (11)
the DT onto its two inputs. This allows us to use the DT as a
layer in a CNN, thereby allowing us to jointly learn DCNNs Comparing with Eq. (1), we can relate the GRU’s “update
that compute the coarse segmentation score maps in x and gate” zi and “candidate activation” ỹi with DT’s weight and
the reference edge map in g. raw input signal as follows: zi = 1 − wi and ỹi = xi .
We demonstrate how DT backpropagation works for the The GRU update gate zi is defined as zi = σ(fi ), where
1-D filtering process of Eq. (1), whose forward pass is il- fi is an activation signal and σ(t) = 1/(1+e−t ). Comparing
lustrated as computation tree in Fig. 3(a). We assume that with Eq. (9) yields a direct correspondence between the DT
each node yi not only influences the following node yi+1 but reference edge map gi and the GRU activation fi :
also feeds a subsequent layer, thus also receiving gradient  
∂L σr σk
contributions ∂y from that layer during back-propagation. fi
i gi = √ log(1 + e ) − 1 . (12)
Similar to standard back-propagation in time, we unroll the σs 2
recursion of Eq. (1) in reverse for i = N, . . . , 2 as illustrated
in Fig. 3(b) to update the derivatives with respect to y, and 4. Experimental Evaluation
to also compute derivatives with respect to x and w,
4.1. Experimental Protocol
∂L ∂L Dataset We evaluate the proposed method on the PASCAL
← (1 − wi ) (6) VOC 2012 segmentation benchmark [13], consisting of 20
∂xi ∂yi
∂L ∂L ∂L foreground object classes and one background class. We
← + (yi−1 − xi ) (7) augment the training set from the annotations by [18]. The
∂wi ∂wi ∂yi
performance is measured in terms of pixel intersection-over-
∂L ∂L ∂L
← + wi , (8) union (IOU) averaged across the 21 classes.
∂yi−1 ∂yi−1 ∂yi Training A two-step training process is employed. We
∂L ∂L ∂L
first train the DeepLab component and then we jointly fine-
where ∂x i
and ∂w i
are initialized to 0 and ∂y i
is ini- tune the whole model. Specifically, we employ exactly the
tially set to the value sent by the subsequent layer. Note same setting as [5] to train DeepLab in the first stage. In
that the weight wi is shared across all filtering stages (i.e., the second stage, we employ a small learning rate of 10−8
left-to-right/right-to-left within horizontal pass and top-to- for fine-tuning. The added convolutional layer of EdgeNet
bottom/bottom-to-top within vertical pass) and K iterations, is initialized with Gaussian variables with zero mean and
with each pass contributing to the partial derivative. standard deviation of 10−5 so that in the beginning the Ed-
With these partial derivatives we can produce derivatives geNet predicts no edges and it starts to gradually learn edges
with respect to the reference edge gi . Plugging Eq. (4) into for semantic segmentation. Total training time is 11.5 hours
(10.5 and 1 hours for each stage).

4548
67 67
Method mIOU (%) 66.5 66.5
66 66
Baseline: DeepLab 62.25

mIOU (%)

mIOU (%)
65.5 65.5
σs=130, σr=0.1 σs=170, σr=1
65 65
conv3 3 65.64 64.5
σs=130, σr=0.5
64.5
σs=130, σr=1

conv2 2 + conv3 3 65.75 64


σs=130, σr=1
64
σs=90, σr=1

63.5 σs=130, σr=2 63.5 σs=50, σr=1


conv2 2 + conv3 3 + conv4 3 66.03
63 63
conv2 2 + conv3 3 + conv4 3 + conv5 3 65.94 1 2 3
DT iteration
4 5 1 2 3
DT iteration
4 5

conv1 2 + conv2 2 + conv3 3 + conv4 3 65.89 (a) (b)


Table 1. VOC 2012 val set. Effect of using features from different Figure 4. VOC 2012 val set. Effect of varying number of iterations
convolutinal layers for EdgeNet (σs = 100 and σr = 1 for DT). for domain transform: (a) Fix σs and vary both σr and K iterations.
(b) Fix σr and vary both σs and K iterations.

Reproducibility The proposed methods are implemented


by extending the Caffe framework [22]. The code and mod- form: (1) DT-Oracle, where groundtruth object boundaries
els are available at http://liangchiehchen.com/ are used, which serves as an upper bound on our method. (2)
projects/DeepLab.html. The proposed DT-EdgeNet, where the edges are produced
by EdgeNet. (3) DT-SE, where the edges are found by Struc-
4.2. Experimental Results tured Edges (SE) [11]. (4) DT-Gradient, where the image
We first explore on the validation set the hyper-parameters (color) gradient magnitude of Eq. (5) is used as in standard
in the proposed model, including (1) features for EdgeNet, domain transform [16]. We search for optimal σs and σr
(2) hyper-parameters for domain transform (i.e., number of for those methods. First, we fix σs = 100 and vary σr in
iterations, σs , and σr ). We also experiment with different Fig. 5(a). We found that the performance of DT-Oracle, DT-
methods to generate edge prediction. After that, we analyze SE, and DT-Gradient are affected a lot by different values of
our models and evaluate on the official test set. σr , since they are generated by other “plugged-in” modules
Features for EdgeNet The EdgeNet we employ exploits (i.e., not jointly fine-tuned). We also show the performance
intermediate features from DeepLab. We first investigate of baseline DeepLab and DeepLab-CRF which employs
which VGG-16 [40] layers give better performance with the dense CRF. We then fix the found optimal value of σr and
DT hyper-parameters fixed. As shown in Tab. 1, baseline vary σs in Fig. 5 (b). We found that as long as σs ≥ 90, the
DeepLab attains 62.25% mIOU on PASCAL VOC 2012 val- performance of DT-EdgeNet, DT-SE, and DT-Gradient do
idation set. We start to exploit the features from conv3 3, not vary significantly. After finding optimal values of σr and
which has receptive field size 40. The size is similar to σs for each setting, we use them for remaining experiments.
the patch size typically used for edge detection [11]. The We further visualize the edges learned by our DT-
resulting model achieves performance of 65.64%, 3.4% bet- EdgeNet in Fig. 6. As shown in the first row, when σr
ter than the baseline. When using features from conv2 2, increases, the learned edges start to include not only object
conv3 3, and conv4 3, the performance can be further im- boundaries but also background textures, which degrades the
proved to 66.03%. However, we do not observe any sig- performance for semantic segmentation in our method (i.e.,
nificant improvement if we also exploit the features from noisy edges make it hard to propagate information between
conv1 2 or conv5 3. We use features from conv2 2, conv3 3, neighboring pixels). As shown in the second row, varying σs
and conv4 3 in remaining experiments involving EdgeNet. does not change the learned edges a lot, as long as its value
Number of domain transform iterations Domain trans- is large enough (i.e., ≥ 90).
form requires multiple iterations of the two-pass 1-D filtering We show val set performance (with the best values of σs
process to avoid the “striping” effect [16, Fig. 4]. We train and σr ) for each method in Tab. 2. The method DT-Gradient
the proposed model with K iterations for the domain trans- improves over the baseline DeepLab by 1.7%. While DT-
form, and perform the same K iterations during test. Since SE is 0.9% better than DT-Gradient, DT-EdgeNet further
there are two more hyper-parameters σs and σr (see Eq. (9)), enhances performance (4.1% over baseline). Even though
we also vary their values to investigate the effect of varying DT-EdgeNet is 1.2% lower than DeepLab-CRF, it is several
the K iterations for domain transform. As shown in Fig. 4, times faster, as we discuss later. Moreover, we have found
employing K = 3 iterations for domain transform in our that combining DT-EdgeNet and dense CRF yields the best
proposed model is sufficient to reap most of the gains for performance (0.8% better than DeepLab-CRF). In this hy-
several different values of σs and σr . brid DT-EdgeNet+DenseCRF scheme we post-process the
Varying domain transform σs , σr and comparison with DT filtered score maps in an extra fully-connected CRF step.
other edge detectors We investigate the effect of varying Trimap Similar to [23, 26, 5], we quantify the accuracy
σs and σr for domain transform. We also compare alterna- of the proposed model near object boundaries. We use the
tive methods to generate edge prediction for domain trans- “void” label annotated on PASCAL VOC 2012 validation

4549
(a) Image (b) σs = 100, σr = 0.1 (c) σs = 100, σr = 0.5 (d) σs = 100, σr = 2 (e) σs = 100, σr = 10

(f) Groundtruth (g) σs = 50, σr = 0.1 (h) σs = 90, σr = 0.1 (i) σs = 130, σr = 0.1 (j) σs = 170, σr = 0.1
Figure 6. Effect of varying domain transform’s σs and σr . First row: when σs is fixed and σr increases, the EdgeNet starts to include more
background edges. Second row: when σr is fixed, varying σs has little effect on learned edges.
DT−Oracle 70
71 71
70 DeepLab−CRF 70 65
DT−EdgeNet 60

mean IOU (%)


69 69
DT−SE
68 DT−Gradient 68 55
mIOU (%)

DT−Oracle
mIOU (%)

67 DeepLab 67 50 DeepLab−CRF
DT−EdgeNet
66 66 45 DT−SE
65 65 40 DT−Gradient
DeepLab
64 64 35
0 5 10 15 20 25 30 35 40
63 63 Trimap Width (pixels)
62 62
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
σr
20 40 60 80 100 120 140 160 180 200
σs (a) (b)
Figure 7. (a) Some trimap examples (top-left: image. top-right:
(a) (b) ground-truth. bottom-left: trimap of 2 pixels. bottom-right: trimap
Figure 5. VOC 2012 val set. Effect of varying σs and σr . (a) Fix of 10 pixels). (b) Segmentation result within a band around the
σs = 100 and vary σr . (b) Use the best σr from (a) and vary σs . object boundaries for the proposed methods (mean IOU).
Method mIOU (%)
DeepLab 62.25 employing a fully-connected CRF as post-processing (i.e.,
DeepLab-CRF 67.64 DeepLab-CRF) to smooth the results. However, if we also
DT-Gradient 63.96 incorporate a fully-connected CRF as post-processing to our
DT-SE 64.89 model, we can further increase performance to 71.2%.
DT-EdgeNet 66.35 Models pretrained with MS-COCO We perform an-
DT-EdgeNet + DenseCRF 68.44 other experiment with the stronger baseline of [34], where
DeepLab is pretrained with the MS-COCO 2014 dataset
DT-Oracle 70.88 [29]. Our goal is to test if we can still obtain improvements
Table 2. Performance on PASCAL VOC 2012 val set. with the proposed methods over that stronger baseline. We
use the same optimal values of hyper-parameters as before,
set. The annotations usually correspond to object boundaries. and report the results on validation set in Tab. 3. We still
We compute the mean IOU for the pixels that lie within a observe 1.6% and 2.7% improvement over the baseline by
narrow band (called trimap) of “void” labels, and vary the DT-SE and DT-EdgeNet, respectively. Besides, adding a
width of the band, as shown in Fig. 7. fully-connected CRF to DT-EdgeNet can bring another 1.8%
Qualitative results We show some semantic segmentation improvement. We then evaluate the models on test set in the
results on PASCAL VOC 2012 val set in Fig. 9. DT-EdgeNet bottom of Tab. 4. Our best model, DT-EdgeNet, improves
visually improves over the baseline DeepLab and DT-SE. the baseline DeepLab by 2.8%, while it is 1.0% lower than
Besides, when comparing the edges learned by Structured DeepLab-CRF. When combining DT-EdgeNet and a fully-
Edges and our EdgeNet, we found that EdgeNet better cap- connected CRF, we achieve 73.6% on the test set. Note
tures the object exterior boundaries and responds less than the gap between DT-EdgeNet and DeepLab-CRF becomes
SE to interior edges. We also show failure cases in the smaller when stronger baseline is used.
bottom two rows of Fig. 9. The first is due to the wrong pre- Incorporating multi-scale inputs State-of-art models on
dictions from DeepLab, and the second due to the difficulty the PASCAL VOC 2012 leaderboard usually employ multi-
in localizing object boundaries with cluttered background. scale features (either multi-scale inputs [10, 28, 7] or features
Test set results After finding the best hyper-parameters, from intermediate layers of DCNN [31, 19, 5]). Motivated
we evaluate our models on the test set. As shown in the top by this, we further combine our proposed discriminatively
of Tab. 4, DT-SE improves 2.7% over the baseline DeepLab, trained domain transform and the model of [7], yielding
and DT-EdgeNet can further enhance the performance to 76.3% performance on test set, 1.5% behind current best
69.0% (3.9% better than baseline), which is 1.3% behind models [28] which jointly train CRF and DCNN [6]

4550
Method mIOU (%) Method ImageNet COCO
DeepLab 67.31 DeepLab [5, 34] 65.1 68.9
DeepLab-CRF 71.01 DeepLab-CRF [5, 34] 70.3 72.7
DT-SE 68.94 DT-SE 67.8 70.7
DT-EdgeNet 69.96 DT-EdgeNet 69.0 71.7
DT-EdgeNet + DenseCRF 71.77 DT-EdgeNet + DenseCRF 71.2 73.6
Table 3. Performance on PASCAL VOC 2012 val set. The models DeepLab-CRF-Attention [7] - 75.7
have been pretrained on MS-COCO 2014 dataset.
DeepLab-CRF-Attention-DT - 76.3
CRF-RNN [46] 72.0 74.7
EdgeNet on BSDS500 We further evaluate the edge detec- BoxSup [10] - 75.2
tion performance of our learned EdgeNet on the test set of CentraleSuperBoundaries++ [24] - 76.0
BSDS500 [1]. We employ the standard metrics to evaluate DPN [30] 74.1 77.5
edge detection accuracy: fixed contour threshold (ODS F- Adelaide Context [28] 75.3 77.8
score), per-image best threshold (OIS F-score), and average Table 4. mIOU (%) on PASCAL VOC 2012 test set. We evaluate
precision (AP). We also apply a standard non-maximal sup- our models with two settings: the models are (1) pretrained with
pression technique to the edge maps produced by EdgeNet ImageNet, and (2) further pretrained with MS-COCO.
for evaluation. Our method attains ODS=0.718, OIS=0.731,
and AP=0.685. As shown in Fig. 8, interestingly, our Ed- 1

geNet yields a reasonably good performance (only 3% worse 0.9

than Structured Edges [11] in terms of ODS F-score), while 0.8

0.7
our EdgeNet is not trained on BSDS500 and there is no edge
0.6
supervision during training on PASCAL VOC 2012.
Precision
0.5

Comparison with dense CRF Employing a fully- 0.4

connected CRF is an effective method to improve the seg- 0.3

mentation performance. Our best model (DT-EdgeNet) is 0.2


[F=.80] Human
[F=.79] HED

1.3% and 1.0% lower than DeepLab-CRF on PASCAL VOC 0.1 [F=.75] SE
[F=.72] EdgeNet
0
2012 test set when the models are pretrained with Ima- 0 0.1 0.2 0.3 0.4 0.5 0.6
Recall
0.7 0.8 0.9 1

geNet or MS-COCO, respectively. However, our method is Figure 8. Evaluation of our learned EdgeNet on the test set of
many times faster in terms of computation time. To quan- BSDS500. Note that our EdgeNet is only trained on PASCAL
tify this, we time the inference computation on 50 PAS- VOC 2012 semantic segmentation task without edge supervision.
CAL VOC 2012 validation images. As shown in Tab. 5,
for CPU timing, on a machine with Intel i7-4790K CPU, Method CPU time GPU time
the well-optimized dense CRF implementation [26] with 10
mean-field iterations takes 830 ms/image, while our imple- DeepLab 5240 145
mentation of domain transform with K = 3 iterations (each EdgeNet 20 (0.4%) 1.2 (0.8%)
iteration consists of separable two-pass filterings across rows Dense CRF (10 iterations) 830 (15.8%) -
and columns) takes 180 ms/image (4.6 times faster). On a
DT (3 iterations) 180 (3.4%) 25 (17.2%)
NVIDIA Tesla K40 GPU, our GPU implementation of do-
main transform further reduces the average computation time CRF-RNN (CRF part) [46] 1482 -
to 25 ms/image. In our GPU implementation, the total com- Table 5. Average inference time (ms/image). Number in parenthe-
putational cost of the proposed method (EdgeNet+DT) is ses is the percentage w.r.t. the DeepLab computation. Note that
26.2 ms/image, which amounts to a modest overhead (about EdgeNet computation time is improved by performing convolution
18%) compared to the 145 ms/image required by DeepLab. first and then upsampling.
Note there is no publicly available GPU implementation of
dense CRF inference yet.
preserving filter traditionally used for graphics applications.
5. Conclusions We show that backpropagating through the domain transform
allows us to learn an task-specific edge map optimized for
We have presented an approach to learn edge maps useful semantic segmentation. Filtering the raw semantic segmen-
for semantic image segmentation in a unified system that tation maps produced by deep fully convolutional networks
is trained discriminatively in an end-to-end fashion. The with our learned domain transform leads to improved lo-
proposed method builds on the domain transform, an edge- calization accuracy near object boundaries. The resulting

4551
(a) Image (b) Baseline (c) SE (d) DT-SE (e) EdgeNet (f) DT-EdgeNet
Figure 9. Visualizing results on VOC 2012 val set. For each row, we show (a) Image, (b) Baseline DeepLab segmentation result, (c) edges
produced by Structured Edges, (d) segmentation result with Structured Edges, (e) edges generated by EdgeNet, and (f) segmentation result
with EdgeNet. Note that our EdgeNet better captures the object boundaries and responds less to the background or object interior edges. For
example, see the legs of left second person in the first image or the dog shapes in the second image. Two failure examples in the bottom.

scheme is several times faster than fully-connected CRFs Acknowledgments This work wast partly supported by
that have been previously used for this purpose. ARO 62250-CS and NIH Grant 5R01EY022247-03.

4552
References [22] Y. Jia et al. Caffe: Convolutional architecture for fast feature
embedding. arXiv:1408.5093, 2014.
[1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Con- [23] P. Kohli, P. H. Torr, et al. Robust higher order potentials for
tour detection and hierarchical image segmentation. PAMI, enforcing label consistency. IJCV, 82(3):302–324, 2009.
33(5):898–916, May 2011. [24] I. Kokkinos. Pushing the boundaries of boundary detection
[2] G. Bertasius, J. Shi, and L. Torresani. Deepedge: A multi- using deep learning. In ICLR, 2016.
scale bifurcated deep network for top-down contour detection. [25] S. Konishi, A. L. Yuille, J. M. Coughlan, and S. C. Zhu.
In CVPR, 2015. Statistical edge detection: Learning and evaluating edge cues.
[3] G. Bertasius, J. Shi, and L. Torresani. High-for-low and PAMI, 25(1):57–74, 2003.
low-for-high: Efficient boundary detection from deep object [26] P. Krähenbühl and V. Koltun. Efficient inference in fully
features and its applications to high-level vision. In ICCV, connected crfs with gaussian edge potentials. In NIPS, 2011.
2015. [27] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.
[4] W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki. Scene Howard, W. Hubbard, and L. D. Jackel. Backpropagation
labeling with lstm recurrent neural networks. In CVPR, 2015. applied to handwritten zip code recognition. Neural computa-
[5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. tion, 1(4):541–551, 1989.
Yuille. Semantic image segmentation with deep convolutional [28] G. Lin, C. Shen, I. Reid, et al. Efficient piecewise train-
nets and fully connected crfs. In ICLR, 2015. ing of deep structured models for semantic segmentation.
[6] L.-C. Chen, A. Schwing, A. Yuille, and R. Urtasun. Learning arXiv:1504.01013, 2015.
deep structured models. In ICML, 2015. [29] T.-Y. Lin et al. Microsoft COCO: Common objects in context.
[7] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille. At- In ECCV, 2014.
tention to scale: Scale-aware semantic image segmentation. [30] Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang. Semantic image
arXiv:1511.03339, 2015. segmentation via deep parsing network. In ICCV, 2015.
[8] K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio. On [31] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
the properties of neural machine translation: Encoder-decoder networks for semantic segmentation. In CVPR, 2015.
approaches. arXiv:1409.1259, 2014. [32] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich. Feed-
[9] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical forward semantic segmentation with zoom-out features. In
evaluation of gated recurrent neural networks on sequence CVPR, 2015.
modeling. arXiv:1412.3555, 2014. [33] H. Noh, S. Hong, and B. Han. Learning deconvolution net-
[10] J. Dai, K. He, and J. Sun. Boxsup: Exploiting bounding boxes work for semantic segmentation. In ICCV, 2015.
to supervise convolutional networks for semantic segmenta- [34] G. Papandreou, L.-C. Chen, K. Murphy, and A. L. Yuille.
tion. In ICCV, 2015. Weakly- and semi-supervised learning of a dcnn for semantic
[11] P. Dollár and C. L. Zitnick. Structured forests for fast edge image segmentation. In ICCV, 2015.
detection. In ICCV, 2013. [35] P. Pinheiro and R. Collobert. Recurrent convolutional neural
[12] J. L. Elman. Finding structure in time. Cognitive science, networks for scene labeling. In ICML, 2014.
14(2):179–211, 1990. [36] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg,
[13] M. Everingham, S. M. A. Eslami, L. V. Gool, C. K. I.
and L. Fei-Fei. ImageNet Large Scale Visual Recognition
Williams, J. Winn, and A. Zisserma. The pascal visual object
Challenge. IJCV, 2015.
classes challenge a retrospective. IJCV, 2014.
[37] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak. Convolu-
[14] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning
tional, long short-term memory, fully connected deep neural
hierarchical features for scene labeling. PAMI, 2013.
networks. In ICASSP, 2015.
[15] Y. Ganin and V. Lempitsky. Nˆ4-fields: Neural network
[38] A. G. Schwing and R. Urtasun. Fully connected deep struc-
nearest neighbor fields for image transforms. In ACCV, 2014.
tured networks. arXiv:1503.02351, 2015.
[16] E. S. L. Gastal and M. M. Oliveira. Domain transform for [39] W. Shen, X. Wang, Y. Wang, X. Bai, and Z. Zhang. Deepcon-
edge-aware image and video processing. In SIGGRAPH, tour: A deep convolutional feature learned by positive-sharing
2011. loss for contour detection. In CVPR, 2015.
[17] A. Graves and J. Schmidhuber. Offline handwriting recog- [40] K. Simonyan and A. Zisserman. Very deep convolutional
nition with multidimensional recurrent neural networks. In networks for large-scale image recognition. In ICLR, 2015.
NIPS, 2009. [41] R. Socher, B. Huval, B. Bath, C. D. Manning, and A. Y.
[18] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik. Ng. Convolutional-recursive deep learning for 3d object
Semantic contours from inverse detectors. In ICCV, 2011. classification. In NIPS, 2012.
[19] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hyper- [42] V. Vineet, J. Warrell, and P. H. Torr. Filter-based mean-
columns for object segmentation and fine-grained localization. field inference for random fields with higher-order terms and
In CVPR, 2015. product label-spaces. IJCV, 110(3):290–307, 2014.
[20] S. Hochreiter and J. Schmidhuber. Long short-term memory. [43] F. Visin, K. Kastner, K. Cho, M. Matteucci, A. Courville,
Neural computation, 9(8):1735–1780, 1997. and Y. Bengio. Renet: A recurrent neural network based
[21] J.-J. Hwang and T.-L. Liu. Pixel-wise deep learning for con- alternative to convolutional networks. arXiv:1505.00393,
tour detection. In ICLR, 2015. 2015.

4553
[44] S. Xie and Z. Tu. Holistically-nested edge detection. In ICCV,
2015.
[45] M. D. Zeiler, G. W. Taylor, and R. Fergus. Adaptive deconvo-
lutional networks for mid and high level feature learning. In
ICCV, 2011.
[46] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet,
Z. Su, D. Du, C. Huang, and P. Torr. Conditional random
fields as recurrent neural networks. In ICCV, 2015.

4554

You might also like