Semantic Image Segmentation With Task-Specific Edge Detection Using Cnns and A Discriminatively Trained Domain Transform
Semantic Image Segmentation With Task-Specific Edge Detection Using Cnns and A Discriminatively Trained Domain Transform
Semantic Image Segmentation With Task-Specific Edge Detection Using Cnns and A Discriminatively Trained Domain Transform
Liang-Chieh Chen∗ Jonathan T. Barron, George Papandreou, Kevin Murphy Alan L. Yuille
lcchen@cs.ucla.edu {barron, gpapan, kpmurphy}@google.com yuille@stat.ucla.edu
alan.yuille@jhu.edu
Edge Prediction
Abstract
Image Filtered Score
Deep convolutional neural networks (CNNs) are the back- Convolutional Domain
Neural Network Segmentation Score Transform
bone of state-of-art semantic image segmentation systems.
Recent work has shown that complementing CNNs with
fully-connected conditional random fields (CRFs) can signif-
Figure 1. A single unified CNN produces both coarse semantic
icantly enhance their object localization accuracy, yet dense
segmentation scores and an edge map, which respectively serve as
CRF inference is computationally expensive. We propose input multi-channel image and reference edge to a domain trans-
replacing the fully-connected CRF with domain transform form edge-preserving filter. The resulting filtered semantic segmen-
(DT), a modern edge-preserving filtering method in which tation scores are well-aligned with the object boundaries. The full
the amount of smoothing is controlled by a reference edge architecture is discriminatively trained by backpropagation (red
map. Domain transform filtering is several times faster than dashed arrows) to optimize the target semantic segmentation.
dense CRF inference and we show that it yields comparable
semantic segmentation results, accurately capturing object
boundaries. Importantly, our formulation allows learning dimensional Gaussian filtering in the 5-D bilateral (2-D po-
the reference edge map from intermediate CNN features sition, 3-D color) space and is expensive in terms of both
instead of using the image gradient magnitude as in stan- memory and CPU time, even when advanced algorithmic
dard DT filtering. This produces task-specific edges in an techniques are used.
end-to-end trainable system optimizing the target semantic In this paper, we propose replacing the fully-connected
segmentation quality. CRF and its associated bilateral filtering with the domain
transform (DT) [16], an alternative edge-aware filter. The
1. Introduction recursive formulation of the domain transform amounts to
adaptive recursive filtering of a signal, where information
Deep convolutional neural networks (CNNs) are very is not allowed to propagate across edges in some reference
effective in semantic image segmentation, the task of assign- signal. This results in an extremely efficient scheme which
ing a semantic label to every pixel in an image. Recently, it is an order of magnitude faster than the fastest algorithms
has been demonstrated that post-processing the output of a for a bilateral filter of equivalent quality.
CNN with a fully-connected CRF can significantly increase The domain transform can equivalently be seen as a recur-
segmentation accuracy near object boundaries [5]. rent neural network (RNN). In particular, we show that the
As explained in [26], mean-field inference in the fully- domain transform is a special case of the recently proposed
connected CRF model amounts to iterated application of the RNN with gated recurrent units. This connection allows us to
bilateral filter, a popular technique for edge-aware filtering. share insights, better understanding two seemingly different
This encourages pixels which are nearby in position and in methods, as we explain in Section 3.4.
color to be assigned the same semantic label. In practice,
The amount of smoothing in a DT is spatially modulated
this produces semantic segmentation results which are well
by a reference edge map, which in the standard DT corre-
aligned with object boundaries in the image.
sponds to image gradient magnitude. Instead, we will learn
One key impediment in adopting the fully-connected CRF
the reference edge map from intermediate layer features
is the rather high computational cost of the underlying bi-
of the same CNN that produces the semantic segmentation
lateral filtering step. Bilateral filtering amounts to high-
scores, as illustrated in Fig. 1. Crucially, this allows us to
∗ Work done in part during an internship at Google Inc. learn a task-specific edge detector tuned for semantic image
4545
segmentation in an end-to-end trainable system. Long range dependency Recurrent neural networks
We evaluate the performance of the proposed method on (RNNs) [12] with long short-term memory (LSTM) units
the challenging PASCAL VOC 2012 semantic segmentation [20] or gated recurrent units (GRUs) [8, 9] have proven
task. In this task, domain transform filtering is several times successful to model the long term dependencies in sequen-
faster than dense CRF inference, while performing almost tial data (e.g., text and speech). Sainath et al. [37] have
as well in terms of the mean intersection-over-union (mIOU) combined CNNs and RNNs into one unified architecture
metric. In addition, although we only trained for semantic for speech recognition. Some recent work has attempted
segmentation, the learned edge map performs competitively to model spatial long range dependency with recurrent net-
on the BSDS500 edge detection benchmark. works for computer vision tasks [17, 41, 35, 4, 43]. Our
work, integrating CNNs and Domain Transform (DT) with
2. Related Work recursive filtering [16], bears a similarity to ReNet [43],
which also performs recursive operations both horizontally
Semantic image segmentation Deep Convolutional Neu- and vertically to capture long range dependency within
ral Networks (CNNs) [27] have demonstrated excellent whole image. In this work, we show the relationship between
performance on the task of semantic image segmentation DT and GRU, and we also demonstrate the effectiveness of
[10, 28, 30]. However, due to the employment of max- exploiting long range dependency by DT for semantic image
pooling layers and downsampling, the output of these net- segmentation. While [42] has previously employed the DT
works tend to have poorly localized object boundaries. Sev- (for joint object-stereo labeling), we propose to backpropa-
eral approaches have been adopted to handle this problem. gate through both of the DT inputs to jointly learn segmenta-
[31, 19, 5] proposed to extract features from the interme- tion scores and edge maps in an end-to-end trainable system.
diate layers of a deep network to better estimate the object We show that these learned edge maps bring significant im-
boundaries. Networks employing deconvolutional layers and provements compared to standard image gradient magnitude
unpooling layers to recover the “spatial invariance” effect of used by [42] or earlier DT literature [16].
max-pooling layers have been proposed by [45, 33]. [14, 32]
used super-pixel representation, which essentially appeals 3. Proposed Model
to low-level segmentation methods for the task of localiza-
tion. The fully connected Conditional Random Field (CRF) 3.1. Model overview
[26] has been applied to capture long range dependencies Our proposed model consists of three components, il-
between pixels in [5, 28, 30, 34]. Further improvements lustrated in Fig. 2. They are jointly trained end-to-end to
have been shown in [46, 38] when backpropagating through optimize the output semantic segmentation quality.
the CRF to refine the segmentation CNN. In contrary, we The first component that produces coarse semantic seg-
adopt another approach based on the domain transform [16] mentation score predictions is based on the publicly available
and show that beyond refining the segmentation CNN, we DeepLab model, [5], which modifies VGG-16 net [40] to
can also jointly learn to detect object boundaries, embedding be FCN [31]. The model is initialized from the VGG-16
task-specific edge detection into the proposed model. ImageNet [36] pretrained model. We employ the DeepLab-
LargeFOV variant of [5], which introduces zeros into the
Edge detection The edge/contour detection task has a filters to enlarge its Field-Of-View, which we will simply
long history [25, 1, 11], which we will only briefly re- denote by DeepLab in the sequel.
view. Recently, several works have achieved outstanding We add a second component, which we refer to as Ed-
performance on the edge detection task by employing CNNs geNet. The EdgeNet predicts edges by exploiting features
[2, 3, 15, 21, 39, 44]. Our work is most related to the ones from intermediate layers of DeepLab. The features are re-
by [44, 3, 24]. While Xie and Tu [44] also exploited fea- sized to have the same spatial resolution by bilinear inter-
tures from the intermediate layers of a deep network [40] polation before concatenation. A convolutional layer with
for edge detection, they did not apply the learned edges for kernel size 1×1 and one output channel is applied to yield
high-level tasks, such as semantic image segmentation. On edge prediction. ReLU is used so that the edge prediction is
the other hand, Bertasius et al. [3] and Kokkinos [24] made in the range of zero to infinity.
use of their learned boundaries to improve the performance The third component in our system is the domain trans-
of semantic image segmentation. However, the boundary form (DT), which is is an edge-preserving filter that lends
detection and semantic image segmentation are considered itself to very efficient implementation by separable 1-D re-
as two separate tasks. They optimized the performance of cursive filtering across rows and columns. Though DT is
boundary detection instead of the performance of high level traditionally used for graphics applications [16], we use it to
tasks. On the contrary, we learn object boundaries in or- filter the raw CNN semantic segmentation scores to be bet-
der to directly optimize the performance of semantic image ter aligned with object boundaries, guided by the EdgeNet
segmentation. produced edge map.
4546
Image Segmentation Prediction 128 + 256 + 512 Edge Prediction Filtered Score Map
321 321
1
Upsampling and concatenation x2 x4
x8
Edge Prediction
Upsampling (x8)
81 41 41 41 41 41
321
161 512 512 1024 1024 21
256
321 321 128 321 321 321 321 321 321
3 64 21 21 21 21 21 21
Image Semantic Segmentation Prediction Domain Transform Filtered
(one iteration) Score Map
Figure 2. Our proposed model has three components: (1) DeepLab for semantic segmentation prediction, (2) EdgeNet for edge prediction,
and (3) Domain Transform to accurately align segmentation scores with object boundaries. EdgeNet reuses features from intermediate
DeepLab layers, resized and concatenated before edge prediction. Domain transform takes as input the raw segmentation scores and edge
map, and recursively filters across rows and columns to produce the final filtered segmentation scores.
We review the standard DT in Sec. 3.2, we extend it to a Filtering by Eq. (1) is asymmetric, since the current out-
fully trainable system with learned edge detection in Sec. 3.3, put only depends on previous outputs. To overcome this
and we discuss connections with the recently proposed gated asymmetry, we filter 1-D signals twice, first left-to-right,
recurrent unit networks in Sec. 3.4. then right-to-left on the output of the left-to-right pass.
Domain transform filtering for 2-D signals works in a
3.2. Domain transform with recursive filtering separable fashion, employing 1-D filtering sequentially along
The domain transform takes two inputs: (1) The raw each signal dimension. That is, a horizontal pass (left-to-
input signal x to be filtered, which in our case corresponds right and right-to-left) is performed along each row, followed
to the coarse DCNN semantic segmentation scores, and (2) a by a vertical pass (top-to-bottom and bottom-to-top) along
positive “domain transform density” signal d, whose choice each column. In practice, K > 1 iterations of the two-pass 1-
we discuss in detail in the following section. The output D filtering process can suppress “striping” artifacts resulting
of the DT is a filtered signal y. We will use the recursive from 1-D filtering on 2-D signals [16, Fig. 4]. We reduce the
formulation of the DT due to its speed and efficiency, though standard deviation of the DT filtering kernel at each iteration,
the filter can be applied via other techniques [16]. requiring that the sum of total variances equals the desired
For 1-D signals of length N , the output is computed by variance σs2 , following [16, Eq. 14]
setting y1 = x1 and then recursively for i = 2, . . . , N
√ 2K−k
yi = (1 − wi )xi + wi yi−1 . (1) σk = σs 3 √ , k = 1, . . . , K , (3)
4K − 1
The weight wi depends on the domain transform density di plugging σk in place of σs to compute the weights wi by
√ Eq. (2) at the k-th iteration.
wi = exp − 2di /σs , (2) The domain transform density values di are defined as
where σs is the standard deviation of the filter kernel over σs
di = 1 + g i , (4)
the input’s spatial domain. σr
Intuitively, the strength of the domain transform density
di ≥ 0 determines the amount of diffusion/smoothing by where gi ≥ 0 is the “reference edge”, and σr is the standard
controlling the relative contribution of the raw input signal deviation of the filter kernel over the reference edge map’s
xi to the filtered signal value at the previous position yi−1 range. Note that the larger the value of gi is, the more
when computing the filtered signal at the current position confident the model thinks there is a strong edge at pixel i,
yi . The value of wi ∈ (0, 1) acts like a gate, which controls thus inhibiting diffusion (i.e., di → ∞ and wi = 0). The
how much information is propagated from pixel i − 1 to i. standard DT [16] usually employs the color image gradient
We have full diffusion when di is very small, resulting into 3
wi = 1 and yi = yi−1 . On the other extreme, if di is very (c)
X
gi = k∇Ii k (5)
large, then wi = 0 and diffusion stops, resulting in yi = xi . c=1
4547
Eq. (2) yields
√ !
2 σs
wi = exp − 1 + gi . (9)
σk σr
4548
67 67
Method mIOU (%) 66.5 66.5
66 66
Baseline: DeepLab 62.25
mIOU (%)
mIOU (%)
65.5 65.5
σs=130, σr=0.1 σs=170, σr=1
65 65
conv3 3 65.64 64.5
σs=130, σr=0.5
64.5
σs=130, σr=1
4549
(a) Image (b) σs = 100, σr = 0.1 (c) σs = 100, σr = 0.5 (d) σs = 100, σr = 2 (e) σs = 100, σr = 10
(f) Groundtruth (g) σs = 50, σr = 0.1 (h) σs = 90, σr = 0.1 (i) σs = 130, σr = 0.1 (j) σs = 170, σr = 0.1
Figure 6. Effect of varying domain transform’s σs and σr . First row: when σs is fixed and σr increases, the EdgeNet starts to include more
background edges. Second row: when σr is fixed, varying σs has little effect on learned edges.
DT−Oracle 70
71 71
70 DeepLab−CRF 70 65
DT−EdgeNet 60
DT−Oracle
mIOU (%)
67 DeepLab 67 50 DeepLab−CRF
DT−EdgeNet
66 66 45 DT−SE
65 65 40 DT−Gradient
DeepLab
64 64 35
0 5 10 15 20 25 30 35 40
63 63 Trimap Width (pixels)
62 62
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
σr
20 40 60 80 100 120 140 160 180 200
σs (a) (b)
Figure 7. (a) Some trimap examples (top-left: image. top-right:
(a) (b) ground-truth. bottom-left: trimap of 2 pixels. bottom-right: trimap
Figure 5. VOC 2012 val set. Effect of varying σs and σr . (a) Fix of 10 pixels). (b) Segmentation result within a band around the
σs = 100 and vary σr . (b) Use the best σr from (a) and vary σs . object boundaries for the proposed methods (mean IOU).
Method mIOU (%)
DeepLab 62.25 employing a fully-connected CRF as post-processing (i.e.,
DeepLab-CRF 67.64 DeepLab-CRF) to smooth the results. However, if we also
DT-Gradient 63.96 incorporate a fully-connected CRF as post-processing to our
DT-SE 64.89 model, we can further increase performance to 71.2%.
DT-EdgeNet 66.35 Models pretrained with MS-COCO We perform an-
DT-EdgeNet + DenseCRF 68.44 other experiment with the stronger baseline of [34], where
DeepLab is pretrained with the MS-COCO 2014 dataset
DT-Oracle 70.88 [29]. Our goal is to test if we can still obtain improvements
Table 2. Performance on PASCAL VOC 2012 val set. with the proposed methods over that stronger baseline. We
use the same optimal values of hyper-parameters as before,
set. The annotations usually correspond to object boundaries. and report the results on validation set in Tab. 3. We still
We compute the mean IOU for the pixels that lie within a observe 1.6% and 2.7% improvement over the baseline by
narrow band (called trimap) of “void” labels, and vary the DT-SE and DT-EdgeNet, respectively. Besides, adding a
width of the band, as shown in Fig. 7. fully-connected CRF to DT-EdgeNet can bring another 1.8%
Qualitative results We show some semantic segmentation improvement. We then evaluate the models on test set in the
results on PASCAL VOC 2012 val set in Fig. 9. DT-EdgeNet bottom of Tab. 4. Our best model, DT-EdgeNet, improves
visually improves over the baseline DeepLab and DT-SE. the baseline DeepLab by 2.8%, while it is 1.0% lower than
Besides, when comparing the edges learned by Structured DeepLab-CRF. When combining DT-EdgeNet and a fully-
Edges and our EdgeNet, we found that EdgeNet better cap- connected CRF, we achieve 73.6% on the test set. Note
tures the object exterior boundaries and responds less than the gap between DT-EdgeNet and DeepLab-CRF becomes
SE to interior edges. We also show failure cases in the smaller when stronger baseline is used.
bottom two rows of Fig. 9. The first is due to the wrong pre- Incorporating multi-scale inputs State-of-art models on
dictions from DeepLab, and the second due to the difficulty the PASCAL VOC 2012 leaderboard usually employ multi-
in localizing object boundaries with cluttered background. scale features (either multi-scale inputs [10, 28, 7] or features
Test set results After finding the best hyper-parameters, from intermediate layers of DCNN [31, 19, 5]). Motivated
we evaluate our models on the test set. As shown in the top by this, we further combine our proposed discriminatively
of Tab. 4, DT-SE improves 2.7% over the baseline DeepLab, trained domain transform and the model of [7], yielding
and DT-EdgeNet can further enhance the performance to 76.3% performance on test set, 1.5% behind current best
69.0% (3.9% better than baseline), which is 1.3% behind models [28] which jointly train CRF and DCNN [6]
4550
Method mIOU (%) Method ImageNet COCO
DeepLab 67.31 DeepLab [5, 34] 65.1 68.9
DeepLab-CRF 71.01 DeepLab-CRF [5, 34] 70.3 72.7
DT-SE 68.94 DT-SE 67.8 70.7
DT-EdgeNet 69.96 DT-EdgeNet 69.0 71.7
DT-EdgeNet + DenseCRF 71.77 DT-EdgeNet + DenseCRF 71.2 73.6
Table 3. Performance on PASCAL VOC 2012 val set. The models DeepLab-CRF-Attention [7] - 75.7
have been pretrained on MS-COCO 2014 dataset.
DeepLab-CRF-Attention-DT - 76.3
CRF-RNN [46] 72.0 74.7
EdgeNet on BSDS500 We further evaluate the edge detec- BoxSup [10] - 75.2
tion performance of our learned EdgeNet on the test set of CentraleSuperBoundaries++ [24] - 76.0
BSDS500 [1]. We employ the standard metrics to evaluate DPN [30] 74.1 77.5
edge detection accuracy: fixed contour threshold (ODS F- Adelaide Context [28] 75.3 77.8
score), per-image best threshold (OIS F-score), and average Table 4. mIOU (%) on PASCAL VOC 2012 test set. We evaluate
precision (AP). We also apply a standard non-maximal sup- our models with two settings: the models are (1) pretrained with
pression technique to the edge maps produced by EdgeNet ImageNet, and (2) further pretrained with MS-COCO.
for evaluation. Our method attains ODS=0.718, OIS=0.731,
and AP=0.685. As shown in Fig. 8, interestingly, our Ed- 1
0.7
our EdgeNet is not trained on BSDS500 and there is no edge
0.6
supervision during training on PASCAL VOC 2012.
Precision
0.5
1.3% and 1.0% lower than DeepLab-CRF on PASCAL VOC 0.1 [F=.75] SE
[F=.72] EdgeNet
0
2012 test set when the models are pretrained with Ima- 0 0.1 0.2 0.3 0.4 0.5 0.6
Recall
0.7 0.8 0.9 1
geNet or MS-COCO, respectively. However, our method is Figure 8. Evaluation of our learned EdgeNet on the test set of
many times faster in terms of computation time. To quan- BSDS500. Note that our EdgeNet is only trained on PASCAL
tify this, we time the inference computation on 50 PAS- VOC 2012 semantic segmentation task without edge supervision.
CAL VOC 2012 validation images. As shown in Tab. 5,
for CPU timing, on a machine with Intel i7-4790K CPU, Method CPU time GPU time
the well-optimized dense CRF implementation [26] with 10
mean-field iterations takes 830 ms/image, while our imple- DeepLab 5240 145
mentation of domain transform with K = 3 iterations (each EdgeNet 20 (0.4%) 1.2 (0.8%)
iteration consists of separable two-pass filterings across rows Dense CRF (10 iterations) 830 (15.8%) -
and columns) takes 180 ms/image (4.6 times faster). On a
DT (3 iterations) 180 (3.4%) 25 (17.2%)
NVIDIA Tesla K40 GPU, our GPU implementation of do-
main transform further reduces the average computation time CRF-RNN (CRF part) [46] 1482 -
to 25 ms/image. In our GPU implementation, the total com- Table 5. Average inference time (ms/image). Number in parenthe-
putational cost of the proposed method (EdgeNet+DT) is ses is the percentage w.r.t. the DeepLab computation. Note that
26.2 ms/image, which amounts to a modest overhead (about EdgeNet computation time is improved by performing convolution
18%) compared to the 145 ms/image required by DeepLab. first and then upsampling.
Note there is no publicly available GPU implementation of
dense CRF inference yet.
preserving filter traditionally used for graphics applications.
5. Conclusions We show that backpropagating through the domain transform
allows us to learn an task-specific edge map optimized for
We have presented an approach to learn edge maps useful semantic segmentation. Filtering the raw semantic segmen-
for semantic image segmentation in a unified system that tation maps produced by deep fully convolutional networks
is trained discriminatively in an end-to-end fashion. The with our learned domain transform leads to improved lo-
proposed method builds on the domain transform, an edge- calization accuracy near object boundaries. The resulting
4551
(a) Image (b) Baseline (c) SE (d) DT-SE (e) EdgeNet (f) DT-EdgeNet
Figure 9. Visualizing results on VOC 2012 val set. For each row, we show (a) Image, (b) Baseline DeepLab segmentation result, (c) edges
produced by Structured Edges, (d) segmentation result with Structured Edges, (e) edges generated by EdgeNet, and (f) segmentation result
with EdgeNet. Note that our EdgeNet better captures the object boundaries and responds less to the background or object interior edges. For
example, see the legs of left second person in the first image or the dog shapes in the second image. Two failure examples in the bottom.
scheme is several times faster than fully-connected CRFs Acknowledgments This work wast partly supported by
that have been previously used for this purpose. ARO 62250-CS and NIH Grant 5R01EY022247-03.
4552
References [22] Y. Jia et al. Caffe: Convolutional architecture for fast feature
embedding. arXiv:1408.5093, 2014.
[1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Con- [23] P. Kohli, P. H. Torr, et al. Robust higher order potentials for
tour detection and hierarchical image segmentation. PAMI, enforcing label consistency. IJCV, 82(3):302–324, 2009.
33(5):898–916, May 2011. [24] I. Kokkinos. Pushing the boundaries of boundary detection
[2] G. Bertasius, J. Shi, and L. Torresani. Deepedge: A multi- using deep learning. In ICLR, 2016.
scale bifurcated deep network for top-down contour detection. [25] S. Konishi, A. L. Yuille, J. M. Coughlan, and S. C. Zhu.
In CVPR, 2015. Statistical edge detection: Learning and evaluating edge cues.
[3] G. Bertasius, J. Shi, and L. Torresani. High-for-low and PAMI, 25(1):57–74, 2003.
low-for-high: Efficient boundary detection from deep object [26] P. Krähenbühl and V. Koltun. Efficient inference in fully
features and its applications to high-level vision. In ICCV, connected crfs with gaussian edge potentials. In NIPS, 2011.
2015. [27] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.
[4] W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki. Scene Howard, W. Hubbard, and L. D. Jackel. Backpropagation
labeling with lstm recurrent neural networks. In CVPR, 2015. applied to handwritten zip code recognition. Neural computa-
[5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. tion, 1(4):541–551, 1989.
Yuille. Semantic image segmentation with deep convolutional [28] G. Lin, C. Shen, I. Reid, et al. Efficient piecewise train-
nets and fully connected crfs. In ICLR, 2015. ing of deep structured models for semantic segmentation.
[6] L.-C. Chen, A. Schwing, A. Yuille, and R. Urtasun. Learning arXiv:1504.01013, 2015.
deep structured models. In ICML, 2015. [29] T.-Y. Lin et al. Microsoft COCO: Common objects in context.
[7] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille. At- In ECCV, 2014.
tention to scale: Scale-aware semantic image segmentation. [30] Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang. Semantic image
arXiv:1511.03339, 2015. segmentation via deep parsing network. In ICCV, 2015.
[8] K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio. On [31] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
the properties of neural machine translation: Encoder-decoder networks for semantic segmentation. In CVPR, 2015.
approaches. arXiv:1409.1259, 2014. [32] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich. Feed-
[9] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical forward semantic segmentation with zoom-out features. In
evaluation of gated recurrent neural networks on sequence CVPR, 2015.
modeling. arXiv:1412.3555, 2014. [33] H. Noh, S. Hong, and B. Han. Learning deconvolution net-
[10] J. Dai, K. He, and J. Sun. Boxsup: Exploiting bounding boxes work for semantic segmentation. In ICCV, 2015.
to supervise convolutional networks for semantic segmenta- [34] G. Papandreou, L.-C. Chen, K. Murphy, and A. L. Yuille.
tion. In ICCV, 2015. Weakly- and semi-supervised learning of a dcnn for semantic
[11] P. Dollár and C. L. Zitnick. Structured forests for fast edge image segmentation. In ICCV, 2015.
detection. In ICCV, 2013. [35] P. Pinheiro and R. Collobert. Recurrent convolutional neural
[12] J. L. Elman. Finding structure in time. Cognitive science, networks for scene labeling. In ICML, 2014.
14(2):179–211, 1990. [36] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg,
[13] M. Everingham, S. M. A. Eslami, L. V. Gool, C. K. I.
and L. Fei-Fei. ImageNet Large Scale Visual Recognition
Williams, J. Winn, and A. Zisserma. The pascal visual object
Challenge. IJCV, 2015.
classes challenge a retrospective. IJCV, 2014.
[37] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak. Convolu-
[14] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning
tional, long short-term memory, fully connected deep neural
hierarchical features for scene labeling. PAMI, 2013.
networks. In ICASSP, 2015.
[15] Y. Ganin and V. Lempitsky. Nˆ4-fields: Neural network
[38] A. G. Schwing and R. Urtasun. Fully connected deep struc-
nearest neighbor fields for image transforms. In ACCV, 2014.
tured networks. arXiv:1503.02351, 2015.
[16] E. S. L. Gastal and M. M. Oliveira. Domain transform for [39] W. Shen, X. Wang, Y. Wang, X. Bai, and Z. Zhang. Deepcon-
edge-aware image and video processing. In SIGGRAPH, tour: A deep convolutional feature learned by positive-sharing
2011. loss for contour detection. In CVPR, 2015.
[17] A. Graves and J. Schmidhuber. Offline handwriting recog- [40] K. Simonyan and A. Zisserman. Very deep convolutional
nition with multidimensional recurrent neural networks. In networks for large-scale image recognition. In ICLR, 2015.
NIPS, 2009. [41] R. Socher, B. Huval, B. Bath, C. D. Manning, and A. Y.
[18] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik. Ng. Convolutional-recursive deep learning for 3d object
Semantic contours from inverse detectors. In ICCV, 2011. classification. In NIPS, 2012.
[19] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hyper- [42] V. Vineet, J. Warrell, and P. H. Torr. Filter-based mean-
columns for object segmentation and fine-grained localization. field inference for random fields with higher-order terms and
In CVPR, 2015. product label-spaces. IJCV, 110(3):290–307, 2014.
[20] S. Hochreiter and J. Schmidhuber. Long short-term memory. [43] F. Visin, K. Kastner, K. Cho, M. Matteucci, A. Courville,
Neural computation, 9(8):1735–1780, 1997. and Y. Bengio. Renet: A recurrent neural network based
[21] J.-J. Hwang and T.-L. Liu. Pixel-wise deep learning for con- alternative to convolutional networks. arXiv:1505.00393,
tour detection. In ICLR, 2015. 2015.
4553
[44] S. Xie and Z. Tu. Holistically-nested edge detection. In ICCV,
2015.
[45] M. D. Zeiler, G. W. Taylor, and R. Fergus. Adaptive deconvo-
lutional networks for mid and high level feature learning. In
ICCV, 2011.
[46] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet,
Z. Su, D. Du, C. Huang, and P. Torr. Conditional random
fields as recurrent neural networks. In ICCV, 2015.
4554