Abstract
This paper presents a fast and robust architecture for scene understanding for aerial images recorded from an Unmanned Aerial Vehicle. The architecture uses Deep Wavelet Scattering Network to extract Translation and Rotation Invariant features that are then used by a Conditional Random Field to perform scene segmentation. Experiments are conducted using the proposed framework on two annotated datasets of 1277 images and 300 aerial images, introduced in the paper. An overall pixel accuracy of 81 % and 78 % is achieved for the datasets. A comparison with another similar framework is also presented.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Unmanned air vehicles (UAVs) have recently become a useful information gathering medium for numerous applications such as surveillance [15], vegetation management [23], disaster management (flood) [18], atmosphere pollution monitoring [22] and coastline management [2]. UAVs have particularly gained popularity for data collection in the aftermath of natural catastrophes such as floods [18] and earthquakes [18] due to their ease of deployment, ability to fly at low altitudes and capture images at higher resolution. These systems have been used in the past to segment the aerial image into regions to develop emergency route plans that can help to rescue trapped victims and estimate the incurred damage.
Numerous attempts have been made in the past to segment regions from aerial imagery. Initial methods in this area focused only on segmenting roads from aerial images. Some of them achieved this task by using traditional methods such as active contours and snakes [9] while others used features such as higher order movements [17] and intensity [3]. These methods were further extended to segment the aerial image into other natural and man-made landmarks (in addition to roads) that can help to construct detailed maps of the terrain of interest and further benefit route planning. Ghiasi et al. [7], Dubuisson-Jolly et al. [6] and Rezaeian et al. [16] used a fusion of color and texture features to achieve semantic segmentation of aerial images. Lathuiliere et al. [10] combined a Markov model with SVM to achieve the task of aerial scene segmentation while Montoya-Zegarra et al. [13] used class specific priors with Conditional Random Field (CRF) to achieve the pixel-wise labeling. Marmanis et al. [12] used an ensemble of Convolutional Neural Networks (CNNs) to segment out vegetation regions from aerial images.
Hand-engineered color and texture features achieve only a nominal scene segmentation accuracy while despite the success of CNNs, design and optimal configuration of these networks is not well understood making it difficult to develop these networks. In addition, it is difficult to train CNNs for aerial scene segmentation due to the availability of limited training dataset. Bruna et al. [1] and Sifre et al. [21] have shown that wavelet-based ScatterNets can give a performance that is competitive to that of trained networks, based on our accumulated knowledge of geometrical image properties. Hence, we use the Deep Wavelet scattering architecture proposed by Mallat [21] as the front-end of our proposed pipeline to extract translation and rotation invariant scattering features. Condition Random field (CRF) is the obvious choice from the back-end as they give superior performance over Markov Random Field (MRF) [5]. Hence, CRF is used as the back-end of the proposed network that uses the translation and rotation invariant features extracted by the scattering network to perform the desired scene segmentation.
This paper presents a framework for Scene Understanding for aerial images recorded from an unmanned air vehicle. The main contributions of the paper are stated below:
-
Scene Understanding Architecture: The proposed architecture extracts Translation and Rotation Invariant features using a handcrafted computationally efficient Deep Wavelet Scattering network (front-end) that are further used by a Condition Random Field (CRF) (back-end) to achieve the necessary scene segmentation.
-
Datasets: Since, CRF is a supervised learning algorithm, a dataset of 1277 annotated images carefully collected from Stanford Background dataset [8] and CMU Urban Image dataset [14], that contains the selected natural and man-made landmarks that appear in aerial images is introduced (Please note that these features are not recorded from the UAV). This dataset is used to pre-train the CRF. Next, an UAV aerial image dataset of 300 annotated images recorded from an UAV with the same man-made landmarks is used to fine tune the pre-trained CRF.
The proposed framework is used to perform scene understanding on the introduced datasets. The average segmentation accuracy for each class for both datasets is presented. In addition, an extensive comparison of the proposed pipeline with other scene segmentation methods is presented.
The paper is divided into the following sections. Section 2 presented the Datasets introduced in the paper while Sect. 3 presents the proposed Scene Segmentation Framework. Section 4 presents the experimental results and Sect. 5 draws conclusions.
2 Introduced Annotated Datasets
The paper presents two annotated datasets which contain natural and man-made landmarks which appear in aerial images. The landmarks quite commonly seen in aerial images are included in the datasets. The landmarks are namely: โSkyโ, โTreeโ, โRoadโ, โGrassโ, โWaterโ, โBuildingโ, โMountain.โ, โForeground objectsโ. The first dataset (D1) is a collection of 1277 annotated images carefully chooses from Stanford Background dataset [8] and CMU Urban Image Dataset [14]. All images are forced to a fixed resolution of \(200 \times 300\). The second UAV aerial image dataset (D2) introduced in the paper includes 300 annotated images takes from the UAV. The images contain the landmarks mentioned above. The UAV and example of two images from the UAV aerial image dataset are shown in Fig. 1.
3 Scene Understanding Framework
This section introduces the proposed scene understanding framework that is used to segment the image regions which can be then utilized to interpret the scene. The framework is composed of a front-end that extracts discriminatory features while a back-end that uses these features to segment the image into different regions. We use the Deep Wavelet scattering architecture proposed by Mallat [21] as the front-end of our proposed pipeline to extract translation and rotation invariant scattering features while Conditional Random field (CRF) is used as the back-end of the proposed network that utilizes the extracted features to perform the desired scene segmentation. The pipeline is shown in Fig. 2.
3.1 Deep Wavelet Scattering Network
Deep wavelet scattering networks are multilayer networks that incorporate geometric knowledge to produce high-dimensional image representations that are discriminative and approximately invariant to translation and rotation [1, 21]. The invariants at the first layer of the network are obtained by filtering the image with multi-scale and multi-directional complex Morlet wavelet decompositions followed by a point-wise nonlinearity and local averaging. The high frequencies lost due to averaging are recovered at the later layers using cascaded wavelet transformations with non-linearities, justifying the need for a multilayer network. Bruna et al. [1] and Sifre et al. [19โ21] have proposed numerous convolutional scattering architectures that produce invariant and discriminative feature descriptors. We present the generic idea behind all the models below.
ScatterNets decompose an input image x using multi-scale and multi-directional complex Morlet wavelets that are obtained by dilating and rotating a single band-pass filter \(\psi \) for any scale j and direction \(\theta \). A wavelet filters a signal x using a complex wavelet \({\psi _{\theta ,j_{1}} }\) using the following formulation:
where \({\psi }^{a}\) is the real and \({\psi }^{b} \) is the imaginary part of the wavelet. A wavelet transform response commutes with translations, and is therefore not translation invariant. To build a translation invariant representation, a \(L_{2}\) point-wise is first applied to the wavelet coefficient as shown below:
Separable scattering architecture. First spatial scattering layers in grey, second scattering layers in black. Spatial wavelet-modulus operators (grey arrows) are averaged (doted grey arrows), as in [1]. Outputs of the first scattering are reorganized in different orbits (large black circles) of the action of the rotation on the representation. A second cascade of wavelet-modulus operators along the orbits (black arrows) splits the angular information in several paths that are averaged (doted black arrows) along the rotation to achieve rotation invariance. Output nodes are colored with respect to the order m, \(\overset{^{\circ }}{m}\) of their corresponding paths. Modified from [19].
\(L_{2}\) is a good non-linearity as it is stable to deformations and non-expansive that makes it stable to additive noise [1]. This results in the regular envelope of the filtered signal which still commutes with translations.
The resulting wavelet-modulus operator applied on the signal x is given by:
where \(x\star \phi _{J}\) is the low-pass coefficient and \(|x \star \psi _{\theta ,j}|_{\theta ,j}\) is the high pass coefficient. The invariant part of \(U_{1}\) is computed with an averaging over the spatial and angle variables. It is implemented for each scale j, fixed with a roto-translation convolution of \(Y(h) = U_{1}x(h,j_{1})\) along the \(h = ({u}',{\theta }')\) variable, with an averaging kernel \(\varPhi _{J}(h)\). For \(p_{1} = (g_{1}, j_{1})\) and \(g_{1} = (u, \theta _{1})\), this is written
We choose \(\varPhi _{J}({u}',{\theta }') = (2\pi )^{-1}\varPhi _{J}({u}')\) to perform an averaging over all angles \(\theta \) and over a spatial domain proportional to \(2^J\).
The high frequencies lost by this averaging are recovered through roto-translation convolutions with separable wavelets. Roto-translation wavelets are computed with three separable products. Complex quadrature phase spatial wavelets \(\psi _{\theta _{2},j_{2}}(u)\) or averaging filters \(\varPhi _{J}(u)\) are multiplied by complex \(2\pi \) periodic wavelet \(\bar{\psi }_{k}(\theta ) \) or by \(\bar{\phi }(\theta ) = (2 \pi )^{-1}\)
Finally, roto-translation wavelets for second layer are computed as \(\widetilde{W_{2}}U_{1}x = (S_{1}x,U_{2}x)\) where \(S_{1}x\) is defined in (4) and
with \(g_{1} = (u, \theta _{1})\), \(p_{2} = (g_{1}, \bar{p_{2}})\), and \(\bar{p_{2}} = (j_{1}, \theta _{2}-\theta _{1}, j_{2}, k_{2})\). Since \(U_{2}x(p_{2})\) is computed with a roto-translation convolution, it remains covariant to the action of the roto-translation group. Fast computations of roto-translation convolutions with separable wavelet filters \(\varPsi _{\theta _{2},j_{2},k_{2}}(u,\theta ) = \psi _{\theta _{2},j_{2}} (u) \bar{\psi }_{k_{2}}(\theta )\) are performed by factorizing
It is thus computed with a two-dimensional convolution of \(Y(u,\theta ')\) with \(\psi _{\theta _{2},j_{2}}(r_{\theta }u)\) along \(u = (u_{1}, u_{2})\), followed by a convolution of the output and a one-dimensional circular convolution of the result with \(k_{2}\) along \(\theta \). This convolution which rotates the spatial support of \(\psi _{\theta _{2},j_{2}}(u)\) by \(\theta \) while multiplying its amplitude by \( \bar{\psi }_{k_{2}}(u)\).
Applying \(\widetilde{W_{3}} = \widetilde{W_{2}}\) to \(U_{2}x\) computes second order scattering coefficients as a convolution of \(Y(g) = U_{2} x(g,\bar{p}_{2})\) with \(\varPhi _{J}(g)\), for \(\bar{p}_{2}\) fixed is given as:
The output roto-translation of a second order scattering representation is a vector of coefficients given by:
with \(p_{1} = (u,\theta _{1},j_{1})\) and \(p_{2} = (u,\theta _{1},j_{1},\theta _{2},j_{2},k_{2})\).
3.2 Conditional Random Field
Conditional Random Field (CRF) is a probabilistic framework that allows us to describe the relationship between related output variables such as labels for pixels in an image as a function of observed features: pixel colors or features [5]. This framework is thus ideal for combining multiple visual cues for scene understanding.
The CRF undirected graphical model used in this paper uses pairwise 4-connected grid consisting of finite number of vertices or nodes and edges connecting these nodes. Each node corresponds to a random variable denoted by X. Edges define the neighbourhood relation between these unobserved random variables.
A loss equation, required to be minimized to achieve the optimal labelling is defined by fitting two matrices F and G to the unary and edge features. \(\gamma _{i}\) represent the set of parameter values \(\gamma (x_i)\) for all values of \(x_{i}\). Let \(k(\mathbf {y},i)\) represent the unary features for variable i for a given input image y. Therefore:
In the similar fashion, the parameter values for all \(x_{i}, x_{j}\) is denoted by \(\gamma _{ij}\). Given that \(v(\mathbf {y},i,j)\) represents the edge feature for pair (i, j), then:
The gradients needed for optimization can be obtained using the following equation:
This is under the assumption that \(\frac{\partial L}{\partial \gamma }\) has been calculated.
A clique loss function is used in the paper to achieve the scene segmentation with Tree-Reweighted [5] inference which uses LBFGS optimization algorithm. In this process, the number of images used for trained is repeatedly doubled, with the number of learning iterations halved. Marginal based clique loss function is used to calculate the loss at each iteration. After every iteration, the loss value is checked and if bad search direction is encountered, L-BFGS is reinitialized [5].
4 Results
The proposed scene segmentation pipeline was evaluated and compared with another similar framework on both datasets introduced in the Sect. 2.
The front-end of the pipeline uses the deep wavelet scattering network to extract features using Morlet filters at 4 scales (j) and 8 pre-defined orientations (\(\theta \)), as explained in Sect. 3.1. The features are extricated from each image from the 1277 annotated image dataset (D1) constructed by combining images from the Stanford Background dataset [8] and CMU Urban Image Dataset [14]. Each image of the dataset has a fixed resolution of \(200 \times 300\). The condition random field is trained on the image features obtained using a 5-fold cross-validation split on the dataset. The average accuracies for โSkyโ, โTreeโ, โRoadโ, โGrassโ, โWaterโ, โBuildingโ, โMountainโ, โForeground objectโ is presented in Table 1. Two images selected from the D1 dataset along with the ground truth and segmentation using the trained CRF is shown in Fig. 3. The trained CRF is able to recognize the above-mentioned landmarks from images contained in D1 dataset.
Next, the trained CRF is fine tuned to detect the same landmarks for aerial images recorded from the UAV. The trained CRF model is fine-tuned on the features extracted from the UAV annotated aerial image dataset (D2) presented in Sect. 2. The features are extracted with the deep wavelet scattering network using Morlet filters with the above-mentioned parameters, from images of resolution \(200 \times 300\) obtained using a 5-fold cross-validation split on the UAV image dataset. The average accuracies for the above-mentioned labels for the UAV image dataset is presented in Table 1. Two images selected from the D2 dataset along with the ground truth and segmentation using the Fine-tuned CRF is shown in Fig. 3.
The proposed scene segmentation pipeline is compared with segmentation pipelines that use: (i) hand-crafted or (ii) learned features to achieve this task. The scene segmentation results of the proposed method are compared with the segmentation performed by a handcrafted feature obtained by combining RGB intensities, HOG [4] and pixel locations that are then used in a CRF framework to achieve segmentation for both datasets. The proposed method is then compared with the scene segmentation results obtained by training a Fully Convolutional Network (FCN) with 8-pixel stride [11] on D1 dataset and then fine tuning the learned network on D2 dataset. The results are presented in Table 1. It is evident from Table 1 that the proposed method outperforms the segmentation pipeline that makes use of the hand-crafted features for scene segmentation on both datasets. The proposed method is also able to outperform the Fully Convolutional Network [11] on both datasets. The reason for this seems to be the small size of the D1 and D2 datasets resulting into the inefficient learning of the FCN network.
5 Conclusions
The paper introduces a novel application area of scene understanding to aerial images, which can be vital in surveillance and disaster management applications. The proposed architecture has also shown the importance of Scatternet that can extract invariant features which can replace popular hand-crafted features due to its superior performance. The proposed framework can also be used in an application with less availability of training data as only the back-end of our framework requires learning. The proposed framework achieves decent overall pixel accuracy for scene segmentation on both annotated datasets introduced in the paper. We hope to extend our framework to make use of large corpora of partially labeled data, or perhaps by using motion cues in videos to obtain segmentation labels. An important and natural extension of our method can be provided by incorporating object-based reasoning directly into our model which can lead to better understanding of images.
References
Bruna, J., Mallat, S.: Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1872โ1886 (2013)
Casella, E., Rovere, A., Pedroncini, A., Mucerino, L., Casella, M., Cusati, L.A., Vacchi, M., Ferrari, M., Firpo, M.: Study of wave runup using numerical models and low-altitude aerial photogrammetry: A tool for coastal management. Estuar. Coast. Shelf Sci. 149, 160โ167 (2014)
Christophe, E., Inglada, J.: Robust road extraction for high resolution satellite images. In: 2007 IEEE International Conference on Image Processing, pp. 437โ440. IEEE (2007)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2005)
Domke, J.: Learning graphical model parameters with approximate marginal inference. IEEE Trans. Pattern Anal. Mach. Intell. 35, 2454โ2467 (2013)
Dubuisson-Jolly, M., Gupta, A.: Color and texture fusion: application to aerial image segmentation and gis updating. Image Vis. Comput. 18, 823โ832 (2010)
Ghiasi, M., Amirfattahi, R.: Fast semantic segmentation of aerial images based on color and texture. In: 8th Iranian Conference on Machine Vision and Image Processing (MVIP) (2013)
Gould, S., Fulton, R., Koller, D.: Decomposing a scene into geometric and semantically consistent regions. In: International Conference on Computer Vision (ICCV) (2009)
Laptev, I., Mayer, H., Lindeberg, T., Eckstein, W., Steger, C., Baumgartner, A.: Automatic extraction of roads from aerial images based on scale space and snakes. Mach. Vis. Appl. 12(1), 23โ31 (2000)
Lathuiliere, S., Vu, H., Le, T., Tran, T., Hung, D.: Semantic regions recognition in UAV images sequence. Knowl. Syst. Eng. 326, 313โ324 (2015)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional network for semantic segmentation. In: IEEE Computer Vision and Pattern Recognition (CVPR) (2015)
Marmanis, D., Wegner, J.D., Galliani, S., Schindler, K., Datcu, M., Stilla, U.: Semantic segmentation of aerial images with an ensemble of CNNs. ISPRS Ann. Photogrammetry Remote Sens. Spatial Inf. Sci. 3, 473โ480 (2016)
Montoya-Zegarra, J., Wegner, J., Ladicky, L., Schindler, K.: Semantic segmentation of aerial images in urban areas with class-specific higher-order cliques. ISPRS Ann. Photogrammetry Remote Sens. Spatial Inf. Sci. 2, 127โ133 (2015)
Munoz, D., Bagnell, J.A., Hebert, M.: co-inference for multi-modal scene analysis. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 668โ681. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33783-3_48
Penmetsa, S., Minhuj, F., Singh, A., Omkar, S.: Autonomous UAV for suspicious action detection using pictorial human pose estimation and classification. Electron. Lett. Comput. Vis. Image Anal. 3(1), 18โ32 (2014)
Rezaeian, M., Amirfattahi, R., Sadri, S.: Semantic segmentation of aerial images using fusion of color and texture features. J. Comput. Secur. 1, 225โ238 (2013)
Rochery, M., Jermyn, I.H., Zerubia, J.: Higher order active contours. Int. J. Comput. Vis. 69(1), 27โ42 (2006)
ลerban, G., Rus, I., Vele, D., Breลฃcan, P., Alexe, M., Petrea, D.: Flood-prone area delimitation using UAV technology, in the areas hard-to-reach for classic aircrafts: case study in the north-east of apuseni mountains, transylvania. Nat. Hazards, 82, 1โ16 (2016)
Sifre, L.: Rigid-motion scattering for image classification. Ph.D. thesis (2014)
Sifre, L., Mallat, S.: Combined scattering for rotation invariant texture analysis. In: European Symposium on Artificial Neural Networks (ESANN) (2012)
Sifre, L., Mallat, S.: Rotation, scaling and deformation invariant scattering for texture discrimination. In: IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 1233โ1240 (2013)
ล mรญdl, V., Hofman, R.: Tracking of atmospheric release of pollution using unmanned aerial vehicles. Atmos. Environ. 67, 425โ436 (2013)
Su, Y., Guo, Q., Fry, D.L., Collins, B.M., Kelly, M., Flanagan, J.P., Battles, J.J.: A vegetation mapping strategy for conifer forests by combining airborne lidar data and aerial imagery. Can. J. Remote Sens. 42(1), 1โ15 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
ยฉ 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Nadella, S., Singh, A., Omkar, S.N. (2016). Aerial Scene Understanding Using Deep Wavelet Scattering Network and Conditional Random Field. In: Hua, G., Jรฉgou, H. (eds) Computer Vision โ ECCV 2016 Workshops. ECCV 2016. Lecture Notes in Computer Science(), vol 9913. Springer, Cham. https://doi.org/10.1007/978-3-319-46604-0_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-46604-0_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46603-3
Online ISBN: 978-3-319-46604-0
eBook Packages: Computer ScienceComputer Science (R0)