Cost estimation of spatial join in spatialhadoop

357 Accesses
8 Citations
Explore all metrics

Abstract

Spatial join is an important operation in geo-spatial applications, since it is frequently used for performing data analysis involving geographical information. Many efforts have been done in the past decades in order to provide efficient algorithms for spatial join and this becomes particularly important as the amount of spatial data to be processed increases. In recent years, the MapReduce approach has become a de-facto standard for processing large amount of data (big-data) and some attempts have been made for extending existing frameworks for the processing of spatial data. In this context, several different MapReduce implementations of spatial join have been defined which mainly differ in the use of a spatial index and in the way this index is built and used. In general, none of these algorithms can be considered better than the others, but the choice might depend on the characteristics of the involved datasets. The aim of this work is to deeply analyse them and define a cost model for ranking them based on the characteristics of the dataset at hand (i.e., selectivity or spatial properties). This cost model has been extensively tested w.r.t. a set of synthetic datasets in order to prove its effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient large-scale distance-based join queries in spatialhadoop

Article 20 September 2017

New plane-sweep algorithms for distance-based join queries in spatial databases

Article 27 February 2016

Voronoi-Diagram Based Partitioning for Distance Join Query Processing in SpatialHadoop

Notes

From the official Hadoop documentation, the maximum number of parallel reducers could be set equal to the number of available containers multiplied by a factor of 0.95.

References

An N, Yang Z, Sivasubramaniam A (2001) Selectivity estimation for spatial joins. In: Proceedings of the 17th International Conference on Data Engineering, pp 368–375. https://doi.org/10.1109/ICDE.2001.914849
Aref W, Samet H (1994) A cost model for query optimization using R-Trees. In: Proceedings of the Second ACM Workshop on Advances in Geographic Information Systems, pp 60–67
Belussi A, Carra D, Migliorini S, Negri M, Pelagatti G (2018) What makes spatial data big? A discussion on how to partition spatial data. In: Proceedings of 10th International Conference on Geographic Information Science, pp 1–15. https://doi.org/10.1109/ICDE.2015.7113382
Belussi A, Migliorini S, Eldawy A (2018) A Cost Model for Spatial Join Operations in SpatialHadoop. Tech. Rep. RR108/2018, Dept. of Computer Science, University of Verona. https://iris.univr.it/handle/11562/981957
Belussi A, Migliorini S, Eldawy A (2018) Detecting Skewness of Big Spatial Data in SpatialHadoop. In: Proceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp 432–435. https://doi.org/10.1145/3274895.3274923
van den Bercken J, Seeger B, Widmayer P (1999) The bulk index join: a generic approach to processing non-equijoins. In: Proceedings of the 15th International Conference on Data Engineering, pp 257–. https://doi.org/10.1109/ICDE.1999.754937
Blanas S, Patel JM, Ercegovac V, Rao J, Shekita EJ, Tian Y (2010) A Comparison of Join Algorithms for Log Processing in MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp 975–986. https://doi.org/10.1145/1807167.1807273
Chasparis H, Eldawy A (2017) Experimental Evaluation of Selectivity Estimation on Big Spatial Data. In: Proceedings of the Fourth International ACM Workshop on Managing and Mining Enriched Geo-Spatial Data (GeoRich 2017), collocated with ACM SIGMOD 2017, pp 8:1–8:6. Chicago. https://doi.org/10.1145/3080546.3080553
Dittrich J, Seeger B (2000) Data redundancy and duplicate detection in spatial join processing. In: Lomet DB, Weikum G (eds) Proceedings of the 16th International Conference on Data Engineering. https://doi.org/10.1109/ICDE.2000.839452. IEEE Computer Society, San Diego, pp 535–546
Eldawy A, Alarabi L, Mokbel MF (2015) Spatial Partitioning Techniques in SpatialHadoop. Proc VLDB Endow 8(12):1602–1605. https://doi.org/10.14778/2824032.2824057
Article Google Scholar
Eldawy A, Mokbel MF (2015) SpatialHadoop: A MapReduce framework for spatial data. In: Proceedings of the 31st IEEE International Conference on Data Engineering, pp 1352–1363. https://doi.org/10.1109/ICDE.2015.7113382
Eldawy A, Mokbel MF (2017) Spatial Join with Hadoop. In: Encyclopedia of GIS. Springer, pp 2032–2036. https://doi.org/10.1007/978-3-319-17885-1_1570
Eldawy A, Mokbel MF (2017) The Era of Big Spatial Data. Proc VLDB Endow 10(12):1992–1995. https://doi.org/10.14778/3137765.3137828
Article Google Scholar
Eldawy A, Sabek I, Elganainy M, Bakeer A, Abdelmotaleb A, Mokbel MF (2017) Sphinx: Empowering Impala for Efficient Execution of SQL Queries on Big Spatial Data. In: Proceedings of the 15th International Symposium on Advances in Spatial and Temporal Databases, pp 65–83. https://doi.org/10.1007/978-3-319-64367-0_4
Gu J, Peng S, Wang XS, Rao W, Yang M, Cao Y (2014) Cost-Based Join Algorithm Selection in Hadoop. In: Proceedings of the 15th International Conference on Web Information Systems Engineering, pp 246–261. https://doi.org/10.1007/978-3-319-11746-1_18
Han S, Choi W, Muwafiq R, Nah Y (2017) Impact of Memory Size on Bigdata Processing based on Hadoop and Spark. In: PRoceedings of the International Conference on Research in Adaptive and Convergent Systems (RACS), pp 275–280. https://doi.org/10.1145/3129676.3129688
Harada L, Nakano M, Kitsuregawa M, Takagi M (1990) Query Processing for Multi-Attribute Clustered Records. In: Proceedings of 16th International Conferece on Very Large Data Bases, pp 59–70
Hoel EG, Samet H (1995) Benchmarking Spatial Join Operations with Spatial Output. In: Proceedings of the 21th International Conference on Very Large Data Bases, pp 606–618
Jacox EH, Samet H (2007) Spatial Join Techniques. ACM Transactions on Database Systems 32(1). https://doi.org/10.1145/1206049.1206056
Lin X, Meng Z, Xu C, Wang M (2012) A Practical Performance Model for Hadoop MapReduce. In: Proceeding of the 2012 IEEE international conference on cluster computing workshops, pp 231–239. https://doi.org/10.1109/ClusterW.2012.24
Mamoulis N, Papadias D (2001) Multiway Spatial Joins. ACM Trans Database Syst 26(4), 424–475. https://doi.org/10.1145/503099.503101
Mavridis I, Karatza H (2017) Performance Evaluation of Cloud-based Log File Analysis with Apache Hadoop and Apache Spark. J Syst Softw 125 (C):133–151. https://doi.org/10.1016/j.jss.2016.11.037
Article Google Scholar
Nievergelt J, Hinterberger H, Sevcik KC (1981) The grid file: An adaptable, symmetric multi-key file structure. In: Proceedings of 3rd Conference of the European Cooperation in Informatics – Trends in Information Processing Systems, pp 236–251. https://doi.org/10.1007/3-540-10885-8_45
Papadopoulos A, Rigaux P, Scholl M (1999) A Performance Evaluation of Spatial Join Processing Strategies. In: Proceedings of 6th International Symposium on Advances in Spatial Databases, pp 286–307. https://doi.org/10.1007/3-540-48482-5_18
Patel JM, DeWitt DJ (1996) Partition based spatial-merge join. SIGMOD Rec 25(2):259–270. https://doi.org/10.1145/235968.233338
Article Google Scholar
Rigaux P, Scholl M, Voisard A (2002) Spatial Databases with Application to GIS. Morgan Kaufmann Publishers Inc., San Francisco
Sabek I, Mokbel MF (2017) On Spatial Joins in MapReduce. In: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp 21:1–21:10. https://doi.org/10.1145/3139958.3139967
Samadi Y, Zbakh M, Tadonki C (2018) Performance comparison between Hadoop and Spark frameworks using HiBenchbenchmarks Concurrency and Computation: Practice and Experience 30(12). https://doi.org/10.1002/cpe.4367
Siddique AB, Eldawy A, Hristidis V (2019) Comparing Synopsis Techniques for Approximate Spatial Data Analysis. PVLDB 12(11):1583–1596. https://doi.org/10.14778/3342263.3342635, http://www.vldb.org/pvldb/vol12/p1583-siddique.pdf
Sowell B, Salles MV, Cao T, Demers A, Gehrke J (2013) An Experimental Analysis of Iterated Spatial Joins in Main Memory. Proc VLDB Endow 6 (14):1882–1893. https://doi.org/10.14778/2556549.2556570
Article Google Scholar
Šidlauskas D, Jensen CS (2014) Spatial Joins in Main Memory: Implementation Matters!. Proc VLDB Endow 8(1):97–100. https://doi.org/10.14778/2735461.2735470
Article Google Scholar
White T (2015) HAdoop: The Definitive Guide, 4th. O’Reilly Media, Inc
Xie D, Li F, Yao B, Li G, Chen Z, Zhou L, Guo M (2016) Simba: spatial in-memory big data analysis. In: Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp 86:1–86:4. https://doi.org/10.1145/2996913.2996935
Yu J, Wu J, Sarwat M (2015) Geospark: a cluster computing framework for processing large-scale spatial data. In: Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp 70:1–70:4. https://doi.org/10.1145/2820783.2820860

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Verona, Verona, Italy
A. Belussi & S. Migliorini
Department of Computer Science and Engineering, University of California Riverside, Riverside, CA, USA
A. Eldawy

Authors

A. Belussi
View author publications
You can also search for this author in PubMed Google Scholar
S. Migliorini
View author publications
You can also search for this author in PubMed Google Scholar
A. Eldawy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S. Migliorini.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Belussi, A., Migliorini, S. & Eldawy, A. Cost estimation of spatial join in spatialhadoop. Geoinformatica 24, 1021–1059 (2020). https://doi.org/10.1007/s10707-020-00414-x

Download citation

Received: 11 June 2019
Revised: 28 February 2020
Accepted: 18 May 2020
Published: 05 July 2020
Issue Date: October 2020
DOI: https://doi.org/10.1007/s10707-020-00414-x

Cost estimation of spatial join in spatialhadoop

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Efficient large-scale distance-based join queries in spatialhadoop

New plane-sweep algorithms for distance-based join queries in spatial databases

Voronoi-Diagram Based Partitioning for Distance Join Query Processing in SpatialHadoop

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Cost estimation of spatial join in spatialhadoop

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Efficient large-scale distance-based join queries in spatialhadoop

New plane-sweep algorithms for distance-based join queries in spatial databases

Voronoi-Diagram Based Partitioning for Distance Join Query Processing in SpatialHadoop

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now