Abstract
Spatial join is an important operation in geo-spatial applications, since it is frequently used for performing data analysis involving geographical information. Many efforts have been done in the past decades in order to provide efficient algorithms for spatial join and this becomes particularly important as the amount of spatial data to be processed increases. In recent years, the MapReduce approach has become a de-facto standard for processing large amount of data (big-data) and some attempts have been made for extending existing frameworks for the processing of spatial data. In this context, several different MapReduce implementations of spatial join have been defined which mainly differ in the use of a spatial index and in the way this index is built and used. In general, none of these algorithms can be considered better than the others, but the choice might depend on the characteristics of the involved datasets. The aim of this work is to deeply analyse them and define a cost model for ranking them based on the characteristics of the dataset at hand (i.e., selectivity or spatial properties). This cost model has been extensively tested w.r.t. a set of synthetic datasets in order to prove its effectiveness.
Similar content being viewed by others
Notes
From the official Hadoop documentation, the maximum number of parallel reducers could be set equal to the number of available containers multiplied by a factor of 0.95.
References
An N, Yang Z, Sivasubramaniam A (2001) Selectivity estimation for spatial joins. In: Proceedings of the 17th International Conference on Data Engineering, pp 368–375. https://doi.org/10.1109/ICDE.2001.914849
Aref W, Samet H (1994) A cost model for query optimization using R-Trees. In: Proceedings of the Second ACM Workshop on Advances in Geographic Information Systems, pp 60–67
Belussi A, Carra D, Migliorini S, Negri M, Pelagatti G (2018) What makes spatial data big? A discussion on how to partition spatial data. In: Proceedings of 10th International Conference on Geographic Information Science, pp 1–15. https://doi.org/10.1109/ICDE.2015.7113382
Belussi A, Migliorini S, Eldawy A (2018) A Cost Model for Spatial Join Operations in SpatialHadoop. Tech. Rep. RR108/2018, Dept. of Computer Science, University of Verona. https://iris.univr.it/handle/11562/981957
Belussi A, Migliorini S, Eldawy A (2018) Detecting Skewness of Big Spatial Data in SpatialHadoop. In: Proceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp 432–435. https://doi.org/10.1145/3274895.3274923
van den Bercken J, Seeger B, Widmayer P (1999) The bulk index join: a generic approach to processing non-equijoins. In: Proceedings of the 15th International Conference on Data Engineering, pp 257–. https://doi.org/10.1109/ICDE.1999.754937
Blanas S, Patel JM, Ercegovac V, Rao J, Shekita EJ, Tian Y (2010) A Comparison of Join Algorithms for Log Processing in MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp 975–986. https://doi.org/10.1145/1807167.1807273
Chasparis H, Eldawy A (2017) Experimental Evaluation of Selectivity Estimation on Big Spatial Data. In: Proceedings of the Fourth International ACM Workshop on Managing and Mining Enriched Geo-Spatial Data (GeoRich 2017), collocated with ACM SIGMOD 2017, pp 8:1–8:6. Chicago. https://doi.org/10.1145/3080546.3080553
Dittrich J, Seeger B (2000) Data redundancy and duplicate detection in spatial join processing. In: Lomet DB, Weikum G (eds) Proceedings of the 16th International Conference on Data Engineering. https://doi.org/10.1109/ICDE.2000.839452. IEEE Computer Society, San Diego, pp 535–546
Eldawy A, Alarabi L, Mokbel MF (2015) Spatial Partitioning Techniques in SpatialHadoop. Proc VLDB Endow 8(12):1602–1605. https://doi.org/10.14778/2824032.2824057
Eldawy A, Mokbel MF (2015) SpatialHadoop: A MapReduce framework for spatial data. In: Proceedings of the 31st IEEE International Conference on Data Engineering, pp 1352–1363. https://doi.org/10.1109/ICDE.2015.7113382
Eldawy A, Mokbel MF (2017) Spatial Join with Hadoop. In: Encyclopedia of GIS. Springer, pp 2032–2036. https://doi.org/10.1007/978-3-319-17885-1_1570
Eldawy A, Mokbel MF (2017) The Era of Big Spatial Data. Proc VLDB Endow 10(12):1992–1995. https://doi.org/10.14778/3137765.3137828
Eldawy A, Sabek I, Elganainy M, Bakeer A, Abdelmotaleb A, Mokbel MF (2017) Sphinx: Empowering Impala for Efficient Execution of SQL Queries on Big Spatial Data. In: Proceedings of the 15th International Symposium on Advances in Spatial and Temporal Databases, pp 65–83. https://doi.org/10.1007/978-3-319-64367-0_4
Gu J, Peng S, Wang XS, Rao W, Yang M, Cao Y (2014) Cost-Based Join Algorithm Selection in Hadoop. In: Proceedings of the 15th International Conference on Web Information Systems Engineering, pp 246–261. https://doi.org/10.1007/978-3-319-11746-1_18
Han S, Choi W, Muwafiq R, Nah Y (2017) Impact of Memory Size on Bigdata Processing based on Hadoop and Spark. In: PRoceedings of the International Conference on Research in Adaptive and Convergent Systems (RACS), pp 275–280. https://doi.org/10.1145/3129676.3129688
Harada L, Nakano M, Kitsuregawa M, Takagi M (1990) Query Processing for Multi-Attribute Clustered Records. In: Proceedings of 16th International Conferece on Very Large Data Bases, pp 59–70
Hoel EG, Samet H (1995) Benchmarking Spatial Join Operations with Spatial Output. In: Proceedings of the 21th International Conference on Very Large Data Bases, pp 606–618
Jacox EH, Samet H (2007) Spatial Join Techniques. ACM Transactions on Database Systems 32(1). https://doi.org/10.1145/1206049.1206056
Lin X, Meng Z, Xu C, Wang M (2012) A Practical Performance Model for Hadoop MapReduce. In: Proceeding of the 2012 IEEE international conference on cluster computing workshops, pp 231–239. https://doi.org/10.1109/ClusterW.2012.24
Mamoulis N, Papadias D (2001) Multiway Spatial Joins. ACM Trans Database Syst 26(4), 424–475. https://doi.org/10.1145/503099.503101
Mavridis I, Karatza H (2017) Performance Evaluation of Cloud-based Log File Analysis with Apache Hadoop and Apache Spark. J Syst Softw 125 (C):133–151. https://doi.org/10.1016/j.jss.2016.11.037
Nievergelt J, Hinterberger H, Sevcik KC (1981) The grid file: An adaptable, symmetric multi-key file structure. In: Proceedings of 3rd Conference of the European Cooperation in Informatics – Trends in Information Processing Systems, pp 236–251. https://doi.org/10.1007/3-540-10885-8_45
Papadopoulos A, Rigaux P, Scholl M (1999) A Performance Evaluation of Spatial Join Processing Strategies. In: Proceedings of 6th International Symposium on Advances in Spatial Databases, pp 286–307. https://doi.org/10.1007/3-540-48482-5_18
Patel JM, DeWitt DJ (1996) Partition based spatial-merge join. SIGMOD Rec 25(2):259–270. https://doi.org/10.1145/235968.233338
Rigaux P, Scholl M, Voisard A (2002) Spatial Databases with Application to GIS. Morgan Kaufmann Publishers Inc., San Francisco
Sabek I, Mokbel MF (2017) On Spatial Joins in MapReduce. In: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp 21:1–21:10. https://doi.org/10.1145/3139958.3139967
Samadi Y, Zbakh M, Tadonki C (2018) Performance comparison between Hadoop and Spark frameworks using HiBenchbenchmarks Concurrency and Computation: Practice and Experience 30(12). https://doi.org/10.1002/cpe.4367
Siddique AB, Eldawy A, Hristidis V (2019) Comparing Synopsis Techniques for Approximate Spatial Data Analysis. PVLDB 12(11):1583–1596. https://doi.org/10.14778/3342263.3342635, http://www.vldb.org/pvldb/vol12/p1583-siddique.pdf
Sowell B, Salles MV, Cao T, Demers A, Gehrke J (2013) An Experimental Analysis of Iterated Spatial Joins in Main Memory. Proc VLDB Endow 6 (14):1882–1893. https://doi.org/10.14778/2556549.2556570
Šidlauskas D, Jensen CS (2014) Spatial Joins in Main Memory: Implementation Matters!. Proc VLDB Endow 8(1):97–100. https://doi.org/10.14778/2735461.2735470
White T (2015) HAdoop: The Definitive Guide, 4th. O’Reilly Media, Inc
Xie D, Li F, Yao B, Li G, Chen Z, Zhou L, Guo M (2016) Simba: spatial in-memory big data analysis. In: Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp 86:1–86:4. https://doi.org/10.1145/2996913.2996935
Yu J, Wu J, Sarwat M (2015) Geospark: a cluster computing framework for processing large-scale spatial data. In: Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp 70:1–70:4. https://doi.org/10.1145/2820783.2820860
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Belussi, A., Migliorini, S. & Eldawy, A. Cost estimation of spatial join in spatialhadoop. Geoinformatica 24, 1021–1059 (2020). https://doi.org/10.1007/s10707-020-00414-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10707-020-00414-x