Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Hadoop GIS: a high performance spatial data warehousing system over mapreduce

Published: 01 August 2013 Publication History

Abstract

Support of high performance queries on large volumes of spatial data becomes increasingly important in many application domains, including geospatial problems in numerous fields, location based services, and emerging scientific applications that are increasingly data- and compute-intensive. The emergence of massive scale spatial data is due to the proliferation of cost effective and ubiquitous positioning technologies, development of high resolution imaging technologies, and contribution from a large number of community users. There are two major challenges for managing and querying massive spatial data to support spatial queries: the explosion of spatial data, and the high computational complexity of spatial queries. In this paper, we present Hadoop-GIS - a scalable and high performance spatial data warehousing system for running large scale spatial queries on Hadoop. Hadoop-GIS supports multiple types of spatial queries on MapReduce through spatial partitioning, customizable spatial query engine RESQUE, implicit parallel spatial query execution on MapReduce, and effective methods for amending query results through handling boundary objects. Hadoop-GIS utilizes global partition indexing and customizable on demand local spatial indexing to achieve efficient query processing. Hadoop-GIS is integrated into Hive to support declarative spatial queries with an integrated architecture. Our experiments have demonstrated the high efficiency of Hadoop-GIS on query response and high scalability to run on commodity clusters. Our comparative experiments have showed that performance of Hadoop-GIS is on par with parallel SDBMS and outperforms SDBMS for compute-intensive queries. Hadoop-GIS is available as a set of library for processing spatial queries, and as an integrated software package in Hive.

References

[1]
Geocouch. https://github.com/couchbase/geocouch/.
[2]
neo4j/spatial. https://github.com/neo4j/spatial.
[3]
The sloan digital sky survey project (sdss). http://www.sdss.org.
[4]
Spatial index library. http://libspatialindex.github.com.
[5]
Spatialhadoop. http://spatialhadoop.cs.umn.edu/.
[6]
Geos. http://trac.osgeo.org/geos, 2013.
[7]
Hadoop-GIS wiki. https://web.cci.emory.edu/confluence/display/hadoopgis, 2013.
[8]
Openstreetmap. http://www.openstreetmap.org, 2013.
[9]
Pathology analytical imaging standards. https://web.cci.emory.edu/confluence/display/PAIS, 2013.
[10]
A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. VLDB Endow., 2:922-933, August 2009.
[11]
A. Aji. High performance spatial query processing for large scale scientific data. In Proceedings of the on SIGMOD/PODS 2012 PhD Symposium, pages 9-14, New York, NY, USA, 2012. ACM.
[12]
A. Aji, F. Wang, and J. H. Saltz. Towards building a high performance spatial query system for large scale medical imaging data. In SIGSPATIAL/GIS, pages 309-318, 2012.
[13]
A. Akdogan, U. Demiryurek, F. Banaei-Kashani, and C. Shahabi. Voronoi-based geospatial query processing with mapreduce. In CLOUDCOM, pages 9-16, 2010.
[14]
N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger. The r*-tree: An efficient and robust access method for points and rectangles. In SIGMOD, 1990.
[15]
J. V. d. Bercken and B. Seeger. An evaluation of generic bulk loading techniques. In VLDB, pages 461-470, 2001.
[16]
S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A comparison of join algorithms for log processing in mapreduce. In SIGMOD, 2010.
[17]
T. Brinkhoff, H.-P. Kriegel, and B. Seeger. Parallel processing of spatial joins using r-trees. In ICDE, 1996.
[18]
A. Cary, Z. Sun, V. Hristidis, and N. Rishe. Experiences on processing spatial data with mapreduce. In SSDBM, pages 302-319, 2009.
[19]
R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: easy and efficient parallel processing of massive data sets. PVLDB, 1(2):1265-1276, 2008.
[20]
J. Dean and S. Ghemawat. Mapreduce: a flexible data processing tool. Commun. ACM, 53(1):72-77, 2010.
[21]
A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a high level dataflow system on top of MapReduce: The Pig experience. PVLDB, 2(2):1414-1425, 2009.
[22]
H. Gupta, B. Chawda, S. Negi, T. A. Faruquie, L. V. Subramaniam, and M. Mohania. Processing multi-way spatial joins on map-reduce. In EDBT, pages 113-124, 2013.
[23]
I. Kamel and C. Faloutsos. Hilbert r-tree: An improved r-tree using fractals. In VLDB, pages 500-509, 1994.
[24]
R. Lee, T. Luo, Y. Huai, F. Wang, Y. He, and X. Zhang. Ysmart: Yet another sql-to-mapreduce translator. In ICDCS, 2011.
[25]
M.-L. Lo and C. V. Ravishankar. Spatial hash-joins. In SIGMOD, pages 247-258, 1996.
[26]
M. A. Nieto-Santisteban, A. R. Thakar, and A. S. Szalay. Cross-matching very large datasets. In NSTC NASA Conference, 2007.
[27]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD, 2008.
[28]
J. Patel, J. Yu, N. Kabra, K. Tufte, B. Nag, J. Burger, N. Hall, K. Ramasamy, R. Lueder, C. Ellmann, J. Kupsch, S. Guo, J. Larson, D. De Witt, and J. Naughton. Building a scaleable geo-spatial dbms: technology, implementation, and evaluation. In SIGMOD, SIGMOD '97, pages 336-347, 1997.
[29]
A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD, pages 165-178, 2009.
[30]
M. Stonebraker, D. J. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. Mapreduce and parallel dbmss: friends or foes? Commun. ACM, 53(1):64-71, 2010.
[31]
A. S. Szalay, G. Bell, J. vandenBerg, A. Wonders, R. C. Burns, D. Fay, J. Heasley, T. Hey, M. A. Nieto-Santisteban, A. Thakar, C. v. Ingen, and R. Wilton. Graywulf: Scalable clustered architecture for data intensive computing. In HICSS, pages 1-10, 2009.
[32]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow., 2(2):1626-1629, Aug. 2009.
[33]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. volume 2, pages 1626-1629, August 2009.
[34]
F. Wang, J. Kong, L. Cooper, T. Pan, K. Tahsin, W. Chen, A. Sharma, C. Niedermayr, T. W. Oh, D. Brat, A. B. Farris, D. Foran, and J. Saltz. A data model and database for high-resolution pathology analytical image informatics. J Pathol Inform, 2(1):32, 2011.
[35]
F. Wang, J. Kong, J. Gao, D. Adler, L. Cooper, C. Vergara-Niedermayr, Z. Zhou, B. Katigbak, T. Kurc, D. Brat, and J. Saltz. A high-performance spatial database based approach for pathology imaging algorithm evaluation. J Pathol Inform, 4(5), 2013.
[36]
K. Wang, Y. Huai, R. Lee, F. Wang, X. Zhang, and J. H. Saltz. Accelerating pathology image data cross-comparison on cpu-gpu hybrid systems. Proc. VLDB Endow., 5(11):1543-1554, July 2012.
[37]
Y. Xu, P. Kostamaa, and L. Gao. Integrating hadoop and parallel dbms. In SIGMOD, pages 969-974, 2010.
[38]
S. Zhang, J. Han, Z. Liu, K. Wang, and Z. Xu. Sjmr: Parallelizing spatial join with mapreduce on clusters. In CLUSTER, 2009.
[39]
Y. Zhong, J. Han, T. Zhang, Z. Li, J. Fang, and G. Chen. Towards parallel spatial query processing for big spatial data. In IPDPSW, pages 2085-2094, 2012.
[40]
X. Zhou, D. J. Abel, and D. Truffet. Data partitioning for parallel spatial join processing. Geoinformatica, 2:175-204, June 1998.

Cited By

View all
  • (2025)GENERALIZED DATASET OF GEOLOGICAL AND GEOPHYSICAL INFORMATION ON THE EASTERN SECTOR OF THE RUSSIAN ARCTIC FOR MACHINE LEARNING-BASED ANALYSISRussian Geology and Geophysics10.2113/RGG2024474766:2(210-223)Online publication date: 1-Feb-2025
  • (2025)An update on applications of digital pathology: primary diagnosis; telepathology, education and researchDiagnostic Pathology10.1186/s13000-025-01610-920:1Online publication date: 12-Feb-2025
  • (2025)Efficient distributed matrix for resolving computational intensity in remote sensingFuture Generation Computer Systems10.1016/j.future.2024.107644166(107644)Online publication date: May-2025
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 6, Issue 11
August 2013
237 pages

Publisher

VLDB Endowment

Publication History

Published: 01 August 2013
Published in PVLDB Volume 6, Issue 11

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)78
  • Downloads (Last 6 weeks)7
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)GENERALIZED DATASET OF GEOLOGICAL AND GEOPHYSICAL INFORMATION ON THE EASTERN SECTOR OF THE RUSSIAN ARCTIC FOR MACHINE LEARNING-BASED ANALYSISRussian Geology and Geophysics10.2113/RGG2024474766:2(210-223)Online publication date: 1-Feb-2025
  • (2025)An update on applications of digital pathology: primary diagnosis; telepathology, education and researchDiagnostic Pathology10.1186/s13000-025-01610-920:1Online publication date: 12-Feb-2025
  • (2025)Efficient distributed matrix for resolving computational intensity in remote sensingFuture Generation Computer Systems10.1016/j.future.2024.107644166(107644)Online publication date: May-2025
  • (2025)Raster interval object approximations for spatial intersection joinsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00887-434:1Online publication date: 1-Jan-2025
  • (2024)Gamification in Real-World Applications: Interactive Maps and Augmented RealityLevel Up! Exploring Gamification's Impact on Research and Innovation10.5772/intechopen.1004870Online publication date: 14-May-2024
  • (2024)Non-Uniform Spatial Partitions and Optimized Trajectory Segments for Storage and Indexing of Massive GPS Trajectory DataISPRS International Journal of Geo-Information10.3390/ijgi1306019713:6(197)Online publication date: 12-Jun-2024
  • (2024)High-Performance Spatial Data Analytics: Systematic R&D for Scale-Out and Scale-Up Solutions from the Past to NowProceedings of the VLDB Endowment10.14778/3685800.368591217:12(4507-4520)Online publication date: 8-Nov-2024
  • (2024)Efficient Distance Queries on Non-point DataACM Transactions on Spatial Algorithms and Systems10.1145/369819411:1(1-37)Online publication date: 12-Nov-2024
  • (2024)Spatial Computing Opportunities in Biomedical Decision Support: The Atlas-EHR VisionACM Transactions on Spatial Algorithms and Systems10.1145/367920110:3(1-36)Online publication date: 25-Sep-2024
  • (2024)RayJoin: Fast and Precise Spatial JoinProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656610(124-136)Online publication date: 30-May-2024
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media