Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Hadoop GIS: a high performance spatial data warehousing system over mapreduce

Published: 01 August 2013 Publication History

Abstract

Support of high performance queries on large volumes of spatial data becomes increasingly important in many application domains, including geospatial problems in numerous fields, location based services, and emerging scientific applications that are increasingly data- and compute-intensive. The emergence of massive scale spatial data is due to the proliferation of cost effective and ubiquitous positioning technologies, development of high resolution imaging technologies, and contribution from a large number of community users. There are two major challenges for managing and querying massive spatial data to support spatial queries: the explosion of spatial data, and the high computational complexity of spatial queries. In this paper, we present Hadoop-GIS - a scalable and high performance spatial data warehousing system for running large scale spatial queries on Hadoop. Hadoop-GIS supports multiple types of spatial queries on MapReduce through spatial partitioning, customizable spatial query engine RESQUE, implicit parallel spatial query execution on MapReduce, and effective methods for amending query results through handling boundary objects. Hadoop-GIS utilizes global partition indexing and customizable on demand local spatial indexing to achieve efficient query processing. Hadoop-GIS is integrated into Hive to support declarative spatial queries with an integrated architecture. Our experiments have demonstrated the high efficiency of Hadoop-GIS on query response and high scalability to run on commodity clusters. Our comparative experiments have showed that performance of Hadoop-GIS is on par with parallel SDBMS and outperforms SDBMS for compute-intensive queries. Hadoop-GIS is available as a set of library for processing spatial queries, and as an integrated software package in Hive.

References

[1]
Geocouch. https://github.com/couchbase/geocouch/.
[2]
neo4j/spatial. https://github.com/neo4j/spatial.
[3]
The sloan digital sky survey project (sdss). http://www.sdss.org.
[4]
Spatial index library. http://libspatialindex.github.com.
[5]
Spatialhadoop. http://spatialhadoop.cs.umn.edu/.
[6]
Geos. http://trac.osgeo.org/geos, 2013.
[7]
Hadoop-GIS wiki. https://web.cci.emory.edu/confluence/display/hadoopgis, 2013.
[8]
Openstreetmap. http://www.openstreetmap.org, 2013.
[9]
Pathology analytical imaging standards. https://web.cci.emory.edu/confluence/display/PAIS, 2013.
[10]
A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. VLDB Endow., 2:922-933, August 2009.
[11]
A. Aji. High performance spatial query processing for large scale scientific data. In Proceedings of the on SIGMOD/PODS 2012 PhD Symposium, pages 9-14, New York, NY, USA, 2012. ACM.
[12]
A. Aji, F. Wang, and J. H. Saltz. Towards building a high performance spatial query system for large scale medical imaging data. In SIGSPATIAL/GIS, pages 309-318, 2012.
[13]
A. Akdogan, U. Demiryurek, F. Banaei-Kashani, and C. Shahabi. Voronoi-based geospatial query processing with mapreduce. In CLOUDCOM, pages 9-16, 2010.
[14]
N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger. The r*-tree: An efficient and robust access method for points and rectangles. In SIGMOD, 1990.
[15]
J. V. d. Bercken and B. Seeger. An evaluation of generic bulk loading techniques. In VLDB, pages 461-470, 2001.
[16]
S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A comparison of join algorithms for log processing in mapreduce. In SIGMOD, 2010.
[17]
T. Brinkhoff, H.-P. Kriegel, and B. Seeger. Parallel processing of spatial joins using r-trees. In ICDE, 1996.
[18]
A. Cary, Z. Sun, V. Hristidis, and N. Rishe. Experiences on processing spatial data with mapreduce. In SSDBM, pages 302-319, 2009.
[19]
R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: easy and efficient parallel processing of massive data sets. PVLDB, 1(2):1265-1276, 2008.
[20]
J. Dean and S. Ghemawat. Mapreduce: a flexible data processing tool. Commun. ACM, 53(1):72-77, 2010.
[21]
A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a high level dataflow system on top of MapReduce: The Pig experience. PVLDB, 2(2):1414-1425, 2009.
[22]
H. Gupta, B. Chawda, S. Negi, T. A. Faruquie, L. V. Subramaniam, and M. Mohania. Processing multi-way spatial joins on map-reduce. In EDBT, pages 113-124, 2013.
[23]
I. Kamel and C. Faloutsos. Hilbert r-tree: An improved r-tree using fractals. In VLDB, pages 500-509, 1994.
[24]
R. Lee, T. Luo, Y. Huai, F. Wang, Y. He, and X. Zhang. Ysmart: Yet another sql-to-mapreduce translator. In ICDCS, 2011.
[25]
M.-L. Lo and C. V. Ravishankar. Spatial hash-joins. In SIGMOD, pages 247-258, 1996.
[26]
M. A. Nieto-Santisteban, A. R. Thakar, and A. S. Szalay. Cross-matching very large datasets. In NSTC NASA Conference, 2007.
[27]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD, 2008.
[28]
J. Patel, J. Yu, N. Kabra, K. Tufte, B. Nag, J. Burger, N. Hall, K. Ramasamy, R. Lueder, C. Ellmann, J. Kupsch, S. Guo, J. Larson, D. De Witt, and J. Naughton. Building a scaleable geo-spatial dbms: technology, implementation, and evaluation. In SIGMOD, SIGMOD '97, pages 336-347, 1997.
[29]
A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD, pages 165-178, 2009.
[30]
M. Stonebraker, D. J. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. Mapreduce and parallel dbmss: friends or foes? Commun. ACM, 53(1):64-71, 2010.
[31]
A. S. Szalay, G. Bell, J. vandenBerg, A. Wonders, R. C. Burns, D. Fay, J. Heasley, T. Hey, M. A. Nieto-Santisteban, A. Thakar, C. v. Ingen, and R. Wilton. Graywulf: Scalable clustered architecture for data intensive computing. In HICSS, pages 1-10, 2009.
[32]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow., 2(2):1626-1629, Aug. 2009.
[33]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. volume 2, pages 1626-1629, August 2009.
[34]
F. Wang, J. Kong, L. Cooper, T. Pan, K. Tahsin, W. Chen, A. Sharma, C. Niedermayr, T. W. Oh, D. Brat, A. B. Farris, D. Foran, and J. Saltz. A data model and database for high-resolution pathology analytical image informatics. J Pathol Inform, 2(1):32, 2011.
[35]
F. Wang, J. Kong, J. Gao, D. Adler, L. Cooper, C. Vergara-Niedermayr, Z. Zhou, B. Katigbak, T. Kurc, D. Brat, and J. Saltz. A high-performance spatial database based approach for pathology imaging algorithm evaluation. J Pathol Inform, 4(5), 2013.
[36]
K. Wang, Y. Huai, R. Lee, F. Wang, X. Zhang, and J. H. Saltz. Accelerating pathology image data cross-comparison on cpu-gpu hybrid systems. Proc. VLDB Endow., 5(11):1543-1554, July 2012.
[37]
Y. Xu, P. Kostamaa, and L. Gao. Integrating hadoop and parallel dbms. In SIGMOD, pages 969-974, 2010.
[38]
S. Zhang, J. Han, Z. Liu, K. Wang, and Z. Xu. Sjmr: Parallelizing spatial join with mapreduce on clusters. In CLUSTER, 2009.
[39]
Y. Zhong, J. Han, T. Zhang, Z. Li, J. Fang, and G. Chen. Towards parallel spatial query processing for big spatial data. In IPDPSW, pages 2085-2094, 2012.
[40]
X. Zhou, D. J. Abel, and D. Truffet. Data partitioning for parallel spatial join processing. Geoinformatica, 2:175-204, June 1998.

Cited By

View all
  • (2024)Gamification in Real-World Applications: Interactive Maps and Augmented RealityLevel Up! Exploring Gamification's Impact on Research and Innovation10.5772/intechopen.1004870Online publication date: 14-May-2024
  • (2024)Non-Uniform Spatial Partitions and Optimized Trajectory Segments for Storage and Indexing of Massive GPS Trajectory DataISPRS International Journal of Geo-Information10.3390/ijgi1306019713:6(197)Online publication date: 12-Jun-2024
  • (2024)Efficient Distance Queries on Non-point DataACM Transactions on Spatial Algorithms and Systems10.1145/369819411:1(1-37)Online publication date: 2-Oct-2024
  • Show More Cited By

Index Terms

  1. Hadoop GIS: a high performance spatial data warehousing system over mapreduce
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 6, Issue 11
      August 2013
      237 pages

      Publisher

      VLDB Endowment

      Publication History

      Published: 01 August 2013
      Published in PVLDB Volume 6, Issue 11

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)75
      • Downloads (Last 6 weeks)9
      Reflects downloads up to 20 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Gamification in Real-World Applications: Interactive Maps and Augmented RealityLevel Up! Exploring Gamification's Impact on Research and Innovation10.5772/intechopen.1004870Online publication date: 14-May-2024
      • (2024)Non-Uniform Spatial Partitions and Optimized Trajectory Segments for Storage and Indexing of Massive GPS Trajectory DataISPRS International Journal of Geo-Information10.3390/ijgi1306019713:6(197)Online publication date: 12-Jun-2024
      • (2024)Efficient Distance Queries on Non-point DataACM Transactions on Spatial Algorithms and Systems10.1145/369819411:1(1-37)Online publication date: 2-Oct-2024
      • (2024)Spatial Computing Opportunities in Biomedical Decision Support: The Atlas-EHR VisionACM Transactions on Spatial Algorithms and Systems10.1145/367920110:3(1-36)Online publication date: 25-Sep-2024
      • (2024)RayJoin: Fast and Precise Spatial JoinProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656610(124-136)Online publication date: 30-May-2024
      • (2024)A Distributed Spatial Data Warehouse for AIS Data2024 25th IEEE International Conference on Mobile Data Management (MDM)10.1109/MDM61037.2024.00045(211-218)Online publication date: 24-Jun-2024
      • (2024)Performance tuning analysis of spatial operations on Spatial Hadoop cluster with SSDProcedia Computer Science10.1016/j.procs.2020.03.278167:C(2253-2266)Online publication date: 9-Jul-2024
      • (2024)Enabling Spatial Digital Twins: Technologies, Challenges, and Future Research DirectionsPFG – Journal of Photogrammetry, Remote Sensing and Geoinformation Science10.1007/s41064-024-00301-2Online publication date: 6-Aug-2024
      • (2024)A learning-based framework for spatial join processing: estimation, optimization and tuningThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00836-133:4(1155-1177)Online publication date: 13-Feb-2024
      • (2024)ApplicationsHigh Performance Geographic Information System10.1007/978-981-97-7170-7_7(267-297)Online publication date: 16-Oct-2024
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media