Abstract
Big data analytics systems such as Apache Spark offer built-in support for nested data, which abounds, for instance, as JSON data available online. However, these systems typically have to transform the data to gain access to (deeply) nested data for further processing. This adds complexity to big data analytics pipelines and may result in an unnecessary runtime overhead. Therefore, this paper introduces tree-pattern matching as a first-class operator in big data analytics systems. It reduces the complexity of big data analytics pipelines and accelerates the pipeline processing by up to four times, compared to state-of-the-art pipelines for nested data. The novelty of our operator lies in the distributed and data-parallel processing supported by its underlying tree-pattern matching algorithm. Experiments validate that our operator, implemented in Spark, can improve pipeline complexity and runtime.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Afrati, F., Delorey, D., Pasumansky, M., Ullman, J.D.: Storing and querying tree-structured records in Dremel. PVLDB 7(12), 1131–1142 (2014)
Al-Khalifa, S., Jagadish, H.V., Koudas, N., Patel, J., Srivastava, D., Wu, Y.: Structural joins: a primitive for efficient XML query pattern matching. In: ICDE (2002)
Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: SIGMOD (2015)
Bruno, N., Koudas, N., Srivastava, D.: Holistic twig joins: optimal XML pattern matching. In: SIGMOD (2002)
Diestelkämper, R.: Evaluation workload (2020). https://www.ipvs.uni-stuttgart.de/departments/de/resources/pebble/pebble_tpm_workload.pdf
Diestelkämper, R., Herschel, M.: Capturing and querying structural provenance in spark with pebble. In: SIGMOD (2019)
Diestelkämper, R., Herschel, M.: Tracing nested data with structural provenance for big data analytics. In: EDBT (2020)
Grimsmo, N., Bjørklund, T., Hetland, M.: Fast optimal twig joins. PVLDB 3, 894–905 (2010)
Grumbach, S., Milo, T.: Towards tractable algebras for bags. J. Comput. Syst. Sci. 52(3), 570–588 (1996)
Hachicha, M., Darmont, J.: A survey of XML tree patterns. TKDE 25(1), 29–46 (2013)
Izadi, S.K., Härder, T., Haghjoo, M.: S3: evaluation of tree-pattern XML queries supported by structural summaries. Data Knowl. Eng. 68(1), 126–145 (2009)
Kumar, S.: Twitter Data Analytics. SpringerBriefs in Computer Science, 1st edn., p. 77. Springer, New York (2014). https://doi.org/10.1007/978-1-4614-9372-3
Ley, M.: DBLP: some lessons learned. PVLDB 2(2), 1493–1500 (2009)
Lu, J., Ling, T., Chan, C.Y., Chen, T.: From region encoding to extended Dewey: on efficient processing of XML twig pattern matching. In: VLDB (2005)
Lu, J., Chen, T., Ling, T.W.: Efficient processing of XML twig patterns with parent child edges: A look-ahead approach. In: CIKM (2004)
Lu, J., Ling, T.W., Bao, Z., Wang, C.: Extended XML tree pattern matching: theories and algorithms. TKDE 23(3), 402–416 (2011)
Lu, J., Meng, X., Ling, T.W.: Indexing and querying XML using extended Dewey labeling scheme. Data Knowl. Eng. 70(1), 35–59 (2011)
Tahraoui, M., Pinel-Sauvagnat, K., Laitang, C., Boughanem, M., Kheddouci, H., Ning, L.: A survey on tree matching and XML retrieval. Comp. Sci. Rev. 8, 1–23 (2013)
Tchendji, M.T., Tadonfouet, L., Tchendji, T.T.: A tree pattern matching algorithm for XML queries with structural preferences. J. Comput. Commun. 7, 61–83 (2019)
Wang, Z., Chen, S.: Exploiting common patterns for tree-structured data. In: SIGMOD (2017)
Wu, X., Liu, G.: XML twig pattern matching using version tree. Data Knowl. Eng. 64(3), 580–599 (2008)
Zaharia, M., et al.: Apache spark: a unified engine for big data processing. CACM 59(11), 56–65 (2016)
Zhang, C., Naughton, J., DeWitt, D., Luo, Q., Lohman, G.: On supporting containment queries in relational database management systems. In: SIGMOD (2001)
Acknowledgements
Partially funded by Deutsche Forschungsgemeinschaft (DFG) under Germany’s Excellence Strategy - EXC 2075 - 390740016.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Diestelkämper, R., Herschel, M. (2020). Distributed Tree-Pattern Matching in Big Data Analytics Systems. In: Darmont, J., Novikov, B., Wrembel, R. (eds) Advances in Databases and Information Systems. ADBIS 2020. Lecture Notes in Computer Science(), vol 12245. Springer, Cham. https://doi.org/10.1007/978-3-030-54832-2_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-54832-2_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-54831-5
Online ISBN: 978-3-030-54832-2
eBook Packages: Computer ScienceComputer Science (R0)