Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Free access

Reasoning on data partitioning for single-round multi-join evaluation in massively parallel systems

Published: 21 February 2017 Publication History

Abstract

Evaluating queries over massive amounts of data is a major challenge in the big data era. Modern massively parallel systems, such as, Spark, organize query answering as a sequence of rounds each consisting of a distinct communication phase followed by a computation phase. The communication phase redistributes data over the available servers, while in the subsequent computation phase each server performs the actual computation on its local data. There is a growing interest in single-round algorithms for evaluating multiway joins where data is first reshuffled over the servers and then evaluated in a parallel but communication-free way. As the amount of communication induced by a reshuffling of the data is a dominating cost in such systems, we introduce a framework for reasoning about data partitioning to detect when we can avoid the data reshuffling step. Specifically, we formalize the decision problems parallel-correctness and transfer of parallel-correctness, provide semantical characterizations, and obtain tight complexity bounds.

References

[1]
Afrati, F.N., Ullman, J.D. Optimizing multiway joins in a map-reduce environment. IEEE Trans. Knowl. Data Eng. 23, 9 (2011), 1282--1298.
[2]
Ameloot, T.J., Geck, G., Ketsman, B., Neven, F., Schwentick, T. Parallel-correctness and transferability for conjunctive queries, submitted for journal publication (2015).
[3]
Arora, S., Barak, B. Computational Complexity -- A Modern Approach. Cambridge University Press, 2009.
[4]
Beame, P., Koutris, P., Suciu, D. Communication steps for parallel query processing. In Proceedings of the 32nd Symposium on Principles of Database Systems, PODS'13 (2013), 273--284.
[5]
Beame, P., Koutris, P., Suciu, D. Skew in parallel query processing. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS'14 (2014), 212--223.
[6]
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y. A comparison of join algorithms for log processing in mapreduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, A.K. Elmagarmid and D. Agrawal, eds. (Indianapolis, Indiana, USA, June 6--10, 2010). ACM 975--986.
[7]
Chu, S., Balazinska, M., Suciu, D. From theory to practice: Efficient join query evaluation in a parallel database system. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (2015), 63--78.
[8]
Dean, J., Ghemawat, S. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.
[9]
Ganguly, S., Silberschatz, A., Tsur, S. Parallel bottom-up processing of datalog queries. J. Log. Program. 14, 1&2 (1992), 101--126.
[10]
Geck, G., Ketsman, B., Neven, F., Schwentick, T. Parallel-correctness and containment for conjunctive queries with union and negation. In International Conference on Database Theory (2016), 9:1--9:17.
[11]
Halperin, D., Teixeira de Almeida, V., Choo, L.L., Chu, S., Koutris, P., Moritz, D., Ortiz, J., Ruamviboonsuk, V., Wang, J., Whitaker, A., Xu, S., Balazinska, M., Howe, B., Suciu, D. Demonstration of the Myria big data management service. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD'14 (2014), 881--884.
[12]
Koutris, P., Suciu, D. Parallel evaluation of conjunctive queries. In Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2011, M. Lenzerini and T. Schwentick, eds. (Athens, Greece, June 12--16, 2011). ACM, 223--234.
[13]
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T. Dremel: Interactive analysis of web-scale datasets. Proc. VLDB Endow. 3, 1--2 (Sept. 2010), 330--339.
[14]
Mugnier, M., Simonet, G., Thomazo, M. On the complexity of entailment in existential conjunctive first-order logic with atomic negation. Inf. Comput. 215 (2012), 8--31.
[15]
Nehme, R., Bruno, N. Automated partitioning design in parallel database systems. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD'11 (2011), 1137--1148.
[16]
Ngo, H.Q., Porat, E., Ré, C., Rudra, A. Worst-case optimal join algorithms. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2012 (2012), 37--48.
[17]
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A. Pig latin: A not-so-foreign language for data processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, J. Tsong and L. Wang, eds.(Vancouver, BC, Canada, June 10--12, 2008). ACM 1099--1110.
[18]
Rao, J., Zhang, C., Megiddo, N., Lohman, G. Automating physical database design in a parallel database. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, SIGMOD'02 (2002), 558--569.
[19]
Shute, J., Vingralek, R., Samwel, B., Handy, B., Whipkey, C., Rollins, E., Oancea, M., Littlefield, K., Menestrina, D., Ellner, S., Cieslewicz, J., Rae, I., Stancescu, T., Apte, H. F1: A distributed sql database that scales. Proc. VLDB Endow. 6, 11 (Aug. 2013), 1068--1079.
[20]
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R. Hive: A warehousing solution over a map-reduce framework. PVLDB 2, 2 (2009), 1626--1629.
[21]
Ullman, J.D. Information integration using logical views. Theor. Comput. Sci. 239, 2 (2000), 189--210.
[22]
Veldhuizen, T.L. Triejoin: A simple, worst-case optimal join algorithm. In Proceedings of the 17th International Conference on Database Theory (ICDT) (2014), 96--106.
[23]
Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I. Shark: Sql and rich analytics at scale. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD'13 (2013), 13--24.

Cited By

View all
  • (2023)Combined application of ideological and political education and big data Internet technology in the context of education reformApplied Mathematics and Nonlinear Sciences10.2478/amns.2023.1.002209:1Online publication date: 3-Jun-2023
  • (2019)Split-Correctness in Information ExtractionProceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3294052.3319684(149-163)Online publication date: 25-Jun-2019
  • (2018)Bank Big Data Architecture Based on Massive Parallel Processing Database2018 15th International Symposium on Pervasive Systems, Algorithms and Networks (I-SPAN)10.1109/I-SPAN.2018.00024(93-99)Online publication date: Oct-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Communications of the ACM
Communications of the ACM  Volume 60, Issue 3
March 2017
89 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/3055102
  • Editor:
  • Moshe Y. Vardi
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 February 2017
Published in CACM Volume 60, Issue 3

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Research Foundation Flanders

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)255
  • Downloads (Last 6 weeks)30
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Combined application of ideological and political education and big data Internet technology in the context of education reformApplied Mathematics and Nonlinear Sciences10.2478/amns.2023.1.002209:1Online publication date: 3-Jun-2023
  • (2019)Split-Correctness in Information ExtractionProceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3294052.3319684(149-163)Online publication date: 25-Jun-2019
  • (2018)Bank Big Data Architecture Based on Massive Parallel Processing Database2018 15th International Symposium on Pervasive Systems, Algorithms and Networks (I-SPAN)10.1109/I-SPAN.2018.00024(93-99)Online publication date: Oct-2018
  • (2017)Parallel-Correctness and Transferability for Conjunctive QueriesJournal of the ACM10.1145/310641264:5(1-38)Online publication date: 4-Sep-2017

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Magazine Site

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media