Abstract
MapReduce has been widely recognized as an efficient tool for large-scale data analysis. It achieves high performance by exploiting parallelism among processing nodes while providing a simple interface for upper-layer applications. However, there are many existing applications maintaining their data in a distributed database. It is costly to export those data into the storage system of MapReduce (normally a distributed file system). Moreover, compared to MapReduce, database is equipped with many state-of-the-art techniques, such as index and optimizer. Therefore, a hybrid Database-MapReduce system inheriting the advantages of both systems is preferred. In this paper, we propose AQUA+, a query optimizer tailored for the hybrid system. AQUA+ is an extension work of our previous system AQUA. It generates a plan that adaptively assigns the operators to the database engine and MapReduce engine to optimize the performance. The intuition is to exploit the index, co-partition and other features provided by the database as much as possible and reduce the data volume processed by the MapReduce. Due to the complexity of query optimization, in AQUA+, we introduce a novel tuning technique, learning to optimize. In particular, two neural networks are trained to predict cost and refine query plan, respectively. We train them based on our log of real query processing. Experiments carried out on our in-house cluster confirm the effectiveness of our query optimizer.
Similar content being viewed by others
References
Abouzied A, Abadi DJ, Bajda-Pawlikowski K, Silberschatz A (2019) ‘Integration of large-scale data processing systems and traditional parallel database technology’, Proc. VLDB Endow. f12(12), 2290–2299. http://www.vldb.org/pvldb/vol12/p2290-abouzied.pdf
Abouzied A, Bajda-Pawlikowski K, Huang J, Abadi DJ, Silberschatz A (2010) Hadoopdb in action: building real world applications, in ‘SIGMOD Conference’, pp. 1111–1114
Afrati FN, Ullman JD (2010) Optimizing joins in a map-reduce environment, In Proceedings of the 13th International Conference on Extending Database Technology (pp. 99-110)
Bernstein PA, Goodman N, Wong E, Reeve CL, Rothnie JB Jr (1981) Query processing in a system for distributed databases (sdd-1). ACM Trans. Database Syst. 6(4):602–625
Bittorf M, Bobrovytsky T, Erickson C C A C J, Hecht M GD, Kuff M J I JL, Leblang D KA, Robinson N L I PH, Rus D RS, Wanderman J R D TS, Yoder MM (2015) Impala: A modern, open-source sql engine for hadoop, in Proceedings of the 7th Biennial Conference on Innovative Data Systems Research
Camacho-Rodríguez J, Chauhan A, Gates A, Koifman E, O’Malley O, Garg V, Haindrich Z, Shelukhin S, Jayachandran P, Seth S, Jaiswal D, Bouguerra S, Bangarwa N, Hariappan S, Agarwal A, Dere J, Dai D, Nair T, Dembla N, Vijayaraghavan G, Hagleitner G (2019) Apache hive: From mapreduce to enterprise-grade big data warehousing, in P. A. Boncz, S. Manegold, A. Ailamaki, A. Deshpande and T. Kraska, eds, ‘Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019’, ACM, pp. 1773–1786. https://doi.org/10.1145/3299869.3314045
Chaudhuri S (1998) An overview of query optimization in relational systems, in PODS, pp. 34–43
Chen M-S, Yu PS, Wu K-L (1996) Optimization of parallel execution for multi-join queries. IEEE Trans on Knowl and Data Eng 8(3):416–428
Chen S (2010) Cheetah: a high performance, custom data warehouse on top of mapreduce. Proc. VLDB Endow. 3(1–2):1459–1468
Chen Z, Gehrke J, Korn F (2001) Query optimization in compressed database systems, in Proceedings of the 2001 ACM SIGMOD international conference on Management of data, Santa Barbara, CA, USA, May 21-24, 2001, pp. 271–282. https://doi.org/10.1145/375663.375692
Condie T, Conway N, Alvaro P, Hellerstein JM, Elmeleegy K, Sears R (2010) Mapreduce online. In NSDI 10:21–29
Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters, in OSDI, 137–150
Franklin MJ, Jónsson BT, Kossmann D (1996) Performance tradeoffs for client-server query processing. SIGMOD Rec. 25(2):149–160
Friedman E, Pawlowski PM, Cieslewicz J (2009) Sql/mapreduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. PVLDB 2(2):1402–1413
Ganguly S, Hasan W, Krishnamurthy R (1992) Query optimization for parallel execution. SIGMOD Rec. 21(2):
Hausenblas M, Nadeau J (2013) Apache drill: interactive ad-hoc analysis at scale. Big Data 1(2):100–104
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Jarke M, Koch J (1984) Query optimization in database systems. ACM Comput. Surv. 16(2):111–152
Jia Y (2009) Running tpc-h queries on hive, in http://issues.apache.org/jira/browse/HIVE-600
Lin Y, Agrawal D, Chen C, Ooi BC, Wu S (2011) Llama: leveraging columnar storage for scalable join processing in the mapreduce framework, in SIGMOD Conference, 961–972
Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig latin: a not-so-foreign language for data processing, in SIGMOD Conference, 1099–1110
Ozsu MT (2007) Principles of distributed database systems, 3rd edn. Prentice Hall Press, NJ, USA
Poosala V, Haas PJ, Ioannidis YE, Shekita EJ (1996) Improved histograms for selectivity estimation of range predicates, in SIGMOD Conference, 294–305
Stewart RJ, Trinder PW, Loidl H-W (2011) Comparing high level mapreduce query languages, in APPT, 58–72
Tai KS, Socher R, Manning CD (2015) Improved semantic representations from tree-structured long short-term memory networks, in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, 1556–1566. https://www.aclweb.org/anthology/P15-1150/
Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive - a warehousing solution over a map-reduce framework. PVLDB 2(2):1626–1629
bibitemhiveicde Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Zhang N, Anthony S, Liu H, Murthy R (2010) Hive - a petabyte scale data warehouse using hadoop, In ICDE, 996–1005
Traverso M (2013) Presto: Interacting with petabytes of data at facebook. Retrieved February 4, 2014
Wu S, Li F, Mehrotra S, Ooi BC (2011) Query optimization for massively parallel data processing, In SoCC , 12
Acknowledgements
This work was supported by the Key Research Program of Zhejiang Province [grant numbers 2021C01109]; the Zhejiang Provincial Natural Science Foundation [grant numbers LZ21F020007]; and the Grid State Foundation [grant number 5211XT190033].
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Pang, Z., Wu, S., Huang, H. et al. AQUA+: Query Optimization for Hybrid Database-MapReduce System. Knowl Inf Syst 63, 905–938 (2021). https://doi.org/10.1007/s10115-020-01542-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-020-01542-4