Abstract
MapReduce is a new parallel programming model initially developed for large-scale web content processing. Multidimensional data analysis applications meet the issues of large scale dataset. The arrival of MapReduce provides a chance to utilize the commodity hardware for massively parallelizing multidimensional data analysis applications. The translation and optimization from relational algebra operators to MapReduce programs is still an open and dynamic research field. In this paper, we focus on a special type of data analysis query, namely, Multiple Group-by query. We firstly discuss the communication cost of MapReduce model, then we give an initial implementation of Multiple Group-by query. After that, we propose an optimized version which addresses and reduce the communication cost. According to the experimental measurements, our optimized version shows a better accelerating ability and a better scalability than the other version. We also formally evaluate our results, and give a set of execution time estimations for both the initial implementation and the optimized one.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abouzeid, A., & Bajda, P. (2009). HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. The VLDB Journal, 2(1), 922–933.
Cascading (2011). http://www.cascading.org/.
Chen, Z., & Narasayya, V. (2005). Efficient computation of Multiple Group-by queries. In Proceedings of SIGMOD ’05 (pp. 263–274).
Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 21(1), 107–113.
Dewitt, D. J., & Gray, J. (1992). Parallel database systems: the future of high performance database systems. Communications of the ACM, 35(6), 85–98.
Grid’5000 (2011). https://www.grid5000.fr/.
GridGain (2011). Available online at: http://www.gridgain.com/.
Hadoop (2010). http://hadoop.apache.org/.
Hellerstein, J. (2008). Parallel programming in the age of big data. Gigaom Blog. Nov. 9, 2008. http://gigaom.com.2008/11/09/mapreduce-leads-the-way-for-parallel-programming/.
Jin, C., & Vecchiola, C. (2008). MRPGA: an extension of MapReduce for parallelizing genetic algorithms. In Proceedings of the ESCIENCE ’08 (pp. 214–221).
Jung, J. J. (2009). Semantic business process integration based on ontology alignment. Expert Systems With Applications, 36(8), 11013–11020.
Jung, J. J. (2010). Reusing ontology mappings for query segmentation and routing in semantic peer-to-peer environment. Information Sciences, 180(17), 3248–3257.
Jung, J. J. (2011). Service chain-based business alliance formation in service-oriented architecture. Expert Systems With Applications, 38(3), 2206–2211.
Lämmel, R. (2007). Google’s MapReduce programming model: revisited. Science of Computer Programming, 68(3), 208–237.
Logothetis, D., & Yocum, K. (2008). Ad-hoc data processing in the cloud. In Proceeding of VLDB Endowment (pp. 1472–1475).
Olston, C., & Reed, B. (2008). Pig latin: a not-so-foreign language for data processing. In Proceedings of the SIGMOD ’08 (pp. 1099–1110).
Stephano, C. A., & Mauro, N. (1982). Horizontal data partitioning in database design. In Proceedings of the SIGMOD ’82 (pp. 128–136).
Thusoo, A., & Joydeep, S. (2009). Hive—a warehousing solution over a map-reduce framework. The VLDB Journal, 2(2), 1626–1629.
Yang, H., & Dasdan, A. (2007). Map-reduce-merge: simplified relational data processing on large clusters. In Proceedings of the SIGMOD ’07 (pp. 1029–1040).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pan, J., Magoulès, F., Le Biannic, Y. et al. Parallelizing Multiple Group-by queries using MapReduce: optimization and cost estimation. Telecommun Syst 52, 635–645 (2013). https://doi.org/10.1007/s11235-011-9508-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11235-011-9508-2