Parallelizing Multiple Group-by queries using MapReduce: optimization and cost estimation

Jie Pan¹,
Frédéric Magoulès¹,
Yann Le Biannic² &
…
Christophe Favart²

242 Accesses
7 Citations
Explore all metrics

Abstract

MapReduce is a new parallel programming model initially developed for large-scale web content processing. Multidimensional data analysis applications meet the issues of large scale dataset. The arrival of MapReduce provides a chance to utilize the commodity hardware for massively parallelizing multidimensional data analysis applications. The translation and optimization from relational algebra operators to MapReduce programs is still an open and dynamic research field. In this paper, we focus on a special type of data analysis query, namely, Multiple Group-by query. We firstly discuss the communication cost of MapReduce model, then we give an initial implementation of Multiple Group-by query. After that, we propose an optimized version which addresses and reduce the communication cost. According to the experimental measurements, our optimized version shows a better accelerating ability and a better scalability than the other version. We also formally evaluate our results, and give a set of execution time estimations for both the initial implementation and the optimized one.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MapReduce Join Across Geo-Distributed Data Centers

Efficient Level-Based Top-Down Data Cube Computation Using MapReduce

A survey on parallel clustering algorithms for Big Data

Article 06 October 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Abouzeid, A., & Bajda, P. (2009). HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. The VLDB Journal, 2(1), 922–933.
Google Scholar
Cascading (2011). http://www.cascading.org/.
Chen, Z., & Narasayya, V. (2005). Efficient computation of Multiple Group-by queries. In Proceedings of SIGMOD ’05 (pp. 263–274).
Chapter Google Scholar
Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 21(1), 107–113.
Article Google Scholar
Dewitt, D. J., & Gray, J. (1992). Parallel database systems: the future of high performance database systems. Communications of the ACM, 35(6), 85–98.
Article Google Scholar
Grid’5000 (2011). https://www.grid5000.fr/.
GridGain (2011). Available online at: http://www.gridgain.com/.
Hadoop (2010). http://hadoop.apache.org/.
Hellerstein, J. (2008). Parallel programming in the age of big data. Gigaom Blog. Nov. 9, 2008. http://gigaom.com.2008/11/09/mapreduce-leads-the-way-for-parallel-programming/.
Jin, C., & Vecchiola, C. (2008). MRPGA: an extension of MapReduce for parallelizing genetic algorithms. In Proceedings of the ESCIENCE ’08 (pp. 214–221).
Google Scholar
Jung, J. J. (2009). Semantic business process integration based on ontology alignment. Expert Systems With Applications, 36(8), 11013–11020.
Article Google Scholar
Jung, J. J. (2010). Reusing ontology mappings for query segmentation and routing in semantic peer-to-peer environment. Information Sciences, 180(17), 3248–3257.
Article Google Scholar
Jung, J. J. (2011). Service chain-based business alliance formation in service-oriented architecture. Expert Systems With Applications, 38(3), 2206–2211.
Article Google Scholar
Lämmel, R. (2007). Google’s MapReduce programming model: revisited. Science of Computer Programming, 68(3), 208–237.
Google Scholar
Logothetis, D., & Yocum, K. (2008). Ad-hoc data processing in the cloud. In Proceeding of VLDB Endowment (pp. 1472–1475).
Google Scholar
Olston, C., & Reed, B. (2008). Pig latin: a not-so-foreign language for data processing. In Proceedings of the SIGMOD ’08 (pp. 1099–1110).
Chapter Google Scholar
Stephano, C. A., & Mauro, N. (1982). Horizontal data partitioning in database design. In Proceedings of the SIGMOD ’82 (pp. 128–136).
Google Scholar
Thusoo, A., & Joydeep, S. (2009). Hive—a warehousing solution over a map-reduce framework. The VLDB Journal, 2(2), 1626–1629.
Google Scholar
Yang, H., & Dasdan, A. (2007). Map-reduce-merge: simplified relational data processing on large clusters. In Proceedings of the SIGMOD ’07 (pp. 1029–1040).
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

École Centrale Paris, Grande voie des vignes, 92295, Châtenay-Malabry, France
Jie Pan & Frédéric Magoulès
SAP BusinessObjects, 157-159, rue Anatole France, 92309, Levallois-Perret, France
Yann Le Biannic & Christophe Favart

Authors

Jie Pan
View author publications
You can also search for this author in PubMed Google Scholar
Frédéric Magoulès
View author publications
You can also search for this author in PubMed Google Scholar
Yann Le Biannic
View author publications
You can also search for this author in PubMed Google Scholar
Christophe Favart
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jie Pan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pan, J., Magoulès, F., Le Biannic, Y. et al. Parallelizing Multiple Group-by queries using MapReduce: optimization and cost estimation. Telecommun Syst 52, 635–645 (2013). https://doi.org/10.1007/s11235-011-9508-2

Download citation

Published: 27 July 2011
Issue Date: February 2013
DOI: https://doi.org/10.1007/s11235-011-9508-2

Parallelizing Multiple Group-by queries using MapReduce: optimization and cost estimation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MapReduce Join Across Geo-Distributed Data Centers

Efficient Level-Based Top-Down Data Cube Computation Using MapReduce

A survey on parallel clustering algorithms for Big Data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Parallelizing Multiple Group-by queries using MapReduce: optimization and cost estimation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MapReduce Join Across Geo-Distributed Data Centers

Efficient Level-Based Top-Down Data Cube Computation Using MapReduce

A survey on parallel clustering algorithms for Big Data

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation