Nothing Special   »   [go: up one dir, main page]

Skip to main content

Advertisement

Log in

Parallelizing Multiple Group-by queries using MapReduce: optimization and cost estimation

  • Published:
Telecommunication Systems Aims and scope Submit manuscript

Abstract

MapReduce is a new parallel programming model initially developed for large-scale web content processing. Multidimensional data analysis applications meet the issues of large scale dataset. The arrival of MapReduce provides a chance to utilize the commodity hardware for massively parallelizing multidimensional data analysis applications. The translation and optimization from relational algebra operators to MapReduce programs is still an open and dynamic research field. In this paper, we focus on a special type of data analysis query, namely, Multiple Group-by query. We firstly discuss the communication cost of MapReduce model, then we give an initial implementation of Multiple Group-by query. After that, we propose an optimized version which addresses and reduce the communication cost. According to the experimental measurements, our optimized version shows a better accelerating ability and a better scalability than the other version. We also formally evaluate our results, and give a set of execution time estimations for both the initial implementation and the optimized one.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Abouzeid, A., & Bajda, P. (2009). HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. The VLDB Journal, 2(1), 922–933.

    Google Scholar 

  2. Cascading (2011). http://www.cascading.org/.

  3. Chen, Z., & Narasayya, V. (2005). Efficient computation of Multiple Group-by queries. In Proceedings of SIGMOD ’05 (pp. 263–274).

    Chapter  Google Scholar 

  4. Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 21(1), 107–113.

    Article  Google Scholar 

  5. Dewitt, D. J., & Gray, J. (1992). Parallel database systems: the future of high performance database systems. Communications of the ACM, 35(6), 85–98.

    Article  Google Scholar 

  6. Grid’5000 (2011). https://www.grid5000.fr/.

  7. GridGain (2011). Available online at: http://www.gridgain.com/.

  8. Hadoop (2010). http://hadoop.apache.org/.

  9. Hellerstein, J. (2008). Parallel programming in the age of big data. Gigaom Blog. Nov. 9, 2008. http://gigaom.com.2008/11/09/mapreduce-leads-the-way-for-parallel-programming/.

  10. Jin, C., & Vecchiola, C. (2008). MRPGA: an extension of MapReduce for parallelizing genetic algorithms. In Proceedings of the ESCIENCE ’08 (pp. 214–221).

    Google Scholar 

  11. Jung, J. J. (2009). Semantic business process integration based on ontology alignment. Expert Systems With Applications, 36(8), 11013–11020.

    Article  Google Scholar 

  12. Jung, J. J. (2010). Reusing ontology mappings for query segmentation and routing in semantic peer-to-peer environment. Information Sciences, 180(17), 3248–3257.

    Article  Google Scholar 

  13. Jung, J. J. (2011). Service chain-based business alliance formation in service-oriented architecture. Expert Systems With Applications, 38(3), 2206–2211.

    Article  Google Scholar 

  14. Lämmel, R. (2007). Google’s MapReduce programming model: revisited. Science of Computer Programming, 68(3), 208–237.

    Google Scholar 

  15. Logothetis, D., & Yocum, K. (2008). Ad-hoc data processing in the cloud. In Proceeding of VLDB Endowment (pp. 1472–1475).

    Google Scholar 

  16. Olston, C., & Reed, B. (2008). Pig latin: a not-so-foreign language for data processing. In Proceedings of the SIGMOD ’08 (pp. 1099–1110).

    Chapter  Google Scholar 

  17. Stephano, C. A., & Mauro, N. (1982). Horizontal data partitioning in database design. In Proceedings of the SIGMOD ’82 (pp. 128–136).

    Google Scholar 

  18. Thusoo, A., & Joydeep, S. (2009). Hive—a warehousing solution over a map-reduce framework. The VLDB Journal, 2(2), 1626–1629.

    Google Scholar 

  19. Yang, H., & Dasdan, A. (2007). Map-reduce-merge: simplified relational data processing on large clusters. In Proceedings of the SIGMOD ’07 (pp. 1029–1040).

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jie Pan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pan, J., Magoulès, F., Le Biannic, Y. et al. Parallelizing Multiple Group-by queries using MapReduce: optimization and cost estimation. Telecommun Syst 52, 635–645 (2013). https://doi.org/10.1007/s11235-011-9508-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11235-011-9508-2

Keywords

Navigation