Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/CCGrid.2016.85acmotherconferencesArticle/Chapter ViewAbstractPublication PagesccgridConference Proceedingsconference-collections
research-article

Towards memory-optimized data shuffling patterns for big data analytics

Published: 16 May 2016 Publication History

Abstract

Big data analytics is an indispensable tool in transforming science, engineering, medicine, healthcare, finance and ultimately business itself. With the explosion of data sizes and need for shorter time-to-solution, in-memory platforms such as Apache Spark gain increasing popularity. However, this introduces important challenges, among which data shuffling is particularly difficult: on one hand it is a key part of the computation that has a major impact on the overall performance and scalability so its efficiency is paramount, while on the other hand it needs to operate with scarce memory in order to leave as much memory available for data caching. In this context, efficient scheduling of data transfers such that it addresses both dimensions of the problem simultaneously is non-trivial. State-of-the-art solutions often rely on simple approaches that yield sub-optimal performance and resource usage. This paper contributes a novel shuffle data transfer strategy that dynamically adapts to the computation with minimal memory utilization, which we briefly underline as a series of design principles.

References

[1]
T. Hey, S. Tansley, and K. M. Tolle, Eds., The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, 2009.
[2]
J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, vol. 51, no. 1, pp. 107--113, 2008.
[3]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica, "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing," in NSDI'12: The 9th USENIX Symposium on Networked Systems Design and Implementation, San Jose, USA, 2012, pp. 15--28.
[4]
G. Graefe, "Encapsulation of parallelism in the volcano query processing system," in SIGMOD '90: The 1990 ACM SIGMOD International Conference on Management of Data. Atlantic City, USA: ACM, 1990, pp. 102--111.
[5]
C. Baru and G. Fecteau, "An overview of db2 parallel edition," SIGMOD Rec., vol. 24, no. 2, pp. 460--462, May 1995.
[6]
B. Nicolae, "Understanding Vertical Scalability of I/O Virtualization for MapReduce Workloads: Challenges and Opportunities," in Big-DataCloud '13: 2nd Workshop on Big Data Management in Clouds (held in conjunction with EuroPar'13), Aachen, Germany, 2013.
[7]
J. Tan, A. Chin, Z. Z. Hu, Y. Hu, S. Meng, X. Meng, and L. Zhang, "Dynmr: Dynamic mapreduce with reducetask interleaving and map-task backfilling," in EuroSys '14: Proceedings of the Ninth European Conference on Computer Systems. Amsterdam, The Netherlands: ACM, 2014, pp. 2:1--2:14.
[8]
K. Ousterhout, R. Rasti, S. Ratnasamy, S. Shenker, and B.-G. Chun, "Making sense of performance in data analytics frameworks," in NSDI'15: The 12th USENIX Conference on Networked Systems Design and Implementation, Oakland, USA, 2015, pp. 293--307.
[9]
B. Nicolae, D. Moise, G. Antoniu, L. Bougé, and M. Dorier, "BlobSeer: Bringing high throughput under heavy concurrency to Hadoop Map/Reduce applications," in IPDPS '10: Proc. 24th IEEE International Parallel and Distributed Processing Symposium, Atlanta, USA, 2010, pp. 1--12.
[10]
H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica, "Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks," in SOCC '14: Proceedings of the ACM Symposium on Cloud Computing, Seattle, USA, 2014, pp. 6:1--6:15.
[11]
G. Greiner and R. Jacob, "The efficiency of mapreduce in parallel external memory," in LATIN'12: Proceedings of the 10th Latin American International Conference on Theoretical Informatics, Arequipa, Peru, 2012, pp. 433--445.
[12]
M. W.-u. Rahman, X. Lu, N. S. Islam, and D. K. D. Panda, "HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects," in ICS '14: Proceedings of the 28th ACM International Conference on Supercomputing, Munich, Germany, 2014, pp. 33--42.
[13]
X. Lu, M. W. U. Rahman, N. Islam, D. Shankar, and D. K. Panda, "Accelerating Spark with RDMA for Big Data Processing: Early Experiences," in HOTI'14: IEEE 22nd Annual Symposium on HighPerformance Interconnects, Mountain View, USA, 2014, pp. 9--16.
[14]
A. Davidson and A. Or, "Optimizing shuffle performance in spark," University of California, Berkeley - Department of Electrical Engineering and Computer Sciences, Tech. Rep., 2013.
[15]
B. Nicolae, "On the benefits of transparent compression for cost-effective cloud data storage," Transactions on Large-Scale Data- and Knowledge-Centered Systems, vol. 3, pp. 167--184, 2011.
[16]
B. Nicolae, "Leveraging naturally distributed data redundancy to reduce collective I/O replication overhead," in IPDPS '15: 29th IEEE International Parallel and Distributed Processing Symposium, Hyderabad, India, 2015, pp. 1023--1032.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
CCGRID '16: Proceedings of the 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing
May 2016
784 pages
ISBN:9781509024520

Publisher

IEEE Press

Publication History

Published: 16 May 2016

Check for updates

Author Tags

  1. big data analytics
  2. data shuffling
  3. elastic buffering
  4. memory-efficient I/O

Qualifiers

  • Research-article

Conference

CCGrid '16

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 20
    Total Downloads
  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media