research-article

Towards memory-optimized data shuffling patterns for big data analytics

Authors:

Bogdan Nicolae,

Claudia Misale,

Kostas Katrinis,

Yoonho ParkAuthors Info & Claims

CCGRID '16: Proceedings of the 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing

Pages 409 - 412

https://doi.org/10.1109/CCGrid.2016.85

Published: 16 May 2016 Publication History

Abstract

Big data analytics is an indispensable tool in transforming science, engineering, medicine, healthcare, finance and ultimately business itself. With the explosion of data sizes and need for shorter time-to-solution, in-memory platforms such as Apache Spark gain increasing popularity. However, this introduces important challenges, among which data shuffling is particularly difficult: on one hand it is a key part of the computation that has a major impact on the overall performance and scalability so its efficiency is paramount, while on the other hand it needs to operate with scarce memory in order to leave as much memory available for data caching. In this context, efficient scheduling of data transfers such that it addresses both dimensions of the problem simultaneously is non-trivial. State-of-the-art solutions often rely on simple approaches that yield sub-optimal performance and resource usage. This paper contributes a novel shuffle data transfer strategy that dynamically adapts to the computation with minimal memory utilization, which we briefly underline as a series of design principles.

References

[1]

T. Hey, S. Tansley, and K. M. Tolle, Eds., The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, 2009.

[2]

J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, vol. 51, no. 1, pp. 107--113, 2008.

Digital Library

[3]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica, "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing," in NSDI'12: The 9th USENIX Symposium on Networked Systems Design and Implementation, San Jose, USA, 2012, pp. 15--28.

Digital Library

[4]

G. Graefe, "Encapsulation of parallelism in the volcano query processing system," in SIGMOD '90: The 1990 ACM SIGMOD International Conference on Management of Data. Atlantic City, USA: ACM, 1990, pp. 102--111.

Digital Library

[5]

C. Baru and G. Fecteau, "An overview of db2 parallel edition," SIGMOD Rec., vol. 24, no. 2, pp. 460--462, May 1995.

Digital Library

[6]

B. Nicolae, "Understanding Vertical Scalability of I/O Virtualization for MapReduce Workloads: Challenges and Opportunities," in Big-DataCloud '13: 2nd Workshop on Big Data Management in Clouds (held in conjunction with EuroPar'13), Aachen, Germany, 2013.

[7]

J. Tan, A. Chin, Z. Z. Hu, Y. Hu, S. Meng, X. Meng, and L. Zhang, "Dynmr: Dynamic mapreduce with reducetask interleaving and map-task backfilling," in EuroSys '14: Proceedings of the Ninth European Conference on Computer Systems. Amsterdam, The Netherlands: ACM, 2014, pp. 2:1--2:14.

Digital Library

[8]

K. Ousterhout, R. Rasti, S. Ratnasamy, S. Shenker, and B.-G. Chun, "Making sense of performance in data analytics frameworks," in NSDI'15: The 12th USENIX Conference on Networked Systems Design and Implementation, Oakland, USA, 2015, pp. 293--307.

Digital Library

[9]

B. Nicolae, D. Moise, G. Antoniu, L. Bougé, and M. Dorier, "BlobSeer: Bringing high throughput under heavy concurrency to Hadoop Map/Reduce applications," in IPDPS '10: Proc. 24th IEEE International Parallel and Distributed Processing Symposium, Atlanta, USA, 2010, pp. 1--12.

[10]

H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica, "Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks," in SOCC '14: Proceedings of the ACM Symposium on Cloud Computing, Seattle, USA, 2014, pp. 6:1--6:15.

Digital Library

[11]

G. Greiner and R. Jacob, "The efficiency of mapreduce in parallel external memory," in LATIN'12: Proceedings of the 10th Latin American International Conference on Theoretical Informatics, Arequipa, Peru, 2012, pp. 433--445.

Digital Library

[12]

M. W.-u. Rahman, X. Lu, N. S. Islam, and D. K. D. Panda, "HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects," in ICS '14: Proceedings of the 28th ACM International Conference on Supercomputing, Munich, Germany, 2014, pp. 33--42.

Digital Library

[13]

X. Lu, M. W. U. Rahman, N. Islam, D. Shankar, and D. K. Panda, "Accelerating Spark with RDMA for Big Data Processing: Early Experiences," in HOTI'14: IEEE 22nd Annual Symposium on HighPerformance Interconnects, Mountain View, USA, 2014, pp. 9--16.

Digital Library

[14]

A. Davidson and A. Or, "Optimizing shuffle performance in spark," University of California, Berkeley - Department of Electrical Engineering and Computer Sciences, Tech. Rep., 2013.

[15]

B. Nicolae, "On the benefits of transparent compression for cost-effective cloud data storage," Transactions on Large-Scale Data- and Knowledge-Centered Systems, vol. 3, pp. 167--184, 2011.

Digital Library

[16]

B. Nicolae, "Leveraging naturally distributed data redundancy to reduce collective I/O replication overhead," in IPDPS '15: 29th IEEE International Parallel and Distributed Processing Symposium, Hyderabad, India, 2015, pp. 1023--1032.

Digital Library

Recommendations

Responsible Big Data Analytics for E-Business Services
ICBDR '21: Proceedings of the 5th International Conference on Big Data Research

This paper examines responsible big data analytics for e-business services and looks at how to use responsible big data analytics to obtain responsible e-business services. It addresses why responsibility matters to big data analytics and e-business ...
Big Data Analytics
Multimedia Big Data Analytics: A Survey

With the proliferation of online services and mobile technologies, the world has stepped into a multimedia big data era. A vast amount of research work has been done in the multimedia area, targeting different aspects of big data analytics, such as the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

CCGRID '16: Proceedings of the 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing

May 2016

784 pages

ISBN:9781509024520

General Chairs:
Carlos Varela
Rensselaer Polytechnic Institute (RPI)
,
Harold Castro
Universidad de los Andes, Colombia
,
Carlos Jaime Barrios
Universidad Industrial de Santander (UIS), Colombia

Publisher

IEEE Press

Publication History

Published: 16 May 2016

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CCGrid '16

CCGrid '16: 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

May 16 - 19, 2016

Cartagena, Columbia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
20
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten