research-article

Public Access

Exploration of memory hybridization for RDD caching in Spark

Authors:

Muhammad Ahad Ul Alam,

Amit Kumar Nath,

Weikuan YuAuthors Info & Claims

ISMM 2019: Proceedings of the 2019 ACM SIGPLAN International Symposium on Memory Management

Pages 41 - 52

https://doi.org/10.1145/3315573.3329988

Published: 23 June 2019 Publication History

Abstract

Apache Spark is a popular cluster computing framework for iterative analytics workloads due to its use of Resilient Distributed Datasets (RDDs) to cache data for in-memory processing. We have revealed that the performance of Spark RDD cache can be severely limited if its capacity falls short to the needs of the workloads. In this paper, we have explored different memory hybridization strategies to leverage emergent Non-Volatile Memory (NVM) devices for Spark's RDD cache. We have found that a simple layered hybridization approach does not offer an effective solution. Therefore, we have designed a flat hybridization scheme to leverage NVM for caching RDD blocks, along with several architectural optimizations such as dynamic memory allocation for block unrolling, asynchronous migration with preemption, and opportunistic eviction to disk. We have performed an extensive set of experiments to evaluate the performance of our proposed flat hybridization strategy and found it to be robust in handling different system and NVM characteristics. Our proposed approach uses DRAM for a fraction of the hybrid memory system and yet manages to keep the increase in execution time to be within 10% on average. Moreover, our opportunistic eviction of blocks to disk improves performance by up to 7.5% when utilized alongside the current mechanism.

References

[1]

Gaurav Dhiman, Raid Ayoub, and Tajana Rosing. 2009. PDRAM: A Hybrid PRAM and DRAM Main Memory System. In Proceedings of the 46th Annual Design Automation Conference (DAC ’09). ACM, New York, NY, USA, 664–469.

Digital Library

[2]

Z. Duan, H. Liu, X. Liao, and H. Jin. 2018. HME: A lightweight emulator for hybrid memory. In 2018 Design, Automation Test in Europe Conference Exhibition (DATE). 1375–1380.

[3]

Subramanya R. Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Sundaram, Nadathur Satish, Rajesh Sankaran, Jeff Jackson, and Karsten Schwan. 2016. Data Tiering in Heterogeneous Memory Systems. In Proceedings of the Eleventh European Conference on Computer Systems (EuroSys ’16). ACM, New York, NY, USA, Article 15, 16 pages.

Digital Library

[4]

Yuanzhen Geng, Xuanhua Shi, Cheng Pei, Hai Jin, and Wenbin Jiang. 2017. LCS: An Efficient Data Eviction Strategy for Spark. International Journal of Parallel Programming 45, 6 (01 Dec 2017), 1285–1297.

Digital Library

[5]

M. Giardino, K. Doshi, and B. Ferri. 2016. Soft2LM: Application Guided Heterogeneous Memory Management. In 2016 IEEE International Conference on Networking, Architecture and Storage (NAS). 1–10.

[6]

Jian Huang, Karsten Schwan, and Moinuddin K. Qureshi. 2014. NVRAM-aware Logging in Transaction Systems. Proc. VLDB Endow. 8, 4 (Dec. 2014), 389–400.

Digital Library

[7]

Nusrat Sharmin Islam, Md. Wasi-ur Rahman, Xiaoyi Lu, and Dhabaleswar K. Panda. 2016. High Performance Design for HDFS with Byte-Addressability of NVM and RDMA. In Proceedings of the 2016 International Conference on Supercomputing (ICS ’16). ACM, New York, NY, USA, Article 8, 14 pages.

Digital Library

[8]

Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger. 2009. Architecting Phase Change Memory As a Scalable Dram Alternative. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA ’09). ACM, New York, NY, USA, 2–13.

Digital Library

[9]

Haikun Liu, Yujie Chen, Xiaofei Liao, Hai Jin, Bingsheng He, Long Zheng, and Rentong Guo. 2017. Hardware/Software Cooperative Caching for Hybrid DRAM/NVM Memory Architectures. In Proceedings of the International Conference on Supercomputing (ICS ’17). ACM, New York, NY, USA, Article 26, 10 pages.

Digital Library

[10]

Justin Meza, Jichuan Chang, HanBin Yoon, Onur Mutlu, and Parthasarathy Ranganathan. 2012. Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management. IEEE Comput. Archit. Lett. 11, 2 (July 2012), 61–64.

Digital Library

[11]

Bao Nguyen, Hua Tan, and Xuechen Zhang. 2017. Large-scale Adaptive Mesh Simulations Through Non-volatile Byte-addressable Memory. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’17). ACM, New York, NY, USA, Article 27, 12 pages.

Digital Library

[12]

Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, and Byung-Gon Chun. 2015. Making Sense of Performance in Data Analytics Frameworks. In Proceedings of the 12th USENIX Conference on Networked Systems Design and Implementation (NSDI’15). USENIX Association, Berkeley, CA, USA, 293–307. http://dl.acm.org/citation. cfm?id=2789770.2789791

Digital Library

[13]

H. Park, S. Yoo, and S. Lee. 2011. Power management of hybrid DRAM/PRAM-based main memory. In 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC). 59–64.

Digital Library

[14]

Moinuddin K. Qureshi, Vijayalakshmi Srinivasan, and Jude A. Rivers. 2009. Scalable High Performance Main Memory System Using Phasechange Memory Technology. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA ’09). ACM, New York, NY, USA, 24–33.

Digital Library

[15]

Luiz E. Ramos, Eugene Gorbatov, and Ricardo Bianchini. 2011. Page Placement in Hybrid Memory Systems. In Proceedings of the International Conference on Supercomputing (ICS ’11). ACM, New York, NY, USA, 85–95.

Digital Library

[16]

Luiz E. Ramos, Eugene Gorbatov, and Ricardo Bianchini. 2011. Page Placement in Hybrid Memory Systems. In Proceedings of the International Conference on Supercomputing (ICS ’11). ACM, New York, NY, USA, 85–95.

Digital Library

[17]

Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan. 2015. Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics. Proc. VLDB Endow. 8, 13 (Sept. 2015), 2110–2121.

Digital Library

[18]

Avraham Shinnar, David Cunningham, Vijay Saraswat, and Benjamin Herta. 2012. M3R: Increased Performance for In-memory Hadoop Jobs. Proc. VLDB Endow. 5, 12 (Aug. 2012), 1736–1747.

Digital Library

[19]

Haris Volos, Guilherme Magalhaes, Ludmila Cherkasova, and Jun Li. 2015. Quartz: A Lightweight Performance Emulator for Persistent Memory Software. In Proceedings of the 16th Annual Middleware Conference (Middleware ’15). ACM, New York, NY, USA, 37–49.

Digital Library

[20]

Md. Wasi-ur Rahman, Nusrat Sharmin Islam, Xiaoyi Lu, and Dhabaleswar K. (DK) Panda. 2016. Can Non-volatile Memory Benefit Mapreduce Applications on HPC Clusters?. In Proceedings of the 1st Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS ’16). IEEE Press, Piscataway, NJ, USA, 19–24.

[21]

Kai Wu, Yingchao Huang, and Dong Li. 2017. Unimem: Runtime Data Managementon Non-volatile Memory-based Heterogeneous Main Memory. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’17). ACM, New York, NY, USA, Article 58, 14 pages.

Digital Library

[22]

Fei Xia, Dejun Jiang, Jin Xiong, and Ninghui Sun. 2017. HiKV: A Hybrid Index Key-Value Store for DRAM-NVM Memory Systems. In 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, Santa Clara, CA, 349–362. https://www.usenix.org/conference/ atc17/technical-sessions/presentation/xia

Digital Library

[23]

Erci Xu, Mohit Saxena, and Lawrence Chiu. 2016. Neutrino: Revisiting Memory Caching for Iterative Data Analytics. In 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 16). USENIX Association, Denver, CO. https://www.usenix.org/conference/ hotstorage16/workshop-program/presentation/xu

Digital Library

[24]

L. Xu, M. Li, L. Zhang, A. R. Butt, Y. Wang, and Z. Z. Hu. 2016. MEM-TUNE: Dynamic Memory Management for In-Memory Data Analytic Platforms. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 383–392.

[25]

Yinghao Yu, Wei Wang, Jun Zhang, and Khaled Ben Letaief. 2017. LRC: Dependency-aware cache management for data analytics clusters. IEEE INFOCOM 2017 - IEEE Conference on Computer Communications (2017), 1–9.

[26]

Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud’10). USENIX Association, Berkeley, CA, USA, 10–10. http://dl.acm.org/citation.cfm?id=1863103.1863113

Digital Library

[27]

Haoyu Zhang, Brian Cho, Ergin Seyfe, Avery Ching, and Michael J. Freedman. 2018. Riffle: Optimized Shuffle Service for Large-scale Data Analytics. In Proceedings of the Thirteenth EuroSys Conference (EuroSys ’18). ACM, New York, NY, USA, Article 43, 15 pages.

Digital Library

[28]

K. Zhang, Y. Tanimura, H. Nakada, and H. Ogawa. 2017. Understanding and improving disk-based intermediate data caching in Spark. In 2017 IEEE International Conference on Big Data (Big Data). 2508–2517.

[29]

W. Zhang and T. Li. 2009. Exploring Phase Change Memory and 3D Die-Stacking for Power/Thermal Friendly, Fast and Durable Memory Architectures. In 2009 18th International Conference on Parallel Architectures and Compilation Techniques. 101–112.

Digital Library

Cited By

Ding WSun YLi MLiu JJu HHuang JLin C(2024)A Novel Spark-Based Attribute Reduction and Neighborhood Classification for Rough EvidenceIEEE Transactions on Cybernetics10.1109/TCYB.2022.320813054:3(1470-1483)Online publication date: Mar-2024
https://doi.org/10.1109/TCYB.2022.3208130
Xu GSong MLeng ZJia Z(2023)Simulation Research on Fast Matching of Big Data Based on SparkIEEE Access10.1109/ACCESS.2023.326298911(32628-32635)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3262989

Index Terms

Exploration of memory hybridization for RDD caching in Spark
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs
        MapReduce-based systems
  2. Information storage systems
    1. Information storage technologies
      1. Storage class memory

Recommendations

A comparative between hadoop mapreduce and apache Spark on HDFS
IML '17: Proceedings of the 1st International Conference on Internet of Things and Machine Learning

Data is growing now in a very high speed with a large volume, Spark and MapReduce¹ both provide a processing model for analyzing and managing this large data -Big Data- stored on HDFS. In this paper, we discuss a comparative between Apache Spark and ...
Performance comparison of Apache Hadoop and Apache Spark
ICAICR '19: Proceedings of the Third International Conference on Advanced Informatics for Computing Research

The term 'Big Data' is a broad term used for the data sets, which is enormous and traditional data processing applications find it hard to process. Both Apache Spark and Apache Hadoop are one of the significant parts of the big data family. Some of the ...
A hybrid memory built by SSD and DRAM to support in-memory Big Data analytics

Big Data requires a shift in traditional computing architecture. The in-memory computing is a new paradigm for Big Data analytics. However, DRAM-based main memory is neither cost-effective nor energy-effective. This work combines flash-based solid state ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ISMM 2019: Proceedings of the 2019 ACM SIGPLAN International Symposium on Memory Management

June 2019

135 pages

ISBN:9781450367226

DOI:10.1145/3315573

General Chair:
Jeremy Singer,
Program Chair:
Harry Xu

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

ISMM '19

Sponsor:

SIGPLAN

ISMM '19: 2019 ACM SIGPLAN International Symposium on Memory Management

June 23, 2019

AZ, Phoenix, USA

Acceptance Rates

Overall Acceptance Rate 72 of 156 submissions, 46%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
684
Total Downloads

Downloads (Last 12 months)74
Downloads (Last 6 weeks)8

Reflects downloads up to 19 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ding WSun YLi MLiu JJu HHuang JLin C(2024)A Novel Spark-Based Attribute Reduction and Neighborhood Classification for Rough EvidenceIEEE Transactions on Cybernetics10.1109/TCYB.2022.320813054:3(1470-1483)Online publication date: Mar-2024
https://doi.org/10.1109/TCYB.2022.3208130
Xu GSong MLeng ZJia Z(2023)Simulation Research on Fast Matching of Big Data Based on SparkIEEE Access10.1109/ACCESS.2023.326298911(32628-32635)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3262989

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents