Effect of garbage collection in iterative algorithms on Spark: an experimental analysis

Minseo Kang¹ &
Jae-Gil Lee¹

411 Accesses
4 Citations
Explore all metrics

Abstract

Spark is one of the most widely used systems for the distributed processing of big data. Its performance bottlenecks are mainly due to the network I/O, disk I/O, and garbage collection. Previous studies quantitatively analyzed the performance impact of these bottlenecks but did not focus on iterative algorithms. In an iterative algorithm, garbage collection has more performance impact than other workloads because the algorithm repeatedly loads and deletes data in the main memory through multiple iterations. Spark provides three caching mechanisms which are “disk cache,” “memory cache,” and “no cache” to keep the unchanged data across iterations. In this paper, we provide an in-depth experimental analysis of the effect of garbage collection on the overall performance depending on the caching mechanisms of Spark with various combinations of algorithms and datasets. The experimental results show that garbage collection accounts for 16–47% of the total elapsed time of running iterative algorithms on Spark and that the memory cache is no less advantageous in terms of garbage collection than the disk cache. We expect the results of this paper to serve as a guide for the tuning of garbage collection in the running of iterative algorithms on Spark.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Memory Management Approaches in Apache Spark: A Review

LCS: An Efficient Data Eviction Strategy for Spark

Article 02 November 2016

On the Performance of Spark on HPC Systems: Towards a Complete Picture

References

Apache. Apache Spark: https://spark.apache.org. Accessed 24 Dec 2019
Kang M, Lee J (2016). A comparative analysis of iterative MapReduce systems. In: Proceedings of the 6th International Conference on Emerging Databases: Technologies, Applications, and Theory (EDB), pp 61–64
Kang M, Lee J (2017) An experimental analysis of limitations of MapReduce for iterative algorithms on Spark. Clust Comput 20:2604–3593
Google Scholar
Lee H, Kang M, Youn SB, Lee JG, Kwon, Y (2016) An experimental comparison of iterative MapReduce frameworks. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp 2089–2094
Ousterhout K, Rasti R, Ratnasamy S, Shenker S, Chun BG (2015) Making sense of performance in data analytics frameworks. Natl Spat Data Infrastruct 15:293–307
Google Scholar
Oracle. Java garbage collection basics: http://www.oracle.com/webfolder/technetwork/tutorials/obe/java/gc01/index.html. Accessed 24 Dec 2019
Bu Y, Howe B, Balazinska M, Ernst MD (2010) HaLoop: efficient iterative data processing on large clusters. Proc VLDB Endow 3(1–2):285–296
Article Google Scholar
Zhang Y, Gao Q, Gao L, Wang C (2012) iMapreduce: a distributed computing framework for iterative computation. J Grid Comput 10(1):47–68
Article Google Scholar
Ekanayake J, Li H, Zhang B, Gunarathne T, Bae SH, Qiu J, Fox G (2010) Twister: a runtime for iterative MapReduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp 810–818
Karau H, Warren R (2016) High performance Spark: best practices for scaling and optimizing Apache Spark. O’Reilly Media Inc., Sebastopol
Google Scholar
Karau H, Konwinski A, Wendell P, Zaharia M (2015) Learning Spark. O’Reilly Media Inc., Sebastopol
Google Scholar
Apache. Apache Spark garbage collection tuning. https://spark.apache.org/docs/latest/tuning.html#garbage-collection-tuning. Accessed 24 Dec 2019
Xu L, Guo T, Duo W, Wang W, Wei J (2019) An experimental evaluation of garbage collectors on big data applications. Proc VLDB Endow 12(5):570–583
Article Google Scholar
Xu L, Li M, Zhang L, Butt AR, Wang Y, Hu ZZ (2016) MEMTUNE: dynamic memory management for in-memory data analytic platforms. In: Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp 383–392
Lu L, Shi X, Zhou Y, Zhang X, Jin H, Pei C, He L, Geng Y (2016) Lifetime-based memory management for distributed data processing systems. Proc VLDB Endow 9(12):936–947
Article Google Scholar
Bu Y, Borkar VR, Xu GH, Carey MJ (2013). A bloat-aware design for big data applications. In: Proceedings of the International Symposium on Memory Management (ISMM), pp 119–130
Chu C, Kim SK, Lin YA, Yu Y, Bradski G, Ng AY, Olukotun K (2007) Map-reduce for machine learning on multicore. Adv Neural Inf Process Syst 6:281–288
Google Scholar
The Lemur Project. The ClueWeb09 collection. http://lemurproject.org/clueweb09. Accessed 24 Dec 2019
Leskovec J, Krevl A SNAP datasets. http://snap.stanford.edu/data. Accessed 24 Dec 2019
Kwon Y, Nunley D, Gardner JP, Balazinska M, Howe B, Loebman S (2010) Scalable clustering algorithm for N-body simulations in a shared-nothing cluster. Sci Stat Database Manag 6187:132–150
Google Scholar
Apache. Apache Spark web interfaces. https://spark.apache.org/docs/latest/monitoring.html. Accessed 24 Dec 2019

Download references

Acknowledgements

This research, “Geospatial Big Data Management, Analysis and Service Platform Technology Development,” was supported by the MOLIT (The Ministry of Land, Infrastructure and Transport), Korea, under the national spatial information research program supervised by the KAIA (Korea Agency for Infrastructure Technology Advancement) (19NSIP-B081011-06).

Author information

Authors and Affiliations

Graduate School of Knowledge Service Engineering, KAIST, Daejeon, Korea
Minseo Kang & Jae-Gil Lee

Authors

Minseo Kang
View author publications
You can also search for this author in PubMed Google Scholar
Jae-Gil Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jae-Gil Lee.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kang, M., Lee, JG. Effect of garbage collection in iterative algorithms on Spark: an experimental analysis. J Supercomput 76, 7204–7218 (2020). https://doi.org/10.1007/s11227-020-03150-z

Download citation

Published: 16 January 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s11227-020-03150-z

Effect of garbage collection in iterative algorithms on Spark: an experimental analysis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Memory Management Approaches in Apache Spark: A Review

LCS: An Efficient Data Eviction Strategy for Spark

On the Performance of Spark on HPC Systems: Towards a Complete Picture

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Effect of garbage collection in iterative algorithms on Spark: an experimental analysis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Memory Management Approaches in Apache Spark: A Review

LCS: An Efficient Data Eviction Strategy for Spark

On the Performance of Spark on HPC Systems: Towards a Complete Picture

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation