Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2857546.2857595acmconferencesArticle/Chapter ViewAbstractPublication PagesicuimcConference Proceedingsconference-collections
research-article

Dynamic File Placing Control for Improving the I/O Performance in the Reduce Phase of Hadoop

Published: 04 January 2016 Publication History

Abstract

Hadoop is a popular open-source MapReduce implementation. In the cases of jobs wherein all the output files of all the relevant Map tasks are transmitted and consolidated into a single Reduce task, such as in TeraSort, the single Reduce task is the bottleneck task and is I/O bounded for processing many large output files. In most cases, including TeraSort, the intermediate data, which include the output files of the Map tasks, are large and accessed sequentially. For improving the performance of these jobs, it is important to increase the sequential access performance. In this paper, we focus on Hadoop sample job TeraSort, which is a single-Reduce-tasked job, and discuss a method for improving its performance. First, we perform TeraSort and demonstrate that the single Reduce task is the bottleneck task and is I/O bounded. Second, we show the sequential I/O speed of each zone of an HDD. Third, we introduce a static method for improving the performance of such single-Reduce-tasked jobs. The method statically controls block bitmaps of the filesystem and places the intermediate files in a faster zone, i.e., the outer range, of the HDD. Forth, we propose to improve this static method by controlling block bitmap dynamically. Lastly, we present performance evaluation of the proposed method and demonstrate that our method improves the performance.

References

[1]
Apache Hadoop, available from <http://wiki.apache.org/hadoop> (accessed 2015-07-01).
[2]
Dean, J. and Ghemawat, S. 2004. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design and Implementation.
[3]
Ozawa, T., Onizuka, M., Fukumoto, Y., and Moriai, S. Oct. 2013. MapReduce optimization using mapper-side aggregation. IPSJ Journal: Advanced Computer Systems, Vol. 43. (in Japanese)
[4]
Fujishima, E. and Yamaguchi, S. 2015. Improving the I/O Performance in the Reduce Phase of Hadoop. The Third International Symposium on Computing and Networking (CANDAR2015).
[5]
Fujishima, E. and Yamaguchi, S. 2015. I/O Performance Improvement of the Reduce Phase of Hadoop. The Forteenth Forum on Infromation Technology (FIT2015). RC-003. (in Japanese)
[6]
Yamada, M. and Yamaguchi, S. 2012. Filesystem Layout Reorganization in Virtualized Environment. The 9th IEEE International Conference on Autonomic and Trusted Computing (IEEE ATC 2012), ATC4-2.
[7]
Melnik, S., Gubarev, A., Long, J., Romer, G., Shivakumar, S., Tolton, M., and Vassilakis, T., 2010. Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endow., Vol. 3 Issue 1--2, pp.330--339.
[8]
Cloudeare Impala, available from <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>.
[9]
Costa, P., Donnelly, A., Rowstron, A., and O'Shea, G., 2012. Camdoop: exploiting in-network aggregation for big data applications. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12), pp.3--3.
[10]
Li, B., Mazur, E., Diao, Y., McGregor, A., and Shenoy, P., 2011. A platform for scalable one-pass analytics using MapReduce. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data (SIGMOD '11), pp.985--996.
[11]
Axboe, J., 2004. Linux block io - present and future. In Proceedings of the Ottawa Linux Symposium, pp.51--61. Ottawa Linux Symposium.
[12]
Iyer, S., and Druschel, P., 2001. Anticipatory scheduling: A disk scheduling framework to overcome deceptive idleness in synchronous I/O. In Proceedings of the eighteenth ACM symposium on Operating systems principles (SOSP '01).
[13]
Nakamura, Y., Nomura, S., Nagata, K., and Yamaguchi, S., 2014. I/O Scheduling in Android Devices with Flash Storage. 8th International Conference on Ubiquitous Information Management and Communication ACM IMCOM (ICUIMC).

Cited By

View all
  • (2020)Job-Aware File-Storage Optimization for Improved Hadoop I/O PerformanceIEICE Transactions on Information and Systems10.1587/transinf.2019EDP7337E103.D:10(2083-2093)Online publication date: 1-Oct-2020
  • (2020)A Study on I/O Performance in Highly Consolidated Container-Based Virtualized Environment on OverlayFS with Optimized Synchronization2020 14th International Conference on Ubiquitous Information Management and Communication (IMCOM)10.1109/IMCOM48794.2020.9001733(1-4)Online publication date: Jan-2020
  • (2020)Cache Management with Fadvise Based on LFU2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC48688.2020.0-102(1145-1150)Online publication date: Jul-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
IMCOM '16: Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication
January 2016
658 pages
ISBN:9781450341424
DOI:10.1145/2857546
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 January 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Hadoop
  2. MapReduce
  3. filesystem

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

IMCOM '16
Sponsor:

Acceptance Rates

Overall Acceptance Rate 213 of 621 submissions, 34%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2020)Job-Aware File-Storage Optimization for Improved Hadoop I/O PerformanceIEICE Transactions on Information and Systems10.1587/transinf.2019EDP7337E103.D:10(2083-2093)Online publication date: 1-Oct-2020
  • (2020)A Study on I/O Performance in Highly Consolidated Container-Based Virtualized Environment on OverlayFS with Optimized Synchronization2020 14th International Conference on Ubiquitous Information Management and Communication (IMCOM)10.1109/IMCOM48794.2020.9001733(1-4)Online publication date: Jan-2020
  • (2020)Cache Management with Fadvise Based on LFU2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC48688.2020.0-102(1145-1150)Online publication date: Jul-2020
  • (2020)Performance Improvement of Hadoop ext4-based Disk I/O2020 Eighth International Symposium on Computing and Networking (CANDAR)10.1109/CANDAR51075.2020.00032(181-187)Online publication date: Nov-2020
  • (2019)A Novel Approach for Maintaining Consistency in Distributed File System2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP)10.1109/ICACCP.2019.8882935(1-6)Online publication date: Feb-2019
  • (2019)Job-Aware Optimization of File Placement in Hadoop2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC)10.1109/COMPSAC.2019.10284(664-669)Online publication date: Jul-2019
  • (2018)Hadoop I/O Performance Improvement by File Layout OptimizationIEICE Transactions on Information and Systems10.1587/transinf.2017EDP7114E101.D:2(415-427)Online publication date: 2018
  • (2018)I/O Performance Improvement of Secure Big Data Analyses with Application Support on SSD CacheProceedings of the 12th International Conference on Ubiquitous Information Management and Communication10.1145/3164541.3164560(1-7)Online publication date: 5-Jan-2018
  • (2018)A Caching Filesystem for Increasing Locality in the Second Cache in a Virtualized EnvironmentProceedings of the 12th International Conference on Ubiquitous Information Management and Communication10.1145/3164541.3164557(1-7)Online publication date: 5-Jan-2018
  • (2018)A Kernel-Based Method for Resolving Performance Inefficiencies in Mining Frequent-Patterns in Encrypted Data2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW)10.1109/CANDARW.2018.00098(506-510)Online publication date: Nov-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media