Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

CIRCUMFLEX: a scheduling optimizer for MapReduce workloads with shared scans

Published: 16 February 2012 Publication History

Abstract

We consider MapReduce clusters designed to support multiple concurrent jobs, concentrating on environments in which the number of distinct datasets is modest relative to the number of jobs. Many datasets in such scenarios wind up being scanned by multiple concurrent Map phase jobs. As has been noticed previously, this scenario provides an opportunity for Map phase jobs to cooperate, sharing the scans of these datasets, and thus reducing the costs of such scans. Our paper has two main contributions. First, we present a novel and highly general method for sharing scans and thus amortizing their costs. This concept, which we call cyclic piggybacking, has a number of advantages over the more traditional batching scheme described in the literature. Second, we describe a significant but natural generalization of the recently introduced flex scheduler, for optimizing schedules within the context of this cyclic piggybacking paradigm. The overall approach, including both cyclic piggybacking and the flex generalization, is called circumflex. We demonstrate the excellent performance of circumflex via a variety of simulation experiments.

References

[1]
P. Agrawal, D. Kifer, and C. Olston. Scheduling shared scans of large data files. In Proceedings of VLDB, 2008.
[2]
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. ACM Transactions on Computer Systems, 51(1):107--113, 2008.
[3]
Hadoop. http://hadoop.apache.org.
[4]
D. Knuth. The Art of Computer Programming. Addison-Wesley, 1998.
[5]
T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. Mrshare: sharing across multiple queries in mapreduce. Proc. VLDB Endow., 3(1-2):494--505, 2010.
[6]
M. Pinedo. Scheduling: Theory, Algorithms and Systems. Prentice Hall, 1995.
[7]
J. Wolf, A. Balmin, D. Rajan, K. Hildrum, R. Khandekar, S. Parekh, K.-L. Wu, and R. Vernica. Circumflex: A scheduling optimizer for mapreduce workloads involving shared scans. Technical Report RC25118, IBM Technical Report, 2011.
[8]
J. L. Wolf, D. Rajan, K. Hildrum, R. Khandekar, V. Kumar, S. Parekh, K.-L. Wu, and A. Balmin. Flex: A slot allocation scheduling optimizer for mapreduce workloads. In Proceedings of Middleware, 2010.
[9]
J. L. Wolf, M. S. Squillante, J. J. Turek, P. S. Yu, and J. Sethuraman. Scheduling algorithms for the broadcast delivery of digital products. IEEE Trans. on Knowl. and Data Eng., 13(5):721--741, 2001.
[10]
M. Zaharia. Hadoop fair scheduler design document, http://svn.apache.org/repos/asf/hadoop/mapreduce/trunk/src/contrib/fairscheduler/designdoc/fairschedulerdesigndoc.pdf.
[11]
M. Zaharia, D. Borthakur, J. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Job scheduling for multi-user mapreduce clusters. Technical Report EECS-2009-55, UC Berkeley Technical Report, 2009.
[12]
M. Zaharia, D. Borthakur, J. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of EuroSys, 2010.
[13]
M. Zukowski, S. Héman, N. Nes, and P. Boncz. Cooperative scans: dynamic bandwidth sharing in a dbms. In Proceedings of VLDB, 2007.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGOPS Operating Systems Review
ACM SIGOPS Operating Systems Review  Volume 46, Issue 1
January 2012
99 pages
ISSN:0163-5980
DOI:10.1145/2146382
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 February 2012
Published in SIGOPS Volume 46, Issue 1

Check for updates

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)1
Reflects downloads up to 23 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2019)A Dynamic Convergent Replica Selection Strategy Based on Cloud Storage2019 International Conference on Artificial Intelligence and Advanced Manufacturing (AIAM)10.1109/AIAM48774.2019.00100(473-478)Online publication date: Oct-2019
  • (2016)A Survey on Job Scheduling in Big DataCybernetics and Information Technologies10.1515/cait-2016-003316:3(35-51)Online publication date: 1-Sep-2016
  • (2015)MUSInternational Journal of Computational Science and Engineering10.1504/IJCSE.2015.07349511:4(360-367)Online publication date: 1-Dec-2015
  • (2015)Classification Framework of MapReduce Scheduling AlgorithmsACM Computing Surveys10.1145/269331547:3(1-38)Online publication date: 16-Apr-2015
  • (2015)SAMESFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-014-4138-y9:1(128-141)Online publication date: 1-Feb-2015
  • (2014)Sharing across Multiple MapReduce JobsACM Transactions on Database Systems10.1145/256079639:2(1-46)Online publication date: 26-May-2014
  • (2013)An Efficiency-Aware Scheduling for Data-Intensive Computations on MapReduce ClustersIEICE Transactions on Information and Systems10.1587/transinf.E96.D.2654E96.D:12(2654-2662)Online publication date: 2013
  • (2013)Scheduling real-time workflow on MapReduce-based cloudThird International Conference on Innovative Computing Technology (INTECH 2013)10.1109/INTECH.2013.6653690(117-122)Online publication date: Aug-2013

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media