Abstract
In the cloud age, heterogeneous application modes on large-scale infrastructures bring about the challenges on resource utilization and manageability to data centers. Many resource and runtime management systems are developed or evolved to address these challenges and relevant problems from different perspectives. This paper tries to identify the main motivations, key concerns, common features, and representative solutions of such systems through a survey and analysis. A typical kind of these systems is generalized as the consolidated cluster system, whose design goal is identified as reducing the overall costs under the quality of service premise. A survey on this kind of systems is given, and the critical issues concerned by such systems are summarized as resource consolidation and runtime coordination. These two issues are analyzed and classified according to the design styles and external characteristics abstracted from the surveyed work. Five representative consolidated cluster systems from both academia and industry are illustrated and compared in detail based on the analysis and classifications. We hope this survey and analysis to be conducive to both design implementation and technology selection of this kind of systems, in response to the constantly emerging challenges on infrastructure and application management in data centers.
Similar content being viewed by others
References
Hindman B, Konwinski A, Zaharia M, Ghodsi A, Joseph A, Katz R, Shenker S, Stoica I. Mesos: a platform for fine-grained resource sharing in the data center. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, NSDI’11. 2011
Murthy A C, Douglas C, Konar M, O’Malley O, Radia S, Agarwal S, V V K. Architecture of next generation apache hadoop MapReduce framework. Technical report, Apache Hadoop community, 2011
Lu X, Lin J, Zha L, Xu Z. Vega LingCloud: a resource single leasing point system to support heterogeneous application modes on shared infrastructure. In: Proceedings of the 9th International Symposium on Parallel and Distributed Processing with Applications, ISPA’11. 2011, 99–106
Chase J S, Irwin D E, Grit L E, Moore J D, Sprenkle S E. Dynamic virtual clusters in a grid site manager. In: Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing, HPDC’03. 2003, 90–100
Ramakrishnan L, Koelbel C, Kee Y, Wolski R, Nurmi D, Gannon D, Obertelli G, YarKhan A, Mandal A, Huang T M, Thyagaraja K, Zagorodnov D. VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC’09. 2009
Kim H, el-Khamra Y, Jha S, Parashar M. An autonomic approach to integrated HPC grid and cloud usage. In: Proceedings of the 5th IEEE International Conference on e-Science, e-Science’09. 2009, 366–373
Lu X, Lin J, Zha L. Architecture and key technologies of LingCloud. Journal of Computer Research and Development, 2011, 48(7): 1111–1122
Baker M, Buyya R. Cluster computing at a glance. In: Buyya R, ed. High Performance Cluster Computing: Architectures and Systems, volume 2. Prentice Hall PTR, 1999, 3–47
Beloglazov A, Buyya R, Lee Y C, Zomaya A. A taxonomy and survey of energy-efficient data centers and cloud computing systems. In: Zelkowitz M V ed. Advances in Computers, Volume 82. Elsevier B.V., 2011, 47–111
Wang L, Zhan J, Shi W, Liang Y. In cloud, can scientific communities benefit from the economies of scale? IEEE Transactions on Parallel and Distributed Systems, 2012, 23(2): 296–303
Krauter K, Buyya R, Maheswaran M. A taxonomy and survey of grid resource management systems for distributed computing. Software: Practice and Experience, 2002, 32(2): 135–164
Barham P, Dragovic B, Fraser K, Hand S, Harris T, Ho A, Neugebauer R, Pratt I, Warfield A. Xen and the art of virtualization. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles, SOSP’03. 2003, 164–177
VMware virtualization software. http://www.vmware.com/
Kivity A, Kamay Y, Laor D, Lublin U, Liguori A. KVM: the Linux virtual machine monitor. In: Proceedings of the 9th Annual Ottawa Linux Symposium, OLS’07. 2007, 225–230
Mell P, Grance T. The NIST definition of cloud computing. Technical Report SP 800-145, Information Technology Laboratory, National Institute of Standards and Technology, 2011
Silberstein M, Geiger D, Schuster A, Livny M. Scheduling mixed workloads in multi-grids: the grid execution hierarchy. In: Proceedings of the 15th IEEE International Symposium on High Performance Distributed Computing, HPDC’06. 2006, 291–302
Manyika J, Chui M, Brown B, Bugin J, Dobbs R, Roxburgh C, Byers A H. Big data: the next frontier for innovation, competition, and productivity. Technical report, McKinsey Global Institute, 2011
Litzkow M, Livny M, Mutka M. Condor-a hunter of idle workstations. In: Proceedings of the 8th International Conference of Distributed Computing Systems, ICDCS’88. 1988, 104–111
Oracle Corporation. Oracle grid engine: an overview. Technical report, 2010
Foster I, Zhao Y, Raicu I, Lu S. Cloud computing and grid computing 360-degree compared. In: Proceedings of Grid Computing Environments Workshop, GCE’08. 2008
Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th USENIX Symposium on Operating Systems Design & Implementation, OSDI’04. 2004
Apache Hadoop. http://hadoop.apache.org/
Peng D, Dabek F. Large-scale incremental processing using distributed transactions and notifications. In: Proceedings of the 9th USENIX Symposium on Operating Systems Design & Implementation, OSDI’10. 2010
Neumeyer L, Robbins B, Nair A, Kesari A. S4: distributed stream computing platform. In: Proceedings of 2010 IEEE International Conference on Data Mining Workshops, ICDMW’10. 2010, 170–177
Gropp W, Lusk E, Skjellum A. Using MPI: Portable Parallel Programming with the Message-Passing Interface. MIT Press, 1994
MPICH2: High-performance and widely portable MPI. http://www.mcs.anl.gov/research/projects/mpich2/
Graham R L, Shipman G M, Barrett B, Castain R H, Bosilca G, Lumsdaine A. Open MPI: a high-performance, heterogeneous MPI. In: Proceedings of 2006 IEEE International Conference on Cluster Computing, Cluster’06. 2006
Armbrust M, Fox A, Griffith R, Joseph A, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Stoica I, Zaharia M. Above the clouds: a berkeley view of cloud computing. Technical Report UCB/EECS-2009-28, EECS Department, University of California, Berkeley, 2009
Wentzlaff D, Gruenwald III C, Beckmann N, Modzelewski K, Belay A, Youseff L, Miller J, Agarwal A. An operating system for multicore and clouds: mechanisms and implementation. In: Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC’10. 2010, 3–14
Zaharia M, Borthakur D, Sen Sarma J, Elmeleegy K, Shenker S, Stoica I. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European Conference on Computer Systems, EuroSys’10. 2010, 265–278
Benson T, Akella A, Maltz D A. Network traffic characteristics of data centers in the wild. In: Proceedings of the 10th Annual Conference on Internet Measurement, IMC’10. 2010, 267–280
Boutaba R, Cheng L, Zhang Q. On cloud computational models and the heterogeneity challenge. Journal of Internet Services and Applications, 2012, 3(1): 77–86
Zaharia M, Konwinski A, Joseph A D, Katz R, Stoica I. Improving MapReduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX conference on Operating Systems Design & Implementation, OSDI’08. 2008
Fan Z, Qiu F, Kaufman A, Yoakum-Stover S. GPU cluster for high performance computing. In: Proceedings of the ACM/IEEE Conference on Supercomputing, SC’04. 2004
Liu J, Chandrasekaran B, Wu J, Jiang W, Kini S, Yu W, Buntinas D, Wyckoff P, Panda D K. Performance comparison of MPI implementations over InfiniBand, myrinet and quadrics. In: Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, SC’03. 2003
Greenberg A, Hamilton J, Maltz D A, Patel P. The cost of a cloud: research problems in data center networks. ACM SIGCOMM Computer Communication Review, 2008, 39(1): 68–73
Abadi D J. Data management in the cloud: limitations and opportunities. IEEE Data Engineering Bulletin, 2009, 32(1): 3–12
Buyya R, Beloglazov A, Abawajy J H. Energy-efficient management of data center resources for cloud computing: a vision, architectural elements, and open challenges. In: Proceedings of the 2010 International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA’10. 2010, 6–20
Ramgovind S, Eloff M M, Smith E. The management of security in cloud computing. In: Proceedings of the 9th Annual Information Security for South Africa Conference, ISSA’10. 2010
Mehta S, Neogi A. ReCon: a tool to recommend dynamic server consolidation in multi-cluster data centers. In: Proceedings of the 11th IEEE/IFIP Network Operations and Management Symposium, NOMS’08. 2008, 363–370
Zhan J, Wang L, Tu B, Li Y, Wang P, Zhou W, Meng D. Phoenix cloud: consolidating different computing loads on shared cluster system for large organization. In: Proceedings of the 1st Workshop on Cloud Computing and Its Applications, CCA’08. 2008
Calheiros R N, Ranjan R, Beloglazov A, De Rose C A F, Buyya R. CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Software: Practice and Experience, 2011, 41(1): 23–50
Livny M. Condor and the cloud-the challenges and the roadmap of condor. http://www.grid.org.il/_Uploads/dbsAttachedFiles/Condor-Cloud-IGT.pdf, 2009
Linux containers. http://lxc.sourceforge.net/
Koziolek H. Performance evaluation of component-based software systems: a survey. Performance Evaluation, 2010, 67(8): 634–658
Huai Y, Lee R, Zhang S, Xia C H, Zhang X. DOT: a matrix model for analyzing, optimizing and deploying software for big data analytics in distributed systems. In: Proceedings of the 2nd ACM Symposium on Cloud Computing, SoCC’11. 2011, 1–14
Thain D, Tannenbaum T, Livny M. Distributed computing in practice: the condor experience. Concurrency and Computation: Practice and Experience, 2005, 17(2–4): 323–356
Youseff L, Butrico M, Da Silva D. Toward a unified ontology of cloud computing. In: Proceedings of Grid Computing Environments Workshop, GCE’08. 2008
Apache Mesos: dynamic resource sharing for clusters. http://incubator.apache.org/mesos/
Lee G, Chun B, Katz R H. Heterogeneity-aware resource allocation and scheduling in the cloud. In: Proceedings of the 3rd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’11. 2011
Zaharia M, Chowdhury M, Franklin M J, Shenker S, Stoica I. Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’10. 2010
Apache ZooKeeper. http://zookeeper.apache.org/
Murthy A C. The next generation of apache hadoop MapReduce. http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreducenextgen/, 2011
Apache HBase. http://hbase.apache.org/
Seo S, Yoon E J, Kim J, Jin S, Kim J, Maeng S. HAMA: an efficient matrix computation with the MapReduce framework. In: Proceedings of the 2nd International Conference on Cloud Computing Technology and Science, CloudCom’10. 2010, 721–726
Apache giraph. http://incubator.apache.org/giraph/
Pandey J. RPC improvements and wire compatibility in apache hadoop. http://hortonworks.com/blog/rpc-improvements-and-wire-compatibility-in-apache-hadoop/, 2012
Wright D. Cheap cycles from the desktop to the dedicated cluster: combining opportunistic and dedicated scheduling with Condor. In: Proceedings of the LCI International Conference on Linux Clusters: The HPC Revolution. 2001
Thain G. Condor integrated with hadoop’s map reduce. http://research.cs.wisc.edu/condor/CondorWeek2010/condor-presentations/thain-condor-hadoop.pdf, 2010
Foster I, and Kesselman C. Globus: a metacomputing infrastructure toolkit. International Journal of Supercomputer Applications, 1997, 11(2): 115–128
Henderson R. Job scheduling under the portable batch system. In: Feitelson D, Rudolph L, eds. Job Scheduling Strategies for Parallel Processing. LNCS. Springer Berlin / Heidelberg, 1995, 949: 279–294
Coleman N, Raman R, Livny M, Solomon M. Distributed policy management and comprehension with classified advertisements. Technical Report UW-CS-TR-1481, Computer Sciences Department, University of Wisconsin-Madison, 2003
Couvares P, Kosar T, Roy A, Weber J, Wenger K. Workflow management in condor. In: Taylor I J, Deelman E, Gannon D B, Shields M, eds. Workflows for e-Science. Springer London, 2007, 357–375
Basney J, Livny M. Deploying a high throughput computing cluster. In: Buyya R, ed. High Performance Cluster Computing: Architectures and Systems, Volume 1. Prentice Hall PTR, 1999, 116–134
Farrellee M. Condor: cloud scheduler. http://spinningmatt.files.wordpress.com/2010/04/matthewfarrelleeopensourcecloudcomputingforum-10feb2010.pdf, 2010
Open grid scheduler: the official open source grid engine. http://gridscheduler.sourceforge.net/
Son of grid engine. https://arc.liv.ac.uk/trac/SGE
Sun microsystems. Sun ONE grid engine, enterprise edition administration and user’s guide. Technical Report 816-4739-11, 2002
Troger P, Rajic H, Haas A, Domagalski P. Standardization of an API for distributed resource management systems. In: Proceedings of the 7th IEEE International Symposium on Cluster Computing and the Grid, CCGRID’07. 2007, 619–626
Gentzsch W. Sun grid engine: towards creating a compute power grid. In: Proceedings of the 1st IEEE/ACM International Symposium on Cluster Computing and the Grid, CCGIRD’01 2001, 35–36
Oracle Corporation. Extreme scalability using oracle grid engine software: managing extreme workloads. Technical report, 2010
Templeton D. Intro to service domain manager. http://blogs.oracle.com/templedf/entry/service_domain_manager, 2010
Sotomayor B, Montero R S, Llorente I M, Foster I. Virtual infrastructure management in private and hybrid clouds. IEEE Internet Computing, 2009, 13(5): 14–22
Mugler J, Naughton T, Scott S L. OSCAR meta-package system. In: Proceedings of the 19th International Symposium on High Performance Computing Systems and Applications, HPCS’05. 2005, 353–360
Massie ML, Chun B N, Culler D E. The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing, 2004, 30(7): 817–840.
Zha L, Li W, Yu H, Xie X, Xiao N, Xu Z. System software for China national grid. In: Proceedings of IFIP International Conference on Network and Parallel Computing, NPC’05. 2005, 14–21
Lin J, Lu X, Yu L, Zou Y, Zha L. VegaWarden: a uniform user management system for cloud applications. In: Proceedings of the 5th IEEE International Conference on Networking, Architecture and Storage, NAS’10. 2010, 457–464
Yu L, Zha L, Wang X, Zhou H, Zou Y. GOS security: design and implementation. In: Proceedings of the 15th International Conference on Parallel and Distributed Systems, ICPADS’09. 2009, 955–960
Steinder M, Whalley I, Carrera D, Gaweda I, Chess D. Server virtualization in autonomic management of heterogeneous workloads. In: Proceedings of the 10th IFIP/IEEE International Symposium on Integrated Network Management, IM’07. 2007, 139–148
Mateescu G, Gentzsch W, Ribbens C J. Hybrid computing-where HPC meets grid and cloud computing. Future Generation Computer Systems, 2011, 27(5): 440–453
Author information
Authors and Affiliations
Corresponding author
Additional information
Jian Lin is a PhD candidate in computer architecture at Institute of Computing Technology, Chinese Academy of Sciences. His current research interests include distributed software architecture, large-scale resource management, and security technologies in grid and cloud computing systems.
Li Zha obtained his PhD in 2003, and is an associate professor of Institute of Computing Technology, Chinese Academy of Sciences. He has been the project leader of several national level research programs. His research is focused on large-scale distributed resource management, data storage/processing/retrieval and system level optimization. His interests also include other classic issues in distributed computing and grid computing field.
Zhiwei Xu received the PhD from University of Southern California in 1987. He is currently a professor of Institute of Computing Technology, Chinese Academy of Sciences. His research interests include network computing, distributed operating systems, and high-performance computer architecture. His editorial board services include the IEEE Transactions on Services Computing, Journal of Grid Computing, Journal of Computer Science and Technology, and Journal of Computer Research and Development. He is a senior member of the IEEE.
Rights and permissions
About this article
Cite this article
Lin, J., Zha, L. & Xu, Z. Consolidated cluster systems for data centers in the cloud age: a survey and analysis. Front. Comput. Sci. 7, 1–19 (2013). https://doi.org/10.1007/s11704-012-2086-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11704-012-2086-y