Abstract
In this paper, we investigate Cloud computing resource provisioning to extend the computing capacity of local clusters in the presence of failures. We consider three steps in the resource provisioning including resource brokering, dispatch sequences, and scheduling. The proposed brokering strategy is based on the stochastic analysis of routing in distributed parallel queues and takes into account the response time of the Cloud provider and the local cluster while considering computing cost of both sides. Moreover, we propose dispatching with probabilistic and deterministic sequences to redirect requests to the resource providers. We also incorporate checkpointing in some well-known scheduling algorithms to provide a fault-tolerant environment. We propose two cost-aware and failure-aware provisioning policies that can be utilized by an organization that operates a cluster managed by virtual machine technology, and seeks to use resources from a public Cloud provider. Simulation results demonstrate that the proposed policies improve the response time of users’ requests by a factor of 4.10 under a moderate load with a limited cost on a public Cloud.
Similar content being viewed by others
Notes
There are several approximations for this queue in the literature, but we choose one which is a good estimate for heavily loaded systems.
This assumption is made just to focus on performance degradation due to failure.
This is the maximum amount of data for a real scientific workflow application [40].
The network latency is negligible as it is less than a second for public Cloud environments [41].
All prices obtained at time of writing this paper during May–June 2011.
References
Sotomayor B, Montero RS, Llorente IM, Foster I (2009) Virtual infrastructure management in private and hybrid clouds. IEEE Internet Comput 13(5):14–22
Kondo D, Javadi B, Malecot P, Cappello F, Anderson DP (2009) Cost-benefit analysis of Cloud computing versus desktop grids. In: Proceedings of the 23rd IEEE international parallel and distributed processing symposium (IPDPS 2009), Rome, Italy. IEEE Computer Society, Washington, pp 1–12
Deelman E, Singh G, Livny M, Berriman B, Good J (2008) The cost of doing science on the Cloud: the montage example. In: Proceedings of the 19th ACM/IEEE international conference on supercomputing (SC 2008), Austin, Texas. IEEE Press, Piscataway, pp 1–12
Palankar MR, Iamnitchi A, Ripeanu M, Garfinkel S (2008) Amazon S3 for science Grids: a viable solution? In: Proceedings of the 1st international workshop on data-aware distributed computing (DADC’08) in conjunction with HPDC 2008, Boston, MA. ACM, New York, pp 55–64
de Assunção MD, di Costanzo A, Buyya R (2009) Evaluating the cost-benefit of using cloud computing to extend the capacity of clusters. In: Proceedings of the 18th international symposium on high performance parallel and distributed computing (HPDC 2009), Garching, Germany. ACM, New York, pp 141–150
Kondo D, Javadi B, Iosup A, Epema DHJ (2010) The failure trace archive: enabling comparative analysis of failures in diverse distributed systems. In: Proceedings of the 10th IEEE/ACM international conference on cluster, cloud and grid computing (CCGrid 2010), Melbourne, Australia. IEEE Computer Society, Washington, pp 398–407
Anselmi J, Gaujal B (2010) Optimal routing in parallel, non-observable queues and the price of anarchy revisited. In: 22nd international teletraffic congress (ITC), Amsterdam, The Netherlands
di Costanzo A, de Assunção MD, Buyya R (2009) Harnessing cloud technologies for a virtualized distributed computing infrastructure. IEEE Internet Comput 13(5):24–33
Fontán J, Vázquez T, Gonzalez L, Montero RS, Llorente IM (2008) OpenNEbula: the open source virtual machine manager for cluster computing. In: Open source grid and cluster software conference, book of abstracts, San Francisco, CA
Nurmi D, Wolski R, Grzegorczyk C, Obertelli G, Soman S, Youseff L, Zagorodnov D (2009) The Eucalyptus open-source cloud-computing system. In: Proceedings of the 9th IEEE/ACM international symposium on cluster computing and the grid (CCGrid 2009), Shanghai, China. IEEE Computer Society, Washington, pp 124–131
Vecchiola C, Chu X, Buyya R (2009) Aneka: a software platform for .NET-based cloud computing IOS Press, Amsterdam, pp 267–295
Amazon Inc., Amazon elastic compute cloud (Amazon EC2). http://aws.amazon.com/ec2
Tatezono M, Maruyama N, Matsuoka S (2006) Making wide-area, multi-site MPI feasible using Xen VM. In: Proceedings of the 4th workshop on frontiers of high performance computing and networking in conjunction with ISPA 2006, Sorrento, Italy. Springer, Berlin, pp 387–396
Iosup A, Epema DHJ, Tannenbaum T, Farrellee M, Livny M (2007) Inter-operating grids through delegated matchmaking. In: Proceedings of the 18th ACM/IEEE conference on supercomputing (SC 2007), Reno, Nevada. ACM, New York, pp 1–12
Balazinska M, Balakrishnan H, Stonebraker M (2004) Contract-based load management in federated distributed systems. In: Proceedings of the 1st symposium on networked systems design and implementation (NSDI 2004), San Francisco, CA. USENIX Association, Berkeley, pp 197–210
Irwin D, Chase J, Grit L, Yumerefendi A, Becker D, Yocum KG (2006) Sharing networked resources with brokered leases. In: Proceedings of the USENIX annual technical conference, Boston, MA. USENIX Association, Berkeley, pp 199–212
Grit L, Inwin D, Yumerefendi A, Chase J (2006) Virtual machine hosting for networked clusters: building the foundations for ‘autonomic’ orchestration. In: Proceedings of the 1st international workshop on virtualization technology in distributed computing (VTDC 2006), Tampa, Florida. IEEE Computer Society, Washington, pp 7–15
Ruth P, McGachey P, Xu D (2005) VioCluster: virtualization for dynamic computational domain. In: Proceedings of the 7th IEEE international conference on cluster computing (Cluster 2005), Burlington, MA. IEEE Press, Piscataway, pp 1–10
Rubio-Montero AJ, Huedo E, Montero RS, Llorente IM (2007) Management of virtual machines on globus grids using GridWay. In: Proceedings of the 21st IEEE international parallel and distributed processing symposium (IPDPS 2007), Long Beach, USA. IEEE Press, Piscataway, pp 1–7
Huedo E, Montero RS, Llorente IM (2010) Grid architecture from a metascheduling perspective. Computer 43(7):51–56
Garfinkel S (2007) Commodity grid computing with Amazons S3 and EC2. Usenix Login 32(1):7–13
Marshall P, Keahey K, Freeman T (2010) Elastic site: using clouds to elastically extend site resources. In: Proceedings of the 10th IEEE/ACM international conference on cluster, cloud and grid computing (CCGrid 2010), Melbourne, Australia. IEEE Computer Society, Washington, pp 43–52
Moschakis I, Karatza H (2010) Evaluation of gang scheduling performance and cost in a cloud computing system. J Supercomput 1:1–18
Guo X, Lu Y, Squillante MS (2004) Optimal probabilistic routing in distributed parallel queues. ACM SIGMETRICS Perform Eval Rev 32(2):53–54
Ross SM (1997) Stochastic processes, 2nd edn. Wiley, New York
Hordijk A, van der Laan D (2004) Periodic routing to parallel queues and billiard sequences. Math Methods Oper Res 59:173–192
Mu’alem AW, Feitelson DG (2001) Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Trans Parallel Distrib Syst 12(6):529–543
Lifka DA (1995) The ANL/IBM SP scheduling system. In: Proceedings of the 1st workshop on job scheduling strategies for parallel processing (JSSPP’95), Santa Barbara, CA. Springer, London, pp 295–303
Srinivasan S, Kettimuthu R, Subramani V, Sadayappan P (2002) Selective reservation strategies for backfill job scheduling. In: Proceedings of the 8th international workshop on job scheduling strategies for parallel processing (JSSPP’02), Edinburgh, Scotland, UK. Springer, London, pp 55–71
Bouguerra M, Gautier T, Trystram D, Vincent J-M (2010) A flexible checkpoint/restart model in distributed systems. In: Proceedings of the 9th international conference on parallel processing and applied mathematics (PPAM 2010), Torun, Poland. Springer, Berlin, pp 206–215
Kleinrock L, Korfhage W (1993) Collecting unused processing capacity: an analysis of transient distributed systems. IEEE Trans Parallel Distrib Syst 4(5):535–546
Varia J (2011) Best practices in architecting cloud applications in the AWS cloud. Wiley, Hoboken, pp 459–490
Hoelzle U, Barroso LA (2009) The datacenter as a computer: an introduction to the design of warehouse-scale machines. Morgan and Claypool Publishers, San Rafael
Ostermann S, Iosup A, Yigitbasi N, Prodan R, Fahringer T, Epema D (2009) A performance analysis of EC2 Cloud computing services for scientific computing. In: Proceedings of the 1st international conference on cloud computing (CloudComp 2009), Beijing, China. Springer, Berlin, pp 115–131
Calheiros RN, Ranjan R, Beloglazov A, De Rose CAF, Buyya R (2011) CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Softw Pract Exp 41(1):23–50
Grimme C, Lepping J, Papaspyrou A (2008) Prospects of collaboration between compute providers by means of job interchange. In: 13th job scheduling strategies for parallel processing. Lecture notes in computer science, vol 4942. Springer, Berlin, pp 132–151
Feitelson DG, Rudolph L, Schwiegelshohn U, Sevcik KC, Wong P (1997) Theory and practice in parallel job scheduling. In: Proceedings of the 3rd international workshop on job scheduling strategies for parallel processing (JSSPP’97), Seattle, WA. Springer, London, pp 1–34
Iosup A, Li H, Jan M, Anoep S, Dumitrescu C, Wolters L, Epema DHJ (2008) The grid workloads archive. Future Gener Comput Syst 24(7):672–686
Li H, Groep D, Wolters L (2004) Workload characteristics of a multi-cluster supercomputer. In: Proceedings of the 10th international workshop on job scheduling strategies for parallel processing (JSSPP’04), New York, USA. Springer, Berlin, pp 176–193
Pandey S, Voorsluys W, Rahman M, Buyya R, Dobson JE, Chiu K (2009) A grid workflow environment for brain imaging analysis on distributed systems. Concurr Comput 21(16):2118–2139
CloudHarmony, http://cloudharmony.com/
Acknowledgements
The authors would like to thank Jonatha Anselmi, Rodrigo N. Calheiros, Mohsen Amini, and Amir Vahid for useful discussions.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Javadi, B., Thulasiraman, P. & Buyya, R. Enhancing performance of failure-prone clusters by adaptive provisioning of cloud resources. J Supercomput 63, 467–489 (2013). https://doi.org/10.1007/s11227-012-0826-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-012-0826-2