Abstract
Computational science depends on complex, data intensive applications operating on datasets from a variety of scientific instruments. A major challenge is the integration of data into the scientist’s workflow. Recent advances in dynamic, networked cloud resources provide the building blocks to construct reconfiguration, end-to-end infrastructure that can increase scientific productivity, but applications are not taking advantage of them. In our previous work, we introduced DyNamo, that enabled CASA scientists to improve the efficiency of their operations and effortlessly leverage capabilities of the cloud resources available to them that previously remained underutilized. However, the provided workflow automation did not satisfy all the operational requirements of CASA. Custom scripts were still in production to manage workflow triggering, while multiple layer 2 connections would have to be allocated to maintain network QoS requirements. To address these issues, we enhance the DyNamo system with advanced network manipulation mechanisms, end-to-end infrastructure monitoring and ensemble workflow management capabilities. DyNamo’s Virtual Software Defined Exchange (vSDX) capabilities have been extended, enabling link adaptation, flow prioritization and traffic control between endpoints. These new features allow us to enforce network QoS requirements for each workflow ensemble and can lead to more fair network sharing. Additionally, to accommodate CASA’s operational needs we have extended the newly integrated Pegasus Ensemble Manager with event based triggering functionality, that improves managing CASA’s workflow ensembles. The Pegasus Ensemble Manager, apart from managing the workflow ensembles can also create conditions for a more fair resource usage, by employing throttling techniques to reduce compute and network resource contention. We evaluate the effects of the DyNamo’s vSDX policies by using two CASA workflow ensembles competing for network resources, and we show that traffic shaping of the ensembles can lead to a fairer sharing of the network links. Finally, we study how changing the Pegasus Ensemble Manager’s throttling for each of the two workflow ensembles affects their performance while they compete for the same network resources, and we assess if this approach is a viable alternative compared to the vSDX policies.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Baldin, I., Chase, J., Xin, Y., Mandal, A., Ruth, P., Castillo, C., Orlikowski, V., Heermann, C., Mills, J.: ExoGENI: a multi-domain infrastructure-as-a-service testbed, pp. 279–315. Springer, Cham (2016)
Lyons, E., Papadimitriou, G., Wang, C., Thareja, K., Ruth, P., Villalobos, J., Rodero, I., Deelman, E.,Zink, M., Mandal, A.: Toward a dynamic network-centric distributed cloud platform for scientific workflows: A case study for adaptive weather sensing. In: 2019 15th International Conference on eScience (eScience), pp. 67–76. (2019)
Gupta, A., Vanbever, L., Shahbaz, M., Donovan, S.P., Schlinker, B., Feamster, N., Rexford, J., Shenker, S., Clark, R., Katz-Bassett, E.: Sdx: a software defined internet exchange. SIGCOMM 44, 551–562 (2014)
Mambretti, J., Chen, J., Yeh, F.: Next generation clouds, the chameleon cloud testbed, and software defined networking (sdn), In: 2015 international conference on cloud computing research and innovation (ICCCRI), pp. 73–79. (2015)
Amazon Elastic Compute Cloud. http://www.amazon.com/ec2
Microsoft Azure Cloud. https://azure.microsoft.com/en-us/
AWS CloudFormation. http://aws.amazon.com/cloudformation
OpenStack Heat Project. https://wiki.openstack.org/wiki/Heat
Baldin, I., Ruth, P., Wang, C., Chase, J. S.: The future of multi-clouds: a survey of essential architectural elements, In: 2018 international scientific and technical conference modern computer network technologies (MoNeTeC), pp. 1–13. (2018)
Foster, I.: Globus online: accelerating and democratizing science through cloud-based services. IEEE Internet Comput. 15(3), 70–73 (2011)
Liu, J., Pacitti, E., Valduriez, P., Mattoso, M.: A survey of data-intensive scientific workflow management. J. Grid Comput. 13(4), 457–493 (2015)
Galante, G., Erpen De Bona, L.C., Mury, A.R., Schulze, B., Rosa Righi, R.: An analysis of public clouds elasticity in the execution of scientific applications: a survey. J. Grid. Comput. 14(2), 193–216 (2016)
Coutinho, E.. F.., de Carvalho Sousa, F.. R.., Rego, P.. A.. L.., Gomes, D.. G.., de Souza, J.. N..: Elasticity in cloud computing: a survey. Ann.Telecommun. - annales des telecommunications 70(7), 289–309 (2015)
Wang, J., AbdelBaky, M., Diaz-Montes, J., Purawat, S., Parashar, M., Altintas, I.: “Kepler + cometcloud: Dynamic scientific workflow execution on federated cloud resources (international Conference on Computational Science 2016, ICCS 2016, 6–8 June 2016. San Diego, California, USA), Proced. Comput. Sci. 80, 700–711 (2016)
Ostermann, S., Prodan, R., Fahringer, T.: Dynamic cloud provisioning for scientific grid workflows. In: 2010 11th IEEE/ACM international conference on grid computing, pp. 97–104. (2010)
Mandal, A., Ruth, P., Baldin, I., Xin, Y., Castillo, C., Juve, G., Rynge, M., Deelman, E., Chase, J.: Adapting scientific workflows on networked clouds using proactive introspection, In: IEEE/ACM Utility and Cloud Computing (UCC). (2015)
Macker, J.P., Taylor, I.: Orchestration and analysis of decentralized workflows within heterogeneous networking infrastructures. Future Gener. Comput. Syst. 75, 388–401 (2017)
Ramakrishnan, L., Koelbel, C., Kee, Y., Wolski, Y., Nurmi, Y., Gannon, D., Obertelli, G., YarKhan, A., Mandal, A., Huang, T. M., Thyagaraja, T. M., Zagorodnov, D.: Vgrads: enabling e-science workflows on grids and clouds with fault tolerance. In: Proceedings of the conference on high performance computing networking, storage and analysis, pp. 1–12. (2009)
Liu, Q., Rao, N. S. V., Sen, S., Settlemyer, B. W., Chen, H.-B., Boley, J. M., Kettimuthu, R., Katramatos, D.: Virtual environment for testing software-defined networking solutions for scientific workflows. In: Proceedings of the 1st international workshop on autonomous infrastructure for Science, ser. AI-Science’18. New York, NY, USA: Association for Computing Machinery. (2018). https://doi.org/10.1145/3217197.3217202
Ghahramani, M.H., Zhou, M., Hon, C.T.: Toward cloud computing qos architecture: analysis of cloud systems and cloud services. IEEE/CAA J. At. Sin. 4(1), 6–18 (2017)
Varshney, S., Sandhu, R., Gupta, P.K.: Qos based resource provisioning in cloud computing environment: a technical survey. In: Singh, M., Gupta, P., Tyagi, V., Flusser, J., Ören, T., Kashyap, R. (eds.) Advances in computing and data sciences, pp. 711–723. Springer, Singapore (2019)
On-demand secure circuits and advance reservation system. https://doi.org/10.1145/2443416.2443420
Islam, M.,Huang, A. K., Battisha, M., Chiang, M., Srinivasan, S., Peters, C., Neumann, A.,Abdelnur,a.: Oozie: Towards a scalable workflow management system for hadoop. In: Proceedings of the 1st ACM SIGMOD workshop on scalable workflow execution engines and technologies, ser. SWEET ’12. Association for Computing Machinery, New York, (2012). https://doi.org/10.1145/2443416.2443420
Senturk, I. F., Balakrishnan, P., Abu-Doleh, A., Kaya, K., Malluhi, Q., ., Çatalyürek, Ümit. V.: A resource provisioning framework for bioinformatics applications in multi-cloud environments. Future Gener. Comput. Syst. 78, 379–391 (2018)
Malawski, M., Figiela, K., Bubak, M., Deelman, E., Nabrzyski, J.: Scheduling multilevel deadline-constrained scientific workflows on clouds based on cost optimization. Sci. Pogram. 29, 158–169 (2015)
Abrishami, S., Naghibzadeh, M., Epema, D.H.: Deadline-constrained workflow scheduling algorithms for infrastructure as a service clouds. Future Gener. Comput. Syst. 29(1), 158–169 (2013)
Dickinson, M., Debroy, S., Calyam, P., Valluripally, S., Zhang, Y., Bazan Antequera, R., Joshi, T., White, T., Xu, D.: Multi-cloud performance and security driven federated workflow management. IEEE Trans.Cloud Comput. 9, 240–257 (2018)
Deelman, E., Vahi, K., Juve, G., Rynge, M., Callaghan, S., Maechling, P.J., Mayani, R., Chen, W., Ferreira da Silva, R., Livny, M., Wenger, K.: Pegasus: a workflow management system for science automation (funding Acknowledgements: NSF ACI SDCI 0722019, NSF ACI SI2-SSI 1148515 and NSF OCI-1053575). Future Gener. Comput. Syst. 46, 17–35 (2015)
National Energy Research Scientific Computing Center (NERSC). https://www.nersc.gov
Oak Ridge Leadership Computing Facility. https://www.olcf.ornl.gov
Extreme science and engineering discovery environment (xsede). http://www.xsede.org
Pordes, R., Petravick, D., Kramer, B., Olson, D., Livny, M., Roy, A., Avery, P., Blackburn, K., Wenaus, T., Würthwein, F., Foster, I., Gardner, R., Wilde, M., Blatecky, A., McGee, J., Quick, R.: The open science grid. J.Phys. Conf.Ser. 78, 012057 (2007)
Amazon.com, Inc.: Amazon Web Services (AWS). http://aws.amazon.com
Keahey, K., Riteau, K., Stanzione, D., Cockerill, K., Mambretti, J., Rad, P., Ruth, P.: “Chameleon: a scalable production testbed for computer science research,” in Contemporary High Performance Computing: From Petascale toward Exascale, 1st ed., ser. Chapman & Hall/CRC Computational Science, J. Vetter, Ed.Boca Raton, FL: CRC Press, 2018, vol. 3, ch. 5
Thain, D., Tannenbaum, T., Livny, M.: Distributed computing in practice: the condor experience. Concurr. Comput. 17(2–4), 323–356 (2005)
Gunter, D., Deelman, E., Samak, T., Brooks, C., Goode, M., Juve, G., Mehta, G., Moraes, P., Silva, F., Swany, M., Vahi, K.: “Online workflow management and performance analysis with stampede,” in 7th International Conference on Network and Service Management (CNSM-2011), (2011)
Bayucan, A., Henderson, R. L., Lesiak, C., Mann, B., Proett, T., Tweten, D.: Portable batch system: external reference specification. In: Technical report, MRJ technology solutions, vol. 5, (1999)
Simple Linux Utility for Resource Management. http://slurm.schedmd.com/
Raman, R., Livny, M., Solomon, M.: “Matchmaking: distributed resource management for high throughput computing,” in Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244), pp. 140–146(1998)
Frey, J., Tannenbaum, T., Foster, I., Livny, M., Tuecke, S.: “Condor-G: A computation management agent for multi-institutional grids,” in Proceedings of the Tenth IEEE Symposium on High Performance Distributed Computing (HPDC), pp. 7–9. California, August, San Francisco (2001)
Mobius Github Repository. https://github.com/RENCI-NRIG/Mobius
Internet 2. https://www.internet2.edu/
The energy science network. https://www.es.net/
OpenStack Cloud Software. http://openstack.org
McLaughlin, D., Pepyne, D., Chandrasekar, V., Philips, B., Kurose, J., Zink, M., Droegemeier, K., Cruz-Pol, S., Junyent, F., Brotzge, J., Westbrook, D., Bharadwaj, N., Wang, Y., Lyons, E., Hondl, K., Liu, Y., Knapp, E., Xue, M., Hopf, A., Kloesel, K., DeFonzo, A., Kollias, P., Brewster, K., Contreras, R., Dolan, B., Djaferis, T., Insanic, E., Frasier, S., Carr, F.: Short-wavelength technology and the potential for distributed networks of small radar systems. Bull. Am. Meteorol. Soc. 90(12), 1797–1818 (2009). https://doi.org/10.1175/2009BAMS2507.1.
Lyons, E. J., Zink, M.,Philips, B.: Efficient data processing with exogeni for the casa dfw urban testbed. In: 2017 IEEE international geoscience and remote sensing symposium (IGARSS), pp. 5977–5980, (2017)
Li, L., Schmid, W., Joss, J.: Nowcasting of motion and growth of precipitation with radar over a complex orography. J. Appl. Meteorol. 34(6), 1286–1300 (1995)
Ruzanski, E., Chandrasekar, V.: Weather radar data interpolation using a kernel-based lagrangian nowcasting technique. IEEE Trans. Geosci. Remote Sens. 53(6), 3073–3083 (2015)
Yao, Y., Cao, Q., Farias, R., Chase, J., Orlikowski, V., Ruth, P., Cevik, M., Wang, C., Buraglio, N.: Toward live inter-domain network services on the exogeni testbed. In: IEEE INFOCOM 2018—IEEE conference on computer communications workshops (INFOCOM WKSHPS), pp. 772–777, (2018)
Zeek Github Repository. https://github.com/zeek/zeek
Ahab Github Repository. https://github.com/RENCI-NRIG/ahab
Linux Foundation Collaborative Projects. https://www.openvswitch.org/
Open flow SDN Controllers. https://en.wikipedia.org/wiki/List_of_SDN_controller_software/
Ryu SDN Controller. https://ryu-sdn.org/
Ryu Rest Router. https://github.com/faucetsdn/ryu/blob/master/ryu/app/rest_router.py
Exoplex Github Repository. https://github.com/RENCI-NRIG/CICI-SAFE
Pandey, S., Vahi, K., Ferreira da Silva, R., Deelman, E., Jian, M., Harrison, C., Chu, A., Casanova, A.: Event-based triggering and management of scientific workflow ensembles, In: 2018, poster presented at the HPC Asia 2018: Tokyo, Japan. http://sighpc.ipsj.or.jp/HPCAsia2018/poster/post102s2-file1.pdf
Prometheus. https://prometheus.io/
Node Exporter. https://prometheus.io/docs/guides/node-exporter/
Grafana. https://grafana.com/
ELK stack. (2018). https://www.elastic.co/elk-stack
Scitech, CASA Nowcast Pegasus Workflow. https://github.com/pegasus-isi/casa-nowcast-workflow
Scitech: CASA Wind Pegasus Workflow. https://github.com/pegasus-isi/casa-wind-workflow
Hasegawa, G., Murata, M., Miyahara, H.: Fairness and stability of congestion control mechanisms of tcp. In: IEEE INFOCOM ’99. Conference on computer communications. Proceedings. Eighteenth annual joint conference of the IEEE computer and communications societies. The future is now (Cat. No.99CH36320), vol. 3, pp. 1329–1336, (1999)
Acknowledgements
This work is funded by NSF award #1826997. We thank Mert Cevik (RENCI), engineers from UNT and LEARN for the UNT stitchport setup. Results in this paper were obtained using Chameleon and ExoGENI testbeds supported by NSF.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Papadimitriou, G., Lyons, E., Wang, C. et al. Fair sharing of network resources among workflow ensembles. Cluster Comput 25, 2873–2891 (2022). https://doi.org/10.1007/s10586-021-03457-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-021-03457-3