research-article

Open access

Skadi: Building a Distributed Runtime for Data Systems in Disaggregated Data Centers

Authors:

Sanidhya Kashyap,

Yizhou ShanAuthors Info & Claims

HOTOS '23: Proceedings of the 19th Workshop on Hot Topics in Operating Systems

Pages 94 - 102

https://doi.org/10.1145/3593856.3595897

Published: 22 June 2023 Publication History

Abstract

Data-intensive systems are the backbone of today's computing and are responsible for shaping data centers. Over the years, cloud providers have relied on three principles to maintain cost-effective data systems: use disaggregation to decouple scaling, use domain-specific computing to battle waning laws, and use serverless to lower costs. Although they work well individually, they fail to work in harmony: an issue amplified by emerging data system workloads.

In this paper, we envision a distributed runtime to mitigate current shortcomings. The distributed runtime has a tiered access layer exposing declarative APIs, underpinned by a stateful serverless runtime with a distributed task execution model. It will be the narrow waist between data systems and hardware. Users are oblivious to data location, concurrency, disaggregation style, or even the hardware to do the computing. The underlying stateful serverless runtime transparently evolves with novel data-center architectures, such as disaggregation and tightly-coupled clusters. We prototype Skadi to showcase that the distributed runtime is practical.

References

[1]

a16z. Emerging Architectures for Modern Data Infrastructure. https://a16z.com/2020/10/15/emerging-architectures-for-modern-data-infrastructure/.

[2]

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (2016).

[3]

Apache Arrow. https://arrow.apache.org/.

[4]

Armbrust, M., Ghodsi, A., Xin, R., and Zaharia, M. Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics. In Proceedings of CIDR (2021).

[5]

Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., Meng, X., Kaftan, T., Franklin, M. J., Ghodsi, A., and Zaharia, M. Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (2015).

Digital Library

[6]

Armenatzoglou, N., Basu, S., Bhanoori, N., Cai, M., Chainani, N., Chinta, K., Govindaraju, V., Green, T. J., Gupta, M., Hillig, S., Hotinger, E., Leshinksy, Y., Liang, J., McCreedy, M., Nagel, F., Pandis, I., Parchas, P., Pathak, R., Polychroniou, O., Rahman, F., Saxena, G., Soundararajan, G., Subramanian, S., and Terry, D. Amazon Redshift Re-Invented. In Proceedings of the 2022 International Conference on Management of Data (2022).

Digital Library

[7]

Barham, P., Chowdhery, A., Dean, J., Ghemawat, S., Hand, S., Hurt, D., Isard, M., Lim, H., Pang, R., Roy, S., et al. Pathways: Asynchronous distributed dataflow for ML. Proceedings of Machine Learning and Systems (2022).

[8]

Bosshart, P., Gibb, G., Kim, H.-S., Varghese, G., McKeown, N., Izzard, M., Mujica, F., and Horowitz, M. Forwarding Metamorphosis: Fast Programmable Match-Action Processing in Hardware for SDN. In Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM (2013).

Digital Library

[9]

Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., and Tzoumas, K. Apache flink: Stream and batch processing in a single engine. The Bulletin of the Technical Committee on Data Engineering (2015).

[10]

Caulfield, A. M., Chung, E. S., Putnam, A., Angepat, H., Fowers, J., Haselman, M., Heil, S., Humphrey, M., Kaur, P., Kim, J.-Y., Lo, D., Massengill, T., Ovtcharov, K., Papamichael, M., Woods, L., Lanka, S., Chiou, D., and Burger, D. A Cloud-Scale Acceleration Architecture. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture (2016).

Digital Library

[11]

Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Cowan, M., Shen, H., Wang, L., Hu, Y., Ceze, L., Guestrin, C., and Krishnamurthy, A. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (2018).

[12]

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).

[13]

Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J. J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P., Hsieh, W., Kanthak, S., Kogan, E., Li, H., Lloyd, A., Melnik, S., Mwaura, D., Nagle, D., Quinlan, S., Rao, R., Rolig, L., Saito, Y., Szymaniak, M., Taylor, C., Wang, R., and Woodford, D. Spanner: Google's Globally Distributed Database. ACM Trans. Comput. Syst. (2013).

[14]

Damme, P., Birkenbach, M., Bitsakos, C., Boehm, M., Bonnet, P., Ciorba, F., Dokter, M., Dowgiallo, P., Eleliemy, A., Faerber, C., et al. DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines. In Conference on Innovative Data Systems Research (2022).

[15]

Daoud, F., Watad, A., and Silberstein, M. GPUrdma: GPU-side library for high performance networking from GPU kernels. In Proceedings of the 6th international Workshop on Runtime and Operating Systems for Supercomputers (2016).

Digital Library

[16]

Dean, J., and Ghemawat, S. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM (2008).

Digital Library

[17]

Du, D., Liu, Q., Jiang, X., Xia, Y., Zang, B., and Chen, H. Serverless Computing on Heterogeneous Computers. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (2022).

Digital Library

[18]

Fowers, J., Ovtcharov, K., Papamichael, M., Massengill, T., Liu, M., Lo, D., Alkalay, S., Haselman, M., Adams, L., Ghandi, M., Heil, S., Patel, P., Sapek, A., Weisz, G., Woods, L., Lanka, S., Reinhardt, S. K., Caulfield, A. M., Chung, E. S., and Burger, D. A Configurable Cloud-Scale DNN Processor for Real-Time AI. In Proceedings of the 45th Annual International Symposium on Computer Architecture (2018).

Digital Library

[19]

Frostig, R., Johnson, M. J., and Leary, C. Compiling machine learning programs via high-level tracing. Systems for Machine Learning (2018).

[20]

Gandhi, A., Asada, Y., Fu, V., Gemawat, A., Zhang, L., Sen, R., Curino, C., Camacho-Rodríguez, J., and Interlandi, M. The tensor data platform: Towards an ai-centric database system. arXiv preprint arXiv:2211.02753 (2022).

[21]

Geyer, A., Krause, A., Habich, D., and Lehner, W. Pipeline Group Optimization on Disaggregated Systems. In Proceedings of CIDR (2023).

[22]

Ghemawat, S., Gobioff, H., and Leung, S.-T. The Google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles (2003).

Digital Library

[23]

Gibson, D., Hariharan, H., Lance, E., McLaren, M., Montazeri, B., Singh, A., Wang, S., Wassel, H. M. G., Wu, Z., Yoo, S., Balasubramanian, R., Chandra, P., Cutforth, M., Cuy, P., Decotigny, D., Gautam, R., Iriza, A., Martin, M. M. K., Roy, R., Shen, Z., Tan, M., Tang, Y., Wong-Chan, M., Zbiciak, J., and Vahdat, A. Aquila: A unified, low-latency fabric for datacenter networks. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22) (2022).

[24]

Gonzalez, J. E., Low, Y., Gu, H., Bickson, D., and Guestrin, C. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (2012).

[25]

Gonzalez, J. E., Xin, R. S., Dave, A., Crankshaw, D., Franklin, M. J., and Stoica, I. GraphX: Graph Processing in a Distributed Dataflow Framework. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (2014).

[26]

Google Cloud Platform. https://cloud.google.com/bigquery.

[27]

Grandl, R., Singhvi, A., Viswanathan, R., and Akella, A. Whiz: Data-Driven Analytics Execution. In 18th USENIX Symposium on Networked Systems Design and Implementation (2021).

[28]

Guo, Z., Shan, Y., Luo, X., Huang, Y., and Zhang, Y. Clio: A Hardware-Software Co-Designed Disaggregated Memory System. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (2022).

Digital Library

[29]

He, D., Nakandala, S. C., Banda, D., Sen, R., Saur, K., Park, K., Curino, C., Camacho-Rodríguez, J., Karanasos, K., and Interlandi, M. Query Processing on Tensor Computation Runtimes. Proc. VLDB Endow. (2022).

Digital Library

[30]

Hennessy, J. L., and Patterson, D. A. A New Golden Age for Computer Architecture. Commun. ACM (2019).

[31]

Isard, M., Budiu, M., Yu, Y., Birrell, A., and Fetterly, D. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 (2007).

Digital Library

[32]

Jouppi, N. P., Kurian, G., Li, S., Ma, P., Nagarajan, R., Nai, L., Patil, N., Subramanian, S., Swing, A., Towles, B., et al. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. arXiv preprint arXiv:2304.01433 (2023).

[33]

Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., Boyle, R., Cantin, P.-l., Chao, C., Clark, C., Coriell, J., Daley, M., Dau, M., Dean, J., Gelb, B., Ghaemmaghami, T. V., Gottipati, R., Gulland, W., Hagmann, R., Ho, C. R., Hogberg, D., Hu, J., Hundt, R., Hurt, D., Ibarz, J., Jaffey, A., Jaworski, A., Kaplan, A., Khaitan, H., Killebrew, D., Koch, A., Kumar, N., Lacy, S., Laudon, J., Law, J., Le, D., Leary, C., Liu, Z., Lucke, K., Lundin, A., MacKean, G., Maggiore, A., Mahony, M., Miller, K., Nagarajan, R., Narayanaswami, R., Ni, R., Nix, K., Norrie, T., Omernick, M., Penukonda, N., Phelps, A., Ross, J., Ross, M., Salek, A., Samadiani, E., Severn, C., Sizikov, G., Snelham, M., Souter, J., Steinberg, D., Swing, A., Tan, M., Thorson, G., Tian, B., Toma, H., Tuttle, E., Vasudevan, V., Walter, R., Wang, W., Wilcox, E., and Yoon, D. H. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (2017).

Digital Library

[34]

Jungmair, M., Kohn, A., and Giceva, J. Designing an Open Framework for Query Optimization and Compilation. Proc. VLDB Endow. (2022).

Digital Library

[35]

Keeton, K., Singhal, S., Volos, H., Zhang, Y., Chaurasiya, R. C., Crasta, C. R., George, S. T., Natarajan, K., Shome, P., Suresh, S., et al. MODC: resilience for disaggregated memory architectures using task-based programming. arXiv preprint arXiv:2109.05329 (2021).

[36]

Klimovic, A., Wang, Y., Stuedi, P., Trivedi, A., Pfefferle, J., and Kozyrakis, C. Pocket: Elastic Ephemeral Storage for Serverless Analytics. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (2018).

[37]

Korolija, D., Koutsoukos, D., Keeton, K., Taranov, K., Milojičić, D., and Alonso, G. Farview: Disaggregated memory with operator off-loading for database engines. In Conference on Innovative Data Systems Research (2021).

[38]

Korolija, D., Roscoe, T., and Alonso, G. Do OS Abstractions Make Sense on FPGAs? In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation (2020).

Digital Library

[39]

Kraft, P., Kazhamiaka, F., Bailis, P., and Zaharia, M. Data-Parallel Actors: A Programming Model for Scalable Query Serving Systems. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22) (2022).

[40]

Kraft, P., Li, Q., Kaffes, K., Skiadopoulos, A., Kumar, D., Cho, D., Li, J., Redmond, R., Weckwerth, N., Xia, B., et al. Apiary: A DBMS-Backed Transactional Function-as-a-Service Framework. arXiv preprint arXiv:2208.13068 (2022).

[41]

Lattner, C., Amini, M., Bondhugula, U., Cohen, A., Davis, A., Pienaar, J., Riddle, R., Shpeisman, T., Vasilache, N., and Zinenko, O. MLIR: A compiler infrastructure for the end of Moore's law. arXiv preprint arXiv:2002.11054 (2020).

[42]

Liu, H., Tang, B., Zhang, J., Deng, Y., Yan, X., Zheng, X., Shen, Q., Zeng, D., Mao, Z., Zhang, C., You, Z., Wang, Z., Jiang, R., Wang, F., Yiu, M. L., Li, H., Han, M., Li, Q., and Luo, Z. GHive: Accelerating Analytical Query Processing in Apache Hive via CPU-GPU Heterogeneous Computing. In Proceedings of the 13th Symposium on Cloud Computing (2022).

Digital Library

[43]

Marty, M., de Kruijf, M., Adriaens, J., Alfeld, C., Bauer, S., Contavalli, C., Dalton, M., Dukkipati, N., Evans, W. C., Gribble, S., Kidd, N., Kononov, R., Kumar, G., Mauer, C., Musick, E., Olson, L., Rubow, E., Ryan, M., Springborn, K., Turner, P., Valancius, V., Wang, X., and Vahdat, A. Snap: A Microkernel Approach to Host Networking. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (2019).

Digital Library

[44]

Melnik, S., Gubarev, A., Long, J. J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T., Ahmadi, H., Delorey, D., Min, S., Pasumansky, M., and Shute, J. Dremel: A Decade of Interactive SQL Analysis at Web Scale. Proc. VLDB Endow. (2020).

Digital Library

[45]

Min, J., Liu, M., Chugh, T., Zhao, C., Wei, A., Doh, I. H., and Krishnamurthy, A. Gimbal: Enabling Multi-Tenant Storage Disaggregation on SmartNIC JBOFs. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference (2021).

Digital Library

[46]

Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw, R., Liang, E., Elibol, M., Yang, Z., Paul, W., Jordan, M. I., and Stoica, I. Ray: A Distributed Framework for Emerging AI Applications. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (2018), OSDI'18.

[47]

Mullender, S. J., Van Rossum, G., Tananbaum, A., Van Renesse, R., and Van Staveren, H. Amoeba: A distributed operating system for the 1990s. Computer (1990).

[48]

Murray, D. G., McSherry, F., Isaacs, R., Isard, M., Barham, P., and Abadi, M. Naiad: A Timely Dataflow System. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (2013).

Digital Library

[49]

Murray, D. G., Schwarzkopf, M., Smowton, C., Smith, S., Madhavapeddy, A., and Hand, S. Ciel: A universal execution engine for distributed data-flow computing. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (2011).

Digital Library

[50]

Nelson, J., Holt, B., Myers, B., Briggs, P., Ceze, L., Kahan, S., and Oskin, M. Latency-Tolerant Software Distributed Shared Memory. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (2015).

Digital Library

[51]

NVIDIA. https://www.nvidia.com/en-us/networking/products/data-processing-unit/.

[52]

NVIDIA. GPU Accelerated Data Science with RAPIDS. https://www.nvidia.com/en-us/deep-learning-ai/software/rapids/.

[53]

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. PyTorch: An Imperative Style, High-Performance Deep Learning Library, 2019.

[54]

Pemberton, N., Zabreyko, A., Ding, Z., Katz, R., and Gonzalez, J. Kernel-as-a-Service: A Serverless Interface to GPUs. arXiv preprint arXiv:2212.08146 (2022).

[55]

Pu, Q., Venkataraman, S., and Stoica, I. Shuffling, fast and slow: Scalable analytics on serverless infrastructure. In Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation (2019).

[56]

Ranganathan, P., Stodolsky, D., Calow, J., Dorfman, J., Guevara, M., Smullen IV, C. W., Kuusela, A., Balasubramanian, R., Bhatia, S., Chauhan, P., Cheung, A., Chong, I. S., Dasharathi, N., Feng, J., Fosco, B., Foss, S., Gelb, B., Gwin, S. J., Hase, Y., He, D.-k., Ho, C. R., Huffman Jr., R. W., Indupalli, E., Jayaram, I., Kongetira, P., Kyaw, C. M., Laursen, A., Li, Y., Lou, F., Lucke, K. A., Maaninen, J., Macias, R., Mahony, M., Munday, D. A., Muroor, S., Penukonda, N., Perkins-Argueta, E., Persaud, D., Ramirez, A., Rautio, V.-M., Ripley, Y., Salek, A., Sekar, S., Sokolov, S. N., Springer, R., Stark, D., Tan, M., Wachsler, M. S., Walton, A. C., Wickeraad, D. A., Wijaya, A., and Wu, H. K. Warehouse-Scale Video Acceleration: Co-Design and Deployment in the Wild. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (2021).

Digital Library

[57]

Schleier-Smith, J., Sreekanti, V., Khandelwal, A., Carreira, J., Yadwadkar, N. J., Popa, R. A., Gonzalez, J. E., Stoica, I., and Patterson, D. A. What Serverless Computing is and Should Become: The next Phase of Cloud Computing. Commun. ACM (2021).

Digital Library

[58]

Shan, Y. Distributing and Disaggregating Hardware Resources in Data Centers. University of California, San Diego, 2022.

Digital Library

[59]

Shan, Y., Huang, Y., Chen, Y., and Zhang, Y. LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (2018).

[60]

Sidler, D., Wang, Z., Chiosa, M., Kulkarni, A., and Alonso, G. StRoM: Smart Remote Memory. In Proceedings of the Fifteenth European Conference on Computer Systems (2020).

Digital Library

[61]

Singhvi, A., Akella, A., Anderson, M., Cauble, R., Deshmukh, H., Gibson, D., Martin, M. M. K., Strominger, A., Wenisch, T. F., and Vahdat, A. CliqueMap: Productionizing an RMA-Based Distributed Caching System. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference (2021).

Digital Library

[62]

Skiadopoulos, A., Li, Q., Kraft, P., Kaffes, K., Hong, D., Mathew, S., Bestor, D., Cafarella, M., Gadepally, V., Graefe, G., Kepner, J., Kozyrakis, C., Kraska, T., Stonebraker, M., Suresh, L., and Zaharia, M. DBOS: A DBMS-Oriented Operating System. Proc. VLDB Endow. (2022).

[63]

Sreekanti, V., Wu, C., Lin, X. C., Schleier-Smith, J., Gonzalez, J. E., Hellerstein, J. M., and Tumanov, A. Cloudburst: Stateful functions-as-a-service. Proc. VLDB Endow. (2020).

Digital Library

[64]

Stuedi, P., Trivedi, A., Pfefferle, J., Klimovic, A., Schuepbach, A., and Metzler, B. Unification of Temporary Storage in the Nodekernel Architecture. In Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference (2019).

Digital Library

[65]

Verbitski, A., Gupta, A., Saha, D., Brahmadesam, M., Gupta, K., Mittal, R., Krishnamurthy, S., Maurice, S., Kharatishvili, T., and Bao, X. Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases. In Proceedings of the 2017 ACM International Conference on Management of Data (2017).

Digital Library

[66]

Vilanova, L., Maudlej, L., Bergman, S., Miemietz, T., Hille, M., Asmussen, N., Roitzsch, M., Härtig, H., and Silberstein, M. Slashing the Disaggregation Tax in Heterogeneous Data Centers with FractOS. In Proceedings of the Seventeenth European Conference on Computer Systems (2022).

Digital Library

[67]

Vuppalapati, M., Miron, J., Agarwal, R., Truong, D., Motivala, A., and Cruanes, T. Building an Elastic Query Engine on Disaggregated Storage. In Proceedings of the 17th Usenix Conference on Networked Systems Design and Implementation (2020).

Digital Library

[68]

Wang, S., Liagouris, J., Nishihara, R., Moritz, P., Misra, U., Tumanov, A., and Stoica, I. Lineage Stash: Fault Tolerance off the Critical Path. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (2019).

Digital Library

[69]

Wang, S., Liang, E., Oakes, E., Hindman, B., Luan, F. S., Cheng, A., and Stoica, I. Ownership: A Distributed Futures System for FineGrained Tasks. In 18th USENIX Symposium on Networked Systems Design and Implementation (2021).

[70]

Winter, C., Giceva, J., Neumann, T., and Kemper, A. On-Demand State Separation for Cloud Data Warehousing. Proc. VLDB Endow. (2022).

Digital Library

[71]

Yandex. Clickhouse. https://clickhouse.com/.

[72]

Yang, F., Tschetter, E., Léauté, X., Ray, N., Merlino, G., and Ganguli, D. Druid: A real-time analytical data store. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data (2014).

Digital Library

[73]

Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, U., Gunda, P. K., and Currey, J. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (2008).

[74]

Yuan, J., Li, X., Cheng, C., Liu, J., Guo, R., Cai, S., Yao, C., Yang, F., Yi, X., Wu, C., et al. Oneflow: Redesign the distributed deep learning framework from scratch. arXiv preprint arXiv:2110.15032 (2021).

[75]

Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., and Stoica, I. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for in-Memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (2012).

Digital Library

[76]

Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., and Stoica, I. Discretized Streams: Fault-Tolerant Streaming Computation at Scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (2013).

Digital Library

[77]

Zha, Y., and Li, J. Virtualizing FPGAs in the Cloud. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (2020).

Digital Library

[78]

Zhang, Q., Chen, X., Sankhe, S., Zheng, Z., Zhong, K., Angel, S., Chen, A., Liu, V., and Loo, B. T. Optimizing data-intensive systems in disaggregated data centers with teleport. In Proceedings of the 2022 International Conference on Management of Data (2022).

Digital Library

[79]

Zheng, L., Li, Z., Zhang, H., Zhuang, Y., Chen, Z., Huang, Y., Wang, Y., Xu, Y., Zhuo, D., Xing, E. P., Gonzalez, J. E., and Stoica, I. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (2022).

[80]

Zhou, Y., Wassel, H. M. G., Liu, S., Gao, J., Mickens, J., Yu, M., Kennelly, C., Turner, P., Culler, D. E., Levy, H. M., and Vahdat, A. Carbink: Fault-Tolerant Far Memory. In 16th USENIX Symposium on Operating Systems Design and Implementation (2022).

[81]

Zhuang, S., Li, Z., Zhuo, D., Wang, S., Liang, E., Nishihara, R., Moritz, P., and Stoica, I. Hoplite: Efficient and Fault-Tolerant Collective Communication for Task-Based Distributed Systems. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference (2021).

Digital Library

Cited By

Periola AObayiuwana E(2024)Computing Diversity Paradigm for the Utilization of Unused Telephony and Marine InfrastructureInternational Journal of Networked and Distributed Computing10.1007/s44227-024-00027-yOnline publication date: 9-May-2024
https://doi.org/10.1007/s44227-024-00027-y

Recommendations

Considerations for cloud data centers: Framework, architecture and adoption
AERO '11: Proceedings of the 2011 IEEE Aerospace Conference

Cloud computing is one of the fastest growing opportunities for enterprises and service providers.12 Enterprises use the Infrastructure-as-a-service (IaaS) model to build private clouds, and virtual private clouds that reduce operating and capital ...
Big Data Aware Virtual Machine Placement in Cloud Data Centers
BDCAT '17: Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies

While society continues to be transformed by insights from processing big data, the increasing rate at which this data is gathered is making processing in private clusters obsolete. A vast amount of big data already resides in the cloud, and cloud ...
Building Dynamic Computing Infrastructures over Distributed Clouds
NCCA '11: Proceedings of the 2011 First International Symposium on Network Cloud Computing and Applications

The emergence of cloud computing infrastructures brings new ways to build and manage computing systems, with the flexibility offered by virtualization technologies. In this context, this PhD thesis focuses on two principal objectives. First, leveraging ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

HOTOS '23: Proceedings of the 19th Workshop on Hot Topics in Operating Systems

June 2023

247 pages

ISBN:9798400701955

DOI:10.1145/3593856

General Chair:
Malte Schwarzkopf,
Program Chairs:
Andrew Baumann,
Natacha Crooks

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

HOTOS '23

Sponsor:

SIGOPS

HOTOS '23: 19th Workshop on Hot Topics in Operating Systems

June 22 - 24, 2023

RI, Providence, USA

Upcoming Conference

HOTOS '25

Sponsor:
sigops

Workshop on Hot Topics in Operating Systems

May 14 - 16, 2025

Banff or Lake Louise , AB , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
807
Total Downloads

Downloads (Last 12 months)560
Downloads (Last 6 weeks)68

Reflects downloads up to 02 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Periola AObayiuwana E(2024)Computing Diversity Paradigm for the Utilization of Unused Telephony and Marine InfrastructureInternational Journal of Networked and Distributed Computing10.1007/s44227-024-00027-yOnline publication date: 9-May-2024
https://doi.org/10.1007/s44227-024-00027-y

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents