Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3593856.3595897acmconferencesArticle/Chapter ViewAbstractPublication PageshotosConference Proceedingsconference-collections
research-article
Open access

Skadi: Building a Distributed Runtime for Data Systems in Disaggregated Data Centers

Published: 22 June 2023 Publication History

Abstract

Data-intensive systems are the backbone of today's computing and are responsible for shaping data centers. Over the years, cloud providers have relied on three principles to maintain cost-effective data systems: use disaggregation to decouple scaling, use domain-specific computing to battle waning laws, and use serverless to lower costs. Although they work well individually, they fail to work in harmony: an issue amplified by emerging data system workloads.
In this paper, we envision a distributed runtime to mitigate current shortcomings. The distributed runtime has a tiered access layer exposing declarative APIs, underpinned by a stateful serverless runtime with a distributed task execution model. It will be the narrow waist between data systems and hardware. Users are oblivious to data location, concurrency, disaggregation style, or even the hardware to do the computing. The underlying stateful serverless runtime transparently evolves with novel data-center architectures, such as disaggregation and tightly-coupled clusters. We prototype Skadi to showcase that the distributed runtime is practical.

References

[1]
a16z. Emerging Architectures for Modern Data Infrastructure. https://a16z.com/2020/10/15/emerging-architectures-for-modern-data-infrastructure/.
[2]
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (2016).
[3]
Apache Arrow. https://arrow.apache.org/.
[4]
Armbrust, M., Ghodsi, A., Xin, R., and Zaharia, M. Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics. In Proceedings of CIDR (2021).
[5]
Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., Meng, X., Kaftan, T., Franklin, M. J., Ghodsi, A., and Zaharia, M. Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (2015).
[6]
Armenatzoglou, N., Basu, S., Bhanoori, N., Cai, M., Chainani, N., Chinta, K., Govindaraju, V., Green, T. J., Gupta, M., Hillig, S., Hotinger, E., Leshinksy, Y., Liang, J., McCreedy, M., Nagel, F., Pandis, I., Parchas, P., Pathak, R., Polychroniou, O., Rahman, F., Saxena, G., Soundararajan, G., Subramanian, S., and Terry, D. Amazon Redshift Re-Invented. In Proceedings of the 2022 International Conference on Management of Data (2022).
[7]
Barham, P., Chowdhery, A., Dean, J., Ghemawat, S., Hand, S., Hurt, D., Isard, M., Lim, H., Pang, R., Roy, S., et al. Pathways: Asynchronous distributed dataflow for ML. Proceedings of Machine Learning and Systems (2022).
[8]
Bosshart, P., Gibb, G., Kim, H.-S., Varghese, G., McKeown, N., Izzard, M., Mujica, F., and Horowitz, M. Forwarding Metamorphosis: Fast Programmable Match-Action Processing in Hardware for SDN. In Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM (2013).
[9]
Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., and Tzoumas, K. Apache flink: Stream and batch processing in a single engine. The Bulletin of the Technical Committee on Data Engineering (2015).
[10]
Caulfield, A. M., Chung, E. S., Putnam, A., Angepat, H., Fowers, J., Haselman, M., Heil, S., Humphrey, M., Kaur, P., Kim, J.-Y., Lo, D., Massengill, T., Ovtcharov, K., Papamichael, M., Woods, L., Lanka, S., Chiou, D., and Burger, D. A Cloud-Scale Acceleration Architecture. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture (2016).
[11]
Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Cowan, M., Shen, H., Wang, L., Hu, Y., Ceze, L., Guestrin, C., and Krishnamurthy, A. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (2018).
[12]
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).
[13]
Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J. J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P., Hsieh, W., Kanthak, S., Kogan, E., Li, H., Lloyd, A., Melnik, S., Mwaura, D., Nagle, D., Quinlan, S., Rao, R., Rolig, L., Saito, Y., Szymaniak, M., Taylor, C., Wang, R., and Woodford, D. Spanner: Google's Globally Distributed Database. ACM Trans. Comput. Syst. (2013).
[14]
Damme, P., Birkenbach, M., Bitsakos, C., Boehm, M., Bonnet, P., Ciorba, F., Dokter, M., Dowgiallo, P., Eleliemy, A., Faerber, C., et al. DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines. In Conference on Innovative Data Systems Research (2022).
[15]
Daoud, F., Watad, A., and Silberstein, M. GPUrdma: GPU-side library for high performance networking from GPU kernels. In Proceedings of the 6th international Workshop on Runtime and Operating Systems for Supercomputers (2016).
[16]
Dean, J., and Ghemawat, S. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM (2008).
[17]
Du, D., Liu, Q., Jiang, X., Xia, Y., Zang, B., and Chen, H. Serverless Computing on Heterogeneous Computers. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (2022).
[18]
Fowers, J., Ovtcharov, K., Papamichael, M., Massengill, T., Liu, M., Lo, D., Alkalay, S., Haselman, M., Adams, L., Ghandi, M., Heil, S., Patel, P., Sapek, A., Weisz, G., Woods, L., Lanka, S., Reinhardt, S. K., Caulfield, A. M., Chung, E. S., and Burger, D. A Configurable Cloud-Scale DNN Processor for Real-Time AI. In Proceedings of the 45th Annual International Symposium on Computer Architecture (2018).
[19]
Frostig, R., Johnson, M. J., and Leary, C. Compiling machine learning programs via high-level tracing. Systems for Machine Learning (2018).
[20]
Gandhi, A., Asada, Y., Fu, V., Gemawat, A., Zhang, L., Sen, R., Curino, C., Camacho-Rodríguez, J., and Interlandi, M. The tensor data platform: Towards an ai-centric database system. arXiv preprint arXiv:2211.02753 (2022).
[21]
Geyer, A., Krause, A., Habich, D., and Lehner, W. Pipeline Group Optimization on Disaggregated Systems. In Proceedings of CIDR (2023).
[22]
Ghemawat, S., Gobioff, H., and Leung, S.-T. The Google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles (2003).
[23]
Gibson, D., Hariharan, H., Lance, E., McLaren, M., Montazeri, B., Singh, A., Wang, S., Wassel, H. M. G., Wu, Z., Yoo, S., Balasubramanian, R., Chandra, P., Cutforth, M., Cuy, P., Decotigny, D., Gautam, R., Iriza, A., Martin, M. M. K., Roy, R., Shen, Z., Tan, M., Tang, Y., Wong-Chan, M., Zbiciak, J., and Vahdat, A. Aquila: A unified, low-latency fabric for datacenter networks. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22) (2022).
[24]
Gonzalez, J. E., Low, Y., Gu, H., Bickson, D., and Guestrin, C. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (2012).
[25]
Gonzalez, J. E., Xin, R. S., Dave, A., Crankshaw, D., Franklin, M. J., and Stoica, I. GraphX: Graph Processing in a Distributed Dataflow Framework. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (2014).
[26]
Google Cloud Platform. https://cloud.google.com/bigquery.
[27]
Grandl, R., Singhvi, A., Viswanathan, R., and Akella, A. Whiz: Data-Driven Analytics Execution. In 18th USENIX Symposium on Networked Systems Design and Implementation (2021).
[28]
Guo, Z., Shan, Y., Luo, X., Huang, Y., and Zhang, Y. Clio: A Hardware-Software Co-Designed Disaggregated Memory System. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (2022).
[29]
He, D., Nakandala, S. C., Banda, D., Sen, R., Saur, K., Park, K., Curino, C., Camacho-Rodríguez, J., Karanasos, K., and Interlandi, M. Query Processing on Tensor Computation Runtimes. Proc. VLDB Endow. (2022).
[30]
Hennessy, J. L., and Patterson, D. A. A New Golden Age for Computer Architecture. Commun. ACM (2019).
[31]
Isard, M., Budiu, M., Yu, Y., Birrell, A., and Fetterly, D. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 (2007).
[32]
Jouppi, N. P., Kurian, G., Li, S., Ma, P., Nagarajan, R., Nai, L., Patil, N., Subramanian, S., Swing, A., Towles, B., et al. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. arXiv preprint arXiv:2304.01433 (2023).
[33]
Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., Boyle, R., Cantin, P.-l., Chao, C., Clark, C., Coriell, J., Daley, M., Dau, M., Dean, J., Gelb, B., Ghaemmaghami, T. V., Gottipati, R., Gulland, W., Hagmann, R., Ho, C. R., Hogberg, D., Hu, J., Hundt, R., Hurt, D., Ibarz, J., Jaffey, A., Jaworski, A., Kaplan, A., Khaitan, H., Killebrew, D., Koch, A., Kumar, N., Lacy, S., Laudon, J., Law, J., Le, D., Leary, C., Liu, Z., Lucke, K., Lundin, A., MacKean, G., Maggiore, A., Mahony, M., Miller, K., Nagarajan, R., Narayanaswami, R., Ni, R., Nix, K., Norrie, T., Omernick, M., Penukonda, N., Phelps, A., Ross, J., Ross, M., Salek, A., Samadiani, E., Severn, C., Sizikov, G., Snelham, M., Souter, J., Steinberg, D., Swing, A., Tan, M., Thorson, G., Tian, B., Toma, H., Tuttle, E., Vasudevan, V., Walter, R., Wang, W., Wilcox, E., and Yoon, D. H. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (2017).
[34]
Jungmair, M., Kohn, A., and Giceva, J. Designing an Open Framework for Query Optimization and Compilation. Proc. VLDB Endow. (2022).
[35]
Keeton, K., Singhal, S., Volos, H., Zhang, Y., Chaurasiya, R. C., Crasta, C. R., George, S. T., Natarajan, K., Shome, P., Suresh, S., et al. MODC: resilience for disaggregated memory architectures using task-based programming. arXiv preprint arXiv:2109.05329 (2021).
[36]
Klimovic, A., Wang, Y., Stuedi, P., Trivedi, A., Pfefferle, J., and Kozyrakis, C. Pocket: Elastic Ephemeral Storage for Serverless Analytics. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (2018).
[37]
Korolija, D., Koutsoukos, D., Keeton, K., Taranov, K., Milojičić, D., and Alonso, G. Farview: Disaggregated memory with operator off-loading for database engines. In Conference on Innovative Data Systems Research (2021).
[38]
Korolija, D., Roscoe, T., and Alonso, G. Do OS Abstractions Make Sense on FPGAs? In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation (2020).
[39]
Kraft, P., Kazhamiaka, F., Bailis, P., and Zaharia, M. Data-Parallel Actors: A Programming Model for Scalable Query Serving Systems. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22) (2022).
[40]
Kraft, P., Li, Q., Kaffes, K., Skiadopoulos, A., Kumar, D., Cho, D., Li, J., Redmond, R., Weckwerth, N., Xia, B., et al. Apiary: A DBMS-Backed Transactional Function-as-a-Service Framework. arXiv preprint arXiv:2208.13068 (2022).
[41]
Lattner, C., Amini, M., Bondhugula, U., Cohen, A., Davis, A., Pienaar, J., Riddle, R., Shpeisman, T., Vasilache, N., and Zinenko, O. MLIR: A compiler infrastructure for the end of Moore's law. arXiv preprint arXiv:2002.11054 (2020).
[42]
Liu, H., Tang, B., Zhang, J., Deng, Y., Yan, X., Zheng, X., Shen, Q., Zeng, D., Mao, Z., Zhang, C., You, Z., Wang, Z., Jiang, R., Wang, F., Yiu, M. L., Li, H., Han, M., Li, Q., and Luo, Z. GHive: Accelerating Analytical Query Processing in Apache Hive via CPU-GPU Heterogeneous Computing. In Proceedings of the 13th Symposium on Cloud Computing (2022).
[43]
Marty, M., de Kruijf, M., Adriaens, J., Alfeld, C., Bauer, S., Contavalli, C., Dalton, M., Dukkipati, N., Evans, W. C., Gribble, S., Kidd, N., Kononov, R., Kumar, G., Mauer, C., Musick, E., Olson, L., Rubow, E., Ryan, M., Springborn, K., Turner, P., Valancius, V., Wang, X., and Vahdat, A. Snap: A Microkernel Approach to Host Networking. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (2019).
[44]
Melnik, S., Gubarev, A., Long, J. J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T., Ahmadi, H., Delorey, D., Min, S., Pasumansky, M., and Shute, J. Dremel: A Decade of Interactive SQL Analysis at Web Scale. Proc. VLDB Endow. (2020).
[45]
Min, J., Liu, M., Chugh, T., Zhao, C., Wei, A., Doh, I. H., and Krishnamurthy, A. Gimbal: Enabling Multi-Tenant Storage Disaggregation on SmartNIC JBOFs. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference (2021).
[46]
Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw, R., Liang, E., Elibol, M., Yang, Z., Paul, W., Jordan, M. I., and Stoica, I. Ray: A Distributed Framework for Emerging AI Applications. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (2018), OSDI'18.
[47]
Mullender, S. J., Van Rossum, G., Tananbaum, A., Van Renesse, R., and Van Staveren, H. Amoeba: A distributed operating system for the 1990s. Computer (1990).
[48]
Murray, D. G., McSherry, F., Isaacs, R., Isard, M., Barham, P., and Abadi, M. Naiad: A Timely Dataflow System. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (2013).
[49]
Murray, D. G., Schwarzkopf, M., Smowton, C., Smith, S., Madhavapeddy, A., and Hand, S. Ciel: A universal execution engine for distributed data-flow computing. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (2011).
[50]
Nelson, J., Holt, B., Myers, B., Briggs, P., Ceze, L., Kahan, S., and Oskin, M. Latency-Tolerant Software Distributed Shared Memory. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (2015).
[51]
NVIDIA. https://www.nvidia.com/en-us/networking/products/data-processing-unit/.
[52]
NVIDIA. GPU Accelerated Data Science with RAPIDS. https://www.nvidia.com/en-us/deep-learning-ai/software/rapids/.
[53]
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. PyTorch: An Imperative Style, High-Performance Deep Learning Library, 2019.
[54]
Pemberton, N., Zabreyko, A., Ding, Z., Katz, R., and Gonzalez, J. Kernel-as-a-Service: A Serverless Interface to GPUs. arXiv preprint arXiv:2212.08146 (2022).
[55]
Pu, Q., Venkataraman, S., and Stoica, I. Shuffling, fast and slow: Scalable analytics on serverless infrastructure. In Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation (2019).
[56]
Ranganathan, P., Stodolsky, D., Calow, J., Dorfman, J., Guevara, M., Smullen IV, C. W., Kuusela, A., Balasubramanian, R., Bhatia, S., Chauhan, P., Cheung, A., Chong, I. S., Dasharathi, N., Feng, J., Fosco, B., Foss, S., Gelb, B., Gwin, S. J., Hase, Y., He, D.-k., Ho, C. R., Huffman Jr., R. W., Indupalli, E., Jayaram, I., Kongetira, P., Kyaw, C. M., Laursen, A., Li, Y., Lou, F., Lucke, K. A., Maaninen, J., Macias, R., Mahony, M., Munday, D. A., Muroor, S., Penukonda, N., Perkins-Argueta, E., Persaud, D., Ramirez, A., Rautio, V.-M., Ripley, Y., Salek, A., Sekar, S., Sokolov, S. N., Springer, R., Stark, D., Tan, M., Wachsler, M. S., Walton, A. C., Wickeraad, D. A., Wijaya, A., and Wu, H. K. Warehouse-Scale Video Acceleration: Co-Design and Deployment in the Wild. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (2021).
[57]
Schleier-Smith, J., Sreekanti, V., Khandelwal, A., Carreira, J., Yadwadkar, N. J., Popa, R. A., Gonzalez, J. E., Stoica, I., and Patterson, D. A. What Serverless Computing is and Should Become: The next Phase of Cloud Computing. Commun. ACM (2021).
[58]
Shan, Y. Distributing and Disaggregating Hardware Resources in Data Centers. University of California, San Diego, 2022.
[59]
Shan, Y., Huang, Y., Chen, Y., and Zhang, Y. LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (2018).
[60]
Sidler, D., Wang, Z., Chiosa, M., Kulkarni, A., and Alonso, G. StRoM: Smart Remote Memory. In Proceedings of the Fifteenth European Conference on Computer Systems (2020).
[61]
Singhvi, A., Akella, A., Anderson, M., Cauble, R., Deshmukh, H., Gibson, D., Martin, M. M. K., Strominger, A., Wenisch, T. F., and Vahdat, A. CliqueMap: Productionizing an RMA-Based Distributed Caching System. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference (2021).
[62]
Skiadopoulos, A., Li, Q., Kraft, P., Kaffes, K., Hong, D., Mathew, S., Bestor, D., Cafarella, M., Gadepally, V., Graefe, G., Kepner, J., Kozyrakis, C., Kraska, T., Stonebraker, M., Suresh, L., and Zaharia, M. DBOS: A DBMS-Oriented Operating System. Proc. VLDB Endow. (2022).
[63]
Sreekanti, V., Wu, C., Lin, X. C., Schleier-Smith, J., Gonzalez, J. E., Hellerstein, J. M., and Tumanov, A. Cloudburst: Stateful functions-as-a-service. Proc. VLDB Endow. (2020).
[64]
Stuedi, P., Trivedi, A., Pfefferle, J., Klimovic, A., Schuepbach, A., and Metzler, B. Unification of Temporary Storage in the Nodekernel Architecture. In Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference (2019).
[65]
Verbitski, A., Gupta, A., Saha, D., Brahmadesam, M., Gupta, K., Mittal, R., Krishnamurthy, S., Maurice, S., Kharatishvili, T., and Bao, X. Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases. In Proceedings of the 2017 ACM International Conference on Management of Data (2017).
[66]
Vilanova, L., Maudlej, L., Bergman, S., Miemietz, T., Hille, M., Asmussen, N., Roitzsch, M., Härtig, H., and Silberstein, M. Slashing the Disaggregation Tax in Heterogeneous Data Centers with FractOS. In Proceedings of the Seventeenth European Conference on Computer Systems (2022).
[67]
Vuppalapati, M., Miron, J., Agarwal, R., Truong, D., Motivala, A., and Cruanes, T. Building an Elastic Query Engine on Disaggregated Storage. In Proceedings of the 17th Usenix Conference on Networked Systems Design and Implementation (2020).
[68]
Wang, S., Liagouris, J., Nishihara, R., Moritz, P., Misra, U., Tumanov, A., and Stoica, I. Lineage Stash: Fault Tolerance off the Critical Path. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (2019).
[69]
Wang, S., Liang, E., Oakes, E., Hindman, B., Luan, F. S., Cheng, A., and Stoica, I. Ownership: A Distributed Futures System for FineGrained Tasks. In 18th USENIX Symposium on Networked Systems Design and Implementation (2021).
[70]
Winter, C., Giceva, J., Neumann, T., and Kemper, A. On-Demand State Separation for Cloud Data Warehousing. Proc. VLDB Endow. (2022).
[71]
Yandex. Clickhouse. https://clickhouse.com/.
[72]
Yang, F., Tschetter, E., Léauté, X., Ray, N., Merlino, G., and Ganguli, D. Druid: A real-time analytical data store. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data (2014).
[73]
Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, U., Gunda, P. K., and Currey, J. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (2008).
[74]
Yuan, J., Li, X., Cheng, C., Liu, J., Guo, R., Cai, S., Yao, C., Yang, F., Yi, X., Wu, C., et al. Oneflow: Redesign the distributed deep learning framework from scratch. arXiv preprint arXiv:2110.15032 (2021).
[75]
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., and Stoica, I. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for in-Memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (2012).
[76]
Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., and Stoica, I. Discretized Streams: Fault-Tolerant Streaming Computation at Scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (2013).
[77]
Zha, Y., and Li, J. Virtualizing FPGAs in the Cloud. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (2020).
[78]
Zhang, Q., Chen, X., Sankhe, S., Zheng, Z., Zhong, K., Angel, S., Chen, A., Liu, V., and Loo, B. T. Optimizing data-intensive systems in disaggregated data centers with teleport. In Proceedings of the 2022 International Conference on Management of Data (2022).
[79]
Zheng, L., Li, Z., Zhang, H., Zhuang, Y., Chen, Z., Huang, Y., Wang, Y., Xu, Y., Zhuo, D., Xing, E. P., Gonzalez, J. E., and Stoica, I. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (2022).
[80]
Zhou, Y., Wassel, H. M. G., Liu, S., Gao, J., Mickens, J., Yu, M., Kennelly, C., Turner, P., Culler, D. E., Levy, H. M., and Vahdat, A. Carbink: Fault-Tolerant Far Memory. In 16th USENIX Symposium on Operating Systems Design and Implementation (2022).
[81]
Zhuang, S., Li, Z., Zhuo, D., Wang, S., Liang, E., Nishihara, R., Moritz, P., and Stoica, I. Hoplite: Efficient and Fault-Tolerant Collective Communication for Task-Based Distributed Systems. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference (2021).

Cited By

View all
  • (2024)Computing Diversity Paradigm for the Utilization of Unused Telephony and Marine InfrastructureInternational Journal of Networked and Distributed Computing10.1007/s44227-024-00027-yOnline publication date: 9-May-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
HOTOS '23: Proceedings of the 19th Workshop on Hot Topics in Operating Systems
June 2023
247 pages
ISBN:9798400701955
DOI:10.1145/3593856
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2023

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

HOTOS '23
Sponsor:

Upcoming Conference

HOTOS '25
Workshop on Hot Topics in Operating Systems
May 14 - 16, 2025
Banff or Lake Louise , AB , Canada

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)560
  • Downloads (Last 6 weeks)68
Reflects downloads up to 02 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Computing Diversity Paradigm for the Utilization of Unused Telephony and Marine InfrastructureInternational Journal of Networked and Distributed Computing10.1007/s44227-024-00027-yOnline publication date: 9-May-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media