Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3448016.3457553acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

ArkDB: A Key-Value Engine for Scalable Cloud Storage Services

Published: 18 June 2021 Publication History

Abstract

Persistent key-value stores play a crucial role in enabling internet-scale services. At Alibaba Cloud, scale-out cloud storage services including Object Storage Service, File Storage Service and Tablestore are built on distributed key-value stores. Key challenges in the design of the underlying key-value engine for these services lie in utilization of disaggregated storage, supporting write and range query-heavy workloads, and balancing of scalability, availability and resource usage. This paper presents ArkDB, a key-value engine designed to address these challenges by combining advantages of both LSM tree and Bw-tree, and leveraging advances in hardware technologies. Built on top of Pangu, an append-only distributed file system, ArkDB's innovations include shrinkable page mapping table, clear separation of system and user states for fast recovery, write amplification reduction, efficient garbage collection and lightweight partition split and merge. Experimental results demonstrate ArkDB's improvements over existing designs. Compared with Bw-tree, ArkDB efficiently stabilizes the mapping table size despite continuous write working set growth. Compared with RocksDB, an LSM tree-based key-value engine, ArkDB increases ingestion throughput by 2.16x, while reducing write amplification by 3.1x. It outperforms RocksDB by 52% and 37% respectively on a write-heavy workload and a range query-intensive workload of the Yahoo! Cloud Serving Benchmark. Experiments running in Tablestore in a cluster environment further demonstrate ArkDB's performance on Pangu and its efficient partition split/merge support.

Supplementary Material

MP4 File (3448016.3457553.mp4)
Persistent key-value stores play a crucial role in enabling internet-scale services.At Alibaba Cloud, scale-out cloud storage services including Object Storage Service, File Storage Service and Tablestore are built on distributed key-value stores. Key challenges in the design of the underlying key-value engine for these services lie in utilization of disaggregated storage, supporting write and range query-heavy workloads, and balancing of scalability, availability and resource usage. This paper presents ArkDB, a key-value engine designed to address these challenges by combining advantages of both LSM tree and Bw-tree, and leveraging advances in hardware technologies. Built on top of Pangu, an append-only distributed file system, ArkDB's innovations include shrinkable page mapping table, clear separation of system and user states for fast recovery, write amplification reduction, efficient garbage collection and lightweight partitionsplit and merge. Experimental results demonstrate ArkDB's improvements over existing designs. Compared with Bw-tree, ArkDB efficiently stabilizes the mapping table size despite continuous write working set growth. Compared with RocksDB, a popular LSM tree-based key-value engine, \dmt{} increases ingestion throughput by $2.16\times$, while reducing write amplification by $3.1\times$. It also outperforms RocksDB by 52\% and 37\% respectively on a write-heavy workload and a range query-intensive workload. Experiments running in Tablestore in a cluster environment further demonstrates ArkDB's performance on Pangu and its efficient partition split/merge support.

References

[1]
Panagiotis Antonopoulos, Alex Budovski, Cristian Diaconu, Alejandro Hernandez Saenz, Jack Hu, Hanuma Kodavalla, Donald Kossmann, Sandeep Lingam, Umar Farooq Minhas, Naveen Prakash, Vijendra Purohit, Hugh Qu, Chaitanya Sreenivas Ravella, Krystyna Reisteter, Sheetal Shrotri, Dixin Tang, and Vikram Wakade. 2019. Socrates: The New SQL Server in the Cloud. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, and Tim Kraska (Eds.). ACM, 1743--1756. https://doi.org/10.1145/3299869.3314047
[2]
Oana Balmau, Diego Didona, Rachid Guerraoui, Willy Zwaenepoel, Huapeng Yuan, Aashray Arora, Karan Gupta, and Pavan Konka. 2017. TRIAD: Creating Synergies Between Memory, Disk and Log in Log Structured Key-Value Stores. In 2017 USENIX Annual Technical Conference, USENIX ATC 2017, Santa Clara, CA, USA, July 12--14, 2017, Dilma Da Silva and Bryan Ford (Eds.). USENIX Association, 363--375. https://www.usenix.org/conference/atc17/technical-sessions/presentation/balmau
[3]
Hal Berenson, Philip A. Bernstein, Jim Gray, Jim Melton, Elizabeth J. O'Neil, and Patrick E. O'Neil. 1995. A Critique of ANSI SQL Isolation Levels. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, California, USA, May 22--25, 1995, Michael J. Carey and Donovan A. Schneider (Eds.). ACM Press, 1--10. https://doi.org/10.1145/223784.223785
[4]
Gerth Stølting Brodal and Rolf Fagerberg. 2003. Lower bounds for external memory dictionaries. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 12--14, 2003, Baltimore, Maryland, USA. ACM/SIAM, 546--554. http://dl.acm.org/citation.cfm?id=644108.644201
[5]
Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakantan, Arild Skjolsvold, Sam McKelvie, Yikang Xu, Shashwat Srivastav, Jiesheng Wu, Huseyin Simitci, Jaidev Haridas, Chakravarthy Uddaraju, Hemal Khatri, Andrew Edwards, Vaman Bedekar, Shane Mainali, Rafay Abbasi, Arpit Agarwal, Mian Fahim ul Haq, Muhammad Ikram ul Haq, Deepali Bhardwaj, Sowmya Dayanand, Anitha Adusumilli, Marvin McNett, Sriram Sankaran, Kavitha Manivannan, and Leonidas Rigas. 2011. Windows Azure Storage: a highly available cloud storage service with strong consistency. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles 2011, SOSP 2011, Cascais, Portugal, October 23--26, 2011, Ted Wobber and Peter Druschel (Eds.). ACM, 143--157. https://doi.org/10.1145/2043556.2043571
[6]
Wei Cao, Zhenjun Liu, Peng Wang, Sen Chen, Caifeng Zhu, Song Zheng, Yuhui Wang, and Guoqing Ma. 2018. PolarFS: An Ultra-low Latency and Failure Resilient Distributed File System for Shared Storage Cloud Database. Proc. VLDB Endow., Vol. 11, 12 (2018), 1849--1862. https://doi.org/10.14778/3229863.3229872
[7]
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. 2006. Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!). In 7th Symposium on Operating Systems Design and Implementation (OSDI '06), November 6--8, Seattle, WA, USA, Brian N. Bershad and Jeffrey C. Mogul (Eds.). USENIX Association, 205--218. http://www.usenix.org/events/osdi06/tech/chang.html
[8]
Alibaba Cloud. 2018. Pangu -- The High Performance Distributed File System by Alibaba Cloud. https://www.alibabacloud.com/blog/pangu-the-high-performance-distributed-file-system-by-alibaba-cloud_594059 .
[9]
Alibaba Cloud. 2020 a. File Storage NAS. https://www.alibabacloud.com/product/nas .
[10]
Alibaba Cloud. 2020 b. Object Storage Service. https://www.alibabacloud.com/product/oss .
[11]
Alibaba Cloud. 2020 c. Tablestore. https://www.alibabacloud.com/product/table-store .
[12]
Alexander Conway, Abhishek Gupta, Vijay Chidambaram, Martin Farach-Colton, Richard P. Spillane, Amy Tai, and Rob Johnson. 2020. SplinterDB: Closing the Bandwidth Gap for NVMe Key-Value Stores. In 2020 USENIX Annual Technical Conference, USENIX ATC 2020, July 15--17, 2020, Ada Gavrilovska and Erez Zadok (Eds.). USENIX Association, 49--63. https://www.usenix.org/conference/atc20/presentation/conway
[13]
Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking Cloud Serving Systems with YCSB. In SoCC '10: Proceedings of the 1st ACM symposium on Cloud computing. 143--254.
[14]
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson C. Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2012. Spanner: Google's Globally-Distributed Database. In 10th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2012, Hollywood, CA, USA, October 8--10, 2012, Chandu Thekkath and Amin Vahdat (Eds.). USENIX Association, 251--264. https://www.usenix.org/conference/osdi12/technical-sessions/presentation/corbett
[15]
Niv Dayan, Manos Athanassoulis, and Stratos Idreos. 2017. Monkey: Optimal Navigable Key-Value Store. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14--19, 2017. ACM, 79--94.
[16]
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: amazon's highly available key-value store. In Proceedings of the 21st ACM Symposium on Operating Systems Principles 2007, SOSP 2007, Stevenson, Washington, USA, October 14--17, 2007, Thomas C. Bressoud and M. Frans Kaashoek (Eds.). ACM, 205--220. https://doi.org/10.1145/1294261.1294281
[17]
Alex Depoutovitch, Chong Chen, Jin Chen, Paul Larson, Shu Lin, Jack Ng, Wenlin Cui, Qiang Liu, Wei Huang, Yong Xiao, and Yongjun He. 2020. Taurus Database: How to be Fast, Available, and Frugal in the Cloud. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14--19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 1463--1478. https://doi.org/10.1145/3318464.3386129
[18]
Siying Dong, Mark Callaghan, Leonidas Galanis, Dhruba Borthakur, Tony Savor, and Michael Strum. 2017. Optimizing Space Amplification in RocksDB. In CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research, Chaminade, CA, USA, January 8--11, 2017, Online Proceedings. www.cidrdb.org.
[19]
Jiaqing Du, Sameh Elnikety, and Willy Zwaenepoel. 2013. Clock-SI: Snapshot Isolation for Partitioned Data Stores Using Loosely Synchronized Clocks. In IEEE 32nd Symposium on Reliable Distributed Systems, SRDS 2013, Braga, Portugal, 1--3 October 2013 . IEEE Computer Society, 173--184. https://doi.org/10.1109/SRDS.2013.26
[20]
Facebook. 2020. RocksDB . https://rocksdb.org/.
[21]
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles 2003, SOSP 2003, Bolton Landing, NY, USA, October 19--22, 2003, Michael L. Scott and Larry L. Peterson (Eds.). ACM, 29--43. https://doi.org/10.1145/945445.945450
[22]
Google. 2020. LevelDB . https://dbdb.io/db/leveldb .
[23]
Goetz Graefe. 2012. A survey of B-tree logging and recovery techniques. ACM Trans. Database Syst., Vol. 37, 1 (2012), 1:1--1:35. https://doi.org/10.1145/2109196.2109197
[24]
Dave Hitz, James Lau, and Michael A. Malcolm. 1994. File System Design for an NFS File Server Appliance. In USENIX Winter 1994 Technical Conference, San Francisco, California, USA, January 17--21, 1994, Conference Proceedings. USENIX Association, 235--246. https://www.usenix.org/conference/usenix-winter-1994-technical-conference/file-system-design-nfs-file-server-appliance
[25]
Richard Jones, Antony Hosking, and Eliot Moss. 2011. The Garbage Collection Handbook: The Art of Automatic Memory Management 1st ed.). Chapman & Hall/CRC.
[26]
Bo-Kyeong Kim and Dong-Ho Lee. 2015. LSB-Tree: a log-structured B-Tree index structure for NAND ?ash SSDs . Design Automation for Embedded Systems, Vol. 19 (March 2015), 77--100.
[27]
Sandeep S. Kulkarni, Murat Demirbas, Deepak Madappa, Bharadwaj Avva, and Marcelo Leone. 2014. Logical Physical Clocks. In Principles of Distributed Systems - 18th International Conference, OPODIS 2014, Cortina d'Ampezzo, Italy, December 16--19, 2014. Proceedings (Lecture Notes in Computer Science, Vol. 8878), Marcos K. Aguilera, Leonardo Querzoni, and Marc Shapiro (Eds.). Springer, 17--32. https://doi.org/10.1007/978--3--319--14472--6_2
[28]
Justin J. Levandoski, David B. Lomet, and Sudipta Sengupta. 2013a. The Bw-Tree: A B-tree for new hardware platforms. In 2013 IEEE 29th International Conference on Data Engineering (ICDE). 302--313.
[29]
Justin J. Levandoski, David B. Lomet, and Sudipta Sengupta. 2013b. LLAMA: A Cache/Storage Subsystem for Modern Hardware. PVLDB, Vol. 6, 10 (2013), 877--888.
[30]
Justin J. Levandoski, David B. Lomet, Sudipta Sengupta, Ryan Stutsman, and Rui Wang. 2015. High Performance Transactions in Deuteronomy. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4--7, 2015, Online Proceedings. www.cidrdb.org. http://cidrdb.org/cidr2015/Papers/CIDR15_Paper15.pdf
[31]
Hyeontaek Lim, Dongsu Han, David G. Andersen, and Michael Kaminsky. 2014. MICA: A Holistic Approach to Fast In-Memory Key-Value Storage. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2014, Seattle, WA, USA, April 2--4, 2014, Ratul Mahajan and Ion Stoica (Eds.). USENIX Association, 429--444. https://www.usenix.org/conference/nsdi14/technical-sessions/presentation/lim
[32]
David B. Lomet and Mark R. Tuttle. 2003. A Theory of Redo Recovery. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9--12, 2003, Alon Y. Halevy, Zachary G. Ives, and AnHai Doan (Eds.). ACM, 397--406. https://doi.org/10.1145/872757.872806
[33]
Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. WiscKey: Separating Keys from Values in SSD-conscious Storage. In 14th USENIX Conference on File and Storage Technologies, FAST 2016, Santa Clara, CA, USA, February 22--25, 2016. USENIX Association, 133--148.
[34]
Changwoo Mina, Kangnyeon Kimb, Hyunjin Choc, Sang-Won Leed, and Young Ik Eome. 2012. SFS: Random Write Considered Harmful in Solid State Drives. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST). 139--154.
[35]
C. Mohan and Frank E. Levine. 1992. ARIES/IM: An Efficient and High Concurrency Index Management Method Using Write-Ahead Logging. In Proceedings of the 1992 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 2--5, 1992, Michael Stonebraker (Ed.). ACM Press, 371--380. https://doi.org/10.1145/130283.130338
[36]
Patrick O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O'Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Informatica, Vol. 33 (1996), 351--385.
[37]
Anastasios Papagiannis, Giorgos Saloustros, Pilar Gonzá lez-Fé rez, and Angelos Bilas. 2016. Tucana: Design and Implementation of a Fast and Efficient Scale-up Key-value Store. In 2016 USENIX Annual Technical Conference, USENIX ATC 2016, Denver, CO, USA, June 22--24, 2016, Ajay Gulati and Hakim Weatherspoon (Eds.). USENIX Association, 537--550. https://www.usenix.org/conference/atc16/technical-sessions/presentation/papagiannis
[38]
Pandian Raju, Rohan Kadekodi, Vijay Chidambaram, and Ittai Abraham. 2017. PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees. In Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, October 28--31, 2017 . ACM, 497--514.
[39]
Ohad Rodeh, Josef Bacik, and Chris Mason. 2013. BTRFS: The Linux B-Tree Filesystem. ACM Trans. Storage, Vol. 9, 3 (2013), 9:1--9:32. https://doi.org/10.1145/2501620.2501623
[40]
Russell Sears and Raghu Ramakrishnan. 2012. bLSM: a general purpose log structured merge tree. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20--24, 2012. ACM, 217--228.
[41]
Dharma Shukla, Shireesh Thota, Karthik Raman, Madhan Gajendran, Ankur Shah, Sergii Ziuzin, Krishnan Sundaram, Miguel Gonzalez Guajardo, Anna Wawrzyniak, Samer Boshra, Renato Ferreira, Mohamed Nassar, Michael Koltachev, Ji Huang, Sudipta Sengupta, Justin J. Levandoski, and David B. Lomet. 2015. Schema-Agnostic Indexing with Azure DocumentDB. PVLDB, Vol. 8, 12 (2015), 1668--1679.
[42]
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (MSST '10). IEEE Computer Society, USA, 1--10. https://doi.org/10.1109/MSST.2010.5496972
[43]
Andrew S. Tanenbaum and Herbert Bos. 2015. Modern Operating Systems (4th Edition) .Pearson Education, Inc.
[44]
Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, and Xiaofeng Bao. 2017. Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14--19, 2017, Semih Salihoglu, Wenchao Zhou, Rada Chirkova, Jun Yang, and Dan Suciu (Eds.). ACM, 1041--1052. https://doi.org/10.1145/3035918.3056101

Cited By

View all
  • (2024)LETUS: A Log-Structured Efficient Trusted Universal BlockChain StorageCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653390(161-174)Online publication date: 9-Jun-2024
  • (2024)BG3: A Cost Effective and I/O Efficient Graph Database in BytedanceCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653373(360-372)Online publication date: 9-Jun-2024
  • (2024) Bw e -tree: An Evolution of Bw-tree on Fast Storage 2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00396(5266-5279)Online publication date: 13-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
June 2021
2969 pages
ISBN:9781450383431
DOI:10.1145/3448016
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cloud storage architecture
  2. garbage collection
  3. key-value store
  4. recovery
  5. transaction logging

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)187
  • Downloads (Last 6 weeks)26
Reflects downloads up to 17 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)LETUS: A Log-Structured Efficient Trusted Universal BlockChain StorageCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653390(161-174)Online publication date: 9-Jun-2024
  • (2024)BG3: A Cost Effective and I/O Efficient Graph Database in BytedanceCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653373(360-372)Online publication date: 9-Jun-2024
  • (2024) Bw e -tree: An Evolution of Bw-tree on Fast Storage 2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00396(5266-5279)Online publication date: 13-May-2024
  • (2023)CDSBen: Benchmarking the Performance of Storage Services in Cloud-Native Database System at ByteDanceProceedings of the VLDB Endowment10.14778/3611540.361154916:12(3584-3596)Online publication date: 1-Aug-2023
  • (2022)A maturity model for AI-empowered cloud-native databases: from the perspective of resource managementJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-022-00318-111:1Online publication date: 7-Sep-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media