Scalable and adaptive log manager in distributed systems

Huan Zhou¹,
Weining Qian²,
Xuan Zhou²,
Qiwen Dong²,
Aoying Zhou² &
…
Wenrong Tan¹

153 Accesses
1 Altmetric
Explore all metrics

Abstract

On-line transaction processing (OLTP) systems rely on transaction logging and quorum-based consensus protocol to guarantee durability, high availability and strong consistency. This makes the log manager a key component of distributed database management systems (DDBMSs). The leader of DDBMSs commonly adopts a centralized logging method to writing log entries into a stable storage device and uses a constant log replication strategy to periodically synchronize its state to followers. With the advent of new hardware and high parallelism of transaction processing, the traditional centralized design of logging limits scalability, and the constant trigger condition of replication can not always maintain optimal performance under dynamic workloads.

In this paper, we propose a new log manager named Salmo with scalable logging and adaptive replication for distributed database systems. The scalable logging eliminates centralized contention by utilizing a highly concurrent data structure and speedy log hole tracking. The kernel of adaptive replication is an adaptive log shipping method, which dynamically adjusts the number of log entries transmitted between leader and followers based on the real-time workload. We implemented and evaluated Salmo in the open-sourced transaction processing systems Cedar and DBx1000. Experimental results show that Salmo scales well by increasing the number of working threads, improves peak throughput by 1.56× and reduces latency by more than 4× over log replication of Raft, and maintains efficient and stable performance under dynamic workloads all the time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast Quorum-Based Log Replication and Replay for Fast Databases

Efficient and stable quorum-based log replication and replay for modern cluster-databases

Article 09 January 2022

Plover: parallel logging for replication systems

Article 03 January 2020

References

Stonebraker M. Concurrency control and consistency of multiple copies of data in distributed INGRES. IEEE Transactions on Software Engineering, 1979, SE-5(3): 188–194
Article Google Scholar
Corbett J C, Dean J, Epstein M, Fikes A, Frost C, et al. Spanner: Google’s globally distributed database. ACM Transactions on Computer Systems, 2013, 31(3): 8
Article Google Scholar
Lamport L. The part-time parliament. ACM Transactions on Computer Systems, 1998, 16(2): 133–169
Article Google Scholar
Ongaro D, Ousterhout J. In search of an understandable consensus algorithm. In: Proceedings of 2014 USENIX Conference on USENIX Annual Technical Conference. 2014, 305–320
Gray J, McJones P, Blasgen M, Lindsay B, Lorie R, Price T, Putzolu F, Traiger I. The recovery manager of the system R database manager. ACM Computing Surveys, 1981, 13(2): 223–243
Article Google Scholar
Mohan C, Haderle D, Lindsay B, Pirahesh H, Schwarz P. ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. ACM Transactions on Database Systems, 1992, 17(1): 94–162
Article Google Scholar
Diaconu C, Freedman C, Ismert E, Larson P A, Mittal P, Stonecipher R, Verma N, Zwilling M. Hekaton: SQL server’s memory-optimized OLTP engine. In: Proceedings of 2013 ACM SIGMOD International Conference on Management of Data. 2013, 1243–1254
Levandoski J J, Lomet D B, Sengupta S, Stutsman R, Wang R. High performance transactions in deuteronomy. In: Proceedings of the 7th Biennial Conference on Innovative Data Systems Research. 2015
Lim H, Kaminsky M, Andersen D G. Cicada: dependably fast multicore in-memory transactions. In: Proceedings of 2017 ACM International Conference on Management of Data. 2017, 21–35
Johnson R, Pandis I, Stoica R, Athanassoulis M, Ailamaki A. Aether: a scalable approach to logging. Proceedings of the VLDB Endowment, 2010, 3(1–2): 681–692
Article Google Scholar
Johnson R, Pandis I, Stoica R, Athanassoulis M, Ailamaki A. Scalability of write-ahead logging on multicore and multisocket hardware. The VLDB Journal, 2012, 21(2): 239–263
Article Google Scholar
Kim K, Wang T, Johnson R, Pandis I. ERMIA: fast memory-optimized database system for heterogeneous workloads. In: Proceedings of 2016 International Conference on Management of Data. 2016, 1675–1687
Jung H, Han H, Kang S. Scalable database logging for multicores. Proceedings of the VLDB Endowment, 2017, 11(2): 135–148
Article Google Scholar
Tu S, Zheng W, Kohler E, Liskov B, Madden S. Speedy transactions in multicore in-memory databases. In: Proceedings of the 24th ACM Symposium on Operating Systems Principles. 2013, 18–32
Zheng W, Tu S, Kohler E, Liskov B. Fast databases with fast durability and recovery through multicore parallelism. In: Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation. 2014, 465–477
Kimura H. FOEDUS: OLTP engine for a thousand cores and NVRAM. In: Proceedings of 2015 ACM SIGMOD International Conference on Management of Data. 2015, 691–706
Kim J, Jang H, Son S, Han H, Kang S, Jung H. Border-collie: a wait-free, read-optimal algorithm for database logging on multicore hardware. In: Proceedings of 2019 International Conference on Management of Data. 2019, 723–740
Xia Y, Yu X, Pavlo A, Devadas S. Taurus: lightweight parallel logging for in-memory database management systems. Proceedings of the VLDB Endowment, 2020, 14(2): 189–201
Article Google Scholar
Haubenschild M, Sauer C, Neumann T, Leis V. Rethinking logging, checkpoints, and recovery for high-performance storage engines. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 877–892
Zhou H, Guo J, Hu H, Qian W, Zhou X, Zhou A. Plover: parallel logging for replication systems. Frontiers of Computer Science, 2020, 14(4): 144606
Article Google Scholar
Hong C, Zhou D, Yang M, Kuo C, Zhang L, Zhou L. KuaFu: closing the parallelism gap in database replication. In: Proceedings of the 29th International Conference on Data Engineering. 2013, 1186–1195
Poke M, Hoefler T. DARE: high-performance state machine replication on RDMA networks. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing. 2015, 107–118
Qin D, Brown A D, Goel A. Scalable replay-based replication for fast databases. Proceedings of the VLDB Endowment, 2017, 10(13): 2025–2036
Article Google Scholar
Wang T, Johnson R, Pandis I. Query fresh: log shipping on steroids. Proceedings of the VLDB Endowment, 2017, 11(4): 406–419
Article Google Scholar
Guo J, Chu J, Cai P, Zhou M, Zhou A. Low-overhead Paxos replication. Data Science and Engineering, 2017, 2(2): 169–177
Article Google Scholar
Arora V, Mittal T, Agrawal D, El-Abbadi A, Xue X, Zhi Y, Zhu J. Leader or majority: why have one when you can have both? Improving read scalability in raft-like consensus protocols. In: Proceedings of the 9th USENIX Conference on Hot Topics in Cloud Computing. 2017
Cao W, Liu Z, Wang P, Chen S, Zhu C, Zheng S, Wang Y, Ma G. PolarFS: an ultra-low latency and failure resilient distributed file system for shared storage cloud database. Proceedings of the VLDB Endowment, 2018, 11(12): 1849–1862
Article Google Scholar
Zhang Z, Hu H, Yu Y, Qian W, Shu K. Dependency preserved raft for transactions. In: Proceedings of the 25th International Conference of Database Systems for Advanced Applications. 2020, 228–245
Pan Y, Li Y, Zhang C, Zhang R, Hong D. The design and implementation of an efficient order management system based on CEDAR. Journal of East China Normal University (Natural Science), 2018, (3): 88–96
Helland P, Sammer H, Lyon J, Carr R, Garrett P, Reuter A. Group commit timers and high volume transaction systems. In: Proceedings of the 2nd International Workshop on High Performance Transaction Systems. 1987, 301–329
Spiro P M, Joshi A M, Rengarajan T K. Designing an optimized transaction commit protocol. Digital Technical Journal, 1991, 3(1): 1–32
Google Scholar
Howard H. ARC: analysis of raft consensus. Technical Report UCAM-CL-TR-857. Cambridge: University of Cambridge, 2014
Google Scholar
Huang D, Liu Q, Cui Q, Fang Z, Ma X, Xu F, Shen L, Tang L, Zhou Y, Huang M, Wei W, Liu C, Zhang J, Li J, Wu X, Song L, Sun R, Yu S, Zhao L, Cameron N, Pei L, Tang X. TiDB: a raft-based HTAP database. Proceedings of the VLDB Endowment, 2020, 13(12): 3072–3084
Article Google Scholar
Wang T, Johnson R. Scalable logging through emerging non-volatile memory. Proceedings of the VLDB Endowment, 2014, 7(10): 865–876
Article Google Scholar
Li K, Han F. Memory transaction engine of OceanBase. Journal of East China Normal University (Natural Science), 2014, 5: 149–163
Google Scholar
Cooper B F, Silberstein A, Tam E, Ramakrishnan R, Sears R. Benchmarking cloud serving systems with YCSB. In: Proceedings of the 1st ACM Symposium on Cloud Computing. 2010, 143–154
Zhu T, Zhao Z, Li F, Qian W, Zhou A, Xie D, Stutsman R, Li H, Hu H. SolarDB: toward a shared-everything database on distributed log-structured storage. ACM Transaction on Storage, 2019, 15(2): 11
Google Scholar
Gray J. Notes on data base operating systems. In: Proceedings of Operating Systems, an Advanced Course. 1978, 393–481
Harizopoulos S, Abadi D J, Madden S, Stonebraker M. OLTP through the looking glass, and what we found there. In: Proceedings of 2008 ACM SIGMOD International Conference on Management of Data. 2008, 981–992
Huang J, Schwan K, Qureshi M K. NVRAM-aware logging in transaction systems. Proceedings of the VLDB Endowment, 2014, 8(4): 389–400
Article Google Scholar
Arulraj J, Perron M, Pavlo A. Write-behind logging. Proceedings of the VLDB Endowment, 2016, 10(4): 337–348
Article Google Scholar
Hagmann R. Reimplementing the cedar file system using logging and group commit. In: Proceedings of the Eleventh ACM Symposium on Operating Systems Principles. 1987, 155–162
Lamport L. Paxos made simple, fast, and byzantine. In: Proceedings of the 6th International Conference on Principles of Distributed Systems. 2002, 7–9
Lamport L. Fast Paxos. Distributed Computing, 2006, 19(2): 79–103
Article MathSciNet Google Scholar
Burrows M. The Chubby lock service for loosely-coupled distributed systems. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation. 2006, 335–350
Baker J, Bond C, Corbett J C, Furman J J, Khorlin A, Larson J, Leon J M, Li Y, Lloyd A, Yushprakh V. Megastore: providing scalable, highly available storage for interactive services. In: Proceedings of the 5th Biennial Conference on Innovative Data Systems Research. 2011, 223–234
Shute J, Vingralek R, Samwel B, Handy B, Whipkey C, Rollins E, Oancea M, Littlefield K, Menestrina D, Ellner S, Cieslewicz J, Rae I, Stancescu T, Apte H. F1: a distributed SQL database that scales. Proceedings of the VLDB Endowment, 2013, 6(11): 1068–1079
Article Google Scholar
Zheng J, Lin Q, Xu J, Wei C, Zeng C, Yang P, Zhang Y. PaxosStore: high-availability storage made practical in WeChat. Proceedings of the VLDB Endowment, 2017, 10(12): 1730–1741
Article Google Scholar
Rao J, Shekita E J, Tata S. Using Paxos to build a scalable, consistent, and highly available datastore. Proceedings of the VLDB Endowment, 2011, 4(4): 243–254
Article Google Scholar
Ousterhout J, Agrawal P, Erickson D, Kozyrakis C, Leverich J, Mazières D, Mitra S, Narayanan A, Parulkar G, Rosenblum M, Rumble S M, Stratmann E, Stutsman R. The case for RAMClouds: scalable high-performance storage entirely in DRAM. ACM SIGOPS Operating Systems Review, 2010, 43(4): 92–105
Article Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant Nos. 62002119, 61977026, 62072180, and 61772202) supported by the Fundamental Research Funds for the Central Universities, Southwest Minzu University (2021PTJS23), and supported by the Open Fund of Shanghai Engineering Research Center on Big Data Management System. We also thank FCS reviewers for valuable comments which greatly helped to improve this paper.

Author information

Authors and Affiliations

The Key Laboratory for Computer Systems of State Ethnic Affairs Commission, Southwest Minzu University, Chengdu, 610041, China
Huan Zhou & Wenrong Tan
School of Data Science and Engineering, East China Normal University, Shanghai, 200062, China
Weining Qian, Xuan Zhou, Qiwen Dong & Aoying Zhou

Authors

Huan Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Weining Qian
View author publications
You can also search for this author in PubMed Google Scholar
Xuan Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Qiwen Dong
View author publications
You can also search for this author in PubMed Google Scholar
Aoying Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Wenrong Tan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huan Zhou.

Additional information

Huan Zhou is a lecturer of the School of Computer Science and Engineering, Southwest Minzu University (SWU), China. She received her BS in computer science and technology from Sichuan Normal University (SICNU), China in 2013 and her PhD in software engineering from East China Normal University (ECNU), China in 2019. Before she joined SWU in 2021, she had worked as a postdoctoral researcher in the School of Data Science and Engineering, ECNU, China. Her research interests include database management system and data mining.

Weining Qian is a professor and dean of the School of Data Science and Engineering, East China Normal University (ECNU), China. He received his BS, MS and PhD in computer science from Fudan University, China in 1998, 2001 and 2004, respectively. He is now serving as a standing committee member of Technical Committee on Databases of China Computer Federation (CCF), a member of Expert Group on Artificial Intelligence Science and Technology Innovation of MoE and a vice president of Data Science and Knowledge Systems Engineering Committee of Systems Engineering Society of China (SESC). His research interests include scalable transaction processing, benchmarking on big data systems, big data analysis and processing, big data application and computational education.

Xuan Zhou is a professor and a vice dean of the School of Data Science and Engineering, East China Normal University (ECNU), China. He obtained his BS from Fudan University, China in 2001 and his PhD from the National University of Singapore, Singapore in 2005, both in computer science. Since his graduation, he had worked as a scientist at the L3S Research Centre (Germany) and the CSIRO ICT Centre (Australia) until the end of 2010. Before he joined ECNU in 2017, he spent six years in Renmin University of China, as an associate professor. He is the winner of the Program for New Century Excellent Talents in University of Ministry of Education (MoE). His research interests include database system and information retrieval.

Qiwen Dong is a professor of the School of Data Science and Engineering, East China Normal University (ECNU), China. He received his BS and MS in aerospace engineering and mechanics from Harbin Institute of Technology, China in 2000 and 2002, respectively, and his PhD in computer science from Harbin Institute of Technology, China in 2008. He was a postdoctoral researcher in the School of Computer Science of Fudan University, China from 2008 to 2010, and subsequently served as an associate professor there until 2016. In the meantime, he visited the University of Michigan, USA. His research interests include bioinformatics, data mining, big-data. etc.

Aoying Zhou, Vice President of East China Normal University (ECNU), Vice President of Guizhou University (GZU), Founding Dean of School of Data Science and Engineering (DaSE) ECNU, Professor. He got his BS and MS in computer science from Sichuan University, China in 1985 and 1988, respectively, and he won his PhD from Fudan University, China in 1993. He is the winner of the National Science Fund for Distinguished Young Scholars supported by the National Natural Science Foundation of China (NSFC). He is CCF (China Computer Federation) Fellow and Associate Editor-in-Chief of Chinese Journal of Computer. He served General Chair of ER’2004, Vice PC Chair of ICDE’2009 and ICDE’2012, PC Co-chair of VLDB’2014. His research interests include Web data management, data management for data-intensive computing, inmemory cluster computing, distributed transaction processing, benchmarking for big data and performance.

Wenrong Tan is a professor and dean of the School of Computer Science and Engineering, Southwest Minzu University (SWU), China. She received her BS from Sichuan University, China in 1988 and her MS from Chongqing University, China in 1991, both in computer science. She is now serving as a vice president of Sichuan Province Computer Federation and a vice president of Sichuan Institute of Artificial Intelligence. Her research interests include internet of things and natural language processing.

Electronic Supplementary Material