Abstract
Providing ACID transactions under conflicts across globally distributed data is the Everest of transaction processing protocols. Transaction processing in this scenario is particularly costly due to the high latency of cross-continent network links, which inflates concurrency control and data replication overheads. To mitigate the problem, we introduce Ocean Vista—a novel distributed protocol that guarantees strict serializability. We observe that concurrency control and replication address different aspects of resolving the visibility of transactions, and we address both concerns using a multi-version protocol that tracks visibility using version watermarks and arrives at correct visibility decisions using efficient gossip. Gossiping the watermarks enables asynchronous transaction processing and acknowledging transaction visibility in batches in the concurrency control and replication protocols, which improves efficiency under high cross-data center network delays. In particular, Ocean Vista can access conflicting transactions in parallel and supports efficient write-quorum/read-one access using one round trip in the common case. We demonstrate experimentally in a multi-data center cloud environment that our design outperforms a leading distributed transaction processing engine (TAPIR) more than tenfold in terms of peak throughput, albeit at the cost of additional latency for gossip and a more restricted transaction model. The latency penalty is generally bounded by one wide area network (WAN) round trip time (RTT), and in the best case (i.e., under light load) our system nearly breaks even with TAPIR by committing transactions in around one WAN RTT.
Similar content being viewed by others
Notes
Super-quorums are defined as in FastPaxos [22].
An alternative expression is using a super-quorum size \(2f + 1\) out of \(3f + 1\) replicas.
References
Balakrishnan, M., Malkhi, D., Davis, J.D., Prabhakaran, V., Wei, M., Wobber, T.: CORFU: a distributed shared log. ACM Trans. Comput. Syst. 31(4), 10:1–10:24 (2013)
Bernstein, P., Reid, C., Das, S.: Hyder—a transactional record manager for shared flash. In: Proceedings of the 5th Biennial Conference on Innovative Data Systems Research, CIDR’11, pp. 9–20 (2011)
Bernstein, P.A., Das, S., Ding, B., Pilman, M.: Optimizing optimistic concurrency control for tree-structured, log-structured databases. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD’15, pp. 1295–1309 (2015)
Bernstein, P.A., Hadzilacos, V., Goodman, N.: Concurrency Control and Recovery in Database Systems. Addison-Wesley Longman Publishing Co. Inc, Boston, MA (1987)
Bernstein, P.A., Shipman, D.W., Wong, W.S.: Formal aspects of serializability in database concurrency control. IEEE Trans. Softw. Eng. 5(3), 203–216 (1979)
Chockler, G.V., Keidar, I., Vitenberg, R.: Group communication specifications: a comprehensive study. ACM Comput. Surv. 33(4), 427–469 (2001)
Corbett, J.C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J.J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P., Hsieh, W., Kanthak, S., Kogan, E., Li, H., Lloyd, A., Melnik, S., Mwaura, D., Nagle, D., Quinlan, S., Rao, R., Rolig, L., Saito, Y., Szymaniak, M., Taylor, C., Wang, R., Woodford, D.: Spanner: Google’s globally distributed database. ACM Trans. Comput. Syst. 31(3), 8:1–8:22 (2013)
Dey, A., Fekete, A., Nambiar, R., Röhm, U.: YCSB+T: Benchmarking web-scale transactional databases. In: IEEE 30th International Conference on Data Engineering Workshops, pp. 223–230 (2014)
Didona, D., Guerraoui, R., Wang, J., Zwaenepoel, W.: Causal consistency and latency optimality: friend or foe? PVLDB 11(11), 1618–1632 (2018)
Facebook: fbthrift. https://github.com/facebook/fbthrift
Faleiro, J.M., Abadi, D.J.: Rethinking serializable multiversion concurrency control. PVLDB 8(11), 1190–1201 (2015)
Faleiro, J.M., Abadi, D.J., Hellerstein, J.M.: High performance transactions via early write visibility. PVLDB 10(5), 613–624 (2017)
Fan, H., Golab, W.: Scalable transaction processing using functors. In: Proceedings of the 38th IEEE International Conference on Distributed Computing Systems, ICDCS’18, pp. 1004–1016 (2018)
Fan, H., Golab, W.: Ocean vista: gossip-based visibility control for speedy geo-distributed transactions. PVLDB 12(11), 1471–1484 (2019)
Goel, A.K., Pound, J., Auch, N., Bumbulis, P., MacLean, S., Färber, F., Gropengiesser, F., Mathis, C., Bodner, T., Lehner, W.: Towards scalable real-time analytics: an architecture for scale-out of OLxP workloads. PVLDB 8(12), 1716–1727 (2015)
Herlihy, M.P., Wing, J.M.: Linearizability: a correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst. 12(3), 463–492 (1990)
Hunt, P., Konar, M., Junqueira, F.P., Reed, B.: ZooKeeper: Wait-free coordination for internet-scale systems. In: Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, USENIXATC’10, pp. 11–11 (2010)
Kallman, R., Kimura, H., Natkins, J., Pavlo, A., Rasin, A., Zdonik, S., Jones, E.P.C., Madden, S., Stonebraker, M., Zhang, Y., Hugg, J., Abadi, D.J.: H-store: a high-performance, distributed main memory transaction processing system. PVLDB 1(2), 1496–1499 (2008)
Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., Lewin, D.: Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In: Proceedings of the 29th Annual ACM Symposium on Theory of Computing, STOC’97, pp. 654–663 (1997)
Kraska, T., Pang, G., Franklin, M.J., Madden, S., Fekete, A.: MDCC: Multi-data center consistency. In: Proceedings of the 8th ACM European Conference on Computer Systems, EuroSys’13, pp. 113–126 (2013)
Lamport, L.: The part-time parliament. ACM Trans. Comput. Syst. 16(2), 133–169 (1998)
Lamport, L.: Fast Paxos. Distrib. Comput. 19, 79–103 (2006)
Leau, C.: Spring Data Redis—Retwis-J. https://docs.spring.io/springdata/ data-keyvalue/examples/retwisj/current/ (2013)
Levandoski, J.J., Lomet, D.B., Sengupta, S.: The Bw-Tree: A B-tree for new hardware platforms. In: Proceedings of the 2013 IEEE International Conference on Data Engineering, ICDE’13, pp. 302–313 (2013)
Lloyd, W., Freedman, M.J., Kaminsky, M., Andersen, D.G.: Stronger semantics for low-latency geo-replicated storage. In: Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation, NSDI’13, pp. 313–328 (2013)
Lockerman, J., Faleiro, J.M., Kim, J., Sankaran, S., Abadi, D.J., Aspnes, J., Sen, S., Balakrishnan, M.: The fuzzylog: A partially ordered shared log. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’18, pp. 357–372 (2018)
Mu, S., Nelson, L., Lloyd, W., Li, J.: Consolidating concurrency control and consensus for commits under conflicts. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16, pp. 517–532 (2016)
Ongaro, D., Ousterhout, J.: In search of an understandable consensus algorithm. In: Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, USENIXATC’14, pp. 305–320 (2014)
Papadimitriou, C.H.: The serializability of concurrent database updates. J. ACM 26(4), 631–653 (1979)
Ren, K., Li, D., Abadi, D.J.: SLOG: serializable, low-latency, geo-replicated transactions. PVLDB 12(11), 1747–1761 (2019)
Ren, K., Thomson, A., Abadi, D.J.: An evaluation of the advantages and disadvantages of deterministic database systems. PVLDB 7(10), 821–832 (2014)
Sovran, Y., Power, R., Aguilera, M.K., Li, J.: Transactional storage for geo-replicated systems. In: Proceedings of the 23rd ACM Symposium on Operating Systems Principles, SOSP’11, pp. 385–400 (2011)
Thomson, A., Diamond, T., Weng, S.C., Ren, K., Shao, P., Abadi, D.J.: Calvin: Fast distributed transactions for partitioned database systems. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD’12, pp. 1–12 (2012)
UWSysLab: TAPIR implementation. https://github.com/UWSysLab/tapir (2018)
Verbitski, A., Gupta, A., Saha, D., Brahmadesam, M., Gupta, K., Mittal, R., Krishnamurthy, S., Maurice, S., Kharatishvili, T., Bao, X.: Amazon Aurora: Design considerations for high throughput cloud-native relational databases. In: Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD’17, pp. 1041–1052 (2017)
Verbitski, A., Gupta, A., Saha, D., Corey, J., Gupta, K., Brahmadesam, M., Mittal, R., Krishnamurthy, S., Maurice, S., Kharatishvilli, T., Bao, X.: Amazon Aurora: On avoiding distributed consensus for I/Os, commits, and membership changes. In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD’18, pp. 789–796 (2018)
Yan, X., Yang, L., Zhang, H., Lin, X.C., Wong, B., Salem, K., Brecht, T.: Carousel: Low-latency transaction processing for globally-distributed data. In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD’18, pp. 231–243 (2018)
Zhang, I., Sharma, N.K., Szekeres, A., Krishnamurthy, A., Ports, D.R.K.: Building consistent transactions with inconsistent replication. ACM Trans. Comput. Syst. 35(4), 12:1–12:37 (2018)
Zhang, Y., Power, R., Zhou, S., Sovran, Y., Aguilera, M.K., Li, J.: Transaction chains: achieving serializability with low latency in geo-distributed storage systems. In: Proceedings of the 24th ACM symposium on operating systems principles, SOSP’13, pp. 276–291 (2013)
Acknowledgements
We are grateful to the anonymous reviewers from both PVLDB and VLDBJ and to Ken Salem, for their helpful feedback on this research. Wojciech Golab is supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada and by Ripple.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Fan, H., Golab, W. Gossip-based visibility control for high-performance geo-distributed transactions. The VLDB Journal 30, 93–114 (2021). https://doi.org/10.1007/s00778-020-00626-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-020-00626-5