Database Performance at Scale: A Practical Guide Felipe Cardeneti Mendes Piotr Sarna Pavel Emelyanov Cynthia Dunlop
Database Performance at Scale: A Practical Guide Felipe Cardeneti Mendes Piotr Sarna Pavel Emelyanov Cynthia Dunlop
Database Performance at Scale: A Practical Guide Felipe Cardeneti Mendes Piotr Sarna Pavel Emelyanov Cynthia Dunlop
Performance
at Scale
A Practical Guide
―
Felipe Cardeneti Mendes · Piotr Sarna
Pavel Emelyanov · Cynthia Dunlop
Database Performance
at Scale
A Practical Guide
Copyright © 2023 by Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on
microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation,
computer software, or by similar or dissimilar methodology now known or hereafter developed.
Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes
were made.
The images or other third party material in this book are included in the book’s Creative Commons license, unless
indicated otherwise in a credit line to the material. If material is not included in the book’s Creative Commons license and
your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain
permission directly from the copyright holder.
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every
occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to
the benefit of the trademark owner, with no intention of infringement of the trademark.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as
such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of publication, neither the
authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made.
The publisher makes no warranty, express or implied, with respect to the material contained herein.
Managing Director, Apress Media LLC: Welmoed Spahr
Acquisitions Editor: Jonathan Gennick
Development Editor: Laura Berendson
Editorial Project Manager: Shaul Elson
Copy Editor: Kezia Endsley
Cover designed by eStudioCalamar
Distributed to the book trade worldwide by Springer Science+Business Media LLC, 1 New York Plaza, Suite 4600,
New York, NY 10004. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail orders-ny@springer-sbm.com, or visit www.
springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business
Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.
For information on translations, please e-mail booktranslations@springernature.com; for reprint, paperback, or audio
rights, please e-mail bookpermissions@springernature.com.
Apress titles may be purchased in bulk for academic, corporate, or promotional use. eBook versions and licenses are also
available for most titles. For more information, reference our Print and eBook Bulk Sales web page at http://www.apress.
com/bulk-sales.
Any source code or other supplementary material referenced by the author in this book is available to readers on GitHub
(https://github.com/Apress). For more detailed information, please visit https://www.apress.com/gp/services/source-code.
Paper in this product is recyclable
To Cristina and Snow
—Felipe
To Wiktoria
—Piotr
To Svetlana and Mykhailo
—Pavel
To David
—Cynthia
Table of Contents
About the Authors�������������������������������������������������������������������������������������������������� xiii
Acknowledgments�������������������������������������������������������������������������������������������������xvii
Introduction������������������������������������������������������������������������������������������������������������xix
v
Table of Contents
Mixed Workloads������������������������������������������������������������������������������������������������������������������� 19
Delete-Heavy Workloads������������������������������������������������������������������������������������������������������� 20
Competing Workloads (Real-Time vs Batch)������������������������������������������������������������������������� 21
Item Size������������������������������������������������������������������������������������������������������������������������������������� 23
Item Type������������������������������������������������������������������������������������������������������������������������������������� 24
Dataset Size�������������������������������������������������������������������������������������������������������������������������������� 26
Throughput Expectations������������������������������������������������������������������������������������������������������������ 27
Latency Expectations������������������������������������������������������������������������������������������������������������������ 29
Concurrency�������������������������������������������������������������������������������������������������������������������������������� 31
Connected Technologies������������������������������������������������������������������������������������������������������������� 32
Demand Fluctuations������������������������������������������������������������������������������������������������������������������ 33
ACID Transactions����������������������������������������������������������������������������������������������������������������������� 34
Consistency Expectations����������������������������������������������������������������������������������������������������������� 36
Geographic Distribution�������������������������������������������������������������������������������������������������������������� 38
High-Availability Expectations����������������������������������������������������������������������������������������������������� 39
Summary������������������������������������������������������������������������������������������������������������������������������������ 40
vii
Table of Contents
Paging��������������������������������������������������������������������������������������������������������������������������������������� 100
Concurrency������������������������������������������������������������������������������������������������������������������������������ 101
Modern Hardware���������������������������������������������������������������������������������������������������������������� 102
Modern Software����������������������������������������������������������������������������������������������������������������� 104
What to Look for When Selecting a Driver�������������������������������������������������������������������������������� 105
Summary���������������������������������������������������������������������������������������������������������������������������������� 107
viii
Table of Contents
ix
Table of Contents
x
Table of Contents
Index��������������������������������������������������������������������������������������������������������������������� 249
xi
About the Authors
Felipe Cardeneti Mendes is an IT specialist with years of
experience using distributed systems and open-source
technologies. He has co-authored three Linux books and
is a frequent speaker at public events and conferences
to promote open-source technologies. Felipe works as a
solution architect at ScyllaDB.
xiii
About the Authors
xiv
About the Technical Reviewers
Botond Dénes has been a principal software engineer at
ScyllaDB since 2017. Botond has mostly worked on making
queries perform better and making sure their concurrency
and resource consumption (especially memory) are kept in
check. In addition, he has worked extensively on disaster
recovery and diagnostics tools.
xv
About the Technical Reviewers
xvi
Acknowledgments
The process of creating this book has been a wild ride across many countries, cultures,
and time zones, as well as around many obstacles. There are many people to thank for
their assistance, inspiration, and support along this journey.
To begin, ScyllaDB co-founders Dor Laor and Avi Kivity—for starting the company
that brought us all together, for pushing the boundaries of database performance at scale
in ways that inspired this book, and for trusting us to share the collective sea monster
wisdom in this format. Thank you for this amazing opportunity.
We thank our respective teams, and especially our managers, for supporting this side
project. We hope we kept the core workload disruption to a minimum and did not inflict
any “stop the world” project pauses.
Our technical reviewers—Botond Dénes, Ľuboš Koščo, and Raphael S. Carvalho—
painstakingly reviewed the first draft of every page in this book and offered insightful
suggestions throughout. Thank you for your thoughtful comments and for being so
generous with your time.
Additionally, our unofficial technical reviewer and toughest critic, Kostja Osipov,
provided early and (brutally) honest feedback that led us to substantially alter the book’s
focus for the better.
The Brazilian Ninja team (Guilherme Nogueira, Lucas Martins Guimarães, and
Noelly Medina) rescued us in our darkest hour, allowing us to scale out and get the first
draft across the finish line. Muito Obrigado!
Ben Gaisne is the graphic design mastermind behind the images in this book. Merci
for transforming our scribbles into beautiful diagrams and putting up with about ten
rounds of “just one more round of book images.”
We are also indebted to many for their unintentional contributions on the content
front. Glauber Costa left us with a treasure trove of materials we consulted when
composing chapters, especially Chapter 9 on benchmarking. He also inspired the addition
of Chapter 6 on getting data closer. Additionally, we also looked back to ScyllaDB blogs as
we were writing—specifically, blogs by Avi Kivity (for Chapter 3), Eyal Gutkind (for Chapter
7), Vlad Zolotarov and Moreno Garcia (also for Chapter 7), Dor Laor (for Chapter 8), Eliran
Sinvani (also for Chapter 8), and Ivan Prisyazhynyy (for Chapter 9).
xvii
Acknowledgments
Last, but certainly not least, we thank Jonathan Gennick for bringing us to Apress. We
thank Shaul Elson and Susan McDermott for guiding us through the publishing process.
It has been a pleasure working with you. And we thank everyone involved in editing and
production; having previously tried this on our own, we know it’s an excruciating task
and we are truly grateful to you for relieving us of this burden!
xviii
Introduction
Sisyphean challenge. Gordian knot. Rabbit hole. Many metaphors have been used to
describe the daunting challenge of achieving database performance at scale. That isn’t
surprising. Consider just a handful of the many factors that contribute to satisfying
database latency and throughput expectations for a single application:
• How well you know your workload access patterns and whether they
are a good fit for your current or target database.
xix
Introduction
As with any engineering challenge, there’s no one-size-fits-all solution. But there are
a lot of commonly overlooked considerations and opportunities with the potential to
help teams meet their database performance objectives faster, and with fewer headaches.
As a group of people with experience across a variety of performance-oriented
database projects, we (the authors) have a unique perspective into what works well for
different performance-sensitive use cases—from low-level engineering optimizations,
to infrastructure components, to topology considerations and the KPIs to focus on for
monitoring. Frequently, we engage with teams when they’re facing a performance
challenge so excruciating that they’re considering changing their production database
(which can seem like the application development equivalent of open heart surgery).
And in many cases, we develop a long-term relationship with a team, watching their
projects and objectives evolve over time and helping them maintain or improve
performance across the shifting sands.
Based on our experience with performance-focused database engineering as well as
performance-focused database users, this book represents what we think teams striving
for extreme database performance—low latency, high throughput, or both—should be
thinking about. We have experience working with multi-petabyte distributed systems
requiring millions of interactions per second. We’ve engineered systems supporting
business critical real-time applications with sustained latencies below one millisecond.
Finally, we’re well aware of commonly-experienced “gotchas” that no one has dared to
tell you about, until now.
xx
Introduction
We specifically tagged on the “at scale” modifier to emphasize that we’re catering to
teams who are outside of the honeymoon zone, where everything is just blissfully fast
no matter what you do with respect to setup, usage, and management. Different teams
will reach that inflection point for different reasons, and at different thresholds. But one
thing is always the same: It’s better to anticipate and prepare than to wait and scramble
to react.
You might also be looking to reduce costs without compromising performance, but
unsure of all the considerations involved in doing so.
We assume that you want to get your database performance challenges resolved,
fast. That’s why we focus on providing very direct and opinionated recommendations
based on what we have seen work (and fail) in real-world situations. There are, of
course, exceptions to every rule and ways to debate the finer points of almost any tip
xxi
Introduction
in excruciating detail. We’ll focus on presenting the battle-tested “best practices” and
anti-patterns here, and encourage additional discussion in whatever public or private
channels you prefer.
There are already many outstanding references that cover the topics we’re
deliberately not addressing, so we’re not going to attempt to re-create or replace them.
See Appendix A for a list of recommended resources.
Also, this is not a book about ScyllaDB, even though the authors and technical
reviewers have experience with ScyllaDB. Our goal is to present strategies that are useful
across the broader class of performance-oriented databases. We reference ScyllaDB, as
well as other databases, as appropriate to provide concrete examples.
xxii
Introduction
database performance challenges and tradeoffs that you’re likely to face depending on
your project’s specific workload characteristics and technical/business requirements.
The next set of chapters provides a window into many often-overlooked engineering
details that could be constraining—or helping—your database performance. First, we
look at ways databases can extract more performance from your CPU, memory, storage,
and networking. Next, we shift the focus from hardware interactions to algorithmic
optimizations—deep diving into the intricacies of a sample performance optimization
from the perspective of the engineer behind it. Following that, we share everything a
performance-obsessed developer really should know about database drivers but never
thought to ask. Driver-level optimizations —both how they’re engineered and how you
work with them—are absolutely critical for performance, so we spend a good amount
of time on topics like the interaction between clients and servers, contextual awareness,
maximizing concurrency while keeping latencies under control, correct usage of
paging, timeout control, retry strategies, and so on. Finally, we look at the performance
possibilities in moving more logic into the database (via user-defined functions and
user-defined aggregates) as well as moving the database servers closer to users.
Then, the final set of chapters shifts into field-tested recommendations for
getting better performance out of your database deployment. It starts by looking at
infrastructure and deployment model considerations that are important to understand,
whether you’re managing your own deployment or opting for a database-as-a-service
(maybe serverless) deployment model. Then, we share our top strategies related to
topology, benchmarking, monitoring, and admin—all through the not-always-rosy lens
of performance.
After all that, we hope you end up with a new appreciation of the countless
considerations that impact database performance at scale, discover some previously
overlooked opportunities to optimize your database performance, and avoid the
common traps and pitfalls that inflict unnecessary pain and distractions on all too many
dev and database teams.
Tip Check out our GitHub repo for easy access to the sources we reference in
footnotes, plus additional resources on database performance at scale: https://
github.com/Apress/db-performance-at-scale.
xxiii
Introduction
Summary
Optimizing database performance at the scale required for today’s data-intensive
applications often requires more than performance tuning and scaling out. This
book shares commonly overlooked considerations, pitfalls, and opportunities that
have helped many teams break through database performance plateaus. It’s neither
a definitive guide to distributed databases nor a beginner’s resource. Rather, it’s a
look at the many different factors that impact performance, and our top field-tested
recommendations for navigating them. Chapter 1 provides two (fun and fanciful) tales
that surface some of the many roadblocks you might face and highlight the range of
strategies for navigating around them.
xxiv
CHAPTER 1
1
Atomicity, consistency, isolation, and durability
1
© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023
F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_1
Chapter 1 A Taste of What You’re Up Against: Two Tales
On a very high level, the company’s main product consisted of only two layers:
• The frontend, the entry point for users, which actually runs in their
own browsers and communicates with the rest of the system to
exchange and persist information.
Joan’s first task was to implement a very simple service for gathering and summing
up various statistics from the database and integrate that service with the whole
ecosystem, so that it fetched data from the database in real-time and allowed the
DevOps teams to inspect the statistics live.
To impress the management and reassure them that hiring Joan was their absolutely
best decision this quarter, Joan decided to deliver a proof-of-concept implementation
on her first day! The company’s unspoken policy was to write software in Rust, so she
grabbed the first driver for their database from a brief crates.io search and sat down to
her self-organized hackathon.
The day went by really smoothly, with Rust’s ergonomic-focused ecosystem
providing a superior developer experience. But then Joan ran her first smoke tests on a
real system. Disbelief turned to disappointment and helplessness when she realized that
every third request (on average) ended up in an error, even though the whole database
cluster reported to be in a healthy, operable state. That meant a debugging session was
in order!
Unfortunately, the driver Joan hastily picked for the foundation of her work, even
though open-source on its own, was just a thin wrapper over precompiled, legacy C
code, with no source to be found. Fueled by a strong desire to solve the mystery and a
healthy dose of fury, Joan spent a few hours inspecting the network communication with
Wireshark,2 and she made an educated guess that the bug must be in the hashing key
implementation.3 In the database used by the company, keys are hashed to later route
requests to appropriate nodes. If a hash value is computed incorrectly, a request may be
forwarded to the wrong node, which can refuse it and return an error instead.
2
Wireshark is a great tool for inspecting network packets and more (www.wireshark.org).
3
Loosely based on a legit hashing quirk in Apache Cassandra (https://github.com/apache/
cassandra/blob/56ea39ec704a94b5d23cbe530548745ab2420cee/src/java/org/apache/
cassandra/utils/MurmurHash.java#L31-L32).
2
Chapter 1 A Taste of What You’re Up Against: Two Tales
Unable to verify the claim due to the missing source code, Joan decided on a simpler
path—ditching the originally chosen driver and reimplementing the solution on one of
the officially supported, open-source drivers backed by the database vendor, with a solid
user base and regularly updated release schedule.
2. Drivers have bugs too, and it’s impossible to avoid them. Still,
there are good practices to follow:
a. Unless there’s a good reason, choose the officially supported driver (if it
exists).
The introductory task was eventually completed successfully, which made Joan
ready to receive her first real assignment.
The Tuning
Armed with the experience gained working on the introductory task, Joan started planning
how to approach her new assignment: a misbehaving app. One of the applications
notoriously caused stability issues for the whole system, disrupting other workloads
each time it experienced any problems. The rogue app was already based on an officially
supported driver, so Joan could cross that one off the list of potential root causes.
3
Chapter 1 A Taste of What You’re Up Against: Two Tales
This particular service was responsible for injecting data backed up from the
legacy system into the new database. Because the company was not in a great hurry,
the application was written with low concurrency in mind to have low priority and
not interfere with user workloads. Unfortunately, once every few days something
kept triggering an anomaly. The normally peaceful application seemed to be trying to
perform a denial-of-service attack on its own database, flooding it with requests until the
backend got overloaded enough to cause issues for other parts of the ecosystem.
As Joan watched metrics presented in a Grafana dashboard, clearly suggesting that
the rate of requests generated by this application started spiking around the time of the
anomaly, she wondered how on Earth this workload could behave like that. It was, after
all, explicitly implemented to send new requests only when fewer than 100 of them were
currently in progress.
Since collaboration was heavily advertised as one of the company’s “spirit and
cultural foundations” during the onboarding sessions with an onsite coach, she decided
it was best to discuss the matter with her colleague, Tony.
The observation that led to discovering the root cause was rather simple: The request
didn’t actually return a timeout error because the database server never sent such a
response. The request was simply qualified as timed out by the driver, and discarded. But
the sole fact that the driver no longer waits for a response for a particular request does
not mean that the database is done processing it! It’s entirely possible that the request
was instead just stalled, taking longer than expected, and the driver gave up waiting for
its response.
With that knowledge, it’s easy to imagine that once 100 requests time out on the
client side, the app might erroneously think that they are not in progress anymore, and
happily submit 100 more requests to the database, increasing the total number of
4
For an overview of the “rubber duck debugging” concept, see https://
rubberduckdebugging.com/.
4
Chapter 1 A Taste of What You’re Up Against: Two Tales
in-flight requests (i.e., concurrency) to 200. Rinse, repeat, and you can achieve extreme
levels of concurrency on your database cluster—even though the application was
supposed to keep it limited to a small number!
With the client-side timeouts properly amended, the application choked much less
frequently and to a smaller extent, but it still wasn’t a perfect citizen in the distributed
system. It occasionally picked a victim database node and kept bothering it with too
many requests, while ignoring the fact that seven other nodes were considerably less
loaded and could help handle the workload too. At other times, its concurrency was
reported to be exactly 200 percent larger than expected by the configuration. Whenever
the two anomalies converged in time, the poor node was unable to handle all the
requests it was bombarded with, and it had to give up on a fair portion of them. A long
5
OpenTelemetry “is a collection of tools, APIs, and SDKs. Use it to instrument, generate,
collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s
performance and behavior.” For details, see https://opentelemetry.io/.
5
Chapter 1 A Taste of What You’re Up Against: Two Tales
study of the driver’s documentation, which was fortunately available in mdBook6 format
and kept reasonably up-to-date, helped Joan alleviate those pains too.
The first issue was simply a misconfiguration of the non-default load balancing
policy, which tried too hard to pick “the least loaded” database node out of all the
available ones, based on heuristics and statistics occasionally updated by the database
itself. Unfortunately, this policy was also “best effort,” and relied on the fact that statistics
arriving from the database were always legit. But a stressed database node could become
so overloaded that it wasn’t sending updated statistics in time! That led the driver to
falsely believe that this particular server was not actually busy at all. Joan decided that
this setup was a premature optimization that turned out to be a footgun, so she just
restored the original default policy, which worked as expected.
The second issue (temporary doubling of the concurrency) was caused by another
misconfiguration: an overeager speculative retry policy. After waiting for a preconfigured
period of time without getting an acknowledgement from the database, drivers would
speculatively resend a request to maximize its chances to succeed. This mechanism
is very useful to increase requests’ success rate. However, if the original request also
succeeds, it means that the speculative one was sent in vain. In order to balance the
pros and cons, speculative retry should be configured to resend requests only when
it’s very likely that the original one failed. Otherwise, as in Joan’s case, the speculative
retry may act too soon, doubling the number of requests sent (and thus also doubling
concurrency) without improving the success rate.
Whew, nothing gives a simultaneous endorphin rush and dopamine hit like a quality
debugging session that ends in an astounding success (except writing a cheesy story in a
deeply technical book, naturally). Great job, Joan!
The end.
6
mdBook “is a command line tool to create books with Markdown.” For details, see https://
rust-lang.github.io/mdBook/.
6
Chapter 1 A Taste of What You’re Up Against: Two Tales
After some experimentation with the offering’s free tier, Patrick decided to sign a
one-year contract with a major cloud provider to get a significant discount on its NoSQL
database-as-a-service offering. With provisioned throughput capable of serving up to
1,000 customers every second, the technology stack was ready and the store opened its
virtual doors to the customers. To Patrick’s disappointment, fewer than ten customers
visited the site daily. At the same time, the shiny new database cluster kept running,
fueled by a steady influx of money from his credit card and waiting for its potential to be
harnessed.
b. Is the variance high and hard to predict, with the system being idle for
potentially long periods of time, with occasional bumps of activity?
7
Chapter 1 A Taste of What You’re Up Against: Two Tales
7
A very strong consistency guarantee; see the Jepsen page on Linearizability for details
(https://jepsen.io/consistency/models/linearizable).
9
Chapter 1 A Taste of What You’re Up Against: Two Tales
Tip For those having trouble with remembering the formula, think units.
Concurrency is just a number, latency can be measured in seconds, while
throughput is usually expressed in 1/s. Then, it stands to reason that in order for
units to match, concurrency should be obtained by multiplying latency (seconds) by
throughput (1/s). You’re welcome!
10
Chapter 1 A Taste of What You’re Up Against: Two Tales
Throughput depends on the hardware and naturally has its limits (e.g., you can’t
expect a NVMe drive purchased in 2023 to serve the data for you in terabytes per second,
although we are crossing our fingers for this assumption to be invalidated in near
future!) Once the limit is hit, you can treat it as constant in the formula. It’s then clear
that as concurrency raises, so does latency. For the end-users—Malaysian teenagers in
this scenario—it means that the latency is eventually going to cross the magic barrier
for average human perception of a few seconds. Once that happens, users get too
frustrated and simply give up on trying altogether, assuming that the system is broken
beyond repair. It’s easy to find online articles quoting that “Amazon found that 100ms
of latency costs them 1 percent in sales”; although it sounds overly simplified, it is also
true enough.
11
Chapter 1 A Taste of What You’re Up Against: Two Tales
Unfortunately, the next March 17th didn’t go as smoothly as expected either. Patrick
spent most of the day enjoying steady Grafana dashboards, which kept assuring him
that the traffic was under control and capable of handling the load of customers, with a
healthy safe margin. But then the dashboards stopped, kindly mentioning that the disks
became severely overutilized. This seemed completely out of place given the observed
concurrency. While looking for the possible source of this anomaly, Patrick noticed, to
his horror, that the scheduled backup procedure coincided with the annual peak load…
The end.
12
Chapter 1 A Taste of What You’re Up Against: Two Tales
Summary
Meeting database performance expectations can sometimes seem like a never-ending
pain. As soon as you diagnose and address one problem, another is likely lurking right
behind it. The next chapter helps you anticipate the challenges and opportunities you
are most likely to face given your technical requirements and business expectations.
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
13
CHAPTER 2
Note Since this chapter covers a broad range of scenarios, not everything will be
applicable to your specific project and workload. Feel free to skim this chapter and
focus on the sections that seem most relevant.
15
© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023
F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_2
Chapter 2 Your Project, Through the Lens of Database Performance
workloads, others are optimized for write-heavy situations, and some are built to
accommodate both. Selecting, or sticking with, one that’s a poor fit for your current and
future situation will be a significant burden that will be difficult to overcome, no matter
how strategically you optimize everything else.
There’s also a significant impact to cost. That might not seem directly related to
performance, but if you can’t afford (or get approval for) the infrastructure that you truly
need to support your workload, this will clearly limit your performance.1
Tip Not sure what your workload looks like? This is one of many situations where
observability is your friend. If your existing database doesn’t help you profile your
workload, consider if it’s feasible to try your workloads on a compatible database
that enables deeper visibility.
Write-Heavy Workloads
If you have a write-heavy workload, we strongly recommend a database that stores data
in immutable files (e.g., Cassandra, ScyllaDB, and others that use LSM trees).2 These
databases optimize write speed because: 1) writes are sequential, which is faster in
terms of disk I/O and 2) writes are performed immediately, without first worrying about
reading or updating existing values (like databases that rely on B-trees do). As a result,
you can typically write a lot of data with very low latencies.
However, if you opt for a write-optimized database, be prepared for higher storage
requirements and the potential for slower reads. When you work with immutable
files, you’ll need sufficient storage to keep all the immutable files that build up until
compaction runs.3 You can mitigate the storage needs to some extent by choosing
compaction strategies carefully. Plus, storage is relatively inexpensive these days.
1
With write-heavy workloads, you can easily spend millions per month with Bigtable or
DynamoDB. Read-heavy workloads are typically less costly in these pricing models.
2
If you want a quick introduction to LSM trees and B-trees, see Appendix A. Chapter 4 also
discusses B-trees in more detail.
3
Compaction is a background process that databases with an LSM tree storage backend use to
merge and optimize the shape of the data. Since files are immutable, the process essentially
involves picking up two or more pre-existing files, merging their contents, and producing a sorted
output file.
16
Chapter 2 Your Project, Through the Lens of Database Performance
The potential for read amplification is generally a more significant concern with
write-optimized databases (given all the files to search through, more disk reads are
required per read request).
But read performance doesn’t necessarily need to suffer. You can often minimize this
tradeoff with a write-optimized database that implements its own caching subsystem
(as opposed to those that rely on the operating system’s built-in cache), enabling fast
reads to coexist alongside extremely fast writes. Bypassing the underlying OS with a
performance-focused built-in cache should speed up your reads nicely, to the point
where the latencies are nearly comparable to read-optimized databases.
With a write-heavy workload, it’s also essential to have extremely fast storage, such
as NVMe drives, if your peak throughput is high. Having a database that can theoretically
store values rapidly ultimately won’t help if the disk itself can’t keep pace.
Another consideration: beware that write-heavy workloads can result in
surprisingly high costs as you scale. Writes cost around five times more than reads
under some vendors’ pricing models. Before you invest too much effort in performance
optimizations, and so on, it’s a good idea to price your solution at scale and make sure
it’s a good long-term fit.
Read-Heavy Workloads
With read-heavy workloads, things change a bit. B-tree databases (such as DynamoDB)
are optimized for reads (that’s the payoff for the extra time required to update values on
the write path). However, the advantage that read-optimized databases offer for reads
is generally not as significant as the advantage that write-optimized databases offer for
writes, especially if the write-optimized database uses internal caching to make up the
difference (as noted in the previous section).
Careful data modeling will pay off in spades for optimizing your reads. So will careful
selection of read consistency (are eventually consistent reads acceptable as opposed to
strongly consistent ones?), locating your database near your application, and performing
a thorough analysis of your query access patterns. Thinking about your access patterns is
especially crucial for success with a read-heavy workload. Consider aspects such as the
following:
• What is the nature of the data that the application will be querying
mostly frequently? Does it tolerate potentially stale reads or does it
require immediate consistency?
17
Chapter 2 Your Project, Through the Lens of Database Performance
For example, assume that your use case requires dynamic querying capabilities (such
as type-ahead use cases, report-building solutions, etc.) where you frequently need to query
data from columns other than your primary/hash key component. In this case, you might
find yourself performing full table scans all too frequently, or relying on too many indexes.
Both of these, in one way or another, may eventually undermine your read performance.
On the infrastructure side, selecting servers with high memory footprints is key for
enabling low read latencies if you will mostly serve data that is frequently accessed. On
the other hand, if your reads mostly hit cold data, you will want a nice balance between
your storage speeds and memory. In fact, many distributed databases typically reserve
some memory space specifically for caching indexes; this way, reads that inevitably
require going to disk won’t waste I/O by scanning through irrelevant data.
What if the use case requires reading from both hot and cold data at the same time?
And what if you have different latency requirements for each set of data? Or what if you
want to mix a real-time workload on top of your analytics workload for the very same
dataset? Situations like this are quite common. There’s no one-size-fits-all answer, but
here are a few important tips:
• Some databases will allow you to read data without polluting your
cache (e.g., filling it up with data that is unlikely to be requested again).
Using such a mechanism is especially important when you’re running
large scans while simultaneously serving real-time data. If the large
scans were allowed to override the previously cached entries that the
real-time workload required, those reads would have to go through
disk and get repopulated into the cache again. This would effectively
waste precious processing time and result in elevated latencies.
Note You might not need all your reads. At ScyllaDB, we’ve come across a
number of cases where teams are performing reads that they don’t really need. For
example, by using a read-before-write approach to avoid race conditions where
multiple clients are trying to update the same value with different updates at the
same time. The details of the solution aren’t relevant here, but it is important to
note that, by rethinking their approach, they were able to shave latencies off their
writes as well as speed up the overall response by eliminating the unnecessary
read. The moral here: Getting new eyes on your existing approaches might surface
a way to unlock unexpected performance optimizations.
Mixed Workloads
More evenly mixed access patterns are generally even more complex to analyze and
accommodate. In general, the reason that mixed workloads are so complex in nature is
due to the fact that there are two competing workloads from the database perspective.
Databases are essentially made for just two things: reading and writing. The way that
different databases handle a variety of competing workloads is what truly differentiates
one solution from another. As you test and compare databases, experiment with different
read/write ratios so you can adequately prepare yourself for scenarios when your access
patterns may change.
Be sure to consider nuances like whether your reads are from cold data (data not
often accessed) or hot data (data that’s accessed often and likely cached). Analytics use
cases tend to read cold data frequently because they need to process large amounts of
data. In this case, disk speeds are very important for overall performance. Plus, you’ll
want a comfortably large amount of memory so that the database’s cache can hold the
4
The “Competing Workloads” section later in this chapter, as well as the “Workload Isolation”
section in Chapter 8, cover a few options for prioritizing and separating workloads.
19
Chapter 2 Your Project, Through the Lens of Database Performance
data that you need to process. On the other hand, if you frequently access hot data, most
of your data will be served from the cache, in such a way that the disk speeds become
less important (although not negligible).
Tip Not sure if your reads are from cold or hot data? Take a look at the ratio
of cache misses in your monitoring dashboards. For more on monitoring, see
Chapter 10.
If your ratio of cache misses is higher than hits, this means that reads need to
frequently hit the disks in order to look up your data. This may happen because your
database is underprovisioned in memory space, or simply because the application
access patterns often read infrequently accessed data. It is important to understand the
performance implications here. If you’re frequently reading from cold data, there’s a risk
that I/O will become the bottleneck—for writes as well as reads. In that case, if you need
to improve performance, adding more nodes or switching your storage medium to a
faster solution could be helpful.
As noted earlier, write-optimized databases can improve read latency via internal
caching, so it’s not uncommon for a team with, say, 60 percent reads and 40 percent
writes to opt for a write-optimized database. Another option is to boost the latency
of reads with a write-optimized database: If your database supports it, dedicate extra
“shares” of resources to the reads so that your read workload is prioritized when there is
resource contention.
Delete-Heavy Workloads
What about delete-heavy workloads, such as using your database as a durable queue
(saving data from a producer until the consumer accesses it, deleting it, then starting the
cycle over and over again)? Here, you generally want to avoid databases that store data
in immutable files and use tombstones to mark rows and columns that are slated for
deletion. The most notable examples are Cassandra and other Cassandra-compatible
databases.
Tombstones consume cache space and disk resources, and the database needs to
search through all these tombstones to reach the live data. For many workloads, this
is not a problem. But for delete-heavy workloads, generating an excessive amount of
20
Chapter 2 Your Project, Through the Lens of Database Performance
tombstones will, over time, significantly degrade your read latencies. There are ways and
mechanisms to mitigate the impact of tombstones.5 However, in general, if you have a
delete-heavy workload, it may be best to use a different database.
It is important to note that occasional deletes are generally fine on Cassandra and
Cassandra-compatible databases. Just be aware of the fact that deletes on append-only
databases result in tombstone writes. As a result, these may incur read amplification,
elevating your read latencies. Tombstones and data eviction in these types of databases
are potentially long and complex subjects that perhaps could have their own dedicated
chapter. However, the high-level recommendation is to exercise caution if you have a
potentially delete-heavy pattern that you might later read from, and be sure to combine
it with a compaction strategy tailored for efficient data eviction.
All that being said, it is interesting to note that some teams have successfully
implemented delete-heavy workloads on top of Cassandra and Cassandra-like
databases. The performance overhead carried by tombstones is generally circumvented
by a combination of data modeling, a careful study of how deletes are performed,
avoiding reads that potentially scan through a large set of deleted data, and careful
tuning over the underlying table’s compaction strategy to ensure that tombstones
get evicted in a timely manner. For example, Tencent Games used the Time Window
Compaction Strategy to aggressively expire tombstones and use it as the foundation for a
time series distributed queue.6
5
For some specific recommendations, see the DataStax blog, “Cassandra Anti-Patterns:
Queues and Queue-like Datasets” (www.datastax.com/blog/cassandra-anti-patterns-
queues-and-queue-datasets)
6
See the article, “Tencent Games’ Real-Time Event-Driven Analytics System Built with ScyllaDB +
Pulsar” (https://www.scylladb.com/2023/05/15/tencent-games-real-time-event-driven-
analytics-systembuilt-with-scylladb-pulsar/)
21
Chapter 2 Your Project, Through the Lens of Database Performance
OLAP (analytical) workloads, which can be run in batch mode and are more focused
on throughput (see Figure 2-1). Or, you can prioritize analytics. Both are technically
feasible; it just boils down to what’s most important for your use case.
For example, assume you have a web server database with analytics. It must support
two workloads:
22
Chapter 2 Your Project, Through the Lens of Database Performance
Running on the same cluster, such workloads would be competing for resources.
As system utilization rises, the database must strictly prioritize which activities get
what specific share of resources under contention. There are a few different ways you
can handle this. Physical isolation, logical isolation, and scheduled isolation can all be
acceptable choices under the right circumstances. Chapter 8 covers these options.
Item Size
The size of the items you are storing in the database (average payload size) will dictate
whether your workload is CPU bound or storage bound. For example, running 100K
OPS with an average payload size of 90KB is much different than achieving the same
throughput with a 1KB payload. Higher payloads require more processing, I/O, and
network traffic than smaller payloads.
Without getting too deep into database internals here, one notable impact is on the
page cache. Assuming a default page cache size of 4KB, the database would have to serve
several pages for the largest payload—that’s much more I/O to issue, process, merge,
and serve back to the application clients. With the 1KB example, you could serve it from
a single-page cache entry, which is less taxing from a compute resource perspective.
Conversely, having a large number of smaller-sized items may introduce CPU overhead
compared to having a smaller number of larger items because the database must process
each arriving item individually.
In general, the larger the payload gets, the more cache activity you will have. Most
write-optimized databases will store your writes in memory before persisting that
information to the disk (in fact, that’s one of the reasons why they are write-optimized).
Larger payloads deplete the available cache space more frequently, and this incurs a
higher flushing activity to persist the information on disk in order to release space for
more incoming writes. Therefore, more disk I/O is needed to persist that information.
If you don’t size this properly, it can become a bottleneck throughout this repetitive
process.
When you’re working with extremely large payloads, it’s important to set realistic
latency and throughput expectations. If you need to serve 200KB payloads, it’s unlikely
that any database will enable you to achieve single-digit millisecond latencies. Even if
the entire dataset is served from cache, there’s a physical barrier between your client
and the database: networking. The network between them will eventually throttle
your transfer speeds, even with an insanely fast client and database. Eventually, this
23
Chapter 2 Your Project, Through the Lens of Database Performance
will impact throughput as well as latency. As your latency increases, your client will
eventually throttle down and you won’t be able to achieve the same throughput that
you could with smaller payload sizes. The requests would be stalled, queuing in the
network.7
Generally speaking, databases should not be used to store large blobs. We’ve seen
people trying to store gigabytes of data within a single-key in a database—and this
isn’t a great idea. If your item size is reaching this scale, consider alternative solutions.
One solution is to use CDNs. Another is to store the largest chunk of your payload size
in cold storage like Amazon S3 buckets, Google Cloud storage, or Azure blob storage.
Then, use the database as a metadata lookup: It can read the data and fetch an identifier
that will help find the data in that cold storage. For example, this is the strategy used by
a game developer converting extremely large (often in the gigabyte range) content to
popular gaming platforms. They store structured objects with blobs that are referenced
by a content hash. The largest payload is stored within a cloud vendor Object Storage
solution, whereas the content hash is stored in a distributed NoSQL database.8
Note that some databases impose hard limits on item size. For example, DynamoDB
currently has a maximum item size of 400KB. This might not suit your needs. On top of
that, if you’re using an in-memory solution such as Redis, larger keys will quickly deplete
your memory. In this case, it might make sense to hash/compress such large objects
prior to storing them.
No matter which database you choose, the smaller your payload, the greater your
chances of introducing memory fragmentation. This might reduce your memory
efficiency, which might in turn elevate costs because the database won’t be able to fully
utilize its available memory.
Item Type
The item type has a large impact on compression, which in turn impacts your
storage utilization. If you’re frequently storing text, expect to take advantage of a high
compression ratio. But, that’s not the case for random and uncommon blob sequences.
7
There are alternatives to this; for example, RDMA, DPDK and other solutions. However, most
use cases do not require such solutions, so they are not covered in detail here.
8
For details, see the Epic Games talk, “Using ScyllaDB for Distribution of Game Assets in Unreal
Engine” (www.youtube.com/watch?v=aEgP9YhAb08).
24
Chapter 2 Your Project, Through the Lens of Database Performance
25
Chapter 2 Your Project, Through the Lens of Database Performance
you a nice performance boost.9 Just like collections, however, UDTs should not be
misused—and misusing UDTs can lead to the same severe impacts that are incurred by
collections.
Dataset Size
Knowing your dataset size is important for selecting appropriate infrastructure options.
For example, AWS cloud instances have a broad array of NVMe storage offerings. Having
a good grasp of how much storage you need can help you avoid selecting an instance
that causes performance to suffer (if you end up with insufficient storage) or that’s
wasteful from a cost perspective (if you overprovision).
It’s important to note that your selected storage size should not be equal to your total
dataset size. You also need to factor in replication and growth—plus steer clear of 100
percent storage utilization.
For example, let’s assume you have 3TB of already compressed data. The bare
minimum to support a workload is your current dataset size multiplied by your
anticipated replication. If you have 3TB of data with the common replication factor of
three, that gives you 9TB. If you naively deployed this on three nodes supporting 3TB of
data each, you’d hit near 100 percent disk utilization which, of course, is not optimal.
Instead, if you factor in some free space and minimal room for growth, you’d want
to start with at least six nodes of that size—each storing only 1.5TB of data. This gives
you around 50 percent utilization. On the other hand, if your database cannot support
that much data per node (every database has a limit) or if you do not foresee much
future data growth, you could have six nodes supporting 2TB each, which would store
approximately 1.5TB per replica under a 75 percent utilization. Remember: Factoring
in your growth is critical for avoiding unpleasant surprises in production, from an
operational as well as a budget perspective.
9
For some specific examples of how UDTs impact performance, see the performance benchmark
that ScyllaDB performed with different UDT sizes against individual columns: “If You Care
About Performance, Employ User Defined Types” (https://www.scylladb.com/2017/12/07/
performance-udt/)
26
Chapter 2 Your Project, Through the Lens of Database Performance
Note We very intentionally discussed the dataset size from a compressed data
standpoint. Be aware that some database vendors measure your storage utilization
with respect to uncompressed data. This often leads to confusion. If you’re moving
data from one database solution to another and your data is uncompressed (or
you’re not certain it’s compressed), consider loading a small fraction of your
total dataset beforehand in order to determine its compression ratio. Effective
compression can dramatically reduce your storage footprint.
If you’re working on a very fluid project and can’t define or predict your dataset
size, a serverless database deployment model might be a good option to provide easy
flexibility and scaling. But, be aware that rapid increases in overall dataset size and/or
IOPS (depending on the pricing model) could cause the price to skyrocket exponentially.
Even if you don’t explicitly pay a penalty for storing a large dataset, you might be charged
a premium for the many operations that are likely associated with that large dataset.
Serverless is discussed more in Chapter 7.
Throughput Expectations
Your expected throughput and latency should be your “north star” from database and
infrastructure selection all the way to monitoring. Let’s start with throughput.
If you’re serious about database performance, it’s essential to know what throughput
you’re trying to achieve—and “high throughput” is not an acceptable answer.
Specifically, try to get all relevant stakeholders’ agreement on your target number of peak
read operations per second and peak write operations per second for each workload.
Let’s unravel that a little. First, be sure to separate read throughput vs write
throughput. A database’s read path is usually quite distinct from its write path. It stresses
different parts of the infrastructure and taps different database internals. And the client/
user experience of reads is often quite different than that of writes. Lumping them
together into a meaningless number won’t help you much with respect to performance
measurement or optimization. The main use for average throughput is in applying
Little’s Law (more on that in the “Concurrency” section a little later in this chapter).
27
Chapter 2 Your Project, Through the Lens of Database Performance
Another caveat: The same database’s past or current throughput with one use case
is no guarantee of future results with another—even if it’s the same database hosted on
identical infrastructure. There are too many different factors at play (item size, access
patterns, concurrency… all the things in this chapter, really). What’s a great fit for one use
case could be quite inappropriate for another.
Also, note the emphasis on peak operations per second. If you build and optimize
with an average in mind, you likely won’t be able to service beyond the upper ranges of
that average. Focus on the peak throughput that you need to sustain to cover your core
needs and business patterns—including surges. Realize that databases can often “boost”
to sustain short bursts of exceptionally high load. However, to be safe, it’s best to plan for
your likely peaks and reserve boosting for atypical situations.
Also, be sure not to confuse concurrency with throughput. Throughput is the speed
at which the database can perform read or write operations; it’s measured in the number
of read or write operations per second. Concurrency is the number of requests that the
client sends to the database at the same time (which, in turn, will eventually translate
to a given number of concurrent requests queuing at the database for execution).
Concurrency is expressed as a hard number, not a rate over a period of time. Not every
request that is born at the same time will be able to be processed by the database at
the same time. Your client could send 150K requests to the database, all at once. The
database might blaze through all these concurrent requests if it’s running at 500K
OPS. Or, it might take a while to process them if the database throughput tops out at
50K OPS.
It is generally possible to increase throughput by increasing your cluster size (and/
or power). But, you also want to pay special attention to concurrency, which will be
discussed in more depth later in this chapter as well as in Chapter 5. For the most part,
high concurrency is essential for achieving impressive performance. But if the clients
end up overwhelming the database with a concurrency that it can’t handle, throughput
will suffer, then latency will rise as a side effect. A friendly reminder that transcends the
database world: No system, distributed or not, supports unlimited concurrency. Period.
28
Chapter 2 Your Project, Through the Lens of Database Performance
Latency Expectations
Latency is a more complex challenge than throughput: You can increase throughput
by adding more nodes, but there’s no simple solution for reducing latency. The lower
the latency you need to achieve, the more important it becomes to understand and
explore database tradeoffs and internal database optimizations that can help you shave
milliseconds or microseconds off latencies. Database internals, driver optimizations,
efficient CPU utilization, sufficient RAM, efficient data modeling… everything matters.
As with throughput, aim for all relevant stakeholders’ agreement on the acceptable
latencies. This is usually expressed as latency for a certain percentile of requests. For
performance-sensitive workloads, tracking at the 99th percentile (P99) is common. Some
teams go even higher, such as the P9999, which refers to the 99.99th percentile.
As with throughput, avoid focusing on average (mean) or median (P50) latency
measurements. Average latency is a theoretical measurement that is not directly
correlated to anything systems or users experience in reality. Averages conceal outliers:
Extreme deviations from the norm that may have a large and unexpected impact on
overall system performance, and hence on user experience.
For example, look at the discrepancy between average latencies and P99 latencies in
Figure 2-2 (different colors represent different database nodes). P99 latencies were often
double the average for reads, and even worse for writes.
A hot partition is a data access imbalance problem that causes specific partitions to receive
10
more traffic compared to others, thus introducing higher load on a specific set of replica servers.
29
Chapter 2 Your Project, Through the Lens of Database Performance
Note that monitoring systems are sometimes configured in ways that omit outliers.
For example, if a monitoring system is calibrated to measure latency on a scale of 0
to 1000ms, it is going to overlook any larger measurements—thus failing to detect the
serious issues of query timeouts and retries.
P99 and above percentiles are not perfect.11 But for latency-sensitive use cases,
they’re the number you’ll want to keep in mind as you are selecting your infrastructure,
benchmarking, monitoring, and so on.
Also, be clear about what exactly is involved in the P99 you are looking to achieve.
Database latency is the time that elapses between when the database receives a request,
processes it, and sends back an appropriate response. Client-side latency is broader:
Here, the measurement starts with the client sending the request and ends with the
client receiving the database’s response. It includes the network time and client-side
11
For a detailed critique, see Gil Tene’s famous “Oh Sh*t” talk (www.youtube.com/watch?
v=lJ8ydIuPFeU) as well as his recent P99 CONF talk on Misery Metrics and Consequences
(https://www.p99conf.io/session/misery-metrics-consequences/).
30
Chapter 2 Your Project, Through the Lens of Database Performance
processing. There can be quite a discrepancy between database latency and client-
side latency; a ten times higher client-side latency isn’t all that uncommon (although
clearly not desirable). There could be many culprits to blame for a significantly higher
client-side latency than database latency: excessive concurrency, inefficient application
architecture, coding issues, and so on. But that’s beyond the scope of this discussion—
beyond the scope of this book, even.
The key point here is that your team and all the stakeholders need to be on the same
page regarding what you’re measuring. For example, say you’re given a read latency
requirement of 15ms. You work hard to get your database to achieve that and report that
you met the expectation—then you learn that stakeholders actually expect 15ms for the
full client-side latency. Back to the drawing board.
Ultimately, it’s important to track both database latency and client-side latency.
You can optimize the database all you want, but if the application is introducing latency
issues from the client side, a fast database won’t have much impact. Without visibility
into both the database and the client-side latencies, you’re essentially flying half blind.
Concurrency
What level of concurrency should your database be prepared to handle? Depending
on the desired qualities of service from the database cluster, concurrency must be
judiciously balanced to reach appropriate throughput and latency values. Otherwise,
requests will pile up waiting to be processed—causing latencies to spike, timeouts to
rise, and the overall user experience to degrade.
Little’s Law establishes that:
L=λW
where λ is the average throughput, W is the average latency, and L represents the total
number of requests either being processed or on queue at any given moment when the
cluster reaches steady state. Given that your throughput and latency targets are usually
fixed, you can use Little’s Law to estimate a realistic concurrency.
For example, if you want a system to serve 500,000 requests per second at 2.5ms
average latency, the best concurrency is around 1,250 in-flight requests. As you approach
the saturation limit of the system—around 600,000 requests per second for read
requests—increases in concurrency will keep constant since this is the physical limit of
the database. Every new in-flight request will only cause increased latency.
31
Chapter 2 Your Project, Through the Lens of Database Performance
In fact, if you approximate 600,000 requests per second as the physical capacity of this
database, you can calculate the expected average latency at a particular concurrency
point. For example, at 6,120 in-flight requests, the average latency is expected to be
6120/600,000 = 10ms.
Past the maximum throughput, increasing concurrency will increase latency.
Conversely, reducing concurrency will reduce latency, provided that this reduction does
not result in a decrease in throughput.
In some use cases, it’s fine for queries to pile up on the client side. But many times
it’s not. In those cases, you can scale out your cluster or increase the concurrency on
the application side—at least to the point where the latency doesn’t suffer. It’s a delicate
balancing act.12
Connected Technologies
A database can’t rise above the slowest-performing link in your distributed data system.
Even if your database is processing reads and writes at blazing speeds, it won’t ultimately
matter much if it interacts with an event-streaming platform that’s not optimized for
performance or involves transformations from a poorly-configured Apache Spark
instance, for example.
This is just one of many reasons that taking a comprehensive and proactive approach
to monitoring (more on this in Chapter 10) is so important. Given the complexity of
databases and distributed data systems, it’s hard to guess what component is to blame
for a problem. Without a window into the state of the broader system, you could naively
waste amazing amounts of time and resources trying to optimize something that won’t
make any difference.
If you’re looking to optimize an existing data system, don’t overlook the performance
gains you can achieve by reviewing and tuning its connected components. Or, if your
monitoring efforts indicate that a certain component is to blame for your client-side
performance problems but you feel you’ve hit your limit with it, explore what’s required
to replace it with a more performant alternative. Use benchmarking to determine the
severity of the impact from a performance perspective.
12
For additional reading on concurrency, the Netflix blog “Performance Under Load” is a great
resource (https://netflixtechblog.medium.com/performance-under-load-3e6fa9a60581).
32
Chapter 2 Your Project, Through the Lens of Database Performance
Also, note that some database offerings may have ecosystem limitations. For
example, if you’re considering a serverless deployment model, be aware that some
Change Data Capture (CDC) connectors, drivers, and so on, might not be supported.
Demand Fluctuations
Databases might experience a variety of different demand fluctuations, ranging from
predictable moderate fluctuations to unpredictable and dramatic spikes. For instance,
the world’s most watched sporting event experiences different fluctuations than a food
delivery service, which experiences different fluctuations than an ambulance-tracking
service—and all require different strategies and infrastructure.
First, let’s look at the predictable fluctuations. With predictability, it’s much easier to
get ahead of the issue. If you’re expected to support periodic big events that are known
in advance (Black Friday, sporting championships, ticket on sales, etc.), you should have
adequate time to scale up your cluster for each anticipated spike. That means you can
tailor your normal topology for the typical day-in, day-out demands without having to
constantly incur the costs and admin burden of having that larger scale topology.
On the other side of the spikiness spectrum, there’s applications with traffic with
dramatic peaks and valleys across the course of each day. For example, consider food
delivery businesses, which face a sudden increase around lunch, followed by a few
hours of minimal traffic, then a second spike at dinner time (and sometimes breakfast
the following morning). Expanding the cluster for each spike—even with “autoscaling”
(more on autoscaling later in this chapter)—is unlikely to deliver the necessary
performance gain fast enough. In these cases, you should provision an infrastructure
that supports the peak traffic.
But not all spikes are predictable. Certain industries—such as emergency services,
news, and social media—are susceptible to sudden massive spikes. In this case, a good
preventative strategy is to control your concurrency on the client side, so it doesn’t
overwhelm your database. However, controlling concurrency might not be an option for
use cases with strict end-to-end latency requirements. You can also scramble to scale
out your clusters as fast as feasible when the spike occurs. This is going to be markedly
simpler if you’re on the cloud than if you’re on-prem. If you can start adding nodes
immediately, increase capacity incrementally—with a close eye on your monitoring
results—and keep going until you’re satisfied with the results, or until the peak has
subsided. Unfortunately, there is a real risk that you won’t be able to sufficiently scale out
33
Chapter 2 Your Project, Through the Lens of Database Performance
before the spike ends. Even if the ramp up begins immediately, you need to account for
the time it takes to get data over to add new nodes, stream data to them, and rebalance
the cluster.
If you’re selecting a new database and anticipate frequent and sharp spikes, be sure
to rigorously test how your top contenders respond under realistic conditions. Also,
consider the costs of maintaining acceptable performance throughout these peaks.
Note The word “autoscaling” insinuates that your database cluster auto-
magically expands based on the traffic it is receiving. Not so. It’s simply a robot
enabling/disabling capacity that’s pre-provisioned for you based on your target
table settings. Even if you’re not using this capacity, you might be paying for the
convenience of having it set aside and ready to go. Also, it’s important to realize
that it’s not instantaneous. It takes upwards of 2.5 hours to go from 0 rps to 40k.13
This is not ideal for unexpected or extreme spikes.
Autoscaling is best when:
• Load changes have high amplitude
• The rate of change is in the magnitude of hours
• The load peak is narrow relative to the baseline14
ACID Transactions
Does your use case require you to process a logical unit of work with ACID (atomic,
consistent, isolated, and durable) properties? These transactions, which are historically
the domain of RDBMS, bring a severe performance hit.
13
See The Burning Monk blog, “Understanding the Scaling Behaviour of DynamoDB
OnDemand Tables” (https://theburningmonk.com/2019/03/understanding-the-scaling-
behaviour-of-dynamodb-ondemand-tables/).
14
For more on the best and worst uses of autoscaling, see Avishai Ish Shalom’s blog, “DynamoDB
Autoscaling Dissected: When a Calculator Beats a Robot” (www.scylladb.com/2021/07/08/
dynamodb-autoscaling-dissected-when-a-calculator-beats-a-robot/).
34
Chapter 2 Your Project, Through the Lens of Database Performance
It is true that distributed ACID compliant databases do exist—and that the past few
years have brought some distinct progress in the effort to minimize the performance
impact (e.g., through row-level locks or column-level locking and better conflict
resolution algorithms). However, some level of penalty will still exist.
As a general guidance, if you have an ACID-compliant use case, pay special attention
to your master nodes; these can easily become your bottlenecks since they will often
be your primary query coordinators (more on this in Appendix A). In addition, if at
all possible, try to ensure that the majority of your transactions are isolated to the
minimum amount of resources. For example, a transaction spanning a single row may
involve a specific set of replicas, whereas a transaction involving several keys may span
your cluster as a whole—inevitably increasing your latency. It is therefore important to
understand which types of transactions your target database supports. Some vendors
may support a mix of approaches, while others excel at specific ones. For instance,
MongoDB introduced multi-document transactions on sharded clusters in its version
4.2; prior to that, it supported only multi-document transactions on replica sets.
If it’s critical to support transactions in a more performant manner, sometimes it’s
possible to rethink your data model and reimplement a use case in a way that makes it
suitable for a database that’s not ACID compliant. For example, one team who started
out with Postgres for all their use cases faced skyrocketing business growth. This is a very
common situation with startups that begin small and then suddenly find themselves in a
spot where they are unable to handle a spike in growth in a cost-effective way. They were
able to move their use cases to NoSQL by conducting a careful data-modeling analysis
and rethinking their use cases, access patterns, and the real business need of what truly
required ACID and what did not. This certainly isn’t a quick fix, but in the right situation,
it can pay off nicely.
Another option to consider: Performance-focused NoSQL databases like Cassandra
aim to support isolated conditional updates with capabilities such as lightweight
transactions that allow “atomic compare and set” operations. That is, the database
checks if a condition is true, and if so, it conducts the transaction. If the condition is not
met, the transaction is not completed. They are named “lightweight” since they do not
truly lock the database for the transaction. Instead, they use a consensus protocol to
ensure there is agreement between the nodes to commit the change. This capability was
35
Chapter 2 Your Project, Through the Lens of Database Performance
introduced by Cassandra and it’s supported in several ways across different Cassandra-
compatible databases. If this is something you expect to use, it’s worth exploring the
documentation to understand the differences.15
However, it’s important to note that lightweight transactions have their limits. They
can’t support complex use cases like a retail transaction that updates the inventory
only after a sale is completed with a successful payment. And just like ACID-compliant
databases, lightweight transactions have their own performance implications. As a
result, the choice of whether to use them will greatly depend on the amount of ACID
compliance that your use case requires.
DynamoDB is a prime example of how the need for transactions will require more
compute resources (read: money). As a result, use cases relying heavily on ACID will
fairly often require much more infrastructure power to satisfy heavy usage requirements.
In the DynamoDB documentation, AWS recommends that you ensure the database is
configured for auto-scaling or that it has enough read/write capacity to account for the
additional overhead of transactions.16
Consistency Expectations
Most NoSQL databases opt for eventual consistency to gain performance. This is in
stark contrast to the RDBMS model, where ACID compliance is achieved in the form
of transactions, and, because everything is in a single node, the effort on locking and
avoiding concurrency clashes is often minimized. When deciding between a database
with strong or eventual consistency, you have to make a hard choice. Do you want to
sacrifice scalability and performance or can you accept the risk of sometimes serving
stale data?
Can your use case tolerate eventual consistency, or is strong consistency truly
required? Your choice really boils down to how much risk your application—and your
business—can tolerate with respect to inconsistency. For example, a retailer who
15
See Kostja Osipov’s blog, “Getting the Most Out of Lightweight Transactions in ScyllaDB”
(www.scylladb.com/2020/07/15/getting-the-most-out-of-lightweight-transactions-in-
scylla/) for an example of how financial transactions can be implemented using Lightweight
Transactions.
16
See “Amazon DynamoDB Transactions: How it Works” (https://docs.aws.amazon.com/
amazondynamodb/latest/developerguide/transaction-apis.html).
36
Chapter 2 Your Project, Through the Lens of Database Performance
(understandably) requires consistent pricing might want to pay the price for consistent
writes upfront during a weekly catalog update so that they can later serve millions of low-
latency read requests under more relaxed consistency levels. In other cases, it’s more
important to ingest data quickly and pay the price for consistency later (for example,
in the playback tracking use case that’s common in streaming platforms—where the
database needs to record the last viewing position for many users concurrently). Or
maybe both are equally important. For example, consider a social media platform that
offers live chat. Here, you want consistency on both writes and reads, but you likely don’t
need the highest consistency (the impact of an inconsistency here is likely much less
than with a financial report).
In some cases, “tunable consistency” will help you achieve a balance between strong
consistency and performance. This gives you the ability to tune the consistency at the
query level to suit what you’re trying to achieve. You can have some queries relying on a
quorum of replicas, then have other queries that are much more relaxed.
Regardless of your consistency requirements, you need to be aware of the
implications involved when selecting a given consistency level. Databases that offer
tunable consistency may be a blessing or a curse if you don’t know what you are doing.
Consider a NoSQL deployment spanning three different regions, with three nodes
each (nine nodes in total). A QUORUM read would essentially have to traverse two
different regions in order to be acknowledged back to the client. In that sense, if your
Network Round Trip Time (RTT)17 is 50ms, then it will take at least this amount of time
for the query to be considered successful by the database. Similarly, if you were to run
operations with the highest possible consistency (involving all replicas), then the failure
of a single node may bring your entire application down.
Note NoSQL databases fairly often will provide you with ways to confine your
queries to a specific region to prevent costly network round trips from impacting
your latency. But again, it all boils down to you what your use case requires.
RTT is the duration, typically measured in milliseconds, that a network request takes to reach a
17
destination, plus the time it takes for the packet to be received back at the origin.
37
Chapter 2 Your Project, Through the Lens of Database Performance
Geographic Distribution
Does your business need to support a regional or global customer base in the near-term
future? Where are your users and your application located? The greater the distance
between your users, your application, and your database, the more they’re going to face
high latencies that stem from the physical time it takes to move data across the network.
Knowing this will influence where you locate your database and how you design your
topology—more on this in Chapters 6 and 8.
The geographic distribution of your cluster might also be a requirement from a
disaster recovery perspective. In that sense, the cluster would typically serve data
primarily from a specific region, but failover to another in the event of a disaster (such as
a full region outage). These kinds of setups are costly, as they will require doubling your
infrastructure spend. However, depending on the nature of your use case, sometimes it’s
required.
Some organizations that invest in a multi-region deployment for the primary
purpose of disaster recovery end up using them to host isolated use cases. As explained
in the “Competing Workloads” section of this chapter, companies often prefer to
physically isolate OLTP from OLAP workloads. Moving some isolated (less critical)
workloads to remote regions prevents these servers from being “idle” most of the time.
Regardless of the magnitude of compelling reasons that may drive you toward a
geographically dispersed deployment, here’s some important high-level advice from a
performance perspective (you’ll learn some more technical tips in Chapter 8):
1. Consider the increased load that your target region or regions will
receive in the event of a full region outage. For example, assume
that you operate globally across three regions, and all these three
regions serve your end-users. Are the two remaining regions able
to sustain the load for a long period of time?
2. Recognize that simply having a geographically-dispersed database
does not fully cover you in a disaster recovery situation. You also
need to have your application, web servers, messaging queue
systems, and so on, geographically replicated. If the only thing
that’s geo-replicated is your database, you won’t be in a great
position when your primary application goes down.
38
Chapter 2 Your Project, Through the Lens of Database Performance
High-Availability Expectations
Inevitably, s#*& happens. To prepare for the worst, start by understanding what your use
case and business can tolerate if a node goes down. Can you accept the data loss that
could occur if a node storing unreplicated data goes down? Do you need to continue
buzzing along without a noticeable performance impact even if an entire datacenter or
availability zone goes down? Or is it okay if things slow down a bit from time to time?
This will all impact how you architect your topology and configure things like replication
factor and consistency levels (you’ll learn about this more in Chapter 8).
It’s important to note that replication and consistency both come at a cost to
performance. Get a good feel for your business’s risk tolerance and don’t opt for more
than your business really needs.
When considering your cluster topology, remember that quite a lot is at risk if you
get it wrong (and you don’t want to be caught off-guard in the middle of the night).
For example, the failure of a single node in a three-node cluster could make you
momentarily lose 33 percent of your processing power. Quite often, that’s a significant
blow, with discernable business impact. Similarly, the loss of a node in a six-node
cluster would reduce the blast radius to only 16 percent. But there’s always a tradeoff.
A sprawling deployment spanning hundreds of nodes is not ideal either. The more nodes
you have, the more likely you are to experience a node failure. Balance is key.
39
Chapter 2 Your Project, Through the Lens of Database Performance
Summary
The specific database challenges you encounter, as well as your options for addressing
them, are highly dependent on your situation. For example, an AdTech use case that
demands single-digit millisecond P99 latencies for a large dataset with small item
sizes requires a different treatment than a fraud detection use case that prioritizes the
ingestion of massive amounts of data as rapidly as possible. One of the primary factors
influencing how these workloads are handled is how your database is architected. That’s
the focus for the next two chapters, which dive into database internals.
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
40
CHAPTER 3
Database Internals:
Hardware and Operating
System Interactions
A database’s internal architecture makes a tremendous impact on the latency it can
achieve and the throughput it can handle. Being an extremely complex piece of software,
a database doesn’t exist in a vacuum, but rather interacts with the environment, which
includes the operating system and the hardware.
While it’s one thing to get massive terabyte-to-petabyte scale systems up and
running, it’s a whole other thing to make sure they are operating at peak efficiency. In
fact, it’s usually more than just “one other thing.” Performance optimization of large
distributed systems is usually a multivariate problem—combining aspects of the
underlying hardware, networking, tuning operating systems, and finagling with layers of
virtualization and application architectures.
Such a complex problem warrants exploration from multiple perspectives. This
chapter begins the discussion of database internals by looking at ways that databases
can optimize performance by taking advantage of modern hardware and operating
systems. It covers how the database interacts with the operating system plus CPUs,
memory, storage, and networking. Then, the next chapter shifts focus to algorithmic
optimizations.1
1
This chapter draws from material originally published on the Seastar site (https://seastar.io/)
and the ScyllaDB blog (https://www.scylladb.com/blog/). It is used here with permission of
ScyllaDB.
41
© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023
F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_3
Chapter 3 Database Internals: Hardware and Operating System Interactions
CPU
Programming books tell programmers that they have this CPU that can run processes
or threads, and what runs means is that there’s some simple sequential instruction
execution. Then there’s a footnote explaining that with multiple threads you might need
to consider doing some synchronization. In fact, how things are actually executed inside
CPU cores is something completely different and much more complicated. It would
be very difficult to program these machines if you didn’t have those abstractions from
books, but they are a lie to some degree. How you can efficiently take advantage of CPU
capabilities is still very important.
42
Chapter 3 Database Internals: Hardware and Operating System Interactions
The cores are all connected by what is essentially a network—a dual ring
interconnected architecture. There are two such rings and they are bidirectional. Why
should developers use a synchronous API for that then? Since sharing information
across cores requires costly locking, a shared-nothing model is perfectly worth
considering. In such a model, all requests are sharded onto individual cores, one
application thread is run per core, and communication depends on explicit message
passing, not shared memory between threads. This design avoids slow, unscalable lock
primitives and cache bounces.
Any sharing of resources across cores in modern processors must be handled
explicitly. For example, when two requests are part of the same session and two CPUs
each get a request that depends on the same session state, one CPU must explicitly
forward the request to the other. Either CPU may handle either response. Ideally,
your database provides facilities that limit the need for cross-core communication—
but when communication is inevitable, it provides high-performance non-blocking
communication primitives to ensure performance is not degraded.
Futures-Promises
There are many solutions for coordinating work across multiple cores. Some are highly
programmer-friendly and enable the development of software that works exactly as if it
were running on a single core. For example, the classic UNIX process model is designed
43
Chapter 3 Database Internals: Hardware and Operating System Interactions
to keep each process in total isolation and relies on kernel code to maintain a separate
virtual memory space per process. Unfortunately, this increases the overhead at the
OS level.
There’s a model known as “futures and promises.” A future is a data structure that
represents some yet-undetermined result. A promise is the provider of this result. It
can be helpful to think of a promise/future pair as a first-in first-out (FIFO) queue
with a maximum length of one item, which may be used only once. The promise is the
producing end of the queue, while the future is the consuming end. Like FIFOs, futures
and promises decouple the data producer and the data consumer.
However, the optimized implementations of futures and promises need to take
several considerations into account. While the standard implementation targets coarse-
grained tasks that may block and take a long time to complete, optimized futures and
promises are used to manage fine-grained, non-blocking tasks. In order to meet this
requirement efficiently, they should:
• Require no locking
• Support continuations
2
Watch the Linux Foundation video, “Exploring Phantom Traffic Jams in Your Data Flows,” on
YouTube (www.youtube.com/watch?v=IXS_Afb6Y4o) and/or read the corresponding article on the
ScyllaDB blog (www.scylladb.com/2022/04/19/exploring-phantom-jams-in-your-
data-flow/).
44
Chapter 3 Database Internals: Hardware and Operating System Interactions
and low-level instructions. Maintaining them in an optimal manner requires a good low-
level programming paradigm and future-promises is one of the best choices. However,
large instruction sets need even more care; this leads to “execution stages.”
Execution Stages
Let’s dive deeper into CPU microarchitecture, because (as discussed previously)
database engine CPUs typically need to deal with millions and billions of instructions,
and it’s essential to help the poor thing with that. In a very simplified way, the
microarchitecture of a modern x86 CPU—from the point of view of top-down analysis—
consists of four major components: frontend, backend, branch speculation, and retiring.
Frontend
The processor’s frontend is responsible for fetching and decoding instructions that are
going to be executed. It may become a bottleneck when there is either a latency problem
or insufficient bandwidth. The former can be caused, for example, by instruction cache
misses. The latter happens when the instruction decoders cannot keep up. In the latter
case, the solution may be to attempt to make the hot path (or at least significant portions
of it) fit in the decoded μop cache (DSB) or be recognizable by the loop detector (LSD).
Branch Speculation
Pipeline slots that the top-down analysis classifies as bad speculation are not stalled, but
wasted. This happens when a branch is incorrectly predicted and the rest of the CPU
executes a μop that eventually cannot be committed. The branch predictor is generally
considered to be a part of the frontend. However, its problems can affect the whole pipeline
in ways beyond just causing the backend to be undersupplied by the instruction fetch and
decode. (Note: Branch mispredictions are covered in more detail a bit later in this chapter.)
Backend
The backend receives decoded μops and executes them. A stall may happen either
because of an execution port being busy or a cache miss. At the lower level, a pipeline
slot may be core bound either due to data dependency or an insufficient number of
available execution units. Stalls caused by memory can be caused by cache misses at
different levels of data cache, external memory latency, or bandwidth.
45
Chapter 3 Database Internals: Hardware and Operating System Interactions
Retiring
Finally, there are pipeline slots that get classified as retiring. They are the lucky ones that
were able to execute and commit their μop without any problems. When 100 percent
of the pipeline slots are able to retire without a stall, the program has achieved the
maximum number of instructions per cycle for that model of the CPU. Although this is
very desirable, it doesn’t mean that there’s no opportunity for improvement. Rather, it
means that the CPU is fully utilized and the only way to improve the performance is to
reduce the number of instructions.
46
Chapter 3 Database Internals: Hardware and Operating System Interactions
Memory
Memory management is the central design point in all aspects of programming. Even
comparing programming languages to one another always involves discussions about
the way programmers are supposed to handle memory allocation and freeing. No
wonder memory management design affects the performance of a database so much.
Applied to database engineering, memory management typically falls into two
related but independent subsystems: memory allocation and cache control. The former
is in fact a very generic software engineering issue, so considerations about it are not
extremely specific to databases (though they are crucial and are worth studying). As
opposed to that, the latter topic is itself very broad, affected by the usage details and
corner cases. Respectively, in the database world, cache control has its own flavor.
Allocation
The manner in which programs or subsystems allocate and free memory lies at the core
of memory management. There are several approaches worth considering.
As illustrated by Figure 3-2, a so-called “log-structured allocation” is known from
filesystems where it puts sequential writes to a circular log on the persisting storage and
handles updates the very same way. At some point, this filesystem must reclaim blocks
that became obsolete entries in the log area to make some more space available for
future writes. In a naive implementation, unused entries are reclaimed by rereading and
rewriting the log from scratch; obsolete blocks are then skipped in the process.
47
Chapter 3 Database Internals: Hardware and Operating System Interactions
A memory allocator for naive code can do something similar. In its simplest form,
it would allocate the next block of memory by simply advancing a next-free pointer.
Deallocation would just need to mark the allocated area as freed. One advantage of
this approach is the speed of allocation. Another is the simplicity and efficiency of
deallocation if it happens in FIFO order or affects the whole allocation space. Stack
memory allocations are later released in the order that’s reverse to allocation, so this is
the most prominent and the most efficient example of such an approach.
Using linear allocators as general-purpose allocators can be more problematic
because of the difficulty of space reclamation. To reclaim space, it’s not enough to just
mark entries as free. This leads to memory fragmentation, which in turn outweighs
the advantages of linear allocation. So, as with the filesystem, the memory must be
reclaimed so that it only contains allocated entries and the free space can be used again.
Reclamation requires moving allocated entries around—a process that changes and
invalidates their previously known addresses. In naive code, the locations of references
to allocated entries (addresses stored as pointers) are unknown to the allocator. Existing
references would have to be patched to make the allocator action transparent to the
caller; that’s not feasible for a general-purpose allocator like malloc. Logging allocator
48
Chapter 3 Database Internals: Hardware and Operating System Interactions
use is tied to the programming language selection. Some RTTIs, like C++, can greatly
facilitate this by providing move-constructors. However, passing pointers to libraries that
are outside of your control (e.g., glibc) would still be an issue.
Another alternative is adopting a strategy of pool allocators, which provide allocation
spaces for allocation of entries of a fixed size (see Figure 3-3). By limiting the allocation
space that way, fragmentation can be reduced. A number of general-purpose allocators
use pool allocators for small allocations. In some cases, those application spaces exist on
a per-thread basis to eliminate the need for locking and improve CPU cache utilization.
Figure 3-3. Pool allocators provide allocation spaces for allocation of entries of a
fixed size. Fragmentation is reduced by limiting the allocation space
This pool allocation strategy provides two core benefits. First, it saves you
from having to search for available memory space. Second, it alleviates memory
fragmentation because it pre-allocates in memory a cache for use with a collection of
object sizes. Here’s how it works to achieve that:
49
Chapter 3 Database Internals: Hardware and Operating System Interactions
1. The region for each of the sizes has fixed-size memory chunks that
are suitable for the contained objects, and those chunks are all
tracked by the allocator.
2. When it’s time for the allocator to allocate memory for a certain
type of data object, it’s typically possible to use a free slot (chunk)
in one of the existing memory slabs.3
3. When it’s time for the allocator to free the object’s memory, it can
simply move that slot over to the containing slab’s list of unused/
free memory slots.
4. That memory slot (or some other free slot) will be removed from
the list of free slots whenever there’s a call to create an object of
the same type (or a call to allocate memory of the same size).
The best allocation approach to pick heavily depends on the usage scenario. One
great benefit of a log-structured approach is that it handles fragmentation of small
sub-pools in a more efficient way. Pool allocators, on the other hand, generate less
background load on the CPU because of the lack of compacting activity.
Cache Control
When it comes to memory management in a software application that stores lots of data
on disk, you cannot overlook the topic of cache control. Caching is always a must in data
processing, and it’s crucial to decide what and where to cache.
If caching is done at the I/O level, for both read/write and mmap, caching can
become the responsibility of the kernel. The majority of the system’s memory is given
over to the page cache. The kernel decides which pages should be evicted when memory
runs low, decides when pages need to be written back to disk, and controls read-ahead.
The application can provide some guidance to the kernel using the madvise(2) and
fadvise(2) system calls.
The main advantage of letting the kernel control caching is that great effort has been
invested by the kernel developers over many decades into tuning the algorithms used
by the cache. Those algorithms are used by thousands of different applications and are
3
We are using the term “slab” to mean one or more contiguous memory pages that contain
pre-allocated chunks of memory.
50
Chapter 3 Database Internals: Hardware and Operating System Interactions
generally effective. The disadvantage, however, is that these algorithms are general-
purpose and not tuned to the application. The kernel must guess how the application
will behave next. Even if the application knows differently, it usually has no way to help
the kernel guess correctly. This results in the wrong pages being evicted, I/O scheduled
in the wrong order, or read-ahead scheduled for data that will not be consumed in the
near future.
Next, doing the caching at the I/O level interacts with the topic often referred to as
IMR—in memory representation. No wonder that the format in which data is stored on
disk differs from the form the same data is allocated in memory as objects. The simplest
reason that it’s not the same is byte-ordering. With that in mind, if the data is cached
once it’s read from the disk, it needs to be further converted or parsed into the object
used in memory. This can be a waste of CPU cycles, so applications may choose to cache
at the object level.
Choosing to cache at the object level affects a lot of other design points. With
that, the cache management is all on the application side including cross-core
synchronization, data coherence, invalidation, and so on. Next, since objects can be
(and typically are) much smaller than the average I/O size, caching millions and billions
of those objects requires a collection selection that can handle it (you’ll learn about this
quite soon). Finally, caching on the object level greatly affects the way I/O is done.
I/O
Unless the database engine is an in-memory one, it will have to keep the data on external
storage. There can be many options to do that, including local disks, network-attached
storage, distributed file- and object- storage systems, and so on. The term “I/O” typically
refers to accessing data on local storage—disks or filesystems (that, in turn, are located
on disks as well). And in general, there are four choices for accessing files on a Linux
server: read/write, mmap, Direct I/O (DIO) read/write, and Asynchronous I/O (AIO/
DIO, because this I/O is rarely used in cached mode).
Traditional Read/Write
The traditional method is to use the read(2) and write(2) system calls. In a modern
implementation, the read system call (or one of its many variants—pread, readv, preadv,
etc.) asks the kernel to read a section of a file and copy the data into the calling process
51
Chapter 3 Database Internals: Hardware and Operating System Interactions
address space. If all of the requested data is in the page cache, the kernel will copy it
and return immediately; otherwise, it will arrange for the disk to read the requested
data into the page cache, block the calling thread, and when the data is available, it will
resume the thread and copy the data. A write, on the other hand, will usually1 just copy
the data into the page cache; the kernel will write back the page cache to disk some time
afterward.
mmap
An alternative and more modern method is to memory-map the file into the application
address space using the mmap(2) system call. This causes a section of the address space
to refer directly to the page cache pages that contain the file’s data. After this preparatory
step, the application can access file data using the processor’s memory read and
memory write instructions. If the requested data happens to be in cache, the kernel is
completely bypassed and the read (or write) is performed at memory speed. If a cache
miss occurs, then a page-fault happens and the kernel puts the active thread to sleep
while it goes off to read the data for that page. When the data is finally available, the
memory-management unit is programmed so the newly read data is accessible to the
thread, which is then awoken.
52
Chapter 3 Database Internals: Hardware and Operating System Interactions
Figure 3-4. Direct I/O involves opening the file with the O_DIRECT flag; further
activity will use the normal read and write family of system calls, but their
behavior is now altered
53
Chapter 3 Database Internals: Hardware and Operating System Interactions
Figure 3-5. A refinement of Direct I/O, Asynchronous Direct I/O behaves similarly
but prevents the calling thread from blocking
Note: io_uring The API to perform asynchronous I/O appeared in Linux long ago,
and it was warmly met by the community. However, as it often happens, real-
world usage quickly revealed many inefficiencies, such as blocking under some
circumstances (despite the name), the need to call the kernel too often, and poor
support for canceling the submitted requests. Eventually, it became clear that the
updated requirements were not compatible with the existing API and the need for a
new one arose.
This is how the io_uring() API appeared. It provides the same facilities as AIO
does, but in a much more convenient and performant way (it also has notably
better documentation). Without diving into implementation details, let’s just say that
it exists and is preferred over the legacy AIO.
54
Chapter 3 Database Internals: Hardware and Operating System Interactions
I/O Scheduling
One of the problems with letting the kernel control caching (with the mmap and read/
write access methods) is that the application loses control of I/O scheduling. The kernel
picks whichever block of data it deems appropriate and schedules it for write or read.
This can result in the following problems:
55
Chapter 3 Database Internals: Hardware and Operating System Interactions
By bypassing the kernel page cache, the application takes on the burden of
scheduling I/O. This doesn’t mean that the problems are solved, but it does mean that
the problems can be solved—with sufficient attention and effort.
When using Direct I/O, each thread controls when to issue I/O. However, the kernel
controls when the thread runs, so responsibility for issuing I/O is shared between the
kernel and the application. With AIO/DIO, the application is in full control of when I/O
is issued.
Thread Scheduling
An I/O intensive application using mmap or read/write cannot guess what its cache hit
rate will be. Therefore, it has to run a large number of threads (significantly larger than
the core count of the machine it is running on). Using too few threads, they may all be
waiting for the disk leaving the processor underutilized. Since each thread usually has
at most one disk I/O outstanding, the number of running threads must be around the
concurrency of the storage subsystem multiplied by some small factor in order to keep
the disk fully occupied. However, if the cache hit rate is sufficiently high, then these large
numbers of threads will contend with each other for the limited number of cores.
When using Direct I/O, this problem is somewhat mitigated. The application knows
exactly when a thread is blocked on I/O and when it can run, so the application can
adjust the number of running threads according to runtime conditions.
With AIO/DIO, the application has full control over both running threads and
waiting I/O (the two are completely divorced), so it can easily adjust to in-memory or
disk-bound conditions or anything in between.
I/O Alignment
Storage devices have a block size; all I/O must be performed in multiples of this block
size which is typically 512 or 4096 bytes. Using read/write or mmap, the kernel performs
the alignment automatically; a small read or write is expanded to the correct block
boundary by the kernel before it is issued.
56
Chapter 3 Database Internals: Hardware and Operating System Interactions
With DIO, it is up to the application to perform block alignment. This incurs some
complexity, but also provides an advantage: The kernel will usually over-align to a 4096
byte boundary even when a 512-byte boundary suffices. However, a user application
using DIO can issue 512-byte aligned reads, which results in saving bandwidth on
small items.
Application Complexity
While the previous discussions favored AIO/DIO for I/O intensive applications, that method
comes with a significant cost: complexity. Placing the responsibility of cache management
on the application means it can make better choices than the kernel and make those
choices with less overhead. However, those algorithms need to be written and tested. Using
asynchronous I/O requires that the application is written using callbacks, coroutines, or a
similar method, and often reduces the reusability of many available libraries.
57
Chapter 3 Database Internals: Hardware and Operating System Interactions
From the performance point of view, the difference is not that drastic. On one hand,
writing data to a file is always accompanied by associated metadata updates. This
consumes both disk space and I/O bandwidth. However, some modern filesystems
provide a very good balance of performance and efficiency, almost eliminating the I/O
latency. (One of the most prominent examples is XFS. Another really good and mature
piece of software is Ext4). The great ally in this camp is the fallocate(2) system call
that makes the filesystem preallocate space on disk. When used, filesystems also have a
chance to make full use of the extents mechanisms, thus bringing the QoS of using files
to the same performance level as when using raw block devices.
Appending Writes
The database may have a heavy reliance on appends to files or require in-place updates
of individual file blocks. Both approaches need special attention from the system
architect because they call for different properties from the underlying system.
On one hand, appending writes requires careful interaction with the filesystem so
that metadata updates (file size, in particular) do not dominate the regular I/O. On the
other hand, appending writes (being sort of cache-oblivious algorithms) handle the disk
overwriting difficulties in a natural manner. Contrary to this, in-place updates cannot
happen at random offsets and sizes because disks may not tolerate this kind of workload,
even if they’re used in a raw block device manner (not via a filesystem).
That being said, let’s dive even deeper into the stack and descend into the
hardware level.
58
Chapter 3 Database Internals: Hardware and Operating System Interactions
Overwhelming the disk should be avoided because when the disk is full of requests
it cannot distinguish between the criticality of certain requests over others. Of course,
all requests are important, but it makes sense to prioritize latency-sensitive requests.
For example, ScyllaDB serves real-time queries that need to be completed in single-
digit milliseconds or less and, in parallel, it processes terabytes of data for compaction,
streaming, decommission, and so forth. The former have strong latency sensitivity; the
latter are less so. Good I/O maintenance that tries to maximize the I/O bandwidth while
keeping latency as low as possible for latency-sensitive tasks is complicated enough to
become a standalone component called the I/O Scheduler.
When evaluating a disk, you would most likely be looking at its four parameters—
read/write IOPS and read/write throughput (such as in MB/s). Comparing these
numbers to one another is a popular way of claiming one disk is better than the other
and estimating the aforementioned “bandwidth capacity” of the drive by applying Little’s
Law. With that, the I/O Scheduler’s job is to provide a certain level of concurrency inside
the disk to get maximum bandwidth from it, but not to make this concurrency too high
in order to prevent the disk from queueing requests internally for longer than needed.
For instance, Figure 3-6 illustrates how read request latency depends on the
intensity of small reads (challenging disk IOPS capacity) vs the intensity of large writes
(pursuing the disk bandwidth). The latency value is color-coded, and the “interesting
area” is painted in cyan—this is where the latency stays below 1 millisecond. The drive
measured is the NVMe disk that comes with the AWS EC2 i3en.3xlarge instance.
59
Chapter 3 Database Internals: Hardware and Operating System Interactions
Figure 3-6. Bandwidth/latency graphs showing how read request latency depends
on the intensity of small reads (challenging disk IOPS capacity) vs the intensity of
large writes (pursuing the disk bandwidth)
Tip: How to Measure Your Own Disk Behavior Under Load The better you
understand how your own disks perform under load, the better you can tune them
to capitalize on their “sweet spot.” One way to do this is with Diskplorer,4 an open-
source disk latency/bandwidth exploring toolset. By using Linux fio under the hood
4
You can access Diskplorer at https://github.com/scylladb/diskplorer. This project contains
instructions on how to generate a graph of your own.
60
Chapter 3 Database Internals: Hardware and Operating System Interactions
Networking
The conventional networking functionality available in Linux is remarkably full-featured,
mature, and performant. Since the database rarely imposes severe per-ping latency
requirements, there are very few surprises that come from it when properly configured
and used. Nonetheless, some considerations still need to be made.
As explained by David Ahern, “Linux will process a fair amount of packets in the
context of whatever is running on the CPU at the moment the IRQ is handled. System
accounting will attribute those CPU cycles to any process running at that moment even
though that process is not doing any work on its behalf. For example, ‘top’ can show a
process that appears to be using 99+% CPU, but in reality, 60 percent of that time is spent
processing packets—meaning the process is really only getting 40 percent of the CPU to
make progress on its workload.”6
However, for truly networking-intensive applications, the Linux stack is constrained:
5
Watch the video on YouTube (www.youtube.com/watch?v=Am-nXO6KK58).
6
For the source and additional detail, see David Ahern’s, “The CPU Cost of Networking on a Host”
(https://people.kernel.org/dsahern/the-cpu-cost-of-networking-on-a-host).
61
Chapter 3 Database Internals: Hardware and Operating System Interactions
As before, the way to overcome this limitation is to move the packet processing to the
userspace. There are plenty of out-of-kernel implementations of the TCP algorithm that
are worth considering.
DPDK
One of the generic approaches that’s often referred to in the networking area is the poll
mode vs interrupt model. When a packet arrives, the system may have two options for
how to get informed—set up and interrupt from the hardware (or, in the case of the
userspace implementation, from the kernel file descriptor using the poll family of system
calls) or keep polling the network card on its own from time to time until the packet is
noticed.
The famous userspace network toolkit, called DPDK, is designed specifically for
fast packet processing, usually in fewer than 80 CPU cycles per packet.7 It integrates
seamlessly with Linux in order to take advantage of high-performance hardware.
IRQ Binding
As stated earlier, packet processing may take up to 60 percent of the CPU time, which
is way too much. This percentage leaves too few CPU ticks for the database work itself.
Even though in this case the backpressure mechanism would most likely keep the
external activity off and the system would likely find its balance, the resulting system
throughput would likely be unacceptable.
System architects may consider the non-symmetrical CPU approach to mitigate
this. If you’re letting the Linux kernel process network packets, there are several ways to
localize this processing on separate CPUs.
7
For details, see the Linux Foundation’s page on DPDK (Data Plane Developers Kit) at
www.dpdk.org.
62
Chapter 3 Database Internals: Hardware and Operating System Interactions
The simplest way is to bind the IRQ processing from the NIC to specific cores or
hyper-threads. Linux uses two-step processing of incoming packets called IRQ and soft-
IRQ. If the IRQs are properly bound to cores, the soft-IRQ also happens on those cores—
thus completely localizing the processing.
For huge-scale nodes running tens to hundred(s) of cores, the number of network-
only cores may become literally more than one. In this case, it might make sense to
localize processing even further by assigning cores from different NUMA nodes and
teaching the NIC to balance the traffic between those using the receive packet steering
facility of the Linux kernel.
Summary
This chapter introduced a number of ways that database engineering decisions enable
database users to squeeze more power out of modern infrastructure. For CPUs, the
chapter talked about taking advantage of multicore servers by limiting resource sharing
across cores and using future-promise design to coordinate work across cores. The
chapter also provided a specific example of how low-level CPU architecture has direct
implications on the database.
Moving on to memory, you read about two related but independent subsystems:
memory allocation and cache control. For I/O, the chapter discussed Linux options
such as traditional read/write, mmap, Direct I/O (DIO) read/write, and Asynchronous
I/O—including the various tradeoffs of each. This was followed by a deep dive into
how modern SSDs work and how a database can take advantage of a drive’s unique
characteristics. Finally, you looked at constraints associated with the Linux networking
stack and explored alternatives such as DPDK and IRQ binding. The next chapter shifts
the focus from hardware interactions to algorithmic optimizations: pure software
challenges.
63
Chapter 3 Database Internals: Hardware and Operating System Interactions
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter's Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
64
CHAPTER 4
Database Internals:
Algorithmic Optimizations
In the performance world, the hardware is always the unbreakable limiting factor—one
cannot squeeze more performing units from a system than the underlying chips may
provide. As opposed to that, the software part of the system is often considered the most
flexible thing in programming—in the sense that it can be changed at any time given
enough developers’ brains and hands (and investors’ cash).
However, that’s not always the case. Sometimes selecting an algorithm should be
done as early as the architecting stage in the most careful manner possible because the
chosen approach becomes so extremely fundamental that changing it would effectively
mean rewriting the whole engine from scratch or requiring users to migrate exabytes of
data from one instance to another.
This chapter shares one detailed example of algorithmic optimization—from the
perspective of the engineer who led this optimization. Specifically, this chapter looks
at how the B-trees family can be used to store data in cache implementations and
other accessory and in-memory structures. This look into a representative engineering
challenge should help you better understand what tradeoffs or optimizations various
databases might be making under the hood—ideally, so you can take better advantage of
its very deliberate design decisions.1
Note The goal of this chapter is not to convince database users that they need a
database with any particular algorithmic optimization—or to educate infrastructure
engineers on designing B-trees or the finer points of algorithmic optimization.
Rather, it’s to help anyone selecting or working with a database understand the
1
This chapter draws from material originally published on the ScyllaDB blog (www.scylladb.com/
blog). It is used here with permission of ScyllaDB.
65
© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023
F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_4
Chapter 4 Database Internals: Algorithmic Optimizations
Optimizing Collections
Maintaining large sets of objects in memory deserves the same level of attention as
maintaining objects in external memory—say, spinning disks or network-attached
storages. For a task as simple as looking up an object by a plain key, the acceptable
solution is often a plain hash table (even with great attention to hash function selection)
or a binary balanced tree (usually the red-black one due to its implementation
simplicity). However, branchy trees like the B-trees family can significantly boost
performance. They also have a lot of non-obvious pitfalls.
The second reason that B-trees are often a good choice for in-memory collections
also comes from the dispersed nature of binary trees and from how modern CPUs
are designed. It’s well known that when executing a stream of instructions, CPU cores
split the processing of each instruction into stages (loading instructions, decoding
them, preparing arguments, and doing the execution itself ) and the stages are run in
67
Chapter 4 Database Internals: Algorithmic Optimizations
2
See Marek Majkowski’s blog, “Branch predictor: How many ‘if’s are too many? Including x86 and
M1 benchmarks!” https://blog.cloudflare.com/branch-predictor/.
3
See the tutorial, “Eytzinger Binary Search” https://algorithmica.org/en/eytzinger.
4
Both are available as open-source software; see https://github.com/scylladb/scylladb and
https://github.com/tarantool/tarantool.
68
Chapter 4 Database Internals: Algorithmic Optimizations
The way to reduce this number is to compare more than one key at a time when
doing intra-node searches. In case the keys are small enough, SIMD instructions can
compare up to 64 keys in one go. Although a SIMD compare instruction may be slower
than a classic cmp one and requires additional instructions to process the comparison
mask, linear SIMD-powered search wins on short enough arrays (and B-tree nodes can
be short enough). For example, Figure 4-3 shows the times of looking up an integer in a
sorted array using three techniques—linear search, binary search, and SIMD-optimized
linear search such as the x86 Advanced Vector Extensions (AVX).
Figure 4-3. The test used a large amount of randomly generated arrays of values
dispersed in memory to eliminate differences in cache usage and a large amount
of random search keys to blur branch predictions. These are the average times of
finding a key in an array normalized by the array length. Smaller results are faster
(better)
A great implicit feature of a tree is the ability to iterate over elements in a sorted
manner (called a scan). To scan a classical B-tree, there are both recursive and state-
machine algorithms that process the keys in a very non-uniform manner—the algorithm
walks up-and-down the tree while it moves. Despite B-trees being described as cache-
friendly, scanning them requires visiting every single node and inner nodes are visited in
a cache unfriendly manner. Figure 4-4 illustrates this phenomenon.
Figure 4-4. Scanning a classical B-tree involves walking up and down the tree;
every node and inner node is visited
As opposed to this, B+-trees’ scan only needs to loop through its leaf nodes, which,
with some additional effort, can be implemented as a linear scan over a linked list of
arrays, as demonstrated in Figure 4-5.
70
Chapter 4 Database Internals: Algorithmic Optimizations
overhead needed to store a single key. For a binary tree, the overhead is three pointers—
to both left and right children as well as to the parent node. For a B-tree, it will differ for
inner and leaf nodes. For both types, the overhead is one parent pointer and k pointers
to keys, even if they are not inserted in the tree. For inner nodes there will additionally be
k+1 pointers to child nodes.
The number of nodes in a B-tree is easy to estimate for a large number of keys. As the
number of nodes grows, the per-key overhead blurs as keys “share” parent and children
pointers. However, there’s a very interesting point at the beginning of a tree’s growth.
When the number of keys becomes k+1 (i.e., the tree overgrows its first leaf node), the
number of nodes jumps three times because, in this case, it’s needed to allocate one
more leaf node and one inner node to link those two.
There is a good and pretty cheap optimization to mitigate this spike, called “linear
root.” The leaf root node grows on demand, doubling each step like a std::vector in
C++, and can overgrow the capacity of k up to some extent. Figure 4-6 shows the per-key
overhead for a 4-ary B-tree with 50 percent initial overgrowth. Note the first split spike of
a classical algorithm at five keys.
Figure 4-6. The per-key overhead for a 4-ary B-tree with 50 percent initial
overgrowth
When discussing how B-trees work with small amounts of keys, it’s worth
mentioning the corner case of one key. In ScyllaDB, a B-tree is used to store sorted rows
inside a block of rows called a partition. Since it’s possible to have a schema where each
71
Chapter 4 Database Internals: Algorithmic Optimizations
partition always has a single row, this corner case is not that “corner” for us. In the case
of a binary tree, the single-element tree is equivalent to having a direct pointer from the
tree owner to this element (plus the cost of two nil pointers to the left and right children).
In case of a B-tree, the cost of keeping the single key is always in having a root node that
implies extra pointer fetching to access this key. Even the linear root optimization is
helpless here. Fixing this corner case was possible by reusing the pointer to the root node
to point directly to the single key.
72
Chapter 4 Database Internals: Algorithmic Optimizations
73
Chapter 4 Database Internals: Algorithmic Optimizations
It’s possible to define a compaction operation for a B-tree that will pick several
adjacent nodes and squash them together, but this operation has its limitations. First,
a certain amount of underoccupied nodes makes it possible to insert a new element
into a tree without the need to rebalance, thus saving CPU cycles. Second, since each
node cannot contain less than a half of its capacity, squashing two adjacent nodes
is impossible. Even if considering three adjacent nodes, then the amount of really
squashable nodes would be less than 5 percent of the leaves and less than 1 percent of
the inners.
Summary
As extensive as these optimizations might seem, they are really just the tip of the iceberg
for this one particular example. Many finer points that matter from an engineering
perspective were skipped for brevity (for example, the subtle difference in odd vs
even number of keys on a node). For a database user, the key takeaway here is that
an excruciating level of design and experimentation often goes into the software for
determining how your database stores and retrieves data. You certainly don’t need to
be this familiar with every aspect of how your database was engineered. But knowing
what algorithmic optimizations your database has focused on will help you understand
why it performs in certain ways under different contexts. And you might discover some
impressively engineered capabilities that could help you handle more user requests or
shave a few precious milliseconds off your P99 latencies. The next chapter takes you into
the inner workings of database drivers and shares tips for getting the most out of a driver,
particularly from a performance perspective.
74
Chapter 4 Database Internals: Algorithmic Optimizations
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter's Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
75
CHAPTER 5
Database Drivers
Databases usually expose a specific communication protocol for their users. This
protocol is the foundation of communication between clients and servers, so it’s often
well-documented and has a formal specification. Some databases, like PostgreSQL,
implement their own binary format on top of the TCP/IP stack.1 Others, like Amazon
DynamoDB,2 build theirs on top of HTTP, which is a little more verbose, but also more
versatile and compatible with web browsers. It’s also not uncommon to see a database
exposing a protocol based on gRPC3 or any other well-established framework.
Regardless of the implementation details, users seldom use the bare protocol
themselves because it’s usually a fairly low-level API. What’s used instead is a driver—a
programming interface written in a particular language, implementing a higher-level
abstraction for communicating with the database. Drivers hide all the nitty-gritty details
behind a convenient interface, which saves users from having to manually handle
connection management, parsing, validation, handshakes, authentication, timeouts,
retries, and so on.
In a distributed environment (which a scalable database cluster usually is), clients,
and therefore drivers, are an extremely important part of the ecosystem. The clients
are usually the most numerous group of actors in the system, and they are also very
heterogeneous in nature, as visualized in Figure 5-1. Some clients are connected via
local network interfaces, other ones connect via a questionable Wi-Fi hotspot on another
continent and thus have vastly different latency characteristics and error rates. Some
might run on microcontrollers with 1MiB of random access memory, while others
utilize 128-core bare metal machines from a cloud provider. Due to this diversity, it’s
1
See the PostgreSQL documentation (https://www.postgresql.org/docs/7.3/protocol-
protocol.html).
2
See the DynamoDB Developer Guide on the DynamoDB API (https://docs.aws.amazon.com/
amazondynamodb/latest/developerguide/HowItWorks.API.html).
3
gRPC is “a high performance, open-source universal RPC framework;” see https://grpc.io for
details.
77
© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023
F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_5
Chapter 5 Database Drivers
very important to take drivers into consideration when thinking about performance,
scalability, and resilience to failures. Ultimately it’s the drivers that generate traffic and
its concurrency, so cooperation between them and database nodes is crucial for the
whole system to be healthy and efficient.
This chapter takes a look at how drivers impact performance—through the eyes of
someone who has engineered drivers for performance. It provides insight into various
ways that drivers can support efficient client-server interactions and shares tips for
getting the most out of a driver, particularly from the performance perspective. Finally,
the chapter wraps up with several considerations to keep in mind as you’re selecting
a driver.
78
Chapter 5 Database Drivers
queries operating on huge volumes of historical data). Other ones strive to be universal,
handling all kinds of clients and balancing the load so that everyone is happy (or, more
precisely, “happy enough”).
Workload Types
There are multiple ways of classifying database clients. One particularly interesting
way is to delineate between clients processing interactive and batch (e.g., analytical)
workloads, also known as OLTP (online transaction processing) vs OLAP (online
analytical processing)—see Figure 5-2.
79
Chapter 5 Database Drivers
Interactive Workloads
A client processing an interactive workload typically wants certain latency guarantees.
Receiving a response fast is more important than ensuring that the query succeeded.
In other words, it’s better to return an error in a timely manner than make the client
indefinitely wait for the correct response. Such workloads are often characterized by
unbounded concurrency, which means that the number of in-progress operations is
hard to predict.
A prime example of an interactive workload is a server handling requests from web
browsers. Imagine an online game, where players interact with the system straight from
their favorite browsers. High latency for such a player means a poor user experience
because people tend to despise waiting for online content for more than a few hundred
milliseconds; with multi-second delays, most will just ditch the game as unusable and
try something else. It’s therefore particularly important to be as interactive as possible
and return the results quickly—even if the result happens to be a temporary error. In
such a scenario, the concurrency of clients varies and is out of control for the database.
Sometimes there might be a large influx of players, and the database might need to
refuse some of them to avoid overload.
80
Chapter 5 Database Drivers
Mixed Workloads
Certain workloads cannot be easily qualified as fully interactive or fully batch. The
clients are free to intermix their requirements, concurrency, and load however they
please—so the databases should also be ready for surprises. For example, a batch
workload might suddenly experience a giant temporary spike in concurrency. Databases
should, on the one hand, maintain a level of trust in the workload’s typical patterns, but
on the other hand anticipate that workloads can simply change over time—due to bugs,
hardware changes, or simply because the use case has diverged from its original goal.
Throughput vs Goodput
A healthy distributed database cluster is characterized by stable goodput, not
throughput. Goodput is an interesting portmanteau of good + throughput, and it’s a
measure of useful data being transferred between clients and servers over the network,
as opposed to just any data. Goodput disregards errors and other churn-like redundant
retries, and is used to judge how effective the communication actually is.
This distinction is important.
4
Apache Spark is “multi-language engine for executing data engineering, data science, and
machine learning on single-node machines or clusters.” For details, see https://spark.
apache.org/.
81
Chapter 5 Database Drivers
Imagine an extreme case of an overloaded node that keeps returning errors for each
incoming request. Even though stable and sustainable throughput can be observed,
this database brings no value to the end-user. Thus, it’s essential to track how much
useful data can be delivered in an acceptable time. For example, this can be achieved
by tracking both the total throughput and throughput spent on sending back error
messages and then subtracting one from another to see how much valid data was
transferred (see Figure 5-3).
Figure 5-3. Note how a fraction of the throughput times out, effectively requiring
more work from clients to achieve goodput
82
Chapter 5 Database Drivers
Timeouts
In a distributed system, there are two fundamental types of timeouts that influence one
another: client-side timeouts and server-side timeouts. While both are conceptually
similar, they have different characteristics. It’s vital to properly configure both of them to
prevent problems like data races and consistency issues.
Client-Side Timeouts
This type of timeout is generally configured in the database driver. It signifies how long it
takes for a driver to decide that a response from a server is not likely to arrive. In a perfect
world built on top of a perfect network, all parties always respond to their requests.
However, in practice, there are numerous causes for a response to either be late or lost:
• The recipient died
83
Chapter 5 Database Drivers
• And so on
Server-Side Timeouts
A server-side timeout determines when a database node should start considering a
particular request as expired. Once this point in time has passed, there is no reason
to continue processing the query. (Doing so would waste resources which could have
otherwise been used for serving other queries that still have a chance to succeed.)
When the specified time has elapsed, databases often return an error indicating that the
request took too long.
Using reasonable values for server-side timeouts helps the database manage its
priorities in a more precise way, allocating CPU, memory and other scarce resources on
queries likely to succeed in a timely manner. Drivers that receive an error indicating that
a server-side timeout has occurred should also act accordingly—perhaps by reducing
the pressure on a particular node or retrying on another node that hasn’t experienced
timeouts lately.
84
Chapter 5 Database Drivers
A Cautionary Tale
The CQL protocol, which specifies the communication layer in Apache Cassandra and
ScyllaDB, comes with built-in support for concurrency. Namely, each request is assigned
a stream ID, unique for each connection. This stream ID is encoded as a 16-bit integer
with the first bit being reserved by the protocol, which leaves the drivers 32768 unique
values for handling in-flight requests per single connection. This stream ID is later
used to match an incoming response with its original request. That’s not a particularly
large number, given that modern systems are known to handle millions of requests per
second. Thus, drivers need to eventually reuse previously assigned stream IDs.
But the CQL driver for Python had a bug.5 In the event of a client-side timeout, it
assumed that the stream ID of an expired request was immediately free to reuse. While
the assumption holds true if the server dies, it is incorrect if processing simply takes
longer than expected. It was therefore possible that once a response with a given stream
ID arrived, another request had already reused the stream ID, and the driver would
mistakenly match the response with the new request. If the user was lucky, they would
simply receive garbage data that did not pass validation. Unfortunately, data from the
mismatched response might appear correct, even though it originates from a totally
different request. This is the kind of bug that looks innocent at first glance, but may cause
people to log in to other people’s bank accounts and wreak havoc on their lives.
A rule of thumb for client-side timeouts is to make sure that a server-side timeout
also exists and is strictly shorter than the client-side one. It should take into account
clock synchronization between clients and servers (or lack thereof ), as well as estimated
network latency. Such a procedure minimizes the chances for a late response to arrive at
all, and thus removes the root cause of many issues and vulnerabilities.
5
Bug report and applied fixes can be found here:
https://datastax-oss.atlassian.net/browse/PYTHON-1286
https://github.com/scylladb/python-driver/pull/106
https://github.com/datastax/python-driver/pull/1114
85
Chapter 5 Database Drivers
Contextual Awareness
At this point it should be clear that both servers and clients can make better, more
educated, and mutually beneficial decisions if they know more about each other.
Exchanging timeout information is important, but drivers and servers can do even more
to keep each other up to date.
6
See the documentation on Gossip in ScyllaDB (https://docs.scylladb.com/stable/kb/
gossip.html).
86
Chapter 5 Database Drivers
Current Load
Overload protection and request latency optimization are tedious tasks, but they can be
substantially facilitated by exchanging as much context as possible between interested
parties.
The following methods can be applied to distribute the load evenly across the
distributed system and prevent unwanted spikes:
87
Chapter 5 Database Drivers
Based on these indicators, drivers should try to amend the amount of data they
send, the concurrency, and the rate of retries as well as speculative execution, which
can keep the whole distributed system in a healthy, balanced state. It’s ultimately in the
driver’s interest to ease the pressure on nodes that start showing symptoms of getting
overloaded, be it by reducing the concurrency of operations, limiting the frequency
and number of retries, temporarily giving up on speculatively sent requests, and so on.
Otherwise, if the database servers get overloaded, all clients may experience symptoms
like failed requests, timeouts, increased latency, and so on.
Request Caching
Many database management systems, ranging from SQLite, MySQL, and Postgres to
NoSQL databases, implement an optimization technique called prepared statements.
While the language used to communicate with the database is usually human-readable
(or at least developer-readable), it is not the most efficient way of transferring data from
one computer to another.
Let’s take a look at the (simplified) lifecycle of an unprepared statement once it’s sent
from a ScyllaDB driver to the database and back. This is illustrated in Figure 5-4.
88
Chapter 5 Database Drivers
2. The string is packed into a CQL frame by the driver. Each CQL
frame consists of a header, which describes the purpose of a
particular frame. Following the header, a specific payload may be
sent as well. The full protocol specification is available at https://
github.com/apache/cassandra/blob/trunk/doc/native_
protocol_v4.spec.
6. The database parses the string in order to validate its contents and
interpret what kind of an operation is requested: is it an insertion,
an update, a deletion, a selection?
Now, imagine that a user wants to perform a hundred million operations on the
database in quick succession because the data is migrated from another system. Even
if parsing the query strings is a relatively fast operation and takes 50 microseconds, the
total time spent on parsing strings will take over an hour of CPU time. Sounds like an
obvious target for optimization.
The key observation is that operations performed on a database are usually similar
to one another and follow a certain pattern. For instance, migrating a table from one
system to another may mean sending lots of requests with the following schema:
where ? denotes the only part of the string that varies between requests.
89
Chapter 5 Database Drivers
This query string with question marks instead of real values is actually also valid
CQL! While it can’t be executed as is (because some of the values are not known), it can
be prepared.
Preparing such a statement means that the database will meticulously analyze the
string, parse it, and create an internal representation of the statement in its own memory.
Once done, a unique identifier is generated and sent back to the driver. The client can now
execute the statement by providing only its identifier (which is a 128-bit UUID7 in ScyllaDB)
and all the values missing from the prepared query string. The process of replacing
question marks with actual values is called binding and it’s the only thing that the database
needs to do instead of launching a CQL parser, which offers a significant speedup.
Preparing statements without care can also be detrimental to overall cluster
performance though. When a statement gets prepared, the database needs to keep a
certain amount of information about it in memory, which is hardly a limitless resource.
Caches for prepared statements are usually relatively small, under the assumption that
the driver’s users (app developers) are kind and only prepare queries that are used
frequently. If, on the other hand, a user were to prepare lots of unique statements that
aren’t going to be reused any time soon, the database cache might invalidate existing
entries for frequently used queries. The exact heuristics of how entries are invalidated
depends on the algorithm used in the cache, but a naive LRU (least recently used)
eviction policy is susceptible to this problem. Therefore, other cache algorithms resilient
to such edge cases should be considered when designing a cache without full information
about expected usage patterns. Some notable examples include the following:
7
See the memo, “A Universally Unique IDentifier (UUID) URN Namespace,” at https://www.
ietf.org/rfc/rfc4122.txt.
90
Chapter 5 Database Drivers
Finally, regardless of the algorithm used for cache eviction implemented server-side,
drivers should take care not to prepare queries too aggressively, especially if it happens
automatically, which is often the case in ORMs (object-relational mappings). Making
an interface convenient for the user may sound tempting, and developer experience is
indeed an important factor when designing a driver, but being too eager with reserving
precious database resources may be disadvantageous in the long term.
Query Locality
In distributed systems, any kind of locality is welcome because it reduces the chances of
failure, keeps the latency low, and generally prevents many undesirable events. While
database clients, and thus also drivers, do not usually share the same machines with
the database cluster, it is possible to keep the distance between them short. “Distance”
might mean either a physical measure or the number of intermediary devices in the
network topology. Either way, for latency’s sake, it’s good to minimize it between parties
that need to communicate with each other frequently.
Many database management systems allow their clients to announce their
“location,” for example, by declaring which datacenter is their local, default one. Drivers
should take that information into account when communicating with the database
nodes. As long as all consistency requirements are fulfilled, it’s usually better to send
data directly to a nearby node, under the assumption that it will spend less time in
transit. Short routes also usually imply fewer middlemen, and that in turn translates to
fewer potential points of failure.
Drivers can make much more educated choices though. Quite a few NoSQL
databases can be described as “distributed hash tables” because they partition their
data and spread it across multiple nodes which own a particular set of hashes. If the
hashing algorithm is well known and deterministic, drivers can leverage that fact to try
to optimize the queries even further—sending data directly to the appropriate node, or
even the appropriate CPU core.
91
Chapter 5 Database Drivers
1. A request arrives.
However, in certain cases, the driver can compute the token locally on its own, and
then use the cluster topology information to route the request straight to the owning
node. This local node-level routing saves at least one network round-trip as well as the
CPU time of some of the nodes.
8
A token is how a hash value is named in Cassandra nomenclature.
92
Chapter 5 Database Drivers
In the Cassandra/ScyllaDB case, this is possible because each table has a well-
defined “partitioner,” which simply means a hash function implementation. The default
choice—used in Cassandra—is murmur3,9 which returns a 64-bit hash value, has
satisfying distribution, and is relatively cheap to compute. ScyllaDB takes it one step
further and allows the drivers to calculate which CPU core of which database node owns
a particular datum. When a driver is cooperative and proactively establishes a separate
connection per each core of each machine, it can send the data not only to the right
node, but also straight to the single CPU core responsible for handling it. This not only
saves network bandwidth, but is also very friendly to CPU caches.
9
See the DataStax documentation on Murmur3Partitioner (https://docs.datastax.com/en/
cassandra-oss/3.x/cassandra/architecture/archPartitionerM3P.html).
93
Chapter 5 Database Drivers
Figure 5-7. Shard-aware clients route queries to the correct node(s) + core
Retries
In a perfect system, no request ever fails and logic implemented in the drivers can
be kept clean and minimal. In the real world, failures happen disturbingly often, so
the drivers should also be ready to deal with them. One such mechanism for failure
tolerance is a driver’s retry policy. A retry policy’s job is to decide whether a request
should be sent again because it failed (or at least the driver strongly suspects that it did).
Error Categories
Before diving into techniques for retrying requests in a smart way, there’s a more
fundamental question to consider: does a retry even make sense? The answer is not that
obvious and it depends on many internal and external factors. When a request fails, the
error can fall into the following categories, presented with a few examples:
1. Timeouts
a. Read timeouts
b. Write timeouts
94
Chapter 5 Database Drivers
2. Temporary errors
3. Permanent errors
b. Authentication error
c. Insufficient permissions
Depending on the category, the retry decision may be vastly different. For instance,
it makes absolutely no sense to retry a request that has incorrect syntax. It will not
magically start being correct, and such a retry attempt would only waste bandwidth and
database resources.
Idempotence
Error categories aside, retry policy must also consider one important trait of the request
itself: its idempotence. An idempotent request can be safely applied multiple times, and
the result will be indistinguishable from applying it just once.
Why does this need to be taken into account at all? For certain classes of errors, the
driver cannot be sure whether the request actually succeeded. A prime example of such
error is a timeout. The fact that the driver did not manage to get a response in time does
not mean that the server did not successfully process the request. It’s a similar situation
if the network connection goes down: The driver won’t know if the database server
actually managed to apply the request.
When in doubt, the driver should make an educated guess in order to ensure
consistency. Imagine a request that withdraws $100 from somebody’s bank account. You
certainly don’t want to retry the same request again if you’re not absolutely sure that
it failed; otherwise, the bank customer might become a bit resentful. This is a perfect
example of a non-idempotent request: Applying it multiple times changes the ultimate
outcome.
95
Chapter 5 Database Drivers
Fortunately, there’s a large subset of idempotent queries that can be safely retried,
even when it’s unclear whether they already succeeded:
1. Read-only requests
Since they do not modify any data, they won’t have any side
effects, no matter how often they’re retried.
In general, it’s a good idea for drivers to give users an opportunity to declare
their requests’ idempotence explicitly. Some queries can be trivially deduced to be
idempotent by the driver (e.g., when it’s a read-only SELECT statement in the database
world), but others may be less obvious. For example, the conditional example from the
previous Step 2 is idempotent if the value is never decremented, but not in the general
case. Imagine the following counter-example:
edu/~arvind/cs425/lectureNotes/clocks-2.pdf).
96
Chapter 5 Database Drivers
Since it’s often impossible to guess if a request is idempotent just by analyzing its
contents, it’s best for drivers to have a set_idempotent() function exposed in their
API. It allows the users to explicitly mark some queries as idempotent, and then the logic
implemented in the driver can assume that it’s safe to retry such a request when the
need arises.
Retry Policies
Finally, there’s enough context to discuss actual retry policies that a database driver
could implement. The sole job of a retry policy is to analyze a failed query and return a
decision. This decision depends on the database system and its intrinsics, but it’s often
one of the following (see Figure 5-8):
• Do not retry
97
Chapter 5 Database Drivers
Deciding not to retry is often a decent choice—it’s the only correct one when the
driver isn’t certain whether an idempotent query really failed or just timed out. It’s also
the obvious choice for permanent errors; there’s no point in retrying a request that was
previously refused due to incorrect syntax. And whenever the system is overloaded, the
“do not retry” approach might help the entire cluster. Although the immediate effect
(preventing a user’s request from being driven to completion) is not desirable, it provides
a level of overload protection that might pay off in the future. It prevents the overload
condition from continuing to escalate. Once a node gets too much traffic, it refuses more
requests, which increases the rate of retries, and ends up in a vicious circle.
Retrying on the same database node is generally a good option for timeouts.
Assuming that the request is idempotent, the same node can probably resolve potential
conflicts faster. Retrying on a different node is a good idea if the previous node showed
symptoms of overload, or had an input/output error that indicated a temporary issue.
Finally, in certain cases, it’s a good idea to delay the retry instead of firing it off
immediately (see Figure 5-9).
98
Chapter 5 Database Drivers
When the whole cluster shows the symptoms of overload—be it high reported CPU
usage or perceived increased latency—retrying immediately after a request failed may
only exacerbate the problem. What a driver can do instead is apply a gentle backoff
algorithm, giving the database cluster time to recover. Remember that even a failed retry
costs resources: networking, CPU, and memory. Therefore, it’s better to balance the costs
and chances for success in a reasonable manner.
The three most common backoff strategies are constant, linear, and exponential
backoff, as visualized in Figure 5-10.
99
Chapter 5 Database Drivers
The first type (constant) simply waits a certain predefined amount of time before
retrying. Linear backoff increases the time between attempts in a linear fashion; it
could wait one second before the first attempt, two seconds before the second one,
and so forth. Finally, exponential backoff, arguably the most commonly used method,
increases the delay by multiplying it by a constant each time. Usually it just doubles
it—because both processors and developers love multiplying and dividing by two (the
latter ones mostly just to show off their intricate knowledge of the bitwise shift operator).
Exponential backoff has especially nice characteristics for overload prevention. The retry
rate drops exponentially, and so does the pressure that the driver places on the database
cluster.
Paging
Databases usually store amounts of data that are orders of magnitude larger than a single
client machine could handle. If you fetch all available records, the result is unlikely to fit
into your local disks, not to mention your available RAM. Nonetheless, there are many
valid cases for processing large amounts of data, such as analyzing logs or searching for
specific documents. It is quite acceptable to ask the database to serve up all the data it
has—but you probably want it to deliver that data in smaller bits.
That technique is customarily called paging, and it is ubiquitous. It’s exactly what
you’ve experienced when browsing through page 17 of Google search results in futile
search for an answer to a question that was asked only on an inactive forum seven years
ago—or getting all the way to page 24 of eBay listings, hunting for that single perfect offer.
Databases and their drivers also implement paging as a mechanism beneficial for both
parties. Drivers get their data in smaller chunks, which can be done with lower latency.
And databases receive smaller queries, which helps with cache management, workload
prioritization, memory usage, and so on.
Different database models may have a different view of exactly what paging involves
and how you interface with it. Some systems may offer fine-grained control, which
allows you to ask for “page 16” of your data. Others are “forward-only”: They reduce the
user-facing interface to “here’s the current page—you can ask for the next page if you
want.” Your ability to control the page size also varies. Sometimes it’s possible to specify
the size in terms of a number of database records or bytes. In other cases, the page size
is fixed.
100
Chapter 5 Database Drivers
Concurrency
In many cases, the only way to utilize a database to the fullest—and achieve optimal
performance—is to also achieve high concurrency. That often requires the drivers to
perform many I/O operations at the same time, and that’s in turn customarily achieved
101
Chapter 5 Database Drivers
by issuing asynchronous tasks. That being said, let’s take quite a few steps back to
explain what that really means and what’s involved in achieving that from both a
hardware and software perspective.
Note High concurrency is not a silver bullet. When it’s too high, it’s easy
to overload the system and ruin the quality of service for other users—see
Figure 5-11 for its effect on latency. Chapter 1 includes a cautionary tale on what
can happen when concurrency gets out of bounds and Chapter 2 also touches on
the dangers of unbounded concurrency.
Modern Hardware
Back in the old days, making decisions around I/O concurrency was easy because
magnetic storage drives (HDD) had an effective concurrency of 1. There was (usually)
only a single actuator arm used to navigate the platters, so only a single sector of data
could have been read at once. Then, an SSD revolution happened. Suddenly, disks
could read from multiple offsets concurrently. Moreover, it became next to impossible to
fully utilize the disk (i.e., to read and write with the speeds advertised in shiny numbers
printed on their labels) without actually asking for multiple operations to be performed
concurrently. Now, with enterprise-grade NVMe drives and inventions like Intel
Optane,11 concurrency is a major factor when benchmarking input/output devices. See
Figure 5-11.
11
High speed persistent memory (sadly discontinued in 2021).
102
Chapter 5 Database Drivers
12
RSS allows directing traffic from specific queues directly into chosen CPUs.
13
Terabits per second
14
See the “Efficient IO with io_uring” article (https://kernel.dk/io_uring.pdf).
103
Chapter 5 Database Drivers
Modern Software
How could modern software adapt to the new, highly concurrent era? Historically,
a popular model of ensuring that multiple operations can be performed at the same
time was to keep a pool of operating system threads, with each thread having its own
queue of tasks. That only scales in a limited way though, so now the industry leans
toward so-called “green threads,” which are conceptually similar to their operating
system namesakes, but are instead implemented in userspace, in a much more
lightweight manner.
For example, in Seastar (a high-performance asynchronous framework implemented
in C++ and based on a future-promise model15), there are quite a few ways of expressing
a single flow of execution, which could be called a green thread. A fiber of execution can
be created by chaining futures, and you can also use the C++ coroutines mechanism to
build asynchronous programs in a clean way, with the compiler assisting in making the
code async-friendly.
In the Rust language, the asynchronous model is quite unique. There, a future
represents the computation, and it’s the programmer’s responsibility to advance the
state of this asynchronous state machine. Other languages, like JavaScript, Go, and Java,
also come with well-defined and standardized support for asynchronous programming.
This async programming support is good, because database drivers are prime
examples of software that should support asynchronous operations from day one.
Drivers are generally responsible for communicating over the network with highly
specialized database clusters, capable of performing lots of I/O operations at the same
time. We can’t emphasize enough that high concurrency is the only way to utilize the
database to the fullest. Asynchronous code makes that substantially easier because it
allows high levels of concurrency to be achieved without straining the local resources.
Green threads are lightweight and there can be thousands of them even on a consumer-
grade laptop. Asynchronous I/O is a perfect fit for this use case as well because it allows
efficiently sending thousands of requests over the network in parallel, without blocking
the CPU and forcing it to wait for any of the operations to complete, which was a known
bottleneck in the legacy threadpool model.
futures-promises/).
104
Chapter 5 Database Drivers
1. Clear documentation
List_of_drivers.
105
Chapter 5 Database Drivers
3. Asynchronous API
106
Chapter 5 Database Drivers
5. Database-specific optimizations
S
ummary
This chapter provided insights into how the choice of a database driver impacts
performance and highlighted considerations to keep in mind when selecting a driver.
Drivers are often an overlooked part of a distributed system. That’s a shame because
drivers are so close to database users, both physically and figuratively! Proximity is an
extremely important factor in all networked systems because it directly translates to
latency. The next chapter ponders proximity from a subtly different point of view: How to
get the data itself closer to the application users.
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter's Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
107
CHAPTER 6
109
© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023
F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_6
Chapter 6 Getting Data Closer
the results to the users, or some middleware, for further processing. It makes even more
sense when you consider that database nodes generally run on powerful hardware,
with lots of RAM and fast I/O devices. This usually translates to formidable CPU power.
Dedicated large data processing frameworks aside (e.g., Apache Spark, which is out of
scope for this book), regular database engines almost always support some level of user-
defined computations. These can be classified into two major sections: user-defined
functions/procedures and user-defined aggregates.
Note that the definitions vary. Some database vendors use the general name
“functions” to mean both aggregate and scalar functions. Others actually mean
“scalar functions” when they reference “functions,” and use the name “aggregates” for
“aggregate functions.” That’s the convention applied to this chapter.
110
Chapter 6 Getting Data Closer
The first on the list is the least flexible, offers the worst developer experience, and
has the lowest security risk. The last has the most flexibility, offers the best developer
experience, and also harbors the most potential for being a security risk worthy of its
own CVE number.
Scalar functions are usually invoked per each row, at least for row-oriented
databases, which is usually the case for SQL. You might wonder if the computations can’t
simply be performed by end users on their machines. That’s a valid point. The main
advantage of that approach is fantastic scalability regardless of how many users perform
data transformations (if they do it locally on their own machines, then the database
cluster does not get overloaded).
There are several great reasons to push the computations closer to where the data
is stored:
1
It’s also a great example of the CVE risk: https://cve.mitre.org/cgi-bin/cvename.
cgi?name=CVE-2021-44521
https://jfrog.com/blog/cve-2021-44521-exploiting-apache-cassandra-user-defined-
functions-for-remote-code-execution/
111
Chapter 6 Getting Data Closer
Determinism
In distributed environments, idempotence (discussed in Chapter 5) is an important
attribute that makes it possible to send requests in a speculative manner, potentially
increasing performance. Thus, it is better to make sure that user-defined functions are
deterministic. In other words, a user-defined function’s value should only depend on
the value of its arguments, and not on the value of any external factors like time, date,
pseudo-random seed, and so on.
A perfect example of a non-deterministic function is now(). Calling it twice might
yield the same value if you’re fast enough, but it’s generally not guaranteed since its
result is time-dependent. If possible, it’s a good idea to program the user-defined
functions in a deterministic way and mark them as such. For time/date, this might
involve computing the results based on a timestamp passed as a parameter rather than
using built-in time utilities. For pseudo-random sampling, the seed could also be passed
as a parameter, as opposed to relying on sources of entropy provided by the user-defined
function runtime.
112
Chapter 6 Getting Data Closer
Latency
Running user-provided code on your database clusters is potentially dangerous in
aspects other than security. Most embedded languages are Turing-complete, and
customarily allow the developers to use loops, recursion, and other similar techniques
in their code. That’s risky. An undetected infinite loop may serve as a denial-of-service
attack, forcing the database servers to endlessly process a function and block other tasks
from used resources. And even if the user-defined function author did not have malicious
intentions, some computations simply consume a lot of CPU time and memory.
In a way, a user-defined function should be thought of as a potential “noisy
neighbor”2 and its resources should be as limited as possible. For some use cases,
a simple hard limit on memory and CPU time used is enough to ensure that the
performance of other database tasks does not suffer from a “noisy” user-defined
function. However, sometimes, a more specific solution is required—for example,
splitting a user-function definition into smaller time bits, assigning priorities to user-
defined functions, and so on.
One interesting metering mechanism was applied by Wasmtime,3 a WebAssembly
runtime. Code running in a WebAssembly instance consumes fuel,4 a synthetic unit used
for tracking how fast an instance exhausts system resources. When an instance runs out
of fuel, the runtime does one of the preconfigured actions—either “refills” and lets the
code execution continue or decides that the task reached its quota and terminates it.
2
See the Microsoft Azure documentation on the Noisy Neighbor antipattern (https://learn.
microsoft.com/en-us/azure/architecture/antipatterns/noisy-neighbor/noisy-neighbor).
3
See the Bytecode Alliance documentation at https://wasmtime.dev.
4
See the Wasmtime docs (https://docs.wasmtime.dev/api/wasmtime/struct.Store.
html#method.fuel_consumed).
113
Chapter 6 Getting Data Closer
JIT is a very powerful tool, but it’s not a silver bullet—compilation and additional
optimization can be an expensive process in terms of resources. A small user-defined
function may take less than a millisecond to run, but recompiling it can cause a
sudden spike in CPU and memory usage, as well as a multi-millisecond delay in the
processing—resulting in high tail latency. It should therefore be a conscious decision to
either enable just-in-time compilation for user-defined functions if the language allows
it, or disable it altogether.
Examples
Let’s take a look at a few examples of user-defined functions. The function serving as
the example operates on floating point numbers; given two parameters, it returns the
sum of them, inverted. Given 5 and 7, it should return 1 5 + 1 7 , which is approximately
0.34285714285.
Here’s how it could be defined in Apache Cassandra, which allows user-defined
function definitions to be provided in Java, its native language, as well as in other
languages:
Let’s take a closer look at the definition. The first line is straightforward: it includes
the function’s name, parameters, and its types. It also specifies that if a function
definition with that name already exists, it should be replaced. Next, it explicitly declares
what happens if any of the parameters is null, which is a valid value for any type. The
function can either return null without calling the function at all or allow null and let
the source code handle it explicitly (the syntax for that is CALLED ON NULL INPUT). This
explicit declaration is required by Apache Cassandra.
114
Chapter 6 Getting Data Closer
That declaration is then followed by the return type and chosen language—from
which you can correctly deduce that multiple languages are supported. Then comes
the function body. The only non-obvious decision made by the programmer was how
to handle 0 as a parameter. Since the type system implemented in Apache Cassandra
already handles NaN,5 it’s a decent candidate (next to positive/negative infinity).
The newly created function can be easily tested by creating a table, filling it with a
few values, and inspecting the result:
5
Not-a-number
115
Chapter 6 Getting Data Closer
Best Practices
Before you learn about user-defined aggregates, which unleash the true potential of
user-defined functions, it’s important to sum up a few best practices for setting up user-
defined functions in your database management system:
116
Chapter 6 Getting Data Closer
User-Defined Aggregates
The greatest potential for user-defined functions lies in them being building blocks for
user-defined aggregates. Aggregate functions operate on multiple rows or columns,
sometimes on entire tables or databases.
Moving this kind of operation closer to where the data lies makes perfect sense.
Imagine 1TB worth of database rows that need to be aggregated into a single value: the
sum of their values. When a thousand users request all these rows in order to perform
the aggregation client-side, the following happens:
If the aggregation is performed by the database servers, it not only avoids a petabyte
of traffic; it also saves computing power for the users (which is a considerably greener
solution). If the computation is properly cached, it only needs to be performed once.
This is a major win in terms of performance, and many use cases can immediately
benefit from pushing the aggregate computations closer to the data. This is especially
important for analytic workloads that tend to process large volumes of data in order to
produce useful statistics and feedback—a process that is its own type of aggregation.
Built-In Aggregates
Databases that allow creating user-defined aggregates usually also provide a few
traditional built-in aggregation functions: the (in)famous COUNT(*), but also MAX, MIN,
SUM, AVG, and others. Such functions take into account multiple rows or values and return
an aggregated result. The result may be a single value. Or, it could also be a set of values
if the input is divided into smaller classes. One example of such an operation is SQL’s
GROUP BY statement, which applies the aggregation to multiple disjoint groups of values.
Built-in aggregates should be preferred over user-defined ones whenever possible—
they are likely written in the language native to the database server, already optimized,
and secure. Still, the set of predefined aggregate functions is often very basic and doesn’t
allow users to perform the complex computations that make user-defined aggregates
such a powerful tool.
117
Chapter 6 Getting Data Closer
Components
User-defined aggregates are customarily built on top of user-defined scalar functions.
The details heavily depend on the database system, but the following components are
definitely worth mentioning.
Initial Value
An aggregation needs to start somewhere, and it’s up to the user to provide an initial
value from which the final result will eventually be computed. In the case of the COUNT
function, which returns the number of rows or values in a table, a natural candidate
for the initial value is 0. In the case of AVG, which computes the arithmetic mean from
all column values, the initial state could consist of two variables: The total number of
values, initialized to 0, and the total sum of values, also initialized to 0.
Final Function
The final function is an optional feature for user-defined aggregates. Its sole purpose is
to transform the final state of the aggregation to something else. For COUNT, no further
transformations are required. The user is simply interested in the final state of the
aggregation (the number of values), so the final function doesn’t need to be present; it
can be assumed to be an identity function. However, in the case of AVG, the final function
is what makes the result useful to the user. It transforms the final state—the total number
of values and its total sum—and produces the arithmetic mean by simply dividing one
by the other, handling the special case of avoiding dividing by zero.
118
Chapter 6 Getting Data Closer
Reduce Function
The reduce function is an interesting optional addition to the user-defined aggregates
world, especially for distributed databases. It can be thought of as another state
transition function, but one that can combine two partial states into one.
With the help of a reduce function, computations of the user-defined aggregate
can be distributed to multiple database nodes, in a map-reduce6 fashion. This, in turn,
can bring massive performance gains, because the computations suddenly become
concurrent. Note that this optimization is not always possible—if the state transition
function is not commutative, distributing the partial computations may yield an
incorrect result.
In order to better imagine what a reduce function can look like, let’s go back to the
AVG example. A partial state for AVG can be represented as (n, s), where n is the number
of values, and s is the sum of them. Reducing two partial states into the new valid state
can be performed by simply adding the corresponding values: (n1, s1) + (n2, s2) → (n1+
n2, s1 + s2). An optional reduce function can be defined (e.g., in ScyllaDB’s user-defined
aggregate implementation7).
The user-defined aggregates support is not standardized among database vendors
and each database has its own quirks and implementation details. For instance, in
PostgreSQL, you can also implement a “moving” aggregate8 by providing yet another set
of functions and parameters: msfunc, minvfunc, mstype, and minitcond. Still, the general
idea remains unchanged: Let the users push aggregation logic as close to the data as
possible.
Examples
Let’s create a custom integer arithmetic mean implementation in PostgreSQL.
That’s going to be done by providing a state transition function, called sfunc in
PostgreSQL nomenclature, finalfunc for the final function, initial value (initcond),
and the state type—stype. All of the functions will be implemented in SQL, PostgreSQL’s
native query language.
6
MapReduce is a framework for processing parallelizable problems across large datasets.
7
See the ScyllaDB documentation on ScyllaDB CQL Extensions (https://github.com/scylladb/
scylladb/blob/master/docs/cql/cql-extensions.md#reducefunc-for-uda).
8
See the PostgreSQL documentation on User-Defined Aggregates (https://www.postgresql.
org/docs/current/xaggr.html#XAGGR-MOVING-AGGREGATES).
119
Chapter 6 Getting Data Closer
Final Function
The final function divides the total sum of values by the total count of them, special-
casing an average of 0 values, which should be just 0. The final function returns a
floating point number because that’s how the aggregate function is going to represent an
arithmetic mean.
Aggregate Definition
With all the building blocks in place, the user-defined aggregate can now be declared:
120
Chapter 6 Getting Data Closer
In addition to declaring the state transition function and the final function, the state
type is also declared to be an array of integers (which will always keep two values in the
implementation), as well as the initial condition that sets both counters, the total sum
and the total number of values, to 0.
That’s it! Since the AVG aggregate for integers happens to be built-in, that gives you
the perfect opportunity to validate if the implementation is correct:
Voilà. Remember that while creating an alternative implementation for AVG is a great
academic example of user-defined aggregates, for production use it’s almost always
better to stick to the built-in aggregates whenever they’re available.
121
Chapter 6 Getting Data Closer
LANGUAGE lua
AS $$
return { acc[1]+val, acc[2]+1 }
$$;
ScyllaDB’s native query language, CQL, is extremely similar to SQL, even in its
acronym. It’s easy to see that most of the source code corresponds to the PostgreSQL
implementation from the previous paragraph. ScyllaDB does not allow defining user-
defined functions in CQL, but it does support Lua, a popular lightweight embeddable
language, as well as WebAssembly. Since this book is expected to be read mostly by
human beings (and occasionally ChatGPT once it achieves full consciousness), Lua was
chosen for this example due to the fact it’s much more concise.
122
Chapter 6 Getting Data Closer
The most notable difference is the reduce function, declared in the aggregate
under the REDUCEFUNC keyword. This function accepts two partial states and returns
another (composed) state. What ScyllaDB servers can do if this function is present is the
following:
1. Divide the domain (e.g., all rows in the database) into multiple
pieces and ask multiple servers to partially aggregate them, and
then send back the result.
Thus, by providing the reduce function, the user also allows ScyllaDB to compute
the aggregate concurrently on multiple machines. This can reduce the query execution
time by orders of magnitude compared to a large query that only gets executed on a
single server.
In this particular case, it might even be preferable to provide a user-defined
alternative for a user-defined function in order to increase its concurrency—unless the
built-in primitives also come with their reduce functions out of the box. That’s the case
in ScyllaDB, but not necessarily in other databases that offer similar capabilities.
Best Practices
1. If the computations can be efficiently represented with built-
in aggregates, do so—or at least benchmark whether a custom
implementation is any faster. User-defined aggregates are very
expressive, but usually come with a cost of overhead compared to
built-in implementations.
123
Chapter 6 Getting Data Closer
☒ It’s portable
9
For example, “WebAssembly: The Definitive Guide” by Brian Sletten, “Programming
WebAssembly with Rust” by Kevin Hoffman, or “ScyllaDB’s Take on WebAssembly for User-
Defined Functions” by Piotr Sarna.
124
Chapter 6 Getting Data Closer
Runtime
WebAssembly is compiled to bytecode. This bytecode is designed to run on a virtual
machine, which is usually part of a larger development environment called a runtime.
There are multiple implementations of WebAssembly runtimes, most notably:
• Wasmtime
https://wasmtime.dev/
A fast and secure runtime for WebAssembly, implemented in Rust,
backed by the Bytecode Alliance10 nonprofit organization.
• Wasmer.io
https://wasmer.io/
• WasmEdge:
https://wasmedge.org/
• V8:
https://v8.dev/
Also, since the WebAssembly specification is public, feel free to implement your own!
Beware though: The standard is still in heavy development, changing rapidly every day.
10
https://bytecodealliance.org/
11
https://wapm.io/
125
Chapter 6 Getting Data Closer
Back to Latency
Each runtime is free to define its own performance characteristics and guarantees. One
interesting feature introduced in Wasmtime is the concept of fuel, already mentioned in
the earlier discussion of user-defined functions. Combined with the fact that Wasmtime
provides an optional asynchronous interface for running WebAssembly modules, it gives
users an opportunity to fine-tune the runtime to their latency requirements.
When Wasmtime starts executing a given WebAssembly function, this unit of
execution is assigned a certain amount of fuel. Each execution step exhausts a small
amount of fuel—at the time of writing this paragraph, it simply consumes one unit of fuel
on each WebAssembly bytecode instruction, excluding a few flow control instructions
like branching. Once the execution unit runs out of fuel, it yields. After that happens, one
of the preconfigured actions is taken: either the execution unit is terminated, or its tank
gets refilled and it’s allowed to get back to whatever it was computing. This mechanism
allows the developer to control not only the total amount of CPU time that a single
function execution can take, but also how often the execution should yield and hand
over the CPU for other tasks. Thus, configuring fuel management the right way prevents
function executions from taking over the CPU for too long. That helps maintain low,
predictable latency in the whole system.
Another interesting aspect of WebAssembly is its portability. The fact that the
code can be distributed to multiple places and it’s guaranteed to run properly in
multiple environments makes it a great candidate for moving not only data, but also
computations, closer to the user.
Pushing the database logic from enormous datacenters to smaller ones, located
closer to end users, got its own buzzy name: edge computing.
Edge Computing
Since the Internet of Things (IoT) became a thing, the term edge computing needs
disambiguation. This paragraph is (unfortunately?) not about:
• Creating a data mesh from your local network of Bluetooth light bulbs
126
Chapter 6 Getting Data Closer
The edge described in this paragraph is of a more boring kind. It still means
performing computations on servers, but on ones closer to the user (e.g., located in a
local Equinix datacenter in Warsaw, rather than Amazon’s eu-central-1 in Frankfurt).
Performance
What does edge computing have to do with database performance? It brings the data
closer to the user, and closer physical distance translates to lower latency. On the other
hand, having your database cluster distributed to multiple locations has its downsides
as well. Moving large amounts of data between those regions might be costly, as cloud
vendors tend to charge for cross-region traffic. If the latency between database nodes
reaches hundreds of milliseconds, which is the customer grade latency between
Northern America and Europe (unless you can afford Hibernia Express12), they can get
out of sync easily. Even a few round-trips—and distributed consensus algorithms alone
require at least two—can cause delays that exceed the comfort zone of one second.
Failure detection mechanisms are also affected since packet loss occurs much more
often when the cluster spans multiple geographical locations.
Database drivers for edge-friendly databases need to be aware of all these limitations
mentioned. In particular, they need to be extra careful to pick the closest region
whenever possible, minimizing the latency and the chance of failure.
12
A submarine link between Canada, Ireland, and the UK, offering sub-60ms latency.
127
Chapter 6 Getting Data Closer
The concept of CRDT gained traction along with edge computing because the two
complement each other. The database is allowed to keep replicas in multiple places and
allows them to act without central coordination—but at the same time, users can assume
that eventually the database state is going to become consistent.
A few interesting data structures that fit the definition of CRDT are discussed next.
G-Counter
Grow-only counter. Usually implemented as an array of counters, keeping a local
counter value per each database node. Two array states from different nodes can
be merged by taking the maximum of each respective field. The actual value of the
G-Counter is simply a sum of all local counters.
PN-Counter
Positive-Negative counter, brilliantly implemented by keeping two G-Counter
instances—one for accumulating positive values, the other for negative ones. The final
value is obtained by subtracting one from the other.
G-Set
Grow-only set, that is, one that forbids the removal of elements. Converging two G-Sets
is a simple set union since values are never removed from a G-Set. One flavor of G-Set
is G-Map, where an entry, key, and value associated with the key cannot be removed
once added.
LWW-Set
Last-write-wins set (and map, accordingly). This is a combination of two G-Sets, one
gathering added elements and the other containing removed ones. Conflict resolution is
based on a set union of the “added” G-Set, minus the union of the “removed” G-Set, but
timestamps are also taken into account. A value exists if its timestamp in the “added” set
is larger than its timestamp in the “removed” set, or if it’s not present in the “removed”
set at all.
The list is obviously not exhaustive, and countless other CRDTs exist. You’re hereby
encouraged to do research on the topic if you found it interesting!
128
Chapter 6 Getting Data Closer
CRDTs are not just theoretical structures; they are very much used in practice.
Variants of conflict-free replicated data types are common among databases that offer
eventual consistency, like Apache Cassandra and ScyllaDB. Their writes have last-write-
wins semantics for conflict resolution, and their implementation of counters is based on
the idea of a PN-Counter.
Summary
At this point, it should be clear that there are a number of ways to improve
performance by using a database a bit unconventionally, as well as understanding
(and tapping) specialized capabilities built into the database and its drivers. Let’s
shift gears and look at the top “do’s and don’ts” that we recommend for ensuring that
your database is performing at its best. The next chapter begins this discussion by
focusing on infrastructure options (CPUs, memory, storage, and networking) and
deployment models.
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter's Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
129
CHAPTER 7
Infrastructure and
Deployment Models
As noted in the previous chapter, many modern databases offer capabilities beyond
“just” storing and retrieving data. But all databases are ultimately built from the ground
up in order to serve I/O in the most efficient way possible. And it’s crucial to remember
this when selecting your infrastructure and deployment model of choice.
In theory, a database’s purpose is fairly simple: You submit a request and expect
to receive a response. But as you have seen in the previous chapters, an insane level of
engineering effort is spent on continuously enhancing and speeding up this process.
Very likely, years and years were dedicated to optimizing algorithms that may give
you a processing boost of a few CPU cycles, or minimizing the amount of memory
fragmentation, or reducing the amount of storage I/O needed to look up a specific set
of data. All these advancements, eventually, converge to create a database suitable for
performance at scale.
Regardless of your database selection, you may eventually hit a wall that no
engineering effort can break through: the database’s physical hardware. It makes very
little sense to have a solution engineered for performance when the hardware you throw
at it may be suboptimal. Similarly, a less performant database will likely be unable to
make efficient use of an abundance of available physical resources.
This chapter looks at critical considerations and tradeoffs when selecting CPUs,
memory, storage, and networking for your distributed database infrastructure. It
describes how different resources cooperate and how to configure the database to
deliver the best performance. Special attention is drawn to storage I/O as the most
difficult component to deal with. There’s also a close look at optimal cloud-based
deployments suitable for highly-performant distributed databases (given that these are
the deployment preference of most businesses).
131
© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023
F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_7
Chapter 7 Infrastructure and Deployment Models
• Storage
• CPU (cores)
• Memory (RAM)
• Network interfaces
Each could be a potential bottleneck for internal database latency: The delay from
when a request is received by the database (or a node in the database) and when the
database provides a response.
132
Chapter 7 Infrastructure and Deployment Models
structure called the “commit log” and a memory structure that’s named a “memtable.” A
write is considered successful only after both operations succeed.
On the other side of the spectrum, the database’s read path will typically also involve
several physical components. Assuming that you’re not using an in-memory database,
then the read path will start by checking whether the data you are looking for is present
within the database cache. But if it’s not, the database needs to look up and retrieve the
data from disk, de-serialize it, and then answer with the results.
Network also plays a crucial role throughout the entire process. When you write,
data needs to be rapidly replicated to other replicas. When you read, the database needs
to select the correct replicas (shards) containing the data that the application is after,
thus potentially having to communicate with other nodes in the cluster. Moreover,
strong consistency use cases always require the response of a majority of members for
an operation to be successful—so delayed responses from a replica can dramatically
increase the tail latency of a request routed to it.
Achieving Balance
Balance is key to any distributed system, including and beyond databases. It makes
very little sense to try to achieve 1 million operations per second (OPS) in a system that
has the fastest network link available but relies on very few CPUs. Similarly, it’s not very
efficient to purchase the most expensive and performant infrastructure for your solution
if your use case requires only 10K OPS.
Additionally, it’s important to recognize that a cluster imbalance can easily drag
down performance across your entire distributed system. This happens because a
distributed system cannot be faster than your slowest component—a fact that frequently
surprises people.
Here’s a real-life example. A customer reported elevated latencies affecting their
entire 18-node cluster. After collecting system information, we noticed that the majority
of their nodes were properly using locally-attached nonvolatile memory express (NVMe)
disks—except for one that had a software Redundant Array of Independent Disks (RAID)
with a mix of NVMes and network-attached disks. The customer clarified that they
were running out of storage space and decided to attach another disk in order to relieve
the problem. However, they weren’t aware that this introduced a ticking time bomb
into their entire cluster. Here’s a brief explanation of what happened from a technical
perspective:
133
Chapter 7 Infrastructure and Deployment Models
3. As more and more requests came in, all these delays eventually
created a waiting queue on the replicas.
5. From that point on, the entire cluster speed was impeded by the
speed of its slowest node: the one that had the slowest disk.
• Protocol overheads
134
Chapter 7 Infrastructure and Deployment Models
• Storage
• CPU (cores)
• Memory (RAM)
• Network interfaces
Storage
One of the fastest ways to undermine all your other performance optimizations is to send
every read and write operation through an unsuitable disk. Although recent technology
advancements greatly improved the performance of storage devices, disks are (by far)
still the slowest component in a computer system.
From a performance standpoint, disk performance is typically measured in two
dimensions:
Database engineers obsess over optimizing disk access patterns with respect to those
two dimensions. People who are selecting, managing, or using a database should focus
on two additional disk considerations: the storage technology and the disk size.
Disk Types
Locally-attached NVMe Solid State Drives (SSDs) are the standard when latency is
critical. Compared with other bus interfaces, NVMe SSDs connected to a Peripheral
Component Interconnect Express (PCIe) interface will generally deliver lower latencies
than the Serial AT Attachment (SATA) interface. If your workload isn’t super latency
sensitive, you could also consider using disks via the SATA interface. But, definitely avoid
using network-attached disks if you expect single-digit millisecond latencies. Being
network attached, these disks require an additional hop to reach a storage server, and
that ends up increasing latency for every database request.
135
Chapter 7 Infrastructure and Deployment Models
If your focus is on throughput and latency really doesn’t matter for your use case
(e.g., for moving data into a data warehouse), you might be able to get away with a
persistent disk—but it’s not recommended. By persistent disks, we mean durable
network storage devices that your VMs can access like physical disks, but are located
independently from your VMs. We’re not going to pick on any specific vendors, but a
little research should reveal issues like subpar performance and overall instability. If
you’re forced to work with persistent disks, be prepared to craft a creative solution.1
Hard disk drives (HDDs) might fast become a bottleneck. Since SSDs are getting
progressively cheaper and cheaper, using HDDs is not recommended. Some workloads
may work with HDDs, especially if they play nice and minimize random seeks. An
example of an HDD-friendly workload is a write-mostly (98 percent writes) workload
with minimal random reads. If you decide to use HDDs, try to allocate a separate disk for
the commit log.
ScyllaDB published benchmarking results of several different storage devices—
demonstrating how they perform under extreme load simulating typical database access
patterns.2 For example, Figures 7-1 through 7-4 visualize the different performance
characteristics from two NVMes—a persistent disk and an HDD.
1
For inspiration, consider Discord’s approach—but recognize that this is
certainly not a one-size-fits-all solution. It’s described in their blog, “How Discord
Supercharges Network Disks for Extreme Low Latency” (https://discord.com/blog/
how-discord-supercharges-network-disks-for-extreme-low-latency).
2
You can find the results, as well as the tool to reproduce the results, at https://github.com/
scylladb/diskplorer#sample-results.
136
Chapter 7 Infrastructure and Deployment Models
Figure 7-1. NVMe bandwidth/latency graphs for an AWS i3.2xlarge instance type
137
Chapter 7 Infrastructure and Deployment Models
138
Chapter 7 Infrastructure and Deployment Models
3
Strangely, the 95th percentile at low rates is worse than at high rates.
139
Chapter 7 Infrastructure and Deployment Models
Disk Setup
We hear a lot of questions about RAID setups. Hardware RAIDs are commonly used to
avoid outages introduced by disk failures. As a result, the RAID-5 (distributed parity)
setup is often used.
However, distributed databases typically have their own internal replication
mechanism to allow for business continuity and achieve high availability. Therefore,
RAID setups employing data mirroring or distributed parity have proven to be very
detrimental to disk I/O performance and, fairly often, are used redundantly. On top of
that, we have found that some hardware RAID vendors deliver poor performance results
4
Note the throughput and IOPS were allowed to miss by a 15 percent margin rather than the
normal 3 percent margin.
140
Chapter 7 Infrastructure and Deployment Models
Disk Size
When considering how much storage you need, be sure to account for your existing
data—replicated—plus your anticipated near-term data growth, and also leave sufficient
room for the overhead of internal operations (like compactions [for LSM-tree-based
databases], the commit log, backups, etc.).
141
Chapter 7 Infrastructure and Deployment Models
As Chapter 8 discusses, the most common topology involves three replicas for each
dataset. Assume you have 5TB of raw data and use a replication factor of three:
But 15TB is just a starting point since there are other sizing criteria:
• What is your dataset’s growth rate? (How much do you ingest per
hour or day?)
• Will you store everything forever, or will you have an eviction process
(for example, based on Time To Live [TTL])?
You can model your data’s growth rate based on the number of users or endpoints
and how that number is expected to grow over time. Alternately, data models are often
enriched over time, resulting in more data per source. Or your sampling rate may
increase. For example, your system may begin ingesting data every five seconds rather
than every minute. All of these considerations impact your data storage volume.
It’s strongly recommended that you select storage that’s suitable for where you
expect to end up after a certain time span. If you’re running your database on a public
cloud provider (self-managed or as a fully-managed Database-as-a-Service [DBaaS]),
you won’t need very much lead time to provision new hardware and expand your cluster.
However, for an on-premises hardware purchase, you may need to provision based on
your quarterly or annual budgeting process. You could also face delays due to the supply
chain disruptions that have become increasingly common.
Also, be sure to leave storage space for internal temporary operations such as
compaction, repairs, backups, and commit logs, as well as any other background process
that may temporarily introduce a space amplification. On the other hand, if you’re using
compression, be sure to factor in the amount of space that your selected compression
algorithm can save you.
Finally, recognize that every database has an ideal memory-to-storage ratio—for
example, a certain amount of TB or GB per node that it can support with optimal
performance. If this isn’t readily apparent in your database’s documentation, press your
vendor for their recommendation.
142
Chapter 7 Infrastructure and Deployment Models
3. Lock-in: Once you are using a raw device, it’s extremely difficult
to move away from it. You can’t mount raw devices or query their
storage consumption via typical operating system mechanisms.
All of your disks need to be arranged in a certain way, and you
can’t easily go back to a regular filesystem.
143
Chapter 7 Infrastructure and Deployment Models
Tip If you have to choose one place to invest—on CPU, storage, memory, or
networking—we recommend splurging on storage. Everything else has evolved
faster and better than storage. It still remains the slowest component in most
systems.
Tiered Storage
Many use cases have different latency requirements for different sets of data. Similarly,
industries may see exponential storage utilization growth over time. It is not always
desirable, or even possible, to get rid of old data (for example, due to compliance
regulations, third-party contracts, or simply because it still carries relevance for the
business).
Teams with storage-heavy use cases often seek ways to minimize the costs of storage
consumption: by reducing the replication factor of their dataset, using less performant
(although cheaper) storage disks, or by employing a manual data rotation process from
faster to slower disks.
Tiered storage is a solution implemented by some databases in order to address most
of these concerns. It allows users to configure the database to use distinct storage tiers,
and to define which criteria the database should use to ensure that the data is correctly
replicated to its relevant tier. For example, MongoDB allows you to determine how data
is replicated to a specific storage tier by assigning different tier tags to shards, allowing
its balancer to migrate data between tiers automatically. On top of that, Atlas Online
Archive also allows the database to offload historical datasets to cloud storage.
CPUs (Cores)
Next is the CPU. As of this writing, you are probably looking at modern servers running
some reasonably modern Intel, AMD, or ARM chips, which are commonly found across
most cloud providers and enterprise hardware vendors. Along with storage, CPUs are
another compute resource which—if not correctly sized—may introduce contention to
your workload and impact your latencies. Clusters handling hundreds of thousands up
to millions of operations per second tend to get very high CPU loads.
144
Chapter 7 Infrastructure and Deployment Models
More cores will generally mean better performance. This is important for achieving
optimal performance from databases that are architected to benefit from multithreading,
and it’s absolutely essential for databases that are architected with a shard-per-core
architecture—running a separate shard on each core in each server. In this case, the
more cores the CPU has, the more shards—and the better data distribution—the
database will have.
A combination of vendor recommendations and benchmarking (see Chapter 9)
can help you determine how much throughput each multicore chip can support. A
general recommendation is to avoid running production systems close to the CPU limits
and find the sweet spot between supporting your expected performance and leaving
room for throughput growth. On top of that, when doing benchmarking, remember
to also factor in background database operations that might be detrimental to your
performance. For example, Cassandra and Cassandra-compatible databases often
need to run repair: a weekly process to ensure data consistency across the cluster. This
process requires a lot of coordination and communication across the entire cluster. If
your workload is not properly sized to accommodate background database operations
and other events (such as node failures), your latency may increase to a level that
surprises you.
When using virtual machines, containers, or the public cloud, remember that each
virtual CPU is mapped to a single logical core, or thread. In many cloud deployments,
nodes are provided on a vCPU basis. The vCPU is typically a single hyperthread from
a dual hyperthread x86 physical core for Intel/AMD variants, or a single core for
ARM chips.
No matter what your deployment of choice involves, avoid overcommitting CPU
resources if performance is a priority. Doing so will prevent other guests from stealing
CPU time5 from your database.
Memory (RAM)
If you’re working with an in-memory database, having enough memory to hold your
entire dataset is an absolute must. But every database uses in-memory caching to some
extent. For example, some databases require enough memory space for indexes to avoid
expensive round-trips to storage disks. Others leverage an internal data cache to allow
5
For more on CPU steal time, see “Detecting CPU Steal Time in Guest Virtual Machines” by Jamie
Fargen (https://opensource.com/article/20/1/cpu-steal-time).
145
Chapter 7 Infrastructure and Deployment Models
for lower latencies when retrieving recently used data, Cassandra and Cassandra-like
databases implement memtables, and some databases allow you to control which tables
are served entirely from memory. The more memory the database has at its disposal,
the better you can take advantage of those mechanisms. After all, even the fastest NVMe
can’t come close to the speed of RAM access.
In general, there is no blanket recommendation for “how much memory is enough”
for a database. Different vendors have different requirements and different use cases also
require different memory sizes. However, latency-sensitive use cases typically require
high memory footprints in order to achieve high cache hit rates and serve low-latency
read requests efficiently.
For example, a use case with a higher payload size requires a larger memory
footprint than one with a smaller payload size. Another interesting aspect to consider is
how frequently the use case in question reads data that may be present in memory (hot
data) as opposed to data that was never read (cold data). As mentioned in Chapter 2, the
latter can easily undermine your latencies.
Without a sufficient disk-to-memory ratio, you will be hitting your storage far more
than you probably want if you intend to keep your latencies low. The ideal ratio varies
from database to database since every caching implementation is different, so be sure
to ask your vendor for their specific recommendations. For example, ScyllaDB currently
recommends that for every 1GB of memory allocated to a node, you can store up to
100GB of data (so if you have 32GB of memory, you can handle around 3TB). The higher
your memory-to-storage ratio gets, the less room you have for caching your total dataset.
Every database has some sort of hard physical limit. If you don’t have enough memory
and you have to run a workload on top of a very large dataset, it’s either going to be
rather slow or increase the risk of the database running out of memory.
Another ratio to keep in mind: memory per CPU core. At ScyllaDB, we recommend
at least 8GB of memory per CPU core for production purposes (because, given our
shared-nothing architecture, every shard works independently and has its own allocated
memory for caching). 8GB per vCPU is the same ratio used by most cloud providers for
NoSQL or Big Data-oriented instance types. Again, the recommended ratio will vary
across vendors, depending on the database’s specific internal cache implementation and
other implementation details. For example, in Cassandra and Cassandra-like databases,
part of the memory will be allocated for some of its SSTable-components in order to
speed up disk lookups when reading cold data. Aerospike will typically store all indexes
in RAM. And MongoDB, on average, requires 1GB of RAM per 100K assets.
146
Chapter 7 Infrastructure and Deployment Models
Network
Lastly, you have to ensure that network I/O does not become a bottleneck. Networking
is often an overlooked component. As with any distributed system, a database involves
a lot of traffic between all the cluster members to check for liveness, replicate state
and topology changes, and so on. As a result, network delays not only deteriorate your
application’s latency, but also prevent internode communication from functioning
effectively.
At ScyllaDB, we recommend a minimum network bandwidth of 10Gbps because
internal database operations such as streaming, repairs, and gossip can become very
network intensive. On top of that, you also need to factor in the actual throughput
required for the use case in question; the number of operations per second will certainly
be the highest bandwidth consumer for your deployment.
As with memory, the required network bandwidth will vary. Be sure to check your
vendor recommendations and consider the nature of your use case. A low throughput
workload will obviously consume less traffic than a higher throughput one.
147
Chapter 7 Infrastructure and Deployment Models
For cloud deployments, most IaaS vendors provide a modern network infrastructure
with ample bandwidth between your database servers and between the database
and the application clients. Be sure to check on your client’s network bandwidth
consumption if you suspect network problems. A common mistake we see in
deployments involves application clients deployed with suboptimal network capacity.
Also, be sure to place your application servers as close as possible to your database.
If you are deploying them in a single region, a shorter physical distance between the
servers will translate to better network performance (since it will require fewer network
hops for communication) and, as a result, lower latencies. If you need to go multi-region
and you require strong consistency or replication across these regions, then you need to
pay the latency penalty for traversing regions—plus, you also have to pay, quite literally,
with respect to cross-region networking transfer fees. For multi-region deployments with
cross-region replication, a slow network link may create replication delays that cause
the database to apply backpressure on your writes until it manages to replicate the data
piled up.
148
Chapter 7 Infrastructure and Deployment Models
Some cloud vendors have different instance types for different distributed database
workloads. For example, some workloads may benefit more from compute-heavy
instance types, with more compute power than storage capacity. Conversely, storage-
dense instance types typically feature a higher storage to memory ratio and are often
used by storage-heavy workloads.
To complicate things even more, some cloud providers may offer different CPU
generations for the same instance type. If one CPU generation is considerably slower
than other nodes, the wrong choice could introduce performance bottlenecks into your
cluster.
We have seen some (although rare) scenarios where a noisy neighbor dragged down
an entire node performance with no reasonable explanation. The lack of visibility and
control in cloud instances makes it harder to diagnose such situations. Often, you need
to reach out to your cloud vendor directly to resolve the situation.
As you start configuring your instance, remember that a cloud environment isn’t
created exclusively for databases. You have access to a wide range of options, but it can
be confusing to determine where to start and which options to use. In general, it’s best
to check with your database vendor on which instance types are recommended for
deployment. Even better, go beyond that and compare the results of their benchmarks
against those same instance types running your workload.
After you have decided on your instance types and deployment options, it’s time to
think about instance placement. Most clouds will charge you for both inter-region traffic
and inter-zone traffic, which may quite surprisingly increase the overall networking
costs. Some companies try to mitigate this cost by placing all instances under a single
availability zone (AZ), which also carries the risk of potentially having to face a cluster-
wide outage if/when that AZ goes down. Others opt to ignore the cost aspect and
deploy their replicas in different AZs to ensure data is properly replicated to an isolated
environment. Regardless of your instance’s placement of choice, note that some
database drivers allow clients in specific AZs to route queries only against database
replicas living in the same availability zone in order to reduce costs. Similarly, you will
also want to ensure that your application clients are located under the same zones as
your database to minimize your networking costs.
149
Chapter 7 Infrastructure and Deployment Models
Managed DBaaS solutions can easily speed up your go-to-market and allow you to
focus on priorities beyond your database. Most database vendors now provide some
sort of managed solution. There are even independent companies in the business of
providing this kind of service for a variety of different distributed databases.
We have seen many examples where a managed solution helped users succeed, as
well as numerous complaints over the fact that some managed solutions were rather
limited. It is not our intention to recommend nor criticize any specific service provider in
question. Here is some vendor-agnostic advice on things to consider before selecting a
managed solution:
• What are the options for observability and how do you export the
data in question to your monitoring platform of choice?
• What kind of flexibility do you have with your deployment? What are
the available tunable options and the support for those within your
managed solution?
150
Chapter 7 Infrastructure and Deployment Models
151
Chapter 7 Infrastructure and Deployment Models
your infrastructure. If all goes well, the vendor will “automagically” ensure that you’re
covered, with acceptable performance. You won’t need to procure any additional
servers, or even contact your cloud provider.
Serverless is also a great option to consider if you’re working on a new project and
are not sure what capacity you need to meet performance expectations. It gives you the
freedom to start fast and scale (or shrink) depending on real-world usage. Database
sizing is one less thing to worry about. And you don’t need to predict the future.
Finally, serverless also makes it simpler to justify the spend internally. With this
model, you can assure your organization that you are never overprovisioned—at least
not for long. You’re paying for exactly the amount of performance that the database
vendor determines you need at all times.
However, a serverless deployment also carries the risk of cost overruns and the
uncertainty of unpredictable costs. For example, DynamoDB pricing may not be very
attractive for write-heavy workloads. Similarly, serverless database services may charge
an arm and a leg (or an eye and a knee) depending on the number of operations per
second you plan to sustain over an extended period of time. In some cases, it could
become a double-edged sword from a cost perspective if your goal is to sustain a high-
throughput performant system at large scale.
Another aspect to consider when thinking about a serverless solution is whether
the solution in question is compatible with your existing infrastructure components.
For example, you’ll want to explore what amount of effort is required to connect your
message queueing or analytics tool with that specific serverless solution.
Remember that the overall concept behind serverless is to abstract away the
underlying infrastructure in such a way that not all database-configurable options are
available to you. As a result, troubleshooting potential performance problems is often
more challenging since you might need to rely on your vendor’s input and guidance to
understand which actions to take. Being serverless also means that you lack visibility
into whether the infrastructure you consume is shared with other tenants. Many
distributed database vendors may also offer you different pricing tiers for shared and
dedicated environments.
152
Chapter 7 Infrastructure and Deployment Models
But be aware that there is a performance penalty for the operational convenience
of using containers. This is to be expected because of the extra layer of abstraction (the
container itself ), relaxation of resource isolation, and increased context switches. The
good news is that it can certainly be overcome. In our testing using ScyllaDB, we found
it is possible to take what was originally a 69 percent reduction in peak throughput down
to a 3 percent performance penalty.6
Here’s the TL;DR on that specific experiment:
Of course, the potential penalty and strategies for mitigating will vary from database
to database. But the key takeaway is that there is likely a significant performance
penalty—so be sure to hunt it down and research how to mitigate it. Some common
mitigation strategies include:
6
See “The Cost of Containerization for Your ScyllaDB” on the ScyllaDB blog (https://www.
scylladb.com/2018/08/09/cost-containerization-scylla/).
153
Chapter 7 Infrastructure and Deployment Models
Kubernetes adds yet another virtualization layer—and thus opens the door to yet
another layer of performance issues, as well as different strategies for mitigating them.
First off, if you have the choice of multiple options for deploying and managing database
clusters on Kubernetes, test them out with an eye on performance. Once you settle
on the best fit for your needs, dive into the configuration options that could impact
performance. Here are some performance tips that cross databases:
7
For more detail, see “Create a Pod that Gets Assigned a QoS Class of Guaranteed” in the
Kubernetes docs (https://kubernetes.io/docs/tasks/configure-pod-container/
quality-service-pod/#create-a-pod-that-gets-assigned-a-qos-class-of-guaranteed).
8
For more detail, see “Operator Pattern” in the Kubernetes docs https://kubernetes.io/docs/
concepts/extend-kubernetes/operator/.
154
Chapter 7 Infrastructure and Deployment Models
Summary
This chapter kicked off the final part of this book, focused on sharing recommendations
for getting better performance out of your database deployment. It looked at
infrastructure and deployment model considerations that are important to understand
whether you’re managing your own deployment or opting for a database-as-a-service
(maybe serverless) deployment model. The next chapter looks at performance
considerations relevant to the topology itself: replication, geographic distribution,
scaling up and/or out, and intermediaries like external caches, load balancers, and
abstraction layers.
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter's Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
155
CHAPTER 8
Topology Considerations
As mentioned in Chapter 5, database servers are often combined into intricate
topologies where certain nodes are grouped in a single geographical location; others
are used only as a fast cache layer, and yet others store seldom-accessed cold data in a
cheap place, for emergency purposes only. That chapter covered how drivers work to
understand and interact with that topology to exchange information more efficiently.
This chapter focuses on the topology in and of itself. How is data replicated across
geographies and datacenters? What are the risks and alternatives to taking the common
NoSQL practice of scaling out to the extreme? And what about intermediaries to your
database servers—for example, external caches, load balancers, and abstraction layers?
Performance implications of all this and more are all covered here.1
Replication Strategy
First, let’s look at replication, which is how your data will be spread to other replicas
across your cluster.
Having more replicas will slow your writes (since every write must be duplicated
to replicas), but it could accelerate your reads (since more replicas will be available for
serving the same dataset). It will also allow you to maintain operations and avoid data
1
This chapter draws from material originally published on the ScyllaDB blog (www.scylladb.
com/blog/), ScyllaDB Documentation (https://docs.scylladb.com/stable/), the ScyllaDB
whitepaper “Why Scaling Up Beats Scaling Out for NoSQL” (https://lp.scylladb.com/
whitepaper-scaling-up-vs-scaling-out-offer.html), and an article that ScyllaDB co-founder
and CEO Dor Laor wrote for The New Stack. It is used here with permission of ScyllaDB.
157
© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023
F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_8
Chapter 8 Topology Considerations
loss in the event of node failures. Additionally, replicating data to get closer to your
application and closer to your users will reduce latency, especially if your application has
a highly geographically-distributed user base.
A replication factor (RF) of 1 means there is only one copy of a row in a cluster, and
there is no way to recover the data if the node is compromised or goes down (other than
restoring from a backup). An RF of 2 means that there are two copies of a row in a cluster.
An RF of at least three is used in most systems. This allows you to write and read with
strong consistency, as a quorum of replicas will be achieved, even if one node is down.
Many databases also let you fine-tune replication settings at the regional level. For
example, you could have three replicas in a heavily used region, but only two in a less
popular region.
Note that replicating data across multiple regions (as Bigtable recommends as a
safeguard against both availability zone failure and regional failure) can be expensive.
Before you set this up, understand the cost of replicating data between regions.
If you’re working with DynamoDB, you create tables (not clusters), and AWS
manages the replication for you as soon as you set a table to be Global. One notable
drawback of DynamoDB global tables is that transactions are not supported across
regions, which may be a limiting factor for some use cases.
Rack Configuration
If all your nodes are in the same datacenter, how do you configure their placement? The
rule of thumb here is to have as many racks as you have replicas. For example, if you
have a replication factor of three, run it in three racks. That way, even if an entire rack
goes down, you can still continue to satisfy read and write requests to a majority of your
replicas. Performance might degrade a bit since you have lost roughly 33 percent of your
infrastructure (considering a total zone/rack outage), but overall you’ll still be up and
running. Conversely, if you have three replicas distributed across two racks, then losing
a rack may potentially affect two out of the three natural endpoints for part of your data.
That’s a showstopper if your use case requires strongly consistent reads/writes.
158
Chapter 8 Topology Considerations
159
Chapter 8 Topology Considerations
160
Chapter 8 Topology Considerations
• Fewer failures: Since large and small nodes fail at roughly the
same rate, large nodes deliver a higher mean time between failures
(MTBF) than small nodes. Failures in the data layer require operator
intervention, and restoring a large node requires the same amount of
human effort as a small one. In a cluster of a thousand nodes, you’ll
likely see failures every day—and this magnifies administrative costs.
2
See https://cassandra.apache.org/doc/latest/cassandra/getting_started/
production.html.
161
Chapter 8 Topology Considerations
Some architects are concerned that putting more data on fewer nodes increases
the risks associated with outages and data loss. You can think of this as the “big basket”
problem. It may seem intuitive that storing all of your data on a few large nodes makes
them more vulnerable to outages, like putting all of your eggs in one basket. But this
doesn’t necessarily hold true. Modern databases use a number of techniques to ensure
availability while also accelerating recovery from failures, making big nodes both safer
and more economical. For example, consider capabilities that reduce the time required
to add and replace nodes and internal load balancing mechanisms to minimize the
throughput or latency impact across database restarts.3
Workload Isolation
Many teams find themselves in a position where they need to run multiple different
workloads against the database. It is often compelling to aggregate different workloads
under a single cluster, especially when they need to work on the exact same dataset.
Keeping several workloads together under a single cluster can also reduce costs. But, it’s
essential to avoid resource contention when implementing latency-critical workloads.
Failure to do so may introduce hard-to-diagnose performance situations, where one
misbehaving workload ends up dragging down the entire cluster’s performance.
There are many ways to accomplish workload isolation to minimize the resource
contention that could occur when running multiple workloads on a single cluster. Here
are a few that work well. Keep in mind that the best approach depends on your existing
database’s available options, as well as your use case’s requirements:
3
ScyllaDB Heat Weighted Load Balancing provides a smarter request redistribution
algorithm based on the cache hit ratio of nodes in the cluster. Learn more at www.scylladb.
com/2017/09/21/scylla-heat-weighted-load-balancing/.
162
Chapter 8 Topology Considerations
163
Chapter 8 Topology Considerations
Figure 8-1. Latency between OLTP and OLAP workloads on the same cluster
before enabling workload prioritization
These latencies are approximately six times greater than when OLTP ran by itself.
(OLAP latencies hover between 12–14ms, but, again, OLAP is not latency-sensitive).
Figure 8-2 shows that the throughput on OLTP sinks from around 60,000 OPS to
half that—30,000 OPS. You can see the reason why. OLAP, being throughput hungry, is
maintaining roughly 260,000 OPS.
164
Chapter 8 Topology Considerations
Figure 8-2. Comparative throughput results for OLTP and OLAP on the same
cluster without workload prioritization enabled
Ultimately, OLTP suffers with respect to both latency and throughput, and users
experience slower response times. In many real-world conditions, such OLTP responses
would violate a customer’s SLA.
Figure 8-3 shows the latencies after workload prioritization is enabled. You can see
that the OLTP workload similarly starts out at sub-millisecond to 2ms P99 latencies.
Once an OLAP workload is added, OLTP processing performance degrades, but with
P99 latencies hovering between 4–7ms (about half of the 11–12ms P99 latencies when
workload prioritization was not enabled).
165
Chapter 8 Topology Considerations
Figure 8-3. OLTP and OLAP latencies with workload prioritization enabled
It is important to note that once system contention kicks in, the OLTP latencies
are still somewhat impacted—just not to the same extent they were prior to workload
prioritization. If your real-time workload requires ultra-constant single-digit millisecond
or lower P99 latencies, then we strongly recommend that you avoid introducing any form
of contention.
The OLAP workload, not being as latency-sensitive, has P99 latencies that hover
between 25–65ms. These are much higher latencies than before—the tradeoff for
keeping the OLTP latencies lower.
Throughput wise, Figure 8-4 shows that the OLTP traffic is a smooth 60,000 OPS until
the OLAP load is also enabled.
166
Chapter 8 Topology Considerations
It does dip in performance at that point, but only slightly, hovering between 54,000 to
58,000 OPS. That is only a 3–10 percent drop in throughput. The OLAP workload, for its
part, hovers between 215,000–250,000 OPS. That is a drop of 4–18 percent, which means
an OLAP workload would take longer to complete. Both workloads suffer degradation, as
would be expected for an overloaded cluster, but neither to a crippling degree.
Abstraction Layers
It’s becoming fairly common for teams to write an abstraction layer on top of their
databases. Instead of calling the database’s APIs directly, the applications connect to this
database-agnostic abstraction layer, which then manages the logistics of connecting to
the database.
There are usually a few main motives behind this move:
167
Chapter 8 Topology Considerations
But, there’s definitely a potential for a performance penalty that is highly dependent
on how efficiently the layer was implemented. An abstraction layer that was fastidiously
implemented by a team of masterful Rust engineers is likely to have a much more
negligible impact than a Java or Python one cobbled together as a quick side project. If
you decide to take this route, be sure that the layer is developed with performance in
mind, and that you carefully measure its impact via both benchmarking and ongoing
monitoring. Remember that every application <> database communication is going to
use this layer, so a small inefficiency can quickly snowball into a significant performance
problem.
For example, we once saw a customer report an elevated latency situation coming
from their Golang abstraction layer. Once we realized that the latency on the database
side was within bounds for its use case, the investigation shifted from the database over
to the network and client side. Long story short, the application latency spikes were
identified as being heavily affected during the garbage collection process, dragging down
the client-side performance significantly. The problem was resolved after scaling out the
number of clients and by ensuring that they had enough compute resources to properly
function.
And another example: When working with a customer through a PostgreSQL to
NoSQL migration, we realized that their clients were fairly often opening too many
concurrent connections against the database. Although having a high number of sockets
opened is typically a good idea for distributed systems, an extremely high number of
them can easily overwhelm the client side (which needs to keep track of all open sockets)
168
Chapter 8 Topology Considerations
as well as the database. After we reported our findings to the customer, they discovered
that they were opening a new database session for every request they submitted against
the cluster. After correcting the malfunctioning code, the overall application throughput
was significantly increased because the abstraction layer was then using active sockets
opened when it routed requests.4
Load Balancing
Should you put a dedicated load balancer in front of your database? In most cases, no.
Databases typically have their own way to balance traffic across the cluster, so layering
a load balancer on top of that won’t help—and it could actually hurt. Consider 1) how
many requests the load balancer can serve without becoming a bottleneck and 2) its
balancing policy. Also, recognize that it introduces a single point of failure that reduces
your database resilience. As a result, you overcomplicate your overall infrastructure
topology because you now need to worry about load balancing high availability.
Of course, there are always exceptions. For example, say you were previously using
a database API that’s unaware of the layout of the cluster and its individual nodes
(e.g., DynamoDB, where a client is configured with a single “endpoint address” and
all requests are sent to it). Now you’re shifting to a distributed leaderless database like
ScyllaDB, where clients are node aware and even token aware (aware of which token
ranges are natural endpoints for every node in your topology). If you simply configure
an application with the IP address of a single ScyllaDB node as its single DynamoDB
API endpoint address, the application will work correctly. After all, any node can answer
any request by forwarding it to other nodes as necessary. However, this single node will
be more loaded than the other nodes because it will be the only node actively serving
requests. This node will also become a single point of failure from your application’s
perspective.
In this special edge case, load balancing is critical—but load balancers are
not. Server-side load balancing is fairly complex from an admin perspective. More
importantly with respect to performance, server-side solutions add latency. Solutions
that involve a TCP or HTTP load balancer require another hop for each
4
Learn about abstraction layer usage at Discord in “How Discord Migrated Trillions of Messages
from Cassandra to ScyllaDB “(www.youtube.com/watch?v=S2xmFOAUhsk) and ShareChat
in “ShareChat’s Path to High-Performance NoSQL with ScyllaDB” (www.youtube.com/
watch?v=Y2yHv8iqigA).
169
Chapter 8 Topology Considerations
request—increasing not just the cost of each request, but also its latency. We recommend
client-side load balancing: Modifying the application to send requests to the available
nodes versus a single one.
The key takeaway here is that load balancing generally isn’t needed—and when it
is, server-side load balancers yield a pretty severe performance penalty. If it’s absolutely
necessary, client-side load balancing is likely a better option.5
External Caches
Teams often consider external caches when the existing database cluster cannot meet
the required SLA. This is a clear performance-oriented decision. Putting an external
cache in front of the database is commonly used to compensate for subpar latency
stemming from the various factors discussed throughout this book: inefficient database
internals, driver usage, infrastructure choices, traffic spikes, and so on.
Caching may seem like a fast and easy solution because the deployment can be
implemented without tremendous hassle and without incurring the significant cost of
database scaling, database schema redesign, or even a deeper technology transformation.
However, external caches are not as simple as they are often made out to be. In fact, they
can be one of the more problematic components of a distributed application architecture.
In some cases, it’s a necessary evil—particularly if you have an ultra-latency-sensitive
use case such as real-time ad bidding or streaming media, and you’ve tried all the other
means of reducing latency. But in many cases, the performance boost isn’t worth it. The
following sections outline some key risks and you can decide what makes sense for your
use case and SLAs.
5
For an example of how to implement client-side load balancing, see www.scylladb.
com/2021/04/13/load-balancing-in-scylla-alternator/.
170
Chapter 8 Topology Considerations
171
Chapter 8 Topology Considerations
172
Chapter 8 Topology Considerations
The database also should have various eviction policies (such as the least recently
used [LRU] policy as a straightforward example) that determine when new data should
replace existing (older) cached objects.
Another example is scan-resistant caching. When scanning a large dataset, say a
large range or a full-table scan, a lot of objects are read from the disk. The database can
realize this is a scan (not a regular query) and choose to leave these objects outside its
internal cache. However, an external cache would treat the result set just like any other
and attempt to cache the results. The database automatically synchronizes the content
of the cache with the disk according to the incoming request rate, and thus the user and
the developer do not need to do anything to make sure that lookups to recently written
data are performant. Therefore, if, for some reason, your database doesn’t respond fast
enough, it means that:
• The working set size and request pattern don’t fit the cache
Summary
This chapter shared strong opinions on how to navigate topology decisions. For example,
we recommended:
173
Chapter 8 Topology Considerations
The next chapter looks at best practices for testing your topology: Benchmarking it to
see what it’s capable of and how it compares to alternative configurations and solutions.
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter's Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
174
CHAPTER 9
Benchmarking
We won’t sugarcoat it: database benchmarking is hard. There are many moving parts
and nuances to consider and manage—and a bit of homework is required to really see
what a database is capable of and measure it properly. It’s not easy to properly generate
system load to reflect your real-life scenarios.1 It’s often not obvious how to correctly
measure and analyze the end results. And after extracting benchmarking results, you
need to be able to read them, understand potential performance bottlenecks, analyze
potential performance improvements, and possibly dive into other issues. You need to
make your benchmarking results meaningful, ensure they are easily reproducible, and
also be able to clearly explain these results to your team and other interested parties in a
way that reflects your business needs. There’s also hard mathematics involved: statistics
and queueing theory to help with black boxes and measurements, not to mention
domain-specific knowledge of the system internals of the servers, platforms, operating
systems, and the software running on it.
But when performance is a top priority, careful—and sometimes frequent—
benchmarking is essential. And in the long run, it will pay off. An effective benchmark
can save you from even worse pains, like the high-pressure database migration project
that ensues after you realize—too late—that your existing solution can’t support the
latest phase of company growth with acceptable latencies and/or throughput.
The goal of this chapter is to share strategies that ease the pain slightly and, more
importantly, increase the chances that the pain pays off by helping you select options
that meet your performance needs. The chapter begins by looking at the two key types of
benchmarks and highlighting critical considerations for each objective. Then, it presents
a phased approach that should help you expose problems faster and with lower costs.
Next, it dives into the do’s and don’ts of benchmark planning, execution, and reporting,
1
For an example of realistic benchmarking executed with impressive mastery, see Brian Taylor’s
talk, “How Optimizely (Safely) Maximizes Database Concurrency,” at www.youtube.com/
watch?v=cSiVoX_nq1s.
175
© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023
F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_9
Chapter 9 Benchmarking
with a focus on lessons learned from the best and worst benchmarks we’ve witnessed
over the past several years. Finally, the chapter closes with a look at some less common
benchmarking approaches you might want to consider for specialized needs.
• Latency focus: You assess how many IOPS the database can handle
without compromising latency. This is usually the focus for most
user-facing and real-time applications.
Throughput tests are quite common, but latency tests are a better choice if you
already know the desired throughput (e.g., 1M OPS). This is especially true if your
production system must meet a specific latency goal (for example, the 99.99 percentile
should have a read latency of less than 10ms).
If you’re focused solely on latency, you need to measure and compare latency at
the same throughput rates. If you know only that database A can handle 30K OPS with
a certain P99 latency and database B can handle 50K OPS with a slightly higher P99
latency, you can’t really say which one is “more efficient.” For a fair comparison, you
would need to measure each database’s latencies at either 30K OPS or 50K OPS—or both.
Even better, you would track latency across a broader span of intervals (e.g., measuring
at 10K OPS increments up until when neither database could achieve the required P99
latency, as demonstrated in Figure 9-1.)
176
Chapter 9 Benchmarking
Not all latency benchmarks need to take that form, however. Consider the example
of an AdTech company with a real-time bidding use case. For them, a request that takes
longer than 31ms is absolutely useless because it will fall outside of the bidding window.
It’s considered a timeout. And any request that is 30ms or less is fine; a 2ms response
is not any more valuable to them than a 20ms response. They care only about which
requests time out and which don’t.
Their benchmarking needs are best served by a latency benchmark measuring how
many OPS were generating timeouts over time. For example, Figure 9-2 shows that the
first database in their benchmark (the top line) resulted in over 100K timeouts a second
around 11:30; the other (the horizontal line near the bottom) experienced only around
200 timeouts at that same point in time, and throughout the duration of that test.
177
Chapter 9 Benchmarking
178
Chapter 9 Benchmarking
With a throughput benchmark, you want to see one of the resources (e.g., the
CPU or disk) maxing out in order to understand how much the database can deliver
under extreme load conditions. If you don’t reach this level, it’s a sign that you’re not
really effectively benchmarking the database’s throughput. For example, Figure 9-4
demonstrates the load of two clusters during a benchmark run. Note how one cluster is
fully utilized whereas the other is very close to reaching its limits.
179
Chapter 9 Benchmarking
Figure 9-4. Two clusters’ load comparison: one fully maxed out and another very
close to reaching its limit
If you start off with too much complexity, it will be a nightmare to discover what’s
going wrong and pinpoint the source of the problem. For example, assume you want
to test if a database can handle 1M OPS of traffic from your client with a P99 latency of
1ms or less. However, you notice the latencies are exceeding the expected threshold.
You might spend days adjusting database configurations to no avail, then eventually
figure out that the problem stemmed from a bug in client-side concurrency. This would
180
Chapter 9 Benchmarking
have been much more readily apparent if you started out with just a fraction of that
throughput. In addition to avoiding frustration and lost time, you would have saved your
team a lot of unnecessary infrastructure costs.
As a general rule of thumb, consider at least two phases of benchmarking: one with a
specialized stress tool and one with your real workload (or at least a sampling of it—e.g.,
sending 30 percent of your queries to a cluster for benchmarking). For each phase,
start super small (at around 10 percent of the throughput you ultimately want to test),
troubleshoot as needed, then gradually increase the scope until you reach your target
loads. Keep optimization in mind throughout. Do you need to add more servers or more
clients to achieve a certain throughput? Or are you limited (by budget or infrastructure)
to a fixed hardware configuration? Can you achieve your performance goals with less?
The key is to move incrementally. Of course, the exact approach will vary from
situation to situation. Consider a leading travel company’s approach. Having recently
moved from PostgreSQL to Cassandra, they were quite experienced benchmarkers when
they decided to evaluate Cassandra alternatives. The goal was to test the new database
candidate’s raw speed and performance, along with its support for their specific
workloads.
First, they stood up a five-node cluster and ran database comparisons with synthetic
traffic from cassandra-stress. This gave them confidence that the new database could
meet their performance needs with some workloads. However, their real workloads
are nothing like even customized cassandra-stress workloads. They experience highly
variable and unpredictable traffic (for example, massive surges and disruptions
stemming from a volcanic eruption). For a more realistic assessment, they started
shadowing production traffic. This second phase of benchmarking provided the added
confidence they needed to move forward with the migration.
Finally, they used the same shadowed traffic to determine the best deployment
option. Moving to a larger 21-node cluster, they tested across cloud provider A and cloud
provider B on bare metal. They also experimented with many different options on cloud
provider B: various storage options, CPUs, and so on.
The bottom line here: Start simple, confirm, then scale incrementally. It’s safer and
ultimately faster. Plus, you’ll save on costs. As you move through the process, check if you
need to tweak your setup during your testing. Once you are eventually satisfied with the
results, scale your infrastructure accordingly to meet your defined criteria.
181
Chapter 9 Benchmarking
Tip If you haven’t done so yet, be sure to review the chapters on drivers,
infrastructure, and topology considerations before you begin benchmarking.
182
Chapter 9 Benchmarking
183
Chapter 9 Benchmarking
2
https://github.com/brianfrankcooper/YCSB
3
http://tpc.org/tpcc/default5.asp
4
https://github.com/Netflix/ndbench
5
https://github.com/nosqlbench/nosqlbench
6
www.postgresql.org/docs/current/pgbench.html
7
https://github.com/thelastpickle/tlp-stress
8
https://github.com/scylladb/scylla-tools-java/tree/master/tools/stress
184
Chapter 9 Benchmarking
They are all relatively the same and provide similar configuration parameters. Your
task is to understand which one better reflects the workload you are interested in and
how to run it properly. When in doubt, consult with your vendor for specific tooling
compatible with your database of choice.
Of course, these options won’t cover everything. It makes sense to develop your own
tools if:
• Your workloads look nothing like the ones offered by standard tools
(for example, you rely on multiple operations that are not natively
supported by the tools)
• It helps you test against real (or more realistic) workloads in the later
phases of your benchmarking strategy
Ideally, the final stages of your benchmarking would involve connecting your
application to the database and seeing how it responds to your real workload. But what
if, for example, you are comparing two databases that require you to implement the
application logic in two totally different ways? In this case, the different application logic
implementations could influence your results as much as the difference in databases.
Again, we recommend starting small: Testing just the basic functionality of the
application against both targets (following each one’s best practices) and seeing what the
initial results look like.
Data Models
Tools such as cassandra-stress use a default data model that does not completely
reflect what most teams use in production. For example, the cassandra-stress default
data model has a replication factor set to 1 and uses LOCAL_ONE as a consistency
level. Although cassandra-stress is a convenient way to get some initial performance
impressions, it is critical to benchmark the same/similar data model that you will
185
Chapter 9 Benchmarking
use in production. That’s why we recommend using a custom data model and tuning
your consistency level and queries. cassandra-stress and other benchmarking tools
commonly provide ways to specify a user profile, where you can specify your own
schema, queries, replication factor, request distribution and sizes, throughput rates,
number of clients, and other aspects.
Dataset Size
If you run the benchmark with a dataset that’s smaller than your production dataset, you
may have misleading or incorrect results due to the reduced number of I/O operations.
Eventually, you should configure a test that realistically reflects a fraction of your
production dataset size corresponding to your current scale.
Workloads
Run the benchmark using a load that represents, as closely as possible, your anticipated
production workload. This includes the queries submitted by the load generator. When
you use the right type of queries, they are distributed over the cluster and the ratio
between reads and writes remains relatively constant.
The read/write ratio is important. Different combinations will impact your disk
in different ways. If you want results representative of production, use a realistic
workload mix.
Eventually, you will max out your storage I/O throughput and starve your disk, which
causes requests to start queuing on the database. If you continue pushing past that point,
latency will increase. When you hit that point of increased latency with unsatisfactory
results, stop, reflect on what happened, analyze how you can improve, and iterate
through the test again. Rinse and repeat as needed.
Here are some tips on creating realistic workloads for common use cases:
• Ingestion: Ingest data as fast as possible for at least a few hours, and
do it in a way that doesn’t produce timeouts or errors. The goal here
is to ensure that you’ve got a stable system, capable of keeping up
with your expected traffic rate for long periods.
• Real-time bidding: Use bulk writes coming in after hours or
constantly low background loads; the core of the workload is a lot of
reads with extremely strict latency requirements (perhaps below a
specific threshold).
186
Chapter 9 Benchmarking
The bottom line is to try to emulate what your workloads look like and run
something that’s meaningful to you.
187
Chapter 9 Benchmarking
Well, if you look just at the first minute, it seems that it’s serving 40K OPS. But if you
wait for a few minutes, the throughput decreases.
Whenever you want to make a statement about the maximum throughput that your
database can handle, do that from a steady state. Make sure that you’re inserting an
amount of data that is meaningful, not just a couple of gigabytes, and make sure that it
runs for enough time so it’s a realistic scenario. After you are satisfied with how many
requests can be sustained over a prolonged period of time, consider adding noise, such
as scaling clients, and introducing failure situations.
188
Chapter 9 Benchmarking
189
Chapter 9 Benchmarking
Figure 9-6. Lower baseline throughput that’s almost constant and predictable
throughout a ten-minute period
Both of these loads have roughly the same throughput at the end. Figure 9-6 shows
lower baseline throughput—but it’s constant and very predictable throughout the
period. The OPS in Figure 9-7 dip much lower than the first baseline, but it also spikes to
a much higher value. The behavior shown in Figure 9-6 is obviously more desirable. But
if you aggregate your results, it would be really hard to notice a difference.
190
Chapter 9 Benchmarking
Another aggregation mistake is aggregating tail latencies: taking the average of P99
latencies from multiple load generators. The correct way to determine the percentiles
over multiple load generators is to merge the latency distribution of each load generator
and then determine the percentiles. If that isn’t an option, then the next best alternative
is to take the maximum (the P99, for example) of each of the load generators. The actual
P99 will be equal to or smaller than the maximum P99.
For example, assume you have the following clients:
The 99th percentile in the first example is 3 milliseconds. The 99th percentile for
the second client is 30 milliseconds. Average that out, and you get 16.5 milliseconds.
However, the true 99th percentile is acquired by putting those two arrays together and
taking the 99th percentile from there. The actual 99th percentile was 30 milliseconds.
That 16.5 millisecond “average” is meaningless—it doesn’t correlate to anything in
reality.
Also, do not blindly trust only your application latencies. In general, when evaluating
benchmarking results, be sure to consult your database-reported latencies to rule out
bottlenecks related to the database itself. Situations where the database latencies are
within your specific thresholds, but the client-side results deviate from your expected
numbers are fairly common—and may indicate a problem on either the network or at
the client side.
191
Chapter 9 Benchmarking
Also, provide enough detail so that the benchmark can be repeated. For example, for
a Cassandra benchmark, consider including details such as:
• JVM settings
Finally, keep in mind that the richer your reports, the easier it is for someone to
support your recommendation that option A is preferable to option B. For example, if
you’re looking into how two different databases compare on the same hardware, you
might share details in Table 9-1 in addition to the standard throughput and latency graphs.
192
Chapter 9 Benchmarking
9
See Tene’s talk, “How NOT to Measure Latency” (https://www.youtube.com/
watch?v=lJ8ydIuPFeU)
10
See Prisyazhynyy’s blog, “On Coordinated Omission” (https://www.scylladb.
com/2021/04/22/on-coordinated-omission/)
193
Chapter 9 Benchmarking
Beyond these tips, there are even more parameters that impact coordinated
omissions. We strongly recommend that you seek recommendations from your vendor,
Stack Overflow, or other community resources.
195
Chapter 9 Benchmarking
In this case, you have to consider a multitude of aspects. For instance, while
assessing the impact of encryption-in-transit on your workload, you might collect the
initial tests while the database was running with a hot cache. Then, after applying the
necessary changes, you restart your database and get higher latencies as a result. You
might think, “Oh no! The encryption setting is really hurting my latency!” But, you
forgot that restarting the cluster to apply the change also cleared the cache—and upon
restarting your tests, you’re basically reading from disk. In the end, after warming up the
cache, you notice the encryption option barely impacted your latency. Whew!
196
Chapter 9 Benchmarking
sure—and it could very well be during the worst possible time (e.g., Black Friday or
during the big game you’re streaming to millions). You need to account for potential
disasters and test capacity planning with reduced nodes, a network partition, or
other undesired events. This has the added benefit of teaching you about the true
capabilities of the system’s resiliency.
Also, test the time and effort required to restore from a backup. Yes, this requires
spending a fair bit of time and money on what’s essentially a fire drill. But knowing what
to expect in a time of crisis is quite valuable—and avoiding databases with unacceptable
recovery times can be priceless.
If you’re running on the cloud, you might think you’re safe from disaster. “I’ll just
spin up another cluster and move forward. Right?” Wrong! Apart from the data migration
itself, there are a ton of other things that can go wrong. You’ll need to reconnect all
network VPCs, redo all the networking configuration between the application and
database, and so on. You may also run out of instances of the desired type in a given
region or availability zone. Did you ever go to the supermarket to buy a basic item, say
toilet paper, and find empty shelves because everybody suddenly started filling their
carts with it (e.g., due to a disaster)? This can happen to anything, even virtual instances.
It’s best to test disaster scenarios to gain a better understanding of what issues you could
experience—and practice how you’ll react.
197
Chapter 9 Benchmarking
To give you an idea of what this involved from a setup perspective, we provisioned 20
x i3en.metal AWS instances for the ScyllaDB cluster. Each instance had:
• 96 vCPUs
For the load generators, we used 50 x c5n.9xlarge AWS instances. Each instance had:
• 36 vCPUs
• 96 GiB RAM
If you’re thinking about performing your own extreme-scale benchmark, here are
some lessons learned that you might want to consider:
198
Chapter 9 Benchmarking
Summary
Benchmarking is tedious and painstaking, so make sure that you have clear goals and
effective reporting to ensure the work pays off. Some of the top tips we shared include:
The next chapter dives into best practices for the ongoing monitoring that is critical
to interpreting many benchmarking results, as well as preventing and troubleshooting
performance issues in production.
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter's Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
199
CHAPTER 10
Monitoring
Databases require ongoing care and attention, especially when performance is a priority
and the data being stored is growing rapidly and/or changing frequently. Adverse events
that could place the business at risk—for example, node failures or a misbehaving
client—will inevitably occur. Given the complexity of both databases and data-intensive
applications, it’s not a matter of if some combination of factors ends up degrading
performance, but when.
Enter observability and monitoring. A proactive approach is the key to
understanding and optimizing your baseline performance, catching emerging issues
before your end-users feel the pain, and reacting fast when they do. This chapter helps
you determine where to focus your monitoring efforts—with examples from different
use cases—offers tips for exploring issues as they emerge, and details how you might
proceed when your key performance indicators (KPIs) are trending in the wrong
direction.
201
© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023
F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_10
Chapter 10 Monitoring
concurrency limiter, then you might focus your investigation on background operations
that may be slowing down your database. Or, maybe your application got scaled out to
handle more traffic, therefore breaking your previous client-side assumptions.
Monitoring trends over time can also help you predict and plan for peaks. For
instance, assume you’re a streaming media company. If you know that last year’s version
of a big sporting event drew over 25M active users when you had 250M subscribers,
you can use that data to make some predictions as to how much traffic you might
need to support this year—now that you have almost twice as many subscribers. It’s a
similar case for retail, fraud detection, or any other industry that experiences “Black
Friday” surges. One of the best ways to prepare for the next peak is to understand what
happened during the previous one.
Making monitoring a regular routine rather than an emergency response can also
help you spot potential issues as they emerge—and avoid them causing a crisis. For
example, one of the most common database mistakes is failing to carefully watch disk
utilization. By the time you realize that the system is running out of storage space, it
might be too late to respond.
As a nice side effect, monitoring can also provide a window into how your data and
application usage are evolving. For example, if you note a steady increase in data volume
and/or IOPs, you might consider benchmarking your database against what’s feasible in
the next year. Maybe you’re already built for that scale, or maybe you need to think about
your options for increasing capacity. Additionally, assessing what’s required to achieve
the expected latencies at the likely new scale also helps you predict and plan for the
associated cost increase.
Note: Do You Need to Monitor a DBaaS? You selected a DBaaS because you
didn’t want to worry about your database, right? So does that mean you don’t have
to worry about monitoring? Yes … and no.
You should rest assured that your vendor of choice is carefully watching over
your instance with a great deal of automation as well as expertise. If you’re not
confident that this is the case, you might want to consider rethinking your DBaaS
vendor. But even if you are confident, it’s still advisable to keep a close eye on
database performance. To earn and retain your trust, your DBaaS vendor should
offer full transparency into what they’re monitoring. At a minimum, you should
understand:
202
Chapter 10 Monitoring
• What triggers them to review KPIs, take action internally, and notify you of
an issue
It’s probably overkill to keep a DBaaS monitoring dashboard open on one of your
monitors 24/7. But at least know enough for a basic level of confidence that your
database—and your DBaaS vendor—are both doing their job.
203
Chapter 10 Monitoring
The ultimate goal of monitoring a cluster is to ensure a steady state “healthy system.”
Before looking at specific KPIs, consider what an ideal cluster state looks like for your
database. For example, with a wide column database like ScyllaDB or Cassandra, your
target might be:
• There are no alerts indicating that a KPI you care about has exceeded
the acceptable threshold
• Requests for a partition/row are balanced (e.g., you don’t have a “hot
partition” with 50 percent of read requests going to a single partition)
• Partitions are balanced (e.g., you don’t have an average partition size
of .5 MB and a few partitions that are 10GB)
• The cache hit rate (rows read from the cache) follows a specific
distribution pattern
Here are some specific KPIs to look into regarding your cluster health:
204
Chapter 10 Monitoring
• Caching: This can vary from how much data your cache contains
to how much data is being read from the cache (as opposed to the
disk). The latter measurement will help you assess how the database
is using its caching system and if any tuning is required for it. It could
also explain some latency spikes, which would be correlated to reads
primarily hitting the disk.
205
Chapter 10 Monitoring
206
Chapter 10 Monitoring
• AdTech: AdTech is one of the most recognizable use cases that relies
heavily on sub-millisecond latencies. For example, in real-time
bidding, a single millisecond spike might be all it takes to miss a
targeted ad opportunity. As a result, these use cases often monitor
P99, P999, and even P9999 latencies and set up very aggressive
custom alerting thresholds so that spikes can be identified and
addressed immediately.
Application KPIs
Your distributed database is the single most important stateful component in your
infrastructure. It is therefore no surprise that many database vendors invest a lot of time
and effort into improving and bundling observability capabilities within their products.
However, monitoring a database alone can only do so much. There will always be an
application (or an entire infrastructure) behind it which, if not observed properly, may
cause important business impacts. Application KPIs are the key to exposing things like
query issues, poor data models, and unexpected driver behavior.
Here are some important KPIs to look into regarding your application (client side):
• Latency: High P99 latency on your client side does not necessarily
mean that there’s a problem with your database latency. Client-
side latencies will typically be slightly higher than your database
207
Chapter 10 Monitoring
1
See www.brendangregg.com/blog/2014-07-01/perf-heat-maps.html.
208
Chapter 10 Monitoring
Infrastructure/Hardware KPIs
Keeping an eye on the database and application sounds reasonable, but what about
the underlying hardware and infrastructure? Keeping it all healthy and humming is the
top priority of infrastructure teams. After all, what good does tuning and monitoring a
database do if the server that powers it goes offline due to a weeks-long malfunction that
went unnoticed?
Here are the top infrastructure/hardware KPIs that are relevant from a database
perspective:
• CPU utilization: This is the one and only metric that counts…or is
it? CPU utilization can be looked at from different perspectives. On
the one hand, the OS might say that a CPU is 100 percent busy and
therefore it has certainly reached its limit and cannot possibly accept
more work. Right? Wrong! A busy CPU does not always mean that
the system has reached its limits. Databases such as ScyllaDB have
internal mechanisms to prioritize user workloads over background
internal processes such as compactions and repairs. In such a system,
it is actually expected to see CPU utilization at 100 percent most of
the time—and it does not mean that the system has reached its limits!
210
Chapter 10 Monitoring
for the possible values. For instance, maybe you think that a workload crossing its
expected peak for one minute is acceptable, three minutes should trigger warnings,
and five minutes indicates something is definitely wrong. Set your monitoring system
accordingly and bind the appropriate alerting channels for each type of alert.
Also, make good use of alerting channels! Be sure to tag and appropriately direct
each level of alert to its own set of target channels. You don’t want the alerting system
automation to silently drop a message on a random Slack channel in the middle of the
night if the production system is down.
Figure 10-1. One replica taking much longer than all the others to acknowledge
requests
To see what’s going on here, let’s look at the foreground and background write
queues. But first: what’s a foreground and background queue? Foreground queues
are requests that the application directed to the specified node, but were not yet
211
Chapter 10 Monitoring
acknowledged back to the client. That is, the requests were received, but are waiting to
be processed because the database is currently busy serving other requests. Background
queues are application requests that were already acknowledged back to the application,
but still require additional work in the database before they can be considered done.
Delays replicating data across nodes are typically the reason for high background
queues. High foreground and background queues both correlate with high latencies.
So what’s the true problem here? Figure 10-2 indicates that the application is
overloading the system. It’s sending more requests than the database can handle. And
since the running time of a single task in a distributed system is governed by the slowest
node, the entire system will throttle down to the speed of that slow node.
Figure 10-3 shows that the background queues in other nodes start climbing right
after one node gets overwhelmed with requests it can’t handle. This makes sense,
because the busy node is clearly taking longer to acknowledge requests sent to it.
212
Chapter 10 Monitoring
There are a couple of options for resolving this. First, consider modifying the
application to throttle requests. If you can’t do that, then scale out the cluster to give it
more capacity.
To analyze this, let’s look at the internal cache metrics. The Reads with Misses graph
in Figure 10-5 shows that the reads aren’t hitting the cache—they’re all going to disk
instead. Fetching information from the disk is an order of magnitude slower than doing
so from memory. At this point, you know something weird is going on.
213
Chapter 10 Monitoring
Figure 10-5. Database reads with cache misses; reads are going to disk instead
of cache
Similarly, Figure 10-6 shows the cache hits. You can see that almost no requests
are being served by the cache. This is a likely indication that the workload in question
heavily relies on reading cold (uncached) data.
To investigate further, look at the Active SSTable Reads graph in Figure 10-7. Here,
you can see that the amount of active read requests going to the disk is quite high.
214
Chapter 10 Monitoring
Figure 10-7. Active SSTable Reads graph showing that the amount of active read
requests going to the disk is quite high
On the Queued Reads graph in Figure 10-8, you can see there’s a bit of queuing. This
queuing means that the underlying storage system can’t keep up with the request rate.
Requests need to wait longer before being served—and latency increases.
215
Chapter 10 Monitoring
Figure 10-8. Queued Reads graph demonstrates that several requests are
getting queued
How do you resolve this? Review your queries and access patterns to use the
cache more efficiently. This is where query analysis is helpful. For example, with CQL,
you could look at the distribution of inserts, reads, deletes, and updates, the number
of connections per node or shard, and how many rows you’re currently reading. If
available, also check whether your queries are following the relevant best practices (for
CQL, this could be using prepared statements, token-aware queries, paged queries,
and so on).
Also, watch out for queries that require nodes across datacenters to participate
before requests are considered successful. Cross-datacenter traffic is usually more
expensive in terms of latencies and actual cost. Figure 10-9 shows an example of how to
identify queries traversing to remote regions.
216
Chapter 10 Monitoring
Monitoring Options
Once you have a good grasp of what you’re looking for, how do you find it? There are a
number of tools and technologies available; here’s a quick rundown of the pros and cons
of common options.
217
Chapter 10 Monitoring
218
Chapter 10 Monitoring
Summary
This chapter began by recommending that you make monitoring a regular habit so
that you’re well-prepared to spot emerging issues and effectively diagnose the problem
when something goes wrong. It outlined a number of KPIs that have proven helpful
for tracking business-critical enterprise deployments. For each KPI, it explained what
to look for and offered some tips for how to react when the trends indicate a problem.
The chapter offered some high-level guidelines for creating custom alerts. Finally, we
walked through two sample monitoring scenarios and shared our take on the pros and
cons of different monitoring platform options. The next (and final) chapter looks at the
performance impacts of common admin operations and offers some tips on how you
might mitigate them.
219
Chapter 10 Monitoring
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
220
CHAPTER 11
Administration
A database’s automated admin operations work to keep things tight and tidy behind
the scenes, but a level of supervision is required. Databases don’t know your business
and could very naively decide to execute resource-intensive admin operations at
what’s actually a performance-critical time. This final chapter details how common
admin operations tend to impact performance. It covers the nature and severity of
representative impacts and offers some tips on how you might mitigate them.
221
© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023
F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7_11
Chapter 11 Administration
Figure 11-1. A quick rule of thumb for where to focus your admin-related
performance optimizations
222
Chapter 11 Administration
If there is no discernible performance impact for your scenario, then the second and
third questions don’t really matter. If there’s a significant and business-critical impact
but you can’t control it, you’re in the tough position of deciding whether to accept it
or consider moving to an alternative database. If the stars align and you can control
something that’s both impactful and business-critical, that’s a great place to focus.
For example, consider PostgreSQL’s autovacuum function. As of this writing,
autovacuum is triggered when a specified scale factor/threshold is exceeded. This
is likely to coincide with heavy activity on the table—which is probably not when
you want background admin tasks to kick in. Starving some tables while repeatedly
vacuuming others is common, and users trying to compel autovacuum to hit starved
tables can easily end up pushing the system beyond its limit. What’s the likely impact
on the business? Probably fairly high for any performance-sensitive use case. And to
what extent can you control it? Quite well. For example, you can tune autovacuum
settings at both the global and table level, as well as apply strategies like supplementing
autovacuum with additional scheduled vacuum jobs. The bottom line here is that this is
a great performance optimization opportunity.
On the other hand, if you are using a managed DBaaS such as DynamoDB, admin
operations such as data cleanup might be largely beyond your scope of visibility and
control. It certainly doesn’t hurt to ask your vendor what they’re willing to divulge about
what, when, and how admin operations are performed. Even if you discover that an
admin operation undermines performance in a way that matters for you, you might
not be able to control it—but at least you can better prepare for it and diagnose the
performance hit when it occurs.
Among admin operations that could negatively impact performance, some of the
most common suspects are:
223
Chapter 11 Administration
Two of the most common operations that impact performance across a variety of
databases are backups and compaction. Let’s take a deeper look at both.
Backups
Backups—a common maintenance procedure for any database—can be surprisingly
resource intensive. For example, consider a backup strategy where data deduplication
is required. As data in the database frequently gets written or overwritten, backups may
consume several CPU cycles and disk I/O on reads in order to compare whether the
data to be backed up has already been saved. Then, as it finds newer data that must
be retained, it eventually uploads the data (which also involves issuing underlying I/O
reads) to a safe location. As the process is repeated across multiple nodes, its parallelism
often ends up hurting latencies, especially for use cases that heavily rely on disk I/O to
fetch information.
224
Chapter 11 Administration
Impacts
Factors that influence a backup’s performance impact include:
• Dataset size and replication factor: The more data you’re backing
up, the more time it takes to run a backup. Depending on the number
of files stored on disk, backing up may use a lot of read I/O to scan
through all the required database blobs.
• Scope: Are you backing up all on-disk data files all the time (full
backup)? A specific cluster? A system-wide snapshot? An incremental
backup? A properly defined backup strategy and scope will help you
mitigate the impact.
• Storage medium: Reads from local SSDs are noticeably faster than
regular disks. As a result, if your database relies on slow-access
storage devices, it is much easier for backups to deplete your
available read capacity.
225
Chapter 11 Administration
Optimization
Before you start adjusting any options, consider these two critical questions:
• What type of backup makes the most sense given your workloads?
For example, if you’re working on a food delivery app, a large backup that kicks off
in the middle of the Friday lunch surge could result in lost business. The pain could be
alleviated by running regular backups during predictable downtimes (e.g., very early in
the morning), when there are resources to spare.
But other businesses don’t have a predictable downtime. For another example,
consider an application that provides location tracking services for ambulances—a
use case where a catastrophic event could bring a dramatic surge at any time without
warning. In that case, many small and frequent backups might be the best strategy. This
way, backups are unlikely to significantly impact database performance, no matter when
the unpredictable demand happens to rise.
Work with your team to understand the backup coverage that you need and what
type of backup pain you’re willing to accept, then adjust your options accordingly.
Note Repairs are a totally different process, but they have a similar impact.
Eventually consistent databases need to ensure that replicas (eventually) all have
the appropriate updates. In Cassandra and Cassandra-like databases, this process
is referred to as repairs. When repair runs, it could cause latency to spike. The
key to minimizing its performance impact varies according to your workload.
If there’s a time when your database is predictably idle, run repair then—with
high parallelism and intensity. If your use case can withstand minor latency
spikes, you can try to limit the repair’s intensity and parallelism. But, if you can’t
afford any latency spikes (e.g., a real-time bidding use case that must provide
sub-millisecond P9999 latencies around the clock), your best bet is to limit the
operation to run as slowly as possible.
226
Chapter 11 Administration
Compaction
As mentioned in Chapter 2 and covered more in Appendix A, LSM-based databases use
compaction—a process of rewriting tables to remove deleted entries and reorganize data
to enable faster, more efficient reads and writes. Compaction operations are expensive in
terms of CPU, memory, and disk I/O.1
The degree to which you can control compaction varies dramatically from database
to database. For example, with Bigtable, it’s all done automatically. However, databases
such as Couchbase, HBase, Cassandra, and ScyllaDB let you choose from a variety of
compaction strategies, many of which have additional options you can use to fine-tune
how compaction is performed, as well as other settings that influence compaction
performance (for example, rate-limiting it).
Impacts
The performance impact of compaction also varies dramatically from database to
database. One fundamental factor that influences compaction speed is whether the
database is performing the major compactions on each shard/CPU concurrently, or the
compaction is bound to a single thread. As shown in Figure 11-2, benchmarks found that
there can reflect a nearly 60X difference in the time required to run a major compaction
of 1TB of data at RF=1 on i3.4xlarge machines.
1
For an interesting perspective on compaction, see Avi Kivity’s real-time visualization
in “How a Database Looks from a Disk’s Perspective” (www.p99conf.io/session/
how-a-database-looks-from-a-disks-perspective/).
227
Chapter 11 Administration
Figure 11-2. The wide range of time required to perform compaction on similar
databases—from 36 minutes to 37 hours and 56 minutes
228
Chapter 11 Administration
Optimization
When selecting a compaction strategy, keep in mind that the ultimate goal should be low
amplification. You want to avoid:
• Read amplification (read requests needing many files to look up
relevant data)
• Excessive temporary disk space that requires the disk to be larger
than a perfectly-compacted representation of the data (space
amplification)
• Compacting the same data over and over again (write amplification)
• Overwritten/deleted/expired data remaining on disk, slowing down
your read path
Since not everyone is using a database that performs compaction, this chapter
doesn’t go deep into the weeds of the pros and cons of specific strategies. Table 11-1
provides an overview of which compaction strategy generally works best for different
workloads (your results may vary).
229
Chapter 11 Administration
Two key takeaways should be that 1) one size never fits all, so it’s nice to have a
choice in admin matters, and 2) tradeoffs are inevitable—know what pain you can
tolerate best so pick your poison.
230
Chapter 11 Administration
To drive the point home, here’s a real-world story. Once upon a time, a new ScyllaDB
user reported high read latencies. The use case was a TTL’d time series to support
live media streaming. Time series use cases heavily rely on fetching data in specific
timeframes and expect that such lookups are fast enough to be served by the database.
As a result, time series use cases often rely on a Time-Bucketed compaction strategy,
which ensures that the data in question is compacted together under the same time
window to avoid the database having to potentially scan through multiple files across
distinct windows just to retrieve the data. However, if configured incorrectly, the strategy
may backfire and introduce severe performance headaches.
In this particular situation, we discovered that their time buckets were too small
for the amount of data they were frequently retrieving as part of a single query. For
example, if you decide to time-bucket your data every ten minutes, but always want to
retrieve ten hours’ worth of data, that will require the database to scan through
60 (6 buckets/hour * 10 hours) of buckets! With the right amount of concurrency, every
query scanning through these large chunks of data could starve the underlying disk I/O
capacity. Therefore, the resolution was to update the compaction configuration to reflect
a more realistic data grouping as required by the use case.
One final note on adjusting your compaction strategy for performance: Remember
that when you adjust your compaction strategy, your database will need to rewrite all
your table data. This will incur a significant performance penalty and should be carefully
planned to occur at a time that works best for your business.
Summary
Admin operations like repair, compactions, and backups are an unavoidable part of
running a healthy, well-performing database. There’s no such thing as a “zero impact”
admin operation; performing any operation consumes resources, and these operations
can have exacerbated impacts if you’re operating at extreme scale. This chapter used the
examples of backups and compaction to showcase the potentially significant—and also
highly variable—impact of admin operations on performance.
This is the final official chapter of this book—the end of these highly opinionated
recommendations for improving database performance based on what we’ve seen
working with a broad range of database users and databases. It’s hardly the end of
options for optimizing database performance though. Some potential next steps:
231
Chapter 11 Administration
Open Access This chapter is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.
org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
232
APPENDIX A
Brief Look at
A
Fundamental Database
Design Decisions
This appendix briefly touches on a number of fundamental database design decisions
that impact database performance. Why “briefly?” First, because we suspect that
many readers are already familiar with them. But, more importantly, because other
resources have covered them quite extensively, and extremely well. Honestly, there’s
not much to add. So, we’ll offer a short take on some of the most pressing decisions
that any distributed database must make, then share our top picks for learning more on
each topic.
233
© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023
F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7
Appendix A A Brief Look at Fundamental Database Design Decisions
Sharding
The point of sharding is to ensure that data is well-balanced across your cluster for
reading and writing. You’re going to get the best performance by having all available
nodes reading and writing data—not by overloading some nodes while others are idle or
highly underutilized.
With most databases, sharding is built into its architecture. If that’s the case, it’s
important to get the data modeling correct (e.g., high cardinality partition keys) to help
the database’s automated sharding approach achieve an efficient balance across nodes.
However, if the database requires you to define the sharding strategy, you’re going to
have to bear the burden of more decisions that can impact balancing and, ultimately,
performance. If you see that certain shards are receiving or handling more requests than
others, it’s time to reconsider your strategy.
Also, all automated sharding is not the same. The two most common approaches
used in the class of databases we’ve been covering in this book are:
The level at which the sharding is performed also matters. The most common
approach is to shard per node: Distribute data into separate database server nodes.
Another option is to shard per each core in your server, across all the servers in the
cluster. This divides a server’s resources into shared-nothing units of CPU core, RAM,
persistent storage, and network I/O. The advantage of this approach is that it maximizes
utilization of all the available cores of multi-CPU hardware architectures. If paired with
shard-aware drivers, each client writing or requesting data can send queries directly to
the CPU core responsible for that shard of data. This minimizes hot shards and removes
extra hops—improving performance. However, if you’re not running on powerful
servers, you’re less likely to take full advantage of the potential performance gains here.
234
Appendix A A Brief Look at Fundamental Database Design Decisions
Replication
Being synchronous, replication affects the performance of the database at runtime.
Poorly designed or inappropriately selected replication can cause performance
bottlenecks.
There are many possible approaches to the design and implementation of database
replication. Database replication can either happen on an explicit command or be an
ongoing background process. Either approach should be well-accommodated by the
infrastructure that the database is running on.
For data replication, the database engineer needs to select a replication strategy:
The process for selecting the nodes to which each portion of data will be copied. Some
replication strategies don’t just copy data to a set of nodes; they also apply prioritization.
One or more nodes are called primary replicas for this portion of data, while the other
nodes are called secondary replicas.
Think about the number of nodes in the replica sets. To begin with, each piece of
data can be replicated to a single node, as shown in Figure A-1.
Here, data is sharded across the different nodes for load balancing, but it doesn’t
provide high availability because none of the shards is replicated.
In more complex cases (shown in Figure A-2), there’s one primary replica and one or
more secondary replica nodes.
235
Appendix A A Brief Look at Fundamental Database Design Decisions
Figure A-2. Replicating data to several nodes, with a single primary and multiple
secondary replicas
Here, one node writes data, which then can be propagated to other read-only nodes.
This approach provides some level of high availability since a replica can take over the
cluster if the primary goes offline. However, it does not properly balance your workload.
All writes must be handled by the primary—which means that the primary becomes
a bottleneck. As a result, this method of data replication may be impractical for write-
intensive workloads. Spreading out the primary replicas for different portions of data
across different nodes in the cluster is one potential way to address this.
In a more extreme form, there are no secondary replicas (like in Figure A-3).
236
Appendix A A Brief Look at Fundamental Database Design Decisions
Here, all data is replicated in an active-active leaderless topology. Every node can
accept read and write operations, so all are peers in managing the workload. When this
strategy is applied, any loss to part of the cluster will not result in lost data.
This active-active leaderless topology leads to what is called eventual consistency:
The guarantee that when an update is made in a distributed database, that update will
eventually be reflected in all nodes that store the data—resulting in the same response
every time the data is queried. In an eventually consistent system, the replication
strategy defines a number called the “replication factor” (RF), which is the number of
nodes on which the portion of data can be found. Writing the data to (and reading the
data from) such a system also conforms to different consistency requirements, which is
referred to as consistency level (CL).
The most restrictive consistency level is often called “all.” Writing in this mode
means that the data must be written on disk on all replica nodes in the cluster. Reading
returns the data after all replicas have responded, and the read operation fails if at least
one replica does not respond.
Some less restrictive, but more performant, consistency levels may specify the exact
number of nodes that must confirm the operation. Usually, this number is selected in
the range of one through three, depending on how many replicas the cluster operator
expects to crash.
“Quorum” consistency level provides much stronger consistency for the data.
Writing or reading the data at this level means that the majority of nodes from the replica
set should confirm the operation. In multi-datacenter setups, quorum consistency often
has sub-levels depending on which datacenters the nodes from the replica set belong to.
Note that read and write consistency levels are independent of each other. Even if
data was written with one consistency level (e.g. quorum), reading can use a different
consistency level depending on the intention. For example, a CL of ONE can be used for
data that doesn’t need to be consistent, QUORUM CL can be used for a regular (or “unsure”)
case, and the most restrictive level of ALL can be used to effectively force full data repair.
Learning More
Designing Data-Intensive Applications, by Martin Kleppman (Chapters 5 and 6)
Database Internals: A Deep Dive Into How Distributed Data Systems Work, by Alex
Petrov (Chapters 11, 12, and 13)
237
Appendix A A Brief Look at Fundamental Database Design Decisions
Consensus Algorithms
Even though quorum consistency level works extremely well and provides data
persistence, data consistency, and low latency access, it still does not guarantee the
linearizability, isolation, and atomicity required by transactions: queries that heavily rely
on ACID properties. One of the simplest and most obvious examples of a transaction can
be any CAS (compare-and-set) operation (e.g., increment a counter provided its value
is below 42). When transactions come into play, simple eventually consistent models
stop working and call for stronger means. One option can be a consensus algorithm on a
distributed system.
Consensus algorithms provide strong consistency guarantees about the underlying
data and allow a set of replicas to work together as a coherent unit. Provided the nodes
conform to the protocol, algorithms tolerate failures of less than half of replicas even in
the presence of message loss and reordering. They guarantee the following properties:
Consensus algorithms fall into two large classes—those that need a leader to make
a decision and those that don’t. The latter are called leaderless. Algorithms from the
former class can suffer from periods of silence if the leader fails, so the new leader
election process is started and the cluster cannot service requests while it’s happening.
238
Appendix A A Brief Look at Fundamental Database Design Decisions
Raft
Raft is an example of a consensus algorithm with a leader. Raft was invented to be
simple. It replicates a log of commands from a leader to follower nodes; that log of
commands is called a “replicated state machine.” And, of course, it has the leader-
election algorithm on board too.
Each replica participating in Raft replication can be in one of three states: follower,
candidate, or leader (see Figure A-4). Any attempt to make a decision over the cluster
must be described in terms of a state change in the replicated state machine. Since the
client may not know who the leader is, it sends its decision requests to whatever node
it selects first. Thus, if the decision request is not sent to the leader, followers would
re-route it to one, and then the leader would replicate the decision across its followers.
In the case of leader failure, the followers turn into candidates. The election process
ultimately converts one of those candidates into a leader after it obtains votes from a
majority of replicas.
239
Appendix A A Brief Look at Fundamental Database Design Decisions
Paxos
Paxos appeared earlier than Raft. It’s an example of a leaderless algorithm, and it was
one of the first algorithms that proved the quorum-based way of making distributed
decisions.
According to the algorithm, each replica can play one or more of three roles—the
proposer, the acceptor, and the learner (see Figure A-5). The decision is made through a
two-step process in which the roles are involved, but from a practical perspective, nodes
usually combine those roles. The first step, or, as it’s usually called—the phase—is in
proposing some value. In order to be proposed, the value must be accompanied with
the sequence number. After the proposal is confirmed by the majority of acceptors,
the proposer may proceed to the confirmation phase. After the confirmation phase is
confirmed (again) by the majority of acceptors, the decision is made.
Figure A-5. Each replica participating in Paxos can be the proposer, accepter,
or learner
Once the decision is made, none of the participants may fall back to the first phase.
If the next decision should be made, a new run of the algorithm should be taken. Due to
this, Paxos is never used alone. It’s always part of a larger algorithm that implements all
the necessary “paperwork” needed to instantiate, execute, and wrap up algorithm runs
for individual decisions. One example of such a larger algorithm might be “distributed
log replication.”
240
Appendix A A Brief Look at Fundamental Database Design Decisions
Note It’s worth mentioning that Paxos was one of the first consensus algorithms
that appeared and its goal was to prove how the distributed consensus is made.
It’s not used in its pure form; it’s commonly extended with something else. One
such extension was Raft, which (pretty successfully) tried to reduce Paxos’
complexity at the cost of a potential imbalance in nodes’ roles.
Learning More
Designing Data-Intensive Applications, by Martin Kleppman (Chapter 9)
Database Internals: A Deep Dive Into How Distributed Data Systems Work, by Alex
Petrov (Chapter 14)
NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, by
Pramod J. Sadalage and Martin Fowler (Chapter 4)
241
Appendix A A Brief Look at Fundamental Database Design Decisions
reads, it might have to search quite a while to find the appropriate value. However,
compaction can help to avoid this read amplification. Choosing an effective compaction
strategy for your workload, as well as optimizing when compactions are performed,
can significantly impact the read penalty sometimes associated with LSM trees. Also,
mechanisms like built-in caching can enable an LSM-based database to achieve fast
reads as well as writes.
Figure A-6. With LSM trees, compaction creates fewer (and larger) files
On the other hand, B-tree based databases (as in Figure A-7) are optimized for reads.
B-trees offer fast reads since their structure—and lack of duplication—makes them
much more efficient to search than LSM trees. Traversing the tree is a straightforward
path down a tree (as opposed to searching through potentially many files, as is the case
with LSM trees).
242
Appendix A A Brief Look at Fundamental Database Design Decisions
Figure A-7. B-trees are updated in place (vs LSM trees, which are append only
The B-tree write path is not optimized for speed, though. Each time there’s a write,
the database must traverse the tree to find the appropriate location. If a value already
exists there, it’s updated. If not, a new leaf is added. This is quite disk-intensive, even for
small records. In some cases, moving to faster disks can improve performance.
The database creates a snapshot of the tree each time, which enables you to perform
rollback (including point-in-time restore). B-trees are also well suited to transactions.
When you are going to perform a transaction, you typically want to read the data and
then start snapshotting everything. If the conflict resolution doesn’t go as planned, you
can simply roll it all back. B-trees make this feasible.
Learning More
Designing Data-Intensive Applications, by Martin Kleppman (Chapter 3)
Database Internals: A Deep Dive Into How Distributed Data Systems Work, by Alex
Petrov (Chapters 2, 4, 6, and 7)
The B-Tree, LSM-Tree, and the Bw-Tree in Between, by PhotonDB
243
Appendix A A Brief Look at Fundamental Database Design Decisions
Figure A-8. Data is split into blocks of a fixed size; adjacent data is preferred over
data dispersed across the disk
There are many algorithms to group data in a logical way, each trying to improve
search efficiency (for example, partition and clustering).
When data is organized in the form of a table, the cells can be stored in at least two
ways: row-oriented (e.g., wide column) and column-oriented. It’s not feasible to shift
from one approach to the other because of the amount of I/O required to convert data
between versions. At the same time, this decision has a major impact on the database’s
244
Appendix A A Brief Look at Fundamental Database Design Decisions
Row-Oriented Databases
Let’s look at a sample table for a better understanding of how different storage
approaches work. Figure A-9 shows a table containing information about a person in
each row (the person’s name, age, address, etc.).
With row-oriented storage, data is put on disk row-by-row, and each chunk of data
from a block consists of the cells from one table row. This design is perfect for so-called
OLTP (Online Transaction Processing) applications, since such workloads can often
modify the data by adding—or deleting—entities in the table. Writing a new row is
optimal because it involves just appending an entire row to the existing blocks or putting
it into new blocks. Another reason that row-oriented storage is good for OLTP workloads
is because these workloads are typically loaded with requests retrieving every attribute
from a single entity. Some examples of row-oriented databases are Cassandra, ScyllaDB,
Postgres, and MySQL.
In other words, row-oriented storage is beneficial when all or most of the record
needs to be accessed in the same query or transaction. In this case, it’s better to have
narrow tables. The more columns there are in a given table, the less likely it is that your
query will need all of them. In the extreme case when the query requires only a few
245
Appendix A A Brief Look at Fundamental Database Design Decisions
columns (or even a single column), the row-store becomes too expensive. It needs to
read the whole block, but the whole block will likely contain redundant data that would
just be thrown away after being read.
Column-Oriented Databases
On the other hand, column-oriented databases store data on disk column-by-column.
Let’s take the same table, but consider each data chunk to be its column, not the row. As
shown in Figure A-10, the “names” will be grouped together on the disk, the “ages” will
be grouped together on the disk, the “addresses” will be grouped together on the disk,
and so on and so forth.
This approach is a good choice for OLAP (Online Analytical Processing) workloads
because those workloads generally aggregate specific data over a very large number of
records. OLAP queries are mainly interested in a small subset of columns; for example,
calculating the average age of the people from the table. Also, these workloads rarely
modify data, and even when they do, the modification is appending new records. Some
examples of column-oriented databases are Google BigQuery and Amazon Redshift.
It’s also worth mentioning that the compression rate is often much higher in column-
store rather than in row-store. That’s because in column-store, all cells from the column
have the same data type, and that makes it quite compression-friendly.
246
Appendix A A Brief Look at Fundamental Database Design Decisions
Table A-1 breaks down the impact of these different approaches to storing records on
the disk.
These are just two representative examples. Data can also be stored using document,
graph, and other models. See the following resources for a comprehensive discussion of
how the various models store data, and the best and worst use cases for each.
Learning More
Designing Data-Intensive Applications, by Martin Kleppman (Chapter 3)
Database Internals: A Deep Dive Into How Distributed Data Systems Work, by Alex
Petrov (Chapter 1)
NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, by
Pramod J. Sadalage and Martin Fowler (Chapters 8-11)
247
Index
A B
Active-active leaderless topology, 237 B+-tree, 69, 70, 72
Administration B-tree vs. LSM Tree, 241–243
backups Batch (analytical) workload, 79–81
definition, 224 Benchmarking
impacts, 225 admin operations, 196
optimization, 226 cache, 187
compaction client side mistakes, 188
impacts, 227–229 database’s superpowers, 183
optimization, 229, 231 data models, 185
strategies, 230 dataset size, 186
operation/performance, 221–224, 232 domain-specific knowledge, 175
AdTech use case, 40, 207 extreme scale, 197, 198
Advanced Vector Extensions (AVX), 69 goals, 194–196
Algorithmic optimization latency/throughput, 176–178, 180
B-trees, 66, 67 networking issues, 189
cache implementations, 65 observability, 184
database, 74 phased approach, 180, 181
linear search, steroids, 68, 69 production, 183
optimizing collections, 66 repeatability, 189
scanning tree, 69, 70 reporting, 189–193
separation key, 72, 73 results, 199
tree size, 71 steady state, 187, 188
Amazon DynamoDB, 16, 24, 36, 77, 86, testing disaster
158, 223 recovery, 197
Apache Cassandra-compatible database, 20, tools, 184, 185
21, 25, 36, 85, 111, 114, 115, 129, 145 workloads, 186
Asynchronous Direct I/O (AIO/DIO), 53, 54 Binary tree, 66–68, 71, 72
Atomic, consistent, isolated, and durable Blockchain, 207
(ACID), 1, 34–36, 238 Branch mispredictions, 68
249
© Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, Cynthia Dunlop 2023
F. C. Mendes et al., Database Performance at Scale, https://doi.org/10.1007/978-1-4842-9711-7
INDEX
250
INDEX
Containers/Kubernetes, 152–154 G
DBaaS, 150, 151
Garbage Collector (GC), 206, 209
NVMes, 148
Golang abstraction layer, 168
physical hardware, 131
Goodput, 81–83
serverless, 151, 152
Google Cloud storage, 24
“do not retry” approach, 98
GROUP BY statement, 117
Driver’s retry policy
Grow-only counter, 128
error categories, 94, 95
idempotence, 95–97
retry policies, 97–100 H
Hard disk drives (HDDs), 102, 136
E Hardware components
CPU, 144, 145
Edge computing
memory (RAM), 145–147
CRDT
network, 147, 148
characteristics, 127, 128
storage, 135
definition, 127
Hardware considerations, 132
G-set, 128
balance, 133
LWW-set, 128, 129
performance bottlenecks, 132, 133
PN-counter, 128
setting realistic expectations, 134
database drivers, 127
Hash-based sharding, 234
performance, 127
Hashing key implementation, 2
Eventual consistency, 36, 129, 237
External caches
complexity, 172 I
cost, 171 Intel Xeon Processor, 42, 43
database caching, 172 Internet of Things (IoT), 126
database knowledge/resources, 173 I/O
decreases availability, 171 access methods, 55–57
latency, 170 AIO/DIO, 53, 54
security risks, 172 definition, 51
direct I/O, 52, 53
F filesystem/disk, choosing, 57
filesystem vs. raw disk, 57
Final function, 118–120
mmap, 52
Full active-active-style
SSDs work, 58, 60
replication, 236
traditional method, 51
Full-stack APM system, 218
I/O scheduling, 55, 59
251
INDEX
K
N
Key performance indicators (KPIs), 201,
Networking
203–205, 219
DPDK, 62
Kubernetes, 152, 154, 208, 209
intensive applications, 61, 62
IRQ binding, 62, 63
L Linux, 61
Last-write-wins set, 128, 129 Non-Uniform Memory Architecture
Leader-based algorithm classes, 241 (NUMA), 160
Linear root, 71, 72 Nonvolatile memory express (NVMe)
Log-structured allocation, 47, 48 disks, 133, 148
Log-structured approach, 50
LSM-tree based databases, 141, 182, 227, O
241, 242
Online analytical processing
(OLAP), 79, 246
M Online transaction processing (OLTP), 21,
Memory management 79, 163–167, 245
allocation, 47–50
cache control, 50, 51 P, Q
database engineering, 47
Partitioner, 93
definition, 47
Paxos, 240, 241
mmap method, 55
Pool allocation strategy, 49
Monitoring
Positive-Negative counter, 128
custom alerts, 210
PostgreSQL, 35, 77, 105, 119, 122, 168,
dashboards/alerting, 218
181, 245
database KPIs
Prepared statements, 88, 90, 216
applications, 207–209
Primary replicas, 235, 236
database cluster, 203, 204
high-performance database, 203
infrastructure/hardware, 209, 210 R
database monitoring tool, 218
Raft, 239
full-stack APM system, 218, 219
Range-based sharding, 234
proactive approach, 201, 202
252
INDEX
253
INDEX
254