(Download PDF) Database Internals A Deep Dive Into How Distributed Data Systems Work Alex Petrov Online Ebook All Chapter PDF
(Download PDF) Database Internals A Deep Dive Into How Distributed Data Systems Work Alex Petrov Online Ebook All Chapter PDF
(Download PDF) Database Internals A Deep Dive Into How Distributed Data Systems Work Alex Petrov Online Ebook All Chapter PDF
https://textbookfull.com/product/windows-security-internals-a-
deep-dive-into-windows-authentication-authorization-and-
auditing-1-converted-edition-james-forshaw/
https://textbookfull.com/product/windows-security-internals-a-
deep-dive-into-windows-authentication-authorization-and-auditing-
for-true-epub-1st-edition-james-forshaw/
https://textbookfull.com/product/deep-dive-into-power-automate-
learn-by-example-1st-edition-mishra/
https://textbookfull.com/product/a-deep-dive-into-nosql-
databases-the-use-cases-and-applications-first-edition-raj/
Programming iOS 11 dive deep into views view
controllers and frameworks Eighth Edition Neuburg
https://textbookfull.com/product/programming-ios-11-dive-deep-
into-views-view-controllers-and-frameworks-eighth-edition-
neuburg/
https://textbookfull.com/product/programming-ios-12-dive-deep-
into-views-view-controllers-and-frameworks-1st-edition-matt-
neuburg/
https://textbookfull.com/product/programming-ios-13-dive-deep-
into-views-view-controllers-and-frameworks-1st-edition-matt-
neuburg/
https://textbookfull.com/product/programming-ios-10-dive-deep-
into-views-view-controllers-and-frameworks-seventh-edition-matt-
neuburg/
https://textbookfull.com/product/uberland-how-algorithms-are-
rewriting-the-rules-of-work-alex-rosenblat/
Database Internals
A Deep Dive into How
Distributed Data Systems Work
Alex Petrov
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Database Internals, the cover image,
and related trade dress are trademarks of O’Reilly Media, Inc.
The views expressed in this work are those of the author, and do not represent the publisher’s views.
While the publisher and the author have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the author disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
978-1-492-04034-7
[MBP]
To Pieter Hintjens, from whom I got my first ever signed book:
an inspiring distributed systems programmer, author, philosopher, and friend.
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
2. B-Tree Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Binary Search Trees 26
Tree Balancing 27
Trees for Disk-Based Storage 28
Disk-Based Structures 29
Hard Disk Drives 30
Solid State Drives 30
v
On-Disk Structures 32
Ubiquitous B-Trees 33
B-Tree Hierarchy 35
Separator Keys 36
B-Tree Lookup Complexity 37
B-Tree Lookup Algorithm 38
Counting Keys 38
B-Tree Node Splits 39
B-Tree Node Merges 41
Summary 42
3. File Formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Motivation 46
Binary Encoding 47
Primitive Types 47
Strings and Variable-Size Data 49
Bit-Packed Data: Booleans, Enums, and Flags 49
General Principles 50
Page Structure 52
Slotted Pages 52
Cell Layout 54
Combining Cells into Slotted Pages 56
Managing Variable-Size Data 57
Versioning 58
Checksumming 59
Summary 60
4. Implementing B-Trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Page Header 61
Magic Numbers 62
Sibling Links 62
Rightmost Pointers 63
Node High Keys 64
Overflow Pages 65
Binary Search 67
Binary Search with Indirection Pointers 67
Propagating Splits and Merges 68
Breadcrumbs 69
Rebalancing 70
Right-Only Appends 71
Bulk Loading 72
Compression 73
vi | Table of Contents
Vacuum and Maintenance 74
Fragmentation Caused by Updates and Deletes 75
Page Defragmentation 76
Summary 76
Table of Contents | ix
Strict Consistency 223
Linearizability 223
Sequential Consistency 227
Causal Consistency 229
Session Models 233
Eventual Consistency 234
Tunable Consistency 235
Witness Replicas 236
Strong Eventual Consistency and CRDTs 238
Summary 240
x | Table of Contents
Virtual Synchrony 282
Zookeeper Atomic Broadcast (ZAB) 283
Paxos 285
Paxos Algorithm 286
Quorums in Paxos 287
Failure Scenarios 288
Multi-Paxos 291
Fast Paxos 292
Egalitarian Paxos 293
Flexible Paxos 296
Generalized Solution to Consensus 297
Raft 300
Leader Role in Raft 302
Failure Scenarios 304
Byzantine Consensus 305
PBFT Algorithm 306
Recovery and Checkpointing 309
Summary 309
A. Bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
Table of Contents | xi
Preface
Distributed database systems are an integral part of most businesses and the vast
majority of software applications. These applications provide logic and a user inter‐
face, while database systems take care of data integrity, consistency, and redundancy.
Back in 2000, if you were to choose a database, you would have just a few options,
and most of them would be within the realm of relational databases, so differences
between them would be relatively small. Of course, this does not mean that all data‐
bases were completely the same, but their functionality and use cases were very
similar.
Some of these databases have focused on horizontal scaling (scaling out)—improving
performance and increasing capacity by running multiple database instances acting
as a single logical unit: Gamma Database Machine Project, Teradata, Greenplum, Par‐
allel DB2, and many others. Today, horizontal scaling remains one of the most impor‐
tant properties that customers expect from databases. This can be explained by the
rising popularity of cloud-based services. It is often easier to spin up a new instance
and add it to the cluster than scaling vertically (scaling up) by moving the database to
a larger, more powerful machine. Migrations can be long and painful, potentially
incurring downtime.
Around 2010, a new class of eventually consistent databases started appearing, and
terms such as NoSQL, and later, big data grew in popularity. Over the last 15 years,
the open source community, large internet companies, and database vendors have
created so many databases and tools that it’s easy to get lost trying to understand use
cases, details, and specifics.
The Dynamo paper [DECANDIA07], published by the team at Amazon in 2007, had
so much impact on the database community that within a short period it inspired
many variants and implementations. The most prominent of them were Apache Cas‐
sandra, created at Facebook; Project Voldemort, created at LinkedIn; and Riak, cre‐
ated by former Akamai engineers.
xiii
Today, the field is changing again: after the time of key-value stores, NoSQL, and
eventual consistency, we have started seeing more scalable and performant databases,
able to execute complex queries with stronger consistency guarantees.
xiv | Preface
Why Should I Read This Book?
We often hear people describing database systems in terms of the concepts and algo‐
rithms they implement: “This database uses gossip for membership propagation” (see
Chapter 12), “They have implemented Dynamo,” or “This is just like what they’ve
described in the Spanner paper” (see Chapter 13). Or, if you’re discussing the algo‐
rithms and data structures, you can hear something like “ZAB and Raft have a lot in
common” (see Chapter 14), “Bw-Trees are like the B-Trees implemented on top of log
structured storage” (see Chapter 6), or “They are using sibling pointers like in Blink-
Trees” (see Chapter 5).
We need abstractions to discuss complex concepts, and we can’t have a discussion
about terminology every time we start a conversation. Having shortcuts in the form
of common language helps us to move our attention to other, higher-level problems.
One of the advantages of learning the fundamental concepts, proofs, and algorithms
is that they never grow old. Of course, there will always be new ones, but new algo‐
rithms are often created after finding a flaw or room for improvement in a classical
one. Knowing the history helps to understand differences and motivation better.
Learning about these things is inspiring. You see the variety of algorithms, see how
our industry was solving one problem after the other, and get to appreciate that work.
At the same time, learning is rewarding: you can almost feel how multiple puzzle
pieces move together in your mind to form a full picture that you will always be able
to share with others.
Preface | xv
databases. The rule of thumb for whether or not to include a particular concept in the
book was the question: “Do the people in the database industry and research circles
talk about this concept?” If the answer was “yes,” I added the concept to the long list
of things to discuss.
xvi | Preface
Figure 1-1. Architecture of a database management system
Upon receipt, the transport subsystem hands the query over to a query processor,
which parses, interprets, and validates it. Later, access control checks are performed,
as they can be done fully only after the query is interpreted.
The parsed query is passed to the query optimizer, which first eliminates impossible
and redundant parts of the query, and then attempts to find the most efficient way to
execute it based on internal statistics (index cardinality, approximate intersection size,
etc.) and data placement (which nodes in the cluster hold the data and the costs asso‐
ciated with its transfer). The optimizer handles both relational operations required
for query resolution, usually presented as a dependency tree, and optimizations, such
as index ordering, cardinality estimation, and choosing access methods.
The query is usually presented in the form of an execution plan (or query plan): a
sequence of operations that have to be carried out for its results to be considered
complete. Since the same query can be satisfied using different execution plans that
can vary in efficiency, the optimizer picks the best available plan.
DBMS Architecture | 9
Another random document with
no related content on Scribd:
Figure 51. William Sellers
The firm of William Sellers & Company had another master mind
in that of Dr. Coleman Sellers, a second cousin of William
Sellers.[211] He was born in Philadelphia in 1827, his father, Coleman
Sellers, being also an inventor and mechanic. Like Nasmyth he
spent his school holidays in his father’s shop, which was at
Cardington. In 1846, when he was nineteen years old, he went to
Cincinnati and worked in the Globe Rolling Mill, operated by his elder
brothers, where the first locomotives for the Panama Railroad were
built; and in two years he became superintendent. In 1851 he
became foreman of the works of James and Jonathan Niles, who
were then in Cincinnati and building locomotives. Six years later he
returned to Philadelphia, became chief engineer of William Sellers &
Company, and remained with them for over thirty years, becoming a
partner in 1873. During these years he designed a wide range of
machinery, which naturally covered much the same field as that of
William Sellers, but his familiarity with locomotive work especially
fitted him for the design of railway tools. His designs were original,
correct and refined. The Sellers coupling was his invention and he
did much to introduce the modern systems of power transmission.
[211] See Trans. A. S. M. E., Vol. XXIX, p. 1163; Cassier’s Magazine,
August, 1903, p. 352; Journal of the Franklin Institute, Vol. CXLIX, p. 5.