A Thorough Introduction To Distributed Systems
A Thorough Introduction To Distributed Systems
A Thorough Introduction To Distributed Systems
Systems
What is a Distributed System and why is it so
complicated?
Table of Contents
Introduction
1. What is a distributed system?
2. Distributed Computing
4. Distributed Messaging
5. Distributed Applications
6. Distributed Ledgers
Summary
Introduction
With the ever-growing technological expansion of the world,
distributed systems are becoming more and more widespread. They are
a vast and complex field of study in computer science.
These machines have a shared state, operate concurrently and can fail
independently without affecting the whole system’s uptime.
Scaling vertically is all well and good while you can, but after a certain
point you will see that even the best hardware is not sufficient for
enough traffic, not to mention impractical to host.
The best thing about horizontal scaling is that you have no cap on how
much you can scale — whenever performance degrades you simply add
another machine, up to infinity potentially.
Easy scaling is not the only benefit you get from distributed systems.
Fault tolerance and low latency are also equally as important.
Scaling our database
Imagine that our web application got insanely popular. Imagine also
that our database started getting twice as much queries per second as it
can handle. Your application would immediately start to decline in
performance and this would get noticed by your users.
Let’s work together and make our database scale to meet our high
demands.
Pitfall
Gotcha! We immediately lost the C in our relational database’s ACID
guarantees, which stands for Consistency.
You see, there now exists a possibility in which we insert a new record
into the database, immediately afterwards issue a read query for it and
get nothing back, as if it didn’t exist!
Propagating the new information from the master to the slave does not
happen instantaneously. There actually exists a time window in which
you can fetch stale information. If this were not the case, your write
performance would suffer, as it would have to synchronously wait for
the data to be propagated.
Continuing to Scale
Using the slave database approach, we can horizontally scale our read
traffic up to some extent. That’s great but we’ve hit a wall in regards to
our write traffic — it’s still all in one server!
We’re not left with much options here. We simply need to split our
write traffic into multiple servers as one is not able to handle it.
With sharding you split your server into multiple smaller servers, called
shards. These shards all hold different records — you create a rule as to
what kind of records go into which shard. It is very important to create
the rule such that the data gets spread in an uniform way.
This sharding key should be chosen very carefully, as the load is not
always equal based on arbitrary columns. (e.g more people have a
name starting with C rather than Z). A single shard that receives more
requests than others is called a hot spot and must be avoided. Once
split up, re-sharding data becomes incredibly expensive and can cause
significant downtime, as was the case with FourSquare’s infamous 11
hour outage.
To keep our example simple, assume our client (the Rails app) knows
which database to use for each record. It is also worth noting that there
are many strategies for sharding and this is a simple example to
illustrate the concept.
We have won quite a lot right now — we can increase our write traffic N
times where N is the number of shards. This practically gives us almost
no limit — imagine how finely-grained we can get with this partitioning.
Pitfall
Everything in Software Engineering is more or less a trade-off and this
is no exception. Sharding is no simple feat and is best avoided until
really needed.
We have now made queries by keys other than the partitioned key
incredibly inefficient (they need to go through all of the shards). SQL
JOIN queries are even worse and complex ones become practically
unusable.
Decentralized vs Distributed
Before we go any further I’d like to make a distinction between the two
terms.
Even though the words sound similar and can be concluded to mean
the same logically, their difference makes a significant technological
and political impact.
This means that most systems we will go over today can be thought of
as distributed centralized systems — and that is what they’re made to
be.
Note: This definition has been debated a lot and can be confused with
others (peer-to-peer, federated). In early literature, it’s been defined
differently as well. Regardless, what I gave you as a definition is what I feel
is the most widely used now that blockchain and cryptocurrencies
popularized the term.
Distributed Data Stores
Distributed Data Stores are most widely used and recognized as
Distributed Databases. Most distributed databases are NoSQL non-
relational databases, limited to key-value semantics. They provide
incredible performance and scalability at the cost of consistency or
availability.
CAP Theorem
Proven way back in 2002, the CAP theorem states that a distributed
data store cannot simultaneously be consistent, available and partition
tolerant.
Think about it: if you have two nodes which accept information and
their connection dies — how are they both going to be available and
simultaneously provide you with consistency? They have no way of
knowing what the other node is doing and as such have can either
become offline (unavailable) or work with stale information
(inconsistent).
What do we do?
In the end you’re left to choose if you want your system to be strongly
consistent or highly available under a network partition.
• Soft state — The system could change over time, even during times
of no input (due to eventual consistency)
Of course, there are other data stores which prefer stronger consistency
— HBase, Couchbase, Redis, Zookeeper
Cassandra
Cassandra, as mentioned above, is a distributed No-SQL database
which prefers the AP properties out of the CAP, settling with eventual
consistency. I must admit this may be a bit misleading, as Cassandra is
highly configurable — you can make it provide strong consistency at the
expense of availability as well, but that is not its common use case.
Possibly biased diagram, showing writes per second benchmarks. Taken from here.
Even though this diagram might be biased and it looks like it compares
Cassandra to databases set to provide strong consistency (otherwise I
can’t see why MongoDB would drop performance when upgraded from
4 to 8 nodes), this should still show what a properly set up Cassandra
cluster is capable of.
Consensus
Database transactions are tricky to implement in distributed systems as
they require each node to agree on the right action to take (abort or
commit). This is known as consensus and it is a fundamental problem
in distributed systems.
Distributed Computing
Distributed computing is the key to the influx of Big Data processing
we’ve seen in recent years. It is the technique of splitting an enormous
task (e.g aggregate 100 billion records), of which no single computer is
capable of practically executing on its own, into many smaller tasks,
each of which can fit into a single commodity machine. You split your
huge task into many smaller ones, have them execute on many
machines in parallel, aggregate the data appropriately and you have
solved your initial problem. This approach again enables you to scale
horizontally — when you have a bigger task, simply include more nodes
in the calculation.
MapReduce
MapReduce can be simply defined as two steps — mapping the data and
reducing it to something meaningful.
This example is kept as short, clear and simple as possible, but imagine
we are working with loads of data (e.g analyzing billions of claps). We
won’t be storing all of this information on one machine obviously and
we won’t be analyzing all of this with one machine only. We also won’t
be querying the production database but rather some “warehouse”
database built specifically for low-priority offline jobs.
Each Map job is a separate node transforming as much data as it can.
Each job traverses all of the data in the given storage node and maps it
to a simple tuple of the date and the number one. Then, three
intermediary steps (which nobody talks about) are done — Shuffle, Sort
and Partition. They basically further arrange the data and delete it to
the appropriate reduce job. As we’re dealing with big data, we have
each Reduce job separated to work on a single date only.
This is a good paradigm and surprisingly enables you to do a lot with it
— you can chain multiple MapReduce jobs for example.
Better Techniques
MapReduce is somewhat legacy nowadays and brings some problems
with it. Because it works in batches (jobs) a problem arises where if
your job fails — you need to restart the whole thing. A 2-hour job failing
can really slow down your whole data processing pipeline and you do
not want that in the very least, especially in peak hours.
Another issue is the time you wait until you receive results. In real-time
analytic systems (which all have big data and thus use distributed
computing) it is important to have your latest crunched data be as fresh
as possible and certainly not from a few hours ago.
As such, other architectures have emerged that address these issues.
Namely Lambda Architecture (mix of batch processing and stream
processing) and Kappa Architecture (only stream processing). These
advances in the field have brought new tools enabling them — Kafka
Streams, Apache Spark, Apache Storm, Apache Samza.
Distributed File Systems
Distributed file systems can be thought of as distributed data stores.
They’re the same thing as a concept — storing and accessing a large
amount of data across a cluster of machines all appearing as one. They
typically go hand in hand with Distributed Computing.
HDFS
Hadoop Distributed File System (HDFS) is the distributed file system
used for distributed computing via the Hadoop framework. Boasting
widespread adoption, it is used to store and replicate large files (GB or
TB in size) across many machines.
IPFS
Interplanetary File System (IPFS) is an exciting new peer-to-peer
protocol/network for a distributed file system. Leveraging Blockchain
technology, it boasts a completely decentralized architecture with no
single owner nor point of failure.
IPFS offers a naming system (similar to DNS) called IPNS and lets users
easily access information. It stores file via historic versioning, similar to
how Git does. This allows for accessing all of a file’s previous states.
Distributed Messaging
Messaging systems provide a central place for storage and propagation
of messages/events inside your overall system. They allow you to
decouple your application logic from directly talking with your other
systems.
Known Scale — LinkedIn’s Kafka cluster processed 1 trillion messages a
day with peaks of 4.5 millions messages a second.
If you need to save a certain event to a few places (e.g user creation to
database, warehouse, email sending service and whatever else you can
come up with) a messaging platform is the cleanest way to spread that
message.
Consumers can either pull information out of the brokers (pull model)
or have the brokers push information directly into the consumers (push
model).
Distributed Applications
If you roll up 5 Rails servers behind a single load balancer all connected
to one database, could you call that a distributed application? Recall
my definition from up above:
If you count the database as a shared state, you could argue that this
can be classified as a distributed system — but you’d be wrong, as
you’ve missed the “working together” part of the definition.
Erlang Virtual Machine
Erlang is a functional language that has great semantics for
concurrency, distribution and fault-tolerance. The Erlang Virtual
Machine itself handles the distribution of an Erlang application.
Its model works by having many isolated lightweight processes all with
the ability to talk to each other via a built-in system of message passing.
This is called the Actor Model and the Erlang OTP libraries can be
thought of as a distributed actor framework (along the lines of Akka for
the JVM).
a sample network
You have the notions of two types of user, a leecher and a seeder. A
leecher is the user who is downloading a file and a seeder is the user
who is uploading said file.
Distributed Ledgers
A distributed ledger can be thought of as an immutable, append-only
database that is replicated, synchronized and shared across all nodes in
the distributed network.
They leverage the Event Sourcing pattern, allowing you to rebuild the
ledger’s state at any time in its history.
Blockchain
Blockchain is the current underlying technology used for distributed
ledgers and in fact marked their start. This latest and greatest
innovation in the distributed space enabled the creation of the first ever
truly distributed payment protocol — Bitcoin.
Simplified blockchain
Miners are the nodes who try to compute the hash (via bruteforce).
The miners all compete with each other for who can come up with a
random string (called a nonce) which, when combine with the
contents, produces the aforementioned hash. Once somebody finds the
correct nonce — he broadcasts it to the whole network. Said string is
then verified by each node on its own and accepted into their chain.
The network always trusts and replicates the longest valid chain. In
order to cheat the system and eventually produce a longer chain you’d
need more than 50% of the total CPU power used by all the nodes.
Bitcoin
What previous distributed payment protocols lacked was a way to
practically prevent the double-spending problem in real time, in a
distributed manner. Research has produced interesting propositions[1]
but Bitcoin was the first to implement a practical solution with clear
advantages over others.
The double spending problem states that an actor (e.g Bob) cannot
spend his single resource in two places. If Bob has $1, he should not be
able to give it to both Alice and Zack — it is only one asset, it cannot be
duplicated. It turns out it is really hard to truly achieve this guarantee
in a distributed system. There are some interesting mitigation
approaches predating blockchain, but they do not completely solve the
problem in a practical way.
This is also the reason malicious groups of nodes need to control over
50% of the computational power of the network to actually carry any
successful attack. Less than that, and the rest of the network will create
a longer blockchain faster.
Ethereum
Ethereum can be thought of as a programmable blockchain-based
software platform. It has its own cryptocurrency (Ether) which fuels
the deployment of smart contracts on its blockchain.
And many, many more. The distributed ledger technology really did open
up endless possibilities. Some are most probably being invented as we
speak!
Summary
In the short span of this article, we managed define what a distributed
system is, why you’d use one and go over each category a little. Some
important things to remember are:
Caution
Let me leave you with a parting forewarning:
You must stray away from distributed systems as much as you can. The
complexity overhead they incur with themselves is not worth the effort
if you can avoid the problem by either solving it in a different way or
some other out-of-the-box solution.
. . .
[1]
Combating Double-Spending Using Cooperative P2P Systems, 25–27
June 2007 — a proposed solution in which each ‘coin’ can expire and is
assigned a witness (validator) to it being spent.
. . .
Thanks for taking the time to read through this long(~5600 words)
article!
If, by any chance, you found this informative or thought it provided you
with value, please make sure to give it as many claps you believe it
deserves and consider sharing with a friend who could use an
introduction to this wonderful field of study.
~Stanislav Kozlovski
. . .
Update
I currently work at Confluent. Confluent is a Big Data company
founded by the creators of Apache Kafka themselves! I am immensely
grateful for the opportunity they have given me — I currently work on
Kafka itself, which is beyond awesome! We at Confluent help shape the
whole open-source Kafka ecosystem, including a new managed Kafka-
as-a-service cloud offering.