2a Intro To Cluster Computing PDF
2a Intro To Cluster Computing PDF
2a Intro To Cluster Computing PDF
CPU
Machine Learning, Statistics
Memory
2
20+ billion web pages x 20KB = 400+ TB
1 computer reads 30-35 MB/sec from disk
~4 months to read the web
~1,000 hard drives to store the web
Takes even more to do something useful
with the data!
Today, a standard architecture for such
problems is emerging:
Cluster of commodity Linux nodes
Commodity network (ethernet) to connect them
3
2-10 Gbps backbone between racks
1 Gbps between Switch
any pair of nodes
in a rack
Switch Switch
7
Issue: machine failure in the cluster computer
Idea:
?
8
Issue: machine failure in the cluster computer
Idea:
Store files multiple times for reliability
9
Issue: Copying data over a network takes time
Idea:
?
10
Issue: Copying data over a network takes time
Idea:
Bring computation close to the data
11
Issue: Complex programming for distributed
system
Idea:
?
12
Issue: Complex programming for distributed
system
Map-reduce addresses these all problems
Google’s computational/data manipulation model
Elegant way to work with big data
Storage Infrastructure – File system
Google: GFS. Hadoop: HDFS
Programming model
Map-Reduce
13
Typical usage pattern
Huge files (100s of GB to TB)
Data is rarely updated in place
Reads and appends are common
14
Reliable distributed file system
Data kept in “chunks” spread across machines
Each chunk replicated on different machines
Seamless recovery from disk or machine failure
…
Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N
15
Reliable distributed file system
Data kept in “chunks” spread across machines
Each chunk replicated on different machines
Seamless recovery from disk or machine failure
C0 C1 D0 C1 C2 C4 C0 C4
C5 C2 C5 C3 D0 D1 … D1 C3
Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N
17
Master node
a.k.a. Name Node in Hadoop’s HDFS
Stores metadata about where files are stored
Might be replicated
Client library for file access
Talks to master to find chunk servers
Connects directly to chunk servers to access data
18