Nothing Special   »   [go: up one dir, main page]

2a Intro To Cluster Computing PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Modified from Mining Massive Dataset (Stanford)

CPU
Machine Learning, Statistics
Memory

“Classical” Data Mining


Disk

2
20+ billion web pages x 20KB = 400+ TB
1 computer reads 30-35 MB/sec from disk
 ~4 months to read the web
~1,000 hard drives to store the web
Takes even more to do something useful
with the data!
Today, a standard architecture for such
problems is emerging:
 Cluster of commodity Linux nodes
 Commodity network (ethernet) to connect them
3
2-10 Gbps backbone between racks
1 Gbps between Switch
any pair of nodes
in a rack
Switch Switch

CPU CPU CPU CPU

Mem … Mem Mem … Mem

Disk Disk Disk Disk

Each rack contains 16-64 nodes

In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO


4
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5
Large-scale computing for data mining
problems on commodity hardware
Challenges:
 Machines fail:
 One server may stay up 3 years (~1,000 days)
 If you have 1,000 servers, expect to loose 1/day
 People estimated Google had ~1M machines in 2011
 1,000 machines fail every day!
 How to store data persistently and keep it available if
nodes can fail?
 How to deal with node failures during long-running
computation?
6
Challenges:
 Network bottleneck
 Network bandwidth = 1Gbps
 Moving 10TB takes approximately 1 day
 Distributed programming is hard!
 Need a simple model that hides most of the complexity

7
Issue: machine failure in the cluster computer
Idea:
?

8
Issue: machine failure in the cluster computer
Idea:
 Store files multiple times for reliability

9
Issue: Copying data over a network takes time
Idea:
?

10
Issue: Copying data over a network takes time
Idea:
 Bring computation close to the data

11
Issue: Complex programming for distributed
system
Idea:
?

12
Issue: Complex programming for distributed
system
Map-reduce addresses these all problems
 Google’s computational/data manipulation model
 Elegant way to work with big data
 Storage Infrastructure – File system
 Google: GFS. Hadoop: HDFS
 Programming model
 Map-Reduce

13
Typical usage pattern
 Huge files (100s of GB to TB)
 Data is rarely updated in place
 Reads and appends are common

14
Reliable distributed file system
Data kept in “chunks” spread across machines
Each chunk replicated on different machines
 Seamless recovery from disk or machine failure


Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N

15
Reliable distributed file system
Data kept in “chunks” spread across machines
Each chunk replicated on different machines
 Seamless recovery from disk or machine failure

C0 C1 D0 C1 C2 C4 C0 C4

C5 C2 C5 C3 D0 D1 … D1 C3
Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N

Bring computation directly to the data!


Chunk servers also serve as compute servers
16
Chunk servers
 File is split into contiguous chunks
 Typically each chunk is 16-64MB
 Each chunk replicated (usually 2x or 3x)
 Try to keep replicas in different racks

17
Master node
 a.k.a. Name Node in Hadoop’s HDFS
 Stores metadata about where files are stored
 Might be replicated
Client library for file access
 Talks to master to find chunk servers
 Connects directly to chunk servers to access data

18

You might also like