Map Reduce Intro CS4961-L22

L22: SC Report,
Map Reduce
November 23, 2010
Map Reduce
• What is MapReduce?
• Example computing environment
• How it works
• Fault Tolerance
• Debugging
• Performance
• Google version = Map Reduce; Hadoop = Open source
11/23/10
What is MapReduce?
• Parallel programming model meant for large clusters

- User implements Map() and Reduce()
• Parallel computing framework

- Libraries take care of EVERYTHING else
- Parallelization
- Fault Tolerance
- Data Distribution
- Load Balancing
• Useful model for many practical tasks (large data)

Functional Abstractions Hide Parallelism
• Map and Reduce
• Functions borrowed from functional programming
languages (eg. Lisp)
• Map()
- Process a key/value pair to generate intermediate key/value
pairs•
• Reduce()
- Merge all intermediate values associated with the same key
11/23/10
Example: Counting Words
• Map()
- Input <filename, file text>
- Parses file and emits <word, count> pairs
- eg. <”hello”, 1>
• Reduce()
- Sums values for the same key and emits <word, TotalCount>
- eg. <”hello”, (3 5 2 7)> => <”hello”, 17>
Example Use of MapReduce
• Counting words in a large set of documents
map(string key, string value)

//key: document name
//value: document contents
for each word w in value
EmitIntermediate(w, “1”);
reduce(string key, iterator values)

//key: word
//values: list of counts
int results = 0;
for each v in values
result += ParseInt(v);
Emit(AsString(result));
How MapReduce Works
• User to do list:
- indicate:
- Input/output files
- M: number of map tasks
- R: number of reduce tasks
- W: number of machines
- Write map and reduce functions
- Submit the job
• This requires no knowledge of parallel/distributed

systems!!!
• What about everything else?
Data Distribution
• Input files are split into M pieces on distributed file

system
- Typically ~ 64 MB blocks
• Intermediate files created from map tasks are

written to local disk
• Output files are written to distributed file system
Assigning Tasks
• Many copies of user program are started

• Tries to utilize data localization by running map tasks
on machines with data
• One instance becomes
the Master
• Master finds idle machines and assigns them tasks
Execution (map)
• Map workers read in contents of corresponding input

partition
• Perform user-defined map computation to create
intermediate <key,value> pairs
• Periodically buffered output pairs written to local
disk
- Partitioned into R regions by a partitioning function
Partition Function
• Example partition function: hash(key) mod R

• Why do we need this?
• Example Scenario:
- Want to do word counting on 10 documents
- 5 map tasks, 2 reduce tasks
Execution (reduce)
• Reduce workers iterate over ordered intermediate

data
- Each unique key encountered – values are passed to user's
reduce function
- eg. <key, [value1, value2,..., valueN]>
• Output of user's reduce function is written to

output file on global file system
• When all tasks have completed, master wakes up
user program
Observations
• No reduce can begin until map is complete

• Tasks scheduled based on location of data
• If map worker fails any time before reduce finishes,
task must be completely rerun
• Master must communicate locations of intermediate
files
• MapReduce library does most of the hard work for
us!
Input key*value Input key*value
pairs pairs
...
map map
Data store 1 Data store n
(key 1, (key 2, (key 3, (key 1, (key 2, (key 3,

values...) values...) values...) values...) values...) values...)
== Barrier == : Aggregates intermediate values by output key
key 1, key 2, key 3,

intermediate intermediate intermediate
values values values
reduce reduce reduce
final key 1 final key 2 final key 3

values values values
Fault Tolerance
• Workers are periodically pinged by master

- No response = failed worker
• Master writes periodic checkpoints

• On errors, workers send “last gasp” UDP packet to
master
- Detect records that cause deterministic crashes and skips
them
Fault Tolerance
• Input file blocks stored on multiple machines

• When computation almost done, reschedule in-
progress tasks
- Avoids “stragglers”
Debugging
• Offers human readable status info on http server

- Users can see jobs completed, in-progress, processing rates,
etc.
• Sequential implementation
- Executed sequentially on a single machine
- Allows use of gdb and other debugging tools
MapReduce Conclusions
• Simplifies large-scale computations that fit this model

• Allows user to focus on the problem without worrying
about details
• Computer architecture not very important
- Portable model
References
• Jeffery Dean and Sanjay Ghemawat, MapReduce: Simplified

Data Processing on Large Clusters
• Josh Carter, http://multipart-
mixed.com/software/mapreduce_presentation.pdf
• Ralf Lammel, Google's MapReduce Programming Model –
Revisited
• http://code.google.com/edu/parallel/mapreduce-tutorial.html
• RELATED
- Sawzall
- Pig
- Hadoop

Map Reduce Intro CS4961-L22

Uploaded by

Copyright:

Available Formats

Map Reduce Intro CS4961-L22

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Map Reduce Intro CS4961-L22

Uploaded by

Copyright:

Available Formats

L22: SC Report,

• Parallel programming model meant for large clusters

• Parallel computing framework

• Useful model for many practical tasks (large data)

map(string key, string value)

reduce(string key, iterator values)

• This requires no knowledge of parallel/distributed

• Input files are split into M pieces on distributed file

• Intermediate files created from map tasks are

• Many copies of user program are started

• Map workers read in contents of corresponding input

• Example partition function: hash(key) mod R

• Reduce workers iterate over ordered intermediate

• Output of user's reduce function is written to

• No reduce can begin until map is complete

(key 1, (key 2, (key 3, (key 1, (key 2, (key 3,

== Barrier == : Aggregates intermediate values by output key

key 1, key 2, key 3,

reduce reduce reduce

final key 1 final key 2 final key 3

• Workers are periodically pinged by master

• Master writes periodic checkpoints

• Input file blocks stored on multiple machines

• Offers human readable status info on http server

• Simplifies large-scale computations that fit this model

• Jeffery Dean and Sanjay Ghemawat, MapReduce: Simplified

You might also like