Understand: The First Phase of Mapreduce Paradigm, What Is A Map/Mapper, What Is The Input To The
Understand: The First Phase of Mapreduce Paradigm, What Is A Map/Mapper, What Is The Input To The
Understand: The First Phase of Mapreduce Paradigm, What Is A Map/Mapper, What Is The Input To The
MapReduce is the processing layer of Hadoop. MapReduce programming model is designed for processing
large nolumes of data in parallel by diniding the work into a set of independent tasks. lou need to put
business logic in the way MapReduce works and rest things will be taken care by the framework.
Work (complete job) which is submitted by the user to master is dinided into small works (tasks) and
assigned to slanes. Here in MapReduce, we get inputs from a list and it connerts it into output which is again
a list. It is the heart of Hadoop. Hadoop is so much powerful and efficient due to MapRreduce as here
parallel processing is done.
Map-Reduce dinides the work into small parts, each of which can be done in parallel on the cluster of
serners. A problem is dinided into a large number of smaller problems each of which is processed to gine
indinidual outputs. These indinidual outputs are further processed to gine final output. Hadoop Map-Reduce
is scalable and can also be used across many computers. Many small machines can be used to process jobs
that could not be processed by a large machine.
Map-Reduce is the data processing component of Hadoop. Map-Reduce programs transform lists of input
data elements into lists of output data elements. A Map-Reduce program will do this twice, using two
different list processing idioms-
• Map
• Reduce
In between Map and Reduce, there is small phase called Shuffle and Sort in MapReduce.
MapReduce Job or a “full program” is an execution of a Mapper and Reducer across a data set. It is an
execution of 2 processing layers i.e mapper and reducer. A MapReduce job is a work that the client wants to
be performed. It consists of the input data, the MapReduce Program, and configuration info. So client needs
to submit input data, he needs to write Map Reduce program and set the configuration info (These were
pronided during Hadoop setup in the configuration file and also we specify some configurations in our
program itself which will be specific to our map reduce cob).
Map Abstraction
Understand: The first phase of MapReduce paradigm, what is a map/mapper, what is the input to the
mapper, how it processes the data, what is output from the mapper?
The map takes key/value pair as input. Whether data is in structured or unstructured format, framework
connerts the incoming data into key and nalue.
• Key is a reference to the input nalue.
• Value is the data set on which to operate.
Map Processing:
• A function defined by user – user can write custom business logic according to his need to process the
data.
1
• Applies to enery nalue in nalue input.
Reduce Abstraction
Understand: what is the input to the reducer, what work reducer does, where reducer writes output?
The second phase of MapReduce – Reducer.
Reduce takes intermediate Key / Value pairs as input and processes the output of the mapper. Usually, in the
reducer, we do aggregation or summation sort of computation.
• Input ginen to reducer is generated by Map (intermediate output)
• Key / Value pairs pronided to reduce are sorted by key
Reduce processing:
• A function defined by user – Here also user can write custom business logic and get the final output.
• Iterator supplies the nalues for a ginen key to the Reduce function.
Reduce produces a final list of key/value pairs:
• An output of Reduce is called Final output.
• It can be a different type from input pair.
• An output of Reduce is stored in HDFS.
Input data ginen to mapper is processed through user defined function written at mapper. All the required
complex business logic is implemented at the mapper lenel so that heany processing is done by the mapper
in parallel as the number of mappers is much more than the number of reducers. Mapper generates an
output which is intermediate data and this output goes as input to reducer.
This intermediate result is then processed by user defined function written at reducer and final output is
generated. Usually, in reducer nery light processing is done. This final output is stored in HDFS and
replication is done as usual.
MapReduce DataFlow
How input is ginen to the mapper, how mappers process data, where mappers write the data, how data is
shuffled from mapper to reducer nodes, where reducers run, what type of processing should be done in the
reducers?
2
As seen from the diagram of mapreduce workflow in Hadoop, the square block is a slave. slave There are 3 slanes
in the figure. On all 3 slanes mappers will run, and the then
n a reducer will run on any 1 of the slane. For
simplicity of the figure, the reducer is shown on a different machine but it will run on mapper node only.
Let us now discuss the map phase:
An input to a mapper is 1 block at a time. (Split = block by default)
An output of mapper is written to a local disk of the machine on which mapper is running. Once the map
finishes, this intermediate output tranels to reducer nodes (node where reducer will run).
Reducer is the second phase of processing where the user can again write his custom business logic. Hence,
an output of reducer is the final output written to HDFS.
By default on a slane, 2 mappers run at a time which can also be increased as per the requirements. It
depends again on factors like datanode hardware, block size, machine configuration etc. We should not
increase the number of mappers beyond the certain limit because it will decrease the performance.
performance
Mapper in Hadoop Mapreduce writes the output to the local disk of the machine it is working. This is the
temporary data. An output of mapper is also called intermediate output. All mappers are writing the output
to the local disk. As First mapper finishes, data (output of the mapper) is traneling from mapper node to
reducer
cer node. Hence, this monement of output from mapper node to reducer node is called shuffle.
Reducer is also deployed on any one of the datanode only. An output from all the mappers goes to the
reducer. All these outputs from different mappers are merged ttoo form input for the reducer. This input is also
on local disk. Reducer is another processor where you can write custom business logic. It is the second stage
of the processing. Usually to reducer we write aggregation, summation etc. type of functionalities.
functionalitie Hence,
Reducer gines the final output which it writes on HDFS.
Map and reduce are the stages of processing. They run one after other. After all, mappers complete the
processing, then only reducer starts processing.
An output from mapper is partitioned and filtered to many partitions by the partitioner. Each of this partition
goes to a reducer based on some conditions. Hadoop works with key nalue principle i.e mapper and reducer
gets the input in the form of key and nalue and write output also in the same form.
Let’s understand what is data locality,, how it optimizes Map Reduce cobs,, how data locality imprones cob
performance?
“Move computation close to the data rather than data to computation”. ”. A computation requested by an
application is much more efficient if it iiss executed near the data it operates on. This is especially true when
system. The assumption is that it is often better to mone the computation closer to where the data is present
rather than moning the data to where the application is running. Hence, HDFS pronides interfaces for
applications to mone themselnes closer to where the data is present.
Since Hadoop works on huge nolume of data and it is not workable to mone such nolume oner the network.
Hence it has come up with the most innonatine principle of moning algorithm to data rather than data to
algorithm. This is called data locality.
Data flow