Mapreduce Join Document

SIDE DATA DISTRIBUTION TECHNIQUES IN MAP REDUCE
Side data refers to extra static small data required by map reduce to perform job.
Main challenge is the availability of side data on the node where the map would be
executed. Hadoop provides two side data distribution techniques.
1. USING JOB CONFIGURATION
2. DISTRIBUTED CACHE
1. USING JOB CONFIUGURATION
An arbitrary Key value pair can be set in job configuration.
very useful technique in case of small file.
Suggested size of file to keep in configuration object is in KBs.
Because conf object would be read by job tracker, task tracker and new child jvm.
this would increase overhead at every front.
A part from this side data would require serialization if it has non-primitive
encoding.
2. DISTRIBUTED CACHE
 Rather than serializing side data in the job configuration, it is preferable to

distribute datasets using Hadoop’s distributed cache mechanism
 This provides a service for copying files and archives to the task nodes in time
for the tasks to use them when they run
 To save network bandwidth, files are normally copied to any particular node once
per job
Sample code for Distributed Cache
public class Driver extends Configured implements Tool {
@Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Two parameters are required- <input dir> <output dir>\n");
return -1;
}
Job job = new Job(getConf());

Configuration conf = job.getConfiguration();
job.setJobName("Map-side join with text lookup file in DCache");
DistributedCache.addCacheFile(new
URI("/user/hadoop/joinProject/data/DeliveryStatusCodes"),conf);
job.setJarByClass(Driver.class);
//specifying the custom reducer class

job.setReducerClass(MyReducer.class);
sample Inputs
//Sample Inputs
File 1 – UserDetails.txt
123456, Jim
456123, Tom
789123, Harry
789456, Richa
File 2 – DeliveryDetails.txt
123456, 001
456123, 002
789123, 003
789456, 004
File 3 – DeliveryStatusCodes.txt
001, Delivered
002, Pending
003, Failed
004, Resend
MAP SIDE JOIN Vs REDUCER SIDE JOIN
Two different large data can be joined in map reduce programming also.
Joins in Map phase refers as Map side join, while join at reduce side called as reduce side join.
Lets go in detail, Why we would require to join the data in map reduce. If one Dataset A has master data and B
has sort of transactional data(A & B are just for reference). we need to join them on a coexisting common key
for a result. It is important to realize that we can share data with side data sharing techniques(passing key value
pair in job configuration /distribution caching) if master data set is small. we will use map-reduce join only
when we have both dataset is too big to use data sharing techniques.
Joins at Map Reduce is not recommended way. Same problem can be addressed through high level frameworks
like Hive or cascading. even if you are in situation then we can use below mentioned method to join.
Map side Join

Joining at map side performs the join before data reached to map. function It expects a strong prerequisite
before joining data at map side. Both joining techniques comes with it’s own kind of pros and cons. Map side
join could be more efficient to reduce side but strict format requirement is very tough to meet natively.
however if we would prepare this kind of data through some other MR jobs, will loose the expected
performance over reduce side join.
 Data should be partitioned and sorted in particular way.

 Each input data should be divided in same number of partition.
 Must be sorted with same key.
 All the records for a particular key must reside in the same partition.
Reduce Side Join

Reduce side join also called as Repartitioned join or Repartitioned sort merge join and also it is mostly used
join type. This type of join would be performed at reduce side. i.e it will have to go through sort and shuffle
phase which would incur network overhead. to make it simple we are going to add the steps needs to be
performed for reduce side join. Reduce side join uses few terms like data source, tag and group key lets be
familiar with it.
 Data Source is referring to data source files, probably taken from RDBMS
 Tag would be used to tag every record with it’s source name, so that it’s source can be identified at any given
point of time be it is in map/reduce phase. why it is required will cover it later.
 Group key is referring column to be used as join key between two data sources.
As we know we are going to join this data on reduce side we must prepare in a way that it can be used for
joining in reduce phase. let’s have a look what are the steps needs to be perform.

Mapreduce Join Document

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Mapreduce Join Document

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mapreduce Join Document

Uploaded by

Copyright:

Available Formats

SIDE DATA DISTRIBUTION TECHNIQUES IN MAP REDUCE

1. USING JOB CONFIGURATION

1. USING JOB CONFIUGURATION

An arbitrary Key value pair can be set in job configuration.

very useful technique in case of small file.

Suggested size of file to keep in configuration object is in KBs.

 Rather than serializing side data in the job configuration, it is preferable to

public class Driver extends Configured implements Tool {

Job job = new Job(getConf());

//specifying the custom reducer class

MAP SIDE JOIN Vs REDUCER SIDE JOIN

Map side Join

 Data should be partitioned and sorted in particular way.

Reduce Side Join

You might also like