Mapreduce Join Document
Mapreduce Join Document
Mapreduce Join Document
Side data refers to extra static small data required by map reduce to perform job.
Main challenge is the availability of side data on the node where the map would be
executed. Hadoop provides two side data distribution techniques.
2. DISTRIBUTED CACHE
Because conf object would be read by job tracker, task tracker and new child jvm.
this would increase overhead at every front.
A part from this side data would require serialization if it has non-primitive
encoding.
2. DISTRIBUTED CACHE
This provides a service for copying files and archives to the task nodes in time
for the tasks to use them when they run
To save network bandwidth, files are normally copied to any particular node once
per job
Sample code for Distributed Cache
@Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Two parameters are required- <input dir> <output dir>\n");
return -1;
}
job.setJarByClass(Driver.class);
sample Inputs
//Sample Inputs
File 1 – UserDetails.txt
123456, Jim
456123, Tom
789123, Harry
789456, Richa
File 2 – DeliveryDetails.txt
123456, 001
456123, 002
789123, 003
789456, 004
File 3 – DeliveryStatusCodes.txt
001, Delivered
002, Pending
003, Failed
004, Resend
Two different large data can be joined in map reduce programming also.
Joins in Map phase refers as Map side join, while join at reduce side called as reduce side join.
Lets go in detail, Why we would require to join the data in map reduce. If one Dataset A has master data and B
has sort of transactional data(A & B are just for reference). we need to join them on a coexisting common key
for a result. It is important to realize that we can share data with side data sharing techniques(passing key value
pair in job configuration /distribution caching) if master data set is small. we will use map-reduce join only
when we have both dataset is too big to use data sharing techniques.
Joins at Map Reduce is not recommended way. Same problem can be addressed through high level frameworks
like Hive or cascading. even if you are in situation then we can use below mentioned method to join.
Data Source is referring to data source files, probably taken from RDBMS
Tag would be used to tag every record with it’s source name, so that it’s source can be identified at any given
point of time be it is in map/reduce phase. why it is required will cover it later.
Group key is referring column to be used as join key between two data sources.
As we know we are going to join this data on reduce side we must prepare in a way that it can be used for
joining in reduce phase. let’s have a look what are the steps needs to be perform.