Nothing Special   »   [go: up one dir, main page]

Unit II Big Data

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 27

What is Hadoop

Hadoop is an open source framework from Apache and is used to store process and analyze data
which are very huge in volume. Hadoop is written in Java and is not OLAP (online analytical
processing). It is used for batch/offline processing.It is being used by Facebook, Yahoo, Google,
Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the cluster.

Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of that
HDFS was developed. It states that the files will be broken into blocks and stored in nodes over the
distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel computation on data
using key value pair. The Map task takes input data and converts it into a data set which can be
computed in Key value pair. The output of Map task is consumed by reduce task and then the out of
reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other Hadoop
modules.

Hadoop Architecture
The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS
(Hadoop Distributed File System). The MapReduce engine can be MapReduce/MR1 or YARN/MR2.

A Hadoop cluster consists of a single master and multiple slave nodes. The master node includes
Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node includes DataNode
and TaskTracker.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains a
master/slave architecture. This architecture consist of a single NameNode performs the role of
master, and multiple DataNodes performs the role of a slave.

Both NameNode and DataNode are capable enough to run on commodity machines. The Java
language is used to develop HDFS. So any machine that supports Java language can easily run the
NameNode and DataNode software.

NameNode
ADVERTISEMENT

ADVERTISEMENT

o It is a single master server exist in the HDFS cluster.


o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening, renaming and
closing the files.
o It simplifies the architecture of the system.

DataNode
o The HDFS cluster contains multiple DataNodes.
o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's clients.
o It performs block creation, deletion, and replication upon instruction from the NameNode.
Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and process the data by using
NameNode.
o In response, NameNode provides metadata to Job Tracker.

Task Tracker
o It works as a slave node for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file. This process can also be
called as a Mapper.

MapReduce Layer
The MapReduce comes into existence when the client application submits the MapReduce job to
Job Tracker. In response, the Job Tracker sends the request to the appropriate Task Trackers.
Sometimes, the TaskTracker fails or time out. In such a case, that part of the job is rescheduled.

Advantages of Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster retrieval.
Even the tools to process the data are often on the same servers, thus reducing the processing time.
It is able to process terabytes of data in minutes and Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so it really cost
effective as compared to traditional relational database management system.
o Resilient to failure: HDFS has the property with which it can replicate data over the network, so if
one node is down or some other network failure happens, then Hadoop takes the other copy of data
and use it. Normally, data are replicated thrice but the replication factor is configurable.

Features of Hadoop Which Makes It Popular


Let’s discuss the key features which make Hadoop more reliable to use, an industry
favorite, and the most powerful Big Data tool.
1. Open Source:
Hadoop is open-source, which means it is free to use. Since it is an open-source project
the source-code is available online for anyone to understand it or make some
modifications as per their industry requirement.
2. Highly Scalable Cluster:
Hadoop is a highly scalable model. A large amount of data is divided into multiple
inexpensive machines in a cluster which is processed parallelly. the number of these
machines or nodes can be increased or decreased as per the enterprise’s requirements.
In traditional RDBMS(Relational DataBase Management System) the systems can not be
scaled to approach large amounts of data.
3. Fault Tolerance is Available:
Hadoop uses commodity hardware(inexpensive systems) which can be crashed at any
moment. In Hadoop data is replicated on various DataNodes in a Hadoop cluster which
ensures the availability of data if somehow any of your systems got crashed. You can
read all of the data from a single machine if this machine faces a technical issue data can
also be read from other nodes in a Hadoop cluster because the data is copied or
replicated by default. By default, Hadoop makes 3 copies of each file block and stored it
into different nodes. This replication factor is configurable and can be changed by
changing the replication property in the hdfs-site.xml file.
4. High Availability is Provided:
Fault tolerance provides High Availability in the Hadoop cluster. High Availability means
the availability of data on the Hadoop cluster. Due to fault tolerance in case if any of the
DataNode goes down the same data can be retrieved from any other node where the
data is replicated. The High available Hadoop cluster also has 2 or more than two Name
Node i.e. Active NameNode and Passive NameNode also known as stand by NameNode.
In case if Active NameNode fails then the Passive node will take the responsibility of
Active Node and provide the same data as that of Active NameNode which can easily be
utilized by the user.
5. Cost-Effective:
Hadoop is open-source and uses cost-effective commodity hardware which provides a
cost-efficient model, unlike traditional Relational databases that require expensive
hardware and high-end processors to deal with Big Data. The problem with traditional
Relational databases is that storing the Massive volume of data is not cost-effective, so
the company’s started to remove the Raw data. which may not result in the correct
scenario of their business. Means Hadoop provides us 2 main benefits with the cost one
is it’s open-source means free to use and the other is that it uses commodity hardware
which is also inexpensive.
6. Hadoop Provide Flexibility:
Hadoop is designed in such a way that it can deal with any kind of dataset like
structured(MySql Data), Semi-Structured(XML, JSON), Un-structured (Images and
Videos) very efficiently. This means it can easily process any kind of data independent of
its structure which makes it highly flexible. It is very much useful for enterprises as they
can process large datasets easily, so the businesses can use Hadoop to analyze
valuable insights of data from sources like social media, email, etc. With this flexibility,
Hadoop can be used with log processing, Data Warehousing, Fraud detection, etc.
7. Easy to Use:
Hadoop is easy to use since the developers need not worry about any of the processing
work since it is managed by the Hadoop itself. Hadoop ecosystem is also very large
comes up with lots of tools like Hive, Pig, Spark, HBase, Mahout, etc.
8. Hadoop uses Data Locality:
The concept of Data Locality is used to make Hadoop processing fast. In the data locality
concept, the computation logic is moved near data rather than moving the data to the
computation logic. The cost of Moving data on HDFS is costliest and with the help of the
data locality concept, the bandwidth utilization in the system is minimized.
9. Provides Faster Data Processing:
Hadoop uses a distributed file system to manage its storage i.e. HDFS(Hadoop
Distributed File System). In DFS(Distributed File System) a large size file is broken into
small size file blocks then distributed among the Nodes available in a Hadoop cluster, as
this massive number of file blocks are processed parallelly which makes Hadoop faster,
because of which it provides a High-level performance as compared to the traditional
DataBase Management Systems.
10. Support for Multiple Data Formats:
Hadoop supports multiple data formats like CSV, JSON, Avro, and more, making it easier
to work with different types of data sources. This makes it more convenient for
developers and data analysts to handle large volumes of data with different formats.
11. High Processing Speed:
Hadoop’s distributed processing model allows it to process large amounts of data at high
speeds. This is achieved by distributing data across multiple nodes and processing it in
parallel. As a result, Hadoop can process data much faster than traditional database
systems.
12. Machine Learning Capabilities:
Hadoop offers machine learning capabilities through its ecosystem tools like Mahout,
which is a library for creating scalable machine learning applications. With these tools,
data analysts and developers can build machine learning models to analyze and process
large datasets.
13. Integration with Other Tools:
Hadoop integrates with other popular tools like Apache Spark, Apache Flink, and Apache
Storm, making it easier to build data processing pipelines. This integration allows
developers and data analysts to use their favorite tools and frameworks for building data
pipelines and processing large datasets.
14. Secure:
Hadoop provides built-in security features like authentication, authorization, and
encryption. These features help to protect data and ensure that only authorized users
have access to it. This makes Hadoop a more secure platform for processing sensitive
data.
15. Community Support:
Hadoop has a large community of users and developers who contribute to its
development and provide support to users. This means that users can access a wealth of
resources and support to help them get the most out of Hadoop.

Hadoop – Daemons and Their Features




Daemons mean Process. Hadoop Daemons are a set of processes that run on Hadoop. Hadoop is a
framework written in Java, so all these processes are Java Processes.

Apache Hadoop 2 consists of the following Daemons:


 NameNode
 DataNode
 Secondary Name Node
 Resource Manager
 Node Manager
Namenode, Secondary NameNode, and Resource Manager work on a Master System while the
Node Manager and DataNode work on the Slave machine.
1. NameNode

NameNode works on the Master System. The primary purpose of Namenode is to manage all the
MetaData. Metadata is the list of files stored in HDFS(Hadoop Distributed File System). As we
know the data is stored in the form of blocks in a Hadoop cluster. So the DataNode on which or the
location at which that block of the file is stored is mentioned in MetaData. All information regarding
the logs of the transactions happening in a Hadoop cluster (when or who read/wrote the data) will be
stored in MetaData. MetaData is stored in the memory.

Features:
 It never stores the data that is present in the file.
 As Namenode works on the Master System, the Master system should have good processing
power and more RAM than Slaves.
 It stores the information of DataNode such as their Block id’s and Number of Blocks
How to start Name Node?
hadoop-daemon.sh start namenode
How to stop Name Node?
hadoop-daemon.sh stop namenode

2. DataNode

DataNode works on the Slave system. The NameNode always instructs DataNode for storing the
Data. DataNode is a program that runs on the slave system that serves the read/write request from
the client. As the data is stored in this DataNode, they should possess high memory to store more
Data.

How to start Data Node?


hadoop-daemon.sh start datanode
How to stop Data Node?
hadoop-daemon.sh stop datanode

3. Secondary NameNode

Secondary NameNode is used for taking the hourly backup of the data. In case the Hadoop cluster
fails, or crashes, the secondary Namenode will take the hourly backup or checkpoints of that data
and store this data into a file name fsimage. This file then gets transferred to a new system. A new
MetaData is assigned to that new system and a new Master is created with this MetaData, and the
cluster is made to run again correctly.
This is the benefit of Secondary Name Node. Now in Hadoop2, we have High-Availability and
Federation features that minimize the importance of this Secondary Name Node in Hadoop2.

Major Function Of Secondary NameNode:


 It groups the Edit logs and Fsimage from NameNode together.
 It continuously reads the MetaData from the RAM of NameNode and writes into the Hard Disk.
As secondary NameNode keeps track of checkpoints in a Hadoop Distributed File System, it is also
known as the checkpoint Node.

The Hadoop
Daemon’s Port

Name Node 50070

Data Node 50075

Secondary Name Node 50090

These ports can be configured manually in hdfs-site.xml and mapred-site.xml files.

4. Resource Manager

Resource Manager is also known as the Global Master Daemon that works on the Master System.
The Resource Manager Manages the resources for the applications that are running in a Hadoop
Cluster. The Resource Manager Mainly consists of 2 things.

1. ApplicationsManager
2. Scheduler
An Application Manager is responsible for accepting the request for a client and also makes a
memory resource on the Slaves in a Hadoop cluster to host the Application Master. The scheduler is
utilized for providing resources for applications in a Hadoop cluster and for monitoring this
application.
How to start ResourceManager?
yarn-daemon.sh start resourcemanager
How to stop ResourceManager?
stop:yarn-daemon.sh stop resoucemnager

5. Node Manager

The Node Manager works on the Slaves System that manages the memory resource within the Node
and Memory Disk. Each Slave Node in a Hadoop cluster has a single NodeManager Daemon
running in it. It also sends this monitoring information to the Resource Manager.

How to start Node Manager?


yarn-daemon.sh start nodemanager
How to stop Node Manager?
yarn-daemon.sh stop nodemanager

In a Hadoop cluster, Resource Manager and Node Manager can be tracked with the specific URLs,
of type http://:port_number
The Hadoop
Daemon’s Port

ResourceManager 8088

NodeManager 8042

The below diagram shows how Hadoop works.


What is HDFS
Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed over several
machines and replicated to ensure their durability to failure and high availability to parallel
application.

It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes and
node name.

Where to use HDFS


ADVERTISEMENT
ADVERTISEMENT

o Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.


o Streaming Data Access: The time to read whole data set is more important than latency in
reading the first. HDFS is built on write-once and read-many-times pattern.
o Commodity Hardware:It works on low cost hardware.

Where not to use HDFS


o Low Latency data access: Applications that require very less time to access the first data
should not use HDFS as it is giving importance to whole data rather than time to fetch the
first record.
o Lots Of Small Files:The name node contains the metadata of files in memory and if the files
are small in size it takes a lot of memory for name node's memory which is not feasible.
o Multiple Writes:It should not be used when we have to write multiple times.
HDFS Concepts
1. Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks are
128 MB by default and this is configurable.Files n HDFS are broken into block-sized
chunks,which are stored as independent units.Unlike a file system, if the file is in HDFS is
smaller than block size, then it does not occupy full block?s size, i.e. 5 MB of file stored in
HDFS of block size 128 MB takes 5MB of space only.The HDFS block size is large just to
minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node acts as
master.Name Node is controller and manager of HDFS as it knows the status and the
metadata of all the files in HDFS; the metadata information being file permission, names and
location of each block.The metadata are small, so it is stored in the memory of name
node,allowing faster access to data. Moreover the HDFS cluster is accessed by multiple
clients concurrently,so all this information is handled bya single machine. The file system
operations like opening, closing, renaming etc. are executed by it.
3. Data Node: They store and retrieve blocks when they are told to; by client or name node.
They report back to name node periodically, with list of blocks that they are storing. The
data node being a commodity hardware also does the work of block creation, deletion and
replication as stated by the name node.

HDFS DataNode and NameNode Image:

HDFS Read Image:


HDFS Write Image:

Since all the metadata is stored in name node, it is very important. If it fails the file system can not
be used as there would be no way of knowing how to reconstruct the files from blocks present in
data node. To overcome this, the concept of secondary name node arises.

Secondary Name Node: It is a separate physical machine which acts as a helper of name node. It
performs periodic check points.It communicates with the name node and take snapshot of meta
data which helps minimize downtime and loss of data.

Starting HDFS
The HDFS should be formatted initially and then started in the distributed mode. Commands are
given below.

To Format $ hadoop namenode -format

To Start $ start-dfs.sh

HDFS Basic File Operations


1. Putting data to HDFS from local file system
o First create a folder in HDFS where data can be put form local file system.

$ hadoop fs -mkdir /user/test

o Copy the file "data.txt" from a file kept in local folder /usr/home/Desktop to HDFS
folder /user/ test

$ hadoop fs -copyFromLocal /usr/home/Desktop/data.txt /user/test

o Display the content of HDFS folder

$ Hadoop fs -ls /user/test

2. Copying data from HDFS to local file system


o $ hadoop fs -copyToLocal /user/test/data.txt /usr/bin/data_copy.txt
3. Compare the files and see that both are same
o $ md5 /usr/bin/data_copy.txt /usr/home/Desktop/data.txt

Recursive deleting

o hadoop fs -rmr <arg>

Example:

o hadoop fs -rmr /user/sonoo/

HDFS Other commands


The below is used in the commands

"<path>" means any file or directory name.

"<path>..." means one or more file or directory names.

"<file>" means any filename.

"<src>" and "<dest>" are path names in a directed operation.


"<localSrc>" and "<localDest>" are paths as above, but on the local file system

o put <localSrc><dest>

Copies the file or directory from the local file system identified by localSrc to dest within the
DFS.

o copyFromLocal <localSrc><dest>

Identical to -put

o copyFromLocal <localSrc><dest>

Identical to -put

o moveFromLocal <localSrc><dest>

Copies the file or directory from the local file system identified by localSrc to dest within
HDFS, and then deletes the local copy on success.

o get [-crc] <src><localDest>

Copies the file or directory in HDFS identified by src to the local file system path identified
by localDest.

o cat <filen-ame>

Displays the contents of filename on stdout.

o moveToLocal <src><localDest>

Works like -get, but deletes the HDFS copy on success.

o setrep [-R] [-w] rep <path>

Sets the target replication factor for files identified by path to rep. (The actual replication
factor will move toward the target over time)

o touchz <path>

Creates a file at path containing the current time as a timestamp. Fails if a file already exists
at path, unless the file is already size 0.

o test -[ezd] <path>

Returns 1 if path exists; has zero length; or is a directory or 0 otherwise.


o stat [format] <path>

Prints information about path. Format is a string which accepts file size in blocks (%b),
filename (%n), block size (%o), replication (%r), and modification date (%y, %Y).

Next HDFS Features and Goals


The Hadoop Distributed File System (HDFS) is a distributed file system. It is a core part of Hadoop
which is used for data storage. It is designed to run on commodity hardware.

Unlike other distributed file system, HDFS is highly fault-tolerant and can be deployed on low-cost
hardware. It can easily handle the application that contains large data sets.

Let's see some of the important features and goals of HDFS.

Features of HDFS
ADVERTISEMENT
ADVERTISEMENT

o Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a single
cluster.
o Replication - Due to some unfavorable conditions, the node containing the data may be
loss. So, to overcome such problems, HDFS always maintains the copy of data on a different
machine.
o Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the system in the
event of failure. The HDFS is highly fault-tolerant that if any machine fails, the other machine
containing the copy of that data automatically become active.
o Distributed data storage - This is one of the most important features of HDFS that makes
Hadoop very powerful. Here, data is divided into multiple blocks and stored into nodes.
o Portable - HDFS is designed in such a way that it can easily portable from platform to
another.

Goals of HDFS
o Handling the hardware failure - The HDFS contains multiple server machines. Anyhow, if
any machine fails, the HDFS goal is to recover it quickly.
o Streaming data access - The HDFS applications usually run on the general-purpose file
system. This application requires streaming access to their data sets.
o Coherence Model - The application that runs on HDFS require to follow the write-once-
ready-many approach. So, a file once created need not to be changed. However, it can be
appended and truncate.

Characteristics of HDFS
1. Run-on low-cost system i.e. commodity hardware
Hadoop Distributed File System is very much similar to the existing Distributed File
System but it differs in several aspects like commodity hardware. The Hadoop HDFS
does not require specialized hardware to store and process very large size data, rather it
is designed to work on low-cost clusters of commodity hardware. Where clusters mean a
group of computers that are connected Which are cheap and affordable.
2. Provide High Fault Tolerance
HDFS provides high fault tolerance, Fault tolerance is achieved when the system
functions properly without any data loss even if some hardware components of the
system has failed. In a cluster when a single node fails it crashes the entire system. The
primary duty of fault tolerance is to remove such failed nodes which disturbs the entire
normal functioning of the system. By default, in HDFS every data block is replicated in 3
data nodes. If a data node goes down the client can easily fetch the data from the other 2
data nodes where data is replicated hence it prevents the entire system to go down and
Fault tolerance is achieved in a Hadoop cluster. The HDFS is flexible enough to add and
remove the data nodes with fewer efforts. There are 3 ways with which HDFS can
achieve fault tolerance i.e. Data replication, Heartbeat Messages, and checkpoints, and
recovery.
3. Large Data Set
In the case of HDFS large data set means the data that is in hundreds of megabytes,
gigabytes, terabytes, or sometimes even in petabytes in size. It is preferable to use HDFS
for files of very large size instead of using so many small files because metadata of a
large number of small files consumes a very large space in the memory than that of the
less number of entries for large files in name node.
4. High Throughput
HDFS is designed to be a High Throughput batch processing system rather than
providing low latency interactive uses. HDFS always implements WORM pattern i.e. Write
Once Read Many. The data is immutable means once the data is written it can not be
changed. Due to which data is the same across the network. Thus it can process large
data in a given amount of time and hence provides High Throughput.
5. Data Locality
HDFS allows us to store and process massive size data on the cluster made up of
commodity hardware. Since the data is significantly large so the HDFS moves the
computation process i.e. Map-Reduce program towards the data instead of pulling the
data out for computation. These minimize network congestion and increase the overall
throughput of the system.
6. Scalability
As HDFS stores the large size data over multiple nodes, so when the requirement of data
storing is increased or decreased the number of nodes can be scaled up or scaled down
in a cluster. The vertical and Horizontal scalability is the 2 different mechanisms available
to provide scalability in the cluster. Vertical scalability means adding the resource like
Disk space, RAM on the existing node of the cluster. On another hand in Horizontal
scaling, we increase the number of nodes in the cluster and it is more preferable since we
can have hundreds of nodes in a cluster.
7. Data Compression
HDFS provides built-in support for data compression. Data compression reduces the
amount of storage required for large data sets, which in turn reduces the storage cost.
HDFS uses the zlib compression algorithm by default and supports other compression
algorithms like Snappy, LZO, and Gzip. The compression algorithm can be specified at
the time of creating a file or directory.
8. Security
HDFS provides security for the stored data through features like authentication,
authorization, and encryption. Authentication is the process of verifying the identity of the
user who is accessing the system, while authorization is the process of determining
whether a user has the required permissions to access the data. Encryption is the
process of encoding the data in such a way that it cannot be read by unauthorized users.
HDFS provides Kerberos authentication and Access Control Lists (ACLs) for
authorization.
9. Easy Integration
HDFS is designed to be easily integrated with other Hadoop ecosystem components like
Apache Spark, Apache Hive, and Apache Pig. This enables developers to perform
various data processing tasks using these tools without having to worry about data
storage and retrieval, as HDFS takes care of it. Additionally, HDFS can be accessed
using a variety of programming languages like Java, Python, and Scala.
10. Open Source
HDFS is an open-source project developed by Apache Software Foundation. This means
that anyone can contribute to the development of the software and make improvements.
Open source also means that there is a large community of developers who are
constantly working to improve the software and fix bugs. This ensures that HDFS is
constantly evolving and improving.

Hadoop - HDFS Operations


Starting HDFS
Initially you have to format the configured HDFS file system, open
namenode (HDFS server), and execute the following command.

$ hadoop namenode -format

After formatting the HDFS, start the distributed file system. The following
command will start the namenode as well as the data nodes as cluster.

$ start-dfs.sh

Listing Files in HDFS


After loading the information in the server, we can find the list of files in a
directory, status of a file, using ‘ls’. Given below is the syntax of ls that you
can pass to a directory or a filename as an argument.
$ $HADOOP_HOME/bin/hadoop fs -ls <args>

Inserting Data into HDFS


Assume we have data in the file called file.txt in the local system which is
ought to be saved in the hdfs file system. Follow the steps given below to
insert the required file in the Hadoop file system.
Step 1

You have to create an input directory.

$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input

Step 2

Transfer and store a data file from local systems to the Hadoop file system
using the put command.

$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input

Step 3

You can verify the file using ls command.

$ $HADOOP_HOME/bin/hadoop fs -ls /user/input

Retrieving Data from HDFS


Assume we have a file in HDFS called outfile. Given below is a simple
demonstration for retrieving the required file from the Hadoop file system.

Step 1
Initially, view the data from HDFS using cat command.
$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/outfile

Step 2
Get the file from HDFS to the local file system using get command.
$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/

Shutting Down the HDFS


You can shut down the HDFS by using the following command.

$ stop-dfs.sh

What is MapReduce?
A MapReduce is a data processing tool which is used to process the data parallelly in a distributed
form. It was developed in 2004, on the basis of paper titled as "MapReduce: Simplified Data
Processing on Large Clusters," published by Google.

The MapReduce is a paradigm which has two phases, the mapper phase, and the reducer phase. In
the Mapper, the input is given in the form of a key-value pair. The output of the Mapper is fed to
the reducer as input. The reducer runs only after the Mapper is over. The reducer too takes input in
key-value format, and the output of reducer is the final output.
Steps in Map Reduce
ADVERTISEMENT
ADVERTISEMENT

o The map takes data in the form of pairs and returns a list of <key, value> pairs. The keys will
not be unique in this case.
o Using the output of Map, sort and shuffle are applied by the Hadoop architecture. This sort
and shuffle acts on these list of <key, value> pairs and sends out unique keys and a list of
values associated with this unique key <key, list(values)>.
o An output of sort and shuffle sent to the reducer phase. The reducer performs a defined
function on a list of values for unique keys, and Final output <key, value> will be
stored/displayed.
Sort and Shuffle
The sort and shuffle occur on the output of Mapper and before the reducer. When the Mapper
task is complete, the results are sorted by key, partitioned if there are multiple reducers, and then
written to disk. Using the input from each Mapper <k2,v2>, we collect all the values for each
unique key k2. This output from the shuffle phase in the form of <k2, list(v2)> is sent as input to
reducer phase.

Usage of MapReduce
o It can be used in various application like document clustering, distributed sorting, and web
link-graph reversal.
o It can be used for distributed pattern-based searching.
o We can also use MapReduce in machine learning.
o It was used by Google to regenerate Google's index of the World Wide Web.
o It can be used in multiple computing environments such as multi-cluster, multi-core, and
mobile environment.

MapReduce Architecture


MapReduce and HDFS are the two major components of Hadoop which makes it so powerful and
efficient to use. MapReduce is a programming model used for efficient processing in parallel over
large data-sets in a distributed manner. The data is first split and then combined to produce the
final result. The libraries for MapReduce is written in so many programming languages with various
different-different optimizations. The purpose of MapReduce in Hadoop is to Map each of the jobs
and then it will reduce it to equivalent tasks for providing less overhead over the cluster network
and to reduce the processing power. The MapReduce task is mainly divided into two phases Map
Phase and Reduce Phase.
MapReduce Architecture:
Components of MapReduce Architecture:

1. Client: The MapReduce client is the one who brings the Job to the MapReduce for processing.
There can be multiple clients available that continuously send jobs for processing to the
Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is comprised of
so many smaller tasks that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The result of all
the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
In MapReduce, we have a client. The client will submit the job of a particular size to the Hadoop
MapReduce Master. Now, the MapReduce master will divide this job into further equivalent job-
parts. These job-parts are then made available for the Map and Reduce Task. This Map and Reduce
task will contain the program as per the requirement of the use-case that the particular company is
solving. The developer writes their logic to fulfill the requirement that the industry requires. The
input data which we are using is then fed to the Map Task and the Map will generate intermediate
key-value pair as its output. The output of Map i.e. these key-value pairs are then fed to the
Reducer and the final output is stored on the HDFS. There can be n number of Map and Reduce
tasks made available for processing the data as per the requirement. The algorithm for Map and
Reduce is made with a very optimized way such that the time complexity or space complexity is
minimum.
Let’s discuss the MapReduce phases to get a better understanding of its architecture:
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce phase.
1. Map: As the name suggests its main use is to map the input data in key-value pairs. The input
to the map may be a key-value pair where the key can be the id of some kind of address and
value is the actual value that it keeps. The Map() function will be executed in its memory
repository on each of these input key-value pairs and generates the intermediate key-value pair
which works as input for the Reducer or Reduce() function.

2. Reduce: The intermediate key-value pairs that work as input for Reducer are shuffled and sort
and send to the Reduce() function. Reducer aggregate or group the data based on its key-value
pair as per the reducer algorithm written by the developer.
How Job tracker and the task tracker deal with MapReduce:
1. Job Tracker: The work of Job tracker is to manage all the resources and all the jobs across the
cluster and also to schedule each map on the Task Tracker running on the same data node
since there can be hundreds of data nodes available in the cluster.

2. Task Tracker: The Task Tracker can be considered as the actual slaves that are working on the
instruction given by the Job Tracker. This Task Tracker is deployed on each of the nodes
available in the cluster that executes the Map and Reduce task as instructed by Job Tracker.
There is also one important component of MapReduce Architecture known as Job History Server.
The Job History Server is a daemon process that saves and stores historical information about the
task or application, like the logs which are generated during or after the job execution are stored
on Job History Server.

Map Reduce and its Phases with numerical example.



Map Reduce :-
It is a framework in which we can write applications to run huge amount of data in parallel and in
large cluster of commodity hardware in a reliable manner.
Different Phases of MapReduce:-
MapReduce model has three major and one optional phase.
 Mapping
 Shuffling and Sorting
 Reducing
 Combining
Mapping :- It is the first phase of MapReduce programming. Mapping Phase accepts key-value
pairs as input as (k, v), where the key represents the Key address of each record and the value
represents the entire record content.The output of the Mapping phase will also be in the key-value
format (k’, v’).
Shuffling and Sorting :- The output of various mapping parts (k’, v’), then goes into Shuffling and
Sorting phase. All the same values are deleted, and different values are grouped together based on
same keys. The output of the Shuffling and Sorting phase will be key-value pairs again as key and
array of values (k, v[ ]).
Reducer :- The output of the Shuffling and Sorting phase (k, v[]) will be the input of the Reducer
phase. In this phase reducer function’s logic is executed and all the values are Collected against
their corresponding keys. Reducer stabilize outputs of various mappers and computes the final
output.
Combining :- It is an optional phase in the MapReduce phases . The combiner phase is used to
optimize the performance of MapReduce phases. This phase makes the Shuffling and Sorting
phase work even quicker by enabling additional performance features in MapReduce phases.

flow chart

Numerical:-
MovieLens Data
USER_ID MOVIE_ID RATING TIMESTAMP
196 242 3 881250949
186 302 3 891717742
196 377 1 878887116
244 51 2 880606923
166 346 1 886397596
186 474 4 884182806
186 265 2 881171488
Solution : –
Step 1 – First we have to map the values , it is happen in 1st phase of Map Reduce model.
196:242 ; 186:302 ; 196:377 ; 244:51 ; 166:346 ; 186:274 ; 186:265
Step 2 – After Mapping we have to shuffle and sort the values.
166:346 ; 186:302,274,265 ; 196:242,377 ; 244:51
Step 3 – After completion of step1 and step2 we have to reduce each key’s values.
Now, put all values together
Limitations of HDFS
Issues with Small Files
The main problem with Hadoop is that it is not suitable for small
data. HDFS lacks the ability to support the random reading of small due to its
high capacity design. Small files are smaller than the HDFS Block size (default
128MB). If you are storing these huge numbers of small files, HDFS cannot
handle these lots of small files. As HDFS was designed to work with a small
number of large files for storing large data sets rather than a large number of
small files. If there are lot many small files, then the NameNode will be
overloaded since it stores the namespace of HDFS.

Solution:

Hadoop Archives or HAR files is one of the solutions to small files problem.
Hadoop archives act as another layer of the file system over Hadoop. With
Hadoop archive command we can build HAR files. This command runs a map-
reduce job at the backend to pack the archived files into a small number of
HDFS files. But again reading through HAR files is not much efficient than
reading through HDFS. This is because it requires to access two index files and
then finally the data file. Sequence file is another solution to small file problem.
In this, we write a program to merge a number of small files into one sequence
file. Then we process this sequence file in a streaming fashion. Map-reduce can
break this sequence files into chunks and process it in parallel as we can split
the sequence file.
Slow Processing Speed
MapReduce processes a huge amount of data. In Hadoop, MapReduce works
by breaking the processing into phases: Map and Reduce. So, MapReduce
requires a lot of time to perform these tasks, thus increasing latency. Hence,
reduces processing speed.

Solution:

Spark is the solution for the slow processing speed of map-reduce. It does in-
memory calculations which makes it a hundred times faster than Hadoop.
Spark while processing reads the data from RAM and writes the data to RAM
thereby making it a fast processing tool. Flink is one more technology which is
faster than Hadoop map-reduce as it does in-memory calculations. Flink is
even faster than Spark. This is due to the stream processing engine at the core
as opposed to Spark which has batch processing engine

Support for Batch Processing only


Hadoop only supports batch processing, it is not suitable for streaming data.
Hence, overall performance is slower. MapReduce framework doesn’t leverage
the memory of the Hadoop cluster to the maximum.

Solution

Apache Spark solves this problem as it supports stream processing. But Spark
stream processing is not as much efficient as Flink as it uses micro-batch
processing. Apache Flink improves the overall performance as it provides
single run-time for the streaming as well as batch processing.

No Real-time Processing
Hadoop with its core Map-Reduce framework is unable to process real-time
data. Hadoop process data in batches. First, the user loads the file into HDFS.
Then the user runs map-reduce job with the file as input. It follows the ETL
cycle of processing. The user extracts the data from the source. Then the data
gets transformed to meet the business requirements. And finally loaded into
the data warehouse. The users can generate insights from this data. The
companies use these insights for the betterment of their business.

Solution

Spark has come up as a solution to the above problem. Spark supports real-
time processing. It processes the incoming streams of data by forming micro-
batches and then applying computations on these micro-batches.
Flink is also one more solution for slow processing speed. It is even much
faster than Spark as it has a stream processing engine at the core. Flink is a
true streaming engine with adjustable latency and throughput. It has a rich set
of APIs exploiting streaming runtime.

Iterative Processing
Core Hadoop does not support iterative processing. Iterative processing
requires a cyclic data flow. In this output of a previous stage serves as an input
to the next stage. Hadoop map-reduce is capable of batch processing. It
works on the principle of write-once-read-many. The data gets written on the
disk once. And then read multiple times to get insights. The Map-reduce of
Hadoop has a batch processing engine at its core. It is not able to iterate
through data.

Solution

Spark supports iterative processing. In Spark, each iteration needs to get


scheduled and executed separately. It accomplishes iterative processing
through DAG i.e. Directed Acyclic Graph. Spark has RDDs or Resilient
Distributed Datasets. These are a collection of elements partitioned across the
cluster of nodes. Spark creates RDDs from HDFS files. We can also cache them
allowing reusability of RDDs. The iterative algorithms apply operations
repeatedly over data. Thus they benefit from RDDs caching across
iterations. Flink also supports iterative processing. Flink iterates data using
streaming architecture. We can instruct Flink to process only the data which
gets changed thereby improving the performance. Flink implements iterative
algorithms by defining a step function. It embeds the step functions into
special iteration operator. The two variants of this operator are — iterate and
delta iterate. Both these operators apply the step function over and over again
until they meet a terminating condition.

Latency
MapReduce in Hadoop is slower because it supports different format,
structured and huge amount of data. In MapReduce, Map takes a set of data
and converts it into another set of data, where an individual element is broken
down into a key-value pair. Reduce takes the output from the map as and
Reduce takes the output from the map as input and process further.
MapReduce requires a lot of time to perform these tasks thereby increasing
latency.

Solution:
Apache Spark can reduce this issue. Although Spark is the batch system, it is
relatively faster, because it caches much of the input data on memory by RDD.
Apache Flink data streaming achieves low latency and high throughput.

No Ease of Use
In Hadoop, we have to hand code each and every operation. This has two
drawbacks first it is difficult to use. And second, it increases the number of
lines to code. There is no interactive mode available with Hadoop Map-Reduce.
This also makes it difficult to debug as it runs in the batch mode. In this mode,
we have to specify the jar file, the input as well as the location of the output file.
If the program fails in between, it is difficult to find the culprit code.

Solution

Spark is easy for the user as compared to Hadoop. This is because it has many
APIs for Java, Scala, Python, and Spark SQL. Spark performs batch processing,
stream processing and machine learning on the same cluster. This makes life
easy for users. They can use the same infrastructure for various workloads. In
Flink, the number of high-level operators is available. This reduces the number
of lines of code to achieve the same result.

Security Issue
Hadoop does not implement encryption-decryption at the storage as well as
network levels. Thus it is not much secure. For security, Hadoop adopts
Kerberos authentication which is difficult to maintain.

Solution

Spark encrypts temporary data written to local disk. It does not support
encryption of output data generated by applications having APIs such as save
As Hadoop File or save ASTable. Spark implements AES-based encryption for
RPC connections. We should enable RPC authentication to enable encryption.
It should be properly configured.

No Caching
Apache Hadoop is not efficient for caching. MapReduce cannot cache the
intermediate data in memory for the further requirement and this diminishes
the performance of Hadoop.

Solution:

Spark and Flink overcome this issue. Spark and Flink cache data in memory for
further iterations which enhance the overall performance.
Lengthy Code
Apache Hadoop has 1, 20,000 line of code. The number of lines produces the
number of bugs. Hence it will take more time to execute the programs.

Solution:

Spark and Flink are written in Scala and Java. But the implementation is in
Scala, so the number of line of code is lesser than Hadoop. Thus, it takes less
time to execute the programs.

You might also like