Hadoop

Hadoop
Big data :
Data that are huge to store and process is known as bid data .
Sources of big data :
Social networking site ,ecommerce site ,weather station,share market
So , Hadoop was introduced to process and analyse data that are huge in volume
History:
2002 -- apache nutch was being developed which is search engine
While working on Apache Nutch, they were dealing with big data. To store that data they
have to spend a lot of costs which becomes the consequence of that project. This problem
becomes one of the important reason for the emergence of Hadoop.
2003-- gfs also released some research papers
2006 –doug cutting introduced hdfs
First version of Hadoop=0.1.0
Latest version of Hadoop=3.3.4
Hadoop:
Hadoop is an open source framework from Apache
used to store process and analyze data which are very huge in volume.
Hadoop is written in Java and is not OLAP (online analytical processing)
Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled up
just by adding nodes in the cluster.
Hadoop architecture:
Hadoop Common: These are Java libraries and provide file system and OS level
abstractions
Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of
that HDFS was developed. It states that the files will be broken into blocks and stored in
nodes over the distributed architecture.
Hdfs architecture:
The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It
contains a master/slave architecture. This architecture consist of a single NameNode
performs the role of master, and multiple DataNodes performs the role of a slave.
Both NameNode and DataNode are capable enough to run on commodity machines. The
Java language is used to develop HDFS. So any machine that supports Java language can
easily run the NameNode and DataNode software.
NameNode
Name node is used for storing meta data
Meta data keep im trackof transation logs
meta Data can also be the name of the file, size, and the information about the
location(Block number, Block ids) of Datanode that Namenode stores to find the closest
DataNode for Faster Communication.
Namenode instructs the DataNodes with the operation like delete, create,
Replicate, etc.
DataNode
The HDFS cluster contains multiple DataNodes.
Each DataNode contains multiple data blocks.
These data blocks are used to store data.
It is the responsibility of DataNode to read and write requests from the file system's clients.
It performs block creation, deletion, and replication upon instruction from the NameNode.
Job Tracker
The role of Job Tracker is to accept the MapReduce jobs from client and process the data by
using NameNode.
In response, NameNode provides metadata to Job Tracker.
Task Tracker
It works as a slave node for Job Tracker.
It receives task and code from Job Tracker and applies that code on the file. This process can
also be called as a Mapper.
Advantages of Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped which
helps in faster retrieval. Even the tools to process the data are often on the
same servers, thus reducing the processing time. It is able to process terabytes
of data in minutes and Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to
store data so it really cost effective as compared to traditional relational
database management system.
o Resilient to failure: HDFS has the property with which it can replicate data
over the network, so if one node is down or some other network failure
happens, then Hadoop takes the other copy of data and use it. Normally, data
are replicated thrice but the replication factor is configurable.
MapReduce is a framework using which we can write applications to process huge amounts
of data, in parallel, on large clusters of commodity hardware in a reliable manner.
What is MapReduce?
MapReduce is a processing technique and a program model for distributed computing based
on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.
Map takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output
from a map as an input and combines those data tuples into a smaller set of tuples. As the
sequence of the name MapReduce implies, the reduce task is always performed after the map
job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application
into mappers and reducers is sometimes nontrivial. But, once we write an application in the
MapReduce form, scaling the application to run over hundreds, thousands, or even tens of
thousands of machines in a cluster is merely a configuration change. This simple scalability is
what has attracted many programmers to use the MapReduce model.
The Algorithm
 Generally MapReduce paradigm is based on sending the computer to where the

data resides!
 MapReduce program executes in three stages, namely map stage, shuffle stage,
and reduce stage.
o Map stage − The map or mapper’s job is to process the input
data. Generally the input data is in the form of file or directory
and is stored in the Hadoop file system (HDFS). The input file is
passed to the mapper function line by line. The mapper
processes the data and creates several small chunks of data.
o Reduce stage − This stage is the combination of
the Shuffle stage and the Reduce stage. The Reducer’s job is to
process the data that comes from the mapper. After processing,
it produces a new set of output, which will be stored in the
HDFS.
 During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster.
 The framework manages all the details of data-passing such as issuing tasks,
verifying task completion, and copying data around the cluster between the
nodes.
 Most of the computing takes place on nodes with data on local disks that
reduces the network traffic.
 After completion of the given tasks, the cluster collects and reduces the data to
form an appropriate result, and sends it back to the Hadoop server.

Hadoop

Uploaded by

Copyright:

Available Formats

Hadoop

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop

Uploaded by

Copyright:

Available Formats

Hadoop

 Generally MapReduce paradigm is based on sending the computer to where the

You might also like