Hadoop: Er. Gursewak Singh Dsce

HADOOP
Er. Gursewak singh

DSCE
HADOOP
 Hadoop is an open source framework.
 It is provided by Apache to process and analyse very huge volume of data.
 It is written in Java and currently used by Google, Facebook, LinkedIn, Yahoo, Twitter etc.
 Moreover it can be scaled up just by adding nodes in the cluster.

Modules of Hadoop
 HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of
that HDFS was developed. It states that the files will be broken into blocks and stored in nodes
over the distributed architecture.
 Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
 Map Reduce: This is a framework which helps Java programs to do the parallel computation on
data using key value pair. The Map task takes input data and converts it into a data set which can
be computed in Key value pair. The output of Map task is consumed by reduce task and then the
out of reducer gives the desired result.
 Hadoop Common: These Java libraries are used to start Hadoop and are used by other Hadoop
modules.
Features of Hadoop Which Makes It Popular
 1. Open Source:
Hadoop is open-source, which means it is free to use. Since it is an open-source project the source-code is available online for
anyone to understand it or make some modifications as per their industry requirement.
 2. Highly Scalable Cluster:
Hadoop is a highly scalable model. A large amount of data is divided into multiple inexpensive machines in a cluster which is
processed parallelly. the number of these machines or nodes can be increased or decreased as per the enterprise’s
requirements.
 3. Fault Tolerance is Available:
Hadoop uses commodity hardware (inexpensive systems) which can be crashed at any moment. In Hadoop data is replicated
on various DataNodes in a Hadoop cluster which ensures the availability of data if somehow any of your systems got crashed.
You can read all of the data from a single machine if this machine faces a technical issue data can also be read from other
nodes in a Hadoop cluster because the data is copied or replicated by default.

 4. High Availability is Provided:
High Availability means the availability of data on the Hadoop cluster. Due to fault tolerance in case if any of the DataNode
goes down the same data can be retrieved from any other node where the data is replicated. The High available Hadoop
cluster also has 2 or more than two Name Node i.e. Active NameNode and Passive NameNode also known as stand by
NameNode. In case if Active NameNode fails then the Passive node will take the responsibility of Active Node and provide
the same data as that of Active NameNode which can easily be utilized by the user.
 5. Cost-Effective:
Hadoop is open-source and uses cost-effective commodity hardware which provides a cost-efficient model, unlike traditional
Relational databases that require expensive hardware and high-end processors to deal with Big Data.
 6. Hadoop Provide Flexibility:
Hadoop is designed in such a way that it can deal with any kind of dataset like structured(MySql Data), Semi-
Structured(XML, JSON), Un-structured (Images and Videos) very efficiently. This means it can easily process any kind of
data independent of its structure which makes it highly flexible.
 7. Easy to Use:
Hadoop is easy to use since the developers need not worry about any of the processing work since it is managed by
the Hadoop itself. Hadoop ecosystem is also very large comes up with lots of tools like Hive, Pig, Spark, HBase,
Mahout, etc.
 8. Hadoop uses Data Locality:
The concept of Data Locality is used to make Hadoop processing fast. In the data locality concept, the computation
logic is moved near data rather than moving the data to the computation logic. The cost of Moving data on HDFS
is costliest and with the help of the data locality concept, the bandwidth utilization in the system is minimized.
 9. Provides Faster Data Processing:
Hadoop uses a distributed file system to manage its storage i.e. HDFS(Hadoop Distributed File System). In
DFS(Distributed File System) a large size file is broken into small size file blocks then distributed among the
Nodes available in a Hadoop cluster, as this massive number of file blocks are processed parallelly which makes
Hadoop faster.
History of Hadoop
Hadoop Architecture
The Hadoop Architecture Mainly consists of 4
components.
 MapReduce
 HDFS(Hadoop distributed File System)
 YARN(Yet Another Resource Framework)
 Common Utilities or Hadoop Common

1. MapReduce
 MapReduce is nothing but just like an Algorithm or a data structure that is based on the
YARN framework. The major feature of MapReduce is to perform the distributed
processing in parallel in a Hadoop cluster which Makes Hadoop working so fast. When
you are dealing with Big Data, serial processing is no more of any use. MapReduce has
mainly 2 tasks which are divided phase-wise:
 In first phase, Map is utilized and in next phase Reduce is utilized.
 The Input is provided to the Map() function then it’s output is used as an input to the Reduce
function and after that, we receive our final output. Let’s understand What this Map() and
Reduce() does.
 As we can see that an Input is provided to the Map(), now as we are using Big Data. The Input is
a set of Data. The Map() function here breaks this DataBlocks into Tuples that are nothing but a
key-value pair. These key-value pairs are now sent as input to the Reduce(). The Reduce()
function then combines this broken Tuples or key-value pair based on its Key value and form set
of Tuples, and perform some operation like sorting, summation type job, etc. which is then sent
to the final Output Node. Finally, the Output is Obtained.
 The data processing is always done in Reducer depending upon the business requirement of that
industry. This is How First Map() and then Reduce is utilized one by one.
Let’s understand the Map Task and Reduce Task in detail.
 Map Task:
RecordReader The purpose of recordreader is to break the records. It is responsible for providing
key-value pairs in a Map() function. The key is actually is its locational information and value is the
data associated with it.
Map: A map is nothing but a user-defined function whose work is to process the Tuples obtained from
record reader. The Map() function either does not generate any key-value pair or generate multiple
pairs of these tuples.
Combiner: Combiner is used for grouping the data in the Map workflow. It is similar to a Local
reducer. The intermediate key-value that are generated in the Map is combined with the help of this
combiner. Using a combiner is not necessary as it is optional.
Partitionar: Partitional is responsible for fetching key-value pairs generated in the Mapper Phases.
The partitioner generates the shards corresponding to each reducer. Hashcode of each key is also
fetched by this partition. Then partitioner performs it’s(Hashcode) modulus with the number of
reducers(key.hashcode()%(number of reducers))
 Reduce Task
Shuffle and Sort: The Task of Reducer starts with this step, the process in which the Mapper
generates the intermediate key-value and transfers them to the Reducer task is known
as Shuffling. Using the Shuffling process the system can sort the data using its key value. Once
some of the Mapping tasks are done Shuffling begins that is why it is a faster process and does
not wait for the completion of the task performed by Mapper.
Reduce: The main function or task of the Reduce is to gather the Tuple generated from Map
and then perform some sorting and aggregation sort of process on those key-value depending
on its key element.
OutputFormat: Once all the operations are performed, the key-value pairs are written into the
file with the help of record writer, each record in a new line, and the key and value in a space-
separated manner.
Usage of MapReduce
 It can be used in various application like document clustering, distributed sorting, and web
link-graph reversal.
 It can be used for distributed pattern-based searching.
 We can also use MapReduce in machine learning.
 It was used by Google to regenerate Google's index of the World Wide Web.
 It can be used in multiple computing environments such as multi-cluster, multi-core, and

mobile environmen
THANKS

Hadoop: Er. Gursewak Singh Dsce

Uploaded by

Copyright:

Available Formats

Hadoop: Er. Gursewak Singh Dsce

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop: Er. Gursewak Singh Dsce

Uploaded by

Copyright:

Available Formats

HADOOP

Er. Gursewak singh

 Hadoop is an open source framework.

 It is provided by Apache to process and analyse very huge volume of data.

 Moreover it can be scaled up just by adding nodes in the cluster.

anyone to understand it or make some modifications as per their industry requirement.

 2. Highly Scalable Cluster:

 3. Fault Tolerance is Available:

nodes in a Hadoop cluster because the data is copied or replicated by default.

 6. Hadoop Provide Flexibility:

 8. Hadoop uses Data Locality:

 9. Provides Faster Data Processing:

 HDFS(Hadoop distributed File System)

 YARN(Yet Another Resource Framework)

 Common Utilities or Hadoop Common

 It can be used for distributed pattern-based searching.

 We can also use MapReduce in machine learning.

 It can be used in multiple computing environments such as multi-cluster, multi-core, and

You might also like