Hadoop Architecture
Hadoop Architecture
Hadoop Architecture
Hadoop Architecture
HADOOP
Hadoop is an open-source framework and ecosystem for processing,
storing, and analyzing large and complex datasets across clusters
of commodity hardware.
High Throughput:
HDFS is optimized for high throughput data access rather than low-
latency access.
It is well-suited for batch processing workloads, such as those commonly
found in big data analytics.
Data Locality:
HDFS tries to place computation close to the data.
This means that when processing data, tasks are scheduled on nodes
where the data resides, reducing the need for network transfers.
LAYERS OF HDFS
The Hadoop Distributed File System (HDFS) is a distributed file system
designed to run on commodity hardware.
HDFS has two main layers:
NAMESPACE
Consists of directories, files
and blocks.
Block address table which maps the HDFS blocks to the node where
they are located.
Supports block related operations such as create, delete, modify and get
block location.
Clients first contact point is the NameNode for file metadata and then perform actual file
I/O directly with the DataNodes.
If something goes wrong with the NameNode, then whatever metadata was there in
main memory would get lost permanently.
So during startup the namenode uses the FsImage.
But this fsimage file being used and get loaded into memory when
NameNode gets started.
It’s interesting to note that the NameNode never really uses these files on
disk during runtime, except for when it starts.
Lets say you have done so many changes like creating directories, files,
putting the data to HDFS during runtime this information is directly loaded
into Editlogs.
The edits/editlog file (also called transactions log stored in disk) stores all
changes that occur into the file system metadatas in memory during
runtime.
The process that takes the last fsimage file and apply all changes which is
found in editlog file and produce a new up to date fsimage file,
called checkpointing process.
CHECK-POINT
When the NameNode starts up, it reads the FsImage and EditLog
from disk, applies all the transactions from the EditLog to the in-
memory representation of the FsImage, and flushes out this new
version into a new FsImage on disk.
This process is called a checkpoint.
A process which involves merging the fsimage along with the latest
edit log and creating a new fsimage for the namenode to possess the
latest configured metadata of HDFS namespace .
The name node responds with details based on the actual size of the
file, block, and replication configuration.
These details from the name node contain the number of blocks of the
file, the replication factor, and data nodes where each block will be
stored.
Normally, the first replica is written to the data node creating the file, to
improve the write performance because of the write affinity.
The client talks to the name node for metadata to specify where to place
the data blocks.
Block A is transferred to data node 1 along with details of the two
other data nodes where this block needs to be stored. (see in metadata
A(1,2,3))
This involves a block transfer via the rack switch because both of
these data nodes are in the same rack.
When it receives Block A from data node 1, data node 2 copies the
same block to the third data node (in this case, data node 3 of the
another rack).
The whole process is actually repeated for each block of the file, and
data transfer happens in parallel for faster write of blocks
COMMUNICATION PROTOCOLS
All communication from clients to the name node, clients to data
nodes, data nodes to the name node, and name node to the data
nodes happens over Transmission Control Protocol/Internet Protocol
(TCP/IP).
The data nodes communicate with the name node using the data node
protocol with its own TCP port number (configurable).
The client communicates with the name node using the client protocol
with its own TCP port number (configurable).
By design, the name node does not initiate a remote procedure call
(RPC); it only responds to the RPC requests coming from either data
nodes or clients.
READING FROM HDFS
To read a file from the HDFS, the client or application reaches out to
the name node with the name of the file and its location. The name
node responds with the number of blocks of the file, data nodes where
each block has been stored.
The client talks to the name node to get metadata about the file it wants to read.
Now the client or application reaches out to the data nodes directly
(without involving the name node for actual data transfer—data
blocks don’t pass through the name node) to read the blocks of the files
in parallel, based on information received from the name node.
When the client or application receives all the blocks of the file, it
combines these blocks into the form of the original file.
CHECKSUM FOR DATA BLOCKS
When writing blocks of a file, the HDFS client computes the
checksum of each block of the file and stores these checksums in a
separate, hidden file in the same HDFS file system namespace.
On cluster startup, the name node enters into a special state called
safe mode.
During this time, the name node receives a heartbeat signal
(implying that the data node is active and functioning properly) and a
block-report from each data node (containing a list of all blocks on that
specific data node) in the cluster.
THE NAME NODE UPDATES ITS METADATA BASED ON
INFORMATION IT RECEIVES FROM THE DATA NODES.
HANDLING A DATA NODE FAILURE TRANSPARENTLY.
IN TERMS OF STORAGE, WHAT DOES A NAME NODE
CONTAIN AND WHAT DO DATA NODES CONTAIN?
HDFS stores and maintains file system metadata and application data
separately.
The first replica is written to the data node creating the file.
The second replica is written to another data node within the same
rack.
When a client writes a file to a data node, it splits the file into
multiple chunks, called blocks.
When creating a file, the client can also specify a block size
specification to override the cluster-wide configuration.
WHAT IS A CHECKPOINT, AND WHO PERFORMS
THIS OPERATION?
You can enable the Trash feature of HDFS using two configuration
properties: fs.trash.interval and fs.trash.checkpoint.interval in
the core-site.xml configuration file.
After enabling it, if you delete a file, it gets moved to the Trash
folder and stays there, per the settings.
If you happen to recover the file from there before it gets deleted, you
are good; otherwise, you will lose the file.
MAPREDUCE: SIMPLIFIED DATA PROCESSING
OF LARGE DATA ACROSS THE CLUSTERS
MAPREDUCE
MapReduce is a programming model and processing framework for
processing and generating large datasets in parallel.
JobTracker creates and run the Job on the NameNode and whenever
the client submits the job to the JobTracker, it divides the job and
splits it into the tasks, assigns tasks to the worker nodes (task
scheduling) and tracks its progress and fault tolerance.
The JobTracker carries out the communication between the client and
the TaskTracker by making use of Remote Procedure Calls(RPC).
Job- Tracker keeps track of all the jobs and the associated tasks
within the main memory.
TASK TRACKER
Within a cluster there can be multiple TaskTracker.
It is the responsibility of the TaskTracker to execute all the tasks assigned by the JobTracker.
Within each TaskTracker there are number of Map and reduce slots.
The number of Map and Reduce slots determine how many Map and Reduce task can be
executed simultaneously.
The Task-Tracker is pre-configured with the number of slots indicating the number of tasks it can
accept.
When a JobTracker tries to schedule a task, it looks for an empty slot in the TaskTracker
running on the same server which hosts the DataNode, where the data for that task resides.
If not found, it looks for the machine in the same rack.
TaskTracker sends the HeartBeat signal to the JobTracker after every 3 seconds and if the
Job-Tracker doesn't receive this signal, it will consider that Task-tracker as dead.
MAPPING PHASE
This is the first phase of the program. There are two steps in this phase:
splitting and mapping.
A dataset is split into equal units called chunks (input splits) in the
splitting step.
The key-value pairs are then used as inputs in the mapping step.
This is the only data format that a mapper can read or understand.
In the mapping step, the mapper contains a coding logic that processes the
key-value pairs and produces an intermediate key-value pairs.
SHUFFLING PHASE
This is the second phase that takes place after the completion of the Mapping
phase.
In the sorting step, the key-value pairs are sorted using the keys. ((since
different mappers may have output the same key))
The shuffling phase facilitates the removal of duplicate values and the
grouping of values.
The output of this phase will be keys and values, just like in the Mapping phase.
REDUCER PHASE
In the reducer phase, the output of the shuffling phase is used as the
input.
All the duplicate values are removed, and different values are
grouped based on similar keys.
This output is fed as input to the reducer.
All the intermediate values for the intermediate keys are combined
into a list by the reducer called tuples.
The record writer writes these output key-value pairs from the
reducer to the output files. The output data is stored on the HDFS.
WORKFLOW
EXAMPLE
Scheduler
The scheduler is responsible for allocating the resources to the
running application.
The scheduler is pure scheduler it means that it performs no
monitoring no tracking for the application and even doesn’t
guarantees about restarting failed tasks
Application Manager
It manages running Application Masters in the cluster, i.e., it is
responsible for starting application masters and for monitoring and
restarting them on different nodes in case of failures.
NODE MANAGER (NM)
It negotiates resources from the resource manager and works with the
node manager.
It is very similar to SQL. It loads the data, applies the required filters
and dumps the data in the required format.