Hadoop Distributed File System Basics
Hadoop Distributed File System Basics
Hadoop Distributed File System Basics
System Basics
Contents
• Hadoop Distributed File System Design Features
• HDFS Components
• HDFS Block Replication
• HDFS Safe Mode
• Rack Awareness
• NameNode High Availability
• HDFS Checkpoints and Backups
• HDFS Snapshots
• HDFS NFS Gateway
Hadoop Distributed File System
Design Features
• Big data processing
• write-once/read-many model
• No caching of data
• Design based on Google File System(GFS)
• Designed for Data Streaming
• Data Locality
• Moving Computation
HDFS Components
• Node Types
• Name Node - Manage Metadata
• Data Node – Store/Retrieve Data
•Design-Master/Slave Architecture
• Master(NN)-File System Namespace
• Slave(DN)-Read/Write Request
Various system roles in an HDFS deployment
• Disk Files-
• fsimage_*
• image of the file system state used only at startup by the NameNode.
• Stores metadata
• edit_*
• A series of modifications done to the file system after starting the NameNode.
• Location -
dfs.namenode.name.dir property in the hdfs-site.xml file.
HDFS Block Replication Cluster Replication
Factor
>8 3
• Replicates across the cluster >1 and <8 2
in hdfs-site.xml file
• Replicating each block across number of
machine(default=3)
• HDFS based on block size default is
64MB
• Splits are based on logical partitioning
of data.
HDFS Safe Mode
• When the NameNode starts
• enters a read-only safe mode
• blocks cannot be replicated or deleted
• Safe Mode enables the NameNode to perform two important
processes:
• Loading fsimage file into memory and replaying the edit log
• Mapping between blocks and data nodes is created,at least one copy of the
data is available before safe mode exit.
• Safe mode for maintenance - hdfs dfsadmin-safemode command
-Administrator maintenance
Rack Awareness
• Data Locality
• Hadoop MapReduce is to move the computation to the data
1. Data resides on the local machine (best).
2. Data resides in the same rack (better).
3. Data resides in a different rack (good).
Example- YARN Scheduler
Pros- improved fault tolerance
Cons-Entire rack failure then performance degraded
NameNode High Availability
• Earlier,NameNode was a single point of failure
• NameNode High Availability (HA) -to provide
true failover service
HA
Active Standby
Name node Name node
Scheduler
Node 2
Node B Launch
Speculative
Task(Duplicate)
Node 3