HADOOP
HADOOP
HADOOP
© Copyright by Interviewbit
Contents
Before the digital period, the volume of data gathered was slow and could be
examined and stored with a single storage format. At the same time, the format of
the data received for similar purposes had the same format. However, with the
development of the Internet and digital platforms like social media, the data comes
in multiple formats (structured, semi-structured, and unstructured), and its velocity
also massively grown. A new name was given to this data which is Big data. Then, the
need for multiple processors and storage units arose to handle the big data.
Therefore, as a solution, Hadoop was introduced.
Uber
The Bank of Scotland
Netflix
The National Security Agency (NSA) of the United States
Twitter
There are three components of Hadoop are:
1. Hadoop YARN - It is a resource management unit of Hadoop.
2. Hadoop Distributed File System (HDFS) - It is the storage unit of Hadoop.
3. Hadoop MapReduce - It is the processing unit of Hadoop.
NameNode
The NameNode is the master daemon that operates on the master node. It saves the
filesystem metadata, that is, files names, data about blocks of a file, blocks locations,
permissions, etc. It manages the Datanodes.
DataNode
The DataNodes are the slave daemon that operates on the slave nodes. It saves the
actual business data. It serves the client read/write requests based on the NameNode
instructions. It stores the blocks of the files, and NameNode stores the metadata like
block locations, permission, etc.
Fault Tolerance
Hadoop framework divides data into blocks and creates various copies of blocks
on several machines in the cluster. So, when any device in the cluster fails,
clients can still access their data from the other machine containing the exact
copy of data blocks.
High Availability
In the HDFS environment, the data is duplicated by generating a copy of the
blocks. So, whenever a user wants to obtain this data, or in case of an
unfortunate situation, users can simply access their data from the other nodes
because duplicate images of blocks are already present in the other nodes of the
HDFS cluster.
High Reliability
HDFS splits the data into blocks, these blocks are stored by the Hadoop
framework on nodes existing in the cluster. It saves data by generating a
duplicate of every block current in the cluster. Hence presents a fault tolerance
facility. By default, it creates 3 duplicates of each block containing information
present in the nodes. Therefore, the data is promptly obtainable to the users.
Hence the user does not face the difficulty of data loss. Therefore, HDFS is very
reliable.
Replication
Replication resolves the problem of data loss in adverse conditions like device
failure, crashing of nodes, etc. It manages the process of replication at frequent
intervals of time. Thus, there is a low probability of a loss of user data.
Scalability
HDFS stocks the data on multiple nodes. So, in case of an increase in demand, it
can scale the cluster.
HDFS NAS
HDFS is a
NAS is a file-level computer data
Distributed File
storage server connected to a
system that is
computer network that provides
mainly used to store
network access to a heterogeneous
data by commodity
group of clients.
hardware.
HDFS is
programmed to
NAS is not suitable to work with a
work with the
MapReduce paradigm.
MapReduce
paradigm.
The first is the map operation, which takes a set of data and transforms it into a
different collection of data, where individual elements are divided into tuples. The
reduce operation consolidates those data tuples based on the key and subsequently
modifies the value of the key.
Let us take an example of a text file called example_data.txt and understand how
MapReduce works.
The content of the example_data.txt file is:
coding,jamming,ice,river,man,driving
Now, assume we have to find out the word count on the example_data.txt using
MapReduce. So, we will be looking for the unique words and the number of times
those unique words appeared.
First, we break the input into three divisions, as seen in the figure. This will share
the work among all the map nodes.
Then, all the words are tokenized in each of the mappers, and a hardcoded value
(1) to each of the tokens is given. The reason behind giving a hardcoded value
equal to 1 is that every word by itself will, at least, occur once.
Now, a list of key-value pairs will be created where the key is nothing but the
individual words and value is one. So, for the first line (Coding Ice Jamming), we
have three key-value pairs – Coding, 1; Ice, 1; Jamming, 1.
The mapping process persists the same on all the nodes.
Next, a partition process occurs where sorting and shuffling follow so that all the
tuples with the same key are sent to the identical reducer.
Subsequent to the sorting and shuffling phase, every reducer will have a unique
key and a list of values matching that very key. For example, Coding, [1,1]; Ice,
[1,1,1].., etc.
Now, each Reducer adds the values which are present in that list of values. As
shown in the example, the reducer gets a list of values [1,1] for the key Jamming.
Then, it adds the number of ones in the same list and gives the final output as –
Jamming, 2.
Lastly, all the output key/value pairs are then assembled and written in the
output file.
In Hadoop MapReduce, shuffling is used to transfer data from the mappers to the
important reducers. It is the process in which the system sorts the unstructured data
and transfers the output of the map as an input to the reducer. It is a significant
process for reducers. Otherwise, they would not accept any information. Moreover,
since this process can begin even before the map phase is completed, it helps to save
time and complete the process in a lesser amount of time.
The Spark Core Engine can be used along with any of the other five components
specified. It is not required to use all the Spark components collectively. Depending
on the use case and request, one or more can be used along with Spark Core.
11. What are the three modes that hadoop can Run?
MapReduce needs programs to be translated into map and reduce stages. As not all
data analysts are accustomed to MapReduce, Yahoo researchers introduced Apache
pig to bridge the gap. Apache Pig was created on top of Hadoop, producing a high
level of abstraction and enabling programmers to spend less time writing complex
MapReduce programs.
Yarn stands for Yet Another Resource Negotiator. It is the resource management layer
of Hadoop. The Yarn was launched in Hadoop 2.x. Yarn provides many data
processing engines like graph processing, batch processing, interactive processing,
and stream processing to execute and process data saved in the Hadoop Distributed
File System. Yarn also offers job scheduling. It extends the capability of Hadoop to
other evolving technologies so that they can take good advantage of HDFS and
economic clusters.
Apache Yarn is the data operating method for Hadoop 2.x. It consists of a master
daemon known as “Resource Manager,” a slave daemon called node manager, and
Application Master.
Persistent Znodes:
The default znode in ZooKeeper is the Persistent Znode. It permanently stays in
the zookeeper server until any other clients leave it apart.
Ephemeral Znodes:
These are the temporary znodes. It is smashed whenever the creator client logs
out of the ZooKeeper server. For example, assume client1 created eznode1.
Once client1 logs out of the ZooKeeper server, the eznode1 gets destroyed.
Sequential Znodes:
Sequential znode is assigned a 10-digit number in numerical order at the end of
its name. Assume client1 produced a sznode1. In the ZooKeeper server, the
sznode1 will be named like this:
sznode0000000001
If the client1 generates another sequential znode, it will bear the following
number in a sequence. So the subsequent sequential znode is <znode
name>0000000002.
C) cat: We are using the cat command to display the content of the file present in the
directory of HDFS.
hadoop fs –cat /path_to_file_in_hdfs
D)mv : The HDFS mv command moves the files or directories from the source to a
destination within HDFS.
hadoop fs -mv <src> <dest>
E) copyToLocal: This command copies the file from the file present in the
newDataFlair directory of HDFS to the local file system.
hadoop fs -copyToLocal <hdfs source> <localdst>
F) get: Copies the file from the Hadoop File System to the Local File System.
hadoop fs -get<src> <localdest>
Robust: It is highly robust. It even has community support and contribution and
is easily usable.
Full Load: Sqoop can load the whole table just by a single Sqoop command. It
also allows us to load all the tables of the database by using a single Sqoop
command.
Incremental Load: It supports incremental load functionality. Using Sqoop, we
can load parts of the table whenever it is updated.
Parallel import/export: It uses the YARN framework for importing and
exporting the data. That provides fault tolerance on the top of parallelism.
Import results of SQL query: It allows us to import the output from the SQL
query into the Hadoop Distributed File System.
By default, the replication factor is 3. There are no two copies that will be on the
same data node. Usually, the first two copies will be on the same rack, and the third
copy will be off the shelf. It is advised to set the replication factor to at least three so
that one copy is always safe, even if something happens to the rack.
We can set the default replication factor of the file system as well as of each file and
directory exclusively. For files that are not essential, we can lower the replication
factor, and critical files should have a high replication factor.
26. Where are the two types of metadata that NameNode server
stores?
The two types of metadata that NameNode server stores are in Disk and RAM.
Metadata is linked to two files which are:
EditLogs: It contains all the latest changes in the file system regarding the last
FsImage.
FsImage: It contains the whole state of the namespace of the file system from
the origination of the NameNode.
Once the file is deleted from HDFS, the NameNode will immediately store this in the
EditLog.
All the file systems and metadata which are present in the Namenode’s Ram are read
by the Secondary NameNode continuously and later get recorded into the file system
or hard disk. EditLogs is combined with FsImage in the NameNode. Periodically,
Secondary NameNode downloads the EditLogs from the NameNode, and then it is
implemented to FsImage. The new FsImage is then copied back into the NameNode
and used only a er the NameNode has started the subsequent time.
27. Which Command is used to find the status of the Blocks and
File-system health?
The command used to find the status of the block is: hdfs fsck <path> -files –
blocks
And the command used to find File-system health is: hdfs fsck/ -files –blocks –
locations > dfs-fsck.log
28. Write the command used to copy data from the local system
onto HDFS?
The command used for copying data from the Local system to HDFS is: hadoop fs –
copyFromLocal [source][destination]
Both the Jobtracker and the name node detect the failure on which blocks were
the DataNode failed.
On the failed node all the tasks are rescheduled by locating other DataNodes
with copies of these blocks
User’s data will be replicated to another node from namenode to maintain the
configured replication factor.
Since they have distributed collections of objects, they can be operated in parallel.
Resilient Distributed Datasets are divided into parts such that they can be executed
on various nodes of a cluster.
Java
PHP
Python
C++
Ruby
Local Metastore R
T
It can also connect to a separate database running in a separate JVM in
m
the same or separate machine.
i
Every table in the hive can have one or more than one partition keys to recognize a
distinct partition. With the help of partitions, it is effortless to do queries on slices of
the data.
42. How can you restart NameNode and all the daemons in
Hadoop?
The following commands will help you restart NameNode and all the daemons:
You can stop the NameNode with ./sbin /Hadoop-daemon.sh stop NameNode
command and then start the NameNode using ./sbin/Hadoop-daemon.sh start
NameNode command.
You can stop all the daemons with the ./sbin /stop-all.sh command and then
start the daemons using the ./sbin/start-all.sh command.
43. How do you differentiate inner bag and outer bag in Pig.
Example : (4,{(4,2,1),
(4,3,3,)}) Example:{(park, New York),
In this example the (Hollywood, Los Angeles)}
complete relation is an Which is a bag of tuples,
outer bag and {(4,2,1), nothing but an outer bag.
(4,3,3,)} is an inner bag.
44. If the source data gets updated every now and then, how
will you synchronize the data in HDFS that is imported by
Sqoop?
If the source data gets updated in a very short interval of time, the synchronization of
data in HDFS that is imported by Sqoop is done with the help of incremental
parameters.
We should use incremental import along with the append choice even when the table
is refreshed continuously with new rows. Principally where values of a few columns
are examined, and if it encounters any revised value for those columns, only a new
row will be inserted. Similar to incremental import, the origin has a date column
examined for all the records that have been modified a er the last import,
depending on the previous revised column in the beginning. The values would be
modernized.
46. What is the default File format to import data using Apache
sqoop?
There are basically two file formats sqoop allos to import data they are:
Delimited Text File format
Sequence File Format
Css Interview Questions Laravel Interview Questions Asp Net Interview Questions