Unit II Big Data
Unit II Big Data
Unit II Big Data
Hadoop is an open source framework from Apache and is used to store process and analyze data
which are very huge in volume. Hadoop is written in Java and is not OLAP (online analytical
processing). It is used for batch/offline processing.It is being used by Facebook, Yahoo, Google,
Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the cluster.
Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of that
HDFS was developed. It states that the files will be broken into blocks and stored in nodes over the
distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel computation on data
using key value pair. The Map task takes input data and converts it into a data set which can be
computed in Key value pair. The output of Map task is consumed by reduce task and then the out of
reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other Hadoop
modules.
Hadoop Architecture
The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS
(Hadoop Distributed File System). The MapReduce engine can be MapReduce/MR1 or YARN/MR2.
A Hadoop cluster consists of a single master and multiple slave nodes. The master node includes
Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node includes DataNode
and TaskTracker.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains a
master/slave architecture. This architecture consist of a single NameNode performs the role of
master, and multiple DataNodes performs the role of a slave.
Both NameNode and DataNode are capable enough to run on commodity machines. The Java
language is used to develop HDFS. So any machine that supports Java language can easily run the
NameNode and DataNode software.
NameNode
ADVERTISEMENT
ADVERTISEMENT
DataNode
o The HDFS cluster contains multiple DataNodes.
o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's clients.
o It performs block creation, deletion, and replication upon instruction from the NameNode.
Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and process the data by using
NameNode.
o In response, NameNode provides metadata to Job Tracker.
Task Tracker
o It works as a slave node for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file. This process can also be
called as a Mapper.
MapReduce Layer
The MapReduce comes into existence when the client application submits the MapReduce job to
Job Tracker. In response, the Job Tracker sends the request to the appropriate Task Trackers.
Sometimes, the TaskTracker fails or time out. In such a case, that part of the job is rescheduled.
Advantages of Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster retrieval.
Even the tools to process the data are often on the same servers, thus reducing the processing time.
It is able to process terabytes of data in minutes and Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so it really cost
effective as compared to traditional relational database management system.
o Resilient to failure: HDFS has the property with which it can replicate data over the network, so if
one node is down or some other network failure happens, then Hadoop takes the other copy of data
and use it. Normally, data are replicated thrice but the replication factor is configurable.
Daemons mean Process. Hadoop Daemons are a set of processes that run on Hadoop. Hadoop is a
framework written in Java, so all these processes are Java Processes.
NameNode works on the Master System. The primary purpose of Namenode is to manage all the
MetaData. Metadata is the list of files stored in HDFS(Hadoop Distributed File System). As we
know the data is stored in the form of blocks in a Hadoop cluster. So the DataNode on which or the
location at which that block of the file is stored is mentioned in MetaData. All information regarding
the logs of the transactions happening in a Hadoop cluster (when or who read/wrote the data) will be
stored in MetaData. MetaData is stored in the memory.
Features:
It never stores the data that is present in the file.
As Namenode works on the Master System, the Master system should have good processing
power and more RAM than Slaves.
It stores the information of DataNode such as their Block id’s and Number of Blocks
How to start Name Node?
hadoop-daemon.sh start namenode
How to stop Name Node?
hadoop-daemon.sh stop namenode
2. DataNode
DataNode works on the Slave system. The NameNode always instructs DataNode for storing the
Data. DataNode is a program that runs on the slave system that serves the read/write request from
the client. As the data is stored in this DataNode, they should possess high memory to store more
Data.
3. Secondary NameNode
Secondary NameNode is used for taking the hourly backup of the data. In case the Hadoop cluster
fails, or crashes, the secondary Namenode will take the hourly backup or checkpoints of that data
and store this data into a file name fsimage. This file then gets transferred to a new system. A new
MetaData is assigned to that new system and a new Master is created with this MetaData, and the
cluster is made to run again correctly.
This is the benefit of Secondary Name Node. Now in Hadoop2, we have High-Availability and
Federation features that minimize the importance of this Secondary Name Node in Hadoop2.
The Hadoop
Daemon’s Port
4. Resource Manager
Resource Manager is also known as the Global Master Daemon that works on the Master System.
The Resource Manager Manages the resources for the applications that are running in a Hadoop
Cluster. The Resource Manager Mainly consists of 2 things.
1. ApplicationsManager
2. Scheduler
An Application Manager is responsible for accepting the request for a client and also makes a
memory resource on the Slaves in a Hadoop cluster to host the Application Master. The scheduler is
utilized for providing resources for applications in a Hadoop cluster and for monitoring this
application.
How to start ResourceManager?
yarn-daemon.sh start resourcemanager
How to stop ResourceManager?
stop:yarn-daemon.sh stop resoucemnager
5. Node Manager
The Node Manager works on the Slaves System that manages the memory resource within the Node
and Memory Disk. Each Slave Node in a Hadoop cluster has a single NodeManager Daemon
running in it. It also sends this monitoring information to the Resource Manager.
In a Hadoop cluster, Resource Manager and Node Manager can be tracked with the specific URLs,
of type http://:port_number
The Hadoop
Daemon’s Port
ResourceManager 8088
NodeManager 8042
It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes and
node name.
Since all the metadata is stored in name node, it is very important. If it fails the file system can not
be used as there would be no way of knowing how to reconstruct the files from blocks present in
data node. To overcome this, the concept of secondary name node arises.
Secondary Name Node: It is a separate physical machine which acts as a helper of name node. It
performs periodic check points.It communicates with the name node and take snapshot of meta
data which helps minimize downtime and loss of data.
Starting HDFS
The HDFS should be formatted initially and then started in the distributed mode. Commands are
given below.
To Start $ start-dfs.sh
o Copy the file "data.txt" from a file kept in local folder /usr/home/Desktop to HDFS
folder /user/ test
Recursive deleting
Example:
o put <localSrc><dest>
Copies the file or directory from the local file system identified by localSrc to dest within the
DFS.
o copyFromLocal <localSrc><dest>
Identical to -put
o copyFromLocal <localSrc><dest>
Identical to -put
o moveFromLocal <localSrc><dest>
Copies the file or directory from the local file system identified by localSrc to dest within
HDFS, and then deletes the local copy on success.
Copies the file or directory in HDFS identified by src to the local file system path identified
by localDest.
o cat <filen-ame>
o moveToLocal <src><localDest>
Sets the target replication factor for files identified by path to rep. (The actual replication
factor will move toward the target over time)
o touchz <path>
Creates a file at path containing the current time as a timestamp. Fails if a file already exists
at path, unless the file is already size 0.
Prints information about path. Format is a string which accepts file size in blocks (%b),
filename (%n), block size (%o), replication (%r), and modification date (%y, %Y).
Unlike other distributed file system, HDFS is highly fault-tolerant and can be deployed on low-cost
hardware. It can easily handle the application that contains large data sets.
Features of HDFS
ADVERTISEMENT
ADVERTISEMENT
o Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a single
cluster.
o Replication - Due to some unfavorable conditions, the node containing the data may be
loss. So, to overcome such problems, HDFS always maintains the copy of data on a different
machine.
o Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the system in the
event of failure. The HDFS is highly fault-tolerant that if any machine fails, the other machine
containing the copy of that data automatically become active.
o Distributed data storage - This is one of the most important features of HDFS that makes
Hadoop very powerful. Here, data is divided into multiple blocks and stored into nodes.
o Portable - HDFS is designed in such a way that it can easily portable from platform to
another.
Goals of HDFS
o Handling the hardware failure - The HDFS contains multiple server machines. Anyhow, if
any machine fails, the HDFS goal is to recover it quickly.
o Streaming data access - The HDFS applications usually run on the general-purpose file
system. This application requires streaming access to their data sets.
o Coherence Model - The application that runs on HDFS require to follow the write-once-
ready-many approach. So, a file once created need not to be changed. However, it can be
appended and truncate.
Characteristics of HDFS
1. Run-on low-cost system i.e. commodity hardware
Hadoop Distributed File System is very much similar to the existing Distributed File
System but it differs in several aspects like commodity hardware. The Hadoop HDFS
does not require specialized hardware to store and process very large size data, rather it
is designed to work on low-cost clusters of commodity hardware. Where clusters mean a
group of computers that are connected Which are cheap and affordable.
2. Provide High Fault Tolerance
HDFS provides high fault tolerance, Fault tolerance is achieved when the system
functions properly without any data loss even if some hardware components of the
system has failed. In a cluster when a single node fails it crashes the entire system. The
primary duty of fault tolerance is to remove such failed nodes which disturbs the entire
normal functioning of the system. By default, in HDFS every data block is replicated in 3
data nodes. If a data node goes down the client can easily fetch the data from the other 2
data nodes where data is replicated hence it prevents the entire system to go down and
Fault tolerance is achieved in a Hadoop cluster. The HDFS is flexible enough to add and
remove the data nodes with fewer efforts. There are 3 ways with which HDFS can
achieve fault tolerance i.e. Data replication, Heartbeat Messages, and checkpoints, and
recovery.
3. Large Data Set
In the case of HDFS large data set means the data that is in hundreds of megabytes,
gigabytes, terabytes, or sometimes even in petabytes in size. It is preferable to use HDFS
for files of very large size instead of using so many small files because metadata of a
large number of small files consumes a very large space in the memory than that of the
less number of entries for large files in name node.
4. High Throughput
HDFS is designed to be a High Throughput batch processing system rather than
providing low latency interactive uses. HDFS always implements WORM pattern i.e. Write
Once Read Many. The data is immutable means once the data is written it can not be
changed. Due to which data is the same across the network. Thus it can process large
data in a given amount of time and hence provides High Throughput.
5. Data Locality
HDFS allows us to store and process massive size data on the cluster made up of
commodity hardware. Since the data is significantly large so the HDFS moves the
computation process i.e. Map-Reduce program towards the data instead of pulling the
data out for computation. These minimize network congestion and increase the overall
throughput of the system.
6. Scalability
As HDFS stores the large size data over multiple nodes, so when the requirement of data
storing is increased or decreased the number of nodes can be scaled up or scaled down
in a cluster. The vertical and Horizontal scalability is the 2 different mechanisms available
to provide scalability in the cluster. Vertical scalability means adding the resource like
Disk space, RAM on the existing node of the cluster. On another hand in Horizontal
scaling, we increase the number of nodes in the cluster and it is more preferable since we
can have hundreds of nodes in a cluster.
7. Data Compression
HDFS provides built-in support for data compression. Data compression reduces the
amount of storage required for large data sets, which in turn reduces the storage cost.
HDFS uses the zlib compression algorithm by default and supports other compression
algorithms like Snappy, LZO, and Gzip. The compression algorithm can be specified at
the time of creating a file or directory.
8. Security
HDFS provides security for the stored data through features like authentication,
authorization, and encryption. Authentication is the process of verifying the identity of the
user who is accessing the system, while authorization is the process of determining
whether a user has the required permissions to access the data. Encryption is the
process of encoding the data in such a way that it cannot be read by unauthorized users.
HDFS provides Kerberos authentication and Access Control Lists (ACLs) for
authorization.
9. Easy Integration
HDFS is designed to be easily integrated with other Hadoop ecosystem components like
Apache Spark, Apache Hive, and Apache Pig. This enables developers to perform
various data processing tasks using these tools without having to worry about data
storage and retrieval, as HDFS takes care of it. Additionally, HDFS can be accessed
using a variety of programming languages like Java, Python, and Scala.
10. Open Source
HDFS is an open-source project developed by Apache Software Foundation. This means
that anyone can contribute to the development of the software and make improvements.
Open source also means that there is a large community of developers who are
constantly working to improve the software and fix bugs. This ensures that HDFS is
constantly evolving and improving.
After formatting the HDFS, start the distributed file system. The following
command will start the namenode as well as the data nodes as cluster.
$ start-dfs.sh
Step 2
Transfer and store a data file from local systems to the Hadoop file system
using the put command.
Step 3
Step 1
Initially, view the data from HDFS using cat command.
$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/outfile
Step 2
Get the file from HDFS to the local file system using get command.
$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/
$ stop-dfs.sh
What is MapReduce?
A MapReduce is a data processing tool which is used to process the data parallelly in a distributed
form. It was developed in 2004, on the basis of paper titled as "MapReduce: Simplified Data
Processing on Large Clusters," published by Google.
The MapReduce is a paradigm which has two phases, the mapper phase, and the reducer phase. In
the Mapper, the input is given in the form of a key-value pair. The output of the Mapper is fed to
the reducer as input. The reducer runs only after the Mapper is over. The reducer too takes input in
key-value format, and the output of reducer is the final output.
Steps in Map Reduce
ADVERTISEMENT
ADVERTISEMENT
o The map takes data in the form of pairs and returns a list of <key, value> pairs. The keys will
not be unique in this case.
o Using the output of Map, sort and shuffle are applied by the Hadoop architecture. This sort
and shuffle acts on these list of <key, value> pairs and sends out unique keys and a list of
values associated with this unique key <key, list(values)>.
o An output of sort and shuffle sent to the reducer phase. The reducer performs a defined
function on a list of values for unique keys, and Final output <key, value> will be
stored/displayed.
Sort and Shuffle
The sort and shuffle occur on the output of Mapper and before the reducer. When the Mapper
task is complete, the results are sorted by key, partitioned if there are multiple reducers, and then
written to disk. Using the input from each Mapper <k2,v2>, we collect all the values for each
unique key k2. This output from the shuffle phase in the form of <k2, list(v2)> is sent as input to
reducer phase.
Usage of MapReduce
o It can be used in various application like document clustering, distributed sorting, and web
link-graph reversal.
o It can be used for distributed pattern-based searching.
o We can also use MapReduce in machine learning.
o It was used by Google to regenerate Google's index of the World Wide Web.
o It can be used in multiple computing environments such as multi-cluster, multi-core, and
mobile environment.
MapReduce Architecture
MapReduce and HDFS are the two major components of Hadoop which makes it so powerful and
efficient to use. MapReduce is a programming model used for efficient processing in parallel over
large data-sets in a distributed manner. The data is first split and then combined to produce the
final result. The libraries for MapReduce is written in so many programming languages with various
different-different optimizations. The purpose of MapReduce in Hadoop is to Map each of the jobs
and then it will reduce it to equivalent tasks for providing less overhead over the cluster network
and to reduce the processing power. The MapReduce task is mainly divided into two phases Map
Phase and Reduce Phase.
MapReduce Architecture:
Components of MapReduce Architecture:
1. Client: The MapReduce client is the one who brings the Job to the MapReduce for processing.
There can be multiple clients available that continuously send jobs for processing to the
Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is comprised of
so many smaller tasks that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The result of all
the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
In MapReduce, we have a client. The client will submit the job of a particular size to the Hadoop
MapReduce Master. Now, the MapReduce master will divide this job into further equivalent job-
parts. These job-parts are then made available for the Map and Reduce Task. This Map and Reduce
task will contain the program as per the requirement of the use-case that the particular company is
solving. The developer writes their logic to fulfill the requirement that the industry requires. The
input data which we are using is then fed to the Map Task and the Map will generate intermediate
key-value pair as its output. The output of Map i.e. these key-value pairs are then fed to the
Reducer and the final output is stored on the HDFS. There can be n number of Map and Reduce
tasks made available for processing the data as per the requirement. The algorithm for Map and
Reduce is made with a very optimized way such that the time complexity or space complexity is
minimum.
Let’s discuss the MapReduce phases to get a better understanding of its architecture:
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce phase.
1. Map: As the name suggests its main use is to map the input data in key-value pairs. The input
to the map may be a key-value pair where the key can be the id of some kind of address and
value is the actual value that it keeps. The Map() function will be executed in its memory
repository on each of these input key-value pairs and generates the intermediate key-value pair
which works as input for the Reducer or Reduce() function.
2. Reduce: The intermediate key-value pairs that work as input for Reducer are shuffled and sort
and send to the Reduce() function. Reducer aggregate or group the data based on its key-value
pair as per the reducer algorithm written by the developer.
How Job tracker and the task tracker deal with MapReduce:
1. Job Tracker: The work of Job tracker is to manage all the resources and all the jobs across the
cluster and also to schedule each map on the Task Tracker running on the same data node
since there can be hundreds of data nodes available in the cluster.
2. Task Tracker: The Task Tracker can be considered as the actual slaves that are working on the
instruction given by the Job Tracker. This Task Tracker is deployed on each of the nodes
available in the cluster that executes the Map and Reduce task as instructed by Job Tracker.
There is also one important component of MapReduce Architecture known as Job History Server.
The Job History Server is a daemon process that saves and stores historical information about the
task or application, like the logs which are generated during or after the job execution are stored
on Job History Server.
Map Reduce :-
It is a framework in which we can write applications to run huge amount of data in parallel and in
large cluster of commodity hardware in a reliable manner.
Different Phases of MapReduce:-
MapReduce model has three major and one optional phase.
Mapping
Shuffling and Sorting
Reducing
Combining
Mapping :- It is the first phase of MapReduce programming. Mapping Phase accepts key-value
pairs as input as (k, v), where the key represents the Key address of each record and the value
represents the entire record content.The output of the Mapping phase will also be in the key-value
format (k’, v’).
Shuffling and Sorting :- The output of various mapping parts (k’, v’), then goes into Shuffling and
Sorting phase. All the same values are deleted, and different values are grouped together based on
same keys. The output of the Shuffling and Sorting phase will be key-value pairs again as key and
array of values (k, v[ ]).
Reducer :- The output of the Shuffling and Sorting phase (k, v[]) will be the input of the Reducer
phase. In this phase reducer function’s logic is executed and all the values are Collected against
their corresponding keys. Reducer stabilize outputs of various mappers and computes the final
output.
Combining :- It is an optional phase in the MapReduce phases . The combiner phase is used to
optimize the performance of MapReduce phases. This phase makes the Shuffling and Sorting
phase work even quicker by enabling additional performance features in MapReduce phases.
flow chart
Numerical:-
MovieLens Data
USER_ID MOVIE_ID RATING TIMESTAMP
196 242 3 881250949
186 302 3 891717742
196 377 1 878887116
244 51 2 880606923
166 346 1 886397596
186 474 4 884182806
186 265 2 881171488
Solution : –
Step 1 – First we have to map the values , it is happen in 1st phase of Map Reduce model.
196:242 ; 186:302 ; 196:377 ; 244:51 ; 166:346 ; 186:274 ; 186:265
Step 2 – After Mapping we have to shuffle and sort the values.
166:346 ; 186:302,274,265 ; 196:242,377 ; 244:51
Step 3 – After completion of step1 and step2 we have to reduce each key’s values.
Now, put all values together
Limitations of HDFS
Issues with Small Files
The main problem with Hadoop is that it is not suitable for small
data. HDFS lacks the ability to support the random reading of small due to its
high capacity design. Small files are smaller than the HDFS Block size (default
128MB). If you are storing these huge numbers of small files, HDFS cannot
handle these lots of small files. As HDFS was designed to work with a small
number of large files for storing large data sets rather than a large number of
small files. If there are lot many small files, then the NameNode will be
overloaded since it stores the namespace of HDFS.
Solution:
Hadoop Archives or HAR files is one of the solutions to small files problem.
Hadoop archives act as another layer of the file system over Hadoop. With
Hadoop archive command we can build HAR files. This command runs a map-
reduce job at the backend to pack the archived files into a small number of
HDFS files. But again reading through HAR files is not much efficient than
reading through HDFS. This is because it requires to access two index files and
then finally the data file. Sequence file is another solution to small file problem.
In this, we write a program to merge a number of small files into one sequence
file. Then we process this sequence file in a streaming fashion. Map-reduce can
break this sequence files into chunks and process it in parallel as we can split
the sequence file.
Slow Processing Speed
MapReduce processes a huge amount of data. In Hadoop, MapReduce works
by breaking the processing into phases: Map and Reduce. So, MapReduce
requires a lot of time to perform these tasks, thus increasing latency. Hence,
reduces processing speed.
Solution:
Spark is the solution for the slow processing speed of map-reduce. It does in-
memory calculations which makes it a hundred times faster than Hadoop.
Spark while processing reads the data from RAM and writes the data to RAM
thereby making it a fast processing tool. Flink is one more technology which is
faster than Hadoop map-reduce as it does in-memory calculations. Flink is
even faster than Spark. This is due to the stream processing engine at the core
as opposed to Spark which has batch processing engine
Solution
Apache Spark solves this problem as it supports stream processing. But Spark
stream processing is not as much efficient as Flink as it uses micro-batch
processing. Apache Flink improves the overall performance as it provides
single run-time for the streaming as well as batch processing.
No Real-time Processing
Hadoop with its core Map-Reduce framework is unable to process real-time
data. Hadoop process data in batches. First, the user loads the file into HDFS.
Then the user runs map-reduce job with the file as input. It follows the ETL
cycle of processing. The user extracts the data from the source. Then the data
gets transformed to meet the business requirements. And finally loaded into
the data warehouse. The users can generate insights from this data. The
companies use these insights for the betterment of their business.
Solution
Spark has come up as a solution to the above problem. Spark supports real-
time processing. It processes the incoming streams of data by forming micro-
batches and then applying computations on these micro-batches.
Flink is also one more solution for slow processing speed. It is even much
faster than Spark as it has a stream processing engine at the core. Flink is a
true streaming engine with adjustable latency and throughput. It has a rich set
of APIs exploiting streaming runtime.
Iterative Processing
Core Hadoop does not support iterative processing. Iterative processing
requires a cyclic data flow. In this output of a previous stage serves as an input
to the next stage. Hadoop map-reduce is capable of batch processing. It
works on the principle of write-once-read-many. The data gets written on the
disk once. And then read multiple times to get insights. The Map-reduce of
Hadoop has a batch processing engine at its core. It is not able to iterate
through data.
Solution
Latency
MapReduce in Hadoop is slower because it supports different format,
structured and huge amount of data. In MapReduce, Map takes a set of data
and converts it into another set of data, where an individual element is broken
down into a key-value pair. Reduce takes the output from the map as and
Reduce takes the output from the map as input and process further.
MapReduce requires a lot of time to perform these tasks thereby increasing
latency.
Solution:
Apache Spark can reduce this issue. Although Spark is the batch system, it is
relatively faster, because it caches much of the input data on memory by RDD.
Apache Flink data streaming achieves low latency and high throughput.
No Ease of Use
In Hadoop, we have to hand code each and every operation. This has two
drawbacks first it is difficult to use. And second, it increases the number of
lines to code. There is no interactive mode available with Hadoop Map-Reduce.
This also makes it difficult to debug as it runs in the batch mode. In this mode,
we have to specify the jar file, the input as well as the location of the output file.
If the program fails in between, it is difficult to find the culprit code.
Solution
Spark is easy for the user as compared to Hadoop. This is because it has many
APIs for Java, Scala, Python, and Spark SQL. Spark performs batch processing,
stream processing and machine learning on the same cluster. This makes life
easy for users. They can use the same infrastructure for various workloads. In
Flink, the number of high-level operators is available. This reduces the number
of lines of code to achieve the same result.
Security Issue
Hadoop does not implement encryption-decryption at the storage as well as
network levels. Thus it is not much secure. For security, Hadoop adopts
Kerberos authentication which is difficult to maintain.
Solution
Spark encrypts temporary data written to local disk. It does not support
encryption of output data generated by applications having APIs such as save
As Hadoop File or save ASTable. Spark implements AES-based encryption for
RPC connections. We should enable RPC authentication to enable encryption.
It should be properly configured.
No Caching
Apache Hadoop is not efficient for caching. MapReduce cannot cache the
intermediate data in memory for the further requirement and this diminishes
the performance of Hadoop.
Solution:
Spark and Flink overcome this issue. Spark and Flink cache data in memory for
further iterations which enhance the overall performance.
Lengthy Code
Apache Hadoop has 1, 20,000 line of code. The number of lines produces the
number of bugs. Hence it will take more time to execute the programs.
Solution:
Spark and Flink are written in Scala and Java. But the implementation is in
Scala, so the number of line of code is lesser than Hadoop. Thus, it takes less
time to execute the programs.