Notes
Notes
Notes
The Hadoop Distributed File System (HDFS) was designed for Big Data
processing. Although capable of supporting many users simultaneously, HDFS
is not designed as a true parallel file system. Rather, the design assumes a large
file write-once/read-many model that enables other optimizations and relaxes
many of the concurrency and coherence overhead requirements of a true
parallel file system. For instance, HDFS rigorously restricts data writing to one
user at a time. All additional writes are ―append-only,‖ and there is no random
writing to HDFS files. Bytes are always appended to the end of a stream, and
byte streams are guaranteed to be stored in the order written.
HDFS is designed for data streaming where large amounts of data are read from
disk in bulk. The HDFS block size is typically 64MB or 128MB. Thus, this
approach is entirely unsuitable for standard POSIX file system use. In addition,
due to the sequential nature of the data, there is no local caching mechanism.
The large block and file sizes make it more efficient to reread data from HDFS
than to try to cache the data.
Perhaps the most interesting aspect of HDFS—and the one that separates it
from other file systems—is its data locality. A principal design aspect of
Hadoop MapReduce is the emphasis on moving the computation to the data
rather than moving the data to the computation. This distinction is reflected in
how Hadoop clusters are implemented. In other high-performance systems, a
parallel file system will exist on hardware separate from the compute hardware.
Data is then moved to and from the computer components via high-speed
interfaces to the parallel file system array. HDFS, in contrast, is designed to
work on the same hardware as the compute portion of the cluster. That is, a
single server node in the cluster is often both a computation engine and a
storage engine for the application.
BNMIT 1
15CS82/Bigdata Analytics
Finally, Hadoop clusters assume node (and even rack) failure will occur at
some point. To deal with this situation, HDFS has a redundant design that can
tolerate system failure and still provide the data needed by the compute part of
the program.
2. 8
Explain the important aspects of HDFS
BNMIT 2
15CS82/Bigdata Analytics
If several machines must be involved in the serving of a file, then a file could
BNMIT 3
15CS82/Bigdata Analytics
In addition, the HDFS default block size is often 64MB. In a typical operating
system, the block size is 4KB or 8KB. The HDFS default block size is not the
minimum block size, however. If a 20KB file is written to HDFS, it will create
a block that is approximately 20KB in size. (The underlying file system may
have a minimal block size that increases the actual file size.) If a file of size
80MB is written to HDFS, a 64MB block and a 16MB block will be created.
HDFS are not exactly the same as the data splits used by the MapReduce
process. The HDFS blocks are based on size, while the splits are based on a
logical partitioning of the data. For instance, if a file contains discrete records,
the logical split ensures that a record is not split physically across two separate
servers during processing. Each HDFS block may consist of one or more splits.
Figure 3.2 provides an example of how a file is broken into blocks and
replicated across the cluster. In this case, a replication factor of 3 ensures that
any one DataNode can fail and the replicated blocks will be available on other
nodes—and then subsequently re-replicated on other DataNodes.
Figure 3.2
BNMIT 4
15CS82/Bigdata Analytics
NameNode stores the metadata of the HDFS files ystem in a file called
fs_image. File systems modifications are written to anedits log file, and at
BNMIT 5
15CS82/Bigdata Analytics
startup the NameNode merges the edits into a new fs_image. The Secondary
NameNode or CheckpointNode periodically fetchesedits from the NameNode,
merges them, and returns an updated fs_image to theNameNode.
The HDFS NFS Gateway supports NFSv3 and enables HDFS to be mounted as
part of the client’s local file system. Users can browse the HDFS file system
through their local file systems that provide an NFSv3 client compatible
operating system.
This feature offers users the following capabilities:
Users can easily download/upload files from/to the HDFS file system
BNMIT 6
15CS82/Bigdata Analytics
To copy a file from your current local directory into HDFS, use the following
command. If a full path is not supplied, your home directory is assumed. In this
case, the file test is placed in the directory stuff that was created previously.
$ hdfsdfs -put test stuff
Found 1 items
-rw-r--r-- 2 hdfshdfs 12857 2015-05-29 13:12 stuff/test
BNMIT 7
15CS82/Bigdata Analytics
command.
In this case, the file we copied into HDFS, test, will be copied back to the
current local directory with the name test-local.
The following command will delete the HDFS file test.dhfsthat was
created previously:
$ hdfsdfs -rmtest.hdfs
Regular users can get an abbreviated HDFS status report using the following
command. Those with HDFS administrator privileges will get a full (and
potentially long) report. Also, this command uses dfs admin instead of dfs to
invoke administrative commands. The status report is similar to the data
BNMIT 8
15CS82/Bigdata Analytics
BNMIT 9
15CS82/Bigdata Analytics
give them each a section of the book to search. This step is the map stage. The
reduce phase happens when everyone is done counting
13 Write simple Mapper and Reducer Scripts and write a brief description on it. 8
#!/bin/bash
while read line ; do
for token in $line; do
if [ "$token" = "Kutuzov" ] ; then
echo "Kutuzov,1"
elif [ "$token" = "Petersburg" ] ; then
echo "Petersburg,1"
fi
done
done
Listing 5.2 Simple Reducer Script
#!/bin/bash
kcount=0
pcount=0
while read line ; do
if [ "$line" = "Kutuzov,1" ] ; then
let kcount=kcount+1
elif [ "$line" = "Petersburg,1" ] ; then
let pcount=pcount+1
fi
done
echo "Kutuzov,$kcount"
echo "Petersburg,$pcount"
Formally, the MapReduce process can be described as follows. The mapper and
reducer functions are both defined with respect to data structured in (key,value)
pairs. The mapper takes one pair of data with a type in one data domain, and
returns a list of pairs in a different domain:
Map(key1,value1) → list(key2,value2)
The reducer function is then applied to each key–value pair, which in turn
produces a collection of values in the same domain:
BNMIT 10
15CS82/Bigdata Analytics
BNMIT 11
15CS82/Bigdata Analytics
2. Map Step. The mapping process is where the parallel nature of Hadoop
comes into play. For large amounts of data, many mappers can be operating at
the same time. The user provides the specific mapping process. MapReduce
will try to execute the mapper on the machines where the block resides.
Because the file is replicated in HDFS, the least busy node with the data will be
chosen. If all nodes holding the data are too busy, MapReduce will try to pick a
node that is closest to the node that hosts the data block (a characteristic called
rack-awareness). The last choice is any node in the cluster that has access to
HDFS.
BNMIT 12
15CS82/Bigdata Analytics
4. Shuffle Step. Before the parallel reduction stage can complete, all similar
keys must be combined and counted by the same reducer process. Therefore,
results of the map stage must be collected by key–value pairs and shuffled to
the same reducer process. If only a single reducer process is used, the shuffle
stage is not needed.
5. Reduce Step. The final step is the actual reduction. In this stage, the data
reduction is performed as per the programmer’s design. The reduce step is also
optional. The results are written to HDFS. Each reducer will write an output
file. For example, a MapReduce job running four reducers will create files
called part-0000, part-0001, part-0002, and part-0003.
Figure 5.1 is an example of a simple Hadoop MapReduce data flow for a word
count program. The map process counts the words in the split, and the reduce
process calculates the total for each word. As mentioned earlier, the actual
computation of the map and reduce stages are up to the programmer. The
MapReduce data flow shown in Figure 5.1 is the same regardless of the specific
map and reduce tasks.
The input to the MapReduce application is the following file in HDFS with
BNMIT 13
15CS82/Bigdata Analytics
three lines of text. The goal is to count the number of times each word is used.
see spot run
run spot run
see the cat
The first thing MapReduce will do is create the data splits. For simplicity, each
line will be one split. Since each split will require a map task, there are three
mapper processes that count the number of words in the split. On a cluster, the
results of each map task are written to local disk and not to HDFS. Next, similar
keys need to be collected and sent to a reducer process. The shuffle step
requires data movement and can be expensive in terms of processing time.
Depending on the nature of the application, the amount of data that must be
shuffled throughout the cluster can vary from small to large.
Once the data have been collected and sorted by key, the reduction step can
begin (even if only partial results are available). It is not necessary—and not
normally recommended—to have a reducer for each key–value pair as shown in
Figure 5.1.
In some cases, a single reducer will provide adequate performance;in other
cases, multiple reducers may be required to speed up the reduce phase.The
number of reducers is a tunable option for many applications. The final step is
to write the output to HDFS.
As mentioned, a combiner step enables some pre-reduction of the map output
data. For instance, in the previous example, one map produced the following.
counts:
(run,1)
(spot,1)
(run,1)
As shown in Figure 5.2, the count for run can be combined into (run,2)before
the shuffle. This optimization can help minimize the amount of data transfer
needed for the shuffle phase.
BNMIT 14
15CS82/Bigdata Analytics
Figure 5.3 shows a simple three-node MapReduce process. Once the mapping is
complete, the same nodes begin the reduce process. The shuffle stage makes
sure the necessary data are sent to each mapper. Also note that there is no
requirement that all the mappers complete at the same time or that the mapper
on a specific node be complete before a reducer is started. Reducers can be set
to start shuffling based on a threshold of percentage of mappers that have
finished.
BNMIT 15
15CS82/Bigdata Analytics
Finally, although the examples are simple in nature, the parallel MapReduce
algorithm can be scaled up to extremely large data sizes. For instance, the
Hadoop word count sample application can be run on the three lines given
earlier or on a 3TB file. The application requires no changes to account for the
scale of the problem—a feature that is one of the remarkable advantages of
MapReduce processing.
17 Write a note on Hadoop MapReduce Hardware. 8
The use of server nodes for both storage (HDFS) and processing (mappers,
reducers) is somewhat different from the traditional separation of these two
tasks
in the data centre. It is possible to build Hadoop systems and separate the
roles(discrete storage and processing nodes). However, a majority of Hadoop
systems
use the general approach where servers enact both roles. Another interesting
feature of dynamic MapReduce execution is the capability to tolerate dissimilar
servers. That is, old and new hardware can be used together. Of course, large
disparities in performance will limit the faster systems, but the dynamic nature
of MapReduce execution will still work effectively on such systems.
18 Explain briefly the steps for compiling and running the program from the 6
command line.
To compile and run the program from the command line, perform the following
steps:
BNMIT 16
15CS82/Bigdata Analytics
20 Explain how the pipe interfaces are used for mapper and reducer for an
application
Pipes is a library that allows C++ source code to be used for mapper and reducer code.
Applications that require high performance when crunching numbers may achieve
better throughput if written in C++ and used through the Pipes interface. Both key and
value inputs to pipes programs are provided as STL strings(std::string). As shown in
Listing 6.4, the program must define an instance of a mapper and an instance of a
reducer. A program to use with Pipes is defined by writing classes extending Mapper
and Reducer. Hadoop must then be informed as to which classes to use to run the job.
The Pipes framework on each machine assigned to your job will start an instance of
C++ program. Therefore, the executable must be placed inHDFS prior to use.
BNMIT 17
15CS82/Bigdata Analytics
BNMIT 18