Big Data
Big Data
Big Data
1. How a secondary name node differs from the name node in HDFS.
Ans. Secondary namenode is responsible for writing editlogs of NameNode in file
called fSimage in HDFS. After which the edit logs are cleared. This activity is
done periodically which helps minimizing the size of edit log files(since changes
are flushed to fsimage on secondary namenode).
The Checkpoint Node fetches periodically fsimage and edits from the NameNode
and merges them.
The resulting state is called checkpoint. After this is uploads the result to the
NameNode.
Keep it simple.
Pig Latin enables streamlined functions for communicating with Java
MapReduce. It’s is a kind of an abstraction, in simple words, that uses an easy
way for the creation of parallel programs on the Hadoop cluster for data flows and
analysis. Messy tasks may needs a series of interrelated data transformations —
like series are encoded as data flow sequences. Implementing data
transformation and flows as Pig Latin scripts in comparison to Java MapReduce
programs makes these programs much simpler and easier to write, understand,
and maintain since:
3) We don’t need to come up with custom code to support rich data types.
Pig Latin provides a simpler language to exploit our Hadoop cluster, thus making
it easier for more people to leverage the power of Hadoop and become
productive sooner.
Make it smart.
We may recall that the Pig Latin Compiler does his duty of transforming a Pig
Latin program into a series of Java MapReduce tasks. The trick is to ensure that
the compiler can perfectly optimize the execution of these Java MapReduce jobs
automatically, by just allowing the user to focus on semantics rather than on how
to optimize and access the data. For the people with SQL types out there, this
explanation will sound familiar. SQL is set up as a declarative query that we use
to access structured data stored in an RDBMS. The RDBMS engine first
translates the query to a data access method and then inspects the statistics and
builds a series of data access strategies. The cost-based optimizer selects the
most efficient strategy for execution.
Traditional RDBMS data warehouses make use of the ETL data processing
pattern, where we extract data from outside sources, transform it to fit our
operational needs, and then load it into the end target, whether it’s an operational
data store, a data warehouse, or another variant of database. However, with big
data, we typically want to reduce the amount of data we have moving about, so
we finally end up bringing the processing to the data itself. The language for Pig
data flows, that’s why, takes a pass on the old ETL approach, and goes with ELT
instead: Extract the data from our various sources, load it into HDFS, and then
transform it as necessary to prepare the data for further analysis.
4. How to create a table by using HIVEQL.
Ans. A table in Hive is a set of data that uses a schema to sort the data by given
identifiers.
A Hadoop cluster consists of a single master and multiple slave nodes. The
master node includes Job Tracker, Task Tracker, NameNode, and DataNode
whereas the slave node includes DataNode and TaskTracker.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a distributed file system for
Hadoop. It contains a master/slave architecture. This architecture consist of a
single NameNode performs the role of master, and multiple DataNodes performs
the role of a slave.
NameNode
It is a single master server exist in the HDFS cluster.
As it is a single node, it may become the reason of single point failure.
It manages the file system namespace by executing an operation like the
opening, renaming and closing the files.
It simplifies the architecture of the system.
DataNode
The HDFS cluster contains multiple DataNodes.
Each DataNode contains multiple data blocks.
These data blocks are used to store data.
It is the responsibility of DataNode to read and write requests from the file
system's clients.
It performs block creation, deletion, and replication upon instruction from the
NameNode.
Job Tracker
The role of Job Tracker is to accept the MapReduce jobs from client and process
the data by using NameNode.
In response, NameNode provides metadata to Job Tracker.
Task Tracker
It works as a slave node for Job Tracker.
It receives task and code from Job Tracker and applies that code on the file. This
process can also be called as a Mapper.
MapReduce Layer
The MapReduce comes into existence when the client application submits the
MapReduce job to Job Tracker. In response, the Job Tracker sends the request to
the appropriate Task Trackers. Sometimes, the TaskTracker fails or time out. In
such a case, that part of the job is rescheduled.
6. Explain the various operational modes of Hadoop cluster configuration.
Ans. Hadoop Mainly works on 3 different Modes:
Standalone Mode
Pseudo-distributed Mode
Fully-Distributed Mode
1. Standalone Mode
In Standalone Mode none of the Daemon will run i.e. Namenode, Datanode,
Secondary Name node, Job Tracker, and Task Tracker. We use job-tracker and
task-tracker for processing purposes in Hadoop1. For Hadoop2 we use Resource
Manager and Node Manager. Standalone Mode also means that we are installing
Hadoop only in a single system. By default, Hadoop is made to run in this
Standalone Mode or we can also call it as the Local mode. We mainly use
Hadoop in this Mode for the Purpose of Learning, testing, and debugging.
2. Pseudo Distributed Mode (Single Node Cluster)
In Pseudo-distributed Mode we also use only a single node, but the main thing is
that the cluster is simulated, which means that all the processes inside the cluster
will run independently to each other. All the daemons that are Namenode,
Datanode, Secondary Name node, Resource Manager, Node Manager, etc. will
be running as a separate process on separate JVM(Java Virtual Machine) or we
can say run on different java processes that is why it is called a Pseudo-
distributed.
One thing we should remember that as we are using only the single node set up
so all the Master and Slave processes are handled by the single system.
Namenode and Resource Manager are used as Master and Datanode and Node
Manager is used as a slave. A secondary name node is also used as a Master.
The purpose of the Secondary Name node is to just keep the hourly based
backup of the Name node. In this Mode,
Hadoop is used for development and for debugging purposes both.
Our HDFS(Hadoop Distributed File System ) is utilized for managing the Input
and Output processes.
We need to change the configuration files mapred-site.xml, core-site.xml, hdfs-
site.xml for setting up the environment.
3. Fully Distributed Mode (Multi-Node Cluster)
This is the most important one in which multiple nodes are used few of them run
the Master Daemon’s that are Namenode and Resource Manager and the rest of
them run the Slave Daemon’s that are DataNode and Node Manager. Here
Hadoop will run on the clusters of Machine or nodes. Here the data that is used is
distributed across different nodes. This is actually the Production Mode of Hadoop
let’s clarify or understand this Mode in a better way in Physical Terminology.
Once you download the Hadoop in a tar file format or zip file format then you
install it in your system and you run all the processes in a single system but here
in the fully distributed mode we are extracting this tar or zip file to each of the
nodes in the Hadoop cluster and then we are using a particular node for a
particular process. Once you distribute the process among the nodes then you’ll
define which nodes are working as a master or which one of them is working as a
slave.
7. Distinguish between the old and new versions of Hadoop API for Map
Reduce framework.
ans.
8. Explain about the implementation of map reduce concept with a small
example.
9. Explain the architecture of a pig with a neat sketch.
Ans. The language used to analyze data in Hadoop using Pig is known as Pig Latin. It is
a highlevel data processing language which provides a rich set of data types and operators
to perform various operations on the data.
To perform a particular task Programmers using Pig, programmers need to write a Pig
script using the Pig Latin language, and execute them using any of the execution
mechanisms (Grunt Shell, UDFs, Embedded). After execution, these scripts will go
through a series of transformations applied by the Pig Framework, to produce the desired
output.
Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it
makes the programmer’s job easy. The architecture of Apache Pig is shown below.
Apache Pig Components
As shown in the figure, there are various components in the Apache Pig
framework. Let us take a look at the major components.
Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the
script, does type checking, and other miscellaneous checks. The output of the
parser will be a DAG (directed acyclic graph), which represents the Pig Latin
statements and logical operators.
In the DAG, the logical operators of the script are represented as the nodes and
the data flows are represented as edges.
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out the
logical optimizations such as projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a series of MapReduce
jobs.
Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally,
these MapReduce jobs are executed on Hadoop producing the desired results.
Column Types
Literals
Null Values
Complex Types
Column type are used as column data types of Hive. They are as follows:
Integral Types
Integer type data can be specified using integral data types, INT. When the data
range exceeds the range of INT, you need to use BIGINT and if the data range is
smaller than the INT, you use SMALLINT. TINYINT is smaller than SMALLINT.
String Types
String type data types can be specified using single quotes (' ') or double quotes
(" "). It contains two data types: VARCHAR and CHAR. Hive follows C-types
escape characters.
Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision. It
supports java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and format
“yyyy-mm-dd hh:mm:ss.ffffffffff”.
Dates
DATE values are described in year/month/day format in the form {{YYYY-MM-
DD}}.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used
for representing immutable arbitrary precision. The syntax and example is as
follows:
DECIMAL(precision, scale)
decimal(10,0)
Union Types
Union is a collection of heterogeneous data types. You can create an instance
using create union. The syntax and example is as follows:
UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Literals
The following literals are used in Hive:
Decimal Type
Decimal type data is nothing but floating point value with higher range than
DOUBLE data type. The range of decimal type is approximately -10-308 to
10308.
Null Value
Missing values are represented by the special value NULL.
Complex Types
The Hive complex data types are as follows:
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps
Maps in Hive are similar to Java Maps.
1. Text files
A text file is the most basic and a human-readable file. It can be read or written in
any programming language and is mostly delimited by comma or tab.
The text file format consumes more space when a numeric value needs to be
stored as a string. It is also difficult to represent binary data such as an image.
2. Sequence File
The sequencefile format can be used to store an image in the binary format. They
store key-value pairs in a binary container format and are more efficient than a
text file. However, sequence files are not human- readable.
The Avro file format is ideal for long-term storage of important data. It can read
from and write in many languages like Java, Scala and so on.Schema metadata
can be embedded in the file to ensure that it will always be readable. Schema
evolution can accommodate changes. The Avro file format is considered the best
choice for general-purpose storage in Hadoop.
15. Explain the hadoop distributed file system architecture with a neat sketch.
Ans.
The Hadoop Distributed File System (HDFS) is the primary data storage system
used by Hadoop applications. HDFS employs a NameNode and DataNode
architecture to implement a distributed file system that provides high-performance
access to data across highly scalable Hadoop clusters.
With HDFS, data is written on the server once, and read and reused numerous
times after that. HDFS has a primary NameNode, which keeps track of where file
data is kept in the cluster.
The NameNode knows which DataNode contains which blocks and where the
DataNodes reside within the machine cluster. The NameNode also manages
access to the files, including reads, writes, creates, deletes and the data block
replication across the DataNodes.
16. What is HDFS? List all the components of HDFS and explain.
Ans.
Hadoop is a framework that uses distributed storage and parallel processing to
store and manage big data. It is the software most used by data analysts to
handle big data, and its market size continues to grow. There are three
components of Hadoop:
Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit.
Hadoop MapReduce - Hadoop MapReduce is the processing unit.
Hadoop YARN - Yet Another Resource Negotiator (YARN) is a resource
management unit.
17. With block diagram discuss the various frameworks that run under YARN.
18. What are the characteristics of Big Data?
19. Explain hadoop Architectural Model.
20. Explain NoSQL data Architecture patterns.
Ans.
Architecture Pattern is a logical way of categorizing data that will be stored on the
Database. NoSQL is a type of database which helps to perform operations on big
data and store it in a valid format. It is widely used because of its flexibility and a
wide variety of services.
Complex queries may attempt to involve multiple key-value pairs which may delay
performance.
Data can be involving many-to-many relationships which may collide.
Examples:
DynamoDB
Berkeley DB
2. Column Store Database:
Rather than storing data in relational tuples, the data is stored in individual cells
which are further grouped into columns. Column-oriented databases work only on
columns. They store large amounts of data into columns together. Format and
titles of the columns can diverge from one row to other. Every column is treated
separately. But still, each individual column may contain multiple other columns
like traditional databases.
Basically, columns are mode of storage in this type.
Advantages:
HBase
Bigtable by Google
Cassandra
3. Document Database:
The document database fetches and accumulates data in form of key-value pairs
but here, the values are called as Documents. Document can be stated as a
complex data structure. Document here can be a form of text, arrays, strings,
JSON, XML or any such format. The use of nested documents is also very
common. It is very effective as most of the data created is usually in form of
JSONs and is unstructured.
Advantages:
This type of format is very useful and apt for semi-structured data.
Storage retrieval and managing of documents is easy.
Limitations:
MongoDB
CouchDB
4. Graph Databases:
Clearly, this architecture pattern deals with the storage and management of data
in graphs. Graphs are basically structures that depict connections between two or
more objects in some data. The objects or entities are called as nodes and are
joined together by relationships called Edges. Each edge has a unique identifier.
Each node serves as a point of contact for the graph. This pattern is very
commonly used in social networks where there are a large number of entities and
each entity has one or many characteristics which are connected by edges. The
relational database pattern has tables that are loosely connected, whereas
graphs are often very strong and rigid in nature.
Advantages:
Examples:
Neo4J
FlockDB( Used by Twitter)
21. What is Big data? Why we need Big Data? What are the challenges of Big
Data?
22. What is HDFS? What is PIG? What is hive used for ?
Ans.
Pig is an open-source high level data flow system. It provides a simple language
called Pig Latin, for queries and data manipulation, which are then compiled in to
MapReduce jobs that run on Hadoop
Hive allows users to read, write, and manage petabytes of data using SQL. Hive
is built on top of Apache Hadoop, which is an open-source framework used to
efficiently store and process large datasets.
23. What are the 3 V ’s of Big data? Explain with the help of two big data case
studies
24. How can we examine the HIVE Clients? Explain.
25. What is Hadoop API? Explain Hadoop API for MapReduce framework.