Bda - 4 Unit

BDA - UNIT-IV
Big data technology landscape Two important technologies: NoSQL and Hadoop
Topics Covered:
1.Distributed computing challenges

2.NoSQL
3. Hadoop: consisting of HDFS and MapReduce.

3.1.history of hadoop,
hadoop overview
use case of hadoop,
3.4.hadoop distributors,
4. HDFS:
HDFS daemons: Namenode, datanode, secondary namenode
file read, file write, Replica processing of data with hadoop
4.3.Managing resources and applications with Hadoop YARN
Distributed computing challenges
1. In a distributed system ,since several servers are networked together there could be failure of
hardware.
ex: a hard disk failure creates data retrieval problem
2. In DS the data is spread across several machines.
How to integrate them prior to processing it?
Solution: two important technologies: NoSQL and hadoop. We study in this unit 4
NoSQL
RDBMSs
• MySQL is the world's most used RDBMS, and runs as a server providing multi-user access to
a number of databases.
• TheOracle Database is an object-relational database management system (ORDBMS).
• The main difference between Oracleand MySQL is the fact that MySQL is open source,
whileOracle is not.
• SQL stands for Structured Query Language. It's a standard language for accessing and
manipulating databases
• SQL Server, Oracle, Informix, Postgres, etc are RDMS
introduction to NoSQL.
• It is a distributed DataBase model while hadoop is not a data base.(hadoop is a framework) ;
• NoSQL is OpenSource, non relational, scalable.
• There are several databases which follow this NoSQL model.
• NoSQL data bases are used in Big data and real time web applications, social media.
• They do not restrict the data to adhere to any schema at the time of storage
pg. 42 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
• They structure the unstructured input data into different formats viz key value pairs ;
document oriented; coloumn oriented; graph based data ; besides structured data
• They adhere to CAP theorem and compromise on C in favor of A and P.
• It does not support ACID properties of transactions (Atomocity,Consistency,Isolation, and
Durability).
Types of NoSQL data bases They
can be broadly classified into:

1. key-value or the big hash table type: : they maintain big hash table of keys and values.
sample key value pair:
key value
First name Robert
last name williams
2. Document type: maintain data as a collection of documents. Documents are equivalent to
records in RDBMS and collection is equivalent of Table in RDBMS. Sample document:
{“Book Name”: “Fundamentals .. “,
“Publisher”: “Wiley India”,
“year”: “2011”
}
3. Column type: each storage block has data

from only one column
4. Graph type: Also called network db. A graph stores data in nodes
sample graph:ID, name, Age stored in each node.
arrows carry Labels like “member”,”member since 2002” , “knows since 2002”, etc.,
popular NoSQL data bases
1. Key value or big hash table

2. Schema-less
1. Key value or big hash table type NoSQL Data bases: (some schema is followed)
Amazon S3 (Dynamo); Scalaris , Redis,Riak,
2.schema-less: (no schema even like key, value)
Column based : Cassaandra, Hbase
Document based: ApacheCouchDB, MongoDB,
MarkLogic
Graph-based: Neo4j, HyperGraphDB
Advantages of NoSQL
• Dynamic schema: since it allows insertion of data without a predefined schema-it facilitates
application changes in real time ie faster code development and integration and less db
administration
• Auto sharding: it automatically spreads data across arbitrary number of servers while
balancing the load and query on the servers. if a server fails the server is replaced w/o
disruptions.
• Replication: multiple copies of data are stored across the cluster and even across data centers.
This promises high availability and fault tolerance
• Rapid and elastic Scalability: allows to scale to the cloud with the following capacities:
Cluster scale: allows distribution of data base across >100 nodes among multiple data centers
performance scale: supports over >100000 database read and write operations per sec
Data scale: supports storing of >1 billion documents in the db
• Cheap and easy to implement
• Adheres to CAP. relaxes consistency requirement
Disadvantages of NoSQL
• Does not support joins
• No support for ACID
• No standard query language interface except in case of MongoDB and Cassandra(CQL)
• No easy integration with other applications that support SQL
No SQL applications in Industry

• Key value pairs type data base: used for shopping carts, web user data analysis(amazon,
Linkedin)
• Column type database: used by facebook, twitter, eBay, NETFLIX
• Document type database : used for logging, archives management
• Graph type database : used in network modeling, walmart
• NoSQL vendors:
1. amazon (Dynamo): Used by Linkedin, Mozilla
2. Facebook(Cassandra):Used by Netflix. Twitter,eBay ie column type darabase
3.Google(Big Table). Used by Adobe Photoshop
NewSQL
• Data base that has the same scalable performance as NoSQL, support OLTP, maintain ACID
guarantees of traditional Data Base.
• It is a new RDBMS supporting relational data model and uses SQL as interface.
Comparison
ACID
• In databases, a transaction is a very small of a program

may contain several lowlevel tasks.
• A transaction in a database system must maintain Atomicity, Consistency, Isolation, and
Durability − commonly known as ACID properties − in order to ensure accuracy,
completeness, and data integrity .
• For example, a transfer of funds from one bank account to another, even involving multiple
changes such as debiting one account and crediting another, is a single transaction.
• Atomicity Consistency Isolation Durability (ACID) is a concept referring to a database
system's four transaction properties: atomicity, consistency, isolationand durability.
• These four properties describe the major guarantees of the transaction paradigm, which has
influenced many aspects of development in database systems.
Atomicity
• An atomic transaction is an indivisible and irreducible series of database operations such that
either all occur, or nothing occurs. A guarantee of atomicity prevents updates to
the database occurring only partially, which can cause greater problems than rejecting the
whole series outright.
• An atomic transaction is an indivisible and irreducible series of database operations such that
either all occur, or nothing occurs.
• Transactions are often composed of Multiple statements.
• A guarantee of atomicity prevents updates to the database occurring only partially, which can
cause greater problems than rejecting the whole series outright.
• Atomicity guarantees that each transaction is treated as a single "unit", which either succeeds
completely, or fails completely:
• if any of the statements in a transaction fails to complete, the entire transaction fails and the
database is left unchanged.
• An atomic system must guarantee atomicity in each and every situation, including power
failures, errors and crashes.
Consistency
• Consistency ensures that a transaction can only bring the database from one valid state
to another valid state, maintaining database invariants:
• any data written to the database must be valid according to all defined rules,
including constraints, cascades, triggers, and any combination thereof.
• This prevents database corruption by an illegal transaction, but does not guarantee that a
transaction is correct.
Isolation
• Transactions are often executed concurrently (e.g., reading and writing to multiple tables at
the same time)
• Isolation ensures that concurrent execution of transactions leaves the database in the same
state that would have been obtained if the transactions were executed sequentially.
• Isolation is the main goal of concurrency control;
• depending on the method used, the effects of an incomplete transaction might not even be
visible to other transactions.
Durability
• Durability guarantees that once a transaction has been committed, it will remain committed
even in the case of a system failure (e.g., power outage or crash).
• This usually means that completed transactions (or their effects) are recorded in non-volatile
memory
4.3.Hadoop:
3.1.history of hadoop,
3.2.hadoop overview
3.3. use case of hadoop,
3.4.hadoop distributors,
4. HDFS:
1. Hadoop overview
• For 1. massive data storage

2. faster data processing
Key aspects:
1. OSS
2. Framework: programs, tools etc; provided to develop and execute applications. It is not a data
base like NoSQL
3. distributed: data distributed across multiple
computers. Data processed parallelly
4. Massive data and faster processing
Hadoop distributors
• The following companies supply hadoop products:
• Cloudera, Hortonworks, MAPR, Apache Hadoop
4. HDFS
HDFS is one of the two core components of hadoop, the 2nd being MapReduce.
HDFS daemons
1.NameNode:
• There is a single namenode per cluster
• It manages file related operations like read, write, create and delete
• Namenode stores HDFS namespace
• It manages file system Namespace which is a collection of files in the cluster
• file system Namespace includes mapping of blocks to file , file properties and is stored in a
file called FsImage
• It uses editlog to record every transaction
• A rack is a collection of data nodes within a cluster
• it uses rackID to identify datanodes in the rack.
• When namenode starts, it reads FsImage and EditLog from disk and applies all transactions
from EditLog to represent in FsImage.
• Then it flushes out new version of FsImage on disk and truncates the old EditLog because the
changes are updated in the FsImage.
e
• There are multiples
• During pipeline read write datanodes communicate with each other.
• A datanode also sends heartbeat message to namenode to ensure connectivity between name
and data nodes
• In case of no heartbeat, namenode replicates datanode within the cluster and keeps running
3. Secondary NameNode
• It takes a snapshot of HDFS metadata at intervals as specified in the configuration
• It ocuupies same memory size as namnode
• Therefore they are run on different machines
• In failure of namenode the secondary can be configured

• File read:
• 1. the client opens file he wants to read by calling open() on the DFS
• 2.DFS communicates with namenode to get the location of the data blocks
• 3.namenode returns the addresses of the datanodes containing the data blocks
• 4.DFS returns an FSDataInputStream to client.
• 5. client calls read() on the FSDataInputStream which contains the addresses of the datanodes
for the first few blocks of file, connects to the nearest datanode for the 1st block in the file
on FSDataInputStream to close the connection
• 6.client calls read() repeatedly to get the data stream from the datanode
• 7.when the end of a block FSDataInputStream closes the connection with datanode.
• 8. it repeats the steps for to find the best node for the next block.
9. client calls close()
File write
• 1. client calls create() to create file
• 2. An RPC call is initiated to namenode
• 3. namenode creates file after few checks
• 4. FSDataInputStream returns the stream for client to write on
• 5.as the client writes data, the data is split into packets which is then written to a data queue
• 6.datastreamer requests namenode to allocate blocks by selecting alist of suitable nodes for
storing replicas (by default 3)
• 7. this list of dtanodes makes a pipeline with 3 nodes in the pipe line for the 1st block
• 8. datastreamer streams the packets to the 1st data node in the pipeline which stores and the
forwards to other datanodes in the pipeline
• 9.DFSOutputStream manages an “Ack queue” of packets that are waiting for ackment- and a
pkt is removed from the queue only if it is acknowledged by all the datanodes in the pipeline
• 10.when the client finishes writing the file it calls close() on the stream
• 11.this flushes all the remaining pkts to the datanode pipeline and waits for
acknowledgements before communicating with NameNode to inform the client that the
creation of file is complete
Replica processing of data with Hadoop

• Replica placement strategy:
• by default 3 replicas are created for each data set
1st replica is placed in the same node as the client
2nd replica is placed on a node in a different rack
3rd replica is placed on the same rack as second but on a different node in the rack
• Then a data pipeline is built . The client application writes a block to the 1st datanode in the
pipeline.
next this datanode takes over and forwards data to the next node in the pipeline.
• this process continues for all the data blocks.
• Subsequently all the dta blocks are written to the disk
• The client application need not track all blocks of data. The HDFS directs the client to the
nearest replica.
Why hadoop 2.x ?

• Because of following limitations of hadoop1.0:
• In hadoop 1.0 HDFS and MR are core componenets while other components are built around.
• 1. single namenode for entire namespace of a cluster. It saves all its file metadata in main
memory. This puts a limit on the number of objects stored in NameNode.
• 2.restricted to processing batch-oriented Map reduce jobs
• 3.MR for cluster resource management and data processing. not suitable for interactive
analysis
• 4. hadoop1.0 not suitable for machine learning, graphs and other memory intensive
algorithms
5. map slots may become full while reduce slots are empty and vice versa- inefficient resource
utilization
HDFS 2 used in hadoop 2.0 consists of 2 major components:
1. namespace service: to take care of file related (create, read, write) operations
2. blocks storage service: handles data nodes cluster management, replication
HDFS2 uses:
1. mutiple independent namenodes: datanodes are common storage blocks shared by all
namenodes. All datanodes register with every namenode in the cluster
2. passive standby namenode
Managing resources and applications with hadoop YARN

• YARN is a sub-project of hadoop 2.x
• It is a general processing platform
• YARN is not constrained to MR alone
• Multiple applications can be run in hadoop2.x
if all the applications share the same resources (memory, cpu, network etc.,)
management.
• With YARN hadoop can do not only batch processing but also interactive, online, streaming,
graph and other types of processing
Daemons of YARN
1. Global Resource Manager: to distribute resources among various applications. It has 2
components:
Scheduler: decides allocation of resources to running applications. No monitoring
ApplicationManager: accepts jobs, negotiates resources for
executing ApplicationMaster which is specific to an application
• 2.NodeManager: it monitors usage of resources and reports the usage to Global Resource
Manager. It launches ‘application containers’ for execution of application.
• Every machine will have one NodeManager
• 3.Per-application ApplicationMaster: every application has one.to
negotiate required resoueces for execution from the Resource Manager.
It works along with NodeManager for executing and monitoring
component tasks
• Application is a job submitted to the framework. Ex: Map Reduce job

• Container: is a basic unit of allocation across
multiple resource types: ex: container_0= 2GB, 1
CPU
container_1= 1GB, 6 CPU
container replaces the fixed map/reduce slots
YARN Architecture: steps

• 1.client program submits the application which contains specifications
to launch application specific ‘ ApplicationMaster’
• 2.ResourceManager launches ‘ ApplicationMaster’ by assigning some container
• 3. ‘ ApplicationMaster’ registers with ApplicationMaster’ so that the
client can quiery from Resource manager for details
• 4.( applicationmaster negotiates apptopruate resource containers via
the resource –request protocol)
• 5. after container allocation , the ApplicationMaster launches the
container by providing the specs to NodeManger
• 6. NodeManger executeds the application code and provides status to
ApplicationMaster via application specific protocol
• 7.on completion of application , ‘
ApplicationMaster deregisters with
ResourceManager and shuts down. itscontainers
can then be reused.

Bda - 4 Unit

Uploaded by

Copyright:

Available Formats

Bda - 4 Unit

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bda - 4 Unit

Uploaded by

Copyright:

Available Formats

BDA - UNIT-IV

1.Distributed computing challenges

3. Hadoop: consisting of HDFS and MapReduce.

Distributed computing challenges

Types of NoSQL data bases They

can be broadly classified into:

3. Column type: each storage block has data

popular NoSQL data bases

1. Key value or big hash table

No SQL applications in Industry

• In databases, a transaction is a very small of a program

• For 1. massive data storage

file read, file write, Replica processing of data with hadoop

Replica processing of data with Hadoop

Why hadoop 2.x ?

Managing resources and applications with hadoop YARN

• Application is a job submitted to the framework. Ex: Map Reduce job

YARN Architecture: steps

You might also like