Bda - 4 Unit
Bda - 4 Unit
Bda - 4 Unit
Big data technology landscape Two important technologies: NoSQL and Hadoop
Topics Covered:
1. In a distributed system ,since several servers are networked together there could be failure of
hardware.
ex: a hard disk failure creates data retrieval problem
2. In DS the data is spread across several machines.
How to integrate them prior to processing it?
Solution: two important technologies: NoSQL and hadoop. We study in this unit 4
NoSQL
RDBMSs
• MySQL is the world's most used RDBMS, and runs as a server providing multi-user access to
a number of databases.
• TheOracle Database is an object-relational database management system (ORDBMS).
• The main difference between Oracleand MySQL is the fact that MySQL is open source,
whileOracle is not.
• SQL stands for Structured Query Language. It's a standard language for accessing and
manipulating databases
• SQL Server, Oracle, Informix, Postgres, etc are RDMS
introduction to NoSQL.
• It is a distributed DataBase model while hadoop is not a data base.(hadoop is a framework) ;
• NoSQL is OpenSource, non relational, scalable.
• There are several databases which follow this NoSQL model.
• NoSQL data bases are used in Big data and real time web applications, social media.
• They do not restrict the data to adhere to any schema at the time of storage
pg. 42 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
• They structure the unstructured input data into different formats viz key value pairs ;
document oriented; coloumn oriented; graph based data ; besides structured data
• They adhere to CAP theorem and compromise on C in favor of A and P.
• It does not support ACID properties of transactions (Atomocity,Consistency,Isolation, and
Durability).
Advantages of NoSQL
• Dynamic schema: since it allows insertion of data without a predefined schema-it facilitates
application changes in real time ie faster code development and integration and less db
administration
• Auto sharding: it automatically spreads data across arbitrary number of servers while
balancing the load and query on the servers. if a server fails the server is replaced w/o
disruptions.
• Replication: multiple copies of data are stored across the cluster and even across data centers.
This promises high availability and fault tolerance
• Rapid and elastic Scalability: allows to scale to the cloud with the following capacities:
pg. 43 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
Cluster scale: allows distribution of data base across >100 nodes among multiple data centers
performance scale: supports over >100000 database read and write operations per sec
Data scale: supports storing of >1 billion documents in the db
• Cheap and easy to implement
• Adheres to CAP. relaxes consistency requirement
Disadvantages of NoSQL
• Does not support joins
• No support for ACID
• No standard query language interface except in case of MongoDB and Cassandra(CQL)
• No easy integration with other applications that support SQL
NewSQL
• Data base that has the same scalable performance as NoSQL, support OLTP, maintain ACID
guarantees of traditional Data Base.
• It is a new RDBMS supporting relational data model and uses SQL as interface.
Comparison
ACID
Atomicity
• An atomic transaction is an indivisible and irreducible series of database operations such that
either all occur, or nothing occurs. A guarantee of atomicity prevents updates to
the database occurring only partially, which can cause greater problems than rejecting the
whole series outright.
• An atomic transaction is an indivisible and irreducible series of database operations such that
either all occur, or nothing occurs.
• Transactions are often composed of Multiple statements.
• A guarantee of atomicity prevents updates to the database occurring only partially, which can
cause greater problems than rejecting the whole series outright.
• Atomicity guarantees that each transaction is treated as a single "unit", which either succeeds
completely, or fails completely:
• if any of the statements in a transaction fails to complete, the entire transaction fails and the
database is left unchanged.
• An atomic system must guarantee atomicity in each and every situation, including power
failures, errors and crashes.
Consistency
• Consistency ensures that a transaction can only bring the database from one valid state
to another valid state, maintaining database invariants:
• any data written to the database must be valid according to all defined rules,
including constraints, cascades, triggers, and any combination thereof.
• This prevents database corruption by an illegal transaction, but does not guarantee that a
transaction is correct.
Isolation
• Transactions are often executed concurrently (e.g., reading and writing to multiple tables at
the same time)
• Isolation ensures that concurrent execution of transactions leaves the database in the same
state that would have been obtained if the transactions were executed sequentially.
• Isolation is the main goal of concurrency control;
• depending on the method used, the effects of an incomplete transaction might not even be
visible to other transactions.
Durability
• Durability guarantees that once a transaction has been committed, it will remain committed
even in the case of a system failure (e.g., power outage or crash).
• This usually means that completed transactions (or their effects) are recorded in non-volatile
memory
pg. 45 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
4.3.Hadoop:
3.1.history of hadoop,
3.2.hadoop overview
3.3. use case of hadoop,
3.4.hadoop distributors,
4. HDFS:
HDFS daemons: Namenode, datanode, secondary namenode
file read, file write, Replica processing of data with hadoop
4.3.Managing resources and applications with Hadoop YARN
1. Hadoop overview
Hadoop distributors
• The following companies supply hadoop products:
• Cloudera, Hortonworks, MAPR, Apache Hadoop
4. HDFS
HDFS is one of the two core components of hadoop, the 2nd being MapReduce.
HDFS daemons: Namenode, datanode, secondary namenode
file read, file write, Replica processing of data with hadoop
4.3.Managing resources and applications with Hadoop YARN
HDFS daemons
1.NameNode:
• There is a single namenode per cluster
• It manages file related operations like read, write, create and delete
• Namenode stores HDFS namespace
• It manages file system Namespace which is a collection of files in the cluster
• file system Namespace includes mapping of blocks to file , file properties and is stored in a
file called FsImage
• It uses editlog to record every transaction
• A rack is a collection of data nodes within a cluster
• it uses rackID to identify datanodes in the rack.
• When namenode starts, it reads FsImage and EditLog from disk and applies all transactions
from EditLog to represent in FsImage.
• Then it flushes out new version of FsImage on disk and truncates the old EditLog because the
changes are updated in the FsImage.
e
• There are multiples
• During pipeline read write datanodes communicate with each other.
• A datanode also sends heartbeat message to namenode to ensure connectivity between name
and data nodes
• In case of no heartbeat, namenode replicates datanode within the cluster and keeps running
3. Secondary NameNode
• It takes a snapshot of HDFS metadata at intervals as specified in the configuration
• It ocuupies same memory size as namnode
• Therefore they are run on different machines
• In failure of namenode the secondary can be configured
File write
• 1. client calls create() to create file
• 2. An RPC call is initiated to namenode
• 3. namenode creates file after few checks
• 4. FSDataInputStream returns the stream for client to write on
• 5.as the client writes data, the data is split into packets which is then written to a data queue
• 6.datastreamer requests namenode to allocate blocks by selecting alist of suitable nodes for
storing replicas (by default 3)
• 7. this list of dtanodes makes a pipeline with 3 nodes in the pipe line for the 1st block
• 8. datastreamer streams the packets to the 1st data node in the pipeline which stores and the
forwards to other datanodes in the pipeline
• 9.DFSOutputStream manages an “Ack queue” of packets that are waiting for ackment- and a
pkt is removed from the queue only if it is acknowledged by all the datanodes in the pipeline
• 10.when the client finishes writing the file it calls close() on the stream
• 11.this flushes all the remaining pkts to the datanode pipeline and waits for
acknowledgements before communicating with NameNode to inform the client that the
creation of file is complete
pg. 47 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
2nd replica is placed on a node in a different rack
3rd replica is placed on the same rack as second but on a different node in the rack
• Then a data pipeline is built . The client application writes a block to the 1st datanode in the
pipeline.
next this datanode takes over and forwards data to the next node in the pipeline.
• this process continues for all the data blocks.
• Subsequently all the dta blocks are written to the disk
• The client application need not track all blocks of data. The HDFS directs the client to the
nearest replica.
Daemons of YARN
1. Global Resource Manager: to distribute resources among various applications. It has 2
components:
Scheduler: decides allocation of resources to running applications. No monitoring
ApplicationManager: accepts jobs, negotiates resources for
executing ApplicationMaster which is specific to an application
• 2.NodeManager: it monitors usage of resources and reports the usage to Global Resource
Manager. It launches ‘application containers’ for execution of application.
• Every machine will have one NodeManager
pg. 48 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
• 3.Per-application ApplicationMaster: every application has one.to
negotiate required resoueces for execution from the Resource Manager.
It works along with NodeManager for executing and monitoring
component tasks