No SQL Database in Bda
No SQL Database in Bda
No SQL Database in Bda
DIGITAL NOTES
ON
BIG DATA ANALYTICS
PREPARED BY
K.RAMANA REDDY
Dr.A.MUMMOORTHY
K.SWETHA
COURSE OBJECTIVES:
The objectives of this course are
1. To learn the need of Big Data and the various challenges involved and to acquire
Knowledge about different analytical architectures.
2. To understand Hadoop Architecture and its ecosystems.
3. To Understand Hadoop Ecosystem and acquire knowledge about the NoSQL
database.
4. To acquire knowledge about the NewSQL, MongoDB and Cassandra databases.
5. To imbibe the processing of Big Data with advanced architectures like Spark.
UNIT – I
Introduction to big data: Data, Characteristics of data and Types of digital data:
Unstructured, Semi-structured and Structured - Sources of data. Big Data Evolution -
Definition of big data-Characteristics and Need of big data-Challenges of big data. Big
data analytics, Overview of business intelligence.
UNIT – II
Big data technologies and Databases: Hadoop – Requirement of Hadoop
Framework - Design principle of Hadoop –Comparison with other system SQL and
RDBMS- Hadoop Components – Architecture -Hadoop 1 vs Hadoop 2.
UNIT – III
MapReduce and YARN framework: Introduction to MapReduce , Processing data
with Hadoop using MapReduce, Introduction to YARN, Architecture, Managing
Resources and Applications with Hadoop YARN.
Big data technologies and Databases: NoSQL: Introduction to NoSQL - Features
and Types- Advantages & Disadvantages -Application of NoSQL.
UNIT - IV
New SQL: Overview of New SQL - Comparing SQL, NoSQL and NewSQL.
Mongo DB: Introduction – Features – Data types – Mongo DB Query language –
CRUD operations – Arrays – Functions: Count – Sort – Limit – Skip – Aggregate – Map
Reduce. Cursors – Indexes – Mongo Import – Mongo Export.
Cassandra: Introduction – Features – Data types – CQLSH – Key spaces – CRUD
operations – Collections – Counter – TTL – Alter commands – Import and Export –
Querying System tables.
UNIT - V
(Big Data Frame Works for Analytics) Hadoop Frame Work: Map Reduce
Programming: I/O formats, Map side join-Reduce Side Join-Secondary Sorting-
Pipelining MapReduce jobs
TEXT BOOKS:
1. Seema Acharya and Subhashini Chellappan, “Big Data and Analytics”, Wiley India
Pvt. Ltd., 2016.
2. Mike Frampton, “Mastering Apache Spark”, Packt Publishing, 2015.
REFERENCE BOOKS:
1. Tom White, “Hadoop: The Definitive Guide”, O‟Reilly, 4th Edition, 2015.
2. Mohammed Guller, “Big Data Analytics with Spark”, Apress, 2015
3. Donald Miner, Adam Shook, “Map Reduce Design Pattern”, O‟Reilly, 2012
COURSE OUTCOMES:
On successful completion of the course, students will be able to
1. Demonstrate knowledge of Big Data, Data Analytics, challenges and their solutions
in Big Data.
2. Analyze Hadoop Framework and eco systems.
3. Analyze MapReduce and Yarn, Work on NoSQL environment.
4. Work on NewSQL environment, MongoDB and Cassandra.
5. Apply the Big Data using Map-reduce programming in Both Hadoop and Spark
framework.
INDEX
UNIT ‐ I:
Introduction to big data: Data, Characteristics of data and Types of digital data:
Unstructured, Semi- structured and Structured - Sources of data. Big Data Evolution
-Definition of big data-Characteristics and Need of big data-Challenges of big data.
Big data analytics, Overview of business intelligence.
For example, data might include individual prices, weights, addresses, ages, names,
temperatures, dates, or distances.
1. Accuracy
Data should be sufficiently accurate for the intended use and should be captured only
once, although it may have multiple uses. Data should be captured at the point of
activity.
2. Validity
Data should be recorded and used in compliance with relevant requirements, including
the correct application of any rules or definitions. This will ensure consistency between
periods and with similar organizations, measuring what is intended to be measured.
3. Reliability
Data should reflect stable and consistent data collection processes across collection
points and over time. Progress toward performance targets should reflect real changes
rather than variations in data collection approaches or methods. Source data is clearly
identified and readily available from manual, automated, or other systems and records.
4. Timeliness
Data should be captured as quickly as possible after the event or activity and must be
available for the intended use within a reasonable time period. Data must be available
quickly and frequently enough to support information needs and to influence service
or management decisions.
5. Relevance
Data captured should be relevant to the purposes for which it is to be used. This will
require a periodic review of requirements to reflect changing needs.
6. Completeness
Data requirements should be clearly specified based on the information needs of the
organization and data collection processes matched to these requirements.
Structured Data:
Structured data refers to any data that resides in a fixed field within a record or
file.
Having a particular Data Model.
Meaningful data.
Data arranged in arow and column.
Structured data has the advantage of being easily entered, stored, queried and
analysed.
E.g.: Relational Data Base, Spread sheets.
Structured data is often managed using Structured Query Language (SQL)
Unstructured Data:
Unstructured data can not readily classify and fit into a neat box
Also called unclassified data.
Which does not confirm to any data model.
Business rules are not applied.
Indexing is not required.
New York Stock Exchange : The New York Stock Exchange is an example of Big
Data that generates about one terabyte of new trade data per day.
Social Media: The statistic shows that 500+terabytes of new data get ingested into
the databases of social media site Facebook, every day. This data is mainly
generated in terms of photo and video uploads, message exchanges, putting
comments etc.
Jet engine :A single Jet engine can generate 10+terabytes of data in 30 minutes of
flight time. With many thousand flights per day, generation of data reaches up to
many Petabytes.
Volume:
The name Big Data itself is related to an enormous size. Big Data is a vast ‘volume’ of
data generated from many sources daily, such as business processes, machines,
social media platforms, networks, human interactions, and many more.
Variety:
Big Data can be structured, unstructured, and semi-structured that are being
collected from different sources. Data will only be collected
from databases and sheets in the past, but these days the data will comes in array
forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.
Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate
the data. Veracity is the process of being able to handle and manage data efficiently.
Big Data is also essential in business development.
Value
Value is an essential characteristic of big data. It is not the data that we process or
store. It is valuable and reliable data that we store, process, and also analyze.
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by
which the data is created in real-time. It contains the linking of incoming data sets
speeds, rate of change, and activity bursts. The primary aspect of Big Data is to
provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources like application
logs, business processes, networks, and social media sites, sensors, mobile
devices, etc.
Companies are using Big Data to know what their customers want, who are their best
customers, why people choose different products. The more a company knows about
its customers, the more competitive it becomes.
We can use it with Machine Learning for creating market strategies based on
predictions about customers. Leveraging big data makes companies customer-centric.
Companies can use Historical and real-time data to assess evolving consumers’
preferences. This consequently enables businesses to improve and update their
marketing strategies which make companies more responsive to customer needs.
Big Data importance doesn’t revolve around the amount of data a company has. Its
importance lies in the fact that how the company utilizes the gathered data.
Every company uses its collected data in its own way. More effectively the company
uses its data, more rapidly it grows.
The companies in the present market need to collect it and analyze it because:
1. Cost Savings
Big Data tools like Apache Hadoop, Spark, etc. bring cost-saving benefits to
businesses when they have to store large amounts of data. These tools help
organizations in identifying more effective ways of doing business.
2. Time-Saving
Real-time in-memory analytics helps companies to collect data from various sources.
Tools like Hadoop help them to analyze data immediately thus helping in making quick
decisions based on the learnings.
If we don’t know what our customers want then it will degrade companies’ success. It
will result in the loss of clientele which creates an adverse effect on business growth.
Big data analytics helps businesses to identify customer related trends and patterns.
Customer behavior analysis leads to a profitable business.
To deal with this challenge, businesses use data integration software, ETL
software, and business intelligence software to map disparate data sources
into a common structure and combine them so they can generate accurate
reports.
also validating data sources against what you expect them to be and cleaning up
corrupted and incomplete data sets. Data quality software can also be used
specifically for the task of validating and cleaning your data before it is processed.
Thinking about hiring a data analytics company to help your business implement
a big data strategy? Browse our list of top data analytics companies, and learn
more about their services in our hiring guide.
When your business begins a data project, start with goals in mind and strategies
for how you will use the data you have available to reach those goals. The team
involved in implementing a solution needs to plan the type of data they need and
the schemas they will use before they start building the system so the project
doesn't go in the wrong direction. They also need to create policies for purging
old data from the system once it is no longer useful.
There are a few ways to solve this problem. One is to hire a big data
specialist and have that specialist manage and train your data team until they
are up to speed. The specialist can either be hired on as a full -time employee or
as a consultant who trains your team and moves on, depending on your budget.
Another option, if you have time to prepare ahead, is to offer training to your
current team members so they will have the skills once your big data project is in
motion.
8. Organizational resistance
Another way people can be a challenge to a data project is when they resist
change. The bigger an organization is, the more resistant it is to change. Leaders
may not see the value in big data, analytics, or machine learning. Or they may
simply not want to spend the time and money on a new project.
This can be a hard challenge to tackle, but it can be done. You can start with a
smaller project and a small team and let the results of that project prove the value
of big data to other leaders and gradually become a data-driven business. Another
option is placing big data experts in leadership roles so they can guide your
business towards transformation.
BI supports fact-based decision making using historical data rather than assumptions
and gut feeling.
BI tools perform data analysis and create reports, summaries, dashboards, maps,
graphs, and charts to provide users with detailed intelligence about the nature of the
business.
Why is BI important?
Step 2) The data is cleaned and transformed into the data warehouse. The table can
be linked, and data cubes are formed.
Step 3) Using BI system the user can ask quires, request ad-hoc reports or conduct
any other analysis.
1. Boost productivity
With a BI program, It is possible for businesses to create reports with a single click
thus saves lots of time and resources. It also allows employees to be more productive
on their tasks.
2. To improve visibility
BI also helps to improve the visibility of these processes and make it possible to
identify any areas which need attention.
3. Fix Accountability
BI system assigns accountability in the organization as there must be someone who
should own accountability and ownership for the organization’s performance against
its set goals.
BI System Disadvantages
1. Cost:
Business intelligence can prove costly for small as well as for medium-sized
enterprises. The use of such type of system may be expensive for routine business
transactions.
2. Complexity:
Another drawback of BI is its complexity in implementation of datawarehouse. It can
be so complex that it can make business techniques rigid to deal with.
3. Limited use
Like all improved technologies, BI was first established keeping in consideration the
buying competence of rich firms. Therefore, BI system is yet not affordable for many
small and medium size companies.
UNIT ‐ II:
Big data technologies and Databases: Hadoop – Requirement of Hadoop
Framework - Design principle of Hadoop –Comparison with other system SQL and
RDBMS- Hadoop Components – Architecture -Hadoop 1 vs Hadoop 2.
There are three core components of Hadoop as mentioned earlier. They are HDFS,
MapReduce, and YARN. These together form the Hadoop framework architecture.
Features:
The storage is distributed to handle a large data pool
Distribution increases data security
It is fault-tolerant, other blocks can pick up the failure of one block
2. MapReduce:
The MapReduce framework is the processing unit. All data is distributed and
processed parallelly. There is a MasterNode that distributes data amongst
SlaveNodes. The SlaveNodes do the processing and send it back to the MasterNode.
Features:
Consists of two phases, Map Phase and Reduce Phase.
Processes big data faster with multiples nodes working under one CPU
Features:
It is a filing system that acts as an Operating System for the data stored on
HDFS
It helps to schedule the tasks to avoid overloading any system
A Hadoop cluster consists of a single master and multiple slave nodes. The master
node includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the
slave node includes DataNode and TaskTracker.
NameNode
o It is a single master server exist in the HDFS cluster.
o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the
opening, renaming and closing the files.
o It simplifies the architecture of the system.
DataNode
o The HDFS cluster contains multiple DataNodes.
o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file
system's clients.
o It performs block creation, deletion, and replication upon instruction from the
NameNode.
Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and
process the data by using NameNode.
o In response, NameNode provides metadata to Job Tracker.
Task Tracker
o It works as a slave node for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file.
This process can also be called as a Mapper.
MapReduce Layer
The MapReduce comes into existence when the client application submits the
MapReduce job to Job Tracker. In response, the Job Tracker sends the request to the
appropriate Task Trackers. Sometimes, the TaskTracker fails or time out. In such a
case, that part of the job is rescheduled.
Hadoop 1 vs Hadoop 2
Hadoop 1 Hadoop 2
HDFS HDFS
2. Daemons:
Hadoop 1 Hadoop 2
Namenode Namenode
Datanode Datanode
3. Working:
In Hadoop 1, there is HDFS which is used for storage and top of it, Map Reduce
which works as Resource Management as well as Data Processing. Due to this
workload on Map Reduce, it will affect the performance.
In Hadoop 2, there is again HDFS which is again used for storage and on the
top of HDFS, there is YARN which works as Resource Management. It basically
allocates the resources and keeps all the things going on.
4. Limitations:
Hadoop 2 is also a Master-Slave architecture. But this consists of multiple masters (i.e
active namenodes and standby namenodes) and multiple slaves. If here master node
got crashed then standby master node will take over it. You can make multiple
5. Ecosystem
Oozie is basically Work Flow Scheduler. It decides the particular time of jobs to
execute according to their dependency.
Pig, Hive and Mahout are data processing tools that are working on the top of
Hadoop.
Sqoop is used to import and export structured data. You can directly import and
export the data into HDFS using SQL database.
Flume is used to import and export the unstructured data and streaming data.
UNIT ‐ III:
MapReduce and YARN framework: Introduction to MapReduce , Processing data
with Hadoop using MapReduce, Introduction to YARN, Architecture, Managing
Resources and Applications with Hadoop YARN.
Big data technologies and Databases: NoSQL: Introduction to NoSQL - Features
and Types- Advantages & Disadvantages -Application of NoSQL.
-
3.1 Introduction to MapReduce in Hadoop
The input to each phase is key-value pairs. In addition, every programmer needs to
specify two functions: map function and reduce function.
Let us understand more about MapReduce and its components. MapReduce majorly
has the following three Classes. They are,
Mapper Class
The first stage in Data Processing using MapReduce is the Mapper Class. Here,
RecordReader processes each Input record and generates the respective key-value
pair. Hadoop’s Mapper store saves this intermediate data into the local disk.
Input Split
RecordReader
It interacts with the Input split and converts the obtained data in the form of Key-
Value Pairs.
Reducer Class
The Intermediate output generated from the mapper is fed to the reducer which
processes it and generates the final output which is then saved in the HDFS.
Driver Class
Yet Another Resource Manager takes programming to the next level beyond Java ,
and makes it interactive to let another application Hbase, Spark etc. to work on
it.Different Yarn applications can co-exist on the same cluster so MapReduce, Hbase,
Spark all can run at the same time bringing great benefits for manageability and cluster
utilization.
Components Of YARN
Jobtracker & Tasktrackerwere were used in previous version of Hadoop, which were
responsible for handling resources and checking progress management. However,
Hadoop 2.0 has Resource manager and NodeManager to overcome the shortfall of
Jobtracker & Tasktracker.
Scheduler
Application manager
a) Scheduler
The scheduler is responsible for allocating the resources to the running application.
The scheduler is pure scheduler it means that it performs no monitoring no tracking
for the application and even doesn’t guarantees about restarting failed tasks either
due to application failure or hardware failures.
b) Application Manager
It manages running Application Masters in the cluster, i.e., it is responsible for starting
application masters and for monitoring and restarting them on different nodes in case
of failures.
One application master runs per application. It negotiates resources from the resource
manager and works with the node manager. It Manages the application life cycle.
The AM acquires containers from the RM’s Scheduler before contacting the
corresponding NMs to start the application’s individual tasks.
3.5 NoSQL
NoSQL Database
NoSQL Database is used to refer a non-SQL or non relational database.
It provides a mechanism for storage and retrieval of data other than tabular relations
model used in relational databases. NoSQL database doesn't use tables for storing
data. It is generally used to store big data and real-time web applications.
Schema-free
NoSQL databases are either schema-free or have relaxed schemas
Do not require any sort of definition of the schema of the data
Offers heterogeneous structures of data in the same domain
Simple API
Offers easy to use interfaces for storage and querying data provided
APIs allow low-level data manipulation & selection methods
Text-based protocols mostly used with HTTP REST with JSON
Mostly used no standard based NoSQL query language
Web-enabled databases running as internet-facing services
Distributed
Multiple NoSQL databases can be executed in a distributed fashion
Offers auto-scaling and fail-over capabilities
Often ACID concept can be sacrificed for scalability and throughput
Mostly no synchronous replication between distributed nodes Asynchronous
Multi-Master Replication, peer-to-peer, HDFS Replication
Only providing eventual consistency
Shared Nothing Architecture. This enables less coordination and higher
distribution.
It is one of the most basic NoSQL database example. This kind of NoSQL database
is used as a collection, dictionaries, associative arrays, etc. Key value stores help the
developer to store schema-less data. They work best for shopping cart contents.
Redis, Dynamo, Riak are some NoSQL examples of key-value store DataBases. They
are all based on Amazon’s Dynamo paper.
Column-based
Column-oriented databases work on columns and are based on BigTable paper by
Google. Every column is treated separately. Values of single column databases are
stored contiguously.
They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN
etc. as the data is readily available in a column.
Document-Oriented
Document-Oriented NoSQL DB stores and retrieves data as a key value pair but the
value part is stored as a document. The document is stored in JSON or XML formats.
The value is understood by the DB and can be queried.
In this diagram on your left you can see we have rows and columns, and in the right,
we have a document database which has a similar structure to JSON. Now for the
relational database, you have to know what columns you have and so on. However,
for a document database, you have data store like JSON object. You do not require to
define which make it flexible.
The document type is mostly used for CMS systems, blogging platforms, real-time
analytics & e-commerce applications. It should not use for complex transactions which
require multiple operations or queries against varying aggregate structures.
Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, MongoDB, are popular
Document originated DBMS systems.
Graph-Based
A graph type database stores entities as well the relations amongst those entities. The
entity is stored as a node with the relationship as edges. An edge gives a relationship
between nodes. Every node and edge has a unique identifier.
Graph base database mostly used for social networks, logistics, spatial data.
Neo4J, Infinite Graph, OrientDB, FlockDB are some popular graph-based databases.
Perhaps when a user wishes to mine a particular dataset from large amounts
of data, one can make use of NoSQL databases, to begin with. Data is the
building block of technology that has led mankind to such great heights.
Therefore, one of the most essential fields where NoSQL databases can be put
to use is data mining and data storage.
Social media sites like Facebook and Instagram often approach open-source
NoSQL databases to extract data that helps them keep track of their users and
the activities going on around their platforms.
3. Software Development
The third application that we will be looking at is software development. Software
development requires extensive research on users and the needs of the masses that
are met through software development.
UNIT ‐ IV:
New SQL: Overview of New SQL - Comparing SQL, NoSQL and NewSQL.
Mongo DB: Introduction – Features – Data types – Mongo DB Query language –
CRUD operations – Arrays – Functions: Count – Sort – Limit – Skip – Aggregate – Map
Reduce. Cursors – Indexes – Mongo Import – Mongo Export.
Cassandra: Introduction – Features – Data types – CQLSH – Key spaces – CRUD
operations – Collections – Counter – TTL – Alter commands – Import and Export –
Querying System tables.
-
4.1 NewSQL
4.1.1 Introduction to NewSQL
NewSQL is a modern relational database system that bridges the gap between SQL
and NoSQL. NewSQL databases aim to scale and stay consistent.
NoSQL databases scale while standard SQL databases are consistent. NewSQL
attempts to produce both features and find a middle ground. As a result, the database
type solves the problems in big data fields.
What is NewSQL?
NewSQL is a unique database system that combines ACID compliance with horizontal
scaling. The database system strives to keep the best of both worlds. OLTP-
based transactions and the high performance of NoSQL combine in a single solution.
Enterprises expect high-quality of data integrity on large data volumes. When either
becomes a problem, an enterprise chooses to:
Both solutions are expensive on both a software and hardware level. NewSQL strives
to improve these faults by creating consistent databases that scale.
Partitioning scales the database into units. Queries execute on many shards
and combine into a single result.
ACID properties preserve the features of RDBMS.
Secondary indexing results in faster query processing and information
retrieval.
High availability due to the database replication mechanism.
A built-in crash recovery mechanism delivers fault tolerance and minimizes
downtime.
4.2 MongoDB
4.2.1 Introduction to MongoDB
MongoDB is an open-source document database that provides high
performance, high availability, and automatic scaling.
In simple words, you can say that - Mongo DB is a document-oriented
database. It is an open source product, developed and supported by a company
named 10gen.
MongoDB is available under General Public license for free, and it is also
available under Commercial license from the manufacturer.
Data Description
Types
String String is the most commonly used datatype. It is used to store data. A string
must be UTF 8 valid in mongodb.
Integer Integer is used to store the numeric value. It can be 32 bit or 64 bit depending
on the server you are using.
Boolean This datatype is used to store boolean values. It just shows YES/NO values.
Min/Max This datatype compare a value against the lowest and highest bson elements.
Keys
Arrays This datatype is used to store a list or multiple values into a single key.
Date This datatype stores the current date or time in unix time format. It makes you
possible to specify your own date time by creating object of date and pass the
value of date, month, year into it.
Create Operations
For MongoDB CRUD, if the specified collection doesn’t exist, the create operation will
create the collection when it’s executed. Create operations in MongoDB target a single
collection, not multiple collections. Insert operations in MongoDB are atomic on a
single document level.
MongoDB provides two different create operations that you can use to insert
documents into a collection:
db.collection.insertOne()
db.collection.insertMany()
Read Operations
The read operations allow you to supply special query filters and criteria that let you
specify which documents you want. The MongoDB documentation contains more
information on the available query filters. Query modifiers may also be used to change
how many results are returned.
MongoDB has two methods of reading documents from a collection:
db.collection.find()
db.collection.findOne()
Update Operations
Like create operations, update operations operate on a single collection, and they are
atomic at a single document level. An update operation takes filters and criteria to
select the documents you want to update.
You should be careful when updating documents, as updates are permanent and can’t
be rolled back. This applies to delete operations as well.
For MongoDB CRUD, there are three different methods of updating documents:
db.collection.updateOne()
db.collection.updateMany()
db.collection.replaceOne()
Delete Operations
Delete operations operate on a single collection, like update and create operations.
Delete operations are also atomic for a single document. You can provide delete
operations with filters and criteria in order to specify which documents you would like
to delete from a collection. The filter options rely on the same syntax that read
operations utilize.
MongoDB has two different methods of deleting records from a collection:
db.collection.deleteOne()
db.collection.deleteMany()
and grade are divided into three parts, such as MongoDB array, Array is essential and
useful in MongoDB.
Syntax:
{< array field >: {< operator1> : <value1>, < operator2> : <value2>, < operator3> :
<value3>, ….. }}
2. Operator: The operator is defined as which values we have to create the array. The
operator name is specific to the value name in the array.
3. Value: Value in the array is defined as the actual value of array on which we have
defining array in MongoDB. Value is significant while defining an array.
Code:
db.emp_count.find()
Output:
One a shared cluster, if you use this method without a query predicate, then it
will return an inaccurate count if orphaned documents exist or if a chunk
migration is in progress. So, to avoid such type of situation use
db.collection.aggregate() method.
Syntax:
db.collection.count(query)
Parameters:
Name Description Required / Type
Optional
query The query selection Required document
criteria.
>db.COLLECTION_NAME.find().sort({KEY:1})
Syntax:
db.COLLECTION_NAME.find().limit(NUMBER)
Syntax :
cursor.skip(<offset>)
Syntax:
Basic syntax of aggregate() method is as follows −
>db.COLLECTION_NAME.aggregate(AGGREGATE_OPERATION)
Syntax:
> db.collection.mapReduce(
function() {emit(key,value);}, // map function
function(key,values) {return reduceFunction}, // reduce function
{ out: collection }
1 cursor.addOption(flag)
2. Cursor.batchSize(size)
3. cursor.close()
4. cursor.collation(<collation document>)
5. cursor.forEach(function)
7. cursor.limit()
8. cursor.map(function)
9. cursor.max()
10. cursor.min()
11. cursor.tailable()
12. cursor.toArray()
Creating an Index :
MongoDB provides a method called createIndex() that allows user to create an index.
Syntax:
db.COLLECTION_NAME.createIndex({KEY:1})
The key determines the field on the basis of which you want to create an index and 1
(or -1) determines the order in which these indexes will be arranged(ascending or
descending).
Syntax:
The mongoimport command has the following syntax:
Syntax:
where,
Field_name(s) - Name of the Field (or multiple fields separated by comma (,) ) to be
exported. It is optional in case of CSV file. If not specified, all the fields of the collection
will be exported to JSON file.
But it is suggested to properly specify the name and path of the exported file.
When exporting to CSV format, must specify the fields in the documents to be exported
When exporting to JSON format, _id field will also be exported by default. While the
same
will not be exported if exporting as CSV file until not specified in field list.
4.3 Cassandra
4.3.1 What is Cassandra
Apache Cassandra is highly scalable, high performance, distributed NoSQL database.
Cassandra is designed to handle huge amount of data across many commodity
servers, providing high availability without a single point of failure.
Fast writes
Cassandra was designed to run on cheap commodity hardware. It performs blazingly
fast writes and can store hundreds of terabytes of data, without sacrificing the read
efficiency.
Start CQLsh:
CQLsh provides a lot of options which you can see in the following table:
Options Usage
help This command is used to show help topics about the options of CQLsh
commands.
version it is used to see the version of the CQLsh you are using.
execute It is used to direct the shell to accept and execute a CQL command.
file= "file By using this option, cassandra executes the command in the given file
name" and exits.
u "username" Using this option, you can authenticate a user. The default user name is:
cassandra.
p "password" Using this option, you can authenticate a user with a password. The
default password is: cassandra.
Syntax:
Or
The counter is a special column used to store a number that this changed increments.
For example, you might use a counter column to count the number of times a page is
viewed. So, we can define a counter in a dedicated table only and use that counter
datatype.
Now, we are going to create table with a Counter column. let’s have a look.
Output:
1. In Cassandra Both the INSERT and UPDATE commands support setting a time
for data in a column to expire.
2. It is used to set the time limit for a specific period of time. By USING TTL clause
we can set the TTL value at the time of insertion.
3. We can use TTL function to get the time remaining for a specific selected query.
4. At the point of insertion, we can set expire limit of inserted data by using TTL
clause. Let us consider if we want to set the expire limit to two days then we
need to define its TTL value.
5. By using TTL we can set the expiration period to two days and the value of TTL
will be 172800 seconds. Let’s understand with an example.
Table : student_Registration
To create the table used the following CQL query.
To insert data by using TTL then used the following CQL query.
Output:
Id Name Event
Id Name Event
Now, to determine the remaining time to expire for a specific column used the following
CQL query.
Output:
ttl(Name)
172700
It will decrease as you will check again for its TTL value just because of TTL time limit.
Now, used the following CQL query to check again.
Output:
ttl(Name)
172500
Syntax:
ALTER (TABLE | COLUMNFAMILY) <tablename> <instruction>
Add a column
Drop a column
Adding a Column
Using ALTER command, you can add a column to a table. While adding columns, you
have to take care that the column name is not conflicting with the existing column
names and that the table is not defined with compact storage option.
Dropping a Column
Using ALTER command, you can delete a column from a table. Before dropping a
column from a table, check that the table is not defined with compact storage option.
Given below is the syntax to delete a column from a table using ALTER command.
To import data:
To Export data:
The system keyspace includes a number of tables that contain details about your
Cassandra database objects and cluster configuration.
schema_version, thrift_version,
tokens set, truncated at map
peers peer, data_center, rack, Each node records
release_version, ring_id, what other nodes tell it
rpc_address, schema_version, about themselves over
tokens the gossip.
schema_columns keyspace_name, Used internally with
columnfamily_name, compound primary
column_name, keys.
component_index, index_name,
index_options, index_type,
validator
See comment. Inspect
schema_columnfamili
schema_columnfamili
es
es to get detailed
information about
specific tables.
UNIT – V:
(Big Data Frame Works for Analytics) Hadoop Frame Work: Map Reduce
Programming: I/O formats, Map side join-Reduce Side Join-Secondary Sorting-
Pipelining MapReduce jobs
Spark Frame Work: Introduction to Apache spark-How spark works, Programming
with RDDs: Create RDDspark Operations-Data Frame.
Let’s get an idea of how data flows between the client interacting with HDFS, the name
node, and the data nodes with the help of a diagram. Consider the figure:
Step 1: The client opens the file it wishes to read by calling open() on the File System
Object(which for HDFS is an instance of Distributed File System).
Step 2: Distributed File System( DFS) calls the name node, using remote procedure
calls (RPCs), to determine the locations of the first few blocks in the file. For each
block, the name node returns the addresses of the data nodes that have a copy of that
block. The DFS returns an FSDataInputStream to the client for it to read data from.
Step 3: The client then calls read() on the stream. DFSInputStream, which has stored
the info node addresses for the primary few blocks within the file, then connects to the
primary (closest) data node for the primary block in the file.
Step 4: Data is streamed from the data node back to the client, which calls read()
repeatedly on the stream.
Step 5: When the end of the block is reached, DFSInputStream will close the
connection to the data node, then finds the best data node for the next block. This
happens transparently to the client, which from its point of view is simply reading an
endless stream. Blocks are read as, with the DFSInputStream opening new
connections to data nodes because the client reads through the stream. It will also call
the name node to retrieve the data node locations for the next batch of blocks as
needed.
Step 6: When the client has finished reading the file, a function is called, close() on
the FSDataInputStream.
Next, we’ll check out how files are written to HDFS. Consider figure 1.2 to get a better
understanding of the concept.
Note: HDFS follows the Write once Read many times model. In HDFS we cannot edit
the files which are already stored in HDFS, but we can append data by reopening the
files.
Step 2: DFS makes an RPC call to the name node to create a new file in the file
system’s namespace, with no blocks associated with it. The name node performs
various checks to make sure the file doesn’t already exist and that the client has the
right permissions to create the file. If these checks pass, the name node prepares a
record of the new file; otherwise, the file can’t be created and therefore the client is
thrown an error i.e. IOException. The DFS returns an FSDataOutputStream for the
client to start out writing data to.
Step 3: Because the client writes data, the DFSOutputStream splits it into packets,
which it writes to an indoor queue called the info queue. The data queue is consumed
by the DataStreamer, which is liable for asking the name node to allocate new blocks
by picking an inventory of suitable data nodes to store the replicas. The list of data
nodes forms a pipeline, and here we’ll assume the replication level is three, so there
are three nodes in the pipeline. The DataStreamer streams the packets to the primary
data node within the pipeline, which stores each packet and forwards it to the second
data node within the pipeline.
Step 4: Similarly, the second data node stores the packet and forwards it to the third
(and last) data node in the pipeline.
Step 5: The DFSOutputStream sustains an internal queue of packets that are waiting
to be acknowledged by data nodes, called an “ack queue”.
Step 6: This action sends up all the remaining packets to the data node pipeline and
waits for acknowledgments before connecting to the name node to signal whether the
file is complete or not.
Depending upon the place where the actual join is performed, joins in Hadoop are
classified into-
1. Map-side join – When the join is performed by the mapper, it is called as map-side
join. In this type, the join is performed before data is actually consumed by the map
function. It is mandatory that the input to each map is in the form of a partition and is
in sorted order. Also, there must be an equal number of partitions and it must be sorted
by the join key.
Here, map side processing emits join key and corresponding tuples of both the tables.
As an effect of this processing, all the tuples with same join key fall into the same
reducer which then joins the records with same join key.
Secondary sort is a technique that allows the MapReduce programmer to control the
order that the values show up within a reduce function call.
Lets also assume that our secondary sorting is on a composite key made out of Last
Name and First Name.
The partitioner and the group comparator use only natural key, the partitioner uses it
to channel all records with the same natural key to a single reducer. This partitioning
happens in the Map Phase, data from various Map tasks are received by reducers
where they are grouped and then sent to the reduce method. This grouping is where
the group comparator comes into picture, if we would not have specified a custom
group comparator then Hadoop would have used the default implementation which
would have considered the entire composite key, which would have lead to incorrect
results.
Finally just reviewing the steps involved in a MR Job and relating it to secondary sorting
should help us clear out the lingering doubts.
What is Spark?
Spark was built on the top of the Hadoop MapReduce. It was optimized to run in
memory whereas alternative approaches like Hadoop's MapReduce writes data to and
from computer hard drives. So, Spark process the data much quicker than other
alternatives.
o Fast - It provides high performance for both batch and streaming data, using a
state-of-the-art DAG scheduler, a query optimizer, and a physical execution
engine.
o Easy to Use - It facilitates to write the application in Java, Scala, Python, R,
and SQL. It also provides more than 80 high-level operators.
o Generality - It provides a collection of libraries including SQL and DataFrames,
MLlib for machine learning, GraphX, and Spark Streaming.
o Lightweight - It is a light unified analytics engine which is used for large scale
data processing.
o Runs Everywhere - It can easily run on Hadoop, Apache Mesos, Kubernetes,
standalone, or in the cloud.
Spark has a small code base and the system is divided in various layers. Each layer
has some responsibilities. The layers are independent of each other.
The first layer is the interpreter, Spark uses a Scala interpreter, with some
modifications. As you enter your code in spark console (creating RDD’s and applying
operators), Spark creates a operator graph. When the user runs an action (like collect),
the Graph is submitted to a DAG Scheduler. The DAG scheduler divides operator
graph into (map and reduce) stages. A stage is comprised of tasks based on partitions
of the input data. The DAG scheduler pipelines operators together to optimize the
graph. For e.g. Many map operators can be scheduled in a single stage. This
optimization is key to Sparks performance. The final result of a DAG scheduler is a set
of stages. The stages are passed on to the Task Scheduler. The task scheduler
launches tasks via cluster manager. (Spark Standalone/Yarn/Mesos). The task
scheduler doesn’t know about dependencies among stages.
o Transformation
o Action
Transformation
In Spark, the role of transformation is to create a new dataset from an existing one.
The transformations are considered lazy as they only computed when an action
requires a result to be returned to the driver program.
Transformation Description
flatMap(func) Here, each input item can be mapped to zero or more output
items, so func should return a sequence rather than a single
item.
union(otherDataset) It returns a new dataset that contains the union of the elements
in the source dataset and the argument.
join(otherDataset, When called on datasets of type (K, V) and (K, W), returns a
[numPartitions]) dataset of (K, (V, W)) pairs with all pairs of elements for each
key. Outer joins are supported through leftOuterJoin,
rightOuterJoin, and fullOuterJoin.
cogroup(otherDataset, When called on datasets of type (K, V) and (K, W), returns a
[numPartitions]) dataset of (K, (Iterable, Iterable)) tuples. This operation is also
called groupWith.
pipe(command, Pipe each partition of the RDD through a shell command, e.g.
[envVars]) a Perl or bash script.
repartition(numPartitions) It reshuffles the data in the RDD randomly to create either more
or fewer partitions and balance it across them.
Action
In Spark, the role of action is to return a value to the driver program after running a
computation on the dataset.
Action Description
takeOrdered(n, [ordering]) It returns the first n elements of the RDD using either their
natural order or a custom comparator.
*****