Nothing Special   »   [go: up one dir, main page]

Introduction To Big Data - Presentation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

INTRODUCTION TO

BIG DATA
The Outline
1 What is Big Data ?

2 Types of Big Data

3 The 5 V’s of Big Data

4 What is Big Data Analytics?

5 Types of Big Data Analytics?

6 Workflow of Big Data Analytics?

7 Benefits of big data analytics

8 Big data tools


The evolution of data
Data is increasing exponentially due to the latest technologies, internet of things and the high
emergence of social media networks.

3
The evolution of data
Data is gathered from several sources like social media, transactional logs …

4
What is Big Data?

Collection of huge amount of data sets that are so large and complex that it becomes
difficult to process using on-hand database system tools or traditional data
processing applications.

5
Types of Big Data?

Structured Data Unstructured Data Semi structured Data

6
What is structured data?
• It is most often categorized as quantitative data. It is the data that fits within
fixed fields and columns in relational databases and spreadsheets.
• For example, names, dates, addresses, credit card numbers, stock information,
geolocation, and more.
• It is highly organized and easily understood by machine language. It can be
manipulated using a relational database management system (RDBMS).
Structured Data

7
What is unstructured data?

• It represents any data that does not have a recognizable structure.

• It is considered to be qualitative data, and it cannot be processed

and analyzed using conventional data tools and relational

databases.

• It is managed by non-relational, NoSQL databases.


Unstructured Data
• It is unorganised and raw and can be non-textual or texual. For

example, tweets, images, mp3 files, emails and books.

8
What is semi-structured data?
• It is the “bridge” between structured and unstructured data.

• It does not have a predefined data model and is more complex than

structured data, yet easier to store than unstructured data.

• Data can not be stored in the form of rows and columns as in Databases

• Semi-structured data contains tags and elements (Metadata) which is


Semi structured Data used to group data and describe how the data is stored

• Similar entities are grouped together and organized in a hierarchy.

• For example, JSON, CSV and XML files are considered to be semi-

structured.
9
The 5 V’s of big data

10
The 5 V’s of big data:
Volume:
• The name Big Data itself is related to a size which is enormous/huge.

• Volume is one characteristic which needs to be considered while dealing with Big

Data solutions as it helps in determining whether a certain data can actually be

considered as Big data or not.

• For example, Netflix which has over 86 million subscribers wordlwide and streams

More than 125 million hours of material daily,that is over 60 petabytes in size. Twitter

also generates millions of data daily.

11
The 5 V’s of big data:
Velocity:
• The term ‘velocity’ refers to the speed of generation of data.

• How fast the data is generated and processed to meet the demands, determines real
potential in the data.

• Big Data Velocity deals with the speed at which data flows in from sources like business
processes, application logs, networks, and social media sites, sensors, Mobile devices,
etc. The flow of data is massive and continuous.

• For example, Google receives about 63,000 queries per second.

12
The 5 V’s of big data
Variety:
• Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured.

• During earlier days, spreadsheets and databases were the only sources of data
considered by most of the applications. Nowadays, data in the form of emails, photos,
videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis
applications.

• This variety of unstructured data poses certain issues for storage, mining and
analyzing data.

13
The 5 V’s of big data:

Veracity:

• Veracity refers to the degree of accuracy in data sets and how trustworthy they are.

• Raw data collected from various sources can cause data quality issues that may be
difficult to pinpoint. If they aren't fixed through data cleansing processes, bad data
leads to analysis errors that can undermine the value of business analytics initiatives.

14
The 5 V’s of big data:
Value:
• It refers to the value that big data can provide, and how can
organisations/businesses use the collected data.

• It is about the insights that can be gained from big data.

• Within each organisation, the derived value from big data is unique and relates to
that specific institution/business.

• For example, disease detection, cost optimisation thanks to big data.

15
BIG DATA
ANALYTICS

16
What is Big Data Analytics?
• It is the process of analyzing huge datasets (structured, unstructed and semi

structured data) colleced from enormous data resources. Such analysis aims at

discovering hidden patterns and correlations among data.

• For instance, music industires like Spotify are using these tools. They have 96 million

users that generate huge amount of data every day. Based on likes, search history

and users’ preferences, recommendation lists are generated automatically to end

users.

17
Types of Big Data Analytics?

Descriptive Analytics: it answers the question – What happened? It is the first layer of information

that can be graspped from the collected data. For instance, for primary school dataset, such

analysis might be to find how many children between the age of 6 and 10 attend school?

Predictive Analytics: it answers the question - What is likely to happen? It is about making future

predictions based on past historical data/events.

18
Types of Big Data Analytics? Con’t

Diagnostic Analytics: it answers the question – Why did this happen? Why certain behaviour is

observed?this is done by trying different combinations of variables and then drawing conclusions

Prescriptive Analytics: it answers the question –What should be done? Or what is the best

course of action to make something happen?

It is used for large companies that are looking for advice to enhance their performance and

outcomes.
Workflow of Big Data Analytics

Business case Identification of Data filtering


evaluation data

Data extraction Data aggregation Data analysis

Visualization of data
Workflow of Big Data Analytics (con’t)
Stage 1 - Business case evaluation: The Big Data analytics lifecycle begins with a business case, which

defines the reason and goal behind the analysis.

Stage 2 - Identification of data - Here, a broad variety of data sources are identified.

Stage 3 - Data filtering - All of the identified data from the previous stage is filtered here to remove corrupt

data.

Stage 4 - Data extraction - Data that is not compatible with the tool is extracted and then transformed into

a compatible form.
Workflow of Big Data Analytics (con’t)
Stage 5 - Data aggregation - In this stage, data with the same fields across different datasets are

integrated.

Stage 6 - Data analysis - Data is evaluated using analytical and statistical tools to discover useful

information.

Stage 7 - Visualization of data - With tools like Tableau, Power BI, and QlikView, Big Data analysts can

produce graphic visualizations of the analysis.


Benefits of big data analytics
Businesses, institutions and industries can benefit from big data if used effectively. The

benefits of big data and analytics include:

• Customer Acquisition and Retention: to keep current clients and acquire new ones.

• Decision-Making: to guide decision owners towards making the most adequate and

beneficial decisions.

• Cost-savings: take the required measures to reduce costs thanks to big data analytics.

• Fraud Detection: detect fraud/inconsistency in actual data based on historical ones.

and many more use cases …


Famous companies using BDA
Big Businesses all over the world are using Big Data and analytics to gain big success:

• Amazon, the online retail giant, uses its massive data bank to access customer names,

addresses, payments, and search histories and uses them in advertising algorithms and

to improve customer relations.

• The American Express Company uses big data to analyze customer behavior.

• Netflix uses big data to gain insight into the viewing habits of international viewers.

• Brands like Marriott Hotels, Uber Eats, McDonald's, Starbucks are also consistently using

big data as part of their core business.


Big data tools

Realtime analytics distributed, wide- Distributed


engine column NoSQL DB processing engine

Data storage and document-


processing oriented NoSQL
DB
Big Data
Tools
A column-
Realtime processing oriented NoSQL
DB

Data
Data
warehouse
processing
25
Big data tools
Hadoop:
It is a framework allowing the storage and processing of large datasets in distributed and

parallel manner. It implements the parallel processing with distributed storage architecture. It

can scale up from single servers to thousands of machines each offering local computation and

processing. It is made of three main components:

• HDFS: allows to store any kind of data across the cluster. Hadoop distributed file system.

• YARN: the resources manager and negotiator of the Hadoop ecosystem

• MapReduce: responsible of the processing of the stored data

26
Big data tools
Hadoop:

27
Big data tools
Apache Storm:

• It has emerged as the platform of choice for the industry leaders to develop distributed

real-time, data processing platforms.

• It follows the master-slave model. They are coordinated through Zookeeper

o Nimbus: the Master, responsible for distributing tasks among workers (slaves)

o Supervisor nodes: The worker, the ones doing the tasks

o Zookeeper: is the coordinator between Nimbus and the supervisor nodes.

28
Big data tools
Apache Storm:

29
Thank you for your
attention

You might also like