Introduction To Big Data - Presentation
Introduction To Big Data - Presentation
Introduction To Big Data - Presentation
BIG DATA
The Outline
1 What is Big Data ?
3
The evolution of data
Data is gathered from several sources like social media, transactional logs …
4
What is Big Data?
Collection of huge amount of data sets that are so large and complex that it becomes
difficult to process using on-hand database system tools or traditional data
processing applications.
5
Types of Big Data?
6
What is structured data?
• It is most often categorized as quantitative data. It is the data that fits within
fixed fields and columns in relational databases and spreadsheets.
• For example, names, dates, addresses, credit card numbers, stock information,
geolocation, and more.
• It is highly organized and easily understood by machine language. It can be
manipulated using a relational database management system (RDBMS).
Structured Data
7
What is unstructured data?
databases.
8
What is semi-structured data?
• It is the “bridge” between structured and unstructured data.
• It does not have a predefined data model and is more complex than
• Data can not be stored in the form of rows and columns as in Databases
• For example, JSON, CSV and XML files are considered to be semi-
structured.
9
The 5 V’s of big data
10
The 5 V’s of big data:
Volume:
• The name Big Data itself is related to a size which is enormous/huge.
• Volume is one characteristic which needs to be considered while dealing with Big
• For example, Netflix which has over 86 million subscribers wordlwide and streams
More than 125 million hours of material daily,that is over 60 petabytes in size. Twitter
11
The 5 V’s of big data:
Velocity:
• The term ‘velocity’ refers to the speed of generation of data.
• How fast the data is generated and processed to meet the demands, determines real
potential in the data.
• Big Data Velocity deals with the speed at which data flows in from sources like business
processes, application logs, networks, and social media sites, sensors, Mobile devices,
etc. The flow of data is massive and continuous.
12
The 5 V’s of big data
Variety:
• Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured.
• During earlier days, spreadsheets and databases were the only sources of data
considered by most of the applications. Nowadays, data in the form of emails, photos,
videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis
applications.
• This variety of unstructured data poses certain issues for storage, mining and
analyzing data.
13
The 5 V’s of big data:
Veracity:
• Veracity refers to the degree of accuracy in data sets and how trustworthy they are.
• Raw data collected from various sources can cause data quality issues that may be
difficult to pinpoint. If they aren't fixed through data cleansing processes, bad data
leads to analysis errors that can undermine the value of business analytics initiatives.
14
The 5 V’s of big data:
Value:
• It refers to the value that big data can provide, and how can
organisations/businesses use the collected data.
• Within each organisation, the derived value from big data is unique and relates to
that specific institution/business.
15
BIG DATA
ANALYTICS
16
What is Big Data Analytics?
• It is the process of analyzing huge datasets (structured, unstructed and semi
structured data) colleced from enormous data resources. Such analysis aims at
• For instance, music industires like Spotify are using these tools. They have 96 million
users that generate huge amount of data every day. Based on likes, search history
users.
17
Types of Big Data Analytics?
Descriptive Analytics: it answers the question – What happened? It is the first layer of information
that can be graspped from the collected data. For instance, for primary school dataset, such
analysis might be to find how many children between the age of 6 and 10 attend school?
Predictive Analytics: it answers the question - What is likely to happen? It is about making future
18
Types of Big Data Analytics? Con’t
Diagnostic Analytics: it answers the question – Why did this happen? Why certain behaviour is
observed?this is done by trying different combinations of variables and then drawing conclusions
Prescriptive Analytics: it answers the question –What should be done? Or what is the best
It is used for large companies that are looking for advice to enhance their performance and
outcomes.
Workflow of Big Data Analytics
Visualization of data
Workflow of Big Data Analytics (con’t)
Stage 1 - Business case evaluation: The Big Data analytics lifecycle begins with a business case, which
Stage 2 - Identification of data - Here, a broad variety of data sources are identified.
Stage 3 - Data filtering - All of the identified data from the previous stage is filtered here to remove corrupt
data.
Stage 4 - Data extraction - Data that is not compatible with the tool is extracted and then transformed into
a compatible form.
Workflow of Big Data Analytics (con’t)
Stage 5 - Data aggregation - In this stage, data with the same fields across different datasets are
integrated.
Stage 6 - Data analysis - Data is evaluated using analytical and statistical tools to discover useful
information.
Stage 7 - Visualization of data - With tools like Tableau, Power BI, and QlikView, Big Data analysts can
• Customer Acquisition and Retention: to keep current clients and acquire new ones.
• Decision-Making: to guide decision owners towards making the most adequate and
beneficial decisions.
• Cost-savings: take the required measures to reduce costs thanks to big data analytics.
• Amazon, the online retail giant, uses its massive data bank to access customer names,
addresses, payments, and search histories and uses them in advertising algorithms and
• The American Express Company uses big data to analyze customer behavior.
• Netflix uses big data to gain insight into the viewing habits of international viewers.
• Brands like Marriott Hotels, Uber Eats, McDonald's, Starbucks are also consistently using
Data
Data
warehouse
processing
25
Big data tools
Hadoop:
It is a framework allowing the storage and processing of large datasets in distributed and
parallel manner. It implements the parallel processing with distributed storage architecture. It
can scale up from single servers to thousands of machines each offering local computation and
• HDFS: allows to store any kind of data across the cluster. Hadoop distributed file system.
26
Big data tools
Hadoop:
27
Big data tools
Apache Storm:
• It has emerged as the platform of choice for the industry leaders to develop distributed
o Nimbus: the Master, responsible for distributing tasks among workers (slaves)
28
Big data tools
Apache Storm:
29
Thank you for your
attention