Introduction To Big Data & Basic Data Analysis
Introduction To Big Data & Basic Data Analysis
Introduction To Big Data & Basic Data Analysis
Data
Visualization
Data Access Data Analysis
Formatting, Cleaning
Storage Data
Big Data & Related Topics/Courses
CS19
Human-Computer Interaction
9
Data
Visualization Machine Learning
DatabasesInformation Retrieval
Data Access Data Analysis
Data Mining
Computer Vision
Speech Recognition
Data Understanding Data Integration
Natural Language ProcessingData Warehousing
Formatting, Cleaning
Signal Processing
Many
Storage Applications!
Data
Information Theory
Some Data Analysis Techniques
Visualizat
ion
Classificati Predictive
on Modeling
Time Clusteri
Series ng
Big Data EveryWhere!
640K ought to be
enough for
anybody.
The Earthscope
The Earthscope is the world's
largest science project.
Designed to track North
America's geological evolution,
this observatory records data
over 3.8 million square miles,
amassing 67 terabytes of data.
much more.
(http://www.msnbc.msn.com/id/4
4363598/ns/technology_and_sci
ence-
future_of_technology/#.TmetOd
Q--uI)
Type of Data
Relational Data
(Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF),
What to do with these data?
Aggregation and Statistics
Data warehouse and OLAP
Indexing, Searching, and Querying
Keyword based search
Pattern matching (XML/RDF)
Knowledge discovery
Data Mining
Statistical Modeling
OLAP and Data Mining
Warehouse Architecture
Client Client
Query &
Analysis
Metadata Warehous
e
Integration
store
storeId
city
17
Star
product prodId name price store storeId city
p1 bolt 10 c1 nyc
p2 nut 5
c2 sfo
c3 la
18
Cube
dimensions = 2
19
3-D Cube
dimensions = 3
20
ROLAP vs. MOLAP
ROLAP:
Relational On-Line Analytical
Processing
MOLAP:
Multi-Dimensional On-Line Analytical
Processing
21
Aggregates
Add up amounts for day 1
In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11
p1 c3 1 50 81
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4
22
Aggregates
Add up amounts by day
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11 ans date sum
p1 c3 1 50 1 81
p2 c2 1 8 2 48
p1 c1 2 44
p1 c2 2 4
23
Another Example
Add up amounts by day, product
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale prodId storeId date amt
p1 c1 1 12 sale prodId date amt
p2 c1 1 11
p1 1 62
p1 c3 1 50
p2 1 19
p2 c2 1 8
p1 c1 2 44 p1 2 48
p1 c2 2 4
rollup
drill-down
24
Aggregates
Operators: sum, count, max, min,
median, ave
Having clause
Using dimension hierarchy
average by region (within store)
maximum by month (within date)
25
What is Data Mining?
training
set
29
Clustering
income
education
age
30
K-Means Clustering
31
Association Rule Mining
t ion er ts
ac m c
a ns
d sto odu ht
t r i cu id pr oug
b
sales
market-basket
records:
data
32
Association Rule Discovery
Marketing and Sales Promotion:
Let the rule discovered be
{Bagels, } --> {Potato Chips}
Potato Chips as consequent => Can be used to
determine what should be done to boost its sales.
Bagels in the antecedent => can be used to see which
products would be affected if the store discontinues
selling bagels.
Bagels in antecedent and Potato chips in consequent
=> Can be used to see what products should be sold
with Bagels to promote sale of Potato chips!
Supermarket shelf management.
Inventory Managemnt
Other Types of Mining
Text mining: application of data mining to
textual documents
cluster Web pages to find related pages
cluster pages a user has visited to organize
their visit history
classify Web pages automatically into a Web
directory
Graph Mining:
Deal with graph data
34
The Meaning of Big Data - 3
Vs
Big Volume
With simple (SQL) analytics
With complex (non-SQL) analytics
Big Velocity
Drink from the fire hose
Big Variety
Large number of diverse data sources to
integrate
35
The Participants
36
Hadoop..
Simple analytics
X100 times a parallel DBMS
Complex analytics (Mahout or roll-your-own)
X100 times Scalapack
Parallel programming
Parallel grep (great)
Everything else (awful)
Hadoop lacks
Stateful computations
Point-to-point communication
37
Big Velocity
Sensor tagging everything of value
sends velocity through the roof
E.g. car insurance
39
VoltDB: an example of
New SQL
A main memory SQL engine
Open source
Light-weight transactions
Run-to-completion with no locking
Single-threaded
Multi-core by splitting main memory
40
Big Variety
Typical enterprise has 5000 operational systems
Only a few get into the data warehouse
What about the rest?
41
The World of Data
Integration
the rest of your data
enterprise text
data warehouse
42
Summary
The rest of your data (public and private)
Is a treasure trove of incredibly valuable
information
Largely untapped
43
IoT Meets Big Data
44
Big Data Value Chain
Discove
Ingestio ry & Integrat
Collection Analysis Delivery
n Cleansin ion
g
12
45
45
Considerations for Big Data Standardization
Anytime
Sensors
Anything
Applications
Any Device
Software agents
Any Context
Individuals
Any Place
Organizations
Anywhere
Hardware resources
Any one
47
Big Data Standardization Challenges
(1)
Big Data use cases, definitions, vocabulary and reference architectures
(e.g. system, data, platforms, online/offline)
Specifications and standardization of metadata including data
provenance
Application models (e.g. batch, streaming)
Query languages including non-relational queries to support diverse
data types (XML, RDF, JSON, multimedia) and Big Data operations (e.g.
matrix operations)
Domain-specific languages
Semantics of eventual consistency
Advanced network protocols for efficient data transfer
General and domain specific ontologies and taxonomies for describing
data semantics including interoperation between ontologies
Source : ISO
48
Big Data Standardization
Challenges (2)
Big Data security and privacy access controls
Remote, distributed, and federated analytics (taking the
analytics to the data) including data and processing
resource discovery and data mining
Data sharing and exchange
Data storage, e.g. memory storage system, distributed file
system, data warehouse, etc.
Human consumption of the results of big data analysis (e.g.
visualization)
Interface between relational (SQL) and non-relational
(NoSQL)
Big Data Quality and Veracity description and management
Source : ISO
49
Big Data Seminar Report with ppt and pdf
The Structure of Big Data
Structured
Most traditional data sources
Semi-structured
Many sources of big data
Unstructured
Video data, audio data
Benefits of Big Data
Big Data is already an important part of the $64 billion
database and data analytics market
It offers commercial opportunities of a comparable
Sekhar Kondepudi
sekhar.kondepudi@nus.edu.sg
www.kondepudi-group.info
M : +65 98566472
51