2 Hadoop (Uploaded)
2 Hadoop (Uploaded)
2 Hadoop (Uploaded)
2
1. General purpose RDBMS
- Powers first generation DW
Benefits:
- RDBMS already inhouse
Operational - SQL-based
System
- Trained DBAs
Operational
System
Data BI Reports /
ETL DataData Warehouse
Warehouse ETL
Server Dashboards
Mart
Operational
System
Challenges:
- Cost to deploy and upgrade
Operational
System - Doesn’t support complex analytics
- Scalability and performance
3
2. Analytical platforms
1010data
Aster Data (Teradata)
Calpont Purpose-built database management
Datallegro (Microsoft) systems designed explicitly for query
Exasol processing and analysis that provides
Greenplum (EMC) dramatically higher price/performance
IBM SmartAnalytics
and availability compared to general
Infobright
Kognitio
purpose solutions.
Netezza (IBM)
Oracle Exadata Deployment Options
Paraccel -Software only (Paraccel, Vertica)
Pervasive -Appliance (SAP, Exadata, Netezza)
Sand Technology -Hosted(1010data, Kognitio)
SAP HANA
Sybase IQ (SAP)
Teradata
Vertica (HP)
3. Hadoop
5
Hadoop distilled: What’s new?
6
Hadoop ecosystem
Source: Hortonworks
Hadoop adoption rates
No plans 38%
Considering 32%
Experimenting 20%
Implementing 5%
In production 4%
Based on 158 respondents, BI Leadership Forum, April, 2012
8
Which platform do you choose?
Hadoop
Analytic Database
General Purpose
RDBMS
9
Workflows
“Capture only what’s
needed”
Source
1. Extract, transform, load
Systems
Analytical
database
(DW)
“Capture in case
it’s needed”
5. Explore data
9. Report and mine data Analytical tools
6. Parse, aggregate
10
The new analytical ecosystem
Hadoop
• Open source (using the Apache license)
• Around 40 core Hadoop committers from ~10 companies
• – Cloudera, Yahoo!, Facebook, Apple, and more
• Hundreds of contributors writing features, fixing bugs
• Many related projects, applications, tools, etc.
Hadoop History
• Hadoop is based on work done by Google in the early 2000s
• – Specifically, on papers describing the Google File System (GFS)
published in 2003, and MapReduce published in 2004
• This work takes a radical new approach to the problem of
distributed computing
– Meets all the requirements we have for reliability, scalability etc
• Core concept: distribute the data as it is initially stored in the
system
– Individual nodes can work on data local to those nodes
– No data transfer over the network is required for initial processing
History Continued..
• Doug Cutting and Michael Cafarella created Hadoop in 2005
• Yahoo started to create a search engine called Nutch Search Engine
Project (2005)
• 2006 Yahoo donated Hadoop Project to Apache
• Feb 2006 – Hadoop splits out of Nutch and Yahoo starts using it.
• Dec 2006 – Yahoo creating 100-node Webmap with Hadoop
• Apr 2007 – Yahoo on 1000-node cluster
• Jan 2008 – Hadoop made a top-level Apache project
• Dec 2007 – Yahoo creating 1000-node Webmap with Hadoop
• Sep 2008 – Hive added to Hadoop as a contrib project
Google’s Solutions / Projects
MapReduce
HDFS
Projects
Hadoop Projects
• Hive
• Hbase
• Mahout
• Pig
• Oozie
• Flume
• Scoop
• …..
What Hadoop Does?
• Hadoop in set of Tools ( Framework)
• Support running Applications on Big Data
• Write the Application on Hadoop, it will take the scalability issue
• Major concern with traditional applications
• Breaks the data into pieces and
• send it to multiple computers and later
• Combines are results
Hadoop Architecture
MASTER
Applications
Task Tracker Job Tracker
SLAVES
CPU
Machine Learning, Statistics
Memory
44
Motivation: Google Example
• 20+ billion web pages x 20KB = 400+ TB
• 1 computer reads 30-35 MB/sec from disk
• ~4 months to read the web
• ~1,000 hard drives to store the web
• Takes even more to do something useful
with the data!
• Today, a standard architecture for such problems is emerging:
• Cluster of commodity Linux nodes
• Commodity network (ethernet) to connect them
45
Cloud
MapReduce
• Challenges:
• How to distribute computation?
• Distributed/parallel programming is hard
47
What is map reduce
• Simple data – parallel programming model and framework.
• Designed for scalability and fault-tolerance
• Pioneered by Google
• Processes around 20 petabytes of data per day
• Popularized by open-source Hadoop project.
• Used at yahoo!, Facebook, Amazon.
• Map reduce design goals
• Scalability to large data volume
• 1000’s of machines, 10000’s of disks
• Cost efficient
• Commodity machines (cheap, but unreliable)
• Commodity network
• Automatic fault-tolerance
• Easy to use
Cluster Architecture
2-10 Gbps backbone between racks
1 Gbps between Switch
any pair of nodes
in a rack
Switch Switch
51
Idea and Solution
• Issue: Copying data over a network takes time
• Idea:
• Bring computation close to the data
• Store files multiple times for reliability
• Map-reduce addresses these problems
• Storage Infrastructure – File system
• Google: GFS. Hadoop: HDFS
• Programming model
• Map-Reduce
52
Storage Infrastructure
• Problem:
• If nodes fail, how to store data persistently?
• Answer:
• Distributed File System:
• Provides global file namespace
• Google GFS; Hadoop HDFS;
• Typical usage pattern
• Huge files (100s of GB to TB)
• Data is rarely updated in place
• Reads and appends are common
53
Distributed File System
• Chunk servers
• File is split into contiguous chunks
• Typically each chunk is 16-64MB
• Each chunk replicated (usually 2x or 3x)
• Try to keep replicas in different racks
• Master node
• a.k.a. Name Node in Hadoop’s HDFS
• Stores metadata about where files are stored
• Might be replicated
• Client library for file access
• Talks to master to find chunk servers
• Connects directly to chunk servers to access data
54
Distributed File System
• Reliable distributed file system
• Data kept in “chunks” spread across machines
• Each chunk replicated on different machines
• Seamless recovery from disk or machine failure
C0 C1 D0 C1 C2 C5 C0 C5
C5 C2 C5 C3 D0 D1 … D0 C2
58
MapReduce: The Map Step
Input Intermediate
key-value pairs key-value pairs
k v
map
k v
k v
map
k v
k v
… …
k v k v
59
MapReduce: The Reduce Step
Output
Intermediate Key-value groups key-value pairs
key-value pairs
reduce
k v k v v v k v
reduce
Group
k v k v v k v
by key
k v
… …
…
k v k v k v
60
MapReduce: Word Counting
Provided by the Provided by the
programmer programmer
MAP: Group by key: Reduce:
Read input and Collect all values
Collect all pairs
produces a set of belonging to the
with same key
key-value pairs key and output
data
reads
The crew of the space
shuttle Endeavor recently
(The, 1) (crew, 1)
read the
returned to Earth as (crew, 1) (crew, 1)
ambassadors, harbingers of (crew, 2)
a new era of space (of, 1) (space, 1)
sequential
exploration. Scientists at
(space, 1)
(the, 1) (the, 1)
NASA are saying that the (the, 3)
Sequentially
recent assembly of the (space, 1) (the, 1)
Dextre bot is the first step in (shuttle, 1)
a long-term space-based (shuttle, 1) (the, 1)
(recently, 1)
man/mache partnership. (Endeavor, 1) (shuttle, 1)
'"The work we're doing now …
Only
-- the robotics we're doing - (recently, 1) (recently, 1)
- is what we're going to
need ……………………..
…. …
Big document (key, value) (key, value) (key, value)
61
Word Count Using MapReduce
map(key, value):
// key: document name; value: text of the document
for each word w in value:
emit(w, 1)
reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
62
MapReduce: Overview
• Sequentially read a lot of data
• Map:
• Extract something you care about
• Group by key: Sort and Shuffle
• Reduce:
• Aggregate, summarize, filter or transform
• Write the result
• Other examples:
• Link analysis and graph processing
• Machine Learning algorithms
65
Example: Language Model
• Statistical machine translation:
• Need to count number of times every 5-word sequence occurs in a large
corpus of documents
66
Map Reduce Framework
Map-Reduce: Environment
Map-Reduce environment takes care of:
• Partitioning the input data
• Scheduling the program’s execution across a
set of machines
• Performing the group by key step
• Handling machine failures
• Managing required inter-machine communication
68
Map-Reduce: A diagram
Big document
MAP:
Read input and
produces a set of
key-value pairs
Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)
Reduce:
Collect all values
belonging to the key
and output
69
Map-Reduce: In Parallel
All phases are distributed with many tasks doing the work 70
Combiners
• A combiner is a local aggregation function for repeated keys produced
by same map.
Coordination : Master
• Task Status ( Idle , in progress, completed)
• Pings the nodes to get the state of nodes
• When mapping is complete, this info is send to master to schedule
reducer task
Eight Common Hadoop-
Hadoop-able Problems
1. Modeling true risk
2. Customer churn analysis
3. Recommendation engine
4. PoS transaction analysis
5. Analyzing network data to predict failure
6. Threat analysis
7. Search quality
8. Data “sandbox”
1. Modelling True Risk
• Challenge:
• How much risk exposure does an organization really have with each customer?
– Multiple sources of data and across multiple lines of business
• Solution with Hadoop:
• Source and aggregate disparate data sources to build data picture
– e.g. credit card records, call recordings, chat sessions, emails, banking activity
• Structure and analyze
• – Sentiment analysis, graph creation, pattern recognition
• Typical Industry:
• – Financial Services (banks, insurance companies)
2.Customer Churn Analysis
• Challenge:
• Why is an organization really losing customers?
– Data on these factors comes from different sources
• Solution with Hadoop:
• Rapidly build behavioral model from disparate data sources
• Structure and analyze with Hadoop
• – Traversing
• – Graph creation
• – Pattern recognition
• Typical Industry:
– Telecommunications, Financial Services
3. Recommendation Engine/Ad Targeting
• Challenge:
• Using user data to predict which products to recommend
• Solution with Hadoop:
• Batch processing framework
• – Allow execution in in parallel over large datasets
• Collaborative filtering
• – Collecting ‘taste’ information from many users
• – Utilizing information to predict what similar users like
• Typical Industry
– Ecommerce, Manufacturing, Retail
– Advertising
4. Point of Sale Transaction Analysis
• Challenge:
• Analyzing Point of Sale (PoS) data to target promotions and manage operations
– Sources are complex and data volumes grow across chains of stores and other sources
• Solution with Hadoop:
• Batch processing framework
– Allow execution in in parallel over large datasets
• Pattern recognition
• – Optimizing over multiple data sources
• – Utilizing information to predict demand
• Typical Industry:
• – Retail
5. Analyzing Network Data to Predict Failure
• Challenge:
• Analyzing real-time data series from a network of sensors
– Calculating average frequency over time is extremely tedious because of the need to
analyze terabytes
• Solution with Hadoop:
• Take the computation to the data
– Expand from simple scans to more complex data mining
• Better understand how the network reacts to fluctuations
– Discrete anomalies may, in fact, be interconnected
• Identify leading indicators of component failure
• Typical Industry:
• – Utilities, Telecommunications, Data Centers
6. Threat Analysis/Trade Surveillance
• Challenge:
• Detecting threats in the form of fraudulent activity or attacks
– Large data volumes involved
– Like looking for a needle in a haystack
• Solution with Hadoop:
• Parallel processing over huge datasets
• Pattern recognition to identify anomalies,
– i.e., threats
• Typical Industry:
– Security, Financial Services,
• General: spam fighting, click fraud
7. Search Quality
• Challenge:
• Providing real time meaningful search results
• Solution with Hadoop:
• Analyzing search attempts in conjunction with structured data
• Pattern recognition
– Browsing pattern of users performing searches in different categories
• Typical Industry:
– Web, Ecommerce
8. Data “Sandbox”
Sandbox”
• Challenge:
• Data Deluge
• – Don’t know what to do with the data or what analysis to run
• Solution with Hadoop:
• “Dump” all this data into an HDFS cluster
• Use Hadoop to start trying out different analysis on the data
• See patterns to derive value from data
• Typical Industry:
• – Common across all industries
THANK YOU