Nothing Special   »   [go: up one dir, main page]

2 Hadoop (Uploaded)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 82

Big Data Platforms

Three big data platforms (systems)


• General purpose relational database
• Analytical database
• Hadoop

2
1. General purpose RDBMS
- Powers first generation DW
Benefits:
- RDBMS already inhouse
Operational - SQL-based
System
- Trained DBAs

Operational
System
Data BI Reports /
ETL DataData Warehouse
Warehouse ETL
Server Dashboards
Mart

Operational
System

Challenges:
- Cost to deploy and upgrade
Operational
System - Doesn’t support complex analytics
- Scalability and performance

3
2. Analytical platforms
1010data
Aster Data (Teradata)
Calpont Purpose-built database management
Datallegro (Microsoft) systems designed explicitly for query
Exasol processing and analysis that provides
Greenplum (EMC) dramatically higher price/performance
IBM SmartAnalytics
and availability compared to general
Infobright
Kognitio
purpose solutions.
Netezza (IBM)
Oracle Exadata Deployment Options
Paraccel -Software only (Paraccel, Vertica)
Pervasive -Appliance (SAP, Exadata, Netezza)
Sand Technology -Hosted(1010data, Kognitio)
SAP HANA
Sybase IQ (SAP)
Teradata
Vertica (HP)
3. Hadoop

•Ecosystem of open source projects


•Hosted by Apache Foundation
•Google developed and shared concepts
•Distributed file system that scales out on
commodity servers with direct attached
storage and automatic failover.

5
Hadoop distilled: What’s new?

Unstructured data Benefits


- Comprehensive
Distributed File
Data scientist System - Agile
- Expressive
- Affordable
“Schema at Read”
BIG
DATA
Open Source $$ Drawbacks
MapReduce
No SQL
- Immature
- Batch oriented
- Expertise
- TCO

6
Hadoop ecosystem

Source: Hortonworks
Hadoop adoption rates

No plans 38%

Considering 32%

Experimenting 20%

Implementing 5%

In production 4%
Based on 158 respondents, BI Leadership Forum, April, 2012
8
Which platform do you choose?

Hadoop

Analytic Database

General Purpose
RDBMS

Structured  Semi-Structured  Unstructured

9
Workflows
“Capture only what’s
needed”

Source
1. Extract, transform, load
Systems
Analytical
database
(DW)

“Capture in case
it’s needed”
5. Explore data
9. Report and mine data Analytical tools
6. Parse, aggregate

10
The new analytical ecosystem
Hadoop
• Open source (using the Apache license)
• Around 40 core Hadoop committers from ~10 companies
• – Cloudera, Yahoo!, Facebook, Apple, and more
• Hundreds of contributors writing features, fixing bugs
• Many related projects, applications, tools, etc.
Hadoop History
• Hadoop is based on work done by Google in the early 2000s
• – Specifically, on papers describing the Google File System (GFS)
published in 2003, and MapReduce published in 2004
• This work takes a radical new approach to the problem of
distributed computing
– Meets all the requirements we have for reliability, scalability etc
• Core concept: distribute the data as it is initially stored in the
system
– Individual nodes can work on data local to those nodes
– No data transfer over the network is required for initial processing
History Continued..
• Doug Cutting and Michael Cafarella created Hadoop in 2005
• Yahoo started to create a search engine called Nutch Search Engine
Project (2005)
• 2006 Yahoo donated Hadoop Project to Apache
• Feb 2006 – Hadoop splits out of Nutch and Yahoo starts using it.
• Dec 2006 – Yahoo creating 100-node Webmap with Hadoop
• Apr 2007 – Yahoo on 1000-node cluster
• Jan 2008 – Hadoop made a top-level Apache project
• Dec 2007 – Yahoo creating 1000-node Webmap with Hadoop
• Sep 2008 – Hive added to Hadoop as a contrib project
Google’s Solutions / Projects

• Google File System – SOSP’2003


• Map-Reduce – OSDI’2004
• Sawzall – Scientific Programming Journal’2005
• Big Table – OSDI’2006
• Chubby – OSDI’2006
Open Source World’s Solution

• Google File System – Hadoop Distributed FS (HDFS)


• Map-Reduce – Hadoop Map-Reduce
• Sawzall – Pig, Hive, JAQL
• Big Table – Hadoop HBase, Cassandra
• Chubby – Zookeeper
Current Status of Hadoop
• Largest Cluster
• 2000 nodes (8 cores, 4TB disk)
• Used by 40+ companies / universities over the
world
• Yahoo, Facebook, etc
• Cloud Computing Donation from Google and IBM
• Startup focusing on providing services for hadoop
• Cloudera, HortonWorks
Hadoop Components : In General

MapReduce

HDFS

Projects
Hadoop Projects
• Hive
• Hbase
• Mahout
• Pig
• Oozie
• Flume
• Scoop
• …..
What Hadoop Does?
• Hadoop in set of Tools ( Framework)
• Support running Applications on Big Data
• Write the Application on Hadoop, it will take the scalability issue
• Major concern with traditional applications
• Breaks the data into pieces and
• send it to multiple computers and later
• Combines are results
Hadoop Architecture
MASTER
Applications
Task Tracker Job Tracker

Data Node Name Node

SLAVES

Task Tracker Task Tracker Task Tracker Task Tracker


Data Node Data Node Data Node Data Node
Continued..
• Applications interact with Master (Batch Process)
• Job Tracker (Master): Break the big Task into small pieces send to Task
Tracker in Slave
• Name Node (Master): Maintain the index of which segment of data is
residing on which data node
• Task Tracker – process the small piece of Task
• Data Node – Manage the piece of data assign to it
The Five Hadoop Components
• Hadoop is comprised of five separate daemons:
• NameNode
– Holds the metadata for HDFS
• Secondary NameNode
– Performs housekeeping functions for the NameNode
– Is not a backup or hot standby for the NameNode!
• DataNode
– Stores actual HDFS data blocks
• JobTracker
• – Manages MapReduce jobs, distributes individual tasks
• TaskTracker
– Responsible for instantiating and monitoring individual Map and Reduce tasks
Fault Tolerance
• Built in fault tolerance (by default 3 copies of files in slaves)
• Master may create single point of failure
• Enterprise Hadoop create two copies of Master as well (Main
Master, Backup Master)
• If a node fails, the master will detect that failure and re-assign the
work to a different node on the system
Continued..
• Restarting a task does not require communication with nodes
working on other portions of the data
• If a failed node restarts, it is automatically added back to the
system and assigned new tasks
• If a node appears to be running slowly, the master can redundantly
execute another instance of the same task
For Programmers
• Need not to think about
• Where the file is located
• How to manage failures
• How to break computations into pieces
• How to program for scalability

• Should focus on writing Scale Free Program (can add


thousands of slaves)
Core Hadoop Concepts
• Applications are written in high-level code
– Developers do not worry about network programming, temporal
dependencies etc
• Nodes talk to each other as little as possible
– Developers should not write code which communicates between nodes
– ‘Shared nothing’ architecture
• Data is spread among machines in advance
– Computation happens where the data is stored, wherever possible
– Data is replicated multiple times
Hadoop: Very High-
High-Level Overview
• When data is loaded into the system, it is split into ‘blocks’
– Typically 64MB or 128MB
• Map tasks (the first part of the MapReduce system) work on
relatively small portions of data
– Typically a single block
• A master program allocates work to nodes such that a Map task
will work on a block of data stored locally on that node
– Many nodes work in parallel, each on their own part of the overall
dataset
Hadoop Cluster

• A set of machines running HDFS and MapReduce is known as a


Hadoop Cluster
• – Individual machines are known as nodes
• – A cluster can have as few as one node, as many as several
thousands
• Single Node v/s Multi Node cluster
• More nodes = better performance!
Hadoop Components: HDFS
• HDFS, the Hadoop Distributed File System, is responsible for storing
data on the cluster
• Data files are split into blocks and distributed across multiple nodes
in the cluster
• Each block is replicated multiple times
• – Default is to replicate each block three times
• – Replicas are stored on different nodes
• – This ensures both reliability and availability
Hadoop Components: MapReduce
• MapReduce is the system used to process data in the Hadoop
cluster
• Consists of mainly two phases: Map, and then Reduce
• Each Map task operates on a discrete portion of the overall dataset
• – Typically one HDFS data block
• After all Maps are complete, the MapReduce system distributes the
intermediate data to nodes which perform the Reduce phase
HDFS Basic Concepts
• HDFS is a filesystem written in Java
• – Based on Google’s GFS
• Sits on top of a native filesystem
• – ext3, xfs etc
• Provides redundant storage for massive amounts of data
Where Does Data Come From?
• Science
– Medical imaging, sensor data, genome sequencing, weather data, satellite
feeds, etc.
• Industry
– Financial, pharmaceutical, manufacturing, insurance, online, energy, retail
data
• Legacy
– Sales data, customer behavior, product databases, accounting data, etc.
• System Data
– Log files, health & status feeds, activity streams, network messages, Web
Analytics, intrusion detection, spam filters
Analyzing Data: The Challenges
• Huge volumes of data
• Mixed sources result in many different formats
• – XML
• – CSV
• – EDI
• – LOG
• – Objects
• – SQL
• – Text
• – JSON
• – Binary
• – Etc.
What is Common Across Hadoop-
Hadoop-able
Problems?
• Nature of the data
• – Complex data
• – Multiple data sources
• – Lots of it
• Nature of the analysis
• – Batch processing
• – Parallel execution
• – Spread data over a cluster of servers and take the computation to
• the data
Benefits of Analyzing With Hadoop
• Previously impossible/impractical to do this analysis
• Analysis conducted at lower cost
• Analysis conducted in less time
• Greater flexibility
• Linear scalability
What Analysis is Possible With Hadoop?
• Text mining
• Index building
• Graph creation and analysis
• Pattern recognition
• Collaborative filtering
• Prediction models
• Sentiment analysis
• Risk assessment
HDFS
Goals of HDFS

• Very Large Distributed File System


• 10K nodes, 100 million files, 10 PB
• Convenient Cluster Management
• Load balancing
• Node failures
• Cluster expansion
• Optimized for Batch Processing
• Allow move computation to data
• Maximize throughput
HDFS Architecture
HDFS Details
• Data Coherency
• Write-once-read-many access model
• Client can only append to existing files
• Files are broken up into blocks
• Typically 128 MB block size
• Each block replicated on multiple DataNodes
• Intelligent Client
• Client can find location of blocks
• Client accesses data directly from DataNode
Single Node Architecture

CPU
Machine Learning, Statistics
Memory

“Classical” Data Mining


Disk

44
Motivation: Google Example
• 20+ billion web pages x 20KB = 400+ TB
• 1 computer reads 30-35 MB/sec from disk
• ~4 months to read the web
• ~1,000 hard drives to store the web
• Takes even more to do something useful
with the data!
• Today, a standard architecture for such problems is emerging:
• Cluster of commodity Linux nodes
• Commodity network (ethernet) to connect them

45
Cloud
MapReduce
• Challenges:
• How to distribute computation?
• Distributed/parallel programming is hard

• Map-reduce addresses all of the above


• Google’s computational/data manipulation model
• Elegant way to work with big data

47
What is map reduce
• Simple data – parallel programming model and framework.
• Designed for scalability and fault-tolerance
• Pioneered by Google
• Processes around 20 petabytes of data per day
• Popularized by open-source Hadoop project.
• Used at yahoo!, Facebook, Amazon.
• Map reduce design goals
• Scalability to large data volume
• 1000’s of machines, 10000’s of disks
• Cost efficient
• Commodity machines (cheap, but unreliable)
• Commodity network
• Automatic fault-tolerance
• Easy to use
Cluster Architecture
2-10 Gbps backbone between racks
1 Gbps between Switch
any pair of nodes
in a rack
Switch Switch

CPU CPU CPU CPU

Mem … Mem Mem … Mem

Disk Disk Disk Disk

Each rack contains 16-64 nodes

In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO 49


50
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org
Large-scale Computing
• Large-scale computing for data mining
problems on commodity hardware
• Challenges:
• How do you distribute computation?
• How can we make it easy to write distributed programs?
• Machines fail:
• One server may stay up 3 years (1,000 days)
• If you have 1,000 servers, expect to loose 1/day
• People estimated Google had ~1M machines in 2011
• 1,000 machines fail every day!

51
Idea and Solution
• Issue: Copying data over a network takes time
• Idea:
• Bring computation close to the data
• Store files multiple times for reliability
• Map-reduce addresses these problems
• Storage Infrastructure – File system
• Google: GFS. Hadoop: HDFS
• Programming model
• Map-Reduce

52
Storage Infrastructure
• Problem:
• If nodes fail, how to store data persistently?
• Answer:
• Distributed File System:
• Provides global file namespace
• Google GFS; Hadoop HDFS;
• Typical usage pattern
• Huge files (100s of GB to TB)
• Data is rarely updated in place
• Reads and appends are common

53
Distributed File System
• Chunk servers
• File is split into contiguous chunks
• Typically each chunk is 16-64MB
• Each chunk replicated (usually 2x or 3x)
• Try to keep replicas in different racks
• Master node
• a.k.a. Name Node in Hadoop’s HDFS
• Stores metadata about where files are stored
• Might be replicated
• Client library for file access
• Talks to master to find chunk servers
• Connects directly to chunk servers to access data

54
Distributed File System
• Reliable distributed file system
• Data kept in “chunks” spread across machines
• Each chunk replicated on different machines
• Seamless recovery from disk or machine failure

C0 C1 D0 C1 C2 C5 C0 C5

C5 C2 C5 C3 D0 D1 … D0 C2

Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N

Bring computation directly to the data!


Chunk servers also serve as compute servers 55
Map Reduce Programming Model
• Input data type: file of key-value records
Example
More Specifically
• Input: a set of key-value pairs
• Programmer specifies two methods:
• Map(k, v) → <k’, v’>*
• Takes a key-value pair and outputs a set of key-value pairs
• E.g., key is the filename, value is a single line in the file
• There is one Map call for every (k,v) pair
• Reduce(k’, <v’>*) → <k’, v’’>*
• All values v’ with same key k’ are reduced together
and processed in v’ order
• There is one Reduce function call per unique key k’

58
MapReduce: The Map Step
Input Intermediate
key-value pairs key-value pairs

k v
map
k v
k v
map
k v
k v

… …

k v k v

59
MapReduce: The Reduce Step
Output
Intermediate Key-value groups key-value pairs
key-value pairs
reduce
k v k v v v k v
reduce
Group
k v k v v k v
by key

k v
… …

k v k v k v

60
MapReduce: Word Counting
Provided by the Provided by the
programmer programmer
MAP: Group by key: Reduce:
Read input and Collect all values
Collect all pairs
produces a set of belonging to the
with same key
key-value pairs key and output

data
reads
The crew of the space
shuttle Endeavor recently
(The, 1) (crew, 1)

read the
returned to Earth as (crew, 1) (crew, 1)
ambassadors, harbingers of (crew, 2)
a new era of space (of, 1) (space, 1)

sequential
exploration. Scientists at
(space, 1)
(the, 1) (the, 1)
NASA are saying that the (the, 3)

Sequentially
recent assembly of the (space, 1) (the, 1)
Dextre bot is the first step in (shuttle, 1)
a long-term space-based (shuttle, 1) (the, 1)
(recently, 1)
man/mache partnership. (Endeavor, 1) (shuttle, 1)
'"The work we're doing now …

Only
-- the robotics we're doing - (recently, 1) (recently, 1)
- is what we're going to
need ……………………..
…. …
Big document (key, value) (key, value) (key, value)
61
Word Count Using MapReduce
map(key, value):
// key: document name; value: text of the document
for each word w in value:
emit(w, 1)

reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)

62
MapReduce: Overview
• Sequentially read a lot of data
• Map:
• Extract something you care about
• Group by key: Sort and Shuffle
• Reduce:
• Aggregate, summarize, filter or transform
• Write the result

Outline stays the same, Map and Reduce


change to fit the problem
63
Problems Suited for
Map-Reduce
Example: Host size
• Suppose we have a large web corpus
• Look at the metadata file
• Lines of the form: (URL, size, date, …)
• For each host, find the total number of bytes
• That is, the sum of the page sizes for all URLs from that particular host

• Other examples:
• Link analysis and graph processing
• Machine Learning algorithms

65
Example: Language Model
• Statistical machine translation:
• Need to count number of times every 5-word sequence occurs in a large
corpus of documents

• Very easy with MapReduce:


• Map:
• Extract (5-word sequence, count) from document
• Reduce:
• Combine the counts

66
Map Reduce Framework
Map-Reduce: Environment
Map-Reduce environment takes care of:
• Partitioning the input data
• Scheduling the program’s execution across a
set of machines
• Performing the group by key step
• Handling machine failures
• Managing required inter-machine communication

68
Map-Reduce: A diagram
Big document
MAP:
Read input and
produces a set of
key-value pairs

Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)

Reduce:
Collect all values
belonging to the key
and output

69
Map-Reduce: In Parallel

All phases are distributed with many tasks doing the work 70
Combiners
• A combiner is a local aggregation function for repeated keys produced
by same map.
Coordination : Master
• Task Status ( Idle , in progress, completed)
• Pings the nodes to get the state of nodes
• When mapping is complete, this info is send to master to schedule
reducer task
Eight Common Hadoop-
Hadoop-able Problems
1. Modeling true risk
2. Customer churn analysis
3. Recommendation engine
4. PoS transaction analysis
5. Analyzing network data to predict failure
6. Threat analysis
7. Search quality
8. Data “sandbox”
1. Modelling True Risk
• Challenge:
• How much risk exposure does an organization really have with each customer?
– Multiple sources of data and across multiple lines of business
• Solution with Hadoop:
• Source and aggregate disparate data sources to build data picture
– e.g. credit card records, call recordings, chat sessions, emails, banking activity
• Structure and analyze
• – Sentiment analysis, graph creation, pattern recognition
• Typical Industry:
• – Financial Services (banks, insurance companies)
2.Customer Churn Analysis
• Challenge:
• Why is an organization really losing customers?
– Data on these factors comes from different sources
• Solution with Hadoop:
• Rapidly build behavioral model from disparate data sources
• Structure and analyze with Hadoop
• – Traversing
• – Graph creation
• – Pattern recognition
• Typical Industry:
– Telecommunications, Financial Services
3. Recommendation Engine/Ad Targeting
• Challenge:
• Using user data to predict which products to recommend
• Solution with Hadoop:
• Batch processing framework
• – Allow execution in in parallel over large datasets
• Collaborative filtering
• – Collecting ‘taste’ information from many users
• – Utilizing information to predict what similar users like
• Typical Industry
– Ecommerce, Manufacturing, Retail
– Advertising
4. Point of Sale Transaction Analysis
• Challenge:
• Analyzing Point of Sale (PoS) data to target promotions and manage operations
– Sources are complex and data volumes grow across chains of stores and other sources
• Solution with Hadoop:
• Batch processing framework
– Allow execution in in parallel over large datasets
• Pattern recognition
• – Optimizing over multiple data sources
• – Utilizing information to predict demand
• Typical Industry:
• – Retail
5. Analyzing Network Data to Predict Failure
• Challenge:
• Analyzing real-time data series from a network of sensors
– Calculating average frequency over time is extremely tedious because of the need to
analyze terabytes
• Solution with Hadoop:
• Take the computation to the data
– Expand from simple scans to more complex data mining
• Better understand how the network reacts to fluctuations
– Discrete anomalies may, in fact, be interconnected
• Identify leading indicators of component failure
• Typical Industry:
• – Utilities, Telecommunications, Data Centers
6. Threat Analysis/Trade Surveillance
• Challenge:
• Detecting threats in the form of fraudulent activity or attacks
– Large data volumes involved
– Like looking for a needle in a haystack
• Solution with Hadoop:
• Parallel processing over huge datasets
• Pattern recognition to identify anomalies,
– i.e., threats
• Typical Industry:
– Security, Financial Services,
• General: spam fighting, click fraud
7. Search Quality
• Challenge:
• Providing real time meaningful search results
• Solution with Hadoop:
• Analyzing search attempts in conjunction with structured data
• Pattern recognition
– Browsing pattern of users performing searches in different categories
• Typical Industry:
– Web, Ecommerce
8. Data “Sandbox”
Sandbox”
• Challenge:
• Data Deluge
• – Don’t know what to do with the data or what analysis to run
• Solution with Hadoop:
• “Dump” all this data into an HDFS cluster
• Use Hadoop to start trying out different analysis on the data
• See patterns to derive value from data
• Typical Industry:
• – Common across all industries
THANK YOU

You might also like