Cloud COMPUTING Module 4

MODULE 4
DATA INTENSIVE COMPUTING:

MAP REDUCE PROGRAMMING
1
WHAT IS DATA-INTENSIVE COMPUTING?
 Data-intensive computing focuses on a class of applications that
deal with a large amount of data.
 Data-intensive computing is concerned with production,

manipulation, and analysis of large-scale data in the range of
hundreds of megabytes (MB) to petabytes(PB) and beyond.
 Dataset is commonly used to identify a collection of information

elements that is relevant to one or more applications.
 Datasets are often maintained in repositories, which are

infrastructures supporting the storage, retrieval,
and indexing of large amounts of information.
 To facilitate classification and search relevant bits of information

is called metadata, are attached to datasets.
2
DATA-INTENSIVE RESEARCH ISSUES
3
CHALLENGES IN DATA-INTENSIVE COMPUTING
 The huge amount of data produced, analyzed, or

stored imposes supporting infrastructures
requirements and middleware
 Moving terabytes of databecomes an obstacle for high-
performing computations
 Scalable algorithms that can search and process
massive datasets
 New metadata management technologies that can
scale to handle complex, heterogeneous, and
distributed data sources
 Advances in high-performance computing platforms
aimed at providing a better support for accessing in-
memory multi-terabyte data structures
 High-performance, highly reliable, petascale4
distributed file systems
CHALLENGES IN DATA-INTENSIVE COMPUTING
 Data signature-generation techniques for data reduction

and rapid processing
 New approaches to s/w mobility for delivering
algorithms are able to move the computation to where
the data are located
 Specialized hybrid interconnection architectures that
provide better support for filtering multi-gigabyte data
streams coming from high-speed networks and scientific
instruments
 Flexible and high-performance s/w integration
techniques facilitate the combination of s/w modules
running on different platforms to quickly form analytical
pipelines 5
HISTORICAL PERSPECTIVE
1.The early age: high-speed wide-area networking:
 In 1989, the first experiments in high-speed networking as a
support for remote visualization of scientific data led the way
 Potential of using high-speed wide area networks for
enabling high-speed, TCP/IP-based distributed applications
was demonstrated at Supercomputing 1991.
 Kaiser project, which made available as remote data sources
high data rate and online instrument systems
 Automatic generation of metadata
 Automatic cataloguing of data and metadata while processing the
data in real time
 Facilitation of cooperative research by providing local and remote
users access to data
 Mechanisms to incorporate data into databases and other
documents 6
 (First data intensive environment – MAGIC project)  DPSS

HISTORICAL PERSPECTIVE
2. Data Grids
 With the advent of grid computing, huge computational
power and storage facilities could be obtained by harnessing
heterogeneous resources across different administrative
domains
 Data grids offer two main functionalities:
 High-performance and reliable file transfer for moving large
amounts of data
 Scalable replica discovery and management mechanisms for
easy access to distributed datasets
 Data grids provide storage and dataset management

support for scientific experiments
 Heterogeneity of resource and different administrative7
domain are to be addressed with proper security measures.
DATA GRID REFERENCE SCENARIO
8
Data grids have their own characteristics and introduce
new challenges :
 Massive datasets: The size of datasets can easily be on
the scale of gigabytes, terabytes, and beyond. So it’s
necessary to minimize latencies during bulk transfers,
replicate content.
 Shared data collections: Resource sharing includes
distributed collections of data. For example,
repositories can be used to both store and read data.
 Unified namespace: Data grids impose a unified
logical namespace where to locate data collections and
resources.
 Access restrictions: Even though data grids facilitate
sharing of results and data for experiments, some users
might want to ensure confidentiality for their data and
restrict access to them to their collaborators.
9
 Usage Ex: High energy physics, biology and astronomy.
3. Data clouds and “BigData”
 Large datasets concern to scientific computing domain.

 Big Data is a phrase used to mean a massive volume of
data characterized by its nature, such quantity is difficult
to process using traditional database and s/w techniques.
 Big Data applies to datasets of which the size is beyond the
ability of commonly used software tools to capture,
manage, and process within a tolerable elapsed time
 Cloud technologies support data-intensive computing
in several ways:
 By providing a large amount of compute instances on demand, to
process and analyze large data sets in parallel.
 By providing a storage system optimized for keeping large blobs
of data and other distributed data store architectures.
 By providing frameworks and programming APIs optimized for
the processing and management of large amounts of data. 10
A data cloud is a combination of these components:
 Ex 1: MapReduce framework, which provides the best
performance for leveraging the Google File System on top of
Google’s large computing infrastructure.
 Ex 2: Hadoop system, the most mature, large, and open-

source data cloud. It consists of the
HDFS and Hadoop’s implementation of Map Reduce.

 Ex 3: Sector, consists of the SDFS and a compute service
called Sphere that allows users to execute arbitrary user-
defined functions (UDFs) over the data managed by SDFS.

 Ex 4: Greenplum uses a shared-nothing massively parallel
processing (MPP) architecture based on commodity
hardware.
11
4. Databases and data-intensive computing
 Distributed databases have been considered the

natural evolution of database management systems.
 Distributed databases are a collection of data stored
at different sites of a computer network.
 Each site may provide services for local applications
execution, also participate in global applications
execution.
 A distributed database can be created by splitting and
scattering the data of an existing database over
different sites or by federating together multiple
existing databases.
 These systems are very robust and provide distributed
transaction processing, distributed query
optimization, and efficient management of resources.
12
TECHNOLOGIES FOR DATA - INTENSIVE COMPUTING
 Data intensive computing concern with applications

development focused on large data.
 Storage systems and programming models are the
natural classification.
Storage systems
 Growing of popularity of bigdata

 Growing importance of data analytics in the business
chain
 Presence of data in several forms, not only structured
 New approaches and technologies for computing

13
High-performance distributed file systems and
storage clouds
 Distributed file systems constitute the primary support
for data management.
 They provide an interface to store information
in the form of files and to access them.
 Mostly of the file systems constitute the huge data
storage and transfer support for large computing
clusters, supercomputers, massively parallel
architectures, and lately, storage/computing clouds
 Some file systems
 Lustre
 IBM General Parallel File System (GPFS)
 Google File System (GFS)
 Sector
 Amazon Simple Storage Service (S3) 14
Lustre
 The Lustre file system is a massively parallel dfs that
covers the needs of a small workgroup of clusters to a
large-scale computing cluster.
 The file system is used by several of the Top 500 super
computing systems.
 Lustre provides access to petabytes (PBs) of storage to
serve thousands of clients with an I/O throughput of
hundreds of gigabytes per second (GB/s).
 The system is composed of a metadata server and a
collection of object storage servers.
 The file system implements a robust failover strategy
and recovery mechanism, making server failures and
recoveries transparent to clients.
15
IBM General Parallel File System (GPFS)
 GPFS is the high-performance dfs developed by IBM

that provides support for the RS/6000 super computer
and Linux computing clusters.
 GPFS is a multi-platform dfs, built to provide
advanced recovery mechanisms.
 GPFS is built on the concept of shared disks.
 The file system makes this infrastructure transparent

to users and stripes large files over the disk array and
replication ensures high availability.
 GPFS distributes the metadata of the entire file
system thus eliminating a single point of failure.
16
Google File System (GFS)
 GFS is the storage infrastructure that supports the execution of

distributed applications in Google’s computing cloud.
 The system is fault tolerant, highly available, dfs is built on
commodity hardware and standard Linux operating systems
 GFS is designed with the following assumptions:
 The system is built on top of commodity h/w that often fails.
 The system stores a modest number of large files; multi-GB files
are common and should be treated efficiently, and small files
must be supported, but there is no need to optimize for that.
 The workloads primarily consist of two kinds of reads: large
streaming reads and small random reads.
 The workloads also have many large, sequential writes that
append data to files.
 17
High-sustained bandwidth is more important than low latency.
Google File System (GFS) cont’d
 The architecture of the file system is organized into a single
master, and a collection of chunk servers.
 From a logical point of view, the system is composed of a collection
of software daemons, which implement either the master server or
the chunk server.
 File is a collection of chunks and chunks are replicated on multiple
nodes in order to tolerate failures.
 Clients look up the master server and identify the specific chunk
to access. Once the chunk is identified, the interaction happens.
 Applications interact through the file system with a specific
interface supporting the usual operations for file creation,
deletion, read, and write.
 The interface also supports snapshots and record append.
 Specific attention has been given to implementing a highly
18
available, lightweight, and fault-tolerant infrastructure.
Sector
 Sector is the storage cloud that supports the execution
of data-intensive applications defined according to the
Sphere framework.
 It is a user space file system that can be deployed on
commodity hardware across a wide-area network.
 The system’s architecture is composed of four nodes: a
security server, one or more master nodes, slave nodes,
and client machines.
 The security server maintains all the information
about access control policies for user and files, whereas
master servers coordinate and serve the I/O requests
of clients, which ultimately interact with slave nodes
to access files.
 The protocol used to exchange data with slave nodes is
UDT. 19
Amazon Simple Storage Service(S3)
 Amazon S3 is theonline storage service by Amazon.
 The system is claimed to support high availability,
reliability, scalability, infinite storage, and low latency
at commodity cost.
 The system offers a flat storage space organized into
buckets in AWS account.
 Each bucket can store multiple objects, each identified
by a unique key.
 Objects are identified by unique URLs and exposed
through HTTP, thus allowing very simple get-put
semantics.
 A POSIX-like client library has been developed to
mount S3 buckets as part of the local file system.
 The visibility and accessibility of objects are linked to
AWS accounts. 20
NO SQL SYSTEMS
 Not Only SQL (NoSQL) provides a mechanism for
storage and retrieval of data that is modeled in means
other than the tabular relations used in relational
databases.
 NoSQL is a collection of scripts that allow users to
manage most of the simplest and more common
database tasks by using text files as information
stores.
 Two main factors have determined the growth of
the NoSQL:
 Simple data models are enough to represent the
information used by applications
 The quantity of information contained in unstructured
formats has grown 21
Broad classification of No-SQL implementation includes:
 Document stores (Apache Jackrabbit,
ApacheCouchDB, SimpleDB, Terrastore).
 Graphs (AllegroGraph, Neo4j, FlockDB, Cerebrum).
 Key-value stores.
 Multi-value databases (OpenQM, RocketU2,

OpenInsight).
 Object databases (ObjectStore, JADE, ZODB).
 Tabular stores (GoogleBigTable, HadoopHBase,

Hypertable).
 Tuple stores (ApacheRiver).
22
Apache CouchDB and MongoDB
 Apache CouchDB and MongoDB are two examples of

document stores.
 Both provide a schema-less store where the primary
objects are documents in the form of key value fields.
 The value of each field can string, integer, float, date,
or an array of values.
 The databases expose a RESTful interface and
represent data in JSON format.
 Both allow querying and indexing data by using the
MapReduce programming model.
 Both supports data replication and high availability.
23
Amazon Dynamo
 The main goal of Dynamo is to provide an incrementally
scalable and highly available storage system.
 Dynamo provides a simplified interface based on get/put
semantics, where objects are stored and retrieved with
a unique identifier (key)
 The architecture of the Dynamo system, is composed of
a collection of storage peers organized in a ring that
shares the key space for a given application
 The key space is partitioned among the storage peers,
and the keys are replicated across the ring.
 Each peer is configured with access to a local storage
facility where original objects and replicas are stored.
 each node provides facilities for distributing the
updates among the rings and to detect failures and
24
unreachable nodes.
AMAZON DYNAMO
25
GOOGLE BIGTABLE
 Bigtable is the distributed storage system designed to

scale up to peta-bytes of data across thousands of
servers.
 Bigtable provides storage support for several Google
applications that expose different types of workload.
 Bigtable’s key design goals are wide applicability,
scalability, high performance, and high availability.
 Bigtable organizes the data storage in tables of which
the rows are distributed over the distributed file
system supporting the middleware.
 From a logical point of view, a table is a
multidimensional sorted map.
26
 A table is organized into rows and columns
GOOGLE BIGTABLE
 Bigtable identifies two kinds of processes: master
processes and tablet server processes.
 A tablet server is responsible for serving the requests
for a given tablet that is a contiguous partition of rows
of a table.
 Each server can manage multiple tablets.
 The master server is responsible for keeping track of

the status of the tablet servers and of the allocation of
tablets to tablet servers.
 Chubby a distributed, highly available, and persistent
lock service—supports the activity of the master and
tablet servers.
 At the very bottom layer, the data are stored in the
27
Google File System in the form of files.
28
APACHE CASSANDRA
 Cassandra is a distributed object store for managing
large amounts of structured data spread across many
commodity servers.
 It is designed to avoid a single point of failure and offer
a highly reliable service.
 It provides storage support for several very large Web
applications.
 Data model exposed by Cassandra is based on the
concept of a table
 In terms of the infrastructure, almost similar to
Dynamo, designed for incremental scaling.
 The largest Cassandra deployment manages 100 TB of
data distributed over a cluster of 150 machines. 29
HADOOP HBASE
 HBase is the distributed database that supports the
storage needs of the Hadoop distributed programming
platform.
 Hbase main goal is to offer real-time read/write
operations for tables with billions of rows and millions
of columns by leveraging clusters of commodity
hardware.
 The internal architecture and logic model of HBase is
very similar to Google Bigtable, and the entire system
is backed by the Hadoop Distributed File System
HDFS
30
PROGRAMMING PLATFORMS
 Platforms for programming data-intensive applications
provide abstractions helping to express computations.
 Traditional dbms has proven unsuccessful in the case of
Big Data that are unstructured or semistructured.
 Programming platforms focus on the processing of data
and move into the runtime system, the management of
transfers, making the data always available.
 MapReduce programming platform follows this
approach which expresses the computation in the form
of two simple functions - map and reduce.
 Hides the complexities of managing large and
numerous data files into the distributed file system.
31
MAPREDUCE
 A MapReduce program is composed of a map method
which performs filtering and sorting, and a reduce
method which performs a summary operation.
 Data transfer and management are completely handled
by the distributed storage infrastructure.
 The computation of MapReduce applications is
organized into a workflow of map & reduce operations
that is entirely controlled by the runtime system.
 Developers need only specify how the map and reduce
functions operate on the key-value pairs.
 MapReduce model is expressed in the form of the two
functions, 32
 The map function reads a key-value pair and
produces a list of key-value pairs of different types.
 The reduce function reads a pair composed of a key
and a list of values and produces a list of values of the
same type.
 The types (k1,v1,k2,kv2) used in the expression of the
two functions provide hints as to how these two
functions are connected and are executed.
 The output of map tasks is aggregated together by
grouping the values according to their corresponding
keys and constitutes the input of reduce tasks that,
for each of the keys found, reduces the list of attached
values to a single value.
33
34
MAPREDUCE COMPUTATION WORKFLOW
 Some examples that shows the flexibility of
MapReduce are:
a) Distributed Grep
b) Count of URL-Access Frequency
c) Reverse Web-Link Graph
d) Term-Vector per Host
e) Inverted Index
f) Distributed Sort
 These are Text based processing, MapReduce can

also be used in wider range of problems
 In general, any computation that can be expressed in
2 stages like Analysis and Aggregation can be
represented in terms of MapReduce computation.
35
OVERVIEW OF A MAPREDUCE INFRASTRUCTURE
 Client libraries that are in charge of submitting the input
data files, registering the map and reduce functions, and
returning control to the user once the job is completed.
 A generic distributed infrastructure (i.e., a cluster)
equipped with job-scheduling capabilities and distributed
storage can be used to run MapReduce applications.
 Two different kinds of processes are run on the distributed
infrastructure: a master process and a worker process.
 The master process is in charge of controlling the
execution of map and reduce tasks, partitioning, and
reorganizing the intermediate output produced by the map
task in order to feed the reduce tasks.
 The worker processes are used to host the execution of
map and reduce tasks and provide basic I/O facilities that
are used to interface the map and reduce tasks with input
and output files. 36
37
VARIATIONS AND EXTENSIONS OF MAPREDUCE
 A collection of MapReduce like framework but they
differ from original MapReduce model
a) Hadoop
b) Pig
c) Hive
d) Map-Reduce-Merge
e) Twister
ALTERNATIVES TO MAPREDUCE
a) Sphere
b) All-Pairs
c) DryadLINQ 38
ANEKA MAPREDUCE PROGRAMMING
 Aneka provides an implementation of the MapReduce
abstractions by following the reference model
introduced by Google and implemented by Hadoop.
 It defines the abstraction and runtime support for
developing MapReduce applications.
 The application instance is specialized with the
components that identify the function to use.
 Functions are expressed in terms of Mapper and
Reducer class.
 Runtime support is composed of MapReduce
Scheduling Service, MapReduce Execution
Service, and a specialized Distributed file system.
 Client components is MapReduce Application. 39
ANEKA MAPREDUCE INFRASTRUCTURE
40
Three major components that have collaborated to
execute MapReduce jobs are:
 Programming abstraction
 Runtime support
 Distributed file system support
PROGRAMMING ABSTRACTIONS
Three major classes for application development:
 Mapper<K,V>, Starting point of application
 Reducer<K,V>, design and implementation
 MapReduceApplication<M,R> Submission and
execution
41
 Interface exhibits only MapReduce specific settings,
Control logic is encapsulated in ApplicationBase<M>
from where behavior can be set.
 The parameters that can be controlled:
 Partitions- Integer, the default value is 10.

 Attempts- Retries, the default value is 3.
 UseCombiner- This property stores a Boolean value
and is set to true.
 SynchReduce- This property stores a Boolean value
and is set to true.
 IsInputReady- This is a Boolean value and is set to
false.
 FetchResults- This is a Boolean value and is set to
true.
42
 LogFile- String, the default value is mapreduce.log.
RUNTIME SUPPORT
 The runtime support for the execution of MapReduce
jobs comprises the collection of services that deal with
scheduling and executing MapReduce tasks.
 These are the

 MapReduce Scheduling Service
 MapReduce Execution Service
43
MAPREDUCE SCHEDULING SERVICE ARCHITECTURE
44
 The scheduling of jobs and tasks is the responsibility of the
MapReduce Scheduling Service. (Master process role)
 The architecture of the Scheduling Service is organized into two
major components:
 MapReduceSchedulerService is a wrapper around the scheduler,
implementing the interfaces required by Aneka to expose a software
component as a service
 MapReduceScheduler controls the execution of jobs and schedules
tasks
 Main role of the service wrapper is to translate messages
coming from the Aneka runtime or the client applications into
calls or events directed to the scheduler component, and vice
versa.
 The scheduler manages multiple queues for several operations,
such as uploading input files into the distributed file system;
initializing jobs before scheduling; scheduling map and reduce
tasks; keeping track of unreachable nodes; resubmitting failed
45
tasks; and reporting execution statistics.
MAPREDUCE EXECUTION SERVICE ARCHITECTURE
46
 Execution of task is controlled by MapReduce
Execution service. (Worker process role)
 The service manages the execution of map and reduce
tasks and performs other operations, such as sorting
and merging intermediate files.
 There are three major components that coordinate
together for executing tasks:
 MapReduce Scheduler Service
 Executor Manager
 MapReduce Executor
 The MapReduceScheduler Service interfaces the
ExecutorManager with the Aneka middleware;
 the ExecutorManager is in charge of keeping track of
the tasks being executed by demanding the specific
execution of a task to the MapReduce
Executor and of sending the statistics about the 47
execution back to the Scheduler Service.
DISTRIBUTED FILE SYSTEM SUPPORT
 Aneka supports the MapReduce model that uses a DFS
implementation.
 DFS implementations guarantee high availability and
better efficiency by means of replication and distribution.
 MapReduce requires the ability to perform the following
tasks:
 Retrieving the location of files and file chunks
 Accessing a file by means of a stream
 MapReduce programming model offers classes to read from
and write to files in a sequential manner. These are
classes SeqReader and SeqWriter.
 An Aneka MapReduce file is composed of a header, used to
identify the file, and a sequence of record blocks, each
storing a key-value pair. The header is composed of 4 bytes:
the first 3 bytes represent the character sequence SEQ and
the fourth byte identifies the version of the file 48
49
EXAMPLE APPLICATIONS
 MapReduce is a very useful model for processing large

quantities of data, (logs or Web pages).
 To demonstrate how to program real applications with
Aneka MapReduce, following are considered as a very
common task:
 Parsing Aneka Logs

 Mapper design and implementation
 Reducer design and implementation
 Driver program
 Running the application
50

Cloud COMPUTING Module 4

Uploaded by

Copyright:

Available Formats

Cloud COMPUTING Module 4

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cloud COMPUTING Module 4

Uploaded by

Copyright:

Available Formats

MODULE 4

DATA INTENSIVE COMPUTING:

 Data-intensive computing is concerned with production,

 Dataset is commonly used to identify a collection of information

 Datasets are often maintained in repositories, which are

 To facilitate classification and search relevant bits of information

 The huge amount of data produced, analyzed, or

 Data signature-generation techniques for data reduction

 (First data intensive environment – MAGIC project)  DPSS

 Data grids provide storage and dataset management

 Large datasets concern to scientific computing domain.

 Ex 2: Hadoop system, the most mature, large, and open-

 Distributed databases have been considered the

 Data intensive computing concern with applications

 Growing of popularity of bigdata

 New approaches and technologies for computing

 GPFS is the high-performance dfs developed by IBM

 The file system makes this infrastructure transparent

 GFS is the storage infrastructure that supports the execution of

 Multi-value databases (OpenQM, RocketU2,

 Tabular stores (GoogleBigTable, HadoopHBase,

 Apache CouchDB and MongoDB are two examples of

 Bigtable is the distributed storage system designed to

 The master server is responsible for keeping track of

b) Count of URL-Access Frequency

c) Reverse Web-Link Graph

d) Term-Vector per Host

 These are Text based processing, MapReduce can

 Distributed file system support

 Partitions- Integer, the default value is 10.

 These are the

 MapReduce is a very useful model for processing large

 Parsing Aneka Logs

 Reducer design and implementation

 Running the application

You might also like