Cloud COMPUTING Module 4
Cloud COMPUTING Module 4
Cloud COMPUTING Module 4
1
WHAT IS DATA-INTENSIVE COMPUTING?
Data-intensive computing focuses on a class of applications that
deal with a large amount of data.
3
CHALLENGES IN DATA-INTENSIVE COMPUTING
8
Data grids have their own characteristics and introduce
new challenges :
Massive datasets: The size of datasets can easily be on
the scale of gigabytes, terabytes, and beyond. So it’s
necessary to minimize latencies during bulk transfers,
replicate content.
Shared data collections: Resource sharing includes
distributed collections of data. For example,
repositories can be used to both store and read data.
Unified namespace: Data grids impose a unified
logical namespace where to locate data collections and
resources.
Access restrictions: Even though data grids facilitate
sharing of results and data for experiments, some users
might want to ensure confidentiality for their data and
restrict access to them to their collaborators.
9
Usage Ex: High energy physics, biology and astronomy.
3. Data clouds and “BigData”
Storage systems
16
Google File System (GFS)
Key-value stores.
22
Apache CouchDB and MongoDB
or an array of values.
The databases expose a RESTful interface and
represent data in JSON format.
Both allow querying and indexing data by using the
MapReduce programming model.
Both supports data replication and high availability.
23
Amazon Dynamo
The main goal of Dynamo is to provide an incrementally
scalable and highly available storage system.
Dynamo provides a simplified interface based on get/put
semantics, where objects are stored and retrieved with
a unique identifier (key)
The architecture of the Dynamo system, is composed of
a collection of storage peers organized in a ring that
shares the key space for a given application
The key space is partitioned among the storage peers,
and the keys are replicated across the ring.
Each peer is configured with access to a local storage
facility where original objects and replicas are stored.
each node provides facilities for distributing the
updates among the rings and to detect failures and
24
unreachable nodes.
AMAZON DYNAMO
25
GOOGLE BIGTABLE
30
PROGRAMMING PLATFORMS
Platforms for programming data-intensive applications
provide abstractions helping to express computations.
Traditional dbms has proven unsuccessful in the case of
Big Data that are unstructured or semistructured.
Programming platforms focus on the processing of data
and move into the runtime system, the management of
transfers, making the data always available.
MapReduce programming platform follows this
approach which expresses the computation in the form
of two simple functions - map and reduce.
Hides the complexities of managing large and
numerous data files into the distributed file system.
31
MAPREDUCE
A MapReduce program is composed of a map method
which performs filtering and sorting, and a reduce
method which performs a summary operation.
Data transfer and management are completely handled
by the distributed storage infrastructure.
The computation of MapReduce applications is
organized into a workflow of map & reduce operations
that is entirely controlled by the runtime system.
Developers need only specify how the map and reduce
functions operate on the key-value pairs.
MapReduce model is expressed in the form of the two
functions, 32
The map function reads a key-value pair and
produces a list of key-value pairs of different types.
The reduce function reads a pair composed of a key
and a list of values and produces a list of values of the
same type.
The types (k1,v1,k2,kv2) used in the expression of the
two functions provide hints as to how these two
functions are connected and are executed.
The output of map tasks is aggregated together by
grouping the values according to their corresponding
keys and constitutes the input of reduce tasks that,
for each of the keys found, reduces the list of attached
values to a single value.
33
34
MAPREDUCE COMPUTATION WORKFLOW
Some examples that shows the flexibility of
MapReduce are:
a) Distributed Grep
e) Inverted Index
f) Distributed Sort
b) Pig
c) Hive
d) Map-Reduce-Merge
e) Twister
ALTERNATIVES TO MAPREDUCE
a) Sphere
b) All-Pairs
c) DryadLINQ 38
ANEKA MAPREDUCE PROGRAMMING
Aneka provides an implementation of the MapReduce
abstractions by following the reference model
introduced by Google and implemented by Hadoop.
It defines the abstraction and runtime support for
developing MapReduce applications.
The application instance is specialized with the
components that identify the function to use.
Functions are expressed in terms of Mapper and
Reducer class.
Runtime support is composed of MapReduce
Scheduling Service, MapReduce Execution
Service, and a specialized Distributed file system.
Client components is MapReduce Application. 39
ANEKA MAPREDUCE INFRASTRUCTURE
40
Three major components that have collaborated to
execute MapReduce jobs are:
Programming abstraction
Runtime support
PROGRAMMING ABSTRACTIONS
Three major classes for application development:
Mapper<K,V>, Starting point of application
Reducer<K,V>, design and implementation
MapReduceApplication<M,R> Submission and
execution
41
Interface exhibits only MapReduce specific settings,
Control logic is encapsulated in ApplicationBase<M>
from where behavior can be set.
The parameters that can be controlled:
43
MAPREDUCE SCHEDULING SERVICE ARCHITECTURE
44
The scheduling of jobs and tasks is the responsibility of the
MapReduce Scheduling Service. (Master process role)
The architecture of the Scheduling Service is organized into two
major components:
MapReduceSchedulerService is a wrapper around the scheduler,
implementing the interfaces required by Aneka to expose a software
component as a service
MapReduceScheduler controls the execution of jobs and schedules
tasks
Main role of the service wrapper is to translate messages
coming from the Aneka runtime or the client applications into
calls or events directed to the scheduler component, and vice
versa.
The scheduler manages multiple queues for several operations,
such as uploading input files into the distributed file system;
initializing jobs before scheduling; scheduling map and reduce
tasks; keeping track of unreachable nodes; resubmitting failed
45
tasks; and reporting execution statistics.
MAPREDUCE EXECUTION SERVICE ARCHITECTURE
46
Execution of task is controlled by MapReduce
Execution service. (Worker process role)
The service manages the execution of map and reduce
tasks and performs other operations, such as sorting
and merging intermediate files.
There are three major components that coordinate
together for executing tasks:
MapReduce Scheduler Service
Executor Manager
MapReduce Executor
The MapReduceScheduler Service interfaces the
ExecutorManager with the Aneka middleware;
the ExecutorManager is in charge of keeping track of
the tasks being executed by demanding the specific
execution of a task to the MapReduce
Executor and of sending the statistics about the 47
execution back to the Scheduler Service.
DISTRIBUTED FILE SYSTEM SUPPORT
Aneka supports the MapReduce model that uses a DFS
implementation.
DFS implementations guarantee high availability and
better efficiency by means of replication and distribution.
MapReduce requires the ability to perform the following
tasks:
Retrieving the location of files and file chunks
Accessing a file by means of a stream
MapReduce programming model offers classes to read from
and write to files in a sequential manner. These are
classes SeqReader and SeqWriter.
An Aneka MapReduce file is composed of a header, used to
identify the file, and a sequence of record blocks, each
storing a key-value pair. The header is composed of 4 bytes:
the first 3 bytes represent the character sequence SEQ and
the fourth byte identifies the version of the file 48
49
EXAMPLE APPLICATIONS
Driver program
50