Nothing Special   »   [go: up one dir, main page]

Cloud Computing Notes.docx

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

Notes Prepared By Sarvagya Jain

Unit -1
⮚ Introduction to Service Oriented Architecture:
Service-Oriented Architecture (SOA) is an architectural approach in which applications make use of
services available in the network. In this architecture, services are provided to form applications,
through a communication call over the internet.
● SOA allows users to combine a large number of facilities from existing services to form
applications.
● SOA encompasses a set of design principles that structure system development and provide means
for integrating components into a coherent and decentralized system.
● SOA based computing packages functionalities into a set of interoperable services, which can be
integrated into different software systems belonging to separate business domains.
There are two major roles within Service-oriented Architecture:
1. Service provider: The service provider is the maintainer of the service and the organization that
makes available one or more services for others to use. To advertise services, the provider can
publish them in a registry, together with a service contract that specifies the nature of the service,
how to use it, the requirements for the service, and the fees charged.
2. Service consumer: The service consumer can locate the service metadata in the registry and
develop the required client components to bind and use the service.

Services might aggregate information and data retrieved from other services or create workflows of
services to satisfy the request of a given service consumer. This practice is known as service
orchestration another important interaction pattern is service choreography, which is the coordinated
interaction of services without a single point of control.
⮚ Components of SOA:

⮚ Principles of SOA:

1. Standardized service contract: Specified through one or more service description documents.

1
Notes Prepared By Sarvagya Jain
2. Loose coupling: Services are designed as self-contained components maintain relationships that
minimize dependencies on other services.
3. Abstraction: A service is completely defined by service contracts and description documents.
They hide their logic, which is encapsulated within their implementation.
4. Reusability: Designed as components, services can be reused more effectively, thus reducing
development time and the associated costs.
5. Autonomy: Services have control over the logic they encapsulate and, from a service consumer
point of view; there is no need to know about their implementation.
6. Discoverability: Services are defined by description documents that constitute supplemental
metadata through which they can be effectively discovered. Service discovery provides an
effective means for utilizing third-party resources.
7. Compos ability: Using services as building blocks, sophisticated and complex operations can be
implemented. Service orchestration and choreography provide solid support for composing
services and achieving business goals.

⮚ Advantages of SOA:
● Service reusability: In SOA, applications are made from existing services. Thus, services can be
reused to make many applications.
● Easy maintenance: As services are independent of each other they can be updated and modified
easily without affecting other services.
● Platform independent: SOA allows making a complex application by combining services picked
from different sources, independent of the platform.
● Availability: SOA facilities are easily available to anyone on request.
● Reliability: SOA applications are more reliable because it is easy to debug small services rather
than huge codes
● Scalability: Services can run on different servers within an environment, this increases scalability

⮚ Disadvantages of SOA:
● High overhead: A validation of input parameters of services is done whenever services interact
this decreases performance as it increases load and response time.
● High investment: A huge initial investment is required for SOA.
● Complex service management: When services interact they exchange messages to tasks. the
number of messages may go in millions. It becomes a cumbersome task to handle a large number
of messages.

⮚ Practical applications of SOA:


SOA is used in many ways around us whether it is mentioned or not.
1. SOA infrastructure is used by many armies and air force to deploy situational awareness systems.
2. SOA is used to improve the healthcare delivery.
3. Nowadays many apps are games and they use inbuilt functions to run. For example, an app might
need GPS so it uses inbuilt GPS functions of the device. This is SOA in mobile solutions.
4. SOA helps maintain museums a virtualized storage pool for their information and content.

⮚ Web service
Web service is a standardized medium to propagate communication between the client and server
applications on the World Wide Web. A web service is a software module that is designed to perform a
certain set of tasks.
● Web services in cloud computing can be searched for over the network and can also be invoked
accordingly.
● When invoked, the web service would be able to provide the functionality to the client, which
invokes that web service.

2
Notes Prepared By Sarvagya Jain
How web services works:

The above diagram shows a very simplistic view of how a web service would actually work. The
client would invoke a series of web service calls via requests to a server which would host the actual
web service.
These requests are made through what is known as remote procedure calls. Remote Procedure Calls
(RPC) is calls made to methods which are hosted by the relevant web service.
As an example, Amazon provides a web service that provides prices for products sold online via
amazon.com. The front end or presentation layer can be in .Net or Java but either programming
language would have the ability to communicate with the web service.
The main component of a web service design is the data which is transferred between the client and
the server, and that is XML. XML (Extensible mark-up language) is a counterpart to HTML and easy
to understand the intermediate language that is understood by many programming languages.
So when applications talk to each other, they actually talk in XML. This provides a common platform
for application developed in various programming languages to talk to each other.
Web services use something known as SOAP (Simple Object Access Protocol) for sending the XML
data between applications. The data is sent over normal HTTP. The data which is sent from the web
service to the application is called a SOAP message. The SOAP message is nothing but an XML
document. Since the document is written in XML, the client application calling the web service can be
written in any programming language.

⮚ Architecture of Web Services


The Web Services architecture describes how to instantiate the elements and implement the operations
in an interoperable manner.
The architecture of web service interacts among three roles: service provider, service requester, and
service registry. The interaction involves the three operations: publish, find, and bind. These
operations and roles act upon the web services artifacts. The web service artifacts are the web service
software module and its description.
The service provider hosts a network-associable module (web service). It defines a service description
for the web service and publishes it to a service requestor or service registry. These service requestor

3
Notes Prepared By Sarvagya Jain
uses a find operation to retrieve the service description locally or from the service registry. It uses the
service description to bind with the service provider and invoke with the web service implementation.

The following figure illustrates the operations, roles, and their interaction.

Roles in Web Service Architecture


There are three roles in web service architecture:
1. Service Provider
2. Service Requestor
3. Service Registry
● Service Provider
From an architectural perspective, it is the platform that hosts the services.
● Service Requestor
Service requestor is the application that is looking for and invoking or initiating an interaction with
a service. The browser plays the requester role, driven by a consumer or a program without a user
interface.
● Service Registry
Service requestors find service and obtain binding information for services during development.
Operations in Web Service Architecture
Three behaviours that take place in the micro services:
o Publication of service descriptions (Publish)
o Finding of services descriptions (Find)
o Invoking of service based on service descriptions (Bind)
Publish: In the publish operation; a service description must be published so that a service requester
can find the service.
Find: In the find operation, the service requestor retrieves the service description directly. It can be
involved in two different lifecycle phases for the service requestor:
o At design, time to retrieve the service's interface description for program development.
o And, at the runtime to retrieve the service's binding and location description for invocation.
Bind: In the bind operation, the service requestor invokes or initiates an interaction with the service at
runtime using the binding details in the service description to locate, contact, and invoke the service.

⮚ Type of Web Service


There are mainly two types of web services.
1. SOAP web services.

4
Notes Prepared By Sarvagya Jain
2. RESTful web services.

⮚ SOAP (Simple Object Access Protocol)


SOAP is known as the Simple Object Access Protocol. It defines the standard XML format. It also
defines the way of building web services. We use Web Service Definition Language (WSDL) to define
the format of request XML and the response XML.
SOAP is known as a transport-independent messaging protocol. SOAP is based on transferring XML
data as SOAP Messages. Each message has something which is known as an XML document. Only
the structure of the XML document follows a specific pattern, but not the content. The best part of
Web services and SOAP is that it’s all sent via HTTP, which is the standard web protocol.

SOAP was developed as an intermediate language so that applications built on various programming
languages could talk easily to each other and avoid the extreme development effort.

For example, we have requested to access the To-do application from the Facebook application. The
Facebook application sends an XML request to the To-do application. To-do application processes the
request and generates the XML response and sends back to the Facebook application.

SOAP Building Blocks

The SOAP specification defines something known as a "SOAP message" which is what is sent to the
web service and the client application.

Here is what a SOAP message consists of

o Each SOAP document needs to have a root element known as the <Envelope> element. The
root element is the first element in an XML document.
o The "envelope" is in turn divided into 2 parts. The first is the header, and the next is the body.
o The header contains the routing data which is basically the information which tells the XML
document to which client it needs to be sent to.
o The body will contain the actual message.
SOAP Message Structure
SOAP messages are normally auto-generated by the web service when it is called.

Whenever a client application calls a method in the web service, the web service will automatically
generate a SOAP message which will have the necessary details of the data which will be sent from
the web service to the client application.

A simple SOAP Message has the following elements –

The Envelope element

5
Notes Prepared By Sarvagya Jain
The header element and
The body element

In the above figure, the SOAP-Envelope contains a SOAP-Header and SOAP-Body. It contains
meta-information needed to identify the request, for example, authentication, authorization, signature,
etc. SOAP-Header is optional. The SOAP-Body contains the real XML content of request or response.
In case of an error, the response server responds back with SOAP-Fault.
The SOAP XML request and response structure.
XML Request
<Envelop xmlns=?http://schemas.xmlsoap.org/soap/envelop/?>
<Body>
<getCourseDetailRequest xmlns=?http://udemy.com/course?>
<id>course1</id>
<getCourseDetailRequest>
</Body>
</Envelop>
XML Response
<SOAP-ENV:Envelope xmlns:SOAP ENV=?http://schemas.xmlsoap.org/soap/envelope/?>
<SOAP-ENV:Header /> <!?empty header-->
<SOAP-ENV:Body> <!?body begin-->
<ns2:getCourseDetailsResponse xmlns:ns2=?http://in28mi> <!--content of the response
-->
<ns2:course>
<ns2:id>Course1</ns2:id>
<ns2:name>Spring<ns2:name>
<ns2:description>10 Steps</ns1:description>
</ns2:course>
</ns2:getCourseDetailResponse>
</SOAP-ENV:Body> <!?body end-->
</SOAP-ENV:Envelope>

⮚ Web Services Components


There are two components of web services:
o Web Service Description Language (WSDL)
o Universal Description Discovery and Integration (UDDI)

6
Notes Prepared By Sarvagya Jain
⮚ Web Service Description Language (WSDL)
WSDL acronym for Web Service Description Language. WSDL is an XML based interface
description language. It is used for describing the functionality offered by a web service. Sometimes it
is also known as the WSDL file. The extension of the WSDL file is .wsdl. It provides the
machine-readable description of how the service can be called, what parameter it expects, and what
data structure it returns.
It describes service as a collection of network endpoint, or ports. It is often used in combination with
SOAP and an XML schema to provide XML service over the distributed environment. In short, the
purpose of WSDL is similar to type-signature in a programming language.
WSDL 1.1 WSDL 2.0 Description
Term Term
Service Service It is a set of system functions.
Port Endpoint It is an endpoint that defines a combination of binding and
network addresses.
Binding Binding It specifies the interface and defines the SOAP binding style. It
also defines the operations.
PortType Interface An abstract set of services supported by one or more endpoints.
Operation Operation Abstract detail of an action supported by the service. It defines the
SOAP actions and the way of encoding the message.
Message N/A An abstract, typed definition of data to communicate. W3C has
removed the message in WSDL 2.0, in which XML Schema types
for defining bodies of inputs, outputs, and faults are referred
directly.
Types Types It is a container for data type definition. The XML Schema
language (XSD) is used for this purpose.

⮚ Universal Description, Discovery, and Integration (UDDI)


UDDI acronym for Universal Description, Discovery, and Integration. It is an XML-based registry
for businesses word wide to list themselves on the internet. It defines a set of services supporting the
description and discovery of the business, organizations, or other web service providers. The UDDI
makes the services available and the technical interfaces which may be used to access those services.

7
Notes Prepared By Sarvagya Jain
The idea behind UDDI is to discover organizations and the services that organizations offer, much like
using a telephone directory. It allows the business to list themselves by name, product, location, or
the web service they offer. A UDDI works in the following manner:
o A service provider registers its business with the UDDI registry.
o A service provider registers each service separately with the UDDI registry.
o The consumer looks up the business and service in the UDDI registry.
o The consumer binds the service with the service provider and uses the service.
The UDDI business registry system has three directories are as follows:
o White Pages
o Yellow pages
o Green Pages

White Pages: The white pages contain basic information such as company name, address, phone
number, and other business identifiers such as tax numbers.

Yellow Pages: The yellow pages contain detailed business data organized by relevant business
classification. The version of the yellow page classifies business according to the
newer NAICS (North American Industry Classification System).

Green Pages: The green pages contain information about the company's crucial business process, such
as operating platform, supported programs, and other high-level business protocols.
⮚ Introduction to RESTful Web Services
REST stands for REpresentational State Transfer. It is developed by Roy Thomas Fielding, who
also developed HTTP. The main goal of RESTful web services is to make web services more
effective. RESTful web services try to define services using the different concepts that are already
present in HTTP. REST is an architectural approach, not a protocol.
It does not define the standard message exchange format. We can build REST services with both XML
and JSON. JSON is more popular format with REST. The key abstraction is a resource in REST. A
resource can be anything. It can be accessed through a Uniform Resource Identifier (URI). For
example:
The resource has representations like XML, HTML, and JSON. The current state capture by
representational resource. When we request a resource, we provide the representation of the resource.
The important methods of HTTP are:
o GET: It reads a resource.
o PUT: It updates an existing resource.
o POST: It creates a new resource.
o DELETE: It deletes the resource.
For example, if we want to perform the following actions in the social media application, we get the
corresponding results.
POST /users: It creates a user.
GET /users/ {id}: It retrieves the detail of a user.
GET /users: It retrieves the detail of all users.
DELETE /users: It deletes all users.
DELETE /users/ {id}: It deletes a user.
GET /users/ {id}/posts/post_id: It retrieves the detail of a specific post.
POST / users/ {id}/ posts: It creates a post of the user.
HTTP also defines the following standard status code:
o 404: RESOURCE NOT FOUND
o 200: SUCCESS

8
Notes Prepared By Sarvagya Jain
o 201: CREATED
o 401: UNAUTHORIZED
o 500: SERVER ERROR
⮚ RESTful Service Constraints/Characteristics/Principle

1. Stateless – Server contains no client stage. However, the client side holds the session state. Each
request contains enough contexts to process the message.
2. Uniform Interface – Interface between client and server, HTTP verbs (GET, PUT, POST,
DELETE), URLs (resource name), and HTTP response (status, body).
3. Cacheable – Server responses (representations) are cacheable, implicit, explicit, and negotiable.
4. Layered System – It improves scalability. Usually, the client never informs if it is directly
connected to the end server. Intermediary servers may enable load-balancing and provide shared
caches to improve system scalability.
5. Client – Server – Clients from Servers are separated by UI. This separation of concerns means
clients are least concerned about the activities at server end, like data storage, etc.
6. Code on Demand (optional) – Temporarily, servers can transfer the logic to be executed by client.
In this way the functionality of a client can be customized.
⮚ REST Web Service Components
An Informatics REST web service has the following components:
Resource
A resource includes the mapping that the REST web service runs and the definition of the
response message that the web service returns. The resource also includes a resource ID, which
is a key field in the output data. When you define a resource, you define the structure of the
output data that the web service returns to the client. A web service can have multiple
resources.
Resource mapping
The mapping that returns the data to return to the web service client. A resource mapping does
not read the request query. The REST resource mapping contains a Read transformation. The
transformation reads a data object in the Model repository to retrieve data to return to the
client. By default, you do not have to add a Filter transformation or a Lookup transformation to
retrieve the data based on the client query. The REST web service filters the output data after
the mapping returns data.
Request message
A request from a web service client to the web service to perform a task. An Informatica web
service can perform an HTTP GET method. The request message is a string that contains the
name of the web service, the name and network location of the resource to perform the task,
and the parameters to filter the output.
Resource ID
A key field that you can search for in the output data. Each key field has a URL in the output
data.
Response message
A JSON or XML file that contains the data to return to the web service client. The response
message can contain a hierarchy of elements and multiple-occurring data.

⮚ RESTful Message
RESTful Web Services make use of HTTP protocols as a medium of communication between client
and server. A client sends a message in form of a HTTP Request and the server responds in the form

9
Notes Prepared By Sarvagya Jain
of an HTTP Response. This technique is termed as Messaging. These messages contain message data
and metadata i.e. information about message itself. Let us have a look on the HTTP Request and
HTTP Response messages for HTTP 1.1.
HTTP Request

An HTTP Request has five major parts −


● Verb − Indicates the HTTP methods such as GET, POST, DELETE, PUT, etc.
● URI − Uniform Resource Identifier (URI) to identify the resource on the server.
● HTTP Version − Indicates the HTTP version. For example, HTTP v1.1.
● Request Header − Contains metadata for the HTTP Request message as key-value pairs. For
example, client (or browser) type, format supported by the client, format of the message body,
cache settings, etc.
● Request Body − Message content or Resource representation.
HTTP Response

An HTTP Response has four major parts −


● Status/Response Code − Indicates the Server status for the requested resource. For example,
404 mean resource not found and 200 means response is ok.
● HTTP Version − Indicates the HTTP version. For example HTTP v1.1.
● Response Header − Contains metadata for the HTTP Response message as key value pairs.
For example, content length, content type, response date, server type, etc.
● Response Body − Response message content or Resource representation.

⮚ Software as a service
Software as a service (or SaaS) is a way of delivering applications over the Internet—as a service.
Instead of installing and maintaining software, you simply access it via the Internet, freeing yourself
from complex software and hardware management.

10
Notes Prepared By Sarvagya Jain
SaaS applications are sometimes called Web-based software, on-demand software, or hosted software.
Whatever the name, SaaS applications run on a SaaS provider’s servers. The provider manages access
to the application, including security, availability, and performance.

How does software as a service work?


SaaS works through the cloud delivery model. A software provider will either host the application and
related data using its own servers, databases, networking and computing resources, or it may be an
ISV (Independent Soft. Vendor) that contracts a cloud provider to host the application in the provider's
data center. The application will be accessible to any device with a network connection. SaaS
applications are typically accessed via web browsers.
As a result, companies using SaaS applications are not tasked with the setup and maintenance of the
software. Users simply pay a subscription fee to gain access to the software, which is a ready-made
solution.
SaaS is closely related to the application service provider (ASP) and on-demand computing software
delivery models where the provider hosts the customer's software and delivers it to approved end
users over the internet.

Services provided by SaaS providers -


Business Services - SaaS Provider provides various business services to start-up the business. The
SaaS business services include ERP (Enterprise Resource Planning), CRM (Customer Relationship
Management), billing, and sales.
Document Management - SaaS document management is a software application offered by a third
party (SaaS providers) to create, manage, and track electronic documents.
Example: Slack, Same page, Box, and Zoho Forms.
Social Networks - As we all know, social networking sites are used by the general public, so social
networking service providers use SaaS for their convenience and handle the general public's
information.
Mail Services - To handle the unpredictable number of users and load on e-mail services, many e-mail
providers offering their services using SaaS.
Advantages of SaaS cloud computing layer
1) SaaS is easy to buy
SaaS pricing is based on a monthly fee or annual fee subscription, so it allows organizations to access
business functionality at a low cost, which is less than licensed applications.
Unlike traditional software, which is sold as a licensed based with an up-front cost (and often an
optional ongoing support fee), SaaS providers are generally pricing the applications using a
subscription fee, most commonly a monthly or annually fee.
2. One to Many
SaaS services are offered as a one-to-many model means a single instance of the application is shared
by multiple users.
3. Less hardware required for SaaS
The software is hosted remotely, so organizations do not need to invest in additional hardware.
4. Low maintenance required for SaaS
Software as a service removes the need for installation, set-up, and daily maintenance for the
organizations. The initial set-up cost for SaaS is typically less than the enterprise software. SaaS
vendors are pricing their applications based on some usage parameters, such as a number of users
using the application. So SaaS does easy to monitor and automatic updates.
5. No special software or hardware versions required
All users will have the same version of the software and typically access it through the web browser.
SaaS reduces IT support costs by outsourcing hardware and software maintenance and support to the
IaaS provider.
6. Multidevice support

11
Notes Prepared By Sarvagya Jain
SaaS services can be accessed from any device such as desktops, laptops, tablets, phones, and thin
clients.
7. API Integration
SaaS services easily integrate with other software or services through standard APIs.
8. No client-side installation
SaaS services are accessed directly from the service provider using the internet connection, so do not
need to require any software installation.
Disadvantages of SaaS cloud computing layer
1) Security
Actually, data is stored in the cloud, so security may be an issue for some users. However, cloud
computing is not more secure than in-house deployment.
2) Latency issue
Since data and applications are stored in the cloud at a variable distance from the end-user, there is a
possibility that there may be greater latency when interacting with the application compared to local
deployment. Therefore, the SaaS model is not suitable for applications whose demand response time is
in milliseconds.
3) Total Dependency on Internet
Without an internet connection, most SaaS applications are not usable.
4) Switching between SaaS vendors is difficult
Switching SaaS vendors involves the difficult and slow task of transferring the very large data files
over the internet and then converting and importing them into another SaaS also.
Popular SaaS Providers
The below table shows some popular SaaS providers and services that are provided by them -
Provider Services
Salseforce.com On-demand CRM solutions
Microsoft Office 365 Online office suite
Google Apps Gmail, Google Calendar, Docs, and sites
NetSuite ERP, accounting, order management, CRM, Professionals Services
Automation (PSA), and e-commerce applications.
GoToMeeting Online meeting and video-conferencing software
Constant Contact E-mail marketing, online survey, and event marketing
Oracle CRM CRM applications
Workday, Inc Human capital management, payroll, and financial management.

⮚ Platform as a Service | PaaS


Platform as a Service (PaaS) provides a runtime environment. It allows programmers to easily create,
test, run, and deploy web applications. You can purchase these applications from a cloud service
provider on a pay-as-per use basis and access them using the Internet connection. In PaaS, back end
scalability is managed by the cloud service provider, so end- users do not need to worry about
managing the infrastructure.
PaaS includes infrastructure (servers, storage, and networking) and platform (middleware,
development tools, database management systems, business intelligence, and more) to support the web
application life cycle.
Example: Google App Engine, Force.com, Joyent, Azure.
PaaS providers provide the:
1. Programming languages
PaaS providers provide various programming languages for the developers to develop the applications.
Some popular programming languages provided by PaaS providers are Java, PHP, Ruby, Perl, and Go.

12
Notes Prepared By Sarvagya Jain
2. Application frameworks
PaaS providers provide application frameworks to easily understand the application development.
Some popular application frameworks provided by PaaS providers are Node.js, Drupal, Joomla,
WordPress, Spring, Play, Rack, and Zend.
3. Databases
PaaS providers provide various databases such as ClearDB, PostgreSQL, MongoDB, and Redis to
communicate with the applications.
4. Other tools
PaaS providers provide various other tools that are required to develop, test, and deploy the
applications.
Advantages of PaaS
There are the following advantages of PaaS -
1) Simplified Development
PaaS allows developers to focus on development and innovation without worrying about infrastructure
management.
2) Lower risk
No need for up-front investment in hardware and software. Developers only need a PC and an internet
connection to start building applications.
3) Prebuilt business functionality
Some PaaS vendors also provide already defined business functionality so that users can avoid
building everything from very scratch and hence can directly start the projects only.
4) Instant community
PaaS vendors frequently provide online communities where the developer can get the ideas to share
experiences and seek advice from others.
5) Scalability
Applications deployed can scale from one to thousands of users without any changes to the
applications.
Disadvantages of PaaS cloud computing layer
1) Vendor lock-in
One has to write the applications according to the platform provided by the PaaS vendor, so the
migration of an application to another PaaS vendor would be a problem.
2) Data Privacy
Corporate data, whether it can be critical or not, will be private, so if it is not located within the walls
of the company, there can be a risk in terms of privacy of data.
3) Integration with the rest of the systems applications
It may happen that some applications are local, and some are in the cloud. So there will be chances of
increased complexity when we want to use data which in the cloud with the local data.
Popular PaaS Providers
The below table shows some popular PaaS providers and services that are provided by them -
Providers Services
Google App Engine App Identity, URL Fetch, Cloud storage client library, Logservice
(GAE)
Salesforce.com Faster implementation, Rapid scalability, CRM Services, Sales cloud,
Mobile connectivity, Chatter.
Windows Azure Compute, security, IoT, Data Storage.
AppFog Justcloud.com, SkyDrive, GoogleDocs
Openshift RedHat, Microsoft Azure.
Cloud Foundry from Data, Messaging, and other services.
VMware

13
Notes Prepared By Sarvagya Jain
⮚ Infrastructure as a Service | IaaS
Iaas is also known as Hardware as a Service (HaaS). It is one of the layers of the cloud computing
platform. It allows customers to outsource their IT infrastructures such as servers, networking,
processing, storage, virtual machines, and other resources. Customers access these resources on the
Internet using a pay-as-per use model.
In traditional hosting services, IT infrastructure was rented out for a specific period of time, with
pre-determined hardware configuration. The client paid for the configuration and time, regardless of
the actual use. With the help of the IaaS cloud computing platform layer, clients can dynamically scale
the configuration to meet changing requirements and are billed only for the services actually used.
IaaS cloud computing platform layer eliminates the need for every organization to maintain the IT
infrastructure.
IaaS is offered in three models: public, private, and hybrid cloud. The private cloud implies that the
infrastructure resides at the customer-premise. In the case of public cloud, it is located at the cloud
computing platform vendor's data center, and the hybrid cloud is a combination of the two in which the
customer selects the best of both public cloud and private cloud.
IaaS provider provides the following services -
1. Compute: Computing as a Service includes virtual central processing units and virtual main
memory for the Vms that is provisioned to the end- users.
2. Storage: IaaS provider provides back-end storage for storing files.
3. Network: Network as a Service (NaaS) provides networking components such as routers,
switches, and bridges for the Vms.
4. Load balancers: It provides load balancing capability at the infrastructure layer.
Advantages of IaaS cloud computing layer
There are the following advantages of IaaS computing layer -
1. Shared infrastructure
IaaS allows multiple users to share the same physical infrastructure.
2. Web access to the resources
Iaas allows IT users to access resources over the internet.
3. Pay-as-per-use model
IaaS providers provide services based on the pay-as-per-use basis. The users are required to pay for
what they have used.
4. Focus on the core business
IaaS providers focus on the organization's core business rather than on IT infrastructure.
5. On-demand scalability
On-demand scalability is one of the biggest advantages of IaaS. Using IaaS, users do not worry about
to upgrade software and troubleshoot the issues related to hardware components.
Disadvantages of IaaS cloud computing layer
1. Security
Security is one of the biggest issues in IaaS. Most of the IaaS providers are not able to provide 100%
security.
2. Maintenance & Upgrade
Although IaaS service providers maintain the software, but they do not upgrade the software for some
organizations.
3. Interoperability issues
It is difficult to migrate VM from one IaaS provider to the other, so the customers might face problem
related to vendor lock-in.
Top Iaas Providers
IaaS Vendor Iaas Solution Details
Amazon Web Elastic, Elastic Compute The cloud computing platform pioneer,
Services Cloud (EC2) Amazon offers auto scaling, cloud

14
Notes Prepared By Sarvagya Jain
MapReduce, Route 53, monitoring, and load balancing features as
Virtual Private Cloud, part of its portfolio.
etc.
Netmagic Netmagic IaaS Cloud Netmagic runs from data centers in
Solutions Mumbai, Chennai, and Bangalore, and a
virtual data center in the United States.
Plans are underway to extend services to
West Asia.
Rackspace Cloud servers, cloud The cloud computing platform vendor
files, cloud sites, etc. focuses primarily on enterprise-level
hosting services.
Reliance Reliance Internet Data RIDC supports both traditional hosting and
Communications Center cloud services, with data centers in
Mumbai, Bangalore, Hyderabad, and
Chennai. The cloud services offered by
RIDC include IaaS and SaaS.
Sify Technologies Sify IaaS Sify's cloud computing platform is powered
by HP's converged infrastructure. The
vendor offers all three types of cloud
services: IaaS, PaaS, and SaaS.
Tata InstaCompute InstaCompute is Tata Communications' IaaS
Communications offering. InstaCompute data centers are
located in Hyderabad and Singapore, with
operations in both countries.

⮚ Organizational Scenarios in cloud computing.


End user to cloud
End user data is managed in cloud. End user does not need to be keep up with anything other than
password. For example: Email applications
Concerned Issues
The cloud service must authenticate the users. (Identity theft)
Cloud vendors should be very clear about service level agreements. (SLAs)
Access to cloud should not require a particular platform or technology. (Open Client)
Enterprise to cloud to end user
When end user interacts with enterprise, enterprise accesses data from cloud, manipulates it and sends
it to end user
Enterprise to cloud
Enterprise using cloud services for its internal processes.
Enterprise to cloud to Enterprise
Two enterprise using same cloud
Private cloud
Useful for larger enterprises.
Hybrid clouds
Both public and private clouds work together.
⮚ Administering & Monitoring cloud services
Administering Cloud Computing services is an important process when you have hosted your business
data on the cloud. The business owners need to know whether the performance is at the right level and
whether the deleted data is permanently gone.
Cloud monitoring is the process of evaluating, monitoring, and managing cloud-based services,
applications, and infrastructure. Companies utilize various application monitoring tools to monitor
cloud-based applications.

15
Notes Prepared By Sarvagya Jain
⮚ Types of Cloud Services to Monitor
There are multiple types of cloud services to monitor. Cloud monitoring is not just about monitoring
servers hosted on AWS or Azure. For enterprises, they also put a lot of importance into monitoring
cloud-based services that they consume. Including things like Office 365 and others.
SaaS – Services like Office 365, Salesforce and others
PaaS – Developer friendly services like SQL databases, caching, storage and more
IaaS – Servers hosted by cloud providers like Azure, AWS, Digital Ocean, and others
FaaS – New serverless applications like AWS Lambda and Azure Functions
Application Hosting – Services like Azure App Services, Heroku, etc
Cloud monitoring works through a set of tools that supervise the servers, resources, and applications
running the applications. These tools generally come from two sources:
In-house tools from the cloud provider - this is a simple option because the tools are part of the
service. There is no installation, and integration is seamless.
Tools from independent SaaS provider - although the SaaS provider may be different from the cloud
service provider, that doesn’t mean the two services don’t work seamlessly. These providers also have
expertise in managing performance and costs.
⮚ Benefits of Cloud Monitoring
The top benefits of leveraging cloud monitoring tools include:
● They already have infrastructure and configurations in place. Installation is quick and easy.
● Dedicated tools are maintained by the host. That includes hardware.
● These solutions are built for organizations of various sizes. So if cloud activity increases, the right
monitoring tool can scale seamlessly.
● Subscription-based solutions can keep costs low. They do not require startup or infrastructure
expenditures, and maintenance costs are spread among multiple users.
● Because the resources are not part of the organization’s servers and workstations, they don’t suffer
interruptions when local problems disrupt the organization.
● Many tools can be used on multiple types of devices — desktop computers, tablets, and phones.
This allows organizations to monitor apps and services from any location with Internet access.

⮚ Benefits of cloud
On-demand self-service: A client can provision computer resources without the need for interaction
with cloud service provider personnel.
• Broad network access: Access to resources in the cloud is available over the network using standard
methods in a manner that provides platform-independent access to clients of all types.
This includes a mixture of heterogeneous operating systems, and thick and thin platforms such as
laptops, mobile phones, and PDA.
• Resource pooling: A cloud service provider creates resources that are pooled together in a system
that supports multi-tenant usage.
Physical and virtual systems are dynamically allocated or reallocated as needed. Intrinsic in this
concept of pooling is the idea of abstraction that hides the location of resources such as virtual
machines, processing, memory, storage, and network bandwidth and connectivity.
• Rapid elasticity: Resources can be rapidly and elastically provisioned.
The system can add resources by either scaling up systems (more powerful computers) or scaling out
systems (more computers of the same kind), and scaling may be automatic or manual. From the
standpoint of the client, cloud computing resources should look limitless and can be purchased at any
time and in any quantity.
• Measured service: The use of cloud system resources is measured, audited, and reported to the
customer based on a metered system.
A client can be charged based on a known metric such as amount of storage used, number of
transactions, network I/O (Input/Output) or bandwidth, amount of processing power used, and so forth.
A client is charged based on the level of services provided.

16
Notes Prepared By Sarvagya Jain
Lower costs: Because cloud networks operate at higher efficiencies and with greater utilization,
significant cost reductions are often encountered.
• Ease of utilization: Depending upon the type of service being offered, you may find that you do not
require hardware or software licenses to implement your service.
• Quality of Service: The Quality of Service (QoS) is something that you can obtain under contract
from your vendor.
• Reliability: The scale of cloud computing networks and their ability to provide load balancing and
failover makes them highly reliable, often much more reliable than what you can achieve in a single
organization.
• Outsourced IT management: A cloud computing deployment lets someone else manage your
computing infrastructure while you manage your business. In most instances, you achieve
considerable reductions in IT staffing costs.
• Simplified maintenance and upgrade: Because the system is centralized, you can easily apply
patches and upgrades. This means your users always have access to the latest software versions.
• Low Barrier to Entry: In particular, upfront capital expenditures are dramatically reduced. In cloud
computing, anyone can be a giant at any time.

⮚ Limitations of cloud
1. Internet dependence everything at the cloud is accessible through internet only. If the cloud server
faces some issues, so will your application. Plus, if you use an internet service that fluctuates a lot, the
cloud computing is not for you. Even the biggest service providers face quite long downtimes. In
certain scenarios, it’s become a crucial decision to opt for cloud services.
2. Data incompatibility this is varied as per different service providers. Sometimes, a vendor locks in
the customer by using proprietary rights so that a customer can’t switch to another vendor. For
example, there is a chance that a vendor doesn’t provide compatibility with Google Docs or Google
Sheets. As your customers or employees are becoming advanced, your business may be in crucial need
for it. So ensure your contract with the provider as per your terms, not them.
3. Security breach threats again, the data transactions happen through internet only. Though your
cloud service provider claims to be one of the best-secured-service providers, it should be your call
finally. Because cloud computing history has noted big accidents, before. So if you own a business
where a single miss or corruption of data is not at all acceptable, you should think 100 times before
going for the cloud, especially for large-scale business. But for small business owners, the question
here is – will you be able to provide more security levels to your applications than a cloud service
provider does?
4. Various costs Is your current application compatible enough to take to clouds? The common
mistake business owners do is to invite unrequired expense in order to be highly-advanced. As per the
current scenario and near future analysis, if your current infrastructure is serving the needs, then
migrating to the clouds would not be recommended. Because it may happen that to be compatible with
clouds, your business applications need to be re-written. Moreover, if your business demands huge
data transfers, every month you would be billed huge as well. Sometimes, having set up own
infrastructure can save you from this kind of constant high billings.
5. Customer support before opting for clouds, check out the query resolving time. With time, service
providers are going modern but still check for the best. If your business faces heavy traffic every day
and heavier on weekends, then a quick fix is always on top priority. The best cloud service provider
must have optimum support for technical difficulties via email, call, chat or even forums. Choose the
one who provides the highest support.

17
Notes Prepared By Sarvagya Jain

Unit-2
⮚ Utility Computing:
Utility computing is a model in which computing resources are provided to the customer based on
specific demand. The service provider charges exactly for the services provided, instead of a flat rate.
The foundational concept is that users or businesses pay the providers of utility computing for the
amenities used – such as computing capabilities, storage space and applications services.
Utility computing helps eliminate data redundancy, as huge volumes of data are distributed across
multiple servers or backend systems. The client however, can access the data anytime and from
anywhere.
Utility computing is the most trending IT service model. It provides on-demand computing resources
(computation, storage, and programming services via API) and infrastructure based on the pay per
use method. It minimizes the associated costs and maximizes the efficient use of resources. The
advantage of utility computing is that it reduced the IT cost, provides greater flexibility, and easier to
manage.
Large organizations such as Google and Amazon established their own utility services for computing
storage and application.

Properties of utility computing

There are following five characteristics of utility computing.

Scalability
The utility computing must be ensured that under all conditions sufficient IT resources are available.
Increasing the demand for a service may, its quality (e.g., response time) does not suffer.
Demand pricing
Companies have to buy his own hardware and software when they need computing power. This IT
infrastructure must be paid in advance of the rule, regardless of the intensity with which the company
uses them later. Technology vendors to achieve this link, for example, the fact that the lease rate for
their servers depends on how many CPUs has enabled the customer. If it can be measured in a
company as much computing power to claim the individual sections in fact, may be the IT costs in

18
Notes Prepared By Sarvagya Jain
internal cost directly attributable to the individual departments. Other forms of connection with the use
of IT costs are possible.
Standardized Utility Computing Services
The utility computing service provider offers its customers a catalog of standardized services. These
may have different service level agreements (Agreement on the quality and the price of an IT)
services. The customer has no influence on the underlying technologies such as the server platform.

Utility Computing and Virtualization


To share the web and other resources in the shared pool of machines can be used virtualization
technologies. This will divide the network into logical resource instead of the physical resources
available. An application is assigned no specific pre-determined servers or storage of any but a free
server runtime or memory from the pool.
Automation
Repetitive management tasks such as setting up a new server or the installation of updates can be
automated. Moreover, automatically allocate resources to services and the management of IT services
to be optimized, with service level agreements and operating costs of IT resources must be considered.
Different Types of Utility Computing
Utility computing is of two types: Internal Utility and External Utility. Internal utility means that the
computer network is shared only within a company. Used by several different computer companies to
pool together a special service provider is called External Utility. In addition, various hybrid forms are
possible in this type of Utility Computing.
Advantages of Utility Computing:-
1. The client doesn't have to buy all the hardware, software and licenses needed to do business.
Instead, the client relies on another party to provide these services. The burden of maintaining and
administering the system falls to the utility computing company, allowing the client to concentrate on
other tasks.
2. Utility computing gives companies the option to subscribe to a single service and use the same suite
of software throughout the entire client organization.
3. Another advantage is compatibility. In a large company with many departments, problems can arise
with computing software. Each department might depend on different software suites. The files used
by employees in one part of a company might be incompatible with the software used by employees in
another part. Utility computing gives companies the option to subscribe to a single service and use the
same suite of software throughout the entire client organization.
Disadvantages of Utility Computing:-
1. Potential disadvantage is reliability. If a utility computing company is in financial trouble or has
frequent equipment problems, clients could get cut off from the services for which they're paying.
2. Utility computing systems can also be attractive targets for hackers. A hacker might want to access
services without paying for them or snoop around and investigate client files. Much of the
responsibility of keeping the system safe falls to the provider
⮚ Elastic Computing:
Elastic computing is nothing but a concept in cloud computing in which computing resources can be
scaled up and down easily by the cloud service provider. Cloud service provider gives you provision to
flexible computing power when and wherever required. The elasticity of these resources depends upon
the following factors such as processing power, storage, bandwidth, etc.
Types of Elastic Cloud Computing
Elastic computing have only one type i.e. Elasticity, or fully-automated scalability which removes
manual labor for increasing or decreasing resources as everything is controlled by triggers by the
system monitoring tools.
Scalability refers to the ability of system to accommodate larger loads just by adding resources either
making hardware stronger (scale up) or adding additional nodes (scale out).
Elasticity refers the ability to fit the resources needed to cope with loads, so that when load increase
you scale up by adding more resources and when demand diminishes you shrink back and remove

19
Notes Prepared By Sarvagya Jain
unneeded resources. Elasticity is mostly important in Cloud environment where you pay-per-used
resources only.
Benefits/Pros of Elastic Cloud Computing
Cost Efficiency: – Cloud is available at much cheaper rates than traditional approaches and can
significantly lower the overall IT expenses. By using cloud solution companies can save licensing fees
as well as eliminate overhead charges such as the cost of data storage, software updates, management
etc.
Convenience and continuous availability: – Cloud makes easier access of shared documents and
files with view and modify choice. Public clouds also offer services that are available wherever the
end user might be located. Moreover it guaranteed continuous availability of resources and In case of
system failure; alternative instances are automatically spawned on other machines.
Backup and Recovery: – The process of backing up and recovering data is easy as information is
residing on cloud simplified and not on a physical device. The various cloud providers offer reliable
and flexible backup/recovery solutions.
Cloud is environmentally friendly:-The cloud is more efficient than the typical IT infrastructure and
it takes fewer resources to compute, thus saving energy.
Scalability and Performance: – Scalability is a built-in feature for cloud deployments. Cloud
instances are deployed automatically only when needed and as a result enhance performance with
excellent speed of computations.
Increased Storage Capacity:-The cloud can accommodate and store much more data compared to a
personal computer and in a way offers almost unlimited storage capacity.
Disadvantages/Cons of Elastic Cloud Computing:-
Security and Privacy in the Cloud: – Security is the biggest concern in cloud computing. Companies
essentially hide their private data and information over cloud as remote based cloud infrastructure is
used, it is then up to the cloud service provider to manage, protect and retain data confidential.
Limited Control: – Since the applications and services are running remotely companies, users and
third party virtual environments have limited control over the function and execution of the hardware
and software.
Dependency and vendor lock-in: – One of the major drawbacks of cloud computing is the implicit
dependency on the provider. It is also called “vendor lock-in”. As it becomes difficult to migrate vast
data from old provider to new. So, it is advisable to select vendor very carefully.
Increased Vulnerability: – Cloud based solutions are exposed on the public internet therefore are
more vulnerable target for malicious users and hackers. As we know nothing is completely secure over
Internet even the biggest organizations also suffer from serious attacks and security breaches.
⮚ Ajax: asynchronous ‘rich’ interfaces:
Asynchronous communication between the client and the server forms the backbone of AJAX.
Although an asynchronous request-response method can provide significant value in the development
of rich functionality by itself, the results are lot more pronounced when used in conjunction with other
functional standards such as CSS, DOM, JavaScript, and so on. The predominant popularity of AJAX
stems from such usage.
Client-server communication can be achieved either by using IFrames, or by using the supported
JavaScript function call XMLHttpRequest(). Due to certain limitations of IFrames, XMLHttpRequest
has gained a lot more acceptance. While IFrame can also be an effective option for implementing
AJAX-based solutions,
The primary advantage of using AJAX-based interfaces is that the update of content occurs without
page refreshes. A typical AJAX implementation using XMLHttpRequest happens as described in the
following steps:
1. An action on the client side, whether this is a mouse click or a timed refresh, triggers a client event
2. An XMLHttpRequest object is created and configured
3. The XMLHttpRequest object makes a call
4. The request is processed by a server-side component
5. The component returns an XML (or an equivalent) document containing the result

20
Notes Prepared By Sarvagya Jain
6. The XMLHttpRequest object calls the callback() function and processes the result
7. The HTML DOM is updated with any resulting values
The following simplified image illustrates the high-level steps involved in an AJAX request
flow. The portal client page gets served to the client browser, where the execution of JavaScript
functions takes place.

The following example illustrates the initialization of the request object and its basic use:
if (window.XMLHttpRequest) // Object of the current window {
// for non-IE browsers
request = new XMLHttpRequest();
}
else if (window.ActiveXObject){
// For IE
request = new ActiveXObject("Microsoft.XMLHTTP");
}
request. > {
// do something to process response
};
if (request.readyState == 4){
// everything received, OK. Do something now.
} else {
// wait for the response to come to ready state
}

⮚ Mashups: User interface Services


The term ‘mash-up’ refers to websites that weave data from different sources into new Web services.
The key to a successful Web service is to gather and use large datasets and harness the scale of the
Internet through what is known as network effects.
The main characteristics of the mashup are combination, visualization, and aggregation. It is important
to make existing data more useful, moreover for personal and professional use. To be able to
permanently access the data of other services, mashups are generally client applications or hosted
online.
Architecture of the Mashup –
The architecture of a mashup is divided into three layers:
Presentation / user interaction: this is the user interface of mashups. The technologies used are
HTML/XHTML, CSS, Javascript, Asynchronous Javascript and Xml (Ajax).

21
Notes Prepared By Sarvagya Jain

Web Services: the products functionality can be accessed using the API services. The technologies
used are XMLHTTPRequest, XML-RPC, JSON-RPC, SOAP, REST.
Data: handling the data like sending, storing and receiving. The technologies used are XML, JSON,
and KML.
Architecturally, there are two styles of mashups:
1. Web-based
2. Server-based
Web-based mashups typically use the user’s Web browser to combine and reformat the data.
Server-based mashups analyse and reformat the data on a remote server and transmit the data to the
user’s browser in its final form.
Note: Mashups can be used with software provided as a service (SaaS).
Types of the Mashup –
There are many types of mashup, such as business mashups, consumer mashups, and data mashups.
The most common type of mashup is the consumer mashup, aimed at the general public.
Business (or enterprise) mashups generally define applications that combine their own resources,
application and data, with other external Web services. They focus data into a single presentation and
allow for collaborative action among businesses and developers. This works well for an agile
development project, which requires collaboration between the developers and customer (or customer
proxy, typically a product manager) for defining and implementing the business requirements.
Enterprise mashups are secure, visually rich Web applications that expose actionable information from
diverse internal and external information sources.
Consumer mashups combines different data types. It combines data from multiple public sources in
the browser and organizes it through a simple browser user interface.
Data mashups, opposite to the consumer mashups, combine similar types of media and information
from multiple sources into a single representation. The combination of all these resources creates a
new and distinct Web service that was not originally provided by either source.

⮚ Virtualization Technology:
Virtualization technology provides an alternative technical approach to delivering infrastructure,
platforms and operating systems, servers, software, and systems and applications. Most virtualized
computing environments have much in common with conventional data centers, but employ
high-performing hardware and specialized software that enables a single physical server to function as
multiple concurrently running instances. This approach increases capacity utilization and, in IT
service-based models.
Virtual Machine:
A virtual machine is a virtual representation, or emulation, of a physical computer. They are often
referred to as a guest while the physical machine they run on is referred to as the host.
A Virtualization technology called the Virtual Machine.
A Virtual Machine Monitor (VMM) is a software program that enables the creation, management
and governance of virtual machines (VM) and manages the operation of a virtualized environment on
top of a physical host machine.
VMM is also known as Virtual Machine Manager and Hypervisor.
VMM encapsulates the very basics of virtualization in cloud computing. It is used to separate the
physical hardware from its emulated parts. This often includes the CPU’s memory, I/O and network

22
Notes Prepared By Sarvagya Jain
traffic. A secondary operating system that is usually interacting with the hardware is now a software
emulation of that hardware, and often the guest operating system.

Process Virtual Machines


A process VM, sometimes called an application virtual machine, runs as a normal application inside an
OS and supports a single process. It is created when that process is started and destroyed when it exits.
Its purpose is to provide a platform-independent programming environment that abstracts away details
of the underlying hardware or operating system, and allows a program to execute in the same way on
any platform.
A process VM provides a high-level abstraction — that of a high-level programming language
(compared to the low-level ISA abstraction of the system VM). Process VMs are implemented using
an interpreter; performance comparable to compiled programming languages is achieved by the use of
just-in-time compilation.
This type of VM has become popular with the Java programming language, which is implemented
using the Java virtual machine. Another example is the .NET Framework, which runs on a VM called
the Common Language Runtime.

System Virtual Machines


System virtual machines (sometimes called hardware virtual machines) allow the sharing of the
underlying physical machine resources between different virtual machines, each running its own
operating system. The software layer providing the virtualization is called a virtual machine monitor or
hypervisor.
A hypervisor can run on bare hardware (Type 1 or native VM) or on top of an operating system (Type
2 or hosted VM).
Type 1 hypervisors run directly on the physical hardware (usually a server), taking the place of the
OS. Typically, you use a separate software product to create and manipulate VMs on the hypervisor.
Some management tools, like VMware’s vSphere, let you select a guest OS to install in the VM.You
can use one VM as a template for others, duplicating it to create new ones. Depending on your needs,
you might create multiple VM templates for different purposes, such as software testing, production
databases, and development environments.

Type 2 hypervisors run as an application within a host OS and usually target single-user desktop or
notebook platforms. With a Type 2 hypervisor, you manually create a VM and then install a guest OS
in it. You can use the hypervisor to allocate physical resources to your VM, manually setting the
amount of processor cores and memory it can use. Depending on the hypervisor’s capabilities, you can
also set options like 3D acceleration for graphics.

23
Notes Prepared By Sarvagya Jain
Main advantages of system VMs
1. Multiple OS environments can co-exist on the same computer, in strong isolation from each
other;
2. The virtual machine can provide an instruction set architecture (ISA) that is somewhat different
from that of the real machine.
Main disadvantages of system VMs
● there's still an overhead of the virtualization solution which is used to run and manage a VM, so
performance of a VM will be somewhat slower compared to a physical system with comparable
configuration
● virtualization means decoupling from physical hardware available to the host PC, this usually
means access to devices needs to go through the virtualization solution and this may not always
be possible

Virtual machine migration


Virtual machine migration is the task of moving a virtual machine from one physical hardware
environment to another. It is part of managing hardware virtualization systems and is something that
providers look at as they offer virtualization services.
Virtual machine migration is also known as teleportation.
In hardware virtualization, physical hardware pieces are carved up into a set of virtual machines
logical hardware pieces that do not have a physical shell or composition, which are essentially just
programmed pieces of an overall hardware system. In a virtualization setup, a central hypervisor or
another tool allocates resources like CPU and memory to virtual machines. For instance, in older
networks, most of the individual elements were physical workstations, such as desktop PCs, which
were connected by Ethernet cabling or other physical connections. By contrast, virtual machines do
not have a physical interface. They do not have a box or shell or anything to move around. But they
can be connected to the same keyboards, monitors and peripherals that humans have always used to
interact with personal computers.
In virtual machine migration, system administrators move these virtual pieces between physical
servers or other hardware pieces. In an effort to facilitate this, a new kind of migration has evolved
called "live virtual machine migrations." Live migration involves moving these virtual machines
without shutting down a client system. Modern services often provide live migration functionality to
make it easier to move virtual machines without doing a lot of other administrative work.

⮚ Pitfalls of virtualization:
Here are some of the most common issues posed by adopting virtualization that every organization
must consider.
1. Detection/Discovery
You can't manage what you can't see! IT departments are often unprepared for the complexity
associated with understanding what VMs (virtual machines) exist and which are active or inactive.
To overcome these challenges, discovery tools need to extend to the virtual world by identifying
Virtual Machine Disk Format (.vmdk) files and how many exist within the environment. This will
identify both active and inactive VM’s.

2. Correlation
Difficulty in understanding which VMs are on which hosts and identifying which business critical
functions are supported by each VM is a common and largely unforeseen problem encountered by IT
departments employing virtualization. Mapping guest to host relationships and grouping the VM’s
by criticality & application is a best practice when implementing virtualization.

24
Notes Prepared By Sarvagya Jain

3. Configuration management
Ensuring VMs are configured properly is crucial in preventing performance bottlenecks and security
vulnerabilities. Complexities in VM provisioning and offline VM patching is a frequent issue for IT
departments. A Technical Controls configuration management database (CMDB) is critical to
understanding the configurations of VM’s especially dormant ones. The CMDB will provide the
current state of a VM even if it is dormant, allowing a technician to update the configuration by
auditing and making changes to the template.

4. Additional security considerations


If a host is vulnerable, all associated guest VMs and the business applications on those VMs are also
at risk. This could lead to far more reaching impact than the same exploit on a single physical server.
Treat a Virtual Machine just like any other system and enforce security policies and compliance.
Also, use an application that dynamically maps guest-to-host relationships and tracks guest VM’s as
they move from host to host.

5. VM identity management issues


Virtualization introduces complexities that often lead to issues surrounding separation of duties.
Who manages these machines? Do application owners have visibility into changes being made?
Identify roles and criticality and put them through the same processes you leverage for physical
devices including change management, release management and hardening guidelines.

6. VM network configuration control


With multiple operating systems sharing a single IP address behind a NAT, network access control
becomes much more complex in a virtual network. To address this use AD, DNS and NetBIOS to
identify bridged VM’s. IP sweeps in most cases will not pick these up.

7. Identifying and controlling VM proliferation


VM’s can pop up and move to any location in an instant. To manage this potential issue, you must
establish and enforce a process for Virtual Machine deployment.

8. VM host capacity planning


Virtualization can make understanding what applications are running and how many resources are
being leveraged much more difficult. To better deal with this issue, organizations must track how
many guest to host relationships exist and the configuration of the VM’s.

9. ESX host driver and ACL information


How is the ESX System itself configured? Does it meet your PCI requirements? Who has
permissions to the system? Does it meet your regulatory compliance needs? Organizations must
proactively manage ESX machines by tracking and trending their security configurations over time
to make sure they don’t "drift" from corporate standards.

10. ESX host configuration management


If a guest is infected with a worm or virus it will attack the other local VMs. If that image is moved
to another host, it will continue to do damage across the organization. Do you have visibility into

25
Notes Prepared By Sarvagya Jain
guest to host relationships and their configurations? Guest to host mapping and their configuration
history is critical to the success of managing virtual machines.

11. Intellectual property


Virtualization makes it more difficult to know who has what information. How do you know your
VMs are not walking out the door with critical information and data? Verifying encrypted data and
historical information on your guest VMs can help manage and secure intellectual property.

⮚ Multitenant software: Multi-entity support, Multischema approach, Multi-tenancy using cloud


data stores.
In cloud computing, multitenancy means that multiple customers of a cloud vendor are using the
same computing resources. Despite the fact that they share resources, cloud customers aren't aware
of each other, and their data is kept totally separate. Multitenancy is a crucial component of cloud
computing; without it, cloud services would be far less practical. Multitenant architecture is a feature
in many types of public cloud computing, including IaaS, PaaS, SaaS, containers, and serverless
computing.

To understand multitenancy, think of how banking works. Multiple people can store their money in
one bank, and their assets are completely separate even though they're stored in the same place.
Customers of the bank don't interact with each other, don't have access to other customers' money,
and aren't even aware of each other. Similarly, in public cloud computing, customers of the cloud
vendor use the same infrastructure – the same servers, typically – while still keeping their data and
their business logic separate and secure.

The classic definition of multitenancy was a single software instance* that served multiple users, or
tenants. However, in modern cloud computing, the term has taken on a broader meaning, referring to
shared cloud infrastructure instead of just a shared software instance.

*A software instance is a copy of a running program loaded into random access memory (RAM).

Benefits of multitenancy
Many of the benefits of cloud computing are only possible because of multitenancy. Here are two
crucial ways multitenancy improves cloud computing:

Better use of resources: One machine reserved for one tenant isn't efficient, as that one tenant is not
likely to use all of the machine's computing power. By sharing machines among multiple tenants, use
of available resources is maximized.

Lower costs: With multiple customers sharing resources, a cloud vendor can offer their services to
many customers at a much lower cost than if each customer required their own dedicated
infrastructure.

Drawbacks of multitenancy
Possible security risks and compliance issues: Some companies may not be able to store data within
shared infrastructure, no matter how secure, due to regulatory requirements. Additionally, security
problems or corrupted data from one tenant could spread to other tenants on the same machine,

26
Notes Prepared By Sarvagya Jain
although this is extremely rare and shouldn't occur if the cloud vendor has configured their
infrastructure correctly. These security risks are somewhat mitigated by the fact that cloud vendors
typically are able to invest more in their security than individual businesses can.

The "noisy neighbor" effect: If one tenant is using an inordinate amount of computing power, this
could slow down performance for the other tenants. Again, this should not occur if the cloud vendor
has set up their infrastructure correctly.

How does multitenancy work?


Here we will take a more in-depth look at the technical principles that make multitenancy possible in
different kinds of cloud computing.

In public cloud computing


Imagine a special car engine that could be shared easily between multiple cars and car owners. Each
car owner needs the engine to behave slightly differently: some car owners require a powerful
8-cylinder engine, while others require a more fuel-efficient 4-cylinder engine. Now imagine that
this special engine is able to morph itself each time it starts up so that it can better meet the car
owner's needs.

This is similar to the way many public cloud providers implement multitenancy. Most cloud
providers define multitenancy as a shared software instance. They store metadata* about each tenant
and use this data to alter the software instance at runtime to fit each tenant's needs. The tenants are
isolated from each other via permissions. Even though they all share the same software instance,
they each use and experience the software differently.

*Metadata is information about a file, somewhat like the description on the back of a book.

In container architecture
Containers are self-contained bundles of software that include an application, system libraries,
system settings, and everything else the application needs in order to run. Containers help ensure that
an application runs the same no matter where it is hosted.

Containers are partitioned from each other into different user space environments, and each
container runs as if it were the only system on that host machine. Because containers are
self-contained, multiple containers created by different cloud customers can run on a single host
machine.

In serverless computing
Serverless computing is a model in which applications are broken up into smaller pieces called
functions, and each function only runs on demand, separately from the other functions. (This model
of cloud computing is also known as Function-as-a-Service, or FaaS.)

As the name implies, serverless functions do not run on dedicated servers, but rather on any
available machine in the serverless provider's infrastructure. Because companies are not assigned

27
Notes Prepared By Sarvagya Jain
their own discrete physical servers, serverless providers will often be running code from several of
their customers on a single server at any given time – another example of multitenancy.

Some serverless platforms use Node.js for executing serverless code. The Cloudflare serverless
platform, Cloudflare Workers, uses Chrome V8, in which each function runs in its own sandbox, or
separate environment. This keeps serverless functions totally separate from each other even when
they’re running on the same infrastructure.

In private cloud computing


Private cloud computing uses multitenant architecture in much the same way that public cloud
computing does. The difference is that the other tenants are not from external organizations. In
public cloud computing, Company A shares infrastructure with Company B. In private cloud
computing, different teams within Company A share infrastructure with each other.

28
Notes Prepared By Sarvagya Jain

Unit -3
Cloud Database: A cloud database is a database that typically runs on a cloud computing platform,
and access to the database is provided as-a-service.
Database services take care of scalability and high availability of the database. Database services
make the underlying software-stack transparent to the user.
There are two primary methods to run a database in a cloud:
Virtual machine image
Cloud platforms allow users to purchase virtual-machine instances for a limited time, and one can
run a database on such virtual machines. Users can either upload their own machine image with a
database installed on it, or use ready-made machine images that already include an optimized
installation of a database.
Database-as-a-service (DBaaS)
With a database as a service model, application owners do not have to install and maintain the
database themselves. Instead, the database service provider takes responsibility for installing and
maintaining the database, and application owners are charged according to their usage of the service.
This is a type of software as a service (SaaS).
Data model:
The design and development of typical systems utilize data management and relational databases as
their key building blocks. Modern relational databases have shown poor performance on
data-intensive systems, therefore, the idea of NoSQL has been utilized within database management
systems for cloud based systems. "The NoSQL databases have proven to provide efficient horizontal
scalability, good performance, and ease of assembly into cloud applications
Data models relying on simplified relay algorithms have also been employed in data-intensive cloud
mapping applications unique to virtual frameworks.
SQL databases
are one type of database which can run in the cloud, either in a virtual machine or as a service,
depending on the vendor. While SQL databases are easily vertically scalable, horizontal scalability
poses a challenge that cloud database services based on SQL have started to address.
NoSQL databases
are another type of database which can run in the cloud. NoSQL databases are built to service heavy
read/write loads and can scale up and down easily, and therefore they are more natively suited to
running in the cloud. However, most contemporary applications are built around an SQL data model,
so working with NoSQL databases often requires a complete rewrite of application code.
Some SQL databases have developed NoSQL capabilities including JSON, binary JSON (e.g. BSON
or similar variants), and key-value store data types.
A multi-model database with relational and non-relational capabilities provides a standard SQL
interface to users and applications and thus facilitates the usage of such databases for contemporary
applications built around an SQL data model. Native multi-model databases support multiple data
models with one core and a unified query language to access all data models.
How Cloud Databases Work
Cloud databases can be divided into two broad categories: relational and non-relational.

29
Notes Prepared By Sarvagya Jain
A relational database, typically written in structured query language (SQL), is composed of a set of
interrelated tables that are organized into rows and columns. The relationship between tables and
columns (fields) is specified in a schema. SQL databases, by design, rely on data that is highly
consistent in its format, such as banking transactions or a telephone directory. Popular cloud
platforms and cloud providers include MySQL, Oracle, IBM DB2 and Microsoft SQL Server. Some
cloud platforms such as MySQL are open sourced.

Non-relational databases, sometimes called NoSQL, do not employ a table model. Instead, they store
content, regardless of its structure, as a single document. This technology is well-suited for
unstructured data, such as social media content, photos and videos.

Cloud File System:


HDFS:
HDFS is a distributed file system that handles large data sets running on commodity hardware. It is
used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of nodes. HDFS is
one of the major components of Apache Hadoop, the others being Map Reduce and YARN.
HDFS is the storage unit of Hadoop that is used to store and process huge volumes of data on
multiple data nodes. It is designed with low cost hardware that provides data across multiple Hadoop
clusters. It has high fault tolerance and throughput.

Large file is broken down into small blocks of data. HDFS has a default block size of 128 MB which
can be increased as per requirement. Multiple copies of each block are stored in the cluster in a
distributed manner on different nodes.
HDFS follows the master-slave architecture and it has the following elements.
Name node:
The system having the namenode acts as the master server and it does the following tasks −
● Manages the file system namespace.
● Regulates client’s access to files.
● It also executes file system operations such as renaming, closing, and opening files and
directories.

30
Notes Prepared By Sarvagya Jain
Data node:
There will be a data node. These nodes manage the data storage of their system.
● Data nodes perform read-write operations on the file systems, as per client request.
● They also perform operations such as block creation, deletion, and replication according to
the instructions of the namenode.
Block:
Generally the user data is stored in the files of HDFS. The file in a file system will be divided into
one or more segments and/or stored in individual data nodes. These file segments are called as
blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block.
The default block size is 128 MB, but it can be increased as per the need to change in HDFS
configuration.
GFS:
Google File System (GFS or GoogleFS) is a proprietary distributed file system developed by Google
to provide efficient, reliable access to data using large clusters of commodity hardware.
The GFS is composed of clusters. A cluster is a set of networked computers. GFS clusters contain
three types of interdependent entities which are: Client, master and chunk server.
Clients could be: Computers or applications manipulating existing files or creating new files on the
system.
The master server is the manager of the cluster system that maintains the operation log. Operation
log keeps track of the activities made by the master itself which helps reducing the service
interruptions to a minimum level. At startup, master server retrieves information about contents and
inventories from chunk servers. Then after, the master server keeps tracks of the location of the
chunks with the cluster.
The GFS architecture keeps the messages that the master server sends and receives very small. The
master server itself doesn’t handle file data at all; this is done by chunk servers.
Chunk servers are the core engine of the GFS. They store file chunks of 64 MB size. Chunk servers
coordinate with the master server and send requested chunks to clients directly.
GFS consists of a single master and multiple chunk servers.

Files are divided into fixed sized chunks.

31
Notes Prepared By Sarvagya Jain

Each chunk has 64 MB of data in it. Each chunk is replicated on multiple chunk servers (3 by default).
Even if any chunk server crashes, the data file will still be present in other chunk servers.

This helped Google to store and process huge volumes of data in a distributed manner.

Comparisons among GFS and HDFS:


1. Node Division: HDFS contain single NameNode and many DataNodes in is file system. GFS
contain single Master Node and multiple Chunk Servers and is accessed by multiple clients.
2. Scalability: Both HDFS and GFS are considered as cluster based architecture. Each file system
runs over machines built from commodity hardware. Each cluster may consist of thousands of
nodes with huge data size storage.
3. Implementation: Since GFS is proprietary file system and exclusive to Google only, it cannot be
used by any other company. In the other part, HDFS based on Apache Hadoop open-source project
can be deployed and used by any company willing to manage and process big data.
4. File serving: In GFS, files are divided into units called chunks of fixed size. Chunk size is 64 MB
and can be stored on different nodes in cluster for load balancing and performance needs. In
Hadoop, HDFS file system divides the files into units called blocks of 128 MB in size. Block size
can be adjustable based on the size of data.
5. Internal communication: Communication between chucks and clusters within GFS is made
through TCP connections. For data transfer, pipelining is used over TCP connections. The same
method is in HDFS, but Remote Procedure Call (RPC) is used to conduct external communication
between clusters and blocks.

32
Notes Prepared By Sarvagya Jain
6. Cache management: In GFS, cache metadata are saved in client memory. Chunk server does not
need cache file data. Linux system running on the chunk server caches frequently accessed data in
memory. The HDFS has "Distributed Cache". Distributed Cache is facility provided by the Map
Reduce to distribute application-specific, large, read-only files efficiently. It also caches files such
as text, archives (zip, tar, tgz and tar.gz) and jars needed by applications. Distributed Cache files
can be private or public, that determines how they can be shared on the slave nodes.
"Private" Distributed Cache files are cached in a local directory private to the user whose jobs need
these files.
"Public" Distributed Cache files are cached in a global directory and the file access is setup in such
a way that they are publicly visible to all users.

7. Files protection and permission: GFS splits files up and stores it in multiple pieces on multiple
machines. File names have random names and are not human readable. Files are obfuscated
through algorithms that change constantly.
The HDFS implements POSIX-like mode permission for files and directories. All files and
directories are associated with an owner and a group with separate permissions for users who are
owners, for users that are members of the group and for all other users.

8. Replication strategy: The GFS has two replicas: Primary replicas and secondary replicas.
A primary replica is the data chunk that a chunk server sends to a client.
Secondary replicas serve as backups on other chunk servers. User can specify the number of
replicas to be maintained.
The HDFS has an automatic replication rack based system. By default two copies of each block are
stored by different Data Nodes in the same rack and a third copy is stored on a Data Node on a
different rack.
9. File namespace: In GFS, files are organized hierarchically in directories and identified by path
names. The GFS is exclusively for Google only.
The HDFS supports a traditional hierarchical file organization. Users or application can create
directories to store files inside. The HDFS also supports third-party file systems such as
CloudStore and Amazon Simple Storage Service (S3).
10. File system database: The GFS has bigtable database. Bigtable is a proprietary database
developed by Google using c++.
Apache developed its own database called HBase in Hadoop. The HBase is built with Java
language.

Features of HDFS:
1. Cost-effective:
In HDFS architecture, the DataNodes, which stores the actual data are inexpensive commodity
hardware, thus reduces storage costs.
2. Large Datasets/ Variety and volume of data:
HDFS can store data of any size (ranging from megabytes to petabytes) and of any formats (structured,
unstructured).
3. Replication:
In HDFS replication of data is done to solve the problem of data loss in unfavourable conditions like
crashing of a node, hardware failure, and so on.

33
Notes Prepared By Sarvagya Jain
The data is replicated across a number of machines in the cluster by creating replicas of blocks. The
process of replication is maintained at regular intervals of time by HDFS and HDFS keeps creating
replicas of user data on different machines present in the cluster.
Hence whenever any machine in the cluster gets crashed, the user can access their data from other
machines that contain the blocks of that data. Hence there is no possibility of a loss of user data.
4. Fault Tolerance and reliability:
HDFS is highly fault-tolerant and reliable. HDFS creates replicas of file blocks depending on the
replication factor and stores them on different machines.
If any of the machines containing data blocks fail, other DataNodes containing the replicas of that data
blocks are available. Thus ensuring no loss of data and makes the system reliable even in unfavourable
conditions.
Hadoop 3 introduced Erasure Coding to provide Fault Tolerance. Erasure Coding in HDFS improves
storage efficiency while providing the same level of fault tolerance and data durability as traditional
replication-based HDFS deployment.
5. High Availability:
The High availability feature of Hadoop ensures the availability of data even during NameNode or
DataNode failure.
Since HDFS creates replicas of data blocks, if any of the DataNodes goes down, the user can access
his data from the other DataNodes containing a copy of the same data block.
Also, if the active NameNode goes down, the passive node takes the responsibility of the active
NameNode. Thus, data will be available and accessible to the user even during a machine crash.
6. Scalability:
As HDFS stores data on multiple nodes in the cluster, when requirements increase we can scale the
cluster. There are two scalability mechanisms available: Vertical scalability – add more resources
(CPU, Memory, and Disk) on the existing nodes of the cluster.
Another way is horizontal scalability – Add more machines in the cluster. The horizontal way is
preferred since we can scale the cluster from 10s of nodes to 100s of nodes on the fly without any
downtime.
7. Data Integrity:
Data integrity refers to the correctness of data. HDFS ensures data integrity by constantly checking the
data against the checksum calculated during the write of the file.
While file reading, if the checksum does not match with the original checksum, the data is said to be
corrupted. The client then opts to retrieve the data block from another DataNode that has a replica of
that block. The NameNode discards the corrupted block and creates an additional new replica.
8. High Throughput:
Hadoop HDFS stores data in a distributed fashion, which allows data to be processed parallelly on a
cluster of nodes. This decreases the processing time and thus provides high throughput.
9. Data Locality:
Data locality means moving computation logic to the data rather than moving data to the
computational unit.
In the traditional system, the data is brought at the application layer and then gets processed.
But in the present scenario, due to the massive volume of data, bringing data to the application layer
degrades the network performance.

34
Notes Prepared By Sarvagya Jain
In HDFS, we bring the computation part to the Data Nodes where data resides. Hence, with Hadoop
HDFS, we are moving computation logic to the data, rather than moving data to the computation logic.
This feature reduces the bandwidth utilization in a system.
10. Distributed Storage:
HDFS store data in a distributed manner across the nodes. In Hadoop, data is divided into blocks and
stored on the nodes present in the HDFS cluster. After that HDFS create the replica of each and every
block and store on other nodes. When a single machine in the cluster gets crashed we can easily access
our data from the other nodes which contain its replica.
Features of GFS:
GFS features include:
1. Fault tolerance
2. Critical data replication
3. Automatic and efficient data recovery
4. High aggregate throughput
5. Reduced client and master interaction because of large chunk server size
6. Namespace management and locking
7. High availability
BigTable:
Google uses as data storage a facility called Bigtable. Bigtable is a distributed, persistent,
multidimensional sorted map. Bigtable is not a relational database. BigTable is designed with
semi-structured data storage in mind. It is a large map that is indexed by a row key, column key, and a
timestamp. Each value within the map is an array of bytes that is interpreted by the application.
Every read or write of data to a row is atomic, regardless of how many different columns are read or
written within that row.
(Row key: type string, column key: type string, timestamp: type int64) → String
The key can get generated by the database or by the application.

Few characteristics of BigTable:


1. map
A map is an associative array; a data structure that allows one to look up a value to a corresponding
key quickly. BigTable is a collection of (key, value) pairs where the key identifies a row and the value
is the set of columns.
Using JavaScript Object Notation, here's an example of a simple map where all the values are just
strings:
{
"zzzzz" : "woot",
"xyz" : "hello",
"aaaab" : "world",
"1" : "x",
"aaaaa" : "y"
}
2. persistent
The data is stored persistently on disk.
3. distributed
BigTable's data is distributed among many independent machines. At Google, BigTable is built on top
of GFS (Google File System). The table is broken up among rows, with groups of adjacent rows
managed by a server. A row itself is never distributed.
4. sparse

35
Notes Prepared By Sarvagya Jain
The table is sparse, meaning that different rows in a table may use different columns, with many of the
columns empty for a particular row.
5. sorted
Most associative arrays are not sorted. A key is hashed to a position in a table. BigTable sorts its data
by keys. This helps keep related data close together, usually on the same machine — assuming that
one structures keys in such a way that sorting brings the data together. For example, if domain names
are used as keys in a BigTable, it makes sense to store them in reverse order to ensure that related
domains are close together. For example:
Continuing our JSON example, the sorted version looks like this:
{
"1”: "x",
"aaaaa" : "y",
"aaaab" : "world",
"xyz" : "hello",
"zzzzz" : "woot"
}
6. multidimensional
A table is indexed by rows. Each row contains one or more named column families. Column families
are defined when the table is first created. Within a column family, one may have one or more named
columns. All data within a column family is usually of the same type. The implementation of BigTable
usually compresses all the columns within a column family together. Columns within a column family
can be created on the fly. Rows, column families and columns provide a three-level naming hierarchy
in identifying data. For example: Adding one dimension to our running JSON example gives us this:
{
"1" : {
"A" : "x",
"B" : "z"
},
"aaaaa" : {
"A" : "y",
"B" : "w"
},
"aaaab" : {
"A" : "world",
"B" : "ocean"
},
"xyz" : {
"A" : "hello",
"B" : "there"
},
"zzzzz" : {
"A" : "woot",
"B" : "1337"
}
}
A column family may have any number of columns, denoted by a column "qualifier" or "label".
Here's a subset of our JSON example again, this time with the column qualifier dimension built in:

{
// ...
"aaaaa" : {
"A" : {

36
Notes Prepared By Sarvagya Jain
"foo" : "y",
"bar" : "d"
},
"B" : {
"" : "w"
}
},
"aaaab" : {
"A" : {
"foo" : "world",
"bar" : "domination"
},
"B" : {
"" : "ocean"
}
},
// ...
}
7. time-based
Time is another dimension in BigTable data. Every column family may keep multiple versions of
column family data. If an application does not specify a timestamp, it will retrieve the latest version of
the column family. Alternatively, it can specify a timestamp and get the latest version that is earlier
than or equal to that timestamp.

Cloud BigTable Architecture:

● Client requests go through a front-end server


● Nodes are organized into a Cloud Bigtable cluster of a Cloud Bigtable instance
● Each node in the cluster handles a subset of the requests to the cluster.
● Add nodes to increase the number of simultaneous requests to handle and maximum
throughput
● Table is sharded into blocks of contiguous rows, called tablets similar to HBase regions.
Tablets are stored on Colossus, Google’s file system, in SSTable format.

37
Notes Prepared By Sarvagya Jain
● An SSTable is a ordered immutable map from keys to values, and both are byte strings.
● Tablet is associated with a specific node.
● Writes are stored in Colossus’s shared log as acknowledged
● Data is never stored in nodes themselves;
● Nodes have pointers to a set of tablets stored on Colossus.
● Rebalancing tablets from one node to another is very fast
● Recovery from the failure of a node is very fast
● When a Cloud Bigtable node fails, no data is lost.
H Base:
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an
open-source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide quick random access to
huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop File System
(HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the
Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses the
data in HDFS randomly using HBase. HBase sits on top of the Hadoop File System and provides read
and write access.

Storage Mechanism in HBase

HBase is a column-oriented database and the tables in it are sorted by row. The table schema
defines only column families, which are the key value pairs. A table have multiple column families
and each column family can have any number of columns. Subsequent column values are stored
contiguously on the disk. Each cell value of the table has a timestamp. In short, in an HBase:

● Table is a collection of rows.


● Row is a collection of column families.
● Column family is a collection of columns.
● Column is a collection of key value pairs.
Given below is an example schema of table in HBase.

38
Notes Prepared By Sarvagya Jain
Rowid Column Family Column Family Column Family Column Family

col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3

Column Oriented and Row Oriented

Column-oriented databases are those that store data tables as sections of columns of data, rather than
as rows of data. Shortly, they will have column families.

Row-Oriented Database Column-Oriented Database

It is suitable for Online Transaction Process (OLTP). It is suitable for Online Analytical
Processing (OLAP).

Such databases are designed for small number of Column-oriented databases are designed for
rows and columns. huge tables.

The following image shows column families in a column-oriented database:

HBase Architecture and its Important Components

Below is a detailed architrecutre of HBase with components:

39
Notes Prepared By Sarvagya Jain

HBase Architecture Diagram

HBase architecture consists mainly of four components

● HMaster: in HBase is the implementation of a Master server in HBase architecture. It acts as a


monitoring agent to monitor all Region Server instances present in the cluster and acts as an
interface for all the metadata changes.
● HRegionserver: HRegionServer is the Region Server implementation. It is responsible for serving
and managing regions or data that is present in a distributed cluster.
● HRegions: HRegions are the basic building elements of HBase cluster that consists of the
distribution of tables and are comprised of Column families. It contains multiple stores, one for
each column family. It consists of mainly two components, which are Memstore and Hfile.
● Zookeeper: HBase Zookeeper is a centralized monitoring server which maintains configuration
information and provides distributed synchronization. Distributed synchronization is to access the
distributed applications running across the cluster with the responsibility of providing coordination
services between nodes. If the client wants to communicate with regions, the server's client has to
approach ZooKeeper first.
● HDFS: HDFS is a Hadoop distributed file system, as the name implies it provides a distributed
environment for the storage and it is a file system designed in a way to run on commodity
hardware. It stores each file in multiple blocks and to maintain fault tolerance, the blocks are
replicated across a Hadoop cluster.

Dynamo:
Dynamo is a set of techniques that together can form a highly available key-value structured storage
system or a distributed data store. It has properties of both databases and distributed hash tables
(DHTs). It was used in Amazon Web Services, such as its Simple Storage Service (S3).
Principles
Incremental scalability: Dynamo should be able to scale out one storage host (or “node”) at a time,
with minimal impact on both operators of the system and the system itself.
Symmetry: Every node in Dynamo should have the same set of responsibilities as its peers; there
should be no distinguished node or nodes that take special roles or extra set of responsibilities.
Decentralization: An extension of symmetry, the design should favor decentralized peer-to-peer
techniques over centralized control.
Heterogeneity: The system should be able to exploit heterogeneity in the infrastructure it runs on. For
example, the work distribution must be proportional to the capabilities of the individual servers. This
is essential in adding new nodes with higher capacity without having to upgrade all hosts at once.

40
Notes Prepared By Sarvagya Jain
DynamoDB is a fully-managed NoSQL database service designed to deliver fast and predictable
performance. It uses the Dynamo model in the essence of its design, and improves those features.
● Amazon DynamoDB is a fast and flexible NoSQL database service for all applications that require
consistent single-digit millisecond latency at any scale.
● It is a fully managed database that supports both document and key-value data models.
● Its flexible data model and performance makes it a great fit for mobile, web, gaming, ad-tech, IOT,
and many other applications.
● It is stored in SSD storage.
● It is spread across three geographically data centres.

Map Reduce:
MapReduce is a processing technique and a program model for distributed computing based on java.
The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of
data and converts it into another set of data, where individual elements are broken down into tuples
(key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines
those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the
reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called mappers
and reducers. Decomposing a data processing application into mappers and reducers is sometimes
nontrivial. But, once we write an application in the MapReduce form, scaling the application to run
over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a
configuration change. This simple scalability is what has attracted many programmers to use the
MapReduce model.

The Map Reduce Model: Algorithm

● Generally the MapReduce paradigm is based on sending the computer to where the data
resides!
● MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce
stage.
o Map stage − The map or mapper’s job is to process the input data. Generally the input
data is in the form of file or directory and is stored in the Hadoop file system (HDFS).
The input file is passed to the mapper function line by line. The mapper processes the
data and creates several small chunks of data.
o Reduce stage − This stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored in the
HDFS.
● During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers
in the cluster.
● The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.
● Most of the computing takes place on nodes with data on local disks that reduces the network
traffic.

41
Notes Prepared By Sarvagya Jain
● After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.

Inputs and Outputs (Java Perspective)

The MapReduce framework operates on <key, value> pairs, that is, the framework views the input to
the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job,
conceivably of different types.
The key and the value classes should be in serialized manner by the framework and hence, need to
implement the Writable interface. Additionally, the key classes have to implement the
Writable-Comparable interface to facilitate sorting by the framework. Input and Output types of
a MapReduce job − (Input) <k1, v1> → map → <k2, v2> → reduce → <k3, v3>(Output).
Input Output
Map <k1, v1> list (<k2, v2>)
Reduce <k2, list(v2)> list (<k3, v3>)

Parallel Computing:
Parallel computing refers to the process of breaking down larger problems into smaller, independent,
often similar parts that can be executed simultaneously by multiple processors communicating via
shared memory, the results of which are combined upon completion as part of an overall algorithm.
The primary goal of parallel computing is to increase available computation power for faster
application processing and problem solving.
Parallel computing infrastructure is typically housed within a single datacenter where several
processors are installed in a server rack; computation requests are distributed in small chunks by the
application server that are then executed simultaneously on each server.
There are generally four types of parallel computing, available from both proprietary and open source
parallel computing vendors -- bit-level parallelism, instruction-level parallelism, task parallelism, or
super word-level parallelism:h the compiler decides which instructions to execute in parallel
● Task parallelism: a form of parallelization of computer code across multiple processors that runs
several different tasks at the same time on the sa
● Bit-level parallelism: increases processor word size, which reduces the quantity of instructions the
processor must execute in order to perform an operation on variables greater than the length of the
word.
● Instruction-level parallelism: the hardware approach works upon dynamic parallelism, in which
the processor decides at run-time which instructions to execute in parallel; the software approach
works upon static parallelism.
● Superword-level parallelism: a vectorization technique that can exploit parallelism of inline code

42

You might also like