Cloud Computing Notes.docx
Cloud Computing Notes.docx
Cloud Computing Notes.docx
Unit -1
⮚ Introduction to Service Oriented Architecture:
Service-Oriented Architecture (SOA) is an architectural approach in which applications make use of
services available in the network. In this architecture, services are provided to form applications,
through a communication call over the internet.
● SOA allows users to combine a large number of facilities from existing services to form
applications.
● SOA encompasses a set of design principles that structure system development and provide means
for integrating components into a coherent and decentralized system.
● SOA based computing packages functionalities into a set of interoperable services, which can be
integrated into different software systems belonging to separate business domains.
There are two major roles within Service-oriented Architecture:
1. Service provider: The service provider is the maintainer of the service and the organization that
makes available one or more services for others to use. To advertise services, the provider can
publish them in a registry, together with a service contract that specifies the nature of the service,
how to use it, the requirements for the service, and the fees charged.
2. Service consumer: The service consumer can locate the service metadata in the registry and
develop the required client components to bind and use the service.
Services might aggregate information and data retrieved from other services or create workflows of
services to satisfy the request of a given service consumer. This practice is known as service
orchestration another important interaction pattern is service choreography, which is the coordinated
interaction of services without a single point of control.
⮚ Components of SOA:
⮚ Principles of SOA:
1. Standardized service contract: Specified through one or more service description documents.
1
Notes Prepared By Sarvagya Jain
2. Loose coupling: Services are designed as self-contained components maintain relationships that
minimize dependencies on other services.
3. Abstraction: A service is completely defined by service contracts and description documents.
They hide their logic, which is encapsulated within their implementation.
4. Reusability: Designed as components, services can be reused more effectively, thus reducing
development time and the associated costs.
5. Autonomy: Services have control over the logic they encapsulate and, from a service consumer
point of view; there is no need to know about their implementation.
6. Discoverability: Services are defined by description documents that constitute supplemental
metadata through which they can be effectively discovered. Service discovery provides an
effective means for utilizing third-party resources.
7. Compos ability: Using services as building blocks, sophisticated and complex operations can be
implemented. Service orchestration and choreography provide solid support for composing
services and achieving business goals.
⮚ Advantages of SOA:
● Service reusability: In SOA, applications are made from existing services. Thus, services can be
reused to make many applications.
● Easy maintenance: As services are independent of each other they can be updated and modified
easily without affecting other services.
● Platform independent: SOA allows making a complex application by combining services picked
from different sources, independent of the platform.
● Availability: SOA facilities are easily available to anyone on request.
● Reliability: SOA applications are more reliable because it is easy to debug small services rather
than huge codes
● Scalability: Services can run on different servers within an environment, this increases scalability
⮚ Disadvantages of SOA:
● High overhead: A validation of input parameters of services is done whenever services interact
this decreases performance as it increases load and response time.
● High investment: A huge initial investment is required for SOA.
● Complex service management: When services interact they exchange messages to tasks. the
number of messages may go in millions. It becomes a cumbersome task to handle a large number
of messages.
⮚ Web service
Web service is a standardized medium to propagate communication between the client and server
applications on the World Wide Web. A web service is a software module that is designed to perform a
certain set of tasks.
● Web services in cloud computing can be searched for over the network and can also be invoked
accordingly.
● When invoked, the web service would be able to provide the functionality to the client, which
invokes that web service.
2
Notes Prepared By Sarvagya Jain
How web services works:
The above diagram shows a very simplistic view of how a web service would actually work. The
client would invoke a series of web service calls via requests to a server which would host the actual
web service.
These requests are made through what is known as remote procedure calls. Remote Procedure Calls
(RPC) is calls made to methods which are hosted by the relevant web service.
As an example, Amazon provides a web service that provides prices for products sold online via
amazon.com. The front end or presentation layer can be in .Net or Java but either programming
language would have the ability to communicate with the web service.
The main component of a web service design is the data which is transferred between the client and
the server, and that is XML. XML (Extensible mark-up language) is a counterpart to HTML and easy
to understand the intermediate language that is understood by many programming languages.
So when applications talk to each other, they actually talk in XML. This provides a common platform
for application developed in various programming languages to talk to each other.
Web services use something known as SOAP (Simple Object Access Protocol) for sending the XML
data between applications. The data is sent over normal HTTP. The data which is sent from the web
service to the application is called a SOAP message. The SOAP message is nothing but an XML
document. Since the document is written in XML, the client application calling the web service can be
written in any programming language.
3
Notes Prepared By Sarvagya Jain
uses a find operation to retrieve the service description locally or from the service registry. It uses the
service description to bind with the service provider and invoke with the web service implementation.
The following figure illustrates the operations, roles, and their interaction.
4
Notes Prepared By Sarvagya Jain
2. RESTful web services.
SOAP was developed as an intermediate language so that applications built on various programming
languages could talk easily to each other and avoid the extreme development effort.
For example, we have requested to access the To-do application from the Facebook application. The
Facebook application sends an XML request to the To-do application. To-do application processes the
request and generates the XML response and sends back to the Facebook application.
The SOAP specification defines something known as a "SOAP message" which is what is sent to the
web service and the client application.
o Each SOAP document needs to have a root element known as the <Envelope> element. The
root element is the first element in an XML document.
o The "envelope" is in turn divided into 2 parts. The first is the header, and the next is the body.
o The header contains the routing data which is basically the information which tells the XML
document to which client it needs to be sent to.
o The body will contain the actual message.
SOAP Message Structure
SOAP messages are normally auto-generated by the web service when it is called.
Whenever a client application calls a method in the web service, the web service will automatically
generate a SOAP message which will have the necessary details of the data which will be sent from
the web service to the client application.
5
Notes Prepared By Sarvagya Jain
The header element and
The body element
In the above figure, the SOAP-Envelope contains a SOAP-Header and SOAP-Body. It contains
meta-information needed to identify the request, for example, authentication, authorization, signature,
etc. SOAP-Header is optional. The SOAP-Body contains the real XML content of request or response.
In case of an error, the response server responds back with SOAP-Fault.
The SOAP XML request and response structure.
XML Request
<Envelop xmlns=?http://schemas.xmlsoap.org/soap/envelop/?>
<Body>
<getCourseDetailRequest xmlns=?http://udemy.com/course?>
<id>course1</id>
<getCourseDetailRequest>
</Body>
</Envelop>
XML Response
<SOAP-ENV:Envelope xmlns:SOAP ENV=?http://schemas.xmlsoap.org/soap/envelope/?>
<SOAP-ENV:Header /> <!?empty header-->
<SOAP-ENV:Body> <!?body begin-->
<ns2:getCourseDetailsResponse xmlns:ns2=?http://in28mi> <!--content of the response
-->
<ns2:course>
<ns2:id>Course1</ns2:id>
<ns2:name>Spring<ns2:name>
<ns2:description>10 Steps</ns1:description>
</ns2:course>
</ns2:getCourseDetailResponse>
</SOAP-ENV:Body> <!?body end-->
</SOAP-ENV:Envelope>
6
Notes Prepared By Sarvagya Jain
⮚ Web Service Description Language (WSDL)
WSDL acronym for Web Service Description Language. WSDL is an XML based interface
description language. It is used for describing the functionality offered by a web service. Sometimes it
is also known as the WSDL file. The extension of the WSDL file is .wsdl. It provides the
machine-readable description of how the service can be called, what parameter it expects, and what
data structure it returns.
It describes service as a collection of network endpoint, or ports. It is often used in combination with
SOAP and an XML schema to provide XML service over the distributed environment. In short, the
purpose of WSDL is similar to type-signature in a programming language.
WSDL 1.1 WSDL 2.0 Description
Term Term
Service Service It is a set of system functions.
Port Endpoint It is an endpoint that defines a combination of binding and
network addresses.
Binding Binding It specifies the interface and defines the SOAP binding style. It
also defines the operations.
PortType Interface An abstract set of services supported by one or more endpoints.
Operation Operation Abstract detail of an action supported by the service. It defines the
SOAP actions and the way of encoding the message.
Message N/A An abstract, typed definition of data to communicate. W3C has
removed the message in WSDL 2.0, in which XML Schema types
for defining bodies of inputs, outputs, and faults are referred
directly.
Types Types It is a container for data type definition. The XML Schema
language (XSD) is used for this purpose.
7
Notes Prepared By Sarvagya Jain
The idea behind UDDI is to discover organizations and the services that organizations offer, much like
using a telephone directory. It allows the business to list themselves by name, product, location, or
the web service they offer. A UDDI works in the following manner:
o A service provider registers its business with the UDDI registry.
o A service provider registers each service separately with the UDDI registry.
o The consumer looks up the business and service in the UDDI registry.
o The consumer binds the service with the service provider and uses the service.
The UDDI business registry system has three directories are as follows:
o White Pages
o Yellow pages
o Green Pages
White Pages: The white pages contain basic information such as company name, address, phone
number, and other business identifiers such as tax numbers.
Yellow Pages: The yellow pages contain detailed business data organized by relevant business
classification. The version of the yellow page classifies business according to the
newer NAICS (North American Industry Classification System).
Green Pages: The green pages contain information about the company's crucial business process, such
as operating platform, supported programs, and other high-level business protocols.
⮚ Introduction to RESTful Web Services
REST stands for REpresentational State Transfer. It is developed by Roy Thomas Fielding, who
also developed HTTP. The main goal of RESTful web services is to make web services more
effective. RESTful web services try to define services using the different concepts that are already
present in HTTP. REST is an architectural approach, not a protocol.
It does not define the standard message exchange format. We can build REST services with both XML
and JSON. JSON is more popular format with REST. The key abstraction is a resource in REST. A
resource can be anything. It can be accessed through a Uniform Resource Identifier (URI). For
example:
The resource has representations like XML, HTML, and JSON. The current state capture by
representational resource. When we request a resource, we provide the representation of the resource.
The important methods of HTTP are:
o GET: It reads a resource.
o PUT: It updates an existing resource.
o POST: It creates a new resource.
o DELETE: It deletes the resource.
For example, if we want to perform the following actions in the social media application, we get the
corresponding results.
POST /users: It creates a user.
GET /users/ {id}: It retrieves the detail of a user.
GET /users: It retrieves the detail of all users.
DELETE /users: It deletes all users.
DELETE /users/ {id}: It deletes a user.
GET /users/ {id}/posts/post_id: It retrieves the detail of a specific post.
POST / users/ {id}/ posts: It creates a post of the user.
HTTP also defines the following standard status code:
o 404: RESOURCE NOT FOUND
o 200: SUCCESS
8
Notes Prepared By Sarvagya Jain
o 201: CREATED
o 401: UNAUTHORIZED
o 500: SERVER ERROR
⮚ RESTful Service Constraints/Characteristics/Principle
1. Stateless – Server contains no client stage. However, the client side holds the session state. Each
request contains enough contexts to process the message.
2. Uniform Interface – Interface between client and server, HTTP verbs (GET, PUT, POST,
DELETE), URLs (resource name), and HTTP response (status, body).
3. Cacheable – Server responses (representations) are cacheable, implicit, explicit, and negotiable.
4. Layered System – It improves scalability. Usually, the client never informs if it is directly
connected to the end server. Intermediary servers may enable load-balancing and provide shared
caches to improve system scalability.
5. Client – Server – Clients from Servers are separated by UI. This separation of concerns means
clients are least concerned about the activities at server end, like data storage, etc.
6. Code on Demand (optional) – Temporarily, servers can transfer the logic to be executed by client.
In this way the functionality of a client can be customized.
⮚ REST Web Service Components
An Informatics REST web service has the following components:
Resource
A resource includes the mapping that the REST web service runs and the definition of the
response message that the web service returns. The resource also includes a resource ID, which
is a key field in the output data. When you define a resource, you define the structure of the
output data that the web service returns to the client. A web service can have multiple
resources.
Resource mapping
The mapping that returns the data to return to the web service client. A resource mapping does
not read the request query. The REST resource mapping contains a Read transformation. The
transformation reads a data object in the Model repository to retrieve data to return to the
client. By default, you do not have to add a Filter transformation or a Lookup transformation to
retrieve the data based on the client query. The REST web service filters the output data after
the mapping returns data.
Request message
A request from a web service client to the web service to perform a task. An Informatica web
service can perform an HTTP GET method. The request message is a string that contains the
name of the web service, the name and network location of the resource to perform the task,
and the parameters to filter the output.
Resource ID
A key field that you can search for in the output data. Each key field has a URL in the output
data.
Response message
A JSON or XML file that contains the data to return to the web service client. The response
message can contain a hierarchy of elements and multiple-occurring data.
⮚ RESTful Message
RESTful Web Services make use of HTTP protocols as a medium of communication between client
and server. A client sends a message in form of a HTTP Request and the server responds in the form
9
Notes Prepared By Sarvagya Jain
of an HTTP Response. This technique is termed as Messaging. These messages contain message data
and metadata i.e. information about message itself. Let us have a look on the HTTP Request and
HTTP Response messages for HTTP 1.1.
HTTP Request
⮚ Software as a service
Software as a service (or SaaS) is a way of delivering applications over the Internet—as a service.
Instead of installing and maintaining software, you simply access it via the Internet, freeing yourself
from complex software and hardware management.
10
Notes Prepared By Sarvagya Jain
SaaS applications are sometimes called Web-based software, on-demand software, or hosted software.
Whatever the name, SaaS applications run on a SaaS provider’s servers. The provider manages access
to the application, including security, availability, and performance.
11
Notes Prepared By Sarvagya Jain
SaaS services can be accessed from any device such as desktops, laptops, tablets, phones, and thin
clients.
7. API Integration
SaaS services easily integrate with other software or services through standard APIs.
8. No client-side installation
SaaS services are accessed directly from the service provider using the internet connection, so do not
need to require any software installation.
Disadvantages of SaaS cloud computing layer
1) Security
Actually, data is stored in the cloud, so security may be an issue for some users. However, cloud
computing is not more secure than in-house deployment.
2) Latency issue
Since data and applications are stored in the cloud at a variable distance from the end-user, there is a
possibility that there may be greater latency when interacting with the application compared to local
deployment. Therefore, the SaaS model is not suitable for applications whose demand response time is
in milliseconds.
3) Total Dependency on Internet
Without an internet connection, most SaaS applications are not usable.
4) Switching between SaaS vendors is difficult
Switching SaaS vendors involves the difficult and slow task of transferring the very large data files
over the internet and then converting and importing them into another SaaS also.
Popular SaaS Providers
The below table shows some popular SaaS providers and services that are provided by them -
Provider Services
Salseforce.com On-demand CRM solutions
Microsoft Office 365 Online office suite
Google Apps Gmail, Google Calendar, Docs, and sites
NetSuite ERP, accounting, order management, CRM, Professionals Services
Automation (PSA), and e-commerce applications.
GoToMeeting Online meeting and video-conferencing software
Constant Contact E-mail marketing, online survey, and event marketing
Oracle CRM CRM applications
Workday, Inc Human capital management, payroll, and financial management.
12
Notes Prepared By Sarvagya Jain
2. Application frameworks
PaaS providers provide application frameworks to easily understand the application development.
Some popular application frameworks provided by PaaS providers are Node.js, Drupal, Joomla,
WordPress, Spring, Play, Rack, and Zend.
3. Databases
PaaS providers provide various databases such as ClearDB, PostgreSQL, MongoDB, and Redis to
communicate with the applications.
4. Other tools
PaaS providers provide various other tools that are required to develop, test, and deploy the
applications.
Advantages of PaaS
There are the following advantages of PaaS -
1) Simplified Development
PaaS allows developers to focus on development and innovation without worrying about infrastructure
management.
2) Lower risk
No need for up-front investment in hardware and software. Developers only need a PC and an internet
connection to start building applications.
3) Prebuilt business functionality
Some PaaS vendors also provide already defined business functionality so that users can avoid
building everything from very scratch and hence can directly start the projects only.
4) Instant community
PaaS vendors frequently provide online communities where the developer can get the ideas to share
experiences and seek advice from others.
5) Scalability
Applications deployed can scale from one to thousands of users without any changes to the
applications.
Disadvantages of PaaS cloud computing layer
1) Vendor lock-in
One has to write the applications according to the platform provided by the PaaS vendor, so the
migration of an application to another PaaS vendor would be a problem.
2) Data Privacy
Corporate data, whether it can be critical or not, will be private, so if it is not located within the walls
of the company, there can be a risk in terms of privacy of data.
3) Integration with the rest of the systems applications
It may happen that some applications are local, and some are in the cloud. So there will be chances of
increased complexity when we want to use data which in the cloud with the local data.
Popular PaaS Providers
The below table shows some popular PaaS providers and services that are provided by them -
Providers Services
Google App Engine App Identity, URL Fetch, Cloud storage client library, Logservice
(GAE)
Salesforce.com Faster implementation, Rapid scalability, CRM Services, Sales cloud,
Mobile connectivity, Chatter.
Windows Azure Compute, security, IoT, Data Storage.
AppFog Justcloud.com, SkyDrive, GoogleDocs
Openshift RedHat, Microsoft Azure.
Cloud Foundry from Data, Messaging, and other services.
VMware
13
Notes Prepared By Sarvagya Jain
⮚ Infrastructure as a Service | IaaS
Iaas is also known as Hardware as a Service (HaaS). It is one of the layers of the cloud computing
platform. It allows customers to outsource their IT infrastructures such as servers, networking,
processing, storage, virtual machines, and other resources. Customers access these resources on the
Internet using a pay-as-per use model.
In traditional hosting services, IT infrastructure was rented out for a specific period of time, with
pre-determined hardware configuration. The client paid for the configuration and time, regardless of
the actual use. With the help of the IaaS cloud computing platform layer, clients can dynamically scale
the configuration to meet changing requirements and are billed only for the services actually used.
IaaS cloud computing platform layer eliminates the need for every organization to maintain the IT
infrastructure.
IaaS is offered in three models: public, private, and hybrid cloud. The private cloud implies that the
infrastructure resides at the customer-premise. In the case of public cloud, it is located at the cloud
computing platform vendor's data center, and the hybrid cloud is a combination of the two in which the
customer selects the best of both public cloud and private cloud.
IaaS provider provides the following services -
1. Compute: Computing as a Service includes virtual central processing units and virtual main
memory for the Vms that is provisioned to the end- users.
2. Storage: IaaS provider provides back-end storage for storing files.
3. Network: Network as a Service (NaaS) provides networking components such as routers,
switches, and bridges for the Vms.
4. Load balancers: It provides load balancing capability at the infrastructure layer.
Advantages of IaaS cloud computing layer
There are the following advantages of IaaS computing layer -
1. Shared infrastructure
IaaS allows multiple users to share the same physical infrastructure.
2. Web access to the resources
Iaas allows IT users to access resources over the internet.
3. Pay-as-per-use model
IaaS providers provide services based on the pay-as-per-use basis. The users are required to pay for
what they have used.
4. Focus on the core business
IaaS providers focus on the organization's core business rather than on IT infrastructure.
5. On-demand scalability
On-demand scalability is one of the biggest advantages of IaaS. Using IaaS, users do not worry about
to upgrade software and troubleshoot the issues related to hardware components.
Disadvantages of IaaS cloud computing layer
1. Security
Security is one of the biggest issues in IaaS. Most of the IaaS providers are not able to provide 100%
security.
2. Maintenance & Upgrade
Although IaaS service providers maintain the software, but they do not upgrade the software for some
organizations.
3. Interoperability issues
It is difficult to migrate VM from one IaaS provider to the other, so the customers might face problem
related to vendor lock-in.
Top Iaas Providers
IaaS Vendor Iaas Solution Details
Amazon Web Elastic, Elastic Compute The cloud computing platform pioneer,
Services Cloud (EC2) Amazon offers auto scaling, cloud
14
Notes Prepared By Sarvagya Jain
MapReduce, Route 53, monitoring, and load balancing features as
Virtual Private Cloud, part of its portfolio.
etc.
Netmagic Netmagic IaaS Cloud Netmagic runs from data centers in
Solutions Mumbai, Chennai, and Bangalore, and a
virtual data center in the United States.
Plans are underway to extend services to
West Asia.
Rackspace Cloud servers, cloud The cloud computing platform vendor
files, cloud sites, etc. focuses primarily on enterprise-level
hosting services.
Reliance Reliance Internet Data RIDC supports both traditional hosting and
Communications Center cloud services, with data centers in
Mumbai, Bangalore, Hyderabad, and
Chennai. The cloud services offered by
RIDC include IaaS and SaaS.
Sify Technologies Sify IaaS Sify's cloud computing platform is powered
by HP's converged infrastructure. The
vendor offers all three types of cloud
services: IaaS, PaaS, and SaaS.
Tata InstaCompute InstaCompute is Tata Communications' IaaS
Communications offering. InstaCompute data centers are
located in Hyderabad and Singapore, with
operations in both countries.
15
Notes Prepared By Sarvagya Jain
⮚ Types of Cloud Services to Monitor
There are multiple types of cloud services to monitor. Cloud monitoring is not just about monitoring
servers hosted on AWS or Azure. For enterprises, they also put a lot of importance into monitoring
cloud-based services that they consume. Including things like Office 365 and others.
SaaS – Services like Office 365, Salesforce and others
PaaS – Developer friendly services like SQL databases, caching, storage and more
IaaS – Servers hosted by cloud providers like Azure, AWS, Digital Ocean, and others
FaaS – New serverless applications like AWS Lambda and Azure Functions
Application Hosting – Services like Azure App Services, Heroku, etc
Cloud monitoring works through a set of tools that supervise the servers, resources, and applications
running the applications. These tools generally come from two sources:
In-house tools from the cloud provider - this is a simple option because the tools are part of the
service. There is no installation, and integration is seamless.
Tools from independent SaaS provider - although the SaaS provider may be different from the cloud
service provider, that doesn’t mean the two services don’t work seamlessly. These providers also have
expertise in managing performance and costs.
⮚ Benefits of Cloud Monitoring
The top benefits of leveraging cloud monitoring tools include:
● They already have infrastructure and configurations in place. Installation is quick and easy.
● Dedicated tools are maintained by the host. That includes hardware.
● These solutions are built for organizations of various sizes. So if cloud activity increases, the right
monitoring tool can scale seamlessly.
● Subscription-based solutions can keep costs low. They do not require startup or infrastructure
expenditures, and maintenance costs are spread among multiple users.
● Because the resources are not part of the organization’s servers and workstations, they don’t suffer
interruptions when local problems disrupt the organization.
● Many tools can be used on multiple types of devices — desktop computers, tablets, and phones.
This allows organizations to monitor apps and services from any location with Internet access.
⮚ Benefits of cloud
On-demand self-service: A client can provision computer resources without the need for interaction
with cloud service provider personnel.
• Broad network access: Access to resources in the cloud is available over the network using standard
methods in a manner that provides platform-independent access to clients of all types.
This includes a mixture of heterogeneous operating systems, and thick and thin platforms such as
laptops, mobile phones, and PDA.
• Resource pooling: A cloud service provider creates resources that are pooled together in a system
that supports multi-tenant usage.
Physical and virtual systems are dynamically allocated or reallocated as needed. Intrinsic in this
concept of pooling is the idea of abstraction that hides the location of resources such as virtual
machines, processing, memory, storage, and network bandwidth and connectivity.
• Rapid elasticity: Resources can be rapidly and elastically provisioned.
The system can add resources by either scaling up systems (more powerful computers) or scaling out
systems (more computers of the same kind), and scaling may be automatic or manual. From the
standpoint of the client, cloud computing resources should look limitless and can be purchased at any
time and in any quantity.
• Measured service: The use of cloud system resources is measured, audited, and reported to the
customer based on a metered system.
A client can be charged based on a known metric such as amount of storage used, number of
transactions, network I/O (Input/Output) or bandwidth, amount of processing power used, and so forth.
A client is charged based on the level of services provided.
16
Notes Prepared By Sarvagya Jain
Lower costs: Because cloud networks operate at higher efficiencies and with greater utilization,
significant cost reductions are often encountered.
• Ease of utilization: Depending upon the type of service being offered, you may find that you do not
require hardware or software licenses to implement your service.
• Quality of Service: The Quality of Service (QoS) is something that you can obtain under contract
from your vendor.
• Reliability: The scale of cloud computing networks and their ability to provide load balancing and
failover makes them highly reliable, often much more reliable than what you can achieve in a single
organization.
• Outsourced IT management: A cloud computing deployment lets someone else manage your
computing infrastructure while you manage your business. In most instances, you achieve
considerable reductions in IT staffing costs.
• Simplified maintenance and upgrade: Because the system is centralized, you can easily apply
patches and upgrades. This means your users always have access to the latest software versions.
• Low Barrier to Entry: In particular, upfront capital expenditures are dramatically reduced. In cloud
computing, anyone can be a giant at any time.
⮚ Limitations of cloud
1. Internet dependence everything at the cloud is accessible through internet only. If the cloud server
faces some issues, so will your application. Plus, if you use an internet service that fluctuates a lot, the
cloud computing is not for you. Even the biggest service providers face quite long downtimes. In
certain scenarios, it’s become a crucial decision to opt for cloud services.
2. Data incompatibility this is varied as per different service providers. Sometimes, a vendor locks in
the customer by using proprietary rights so that a customer can’t switch to another vendor. For
example, there is a chance that a vendor doesn’t provide compatibility with Google Docs or Google
Sheets. As your customers or employees are becoming advanced, your business may be in crucial need
for it. So ensure your contract with the provider as per your terms, not them.
3. Security breach threats again, the data transactions happen through internet only. Though your
cloud service provider claims to be one of the best-secured-service providers, it should be your call
finally. Because cloud computing history has noted big accidents, before. So if you own a business
where a single miss or corruption of data is not at all acceptable, you should think 100 times before
going for the cloud, especially for large-scale business. But for small business owners, the question
here is – will you be able to provide more security levels to your applications than a cloud service
provider does?
4. Various costs Is your current application compatible enough to take to clouds? The common
mistake business owners do is to invite unrequired expense in order to be highly-advanced. As per the
current scenario and near future analysis, if your current infrastructure is serving the needs, then
migrating to the clouds would not be recommended. Because it may happen that to be compatible with
clouds, your business applications need to be re-written. Moreover, if your business demands huge
data transfers, every month you would be billed huge as well. Sometimes, having set up own
infrastructure can save you from this kind of constant high billings.
5. Customer support before opting for clouds, check out the query resolving time. With time, service
providers are going modern but still check for the best. If your business faces heavy traffic every day
and heavier on weekends, then a quick fix is always on top priority. The best cloud service provider
must have optimum support for technical difficulties via email, call, chat or even forums. Choose the
one who provides the highest support.
17
Notes Prepared By Sarvagya Jain
Unit-2
⮚ Utility Computing:
Utility computing is a model in which computing resources are provided to the customer based on
specific demand. The service provider charges exactly for the services provided, instead of a flat rate.
The foundational concept is that users or businesses pay the providers of utility computing for the
amenities used – such as computing capabilities, storage space and applications services.
Utility computing helps eliminate data redundancy, as huge volumes of data are distributed across
multiple servers or backend systems. The client however, can access the data anytime and from
anywhere.
Utility computing is the most trending IT service model. It provides on-demand computing resources
(computation, storage, and programming services via API) and infrastructure based on the pay per
use method. It minimizes the associated costs and maximizes the efficient use of resources. The
advantage of utility computing is that it reduced the IT cost, provides greater flexibility, and easier to
manage.
Large organizations such as Google and Amazon established their own utility services for computing
storage and application.
Scalability
The utility computing must be ensured that under all conditions sufficient IT resources are available.
Increasing the demand for a service may, its quality (e.g., response time) does not suffer.
Demand pricing
Companies have to buy his own hardware and software when they need computing power. This IT
infrastructure must be paid in advance of the rule, regardless of the intensity with which the company
uses them later. Technology vendors to achieve this link, for example, the fact that the lease rate for
their servers depends on how many CPUs has enabled the customer. If it can be measured in a
company as much computing power to claim the individual sections in fact, may be the IT costs in
18
Notes Prepared By Sarvagya Jain
internal cost directly attributable to the individual departments. Other forms of connection with the use
of IT costs are possible.
Standardized Utility Computing Services
The utility computing service provider offers its customers a catalog of standardized services. These
may have different service level agreements (Agreement on the quality and the price of an IT)
services. The customer has no influence on the underlying technologies such as the server platform.
19
Notes Prepared By Sarvagya Jain
unneeded resources. Elasticity is mostly important in Cloud environment where you pay-per-used
resources only.
Benefits/Pros of Elastic Cloud Computing
Cost Efficiency: – Cloud is available at much cheaper rates than traditional approaches and can
significantly lower the overall IT expenses. By using cloud solution companies can save licensing fees
as well as eliminate overhead charges such as the cost of data storage, software updates, management
etc.
Convenience and continuous availability: – Cloud makes easier access of shared documents and
files with view and modify choice. Public clouds also offer services that are available wherever the
end user might be located. Moreover it guaranteed continuous availability of resources and In case of
system failure; alternative instances are automatically spawned on other machines.
Backup and Recovery: – The process of backing up and recovering data is easy as information is
residing on cloud simplified and not on a physical device. The various cloud providers offer reliable
and flexible backup/recovery solutions.
Cloud is environmentally friendly:-The cloud is more efficient than the typical IT infrastructure and
it takes fewer resources to compute, thus saving energy.
Scalability and Performance: – Scalability is a built-in feature for cloud deployments. Cloud
instances are deployed automatically only when needed and as a result enhance performance with
excellent speed of computations.
Increased Storage Capacity:-The cloud can accommodate and store much more data compared to a
personal computer and in a way offers almost unlimited storage capacity.
Disadvantages/Cons of Elastic Cloud Computing:-
Security and Privacy in the Cloud: – Security is the biggest concern in cloud computing. Companies
essentially hide their private data and information over cloud as remote based cloud infrastructure is
used, it is then up to the cloud service provider to manage, protect and retain data confidential.
Limited Control: – Since the applications and services are running remotely companies, users and
third party virtual environments have limited control over the function and execution of the hardware
and software.
Dependency and vendor lock-in: – One of the major drawbacks of cloud computing is the implicit
dependency on the provider. It is also called “vendor lock-in”. As it becomes difficult to migrate vast
data from old provider to new. So, it is advisable to select vendor very carefully.
Increased Vulnerability: – Cloud based solutions are exposed on the public internet therefore are
more vulnerable target for malicious users and hackers. As we know nothing is completely secure over
Internet even the biggest organizations also suffer from serious attacks and security breaches.
⮚ Ajax: asynchronous ‘rich’ interfaces:
Asynchronous communication between the client and the server forms the backbone of AJAX.
Although an asynchronous request-response method can provide significant value in the development
of rich functionality by itself, the results are lot more pronounced when used in conjunction with other
functional standards such as CSS, DOM, JavaScript, and so on. The predominant popularity of AJAX
stems from such usage.
Client-server communication can be achieved either by using IFrames, or by using the supported
JavaScript function call XMLHttpRequest(). Due to certain limitations of IFrames, XMLHttpRequest
has gained a lot more acceptance. While IFrame can also be an effective option for implementing
AJAX-based solutions,
The primary advantage of using AJAX-based interfaces is that the update of content occurs without
page refreshes. A typical AJAX implementation using XMLHttpRequest happens as described in the
following steps:
1. An action on the client side, whether this is a mouse click or a timed refresh, triggers a client event
2. An XMLHttpRequest object is created and configured
3. The XMLHttpRequest object makes a call
4. The request is processed by a server-side component
5. The component returns an XML (or an equivalent) document containing the result
20
Notes Prepared By Sarvagya Jain
6. The XMLHttpRequest object calls the callback() function and processes the result
7. The HTML DOM is updated with any resulting values
The following simplified image illustrates the high-level steps involved in an AJAX request
flow. The portal client page gets served to the client browser, where the execution of JavaScript
functions takes place.
The following example illustrates the initialization of the request object and its basic use:
if (window.XMLHttpRequest) // Object of the current window {
// for non-IE browsers
request = new XMLHttpRequest();
}
else if (window.ActiveXObject){
// For IE
request = new ActiveXObject("Microsoft.XMLHTTP");
}
request. > {
// do something to process response
};
if (request.readyState == 4){
// everything received, OK. Do something now.
} else {
// wait for the response to come to ready state
}
21
Notes Prepared By Sarvagya Jain
Web Services: the products functionality can be accessed using the API services. The technologies
used are XMLHTTPRequest, XML-RPC, JSON-RPC, SOAP, REST.
Data: handling the data like sending, storing and receiving. The technologies used are XML, JSON,
and KML.
Architecturally, there are two styles of mashups:
1. Web-based
2. Server-based
Web-based mashups typically use the user’s Web browser to combine and reformat the data.
Server-based mashups analyse and reformat the data on a remote server and transmit the data to the
user’s browser in its final form.
Note: Mashups can be used with software provided as a service (SaaS).
Types of the Mashup –
There are many types of mashup, such as business mashups, consumer mashups, and data mashups.
The most common type of mashup is the consumer mashup, aimed at the general public.
Business (or enterprise) mashups generally define applications that combine their own resources,
application and data, with other external Web services. They focus data into a single presentation and
allow for collaborative action among businesses and developers. This works well for an agile
development project, which requires collaboration between the developers and customer (or customer
proxy, typically a product manager) for defining and implementing the business requirements.
Enterprise mashups are secure, visually rich Web applications that expose actionable information from
diverse internal and external information sources.
Consumer mashups combines different data types. It combines data from multiple public sources in
the browser and organizes it through a simple browser user interface.
Data mashups, opposite to the consumer mashups, combine similar types of media and information
from multiple sources into a single representation. The combination of all these resources creates a
new and distinct Web service that was not originally provided by either source.
⮚ Virtualization Technology:
Virtualization technology provides an alternative technical approach to delivering infrastructure,
platforms and operating systems, servers, software, and systems and applications. Most virtualized
computing environments have much in common with conventional data centers, but employ
high-performing hardware and specialized software that enables a single physical server to function as
multiple concurrently running instances. This approach increases capacity utilization and, in IT
service-based models.
Virtual Machine:
A virtual machine is a virtual representation, or emulation, of a physical computer. They are often
referred to as a guest while the physical machine they run on is referred to as the host.
A Virtualization technology called the Virtual Machine.
A Virtual Machine Monitor (VMM) is a software program that enables the creation, management
and governance of virtual machines (VM) and manages the operation of a virtualized environment on
top of a physical host machine.
VMM is also known as Virtual Machine Manager and Hypervisor.
VMM encapsulates the very basics of virtualization in cloud computing. It is used to separate the
physical hardware from its emulated parts. This often includes the CPU’s memory, I/O and network
22
Notes Prepared By Sarvagya Jain
traffic. A secondary operating system that is usually interacting with the hardware is now a software
emulation of that hardware, and often the guest operating system.
Type 2 hypervisors run as an application within a host OS and usually target single-user desktop or
notebook platforms. With a Type 2 hypervisor, you manually create a VM and then install a guest OS
in it. You can use the hypervisor to allocate physical resources to your VM, manually setting the
amount of processor cores and memory it can use. Depending on the hypervisor’s capabilities, you can
also set options like 3D acceleration for graphics.
23
Notes Prepared By Sarvagya Jain
Main advantages of system VMs
1. Multiple OS environments can co-exist on the same computer, in strong isolation from each
other;
2. The virtual machine can provide an instruction set architecture (ISA) that is somewhat different
from that of the real machine.
Main disadvantages of system VMs
● there's still an overhead of the virtualization solution which is used to run and manage a VM, so
performance of a VM will be somewhat slower compared to a physical system with comparable
configuration
● virtualization means decoupling from physical hardware available to the host PC, this usually
means access to devices needs to go through the virtualization solution and this may not always
be possible
⮚ Pitfalls of virtualization:
Here are some of the most common issues posed by adopting virtualization that every organization
must consider.
1. Detection/Discovery
You can't manage what you can't see! IT departments are often unprepared for the complexity
associated with understanding what VMs (virtual machines) exist and which are active or inactive.
To overcome these challenges, discovery tools need to extend to the virtual world by identifying
Virtual Machine Disk Format (.vmdk) files and how many exist within the environment. This will
identify both active and inactive VM’s.
2. Correlation
Difficulty in understanding which VMs are on which hosts and identifying which business critical
functions are supported by each VM is a common and largely unforeseen problem encountered by IT
departments employing virtualization. Mapping guest to host relationships and grouping the VM’s
by criticality & application is a best practice when implementing virtualization.
24
Notes Prepared By Sarvagya Jain
3. Configuration management
Ensuring VMs are configured properly is crucial in preventing performance bottlenecks and security
vulnerabilities. Complexities in VM provisioning and offline VM patching is a frequent issue for IT
departments. A Technical Controls configuration management database (CMDB) is critical to
understanding the configurations of VM’s especially dormant ones. The CMDB will provide the
current state of a VM even if it is dormant, allowing a technician to update the configuration by
auditing and making changes to the template.
25
Notes Prepared By Sarvagya Jain
guest to host relationships and their configurations? Guest to host mapping and their configuration
history is critical to the success of managing virtual machines.
To understand multitenancy, think of how banking works. Multiple people can store their money in
one bank, and their assets are completely separate even though they're stored in the same place.
Customers of the bank don't interact with each other, don't have access to other customers' money,
and aren't even aware of each other. Similarly, in public cloud computing, customers of the cloud
vendor use the same infrastructure – the same servers, typically – while still keeping their data and
their business logic separate and secure.
The classic definition of multitenancy was a single software instance* that served multiple users, or
tenants. However, in modern cloud computing, the term has taken on a broader meaning, referring to
shared cloud infrastructure instead of just a shared software instance.
*A software instance is a copy of a running program loaded into random access memory (RAM).
Benefits of multitenancy
Many of the benefits of cloud computing are only possible because of multitenancy. Here are two
crucial ways multitenancy improves cloud computing:
Better use of resources: One machine reserved for one tenant isn't efficient, as that one tenant is not
likely to use all of the machine's computing power. By sharing machines among multiple tenants, use
of available resources is maximized.
Lower costs: With multiple customers sharing resources, a cloud vendor can offer their services to
many customers at a much lower cost than if each customer required their own dedicated
infrastructure.
Drawbacks of multitenancy
Possible security risks and compliance issues: Some companies may not be able to store data within
shared infrastructure, no matter how secure, due to regulatory requirements. Additionally, security
problems or corrupted data from one tenant could spread to other tenants on the same machine,
26
Notes Prepared By Sarvagya Jain
although this is extremely rare and shouldn't occur if the cloud vendor has configured their
infrastructure correctly. These security risks are somewhat mitigated by the fact that cloud vendors
typically are able to invest more in their security than individual businesses can.
The "noisy neighbor" effect: If one tenant is using an inordinate amount of computing power, this
could slow down performance for the other tenants. Again, this should not occur if the cloud vendor
has set up their infrastructure correctly.
This is similar to the way many public cloud providers implement multitenancy. Most cloud
providers define multitenancy as a shared software instance. They store metadata* about each tenant
and use this data to alter the software instance at runtime to fit each tenant's needs. The tenants are
isolated from each other via permissions. Even though they all share the same software instance,
they each use and experience the software differently.
*Metadata is information about a file, somewhat like the description on the back of a book.
In container architecture
Containers are self-contained bundles of software that include an application, system libraries,
system settings, and everything else the application needs in order to run. Containers help ensure that
an application runs the same no matter where it is hosted.
Containers are partitioned from each other into different user space environments, and each
container runs as if it were the only system on that host machine. Because containers are
self-contained, multiple containers created by different cloud customers can run on a single host
machine.
In serverless computing
Serverless computing is a model in which applications are broken up into smaller pieces called
functions, and each function only runs on demand, separately from the other functions. (This model
of cloud computing is also known as Function-as-a-Service, or FaaS.)
As the name implies, serverless functions do not run on dedicated servers, but rather on any
available machine in the serverless provider's infrastructure. Because companies are not assigned
27
Notes Prepared By Sarvagya Jain
their own discrete physical servers, serverless providers will often be running code from several of
their customers on a single server at any given time – another example of multitenancy.
Some serverless platforms use Node.js for executing serverless code. The Cloudflare serverless
platform, Cloudflare Workers, uses Chrome V8, in which each function runs in its own sandbox, or
separate environment. This keeps serverless functions totally separate from each other even when
they’re running on the same infrastructure.
28
Notes Prepared By Sarvagya Jain
Unit -3
Cloud Database: A cloud database is a database that typically runs on a cloud computing platform,
and access to the database is provided as-a-service.
Database services take care of scalability and high availability of the database. Database services
make the underlying software-stack transparent to the user.
There are two primary methods to run a database in a cloud:
Virtual machine image
Cloud platforms allow users to purchase virtual-machine instances for a limited time, and one can
run a database on such virtual machines. Users can either upload their own machine image with a
database installed on it, or use ready-made machine images that already include an optimized
installation of a database.
Database-as-a-service (DBaaS)
With a database as a service model, application owners do not have to install and maintain the
database themselves. Instead, the database service provider takes responsibility for installing and
maintaining the database, and application owners are charged according to their usage of the service.
This is a type of software as a service (SaaS).
Data model:
The design and development of typical systems utilize data management and relational databases as
their key building blocks. Modern relational databases have shown poor performance on
data-intensive systems, therefore, the idea of NoSQL has been utilized within database management
systems for cloud based systems. "The NoSQL databases have proven to provide efficient horizontal
scalability, good performance, and ease of assembly into cloud applications
Data models relying on simplified relay algorithms have also been employed in data-intensive cloud
mapping applications unique to virtual frameworks.
SQL databases
are one type of database which can run in the cloud, either in a virtual machine or as a service,
depending on the vendor. While SQL databases are easily vertically scalable, horizontal scalability
poses a challenge that cloud database services based on SQL have started to address.
NoSQL databases
are another type of database which can run in the cloud. NoSQL databases are built to service heavy
read/write loads and can scale up and down easily, and therefore they are more natively suited to
running in the cloud. However, most contemporary applications are built around an SQL data model,
so working with NoSQL databases often requires a complete rewrite of application code.
Some SQL databases have developed NoSQL capabilities including JSON, binary JSON (e.g. BSON
or similar variants), and key-value store data types.
A multi-model database with relational and non-relational capabilities provides a standard SQL
interface to users and applications and thus facilitates the usage of such databases for contemporary
applications built around an SQL data model. Native multi-model databases support multiple data
models with one core and a unified query language to access all data models.
How Cloud Databases Work
Cloud databases can be divided into two broad categories: relational and non-relational.
29
Notes Prepared By Sarvagya Jain
A relational database, typically written in structured query language (SQL), is composed of a set of
interrelated tables that are organized into rows and columns. The relationship between tables and
columns (fields) is specified in a schema. SQL databases, by design, rely on data that is highly
consistent in its format, such as banking transactions or a telephone directory. Popular cloud
platforms and cloud providers include MySQL, Oracle, IBM DB2 and Microsoft SQL Server. Some
cloud platforms such as MySQL are open sourced.
Non-relational databases, sometimes called NoSQL, do not employ a table model. Instead, they store
content, regardless of its structure, as a single document. This technology is well-suited for
unstructured data, such as social media content, photos and videos.
Large file is broken down into small blocks of data. HDFS has a default block size of 128 MB which
can be increased as per requirement. Multiple copies of each block are stored in the cluster in a
distributed manner on different nodes.
HDFS follows the master-slave architecture and it has the following elements.
Name node:
The system having the namenode acts as the master server and it does the following tasks −
● Manages the file system namespace.
● Regulates client’s access to files.
● It also executes file system operations such as renaming, closing, and opening files and
directories.
30
Notes Prepared By Sarvagya Jain
Data node:
There will be a data node. These nodes manage the data storage of their system.
● Data nodes perform read-write operations on the file systems, as per client request.
● They also perform operations such as block creation, deletion, and replication according to
the instructions of the namenode.
Block:
Generally the user data is stored in the files of HDFS. The file in a file system will be divided into
one or more segments and/or stored in individual data nodes. These file segments are called as
blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block.
The default block size is 128 MB, but it can be increased as per the need to change in HDFS
configuration.
GFS:
Google File System (GFS or GoogleFS) is a proprietary distributed file system developed by Google
to provide efficient, reliable access to data using large clusters of commodity hardware.
The GFS is composed of clusters. A cluster is a set of networked computers. GFS clusters contain
three types of interdependent entities which are: Client, master and chunk server.
Clients could be: Computers or applications manipulating existing files or creating new files on the
system.
The master server is the manager of the cluster system that maintains the operation log. Operation
log keeps track of the activities made by the master itself which helps reducing the service
interruptions to a minimum level. At startup, master server retrieves information about contents and
inventories from chunk servers. Then after, the master server keeps tracks of the location of the
chunks with the cluster.
The GFS architecture keeps the messages that the master server sends and receives very small. The
master server itself doesn’t handle file data at all; this is done by chunk servers.
Chunk servers are the core engine of the GFS. They store file chunks of 64 MB size. Chunk servers
coordinate with the master server and send requested chunks to clients directly.
GFS consists of a single master and multiple chunk servers.
31
Notes Prepared By Sarvagya Jain
Each chunk has 64 MB of data in it. Each chunk is replicated on multiple chunk servers (3 by default).
Even if any chunk server crashes, the data file will still be present in other chunk servers.
This helped Google to store and process huge volumes of data in a distributed manner.
32
Notes Prepared By Sarvagya Jain
6. Cache management: In GFS, cache metadata are saved in client memory. Chunk server does not
need cache file data. Linux system running on the chunk server caches frequently accessed data in
memory. The HDFS has "Distributed Cache". Distributed Cache is facility provided by the Map
Reduce to distribute application-specific, large, read-only files efficiently. It also caches files such
as text, archives (zip, tar, tgz and tar.gz) and jars needed by applications. Distributed Cache files
can be private or public, that determines how they can be shared on the slave nodes.
"Private" Distributed Cache files are cached in a local directory private to the user whose jobs need
these files.
"Public" Distributed Cache files are cached in a global directory and the file access is setup in such
a way that they are publicly visible to all users.
7. Files protection and permission: GFS splits files up and stores it in multiple pieces on multiple
machines. File names have random names and are not human readable. Files are obfuscated
through algorithms that change constantly.
The HDFS implements POSIX-like mode permission for files and directories. All files and
directories are associated with an owner and a group with separate permissions for users who are
owners, for users that are members of the group and for all other users.
8. Replication strategy: The GFS has two replicas: Primary replicas and secondary replicas.
A primary replica is the data chunk that a chunk server sends to a client.
Secondary replicas serve as backups on other chunk servers. User can specify the number of
replicas to be maintained.
The HDFS has an automatic replication rack based system. By default two copies of each block are
stored by different Data Nodes in the same rack and a third copy is stored on a Data Node on a
different rack.
9. File namespace: In GFS, files are organized hierarchically in directories and identified by path
names. The GFS is exclusively for Google only.
The HDFS supports a traditional hierarchical file organization. Users or application can create
directories to store files inside. The HDFS also supports third-party file systems such as
CloudStore and Amazon Simple Storage Service (S3).
10. File system database: The GFS has bigtable database. Bigtable is a proprietary database
developed by Google using c++.
Apache developed its own database called HBase in Hadoop. The HBase is built with Java
language.
Features of HDFS:
1. Cost-effective:
In HDFS architecture, the DataNodes, which stores the actual data are inexpensive commodity
hardware, thus reduces storage costs.
2. Large Datasets/ Variety and volume of data:
HDFS can store data of any size (ranging from megabytes to petabytes) and of any formats (structured,
unstructured).
3. Replication:
In HDFS replication of data is done to solve the problem of data loss in unfavourable conditions like
crashing of a node, hardware failure, and so on.
33
Notes Prepared By Sarvagya Jain
The data is replicated across a number of machines in the cluster by creating replicas of blocks. The
process of replication is maintained at regular intervals of time by HDFS and HDFS keeps creating
replicas of user data on different machines present in the cluster.
Hence whenever any machine in the cluster gets crashed, the user can access their data from other
machines that contain the blocks of that data. Hence there is no possibility of a loss of user data.
4. Fault Tolerance and reliability:
HDFS is highly fault-tolerant and reliable. HDFS creates replicas of file blocks depending on the
replication factor and stores them on different machines.
If any of the machines containing data blocks fail, other DataNodes containing the replicas of that data
blocks are available. Thus ensuring no loss of data and makes the system reliable even in unfavourable
conditions.
Hadoop 3 introduced Erasure Coding to provide Fault Tolerance. Erasure Coding in HDFS improves
storage efficiency while providing the same level of fault tolerance and data durability as traditional
replication-based HDFS deployment.
5. High Availability:
The High availability feature of Hadoop ensures the availability of data even during NameNode or
DataNode failure.
Since HDFS creates replicas of data blocks, if any of the DataNodes goes down, the user can access
his data from the other DataNodes containing a copy of the same data block.
Also, if the active NameNode goes down, the passive node takes the responsibility of the active
NameNode. Thus, data will be available and accessible to the user even during a machine crash.
6. Scalability:
As HDFS stores data on multiple nodes in the cluster, when requirements increase we can scale the
cluster. There are two scalability mechanisms available: Vertical scalability – add more resources
(CPU, Memory, and Disk) on the existing nodes of the cluster.
Another way is horizontal scalability – Add more machines in the cluster. The horizontal way is
preferred since we can scale the cluster from 10s of nodes to 100s of nodes on the fly without any
downtime.
7. Data Integrity:
Data integrity refers to the correctness of data. HDFS ensures data integrity by constantly checking the
data against the checksum calculated during the write of the file.
While file reading, if the checksum does not match with the original checksum, the data is said to be
corrupted. The client then opts to retrieve the data block from another DataNode that has a replica of
that block. The NameNode discards the corrupted block and creates an additional new replica.
8. High Throughput:
Hadoop HDFS stores data in a distributed fashion, which allows data to be processed parallelly on a
cluster of nodes. This decreases the processing time and thus provides high throughput.
9. Data Locality:
Data locality means moving computation logic to the data rather than moving data to the
computational unit.
In the traditional system, the data is brought at the application layer and then gets processed.
But in the present scenario, due to the massive volume of data, bringing data to the application layer
degrades the network performance.
34
Notes Prepared By Sarvagya Jain
In HDFS, we bring the computation part to the Data Nodes where data resides. Hence, with Hadoop
HDFS, we are moving computation logic to the data, rather than moving data to the computation logic.
This feature reduces the bandwidth utilization in a system.
10. Distributed Storage:
HDFS store data in a distributed manner across the nodes. In Hadoop, data is divided into blocks and
stored on the nodes present in the HDFS cluster. After that HDFS create the replica of each and every
block and store on other nodes. When a single machine in the cluster gets crashed we can easily access
our data from the other nodes which contain its replica.
Features of GFS:
GFS features include:
1. Fault tolerance
2. Critical data replication
3. Automatic and efficient data recovery
4. High aggregate throughput
5. Reduced client and master interaction because of large chunk server size
6. Namespace management and locking
7. High availability
BigTable:
Google uses as data storage a facility called Bigtable. Bigtable is a distributed, persistent,
multidimensional sorted map. Bigtable is not a relational database. BigTable is designed with
semi-structured data storage in mind. It is a large map that is indexed by a row key, column key, and a
timestamp. Each value within the map is an array of bytes that is interpreted by the application.
Every read or write of data to a row is atomic, regardless of how many different columns are read or
written within that row.
(Row key: type string, column key: type string, timestamp: type int64) → String
The key can get generated by the database or by the application.
35
Notes Prepared By Sarvagya Jain
The table is sparse, meaning that different rows in a table may use different columns, with many of the
columns empty for a particular row.
5. sorted
Most associative arrays are not sorted. A key is hashed to a position in a table. BigTable sorts its data
by keys. This helps keep related data close together, usually on the same machine — assuming that
one structures keys in such a way that sorting brings the data together. For example, if domain names
are used as keys in a BigTable, it makes sense to store them in reverse order to ensure that related
domains are close together. For example:
Continuing our JSON example, the sorted version looks like this:
{
"1”: "x",
"aaaaa" : "y",
"aaaab" : "world",
"xyz" : "hello",
"zzzzz" : "woot"
}
6. multidimensional
A table is indexed by rows. Each row contains one or more named column families. Column families
are defined when the table is first created. Within a column family, one may have one or more named
columns. All data within a column family is usually of the same type. The implementation of BigTable
usually compresses all the columns within a column family together. Columns within a column family
can be created on the fly. Rows, column families and columns provide a three-level naming hierarchy
in identifying data. For example: Adding one dimension to our running JSON example gives us this:
{
"1" : {
"A" : "x",
"B" : "z"
},
"aaaaa" : {
"A" : "y",
"B" : "w"
},
"aaaab" : {
"A" : "world",
"B" : "ocean"
},
"xyz" : {
"A" : "hello",
"B" : "there"
},
"zzzzz" : {
"A" : "woot",
"B" : "1337"
}
}
A column family may have any number of columns, denoted by a column "qualifier" or "label".
Here's a subset of our JSON example again, this time with the column qualifier dimension built in:
{
// ...
"aaaaa" : {
"A" : {
36
Notes Prepared By Sarvagya Jain
"foo" : "y",
"bar" : "d"
},
"B" : {
"" : "w"
}
},
"aaaab" : {
"A" : {
"foo" : "world",
"bar" : "domination"
},
"B" : {
"" : "ocean"
}
},
// ...
}
7. time-based
Time is another dimension in BigTable data. Every column family may keep multiple versions of
column family data. If an application does not specify a timestamp, it will retrieve the latest version of
the column family. Alternatively, it can specify a timestamp and get the latest version that is earlier
than or equal to that timestamp.
37
Notes Prepared By Sarvagya Jain
● An SSTable is a ordered immutable map from keys to values, and both are byte strings.
● Tablet is associated with a specific node.
● Writes are stored in Colossus’s shared log as acknowledged
● Data is never stored in nodes themselves;
● Nodes have pointers to a set of tablets stored on Colossus.
● Rebalancing tablets from one node to another is very fast
● Recovery from the failure of a node is very fast
● When a Cloud Bigtable node fails, no data is lost.
H Base:
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an
open-source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide quick random access to
huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop File System
(HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the
Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses the
data in HDFS randomly using HBase. HBase sits on top of the Hadoop File System and provides read
and write access.
HBase is a column-oriented database and the tables in it are sorted by row. The table schema
defines only column families, which are the key value pairs. A table have multiple column families
and each column family can have any number of columns. Subsequent column values are stored
contiguously on the disk. Each cell value of the table has a timestamp. In short, in an HBase:
38
Notes Prepared By Sarvagya Jain
Rowid Column Family Column Family Column Family Column Family
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
Column-oriented databases are those that store data tables as sections of columns of data, rather than
as rows of data. Shortly, they will have column families.
It is suitable for Online Transaction Process (OLTP). It is suitable for Online Analytical
Processing (OLAP).
Such databases are designed for small number of Column-oriented databases are designed for
rows and columns. huge tables.
39
Notes Prepared By Sarvagya Jain
Dynamo:
Dynamo is a set of techniques that together can form a highly available key-value structured storage
system or a distributed data store. It has properties of both databases and distributed hash tables
(DHTs). It was used in Amazon Web Services, such as its Simple Storage Service (S3).
Principles
Incremental scalability: Dynamo should be able to scale out one storage host (or “node”) at a time,
with minimal impact on both operators of the system and the system itself.
Symmetry: Every node in Dynamo should have the same set of responsibilities as its peers; there
should be no distinguished node or nodes that take special roles or extra set of responsibilities.
Decentralization: An extension of symmetry, the design should favor decentralized peer-to-peer
techniques over centralized control.
Heterogeneity: The system should be able to exploit heterogeneity in the infrastructure it runs on. For
example, the work distribution must be proportional to the capabilities of the individual servers. This
is essential in adding new nodes with higher capacity without having to upgrade all hosts at once.
40
Notes Prepared By Sarvagya Jain
DynamoDB is a fully-managed NoSQL database service designed to deliver fast and predictable
performance. It uses the Dynamo model in the essence of its design, and improves those features.
● Amazon DynamoDB is a fast and flexible NoSQL database service for all applications that require
consistent single-digit millisecond latency at any scale.
● It is a fully managed database that supports both document and key-value data models.
● Its flexible data model and performance makes it a great fit for mobile, web, gaming, ad-tech, IOT,
and many other applications.
● It is stored in SSD storage.
● It is spread across three geographically data centres.
Map Reduce:
MapReduce is a processing technique and a program model for distributed computing based on java.
The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of
data and converts it into another set of data, where individual elements are broken down into tuples
(key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines
those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the
reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called mappers
and reducers. Decomposing a data processing application into mappers and reducers is sometimes
nontrivial. But, once we write an application in the MapReduce form, scaling the application to run
over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a
configuration change. This simple scalability is what has attracted many programmers to use the
MapReduce model.
● Generally the MapReduce paradigm is based on sending the computer to where the data
resides!
● MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce
stage.
o Map stage − The map or mapper’s job is to process the input data. Generally the input
data is in the form of file or directory and is stored in the Hadoop file system (HDFS).
The input file is passed to the mapper function line by line. The mapper processes the
data and creates several small chunks of data.
o Reduce stage − This stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored in the
HDFS.
● During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers
in the cluster.
● The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.
● Most of the computing takes place on nodes with data on local disks that reduces the network
traffic.
41
Notes Prepared By Sarvagya Jain
● After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.
The MapReduce framework operates on <key, value> pairs, that is, the framework views the input to
the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job,
conceivably of different types.
The key and the value classes should be in serialized manner by the framework and hence, need to
implement the Writable interface. Additionally, the key classes have to implement the
Writable-Comparable interface to facilitate sorting by the framework. Input and Output types of
a MapReduce job − (Input) <k1, v1> → map → <k2, v2> → reduce → <k3, v3>(Output).
Input Output
Map <k1, v1> list (<k2, v2>)
Reduce <k2, list(v2)> list (<k3, v3>)
Parallel Computing:
Parallel computing refers to the process of breaking down larger problems into smaller, independent,
often similar parts that can be executed simultaneously by multiple processors communicating via
shared memory, the results of which are combined upon completion as part of an overall algorithm.
The primary goal of parallel computing is to increase available computation power for faster
application processing and problem solving.
Parallel computing infrastructure is typically housed within a single datacenter where several
processors are installed in a server rack; computation requests are distributed in small chunks by the
application server that are then executed simultaneously on each server.
There are generally four types of parallel computing, available from both proprietary and open source
parallel computing vendors -- bit-level parallelism, instruction-level parallelism, task parallelism, or
super word-level parallelism:h the compiler decides which instructions to execute in parallel
● Task parallelism: a form of parallelization of computer code across multiple processors that runs
several different tasks at the same time on the sa
● Bit-level parallelism: increases processor word size, which reduces the quantity of instructions the
processor must execute in order to perform an operation on variables greater than the length of the
word.
● Instruction-level parallelism: the hardware approach works upon dynamic parallelism, in which
the processor decides at run-time which instructions to execute in parallel; the software approach
works upon static parallelism.
● Superword-level parallelism: a vectorization technique that can exploit parallelism of inline code
42