Toward A Cloud Operating System
Toward A Cloud Operating System
Toward A Cloud Operating System
†
Fabio Pianese Peter Bosch Alessandro Duminuco Nico Janssens Thanos Stathopoulos Moritz Steiner
Alcatel-Lucent Bell Labs
Service Infrastructure Research Dept. †Computer Systems and Security Dept.
{firstname.lastname}@alcatel-lucent.com
Abstract—Cloud computing is characterized today by a hotch- turning what was once felt as the result of concrete equipment
potch of elements and solutions, namely operating systems run- and processes into an abstract entity, devoid of any physical
ning on a single virtualized computing environment, middleware connotation: this is what the expression “Cloud computing” is
layers that attempt to combine physical and virtualized resources
from multiple operating systems, and specialized application currently describing.
engines that leverage a key asset of the cloud service provider (e.g. Previous research has successfully investigated the viability
Google’s BigTable). Yet, there does not exist a virtual distributed of several approaches to managing large-scale pools of hard-
operating system that ties together these cloud resources into a
unified processing environment that is easy to program, flexible,
ware, users, processes, and applications. The main concerns
scalable, self-managed, and dependable. of these efforts were twofold: on one hand, exploring the
In this position paper, we advocate the importance of a virtual technical issues such as the scalability limits of management
distributed operating system, a Cloud OS, as a catalyst in unlock- techniques; on the other hand, understanding the real-world
ing the real potential of the Cloud—a computing platform with and “systemic” concerns such as ease of deployment and
seemingly infinite CPU, memory, storage and network resources.
Following established Operating Systems and Distributed Sys-
expressiveness of the user and programming interface.
tems principles laid out by UNIX and subsequent research efforts, Our main motivation lies in the fact that state-of-the-art
the Cloud OS aims to provide simple programming abstractions management systems available today do not provide access to
to available cloud resources, strong isolation techniques between the Cloud in a uniform and coherent way. They either attempt
Cloud processes, and strong integration with network resources.
At the same time, our Cloud OS design is tailored to the to expose all the low-level details of the underlying pieces of
challenging environment of the Cloud by emphasizing elasticity, hardware [2] or reduce the Cloud to a mere set of API calls—
autonomous decentralized management, and fault tolerance. to instantiate and remotely control resources [3][4][5][6],
to provide facilities such as data storage, CDN/streaming,
I. I NTRODUCTION and event queues [7][8][9], or to make available distributed
The computing industry is radically changing the scale of computing library packages [10][11][12]. Yet, a major gap still
its operations. While a few years ago typical deployed systems has to be bridged in order to bond the Cloud resources into
consisted of individual racks filled with few tens of computers, one unified processing environment that is easy to program,
today’s massive computing infrastructures are composed of flexible, scalable, self-managing, and dependable.
multiple server farms, each built inside carefully engineered In this position paper, we argue for a holistic approach to
data centers that may host several tens of thousand CPU cores Cloud computing that transcends the limits of individual ma-
in extremely dense and space-efficient layouts [1]. There are chines. We aim to provide a uniform abstraction—the Cloud
several reasons for this development: Operating System—that adheres to well-established operating
• Significant economies of scale in manufacturing and systems conventions, namely: (a) providing a simple and yet
purchasing huge amounts of off-the-shelf hardware parts. expressive set of Cloud metrics that can be understood by
• Remarkable savings in power and cooling costs from the the applications and exploited according to individual policies
massive pooling of computers in dedicated facilities. and requirements, and (b) exposing a coherent and unified
• Hardware advances that have made the use of system programming interface that can leverage the available network,
virtualization techniques viable and attractive. CPU, and storage as the pooled resources of a large-scale
• Commercial interest for a growing set of applications and distributed Cloud computer.
services to be offloaded “into the Cloud”. In Section II we elaborate our vision of a Cloud operat-
The commoditization of computing is thus transforming pro- ing system, discuss our working assumptions, and state the
cessing, storage, and bandwidth into utilities such as electrical requirements we aim to meet. In Section III we present a
power, water, or telephone access. This process is already set of elements and features that we see as necessary in
well under way, as today businesses of all sizes tend to Cloud OS: distributed resource measurement and management
outsource their computing infrastructures, often turning to techniques, resource abstraction models, and interfaces, both
external providers to fulfill their entire operational IT needs. to the underlying hardware and to the users/programmers.
The migration of services and applications into the network is We then briefly review the related work in Section IV, and
also modifying how computing is perceived in the mainstream, conclude in Section V with our plans for the future.
978-1-4244-6039-7/10/$26.00 2010
c IEEE 335
II. T HE C LOUD : A U NITARY C OMPUTING S YSTEM across the world, as their deployment is driven by such
practical concerns as:
The way we are accustomed to interact with computers has
• Availability of adequate facilities in strategic locations
shaped over the years our expectations about what computers
• High-bandwidth, multi-homed Internet connectivity
can and cannot achieve. The growth in processor power,
• Presence of a reliable supply of cheap electrical power
the development of new human-machine interfaces, and the
• Suitable geological and/or metereological properties
rise of the Internet have progressively turned an instrument
initially meant to perform batch-mode calculations into the (Cold weather, nearby lakes or glaciers for cheap cooling)
• Presence of special legal regimes (data protection, etc.)
personal gateway for media processing and social networking
we have today, which is an integral aspect of our lifestyle. The clusters’ computing and networking hardware is said to be
The emergence of Clouds is about to further this evolution inside the perimeter of the Cloud. The network that connects
with effects that we are not yet able to foresee: as processing the clusters and provides the expected networking services (IP
and storage move away from end-user equipment, the way routing, DNS naming, etc.) lies outside the Cloud perimeter.
people interact with smaller and increasingly pervasive pieces The reliance on commodity hardware and the reduced
of connected hardware will probably change in exciting and servicing capability compel us to treat Cloud hardware as
unexpected ways. unreliable and prone to malfunction and failures. The network
To facilitate this evolution, we argue it is important to needs also to be considered unreliable, as a single failure of
recognize that the established metaphor of the computer as a piece of network equipment can impact a potentially large
a self-contained entity is now outdated and needs to be number of computers at the same time. A typical mode of
abandoned. Computer networks have reached such a high network failure introduces partitions in the global connec-
penetration that in most cases all of the programs and data tivity which disrupt the end-to-end behavior of the affected
that are ever accessed on a user machine have in fact been transport-layer connections. Network issues may arise both
produced somewhere else and downloaded from somewhere inside the Cloud perimeter, where counter-measures may be
else. While much more powerful than in the past, the CPU available to address them quickly in a way that is transparent
power and storage capacity of the hardware a user may have to most applications, and outside, where it is not possible to
at her disposal pales compared to the power and storage of react as timely.
the hardware she is able to access over the Internet. We do not intend to formulate any specific assumption
We feel that the times are ready for a new way to understand on the applications that will run on the Cloud computer. In
and approach the huge amount of distributed, interconnected other words, we expect to satisfy the whole range of current
resources that are already available on large-scale networks: applicative requirements, e.g. CPU-intensive number crunch-
the first Cloud computing infrastructures that start to be com- ing, storage-intensive distributed backup, or network-intensive
mercially available provide us with a concrete set of working bulk data distribution (and combinations thereof). The Cloud
assumptions on which to base the design of future computer operating system aims to be as general purpose as possible,
systems. Research on distributed management of computer providing a simple set of interfaces for the management of
hardware and applications, together with the emergence of Cloud resources: all the policy decisions pertaining to the use
Internet-wide distributed systems, have provided a wealth of of the available resources are left to the individual applications.
experiences and technical building blocks toward the goal B. Cloud OS: a familiar OS metaphor for the Cloud
of building and maintaining large-scale computer systems. We formulate a new metaphor, the Cloud operating system,
However, users and developers still lack a definite perception that may be adequate to support the transition from individual
about the potential of the Clouds, whose size and aggregate computers as the atomic “computational units” to large-scale,
power are so large and hard to grasp: therefore we need to seamless, distributed computer systems. The Cloud OS aims
provide a new set of metaphors that unify and expose Cloud to provide a familiar interface for developing and deploying
resources in a simple yet powerful way. massively scalable distributed applications on behalf of a
large number of users, exploiting the seemingly infinite CPU,
A. Assumptions on Cloud infrastructure
storage, and bandwidth provided by the Cloud infrastructure.
A Cloud is a logical entity composed of managed computing The features of the Cloud OS aim to be an extension to
resources deployed in private facilities and interconnected those of modern operating systems, such as UNIX and its
over a public network, such as the Internet. Cloud machines successors: in addition to simple programming abstractions
(also called nodes) are comprised of inexpensive, off-the- and strong isolation techniques between users and applications,
shelf consumer-grade hardware. Clouds are comprised of a we emphasize the need to provide a much stronger level
large number of clusters (i.e. sets of nodes contained in a of integration with network resources. In this respect, there
same facility) whose size may range from a few machines to is much to be learnt from the successor of UNIX, Plan 9
entire datacenters. Clusters may use sealed enclosures or be from Bell Labs [13], which extended the consistency of the
placed into secluded locations that might not be accessible on “everything is a file” metaphor to a number of inconsistent
a regular basis, a factor that hinders access and maintenance aspects of the UNIX environment. More useful lessons on
activities. Clusters are sparsely hosted in a number of locations techniques of process distribution and remote execution can
research [13] and large-scale network testbed administra- outside the scope of the Cloud OS interface definition.
3 Nodes that are not part of the Cloud are still capable to access Cloud
tion [20]. An additional inspiration, especially concerning the
resources using the network-based system call interface; however, without
implementation of Cloud OS, comes from the last decade of a full OS-level support for Cloud abstractions, they won’t provide seamless
advances in distributed algorithms and peer-to-peer systems. integration between the local and the Cloud environment.
requires complete control over the network infrastructure, but applications, the Cloud OS needs first to collect and aggregate
it may be used in certain cases to augment the accuracy of them it in a timely way. Clearly, solutions based on centralized
end-to-end measurements (e.g., with short-term predictions of databases are not viable, since they lack the fault-tolerance
CPU load or networking performance [21]) in Clouds that span and the scalability we require. The use of distributed systems,
several datacenters. such as distributed hash tables (DHTs), has proved to be
Measurements can target either local quantities, i.e. inside very effective for publishing and retrieving information in
a single Cloud node, or pairwise quantities, i.e. involving large-scale systems [26], even in presence of considerable
pairs of connected machines (e.g. link bandwidth, latency, levels of churn [27]. However, DHTs offer hash table (key,
etc.). Complete measurements of pairwise quantities cannot be value) semantic, which are not expressive enough to support
performed in large-scale systems, as the number of measure- more complex queries such as those used while searching for
ment operations required grows quadratically with the size of resources. Multi-dimensional DHTs [28][29] and gossip-based
the Cloud. Several distributed algorithms to predict latencies approaches [30] extended the base (key, value) semantic in
without global measurement campaigns have been proposed: order to allow multi-criteria and range queries.
Vivaldi [22] collects local latency samples and represents 3) Distributed process and application management: The
nodes as points in a coordinate system. Meridian [23] uses Cloud OS instantiates and manages all objects that exist
an overlay network to recursively select machines that are across the Cloud nodes. A consolidated practice is the use
the closest to a given network host. Bandwidth estimation of virtual machines (VMs), which provide an abstraction that
in Cloud environments remains an open problem: despite the flexibly decouples the “logical” computing resources from
existence of a number of established techniques [24], most of the underlying physical Cloud nodes. Virtualization provides
them are too intrusive and unsuitable for simultaneous use and several properties required in a Cloud environment [31], such
to perform repeated measurements on high capacity links. as the support for multiple OS platforms on the same node and
the implicit isolation (up to a certain extent) between processes
2) Resource abstraction: Modern OS metaphors, such as
running on different VMs on the same hardware. Computation
the “everything is a file” model used by UNIX and Plan9, pro-
elasticity, load balancing, and other optimization requirements
vide transparent network interfaces and completely hide their
introduce the need for dynamic allocation of resources such as
properties and specificities from the applications. However,
the ability to relocate a running process between two nodes in
characterizing the underlying network is a crucial exigence for
the Cloud. This can be done either at the Cloud process level,
a Cloud OS, for network properties such as pairwise latencies,
i.e. migrating single processes between nodes, or at virtual
available bandwidth, etc., determine the ability of distributed
machine level, i.e checkpointing and restoring the whole VM
applications to efficiently exploit the available resources. One
state on a different node. The combination of process and
major strength of a file-based interface is that it is very flexible
VM migration, such as it was introduced by MOSIX [32],
and its shortcomings can be supplemented with an appropriate
is very interesting as a Cloud functionality as it allows to
use of naming conventions. We are considering several such
autonomously regroup and migrate bundles of related Cloud
mechanisms to present abstracted resource information from
objects with a single logical operation4 .
measurements to the applications, e.g. via appropriate exten-
4 Another compelling approach is Libra [33], which aims to bridge the
sions of the /proc interface or via POSIX-compatible semantic
cues [25]. distance between processes and VMs: the “guest” operating system is reduced
to a thin layer on top of the hypervisor that accesses the functions exposed
In order to present information about resources to the user by the “host” operating system through a file-system interface.