SSI Clusters

Single System Image and Cluster Middleware
Approaches, Infrastructure and Technologies
Recap: Cluster Computer Architecture

Sequential Applications Sequential Applications Sequential Applications Parallel Applications Parallel Applications Parallel Applications Parallel Programming Environment Cluster Middleware (Single System Image and Availability Infrastructure) PC/Workstation
Communications Software
PC/Workstation
PC/Workstation
PC/Workstation
Network Interface Hardware
Cluster Interconnection Network/Switch

2
Recap: Major issues in Cluster design

Enhanced Performance (performance @ low cost) Enhanced Availability (failure management) Single System Image (look-and-feel of one system) Size Scalability (physical & application) Fast Communication (networks & protocols) Load Balancing (CPU, Net, Memory, Disk) Security and Encryption (clusters of clusters) Distributed Environment (Social issues) Manageability (admin. And control) Programmability (simple API if required) Applicability (cluster-aware and non-aware app.)
A typical Cluster Computing Environment

Applications
PVM / MPI/ RSH
???
Hardware/OS
4
The missing link is provided by cluster middleware/underware

Applications
PVM //MPI/ RSH PVM MPI/ RSH Middleware
Hardware/OS
5
Message Passing Interface (MPI)

Message Passing Interface (MPI) is an API specification that allows processes to communicate with one another by sending and receiving messages. It is typically used for parallel programs running on computer clusters and supercomputers, where the cost of accessing non-local memory is high. MPI is a language-independent communications protocol used to program parallel computers.
Middleware Design Goals

Complete Transparency (Manageability):

Offer a single system view of a cluster system..

Single entry point, ftp, telnet, software loading...
Scalable Performance:

Easy growth of cluster

no change of API & automatic load distribution.
Enhanced Availability:

Automatic Recovery from failures

Employ checkpointing & fault tolerant technologies
Handle consistency of data when replicated..

7
What is Single System Image (SSI)?

SSI is the illusion, created by software or hardware, that presents a collection of computing resources as one, more whole resource.

In other words, it the property of a system that hides the heterogeneous and distributed nature of the available resources and presents them to users and applications as a single unified computing resource.
SSI makes the cluster appear like a single machine to the user, to applications, and to the network.
8
Cluster Middleware & SSI

SSI

Supported by a middleware layer that resides between the OS and user-level environment Middleware consists of essentially 2 sub-layers of SW infrastructure SSI infrastructure

Glue together OSs on all nodes to offer unified access to system resources Enable cluster services such as checkpointing, automatic failover, recovery from failure, & fault-tolerant support among all nodes of the cluster
9
System availability infrastructure

Functional Relationship Among Middleware SSI Modules
10
Benefits of SSI

Use of system resources transparent. Transparent process migration and load balancing across nodes. Improved reliability and higher availability. Improved system response time and performance Simplified system management. Reduction in the risk of operator errors. No need to be aware of the underlying system architecture to use these machines effectively.
11
Desired SSI Services/Functions

Single Entry Point:

telnet cluster.my_institute.edu telnet node1.cluster. institute.edu

Single User Interface: using the cluster through a single GUI window and it should provide a look and feel of managing a single resources (e.g., PARMON). Single File Hierarchy: /Proc, NFS, xFS, AFS, etc. Single Control Point: Management GUI Single Virtual Networking Single Memory Space - Network RAM/DSM Single Job Management: Glunix, SGE, LSF
12
Availability Support Functions

Single I/O Space:

Any node can access any peripheral or disk devices without the knowledge of physical location. Any process on any node create process with cluster wide process wide and they communicate through signal, pipes, etc, as if they are one a single node.
Single Process Space:

Single Global Job Management System Checkpointing and process migration:

Can saves the process state and intermediate results in memory to disk to support rollback recovery when node fails. RMS Load balancing...
13
SSI Levels

SSI levels of abstractions:
Application and Subsystem Level
Operating System Kernel Level
Hardware Level
14
SSI Characteristics
Every SSI has a boundary. Single system support can exist at different levels within a system, one able to be build on another.

15
SSI Boundaries
Batch System SSI Boundary

Source: In search of clusters 16
SSI Middleware Implementation: Layered approach
17
SSI at Application and Sub-system Levels

Level Application Examples batch system and system management; Google Search Engine Distributed DB (e.g., Oracle 10g), OSF DME, Lotus Notes, MPI, PVM Sun NFS, OSF, DFS, NetWare, and so on OSF DCE, Sun ONC+, Apollo Domain Boundary An application Importance What a user wants
Sub-system
A sub-system
SSI for all applications of the sub-system Implicitly supports many applications and subsystems Best level of support for heterogeneous system
File system
Shared portion of the file system
Toolkit
Explicit toolkit facilities: user, service name, time
Pfister, In search of clusters 18
SSI at OS Kernel Level

Level Kernel/ OS Layer Examples Boundary Importance Each name space: Kernel support for Solaris MC, Unixware applications, adm MOSIX, Sprite, Amoeba files, processes, pipes, devices, etc. subsystems /GLunix UNIX (Sun) vnode, Locus (IBM) vproc None supporting OS kernel Mach, PARAS, Chorus, OSF/1AD, Amoeba Type of kernel objects: files, processes, etc. Each distributed virtual memory space Each service outside the microkernel Modularizes SSI code within kernel May simplify implementation of kernel objects Implicit SSI for all system services
Kernel interfaces Virtual memory Microkernel
SSI at Hardware Level

Level Examples Boundary Importance
Application and Subsystem Level
Operating System Kernel Level SCI (Scalable Coherent Interface), Stanford DASH
memory
memory space
better communication and synchronization
memory and I/O
SCI, SMP techniques
memory and I/O device space
lower overhead cluster I/O
SSI via OS path!

1. Build as a layer on top of the existing OS

Benefits: makes the system quickly portable, tracks vendor software upgrades, and reduces development time. i.e. new systems can be built quickly by mapping new services onto the functionality provided by the layer beneath. e.g.: Glunix. Good, but Cant leverage of OS improvements by vendor. E.g. Unixware, Solaris-MC, and MOSIX.
2. Build SSI at kernel level, True Cluster OS

21
SSI Systems & Tools

OS level:

SCO NSC UnixWare; Solaris-MC; MOSIX, . PVM/MPI, TreadMarks (DSM), Glunix, Condor, SGE, Nimrod, PBS, .., Aneka PARMON, Parallel Oracle, Google, ...
22
Subsystem level:

Application level:

UnixWare: NonStop Cluster (NSC) OS

http://www.sco.com/products/clustering/
UP or SMP node Users, applications, and systems management Standard OS kernel calls Standard SCO UnixWare with clustering hooks UP or SMP node Users, applications, and systems management Standard OS kernel calls Standard SCO UnixWare with clustering hooks
Extensions
Extensions
Modular kernel extensions
Modular kernel extensions
Devices ServerNet
Other nodes
Devices
How does NonStop Clusters Work?

Modular Extensions and Hooks to Provide:

Single Clusterwide Filesystem view; Transparent Clusterwide device access; Transparent swap space sharing; Transparent Clusterwide IPC; High Performance Internode Communications; Transparent Clusterwide Processes, migration,etc.; Node down cleanup and resource failover; Transparent Clusterwide parallel TCP/IP networking; Application Availability; Clusterwide Membership and Cluster timesync; Cluster System Administration; Load Leveling.
Sun Solaris MC (Multi-Computers)

Solaris MC: A High Performance Operating System for Clusters

A distributed OS for a multicomputer, a cluster of computing nodes connected by a high-speed interconnect Provide a single system image, making the cluster appear like a single machine to the user, to applications, and the the network Built as a globalization layer on top of the existing Solaris kernel Interesting features

extends existing Solaris OS preserves the existing Solaris ABI/API compliance provides support for high availability uses C++, IDL, CORBA in the kernel leverages Spring OS technology
25
Solaris-MC: Solaris for MultiComputers

Applications System call inter ace

Net ork File system C++ rocesses Solaris C ther nodes
bject rame ork bject invocations ernel

Sol ri Ar it t r
Existing Solaris 2.5 kernel
global file system globalized process management globalized networking and I/O
http://research.sun.com/techrep/1995/abstract-48.html
26
Solaris MC components
Applications

System call inter ace

Net ork File system C++ rocesses Solaris C ther nodes
bject rame ork bject invocations ernel

Sol ri Ar it t r

Existing Solaris 2.5 kernel
Object and communication support High availability support PXFS global distributed file system Process management Networking
27
MOSIX: Multicomputer OS for UNIX

http://www.mosix.cs.huji.ac.il/ || mosix.org

An OS module (layer) that provides the applications with the illusion of working on a single system. Remote operations are performed like local operations. Transparent to the application - user interface unchanged.
Application
PVM / MPI / RSH
Hardware/OS
28
Key Features of MOSIX

Preemptive process migration that can migrate any process, anywhere, anytime

Supervised by distributed algorithms that respond online to global resource availability transparently. Load-balancing - migrate process from over-loaded to underloaded nodes. Memory ushering - migrate processes from a node that has exhausted its memory, to prevent paging/swapping.
Download MOSIX: http://www.mosix.cs.huji.ac.il/
29
SSI at Subsystem Level
Resource Management and Scheduling
30
Resource Management and Scheduling (RMS)

RMS system is responsible for distributing applications among cluster nodes. It enables the effective and efficient utilization of the resources available Software components

Resource manager

Locating and allocating computational resource, authentication, process creation and migration Queuing applications, resource location and assignment. It instructs resource manager what to do when (policy)
Resource scheduler

Reasons for using RMS

Provide an increased, and reliable, throughput of user applications on the systems Load balancing Utilizing spare CPU cycles Providing fault tolerant systems Manage access to powerful system, etc
31
Basic architecture of RMS: client-server system
Cluster RMS Architecture

User Population Manager Node Computation Nodes
Resource Manager
execution results execution results
Computation Node 1
User 1
job
Job Manager
job
: : : : User u Job Scheduler : Computation Node c Node Status Monitor :
32
Services provided by RMS

Process Migration

Computational resource has become too heavily loaded Fault tolerant concern

Checkpointing Scavenging Idle Cycles

70% to 90% of the time most workstations are idle

Fault Tolerance Minimization of Impact on Users Load Balancing Multiple Application Queues
33
Some Popular Resource Management Systems

Project LSF SGE NQE LL PBS
Commercial Systems - URL

http://www.platform.com/ http://www.sun.com/grid/ http://www.cray.com/
http://www.ibm.com/systems/clusters/software/loadleveler/
http://www.pbsgridworks.com/
Public Domain System - URL

Alchemi Condor GNQS
http://www.alchemi.net - desktop grids http://www.cs.wisc.edu/condor/ http://www.gnqs.org/
34
Pros and Cons of SSI Approaches

Hardware:

Offer the highest level of transparency, but it has rigid architecture not flexible while extending or enhancing the system. Offers full SSI, but expensive to develop and maintain due to limited market share. It cannot be developed partially, to benefit full functionality need to be developed, so it can be risky. E.g., Mosix and SolarisMC Easy to implement at benefit class of applications for which it is designed. E.g., Job management systems such as PBS and SGE. Easy to realise, but requires that each application developed as SSI-aware separately. E.g., Google
Operating System

Subsystem Level

Application Level

35
Additional References

R. Buyya, T. Cortes, and H. Jin, Single System Image, International Journal of High-Performance Computing Applications (IJHPCA), Volume 15, No. 2, Summer 2001. G. Pfister, In Search of Clusters, Prentice Hall, USA. B. Walker, Open SSI Linux Cluster Project: http://openssi.org/ssi-intro.pdf
36

SSI Clusters

Uploaded by

Copyright:

Available Formats

SSI Clusters

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SSI Clusters

Uploaded by

Copyright:

Available Formats

Single System Image and Cluster Middleware

Approaches, Infrastructure and Technologies

Recap: Cluster Computer Architecture

Network Interface Hardware

Network Interface Hardware

Network Interface Hardware

Network Interface Hardware

Cluster Interconnection Network/Switch

Recap: Major issues in Cluster design

A typical Cluster Computing Environment

The missing link is provided by cluster middleware/underware

Message Passing Interface (MPI)

Middleware Design Goals

Complete Transparency (Manageability):

Offer a single system view of a cluster system..

Single entry point, ftp, telnet, software loading...

Easy growth of cluster

no change of API & automatic load distribution.

Automatic Recovery from failures

Employ checkpointing & fault tolerant technologies

Handle consistency of data when replicated..

What is Single System Image (SSI)?

Cluster Middleware & SSI

System availability infrastructure

Functional Relationship Among Middleware SSI Modules

Desired SSI Services/Functions

Single Entry Point:

telnet cluster.my_institute.edu telnet node1.cluster. institute.edu

Availability Support Functions

Single I/O Space:

Single Process Space:

Single Global Job Management System Checkpointing and process migration:

SSI levels of abstractions:

Application and Subsystem Level

Operating System Kernel Level

Batch System SSI Boundary

SSI Middleware Implementation: Layered approach

SSI at Application and Sub-system Levels

Shared portion of the file system

Explicit toolkit facilities: user, service name, time

Pfister, In search of clusters 18

SSI at OS Kernel Level

Kernel interfaces Virtual memory Microkernel

SSI at Hardware Level

Application and Subsystem Level

better communication and synchronization

memory and I/O

SCI, SMP techniques

memory and I/O device space

lower overhead cluster I/O

Pfister, In search of clusters 20

SSI via OS path!

1. Build as a layer on top of the existing OS

2. Build SSI at kernel level, True Cluster OS

SSI Systems & Tools

UnixWare: NonStop Cluster (NSC) OS

Modular kernel extensions

Modular kernel extensions

How does NonStop Clusters Work?

Modular Extensions and Hooks to Provide: