D74942GC10 - sg1 - SA Cluster
D74942GC10 - sg1 - SA Cluster
D74942GC10 - sg1 - SA Cluster
D75861
Edition 1.0
D74942GC10
February 2012
Administration
Disclaimer
Raghavendra JS
Zeeshan Nofil This document contains proprietary information and is protected by copyright and
Venu Poddar other intellectual property laws. You may copy and print this document solely for your
own use in an Oracle training course. The document may not be modified or altered
in any way. Except where your use constitutes "fair use" under copyright law, you
Technical Contributors may not use, share, download, upload, copy, print, display, perform, reproduce,
and Reviewers publish, license, post, transmit, or distribute this document in whole or in part without
the express authorization of Oracle.
Thorsten Fruauf
The information contained in this document is subject to change without notice. If you
Harish Mallya find any problems in the document, please report them in writing to: Oracle University,
Hemachandran 500 Oracle Parkway, Redwood Shores, California 94065 USA. This document is not
Namachivayam warranted to be error-free.
Trademark Notice
Graphic Designer
Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names
Satish Bettegowda may be trademarks of their respective owners.
Editors
Raj Kumar
Richard Wallis
Publishers
Michael Sebastian
Giri Venugopal
THESE eKIT MATERIALS ARE FOR YOUR USE IN THIS CLASSROOM ONLY. COPYING eKIT MATERIALS FROM THIS COMPUTER IS STRICTLY PROHIBITED
Contents
Preface
1 Introduction
Overview 1-2
iii
THESE eKIT MATERIALS ARE FOR YOUR USE IN THIS CLASSROOM ONLY. COPYING eKIT MATERIALS FROM THIS COMPUTER IS STRICTLY PROHIBITED
Agenda 2-32
Identifying Applications That Are Supported by Oracle Solaris Cluster 2-33
Cluster-Unaware Applications 2-34
Failover Applications 2-35
Scalable Applications 2-36
Cluster-Aware Applications 2-38
Oracle Solaris Cluster Data Services 2-40
Quiz 2-41
Agenda 2-42
iv
THESE eKIT MATERIALS ARE FOR YOUR USE IN THIS CLASSROOM ONLY. COPYING eKIT MATERIALS FROM THIS COMPUTER IS STRICTLY PROHIBITED
v
THESE eKIT MATERIALS ARE FOR YOUR USE IN THIS CLASSROOM ONLY. COPYING eKIT MATERIALS FROM THIS COMPUTER IS STRICTLY PROHIBITED
vi
THESE eKIT MATERIALS ARE FOR YOUR USE IN THIS CLASSROOM ONLY. COPYING eKIT MATERIALS FROM THIS COMPUTER IS STRICTLY PROHIBITED
vii
THESE eKIT MATERIALS ARE FOR YOUR USE IN THIS CLASSROOM ONLY. COPYING eKIT MATERIALS FROM THIS COMPUTER IS STRICTLY PROHIBITED
viii
THESE eKIT MATERIALS ARE FOR YOUR USE IN THIS CLASSROOM ONLY. COPYING eKIT MATERIALS FROM THIS COMPUTER IS STRICTLY PROHIBITED
ix
THESE eKIT MATERIALS ARE FOR YOUR USE IN THIS CLASSROOM ONLY. COPYING eKIT MATERIALS FROM THIS COMPUTER IS STRICTLY PROHIBITED
x
THESE eKIT MATERIALS ARE FOR YOUR USE IN THIS CLASSROOM ONLY. COPYING eKIT MATERIALS FROM THIS COMPUTER IS STRICTLY PROHIBITED
Agenda 6-67
Modifying Private Network Address and Netmask (1/5) 6-68
Modifying Private Network Address and Netmask (2/5) 6-69
Modifying Private Network Address and Netmask (3/5) 6-70
Modifying Private Network Address and Netmask (4/5) 6-71
Modifying Private Network Address and Netmask (5/5) 6-72
Summary 6-73
Practice 6 Overview: Performing Basic Cluster Administration 6-74
xi
THESE eKIT MATERIALS ARE FOR YOUR USE IN THIS CLASSROOM ONLY. COPYING eKIT MATERIALS FROM THIS COMPUTER IS STRICTLY PROHIBITED
Agenda 8-14
Shared Disk Set Replica Management 8-15
Initializing the Local metadb Replicas on Local Disks 8-16
Shared Disk Set Mediators 8-19
Creating Shared Disk Sets and Mediators 8-20
Quiz 8-23
Installing Solaris Volume Manager 8-24
Automatic Repartitioning and metadb Placement on Shared Disk Sets 8-25
Using Shared Disk-Set Disk Space 8-27
xii
THESE eKIT MATERIALS ARE FOR YOUR USE IN THIS CLASSROOM ONLY. COPYING eKIT MATERIALS FROM THIS COMPUTER IS STRICTLY PROHIBITED
Agenda 9-28
Performing Failover and Failback Manually 9-29
Agenda 9-30
Configuring IPMP in the Oracle Solaris Cluster Environment 9-31
Integrating IPMP into the Oracle Solaris Cluster Software Environment 9-32
Summary 9-36
Practice 9 Overview: Configuring and Testing IPMP 9-37
xiii
THESE eKIT MATERIALS ARE FOR YOUR USE IN THIS CLASSROOM ONLY. COPYING eKIT MATERIALS FROM THIS COMPUTER IS STRICTLY PROHIBITED
xiv
THESE eKIT MATERIALS ARE FOR YOUR USE IN THIS CLASSROOM ONLY. COPYING eKIT MATERIALS FROM THIS COMPUTER IS STRICTLY PROHIBITED
Summary 11-44
Practice 11 Overview: Installing and Configuring Apache as a Scalable Service on
Oracle Solaris Cluster 11-45
xv
THESE eKIT MATERIALS ARE FOR YOUR USE IN THIS CLASSROOM ONLY. COPYING eKIT MATERIALS FROM THIS COMPUTER IS STRICTLY PROHIBITED
Preface
Profile
Before You Begin This Course
Before you begin this course, you should be able to:
• Administer the Oracle Solaris 10/11 Operating System
• Manage file systems and local disk drives
• Perform system boot procedures
• Manage user and role administration
How This Course Is Organized
Oracle Solaris Cluster 4.x Administration is an instructor-led course featuring
Preface - 2
THESE eKIT MATERIALS ARE FOR YOUR USE IN THIS CLASSROOM ONLY. COPYING eKIT MATERIALS FROM THIS COMPUTER IS STRICTLY PROHIBITED
Related Publications
Oracle Publications
Title
Oracle Solaris Cluster 4.0 Information Library
http://docs.oracle.com/cd/E23623_01/index.html
Oracle Solaris 11 Information Library
http://docs.oracle.com/cd/E23824_01/
Preface - 3
THESE eKIT MATERIALS ARE FOR YOUR USE IN THIS CLASSROOM ONLY. COPYING eKIT MATERIALS FROM THIS COMPUTER IS STRICTLY PROHIBITED
Related Publications
Additional Publications
• System release bulletins
• Installation and user’s guides
• read.me files
• International Oracle User’s Group (IOUG) articles
• Oracle Magazine
Preface - 4
THESE eKIT MATERIALS ARE FOR YOUR USE IN THIS CLASSROOM ONLY. COPYING eKIT MATERIALS FROM THIS COMPUTER IS STRICTLY PROHIBITED
Typographic Conventions
The following two lists explain Oracle University typographical conventions for
words that appear within regular text or within code samples.
Preface - 5
THESE eKIT MATERIALS ARE FOR YOUR USE IN THIS CLASSROOM ONLY. COPYING eKIT MATERIALS FROM THIS COMPUTER IS STRICTLY PROHIBITED
Preface - 6
THESE eKIT MATERIALS ARE FOR YOUR USE IN THIS CLASSROOM ONLY. COPYING eKIT MATERIALS FROM THIS COMPUTER IS STRICTLY PROHIBITED
Introduction
Overview
• Goals
• Agenda
• Introductions
• Your learning center
In this course, you learn the essential information and skills needed to install and administer
Oracle Solaris Cluster hardware and software systems. To begin, we would like to take about
20 minutes to give you an introduction to the course as well as to your fellow students and the
classroom environment.
Course Objectives
Course Objectives
Agenda: Day 1
• Lesson 1: Introduction
• Lesson 2: Planning the Oracle Solaris Cluster Environment
– Define clustering.
– Describe the Oracle Solaris Cluster features.
Agenda: Day 1
Agenda: Day 1
Agenda: Day 2
Agenda: Day 2
Agenda: Day 3
Agenda: Day 3
Agenda: Day 4
Agenda: Day 4
Agenda: Day 5
Introductions
• Name
• Company affiliation
• Title, function, and job responsibility
• Experience related to topics presented in this course
• Logistics
– Restrooms
– Break rooms and designated smoking areas
– Cafeterias and restaurants in the area
Environment
Objectives
Agenda
• Defining clustering
• Describing Oracle Solaris Cluster features
• Identifying Oracle Solaris Cluster hardware environment
• Identifying Oracle Solaris Cluster software environment
Clustering
• HA standards
• How clusters provide HA
• HA benefits of planned and unplanned outages
• Why fault tolerant servers are not an alternative to HA
HA can be defined as a minimization of down time rather than the complete elimination of
down time. Most true standards of HA cannot be achieved in a stand-alone server
environment.
HA standards are usually phrased with wording such as “provides 5 nines availability.” This
ensures up to 99.999% up time for the application or about 5 minutes of down time per year.
One clean server reboot often already exceeds that amount of down time.
Many vendors provide servers that are marketed as fault tolerant. These servers are
designed to be able to tolerate any single hardware failure, for example, memory failure,
central processing unit (CPU) failure, and so on, without any down time.
Inter-node failover:
• Application services and data are recovered automatically
when there is any hardware or software failure.
• Application recovery is done without human intervention
Clusters provide an environment where, in the case of any single hardware or software failure
in the cluster, application services and data are recovered automatically (without human
intervention) and quickly (faster than a server reboot). The existence of the redundant servers
in the cluster and redundant server-storage paths makes this possible.
Inter-node failover:
Workload Failover
WAN
Shared
Production Storage Standby
Server Server
Planned reboots
The HA benefit that cluster environments provide involves not only hardware and software
failures, but also planned outages. Although a cluster can automatically relocate applications
within the cluster in the case of failures, it can also manually relocate services for planned
outages. As such, normal reboots for hardware maintenance in the cluster affects only the up
time of the applications for as much time as it takes to manually relocate the applications to
different servers in the cluster.
Clusters also provide an integrated hardware and software environment for scalable
applications. Scalability is defined as the ability to increase application performance by
supporting multiple instances of applications on different nodes in the cluster. These
instances are generally accessing the same data as each other.
Clusters generally do not require a choice between availability and performance. HA is
generally built into scalable applications as well as non-scalable ones. In scalable
applications, you might not need to relocate failed applications because other instances are
already running on other nodes. You might still need to perform recovery on behalf of failed
instances.
Agenda
• Defining clustering
• Describing Oracle Solaris Cluster features
• Identifying Oracle Solaris Cluster hardware environment
• Identifying Oracle Solaris Cluster software environment
The Oracle Solaris Cluster hardware and software environment is the latest-generation
clustering product. The following are the features of the Oracle Solaris Cluster product:
• Global device implementation: Although data storage must be physically connected
on paths from at least two different nodes in the Oracle Solaris Cluster hardware and
software environment, all the storage in the cluster is logically available from every node
in the cluster by using standard device semantics. This provides the flexibility to run
applications on nodes that use data that is not even physically connected to the nodes.
• Global file system implementation: The Oracle Solaris Cluster software framework
provides a global file service independent of any particular application running in the
cluster, so that the same files can be accessed on every node of the cluster, regardless
of the storage topology.
Note: The global file system is also referred to as cluster file system.
• Cluster framework services implemented in the kernel: The Oracle Solaris Cluster
software is tightly integrated with the Oracle Solaris OS kernels. Node monitoring
capability, transport monitoring capability, and the global device and file system
implementation are implemented in the kernel to provide higher reliability and
performance.
• Off-the-shelf application support: The Oracle Solaris Cluster product includes data
service agents for a large variety of cluster-unaware applications. These are tested
programs and fault monitors that make applications run properly in the cluster
environment.
• Support for some off-the-shelf applications as scalable applications with built-in
Agenda
• Defining clustering
• Describing Oracle Solaris Cluster features
• Identifying Oracle Solaris Cluster hardware environment
• Identifying Oracle Solaris Cluster software environment
Administration
Workstation
Network
The Oracle Solaris Cluster hardware environment supports a maximum of 16 nodes. The
hardware components of a typical two-node cluster comprise:
• Cluster nodes that are running Solaris 11 OS. Each node must run the same revision
and same update of the OS.
• Separate boot disks on each node (with a preference for mirrored boot disks)
• One or more public network interfaces per system per subnet (with a preferred minimum
of at least two)
• A redundant private cluster transport interface
• Dual-hosted, mirrored disk storage
• One terminal concentrator (or any other console access method)
• Administrative workstation
Cluster Nodes
A wide range of server platforms are supported for use in the clustered environment. These
range from small rack-mounted servers up to enterprise-level servers.
Different models of a server architecture are supported as nodes in the same cluster, based
on the network and storage host adapters used. However, you cannot mix SPARC and x86
servers in the same cluster.
All nodes in a cluster are linked by a private cluster transport. The transport can be used for
the following purposes:
• Cluster-wide monitoring and recovery
• Global data access (transparent to applications)
• Application-specific transport for cluster-aware applications
It is highly recommended to use two separate private networks that form the cluster transport.
You can have more than two private networks (and you can add more later). More private
networks can provide a performance benefit in certain circumstances, because global data
access traffic is striped across all the transports.
Oracle Solaris Cluster enables you to build configurations with a single private network
forming the cluster transport. This would be recommended in production only if the single
private network is already redundant (using a lower-level device aggregation).
Crossover cables are often used in a two-node cluster. Switches are optional when you have
two nodes, and they are required for more than two nodes.
Clients connect to the cluster through the public network interfaces. Each network adapter
card can connect to one or more public networks, depending on whether the card has multiple
hardware interfaces.
You can set up Oracle Solaris hosts in the cluster to include multiple public network interface
cards that:
• Are configured so that multiple cards are active
• Serve as failover backups for one another
Each node must have public network interfaces that are under the control of the Oracle
Solaris OS IP Multipathing (IPMP) software. It is recommended to have at least two interfaces
in each IPMP group.
If one of the adapters fails, IP network multipathing software is called to fail over the defective
interface to another adapter in the group. An Oracle Solaris Cluster server is not allowed to
act as a router.
• Multihost disks
– Multihost disks are connected and shared with more than
one Oracle Solaris host.
– Multihost storage makes disks highly available.
Disks that can be connected to more than one Oracle Solaris host at a time are multihost
devices. In the Oracle Solaris Cluster environment, multihost storage makes disks highly
available.
Multihost devices have the following characteristics:
• Tolerance of single-host failures
• Ability to store application data, application binaries, and configuration files
• Protection against host failures. If clients request the data through one host and the host
fails, the requests are switched over to use another host with a direct connection to the
same disks.
• Global access through a primary host that “masters” the disks, or direct concurrent
access through local paths
The Oracle Solaris Cluster hardware environment can use several storage models. They must
all accept multihost connections. The StorEdge 6120 array has a single connection and must
be used with a hub or a switch.
Some data storage arrays support only two physically connected nodes. Many other storage
configurations support more than two nodes connected to the storage.
You can use ZFS and Solaris Volume Manager software to mirror the storage across
controllers. You can choose not to use any volume manager if each node has multipathed
access to HA hardware redundant array of independent disks (RAID) storage.
Local disks:
• Local disks, also called boot disks, are the disks that are
connected to only a single Oracle Solaris host.
• Boot disks must not be connected to multiple nodes.
Local disks are the disks that are connected to only a single Oracle Solaris host. The Oracle
Solaris Cluster environment requires that boot disks for each node be local to the node. That
is, the boot disks are not connected or not visible to any other node. For example, if the boot
device was connected through a storage area network (SAN), it would still be supported if the
LUN is not visible to any other nodes.
Note: Oracle Solaris Cluster software does not require that you mirror the ZFS root pool.
Terminal concentrator:
• A terminal concentrator (TC) is a typical way of accessing
node consoles if you are using ttya console.
• A TC provides data translation from the network to serial
Servers supported in the Oracle Solaris Cluster environment have a variety of console access
mechanisms.
If you are using a serial port console access mechanism (ttya), then you probably have a
terminal concentrator in order to provide the convenience of remote access to your node
consoles. A terminal concentrator (TC) is a device that provides data translation from the
network to serial port interfaces. Each of the serial port outputs connects to a separate node
in the cluster through serial port A.
There is always a trade-off between convenience and security. You might prefer to have only
dumb-terminal console access to the cluster nodes, and keep these terminals behind locked
doors requiring stringent security checks to open them. This is acceptable (although less
convenient to administer) for Oracle Solaris Cluster hardware as well.
Administrative Workstation
Included with the Oracle Solaris Cluster software is the administration console software,
which can be installed on any SPARC or x86 Solaris OS workstation. The software can be a
convenience in managing the multiple nodes of a cluster from a centralized location. It does
not affect the cluster in any other way.
Quiz
Answer: b
The following list summarizes the generally required and optional hardware redundancy
features in the Oracle Solaris Cluster hardware environment:
• Redundant server nodes are required.
• Redundant transport is highly recommended.
• HA access to data storage is required. That is, at least one of the following is required.
- Mirroring across controllers for Just a Bunch of Disks (JBOD) or for hardware
RAID devices without multipathing
- Multipathing from each connected node to hardware RAID devices
• Redundant public network interfaces per subnet are recommended.
You should locate redundant components as far apart as possible. For example, on a system
with multiple I/O boards, you should put the redundant transport interfaces, the redundant
public nets, and the redundant storage array controllers on two different I/O boards.
Agenda
• Defining clustering
• Describing Oracle Solaris Cluster features
• Identifying Oracle Solaris Cluster hardware environment
• Identifying Oracle Solaris Cluster software environment
Cluster-Unaware
Applications
User Land
Data Service
The slide gives a graphical, high-level overview of the software components that work
together to create the Oracle Solaris Cluster software environment.
To function as a cluster member, the following types of software must be installed on every
Oracle Solaris Cluster node:
• Oracle Solaris OS software
• Oracle Solaris Cluster software
• Data service application software
• Logical volume management
An exception is a configuration that uses hardware RAID. This configuration might not require
a software volume manager.
Agenda
• Defining clustering
• Describing Oracle Solaris Cluster features
• Identifying Oracle Solaris Cluster hardware environment
• Identifying Oracle Solaris Cluster software environment
The Oracle Cluster software environment supports both cluster-unaware and cluster-aware
applications.
Cluster-Unaware Applications
Failover Applications
The failover model is the easiest to support in the cluster. Failover applications run on only
one node of the cluster at a time. The cluster provides HA by providing automatic restart on
the same node or on a different node of the cluster.
Failover services are usually paired with an application IP address. This is an IP address that
always fails over from node to node along with the application. In this way, clients outside the
cluster see a logical host name with no knowledge of which node a service is running on. The
client should not even be able to tell that the service is running in a cluster.
Note: Both IPV4 addresses and IPV6 addresses are supported.
Multiple failover applications in the same resource group can share an IP address, with the
restriction that they must all fail over to the same node together.
Scalable Applications
Scalable applications involve running multiple instances of an application in the same cluster
and making it look like a single service by means of a global interface that provides a single
IP address and load balancing.
Scalable Applications
HTTPO
HTTPO HTTPO
Application
Application Application
Although scalable applications are still off-the-shelf, not every application can be made to run
as a scalable application in the Oracle Solaris Cluster software environment. Applications that
write data without any type of locking mechanism might work as failover applications but do
not work as scalable applications.
Cluster-Aware Applications
Cluster-aware applications are applications in which knowledge of the cluster is built into the
software. They differ from cluster-unaware applications in the following ways:
• Multiple instances of the application running on different nodes are aware of each other
and communicate across the private transport.
• It is not required that the Solaris Cluster software framework RGM start and stop these
applications. Because these applications are cluster-aware, they can be started in their
own independent scripts or by hand.
• Applications are not necessarily logically grouped with external application IP
addresses. If they are, the network connections can be monitored by cluster commands.
It is also possible to monitor these cluster-aware applications with Solaris Cluster
software framework resource types.
Cluster-Aware Applications
Parallel database applications: Parallel database applications are a special type of cluster
applications. Multiple instances of the database server cooperate in the cluster, handling
different queries on the same database and even providing parallel query capability on large
queries. A supported application is listed in the slide.
RSM applications: Applications that run on Oracle Solaris Cluster hardware can make use of
an application programming interface (API) called Remote Shared Memory (RSM). This API
maps data from an application instance running on one node to the address space of an
instance running on another node. This can be a highly efficient way for cluster-aware
applications to share large amounts of data across the transport. This requires the SCI
interconnect.
Data service agents make cluster-unaware applications highly available, in either failover or
scalable configurations.
HA and Scalable Data Service Support
The Oracle Solaris Cluster software provides preconfigured components that support HA data
services.
For the list of the most recent components available, visit the Supported Products page at
http://docs.oracle.com/cd/E23623_01/html/E23438/relnotes-6-products.html.
Quiz
a. Cluster interconnect
b. Data service agent
c. Proxy agent
Answer: b
Agenda
• Defining clustering
• Describing Oracle Solaris Cluster features
• Identifying Oracle Solaris Cluster hardware environment
• Identifying Oracle Solaris Cluster software environment
The Oracle Solaris Cluster software HA framework is the software layer that provides generic
cluster services to the nodes in the cluster, regardless of which applications are running in the
cluster. The Oracle Solaris Cluster software framework is implemented as a series of
daemons and kernel modules. One advantage of the Oracle Solaris Cluster software
environment is that much of the framework resides in the kernel, where it is fast, reliable, and
always memory-resident. Some of the services provided by the framework are listed in the
slide.
The cluster membership monitor (CMM) is kernel-resident on each node and detects major
cluster status changes, such as loss of communication between one or more nodes. The
CMM relies on the transport kernel module to generate heartbeats across the transport
medium to other nodes in the cluster. If the heartbeat from any node is not detected within a
defined timeout period, it is considered as having failed, and a cluster reconfiguration is
initiated to renegotiate cluster membership.
clprivnet0 clprivnet0
172.16.193.1 172.16.193.2
Applications written correctly can use the transport for data transfer. This feature stripes IP
traffic sent to the per-node logical IP addresses across all private interconnects. Transmission
Control Protocol (TCP) traffic is striped on a per connection granularity. User Datagram
Protocol (UDP) traffic is striped on a per-packet basis. The cluster framework uses the
clprivnet0 virtual network device for these transactions. This network interface is visible
with ifconfig. No manual configuration is required.
The application receives the benefit of striping across all the physical private interconnects,
but needs to be aware of only a single IP address on each node configured on that node’s
clprivnet0 adapter.
Agenda
• Defining clustering
• Describing Oracle Solaris Cluster features
• Identifying Oracle Solaris Cluster hardware environment
• Identifying Oracle Solaris Cluster software environment
The Oracle Solaris Cluster software framework provides global storage services, a feature
which greatly distinguishes the Oracle Solaris Cluster software product. Not only does this
feature enable scalable applications to run in the cluster, but it also provides a much more
flexible environment for failover services by freeing applications to run on nodes that are not
physically connected to the data.
It is important to understand the differences and relationships between the services listed in
the slide.
Note: The global file system is also referred to as cluster file system.
Global Naming
The DID feature provides a unique device name for every disk drive, CD-ROM drive, or tape
drive in the cluster. Shared disks that might have different logical names on different nodes
(different controller numbers) are given a cluster-wide unique DID instance number. Different
local disks that may use the same logical name (for example, c0t0d0 for each node’s root
disk) are each given unique DID instance numbers.
Global Naming
The figure in the slide demonstrates the relationship between typical Oracle Solaris OS logical
path names and DID instances.
Device files are created for each of the standard eight Solaris OS disk partitions in both the
/dev/did/dsk and /dev/did/rdsk directories (for example, /dev/did/dsk/d2s3 and
/dev/did/rdsk/d2s3).
DIDs themselves are just a global naming scheme and not a global access scheme.
DIDs are used as components of Solaris Volume Manager volumes and in choosing cluster
quorum devices.
Global Devices
The global devices feature of Solaris Cluster software provides simultaneous access to the
raw (character) device associated with storage devices from all nodes, regardless of where
the storage is physically attached. This includes individual DID disk devices, CD-ROMs and
tapes, as well as Solaris Volume Manager volumes.
The Solaris Cluster software framework manages automatic failover of the primary node for
global device groups. All nodes use the same device path, but only the primary node for a
particular device actually talks through the storage medium to the disk device. All other nodes
access the device by communicating with the primary node through the cluster transport. All
nodes have simultaneous access to the device /dev/vx/rdsk/nfsdg/nfsvol. Node 2
becomes the primary node if Node 1 fails.
In general, if a node fails while providing access to a global device, the Oracle Solaris Cluster
software automatically discovers another path to the device. The Oracle Solaris Cluster
software then redirects the access to that path. The local disks on each server are also not
multiported, and thus are not highly available devices.
The cluster automatically assigns unique IDs to each disk, CD-ROM, and tape device in the
cluster. This assignment enables consistent access to each device from any node in the
cluster. The global device namespace is held in the /dev/global directory.
The Oracle Solaris Cluster software maintains a special file system on each node, completely
dedicated to storing the device files for global devices. This file system has the mount point
/global/.devices/node@nodeID, where nodeID is an integer representing a node in
the cluster. The file system is stored on a dedicated partition on the boot disk.
All the /global/.devices file systems, one for each node are visible from each node. In
other words, they are examples of global file systems.
The device names under the /global/.devices/node@nodeID arena can be used
directly. However, because they are unwieldy, the Oracle Solaris Cluster environment
provides symbolic links into this namespace.
For Solaris Volume Manager, the Oracle Solaris Cluster software links the standard device
access directories into the global namespace.
proto192:/dev/vx# ls -l /dev/vx/rdsk/nfsdg
lrwxrwxrwx 1 root root 40 Nov 25 03:57
/dev/vx/rdsk/nfsdg ->/global/.devices/node@1/dev/vx/rdsk/nfsdg/
proto192:/dev/md/nfsds# ls -l /dev/global/rdsk/d3s0
lrwxrwxrwx 1 root root 39 Nov 4 17:43
/dev/global/rdsk/d3s0 -> ../../../devices/pseudo/did@0:3,3s0,raw
proto192:/dev/md/nfsds# ls -l /dev/global/rmt/1
lrwxrwxrwx 1 root root 39 Nov 4 17:43
/dev/global/rmt/1 -> ../../../devices/pseudo/did@8191,1,tp
The cluster file system feature makes file systems simultaneously available on all nodes,
regardless of their physical location.
The cluster file system capability is independent of the structure of the actual file system
layout on disk. However, only certain file system types are supported by Oracle Solaris
Cluster to be file systems underlying the global file system. One of these is:
• UNIX file system (UFS)
The Oracle Solaris Cluster software makes a file system global with a global mount option.
This is typically in the /etc/vfstab file, but can be put on the command line of a standard
mount command:
# mount -o global /dev/vx/dsk/nfs-dg/vol-01 /global/nfs
Oracle Solaris Cluster software also has support in the cluster for failover file system access.
Highly Available Local File Systems, also known as failover file systems, are available only on
one node at a time, on a node that is running a service and has a physical connection to the
storage in question. Failover file systems are also called non-global file systems.
In Oracle Solaris Cluster, more file system types are supported as a failover file system than
as underlying file systems for a global file system.
UFS and ZFS are the supported types of failover file systems.
Failover file system access is appropriate for failover services that run only on the nodes that
are physically connected to storage devices. Failover file system access is not suitable for
scalable services.
Failover file system access, when used appropriately, can have a performance benefit over
global file system access. The global file system infrastructure has an overhead of
maintaining replicated state information on multiple nodes simultaneously.
Quiz
Answer: c
Quiz
Answer: b
Agenda
• Defining clustering
• Describing Oracle Solaris Cluster features
• Identifying Oracle Solaris Cluster hardware environment
• Identifying Oracle Solaris Cluster software environment
Oracle VM Server for SPARC is fully supported as cluster nodes. Both I/O and guest domains
are supported.
Note: The term Oracle VM Server for SPARC, or Logical Domains for short, is the new name
for LDoms. Throughout this course, the term Logical Domain is used as a short name to refer
to Oracle VM Server for SPARC.
You can use one or more Logical Domains and one or more physical nodes not using Logical
Domains in the same cluster.
Zone Clusters
Summary
Practice 2 Overview:
Guided Tour of the Virtual Training Lab
This practice provides a guided tour of the virtual training lab.
While participating in the guided tour, you identify the Oracle Solaris Cluster hardware
components, including the cluster nodes, terminal concentrator, and administrative
workstation.
Console Connectivity
Establishing Cluster Node
Objectives
Agenda
This section describes different methods for achieving access to the Solaris Cluster node
consoles. It is expected that a Solaris Cluster environment administrator:
• Does not require node console access for most operations described during the duration
of the course. Most cluster operations require only that you be logged in on a cluster
node as root or as a user with cluster authorizations in the Role-Based Access Control
(RBAC) subsystem. It is acceptable to have direct telnet, rlogin, or ssh access to
the node.
• Must have console node access for certain emergency and informational purposes. If a
node is failing to boot, the cluster administrator will have to access the node console to
figure out why. The cluster administrator might like to observe boot messages even in
normal, functioning clusters.
Traditional Oracle Solaris Cluster nodes usually use serial port ttyA as the console. Even if
you have a graphics monitor and system keyboard, you are supposed to redirect console
access to the serial port or an emulation thereof.
The rule for console connectivity is simple. You can connect to the node ttyA interfaces any
way you prefer, if whatever device you have connected directly to the interfaces does not
spuriously issue BREAK signals on the serial line. BREAK signals on the serial port bring a
cluster node to the OK prompt, killing all cluster operations on that node.
You can disable node recognition of a BREAK signal by a hardware keyswitch position (on
some nodes), a software keyswitch position (on midrange and high-end servers), or a file
setting (on all nodes). For those servers with a hardware keyswitch, turn the key to the third
position to power the server on and disable the BREAK signal.
For those servers with a software keyswitch, issue the setkeyswitch command with the
secure option to power the server on and disable the BREAK signal.
For all servers, while running Solaris OS, uncomment the line
KEYBOARD_ABORT=alternate in /etc/default/kbd to disable receipt of the normal
BREAK signal through the serial port. This setting takes effect on boot, or by running the
kbd -i command as root. The Alternate Break signal is defined by the particular serial port
driver that you happen to have on your system. You can use the prtconf command to figure
out the name of your serial port driver, and then use man serial-driver to figure out the
sequence. For example, for the zs driver, the sequence is carriage return, tilde (~), and
Control + B: CR ~ CTRL + B. When the Alternate Break sequence is in effect, only serial
console devices are affected.
One of the popular ways of accessing traditional node consoles is through a terminal
concentrator (TC), a device which listens for connections on the network and passes through
traffic (un-encapsulating and re-encapsulating all the TCP/IP headers) to the various serial
ports.
A TC is also known as a Network Terminal Server (NTS). The figure in the slide shows a
terminal concentrator network and serial port interfaces. The node public network interfaces
are not shown. Although you can attach the TC to the public net, most security-conscious
administrators would attach it to a private management network.
Most TCs enable you to administer TCP pass-through ports on the TC. When you connect
with telnet to the TC’s IP address and pass through port, the TC transfers traffic directly to
the appropriate serial port (perhaps with an additional password challenge).
You can choose any type of TC as long as it does not issue BREAK signals on the serial
ports when it is powered on, powered off, or reset, or at any other time that might be
considered spurious. If your TC cannot meet that requirement, you can still disable
recognition of the BREAK signal or enable an alternate abort signal for your node. Some
terminal concentrators support Secure Shell. This might influence your choice, if you are
concerned about passing TC traffic in the clear on the network.
Many of the servers supported with Solaris Cluster have console access through a network
connection to a virtual console device. These include:
• Hardware domain-based systems: The console access device is the system controller
(SC) or system service processor (SSP).
• Servers such as Sun Fire V890: You can choose to have console access through the
Remote System Control (RSC) device and software.
• Modern rack-based servers: The console access device is a small onboard system
controller running Advanced Lights Out Management (ALOM).
• Oracle VM for SPARC: Console access to an Oracle VM for SPARC is through a
network connection to the service domain, which provides the virtual console service for
the Oracle VM for SPARC.
Agenda
Agenda
Ensure that a supported version of the Oracle Solaris OS and any Oracle Solaris software
updates are installed on the administrative console.
Notes:
• When you install the Oracle Solaris Cluster man page packages on the administrative
console, you can view them from the administrative console before you install Oracle
Solaris Cluster software on the cluster nodes or on a quorum server.
• Setting the publisher origin to the file repository URI: To enable client systems to get
packages from your local file repository, you need to reset the origin for the solaris
publisher. Execute the following command on each client:
# pkg set-publisher -G '*' -g /net/host1/export/repoSolaris11/ solaris
- -G '*': Removes all existing origins for the solaris publisher
- -g: Adds the URI of the newly created local repository as the new origin for the
solaris publisher
Quiz
Answer: b
Agenda
All the tools have the same general look and feel. The tool automatically shows one new
window for each node, and a small common keystroke window. You can type in each
individual window as desired. Input that is directed to the common window is automatically
replicated to all the other windows.
Summary
Practice 3 Overview:
Connecting to the Cluster Node Console
This practice covers the following topics:
• Task 1: Updating host name resolution
• Task 2: Installing the pconsole utility
• Task 3: Configuring the pconsole utility
Installation
Objectives
Agenda
You can install Oracle Solaris software from a local DVD-ROM or from a network installation
server by using the Automated Installer (AI) installation method. In addition, Oracle Solaris Cluster
software provides a custom method for installing both the Oracle Solaris OS and Oracle Solaris
Cluster software by using the AI installation method. During AI installation of Oracle Solaris
software, you choose to either install the OS with defaults accepted or run an interactive
installation of the OS where you can customize the installation for components, such as the boot
disk and the ZFS root pool. If you are installing several cluster nodes, consider a network
installation.
Consider the following points when you plan for the Oracle
Solaris OS in an Oracle Solaris Cluster configuration:
• Oracle Solaris Zones
– Install Oracle Solaris Cluster framework software only in the
Consider the following points when you plan to use the Oracle Solaris OS in an Oracle Solaris
Cluster configuration.
• Oracle Solaris Zones: Install Oracle Solaris Cluster framework software only in the global
zone.
• Loopback file system (LOFS): During cluster creation, LOFS capability is enabled by
default. If the cluster meets both of the following conditions, you must disable LOFS to avoid
switchover problems or other failures:
- HA for NFS is configured on a highly available local file system.
- The automountd daemon is running.
If the cluster meets any one of these conditions, you can safely enable LOFS.
If you require both LOFS and the automountd daemon to be enabled, exclude from the
automounter map all files that are part of the highly available local file system that is
exported by HA for NFS.
IP Filter
• Oracle Solaris Cluster relies on IP multipathing (IPMP) for
public network monitoring.
• IP Filter configuration must follow the IPMP configuration
• IP Filter: Oracle Solaris Cluster relies on IP network multipathing (IPMP) for public network
monitoring. Any IP Filter configuration must be made in accordance with IPMP configuration
guidelines and restrictions concerning IP Filter.
• The space requirements for the root (/) file system are
as follows:
– Oracle Solaris Cluster software requires less than 40 MB of
space in the root (/) file system.
– Each Oracle Solaris Cluster data service might use between
When you install the Oracle Solaris OS, ensure that you create the required Oracle Solaris Cluster
partitions and that all partitions meet minimum space requirements.
• Root (/): The primary space requirements for the root (/) file system are as follows:
- The Oracle Solaris Cluster software occupies less than 40 MB of space in the root (/)
file system.
- Each Oracle Solaris Cluster data service might use between 1 MB and 5 MB.
- Solaris Volume Manager software requires less than 5 MB.
- You need to set aside ample space for log files. Also, more messages might be logged
on a clustered node than would be found on a typical stand-alone server. Therefore,
allow at least 100 MB for log files.
- The lofi device for the global-devices namespace requires 100 MB of free space.
In Oracle Solaris Cluster 4.0, a dedicated partition is no longer used for the global-
devices namespace.
- To configure ample additional space and inode capacity, add at least 100MB to the
amount of space you would normally allocate for your root (/) file system. This space is
used for the creation of both block special devices and character special devices used
by the volume management software. You especially need to allocate this extra space
if a large number of shared disks are in the cluster.
• /var : The Oracle Solaris Cluster software occupies a negligible amount of space in the
/var file system at installation time. However, you need to set aside ample space for log
files. Also, more messages might be logged on a clustered node than would be found on a
typical stand-alone server. Therefore, allow at least 100 MB for the /var file system.
• swap: The combined amount of swap space that is allocated for Oracle Solaris and Oracle
Solaris Cluster software must be no less than 750 MB. For best results, add at least 512 MB
Agenda
Previous versions of the Oracle Solaris Cluster software had strict rules regarding how many
nodes were supported in various disk topologies. The only rules in the Oracle Solaris Cluster
software regarding the data storage for the cluster are the following:
• Oracle Solaris Cluster software supports up to 16 nodes. Some storage configurations have
restrictions on the total number of nodes supported.
• A shared storage device can connect to as many nodes as the storage device supports.
• Shared storage devices do not need to connect to all nodes of the cluster. However, these
storage devices must connect to at least two nodes.
Quiz
Answer: a, b, c
Cluster topologies describe typical ways in which cluster nodes can be connected to data storage
devices. Oracle Solaris Cluster does not require you to configure a cluster by using specific
topologies.
Switch
In a clustered pairs topology, two or more pairs of nodes are connected with each pair physically
connected to some storage. Because of the global device and global file system infrastructure, this
does not restrict where applications can fail over to and run. Still, it is likely that you will configure
applications to fail over in pairs of nodes attached to the same storage.
Features of clustered pair configurations:
• Nodes are configured in pairs. You can have any even number of nodes from 2 to 16.
• Each pair of nodes shares storage. Storage is connected to both nodes in the pair.
• All nodes are part of the same cluster. You are likely to design applications that run on the
pair of nodes physically connected to the data storage for that application, but you are not
restricted to this design.
• Because each pair has its own storage, no one node must have a significantly higher
storage capacity than the others.
• This configuration is well suited for failover data services.
• This configuration is well suited if you have a legacy SCSI-array or any disk array that can be
attached to no more than two nodes.
Pair+N Topology
Switch
Storage Storage
The Pair+N topology includes a pair of nodes that are directly connected to the shared storage
and nodes that must use the cluster interconnect to access shared storage because they have no
direct connection themselves.
Features of Pair+N configurations:
• All shared storage is connected to a single pair.
• Additional cluster nodes support scalable data services or failover data services with the
global device and file system infrastructure.
• A maximum of 16 nodes are supported.
• There are common redundant interconnects between all the nodes.
• The Pair+N configuration is well suited for scalable data services.
• This configuration is well suited if you have a legacy SCSI-array or any disk array that can be
attached to no more than two nodes.
The limitations of a Pair+N configuration are that there can be heavy data traffic on the cluster
interconnects. You can increase the bandwidth by adding more cluster transports.
N+1 Topology
The N+1 topology enables one system to act as the storage backup for every other system in the
cluster. All of the secondary paths to the storage devices are connected to the redundant or
secondary system, which can be running a normal workload of its own.
Features of N+1 configurations:
• The secondary node is the only node in the configuration that is physically connected to all
the multihost storage.
• The backup node can take over without any performance degradation.
• The backup node is more cost effective because it does not require additional data storage.
• This configuration is best suited for failover data services.
• This configuration is well suited if you have legacy SCSI-array, or any disk array that can
only be attached to two nodes.
A limitation of the N+1 configuration is that if there is more than one primary node failure, you can
overload the secondary node.
Switch
Storage Storage
In a scalable, or N*N, topology, more than two nodes can be physically connected to the same
storage. This configuration is required for runningOracle Real Application Clusters (Oracle RAC)
across more than two nodes. For ordinary, cluster-unaware applications, each particular disk
group or diskset in the shared storage still supports physical traffic from only one node at a time.
However, having more than two nodes physically connected to the storage adds flexibility and
reliability to the cluster.
Switch
Switch
Oracle Solaris Cluster supports a data replication topology. In this topology, data storage is not
physically multiported between nodes but, rather, is replicated between storage attached to the
individual nodes by using controller-based replication.
The data replication topology is ideal for wider-area clusters where the data replication solution is
preferred to the extra connectivity that would be involved to actually connect the storage to nodes
that are far apart. This topology would be ideal with the quorum server feature.
In this configuration, one node or domain forms the entire cluster. This configuration allows for a
single node to run as a functioning cluster. It offers users the benefits of having application
management functionality and application restart functionality. The cluster starts and is fully
functional with just one node.
Single-node clusters are ideal for users learning how to manage a cluster, to observe cluster
behavior (possibly for agent development purposes), or to begin a cluster with the intention of
adding nodes, as time goes on. Oracle Solaris Cluster provides the ability to experience
application failovers, even on a single-node cluster. You could have an application that fails over
between different nonglobal Oracle Solaris zones on the node.
Single-node clusters can also be useful in the Oracle Solaris Cluster Geographic Edition product,
which manages a partnership of two clusters with data replication across a wide area. Each
member of such a partnership must be a full Oracle Solaris Cluster installation, and a one-node
cluster on either or both ends is acceptable.
Primary Secondary
Cluster Cluster
Data Replication
Oracle Solaris Cluster Geographic Edition enables you to implement a disaster recovery scenario
by forming a conceptual “cluster of clusters” across a wide area. Application data is then replicated
by using data replication.
The Oracle Solaris Cluster Geographic Edition software has the following properties:
• Solaris Cluster Geographic Edition software is configured on top of standard Solaris Cluster
software on the participating clusters.
• Exactly two clusters are involved in the relationship shown in the diagram, and are said to
form a partnership.
• There is no conceptual limit to the distance between the two clusters.
• Oracle Solaris Cluster Geographic Edition does not currently provide an automatic failover of
an application across the two clusters. Instead it provides very simple commands to migrate
an application (either nicely or forcefully) across a wide area, while simultaneously
performing the correct operations on the data replication framework.
• Oracle Solaris Cluster 4.0 offers reliable protection from disaster for traditional or virtualized
workloads on Oracle Solaris 11 through automated application failover and coordination with
replication solutions such as StorageTek Availability Suite 4.0, Oracle Data Guard, and a
script-based plug-in.
Primary Secondary
Cluster Data Replication Cluster
Geneva
Rome
The following points reinforce the main concepts of Oracle Solaris Cluster Geographic Edition by
comparing its elements to a single-cluster configuration:
• Oracle Solaris Cluster resource groups control manual and automatic migration/failover of
applications within a single cluster. Oracle Solaris Cluster Geographic Edition protection
groups provide a framework for control of application migration and data replication between
remote clusters, but the actual migration/takeover is manual (an easy three-word command).
• Single-cluster configurations do support data replication as an alternative to full storage
multiporting between the nodes. This enables single clusters to run in a wider area (campus
or metro clusters) without having to connect nodes to storage that is far away. Oracle Solaris
Cluster Geographic Edition depends on data replication to provide a disaster-recovery
scenario for data and applications that can be an arbitrarily wide distance apart.
Quiz
Answer: b
Quiz
Answer: e
Quiz
Answer: d
Quiz
Answer: c
Quiz
Answer: a
Agenda
The cluster membership subsystem of the Oracle Solaris Cluster software framework operates on
a voting system as follows:
• Each node is assigned exactly one vote.
• Certain devices can be identified as quorum devices and are assigned votes. The following
are types of quorum devices:
- Directly attached multiported disks: Disks are the traditional type of quorum device and
have been supported in all versions of Solaris Cluster.
- NAS quorum devices
- Quorum servers
• There must be a majority (more than 50 percent of all possible votes present) to form a
cluster or remain in a cluster.
• Failure fencing
• Amnesia prevention
Given the rules for quorum voting, it is clear by looking at a simple two-node cluster why you need
extra quorum device votes. If a two-node cluster had only node votes, you must have both nodes
booted to run the cluster. This defeats one of the major goals of the cluster, which is to be able to
survive node failure. But why have quorum voting at all? If there were no quorum rules, you could
run as many nodes in the cluster as were able to boot at any point in time. However, the quorum
vote and quorum devices solve the following two major problems:
• Failure fencing
• Amnesia prevention
These are two distinct problems that are solved by the quorum mechanism in the Solaris Cluster
software. These problems are discussed in the following slides.
Failure Fencing
QD(1)
Storage Array
Amnesia Prevention
If it is allowed to happen, a cluster amnesia scenario would involve one or more nodes being able
to form a cluster (boot first in the cluster) with a stale copy of the cluster configuration. Consider
the following scenario:
1. In a two-node cluster (Node 1 and Node 2), Node 2 is halted for maintenance or crashes.
2. Cluster configuration changes are made on Node 1.
3. Node 1 is shut down.
4. You try to boot Node 2 to form a new cluster. If this is allowed, the cluster would lose the
configuration changes.
Interconnect
Storage Array
A two-node cluster requires a single quorum device, which is typically a quorum disk. The total
votes are three. With the quorum disk, a single node can start clustered operation with a majority
of votes (two votes, in the example shown in the slide).
QD(1)
QD(1)
QD(1)
A typical quorum disk configuration in a Pair+2 configuration is shown in the figure. Three quorum
disks are used.
The following is true for the Pair+N configuration:
• There are three quorum disks.
• There are seven possible votes.
• A quorum is four votes.
• Nodes 3 and 4 do not have access to any quorum devices.
• Nodes 1 or 2 can start clustered operation by themselves.
• Up to three nodes can fail (Nodes 1, 3, and 4 or Nodes 2, 3, and 4), and clustered operation
can continue.
Switch
Switch
QD(1) QD(1)
The N+1 configuration requires a different approach. Node 3 is the failover backup for both Node 1
and Node 2.
The following is true for the N+1 configuration:
• There are five possible votes.
• A quorum is three votes.
• If Nodes 1 and 2 fail, Node 3 can continue.
Quiz
Answer: a, b, c
QD(2)
Quorum devices in the scalable storage topology differ significantly from those in any other
topology. The following is true for the quorum devices in the scalable storage topology:
• The single quorum device has a vote count equal to the votes of the nodes directly attached
to it minus one.
Note: This rule is universal. In all the previous examples, there were two nodes (with one
vote each) directly connected to the quorum device, so that the quorum device had one vote.
• The mathematics and consequences still apply.
• A reservation is performed by using a SCSI-3 persistent group reservation (which is
discussed in more detail later in this lesson).
• If, for example, Nodes 1 and 3 can intercommunicate but Node 2 is isolated, Node 1 or Node
3 can reserve the quorum device on behalf of both of them.
Note: It would seem that in the same race, Node 2 could win and eliminate both Nodes 2
and 3. The topic titled “Intentional Reservation Delays for Partitions with Fewer Than Half of
the Nodes,” later in this lesson, shows why this is unlikely.
Switch
Switch
Oracle Solaris Cluster introduced a new kind of quorum device called a quorum server quorum
device. The quorum server software is installed on some machine external to the cluster. A
quorum server daemon (scqsd) runs on this external machine. The daemon essentially takes the
place of a directly connected quorum disk.
Characteristics of the quorum server quorum device:
• The same quorum server daemon can be used as a quorum device for an unlimited number
of clusters.
• The quorum server software must be installed separately on the server (external side).
• No additional software is necessary on the cluster side.
• A quorum server is especially useful when there is a great physical distance between
quorum nodes. It would be an ideal solution for a cluster that uses the data replication
topology.
• A quorum server can be used on any cluster where you prefer the logic of having a single-
cluster quorum device with quorum device votes automatically assigned to be one fewer
than the node votes.
For example, with a clustered pairs topology, you might prefer the simplicity of a quorum
server quorum device. In that example, any single node could boot into the cluster by itself, if
it could access the quorum server. Of course, you might not be able to run clustered
applications unless the storage for a particular application is also available, but those
relationships can be controlled properly by the application resource dependencies.
Agenda
Quorum devices in the Oracle Solaris Cluster software environment are used not only as a means
of failure fencing but also as a means to prevent cluster amnesia.
Earlier, you reviewed the following scenario:
1. In a two-node cluster (Node 1 and Node 2), Node 2 is halted for maintenance.
2. Meanwhile Node 1, which is running fine in the cluster, makes all sorts of cluster
configuration changes (new device groups, resource groups).
3. Now Node 1 is shut down.
4. You try to boot Node 2 to form a new cluster.
In this simple scenario, the problem is that if you were allowed to boot Node 2 at the end, it would
not have the correct copy of the CCR. Node 2 would have to use the copy that it has (because
there is no other copy available) and you would lose the changes to the cluster configuration made
in Step 2.
The Oracle Solaris Cluster software quorum involves persistent reservations that prevent Node 2
from booting into the cluster. It is not able to count the quorum device as a vote. Therefore, Node
2 waits until the other node boots to achieve the correct number of quorum votes.
A persistent reservation means that reservation information on a quorum device will survive:
• Even if all nodes connected to the device are reset
• Even after the quorum device itself is powered on and off
Clearly, this involves writing some type of information on the disk itself. The information is called a
reservation key and is as follows:
• Each node is assigned a unique 64-bit reservation key value.
• Every node that is physically connected to a quorum device has its reservation key
physically written onto the device. This set of keys is collectively known as the registered
keys on the device.
• Exactly one node’s key is recorded on the device as the reservation holder, but this node
has no special privileges greater than any other registrant. You can think of the reservation
holder as the last node to ever manipulate the keys, but the reservation holder can later be
fenced out by another registrant.
If Node 1 needs to fence out Node 2 for any reason, it will preempt Node 2’s registered key off of
the device. If a node’s key is preempted from the device, it is fenced from the device. If there is a
split brain, each node is racing to preempt the other’s key.
Now the rest of the equation is clear. The reservation is persistent, so if a node is booting into the
cluster, a node cannot count the quorum device vote unless its reservation key is already
registered on the quorum device. Therefore, in the scenario illustrated in the previous paragraph, if
Node 1 subsequently goes down so there are no remaining cluster nodes, only Node 1’s key
remains registered on the device. If Node 2 tries to boot first into the cluster, it will not be able to
count the quorum vote, and must wait for Node 1 to boot.
After Node 1 joins the cluster, it can detect Node 2 across the transport and add Node 2’s
reservation key back to the quorum device so that everything is equal again. A reservation key
only gets added back to a quorum device by another node in the cluster whose key is already
there.
Oracle Solaris Cluster supports both SCSI-2 and SCSI-3 disk reservations. The default policy is
prefer3.
• For disks to which there are exactly two paths, use SCSI-2.
• For disks to which there are more than two paths (for example, any disk with physical
connections from more than two nodes), you must use SCSI-3.
The following slides outline the differences between SCSI-2 and SCSI-3 reservations.
SCSI-2 reservations themselves provide a simple reservation mechanism (the first one to reserve
the device fences out the other one), but it is not persistent and does not involve registered keys.
In other words, SCSI-2 is sufficient to support the fencing goals in Oracle Solaris Cluster, but does
not include the persistence required to implement amnesia prevention.
To implement amnesia prevention by using SCSI-2 quorum devices, Solaris Cluster must make
use of Persistent Group Reservation Emulation (PGRE) to implement the reservation keys. PGRE
has the following characteristics:
• The persistent reservations are not supported directly by the SCSI-2 command set. Instead,
they are emulated by the Solaris Cluster software.
• Reservation keys are written (by the Solaris Cluster software, not directly by the SCSI
reservation mechanism) on private cylinders of the disk (cylinders that are not visible in the
format command, but are still directly writable by the Solaris OS).
The reservation keys have no impact on using the disk as a regular data disk, where you will
not see the private cylinders.
• The race (for example, in a split-brain scenario) is still decided by a normal SCSI-2 disk
SCSI-3 reservations have the persistent group reservation (PGR) mechanism built in. They have
the following characteristics:
• The persistent reservations are implemented directly by the SCSI-3 command set. Disk
firmware itself must be fully SCSI-3 compliant.
• Removing another node’s reservation key is not a separate step from physical reservation of
the disk, as it is in SCSI-2. With SCSI-3, the removal of the other node’s key is both the
fencing and the amnesia prevention.
• SCSI-3 reservations are generally simpler in the cluster because everything that the cluster
needs to do (both fencing and persistent reservations to prevent amnesia) is done directly
and simultaneously with the SCSI-3 reservation mechanism.
• With more than two disk paths (that is, any time more than two nodes are connected to a
device), Oracle Solaris Cluster must use SCSI-3.
The figure in the slide shows that four nodes are all physically connected to a quorum drive.
Remember that the single quorum drive has three quorum votes.
Now, imagine that, because of multiple transport failures, there is a partitioning where Nodes 1
and 3 can see each other over the transport and Nodes 2 and 4 can see each other over the
transport. In each pair, the node with the lower reservation key tries to eliminate the registered
reservation key of the other pair. The SCSI-3 protocol assures that only one pair will remain
registered (the operation is atomic).
In the diagram, Node 1 has successfully won the race to eliminate the keys for Nodes 2 and 4.
Because Nodes 2 and 4 have their reservation key eliminated, they cannot count the three votes
of the quorum device. Because they fall below the needed quorum, they will cause kernel panic.
Cluster amnesia is avoided in the same way as in a two-node quorum device. If you now shut
down the whole cluster, Node 2 and Node 4 cannot count the quorum device because their
reservation key is eliminated. They must wait for either Node 1 or Node 3 to join. One of those
nodes can then add back reservation keys for Node 2 and Node 4.
Both NAS quorum and quorum server provide reservation key–based persistent emulations.
Fencing and amnesia prevention is provided in an analogous way to show they are provided with
a SCSI-3 quorum device. In both implementations, the keys are maintained in a persistent fashion
on the server side; that is, the state of the registration keys recorded with the quorum device
survives rebooting of both the cluster nodes and the quorum server device.
The diagram in the slide shows the scenario just presented, but three nodes can talk to each other
while the fourth is isolated on the cluster transport.
Is there anything to prevent the lone node from eliminating the cluster keys of the other three and
making them all kernel panic?
In this configuration, the lone node intentionally delays before racing for the quorum device. The
only way it can win is if the other three nodes are really dead, or if each is isolated and delaying
the same amount. The delay is implemented when the number of nodes that a node can see on
the transport (including itself) is fewer than half the total nodes.
Agenda
Data Fencing
The surviving node fences all shared disks. This eliminates any
timing-related danger in taking over data.
• With the prefer3 policy, data fencing uses SCSI-2 (two-
path disks) or SCSI-3.
As an extra precaution, nodes that are eliminated from the cluster because of quorum problems
also lose access to all shared data devices. The reason for this is to eliminate a potential timing
problem. The node or nodes that remain in the cluster have no idea whether the nodes being
eliminated from the cluster are actually still running. If they are running, they will have a kernel
panic (after they recognize that they have fallen beneath the required quorum votes). However,
the surviving node or nodes cannot wait for the other nodes to kernel panic before taking over the
data. The reason that nodes are being eliminated is that there has been a communication failure
with them.
To eliminate this potential timing problem, which could otherwise lead to data corruption, before a
surviving node or nodes reconfigure the data and applications, the prefer3 policy fences the
eliminated node or nodes from all shared data devices, in the following manner:
• With the prefer3 policy, SCSI-2 reservation is used for two-path devices and SCSI-3 for
devices with more than two paths.
• You can change the policy to use SCSI-3 even if there are only two paths.
• If you do use the default SCSI-2 for a two-path device, data fencing is just the reservation
and does not include any PGRE.
• Data fencing is released when a fenced node is able to boot successfully into the cluster
again.
• You can turn off data fencing either per disk or globally.
This is:
– Intended for support of SATA disks
– Recommended to keep fencing on for disks that support
fencing
You can disable fencing, either on a disk-by-disk basis or globally in the cluster.
You look at how this can be done in later lessons. It is highly recommended that you keep the
fencing on normal SCSI-capable shared disks.
With the new option to disable fencing, Oracle Solaris Cluster can support SATA disks that are
incapable of either SCSI-2 or SCSI-3 fencing in any cluster, or disks incapable of SCSI-3 fencing
in a cluster where more than two nodes are connected to the storage. Oracle Solaris Cluster can
also support access to a storage device from servers outside of the cluster, if fencing is disabled
on the device.
Quorum Device on a Disk with No Fencing
Oracle Solaris Cluster can support a quorum device on a disk on which it is doing neither SCSI-2
nor SCSI-3 fencing. Solaris Cluster will implement a “software” reservation process, whereby
races for the quorum devices can be decided atomically and reliably without use of any SCSI-2 or
SCSI-3 protocols. The persistent reservation for a disk with no fencing works exactly the same
way as a disk on which you are doing SCSI-2 fencing (the persistent reservation is emulated by
using PGRE).
Quiz
Answer: a
Agenda
net0 net0
net1 net1
Switch
In cluster configurations with more than two nodes, you must join the interconnect interfaces by
using switches. You can also use switches to join two-node cluster interconnects to prepare for
the expansion of the number of nodes at a later time. A typical switch-based interconnect is shown
in the figure in the slide.
During the Oracle Solaris Cluster software installation, you are asked whether the interconnect
system uses switches. If you answer yes, you must provide names for each of the switches.
Note: If you specify more than two nodes during the initial portion of the Solaris Cluster software
installation, the use of switches is assumed.
During the Oracle Solaris Cluster software installation, the cluster interconnects are assigned IP
addresses based on a base address of 172.16.0.0. If necessary, you can override the default
address, but this is not recommended. Uniform addresses can be a benefit during problem
isolation.
The netmask property associated with the entire cluster transport describes, together with the
base address, the entire range of addresses associated with the transport. For example, if you
used the default base address of 172.16.0.0 and the default netmask of 255.255.240.0, you
would be dedicating a 12-bit range (255.255.240.0 has 12 zeros at the end) to the cluster
transport. This range is from 172.16.0.0 to 172.16.15.255.
Note: When you set 255.255.240.0 as the cluster transport netmask, you will not see this
netmask actually applied to any of the private network adapters. Once again, the cluster uses this
netmask to define the entire range that it has access to, and then subdivides the range even
further to cover the multiple separate networks that make up the cluster transport.
While you can choose the cluster transport netmask by hand, the cluster prefers instead that you
specify:
• The maximum anticipated number of private networks
• The maximum anticipated number of nodes
• The maximum anticipated number of virtual clusters (Note: Virtual clusters is another term
for Solaris Containers clusters.)
Note: In Oracle Solaris Cluster, if you want to restrict private network addresses with a class C–
like space, similar to 192.168.5.0, you can do it easily, even with relatively large numbers of
nodes and subnets and Solaris Container clusters.
3. Choose a network interface (at this point, it might be an actual private network, or one ready
to be set up as a secondary public network, or when not connected to anything at all).
# ping public_net_broadcast_address
5. Now that you know your adapter is not on the public net, check to see whether it is
connected on a private net. Make up some unused subnet address just to test out
interconnectivity across a private network. Do not use addresses in the existing public
subnet space.
# ipadm create-addr –T static –a 192.168.1.1/24 net2/v4static
6. Perform Steps 4 and 5 to try to guess the matching network interface on the other node.
Choose a corresponding IP address (for example, 192.168.1.2).
7. Test that the nodes can ping across each private network, as in the following example:
# ping 192.168.1.2
192.168.1.2 is alive
8. After you have identified the new network interfaces, delete them. Cluster installation fails if
your transport network interfaces are still up from testing.
# ipadm delete-ip net1
9. Repeat Steps 3 through 8 with network interfaces for the second cluster transport. Repeat
again if you are configuring more than two cluster transports.
Agenda
You will not be asked about public network configuration when you are installing the cluster.
The public network interfaces must be managed by IPMP, which can be administered either
before or after cluster installation.
Because you are identifying your private transport interfaces before cluster installation, it can be
useful to identify your public network interfaces at the same time, so as to avoid confusion.
Your primary public network adapter should be the only one currently configured on the public
network. You can verify this with the following command:
# dladm show-link
# ipadm show-if
You can verify your secondary public network adapter, if applicable, by making sure that:
• It is not one of those that you identified to be used as the private transport
• It can snoop public network broadcast traffic
# ipadm create-ip net2
# snoop -d net2
(other window or node)
# ping -s pubnet_broadcast_addr
Agenda
Recall that certain adapters are capable of participating in 802.1q-tagged VLANs, and can be
used as both private and public network adapters assuming that the switches are also capable of
tagged VLANs. This enables blade architecture servers that have only two physical network
adapters to be clustered and to still have redundant public and private networks.
An adapter that is participating in a tagged VLAN configuration is assigned an instance number,
1000*(Vlan_identifer) + physical_instance_number.
For example, if you have a physical adapter net1, and it is participating in a tagged VLAN with ID
3 as its public network personality, and a tagged VLAN with ID 5 as its private network
personality, then it appears as if it were two separate adapters, net3001 and net5001.
Summary
Practice 4 Overview:
Preparing for Installation
This practice covers the following topics:
• Task 1: Verifying the Oracle Solaris 11 environment
• Task 2: Identifying a cluster topology
• Task 3: Selecting quorum devices
Objectives
Agenda
Agents X X X
Localization X X
Framework man X X
pages
Data service X X
man pages
Agent builder X X
Generic data X X X
service
The table shown in the slide lists the primary group packages for the Oracle Solaris Cluster 4.0
software and the principal features that each group package contains. You must install at least the
ha-cluster-framework-minimal group package.
Agenda
Agenda
Perform the following steps to complete the Oracle Solaris cluster installation:
1. If you are using a cluster administrative console, display a console screen for each node in
the cluster.
- If pconsole software is installed and configured on your administrative console, use
the pconsole utility to display the individual console screens.
- As superuser, use the following command to start the pconsole utility:
adminconsole# pconsole host[:port] […] &
The pconsole utility also opens a master window from which you can send your input
to all individual console windows at the same time.
- If you do not use the pconsole utility, connect to the consoles of each node
individually.
Note: The output of the last command should show that the
local_only property is now set to false.
5. Set the repository for the Oracle Solaris Cluster software packages.
If you are using an ISO image of the software, perform the following steps:
a. Download the Oracle Solaris Cluster 4.0 ISO image from Oracle Software Delivery
Cloud at http://edelivery.oracle.com/.
Oracle Solaris cluster software is part of the Oracle Solaris Product Pack. Follow online
instructions to complete selection of the media pack and download the software. A
valid Oracle license is required to access Oracle Software Delivery Cloud.
b. Make the Oracle Solaris Cluster 4.0 ISO image available.
# lofiadm -a path-to-iso-image
/dev/lofi/N
# mount -F hsfs /dev/lofi/N /mnt
Where -a path-to-iso-image, specifies the full path and file name of the ISO
image
c. Set the location of the Oracle Solaris Cluster 4.0 package repository.
# pkg set-publisher -g file:///mnt/repo ha-cluster
5. Set up the repository for the Oracle Solaris Cluster software packages.
If the cluster nodes have direct access to the Internet, perform the following steps:
a. Go to http://pkg-register.oracle.com.
b. Choose Oracle Solaris Cluster software.
c. Accept the license.
d. Request a new certificate by choosing Oracle Solaris Cluster software and
submitting a request.
e. The certification page is displayed with download buttons for the key and the certificate.
f. Download the key and certificate files and install them as described in the returned
certification page.
g. Configure the ha-cluster publisher with the downloaded SSL keys and set the
location of the Oracle Solaris Cluster 4.0 repository.
6. Ensure that the solaris and ha-cluster publishers are valid as shown in the slide.
Note: For information about setting the solaris publisher, see
http://www.oracle.com/technetwork/indexes/documentation/index.html#CCOSPrepo
_sharenfs2.
7. Install the Oracle Solaris Cluster 4.0 software.
# /usr/bin/pkg install package
8. Verify that the package is installed successfully.
# pkg info -r package
Agenda
Perform the procedure on each node in the global cluster as in the slide.
Note: Always make /usr/cluster/bin the first entry in the PATH. This placement ensures that
Oracle Solaris Cluster commands take precedence over any other binaries that have the same
name, thus avoiding unexpected behavior.
Agenda
The Solaris Cluster configuration is performed by one of the two following methods:
• Using the scinstall utility interactively: This is the most common method of configuring
Solaris Cluster, and the only one that is described in detail in this lesson.
• Automated Installer: Set up an Automated Installer (AI) install server. Then use the
scinstall AI option to install the software on each node and establish the cluster.
As you configure Oracle Solaris Cluster software on cluster nodes and reboot the nodes into the
cluster, a special flag called the installmode flag is set in the cluster CCR. When this flag is set,
the following happens:
• The first node installed (node ID 1) has a quorum vote of one.
• All other nodes have a quorum vote of zero.
This enables you to complete the rebooting of the second node into the cluster while maintaining
the quorum mathematics rules. If the second node had a vote (making a total of two in the cluster),
the first node would kernel panic when the second node was rebooted after the cluster software
was installed because the first node would lose operational quorum.
One important side effect of the installmode flag is that you must be careful not to reboot the
first node (node ID 1) until you can choose quorum devices and eliminate (reset) the
installmode flag. If you accidentally reboot the first node, all the other nodes will kernel panic
because they have zero votes out of a possible total of one.
If the installation is a single-node cluster, the installmode flag is not set. Post-installation steps
to choose a quorum device and reset the installmode flag are unnecessary.
On a two-node cluster only, you have the option of having the scinstall utility insert a script
that automatically chooses your quorum device as the second node boots into the cluster. The
defaults will always be to accept the option.
The quorum device chosen will be the first dual-ported disk or LUN (the one with the lowest DID
number).
If you choose to allow automatic quorum configuration, the installmode flag is automatically
reset after the quorum device is automatically configured.
You can disable the two-node cluster automatic quorum configuration if you want to:
• Choose the quorum device yourself
• Use a NAS device as a quorum device
• Use the quorum server as a quorum device
In clusters with more than two nodes, scinstall inserts a script to automatically reset the
installmode flag. It will not automatically configure a quorum device. If you want a quorum
device, you still have to do that manually after the installation. By resetting installmode, each
node is assigned its proper single quorum vote.
Cluster transport IP network number and netmask: As described in the lesson titled “Exploring
Node Console Connectivity and the Cluster Console Software,” the default cluster transport IP
address range begins with 172.16.0.0 and the netmask is 255.255.240.0. You should keep
the default if it causes no conflict with anything else visible on any other network. It is perfectly fine
for multiple clusters to use the same addresses on their cluster transports, because these
addresses are not visible anywhere outside the cluster.
Note: The netmask refers to the range of IP addresses that are reserved for all possible cluster
transport addresses. This will not match the actual netmask that you will see configured on the
transport adapters if you check by using ifconfig -a.
If you must specify a different IP address range for the transport, you can do so. Rather than being
asked initially for a specific netmask, you will be asked for the anticipated maximum number of
Quiz
Answer: c
• The node that you are driving from becomes the last
to join.
• Drive from the node that you want to have highest node ID.
• List the other nodes in reverse order.
If you choose the option to configure the entire cluster, you run scinstall on only one node.
You should be aware of the following behavior:
• The node that you are driving from will be the last node to join the cluster, because it needs
to configure and reboot all the other nodes first.
• If you care about which node IDs are assigned to the nodes, you should drive from the node
that you want to be the highest node ID, and list the other nodes in reverse order.
• The Oracle Solaris Cluster software packages must already be installed on all nodes.
Therefore, you do not need remote shell access (neither rsh nor ssh) between the nodes.
The remote configuration is performed by using RPC installed by the Solaris Cluster
packages. If you are concerned about authentication, you can use DES authentication.
If you choose this method, you run scinstall separately on each node.
You must complete scinstall and reboot into the cluster on the first node. This becomes the
sponsoring node for the remaining nodes.
If you have more than two nodes, you can run scinstall simultaneously on all but the first node,
but it might be hard to predict which node gets assigned which node ID. If you care, you should
just run scinstall on the remaining nodes one at a time, and wait for each node to boot into the
cluster before starting the next one.
Both the all-at-once and one-at-a-time methods have typical and custom configuration options (to
make a total of four variations).
The typical configuration mode assumes the following responses:
• It uses network address 172.16.0.0 with netmask 255.255.240.0 for the cluster
interconnect.
• It assumes that you want to perform autodiscovery of cluster transport adapters on the other
nodes with the all-at-once method. (On the one-node-at-a-time method, it asks whether you
want to use autodiscovery in both typical and custom modes.)
• It uses the names switch1 and switch2 for the two transport switches, and assumes the use
of switches even for a two-node cluster.
• It assumes that you want to use standard system authentication (not DES authentication) for
new nodes configuring themselves into the cluster.
Agenda
Option: 1
The example shows the full dialog for the Oracle Solaris cluster installation that requires the least
information, the all-at-once, Typical mode installation. The example is from a two-node cluster,
where the default is to let scinstall set up a script that automates configuration of the quorum
device. In the example, scinstall is running on the node named clnode1.
From the Main Menu, select option 1, Create a new cluster or add a cluster node.
Option: 1
From the New Cluster and Cluster Node Menu, select option 1, Create a new
cluster.
1) Typical
2) Custom
?) Help
q) Return to the Main Menu
Option [1]: 1
Each cluster has a name assigned to it. The name can be made up of
any characters other than whitespace. Each cluster name should be
unique within the namespace of your enterprise.
List the names of the other nodes planned for the initial cluster
configuration. List one node name per line. When finished, type
clnode1
clnode2
You must identify the cluster transport adapters which attach this
node to the private cluster interconnect.
Option: 1
Select the cluster transport adapters that will be used as the cluster private interconnect.
In this example, net1 and net3 are selected as the cluster private interconnect.
1) net1
2) net2
3) net3
4) Other
Specify no when asked whether you want to disable the automatic quorum device selection.
Specify yes when asked whether it is okay to create the new cluster.
Specify no when asked to interrupt cluster creation for cluster check errors.
The cluster configuration begins. Somewhere toward the end of the cluster configuration, the
cluster node reboots.
Rebooting ...
Option: 1
The following is an example of using the one-at-a-time node configuration. The dialog is shown for
clnode1, the first node in the cluster. You cannot install other cluster nodes until this node is
rebooted into the cluster and can then be the sponsor node for the other nodes.
Option: 2
Each cluster has a name assigned to it. The name can be made up of any
characters other than whitespace.
This step enables you to run cluster check to verify that certain basic
hardware and software pre-configuration requirements have been met.
After the first node establishes itself as a single node cluster, other
nodes attempting to add themselves to the cluster configuration must be
found on the list of nodes you just provided. You can modify this list
What is the name of the first switch in the cluster [switch1]? <CR>
What is the name of the second switch in the cluster [switch2]? <CR>
You must configure the cluster transport adapters for each node in the
cluster. These are the adapters which attach to the private cluster
interconnect.
Select the first cluster transport adapter:
Each node in the cluster must have a local file system mounted on
global/.devices/node@<nodeID> before it can successfully participate as
a cluster member. Because the “nodeID” is not assigned until
scinstall is run, scinstall will set this up for you.
...
If you choose to turn off global fencing now, after your cluster
starts you can still use the cluster(1CL) command to turn on global
fencing.
...
The explanation for these options can be found in the Oracle Solaris
Cluster Installation Guide. The first option -i is config. without
Are these the options you want to use (yes/no) [yes]? yes
...
Initializing cluster name to “cluster1" ... done
Rebooting ...
Quiz
Answer: b, c
In the one-at-a-time method, after the first node has rebooted into the cluster, you can configure
the remaining node or nodes. Here, there is almost no difference between the Typical and Custom
modes, except that the Typical mode does not ask about the global devices file system. (The
installer assumes that the placeholder is /globaldevices.) Here, you have no choice about the
automatic quorum selection or the authentication mechanism, because it was already chosen on
the first node.
Option: 3
Before you select this option, the Oracle Solaris Cluster framework
software must already be installed.
This tool supports two modes of operation, Typical and Custom Modes.
For most clusters, you can use Typical mode. However, you might need to
Already established clusters can keep a list of hosts which are able to
configure themselves as new cluster members. This machine should be in
the join list of any cluster which it tries to join. If the list does
not include this machine, you may need to add it by using claccess(1CL)
or other tools.
Each cluster has a name assigned to it. When adding a node to the
cluster, you must identify the name of the cluster you are attempting
to join. A sanity check is performed to verify that the “sponsoring“
This step enables you to run cluster check to verify that certain basic
Do you want to use a lofi device instead and continue the installation
(yes/no) [yes]? Yes
scinstall -i \
Are these the options you want to use (yes/no) [yes]? yes
Rebooting ...
Agenda
scinstall automatically configures the following files and settings on each cluster node:
• /etc/inet/hosts file
• /etc/nsswitch.conf file
• /etc/inet/ntp.conf file
• local-mac-address?
Setting in electrically erasable programmable read-only memory (EEPROM) (SPARC only)
The scinstall utility automatically adds all the cluster names and IP addresses to each node’s
hosts file if it was not there already. (All the names already had to be resolvable, through some
name service, for scinstall to work at all.)
Changes to the /etc/nsswitch.conf File
• It makes sure the files keyword precedes every other name service for every entry in the file.
• It adds the cluster keyword for the hosts and netmasks keywords. This keyword
modifies the standard Oracle Solaris OS resolution libraries so that they can resolve the
cluster transport host names and netmasks directly from the CCR. The default transport
host names (associated with IP addresses on the clprivnet0 adapter) are
clusternode1-priv, clusternode2-priv, and so on. These names can be used by
any utility or application as normal resolvable names without having to be entered in any
other name service.
Agenda
On a two-node cluster on which you chose to allow automatic quorum configuration, the quorum
device is chosen (the lowest possible DID device number) as the second node boots into the
cluster for the first time.
If your cluster has more than two nodes, no quorum device is selected automatically, but the
installmode flag is automatically reset as the last node boots into the cluster.
In the Oracle Solaris 11 OS, as the last node boots into the cluster, you get the login prompt on
the last node booting into the cluster before the quorum auto-configuration runs. This is because
the boot environment is controlled by the SMF of the Oracle Solaris 11 OS, which runs boot
services in parallel and gives you the login prompt before many of the services are complete. The
auto-configuration of the quorum device does not complete until a minute or so later. Do not
attempt to configure the quorum device by hand, because the auto-configuration eventually runs to
completion.
Agenda
# cldevice list -v
DID Device Full Device Path
---------- ----------------
d1 clnode1:/dev/rdsk/c0t0d0
d2 clnode1:/dev/rdsk/c0t1d0
You must choose a quorum device or quorum devices manually in the following circumstances:
• Two-node cluster where you disabled automatic quorum selection
• Any cluster of more than two nodes where a quorum device is desired
Verifying DID Devices
If you are going to be manually choosing quorum devices that are physically attached disks or
LUNs, you must know the DID device number for the quorum device or devices that you want to
choose.
The cldevice (cldev) command shows the DID numbers assigned to the disks in the cluster.
The most succinct option that shows the mapping between DID numbers and all the
corresponding disk paths is cldev list -v.
You must know the DID device number for the quorum device that you choose in the next step.
You can choose any multiported disk.
Note: The local disks (single-ported) appear at the beginning and end of the output and cannot be
chosen as quorum devices.
# cldevice list -v
DID Device Full Device Path
---------- ----------------
d1 clnode1:/dev/rdsk/c0t0d0
d2 clnode1:/dev/rdsk/c0t1d0
d3 clnode1:/dev/rdsk/c0t6d0
d4 clnode1:/dev/rdsk/c1t0d0
d4 clnode2:/dev/rdsk/c1t0d0
d5 clnode1:/dev/rdsk/c1t1d0
d5 clnode2:/dev/rdsk/c1t1d0
d6 clnode1:/dev/rdsk/c1t2d0
d6 clnode2:/dev/rdsk/c1t2d0
In a cluster of more than two nodes, the installmode flag is always automatically reset, but the
quorum device or devices are never automatically selected.
You should use clsetup to choose quorum devices, but the initial screens look a little different
because the installmode flag is already reset.
?) Help
q) Return to the Main Menu
Option: 1
From here, the dialog looks similar to the previous example, except that the installmode is
already reset. Therefore, after adding your quorum devices, you just return to the main menu.
Please do not proceed if any additional nodes have yet to join the
cluster.
Before a new cluster can operate normally, you must reset the installmode attribute on all
nodes. On a two-node cluster where automatic quorum selection was disabled, the installmode
will still be set on the cluster. You must choose a quorum device as a prerequisite to resetting
installmode.
Choosing quorum by using the clsetup utility: The clsetup utility is a menu-driven
interface, which, when the installmode flag is reset, turns into a general menu-driven
alternative to low-level cluster commands.
The clsetup utility recognizes whether the installmode flag is still set, and will not present
any of its normal menus until you reset it. For a two-node cluster, this involves choosing a single
quorum device first.
Option: 1
You can use a disk containing user data or one that is a member of a
device group as a quorum device.
After the “installmode” property has been reset, this program will skip
“Initial Cluster Setup” each time it is run again in the future.
However, quorum devices can always be added to the cluster by using the
regular menu options. Resetting this property fully activates quorum
settings and is necessary for the normal and safe operation of the
cluster.
Agenda
• Cluster status
• Cluster configuration information
• status subcommands:
– Nodes
When you have completed the Oracle Solaris Cluster software installation on all nodes, verify the
following information:
• General cluster status
• Cluster configuration information
Verifying General Cluster Status
The status subcommand of the cluster utilities shows the current status of various cluster
components, such as:
• Nodes
• Devices
• Quorum votes (including node and device quorum votes)
• Device groups
• Resource groups and related resources
• Cluster interconnect status
Note: The cluster command-line interface (CLI) commands are described in detail starting in the
next lesson and continuing on a per-topic basis as you configure storage and applications into the
cluster in the following lessons.
The following two commands give identical output, and show the cluster membership and quorum
vote information:
# cluster status –t quuorum
# clquorum status
The following two commands are identical, and show the status of the private networks that make
up the cluster interconnect (cluster transport):
# cluster status –t interconnect
# clinterconnect status
Cluster configuration is displayed in general by using the list, list -v, show, and show -v
subcommands of the various cluster utilities.
The following command shows the configuration of everything. If you added a -t global at the end
of the command, it would list only the cluster global properties that appear in the first section of
output.
# cluster show
Summary
Objectives
Agenda
When a cluster node is fully booted into a cluster, several cluster daemons are added to the
traditional Oracle Solaris Operating System (OS).
None of these daemons require any manual maintenance, regardless of which version of
Oracle Solaris OS you are running. Behind the scenes, the Oracle Solaris 11 OS uses
Service Management Facility (SMF) to launch daemons. Therefore, at boot time, you might
see a console login prompt before many of these daemons are launched. SMF itself can
restart some daemons.
Agenda
• clnode
• clquorum (clq)
• clinterconnect (clintr)
• cldevice (cldev)
• cldevicegroup (cldg)
• clresourcegroup (clrg)
• clresource (clrs)
• clreslogicalhostname (clrslh)
The following commands relate to administration of device groups and cluster application
resources. In subsequent lessons, you learn more about these commands.
• cldevicegroup (cldg): Device group configuration, status, settings, and adding and
deleting device groups (including VxVM and Solaris Volume Manager device groups)
• clresourcegroup (clrg): Application resource group configuration, status, settings,
and adding and deleting application resource groups
• clresource (clrs): Resource configuration, status, settings, and adding and deleting
individual resources in application resource groups
• clreslogicalhostname (clrslh) and clressharedaddress (clrssa): IP
resource configuration, status, settings, and adding and deleting IP resources in
application resource groups (These commands simplify tasks that can also be
accomplished with clresource.)
• clresourcetype (clrt): Resources for Oracle Solaris Cluster data services
# clquorum
clquorum: (C961689) Not enough arguments.
clquorum: (C101856) Usage error.
SUBCOMMANDS:
While all the cluster commands have excellent man pages, they are also self-documenting
because, if you run a command without any subcommand, the usage message always lists
the possible subcommands.
-i {- | <clconfiguration>}
Specify XML configuration as input.
-p <name>=<value>
Specify the properties.
-t <type>
Specify the device type.
-v Verbose output.
If you run a subcommand that requires arguments, the usage message gives you more
information about the particular subcommand. It does not give you all the information about
the names of properties that you might need to set (for that, you have to go to the man pages).
Quiz
Answer: b
Quiz
Answer: c
Quiz
Answer: e
Agenda
For selected Oracle Solaris Cluster commands and options that you issue at the command
line, use Role-Based Access Control (RBAC) for authorization. Oracle Solaris Cluster has a
simplified RBAC structure that can enable you to assign cluster administrative privileges to
non-root users or roles. Oracle Solaris Cluster commands and options that require RBAC
authorization will require one or more of the following authorization levels:
• solaris.cluster.read
• solaris.cluster.admin
• solaris.cluster.modify
Oracle Solaris Cluster RBAC rights profiles apply to both voting nodes in a global cluster.
solaris.cluster.read: Gives the ability to do any status, list, or show subcommands. By
default, every user has this authorization because it is in the Basic Oracle Solaris User profile.
This can be assigned directly to a user.
solaris.cluster.modify: Gives the ability to run add, create, delete, remove, and
related subcommands. Can be assigned directly to a role (allowed users must assume the
role, and then they get the authorization).
solaris.cluster.admin: Gives the ability to run switch, online, offline, enable, and
disable subcommands. Can be assigned to a rights profile that is then given to a user or a
role.
To create and assign an RBAC role with an Oracle Solaris Cluster Management Rights
Profile, perform the following steps:
1. Become a superuser or assume a role that provides solaris.cluster.admin RBAC
authorization.
2. Select one of the following methods for creating a role:
• For roles in the local scope, do either of the following:
– Use the roleadd command to specify a new local role and its attributes.
– Edit the user_attr file to add a user with type=role. Note that you should
use this method only for emergencies.
• For roles in a name service, use the roleadd and rolemod commands to specify
the new role and its attributes. The roleadd and rolemod commands require
authentication by superuser or a role that is capable of creating other roles. Note that
you can apply the roleadd command to all name services.
3. Start and stop the name service cache daemon. New roles do not take effect until the
name service cache daemon is restarted. As root, type the following text:
# /etc/init.d/nscd stop
# /etc/init.d/nscd start
Quiz
Answer: b, c, e
Agenda
In the example shown in the slide, where the name of the cluster is cluster1, the
cluster show -t global command shows only the cluster global properties or global
default SCSI protocol settings.
Cluster ===
You can rename the cluster by using the cluster rename command. The cluster name is not
particularly important and is not required as an argument in any other commands.
heartbeat_quantum controls the timing of cluster heartbeats on the private network (in
milliseconds).
heartbeat_timeout describes the number of milliseconds of missing heartbeat required by
a node to declare a single interconnect dead or to declare the other node(s) dead and start
reconfiguration.
You usually do not change these values, although it is possible to make them bigger (if your
nodes are very far apart) or smaller (if for some reason you are unsatisfied with the 10-second
timeout).
The cluster enforces that heartbeat_timeout is at least five times as big as
heartbeat_quantum, as illustrated in the slide.
Note: Modifying the private_netaddr and private_netmask properties is a special
case in that it is done only when the entire cluster is down and all nodes are booted in non-
cluster mode. This is covered later in the lesson.
Agenda
The clnode command can be used to show status and configuration of nodes. Although it
shows a variety of data for each node in the cluster, there is only a limited amount of
information that you can actually change with the clnode command.
Viewing node status and configuration: The status and show subcommands show, by
default, all nodes. You can also show a single node by giving its name as the last command-
line argument.
Most of the information shown by clnode cannot be modified by the clnode command.
Some can be modified by other commands (clinterconnect for adding and deleting
transport adapters, for example).
The reboot_on_path_failure command is described later in the lesson. You can run the
clnode command instead of the clsetup utility to change the privatehostname.
You can set the privatehostname to whatever you want. This name automatically resolves
to the IP address associated with the node’s clprivnet0 adapter. This is the single private
IP address whose traffic is automatically distributed across all physical private networks.
Note: If it seems that the OS can resolve private hostnames that no longer exist (because you
changed them), it is because of the OS name-caching daemon (nscd). You can use
nscd -i hosts to clear this cache.
# clnode show-rev -v
Oracle Solaris Cluster 4.0.0 0.22.1 for Solaris 11 i386
ha-cluster/data-service/apache :4.0.0-0.22
ha-cluster/data-service/dhcp :4.0.0-0.22
ha-cluster/data-service/dns :4.0.0-0.22
...
You can use clnode show-rev -v to see installed cluster package release information on
a node or on all nodes. This is useful information to have when talking to technical support
personnel.
You can also directly examine the /etc/cluster/release file to get quick information
about the release of the cluster software installed on a particular node.
Agenda
The clquorum (clq) command is used to view the configuration and status of quorum
devices and node vote count information and to add and delete quorum devices.
Viewing quorum status and configuration: The status, list, and show suboptions show
the status and configuration of quorum devices and node-related quorum information. You
can reduce the amount of information by adding a type-restriction option (-t shared_disk,
for example), or by adding the name of a particular quorum device or node as the very last
argument.
# clq show d2
There is no specific command to replace or repair a quorum device. You just add a new one
and remove the old one.
A two-node cluster, which absolutely requires a quorum device, requires that you perform
repairs in that order (add and then remove). If you have more than two nodes, you can
perform the operations in any order.
# clq status
Cluster Quorum ===
--- Quorum Votes Summary ---
On the cluster side, you need to specifically add a quorum server device to serve the cluster.
This is likely to be your only quorum device (after you remove other quorum devices, if
necessary) because it always has the number of votes equal to one fewer than the number of
node votes.
On the cluster side, you assign a random ASCII device name to the quorum server device (in
this example, qservydude).
# clq remove d5
# clq show d2
Quiz
Answer: b
Agenda
# cldev list -v
DID Device Full Device Path
---------- ----------------
d1 clnode1:/dev/rdsk/c0t600144F0B3CAAC8300004EF9E2520002d0
d1 clnode2:/dev/rdsk/c0t600144F0B3CAAC8300004EF9E2520002d0
d2 clnode1:/dev/rdsk/c0t600144F0B3CAAC8300004EF9E25F0003d0
d2 clnode2:/dev/rdsk/c0t600144F0B3CAAC8300004EF9E25F0003d0
The cldev list -v command provides the best summary of node-to-disk paths and the
corresponding DID device numbers.
/dev/did/rdsk/d10 clnode1 Ok
clnode2 Ok
/dev/did/rdsk/d11 clnode1 Ok
/dev/did/rdsk/d13 clnode2 Ok
/dev/did/rdsk/d2 clnode1 Ok
clnode2 Ok
...
# cldev status -s Fail
By default, the scdpmd daemon probes disk paths periodically (once every few minutes).
Disk path status changes are logged into /var/adm/messages with the syslogd
LOG_INFO facility level. All failures are logged by using the LOG_ERR facility level.
The cldevice status command shows the status of disk paths as last recorded by the
daemon. That is, you can pull out a disk or sever a disk path and the status might still be
reported as Ok for a couple of minutes, until the next time the daemon probes the paths.
# cldev status -s fail is used to print faulted disk paths for the entire cluster.
Agenda
A global setting controls what form of SCSI reservations, if any, are used with disks. The
default value, as demonstrated in the command example, is prefer3. With this value, disks
with exactly two paths are fenced with SCSI-2 reservations (add SCSI-2 PGRE when used as
a quorum device). Disks with more than two paths are fenced with SCSI-3 reservations, which
already implement the persistence needed for quorum devices. Nothing more is needed.
Each individual disk has its own fencing property. The default value for every disk is global,
which means that fencing for that disk follows the global property.
The per-disk policy for the existing quorum device has been set to pathcount so that it does
not follow the global default of prefer3. For other disks, the individual default_fencing
property remains with the value global and the cluster immediately uses SCSI-3 the next
time they need to be fenced.
# clq add d5
# clq remove d4
# cldev set -p default_fencing=global d4
Updating shared devices on node 1
Updating shared devices on node 2
# clq add d4
You can turn off SCSI fencing for particular disk devices. This feature enables the use of
Serial Advanced Technology Attachment (SATA) devices as shared storage devices. These
devices do not support SCSI fencing of any sort.
Note: Elimination of fencing would also enable you to attach devices that support SCSI-2 but
do not support SCSI-3 directly to more than two nodes. This, however, is not the intention of
the feature. It was specifically invented to support SATA devices.
It is not recommended in any way that you eliminate fencing for devices that support fencing.
The per-disk default_fencing property has values that specify that you do not want
fencing for that disk:
• nofencing: Turns off fencing after scrubbing the disk of any existing reservation keys
• nofencing-noscrub: Turns off fencing without any scrubbing
The example in the slide shows the elimination of fencing for a particular disk device.
It is likely that if none of your shared disk devices support fencing (all are SATA drives, for
example), you would have to turn off fencing globally during scinstall time.
In the unlikely case that you need to turn off fencing on all devices after Oracle Solaris Cluster
is already running, you can use the new values of the cluster-wide global_fencing
property. The values are the same as those for the per-disk fencing property, listed
previously.
The example in the slide shows the elimination of all disk fencing globally. As in the earlier
examples of switching from scsi2 to scsi3 fencing, a disk that is already a quorum device
requires special manipulation.
# clq show d4
=== Quorum Devices ===
As shown in the previous section, you are enabled to add a disk on which SCSI fencing has
been eliminated as a quorum device.
Oracle Solaris Cluster will quietly implement its own software quorum mechanism to reliably
and atomically simulate the SCSI-2 or SCSI-3 “race” for the quorum device. The persistent
reservations will be implemented by using PGRE, exactly as when a SCSI-2 device is used
as a quorum device.
The example in the slide shows verification that a quorum device is using the software
quorum mechanism, because its SCSI fencing has been eliminated.
Agenda
The clinterconnect (clintr) command enables you to display the configuration and
status of the private networks that make up the cluster transport. In addition, it enables you to
configure new private networks and/or remove private network components without having to
reboot any cluster nodes.
Viewing interconnect status: The clintr status command shows the status of all private
network paths between all pairs of nodes.
No particular software administration is required to repair a broken interconnect path. If a
cable breaks, for example, the cluster immediately reports a path failure. If you replace the
cable, the path immediately goes back online.
You can cable a new private network and get it defined in the cluster without any reboots or
interruption to any existing service.
The private network definitions in the cluster configuration repository are somewhat complex.
You must perform one of the following sets of actions:
• For a private network defined without a switch (two-node cluster only):
- Define the two adapter endpoints (Oracle Solaris 11 vanity naming feature names
all interfaces as net<N>).
- Define the cable between the two endpoints.
• For a private network defined with a switch:
- Define the adapter endpoints for each node.
- Define the switch endpoint (cluster assumes any endpoint not in the form of
node:adapter is a switch).
- Define cables between each adapter and the switch.
For example, the commands shown in the slide define a new private network for a two-node
cluster. The definitions define a switch. This does not mean that the switch needs to
physically exist. It is just a definition in the cluster.
Agenda
# clsetup
1) Quorum
The clsetup command is a menu-driven utility meant to guide you through many common
(but not all) administrative operations. The clsetup command leads you through a series of
menus, and at the end calls the lower-level administrative commands for you.
In general, clsetup always shows the lower-level commands as it runs them, so it has
educational value as well.
Even if you use the clsetup utility, you still need to know how things are done in the Oracle
Solaris Cluster software environment.
For example, if you go to the cluster interconnect submenu, you see the output shown in the
slide.
For example, to permanently delete an entire private network, you must perform the following
tasks in the correct order:
1. Disable the cable or cables that define the transport.
This is a single operation for a crossover-cable definition, or multiple operations if a
switch is defined (one cable per node connected to the switch).
2. Remove the definition(s) of the transport cable or cables.
3. Remove the definition of the switch.
4. Remove the definitions of the adapters (one per node).
Nothing bad happens if you try to do things in the wrong order; you are just informed that you
missed a step. This would be the same with the command line or with clsetup.
Agenda
Controlling Clusters
Basic cluster control includes starting and stopping clustered operation on one or more nodes
and booting nodes in non-cluster mode.
Starting and stopping cluster nodes: The Oracle Solaris Cluster software starts
automatically during a system boot operation. Use the standard init or shutdown
commands to shut down a single node. Use the cluster shutdown command to shut
down all nodes in the cluster.
Before shutting down an individual node, you should switch resource groups to the next
preferred node and then run shutdown or init on that node.
Note: After an initial Oracle Solaris Cluster software installation, there are no configured
resource groups with which to be concerned.
Occasionally, you might want to boot a node without it joining the cluster. This might be to
debug some sort of problem preventing a node from joining a cluster, or to perform
maintenance. For example, you upgrade the cluster software itself when a node is booted into
non-cluster mode. Other nodes might still be up running your clustered applications.
To other nodes that are still booted into the cluster, a node that is booted into non-cluster
node looks like it has failed completely. It cannot be reached across the cluster transport.
To boot a node to non-cluster mode, you supply the -x boot option, which is passed through
to the kernel.
Booting a SPARC platform machine with the -x command: For a SPARC-based
machine, booting is simple (as shown in the slide).
For x86:
• With the normal Oracle Solaris 11 OS highlighted, press
the e key to edit the boot parameters.
• Highlight the line that begins with kernel. Then press e
+---------------------------------------------------------------+
| Oracle Solaris 11 11/11 |
| solaris-backup-1 |
+---------------------------------------------------------------+
Use the ^ and v keys to select which entry is highlighted.
Press enter to boot the selected OS, ‘e’ to edit the
commands before booting, or ‘c’ for a command-line.
The highlighted entry will be booted automatically in 1 second.
With the normal Solaris 11 OS highlighted, press the e key to edit the boot parameters. You
see the screen as displayed in the slide.
Use the arrows to highlight the line that begins with kernel, and then press e again to edit
that specific line and add the –x.
If you anticipate that a node will be down for an extended period, you can place the node into
maintenance state from an active cluster node. This operation is done to affect vote counts of
a node that is already down. The maintenance state disables the node’s quorum vote. You
cannot place an active cluster member into maintenance state.
A typical command is as follows:
clnode2:/# clq disable clnode1
The clquorum status command shows that the possible vote for clnode2 is now set to 0.
In addition, the vote count for any dual-ported quorum device physically attached to the node
is also set to 0.
You can reset the maintenance state for a node by rebooting the node into the cluster. The
node and any dual-ported quorum devices regain their votes.
Switch
To see the value of placing a node in maintenance state, consider the following topology:
Suppose that there is no way that you can add a quorum device between nodes 2 and 3 as
described in an earlier lesson (maybe you have no more available storage controllers).
Now suppose that node 4 has come down. At that point in time, the cluster survives, but you
can tell that if you lose node 3, you will lose the whole cluster, because you will have only
three of the total possible six quorum votes.
If you put node 4 into maintenance mode, you would eliminate its quorum vote, and the
quorum vote of the shared quorum device between nodes three and four. There would
temporarily be a total possible value of four quorum votes, making the required number of
votes three. This would enable you to survive the death of Node 3.
Agenda
The final task presented in this lesson is unique because it must be accomplished while all
nodes are in multi-user, non-cluster mode.
In this mode, if you run clsetup, it recognizes that the only possible task is to change the
private network information, and it guides you through this task. You can choose a different
network number, and you can give a different anticipated maximum number of nodes and
subnets and use a different suggested netmask.
You run this from one node only, and it automatically propagates to the other nodes. Nodes
can communicate by using the same Remote Procedure Calls (RPC) that are used to
communicate during scinstall.
Refer to the examples shown in the next few slides.
Option: 1
After you reboot into the cluster, your new private network information is automatically applied
by the cluster. For cluster-aware applications, the same clprivnet0 private hostnames (by
default, clusternode1-priv and so on) now resolve to the new addresses.
Summary
Practice 6 Overview:
Performing Basic Cluster Administration
This practice covers the following topics:
• Task 1: Verifying basic cluster configuration and status
• Task 2: Reassigning a quorum device
• Task 3: Configuring Oracle Solaris Cluster quorum server