Book Sleha Guide Color en
Book Sleha Guide Color en
Book Sleha Guide Color en
This guide is intended for administrators who need to set up, configure, and main-
tain clusters with SUSE® Linux Enterprise High Availability Extension. For quick
and efficient configuration and administration, the product includes both a graph-
ical user interface and a command line interface (CLI). For performing key tasks,
both approaches are covered in this guide. Thus, you can choose the appropriate
tool that matches your needs.
SUSE LLC
10 Canal Park Drive
Suite 200
Cambridge MA 02141
USA
https://www.suse.com/documentation
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Docu-
mentation License, Version 1.2 or (at your option) version 1.3; with the Invariant Section being this copyright
notice and license. A copy of the license version 1.2 is included in the section entitled “GNU Free Documentation
License”.
For SUSE trademarks, see http://www.suse.com/company/legal/ . All other third-party trademarks are the prop-
erty of their respective owners. Trademark symbols (®, ™ etc.) denote trademarks of SUSE and its affiliates.
Asterisks (*) denote third-party trademarks.
All information found in this book has been compiled with utmost attention to detail. However, this does not
guarantee complete accuracy. Neither SUSE LLC, its affiliates, the authors nor the translators shall be held liable
for possible errors or the consequences thereof.
Contents
1 Product Overview 2
1.1 Availability as Extension 2
1.3 Benefits 6
1.5 Architecture 12
Architecture Layers 12 • Process Flow 14
iv Administration Guide
Configuration for Two-Node Clusters 57 • Corosync Configuration for N-
Node Clusters 57
7.2 Logging In 92
v Administration Guide
Resource Templates 102 • Modifying Resources 102 • Adding STONITH
Resources 104 • Adding Cluster Resource Groups 105 • Adding Clone
Resources 107 • Adding Multi-state Resources 108 • Grouping Resources
by Using Tags 109 • Configuring Resource Monitoring 110
vi Administration Guide
8.2 Managing Corosync Configuration 156
ix Administration Guide
III STORAGE AND DATA REPLICATION 249
18 OCFS2 252
18.1 Features and Benefits 252
19 GFS2 262
19.1 GFS2 Packages and Management Utilities 262
20 DRBD 267
20.1 Conceptual Overview 267
x Administration Guide
20.5 Creating a Stacked DRBD Device 277
xi Administration Guide
23 Samba Clustering 303
23.1 Conceptual Overview 303
IV APPENDIX 321
A Troubleshooting 322
A.1 Installation and First Steps 322
Glossary 341
This guide is intended for administrators who need to set up, configure, and main-
tain clusters with SUSE® Linux Enterprise High Availability Extension. For quick
and efficient configuration and administration, the product includes both a graph-
ical user interface and a command line interface (CLI). For performing key tasks,
both approaches are covered in this guide. Thus, you can choose the appropriate
tool that matches your needs.
This guide is divided into the following parts:
Appendix
Contains an overview of common problems and their solution. Presents the naming con-
ventions used in this documentation with regard to clusters, resources and constraints.
Contains a glossary with HA-specific terminology.
Administration Guide
This guide is intended for administrators who need to set up, configure, and maintain
clusters with SUSE® Linux Enterprise High Availability Extension. For quick and efficient
configuration and administration, the product includes both a graphical user interface and
a command line interface (CLI). For performing key tasks, both approaches are covered in
this guide. Thus, you can choose the appropriate tool that matches your needs.
2 Feedback
Several feedback channels are available:
User Comments
We want to hear your comments about and suggestions for this manual and the other
documentation included with this product. Use the User Comments feature at the bottom of
each page in the online documentation or go to http://www.suse.com/doc/feedback.html
and enter your comments there.
Mail
For feedback on the documentation of this product, you can also send a mail to doc-
team@suse.com . Make sure to include the document title, the product version and the
publication date of the documentation. To report errors or suggest enhancements, provide
a concise description of the problem and refer to the respective section number and page
(or URL).
Commands that can be run by any user, including the root user.
root # command
Commands that must be run with root privileges. Often you can also prefix these com-
mands with the sudo command to run them.
crm(live)#
Commands executed in the interactive crm shell. For details, see Chapter 8, Configuring and
Managing Cluster Resources (Command Line).
Alt , Alt – F1 : a key to press or a key combination; keys are shown in uppercase as on
a keyboard
amd64, em64t, ipf This paragraph is only relevant for the architectures amd64 , em64t ,
and ipf . The arrows mark the beginning and the end of the text block.
Notices
Important
Important information you should be aware of before proceeding.
Note
Additional information, for example about differences in software versions.
Tip
Helpful information, like a guideline or a piece of practical advice.
For an overview of naming conventions with regard to cluster nodes and names, resources, and
constraints, see Appendix B, Naming Conventions.
1 Product Overview 2
Active/active configurations
Hybrid physical and virtual clusters, allowing virtual servers to be clustered with physical
servers. This improves service availability and resource usage.
Local clusters
Your cluster can contain up to 32 Linux servers. Using pacemaker_remote, the cluster can be
extended to include additional Linux servers beyond this limit. Any server in the cluster can
restart resources (applications, services, IP addresses, and le systems) from a failed server in
the cluster.
1.2.2 Flexibility
The High Availability Extension ships with Corosync messaging and membership layer and Pace-
maker Cluster Resource Manager. Using Pacemaker, administrators can continually monitor the
health and status of their resources, and manage dependencies. They can automatically stop
and start services based on highly configurable rules and policies. The High Availability Exten-
sion allows you to tailor a cluster to the specific applications and hardware infrastructure that
t your organization. Time-dependent configuration enables services to automatically migrate
back to repaired nodes at specified times.
Local Clusters
A single cluster in one location (for example, all nodes are located in one data center).
The cluster uses multicast or unicast for communication between the nodes and manages
failover internally. Network latency can be neglected. Storage is typically accessed syn-
chronously by all nodes.
The greater the geographical distance between individual cluster nodes, the more factors may
potentially disturb the high availability of services the cluster provides. Network latency, limited
bandwidth and access to storage are the main challenges for long-distance clusters.
YaST
A graphical user interface for general system installation and administration. Use it to
install the High Availability Extension on top of SUSE Linux Enterprise Server as described
in the Installation and Setup Quick Start. YaST also provides the following modules in the
High Availability category to help configure your cluster or individual components:
Cluster: Basic cluster setup. For details, refer to Chapter 4, Using the YaST Cluster Module.
Hawk2
A user-friendly Web-based interface with which you can monitor and administer your High
Availability clusters from Linux or non-Linux machines alike. Hawk2 can be accessed from
any machine inside or outside of the cluster by using a (graphical) Web browser. Therefore
it is the ideal solution even if the system on which you are working only provides a minimal
graphical user interface. For details, Chapter 7, Configuring and Managing Cluster Resources
with Hawk2.
crm Shell
A powerful unified command line interface to configure resources and execute all moni-
toring or administration tasks. For details, refer to Chapter 8, Configuring and Managing Clus-
ter Resources (Command Line).
1.3 Benefits
The High Availability Extension allows you to configure up to 32 Linux servers into a high-
availability cluster (HA cluster). Resources can be dynamically switched or moved to any node
in the cluster. Resources can be configured to automatically migrate if a node fails, or they can
be moved manually to troubleshoot hardware or balance the workload.
Increased availability
Improved performance
Scalability
Disaster recovery
Data protection
Server consolidation
Storage consolidation
Shared disk fault tolerance can be obtained by implementing RAID on the shared disk subsystem.
The following scenario illustrates some benefits the High Availability Extension can provide.
During normal cluster operation, each node is in constant communication with the other nodes
in the cluster and performs periodic polling of all registered resources to detect failure.
Suppose Web Server 1 experiences hardware or software problems and the users depending on
Web Server 1 for Internet access, e-mail, and information lose their connections. The following
figure shows how resources are moved when Web Server 1 fails.
Web Site A moves to Web Server 2 and Web Site B moves to Web Server 3. IP addresses and
certificates also move to Web Server 2 and Web Server 3.
Detected a failure and verified with STONITH that Web Server 1 was really dead. STONITH
is an acronym for “Shoot The Other Node In The Head”. It is a means of bringing down
misbehaving nodes to prevent them from causing trouble in the cluster.
Remounted the shared data directories that were formerly mounted on Web server 1 on
Web Server 2 and Web Server 3.
Restarted applications that were running on Web Server 1 on Web Server 2 and Web Server
3.
In this example, the failover process happened quickly and users regained access to Web site
information within seconds, usually without needing to log in again.
Now suppose the problems with Web Server 1 are resolved, and Web Server 1 is returned to
a normal operating state. Web Site A and Web Site B can either automatically fail back (move
back) to Web Server 1, or they can stay where they are. This depends on how you configured
the resources for them. Migrating the services back to Web Server 1 will incur some down-time.
Therefore the High Availability Extension also allows you to defer the migration until a period
when it will cause little or no service interruption. There are advantages and disadvantages to
both alternatives.
The High Availability Extension also provides resource migration capabilities. You can move
applications, Web sites, etc. to other servers in your cluster as required for system management.
For example, you could have manually moved Web Site A or Web Site B from Web Server 1 to
either of the other servers in the cluster. Use cases for this are upgrading or performing scheduled
maintenance on Web Server 1, or increasing performance or accessibility of the Web sites.
Typical resources might include data, applications, and services. The following figures show
how a typical Fibre Channel cluster configuration might look. The green lines depict connections
to an Ethernet power switch. Such a device can be controlled over a network and can reboot
a node when a ping request fails.
Although most clusters include a shared disk subsystem, it is also possible to create a cluster
without a shared disk subsystem. The following figure shows how a cluster without a shared
disk subsystem might look.
FIGURE 1.6: ARCHITECTURE
This component provides reliable messaging, membership, and quorum information about the
cluster. This is handled by the Corosync cluster engine, a group communication system.
Pacemaker as cluster resource manager is the “brain” which reacts to events occurring in the
cluster. It is implemented as pacemaker-controld , the cluster controller, which coordinates
all actions. Events can be nodes that join or leave the cluster, failure of resources, or scheduled
activities such as maintenance, for example.
Policy Engine
The policy engine runs on every node, but the one on the DC is the active one. The engine
is implemented as pacemaker-schedulerd daemon. When a cluster transition is needed,
based on the current state and configuration, pacemaker-schedulerd calculates the ex-
pected next state of the cluster. It determines what actions need to be scheduled to achieve
the next state.
In some cases, it may be necessary to power o nodes to protect shared data or complete resource
recovery. In a Pacemaker cluster, the implementation of node level fencing is STONITH. For this,
Pacemaker comes with a fencing subsystem, pacemaker-fenced . STONITH devices have to be
configured as cluster resources (that use specific fencing agents), because this allows to monitor
the fencing devices. When clients detect a failure, they send a request to pacemaker-fenced ,
which then executes the fencing agent to bring down the node.
The following section informs you about system requirements, and some prerequi-
sites for SUSE® Linux Enterprise High Availability Extension. It also includes rec-
ommendations for cluster setup.
Servers
1 to 32 Linux servers with software as specified in Section 2.2, “Software Requirements”.
The servers can be bare metal or virtual machines. They do not require identical hardware
(memory, disk space, etc.), but they must have the same architecture. Cross-platform clus-
ters are not supported.
Using pacemaker_remote , the cluster can be extended to include additional Linux servers
beyond the 32-node limit.
Communication Channels
At least two TCP/IP communication media per cluster node. The network equipment must
support the communication means you want to use for cluster communication: multicast
or unicast. The communication media should support a data rate of 100 Mbit/s or higher.
For a supported cluster setup two or more redundant communication paths are required.
This can be done via:
For details, refer to Chapter 13, Network Device Bonding and Procedure 4.3, “Defining a Redun-
dant Communication Channel”, respectively.
Depending on the system roles you select during installation, the following software patterns
are installed by default:
The shared disk system is properly set up and functional according to the manufacturer’s
instructions.
The disks contained in the shared disk system should be configured to use mirroring or
RAID to add fault tolerance to the shared disk system. Hardware-based RAID is recom-
mended. Host-based software RAID is not supported for all configurations.
If you are using iSCSI for shared disk system access, ensure that you have properly config-
ured iSCSI initiators and targets.
When using DRBD* to implement a mirroring RAID system that distributes data across
two machines, make sure to only access the device provided by DRBD—never the backing
device. To leverage the redundancy it is possible to use the same NICs as the rest of the
cluster.
When using SBD as STONITH mechanism, additional requirements apply for the shared storage.
For details, see Section 11.3, “Requirements”.
Time Synchronization
Cluster nodes must synchronize to an NTP server outside the cluster. Since SUSE Linux
Enterprise High Availability Extension 15, chrony is the default implementation of NTP.
For more information, see the Administration Guide for SUSE Linux Enterprise Server 15
SP1, chapter Time Synchronization with NTP. It is available from http://www.suse.com/doc-
umentation/ .
If nodes are not synchronized, the cluster may not work properly. In addition, log les and
cluster reports are very hard to analyze without synchronization. If you use the bootstrap
scripts, you will be warned if NTP is not configured yet.
List all cluster nodes in the /etc/hosts le with their fully qualified host name and
short host name. It is essential that members of the cluster can nd each other by
name. If the names are not available, internal cluster communication will fail.
For details on how Pacemaker gets the node names, see also http://clusterlabs.org/doc/
en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-node-name.html .
SSH
All cluster nodes must be able to access each other via SSH. Tools like crm report (for
troubleshooting) and Hawk2's History Explorer require passwordless SSH access between
the nodes, otherwise they can only collect data from the current node.
For the History Explorer there is currently no alternative for passwordless login.
If you are setting up a High Availability cluster with SUSE® Linux Enterprise High
Availability Extension for the rst time, the easiest way is to start with a basic two-
node cluster. You can also use the two-node cluster to run some tests. Afterward,
you can add more nodes by cloning existing cluster nodes with AutoYaST. The
cloned nodes will have the same packages installed and the same system configura-
tion as the original ones.
If you want to upgrade an existing cluster that runs an older version of SUSE Linux
Enterprise High Availability Extension, refer to Chapter 5, Upgrading Your Cluster and
Updating Software Packages.
1. Make sure the node you want to clone is correctly installed and configured. For details, see
the Installation and Setup Quick Start for SUSE Linux Enterprise High Availability Extension
or Chapter 4, Using the YaST Cluster Module.
2. Follow the description outlined in the SUSE Linux Enterprise 15 SP1 Deployment Guide for
simple mass installation. This includes the following basic steps:
a. Creating an AutoYaST profile. Use the AutoYaST GUI to create and modify a profile
based on the existing system configuration. In AutoYaST, choose the High Availability
module and click the Clone button. If needed, adjust the configuration in the other
modules and save the resulting control le as XML.
If you have configured DRBD, you can select and clone this module in the AutoYaST
GUI, too.
b. Determining the source of the AutoYaST profile and the parameter to pass to the
installation routines for the other nodes.
c. Determining the source of the SUSE Linux Enterprise Server and SUSE Linux Enter-
prise High Availability Extension installation data.
e. Passing the command line to the installation routines, either by adding the parame-
ters manually or by creating an info le.
After the clone has been successfully installed, execute the following steps to make the cloned
node join the cluster:
1. Transfer the key configuration les from the already configured nodes to the cloned node
with Csync2 as described in Section 4.5, “Transferring the Configuration to All Nodes”.
The cloned node will now join the cluster because the /etc/corosync/corosync.conf le has
been applied to the cloned node via Csync2. The CIB is automatically synchronized among the
cluster nodes.
The YaST cluster module allows you to set up a cluster manually (from scratch) or
to modify options for an existing cluster.
However, if you prefer an automated approach for setting up a cluster, refer to Ar-
ticle “Installation and Setup Quick Start”. It describes how to install the needed pack-
ages and leads you to a basic two-node cluster, which is set up with the ha-clus-
ter-bootstrap scripts.
You can also use a combination of both setup methods, for example: set up one
node with YaST cluster and then use one of the bootstrap scripts to integrate more
nodes (or vice versa).
conntrack Tools
Allow interaction with the in-kernel connection tracking system for enabling stateful packet
inspection for iptables. Used by the High Availability Extension to synchronize the connec-
tion status between cluster nodes. For detailed information, refer to http://conntrack-tool-
s.netfilter.org/ .
Existing Cluster
The term “existing cluster” is used to refer to any cluster that consists of at least one
node. Existing clusters have a basic Corosync configuration that defines the communication
channels, but they do not necessarily have resource configuration yet.
Multicast
A technology used for a one-to-many communication within a network that can be used
for cluster communication. Corosync supports both multicast and unicast.
If set to active , Corosync uses both interfaces actively. However, this mode is dep-
recated.
If set to passive , Corosync sends messages alternatively over the available networks.
Unicast
A technology for sending messages to a single network destination. Corosync supports both
multicast and unicast. In Corosync, unicast is implemented as UDP-unicast (UDPU).
The following list shows an overview of the available screens in the YaST cluster module. It also
mentions whether the screen contains parameters that are required for successful cluster setup
or whether its parameters are optional.
Service (required)
Allows you to configure the service for bringing the cluster node online. Define whether to
start the Pacemaker service at boot time and whether to open the ports in the firewall that
are needed for communication between the nodes. For details, see Section 4.7, “Configuring
Services”.
If you start the cluster module for the rst time, it appears as a wizard, guiding you through all
the steps necessary for basic setup. Otherwise, click the categories on the left panel to access
the configuration options for each step.
All settings defined in the YaST Communication Channels screen are written to /etc/coro-
sync/corosync.conf . Find example les for a multicast and a unicast setup in /usr/share/
doc/packages/corosync/ .
If you are using IPv4 addresses, node IDs are optional. If you are using IPv6 addresses, node
IDs are required. Instead of specifying IDs manually for each node, the YaST cluster module
contains an option to automatically generate a unique ID for every cluster node.
When using multicast, the same bindnetaddr , mcastaddr , and mcastport will be used
for all cluster nodes. All nodes in the cluster will know each other by using the same
multicast address. For different clusters, use different multicast addresses.
1. Start the YaST cluster module and switch to the Communication Channels category.
3. Define the Bind Network Address. Set the value to the subnet you will use for cluster mul-
ticast.
6. To automatically generate a unique ID for every cluster node keep Auto Generate Node
ID enabled.
If you want to use unicast instead of multicast for cluster communication, proceed as follows.
1. Start the YaST cluster module and switch to the Communication Channels category.
IP Address
To modify or remove any addresses of cluster members, use the Edit or Del buttons.
5. To automatically generate a unique ID for every cluster node keep Auto Generate Node
ID enabled.
7. Enter the number of Expected Votes. This is important for Corosync to calculate quorum in
case of a partitioned cluster. By default, each node has 1 vote. The number of Expected
Votes must match the number of nodes in your cluster.
If network device bonding cannot be used for any reason, the second best choice is to define
a redundant communication channel (a second ring) in Corosync. That way, two physically
separate networks can be used for communication. If one network fails, the cluster nodes can
still communicate via the other network.
The additional communication channel in Corosync will form a second token-passing ring. In /
etc/corosync/corosync.conf , the rst channel you configured is the primary ring and gets
the ring number 0 . The second ring (redundant channel) gets the ring number 1 .
When having defined redundant communication channels in Corosync, use RRP to tell the clus-
ter how to use these interfaces. With RRP, two physically separate networks are used for com-
munication. If one network fails, the cluster nodes can still communicate via the other network.
If set to active , Corosync uses both interfaces actively. However, this mode is deprecated.
If set to passive , Corosync sends messages alternatively over the available networks.
1. Start the YaST cluster module and switch to the Communication Channels category.
2. Activate Redundant Channel. The redundant channel must use the same protocol as the
rst communication channel you defined.
3. If you use multicast, enter the following parameters: the Bind Network Address to use, the
Multicast Address and the Multicast Port for the redundant channel.
If you use unicast, define the following parameters: the Bind Network Address to use, and
the Multicast Port. Enter the IP addresses of all nodes that will be part of the cluster.
4. To tell Corosync how and when to use the different channels, select the rrp_mode to use:
If set to active , Corosync uses both interfaces actively. However, this mode is dep-
recated.
If set to passive , Corosync sends messages alternatively over the available net-
works.
When RRP is used, the High Availability Extension monitors the status of the current rings
and automatically re-enables redundant rings after faults.
Alternatively, check the ring status manually with corosync-cfgtool . View the available
options with -h .
1. Start the YaST cluster module and switch to the Security category.
3. For a newly created cluster, click Generate Auth Key File. An authentication key is created
and written to /etc/corosync/authkey .
If you want the current machine to join an existing cluster, do not generate a new key le.
Instead, copy the /etc/corosync/authkey from one of the nodes to the current machine
(either manually or with Csync2).
Csync2 helps you to keep track of configuration changes and to keep les synchronized across
the cluster nodes:
You can define a list of les that are important for operation.
You can show changes to these les (against the other cluster nodes).
With a simple shell script in ~/.bash_logout , you can be reminded about unsynchronized
changes before logging out of the system.
2. To specify the synchronization group, click Add in the Sync Host group and enter the local
host names of all nodes in your cluster. For each node, you must use exactly the strings
that are returned by the hostname command.
3. Click Generate Pre-Shared-Keys to create a key le for the synchronization group. The key
le is written to /etc/csync2/key_hagroup . After it has been created, it must be copied
manually to all members of the cluster.
4. To populate the Sync File list with the les that usually need to be synchronized among
all nodes, click Add Suggested Files.
5. To Edit, Add or Remove les from the list of les to be synchronized use the respective
buttons. You must enter the absolute path name for each le.
6. Activate Csync2 by clicking Turn Csync2 ON. This will execute the following command to
start Csync2 automatically at boot time:
FIGURE 4.4: YAST CLUSTER—CSYNC2
1. Copy the le /etc/csync2/csync2.cfg manually to all nodes after you have configured
it as described in Section 4.5.1, “Configuring Csync2 with YaST”.
2. Copy the le /etc/csync2/key_hagroup that you have generated on one node in Step 3 of
Section 4.5.1 to all nodes in the cluster. It is needed for authentication by Csync2. However,
do not regenerate the le on the other nodes—it needs to be the same le on all nodes.
3. Execute the following command on all nodes to start the service now:
1. To initially synchronize all les once, execute the following command on the machine
that you want to copy the configuration from:
This will synchronize all the les once by pushing them to the other nodes. If all les are
synchronized successfully, Csync2 will finish with no errors.
If one or several les that are to be synchronized have been modified on other nodes (not
only on the current one), Csync2 reports a conflict. You will get an output similar to the
one below:
2. If you are sure that the le version on the current node is the “best” one, you can resolve
the conflict by forcing this le and resynchronizing:
csync2 -help
2. Configuring a resource for conntrackd (class: ocf , provider: heartbeat ). If you use
Hawk2 to add the resource, use the default values proposed by Hawk2.
After having configured the conntrack tools, you can use them for Linux Virtual Server, see Load
Balancing.
Use the YaST cluster module to configure the user space conntrackd . It needs a dedicated
network interface that is not used for other communication channels. The daemon can be
started via a resource agent afterward.
1. Start the YaST cluster module and switch to the Configure conntrackd category.
2. Select a Dedicated Interface for synchronizing the connection status. The IPv4 address of
the selected interface is automatically detected and shown in YaST. It must already be
configured and it must support multicast.
3. Define the Multicast Address to be used for synchronizing the connection status.
4. In Group Number, define a numeric ID for the group to synchronize the connection status to.
7. For further cluster configuration, click Next and proceed with Section 4.7, “Configuring Ser-
vices”.
FIGURE 4.5: YAST CLUSTER—conntrackd
PROCEDURE 4.8: ENABLING PACEMAKER
4. To open the ports in the firewall that are needed for cluster communication on the current
machine, activate Open Port in Firewall.
5. Confirm your changes. Note that the configuration only applies to the current machine,
not to all cluster nodes.
FIGURE 4.6: YAST CLUSTER—SERVICES
4. On one of the nodes, check the cluster status with the crm status command. If all nodes
are online, the output should be similar to the following:
This output indicates that the cluster resource manager is started and is ready to manage
resources.
After the basic configuration is done and the nodes are online, you can start to configure cluster
resources. Use one of the cluster management tools like the crm shell (crmsh) or Hawk2. For
more information, see Chapter 8, Configuring and Managing Cluster Resources (Command Line) or
Chapter 7, Configuring and Managing Cluster Resources with Hawk2.
This chapter covers two different scenarios: upgrading a cluster to another version
of SUSE Linux Enterprise High Availability Extension (either a major release or a
service pack) as opposed to updating individual packages on cluster nodes. See Sec-
tion 5.2, “Upgrading your Cluster to the Latest Product Version” versus Section 5.3, “Updat-
ing Software Packages on Cluster Nodes”.
If you want to upgrade your cluster, check Section 5.2.1, “Supported Upgrade Paths for
SLE HA and SLE HA Geo” and Section 5.2.2, “Required Preparations Before Upgrading” be-
fore starting to upgrade.
5.1 Terminology
In the following, nd definitions of the most important terms used in this chapter:
Major Release,
General Availability (GA) Version
A major release is a new product version that brings new features and tools, and decommis-
sions previously deprecated components. It comes with backward incompatible changes.
Offline Migration
If a new product version includes major changes that are backward incompatible, the
cluster needs to be upgraded by an offline migration. You need to take all nodes offline
and upgrade the cluster as a whole, before you can bring all nodes back online.
Rolling Upgrade
In a rolling upgrade one cluster node at a time is upgraded while the rest of the cluster
is still running. You take the rst node offline, upgrade it and bring it back online to join
the cluster. Then you continue one by one until all cluster nodes are upgraded to a major
version.
Upgrade
Installation of a newer major version of a package or distribution, which brings new features.
See also Offline Migration versus Rolling Upgrade.
Rolling upgrades are only supported within the same major release (from the GA of a
product version to the next service pack, and from one service pack to the next).
Offline migrations are required to upgrade from one major release to the next (for example,
from SLE HA 12 to SLE HA 15) or from a service pack within one major release to the next
major release (for example, from SLE HA 12 SP3 to SLE HA 15).
Section 5.2.1 gives an overview of the supported upgrade paths for SLE HA (Geo). The column For
Details lists the specific upgrade documentation you should refer (including also the base system
and Geo Clustering for SUSE Linux Enterprise High Availability Extension). This documentation
is available from:
http://www.suse.com/documentation/sles
http://www.suse.com/documentation/sle-ha
http://www.suse.com/documentation/sle-ha-geo
Mixed clusters running on SUSE Linux Enterprise High Availability Extension 12/
SUSE Linux Enterprise High Availability Extension 15 are not supported.
After the upgrade process to product version 15, reverting back to product version
12 is not supported.
Backup
Ensure that your system backup is up to date and restorable.
Testing
Test the upgrade procedure on a staging instance of your cluster setup rst, before per-
forming it in a production environment. This gives you an estimation of the time frame
required for the maintenance window. It also helps to detect and solve any unexpected
problems that might arise.
If your cluster is still based on an older product version than the ones listed above, rst upgrade
it to a version of SLES and SLE HA that can be used as a source for upgrading to the desired
target version.
The High Availability Extension 12 cluster stack comes with major changes in various com-
ponents (for example, /etc/corosync/corosync.conf , disk formats of OCFS2). Therefore, a
rolling upgrade from any SUSE Linux Enterprise High Availability Extension 11 version is
not supported. Instead, all cluster nodes must be offline and the cluster needs to be migrated as
a whole as described in Procedure 5.1, “Upgrading from Product Version 11 to 12: Cluster-Wide Offline
Migration”.
1. Log in to each cluster node and stop the cluster stack with:
3. After the upgrade process has finished, reboot each node with the upgraded version of
SUSE Linux Enterprise Server and SUSE Linux Enterprise High Availability Extension.
4. If you use OCFS2 in your cluster setup, update the on-device structure by executing the
following command:
It adds additional parameters to the disk. They are needed for the updated OCFS2 version
that is shipped with SUSE Linux Enterprise High Availability Extension 12 and 12 SPx.
b. Switch to the Communication Channels category and enter values for the following
new parameters: Cluster Name and Expected Votes. For details, see Procedure 4.1, “Defin-
ing the First Communication Channel (Multicast)” or Procedure 4.2, “Defining the First Com-
munication Channel (Unicast)”, respectively.
If YaST should detect any other options that are invalid or missing according to
Corosync version 2, it will prompt you to change them.
d. If Csync2 is configured for your cluster, use the following command to push the
updated Corosync configuration to the other cluster nodes:
For details on Csync2, see Section 4.5, “Transferring the Configuration to All Nodes”.
Alternatively, synchronize the updated Corosync configuration by manually copying
/etc/corosync/corosync.conf to all cluster nodes.
1. Before starting the offline migration to SUSE Linux Enterprise High Availability Extension
15, manually upgrade the CIB syntax in your current cluster as described in Note: Upgrading
the CIB Syntax Version.
2. Log in to each cluster node and stop the cluster stack with:
4. After the upgrade process has finished, log in to each node and boot it with the upgrad-
ed version of SUSE Linux Enterprise Server and SUSE Linux Enterprise High Availability
Extension.
5. If you use Cluster LVM, you need to migrate from clvmd to lvmlockd. See the man page
of lvmlockd , section changing a clvm VG to a lockd VG and Section 21.4, “Online Migration
from Mirror LV to Cluster MD”.
1. Log in as root on the node that you want to upgrade and stop the cluster stack:
3. Start the cluster stack on the upgraded node to make the node rejoin the cluster:
4. Take the next node offline and repeat the procedure for that node.
The Hawk2 Status screen also shows a warning if different CRM versions are detected for your
cluster nodes.
Does the update affect any packages belonging to SUSE Linux Enterprise High Avail-
ability Extension or Geo Clustering for SUSE Linux Enterprise High Availability Ex-
tension? If yes : Stop the cluster stack on the node before starting the software up-
date:
Does the package update require a reboot? If yes : Stop the cluster stack on the node
before starting the software update:
If none of the situations above apply, you do not need to stop the cluster stack. In
that case, put the node into maintenance mode before starting the software update:
For more details on maintenance mode, see Section 16.2, “Different Options for Mainte-
nance Tasks”.
Either start the cluster stack on the respective node (if you stopped it in Step 1):
or remove the maintenance ag to bring the node back to normal mode:
Two-node clusters
Clusters with more than two nodes. This usually means an odd number of nodes.
Adding also different topologies, different use cases can be derived. The following use cases are
the most common:
Usage scenario: Embedded clusters that focus on service high availability and not data
redundancy for data replication. Such a setup is used for radio stations or assembly line
controllers, for example.
Usage scenario: Classic stretched clusters, focus on high availability of services and local
data redundancy. For databases and enterprise resource planning. One of the most popular
setups during the last few years.
Usage scenario: Classic stretched cluster, focus on high availability of services and data
redundancy. For example, databases, enterprise resource planning.
N ≥ C/2 + 1
For example, a ve-node cluster needs a minimum of three operational nodes (or two
nodes which can fail).
We strongly recommend to use either a two-node cluster or an odd number of cluster nodes.
Two-node clusters make sense for stretched setups across two sites. Clusters with an odd
number of nodes can either be built on one single site or might be spread across three sites.
ignore
Setting no-quorum-policy to ignore makes the cluster behave like it has quorum. Re-
source management is continued.
On SLES 11 this was the recommended setting for a two-node cluster. Starting with
SLES 12, this option is obsolete. Based on configuration and conditions, Corosync gives
cluster nodes or a single node “quorum”—or not.
For two-node clusters the only meaningful behavior is to always react in case of quorum
loss. The rst step should always be to try to fence the lost node.
freeze
If quorum is lost, the cluster partition freezes. Resource management is continued: running
resources are not stopped (but possibly restarted in response to monitor events), but no
further resources are started within the affected partition.
suicide
If quorum is lost, all nodes in the affected cluster partition are fenced. This option works
only in combination with SBD, see Chapter 11, Storage Protection and SBD.
quorum {
# Enable and configure quorum subsystem (default: off)
# see also corosync.conf.5 and votequorum.5
provider: corosync_votequorum
expected_votes: 2
two_node: 1
}
As opposed to SUSE Linux Enterprise 11, the votequorum subsystem in SUSE Linux Enterprise 12
is powered by Corosync version 2.x. This means that the no-quorum-policy=ignore option
must not be used.
By default, when two_node: 1 is set, the wait_for_all option is automatically enabled. If
wait_for_all is not enbaled, the cluster should be started on both nodes in parallel. Otherwise
the rst node will perform a startup-fencing on the missing second node.
quorum {
provider: corosync_votequorum 1
expected_votes: N 2
wait_for_all: 1 3
The output is information in XML format, including several sections (general description,
available parameters, available actions for the agent).
Alternatively, use the crmsh to view information on OCF resource agents. For details, see
Section 8.1.3, “Displaying Information about OCF Resource Agents”.
Systemd
Starting with SUSE Linux Enterprise 12, systemd is a replacement for the popular System
V init daemon. Pacemaker can manage systemd services if they are present. Instead of init
scripts, systemd has unit les. Generally the services (or unit les) are provided by the
operating system. In case you want to convert existing init scripts, nd more information
at http://0pointer.de/blog/projects/systemd-for-admins-3.html .
Service
There are currently many “common” types of system services that exist in parallel: LSB
(belonging to System V init), systemd , and (in some distributions) upstart . Therefore,
Pacemaker supports a special alias which intelligently figures out which one applies to a
given cluster node. This is particularly useful when the cluster contains a mix of systemd,
upstart, and LSB services. Pacemaker will try to nd the named service in the following
order: as an LSB (SYS-V) init script, a systemd unit le, or an Upstart job.
Nagios
Monitoring plug-ins (formerly called Nagios plug-ins) allow to monitor services on re-
mote hosts. Pacemaker can do remote monitoring with the monitoring plug-ins if they are
present. For detailed information, see Section 6.6.1, “Monitoring Services on Remote Hosts with
Monitoring Plug-ins”.
The agents supplied with the High Availability Extension are written to OCF specifications.
Primitives
A primitive resource, the most basic type of resource.
Learn how to create primitive resources with your preferred cluster management tool:
Groups
Groups contain a set of resources that need to be located together, started sequentially and
stopped in the reverse order. For more information, refer to Section 6.3.5.1, “Groups”.
Clones
Clones are resources that can be active on multiple hosts. Any resource can be cloned,
provided the respective resource agent supports it. For more information, refer to Sec-
tion 6.3.5.2, “Clones”.
Promotable clones (formerly known master/slave or multi-state resources) are a special
type of clone resources that can be promoted.
6.3.5.1 Groups
Some cluster resources depend on other components or resources. They require that each com-
ponent or resource starts in a specific order and runs together on the same server with resources
it depends on. To simplify this configuration, you can use cluster resource groups.
An example of a resource group would be a Web server that requires an IP address and
a le system. In this case, each component is a separate resource that is combined into a
cluster resource group. The resource group would run on one or more servers. In case of
a software or hardware malfunction, the group would fail over to another server in the
cluster, similar to an individual cluster resource.
IP Address
1
(192.168.1.180)
Individual Resource
Load Order
File System
2
(XFS)
FIGURE 6.1: GROUP RESOURCE
Dependency
If a resource in the group cannot run anywhere, then none of the resources located after
that resource in the group is allowed to run.
Contents
Groups may only contain a collection of primitive cluster resources. Groups must contain
at least one resource, otherwise the configuration is not valid. To refer to the child of a
group resource, use the child’s ID instead of the group’s ID.
Constraints
Although it is possible to reference the group’s children in constraints, it is usually prefer-
able to use the group’s name instead.
Resource Monitoring
To enable resource monitoring for a group, you must configure monitoring separately for
each resource in the group that you want monitored.
Learn how to create groups with your preferred cluster management tool:
6.3.5.2 Clones
You may want certain resources to run simultaneously on multiple nodes in your cluster. To do
this you must configure a resource as a clone. Examples of resources that might be configured
as clones include cluster le systems like OCFS2. You can clone any resource provided. This is
supported by the resource’s Resource Agent. Clone resources may even be configured differently
depending on which nodes they are hosted.
There are three types of resource clones:
Anonymous Clones
These are the simplest type of clones. They behave identically anywhere they are running.
Because of this, there can only be one instance of an anonymous clone active per machine.
Warning: Use
Unique IDs
This value must not
overlap with any exist-
ing resource or node
IDs.
The output lists all the supported attributes, their purpose and default values.
For example, the command
If left empty, the script will try and determine this from the
routing table.
If unspecified, the script will also try to determine this from the
routing table.
local_stop_script (string):
Script called when the IP is released
local_start_script (string):
Script called when the IP is added
start timeout=90
stop timeout=100
monitor_0 interval=5s timeout=20s
Operation Description
1. If a resource runs into a timeout, it fails and the cluster will try to stop it.
2. If stopping the resource also fails (for example, because the timeout for stopping is set
too low), the cluster will fence the node. It considers the node where this happens to be
out of control.
You can adjust the global default for operations and set any specific timeout values with both
crmsh and Hawk2. The best practice for determining and setting timeout values is as follows:
1. Check how long it takes your resources to start and stop (under load).
2. If needed, add the op_defaults parameter and set the (default) timeout value accord-
ingly:
b. For resources that need longer periods of time, define individual timeout values.
3. When configuring operations for a resource, add separate start and stop operations.
When configuring operations with Hawk2, it will provide useful timeout proposals for
those operations.
Log le messages are generated, according to the configuration specified in the logging
section of /etc/corosync/corosync.conf .
The failure is reflected in the cluster management tools (Hawk2, crm status ), and in
the CIB status section.
The cluster initiates noticeable recovery actions which may include stopping the resource
to repair the failed state and restarting the resource locally or on another node. The re-
source also may not be restarted, depending on the configuration and state of the cluster.
If you do not configure resource monitoring, resource failures after a successful start will not be
communicated, and the cluster will always show the resource as healthy.
This configuration triggers a monitoring operation every 300 seconds for the resource
dummy1 when it is in role="Stopped" . When running, it will be monitored every 30
seconds.
Probing
The CRM executes an initial monitoring for each resource on every node, the so-called
probe . A probe is also executed after the cleanup of a resource. If multiple monitoring
operations are defined for a resource, the CRM will select the one with the smallest interval
and will use its timeout value as default timeout for probing. If no monitor operation is
configured, the cluster-wide default applies. The default is 20 seconds (if not specified
otherwise by configuring the op_defaults parameter). If you do not want to rely on the
automatic calculation or the op_defaults value, define a specific monitoring operation
for the probing of this resource. Do so by adding a monitoring operation with the interval
set to 0 , for example:
The probe of rsc1 will time out in 60s , independent of the global timeout defined in
op_defaults , or any other operation timeouts configured. If you did not set inter-
val="0" for specifying the probing of the respective resource, the CRM will automatically
check for any other monitoring operations defined for that resource and will calculate the
timeout value for probing as described above.
Learn how to add monitor operations to resources with your preferred cluster management tool:
Resource Location
Locational constraints that define on which nodes a resource may be run, may not be run
or is preferred to be run.
Resource Colocation
Colocational constraints that tell the cluster which resources may or may not run together
on a node.
Resource Order
Ordering constraints to define the sequence of actions.
Do not create colocation constraints for members of a resource group. Create a colo-
cation constraint pointing to the resource group as a whole instead. All other types
of constraints are safe to use for members of a resource group.
Do not use any constraints on a resource that has a clone resource or a promotable
clone resource applied to it. The constraints must apply to the clone or promotable
clone resource, not to the child resource.
As an alternative format for defining location, colocation or ordering constraints, you can use
resource sets , where primitives are grouped together in one set. Previously this was possible
either by defining a resource group (which could not always accurately express the design), or by
defining each relationship as an individual constraint. The latter caused a constraint explosion
as the number of resources and combinations grew. The configuration via resource sets is not
necessarily less verbose, but is easier to understand and maintain, as the following examples
show.
For example, you can use the following configuration of a resource set ( loc-alice ) in
the crmsh to place two virtual IPs ( vip1 and vip2 ) on the same node, alice :
If you want to use resource sets to replace a configuration of colocation constraints, consider
the following two examples:
<constraints>
<constraints>
<rsc_colocation id="coloc-1" score="INFINITY" >
<resource_set id="colocated-set-example" sequential="true">
<resource_ref id="A"/>
<resource_ref id="B"/>
<resource_ref id="C"/>
<resource_ref id="D"/>
</resource_set>
</rsc_colocation>
</constraints>
If you want to use resource sets to replace a configuration of ordering constraints, consider the
following two examples:
<constraints>
<rsc_order id="order-1" first="A" then="B" />
<rsc_order id="order-2" first="B" then="C" />
<rsc_order id="order-3" first="C" then="D" />
</constraints>
The same purpose can be achieved by using a resource set with ordered resources:
<constraints>
<rsc_order id="order-1">
<resource_set id="ordered-set-example" sequential="true">
<resource_ref id="A"/>
<resource_ref id="B"/>
<resource_ref id="C"/>
<resource_ref id="D"/>
</resource_set>
</rsc_order>
</constraints>
Sometimes it is useful to place a group of resources on the same node (defining a colocation
constraint), but without having hard dependencies between the resources. For example, you
want two resources to be placed on the same node, but you do not want the cluster to restart
the other one if one of them fails. This can be achieved on the crm shell by using the weak
bond command.
Learn how to set these “weak bonds” with your preferred cluster management tool:
Learn how to add the various kinds of constraints with your preferred cluster management tool:
For more information on configuring constraints and detailed background information about the
basic concepts of ordering and colocation, refer to the following documents. They are available
at http://www.clusterlabs.org/pacemaker/doc/ :
Colocation Explained
Ordering Explained
When defining resource constraints, you specify a score for each constraint. The score indicates
the value you are assigning to this resource constraint. Constraints with higher scores are applied
before those with lower scores. By creating additional location constraints with different scores
for a given resource, you can specify an order for the nodes that a resource will fail over to.
order constraints,
colocation constraints,
However, colocation constraints must not contain more than one reference to a template. Re-
source sets must not contain a reference to a template.
Resource templates referenced in constraints stand for all primitives which are derived from that
template. This means, the constraint applies to all primitive resources referencing the resource
template. Referencing resource templates in constraints is an alternative to resource sets and
can simplify the cluster configuration considerably. For details about resource sets, refer to
Procedure 7.17, “Using a Resource Set for Constraints”.
For example, let us assume you have configured a location constraint for resource rsc1
to preferably run on alice . If it fails there, migration-threshold is checked and com-
pared to the failcount. If failcount >= migration-threshold then the resource is migrated
to the node with the next best preference.
After the threshold has been reached, the node will no longer be allowed to run the failed
resource until the resource's failcount is reset. This can be done manually by the cluster
administrator or by setting a failure-timeout option for the resource.
For example, a setting of migration-threshold=2 and failure-timeout=60s would
cause the resource to migrate to a new node after two failures. It would be allowed to
move back (depending on the stickiness and constraint scores) after one minute.
There are two exceptions to the migration threshold concept, occurring when a resource either
fails to start or fails to stop:
Start failures set the failcount to INFINITY and thus always cause an immediate migration.
Stop failures cause fencing (when stonith-enabled is set to true which is the default).
In case there is no STONITH resource defined (or stonith-enabled is set to false ), the
resource will not migrate.
Value is 0 :
This is the default. The resource will be placed optimally in the system. This may mean
that it is moved when a “better” or less loaded node becomes available. This option is
almost equivalent to automatic failback, except that the resource may be moved to a node
that is not the one it was previously active on.
Value is INFINITY :
The resource will always remain in its current location unless forced o because the node
is no longer eligible to run the resource (node shutdown, node standby, reaching the mi-
gration-threshold , or configuration change). This option is almost equivalent to com-
pletely disabling automatic failback.
Value is -INFINITY :
The resource will always move away from its current location.
Learn how to configure these settings with your preferred cluster management tool:
A node is considered eligible for a resource if it has sufficient free capacity to satisfy the re-
source's requirements. The nature of the capacities is completely irrelevant for the High Avail-
ability Extension; it only makes sure that all capacity requirements of a resource are satisfied
before moving a resource to a node.
To manually configure the resource's requirements and the capacity a node provides, use uti-
lization attributes. You can name the utilization attributes according to your preferences and
define as many name/value pairs as your configuration needs. However, the attribute's values
must be integers.
If multiple resources with utilization attributes are grouped or have colocation constraints, the
High Availability Extension takes that into account. If possible, the resources will be placed on
a node that can fulfill all capacity requirements.
Apart from detecting the minimal requirements, the High Availability Extension also allows
to monitor the current utilization via the VirtualDomain resource agent. It detects CPU and
RAM use of the virtual machine. To use this feature, configure a resource of the following
class, provider and type: ocf:heartbeat:VirtualDomain . The following instance attributes
are available: autoset_utilization_cpu and autoset_utilization_hv_memory . Both de-
fault to true . This updates the utilization values in the CIB during each monitoring cycle.
Independent of manually or automatically configuring capacity and requirements, the place-
ment strategy must be specified with the placement-strategy property (in the global cluster
options). The following values are available:
utilization
Utilization values are considered when deciding if a node has enough free capacity to sat-
isfy a resource's requirements. However, load-balancing is still done based on the number
of resources allocated to a node.
balanced
Utilization values are considered when deciding if a node has enough free capacity to
satisfy a resource's requirements. An attempt is made to distribute the resources evenly,
thus optimizing resource performance.
The following example demonstrates a three-node cluster of equal nodes, with four virtual
machines.
With all three nodes up, resource xenA will be placed onto a node rst, followed by xenD .
xenB and xenC would either be allocated together or one of them with xenD .
Learn how to create tags with your preferred cluster management tool:
monitoring-plugins
monitoring-plugins-metadata
1 The supported parameters are the same as the long options of a monitoring plug-in.
Monitoring plug-ins connect to services with the parameter hostname . Therefore the
attribute's value must be a resolvable host name or an IP address.
2 As it takes some time to get the guest operating system up and its services running,
the start timeout of the monitoring resource must be long enough.
3 A cluster resource container of type ocf:heartbeat:Xen , ocf:heartbeat:Virtu-
alDomain or ocf:heartbeat:lxc . It can either be a VM or a Linux Container.
The example above contains only one resource for the check_tcp plug-in, but multiple
resources for different plug-in types can be configured (for example, check_http or
check_udp ).
If the host names of the services are the same, the hostname parameter can also be spec-
ified for the group, instead of adding it to the individual primitives. For example:
If any of the services monitored by the monitoring plug-ins fail within the VM, the cluster
will detect that and restart the container resource (the VM). Which action to take in this
case can be configured by specifying the on-fail attribute for the service's monitoring
operation. It defaults to restart-container .
Failure counts of services will be taken into account when considering the VM's migra-
tion-threshold.
The “normal” (bare-metal) cluster nodes run the High Availability Extension.
The virtual machines run the pacemaker_remote service (almost no configuration re-
quired on the VM's side).
The cluster stack on the “normal” cluster nodes launches the VMs and connects to the
pacemaker_remote service running on the VMs to integrate them as remote nodes into
the cluster.
As the remote nodes do not have the cluster stack installed, this has the following implications:
Remote nodes are not bound by the scalability limits (Corosync has a member limit of
32 nodes).
Find more information about the remote_pacemaker service, including multiple use cases with
detailed setup instructions in Article “Pacemaker Remote Quick Start”.
To automatically move resources away from a node in case the node runs out of disk
space, proceed as follows:
1 Which disk partitions to monitor. For example, /tmp , /usr , /var , and /dev . To
specify multiple partitions as attribute values, separate them with a blank.
2 The minimum free disk space required for those partitions. Optionally, you can spec-
ify the unit to use for measurement (in the example above, M for megabytes is used).
If not specified, min_disk_free defaults to the unit defined in the disk_unit pa-
rameter.
3 The unit in which to report the disk space.
property node-health-strategy="migrate-on-red"
After a node's health status has turned to red , solve the issue that led to the problem. Then
clear the red status to make the node eligible again for running resources. Log in to the cluster
node and use one of the following methods:
http://crmsh.github.io/documentation
Holds several documents about the crm shell, including a Getting Started tutorial for basic
cluster setup with crmsh and the comprehensive Manual for the crm shell. The latter is
available at http://crmsh.github.io/man-2.0/ . Find the tutorial at http://crmsh.github.io/
start-guide/ .
http://clusterlabs.org/
Home page of Pacemaker, the cluster resource manager shipped with the High Availability
Extension.
http://www.clusterlabs.org/pacemaker/doc/
Holds several comprehensive manuals and some shorter documents explaining general
concepts. For example:
Pacemaker Explained: Contains comprehensive and very detailed information for ref-
erence.
Colocation Explained
Ordering Explained
To configure and manage cluster resources, either use Hawk2, or the crm shell
(crmsh) command line utility. If you upgrade from an earlier version of SUSE® Lin-
ux Enterprise High Availability Extension where Hawk was installed, the package
will be replaced with the current version, Hawk2.
Hawk2's user-friendly Web interface allows you to monitor and administer your
High Availability clusters from Linux or non-Linux machines alike. Hawk2 can be
accessed from any machine inside or outside of the cluster by using a (graphical)
Web browser.
hawk2 Package
The hawk2 package must be installed on all cluster nodes you want to connect to with
Hawk2.
Web Browser
On the machine from which to access a cluster node using Hawk2, you need a (graphical)
Web browser (with JavaScript and cookies enabled) to establish the connection.
Hawk2 Service
To use Hawk2, the respective Web service must be started on the node that you want to
connect to via the Web interface. See Procedure 7.1, “Starting Hawk2 Services”.
If you have set up your cluster with the scripts from the ha-cluster-bootstrap package,
the Hawk2 service is already enabled.
1. On the node you want to connect to, open a shell and log in as root .
If you want Hawk2 to start automatically at boot time, execute the following command:
7.2 Logging In
The Hawk2 Web interface uses the HTTPS protocol and port 7630 .
Instead of logging in to an individual cluster node with Hawk2, you can configure a floating,
virtual IP address ( IPaddr or IPaddr2 ) as a cluster resource. It does not need any special
configuration. It allows clients to connect to the Hawk service no matter which physical node
the service is running on.
When setting up the cluster with the ha-cluster-bootstrap scripts, you will be asked whether
to configure a virtual IP for cluster administration.
1. On any machine, start a Web browser and enter the following URL:
https://HAWKSERVER:7630/
Replace HAWKSERVER with the IP address or host name of any cluster node running the
Hawk Web service. If a virtual IP address has been configured for cluster administration
with Hawk2, replace HAWKSERVER with the virtual IP address.
2. On the Hawk2 login screen, enter the Username and Password of the hacluster user (or
of any other user that is a member of the haclient group).
Monitoring
Status: Displays the current cluster status at a glance (similar to crm status on
the crmsh). For details, see Section 7.8.1, “Monitoring a Single Cluster”. If your cluster
includes guest nodes (nodes that run the pacemaker_remote daemon), they are
displayed, too. The screen refreshes in near real-time: any status changes for nodes
or resources are visible almost immediately.
Dashboard: Allows you to monitor multiple clusters (also located on different sites, in
case you have a Geo cluster setup). For details, see Section 7.8.2, “Monitoring Multiple
Clusters”. If your cluster includes guest nodes (nodes that run the pacemaker_re-
mote daemon), they are displayed, too. The screen refreshes in near real-time: any
status changes for nodes or resources are visible almost immediately.
Troubleshooting
History: Opens the History Explorer from which you can generate cluster reports. For
details, see Section 7.10, “Viewing the Cluster History”.
Configuration
Add Resource: Opens the resource configuration screen. For details, see Section 7.5,
“Configuring Cluster Resources”.
Add Constraint: Opens the constraint configuration screen. For details, see Section 7.6,
“Configuring Constraints”.
Wizards: Allows you to select from several wizards that guide you through the cre-
ation of resources for a certain workload, for example, a DRBD block device. For
details, see Section 7.5.2, “Adding Resources with the Wizard”.
Edit Configuration: Allows you to edit resources, constraints, node names and attrib-
utes, tags, alerts (http://crmsh.github.io/man/#cmdhelp_configure_alert) , and fencing
topologies (http://crmsh.github.io/man/#cmdhelp_configure_fencing_topology) .
Cluster Configuration: Allows you to modify global cluster options and resource and
operation defaults. For details, see Section 7.4, “Configuring Global Cluster Options”.
Access Control Targets: Opens a screen where you can create targets (system users)
for access control lists and assign roles to them. For details, see Procedure 12.3, “As-
signing a Role to a Target with Hawk2”.
Batch: Click to switch to batch mode. This allows you to simulate and stage changes and
to apply them as a single transaction. For details, see Section 7.9, “Using the Batch Mode”.
USERNAME : Allows you to set preferences for Hawk2 (for example, the language for the
Web interface, or whether to display a warning if STONITH is disabled).
Help: Access the SUSE Linux Enterprise High Availability Extension documentation, read
the release notes or report a bug.
1. Log in to Hawk2:
https://HAWKSERVER:7630/
FIGURE 7.1: HAWK2—CLUSTER CONFIGURATION
3. Check the values for no-quorum-policy and stonith-enabled and adjust them, if necessary:
a. Set no-quorum-policy to the appropriate value. See Section 6.2.2, “Global Option no-
quorum-policy” for more details.
b. If you need to disable fencing for any reason, set stonith-enabled to no . By default,
it is set to true , because using STONITH devices is necessary for normal cluster
operation. According to the default value, the cluster will refuse to start any resources
if no STONITH resources have been configured.
c. To remove a parameter from the cluster configuration, click the Minus icon next to
the parameter. If a parameter is deleted, the cluster will behave as if that parameter
had the default value.
d. To add a new parameter to the cluster configuration, choose one from the drop-
down box.
a. To adjust a value, either select a different value from the drop-down box or edit the
value directly.
b. To add a new resource default or operation default, choose one from the empty
drop-down box and enter a value. If there are default values, Hawk2 proposes them
automatically.
c. To remove a parameter, click the Minus icon next to it. If no values are specified
for Resource Defaults and Operation Defaults, the cluster uses the default values that
are documented in Section 6.3.6, “Resource Options (Meta Attributes)” and Section 6.3.8,
“Resource Operations”.
Operations
Needed for resource monitoring. For more information, refer to Section 6.3.8, “Resource Op-
erations”.
When creating a resource, Hawk2 displays the most important resource operations ( mon-
itor , start , and stop ).
Meta Attributes
Tells the CRM how to treat a specific resource. For more information, refer to Section 6.3.6,
“Resource Options (Meta Attributes)”.
When creating a resource, Hawk2 automatically lists the important meta attributes for
that resource (for example, the target-role attribute that defines the initial state of a
resource. By default, it is set to Stopped , so the resource will not start immediately).
Utilization
Tells the CRM what capacity a certain resource requires from a node. For more information,
refer to Section 7.6.8, “Configuring Placement of Resources Based on Load Impact”.
You can adjust the entries and values in those categories either during resource creation or later.
1. Log in to Hawk2:
https://HAWKSERVER:7630/
3. Expand the individual categories by clicking the arrow down icon next to them and select
the desired wizard.
4. Follow the instructions on the screen. After the last configuration step, Verify the values
you have entered.
Hawk2 shows which actions it is going to perform and what the configuration looks like.
Depending on the configuration, you might be prompted for the root password before
you can Apply the configuration.
1. Log in to Hawk2:
https://HAWKSERVER:7630/
2. From the left navigation bar, select Configuration Add Resource Primitive.
4. In case a resource template exists on which you want to base the resource configuration,
select the respective Template. For details about configuring templates, see Procedure 7.6,
“Adding a Resource Template”.
5. Select the resource agent Class you want to use: lsb , ocf , service , stonith , or sys-
temd . For more information, see Section 6.3.2, “Supported Resource Agent Classes”.
6. If you selected ocf as class, specify the Provider of your OCF resource agent. The OCF
specification allows multiple vendors to supply the same resource agent.
Note
The selection you get in the Type list depends on the Class (and for OCF resources
also on the Provider) you have chosen.
FIGURE 7.3: HAWK2—PRIMITIVE RESOURCE
8. To keep the Parameters, Operations, and Meta Attributes as suggested by Hawk2, click Create
to finish the configuration. A message at the top of the screen shows if the action has
been successful.
To adjust the parameters, operations, or meta attributes, refer to Section 7.5.5, “Modifying
Resources”. To configure Utilization attributes for the resource, see Procedure 7.21, “Config-
uring the Capacity a Resource Requires”.
1. Log in to Hawk2:
https://HAWKSERVER:7630/
2. From the left navigation bar, select Configuration Add Resource Template.
4. Follow the instructions in Procedure 7.5, “Adding a Primitive Resource”, starting from Step 5.
1. Log in to Hawk2:
https://HAWKSERVER:7630/
3. In the Operations column, click the arrow down icon next to the resource or group you
want to modify and select Edit.
The resource configuration screen opens.
4. To add a new parameter, operation, or meta attribute, select an entry from the empty
drop-down box.
5. To edit any values in the Operations category, click the Edit icon of the respective entry,
enter a different value for the operation, and click Apply.
6. When you are finished, click the Apply button in the resource configuration screen to
confirm your changes to the parameters, operations, or meta attributes.
A message at the top of the screen shows if the action has been successful.
By default, the global cluster option stonith-enabled is set to true . If no STONITH resources
have been defined, the cluster will refuse to start any resources. Configure one or more STONITH
resources to complete the STONITH setup. To add a STONITH resource for SBD, for libvirt (KVM/
Xen) or for vCenter/ESX Server, the easiest way is to use the Hawk2 wizard (see Section 7.5.2,
“Adding Resources with the Wizard”). While STONITH resources are configured similarly to other
resources, their behavior is different in some respects. For details refer to Section 10.3, “STONITH
Resources and Configuration”.
1. Log in to Hawk2:
https://HAWKSERVER:7630/
2. From the left navigation bar, select Configuration Add Resource Primitive.
4. From the Class list, select the resource agent class stonith.
5. From the Type list, select the STONITH plug-in to control your STONITH device. A short
description for this plug-in is displayed.
6. Hawk2 automatically shows the required Parameters for the resource. Enter values for
each parameter.
7. Hawk2 displays the most important resource Operations and proposes default values. If you
do not modify any settings here, Hawk2 adds the proposed operations and their default
values when you confirm.
8. If there is no reason to change them, keep the default Meta Attributes settings.
To complete your fencing configuration, add constraints. For more details, refer to Chapter 10,
Fencing and STONITH.
1. Log in to Hawk2:
https://HAWKSERVER:7630/
2. From the left navigation bar, select Configuration Add Resource Group.
4. To define the group members, select one or multiple entries in the list of Children. Re-
sort group members by dragging and dropping them into the order you want by using the
“handle” icon on the right.
6. Click Create to finish the configuration. A message at the top of the screen shows if the
action has been successful.
FIGURE 7.6: HAWK2—RESOURCE GROUP
1. Log in to Hawk2:
https://HAWKSERVER:7630/
2. From the left navigation bar, select Configuration Add Resource Clone.
4. From the Child Resource list, select the primitive or group to use as a sub-resource for the
clone.
6. Click Create to finish the configuration. A message at the top of the screen shows if the
action has been successful.
1. Log in to Hawk2:
https://HAWKSERVER:7630/
2. From the left navigation bar, select Configuration Add Resource Multi-state.
4. From the Child Resource list, select the primitive or group to use as a sub-resource for the
multi-state resource.
6. Click Create to finish the configuration. A message at the top of the screen shows if the
action has been successful.
FIGURE 7.8: HAWK2—MULTI-STATE RESOURCE
PROCEDURE 7.12: ADDING A TAG
1. Log in to Hawk2:
https://HAWKSERVER:7630/
2. From the left navigation bar, select Configuration Add Resource Tag.
4. From the Objects list, select the resources you want to refer to with the tag.
5. Click Create to finish the configuration. A message at the top of the screen shows if the
action has been successful.
FIGURE 7.9: HAWK2—TAG
1. Log in to Hawk2:
https://HAWKSERVER:7630/
3. To change the suggested timeout values for the start or stop operation:
b. In the dialog that opens, enter a different value for the timeout parameter, for
example 10 , and confirm your change.
b. In the dialog that opens, enter a different value for the monitoring interval .
i. Select the role entry from the empty drop-down box below.
iii. Click Apply to confirm your changes and to close the dialog for the operation.
5. Confirm your changes in the resource configuration screen. A message at the top of the
screen shows if the action has been successful.
For the processes that take place if the resource monitor detects a failure, refer to Section 6.4,
“Resource Monitoring”.
To view resource failures, switch to the Status screen in Hawk2 and select the resource you are
interested in. In the Operations column click the arrow down icon and select Recent Events. The
dialog that opens lists recent actions performed for the resource. Failures are displayed in red.
To view the resource details, click the magnifier icon in the Operations column.
FIGURE 7.10: HAWK2—RESOURCE DETAILS
1. Log in to Hawk2:
https://HAWKSERVER:7630/
2. From the left navigation bar, select Configuration Add Constraint Location.
4. From the list of Resources select the resource or resources for which to define the constraint.
5. Enter a Score. The score indicates the value you are assigning to this resource constraint.
Positive values indicate the resource can run on the Node you specify in the next step.
Negative values mean it should not run on that node. Constraints with higher scores are
applied before those with lower scores.
To force the resources to run on the node, click the arrow icon and select Always .
This sets the score to INFINITY .
If you never want the resources to run on the node, click the arrow icon and select
Never . This sets the score to -INFINITY , meaning that the resources must not run
on the node.
To set the score to 0 , click the arrow icon and select Advisory . This disables the
constraint. This is useful when you want to set resource discovery but do not want
to constrain the resources.
6. Select a Node.
7. Click Create to finish the configuration. A message at the top of the screen shows if the
action has been successful.
FIGURE 7.11: HAWK2—LOCATION CONSTRAINT
1. Log in to Hawk2:
https://HAWKSERVER:7630/
2. From the left navigation bar, select Configuration Add Constraint Colocation.
4. Enter a Score. The score determines the location relationship between the resources. Pos-
itive values indicate that the resources should run on the same node. Negative values in-
dicate that the resources should not run on the same node. The score will be combined
with other factors to decide where to put the resource.
Some often-used values can also be set via the drop-down box:
To force the resources to run on the same node, click the arrow icon and select
Always . This sets the score to INFINITY .
If you never want the resources to run on the same node, click the arrow icon and
select Never . This sets the score to -INFINITY , meaning that the resources must
not run on the same node.
a. From the drop-down box in the Resources category, select a resource (or a template).
The resource is added and a new empty drop-down box appears beneath.
c. To swap the order of resources within the colocation constraint, click the arrow up
icon next to a resource to swap it with the entry above.
6. If needed, specify further parameters for each resource (such as Started , Stopped , Mas-
ter , Slave , Promote , Demote ): Click the empty drop-down box next to the resource
and select the desired entry.
7. Click Create to finish the configuration. A message at the top of the screen shows if the
action has been successful.
1. Log in to Hawk2:
https://HAWKSERVER:7630/
2. From the left navigation bar, select Configuration Add Constraint Order.
4. Enter a Score. If the score is greater than zero, the order constraint is mandatory, otherwise
it is optional.
If you want to make the order constraint mandatory, click the arrow icon and select
Mandatory .
If you want the order constraint to be a suggestion only, click the arrow icon and
select Optional .
Serialize : To ensure that no two stop/start actions occur concurrently for the
resources, click the arrow icon and select Serialize . This makes sure that one
resource must complete starting before the other can be started. A typical use case
are resources that put a high load on the host during start-up.
5. For order constraints, you can usually keep the option Symmetrical enabled. This specifies
that resources are stopped in reverse order.
a. From the drop-down box in the Resources category, select a resource (or a template).
The resource is added and a new empty drop-down box appears beneath.
c. To swap the order of resources within the order constraint, click the arrow up icon
next to a resource to swap it with the entry above.
7. If needed, specify further parameters for each resource (like Started , Stopped , Master ,
Slave , Promote , Demote ): Click the empty drop-down box next to the resource and
select the desired entry.
8. Confirm your changes to finish the configuration. A message at the top of the screen shows
if the action has been successful.
b. To remove a resource from the location constraint, press Ctrl and click the resource
again to deselect it.
d. You can combine multiple resources in a resource set or create multiple resource sets.
e. To unlink a resource from the resource above, click the scissors icon next to the
resource.
Colocation Explained
Ordering Explained
1. Log in to Hawk2:
https://HAWKSERVER:7630/
4. If you want to automatically expire the failcount for a resource, add the failure-timeout
meta attribute to the resource as described in Procedure 7.5: Adding a Primitive Resource, Step
4 and enter a Value for the failure-timeout .
The process ow regarding migration thresholds and failcounts is demonstrated in Example 6.8,
“Migration Threshold—Process Flow”.
Instead of letting the failcount for a resource expire automatically, you can also clean up fail-
counts for a resource manually at any time. Refer to Section 7.7.3, “Cleaning Up Resources” for
details.
1. Log in to Hawk2:
https://HAWKSERVER:7630/
For more details and a configuration example, refer to Section 6.5.6, “Placing Resources Based on
Their Load Impact”.
Utilization attributes are used to configure both the resource's requirements and the capacity
a node provides. You rst need to configure a node's capacity before you can configure the
capacity a resource requires.
1. Log in to Hawk2:
https://HAWKSERVER:7630/
3. On the Nodes tab, select the node whose capacity you want to configure.
4. In the Operations column, click the arrow down icon and select Edit.
The Edit Node screen opens.
5. Below Utilization, enter a name for a utilization attribute into the empty drop-down box.
The name can be arbitrary (for example, RAM_in_GB ).
7. In the empty text box next to the attribute, enter an attribute value. The value must be
an integer.
8. Add as many utilization attributes as you need and add values for all of them.
9. Confirm your changes. A message at the top of the screen shows if the action has been
successful.
Configure the capacity a certain resource requires from a node either when creating a
primitive resource or when editing an existing primitive resource.
Before you can add utilization attributes to a resource, you need to have set utilization
attributes for your cluster nodes as described in Procedure 7.20.
1. Log in to Hawk2:
https://HAWKSERVER:7630/
2. To add a utilization attribute to an existing resource: Go to Manage Status and open the
resource configuration dialog as described in Section 7.7.1, “Editing Resources and Groups”.
If you create a new resource: Go to Configuration Add Resource and proceed as described
in Section 7.5.3, “Adding Simple Resources”.
4. From the empty drop-down box, select one of the utilization attributes that you have
configured for the nodes in Procedure 7.20.
5. In the empty text box next to the attribute, enter an attribute value. The value must be
an integer.
6. Add as many utilization attributes as you need and add values for all of them.
7. Confirm your changes. A message at the top of the screen shows if the action has been
successful.
After you have configured the capacities your nodes provide and the capacities your resources
require, set the placement strategy in the global cluster options. Otherwise the capacity config-
urations have no effect. Several strategies are available to schedule the load: for example, you
can concentrate it on as few nodes as possible, or balance it evenly over all available nodes. For
more information, refer to Section 6.5.6, “Placing Resources Based on Their Load Impact”.
1. Log in to Hawk2:
https://HAWKSERVER:7630/
2. From the left navigation bar, select Configuration Cluster Configuration to open the respec-
tive screen. It shows global cluster options and resource and operation defaults.
Re-sort group members by dragging and dropping them into the order you want using the
“handle” icon on the right.
When creating a resource with Hawk2, you can set its initial state with the target-role meta
attribute. If you set its value to stopped , the resource does not start automatically after being
created.
1. Log in to Hawk2:
https://HAWKSERVER:7630/
2. From the left navigation bar, select Monitoring Status. The list of Resources also shows
the Status.
3. Select the resource to start. In its Operations column click the Start icon. To continue,
confirm the message that appears.
When the resource has started, Hawk2 changes the resource's Status to green and shows
on which node it is running.
PROCEDURE 7.24: CLEANING UP A RESOURCE
1. Log in to Hawk2:
https://HAWKSERVER:7630/
2. From the left navigation bar, select Status. The list of Resources also shows the Status.
3. Go to the resource to clean up. In the Operations column click the arrow down button and
select Cleanup. To continue, confirm the message that appears.
This executes the command crm resource cleanup and cleans up the resource on all
nodes.
1. Log in to Hawk2:
https://HAWKSERVER:7630/
a. From the left navigation bar, select Monitoring Status. The list of Resources also
shows the Status.
b. In the Operations column click the Stop button next to the resource.
b. In the list of Resources, go to the respective resource. From the Operations column
click the Delete icon next to the resource.
1. Log in to Hawk2:
https://HAWKSERVER:7630/
2. From the left navigation bar, select Monitoring Status. The list of Resources also shows
the Status.
4. In the Operations column click the arrow down button and select Migrate.
Away from current node: This creates a location constraint with a -INFINITY score
for the current node.
Alternatively, you can move the resource to another node. This creates a location
constraint with an INFINITY score for the destination node.
PROCEDURE 7.27: UNMIGRATING A RESOURCE
1. Log in to Hawk2:
https://HAWKSERVER:7630/
2. From the left navigation bar, select Monitoring Status. The list of Resources also shows
the Status.
4. In the Operations column click the arrow down button and select Clear. To continue, con-
firm the message that appears.
Hawk2 uses the crm_resource --clear command. The resource can move back to its
original location or it may stay where it is (depending on resource stickiness).
Errors
If errors have occurred, they are shown at the top of the page.
Resources
Shows the configured resources including their Status, Name (ID), Location (node on which
they are running), and resource agent Type. From the Operations column, you can start
or stop a resource, trigger several actions, or view details. Actions that can be triggered
Nodes
Shows the nodes belonging to the cluster site you are logged in to, including the nodes'
Status and Name. In the Maintenance and Standby columns, you can set or remove the
maintenance or standby ag for a node. The Operations column allows you to view
recent events for the node or further details: for example, if a utilization , standby or
maintenance attribute is set for the respective node.
Tickets
Only shown if tickets have been configured (for use with Geo clustering).
FIGURE 7.15: HAWK2—CLUSTER STATUS
PREREQUISITES
All clusters to be monitored from Hawk2's Dashboard must be running SUSE Linux Enter-
prise High Availability Extension 15 SP1.
If you did not replace the self-signed certificate for Hawk2 on every cluster node with
your own certificate (or a certificate signed by an official Certificate Authority) yet, do
the following: Log in to Hawk2 on every node in every cluster at least once. Verify the
certificate (or add an exception in the browser to bypass the warning). Otherwise Hawk2
cannot connect to the cluster.
1. Log in to Hawk2:
https://HAWKSERVER:7630/
b. Enter the Cluster name with which to identify the cluster in the Dashboard. For ex-
ample, berlin .
c. Enter the fully qualified host name of one of the nodes in the second cluster. For
example, charlie .
4. To view more details for a cluster site or to manage it, switch to the site's tab and click
the chain icon.
Hawk2 opens the Status view for this site in a new browser window or tab. From there,
you can administer this part of the Geo cluster.
5. To remove a cluster from the dashboard, click the x icon on the right-hand side of the
cluster's details.
Staging changes to the cluster and applying them as a single transaction, instead of having
each change take effect immediately.
Simulating changes and cluster events, for example, to explore potential failure scenarios.
For example, batch mode can be used when creating groups of resources that depend on each
other. Using batch mode, you can avoid applying intermediate or incomplete configurations to
the cluster.
While batch mode is enabled, you can add or edit resources and constraints or change the cluster
configuration. It is also possible to simulate events in the cluster, including nodes going online
or offline, resource operations and tickets being granted or revoked. See Procedure 7.30, “Injecting
Node, Resource or Ticket Events” for details.
The cluster simulator runs automatically after every change and shows the expected outcome in
the user interface. For example, this also means: If you stop a resource while in batch mode, the
user interface shows the resource as stopped—while actually, the resource is still running.
1. Log in to Hawk2:
https://HAWKSERVER:7630/
2. To activate the batch mode, select Batch from the top-level row.
An additional bar appears below the top-level row. It indicates that batch mode is active
and contains links to actions that you can execute in batch mode.
3. While batch mode is active, perform any changes to your cluster, like adding or editing
resources and constraints or editing the cluster configuration.
The changes will be simulated and shown in all screens.
4. To view details of the changes you have made, select Show from the batch mode bar. The
Batch Mode window opens.
For any configuration changes it shows the difference between the live state and the sim-
ulated changes in crmsh syntax: Lines starting with a - character represent the current
state whereas lines starting with + show the proposed state.
5. To inject events or view even more details, see Procedure 7.30. Otherwise Close the window.
6. Choose to either Discard or Apply the simulated changes and confirm your choice. This
also deactivates batch mode and takes you back to normal mode.
When running in batch mode, Hawk2 also allows you to inject Node Events and Resource Events.
Node Events
Let you change the state of a node. Available states are online, offline, and unclean.
Resource Events
Let you change some properties of a resource. For example, you can set an operation (like
start , stop , monitor ), the node it applies to, and the expected result to be simulated.
1. Log in to Hawk2:
https://HAWKSERVER:7630/
2. If batch mode is not active yet, click Batch at the top-level row to switch to batch mode.
3. In the batch mode bar, click Show to open the Batch Mode window.
b. Select the Node you want to manipulate and select its target State.
c. Confirm your changes. Your event is added to the queue of events listed in the Batch
Mode dialog.
b. Select the Resource you want to manipulate and select the Operation to simulate.
d. Select the Node on which to run the operation and the targeted Result. Your event is
added to the queue of events listed in the Batch Mode dialog.
b. Select the Ticket you want to manipulate and select the Action to simulate.
c. Confirm your changes. Your event is added to the queue of events listed in the Batch
Mode dialog.
7. The Batch Mode dialog (Figure 7.18) shows a new line per injected event. Any event listed
here is simulated immediately and is reflected on the Status screen.
8. To remove an injected event, click the Remove icon next to it. Hawk2 updates the Status
screen accordingly.
9. To view more details about the simulation run, click Simulator and choose one of the
following:
Summary
Shows a detailed summary.
Transition Graph
Shows a graphical representation of the transition.
Transition
Shows an XML representation of the transition.
10. If you have reviewed the simulated changes, close the Batch Mode window.
11. To leave the batch mode, either Apply or Discard the simulated changes.
https://HAWKSERVER:7630/
2. From the left navigation bar, select Monitoring Status. It lists Resources and Nodes.
b. In the Operations column for the resource, click the arrow down button and select
Recent events.
Hawk2 opens a new window and displays a table view of the latest events.
Generate
Create a cluster report for a certain time. Hawk2 calls the crm report command to gen-
erate the report.
Upload
Allows you to upload crm report archives that have either been created with the crm
shell directly or even on a different cluster.
After reports have been generated or uploaded, they are shown below Reports. From the list of
reports, you can show a report's details, download or delete the report.
1. Log in to Hawk2:
https://HAWKSERVER:7630/
139 Using the History Explorer for Cluster Reports SLE HA 15 SP1
3. To create a cluster report:
b. To modify the time frame for the report, click anywhere on the suggested time frame
and select another option from the drop-down box. You can also enter a Custom start
date, end date and hour, respectively. To start the report, click Generate.
After the report has finished, it is shown below Reports.
4. To upload a cluster report, the crm report archive must be located on a le system that
you can access with Hawk2. Proceed as follows:
5. To download or delete a report, click the respective icon next to the report in the Operations
column.
6. To view Report Details in History Explorer, click the report's name or select Show from the
Operations column.
140 Using the History Explorer for Cluster Reports SLE HA 15 SP1
REPORT DETAILS IN HISTORY EXPLORER
Number of transitions plus time line of all transitions in the cluster that are covered by the
report. To learn how to view more details for a transition, see Section 7.10.3.
Node events.
Resource events.
In Hawk2, you can display the name of each pe-* le plus the time and node on which it
was created. In addition, the History Explorer can visualize the following details, based on the
respective pe-* le:
Details
Shows snippets of logging data that belongs to the transition. Displays the output of the
following command (including the resource agents' log messages):
Configuration
Shows the cluster configuration at the time that the pe-* le was created.
Diff
Shows the differences of configuration and status between the selected pe-* le and the
following one.
Graph
Shows a graphical representation of the transition. If you click Graph, the calculation is
simulated (exactly as done by pacemaker-schedulerd ) and a graphical visualization is
generated.
1. Log in to Hawk2:
https://HAWKSERVER:7630/
3. Click the report's name or select Show from the Operations column to open the Report Details
in History Explorer.
4. To access the transition details, you need to select a transition point in the transition time
line that is shown below. Use the Previous and Next icons and the Zoom In and Zoom Out
icons to nd the transition that you are interested in.
5. To display the name of a pe-input* le plus the time and node on which it was created,
hover the mouse pointer over a transition point in the time line.
6. To view the Transition Details in the History Explorer, click the transition point for which you
want to know more.
7. To show Details, Configuration, Di, Logs or Graph, click the respective buttons to show the
content described in Transition Details in the History Explorer.
1. Log in to Hawk2:
https://HAWKSERVER:7630/
6. Enter the root password for your cluster and click Apply. Hawk2 will generate the report.
To configure and manage cluster resources, either use the crm shell (crmsh) com-
mand line utility or Hawk2, a Web-based user interface.
This chapter introduces crm , the command line tool and covers an overview of this
tool, how to use templates, and mainly configuring and managing cluster resources:
creating basic and advanced types of resources (groups and clones), configuring
constraints, specifying failover nodes and failback nodes, configuring resource mon-
itoring, starting, cleaning up or removing resources, and migrating resources manu-
ally.
Note that you need to set up /etc/sudoers so that sudo does not ask for a password.
8.1 crmsh—Overview
The crm command has several subcommands which manage resources, CIBs, nodes, resource
agents, and others. It offers a thorough help system with embedded examples. All examples
follow a naming convention described in Appendix B.
crm(live/HOSTNAME)
For readability reasons, we omit the host name in the interactive crm prompts in our
documentation. We only include the host name if you need to run the interactive shell
on a specific node, like alice for example:
crm(live/alice)
To print the syntax, its usage, and examples of the group subcommand of configure :
Directly: Concatenate all subcommands to crm , press Enter and you see the output im-
mediately. For example, enter crm help ra to get information about the ra subcom-
mand (resource agents).
It is possible to abbreviate subcommands as long as they are unique. For example, you can
shorten status as st and crmsh will know what you have meant.
Another feature is to shorten parameters. Usually, you add parameters through the params
keyword. You can leave out the params section if it is the rst and only section. For ex-
ample, this line:
As crm Shell Script: Crm shell scripts contain subcommands of crm . For more information,
see Section 8.1.4, “Using crmsh's Shell Scripts”.
As crmsh Cluster Scripts: These are a collection of metadata, references to RPM packages,
configuration les, and crmsh subcommands bundled under a single, yet descriptive name.
They are managed through the crm script command.
Interactive as Internal Shell: Type crm to enter the internal shell. The prompt changes
to crm(live) . With help you can get an overview of the available subcommands. As
the internal shell has different levels of subcommands, you can “enter” one by typing this
subcommand and press Enter .
For example, if you type resource you enter the resource management level. Your prompt
changes to crm(live)resource# . To leave the internal shell, use the commands quit ,
bye , or exit . If you need to go one level back, use back , up , end , or cd .
You can enter the level directly by typing crm and the respective subcommand(s) without
any options and press Enter .
The internal shell supports also tab completion for subcommands and resources. Type the
beginning of a command, press →| and crm completes the respective object.
In addition to previously explained methods, crmsh also supports synchronous command exe-
cution. Use the -w option to activate it. If you have started crm without -w , you can enable it
later with the user preference's wait set to yes ( options wait yes ). If this option is enabled,
crm waits until the transition is finished. Whenever a transaction is started, dots are printed
to indicate progress. Synchronous command execution is only applicable for commands like
resource start .
The following subsections give you an overview of some important aspects of the crm tool.
root # crm ra
crm(live)ra#
crm(live)ra# classes
lsb
ocf / heartbeat linbit lvm2 ocfs2 pacemaker
service
stonith
systemd
To get an overview of all available resource agents for a class (and provider) use the list
command:
start timeout=240
promote timeout=90
demote timeout=90
notify timeout=90
stop timeout=100
monitor_Slave_0 interval=20 timeout=20 start-delay=1m
monitor_Master_0 interval=10 timeout=20 start-delay=1m
Any line starting with the hash symbol ( # ) is a comment and is ignored. If a line is too long,
insert a backslash ( \ ) at the end and continue in the next line. It is recommended to indent
lines that belong to a certain subcommand to improve readability.
In contrast to crmsh shell scripts, cluster scripts performs additional tasks like:
crmsh cluster scripts do not replace other tools for managing clusters—they provide an inte-
grated way to perform the above tasks across the cluster. Find detailed information at http://
crmsh.github.io/scripts/ .
8.1.5.1 Usage
To view the components of a script, use the show command and the name of the cluster script,
for example:
id (required) (unique)
Identifier for the cluster resource
email (required)
Email address
subject
Subject
The output of show contains a title, a short description, and a procedure. Each procedure is
divided into a series of steps, performed in the given order.
Each step contains a list of required and optional parameters, along with a short description
and its default value.
Each cluster script understands a set of common parameters. These parameters can be passed
to any script:
TABLE 8.1: COMMON PARAMETERS
mailx
The verify prints the steps and replaces any placeholders with your given parameters. If ver-
ify nds any problems, it will report it. If everything is ok, replace the verify command with
run :
Check whether your resource is integrated into your cluster with crm status :
Configuration templates are ready-made cluster configurations for crmsh. Do not confuse them
with the resource templates (as described in Section 8.3.3, “Creating Resource Templates”). Those are
templates for the cluster and not for the crm shell.
Configuration templates require minimum effort to be tailored to the particular user's needs.
Whenever a template creates a configuration, warning messages give hints which can be edited
later for further customization.
The following procedure shows how to create a simple yet functional Apache configuration:
crm(live)configure# template
crm(live)configure template# list
g-intranet
b. Display the minimum required changes that need to be lled out by you:
crm(live)configure template# show
ERROR: 23: required parameter ip not set
ERROR: 61: required parameter id not set
ERROR: 65: required parameter configfile not set
c. Invoke your preferred text editor and ll out all lines that have been displayed as
errors in Step 3.b:
crm(live)configure template# edit
4. Show the configuration and check whether it is valid (bold text depends on the configu-
ration you have entered in Step 3.c):
crm(live)configure template# show
primitive virtual-ip ocf:heartbeat:IPaddr \
params ip="192.168.1.101"
primitive apache ocf:heartbeat:apache \
params configfile="/etc/apache2/httpd.conf"
monitor apache 120s:60s
group g-intranet \
apache virtual-ip
crm(live)configure template# apply
crm(live)configure# cd ..
crm(live)configure# show
crm(live)configure# commit
If you are inside your internal crm shell, use the following command:
However, the previous command only creates its configuration from the configuration template.
It does not apply nor commit it to the CIB.
If you omit the name of the shadow CIB, a temporary name @tmp@ is created.
3. To copy the current live configuration into your shadow configuration, use the following
command, otherwise skip this step:
The previous command makes it easier to modify any existing resources later.
4. Make your changes as usual. After you have created the shadow configuration, all changes
go there. To save all your changes, use the following command:
crm(myNewConfig)# commit
Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
175704363 1 alice.example.com (local)
175704619 1 bob.example.com
The diff command is very helpful: It compares the Corosync configuration on all nodes (if not
stated otherwise) and prints the difference between:
replace
This option replaces the current configuration with the new source configuration.
update
This option tries to import the source configuration. It adds new items or updates existing
items to the current configuration.
push
This option imports the content from the source into the current configuration (same as
update ). However, it removes objects that are not available in the new configuration.
To load the new configuration from the le mycluster-config.txt use the following syntax:
The previous command configures a “primitive” with the name myIP . You need to choose
a class (here ocf ), provider ( heartbeat ), and type ( IPaddr ). Furthermore, this primi-
tive expects other parameters like the IP address. Change the address to your setup.
crm(live)configure# show
crm(live)configure# commit
For example, the following command creates a new resource template with the name BigVM
derived from the ocf:heartbeat:Xen resource and some default values and operations:
The new primitive MyVM1 is going to inherit everything from the BigVM resource templates.
For example, the equivalent of the above two would be:
If you want to overwrite some options or operations, add them to your (primitive) definition.
For example, the following new primitive MyVM2 doubles the timeout for monitor operations
but leaves others untouched:
A resource template may be referenced in constraints to stand for all primitives which are de-
rived from that template. This helps to produce a more concise and clear cluster configuration.
Resource template references are allowed in all constraints except location constraints. Coloca-
tion constraints may not contain more than one template reference.
3. Choose a STONITH type from the above list and view the list of possible options. Use the
following command:
4. Create the STONITH resource with the stonith class, the type you have chosen in Step
3, and the respective parameters if needed, for example:
crm(live)# configure
crm(live)configure# primitive my-stonith stonith:external/ipmi \
params hostname="alice" \
ipaddr="192.168.1.221" \
userid="admin" passwd="secret" \
op monitor interval=60m timeout=120s
The location command defines on which nodes a resource may be run, may not be run or
is preferred to be run.
This type of constraint may be added multiple times for each resource. All location constraints
are evaluated for a given resource. A simple example that expresses a preference to run the
resource fs1 on the node with the name alice to 100 would be the following:
Another use case for location constraints are grouping primitives as a resource set. This can be
useful if several resources depend on, for example, a ping attribute for network connectivity. In
former times, the -inf/ping rules needed to be duplicated several times in the configuration,
making it unnecessarily complex.
The following example creates a resource set loc-alice , referencing the virtual IP addresses
vip1 and vip2 :
For a master slave configuration, it is necessary to know if the current node is a master in
addition to running the resource locally.
The implementation of weak-bond creates a dummy resource and a colocation constraint with
the given resources automatically.
The example used for this section would not work without additional constraints. It is essential
that all resources run on the same machine as the master of the DRBD resource. The DRBD
resource must be master before any other resource starts. Trying to mount the DRBD device
when it is not the master simply fails. The following constraints must be fulfilled:
The le system must always be on the same node as the master of the DRBD resource.
The NFS server and the IP address must be on the same node as the le system.
The NFS server and the IP address start after the le system is mounted:
The le system must be mounted on a node after the DRBD resource is promoted to master
on this node.
Normally, rsc1 prefers to run on alice. If it fails there, migration-threshold is checked and com-
pared to the failcount. If failcount >= migration-threshold then it is migrated to the node with
the next best preference.
Start failures set the failcount to inf depend on the start-failure-is-fatal option. Stop
failures cause fencing. If there is no STONITH defined, the resource will not migrate.
For an overview, refer to Section 6.5.4, “Failover Nodes”.
For detailed background information about the parameters and a configuration example, refer
to Section 6.5.6, “Placing Resources Based on Their Load Impact”.
2. To specify the capacity a node provides, use the following command and replace the place-
holder NODE_1 with the name of your node:
With these values, NODE_1 would be assumed to provide 16GB of memory and 8 CPU
cores to resources.
This would make the resource consume 4096 of those memory units from NODE_1 , and
4 of the CPU units.
minimal
Utilization values are considered when deciding if a node has enough free capacity
to satisfy a resource's requirements. An attempt is made to concentrate the resources
on as few nodes as possible (to achieve power savings on the remaining nodes).
balanced
Utilization values are considered when deciding if a node has enough free capacity
to satisfy a resource's requirements. An attempt is made to distribute the resources
evenly, thus optimizing resource performance.
crm(live)configure# commit
The following example demonstrates a three node cluster of equal nodes, with 4 virtual ma-
chines:
With all three nodes up, xenA will be placed onto a node rst, followed by xenD. xenB and xenC
would either be allocated together or one of them with xenD.
If one node failed, too little total memory would be available to host them all. xenA would be
ensured to be allocated, as would xenD. However, only one of xenB or xenC could still be placed,
and since their priority is equal, the result is not defined yet. To resolve this ambiguity as well,
you would need to set a higher priority for either one.
1. Run the crm command as system administrator. The prompt changes to crm(live) .
crm(live)# configure
3. Group the primitives with their relevant identifiers in the correct order:
To change the order of a group member, use the modgroup command from the configure
subcommand. Use the following commands to move the primitive Email before Public-IP .
(This is just to demonstrate the feature):
In case you want to remove a resource from a group (for example, Email ), use this command:
In case you have lots of resources, the output of show is too verbose. To restrict the output,
use the name of the resource. For example, to list the properties of the primitive admin_addr
only, append the resource name to show :
However, in some cases, you want to limit the output of specific resources even more. This can
be achieved with filters. Filters limit the output to specific components. For example, to list the
nodes only, use type:node :
Furthermore, to search for an object that starts with a certain string, use this notation:
To list all available types, enter crm configure show type: and press the →| key. The Bash
completion will give you a list of all types.
root # crm
crm(live)# resource
3. Start the resource with start and press the →| key to show all known resources:
crm(live)resource# start ID
For example, the output can look like this (whereas myIP is the relevant identifier of your
resource):
3. Delete the resource with the relevant identifier (which implies a commit too):
It shows a description and a list of all parameters and their default values. To execute a script,
use run :
If you prefer to run only one step from the suite, the describe command lists all available
steps in the Steps category.
For example, the following command executes the rst step of the health command. The output
is stored in the health.json le for further investigation:
To set a password for the above mydb resource, use the following commands:
Note that the parameters need to be synchronized between nodes; the crm resource secret
command will take care of that. We highly recommend to only use this command to manage
secret parameters.
Each cluster moves states, migrates resources, or starts important processes. All these actions
can be retrieved by subcommands of history .
By default, all history commands look at the events of the last hour. To change this time
frame, use the limit subcommand. The syntax is:
limit 4:00pm ,
limit 16:00
Both commands mean the same, today at 4pm.
crm(live)history# info
Source: live
Period: 2012-01-12 14:10:56 - end
Nodes: alice
Groups:
Resources:
To limit crm report to certain parameters view the available options with the subcommand
help .
To narrow down the level of detail, use the subcommand detail with a level:
crm(live)history# detail 1
The higher the number, the more detailed your report will be. Default is 0 (zero).
After you have set above parameters, use log to show the log messages.
To display the last transition, use the following command:
crm(live)history# transition -1
INFO: fetching new logs, please wait ...
This command fetches the logs and runs dotty (from the graphviz package) to show the
transition graph. The shell opens the log le which you can browse with the ↓ and ↑ cursor
keys.
If you do not want to open the transition graph, use the nograph option:
See Article “Highly Available NFS Storage with DRBD and Pacemaker” for an exhaustive example.
To get a list of all currently available STONITH devices (from the software side), use the com-
mand crm ra list stonith . If you do not nd your favorite agent, install the -devel pack-
age. For more information on STONITH devices and resource agents, see Chapter 10, Fencing and
STONITH.
As of yet there is no documentation about writing STONITH agents. If you want to write new
STONITH agents, consult the examples available in the source of the cluster-glue package.
start
start or enable the resource
status
returns the status of the resource
monitor
similar to status , but checks also for unexpected states
validate
validate the resource's configuration
meta-data
returns information about the resource agent in XML
2. Create a new subdirectory for each new resource agents to avoid naming contradictions.
For example, if you have a resource group kitchen with the resource coffee_machine ,
add this resource to the directory /usr/lib/ocf/resource.d/kitchen/ . To access this
RA, execute the command crm :
3. Implement the different shell functions and save your le under a different name.
More details about writing OCF resource agents can be found at http://linux-ha.org/wiki/Re-
source_Agents . Find special information about several concepts at Chapter 1, Product Overview.
Assuming an action is considered to have failed, the following table outlines the different OCF
return codes. It also shows the type of recovery the cluster will initiate when the respective
error code is received.
5 OCF_ERR_INS- The tools required by the resource are not in- hard
TALLED stalled on this machine.
root # stonith -L
or
Lights-out Devices
Lights-out devices (IBM RSA, HP iLO, Dell DRAC) are becoming increasingly popular and
may even become standard in o-the-shelf computers. However, they are inferior to UPS
devices, because they share a power supply with their host (a cluster node). If a node
Testing Devices
Testing devices are used exclusively for testing purposes. They are usually more gentle on
the hardware. Before the cluster goes into production, they must be replaced with real
fencing devices.
The choice of the STONITH device depends mainly on your budget and the kind of hardware
you use.
pacemaker-fenced
pacemaker-fenced is a daemon which can be accessed by local processes or over the net-
work. It accepts the commands which correspond to fencing operations: reset, power-o,
and power-on. It can also check the status of the fencing device.
The pacemaker-fenced daemon runs on every node in the High Availability cluster. The
pacemaker-fenced instance running on the DC node receives a fencing request from the
pacemaker-controld . It is up to this and other pacemaker-fenced programs to carry
out the desired fencing operation.
STONITH Plug-ins
For every supported fencing device there is a STONITH plug-in which is capable of control-
ling said device. A STONITH plug-in is the interface to the fencing device. The STONITH
plug-ins contained in the cluster-glue package reside in /usr/lib64/stonith/plu-
gins on each node. (If you installed the fence-agents package, too, the plug-ins con-
tained there are installed in /usr/sbin/fence_* .) All STONITH plug-ins look the same
to pacemaker-fenced , but are quite different on the other side, reflecting the nature of
the fencing device.
Some plug-ins support more than one device. A typical example is ipmilan (or exter-
nal/ipmi ) which implements the IPMI protocol and can control any device which sup-
ports this protocol.
The list of parameters (attributes) depends on the respective STONITH type. To view a list of
parameters for a specific device, use the stonith command:
stonith -t stonith-device-type -n
For example, to view the parameters for the ibmhmc device type, enter the following:
stonith -t ibmhmc -n
To get a short help text for the device, use the -h option:
stonith -t stonith-device-type -h
configure
primitive st-ibmrsa-1 stonith:external/ibmrsa-telnet \
params nodename=alice ip_address=192.168.0.101 \
username=USERNAME password=PASSW0RD
primitive st-ibmrsa-2 stonith:external/ibmrsa-telnet \
params nodename=bob ip_address=192.168.0.102 \
username=USERNAME password=PASSW0RD
location l-st-alice st-ibmrsa-1 -inf: alice
location l-st-bob st-ibmrsa-2 -inf: bob
commit
In this example, location constraints are used for the following reason: There is always
a certain probability that the STONITH operation is going to fail. Therefore, a STONITH
operation on the node which is the executioner as well is not reliable. If the node is re-
set, it cannot send the notification about the fencing operation outcome. The only way
to do that is to assume that the operation is going to succeed and send the notification
beforehand. But if the operation fails, problems could arise. Therefore, by convention,
pacemaker-fenced refuses to terminate its host.
The configuration of a UPS type fencing device is similar to the examples above. The
details are not covered here. All UPS devices employ the same mechanics for fencing. How
the device is accessed varies. Old UPS devices only had a serial port, usually connected
at 1200baud using a special serial cable. Many new ones still have a serial port, but often
they also use a USB or Ethernet interface. The kind of connection you can use depends
on what the plug-in supports.
For example, compare the apcmaster with the apcsmart device by using the stonith
-t stonith-device-type -n command:
stonith -t apcmaster -h
With
stonith -t apcsmart -h
The rst plug-in supports APC UPS with a network port and telnet protocol. The second
plug-in uses the APC SMART protocol over the serial line, which is supported by many
APC UPS product lines.
Kdump belongs to the Special Fencing Devices and is in fact the opposite of a fencing device.
The plug-in checks if a Kernel dump is in progress on a node. If so, it returns true, and
acts as if the node has been fenced.
The Kdump plug-in must be used in concert with another, real STONITH device, for ex-
ample, external/ipmi . For the fencing mechanism to work properly, you must specify
that Kdump is checked before a real STONITH device is triggered. Use crm configure
fencing_topology to specify the order of the fencing devices as shown in the following
procedure.
configure
primitive st-kdump stonith:fence_kdump \
params nodename="alice "\ 1
pcmk_host_check="static-list" \
pcmk_reboot_action="off" \
pcmk_monitor_action="metadata" \
pcmk_reboot_retries="1" \
timeout="60"
commit
1 Name of the node to be monitored. If you need to monitor more than one node,
configure more STONITH resources. To prevent a specific node from using a
fencing device, add location constraints.
The fencing action will be started after the timeout of the resource.
The node that does a Kdump will restart automatically after Kdump has finished.
3. Write a new initrd to include the library fence_kdump_send with network en-
abled. Use the -f option to overwrite the existing le, so the new le will be used
for the next boot process:
4. Open a port in the firewall for the fence_kdump resource. The default port is 7410 .
5. To achieve that Kdump is checked before triggering a real fencing mechanism (like
external/ipmi ), use a configuration similar to the following:
fencing_topology \
alice: kdump-node1 ipmi-node1 \
bob: kdump-node2 ipmi-node2
Fencing devices are an indispensable part of an HA cluster, but the less you need to use them,
the better. Power management equipment is often affected by too much broadcast traffic. Some
devices cannot handle more than ten or so connections per minute. Some get confused if two
clients try to connect at the same time. Most cannot handle more than one session at a time.
Checking the status of fencing devices once every few hours should usually be enough. The
probability that a fencing operation needs to be performed and the power switch fails is low.
For detailed information on how to configure monitor operations, refer to Section 8.3.9, “Config-
uring Resource Monitoring” for the command line approach.
external/ssh
ssh
fence_kdump
This plug-in checks if a Kernel dump is in progress on a node. If so, it returns true , and
acts as if the node has been fenced. The node cannot run any resources during the dump
anyway. This avoids fencing a node that is already down but doing a dump, which takes
some time. The plug-in must be used in concert with another, real STONITH device.
For configuration details, see Example 10.3, “Configuration of a Kdump Device”.
external/sbd
This is a self-fencing device. It reacts to a so-called “poison pill” which can be inserted into a
shared disk. On shared-storage connection loss, it stops the node from operating. Learn how
to use this STONITH agent to implement storage-based fencing in Chapter 11, Procedure 11.7,
“Configuring the Cluster to Use SBD”. See also http://www.linux-ha.org/wiki/SBD_Fencing for
more details.
meatware
meatware requires help from the user to operate. Whenever invoked, meatware logs a
CRIT severity message which shows up on the node's console. The operator then confirms
that the node is down and issues a meatclient(8) command. This tells meatware to
inform the cluster that the node should be considered dead. See /usr/share/doc/pack-
ages/cluster-glue/README.meatware for more information.
suicide
This is a software-only device, which can reboot a node it is running on, using the reboot
command. This requires action by the node's operating system and can fail under certain
circumstances. Therefore avoid using this device whenever possible. However, it is safe
to use on one-node clusters.
Diskless SBD
This configuration is useful if you want a fencing mechanism without shared storage. In
this diskless mode, SBD fences nodes by using the hardware watchdog without relying on
any shared device. However, diskless SBD cannot handle a split brain scenario for a two-
node cluster. Use this option only for clusters with more than two nodes.
To test your STONITH devices and their configuration, pull the plug once from each node
and verify that fencing the node does takes place.
Use appropriate fencing devices for your setup. For details, also refer to Section 10.5, “Special
Fencing Devices”.
Configure one or more STONITH resources. By default, the global cluster option stonith-
enabled is set to true . If no STONITH resources have been defined, the cluster will refuse
to start any resources.
Do not set the global cluster option stonith-enabled to false for the following reasons:
DLM/OCFS2 will block forever waiting for a fencing operation that will never hap-
pen.
Do not set the global cluster option startup-fencing to false . By default, it is set to
true for the following reason: If a node is in an unknown state during cluster start-up,
the node will be fenced once to clarify its status.
http://www.linux-ha.org/wiki/STONITH
Information about STONITH on the home page of The High Availability Linux Project.
http://www.clusterlabs.org/pacemaker/doc/
http://techthoughts.typepad.com/managing_computers/2007/10/split-brain-quo.html
Article explaining the concepts of split brain, quorum and fencing in HA clusters.
SBD (STONITH Block Device) provides a node fencing mechanism for Pacemak-
er-based clusters through the exchange of messages via shared block storage (SAN,
iSCSI, FCoE, etc.). This isolates the fencing mechanism from changes in rmware
version or dependencies on specific rmware controllers. SBD needs a watchdog on
each node to ensure that misbehaving nodes are really stopped. Under certain con-
ditions, it is also possible to use SBD without shared storage, by running it in disk-
less mode.
The ha-cluster-bootstrap scripts provide an automated way to set up a cluster
with the option of using SBD as fencing mechanism. For details, see the Article “In-
stallation and Setup Quick Start”. However, manually setting up SBD provides you with
more options regarding the individual settings.
This chapter explains the concepts behind SBD. It guides you through configuring
the components needed by SBD to protect your cluster from potential data corrup-
tion in case of a split brain scenario.
In addition to node level fencing, you can use additional mechanisms for storage
protection, such as LVM2 exclusive activation or OCFS2 le locking support (re-
source level fencing). They protect your system against administrative or applica-
tion faults.
SBD Partition
In an environment where all nodes have access to shared storage, a small partition of the
device is formatted for use with SBD. The size of the partition depends on the block size
of the used disk (for example, 1 MB for standard SCSI disks with 512 byte block size or
4 MB for DASD disks with 4 kB block size). The initialization process creates a message
layout on the device with slots for up to 255 nodes.
SBD Daemon
After the respective SBD daemon is configured, it is brought online on each node before
the rest of the cluster stack is started. It is terminated after all other cluster components
have been shut down, thus ensuring that cluster resources are never activated without SBD
supervision.
Messages
The daemon automatically allocates one of the message slots on the partition to itself,
and constantly monitors it for messages addressed to itself. Upon receipt of a message, the
daemon immediately complies with the request, such as initiating a power-o or reboot
cycle for fencing.
Also, the daemon constantly monitors connectivity to the storage device, and terminates
itself in case the partition becomes unreachable. This guarantees that it is not disconnected
from fencing messages. If the cluster data resides on the same logical unit in a different
partition, this is not an additional point of failure: The workload will terminate anyway
if the storage connectivity has been lost.
Watchdog
Whenever SBD is used, a correctly working watchdog is crucial. Modern systems support
a hardware watchdog that needs to be “tickled” or “fed” by a software component. The
software component (in this case, the SBD daemon) “feeds” the watchdog by regularly
writing a service pulse to the watchdog. If the daemon stops feeding the watchdog, the
hardware will enforce a system restart. This protects against failures of the SBD process
itself, such as dying, or becoming stuck on an I/O error.
2. Depending on your scenario, either use SBD with one to three devices or in diskless mode.
For an outline, see Section 11.4, “Number of SBD Devices”. The detailed setup is described in:
11.3 Requirements
You can use up to three SBD devices for storage-based fencing. When using one to three
devices, the shared storage must be accessible from all nodes.
The path to the shared storage device must be persistent and consistent across all nodes in
the cluster. Use stable device names such as /dev/disk/by-id/dm-uuid-part1-mpath-
abcedf12345 .
The shared storage can be connected via Fibre Channel (FC), Fibre Channel over Ethernet
(FCoE), or even iSCSI.
The shared storage segment must not use host-based RAID, LVM2, or DRBD*. DRBD can
be split, which is problematic for SBD, as there cannot be two states in SBD. Cluster mul-
ti-device (Cluster MD) cannot be used for SBD.
An SBD device can be shared between different clusters, as long as no more than 255 nodes
share the device.
For clusters with more than two nodes, you can also use SBD in diskless mode.
One Device
The most simple implementation. It is appropriate for clusters where all of your data is
on the same shared storage.
Two Devices
This configuration is primarily useful for environments that use host-based mirroring but
where no third storage device is available. SBD will not terminate itself if it loses access
to one mirror leg, allowing the cluster to continue. However, since SBD does not have
enough knowledge to detect an asymmetric split of the storage, it will not fence the other
side while only one mirror leg is available. Thus, it cannot automatically tolerate a second
failure while one of the storage arrays is down.
Three Devices
The most reliable configuration. It is resilient against outages of one device—be it because
of failures or maintenance. SBD will terminate itself only if more than one device is lost
and if required, depending on the status of the cluster partition or node. If at least two
devices are still accessible, fencing messages can be successfully transmitted.
This configuration is suitable for more complex scenarios where storage is not restricted
to a single array. Host-based mirroring solutions can have one SBD per mirror leg (not
mirrored itself), and an additional tie-breaker on iSCSI.
Diskless
This configuration is useful if you want a fencing mechanism without shared storage. In
this diskless mode, SBD fences nodes by using the hardware watchdog without relying on
any shared device. However, diskless SBD cannot handle a split brain scenario for a two-
node cluster. Use this option only for clusters with more than two nodes.
Watchdog Timeout
This timeout is set during initialization of the SBD device. It depends mostly on your storage
latency. The majority of devices must be successfully read within this time. Otherwise, the
node might self-fence.
msgwait Timeout
This timeout is set during initialization of the SBD device. It defines the time after which
a message written to a node's slot on the SBD device is considered delivered. The timeout
should be long enough for the node to detect that it needs to self-fence.
However, if the msgwait timeout is relatively long, a fenced cluster node might rejoin
before the fencing action returns. This can be mitigated by setting the SBD_DELAY_START
parameter in the SBD configuration, as described in Procedure 11.4 in Step 4.
If you change the watchdog timeout, you need to adjust the other two timeouts as well. The
following “formula” expresses the relationship between these three values:
EXAMPLE 11.1: FORMULA FOR TIMEOUT CALCULATION
For example, if you set the watchdog timeout to 120 , set the msgwait timeout to 240 and the
stonith-timeout to 288 .
If you use the ha-cluster-bootstrap scripts to set up a cluster and to initialize the SBD device,
the relationship between these timeouts is automatically considered.
Hardware Driver
HP hpwdt
Generic softdog
1. List the drivers that have been installed with your kernel version:
2. List any watchdog modules that are currently loaded in the kernel:
Before you start, make sure the block device or devices you want to use for SBD meet the
requirements specified in Section 11.3.
When setting up the SBD devices, you need to take several timeout values into account. For
details, see Section 11.5, “Calculation of Timeouts”.
To use SBD with shared storage, you must rst create the messaging layout on one to three
block devices. The sbd create command will write a metadata header to the specified
device or devices. It will also initialize the messaging slots for up to 255 nodes. If executed
without any further options, the command will use the default timeout settings.
(Replace /dev/SBD with your actual path name, for example: /dev/disk/by-id/sc-
si-ST2000DM001-0123456_Wabcdefg .)
To use more than one device for SBD, specify the -d option multiple times, for example:
3. If your SBD device resides on a multipath group, use the -1 and -4 options to adjust the
timeouts to use for SBD. For details, see Section 11.5, “Calculation of Timeouts”. All timeouts
are given in seconds:
1 The -4 option is used to specify the msgwait timeout. In the example above, it is
set to 180 seconds.
2 The -1 option is used to specify the watchdog timeout. In the example above, it
is set to 90 seconds. The minimum allowed value for the emulated watchdog is 15
seconds.
As you can see, the timeouts are also stored in the header, to ensure that all participating
nodes agree on them.
After you have initialized the SBD devices, edit the SBD configuration le, then enable and start
the respective services for the changes to take effect.
SBD_DEVICE="/dev/SBD"
If you need to specify multiple devices in the rst line, separate them with semicolons
(the order of the devices does not matter):
If the SBD device is not accessible, the daemon will fail to start and inhibit cluster start-up.
It will be started together with the Corosync service whenever the Pacemaker service is
started.
1. The following command will dump the node slots and their current messages from the
SBD device:
Now you should see all cluster nodes that have ever been started with SBD listed here.
For example, if you have a two-node cluster, the message slot should show clear for
both nodes:
0 alice clear
1 bob clear
3. The node will acknowledge the receipt of the message in the system log les:
May 03 16:08:31 alice sbd[66139]: /dev/SBD: notice: servant: Received command test
from bob on disk /dev/SBD
As a final step, you need to adjust the cluster configuration as described in Procedure 11.7.
To configure the use of SBD in the cluster, you need to do the following in the cluster
configuration:
1 This is the default configuration, because clusters without STONITH are not support-
ed. But in case STONITH has been deactivated for testing purposes, make sure this
parameter is set to true again.
2 If not explicitly set, this value defaults to 0 , which is appropriate for use of SBD with
one to three devices.
3 A stonith-timeout value of 40 would be appropriate if the msgwait timeout value
for SBD was set to 30 seconds.
4. For a two-node cluster, decide if you want predictable or random delays. For other cluster
setups you do not need to set this parameter.
6. Submit your changes with commit and leave the crm live configuration with exit .
After the resource has started, your cluster is successfully configured for use of SBD. It will use
this method in case a node needs to be fenced.
SBD_PACEMAKER=yes
SBD_STARTMODE=always
SBD_DELAY_START=no
SBD_WATCHDOG_DEV=/dev/watchdog
SBD_WATCHDOG_TIMEOUT=5
The SBD_DEVICE entry is not needed as no shared disk is used. When this parameter is
missing, the sbd service does not start any watcher process for SBD devices.
It will be started together with the Corosync service whenever the Pacemaker service is
started.
5. Run crm configure and set the following cluster properties on the crm shell:
1 This is the default configuration, because clusters without STONITH are not support-
ed. But in case STONITH has been deactivated for testing purposes, make sure this
parameter is set to true again.
2 For diskless SBD, this parameter must not equal zero. It defines after how long it is as-
sumed that the fencing target has already self-fenced. Therefore its value needs to be
>= the value of SBD_WATCHDOG_TIMEOUT in /etc/sysconfig/sbd . Starting with
7. Submit your changes with commit and leave the crm live configuration with exit .
Check if the node is fenced and if the other nodes consider the node as fenced after the
stonith-watchdog-timeout .
The node proactively self-fences. The other nodes notice the loss of the node and
consider it has self-fenced after the stonith-watchdog-timeout .
2. Let the monitoring operation fail (for example, by terminating the respective dae-
mon, if the resource relates to a service).
This failure triggers a fencing action.
Before you proceed, check if your disk supports persistent reservations. Use the following com-
mand (replace DISK with your device name):
Supported disk:
Unsupported disk:
If you get an error message (like the one above), replace the old disk with an SCSI compatible
disk. Otherwise proceed as follows:
1. To create the primitive resource sg_persist , run the following commands as root :
crm(live)configure# ms ms-sg sg \
meta master-max=1 notify=true
3. Do some tests. When the resource is in master/slave status, you can mount and write on
/dev/sdc1 on the cluster node where the master instance is running, while you cannot
write on the cluster node where the slave instance is running.
5. Add the following order relationship plus a collocation between the sg_persist master
and the le system resource:
11.10.2.1 Overview
In a shared storage environment, a small partition of the storage is set aside for storing one or
more locks.
Before acquiring protected resources, the node must rst acquire the protecting lock. The order-
ing is enforced by Pacemaker. The sfex component ensures that even if Pacemaker were subject
to a split brain situation, the lock will never be granted more than once.
These locks must also be refreshed periodically, so that a node's death does not permanently
block the lock and other nodes can proceed.
In the following, learn how to create a shared partition for use with sfex and how to configure
a resource for the sfex lock in the CIB. A single sfex partition can hold any number of locks,
and needs 1 KB of storage space allocated per lock. By default, sfex_init creates one lock
on the partition.
Important: Requirements
The shared partition for sfex should be on the same logical unit as the data you
want to protect.
The shared sfex partition must not use host-based RAID, nor DRBD.
1. Create a shared partition for use with sfex. Note the name of this partition and use it as
a substitute for /dev/sfex below.
1. The sfex lock is represented via a resource in the CIB, configured as follows:
3. If using group syntax, add the sfex resource as the rst resource to the group:
http://www.linux-ha.org/wiki/SBD_Fencing
The cluster administration tools like crm shell (crmsh) or Hawk2 can be used by
root or any user in the group haclient . By default, these users have full read/
write access. To limit access or assign more ne-grained access rights, you can use
Access control lists (ACLs).
Access control lists consist of an ordered set of access rules. Each rule allows read
or write access or denies access to a part of the cluster configuration. Rules are typi-
cally combined to produce a specific role, then users may be assigned to a role that
matches their tasks.
Ensure you have the same users on all nodes in your cluster, either by using NIS, Active
Directory, or by manually adding the same users to all nodes.
All users for whom you want to modify access rights with ACLs must belong to the ha-
client group.
If non-privileged users want to run crmsh, their PATH variable needs to be extended with
/usr/sbin .
If ACLs are not enabled, root and all users belonging to the haclient group have
full read/write access to the cluster configuration.
Even if ACLs are enabled and configured, both root and the default CRM owner
hacluster always have full access to the cluster configuration.
To use ACLs you need some knowledge about XPath. XPath is a language for selecting nodes
in an XML document. Refer to http://en.wikipedia.org/wiki/XPath or look into the specification
at http://www.w3.org/TR/xpath/ .
Alternatively, use Hawk2 as described in Procedure 12.1, “Enabling Use of ACLs with Hawk2”.
1. Log in to Hawk2:
https://HAWKSERVER:7630/
2. In the left navigation bar, select Cluster Configuration to display the global cluster options
and their current values.
3. Below Cluster Configuration click the empty drop-down box and select enable-acl to add the
parameter. It is added with its default value No .
a specification where to apply the rule. This specification can be a type, an ID reference,
or an XPath expression.
Usually, it is convenient to bundle ACLs into roles and assign a specific role to system users (ACL
targets). There are two methods to create ACL rules:
Section 12.3.1, “Setting ACL Rules via XPath Expressions”. You need to know the structure of the
underlying XML to create ACL rules.
Section 12.3.2, “Setting ACL Rules via Abbreviations”. Create a shorthand syntax and ACL rules
to apply to the matched objects.
<num_updates="59"
dc-uuid="175704363"
crm_feature_set="3.0.9"
validate-with="pacemaker-2.0"
epoch="96"
admin_epoch="0"
cib-last-written="Fri Aug 8 13:47:28 2014"
have-quorum="1">
<configuration>
With the XPath language you can locate nodes in this XML document. For example, to select
the root node ( cib ) use the XPath expression /cib . To locate the global cluster configurations,
use the XPath expression /cib/configuration/crm_config .
As an example, Table 12.1, “Operator Role—Access Types and XPath Expressions” shows the parame-
ters (access type and XPath expression) to create an “operator” role. Users with this role can
only execute the tasks mentioned in the second column—they cannot reconfigure any resources
(for example, change parameters or operations), nor change the configuration of colocation or
ordering constraints.
Type XPath/Explanation
Write //crm_config//nvpair[@name='maintenance-
mode']
Write //op_defaults//nvpair[@name='record-
pending']
Write //nodes/node//nvpair[@name='standby']
Write //resources//nvpair[@name='target-role']
Write //resources//nvpair[@name='maintenance']
Write //constraints/rsc_location
Read /cib
//*[@id="rsc1"]
ref:"rsc1"
//constraints/rsc_location
type:"rsc_location"
1. Log in to Hawk2:
https://HAWKSERVER:7630/
3. Click Create.
7. Click Create.
8. If necessary, add more rules by clicking the plus icon and specifying the respective para-
meters.
To assign the role we created in Procedure 12.2 to a system user (target), proceed as follows:
1. Log in to Hawk2:
https://HAWKSERVER:7630/
3. To create a system user (ACL Target), click Create and enter a unique Target ID, for exam-
ple, tux . Make sure this user belongs to the haclient group.
To configure access rights for resources or constraints, you can also use the abbreviated syntax
as explained in Section 12.3.2, “Setting ACL Rules via Abbreviations”.
1. Log in as root .
The previous command creates a new role with the name monitor , sets the read
rights and applies it to all elements in the CIB by using the XPath expression /cib .
If necessary, you can add more access rights and XPath arguments.
4. Assign your roles to one or multiple ACL targets, which are the corresponding system
users. Make sure they belong to the haclient group.
crm(live)configure# show
crm(live)configure# commit
To configure access rights for resources or constraints, you can also use the abbreviated syntax
as explained in Section 12.3.2, “Setting ACL Rules via Abbreviations”.
2. In the Network Settings, switch to the Overview tab, which shows the available devices.
b. In the Address tab of the Network Card Setup dialog that opens, select the option No
Link and IP Setup (Bonding Slaves).
a. Click Add and set the Device Type to Bond. Proceed with Next.
b. Select how to assign the IP address to the bonding device. Three methods are at your
disposal:
Use the method that is appropriate for your environment. If Corosync manages vir-
tual IP addresses, select Statically assigned IP Address and assign an IP address to the
interface.
d. It shows any Ethernet devices that have been configured as bonding slaves in Step
3.b. To select the Ethernet devices that you want to include into the bond, below
Bond Slaves and Order activate the check box in front of the respective devices.
balance-rr
Provides load balancing and fault tolerance, at the cost of out-of-order packet
transmission. This may cause delays, for example, for TCP reassembly.
active-backup
Provides fault tolerance.
balance-xor
Provides load balancing and fault tolerance.
broadcast
Provides fault tolerance.
802.3ad
Provides dynamic link aggregation if supported by the connected switch.
balance-tlb
Provides load balancing for outgoing traffic.
balance-alb
Provides load balancing for incoming and outgoing traffic, if the network de-
vices used allow the modifying of the network device's hardware address while
in use.
5. Click Next and leave YaST with OK to finish the configuration of the bonding device. YaST
writes the configuration to /etc/sysconfig/network/ifcfg-bondDEVICENUMBER .
If you prefer manual configuration instead, refer to the SUSE Linux Enterprise Server
SUSE Linux Enterprise High Availability Extension Administration Guide, chapter Basic
Networking, section Hotplugging of Bonding Slaves.
2. In the Network Settings, switch to the Overview tab, which shows the already configured
devices. If bonding slaves are already configured, the Note column shows it.
a. Select the device to change and click Edit. The Network Card Setup dialog opens.
b. Switch to the General tab and make sure that Activate device is set to On Hotplug .
d. For the Udev rules, click Change and select the BusID option.
e. Click OK and Next to return to the Overview tab in the Network Settings dialog. If
you click the Ethernet device entry now, the bottom pane shows the device's details,
including the bus ID.
At boot time, the network setup does not wait for the hotplug slaves, but for the bond to become
ready, which needs at least one available slave. When one of the slave interfaces is removed
from the system (unbind from NIC driver, rmmod of the NIC driver or true PCI hotplug removal),
the Kernel removes it from the bond automatically. When a new card is added to the system
(replacement of the hardware in the slot), udev renames it by applying the bus-based persistent
name rule and calls ifup for it. The ifup call automatically joins it into the bond.
Load Balancing makes a cluster of servers appear as one large, fast server to out-
side clients. This apparent single server is called a virtual server. It consists of one or
more load balancers dispatching incoming requests and several real servers running
the actual services. With a load balancing setup of High Availability Extension, you
can build highly scalable and highly available network services, such as Web, cache,
mail, FTP, media and VoIP services.
Round Robin. The simplest strategy is to direct each connection to a different address,
taking turns. For example, a DNS server can have several entries for a given host name.
With DNS round robin, the DNS server will return all of them in a rotating order. Thus
different clients will see different addresses.
Selecting the “best” server. Although this has several drawbacks, balancing could be im-
plemented with an “the rst server who responds” or “the least loaded server” approach.
Balance number of connections per server. A load balancer between users and servers
can divide the number of users across multiple servers.
URI. Inspect the HTTP content and dispatch to a server most suitable for this specific URI.
URL parameter, RDP cookie. Inspect the HTTP content for a session parameter, possibly
in post parameters, or the RDP (remote desktop protocol) session cookie, and dispatch to
the server serving this session.
Although there is some overlap, HAProxy can be used in scenarios where LVS/ ipvsadm is not
adequate and vice versa:
SSL termination. The front-end load balancers can handle the SSL layer. Thus the cloud
nodes do not need to have access to the SSL keys, or could take advantage of SSL acceler-
ators in the load balancers.
Application level. HAProxy operates at the application level, allowing the load balancing
decisions to be influenced by the content stream. This allows for persistence based on
cookies and other such filters.
LVS supports “direct routing”, where the load balancer is only in the inbound stream,
whereas the outbound traffic is routed to the clients directly. This allows for potentially
much higher throughput in asymmetric environments.
LVS supports stateful connection table replication (via conntrackd ). This allows for load
balancer failover that is transparent to the client and server.
14.2.1 Director
The main component of LVS is the ip_vs (or IPVS) Kernel code. It is part of the default Kernel
and implements transport-layer load balancing inside the Linux Kernel (layer-4 switching). The
node that runs a Linux Kernel including the IPVS code is called director. The IPVS code running
on the director is the essential feature of LVS.
When clients connect to the director, the incoming requests are load-balanced across all cluster
nodes: The director forwards packets to the real servers, using a modified set of routing rules that
make the LVS work. For example, connections do not originate or terminate on the director, it
does not send acknowledgments. The director acts as a specialized router that forwards packets
from end users to real servers (the hosts that run the applications that process the requests).
228 Configuring Load Balancing with Linux Virtual Server SLE HA 15 SP1
14.2.3 Packet Forwarding
There are three different methods of how the director can send packets from the client to the
real servers:
Direct Routing
Packets from end users are forwarded directly to the real server. The IP packet is not
modified, so the real servers must be configured to accept traffic for the virtual server's IP
address. The response from the real server is sent directly to the client. The real servers
and load balancers need to be in the same physical network segment.
The default installation does not include the configuration le /etc/ha.d/ldirectord.cf .
This le is created by the YaST module. The tabs available in the YaST module correspond to
the structure of the /etc/ha.d/ldirectord.cf configuration le, defining global options and
defining the options for the virtual services.
For an example configuration and the resulting processes between load balancers and real
servers, refer to Example 14.1, “Simple ldirectord Configuration”.
The following procedure describes how to configure the most important global parameters.
For more details about the individual parameters (and the parameters not covered here),
click Help or refer to the ldirectord man page.
1. With Check Interval, define the interval in which ldirectord will connect to each of the
real servers to check if they are still online.
2. With Check Timeout, set the time in which the real server should have responded after
the last check.
3. With Failure Count you can define how many times ldirectord will attempt to request
the real servers until the check is considered failed.
5. In Fallback, enter the host name or IP address of the Web server onto which to redirect a
Web service in case all real servers are down.
6. If you want the system to send alerts in case the connection status to any real server
changes, enter a valid e-mail address in Email Alert.
7. With Email Alert Frequency, define after how many seconds the e-mail alert should be
repeated if any of the real servers remains inaccessible.
8. In Email Alert Status specify the server states for which e-mail alerts should be sent. If you
want to define more than one state, use a comma-separated list.
9. With Auto Reload define, if ldirectord should continuously monitor the configuration
le for modification. If set to yes , the configuration is automatically reloaded upon
changes.
10. With the Quiescent switch, define whether to remove failed real servers from the Kernel's
LVS table or not. If set to Yes, failed servers are not removed. Instead their weight is set to
0 which means that no new connections will be accepted. Already established connections
will persist until they time out.
11. If you want to use an alternative path for logging, specify a path for the log les in Log
File. By default, ldirectord writes its log les to /var/log/ldirectord.log .
You can configure one or more virtual services by defining a couple of parameters for
each. The following procedure describes how to configure the most important parameters
for a virtual service. For more details about the individual parameters (and the parameters
not covered here), click Help or refer to the ldirectord man page.
1. In the YaST IP Load Balancing module, switch to the Virtual Server Configuration tab.
2. Add a new virtual server or Edit an existing virtual server. A new dialog shows the available
options.
3. In Virtual Server enter the shared virtual IP address (IPv4 or IPv6) and port under which
the load balancers and the real servers are accessible as LVS. Instead of IP address and
port number you can also specify a host name and a service. Alternatively, you can also
use a firewall mark. A firewall mark is a way of aggregating an arbitrary collection of
VIP:port services into one virtual service.
4. To specify the Real Servers, you need to enter the IP addresses (IPv4, IPv6, or host names)
of the servers, the ports (or service names) and the forwarding method. The forwarding
method must either be gate , ipip or masq , see Section 14.2.3, “Packet Forwarding”.
Click the Add button and enter the required arguments for each real server.
5. As Check Type, select the type of check that should be performed to test if the real servers
are still alive. For example, to send a request and check if the response contains an expected
string, select Negotiate .
6. If you have set the Check Type to Negotiate , you also need to define the type of service
to monitor. Select it from the Service drop-down box.
7. In Request, enter the URI to the object that is requested on each real server during the
check intervals.
8. If you want to check if the response from the real servers contains a certain string (“I'm
alive” message), define a regular expression that needs to be matched. Enter the regular
expression into Receive. If the response from a real server contains this expression, the real
server is considered to be alive.
9. Depending on the type of Service you have selected in Step 6, you also need to specify
further parameters for authentication. Switch to the Auth type tab and enter the details
like Login, Password, Database, or Secret. For more information, refer to the YaST help text
or to the ldirectord man page.
11. Select the Scheduler to be used for load balancing. For information on the available sched-
ulers, refer to the ipvsadm(8) man page.
12. Select the Protocol to be used. If the virtual service is specified as an IP address and port,
it must be either tcp or udp . If the virtual service is specified as a firewall mark, the
protocol must be fwm .
13. Define further parameters, if needed. Confirm your configuration with OK. YaST writes
the configuration to /etc/ha.d/ldirectord.cf .
The values shown in Figure 14.1, “YaST IP Load Balancing—Global Parameters” and Figure 14.2,
“YaST IP Load Balancing—Virtual Services”, would lead to the following configuration, defined
in /etc/ha.d/ldirectord.cf :
autoreload = yes 1
checkinterval = 5 2
checktimeout = 3 3
quiescent = yes 4
virtual = 192.168.0.200:80 5
checktype = negotiate 6
fallback = 127.0.0.1:80 7
protocol = tcp 8
scheduler = wlc 12
service = http 13
1 Defines that ldirectord should continuously check the configuration le for mod-
ification.
2 Interval in which ldirectord will connect to each of the real servers to check if
they are still online.
3 Time in which the real server should have responded after the last check.
4 Defines not to remove failed real servers from the Kernel's LVS table, but to set their
weight to 0 instead.
5 Virtual IP address (VIP) of the LVS. The LVS is available at port 80 .
6 Type of check that should be performed to test if the real servers are still alive.
7 Server onto which to redirect a Web service all real servers for this service are down.
8 Protocol to be used.
9 Two real servers defined, both available at port 80 . The packet forwarding method
is gate , meaning that direct routing is used.
10 Regular expression that needs to be matched in the response string from the real
server.
11 URI to the object that is requested on each real server during the check intervals.
12 Selected scheduler to be used for load balancing.
13 Type of service to monitor.
This configuration would lead to the following process ow: The ldirectord will connect
to each real server once every 5 seconds ( 2 ) and request 192.168.0.110:80/test.html
or 192.168.0.120:80/test.html as specified in 9 and 11 . If it does not receive the
expected still alive string ( 10 ) from a real server within 3 seconds ( 3 ) of the last
check, it will remove the real server from the available pool. However, because of the
quiescent=yes setting ( 4 ), the real server will not be removed from the LVS table.
Instead its weight will be set to 0 so that no new connections to this real server will be
accepted. Already established connections will be persistent until they time out.
The real servers are set up correctly to provide the needed services.
The load balancing server (or servers) must be able to route traffic to the real servers using
IP forwarding. The network configuration of the real servers depends on which packet
forwarding method you have chosen.
To prevent the load balancing server (or servers) from becoming a single point of failure
for the whole system, you need to set up one or several backups of the load balancer. In the
cluster configuration, configure a primitive resource for ldirectord , so that ldirectord
can fail over to other servers in case of hardware failure.
As the backup of the load balancer also needs the ldirectord configuration le to fulfill
its task, make sure the /etc/ha.d/ldirectord.cf is available on all servers that you
want to use as backup for the load balancer. You can synchronize the configuration le
with Csync2 as described in Section 4.5, “Transferring the Configuration to All Nodes”.
Our servers (usually for Web content) www.example1.com (IP: 192.168.1.200 ) and
www.example2.com (IP: 192.168.1.201 )
global 1
maxconn 256
daemon
defaults 2
log global
mode http
option httplog
option dontlognull
retries 3
option redispatch
maxconn 2000
timeout connect 5000 3
frontend LB
bind 192.168.1.99:80 6
backend LB
mode http
stats enable
stats hide-version
stats uri /stats
stats realm Haproxy\ Statistics
stats auth haproxy:password 7
balance roundrobin 8
option httpclose
option forwardfor
cookie LB insert
option httpchk GET /robots.txt HTTP/1.0
server web1-srv 192.168.1.200:80 cookie web1-srv check
server web2-srv 192.168.1.201:80 cookie web2-srv check
maxconn
Maximum per-process number of concurrent connections.
daemon
Recommended mode, HAProxy runs in the background.
redispatch
Enables or disables session redistribution in case of connection failure.
log
Enables logging of events and traffic.
mode http
Operates in HTTP mode (recommended mode for HAProxy). In this mode, a
request will be analyzed before a connection to any server is performed. Request
that are not RFC-compliant will be rejected.
option forwardfor
Adds the HTTP X-Forwarded-For header into the request. You need this option
if you want to preserve the client's IP address.
3 The maximum time to wait for a connection attempt to a server to succeed.
4 The maximum time of inactivity on the client side.
5 The maximum time of inactivity on the server side.
6 Section which combines front-end and back-end sections in one.
balance leastconn
Defines the load balancing algorithm, see http://cbonte.github.io/haproxy-dconv/
configuration-1.5.html#4-balance .
stats enable ,
stats auth
Enables statistics reporting (by stats enable ). The auth option logs statistics
with authentication to a specific account.
7 Credentials for HAProxy Statistic report page.
8 Load balancing will work in a round-robin process.
include /etc/haproxy/haproxy.cfg
5. Synchronize it:
Note
The Csync2 configuration part assumes that the HA nodes were configured using
ha-cluster-bootstrap . For details, see the Installation and Setup Quick Start.
6. Make sure HAProxy is disabled on both load balancers ( alice and bob ) as it is started
by Pacemaker:
crm(haproxy-config)# verify
For more information about ldirectord , refer to its comprehensive man page.
Apart from local clusters and metro area clusters, SUSE® Linux Enterprise High
Availability Extension 15 SP1 also supports geographically dispersed clusters (Geo
clusters, sometimes also called multi-site clusters). That means you can have multi-
ple, geographically dispersed sites with a local cluster each. Failover between these
clusters is coordinated by a higher level entity, the so-called booth . For details on
how to use and set up Geo clusters, refer to Article “Geo Clustering Quick Start” and
Book “Geo Clustering Guide”.
To perform maintenance tasks on the cluster nodes, you might need to stop the re-
sources running on that node, to move them, or to shut down or reboot the node. It
might also be necessary to temporarily take over the control of resources from the
cluster, or even to stop the cluster service while resources remain running.
This chapter explains how to manually take down a cluster node without negative
side-effects. It also gives an overview of different options the cluster stack provides
for executing maintenance tasks.
The resources that are running on the node will be stopped or moved o the node.
If stopping the resources should fail or time out, the STONITH mechanism will fence the
node and shut it down.
If your aim is to move the services o the node in an orderly fashion before shutting down
or rebooting the node, proceed as follows:
1. On the node you want to reboot or shut down, log in as root or equivalent.
That way, services can migrate o the node without being limited by the shutdown timeout
of Pacemaker.
[...]
Node bob: standby
[...]
4. After you have finished, put the resource, node or cluster back to “normal” oper-
ation.
To put the cluster back to normal mode after your maintenance work is done, use the following
command:
1. Start a Web browser and log in to the cluster as described in Section 7.2, “Logging In”.
3. In the CRM Configuration group, select the maintenance-mode attribute from the empty
drop-down box and click the plus icon to add it.
5. After you have finished the maintenance task for the whole cluster, deactivate the check
box next to the maintenance-mode attribute.
From this point on, High Availability Extension will take over cluster management again.
To put the node back to normal mode after your maintenance work is done, use the following
command:
1. Start a Web browser and log in to the cluster as described in Section 7.2, “Logging In”.
3. In one of the individual nodes' views, click the wrench icon next to the node and select
Maintenance.
4. After you have finished your maintenance task, click the wrench icon next to the node
and select Ready.
To bring the node back online after your maintenance work is done, use the following command:
1. Start a Web browser and log in to the cluster as described in Section 7.2, “Logging In”.
3. In one of the individual nodes' views, click the wrench icon next to the node and select
Standby.
5. To deactivate the standby mode, click the wrench icon next to the node and select Ready.
To put the resource back into normal mode after your maintenance work is done, use the fol-
lowing command:
1. Start a Web browser and log in to the cluster as described in Section 7.2, “Logging In”.
3. Select the resource you want to put in maintenance mode or unmanaged mode, click the
wrench icon next to the resource and select Edit Resource.
5. From the empty drop-down list, select the maintenance attribute and click the plus icon
to add it.
6. Activate the check box next to maintenance to set the maintenance attribute to yes .
8. After you have finished the maintenance task for that resource, deactivate the check box
next to the maintenance attribute for that resource.
From this point on, the resource will be managed by the High Availability Extension soft-
ware again.
To put it into managed mode again after your maintenance work is done, use the following
command:
1. Start a Web browser and log in to the cluster as described in Section 7.2, “Logging In”.
2. From the left navigation bar, select Status and go to the Resources list.
3. In the Operations column, click the arrow down icon next to the resource you want to
modify and select Edit.
The resource configuration screen opens.
4. Below Meta Attributes, select the is-managed entry from the empty drop-down box.
6. After you have finished your maintenance task, set is-managed to Yes (which is the default
value) and apply your changes.
From this point on, the resource will be managed by the High Availability Extension soft-
ware again.
Note: Implications
If the cluster or a node is in maintenance mode, you can stop or restart cluster resources
at will—the High Availability Extension will not attempt to restart them. If you stop the
Pacemaker service on a node, all daemons and processes (originally started as Pacemak-
er-managed cluster resources) will continue to run.
If you want to take down a node while either the cluster or the node is in maintenance
mode , proceed as follows:
1. On the node you want to reboot or shut down, log in as root or equivalent.
2. If you have a DLM resource (or other resources depending on DLM), make sure to explicitly
stop those resources before stopping the Pacemaker service:
The reason is that stopping Pacemaker also stops the Corosync service, on whose mem-
bership and messaging services DLM depends. If Corosync stops, the DLM resource will
assume a split brain scenario and trigger a fencing operation.
18 OCFS2 252
19 GFS2 262
20 DRBD 267
The Distributed Lock Manager (DLM) in the kernel is the base component used by
OCFS2, GFS2, Cluster MD, and Cluster LVM (lvmlockd) to provide active-active
storage at each respective layer.
If rrp_mode is set to none (which means redundant ring configuration is disabled), DLM au-
tomatically uses TCP. However, without a redundant communication channel, DLM com-
munication will fail if the TCP link is down.
If rrp_mode is set to passive (which is the typical setting), and a second communication
ring in /etc/corosync/corosync.conf is configured correctly, DLM automatically uses
SCTP. In this case, DLM messaging has the redundancy capability provided by SCTP.
The configuration consists of a base group that includes several primitives and a base
clone. Both base group and base clone can be used in various scenarios afterward (for both
OCFS2 and Cluster LVM, for example). You only need to extend the base group with the
respective primitives as needed. As the base group has internal colocation and ordering,
this simplifies the overall setup as you do not need to specify several individual groups,
clones and their dependencies.
Follow the steps below on one node in the cluster:
4. Create a base group for the DLM resource and further storage-related resources:
7. If everything is correct, submit your changes with commit and leave the crm live config-
uration with exit .
Xen image store in a cluster. Xen virtual machines and virtual servers can be stored on
OCFS2 volumes that are mounted by cluster servers. This provides quick and easy porta-
bility of Xen virtual machines between servers.
As a high-performance, symmetric and parallel cluster le system, OCFS2 supports the following
functions:
An application's les are available to all nodes in the cluster. Users simply install it once
on an OCFS2 volume in the cluster.
All nodes can concurrently read and write directly to storage via the standard le system
interface, enabling easy management of applications that run across the cluster.
File access is coordinated through DLM. DLM control is good for most cases, but an appli-
cation's design might limit scalability if it contends with the DLM to coordinate le access.
Storage backup functionality is available on all back-end storage. An image of the shared
application les can be easily created, which can help provide effective disaster recovery.
Metadata caching.
Metadata journaling.
Support for multiple-block sizes up to 4 KB, cluster sizes up to 1 MB, for a maximum
volume size of 4 PB (Petabyte).
Asynchronous and direct I/O support for database les for improved database perfor-
mance.
TABLE 18.1: OCFS2 UTILITIES
6. If everything is correct, submit your changes with commit and leave the crm live config-
uration with exit .
For details on configuring the resource group for DLM, see Procedure 17.1, “Configuring a Base
Group for DLM”.
Before you begin, prepare the block devices you plan to use for your OCFS2 volumes. Leave
the devices as free space.
Then create and format the OCFS2 volume with the mkfs.ocfs2 as described in Procedure 18.2,
“Creating and Formatting an OCFS2 Volume”. The most important parameters for the command are
listed in Table 18.2, “Important OCFS2 Parameters”. For more information and the command syntax,
refer to the mkfs.ocfs2 man page.
Specific Features On/O ( --fs-features ) A comma separated list of feature ags can
be provided, and mkfs.ocfs2 will try to
create the le system with those features set
according to the list. To turn a feature on,
include it in the list. To turn a feature o,
prepend no to the name.
For an overview of all available ags, refer
to the mkfs.ocfs2 man page.
If you do not specify any features when creating and formatting the volume with mkfs.ocfs2 ,
the following features are enabled by default: backup-super , sparse , inline-data , unwrit-
ten , metaecc , indexed-dirs , and xattr .
3. Create and format the volume using the mkfs.ocfs2 utility. For information about the
syntax for this command, refer to the mkfs.ocfs2 man page.
For example, to create a new OCFS2 le system on /dev/sdb1 that supports up to 32
cluster nodes, enter the following commands:
3. Mount the volume from the command line, using the mount command.
To mount an OCFS2 volume with the High Availability software, configure an ocfs2 le
system resource in the cluster. The following procedure uses the crm shell to configure
the cluster resources. Alternatively, you can also use Hawk2 to configure the resources as
described in Section 18.6, “Configuring OCFS2 Resources With Hawk2”.
3. Configure Pacemaker to mount the OCFS2 le system on every node in the cluster:
4. Add the ocfs2-1 primitive to the g-storage group you created in Procedure 17.1, “Con-
figuring a Base Group for DLM”.
The add subcommand appends the new group member by default. Because of the base
group's internal colocation and ordering, Pacemaker will only start the ocfs2-1 resource
on nodes that also have a dlm resource already running.
Using the OCFS2 template in the Hawk2 Setup Wizard also leads to a slightly different
resource configuration than the manual configuration described in Procedure 17.1, “Con-
figuring a Base Group for DLM” and Procedure 18.4, “Mounting an OCFS2 Volume with the Cluster
Resource Manager”.
1. Log in to Hawk2:
https://HAWKSERVER:7630/
3. Expand the File System category and select OCFS2 File System .
4. Follow the instructions on the screen. If you need information about an option, click it
to display a short help text in Hawk2. After the last configuration step, Verify the values
you have entered.
The wizard displays the configuration snippet that will be applied to the CIB and any
additional changes, if required.
https://ocfs2.wiki.kernel.org/
The OCFS2 project home page.
http://oss.oracle.com/projects/ocfs2/
The former OCFS2 project home page at Oracle.
http://oss.oracle.com/projects/ocfs2/documentation
The project's former documentation home page.
Global File System 2 or GFS2 is a shared disk le system for Linux computer clus-
ters. GFS2 allows all nodes to have direct concurrent access to the same shared
block storage. GFS2 has no disconnected operating-mode, and no client or server
roles. All nodes in a GFS2 cluster function as peers. GFS2 supports up to 32 cluster
nodes. Using GFS2 in a cluster requires hardware to allow access to the shared stor-
age, and a lock manager to control access to the storage.
SUSE recommends OCFS2 over GFS2 for your cluster environments if performance
is one of your major requirements. Our tests have revealed that OCFS2 performs
better as compared to GFS2 in such settings.
TABLE 19.1: GFS2 UTILITIES
6. If everything is correct, submit your changes with commit and leave the crm live config-
uration with exit .
For details on configuring the resource group for DLM, see Procedure 17.1, “Configuring a Base
Group for DLM”.
Before you begin, prepare the block devices you plan to use for your GFS2 volumes. Leave the
devices as free space.
Then create and format the GFS2 volume with the mkfs.gfs2 as described in Procedure 19.2,
“Creating and Formatting a GFS2 Volume”. The most important parameters for the command are
listed in Table 19.2, “Important GFS2 Parameters”. For more information and the command syntax,
refer to the mkfs.gfs2 man page.
Lock Protocol Name The name of the locking protocol to use. Acceptable locking proto-
( -p ) cols are lock_dlm (for shared storage) or if you are using GFS2 as a
local le system (1 node only), you can specify the lock_nolock pro-
tocol. If this option is not specified, lock_dlm protocol will be as-
sumed.
Lock Table Name ( - The lock table eld appropriate to the lock module you are using.
t) It is clustername : fsname . clustername must match that in the
cluster configuration le, /etc/corosync/corosync.conf . Only
members of this cluster are permitted to use this le system. fsname
is a unique le system name used to distinguish this GFS2 le sys-
tem from others created (1 to 16 characters).
Number of Journals The number of journals for gfs2_mkfs to create. You need at least
( -j ) one journal per machine that will mount the le system. If this op-
tion is not specified, one journal will be created.
3. Create and format the volume using the mkfs.gfs2 utility. For information about the
syntax for this command, refer to the mkfs.gfs2 man page.
For example, to create a new GFS2 le system on /dev/sdb1 that supports up to 32 cluster
nodes, use the following command:
The hacluster name relates to the entry cluster_name in the le /etc/coro-
sync/corosync.conf (this is the default).
3. Mount the volume from the command line, using the mount command.
To mount a GFS2 volume with the High Availability software, configure an OCF le system
resource in the cluster. The following procedure uses the crm shell to configure the cluster
resources. Alternatively, you can also use Hawk2 to configure the resources.
3. Configure Pacemaker to mount the GFS2 le system on every node in the cluster:
4. Create a base group that consists of the dlm primitive you created in Procedure 17.1, “Con-
figuring a Base Group for DLM” and the gfs2-1 primitive. Clone the group:
Because of the base group's internal colocation and ordering, Pacemaker will only start
the gfs2-1 resource on nodes that also have a dlm resource already running.
6. If everything is correct, submit your changes with commit and leave the crm live config-
uration with exit .
The distributed replicated block device (DRBD*) allows you to create a mirror of two
block devices that are located at two different sites across an IP network. When
used with Corosync, DRBD supports distributed high-availability Linux clusters.
This chapter shows you how to install and set up DRBD.
Service Service
Filesystem Filesystem
DRBD allows you to use any block device supported by Linux, usually:
software RAID
By default, DRBD uses the TCP ports 7788 and higher for communication between DRBD nodes.
Make sure that your firewall does not prevent communication on the used ports.
You must set up the DRBD devices before creating le systems on them. Everything pertaining to
user data should be done solely via the /dev/drbdN device and not on the raw device, as DRBD
uses the last part of the raw device for metadata. Using the raw device will cause inconsistent
data.
With udev integration, you will also get symbolic links in the form /dev/drbd/by-res/
RESOURCES which are easier to use and provide safety against misremembering the devices'
minor number.
For example, if the raw device is 1024 MB in size, the DRBD device has only 1023 MB available
for data, with about 70 KB hidden and reserved for the metadata. Any attempt to access the
remaining kilobytes via raw disks fails because it is not available for user data.
To use it permanently for root , create, or extend a le /root/.bashrc and insert the previous
line.
The following sections assumes you have two nodes, alice and bob, and that they should use the
TCP port 7788 . Make sure this port is open in your firewall.
a. Make sure the block devices in your Linux nodes are ready and partitioned (if need-
ed).
b. If your disk already contains a le system that you do not need anymore, destroy
the le system structure with the following command:
If you have more le systems to destroy, repeat this step on all devices you want to
include into your DRBD setup.
c. If the cluster is already using DRBD, put your cluster in maintenance mode:
If you skip this step when your cluster uses already DRBD, a syntax error in the live
configuration will lead to a service shutdown.
As an alternative, you can also use drbdadm -c FILE to test a configuration le.
3. If you have configured Csync2 (which should be the default), the DRBD configuration les
are already included in the list of les need to be synchronized. To synchronize them, use:
If you do not have Csync2 (or do not want to use it), copy the DRBD configuration les
manually to the other node:
4. Perform the initial synchronization (see Section 20.3.3, “Initializing and Formatting DRBD Re-
source”).
Beginning with DRBD version 8.3, the former configuration le is split into separate les,
located under the directory /etc/drbd.d/ .
startup {
# wfc-timeout degr-wfc-timeout outdated-wfc-timeout
# wait-after-sb;
wfc-timeout 100;
degr-wfc-timeout 120;
}
These options are used to reduce the timeouts when booting, see https://docs.lin-
bit.com/docs/users-guide-9.0/#ch-configure for more details.
2. Create the le /etc/drbd.d/r0.res . Change the lines according to your situation and
save it:
resource r0 { 1
device /dev/drbd0; 2
disk /dev/sda1; 3
meta-disk internal; 4
on alice { 5
address 192.168.1.10:7788; 6
node-id 0; 7
}
on bob { 5
address 192.168.1.11:7788; 6
node-id 1; 7
}
disk {
resync-rate 10M; 8
}
connection-mesh { 9
1 DRBD resource name that allows some association to the service that needs them.
For example, nfs , http , mysql_0 , postgres_wal , etc. Here a more general name
r0 is used.
3 The raw device that is replicated between nodes. Note, in this example the devices
are the same on both nodes. If you need different devices, move the disk parameter
into the on host.
4 The meta-disk parameter usually contains the value internal , but it is possible
to specify an explicit device to hold the meta data. See https://docs.linbit.com/docs/
users-guide-9.0/#s-metadata for more information.
5 The on section states which host this configuration statement applies to.
6 The IP address and port number of the respective node. Each resource needs an in-
dividual port, usually starting with 7788 . Both ports must be the same for a DRBD
resource.
7 The node ID is required when configuring more than two nodes. It is a unique, non-
negative integer to distinguish the different nodes.
8 The synchronization rate. Set it to one third of the lower of the disk- and network
bandwidth. It only limits the resynchronization, not the replication.
9 Defines all nodes of a mesh. The hosts parameter contains all host names that share
the same DRBD setup.
3. Check the syntax of your configuration le(s). If the following command returns an error,
verify your les:
1. Start YaST and select the configuration module High Availability DRBD. If you already
have a DRBD configuration, YaST warns you. YaST will change your configuration and
will save your old DRBD configuration les as *.YaSTsave .
2. Leave the booting ag in Start-up Configuration Booting as it is (by default it is off ); do
not change that as Pacemaker manages this service.
4. Go to the Resource Configuration entry. Press Add to create a new resource (see Figure 20.2,
“Resource Configuration”).
FIGURE 20.2: RESOURCE CONFIGURATION
Resource Name
The name of the DRBD resource (mandatory)
Name
The host name of the relevant node
Address:Port
The IP address and port number (default 7788 ) for the respective node
Disk
The raw device that is replicated between both nodes. If you use LVM, insert your
LVM device name.
Meta-disk
The Meta-disk is either set to the value internal or specifies an explicit device
extended by an index to hold the meta data needed by DRBD.
A real device may also be used for multiple drbd resources. For example, if your
Meta-Disk is /dev/sda6[0] for the rst resource, you may use /dev/sda6[1] for
the second resource. However, there must be at least 128 MB space for each resource
available on this disk. The xed metadata size limits the maximum data size that
you can replicate.
5. Click Save.
6. Click Add to enter the second DRBD resource and finish with Save.
8. If you use LVM with DRBD, it is necessary to change some options in the LVM configuration
le (see the LVM Configuration entry). This change can be done by the YaST DRBD module
automatically.
The disk name of localhost for the DRBD resource and the default filter will be rejected in
the LVM filter. Only /dev/drbd can be scanned for an LVM device.
For example, if /dev/sda1 is used as a DRBD disk, the device name will be inserted as the
rst entry in the LVM filter. To change the filter manually, click the Modify LVM Device
Filter Automatically check box.
1. On both nodes (alice and bob), initialize the meta data storage:
2. To shorten the initial resynchronization of your DRBD resource check the following:
If the DRBD devices on all nodes have the same data (for example, by destroying
the le system structure with the dd command as shown in Section 20.3, “Setting Up
DRBD Service”), then skip the initial resynchronization with the following command
(on both nodes):
5. Create your le system on top of your DRBD device, for example:
DRBD 9 will fall back to be compatible with version 8. For three nodes and more, you need to
re-create the metadata to use DRBD version 9 specific options.
If you have a stacked DRBD resource, refer also to Section 20.5, “Creating a Stacked DRBD Device”
for more information.
To keep your data and allow to add new nodes without re-creating new resources, do the fol-
lowing:
2. Update all the DRBD packages on all of your nodes, see Section 20.2, “Installing DRBD Services”.
4. Enlarge the space of your DRBD disks when using internal as meta-disk key. Use a
device that supports enlarging the space like LVM. As an alternative, change to an external
disk for metadata and use meta-disk DEVICE; .
FIGURE 20.3: RESOURCE STACKING
# /etc/drbd.d/r0.res
resource r0 {
protocol C;
device /dev/drbd0;
disk /dev/sda1;
meta-disk internal;
on amsterdam-bob {
address 192.168.1.2:7900;
}
}
resource r0-U {
protocol A;
device /dev/drbd10;
stacked-on-top-of r0 {
address 192.168.2.1:7910;
}
on berlin-charlie {
disk /dev/sda10;
address 192.168.2.2:7910; # Public IP of the backup node
meta-disk internal;
}
}
resource RESOURCE {
net {
fencing resource-only;
# ...
}
handlers {
fence-peer "/usr/lib/drbd/crm-fence-peer.9.sh";
after-resync-target "/usr/lib/drbd/crm-unfence-peer.9.sh";
# ...
}
If the DRBD replication link becomes disconnected, DRBD does the following:
3. The script determines the Pacemaker resource associated with this DRBD resource.
4. The script ensures that the DRBD resource no longer gets promoted to any other node. It
stays on the currently active one.
5. If the replication link becomes connected again and DRBD completes its synchronization
process, then the constraint is removed. The cluster manager is now free to promote the
resource.
root # ls /srv/r0/from_alice
b. Downgrade the DRBD service on bob by typing the following command on bob:
5. To get the service to automatically start and fail over if the server has a problem, you
can set up DRBD as a high availability service with Pacemaker/Corosync. For information
about installing and configuring for SUSE Linux Enterprise 15 SP1 see Part II, “Configuration
and Administration”.
1. Use an external disk for your metadata. This might help, at the cost of maintenance ease.
2. Tune your network connection, by changing the receive and send buer settings via
sysctl .
5. If you have a hardware RAID controller with a BBU (Battery Backup Unit), you might benefit
from setting no-disk-flushes , no-disk-barrier and/or no-md-flushes .
20.10.1 Configuration
If the initial DRBD setup does not work as expected, there is probably something wrong with
your configuration.
To get information about the configuration:
2. Test the configuration le by running drbdadm with the -d option. Enter the following
command:
In a dry run of the adjust option, drbdadm compares the actual configuration of the
DRBD resource with your DRBD configuration le, but it does not execute the calls. Review
the output to make sure you know the source and cause of any errors.
4. If the partitions and settings are correct, run drbdadm again without the -d option.
To resolve this situation, enter the following commands on the node which has data to be dis-
carded:
That resolves the issue by overwriting one node's data with the peer's data, therefore getting a
consistent view on both nodes.
See Article “Highly Available NFS Storage with DRBD and Pacemaker”.
The following man pages for DRBD are available in the distribution: drbd(8) , drbd-
meta(8) , drbdsetup(8) , drbdadm(8) , drbd.conf(5) .
Furthermore, for easier storage administration across your cluster, see the recent an-
nouncement about the DRBD-Manager at https://www.linbit.com/en/drbd-manager/ .
When managing shared storage on a cluster, every node must be informed about
changes that are done to the storage subsystem. The Logical Volume Manager 2
(LVM2), which is widely used to manage local storage, has been extended to sup-
port transparent management of volume groups across the whole cluster. Volume
groups shared among multiple hosts can be managed using the same commands as
local storage.
A shared storage device is available, provided by a Fibre Channel, FCoE, SCSI, iSCSI SAN,
or DRBD*, for example.
Make sure the following packages have been installed: lvm2 and lvm2-lockd .
From SUSE Linux Enterprise 15 onward, we use lvmlockd as the LVM2 cluster extension,
rather than clvmd. Make sure the clvmd daemon is not running, otherwise lvmlockd will
fail to start.
3. If you have already configured a DLM resource (and a corresponding base group and base
clone), continue with Procedure 21.2, “Creating an lvmlockd Resource”.
Otherwise, configure a DLM resource and a corresponding base group and base clone as
described in Procedure 17.1, “Configuring a Base Group for DLM”.
4. To ensure the lvmlockd resource is started on every node, add the primitive resource to
the base group for storage you have created in Procedure 21.1, “Creating a DLM Resource”:
2. Assuming you already have two shared disks, create a shared VG with them:
This resource manages the activation of a VG. In a shared VG, LV activation has two
different modes: exclusive and shared mode. The exclusive mode is the default and should
be used normally, when a local le system like ext4 uses the LV. The shared mode should
only be used for cluster le systems like OCFS2.
Use shared activation mode for OCFS2 and add it to the cloned g-storage group:
VG
PV
iSCSI iSCSI
SAN 1 SAN 2
FIGURE 21.1: SETUP OF A SHARED DISK WITH CLUSTER LVM
Configure only one SAN box rst. Each SAN box needs to export its own iSCSI target. Proceed
as follows:
1. Run YaST and click Network Services iSCSI LIO Target to start the iSCSI Server module.
2. If you want to start the iSCSI target whenever your computer is booted, choose When
Booting, otherwise choose Manually.
4. Switch to the Global tab. If you need authentication, enable incoming or outgoing authen-
tication or both. In this example, we select No Authentication.
b. Click Add.
iqn.DATE.DOMAIN
For more information about the format, refer to Section 3.2.6.3.1. Type "iqn." (iSCSI
Qualified Name) at http://www.ietf.org/rfc/rfc3720.txt .
d. If you want a more descriptive name, you can change it as long as your identifier
is unique for your different targets.
e. Click Add.
2. If you want to start the iSCSI initiator whenever your computer is booted, choose When
Booting, otherwise set Manually.
4. Add the IP address and the port of your iSCSI target (see Procedure 21.5, “Configuring iSCSI
Targets (SAN)”). Normally, you can leave the port as it is and use the default value.
5. If you use authentication, insert the incoming and outgoing user name and password,
otherwise activate No Authentication.
...
[4:0:0:2] disk IET ... 0 /dev/sdd
[5:0:0:1] disk IET ... 0 /dev/sde
Look for entries with IET in their third column. In this case, the devices are /dev/sdd
and /dev/sde .
1. Open a root shell on one of the nodes you have run the iSCSI initiator from Procedure 21.6,
“Configuring iSCSI Initiators”.
After you have created the volumes and started your resources you should have new device
names under /dev/testvg , for example /dev/testvg/lv1 . This indicates the LV has been
activated for use.
b. Add the following options to your configuration le (usually something like /etc/
drbd.d/r0.res ):
resource r0 {
net {
allow-two-primaries;
}
c. Copy the changed configuration le to the other node, for example:
2. Include the lvmlockd resource as a clone in the pacemaker configuration, and make it
depend on the DLM clone resource. See Procedure 21.1, “Creating a DLM Resource” for detailed
instructions. Before proceeding, confirm that these resources have started successfully on
your cluster. Use crm status or the Web interface to check the running services.
3. Prepare the physical volume for LVM with the command pvcreate . For example, on the
device /dev/drbd_r0 the command would look like this:
5. Create logical volumes as needed. You probably want to change the size of the logical
volume. For example, create a 4 GB logical volume with the following command:
6. The logical volumes within the VG are now available as le system mounts or raw usage.
Ensure that services using them have proper dependencies to collocate them with and
order them after the VG has been activated.
After finishing these configuration steps, the LVM2 configuration can be done like on any stand-
alone workstation.
1. Edit the le /etc/lvm/lvm.conf and search for the line starting with filter .
2. The patterns there are handled as regular expressions. A leading “a” means to accept a
device pattern to the scan, a leading “r” rejects the devices that follow the device pattern.
3. To remove a device named /dev/sdb1 , add the following expression to the filter rule:
"r|^/dev/sdb1$|"
A filter line that accepts DRBD and MPIO devices but rejects all other devices would look
like this:
You have a two-node cluster consisting of the nodes alice and bob .
A mirror logical volume named test-lv was created from a volume group named clus-
ter-vg2 .
The volume group cluster-vg2 is composed of the disks /dev/vdb and /dev/vdc .
root # lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 253:0 0 40G 0 disk
├─vda1 253:1 0 4G 0 part [SWAP]
└─vda2 253:2 0 36G 0 part /
vdb 253:16 0 20G 0 disk
├─cluster--vg2-test--lv_mlog_mimage_0 254:0 0 4M 0 lvm
│ └─cluster--vg2-test--lv_mlog 254:2 0 4M 0 lvm
│ └─cluster--vg2-test--lv 254:5 0 12G 0 lvm
└─cluster--vg2-test--lv_mimage_0 254:3 0 12G 0 lvm
└─cluster--vg2-test--lv 254:5 0 12G 0 lvm
vdc 253:32 0 20G 0 disk
├─cluster--vg2-test--lv_mlog_mimage_1 254:1 0 4M 0 lvm
│ └─cluster--vg2-test--lv_mlog 254:2 0 4M 0 lvm
│ └─cluster--vg2-test--lv 254:5 0 12G 0 lvm
└─cluster--vg2-test--lv_mimage_1 254:4 0 12G 0 lvm
└─cluster--vg2-test--lv 254:5 0 12G 0 lvm
Is the mirror log itself mirrored ( mirrored option) and allocated on the same device
as the mirror leg? (For example, this might be the case if you have created the
logical volume for a cmirrord setup on SUSE Linux Enterprise High Availability
Extension 11 or 12 as described in https://www.suse.com/documentation/sle-ha-12/
singlehtml/book_sleha/book_sleha.html#sec.ha.clvm.config.cmirrord .)
Is the mirror log written to a different device ( disk option) or kept in memory ( core
option)? Before starting the migration, either enlarge the size of the physical vol-
ume or reduce the size of the logical volume (to free more space for the physical
volume).
b. Remove the physical volume /dev/vdc from the volume group cluster-vg2 :
For details on why to use the data-offset option, see Important: Avoiding Migration
Failures.
If your cluster consists of more than two nodes, execute this step on all remaining nodes
in your cluster.
a. Initialize the MD device /dev/md0 as physical volume for use with LVM:
c. Move the data from the disk /dev/vdb to the /dev/md0 device:
e. Remove the label from the device so that LVM no longer recognizes it as physical
volume:
Thorough information is available from the pacemaker mailing list, available at http://www.clus-
terlabs.org/wiki/Help:Contents .
The cluster multi-device (Cluster MD) is a software based RAID storage solution for
a cluster. Currently, Cluster MD provides the redundancy of RAID1 mirroring to the
cluster. With SUSE Linux Enterprise High Availability Extension 15 SP1, RAID10 is
included as a technology preview. If you want to try RAID10, replace mirror with
10 in the related mdadm command. This chapter shows you how to create and use
Cluster MD.
A resource agent for DLM (see Procedure 17.1, “Configuring a Base Group for DLM” on how to
configure DLM).
At least two shared disk devices. You can use an additional device as a spare which will
fail over automatically in case of device failure.
If you do not have an existing normal RAID device, create the Cluster MD device on
the node running the DLM resource with the following command:
If you already have an existing normal RAID, rst clear the existing bitmap and then
create the clustered bitmap:
Optionally, to create a Cluster MD device with a spare device for automatic failover,
run the following command on one cluster node:
The UUID must match the UUID stored in the superblock. For details on the UUID, refer
to the mdadm.conf man page.
4. Open /etc/mdadm.conf and add the md device name and the devices associated with it.
Use the UUID from the previous step:
group ha_group
{
# ... list of files pruned ...
include /etc/mdadm.conf
}
2. Add the raider resource to the base group for storage that you have created for DLM:
The behavior of the new device added depends on the state of the Cluster MD device:
If only one of the mirrored devices is active, the new device becomes the second device
of the mirrored devices and a recovery is initiated.
If both devices of the Cluster MD device are active, the new added device becomes a spare
device.
1. Make sure the device is failed by introspecting /proc/mdstat . Look for an (F) before
the device.
2. Run the following command on one cluster node to make a device fail:
3. Remove the failed device using the command on one cluster node:
Private Private
CTDB Network CTDB Network CTDB
Cluster Filesystem
Mapping table that associates Unix user and group IDs to Windows users and groups.
Join information for a member server in a Windows domain must be available on all nodes.
Metadata needs to be available on all nodes, like active SMB sessions, share connections,
and various locks.
The goal is that a clustered Samba server with N+1 nodes is faster than with only N nodes. One
node is not slower than an unclustered Samba server.
a. Make sure the following packages are installed before you proceed: ctdb , tdb-
tools , and samba (needed for smb and nmb resources).
b. Configure your cluster (Pacemaker, OCFS2) as described in this guide in Part II, “Con-
figuration and Administration”.
c. Configure a shared le system, like OCFS2, and mount it, for example, on /srv/
clusterfs . See Chapter 18, OCFS2 for more information.
Make sure the acl option is specified in the le system resource. Use the crm
shell as follows:
e. Make sure the services ctdb , smb , and nmb are disabled:
f. Open port 4379 of your firewall on all nodes. This is needed for CTDB to commu-
nicate with other cluster nodes.
2. Create a directory for the CTDB lock on the shared le system:
192.168.1.10
192.168.1.11
4. Configure Samba. Add the following lines in the [global] section of /etc/samba/sm-
b.conf . Use the host name of your choice in place of "CTDB-SERVER" (all nodes in the
cluster will appear as one big node with this name, effectively):
[global]
# ...
# settings applicable for all CTDB deployments
netbios name = CTDB-SERVER
clustering = yes
idmap config * : backend = tdb2
passdb backend = tdbsam
ctdbd socket = /var/lib/ctdb/ctdb.socket
# settings necessary for CTDB on OCFS2
fileid:algorithm = fsid
vfs objects = fileid
# ...
For more information, see Procedure 4.6, “Synchronizing the Configuration Files with Csync2”.
crm(live)configure# commit
5. Edit the g-ctdb group and insert winbind between the nmb and smb resources:
6. Consult your Windows Server documentation for instructions on how to set up an Active
Directory domain. In this example, we use the following parameters:
AD domain 2k3test.example.com
1. Make sure the following les are included in Csync2's configuration to become installed
on all cluster hosts:
/etc/samba/smb.conf
/etc/security/pam_winbind.conf
/etc/krb5.conf
/etc/nsswitch.conf
/etc/security/pam_mount.conf.xml
/etc/pam.d/common-session
You can also use YaST's Configure Csync2 module for this task, see Section 4.5, “Transferring
the Configuration to All Nodes”.
2. Run YaST and open the Windows Domain Membership module from the Network Services
entry.
ctdb_diagnostics
Run this tool to diagnose your clustered Samba server. Detailed debug messages should
help you track down any problems you might have.
The ctdb_diagnostics command searches for the following les which must be available
on all nodes:
/etc/krb5.conf
/etc/hosts
/etc/ctdb/nodes
/etc/sysconfig/ctdb
ping_pong
Check whether your le system is suitable for CTDB with ping_pong . It performs cer-
tain tests of your cluster le system like coherence and performance (see http://wiki.sam-
ba.org/index.php/Ping_pong ) and gives some indication how your cluster may behave
under high load.
1. Start the command ping_pong on one node and replace the placeholder N with the
amount of nodes plus one. The le ABSPATH/data.txt is available in your shared storage
and is therefore accessible on all nodes ( ABSPATH indicates an absolute path):
ping_pong ABSPATH/data.txt N
Expect a very high locking rate as you are running only one node. If the program does not
print a locking rate, replace your cluster le system.
2. Start a second copy of ping_pong on another node with the same parameters.
the locking rates in the two instances are not almost equal,
the locking rate did not drop after you started the second instance.
3. Start a third copy of ping_pong . Add another node and note how the locking rates change.
4. Kill the ping_pong commands one after the other. You should observe an increase of the
locking rate until you get back to the single node case. If you did not get the expected
behavior, nd more information in Chapter 18, OCFS2.
http://wiki.samba.org/index.php/CTDB_Setup
http://ctdb.samba.org
http://wiki.samba.org/index.php/Samba_%26_Clustering
Risk Analysis. Conduct a solid risk analysis of your infrastructure. List all the possible
threats and evaluate how serious they are. Determine how likely these threats are and
prioritize them. It is recommended to use a simple categorization: probability and impact.
Budget Planning. The outcome of the analysis is an overview, which risks can be tolerated
and which are critical for your business. Ask yourself how you can minimize risks and how
much will it cost. Depending on how big your company is, spend two to fifteen percent of
the overall IT budget on disaster recovery.
Disaster Recovery Plan Development. Make checklists, test procedures, establish and as-
sign priorities, and inventory your IT infrastructure. Define how to deal with a problem
when some services in your infrastructure fail.
Test. After defining an elaborate plan, test it. Test it at least once a year. Use the same
testing hardware as your main IT infrastructure.
To allow disaster recovery on UEFI systems, you need at least Rear version 1.18.a and the
package ebiso . Only this version supports the new helper tool /usr/bin/ebiso . This
helper tool is used to create a UEFI-bootable Rear system ISO image.
If you have a tested and fully functional disaster recovery procedure with one Rear version,
do not update Rear. Keep the Rear package and do not change your disaster recovery
method!
Version updates for Rear are provided as separate packages that intentionally conflict with
each other to prevent your installed version getting accidentally replaced with another
version.
In the following cases you need to completely re-validate your existing disaster recovery pro-
cedure:
If you update low-level system components such as parted , btrfs and similar.
Warning
Btrfs snapshot subvolumes cannot be backed up and restored as usual with le-based
backup software.
While recent snapshot subvolumes on Btrfs le systems need almost no disk space (because
of Btrfs's copy-on-write functionality), those les would be backed up as complete les
when using le-based backup software. They would end up twice in the backup with their
original le size. Therefore, it is impossible to restore the snapshots as they have been
before on the original system.
Rear does not replace a le backup, but complements it. By default, Rear supports the generic
tar command, and several third-party backup tools (such as Tivoli Storage Manager, QNetix
Galaxy, Symantec NetBackup, EMC NetWorker, or HP DataProtector). Refer to Example 24.2 for
an example configuration of using Rear with EMC NetWorker as backup tool.
When your system is booted with UEFI. If your system boots with a UEFI boot loader,
install the package ebiso and add the following line into /etc/rear/local.conf :
ISO_MKISOFS_BIN=/usr/bin/ebiso
How to back up files and how to create and store the disaster recovery system. This needs
to be configured in /etc/rear/local.conf .
How the recovery process works. To change how Rear generates the recovery installer, or
to adapt what the Rear recovery installer does, you need to edit the Bash scripts.
To configure Rear, add your options to the /etc/rear/local.conf configuration le. (The
former configuration le /etc/rear/sites.conf has been removed from the package. How-
ever, if you have such a le from your last setup, Rear will still use it.)
All Rear configuration variables and their default values are set in /usr/share/rear/conf/
default.conf . Some example les ( *example.conf ) for user configurations (for example,
what is set in /etc/rear/local.conf ) are available in the examples subdirectory. Find more
information in the Rear man page.
You should start with a matching example configuration le as template and adapt it as need-
ed to create your particular configuration le. Copy various options from several example con-
figuration les and paste them into your specific /etc/rear/local.conf le that matches
your particular system. Do not use original example configuration les, because they provide
an overview of variables that can be used for specific setups.
Rear can be used in different scenarios. The following example uses an NFS server as
storage for the le backup.
1. Set up an NFS server with YaST as described in the SUSE Linux Enterprise Server 15
SP1 Administration Guide, chapter Sharing File Systems with NFS. It is available from http://
www.suse.com/documentation/ .
2. Define the configuration for your NFS server in the /etc/exports le. Make sure the
directory on the NFS server (where you want the backup data to be available), has the
right mount options. For example:
/srv/nfs *([...],rw,no_root_squash,[...])
Replace /srv/nfs with the path to your backup data on the NFS server and adjust the
mount options. You might need no_root_squash as the rear mkbackup command runs
as root .
Using third-party backup tools instead of tar requires appropriate settings in the Rear
configuration le.
The following is an example configuration for EMC NetWorker. Add this configuration
snippet to /etc/rear/local.conf and adjust it according to your setup:
BACKUP=NSR
OUTPUT=ISO
BACKUP_URL=nfs://host.example.com/path/to/rear/backup
OUTPUT_URL=nfs://host.example.com/path/to/rear/backup
NSRSERVER=backupserver.example.com
RETENTION_TIME="Month"
rear -d -D mkbackup
1. Analyzing the target system and gathering information, in particular about the disk layout
(partitioning, le systems, mount points) and about the boot loader.
2. Creating a bootable recovery system with the information gathered in the rst step. The
resulting Rear recovery installer is specific to the system that you want to protect from
disaster. It can only be used to re-create this specific system.
3. Calling the configured backup tool to back up system and user les.
1. Create a recovery medium by burning the recovery system that you have created in Sec-
tion 24.3 to a DVD or CD. Alternatively, you can use a network boot via PXE.
rear -d -D recover
For details about the steps that Rear takes during the process, see Recovery Process.
6. After the recovery process has finished, check whether the system has been successfully
re-created and can serve as a replacement for your original system in the production en-
vironment.
RECOVERY PROCESS
1. Restoring the disk layout (partitions, le systems, and mount points).
/usr/share/doc/packages/rear/
A Troubleshooting 322
Strange problems may occur that are not easy to understand, especially when start-
ing to experiment with High Availability. However, there are several utilities that
allow you to take a closer look at the High Availability internal processes. This
chapter recommends various solutions.
In case they are not running, start them by executing the following command:
A.2 Logging
Where to find the log files?
Pacemaker writes its log les into the /var/log/pacemaker directory. The main Pace-
maker log le is /var/log/pacemaker/pacemaker.log . In case you cannot nd the log
les, check the logging settings in /etc/sysconfig/pacemaker , Pacemaker's own con-
figuration le. If PCMK_logfile is configured there, Pacemaker will use the path that is
defined by this parameter.
If you need a cluster-wide report showing all relevant log les, see How can I create a report
with an analysis of all my cluster nodes? for more information.
I enabled monitoring but there is no trace of monitoring operations in the log files?
The pacemaker-execd daemon does not log recurring monitor operations unless an error
occurred. Logging all recurring operations would produce too much noise. Therefore re-
curring monitor operations are logged only once an hour.
root # crm_mon -o -r
Operations:
* Node bob:
my_ipaddress: migration-threshold=3
+ (14) start: rc=0 (ok)
+ (15) monitor: interval=10000ms rc=0 (ok)
* Node alice:
Replace NODE with the node you want to examine, or leave it empty. See Section A.5, “His-
tory” for further information.
A.3 Resources
How can I clean up my resources?
Use the following commands:
If you leave out the node, the resource is cleaned on all nodes. More information can be
found in Section 8.4.3, “Cleaning Up Resources”.
Use -o multiple times for more parameters. The list of required and optional parameters
can be obtained by running crm ra info AGENT , for example:
Before running ocf-tester, make sure the resource is not managed by the cluster.
Why do resources not fail over and why are there no errors?
The terminated node might be considered unclean. Then it is necessary to fence it. If the
STONITH resource is not operational or does not exist, the remaining node will waiting
for the fencing to happen. The fencing timeouts are typically high, so it may take quite a
while to see any obvious sign of problems (if ever).
Yet another possible explanation is that a resource is simply not allowed to run on this
node. That may be because of a failure which happened in the past and which was not
“cleaned”. Or it may be because of an earlier administrative action, that is a location con-
straint with a negative score. Such a location constraint is inserted by the crm resource
migrate command, for example.
Why does fencing not happen, although I have the STONITH resource?
Each STONITH resource must provide a host list. This list may be inserted by hand in the
STONITH resource configuration or retrieved from the device itself from outlet names, for
example. That depends on the nature of the STONITH plugin. pacemaker-fenced uses
the list to nd out which STONITH resource can fence the target node. Only if the node
appears in the list can the STONITH resource shoot (fence) the node.
If pacemaker-fenced does not nd the node in any of the host lists provided by running
STONITH resources, it will ask pacemaker-fenced instances on other nodes. If the target
node does not show up in the host lists of other pacemaker-fenced instances, the fencing
request ends in a timeout at the originating node.
This gives you a full transition log for the given resource only. However, it is possible to
investigate more than one resource. Append the resource names after the rst.
If you followed some naming conventions (see section ), the resource command makes
it easier to investigate a group of resources. For example, this command investigates all
primitives starting with db :
Use exclude
Use timeframe
The exclude command let you set an additive regular expression that excludes certain
patterns from the log. For example, the following command excludes all SSH, systemd,
and kernel messages:
With the timeframe command you limit the output to a certain range. For example, the
following command shows all the events on August 23rd from 12:00 to 12:30:
A.6 Hawk2
Replacing the Self-Signed Certificate
To avoid the warning about the self-signed certificate on rst Hawk2 start-up, replace the
automatically created certificate with your own certificate (or a certificate that was signed
by an official Certificate Authority, CA):
Change ownership of the les to root:haclient and make the les accessible to the
group:
A.7 Miscellaneous
How can I run commands on all cluster nodes?
Use the command pssh for this task. If necessary, install pssh . Create a le (for example
hosts.txt ) where you collect all your IP addresses or host names you want to visit. Make
sure you can log in with ssh to each host listed in your hosts.txt le. If everything
is correctly prepared, execute pssh and use the hosts.txt le (option -h ) and the
interactive mode (option -i ) as shown in this example:
Check if the connection between your nodes is broken. Most often, this is the result
of a badly configured firewall. This also may be the reason for a split brain condition,
where the cluster is partitioned.
In this case the Kernel module ocfs2_stackglue.ko is missing. Install the package
ocfs2-kmp-default , ocfs2-kmp-pae or ocfs2-kmp-xen , depending on the installed
Kernel.
Package states,
DLM/OCFS2 states,
System information,
CIB history,
Customize the command execution with further options. For example, if you have a Pace-
maker cluster, you certainly want to add the option -A . In case you have another user
who has permissions to the cluster, use the -u option and specify this user (in addition to
root and hacluster ). In case you have a non-standard SSH port, use the -X option to
add the port (for example, with the port 3479, use -X "-p 3479" ). Further options can
be found in the man page of crm report .
After crm report has analyzed all the relevant log les and created the directory (or
archive), check the log les for an uppercase ERROR string. The most important les in
the top level directory of the report are:
analysis.txt
Compares les that should be identical on all nodes.
corosync.txt
Contains a copy of the Corosync configuration le.
crm_mon.txt
Contains the output of the crm_mon command.
description.txt
Contains all cluster package versions on your nodes. There is also the sysinfo.txt
le which is node specific. It is linked to the top directory.
This le can be used as a template to describe the issue you encountered and post it
to https://github.com/ClusterLabs/crmsh/issues .
sysinfo.txt
Contains a list of all relevant package names and their versions. Additionally, there
is also a list of configuration les which are different from the original RPM package.
Node-specific les are stored in a subdirectory named by the node's name. It contains a
copy of the directory /etc of the respective node.
In case you need to simplify the arguments, set your default values in the configuration
le /etc/crm/crm.conf , section report . Further information is written in the man page
man 8 crmsh_hb_report .
Cluster Nodes
Cluster nodes use rst names:
alice, bob, charlie, doro, and eris
Cluster Resources
Primitives No prefix
Groups Prefix g-
Constraints
High Availability Extension ships with a comprehensive set of tools to assists you in managing
your cluster from the command line. This chapter introduces the tools needed for managing
the cluster configuration in the CIB and the cluster resources. Other command line tools for
managing resource agents or tools used for debugging (and troubleshooting) your setup are
covered in Appendix A, Troubleshooting.
The following list presents several tasks related to cluster management and briey introduces
the tools to use to accomplish these tasks:
2. Configuring passwordless SSH access for that user account, ideally by using a non-standard
SSH port.
By default when crm report is run, it attempts to log in to remote nodes rst as root , then
as user hacluster . However, if your local security policy prevents root login using SSH,
the script execution will fail on all remote nodes. Even attempting to run the script as user
hacluster will fail because this is a service account, and its shell is set to /bin/false , which
prevents login. Creating a dedicated local user is the only option to successfully run the crm
report script on all nodes in the High Availability cluster.
1. Start a shell and create a user hareport with a home directory /home/hareport :
By default, the SSH daemon and the SSH client talk and listen on port 22 . If your network
security guidelines require the default SSH port to be changed to an alternate high num-
bered port, you need to modify the daemon's configuration le /etc/ssh/sshd_config .
1. To modify the default port, search the le for the Port line, uncomment it and edit it
according to your wishes. For example, set it to:
Port 5022
2. If your organization does not permit the root user to access other servers, search the le
for the PermitRootLogin entry, uncomment it and set it to no :
PermitRootLogin no
3. Alternatively, add the respective lines to the end of the le by executing the following
commands:
4. After modifying /etc/ssh/sshd_config , restart the SSH daemon to make the new set-
tings take effect:
If the SSH port change is going to be made on all nodes in the cluster, it is useful to modify
the SSH configuration le, /etc/ssh/sshd_config .
Port 5022
2. Alternatively, add the respective line to the end of the le by executing the following
commands:
You can access other servers using SSH and not be asked for a password. While this may
appear insecure at rst sight, it is actually a very secure access method since the users can
only access servers that their public key has been shared with. The shared key must be
created as the user that will use the key.
1. Log in to one of the nodes with the user account that you have created for running cluster
reports (in our example above, the user account was hareport ).
This command will generate a 2048 bit key by default. The default location for the key is
~/.ssh/ . You are asked to set a passphrase on the key. However, do not enter a passphrase
because for passwordless login there must not be a passphrase on the key.
3. After the keys have been generated, copy the public key to each of the other nodes (in-
cluding the node where you created the key):
4. After the key is shared to all cluster nodes, test if you can log in as user hareport to the
other nodes by using passwordless SSH:
You should be automatically connected to the remote server without being asked to accept
a certificate or enter a password.
1. Log in as root .
3. Look for the following categories: Host alias specification , User alias specifi-
cation , Cmnd alias specification , and Runas alias specification .
User_Alias HA = hareport 2
Runas_Alias R = root 4
1 The host alias defines on which server (or range of servers) the sudo user has rights
to issue commands. In the host alias you can use DNS names, or IP addresses, or
specify an entire network range (for example, 172.17.12.0/24 ). To limit the scope
of access you should specify the host names for the cluster nodes only.
2 The user alias allows you to add multiple local user accounts to a single alias. How-
ever, in this case you could avoid creating an alias since only one account is being
used. In the example above, we added the hareport user which we have created
for running cluster reports.
3 The command alias defines which commands can be executed by the user. This is
useful if you want to limit what the non-root user can access when using sudo . In
this case the hareport user account will need access to the commands crm report
and su .
4 The runas alias specifies the account that the command will be run as. In this case
root .
Defaults targetpw
ALL ALL=(ALL) ALL
As they would conflict with the setup we want to create, disable them:
#Defaults targetpw
#ALL ALL=(ALL) ALL
7. After having defined the aliases above, you can now add the following rule there:
The NOPASSWORD option ensures that the user hareport can execute the cluster report
without providing a password.
This command will extract all information since 0 am on the named nodes and create a
*.tar.bz2 archive named pcmk-DATE.tar.bz2 in the current directory.
1. When using a custom SSH port, use the -X with crm report to modify the client's SSH
port. For example, if your custom SSH port is 5022 , use the following command:
2. To set your custom SSH port permanently for crm report , start the interactive crm shell:
crm options
active/active, active/passive
A concept of how services are running on nodes. An active-passive scenario means that one
or more services are running on the active node and the passive node waits for the active
node to fail. Active-active means that each node is active and passive at the same time. For
example, it has some services running, but can take over other services from the other node.
Compare with primary/secondary and dual-primary in DRBD speak.
arbitrator
Additional instance in a Geo cluster that helps to reach consensus about decisions such as
failover of resources across sites. Arbitrators are single machines that run one or more booth
instances in a special mode.
AutoYaST
AutoYaST is a system for installing one or more SUSE Linux Enterprise systems automatically
and without user intervention.
booth
The instance that manages the failover process between the sites of a Geo cluster. It aims
to get multi-site resources active on one and only one site. This is achieved by using so-
called tickets that are treated as failover domain between cluster sites, in case a site should
be down.
cluster
A high-performance cluster is a group of computers (real or virtual) sharing the application
load to achieve faster results. A high-availability cluster is designed primarily to secure the
highest possible availability of services.
cluster partition
Whenever communication fails between one or more nodes and the rest of the cluster, a
cluster partition occurs. The nodes of a cluster are split into partitions but still active. They
can only communicate with nodes in the same partition and are unaware of the separated
nodes. As the loss of the nodes on the other partition cannot be confirmed, a split brain
scenario develops (see also split brain).
concurrency violation
A resource that should be running on only one node in the cluster is running on several nodes.
conntrack tools
Allow interaction with the in-kernel connection tracking system for enabling stateful packet
inspection for iptables. Used by the High Availability Extension to synchronize the connec-
tion status between cluster nodes.
crmsh
The command line utility crmsh manages your cluster, nodes, and resources.
See Chapter 8, Configuring and Managing Cluster Resources (Command Line) for more information.
DC (designated coordinator)
The DC is elected from all nodes in the cluster. This happens if there is no DC yet or if the
current DC leaves the cluster for any reason. The DC is the only entity in the cluster that
can decide that a cluster-wide change needs to be performed, such as fencing a node or
moving resources around. All other nodes get their configuration and resource allocation
information from the current DC.
Disaster
Unexpected interruption of critical infrastructure induced by nature, humans, hardware fail-
ure, or software bugs.
Disaster Recovery
Disaster recovery is the process by which a business function is restored to the normal, steady
state after a disaster.
DRBD
DRBD® is a block device designed for building high availability clusters. The whole block
device is mirrored via a dedicated network and is seen as a network RAID-1.
existing cluster
The term “existing cluster” is used to refer to any cluster that consists of at least one node.
Existing clusters have a basic Corosync configuration that defines the communication chan-
nels, but they do not necessarily have resource configuration yet.
failover
Occurs when a resource or node fails on one machine and the affected resources are started
on another node.
fencing
Describes the concept of preventing access to a shared resource by isolated or failing cluster
members. There are two classes of fencing: resource level fencing and node level fencing.
Resource level fencing ensures exclusive access to a given resource. Node level fencing pre-
vents a failed node from accessing shared resources entirely and prevents resources from
running on a node whose status is uncertain. This is usually done in a simple and abrupt
way: reset or power o the node.
Geo cluster
Consists of multiple, geographically dispersed sites with a local cluster each. The sites com-
municate via IP. Failover across the sites is coordinated by a higher-level entity, the booth.
Geo clusters need to cope with limited network bandwidth and high latency. Storage is repli-
cated asynchronously.
load balancing
The ability to make several servers participate in the same service and do the same work.
local cluster
A single cluster in one location (for example, all nodes are located in one data center).
Network latency can be neglected. Storage is typically accessed synchronously by all nodes.
multicast
A technology used for a one-to-many communication within a network that can be used for
cluster communication. Corosync supports both multicast and unicast.
node
Any computer (real or virtual) that is a member of a cluster and invisible to the user.
PE (policy engine)
The policy engine is implemented as pacemaker-schedulerd daemon. When a cluster tran-
sition is needed, based on the current state and configuration, pacemaker-schedulerd cal-
culates the expected next state of the cluster. It determines what actions need to be sched-
uled to achieve the next state.
quorum
In a cluster, a cluster partition is defined to have quorum (be “quorate”) if it has the majority
of nodes (or votes). Quorum distinguishes exactly one partition. It is part of the algorithm
to prevent several disconnected partitions or nodes from proceeding and causing data and
service corruption (split brain). Quorum is a prerequisite for fencing, which then ensures
that quorum is indeed unique.
RA (resource agent)
A script acting as a proxy to manage a resource (for example, to start, stop or monitor a
resource). The High Availability Extension supports different kinds of resource agents: For
details, see Section 6.3.2, “Supported Resource Agent Classes”.
split brain
A scenario in which the cluster nodes are divided into two or more groups that do not know
of each other (either through a software or hardware failure). STONITH prevents a split
brain situation from badly affecting the entire cluster. Also known as a “partitioned cluster”
scenario.
The term split brain is also used in DRBD but means that the two nodes contain different data.
STONITH
The acronym for “Shoot the other node in the head”. It refers to the fencing mechanism that
shuts down a misbehaving node to prevent it from causing trouble in a cluster. In a Pace-
maker cluster, the implementation of node level fencing is STONITH. For this, Pacemaker
comes with a fencing subsystem, pacemaker-fenced .
switchover
Planned, on-demand moving of services to other nodes in a cluster. See failover.
unicast
A technology for sending messages to a single network destination. Corosync supports both
multicast and unicast. In Corosync, unicast is implemented as UDP-unicast (UDPU).
This License applies to any manual or other work, in any medium, that contains a notice placed 3. COPYING IN QUANTITY
by the copyright holder saying it can be distributed under the terms of this License. Such a
notice grants a world-wide, royalty-free license, unlimited in duration, to use that work under If you publish printed copies (or copies in media that commonly have printed covers) of the
the conditions stated herein. The "Document", below, refers to any such manual or work. Any Document, numbering more than 100, and the Document's license notice requires Cover Texts,
member of the public is a licensee, and is addressed as "you". You accept the license if you you must enclose the copies in covers that carry, clearly and legibly, all these Cover Texts:
copy, modify or distribute the work in a way requiring permission under copyright law. Front-Cover Texts on the front cover, and Back-Cover Texts on the back cover. Both covers
A "Modified Version" of the Document means any work containing the Document or a portion must also clearly and legibly identify you as the publisher of these copies. The front cover
of it, either copied verbatim, or with modifications and/or translated into another language. must present the full title with all words of the title equally prominent and visible. You may
add other material on the covers in addition. Copying with changes limited to the covers, as
A "Secondary Section" is a named appendix or a front-matter section of the Document that
long as they preserve the title of the Document and satisfy these conditions, can be treated
deals exclusively with the relationship of the publishers or authors of the Document to the
as verbatim copying in other respects.
Document's overall subject (or to related matters) and contains nothing that could fall directly
within that overall subject. (Thus, if the Document is in part a textbook of mathematics, a If the required texts for either cover are too voluminous to t legibly, you should put the
Secondary Section may not explain any mathematics.) The relationship could be a matter rst ones listed (as many as t reasonably) on the actual cover, and continue the rest onto
of historical connection with the subject or with related matters, or of legal, commercial, adjacent pages.
philosophical, ethical or political position regarding them. If you publish or distribute Opaque copies of the Document numbering more than 100, you
The "Invariant Sections" are certain Secondary Sections whose titles are designated, as being must either include a machine-readable Transparent copy along with each Opaque copy, or
those of Invariant Sections, in the notice that says that the Document is released under this state in or with each Opaque copy a computer-network location from which the general net-
License. If a section does not t the above definition of Secondary then it is not allowed to be work-using public has access to download using public-standard network protocols a complete
designated as Invariant. The Document may contain zero Invariant Sections. If the Document Transparent copy of the Document, free of added material. If you use the latter option, you
does not identify any Invariant Sections then there are none. must take reasonably prudent steps, when you begin distribution of Opaque copies in quanti-
ty, to ensure that this Transparent copy will remain thus accessible at the stated location until
The "Cover Texts" are certain short passages of text that are listed, as Front-Cover Texts or
at least one year after the last time you distribute an Opaque copy (directly or through your
Back-Cover Texts, in the notice that says that the Document is released under this License. A
agents or retailers) of that edition to the public.
Front-Cover Text may be at most 5 words, and a Back-Cover Text may be at most 25 words.
It is requested, but not required, that you contact the authors of the Document well before
A "Transparent" copy of the Document means a machine-readable copy, represented in a for-
redistributing any large number of copies, to give them a chance to provide you with an
mat whose specification is available to the general public, that is suitable for revising the doc-
updated version of the Document.
ument straightforwardly with generic text editors or (for images composed of pixels) generic
paint programs or (for drawings) some widely available drawing editor, and that is suitable
for input to text formatters or for automatic translation to a variety of formats suitable for
input to text formatters. A copy made in an otherwise Transparent le format whose markup,
or absence of markup, has been arranged to thwart or discourage subsequent modification
by readers is not Transparent. An image format is not Transparent if used for any substantial
amount of text. A copy that is not "Transparent" is called "Opaque".
Examples of suitable formats for Transparent copies include plain ASCII without markup, Tex-
info input format, LaTeX input format, SGML or XML using a publicly available DTD, and stan-
dard-conforming simple HTML, PostScript or PDF designed for human modification. Examples
of transparent image formats include PNG, XCF and JPG. Opaque formats include proprietary
You may copy and distribute a Modified Version of the Document under the conditions of
sections 2 and 3 above, provided that you release the Modified Version under precisely this 5. COMBINING DOCUMENTS
License, with the Modified Version filling the role of the Document, thus licensing distribution
and modification of the Modified Version to whoever possesses a copy of it. In addition, you You may combine the Document with other documents released under this License, under
must do these things in the Modified Version: the terms defined in section 4 above for modified versions, provided that you include in the
combination all of the Invariant Sections of all of the original documents, unmodified, and
A. Use in the Title Page (and on the covers, if any) a title distinct from that of the
list them all as Invariant Sections of your combined work in its license notice, and that you
Document, and from those of previous versions (which should, if there were any,
preserve all their Warranty Disclaimers.
be listed in the History section of the Document). You may use the same title as a
previous version if the original publisher of that version gives permission. The combined work need only contain one copy of this License, and multiple identical Invari-
ant Sections may be replaced with a single copy. If there are multiple Invariant Sections with
B. List on the Title Page, as authors, one or more persons or entities responsible for the same name but different contents, make the title of each such section unique by adding
authorship of the modifications in the Modified Version, together with at least ve at the end of it, in parentheses, the name of the original author or publisher of that section if
of the principal authors of the Document (all of its principal authors, if it has fewer known, or else a unique number. Make the same adjustment to the section titles in the list of
than ve), unless they release you from this requirement. Invariant Sections in the license notice of the combined work.
C. State on the Title page the name of the publisher of the Modified Version, as the In the combination, you must combine any sections Entitled "History" in the various original
publisher. documents, forming one section Entitled "History"; likewise combine any sections Entitled
"Acknowledgements", and any sections Entitled "Dedications". You must delete all sections
D. Preserve all the copyright notices of the Document.
Entitled "Endorsements".
E. Add an appropriate copyright notice for your modifications adjacent to the other
copyright notices.
6. COLLECTIONS OF DOCUMENTS
F. Include, immediately after the copyright notices, a license notice giving the public
permission to use the Modified Version under the terms of this License, in the form You may make a collection consisting of the Document and other documents released under
shown in the Addendum below. this License, and replace the individual copies of this License in the various documents with a
single copy that is included in the collection, provided that you follow the rules of this License
G. Preserve in that license notice the full lists of Invariant Sections and required Cover
for verbatim copying of each of the documents in all other respects.
Texts given in the Document's license notice.
You may extract a single document from such a collection, and distribute it individually under
H. Include an unaltered copy of this License. this License, provided you insert a copy of this License into the extracted document, and follow
this License in all other respects regarding verbatim copying of that document.
I. Preserve the section Entitled "History", Preserve its Title, and add to it an item
stating at least the title, year, new authors, and publisher of the Modified Version
as given on the Title Page. If there is no section Entitled "History" in the Document, 7. AGGREGATION WITH INDEPENDENT WORKS
create one stating the title, year, authors, and publisher of the Document as given
on its Title Page, then add an item describing the Modified Version as stated in A compilation of the Document or its derivatives with other separate and independent docu-
the previous sentence. ments or works, in or on a volume of a storage or distribution medium, is called an "aggregate"
if the copyright resulting from the compilation is not used to limit the legal rights of the com-
J. Preserve the network location, if any, given in the Document for public access to
pilation's users beyond what the individual works permit. When the Document is included in
a Transparent copy of the Document, and likewise the network locations given in
an aggregate, this License does not apply to the other works in the aggregate which are not
the Document for previous versions it was based on. These may be placed in the
themselves derivative works of the Document.
"History" section. You may omit a network location for a work that was published
at least four years before the Document itself, or if the original publisher of the If the Cover Text requirement of section 3 is applicable to these copies of the Document, then
version it refers to gives permission. if the Document is less than one half of the entire aggregate, the Document's Cover Texts
may be placed on covers that bracket the Document within the aggregate, or the electronic
K. For any section Entitled "Acknowledgements" or "Dedications", Preserve the Title equivalent of covers if the Document is in electronic form. Otherwise they must appear on
of the section, and preserve in the section all the substance and tone of each of the printed covers that bracket the whole aggregate.
contributor acknowledgements and/or dedications given therein.
L. Preserve all the Invariant Sections of the Document, unaltered in their text and 8. TRANSLATION
in their titles. Section numbers or the equivalent are not considered part of the
section titles. Translation is considered a kind of modification, so you may distribute translations of the
M. Delete any section Entitled "Endorsements". Such a section may not be included Document under the terms of section 4. Replacing Invariant Sections with translations requires
in the Modified Version. special permission from their copyright holders, but you may include translations of some
or all Invariant Sections in addition to the original versions of these Invariant Sections. You
N. Do not retitle any existing section to be Entitled "Endorsements" or to conflict in may include a translation of this License, and all the license notices in the Document, and
title with any Invariant Section. any Warranty Disclaimers, provided that you also include the original English version of this
O. Preserve any Warranty Disclaimers. License and the original versions of those notices and disclaimers. In case of a disagreement
between the translation and the original version of this License or a notice or disclaimer, the
If the Modified Version includes new front-matter sections or appendices that qualify as Se- original version will prevail.
condary Sections and contain no material copied from the Document, you may at your option
If a section in the Document is Entitled "Acknowledgements", "Dedications", or "History", the
designate some or all of these sections as invariant. To do this, add their titles to the list of
requirement (section 4) to Preserve its Title (section 1) will typically require changing the
Invariant Sections in the Modified Version's license notice. These titles must be distinct from
actual title.
any other section titles.
You may add a section Entitled "Endorsements", provided it contains nothing but endorse-
ments of your Modified Version by various parties--for example, statements of peer review
9. TERMINATION
or that the text has been approved by an organization as the authoritative definition of a
You may not copy, modify, sublicense, or distribute the Document except as expressly pro-
standard.
vided for under this License. Any other attempt to copy, modify, sublicense or distribute the
You may add a passage of up to ve words as a Front-Cover Text, and a passage of up to 25
Document is void, and will automatically terminate your rights under this License. However,
words as a Back-Cover Text, to the end of the list of Cover Texts in the Modified Version. Only
parties who have received copies, or rights, from you under this License will not have their
one passage of Front-Cover Text and one of Back-Cover Text may be added by (or through
licenses terminated so long as such parties remain in full compliance.
arrangements made by) any one entity. If the Document already includes a cover text for the
same cover, previously added by you or by arrangement made by the same entity you are
acting on behalf of, you may not add another; but you may replace the old one, on explicit
permission from the previous publisher that added the old one.
The Free Software Foundation may publish new, revised versions of the GNU Free Documen-
tation License from time to time. Such new versions will be similar in spirit to the present
version, but may differ in detail to address new problems or concerns. See http://www.gnu.org/
copyleft/ .
Each version of the License is given a distinguishing version number. If the Document specifies
that a particular numbered version of this License "or any later version" applies to it, you have
the option of following the terms and conditions either of that specified version or of any
later version that has been published (not as a draft) by the Free Software Foundation. If the
Document does not specify a version number of this License, you may choose any version ever
published (not as a draft) by the Free Software Foundation.
If you have Invariant Sections, Front-Cover Texts and Back-Cover Texts, replace the
“with...Texts.” line with this:
with the Invariant Sections being LIST THEIR TITLES, with the
Front-Cover Texts being LIST, and with the Back-Cover Texts being LIST.
If you have Invariant Sections without Cover Texts, or some other combination of the three,
merge those two alternatives to suit the situation.
If your document contains nontrivial examples of program code, we recommend releasing
these examples in parallel under your choice of free software license, such as the GNU General
Public License, to permit their use in free software.