Jan 20 2008

Heartbeat 2 Howto

Published by dave at 12:00 am under Apache, linux


Since version 2, Heartbeat is able to manage more than 2 nodes, and doesn’t need "mon" utility to monitor services.
This fonctionnality is now implemented within Heartbeat.
As a consequence, this flexibility and new features may make it harder to configure.
It’s intresting to know version 1 configuration files are still supported.

 

Installation

Heartbeat source files are available from the official site http://www.linux-ha.org. Redhat Enterprise compatible rpms can be downloaded from the Centos website (link given from Heartbeat website). Packages are built pretty quickly as source version is 2.1.0 when rpm is 2.0.8-2 at the time this article was made. 3 rpms are needed:

heartbeat-pils
heartbeat-stonith

heartbeat

We are going to monitor Apache but the configuration remains valid for any other service: mail, database, DNS, DHCP, file server, etc…

 

Diagrams

 
Failover or load-balancing

Heartbeat supports Active-Passive for the failover mode and Active-Active for the load-balancing.
Many other options are possible adding servers and/or services; There are many many possibilities.

We will focus on load-balancing, only a few lines need to be removed for failover.

  Failover   Load-balancing  

 

Configuration

In this setup, the 2 servers are interconnected through their eth0 interface.
Applicative flows come on eth1, that is configured as in the diagram below, with addresses 192.168.0.4 and .5
Addresses on eth0 habe to be in a sub-network dedicated to Heartbeat.
These addresses must appear in /etc/hosts on every node.

3 files must be configured for Heartbeat 2. Again, they are identical on each node.

In /etc/ha.d/

  • ha.cf
  • authkeys

In /var/lib/heartbeat/crm/

  • cib.xml

 
ha.cf

ha.cf contains the main settings like cluster nodes or the communication topology.

use_logd on
# Heartbeat packets sending frequency
keepalive 500ms # seconds - Specify ms for times shorter than 1 second

# Period of time after which a node is declared "dead"
deadtime 2
# Send a warning message
# Important to adjust the deadtime value
warntime 1
# Identical to deadtime but when initializing
initdead 8
updport 694

# Host to be tested to check the node is still online
# The default gateway most likely
ping 192.168.0.1
# Interface used for Heartbeat packets
# Use the serial port,

# "mcast" and "ucast" for multicast and unicast
bcast eth0
# Resource switches back on the primary node when it's available
auto_failback on
# Cluster Nodes List
node n1.domain.com
node n2.domain.com
# Activate Heartbeat 2 Configuration

crm yes
# Allow to add dynamically a new node to the cluster
autojoin any

Other options are available such as compression or bandwidth for communication on a serial cable. Check http://linux-ha.org/ha.cf

 
authkeys

Authkeys defines authentication keys.
Several types are available: crc, md5 and sha1. crc is to be used on a secured sub-network (vlan isolation or cross-over cable.
sha1 offers a higher level of secuirity but can consume a lot CPU resources.
md5 sits in between. It’s not a bad choice.

auth 1
1 md5 secret

Password is stored in clear text, it is important to change the file permissions to 600.

 
cib.xml

<cib>
<configuration>

  <crm_config/>
  <nodes/>
  <resources>
    <group id="server1">

      <primitive class="ocf" id="IP1" provider="heartbeat" type="IPaddr">
        <operations>
          <op id="IP1_mon" interval="10s" name="monitor" timeout="5s"/>
        </operations>
        <instance_attributes id="IP1_inst_attr">
          <attributes>
            <nvpair id="IP1_attr_0" name="ip" value="10.0.0.2"/>
            <nvpair id="IP1_attr_1" name="netmask" value="24"/>
            <nvpair id="IP1_attr_2" name="nic" value="eth1"/>
          </attributes>
        </instance_attributes>
      </primitive>

      <primitive class="lsb" id="apache1" provider="heartbeat" type="apache">
        <operations>
          <op id="jboss1_mon" interval="30s" name="monitor" timeout="20s"/>
        </operations>
      </primitive>
    </group>

    <group id="server2">

      <primitive class="ocf" id="IP2" provider="heartbeat" type="IPaddr">
        <operations>
          <op id="IP2_mon" interval="10s" name="monitor" timeout="5s"/>
        </operations>
        <instance_attributes id="IP2_inst_attr">
          <attributes>
            <nvpair id="IP2_attr_0" name="ip" value="10.0.0.3"/>
            <nvpair id="IP2_attr_1" name="netmask" value="24"/>
            <nvpair id="IP2_attr_2" name="nic" value="eth1"/>
          </attributes>
        </instance_attributes>
      </primitive>

      <primitive class="lsb" id="apache2" provider="heartbeat" type="apache">
        <operations>
          <op id="jboss2_mon" interval="30s" name="monitor" timeout="20s"/>
        </operations>
      </primitive>
    </group>
  </resources>

  <constraints>
    <rsc_location id="location_server1" rsc="server1">
      <rule id="best_location_server1" score="100">
        <expression_attribute="#uname" id="best_location_server1_expr" operation="eq"
        value="n1.domain.com"/>
      </rule>
    </rsc_location>

    <rsc_location id="location_server2" rsc="server2">
      <rule id="best_location_server2" score="100">
        <expression_attribute="#uname" id="best_location_server2_expr" operation="eq"
        value="n2.domain.com"/>
      </rule>
    </rsc_location>

    <rsc_location id="server1_connected" rsc="server1">
      <rule id="server1_connected_rule" score="-INFINITY" boolean_op="or">
        <expression id="server1_connected_undefined" attribute="pingd"
        operation="not_defined"/>
        <expression id="server1_connected_zero" attribute="pingd" operation="lte"
        value="0"/>
      </rule>
    </rsc_location>

    <rsc_location id="server2_connected" rsc="server2">
      <rule id="server2_connected_rule" score="-INFINITY" boolean_op="or">
        <expression id="server2_connected_undefined" attribute="pingd"
        operation="not_defined"/>
        <expression id="server2_connected_zero" attribute="pingd" operation="lte"
        value="0"/>
      </rule>
    </rsc_location>
  </constraints>

</configuration>
</cib>

 

It’s possible to generate the file from a Heartbeat 1 configuration file, haresources located in /etc/ha.d/, with the following command:
python /usr/lib/heartbeat/haresources2cib.py > /var/lib/heartbeat/crm/cib.xml

The file can be split in 2 parts: resources and constraints.

 
Resources

Resources are organized in groups (server1 & 2) putting together a virtual IP address and a service: Apache.
Resources are declared with the <primitive> syntax within the group.
Groups are useful to gather several resources under the same constraints.

The IP1 primitive checks virtual IP 1 is reachable.
It executes OCF type IPaddr script.
OCF scripts are provided with Heartbeat in the rpm packages.
It’s also possible to specify the virtual address, the network mask as well as the interface.
Apache resource type is LSB, meaning it makes a call to a startup script located in the usual /etc/init.d.
The script’s name in the variable type: type="name".
In order to run with Heartbeat, the script must be LSB compliant. LSB compliant means the script must:

  • return its status with “script status”
  • not fail “script start” on a service that is already running
  • not fail stopping a service already stopped

All LSB specifications can be checked at http://www.linux- ha.org/LSBResourceAgent

The following time values can be defined:

  • interval: defines how often the resource’s status is checked
  • timeout: defines the time period before considering a start, stop or status action failed

 
Constraints

2 constraints apply to each group of resources:
The favourite location where resources "should" run.
We give a score of 100 to n1.domain.com for the 1st group.
Hence, if n1.domain.com is active, and option auto_failback is set to "on", resources in this group will always come back there.

Action depending on ping result. If none of the gateways answer ping packets, resources move on another server, and the node goes to standby status.

Score -INFINITY means the the node will never ever accept these resources if the gateways are unreachable.

 

Important Notes

 
Files Rights

/etc/ha.cf directory contains sensible data. Rights have to be changed to be accessed by the owner only – or the application will not launch.

chmod 600 /etc/ha.d/ha.cf

The /var/lib/heartbeat/crm/cib.crm file has to belong to hacluster and group haclient. It must not be accessible by other users.
chown hacluster:haclient /var/lib/heartbeat/crm/cib.crm
chmod 660 /var/lib/heartbeat/crm/cib.crm

cib.xml is accessed in write mode by Heartbeat. Some files are created at the same time.
If you need to edit it manually, stop Heartbeat on all nodes, remove cib.xml.sig in the same directory, edit cib.xml on all nodes, and restart Heartbeat.

It’s advised to use crm_resource to bring modifications (see section below).

 
hosts files

The /etc/hosts file must contain all the cluster nodes hostnames. These names should be identical to the result of ‘uname -n’ command.

 
Services Startup

Heartbeat starts Apache when the service is inactive. It is better to disable Apache automatic startup with chkconfig for instance, as it may take a while to boot, Heartbeat could start it a second time.

chkconfig httpd off

There is a startup script in /etc/init.d/. To launch Heartbeat, run “/etc/init.d/hearbeat start”. Yu can launch Heartbeat automatically at boot time:
chkconfig heartbeat on

 

Behaviour and Tests

The following actions simulate incidents that may occur at some stage, and alter the cluster’s health. We study Heartbeat’s behaviour in each case.

 
Apache shutdown

Local Heartbeat will restart Apache. If the reboot fails, a warning is sent to the logs. Heartbeat won’t launch the service anymore then. The virtual IP address remains on the node. However, the address can be moved manually with Heartbeat tools.

 
Server’s Shutdown or Crash

Virtual addresses move to other servers.

 
Heartbeat’s Manual Shutdown on one Node

Heartbeat stops Heartbeat. Virtual addresses are moved to other nodes. The normal procedure is to run crm_standby to turn the node into standby mide and migrate resources accross.

 
Node to node cable disconnection

Each machine thinks it’s on its own and takes the 2 virtual IPs. However, it is not really a problem as the gateway keeps on sending packets to the last entry in the ARP table. A ping on the broadcast address sends duplicates.

 
Gateway disconnection

This could happen in 2 cases:
- The gateway is unreachable. In this case, the 2 nodes remove their virtual IP address and stop Apache.
- The connection to one of the nodes is down. Addresses move to the other server which can normally reach the gateway.

All simulations allow to maintain the cluster up except if the gateway is gone. But this is not a cluster problem…

 

Tools

Heartbeat’s health can be checked via different ways.

 
Logs

Heartbeat sends messages to the logd daemon that stores them in the /var/log/messages system file.

 
Unix Tools

Usual Unix commands can be used. Heartbeat creates sub-interfaces that you can check with "ifconfig" or "ip address show". Processes status can be displayed with the startup scripts or with the "ps" command for instance.

 
Heartbeat Commands

Heartbeat is provided with a set of commands. Here are the main ones:

  • crmadmin: Controls nodes managers on each machine of the cluster.
  • crm_mon: Quick and useful; Displays nodes status and resources.
  • crm_resource: Makes requests and modifies resources/services related data. Ability to list, migrate, disable or delete resources.
  • crm_verify: Reports the cluster’s warnings and errors
  • crm_standby: Migrate all node’s resources. Useful when upgrading.

 

Maintenance

 
Service Shutdown

When upgrading or shutting Apache down, you may proceed as follow:

  • Stop Apache on node 1 and migrate the virtual address on node 2:
    crm_standby -U n1.domain.com -v true
  • Start upgrading n1 and restart the node:
    crm_standby -U n1.domain.com -v false
  • Resources automatically fail back to n1 (if cib.xml is properly configured)

Proceed the same way on n2

Rem: The previous commands can be run from any node of the cluster.

It is possible to switch the 2 nodes on standby at the same time; Apache will be stopped on the 2 machines and the virtual addresses deleted until it comes back to running mode.

 
Machine Reboot

It is not required to switch the node to standby mode before rebooting. However, it is better practice as resources migrate faster, detection time being inexistant.

 
Resource Failure

Example: Apache crashed and doesn’t restart anymore, all group resources are moved to the 2nd node.

When the issue is resolved and the resource is up again, the cluster’s health can be checked with:
crm_resource –reprobe (-P)

and the resource restarted:
crm_resource –cleanup –resource apache1 (ou crm_resource -C -r apache1).
It will move automatically back to the original server.

 

Adding a New Node to the cluster

 
In Standby

If the cluster contains 2 nodes connected with a cross-over cable, you then will need a switch for the heartbeats network interfaces.

You first need to add new node’s informations. Edit the /etc/hosts files on the current nodes and add the node’s hostname. /etc/hosts contents must be copied accross to the new node. Configure Heartbeat in the same way as other nodes.
ha.cf files must contain the "autojoin any" setting to accept new nodes on the fly.

On the new host, start Heartbeat; The should join the cluster automatically.

If not, run the following on one of the 1st nodes:
/usr/lib/heartbeat/hb_addnode n3.domain.com
The new node only acts as a failover, no service is associated. If n1.domain.com goes in standby, resources will move to n3. They will come back to the original server as soon as it comes back up (n1 being the favourite server).

 
With a New Service

To add a 3rd IP (along with a 3rd Apache), you need to follow the above procedure and then:
- Either stop Heartbeat on the 3 servers and edit cib.xml files
- Or build similar files to cib.xml, containing new resources and constraints to be added, and add them to the cluster in live. This is the preferred method. Create the following files on one of the nodes:
 
newGroup.xml

<group id="server3">
  <primitive class="ocf" provider="heartbeat" type="IPaddr" id="IP3">
    <operations>
      <op id="IP3_mon" interval="5s" name="monitor" timeout="2s"/>
    </operations>
    <instance_attributes id="IP3_attr">
      <attributes>
        <nvpair id="IP3_attr_0" name="ip" value="10.0.0.28"/>
        <nvpair id="IP3_attr_1" name="netmask" value="29"/>
        <nvpair id="IP3_attr_2" name="nic" value="eth1"/>
      </attributes>
    </instance_attributes>
  </primitive>

  <primitive class="lsb" provider="heartbeat" type="apache" id="apache3">
    <operations>
      <op id="apache3_mon" interval="5s" name="monitor" timeout="5s"/>
    </operations>
  </primitive>
</group>

 

newLocationConstraint.xml

<rsc_location id="location_server3" rsc="server3">
  <rule id="best_location_server3" score="100">
    <expression attribute="#uname" id="best_location_server3_expr" operation="eq"
    value="n3.domain.com"/>
  </rule>
</rsc_location>

 

newPingConstraint.xml

<rsc_location id="server3_connected" rsc="server3">
  <rule id="server3_connected_rule" score="-INFINITY" boolean_op="or">
    <expression id="server3_connected_undefined" attribute="pingd" 
    operation="not_defined"/>
    <expression id="server3_connected_zero" attribute="pingd" operation="lte" 
    value="0"/>
  </rule>
</rsc_location>

 

Add n3 constraints
cibadmin -C -o constraints -x newLocationConstraint.xml
cibadmin -C -o constraints -x newPingConstraint.xml

Add n3 resources

cibadmin -C -o resources -x newGroup.xml

n3 resources should start right away.

Note: Adding constraints can be done one at a time. If you try to add 2 constraints from within the same file, only the first will be set.


6 responses so far

6 Responses to “Heartbeat 2 Howto”

  1. Davidon 10 Feb 2009 at 7:01 pm

    “expression_attribute” in your resource constraints is invalid (the underscore doesn’t belong). It should be “expression attribute”

  2. Allie Syadiqinon 12 Feb 2009 at 1:16 am

    Thanks for the howto. It really help me to understand Heartbeat V2 a bit more as I am struggling to configure my servers to use V2.

    If I need to configure for a failover setup instead of load-balancing, what should I change in your example?

    FYI, I am currently running failover using V1 with 2 nodes, which works fine but I need to add a 3rd node which now require me to setup V2. My required setup is as follows :

    Shared IP : 10.1.1.2
    LB1 (10.1.1.10) – Primary/Lead node
    LB2 (10.1.1.11) – Secondary node should LB1 go offline
    LB3 (10.1.1.12) – Backup node. Will only takeover IP if both LB1 and LB2 goes offline

    Any help is greatly appreciated. Thanks again.

  3. ha_adminon 13 Apr 2010 at 12:19 am

    Hi!

    Im configuring HA-Cluster with HeartBeat. I have the same question than Allie Syadiqin:

    What points in config should I change to have a failover cluster rather than a load-balancing?

    Thanks in advance!

  4. daveon 14 Apr 2010 at 7:30 pm

    Hi,

    I wrote this a while ago but if my memory is correct, you need to remove the whole block as you don’t need the 2nd IP nor the service anymore. You also need to remove associated constraints indeed in the following block.
    Look at how a 3rd resource is added below and that’s basically what you need to remove.

    I just saw the IPs on the schema don’t match (192.168 should be 10.0), but the .3 IP and the service on the 2nd server are no more needed.

    If I wrote a failover howto, people would ask for load-balancing ;-)

  5. xpson 23 May 2010 at 10:22 pm

    Hi Dave,

    In configuration step you say that both servers are connected between eth0 interface and this interface have to be in a subnetwork
    dedicated to heartbeat.

    Also, you say that we have to configure eth1 interfaces with addresses as 192.168.0.4 and 192.168.0.5.

    I’m not sure that understand it totally, so i think i have to configure as down:

    – eth0 connect directly between servers, not to the local network. (for example. ip 192.168.1.2 and .3)
    – eth1 connect to the local network and to internet. (pc – switch – internet) with different addresses of eth0

    It’s that correct?

    Thanks and excuse my poor english ;)

  6. daveon 04 Jun 2010 at 5:40 pm

    Hi, that’s exactly it. eth1 are your production IPs while eth0 is a different subnet for heartbeats only.
    It could be on the same network but it wouldn’t be secure

Comments RSS

Leave a Reply

Security Code: