Jan 20 2008

Heartbeat 2 Howto

Published by at 12:00 am under Linux




Important note:
Heartbeat is now obsolete and has moved to a new stack available on Clusterlabs. For simple high availability project using a virtual IP, try out keepalived that does monitoring and failover with just a simple configuration file.

Since version 2, Heartbeat is able to manage more than 2 nodes, and doesn’t need “mon” utility to monitor services.
This fonctionnality is now implemented within Heartbeat.
As a consequence, this flexibility and new features may make it harder to configure.
It’s intresting to know version 1 configuration files are still supported.

Installation

Heartbeat source files are available from the official site http://www.linux-ha.org. Redhat Enterprise compatible rpms can be downloaded from the Centos website (link given from Heartbeat website). Packages are built pretty quickly as source version is 2.1.0 when rpm is 2.0.8-2 at the time this article was made. 3 rpms are needed:

heartbeat-pils
heartbeat-stonith

heartbeat

We are going to monitor Apache but the configuration remains valid for any other service: mail, database, DNS, DHCP, file server, etc…

Diagrams

Failover or load-balancing

Heartbeat supports Active-Passive for the failover mode and Active-Active for the load-balancing.
Many other options are possible adding servers and/or services; There are many many possibilities.

We will focus on load-balancing, only a few lines need to be removed for failover.

Failover Load-balancing

Configuration

In this setup, the 2 servers are interconnected through their eth0 interface.
Applicative flows come on eth1, that is configured as in the diagram below, with addresses 192.168.0.4 and .5
Addresses on eth0 have to be in a sub-network dedicated to Heartbeat.
These addresses must appear in /etc/hosts on every node.

3 files must be configured for Heartbeat 2. Again, they are identical on each node.

In /etc/ha.d/

  • ha.cf
  • authkeys

In /var/lib/heartbeat/crm/

  • cib.xml

ha.cf

ha.cf contains the main settings like cluster nodes or the communication topology.

use_logd on
# Heartbeat packets sending frequency
keepalive 500ms # seconds - Specify ms for times shorter than 1 second

# Period of time after which a node is declared "dead"
deadtime 2
# Send a warning message
# Important to adjust the deadtime value
warntime 1
# Identical to deadtime but when initializing
initdead 8
updport 694

# Host to be tested to check the node is still online
# The default gateway most likely
ping 192.168.0.1
# Interface used for Heartbeat packets
# Use the serial port,

# "mcast" and "ucast" for multicast and unicast
bcast eth0
# Resource switches back on the primary node when it's available
auto_failback on
# Cluster Nodes List
node n1.domain.com
node n2.domain.com
# Activate Heartbeat 2 Configuration

crm yes
# Allow to add dynamically a new node to the cluster
autojoin any

Other options are available such as compression or bandwidth for communication on a serial cable. Check http://linux-ha.org/ha.cf

authkeys

Authkeys defines authentication keys.
Several types are available: crc, md5 and sha1. crc is to be used on a secured sub-network (vlan isolation or cross-over cable.
sha1 offers a higher level of secuirity but can consume a lot CPU resources.
md5 sits in between. It’s not a bad choice.

auth 1
1 md5 secret

Password is stored in clear text, it is important to change the file permissions to 600.

cib.xml

<cib>
<configuration>

  <crm_config/>
  <nodes/>
  <resources>
    <group id="server1">

      <primitive class="ocf" id="IP1" provider="heartbeat" type="IPaddr">
        <operations>
          <op id="IP1_mon" interval="10s" name="monitor" timeout="5s"/>
        </operations>
        <instance_attributes id="IP1_inst_attr">
          <attributes>
            <nvpair id="IP1_attr_0" name="ip" value="192.168.0.2"/>
            <nvpair id="IP1_attr_1" name="netmask" value="24"/>
            <nvpair id="IP1_attr_2" name="nic" value="eth1"/>
          </attributes>
        </instance_attributes>
      </primitive>

      <primitive class="lsb" id="apache1" provider="heartbeat" type="apache">
        <operations>
          <op id="jboss1_mon" interval="30s" name="monitor" timeout="20s"/>
        </operations>
      </primitive>
    </group>

    <group id="server2">

      <primitive class="ocf" id="IP2" provider="heartbeat" type="IPaddr">
        <operations>
          <op id="IP2_mon" interval="10s" name="monitor" timeout="5s"/>
        </operations>
        <instance_attributes id="IP2_inst_attr">
          <attributes>
            <nvpair id="IP2_attr_0" name="ip" value="192.168.0.3"/>
            <nvpair id="IP2_attr_1" name="netmask" value="24"/>
            <nvpair id="IP2_attr_2" name="nic" value="eth1"/>
          </attributes>
        </instance_attributes>
      </primitive>

      <primitive class="lsb" id="apache2" provider="heartbeat" type="apache">
        <operations>
          <op id="jboss2_mon" interval="30s" name="monitor" timeout="20s"/>
        </operations>
      </primitive>
    </group>
  </resources>

  <constraints>
    <rsc_location id="location_server1" rsc="server1">
      <rule id="best_location_server1" score="100">
        <expression_attribute="#uname" id="best_location_server1_expr" operation="eq"
        value="n1.domain.com"/>
      </rule>
    </rsc_location>

    <rsc_location id="location_server2" rsc="server2">
      <rule id="best_location_server2" score="100">
        <expression_attribute="#uname" id="best_location_server2_expr" operation="eq"
        value="n2.domain.com"/>
      </rule>
    </rsc_location>

    <rsc_location id="server1_connected" rsc="server1">
      <rule id="server1_connected_rule" score="-INFINITY" boolean_op="or">
        <expression id="server1_connected_undefined" attribute="pingd"
        operation="not_defined"/>
        <expression id="server1_connected_zero" attribute="pingd" operation="lte"
        value="0"/>
      </rule>
    </rsc_location>

    <rsc_location id="server2_connected" rsc="server2">
      <rule id="server2_connected_rule" score="-INFINITY" boolean_op="or">
        <expression id="server2_connected_undefined" attribute="pingd"
        operation="not_defined"/>
        <expression id="server2_connected_zero" attribute="pingd" operation="lte"
        value="0"/>
      </rule>
    </rsc_location>
  </constraints>

</configuration>
</cib>

It’s possible to generate the file from a Heartbeat 1 configuration file, haresources located in /etc/ha.d/, with the following command:
python /usr/lib/heartbeat/haresources2cib.py > /var/lib/heartbeat/crm/cib.xml

The file can be split in 2 parts: resources and constraints.

Resources

Resources are organized in groups (server1 & 2) putting together a virtual IP address and a service: Apache.
Resources are declared with the <primitive> syntax within the group.
Groups are useful to gather several resources under the same constraints.

The IP1 primitive checks virtual IP 1 is reachable.
It executes OCF type IPaddr script.
OCF scripts are provided with Heartbeat in the rpm packages.
It’s also possible to specify the virtual address, the network mask as well as the interface.
Apache resource type is LSB, meaning it makes a call to a startup script located in the usual /etc/init.d.
The script’s name in the variable type: type=”name”.
In order to run with Heartbeat, the script must be LSB compliant. LSB compliant means the script must:

  • return its status with “script status”
  • not fail “script start” on a service that is already running
  • not fail stopping a service already stopped

All LSB specifications can be checked at http://www.linux- ha.org/LSBResourceAgent

The following time values can be defined:

  • interval: defines how often the resource’s status is checked
  • timeout: defines the time period before considering a start, stop or status action failed

Constraints

2 constraints apply to each group of resources:
The favourite location where resources “should” run.
We give a score of 100 to n1.domain.com for the 1st group.
Hence, if n1.domain.com is active, and option auto_failback is set to “on”, resources in this group will always come back there.

Action depending on ping result. If none of the gateways answer ping packets, resources move on another server, and the node goes to standby status.

Score -INFINITY means the the node will never ever accept these resources if the gateways are unreachable.

Important Notes

Files Rights

/etc/ha.cf directory contains sensible data. Rights have to be changed to be accessed by the owner only – or the application will not launch.

chmod 600 /etc/ha.d/ha.cf

The /var/lib/heartbeat/crm/cib.crm file has to belong to hacluster and group haclient. It must not be accessible by other users.
chown hacluster:haclient /var/lib/heartbeat/crm/cib.crm
chmod 660 /var/lib/heartbeat/crm/cib.crm

cib.xml is accessed in write mode by Heartbeat. Some files are created at the same time.
If you need to edit it manually, stop Heartbeat on all nodes, remove cib.xml.sig in the same directory, edit cib.xml on all nodes, and restart Heartbeat.

It’s advised to use crm_resource to bring modifications (see section below).

hosts files

The /etc/hosts file must contain all the cluster nodes hostnames. These names should be identical to the result of ‘uname -n’ command.

Services Startup

Heartbeat starts Apache when the service is inactive. It is better to disable Apache automatic startup with chkconfig for instance, as it may take a while to boot, Heartbeat could start it a second time.

chkconfig httpd off

There is a startup script in /etc/init.d/. To launch Heartbeat, run “/etc/init.d/hearbeat start”. Yu can launch Heartbeat automatically at boot time:
chkconfig heartbeat on

Behaviour and Tests

The following actions simulate incidents that may occur at some stage, and alter the cluster’s health. We study Heartbeat’s behaviour in each case.

Apache shutdown

Local Heartbeat will restart Apache. If the reboot fails, a warning is sent to the logs. Heartbeat won’t launch the service anymore then. The virtual IP address remains on the node. However, the address can be moved manually with Heartbeat tools.

Server’s Shutdown or Crash

Virtual addresses move to other servers.

Heartbeat’s Manual Shutdown on one Node

Heartbeat stops Heartbeat. Virtual addresses are moved to other nodes. The normal procedure is to run crm_standby to turn the node into standby mide and migrate resources accross.

Node to node cable disconnection

Each machine thinks it’s on its own and takes the 2 virtual IPs. However, it is not really a problem as the gateway keeps on sending packets to the last entry in the ARP table. A ping on the broadcast address sends duplicates.

Gateway disconnection

This could happen in 2 cases:
– The gateway is unreachable. In this case, the 2 nodes remove their virtual IP address and stop Apache.
– The connection to one of the nodes is down. Addresses move to the other server which can normally reach the gateway.

All simulations allow to maintain the cluster up except if the gateway is gone. But this is not a cluster problem…

Tools

Heartbeat’s health can be checked via different ways.

Logs

Heartbeat sends messages to the logd daemon that stores them in the /var/log/messages system file.

Unix Tools

Usual Unix commands can be used. Heartbeat creates sub-interfaces that you can check with “ifconfig” or “ip address show”. Processes status can be displayed with the startup scripts or with the “ps” command for instance.

Heartbeat Commands

Heartbeat is provided with a set of commands. Here are the main ones:

  • crmadmin: Controls nodes managers on each machine of the cluster.
  • crm_mon: Quick and useful; Displays nodes status and resources.
  • crm_resource: Makes requests and modifies resources/services related data. Ability to list, migrate, disable or delete resources.
  • crm_verify: Reports the cluster’s warnings and errors
  • crm_standby: Migrate all node’s resources. Useful when upgrading.

Maintenance

Service Shutdown

When upgrading or shutting Apache down, you may proceed as follow:

  • Stop Apache on node 1 and migrate the virtual address on node 2:
    crm_standby -U n1.domain.com -v true
  • Start upgrading n1 and restart the node:
    crm_standby -U n1.domain.com -v false
  • Resources automatically fail back to n1 (if cib.xml is properly configured)

Proceed the same way on n2

Rem: The previous commands can be run from any node of the cluster.

It is possible to switch the 2 nodes on standby at the same time; Apache will be stopped on the 2 machines and the virtual addresses deleted until it comes back to running mode.

Machine Reboot

It is not required to switch the node to standby mode before rebooting. However, it is better practice as resources migrate faster, detection time being inexistant.

Resource Failure

Example: Apache crashed and doesn’t restart anymore, all group resources are moved to the 2nd node.

When the issue is resolved and the resource is up again, the cluster’s health can be checked with:
crm_resource –reprobe (-P)

and the resource restarted:
crm_resource –cleanup –resource apache1 (ou crm_resource -C -r apache1).
It will move automatically back to the original server.

Adding a New Node to the cluster

In Standby

If the cluster contains 2 nodes connected with a cross-over cable, you then will need a switch for the heartbeats network interfaces.

You first need to add new node’s informations. Edit the /etc/hosts files on the current nodes and add the node’s hostname. /etc/hosts contents must be copied accross to the new node. Configure Heartbeat in the same way as other nodes.
ha.cf files must contain the “autojoin any” setting to accept new nodes on the fly.

On the new host, start Heartbeat; The should join the cluster automatically.

If not, run the following on one of the 1st nodes:
/usr/lib/heartbeat/hb_addnode n3.domain.com
The new node only acts as a failover, no service is associated. If n1.domain.com goes in standby, resources will move to n3. They will come back to the original server as soon as it comes back up (n1 being the favourite server).

With a New Service

To add a 3rd IP (along with a 3rd Apache), you need to follow the above procedure and then:
– Either stop Heartbeat on the 3 servers and edit cib.xml files
– Or build similar files to cib.xml, containing new resources and constraints to be added, and add them to the cluster in live. This is the preferred method. Create the following files on one of the nodes:

newGroup.xml

<group id="server3">
  <primitive class="ocf" provider="heartbeat" type="IPaddr" id="IP3">
    <operations>
      <op id="IP3_mon" interval="5s" name="monitor" timeout="2s"/>
    </operations>
    <instance_attributes id="IP3_attr">
      <attributes>
        <nvpair id="IP3_attr_0" name="ip" value="192.168.0.28"/>
        <nvpair id="IP3_attr_1" name="netmask" value="29"/>
        <nvpair id="IP3_attr_2" name="nic" value="eth1"/>
      </attributes>
    </instance_attributes>
  </primitive>

  <primitive class="lsb" provider="heartbeat" type="apache" id="apache3">
    <operations>
      <op id="apache3_mon" interval="5s" name="monitor" timeout="5s"/>
    </operations>
  </primitive>
</group>

newLocationConstraint.xml

<rsc_location id="location_server3" rsc="server3">
  <rule id="best_location_server3" score="100">
    <expression attribute="#uname" id="best_location_server3_expr" operation="eq"
    value="n3.domain.com"/>
  </rule>
</rsc_location>

newPingConstraint.xml

<rsc_location id="server3_connected" rsc="server3">
  <rule id="server3_connected_rule" score="-INFINITY" boolean_op="or">
    <expression id="server3_connected_undefined" attribute="pingd" 
    operation="not_defined"/>
    <expression id="server3_connected_zero" attribute="pingd" operation="lte" 
    value="0"/>
  </rule>
</rsc_location>

Add n3 constraints
cibadmin -C -o constraints -x newLocationConstraint.xml
cibadmin -C -o constraints -x newPingConstraint.xml

Add n3 resources

cibadmin -C -o resources -x newGroup.xml

n3 resources should start right away.

Note: Adding constraints can be done one at a time. If you try to add 2 constraints from within the same file, only the first will be set.


23 responses so far

23 Responses to “Heartbeat 2 Howto”

  1. Davidon 10 Feb 2009 at 7:01 pm

    “expression_attribute” in your resource constraints is invalid (the underscore doesn’t belong). It should be “expression attribute”

  2. Allie Syadiqinon 12 Feb 2009 at 1:16 am

    Thanks for the howto. It really help me to understand Heartbeat V2 a bit more as I am struggling to configure my servers to use V2.

    If I need to configure for a failover setup instead of load-balancing, what should I change in your example?

    FYI, I am currently running failover using V1 with 2 nodes, which works fine but I need to add a 3rd node which now require me to setup V2. My required setup is as follows :

    Shared IP : 10.1.1.2
    LB1 (10.1.1.10) – Primary/Lead node
    LB2 (10.1.1.11) – Secondary node should LB1 go offline
    LB3 (10.1.1.12) – Backup node. Will only takeover IP if both LB1 and LB2 goes offline

    Any help is greatly appreciated. Thanks again.

  3. ha_adminon 13 Apr 2010 at 12:19 am

    Hi!

    Im configuring HA-Cluster with HeartBeat. I have the same question than Allie Syadiqin:

    What points in config should I change to have a failover cluster rather than a load-balancing?

    Thanks in advance!

  4. daveon 14 Apr 2010 at 7:30 pm

    Hi,

    I wrote this a while ago but if my memory is correct, you need to remove the whole block as you don’t need the 2nd IP nor the service anymore. You also need to remove associated constraints indeed in the following block.
    Look at how a 3rd resource is added below and that’s basically what you need to remove.

    I just saw the IPs on the schema don’t match (192.168 should be 10.0), but the .3 IP and the service on the 2nd server are no more needed.

    If I wrote a failover howto, people would ask for load-balancing 😉

  5. xpson 23 May 2010 at 10:22 pm

    Hi Dave,

    In configuration step you say that both servers are connected between eth0 interface and this interface have to be in a subnetwork
    dedicated to heartbeat.

    Also, you say that we have to configure eth1 interfaces with addresses as 192.168.0.4 and 192.168.0.5.

    I’m not sure that understand it totally, so i think i have to configure as down:

    – eth0 connect directly between servers, not to the local network. (for example. ip 192.168.1.2 and .3)
    – eth1 connect to the local network and to internet. (pc – switch – internet) with different addresses of eth0

    It’s that correct?

    Thanks and excuse my poor english 😉

  6. daveon 04 Jun 2010 at 5:40 pm

    Hi, that’s exactly it. eth1 are your production IPs while eth0 is a different subnet for heartbeats only.
    It could be on the same network but it wouldn’t be secure

  7. tejason 15 Sep 2010 at 7:39 am

    In the above XML, are that 10.0.0.2 and 10.0.0.3 are virtual IP’s, one for each server group.

  8. daveon 15 Sep 2010 at 4:15 pm

    yes they’re virtual IPs. They should be 192.168.0.2 and 3 though to match the pic. I just made the change on the post.

  9. davidon 26 Oct 2010 at 7:33 pm

    thanks for the great tutorial!!!

  10. bharaton 03 Nov 2010 at 2:31 pm

    thanks for this nice tutorial as it really helped me as i am very new to linux
    wat will be in haresources file if i will try it with heartbeat 1 .

  11. daveon 04 Nov 2010 at 8:57 pm

    Heartbeat 1 is a complete different story since it doesn’t manage services.
    Check http://www.netexpertise.eu/en/mysql/mysql-failover.html but Heartbeat 2 is way more advanced and powerful

  12. serveron 12 Nov 2010 at 9:09 am

    In 3 node system which one will be selected DC for the first time
    will it be order in which they are in cib or something else
    as in two node second one is elected as current DC ,
    please reply
    Thanks in advance

  13. daveon 13 Nov 2010 at 2:29 pm

    Look at location rules above, you can change the score for each one of your nodes

  14. serveron 17 Nov 2010 at 9:06 am

    what is the default order of resource failover and DC election in more than three node if i dont specify any lovation constraint as i have gone through them but TL dont want them to be included in cib
    is there any way to do it or any doc specifying this default behaviour of heartbeat2 ????
    please reply
    thanks in advance

  15. serveron 25 Nov 2010 at 5:44 am

    hi,
    previously i was trying to establish four node architecture using heartbeat v1 and i got the problem but now i shifted to heartbeat 2 and trying same architecture

    4 node like
    Node A, B, C, D.
    and
    one resource Ipaddr

    but here i am facing problem with ip failover as

    1- Node A goes down then with default setting and without constraint which node will get the ip resource ????.what will be the default behaviour of cluster in such schenario.

    2- if node B gets resource in above case then when i make node A up and node B down then resource is failing back on node A which is not expected either it should move to node C or D which is desired

    cib.xml

    no-quorum-policy — ignore
    resource-stickiness —- INFINITY
    resource-failure-stickiness—- 0

    node description
    then resource description having monitor operation on

    no constraints

    ha.cf

    crm — yes
    autojoin —- yes
    node A B C D
    auto failback off

    any one having information related it please help in need…….
    thanks in advance

  16. rahulon 26 Nov 2010 at 1:38 pm

    in Heartbeat 2 cluster of three nodes how to stop auto failback:

    as in cib i am having default resource stickiness —- INFINITY, and no constraint.

    1- IF i stop heartbeat on node A failover happens and ip resource shifts to node B stays
    there if i start heartbeat on node A, but

    2- When i give init 6 on node A again failover happens and ip resource shifts to node B but as node A again joins the cluster after reboot (heartbeat is on at run level services boot time) ip resource fails back to node A which is not expected

    if any one have solution then please help…..

    Thanks

  17. daveon 26 Nov 2010 at 5:33 pm

    Hi,

    1/ I have no clue. Test it and let everyone know

    2/ It may come back to the first node? Same, if you have a chance to test it

  18. Dangtizon 02 Dec 2010 at 4:23 pm

    Hi all,

    Whether the heartbeat version 2(more than two node architecture) is it a stable one which can be used for the production purpose.

    If no please let us know the stable release of heartbeat version which can be used for the production purpose with more than two nodes.

    Waiting for the reply……….

    Regards,
    Dangtiz

  19. daveon 02 Dec 2010 at 5:51 pm

    According to the official website, the last stable version is 3.0.3.
    I’ve been using Heartbeat 2 for a few years on production servers and I’ve had no problem so far

  20. Dangtizon 04 Dec 2010 at 7:44 am

    Thanks for the reply,

    As we are using the SUSE Linux Enterprise Server 10 SP2 in our production servers and also we cant upgrade to higher version as they are working in real time so can you suggest us which stable Heartbeat version to use with respect to this SLES 10 SP2.

    Waiting for the reply,

    Regards,
    Dangtiz.

  21. Ben Erridgeon 22 Dec 2010 at 5:11 pm

    All I can say is thanks! simple to the point and so useful!

  22. v nagrikon 30 Apr 2011 at 1:08 am

    where can I download haresources2cib.py.

    nagrik

  23. Paul.LKWon 17 Aug 2013 at 11:03 pm

    Your how to is very useful as I searched whole night and seems no one site mentioned V2 will not use “haresources” file, I have been played by the box whole night until I find this page.
    All is working fine now.

    Very Goooooooooooooooooooooooooooooooood.

Comments RSS

Leave a Reply