[TriLUG] Configuring a simple 1:1 Linux HA cluster on CentOS 5.3

Tue Oct 27 13:58:22 EDT 2009

Since I missed the last TriLUG meeting (specifically on Linux HA), I  
spent the better part of two days getting a couple of servers  
configured for a 1:1 failover solution.  So far, my cluster has run  
well with no hiccups.  Although many articles describe how to  
configure HA, they often leave out specific details about resource  
stickiness, STONITH, etc.

Here are my notes - I thought the community could benefit as well...

= 
= 
= 
= 
= 
========================================================================
        Configuring a simple 1:1 Linux HA cluster on CentOS Servers
                           -Ron Kelley
= 
= 
= 
= 
= 
========================================================================

This article describes how to install and configure a simple 1:1 Linux  
HA cluster
on CentOS using the latest heartbeat (v3.0.0) and pacemaker (v1.0.5)  
packages.

Taken from the excellent article here:
http://clusterlabs.org/mediawiki/images/9/9d/Clusters_from_Scratch_-_Apache_on_Fedora11.pdf

#  
= 
= 
= 
= 
=======================================================================#
#                I.  Install Operating System and HA  
packages                #
#  
= 
= 
= 
= 
=======================================================================#

(1) Install CentOS 5.3 or 5.4 (i386 or x86-64)
     --> Don't include any clustering packages during the install

(2) Identify a secondary network interface (eth1) and associated IP  
address
     that will be used for cluster heartbeat.  Make sure this address  
does
     not collide with something else on the network!  If a secondary  
NIC is
     not available, you can use the primary NIC with a subinterface.

(3) On all servers, remove any existing heartbeat, cluster, or pacemaker
     packages via the “yum erase” command
     -----------------------------------------------------
      #shell>  yum erase heartbeat* cluster* pacemaker*
     -----------------------------------------------------

(4) On all servers, exclude any HA software from the existing CentOS  
repo files
     -----------------------------------------------------
      <foreach section in /etc/ym.repos.d/*repo>
      exclude=heartbeat* pacemaker* cluster*
     -----------------------------------------------------

(5) On all servers, create a new CentOS-HA repo file using the info  
below
     -----------------------------------------------------
      [server_ha-clustering]
      name=High Availability/Clustering server technologies (CentOS_5)
      type=rpm-md
      baseurl=http://download.opensuse.org/repositories/server:/ha- 
clustering/CentOS_5/
      gpgcheck=1
      gpgkey=http://download.opensuse.org/repositories/server:/ha- 
clustering/CentOS_5/repodata/repomd.xml.key
      enabled=1
     -----------------------------------------------------

(6) On all servers, install the latest HA software
     ---------------------------------------------------------
      #shell>  yum install heartbeat*64 cluster*64 pacemaker*64
     ---------------------------------------------------------
     Note:  Make sure the packages are from the HA-Repository and not  
the
            standard CentOS repos.  Also, choose either the 32-bit or  
64-bit
            package versions.  Installing both platform packages on  
the same
            machine could lead to problems.

#  
= 
= 
= 
= 
=======================================================================#
#               II.  Configure the HA  
Software                               #
#  
= 
= 
= 
= 
=======================================================================#

(1) On primary HA node, create /etc/ha.d/authkeys with the following  
info
     -----------------------------------------------------
      auth 2
      2 sha1 test-ha
     -----------------------------------------------------
     Note: The "test-ha" value is the secret key shared between nodes

(2) On primary HA node, change the file permissions to 600 on /etc/ 
ha.d/authkeys
     -----------------------------------------------------
      #shell>  chmod 600 /etc/ha.d/authkeys
     -----------------------------------------------------

(3) On primary HA node, create /etc/ha.d/ha.cf (HA config file)
     with the following info:
     -----------------------------------------------------
      logfile /var/log/ha-log
      logfacility local0
      keepalive 2
      deadtime 15
      initdead 30
      bcast eth1
      udpport 694
      auto_failback off
      node pacemaker-svr1.home.local
      node pacemaker-svr2.home.local
      crm on
      apiauth mgmtd   uid=root
      respawn root    /usr/lib64/heartbeat/mgmtd -v ha
     -----------------------------------------------------
     Notes:
        --> “bcast eth1" defines the HA NIC.  In this case, we are  
using eth1
             via broadcasting. We can also use "eth0" or "ucast  
<IP_ADDR_OF_NIC>"

        --> The "node" lines must include hostnames from all nodes.   
The hostnames
            MUST match the "uname -n" output from each host!

        --> The lines "crm on", "apiauth...", and "repsawn..." are  
required for the
            hb_gui tool to work properly

        --> In order to use the hb_gui tool, you need to give the  
"hacluster" user
            a valid password

        --> If you are using the 32bit version of heartbeat and  
pacemaker, make sure
            the "respawn" directive points to the correct location (/ 
usr/lib/heartbeat)
            for "mgmtd".

(4) On primary HA node, copy files from /etc/ha.d to all other nodes
     -----------------------------------------------------
      #shell>  scp -r /etc/ha.d root@<node_2>
     -----------------------------------------------------

(5) On all nodes, make sure the ha software is set to autostart
     -----------------------------------------------------
      #shell>  /sbin/chkconfig heartbeat on
      #shell>  /sbin/service heartbeat start
     -----------------------------------------------------

#  
= 
= 
= 
= 
=======================================================================#
#              III.  Creating the  
Cluster                                    #
#  
= 
= 
= 
= 
=======================================================================#

(1) On the primary node, run the following commands:
     -----------------------------------------------------
      #shell>  crm configure primitive ClusterIP ocf:heartbeat:IPaddr2  
params ip=192.168.1.50 cidr_netmask=32 op monitor interval=30s
      #shell>  crm configure property stonith-enabled=false
      #shell>  crm configure property no-quorum-policy=ignore
      #shell>  crm configure rsc_defaults resource-stickiness=100
     -----------------------------------------------------
     Notes:
      --> The first command creates an entity called "ClusterIP" with  
a cluster
           IP Address of 192.168.1.50 and a /32 mask.  The HA software  
will monitor
           this address every 30secs for failure.

      --> The second command disables the STONITH (shoot-the-other- 
node-in-the-head)
          feature.

      --> The third command configures a "no quorum" policy for the  
cluster.  With
          this option enabled, only one node can be up and running  
HA.  If this
          option is disabled, both nodes must be up and running HA  
(thereby negating
          a 1:1 failover solution).  Disable this only if you can  
guarantee both
          nodes will be up all the time.

      --> The fourth command applies a resource-stickiness value of  
100 as a
          cluster default.  This option controls how much a service  
prefers to
          stay running where it is.  When enabled, it ensures the  
ClusterIP will
          remain on the active host instead of failing over to a  
designated primary
          host.  In some cases, a specific server may be the preferred  
server
          for a given entity.

#  
= 
= 
= 
= 
=======================================================================#
#              IV.  Monitoring the Cluster and  
VIP                           #
#  
= 
= 
= 
= 
=======================================================================#

(1) To view the virtual IP Address, use the command "ip addr list" on  
the nodes
     -----------------------------------------------------
      #shell>  ip addr list
       1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue
           link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
           inet 127.0.0.1/8 scope host lo
           inet6 ::1/128 scope host
              valid_lft forever preferred_lft forever
       2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc  
pfifo_fast qlen 1000
           link/ether 00:50:56:b1:4f:9e brd ff:ff:ff:ff:ff:ff
           inet 192.168.1.10/24 brd 192.168.1.255 scope global eth0
           inet 192.168.1.50/32 brd 192.168.1.255 scope global eth0
           inet6 fe80::250:56ff:feb1:4f9e/64 scope link
              valid_lft forever preferred_lft forever
       3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc  
pfifo_fast qlen 1000
           link/ether 00:50:56:b1:21:e8 brd ff:ff:ff:ff:ff:ff
           inet 10.1.1.1/30 brd 10.1.1.3 scope global eth1
           inet6 fe80::250:56ff:feb1:21e8/64 scope link
              valid_lft forever preferred_lft forever
       4: sit0: <NOARP> mtu 1480 qdisc noop
           link/sit 0.0.0.0 brd 0.0.0.0
     -----------------------------------------------------

(2) To view the heartbeat log files, simply "tail -100f /var/log/ha-log"

(3) To check the reliability of the cluster, reboot the nodes and verify
     the virtual IP comes back up.  Here are some basic failover tests:

      Notes:
       -->  N1=node-1 and N2=node-2

       -->  On each node, install and start apache, create a unique
            /var/www/html/index.html file on each node.  Using a 3rd  
machine,
            run a "while" loop using wget to constantly get the  
index.html file
            from the cluster VIP.

        --> If you intend to run Apache with a VIP, you must add the  
following
            entry to the /etc/sysctl.conf file on all HA nodes:
            ---------------------------------------------
             sysctl net.ipv4.ip_nonlocal_bind=1
            ---------------------------------------------
            You can either reboot the host or run the "sysctl -p"  
command for the
            new setting to take affect.

          -----------------------------------------------------
            --> Test-1:  N1 and N2 Active, reboot N2, IP remains  
active on N1

            --> Test-2:  N1 and N2 Active, reboot N1, IP fails over to  
N2

            --> Test-3:  N1 active, N2 powered off, power on N2, IP  
remains on N1

            --> Test-4:  N1 and N2 powered off, power on N1, IP starts  
on N1

            --> Test-5:  N1 and N2 powered off, power on N2, IP starts  
on N2

            --> Test-6:  N1 and N2 powered off, power on N1, IP starts  
on N1,
                         power on N2, IP remains on N1

            --> Test-7:  N1 and N2 powered off, power on N2, IP starts  
on N2,
                         power on N1, IP remains on N2

            --> Test-8:  N1 and N2 powered off, power on N1 and N2, IP  
starts
                         on one node, other node is ready to take over  
IP
          -----------------------------------------------------

(4) To reset the cluster configuration to scratch, stop the heartbeat
     software, remove all files from /var/lib/heartbeat/cim on all  
nodes,
     then restart the heartbeat software
     -----------------------------------------------------
      #shell>  /sbin/service heartbeat stop
      #shell>  \rm -r /var/lib/heartbeat/crm/*
      #shell>  /sbin/service heartbeat start
     -----------------------------------------------------

#  
= 
= 
= 
= 
=======================================================================#
#              V. Learning more about Linux  
HA                               #
#  
= 
= 
= 
= 
=======================================================================#

The Linux HA pages are somewhat outdated and lack in detailed  
configuration
data.  The best links to date include:

   --> http://clusterlabs.org/mediawiki/images/9/9d/Clusters_from_Scratch_-_Apache_on_Fedora11.pdf

   --> http://clusterlabs.org/wiki/Main_Page

   Note: Download the above PDF and look thru the sections.  It  
contains a
         wealth of information detailing how to combine resource  
stickiness and
         program/process dependencies for a given cluster.