Active/Passive Cluster with Pacemaker, Corosync

December 28, 2010 by Igor Drobot 26 Comments

In this article I will explains how to set up and maintain an Active/Passive Cluster, using Pacemaker with Corosync with DRBD replication.

We have two nodes (same hardware), one active and another in passive mode. If the active node goes down, the passive one will automatically take its position and all running services.

This article was written complete for and with Debian 6-squeeze (With a lot of luck you could bring it to run on Ubuntu)

Active/Passive

1. Basic notes
Node-1:
Hostname: node-1
IP: 192.168.2.101

Node-2:
Hostname: node-2
IP: 192.168.2.102

Partitioning:
/dev/vda1 System / ext4
/dev/vda5 reserved for DRBD – empty
/dev/vda6 swap area

Cluster IP: 192.168.2.100
We need this cluster IP to reach currently active node.

This is a physical network structure of the cluster:

2. Basic configuration
To ssh from one node to another without passwords:

root@node-1:~# ssh-keygen
root@node-2:~# ssh-keygen
root@node-1:~# ssh-copy-id root@192.168.2.102
root@node-2:~# ssh-copy-id root@192.168.2.101

Now we need to make sure we can communicate with the machines by their name. If you have a DNS server,
add additional entries for the three machines. Otherwise, you’ll need to add the machines to /etc/hosts .
Below are the entries for my cluster nodes:

root@node-1:~# cat /etc/hosts
127.0.0.1       localhost
192.168.2.101   node-1
192.168.2.102   node-2

root@node-2:~# cat /etc/hosts
127.0.0.1       localhost
192.168.2.101   node-1
192.168.2.102   node-2

3. Installation
Install corosync:

1	apt-get install pacemaker

4. Initial corosync configuration:
Generate key for openais and copy it to node

root@node-1:~# corosync-keygen
root@node-1:~# scp /etc/corosync/authkey node-2:/etc/corosync/authkey
chown root:root /etc/corosync/authkey
chmod 400 /etc/corosync/authkey

vim /etc/default/corosync
# change to yes
START=yes
scp /etc/default/corosync node-2:/etc/default/corosync

The only block you must change is the network one:
Full corosync.conf:

totem {
        version: 2
 
        # How long before declaring a token lost (ms)
        token: 3000
 
        # How many token retransmits before forming a new configuration
        token_retransmits_before_loss_const: 10
 
        # How long to wait for join messages in the membership protocol (ms)
        join: 60
 
        # How long to wait for consensus to be achieved before starting a new round of membership configuration (ms)
        consensus: 3600
 
        # Turn off the virtual synchrony filter
        vsftype: none
 
        # Number of messages that may be sent by one processor on receipt of the token
        max_messages: 20
 
        # Limit generated nodeids to 31-bits (positive signed integers)
        clear_node_high_bit: yes
 
        # Disable encryption
        secauth: off
 
        # How many threads to use for encryption/decryption
        threads: 0
 
        # Optionally assign a fixed node id (integer)
        # nodeid: 1234
 
        # This specifies the mode of redundant ring, which may be none, active, or passive.
        rrp_mode: none
 
        interface {
                # The following values need to be set based on your environment
                ringnumber: 0
                bindnetaddr: 192.168.2.0
                mcastaddr: 226.94.1.1
                mcastport: 5405
        }
}
 
amf {  
        mode: disabled
}
 
service {
        # Load the Pacemaker Cluster Resource Manager
        ver:       0
        name:      pacemaker
}
 
aisexec {
        user:   root
        group:  root
}
 
logging {
        fileline: off
        to_stderr: yes
        to_logfile: yes
        logfile: /var/log/corosync/corosync.log
        to_syslog: yes
        syslog_facility: daemon
        debug: off
        timestamp: on
        logger_subsys {
                subsys: AMF
                debug: off
                tags: enter|leave|trace1|trace2|trace3|trace4|trace6
        }
}

Now copy this config to your node2:

1	scp /etc/corosync/corosync.conf node-2:/etc/corosync/corosync.conf

5. Try and Error:
Lets start corosync and see what’s happening:

1 2	root@node-1:~# /etc/init.d/corosync start root@node-2:~# /etc/init.d/corosync start

Check cluster status:

1	crm_mon --one-shot -V

1	crm_verify -L

Take a look on your logs:

1	less /var/log/corosync/corosync.log

6. First cluster resource:

We now turn off STONITH since we don’t need it in this example configuration:

1	crm configure property stonith-enabled=false

Now we add our first resource, virtual-IP to the configuration:

1	crm configure primitive FAILOVER-ADDR ocf:heartbeat:IPaddr2 params ip="192.168.2.100" nic="eth0" op monitor interval="10s" meta is-managed="true"

Of course there are much more options available like Nic.

7. Use crm:
To use all futures of Corosync you should use the Cluster Resource Manager or CRM command line tool. It’s great and very powerful.

1 2	root@node-2:~# crm crm(live)#

Use help command and TAB for completion.

Available commands:
 
        cib              manage shadow CIBs
        resource         resources management
        node             nodes management
        options          user preferences
        configure        CRM cluster configuration
        ra               resource agents information center
        status           show cluster status
        quit,bye,exit    exit the program
        help             show help
        end,cd,up        go back one level

Get cluster status:

1	crm_mon --one-shot

root@node-2:~# crm_mon --one-shot
============
Last updated: Tue Dec 28 21:50:38 2010
Stack: openais
Current DC: node-1 - partition with quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, 2 expected votes
1 Resources configured.
============
 
Online: [ node-1 node-2 ]
     FAILOVER-ADDR    (ocf::heartbeat:IPaddr2):   Started node-2

8. Maintain
To list cluster configuration:

1	crm configure show

To list cluster status and maintain on the status screen:

1	crm_mon --interval=1

Another shorter command for cluster status with fail count overview:

1	crm_mon -i 2 -f

resource-stickiness option
Configuring weight-points to change the resource to another node.
When a node goes down and then goes up, this configuration makes the resource that is running on the another server be kept there (that was always up). This is very good to prevent a sync problem on the node that was down, or prevent that the node that is flapping, flap the cluster services.

1	crm configure rsc_defaults resource-stickiness=100

migration-threshold option
The default migration-threshold is INFINITY (that means the value is 1000000) what a little bit too much is:

# Before modification
apache2: migration-threshold=1000000 fail-count=1
# After modification
apache2: migration-threshold=3 fail-count=1

Lets modify it:

1	crm configure rsc_defaults migration-threshold=3

crm_mon
This configuration is very useful to quickly check the cluster status. Should run on both nodes:

1	crm_mon --daemonize --as-html /var/www/cluster/index.html

Autostart:

1	echo "crm_mon --daemonize --as-html /var/www/html/cluster/index.html" >> /etc/rc.d/rc.local

Click on image to resize:

It’ s a simple HTML plain text with a 5sec refresh.

To migrate a resource to another node, do:

1	crm resource migrate AP-CLUST node-2

To clean up resource messages, do:

1	crm resource cleanup FAILOVER-ADDR

To stop (for start use start) FAILOVER-ADDR resource, do:

1	crm resource stop FAILOVER-ADDR

Put the first node to standby mode and back to online:

1 2	crm node standby node-1 crm node online node-1

Other
Add resources:

# DRBD:
crm configure primitive DRBD ocf:linbit:drbd params drbd_resource="drbd0" op monitor interval="120s"
# Mount target:
crm configure primitive SRV-MOUNT ocf:heartbeat:Filesystem params device="/dev/drbd0" directory="/srv/" fstype="ext4"
# MySQL resource:
crm configure primitive MYSQL lsb:mysql op monitor interval="10s"

Example configuration:

root@node-1:~# crm configure show
node node-1
node node-2
primitive APACHE2 lsb:apache2 \
	op monitor interval="35s" timeout="20s" start-delay="0s"
primitive DRBD ocf:linbit:drbd \
	params drbd_resource="drbd0" \
	op monitor interval="120s" \
	op start interval="0" timeout="240s" \
	op stop interval="0" timeout="100s"
primitive DRBD-LINKS heartbeat:drbdlinks \
	op monitor interval="10s" \
	meta target-role="Started"
primitive FAILOVER-ADDR ocf:heartbeat:IPaddr2 \
	params ip="192.168.2.100" nic="eth0" \
	op monitor interval="10s" \
	meta is-managed="true"
primitive FAILOVER-SRC ocf:heartbeat:IPsrcaddr \
	params ipaddress="192.168.2.100"
primitive MYSQL lsb:mysql \
	op monitor interval="10s"
primitive PING-NET ocf:pacemaker:ping \
	params dampen="5s" multiplier="100" host_list="192.168.2.1 192.168.2.36 8.8.8.8" \
	op monitor interval="60s" timeout="60" \
	op start interval="0" timeout="60s" \
	op stop interval="0" timeout="60s"
primitive SRV-MOUNT ocf:heartbeat:Filesystem \
	params device="/dev/drbd0" directory="/srv/" fstype="ext4" \
	op start interval="0" timeout="60s" \
	op stop interval="0" timeout="60s"
group AP-CLUST FAILOVER-ADDR SRV-MOUNT FAILOVER-SRC APACHE2 DRBD-LINKS MYSQL \
	meta target-role="Started"
ms DRBD-DATA DRBD \
	meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Started"
clone PING-NET-CLONE PING-NET
location cli-standby-AP-CLUST AP-CLUST \
	rule $id="cli-standby-rule-AP-CLUST" -inf: #uname eq node-2
location connected_node AP-CLUST \
	rule $id="connected_node-rule" -inf: not_defined pingd or pingd lte 0
location master-prefer-node-1 FAILOVER-ADDR 25: node-1
colocation AP-CLUST_on_DRBD inf: AP-CLUST DRBD-DATA:Master
order AP-CLUST_after_DRBD-DATA inf: DRBD-DATA:promote AP-CLUST:start
property $id="cib-bootstrap-options" \
	dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
	cluster-infrastructure="openais" \
	expected-quorum-votes="2" \
	stonith-enabled="false" \
	no-quorum-policy="ignore" \
	start-failure-is-fatal="false" \
	last-lrm-refresh="1313668727"
rsc_defaults $id="rsc-options" \
	resource-stickiness="100" \
	migration-threshold="3"

root@node-1:~# crm configure show node node-1 node node-2 primitive APACHE2 lsb:apache2 \ op monitor interval="35s" timeout="20s" start-delay="0s" primitive DRBD ocf:linbit:drbd \ params drbd_resource="drbd0" \ op monitor interval="120s" \ op start interval="0" timeout="240s" \ op stop interval="0" timeout="100s" primitive DRBD-LINKS heartbeat:drbdlinks \ op monitor interval="10s" \ meta target-role="Started" primitive FAILOVER-ADDR ocf:heartbeat:IPaddr2 \ params ip="192.168.2.100" nic="eth0" \ op monitor interval="10s" \ meta is-managed="true" primitive FAILOVER-SRC ocf:heartbeat:IPsrcaddr \ params ipaddress="192.168.2.100" primitive MYSQL lsb:mysql \ op monitor interval="10s" primitive PING-NET ocf:pacemaker:ping \ params dampen="5s" multiplier="100" host_list="192.168.2.1 192.168.2.36 8.8.8.8" \ op monitor interval="60s" timeout="60" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="60s" primitive SRV-MOUNT ocf:heartbeat:Filesystem \ params device="/dev/drbd0" directory="/srv/" fstype="ext4" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="60s" group AP-CLUST FAILOVER-ADDR SRV-MOUNT FAILOVER-SRC APACHE2 DRBD-LINKS MYSQL \ meta target-role="Started" ms DRBD-DATA DRBD \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Started" clone PING-NET-CLONE PING-NET location cli-standby-AP-CLUST AP-CLUST \ rule $id="cli-standby-rule-AP-CLUST" -inf: #uname eq node-2 location connected_node AP-CLUST \ rule $id="connected_node-rule" -inf: not_defined pingd or pingd lte 0 location master-prefer-node-1 FAILOVER-ADDR 25: node-1 colocation AP-CLUST_on_DRBD inf: AP-CLUST DRBD-DATA:Master order AP-CLUST_after_DRBD-DATA inf: DRBD-DATA:promote AP-CLUST:start property $id="cib-bootstrap-options" \ dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ stonith-enabled="false" \ no-quorum-policy="ignore" \ start-failure-is-fatal="false" \ last-lrm-refresh="1313668727" rsc_defaults $id="rsc-options" \ resource-stickiness="100" \ migration-threshold="3"

Special:
NTP:
It is highly recommended to enable NTP-Service on your cluster nodes. Doing so ensures all nodes agree on the current time and makes reading log ﬁles signiﬁcantly easier! For example to find a reason for a split brain, you need the exact time.

OCF resources
To ﬁnd all OCF resource agents provided by Pacemaker and Heartbeat, run:

1 2	crm ra list ocf heartbeat crm ra list ocf pacemaker

Dealing with failconuts:
The output of “crm_mon -i1 -f” shows you some failcounts for your resources:

1	APACHE2: migration-threshold=3 fail-count=2

Clear resource with failcounts:

1	crm resource cleanup APACHE2

Reset failcounts to zero:

1 2	crm resource failcount APACHE2 set node-1 0 crm resource failcount APACHE2 set node-2 0

Maintenance of cluster resources:
If you want to disable your MySQL-Server for a short time you should set it to unmanaged status:

1
2
3

crm resource meta MYSQL set is-managed false
# after upgrades or restarts put it to the managed status back
crm resource meta MYSQL set is-managed true

Comments

Prateek says

December 4, 2017 at 09:42

I have configured a 2 node cluster but both nodes are unable to see each other It happens after multiple times Restransmit both node exclude each other after that i have tried with many times start/stop of corosync and pacemaker at both node but still they are not able to see each other

corosync.conf for both node is same

compatibility: whitetank

totem {
version: 2
secauth: off
threads: 0
interface {
member {
memberaddr: 172.20.172.151
}
member {
memberaddr: 172.20.172.152
}
ringnumber: 0
bindnetaddr: 172.20.172.0
mcastaddr: 226.94.1.1
mcastport: 5406
ttl: 1
}
join: 2000
token: 20000
transport=udpu
}

logging {
fileline: off
to_stderr: no
to_logfile: yes
to_syslog: yes
logfile: /var/log/cluster/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}

amf {
mode: disabled
}

CIB of one of the node is
node srv-vme-ccs-02 \
attributes Postgresql9-data-status=”LATEST”
primitive ACP ocf:heartbeat:acp \
op start interval=”0″ timeout=”120s” \
op stop interval=”0″ timeout=”60s” \
op monitor interval=”10s” timeout=”60s” \
meta failure-timeout=”2000″ target-role=”Started” is-managed=”true”
primitive ACPIP ocf:heartbeat:IPaddr2 \
params ip=”172.20.172.158″ cidr_netmask=”24″ \
op monitor interval=”10s” timeout=”60s” \
op start interval=”0″ timeout=”60s” \
op stop interval=”0″ timeout=”60s” \
meta failure-timeout=”2000s” target-role=”Started”
primitive AmeyoArchiver ocf:heartbeat:ameyoarchiver \
op start timeout=”120s” interval=”0″ \
op stop interval=”0″ timeout=”60s” \
op monitor interval=”10s” timeout=”250s” \
meta failure-timeout=”2000″ target-role=”Started” is-managed=”true”
primitive AmeyoReports ocf:heartbeat:ameyoreports \
op start timeout=”250s” interval=”0″ \
op stop interval=”0″ timeout=”60s” \
op monitor interval=”10s” timeout=”250s” \
meta failure-timeout=”2000″ target-role=”Started” is-managed=”true”
primitive AmeyoReportsIP ocf:heartbeat:IPaddr2 \
params ip=”172.20.172.159″ cidr_netmask=”24″ \
op start timeout=”60s” interval=”0″ \
op stop interval=”0″ timeout=”60s” \
op monitor interval=”20s” timeout=”60s” \
meta target-role=”Started”
primitive Asterisk ocf:heartbeat:asterisk \
op monitor interval=”10s” timeout=”100s” \
op start interval=”0″ timeout=”60s” \
op stop interval=”0″ timeout=”60s” \
meta failure-timeout=”2000″ target-role=”Started”
primitive AsteriskIP ocf:heartbeat:IPaddr2 \
params ip=”172.20.172.157″ cidr_netmask=”24″ \
op start interval=”0″ timeout=”60s” \
op stop interval=”0″ timeout=”60s” \
op monitor interval=”10s” timeout=”100s” \
meta failure-timeout=”2000″
primitive CRM ocf:heartbeat:crm \
op start interval=”0″ timeout=”60s” \
op stop interval=”0″ timeout=”60s” \
op monitor interval=”10s” timeout=”100s” \
meta failure-timeout=”2000″ target-role=”Started”
primitive CRMIP ocf:heartbeat:IPaddr2 \
params ip=”172.20.172.156″ cidr_netmask=”24″ \
op start interval=”0″ timeout=”60s” \
op stop interval=”0″ timeout=”60s” \
op monitor interval=”10s” timeout=”60s” \
meta failure-timeout=”2000″
primitive DBClusterIP ocf:heartbeat:IPaddr2 \
params ip=”172.20.172.155″ cidr_netmask=”24″ \
op monitor interval=”10s” timeout=”60s” \
op start interval=”0″ timeout=”60s” \
op stop interval=”0″ timeout=”60s” \
meta failure-timeout=”2000s” target-role=”Started”
primitive Djinn ocf:heartbeat:djinn \
op monitor interval=”15s” timeout=”100s” \
op start interval=”0″ timeout=”90s” \
op stop interval=”0″ timeout=”60s” \
meta failure-timeout=”2000″
primitive Ping ocf:pacemaker:ping \
params dampen=”40s” multiplier=”1000″ host_list=”172.20.172.1″ \
op stop interval=”0″ timeout=”80s” \
op monitor interval=”10s” timeout=”60s” \
meta failure-timeout=”40s”
primitive Postgresql9 ocf:heartbeat:pgsql \
params pgctl=”/usr/pgsql-9.3/bin/pg_ctl” psql=”/usr/pgsql-9.3/bin/psql” pgdata=”/var/lib/pgsql/9.3/data/” start_opt=”-p 5432″ rep_mode=”async” node_list=”srv-vme-ccs-01 srv-vme-ccs-02″ restore_command=”” primary_conninfo_opt=”keepalives_idle=60 keepalives_interval=5 keepalives_count=5″ master_ip=”172.20.172.155″ restart_on_promote=”true” \
op monitor interval=”20s” role=”Slave” timeout=”100s” \
op monitor interval=”10s” role=”Master” timeout=”100s” \
op start interval=”0″ timeout=”250s” \
op promote interval=”0″ timeout=”70s” \
op stop interval=”0″ timeout=”70s” \
op demote interval=”0″ timeout=”200s” \
op notify interval=”0″ timeout=”200s” \
meta failure-timeout=”2000s” \
meta migration-threshold=”3″
primitive Server ocf:heartbeat:server \
op monitor interval=”10s” timeout=”250s” \
op start interval=”0″ timeout=”450s” \
op stop interval=”0″ timeout=”160s” \
meta failure-timeout=”2000″ target-role=”Started” is-managed=”true”
primitive ServerIP ocf:heartbeat:IPaddr2 \
op start interval=”0″ timeout=”60s” \
op stop interval=”0″ timeout=”60s” \
op monitor interval=”10s” timeout=”60s” \
meta failure-timeout=”2000s” target-role=”Started”
primitive SyncFolder ocf:heartbeat:syncfolder_RA \
params ip=”172.20.172.157″ folder=”/dacx/var/ameyo/dacxdata/voiceprompts/” \
op start interval=”0″ timeout=”60s” \
op stop interval=”0″ timeout=”60s” \
op monitor interval=”10s” timeout=”60s” \
meta failure-timeout=”2000″
group ACP-with-IP ACP ACPIP \
meta target-role=”Started”
group AmeyoReports-with-IP AmeyoReports AmeyoReportsIP \
meta target-role=”Started”
group Asterisk-with-IP Asterisk AsteriskIP \
meta target-role=”Started”
group CRM-with-IP CRM CRMIP \
meta target-role=”Started”
group Server-with-IP Server ServerIP \
meta target-role=”Started”
ms mspostgres Postgresql9 \
meta master-max=”1″ master-node-max=”1″ clone-max=”2″ clone-node-max=”1″ notify=”true” migration-threshold=”3″ target-role=”Started”
clone Ping_clone Ping \
meta globally-unique=”false” target-role=”Started”
clone djinn_clone Djinn \
meta globally-unique=”true” clone-max=”2″ clone-node-max=”1″
clone syncfolder_clone SyncFolder \
meta globally-unique=”true” clone-max=”1″ clone-node-max=”1″
location ACP-with-IP-Ping ACP-with-IP \
rule $id=”acp-ping-rule” -inf: not_defined pingd or pingd eq 0
location AmeyoReports-with-IP-Ping AmeyoReports-with-IP \
rule $id=”ameyoreports-ip-ping-rule” -inf: not_defined pingd or pingd eq 0
location Asterisk-with-IP-Ping Asterisk-with-IP \
rule $id=”asterisk-ping-rule” -inf: not_defined pingd or pingd eq 0
location CRM-with-IP-Ping CRM-with-IP \
rule $id=”crm-ping-rule” -inf: not_defined pingd or pingd eq 0
location DBClusterIP-Ping DBClusterIP \
rule $id=”db-ip-ping-rule” -inf: not_defined pingd or pingd eq 0
location Server-with-IP-Ping Server-with-IP \
rule $id=”server-ping-rule” -inf: not_defined pingd or pingd eq 0
location failover-acp ACP-with-IP 80: srv-vme-ccs-02
location failover-archiver AmeyoArchiver 80: srv-vme-ccs-02
:
location Server-with-IP-Ping Server-with-IP \
rule $id=”server-ping-rule” -inf: not_defined pingd or pingd eq 0
location failover-acp ACP-with-IP 80: srv-vme-ccs-02
location failover-archiver AmeyoArchiver 80: srv-vme-ccs-02
location failover-callserver Asterisk-with-IP 80: srv-vme-ccs-02
location failover-crm CRM-with-IP 80: srv-vme-ccs-02
location failover-db mspostgres 80: srv-vme-ccs-02
location failover-reports AmeyoReports-with-IP 80: srv-vme-ccs-02
location failover-server Server-with-IP 80: srv-vme-ccs-02
location mspostgres-Ping mspostgres \
rule $id=”mspostgres-ping-rule” -inf: not_defined pingd or pingd eq 0
location prefer-acp ACP-with-IP 90: srv-vme-ccs-01
location primary-archiver AmeyoArchiver 90: srv-vme-ccs-01
location primary-callserver Asterisk-with-IP 90: srv-vme-ccs-01
location primary-crm CRM-with-IP 90: srv-vme-ccs-01
location primary-db mspostgres 90: srv-vme-ccs-01
location primary-reports AmeyoReports-with-IP 90: srv-vme-ccs-01
location primary-server Server-with-IP 90: srv-vme-ccs-01
colocation ip-with-ms +inf: DBClusterIP mspostgres:Master
colocation syncfolder-with-asterisk -inf: syncfolder_clone Asterisk
order archiver-after-DBIP inf: DBClusterIP:start AmeyoArchiver:start symmetrical=false
order database-stop inf: DBClusterIP:stop mspostgres:demote symmetrical=false
order reports-after-DBIP inf: DBClusterIP:start AmeyoReports-with-IP:start symmetrical=false
order server-after-ip inf: DBClusterIP:start Server-with-IP:start symmetrical=false
order syncFolder-after-asterisk inf: Asterisk-with-IP syncfolder_clone
property $id=”cib-bootstrap-options” \
dc-version=”1.1.10-14.el6-368c726″ \
cluster-infrastructure=”classic openais (with plugin)” \
expected-quorum-votes=”3″ \
cluster-recheck-interval=”5min” \
stonith-enabled=”false” \
no-quorum-policy=”ignore” \
placement-strategy=”balanced” \
stop-all-resources=”false” \
last-lrm-refresh=”1511469460″
rsc_defaults $id=”rsc-options” \
resource-stickiness=”100″ \
failure-timeout=”2000s” \
migration-threshold=”5″
op_defaults $id=”op_defaults-options” \
on-fail=”restart”

location ACP-with-IP-Ping ACP-with-IP \
rule $id=”acp-ping-rule” -inf: not_defined pingd or pingd eq 0
location AmeyoReports-with-IP-Ping AmeyoReports-with-IP \
rule $id=”ameyoreports-ip-ping-rule” -inf: not_defined pingd or pingd eq 0
location Asterisk-with-IP-Ping Asterisk-with-IP \
rule $id=”asterisk-ping-rule” -inf: not_defined pingd or pingd eq 0
location CRM-with-IP-Ping CRM-with-IP \
rule $id=”crm-ping-rule” -inf: not_defined pingd or pingd eq 0
location DBClusterIP-Ping DBClusterIP \
rule $id=”db-ip-ping-rule” -inf: not_defined pingd or pingd eq 0
location Server-with-IP-Ping Server-with-IP \
Gagan says

October 26, 2017 at 07:28

Is there any way than than OCF script to avoid someone from going to the cluster and doing a start/stop from /etc/init.d/, which is supposed to be done via the cluster commands?
Can we point a new path for LSB scripts in cluster?
Velmurugan says

November 26, 2015 at 14:23

Can you help me to configure openNMS HA

Thanks
zeldor says

October 26, 2014 at 23:52

doudou, you should configure a resource group with all your resources. In case of failure, your group with all your included resources will move to another node.
doudou says

October 22, 2014 at 09:55

Hi zeldor,

i’ve a pacemaker/corosync with 2 nodes.
– a FAILOVER-ADDR
– a service (syslog-ng)

All works fine, but…
If, for example, i do an error on syslog conf on the first node, who have the FAILOVER ADDR, when i kill syslog, node 1 try to start syslog but he can’t (normal).

My problem, he move this resource to node 2, but FAILOVER ADDR stay on node 1.

It is possible, to configure, when one service fail and can’t restart, to move ALL resources to another node, include the FAILOVER ADDR of course..

Thanks
Weimar says

July 22, 2014 at 03:16

disculpe pacemaker y corosyn funcionan en CENTOS 5.5 o cual distro me aconsejarian para mi cluster HA aguardo su respuesta saludos
Stefano says

June 24, 2014 at 14:24

I installed cluster on CentOS.

resource-agents-3.9.2-40.el6_5.6.x86_64
corosync-1.4.1-17.el6.x86_64
pacemaker-1.1.10-14.el6_5.2.x86_64

I have these commands available:

crmadmin crm_error crm_mon crm_resource crm_standby
crm_attribute crm_failcount crm_node crm_shadow crm_ticket
crm_diff crm_master crm_report crm_simulate crm_verify
zeldor says

June 23, 2014 at 22:33

Strange. Maybe you have another syntax in your distro/version? http://oss.clusterlabs.org/pipermail/pacemaker/2010-November/008417.html
Stefano says

June 23, 2014 at 09:55

Thanks Zeldor, but restart command in crm_resource doesn’t exist…
zeldor says

June 21, 2014 at 08:34

Hi Stefano,

you can perform a restart of mysql over crm shell: crm resource restart mysqld

Another Idea: you can set mysql to “unmanaged”-state for the cluster and restart mysql manually.
Stefano says

June 20, 2014 at 10:28

When I edit the mysql file /etc/my.conf, how I restart the service that is controlled by pacemaker?
Stas says

February 13, 2014 at 21:29

Thank you! Great manual!
zeldor says

February 25, 2013 at 10:42

Ortim, I work at moment for axxeo so if you are ready to pay for this work, we could bring your cluster to work. You can contact us over this form
Ortim says

February 25, 2013 at 09:29

I’ve never setup Crosync and Pacemaker.
But now for my work i need a failover between two servers.

after a full night work can’t connect between two nodes.
Can you help me. I can also pay to you
Thanks.
zeldor says

November 16, 2012 at 21:08

Rob, thank you for commenting! The default value of secauth on Ubuntu systems is on. Debian provide the default configuration file with off. This option is only for the totem messages.

The creation of authkey for corosync is only for communication between your two or more nodes.

Encryption and authentication consume 75% of CPU cycles in aisexec as measured with gprof when enabled.
I will say its better to use another security solutions like netfilter – iptables command.
Rob says

November 15, 2012 at 20:02

if you want to use the authkey make sure to enable the use of it…

secauth: on
sjamal says

October 31, 2012 at 18:19

Hi,
Upon first starting corosync on my two nodes and running crm_mon I see that the Current DC is set to ‘NONE.’ It also only shows 1 node configured (this is consistent when I run crm_mon on either node). I’m not sure how to to get the nodes to acknowledge one another. All we really do is copy the same authkey over – what causes the nodes to see each other? Thank you for any help and input.
zeldor says

October 20, 2012 at 23:07

Thank you!
bindnetaddr: specifies the address which the corosync executive should bind. This address should always end in zero. In other words its local LAN subnet address: example 192.168.2.0/10.1.100.0

DRBD has 2 separate posts, you can find the here: http://zeldor.biz/tag/drbd/
Kinnow Grower says

October 20, 2012 at 22:03

Really nice article. I had tried this couple of months before and tried again today. I dont remember now, I had to change the ip address in the following

interface { # The following values need to be set based on your environment ringnumber: 0 bindnetaddr: 192.168.2.0 mcastaddr: 226.94.1.1 mcastport: 5405 }

to ip address of each individual node. Eg. on node1, I had to put the ip only for node1 and on node2, I had to put node2’s ip address. But here you just put the 192.168.2.0. It seems it is network range. Can you please clarify on that?. Also can you please add here, how to setup DRDB on individual nodes?.

Again thank you very much for nice article . It is really helpful.
Fred says

May 21, 2012 at 15:11

don’t mind my last post.. you can delete it.. missed to see the scroll bar
krgds
/Fred
Fred says

May 21, 2012 at 15:08

Hi,
I didn’t want to ask anything, just to see the entire blog. As it is now it doesn’t show all text.. So I need to submit something to get to the page where I can see it all ;)
zeldor says

April 16, 2011 at 00:11

Hey, yeah we are working really on much similar and great things – that is cool, two heads are better than one:)

Pacemaker is really fascinating, I had a lot fun to work with it. I have a webserver-cluster running under Fedora 14 ; normally I use only Debian but in this case I wanted to have newer version of pacemaker.

My apprenticeship project was about HA-Cluster with Pacemaker.

On what kind of cluster are you working?

id
sw says

April 15, 2011 at 14:14

Looks like we are working on many similar things :-)

Thanks for this post, I still have to go through it, but if you want to be able to ssh from one host to the other, copy your SSH key on both servers and use “ssh -A”.

Cheers
Seb

Trackbacks

HA cluster on openSUSE says:

December 13, 2018 at 00:07

[…] more deeper information take a look into this […]
Active/Passive Cluster with Pacemaker, Corosync « Kunawut's knowledge blog says:

July 20, 2012 at 17:10

[…] Replication, Cluster, DRBD, LVS, MPP » Active/Passive Cluster with Pacemaker, Corosync Active/Passive Cluster with Pacemaker, Corosync Published on December 28, 2010 22:58Tags: active/passive, cluster, corosync, debian […]
Sencillo cluster de alta disponilidad con Pacemaker y Corosync « Desde lo alto del Cerro says:

March 4, 2012 at 10:12

[…] http://zeldor.biz/2010/12/activepassive-cluster-with-pacemaker-corosync/ […]

Active/Passive Cluster with Pacemaker, Corosync

Categories

Archives

Categories

Archives

Tags

Comments

Trackbacks

Leave a Reply