[Openais] Redundant ring not recovering after issuing the command corosync-cfgtool -r

Tom Pride tom.pride at gmail.com
Tue Apr 13 03:30:28 PDT 2010


Hi There,

As per the recommendations, the 2 node clusters I have built use 2 redundant
rings for added resilience.  I have currently be carry out some testing on
the clusters to ensure that a failure in one of the redundant rings can be
recovered from.  I am aware of the fact that corosync does not currently
have a feature which monitors failed rings to bring them back up
automatically when communications are repaired.  All I have been doing is
testing to see that the corosync-cfgtool -r command will do as it says on
the tin and "Reset redundant ring state cluster wide after a fault, to
re-enable redundant ring operation."

In my 2 node cluster I have been issuing the ifdown command on eth1 on
node1.  This results in corosync-cfgtool -s reporting the following:

root at mq006:~# corosync-cfgtool -s
Printing ring status.
Local node ID 71056300
RING ID 0
    id    = 172.59.60.4
    status    = Marking seqid 8574 ringid 0 interface 172.59.60.4 FAULTY -
adminisrtative intervention required.
RING ID 1
    id    = 172.23.42.37
    status    = ring 1 active with no faults

I then issue ifup eth1 on node1 and ensure that I can now ping node2. The
link is definitely up, so I then issue the command corosync-cfgtool -r.  I
then run corosync-cfgtool -s again and it reports:

root at mq006:~# corosync-cfgtool -s
Printing ring status.
Local node ID 71056300
RING ID 0
    id    = 172.59.60.4
    status    = ring 0 active with no faults
RING ID 1
    id    = 172.23.42.37
    status    = ring 1 active with no faults

So things are looking good at this point, but if I wait 10 more seconds and
run corosync-cfgtool -s again, it reports that ring_id 0 is FAULTY again:

root at mq006:~# corosync-cfgtool -s
Printing ring status.
Local node ID 71056300
RING ID 0
    id    = 172.59.60.4
    status    = Marking seqid 8574 ringid 0 interface 172.59.60.4 FAULTY -
adminisrtative intervention required.
RING ID 1
    id    = 172.23.42.37
    status    = ring 1 active with no faults

It does not matter how many times I run corosync-cfgtool -r, ring_id 0 will
report it as being FAULTY 10 seconds after issuing the reset.  I have tried
running /etc/init.d/network restart on node1 in the hope that a full network
stop and start makes a difference, but it doesn't.  The only thing that will
fix this situation is if I completely stop and restart the corosync cluster
stack on both nodes (/etc/init.d/corosync stop and /etc/init.d/corosync
start).  Once I've done that both rings stay up and are stable.  This is
obviously not what we want.

I am running the latest RHEL rpms from here:
http://www.clusterlabs.org/rpm/epel-5/x86_64/

corosync-1.2.1-1.el5
corosynclib-1.2.1-1.el5
pacemaker-1.0.8-4.el5
pacemaker-libs-1.0.8-4.el5

My corosync.conf looks like this:
compatibility: whitetank

totem {
    version: 2
    secauth: off
    threads: 0
    consensus: 1201
    rrp_mode: passive
    interface {
                ringnumber: 0
                bindnetaddr: 172.59.60.0
                mcastaddr: 226.94.1.1
                mcastport: 4010
    }
        interface {
                ringnumber: 1
                bindnetaddr: 172.23.40.0
                mcastaddr: 226.94.2.1
                mcastport: 4011
        }
}

logging {
    fileline: off
    to_stderr: yes
    to_logfile: yes
    to_syslog: yes
    logfile: /tmp/corosync.log
    debug: off
    timestamp: on
    logger_subsys {
        subsys: AMF
        debug: off
    }
}

amf {
    mode: disabled
}

service {
       # Load the Pacemaker Cluster Resource Manager
       name: pacemaker
       ver: 0
}

aisexec {
       user:   root
       group:  root
}


This is what gets written into /tmp/corosync.log when I carry out the link
failure test and then try and reset the ring status:
root at mq005:~/activemq_rpms# cat /tmp/corosync.log
Apr 13 11:20:31 corosync [MAIN  ] Corosync Cluster Engine ('1.2.1'): started
and ready to provide service.
Apr 13 11:20:31 corosync [MAIN  ] Corosync built-in features: nss rdma
Apr 13 11:20:31 corosync [MAIN  ] Successfully read main configuration file
'/etc/corosync/corosync.conf'.
Apr 13 11:20:31 corosync [TOTEM ] Initializing transport (UDP/IP).
Apr 13 11:20:31 corosync [TOTEM ] Initializing transmit/receive security:
libtomcrypt SOBER128/SHA1HMAC (mode 0).
Apr 13 11:20:31 corosync [TOTEM ] Initializing transport (UDP/IP).
Apr 13 11:20:31 corosync [TOTEM ] Initializing transmit/receive security:
libtomcrypt SOBER128/SHA1HMAC (mode 0).
Apr 13 11:20:31 corosync [TOTEM ] The network interface [172.59.60.3] is now
up.
Apr 13 11:20:31 corosync [pcmk  ] info: process_ais_conf: Reading configure
Apr 13 11:20:31 corosync [pcmk  ] info: config_find_init: Local handle:
4730966301143465986 for logging
Apr 13 11:20:31 corosync [pcmk  ] info: config_find_next: Processing
additional logging options...
Apr 13 11:20:31 corosync [pcmk  ] info: get_config_opt: Found 'off' for
option: debug
Apr 13 11:20:31 corosync [pcmk  ] info: get_config_opt: Defaulting to 'off'
for option: to_file
Apr 13 11:20:31 corosync [pcmk  ] info: get_config_opt: Found 'yes' for
option: to_syslog
Apr 13 11:20:31 corosync [pcmk  ] info: get_config_opt: Defaulting to
'daemon' for option: syslog_facility
Apr 13 11:20:31 corosync [pcmk  ] info: config_find_init: Local handle:
7739444317642555395 for service
Apr 13 11:20:31 corosync [pcmk  ] info: config_find_next: Processing
additional service options...
Apr 13 11:20:31 corosync [pcmk  ] info: get_config_opt: Defaulting to 'pcmk'
for option: clustername
Apr 13 11:20:31 corosync [pcmk  ] info: get_config_opt: Defaulting to 'no'
for option: use_logd
Apr 13 11:20:31 corosync [pcmk  ] info: get_config_opt: Defaulting to 'no'
for option: use_mgmtd
Apr 13 11:20:31 corosync [pcmk  ] info: pcmk_startup: CRM: Initialized
Apr 13 11:20:31 corosync [pcmk  ] Logging: Initialized pcmk_startup
Apr 13 11:20:32 corosync [pcmk  ] info: pcmk_startup: Maximum core file size
is: 18446744073709551615
Apr 13 11:20:32 corosync [pcmk  ] info: pcmk_startup: Service: 9
Apr 13 11:20:32 corosync [pcmk  ] info: pcmk_startup: Local hostname:
mq005.back.int.cwwtf.local
Apr 13 11:20:32 corosync [pcmk  ] info: pcmk_update_nodeid: Local node id:
54279084
Apr 13 11:20:32 corosync [pcmk  ] info: update_member: Creating entry for
node 54279084 born on 0
Apr 13 11:20:32 corosync [pcmk  ] info: update_member: 0x5452c00 Node
54279084 now known as mq005.back.int.cwwtf.local (was: (null))
Apr 13 11:20:32 corosync [pcmk  ] info: update_member: Node
mq005.back.int.cwwtf.local now has 1 quorum votes (was 0)
Apr 13 11:20:32 corosync [pcmk  ] info: update_member: Node
54279084/mq005.back.int.cwwtf.local is now: member
Apr 13 11:20:32 corosync [pcmk  ] info: spawn_child: Forked child 11873 for
process stonithd
Apr 13 11:20:33 corosync [pcmk  ] info: spawn_child: Forked child 11874 for
process cib
Apr 13 11:20:33 corosync [pcmk  ] info: spawn_child: Forked child 11875 for
process lrmd
Apr 13 11:20:33 corosync [pcmk  ] info: spawn_child: Forked child 11876 for
process attrd
Apr 13 11:20:33 corosync [pcmk  ] info: spawn_child: Forked child 11877 for
process pengine
Apr 13 11:20:33 corosync [pcmk  ] info: spawn_child: Forked child 11878 for
process crmd
Apr 13 11:20:33 corosync [SERV  ] Service engine loaded: Pacemaker Cluster
Manager 1.0.8
Apr 13 11:20:33 corosync [SERV  ] Service engine loaded: corosync extended
virtual synchrony service
Apr 13 11:20:33 corosync [SERV  ] Service engine loaded: corosync
configuration service
Apr 13 11:20:33 corosync [SERV  ] Service engine loaded: corosync cluster
closed process group service v1.01
Apr 13 11:20:33 corosync [SERV  ] Service engine loaded: corosync cluster
config database access v1.01
Apr 13 11:20:33 corosync [SERV  ] Service engine loaded: corosync profile
loading service
Apr 13 11:20:33 corosync [SERV  ] Service engine loaded: corosync cluster
quorum service v0.1
Apr 13 11:20:33 corosync [MAIN  ] Compatibility mode set to whitetank.
Using V1 and V2 of the synchronization engine.
Apr 13 11:20:33 corosync [TOTEM ] The network interface [172.23.42.36] is
now up.
Apr 13 11:20:33 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
membership event on ring 640: memb=0, new=0, lost=0
Apr 13 11:20:33 corosync [pcmk  ] notice: pcmk_peer_update: Stable
membership event on ring 640: memb=1, new=1, lost=0
Apr 13 11:20:33 corosync [pcmk  ] info: pcmk_peer_update: NEW:
mq005.back.int.cwwtf.local 54279084
Apr 13 11:20:33 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
mq005.back.int.cwwtf.local 54279084
Apr 13 11:20:33 corosync [pcmk  ] info: update_member: Node
mq005.back.int.cwwtf.local now has process list:
00000000000000000000000000013312 (78610)
Apr 13 11:20:33 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.
Apr 13 11:20:33 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Apr 13 11:20:33 corosync [pcmk  ] info: pcmk_ipc: Recorded connection
0x545a660 for attrd/11876
Apr 13 11:20:33 corosync [pcmk  ] info: pcmk_ipc: Recorded connection
0x545b290 for stonithd/11873
Apr 13 11:20:33 corosync [pcmk  ] info: pcmk_ipc: Recorded connection
0x545d4e0 for cib/11874
Apr 13 11:20:33 corosync [pcmk  ] info: pcmk_ipc: Sending membership update
640 to cib
Apr 13 11:20:34 corosync [pcmk  ] info: pcmk_ipc: Recorded connection
0x545e210 for crmd/11878
Apr 13 11:20:34 corosync [pcmk  ] info: pcmk_ipc: Sending membership update
640 to crmd
Apr 13 11:20:34 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
membership event on ring 648: memb=1, new=0, lost=0
Apr 13 11:20:34 corosync [pcmk  ] info: pcmk_peer_update: memb:
mq005.back.int.cwwtf.local 54279084
Apr 13 11:20:34 corosync [pcmk  ] notice: pcmk_peer_update: Stable
membership event on ring 648: memb=2, new=1, lost=0
Apr 13 11:20:34 corosync [pcmk  ] info: update_member: Creating entry for
node 71056300 born on 648
Apr 13 11:20:34 corosync [pcmk  ] info: update_member: Node 71056300/unknown
is now: member
Apr 13 11:20:34 corosync [pcmk  ] info: pcmk_peer_update: NEW:  .pending.
71056300
Apr 13 11:20:34 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
mq005.back.int.cwwtf.local 54279084
Apr 13 11:20:34 corosync [pcmk  ] info: pcmk_peer_update: MEMB: .pending.
71056300
Apr 13 11:20:34 corosync [pcmk  ] info: send_member_notification: Sending
membership update 648 to 2 children
Apr 13 11:20:34 corosync [pcmk  ] info: update_member: 0x5452c00 Node
54279084 ((null)) born on: 648
Apr 13 11:20:34 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.
Apr 13 11:20:34 corosync [pcmk  ] info: update_member: 0x545dd00 Node
71056300 (mq006.back.int.cwwtf.local) born on: 648
Apr 13 11:20:34 corosync [pcmk  ] info: update_member: 0x545dd00 Node
71056300 now known as mq006.back.int.cwwtf.local (was: (null))
Apr 13 11:20:34 corosync [pcmk  ] info: update_member: Node
mq006.back.int.cwwtf.local now has process list:
00000000000000000000000000013312 (78610)
Apr 13 11:20:34 corosync [pcmk  ] info: update_member: Node
mq006.back.int.cwwtf.local now has 1 quorum votes (was 0)
Apr 13 11:20:34 corosync [pcmk  ] info: send_member_notification: Sending
membership update 648 to 2 children
Apr 13 11:20:34 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Apr 13 11:23:34 corosync [TOTEM ] Marking seqid 6843 ringid 0 interface
172.59.60.3 FAULTY - adminisrtative intervention required.
Apr 13 11:25:15 corosync [TOTEM ] Marking ringid 0 interface 172.59.60.3
FAULTY - adminisrtative intervention required.
Apr 13 11:28:02 corosync [TOTEM ] Marking ringid 0 interface 172.59.60.3
FAULTY - adminisrtative intervention required.
Apr 13 11:28:13 corosync [TOTEM ] Marking ringid 0 interface 172.59.60.3
FAULTY - adminisrtative intervention required.


Can anyone help me out with this?  Am I doing something wrong or have I
found a bug?

Cheers,
Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.linux-foundation.org/pipermail/openais/attachments/20100413/4e59296e/attachment-0001.htm 


More information about the Openais mailing list