Hi There,<br><br>As per the recommendations, the 2 node clusters I have built use 2 redundant rings for added resilience. I have currently be carry out some testing on the clusters to ensure that a failure in one of the redundant rings can be recovered from. I am aware of the fact that corosync does not currently have a feature which monitors failed rings to bring them back up automatically when communications are repaired. All I have been doing is testing to see that the corosync-cfgtool -r command will do as it says on the tin and "Reset redundant ring state cluster wide after a fault, to re-enable redundant ring operation."<br>
<br>In my 2 node cluster I have been issuing the ifdown command on eth1 on node1. This results in corosync-cfgtool -s reporting the following:<br><br>root@mq006:~# corosync-cfgtool -s<br>Printing ring status.<br>Local node ID 71056300<br>
RING ID 0<br> id = 172.59.60.4<br> status = Marking seqid 8574 ringid 0 interface 172.59.60.4 FAULTY - adminisrtative intervention required.<br>RING ID 1<br> id = 172.23.42.37<br> status = ring 1 active with no faults<br>
<br>I then issue ifup eth1 on node1 and ensure that I can now ping node2. The link is definitely up, so I then issue the command corosync-cfgtool -r. I then run corosync-cfgtool -s again and it reports:<br><br>root@mq006:~# corosync-cfgtool -s<br>
Printing ring status.<br>Local node ID 71056300<br>RING ID 0<br> id = 172.59.60.4<br> status = ring 0 active with no faults<br>RING ID 1<br> id = 172.23.42.37<br> status = ring 1 active with no faults<br>
<br>So things are looking good at this point, but if I wait 10 more seconds and run corosync-cfgtool -s again, it reports that ring_id 0 is FAULTY again:<br><br>root@mq006:~# corosync-cfgtool -s<br>Printing ring status.<br>
Local node ID 71056300<br>RING ID 0<br> id = 172.59.60.4<br> status = Marking seqid 8574 ringid 0 interface 172.59.60.4 FAULTY - adminisrtative intervention required.<br>RING ID 1<br> id = 172.23.42.37<br>
status = ring 1 active with no faults<br><br>It does not matter how many times I run corosync-cfgtool -r, ring_id 0 will report it as being FAULTY 10 seconds after issuing the reset. I have tried running /etc/init.d/network restart on node1 in the hope that a full network stop and start makes a difference, but it doesn't. The only thing that will fix this situation is if I completely stop and restart the corosync cluster stack on both nodes (/etc/init.d/corosync stop and /etc/init.d/corosync start). Once I've done that both rings stay up and are stable. This is obviously not what we want.<br>
<br>I am running the latest RHEL rpms from here: <a href="http://www.clusterlabs.org/rpm/epel-5/x86_64/">http://www.clusterlabs.org/rpm/epel-5/x86_64/</a><br><br>corosync-1.2.1-1.el5<br>corosynclib-1.2.1-1.el5<br>pacemaker-1.0.8-4.el5<br>
pacemaker-libs-1.0.8-4.el5<br><br>My corosync.conf looks like this:<br>compatibility: whitetank<br><br>totem {<br> version: 2<br> secauth: off<br> threads: 0<br> consensus: 1201<br> rrp_mode: passive<br> interface {<br>
ringnumber: 0<br> bindnetaddr: 172.59.60.0<br> mcastaddr: 226.94.1.1<br> mcastport: 4010<br> }<br> interface {<br> ringnumber: 1<br> bindnetaddr: 172.23.40.0<br>
mcastaddr: 226.94.2.1<br> mcastport: 4011<br> }<br>}<br><br>logging {<br> fileline: off<br> to_stderr: yes<br> to_logfile: yes<br> to_syslog: yes<br> logfile: /tmp/corosync.log<br>
debug: off<br> timestamp: on<br> logger_subsys {<br> subsys: AMF<br> debug: off<br> }<br>}<br><br>amf {<br> mode: disabled<br>}<br><br>service {<br> # Load the Pacemaker Cluster Resource Manager<br>
name: pacemaker<br> ver: 0<br>}<br><br>aisexec {<br> user: root<br> group: root<br>}<br><br><br>This is what gets written into /tmp/corosync.log when I carry out the link failure test and then try and reset the ring status:<br>
root@mq005:~/activemq_rpms# cat /tmp/corosync.log <br>Apr 13 11:20:31 corosync [MAIN ] Corosync Cluster Engine ('1.2.1'): started and ready to provide service.<br>Apr 13 11:20:31 corosync [MAIN ] Corosync built-in features: nss rdma<br>
Apr 13 11:20:31 corosync [MAIN ] Successfully read main configuration file '/etc/corosync/corosync.conf'.<br>Apr 13 11:20:31 corosync [TOTEM ] Initializing transport (UDP/IP).<br>Apr 13 11:20:31 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).<br>
Apr 13 11:20:31 corosync [TOTEM ] Initializing transport (UDP/IP).<br>Apr 13 11:20:31 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).<br>Apr 13 11:20:31 corosync [TOTEM ] The network interface [172.59.60.3] is now up.<br>
Apr 13 11:20:31 corosync [pcmk ] info: process_ais_conf: Reading configure<br>Apr 13 11:20:31 corosync [pcmk ] info: config_find_init: Local handle: 4730966301143465986 for logging<br>Apr 13 11:20:31 corosync [pcmk ] info: config_find_next: Processing additional logging options...<br>
Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt: Found 'off' for option: debug<br>Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt: Defaulting to 'off' for option: to_file<br>Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt: Found 'yes' for option: to_syslog<br>
Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt: Defaulting to 'daemon' for option: syslog_facility<br>Apr 13 11:20:31 corosync [pcmk ] info: config_find_init: Local handle: 7739444317642555395 for service<br>
Apr 13 11:20:31 corosync [pcmk ] info: config_find_next: Processing additional service options...<br>Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt: Defaulting to 'pcmk' for option: clustername<br>Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt: Defaulting to 'no' for option: use_logd<br>
Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt: Defaulting to 'no' for option: use_mgmtd<br>Apr 13 11:20:31 corosync [pcmk ] info: pcmk_startup: CRM: Initialized<br>Apr 13 11:20:31 corosync [pcmk ] Logging: Initialized pcmk_startup<br>
Apr 13 11:20:32 corosync [pcmk ] info: pcmk_startup: Maximum core file size is: 18446744073709551615<br>Apr 13 11:20:32 corosync [pcmk ] info: pcmk_startup: Service: 9<br>Apr 13 11:20:32 corosync [pcmk ] info: pcmk_startup: Local hostname: mq005.back.int.cwwtf.local<br>
Apr 13 11:20:32 corosync [pcmk ] info: pcmk_update_nodeid: Local node id: 54279084<br>Apr 13 11:20:32 corosync [pcmk ] info: update_member: Creating entry for node 54279084 born on 0<br>Apr 13 11:20:32 corosync [pcmk ] info: update_member: 0x5452c00 Node 54279084 now known as mq005.back.int.cwwtf.local (was: (null))<br>
Apr 13 11:20:32 corosync [pcmk ] info: update_member: Node mq005.back.int.cwwtf.local now has 1 quorum votes (was 0)<br>Apr 13 11:20:32 corosync [pcmk ] info: update_member: Node 54279084/mq005.back.int.cwwtf.local is now: member<br>
Apr 13 11:20:32 corosync [pcmk ] info: spawn_child: Forked child 11873 for process stonithd<br>Apr 13 11:20:33 corosync [pcmk ] info: spawn_child: Forked child 11874 for process cib<br>Apr 13 11:20:33 corosync [pcmk ] info: spawn_child: Forked child 11875 for process lrmd<br>
Apr 13 11:20:33 corosync [pcmk ] info: spawn_child: Forked child 11876 for process attrd<br>Apr 13 11:20:33 corosync [pcmk ] info: spawn_child: Forked child 11877 for process pengine<br>Apr 13 11:20:33 corosync [pcmk ] info: spawn_child: Forked child 11878 for process crmd<br>
Apr 13 11:20:33 corosync [SERV ] Service engine loaded: Pacemaker Cluster Manager 1.0.8<br>Apr 13 11:20:33 corosync [SERV ] Service engine loaded: corosync extended virtual synchrony service<br>Apr 13 11:20:33 corosync [SERV ] Service engine loaded: corosync configuration service<br>
Apr 13 11:20:33 corosync [SERV ] Service engine loaded: corosync cluster closed process group service v1.01<br>Apr 13 11:20:33 corosync [SERV ] Service engine loaded: corosync cluster config database access v1.01<br>Apr 13 11:20:33 corosync [SERV ] Service engine loaded: corosync profile loading service<br>
Apr 13 11:20:33 corosync [SERV ] Service engine loaded: corosync cluster quorum service v0.1<br>Apr 13 11:20:33 corosync [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine.<br>Apr 13 11:20:33 corosync [TOTEM ] The network interface [172.23.42.36] is now up.<br>
Apr 13 11:20:33 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 640: memb=0, new=0, lost=0<br>Apr 13 11:20:33 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 640: memb=1, new=1, lost=0<br>
Apr 13 11:20:33 corosync [pcmk ] info: pcmk_peer_update: NEW: mq005.back.int.cwwtf.local 54279084<br>Apr 13 11:20:33 corosync [pcmk ] info: pcmk_peer_update: MEMB: mq005.back.int.cwwtf.local 54279084<br>Apr 13 11:20:33 corosync [pcmk ] info: update_member: Node mq005.back.int.cwwtf.local now has process list: 00000000000000000000000000013312 (78610)<br>
Apr 13 11:20:33 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.<br>Apr 13 11:20:33 corosync [MAIN ] Completed service synchronization, ready to provide service.<br>Apr 13 11:20:33 corosync [pcmk ] info: pcmk_ipc: Recorded connection 0x545a660 for attrd/11876<br>
Apr 13 11:20:33 corosync [pcmk ] info: pcmk_ipc: Recorded connection 0x545b290 for stonithd/11873<br>Apr 13 11:20:33 corosync [pcmk ] info: pcmk_ipc: Recorded connection 0x545d4e0 for cib/11874<br>Apr 13 11:20:33 corosync [pcmk ] info: pcmk_ipc: Sending membership update 640 to cib<br>
Apr 13 11:20:34 corosync [pcmk ] info: pcmk_ipc: Recorded connection 0x545e210 for crmd/11878<br>Apr 13 11:20:34 corosync [pcmk ] info: pcmk_ipc: Sending membership update 640 to crmd<br>Apr 13 11:20:34 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 648: memb=1, new=0, lost=0<br>
Apr 13 11:20:34 corosync [pcmk ] info: pcmk_peer_update: memb: mq005.back.int.cwwtf.local 54279084<br>Apr 13 11:20:34 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 648: memb=2, new=1, lost=0<br>
Apr 13 11:20:34 corosync [pcmk ] info: update_member: Creating entry for node 71056300 born on 648<br>Apr 13 11:20:34 corosync [pcmk ] info: update_member: Node 71056300/unknown is now: member<br>Apr 13 11:20:34 corosync [pcmk ] info: pcmk_peer_update: NEW: .pending. 71056300<br>
Apr 13 11:20:34 corosync [pcmk ] info: pcmk_peer_update: MEMB: mq005.back.int.cwwtf.local 54279084<br>Apr 13 11:20:34 corosync [pcmk ] info: pcmk_peer_update: MEMB: .pending. 71056300<br>Apr 13 11:20:34 corosync [pcmk ] info: send_member_notification: Sending membership update 648 to 2 children<br>
Apr 13 11:20:34 corosync [pcmk ] info: update_member: 0x5452c00 Node 54279084 ((null)) born on: 648<br>Apr 13 11:20:34 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.<br>Apr 13 11:20:34 corosync [pcmk ] info: update_member: 0x545dd00 Node 71056300 (mq006.back.int.cwwtf.local) born on: 648<br>
Apr 13 11:20:34 corosync [pcmk ] info: update_member: 0x545dd00 Node 71056300 now known as mq006.back.int.cwwtf.local (was: (null))<br>Apr 13 11:20:34 corosync [pcmk ] info: update_member: Node mq006.back.int.cwwtf.local now has process list: 00000000000000000000000000013312 (78610)<br>
Apr 13 11:20:34 corosync [pcmk ] info: update_member: Node mq006.back.int.cwwtf.local now has 1 quorum votes (was 0)<br>Apr 13 11:20:34 corosync [pcmk ] info: send_member_notification: Sending membership update 648 to 2 children<br>
Apr 13 11:20:34 corosync [MAIN ] Completed service synchronization, ready to provide service.<br>Apr 13 11:23:34 corosync [TOTEM ] Marking seqid 6843 ringid 0 interface 172.59.60.3 FAULTY - adminisrtative intervention required.<br>
Apr 13 11:25:15 corosync [TOTEM ] Marking ringid 0 interface 172.59.60.3 FAULTY - adminisrtative intervention required.<br>Apr 13 11:28:02 corosync [TOTEM ] Marking ringid 0 interface 172.59.60.3 FAULTY - adminisrtative intervention required.<br>
Apr 13 11:28:13 corosync [TOTEM ] Marking ringid 0 interface 172.59.60.3 FAULTY - adminisrtative intervention required.<br><br><br>Can anyone help me out with this? Am I doing something wrong or have I found a bug?<br><br>
Cheers,<br>Tom<br>