[Openais] Redundant ring not recovering after issuing the command corosync-cfgtool -r

Steven Dake sdake at redhat.com
Tue Apr 13 08:37:36 PDT 2010


try separating the port values by 2 instead of 1.

Regards
-steve
On Tue, 2010-04-13 at 11:30 +0100, Tom Pride wrote:
> Hi There,
> 
> As per the recommendations, the 2 node clusters I have built use 2
> redundant rings for added resilience.  I have currently be carry out
> some testing on the clusters to ensure that a failure in one of the
> redundant rings can be recovered from.  I am aware of the fact that
> corosync does not currently have a feature which monitors failed rings
> to bring them back up automatically when communications are repaired.
> All I have been doing is testing to see that the corosync-cfgtool -r
> command will do as it says on the tin and "Reset redundant ring state
> cluster wide after a fault, to re-enable redundant ring operation."
> 
> In my 2 node cluster I have been issuing the ifdown command on eth1 on
> node1.  This results in corosync-cfgtool -s reporting the following:
> 
> root at mq006:~# corosync-cfgtool -s
> Printing ring status.
> Local node ID 71056300
> RING ID 0
>     id    = 172.59.60.4
>     status    = Marking seqid 8574 ringid 0 interface 172.59.60.4
> FAULTY - adminisrtative intervention required.
> RING ID 1
>     id    = 172.23.42.37
>     status    = ring 1 active with no faults
> 
> I then issue ifup eth1 on node1 and ensure that I can now ping node2.
> The link is definitely up, so I then issue the command
> corosync-cfgtool -r.  I then run corosync-cfgtool -s again and it
> reports:
> 
> root at mq006:~# corosync-cfgtool -s
> Printing ring status.
> Local node ID 71056300
> RING ID 0
>     id    = 172.59.60.4
>     status    = ring 0 active with no faults
> RING ID 1
>     id    = 172.23.42.37
>     status    = ring 1 active with no faults
> 
> So things are looking good at this point, but if I wait 10 more
> seconds and run corosync-cfgtool -s again, it reports that ring_id 0
> is FAULTY again:
> 
> root at mq006:~# corosync-cfgtool -s
> Printing ring status.
> Local node ID 71056300
> RING ID 0
>     id    = 172.59.60.4
>     status    = Marking seqid 8574 ringid 0 interface 172.59.60.4
> FAULTY - adminisrtative intervention required.
> RING ID 1
>     id    = 172.23.42.37
>     status    = ring 1 active with no faults
> 
> It does not matter how many times I run corosync-cfgtool -r, ring_id 0
> will report it as being FAULTY 10 seconds after issuing the reset.  I
> have tried running /etc/init.d/network restart on node1 in the hope
> that a full network stop and start makes a difference, but it doesn't.
> The only thing that will fix this situation is if I completely stop
> and restart the corosync cluster stack on both nodes
> (/etc/init.d/corosync stop and /etc/init.d/corosync start).  Once I've
> done that both rings stay up and are stable.  This is obviously not
> what we want.
> 
> I am running the latest RHEL rpms from here:
> http://www.clusterlabs.org/rpm/epel-5/x86_64/
> 
> corosync-1.2.1-1.el5
> corosynclib-1.2.1-1.el5
> pacemaker-1.0.8-4.el5
> pacemaker-libs-1.0.8-4.el5
> 
> My corosync.conf looks like this:
> compatibility: whitetank
> 
> totem {
>     version: 2
>     secauth: off
>     threads: 0
>     consensus: 1201
>     rrp_mode: passive
>     interface {
>                 ringnumber: 0
>                 bindnetaddr: 172.59.60.0
>                 mcastaddr: 226.94.1.1
>                 mcastport: 4010
>     }
>         interface {
>                 ringnumber: 1
>                 bindnetaddr: 172.23.40.0
>                 mcastaddr: 226.94.2.1
>                 mcastport: 4011
>         }
> }
> 
> logging {
>     fileline: off
>     to_stderr: yes
>     to_logfile: yes
>     to_syslog: yes
>     logfile: /tmp/corosync.log
>     debug: off
>     timestamp: on
>     logger_subsys {
>         subsys: AMF
>         debug: off
>     }
> }
> 
> amf {
>     mode: disabled
> }
> 
> service {
>        # Load the Pacemaker Cluster Resource Manager
>        name: pacemaker
>        ver: 0
> }
> 
> aisexec {
>        user:   root
>        group:  root
> }
> 
> 
> This is what gets written into /tmp/corosync.log when I carry out the
> link failure test and then try and reset the ring status:
> root at mq005:~/activemq_rpms# cat /tmp/corosync.log 
> Apr 13 11:20:31 corosync [MAIN  ] Corosync Cluster Engine ('1.2.1'):
> started and ready to provide service.
> Apr 13 11:20:31 corosync [MAIN  ] Corosync built-in features: nss rdma
> Apr 13 11:20:31 corosync [MAIN  ] Successfully read main configuration
> file '/etc/corosync/corosync.conf'.
> Apr 13 11:20:31 corosync [TOTEM ] Initializing transport (UDP/IP).
> Apr 13 11:20:31 corosync [TOTEM ] Initializing transmit/receive
> security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> Apr 13 11:20:31 corosync [TOTEM ] Initializing transport (UDP/IP).
> Apr 13 11:20:31 corosync [TOTEM ] Initializing transmit/receive
> security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> Apr 13 11:20:31 corosync [TOTEM ] The network interface [172.59.60.3]
> is now up.
> Apr 13 11:20:31 corosync [pcmk  ] info: process_ais_conf: Reading
> configure
> Apr 13 11:20:31 corosync [pcmk  ] info: config_find_init: Local
> handle: 4730966301143465986 for logging
> Apr 13 11:20:31 corosync [pcmk  ] info: config_find_next: Processing
> additional logging options...
> Apr 13 11:20:31 corosync [pcmk  ] info: get_config_opt: Found 'off'
> for option: debug
> Apr 13 11:20:31 corosync [pcmk  ] info: get_config_opt: Defaulting to
> 'off' for option: to_file
> Apr 13 11:20:31 corosync [pcmk  ] info: get_config_opt: Found 'yes'
> for option: to_syslog
> Apr 13 11:20:31 corosync [pcmk  ] info: get_config_opt: Defaulting to
> 'daemon' for option: syslog_facility
> Apr 13 11:20:31 corosync [pcmk  ] info: config_find_init: Local
> handle: 7739444317642555395 for service
> Apr 13 11:20:31 corosync [pcmk  ] info: config_find_next: Processing
> additional service options...
> Apr 13 11:20:31 corosync [pcmk  ] info: get_config_opt: Defaulting to
> 'pcmk' for option: clustername
> Apr 13 11:20:31 corosync [pcmk  ] info: get_config_opt: Defaulting to
> 'no' for option: use_logd
> Apr 13 11:20:31 corosync [pcmk  ] info: get_config_opt: Defaulting to
> 'no' for option: use_mgmtd
> Apr 13 11:20:31 corosync [pcmk  ] info: pcmk_startup: CRM: Initialized
> Apr 13 11:20:31 corosync [pcmk  ] Logging: Initialized pcmk_startup
> Apr 13 11:20:32 corosync [pcmk  ] info: pcmk_startup: Maximum core
> file size is: 18446744073709551615
> Apr 13 11:20:32 corosync [pcmk  ] info: pcmk_startup: Service: 9
> Apr 13 11:20:32 corosync [pcmk  ] info: pcmk_startup: Local hostname:
> mq005.back.int.cwwtf.local
> Apr 13 11:20:32 corosync [pcmk  ] info: pcmk_update_nodeid: Local node
> id: 54279084
> Apr 13 11:20:32 corosync [pcmk  ] info: update_member: Creating entry
> for node 54279084 born on 0
> Apr 13 11:20:32 corosync [pcmk  ] info: update_member: 0x5452c00 Node
> 54279084 now known as mq005.back.int.cwwtf.local (was: (null))
> Apr 13 11:20:32 corosync [pcmk  ] info: update_member: Node
> mq005.back.int.cwwtf.local now has 1 quorum votes (was 0)
> Apr 13 11:20:32 corosync [pcmk  ] info: update_member: Node
> 54279084/mq005.back.int.cwwtf.local is now: member
> Apr 13 11:20:32 corosync [pcmk  ] info: spawn_child: Forked child
> 11873 for process stonithd
> Apr 13 11:20:33 corosync [pcmk  ] info: spawn_child: Forked child
> 11874 for process cib
> Apr 13 11:20:33 corosync [pcmk  ] info: spawn_child: Forked child
> 11875 for process lrmd
> Apr 13 11:20:33 corosync [pcmk  ] info: spawn_child: Forked child
> 11876 for process attrd
> Apr 13 11:20:33 corosync [pcmk  ] info: spawn_child: Forked child
> 11877 for process pengine
> Apr 13 11:20:33 corosync [pcmk  ] info: spawn_child: Forked child
> 11878 for process crmd
> Apr 13 11:20:33 corosync [SERV  ] Service engine loaded: Pacemaker
> Cluster Manager 1.0.8
> Apr 13 11:20:33 corosync [SERV  ] Service engine loaded: corosync
> extended virtual synchrony service
> Apr 13 11:20:33 corosync [SERV  ] Service engine loaded: corosync
> configuration service
> Apr 13 11:20:33 corosync [SERV  ] Service engine loaded: corosync
> cluster closed process group service v1.01
> Apr 13 11:20:33 corosync [SERV  ] Service engine loaded: corosync
> cluster config database access v1.01
> Apr 13 11:20:33 corosync [SERV  ] Service engine loaded: corosync
> profile loading service
> Apr 13 11:20:33 corosync [SERV  ] Service engine loaded: corosync
> cluster quorum service v0.1
> Apr 13 11:20:33 corosync [MAIN  ] Compatibility mode set to whitetank.
> Using V1 and V2 of the synchronization engine.
> Apr 13 11:20:33 corosync [TOTEM ] The network interface [172.23.42.36]
> is now up.
> Apr 13 11:20:33 corosync [pcmk  ] notice: pcmk_peer_update:
> Transitional membership event on ring 640: memb=0, new=0, lost=0
> Apr 13 11:20:33 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> membership event on ring 640: memb=1, new=1, lost=0
> Apr 13 11:20:33 corosync [pcmk  ] info: pcmk_peer_update: NEW:
> mq005.back.int.cwwtf.local 54279084
> Apr 13 11:20:33 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> mq005.back.int.cwwtf.local 54279084
> Apr 13 11:20:33 corosync [pcmk  ] info: update_member: Node
> mq005.back.int.cwwtf.local now has process list:
> 00000000000000000000000000013312 (78610)
> Apr 13 11:20:33 corosync [TOTEM ] A processor joined or left the
> membership and a new membership was formed.
> Apr 13 11:20:33 corosync [MAIN  ] Completed service synchronization,
> ready to provide service.
> Apr 13 11:20:33 corosync [pcmk  ] info: pcmk_ipc: Recorded connection
> 0x545a660 for attrd/11876
> Apr 13 11:20:33 corosync [pcmk  ] info: pcmk_ipc: Recorded connection
> 0x545b290 for stonithd/11873
> Apr 13 11:20:33 corosync [pcmk  ] info: pcmk_ipc: Recorded connection
> 0x545d4e0 for cib/11874
> Apr 13 11:20:33 corosync [pcmk  ] info: pcmk_ipc: Sending membership
> update 640 to cib
> Apr 13 11:20:34 corosync [pcmk  ] info: pcmk_ipc: Recorded connection
> 0x545e210 for crmd/11878
> Apr 13 11:20:34 corosync [pcmk  ] info: pcmk_ipc: Sending membership
> update 640 to crmd
> Apr 13 11:20:34 corosync [pcmk  ] notice: pcmk_peer_update:
> Transitional membership event on ring 648: memb=1, new=0, lost=0
> Apr 13 11:20:34 corosync [pcmk  ] info: pcmk_peer_update: memb:
> mq005.back.int.cwwtf.local 54279084
> Apr 13 11:20:34 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> membership event on ring 648: memb=2, new=1, lost=0
> Apr 13 11:20:34 corosync [pcmk  ] info: update_member: Creating entry
> for node 71056300 born on 648
> Apr 13 11:20:34 corosync [pcmk  ] info: update_member: Node
> 71056300/unknown is now: member
> Apr 13 11:20:34 corosync [pcmk  ] info: pcmk_peer_update:
> NEW:  .pending. 71056300
> Apr 13 11:20:34 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> mq005.back.int.cwwtf.local 54279084
> Apr 13 11:20:34 corosync [pcmk  ] info: pcmk_peer_update:
> MEMB: .pending. 71056300
> Apr 13 11:20:34 corosync [pcmk  ] info: send_member_notification:
> Sending membership update 648 to 2 children
> Apr 13 11:20:34 corosync [pcmk  ] info: update_member: 0x5452c00 Node
> 54279084 ((null)) born on: 648
> Apr 13 11:20:34 corosync [TOTEM ] A processor joined or left the
> membership and a new membership was formed.
> Apr 13 11:20:34 corosync [pcmk  ] info: update_member: 0x545dd00 Node
> 71056300 (mq006.back.int.cwwtf.local) born on: 648
> Apr 13 11:20:34 corosync [pcmk  ] info: update_member: 0x545dd00 Node
> 71056300 now known as mq006.back.int.cwwtf.local (was: (null))
> Apr 13 11:20:34 corosync [pcmk  ] info: update_member: Node
> mq006.back.int.cwwtf.local now has process list:
> 00000000000000000000000000013312 (78610)
> Apr 13 11:20:34 corosync [pcmk  ] info: update_member: Node
> mq006.back.int.cwwtf.local now has 1 quorum votes (was 0)
> Apr 13 11:20:34 corosync [pcmk  ] info: send_member_notification:
> Sending membership update 648 to 2 children
> Apr 13 11:20:34 corosync [MAIN  ] Completed service synchronization,
> ready to provide service.
> Apr 13 11:23:34 corosync [TOTEM ] Marking seqid 6843 ringid 0
> interface 172.59.60.3 FAULTY - adminisrtative intervention required.
> Apr 13 11:25:15 corosync [TOTEM ] Marking ringid 0 interface
> 172.59.60.3 FAULTY - adminisrtative intervention required.
> Apr 13 11:28:02 corosync [TOTEM ] Marking ringid 0 interface
> 172.59.60.3 FAULTY - adminisrtative intervention required.
> Apr 13 11:28:13 corosync [TOTEM ] Marking ringid 0 interface
> 172.59.60.3 FAULTY - adminisrtative intervention required.
> 
> 
> Can anyone help me out with this?  Am I doing something wrong or have
> I found a bug?
> 
> Cheers,
> Tom
> _______________________________________________
> Openais mailing list
> Openais at lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais



More information about the Openais mailing list