[Openais] Redundant ring not recovering after issuing the command corosync-cfgtool -r
Steven Dake
sdake at redhat.com
Tue Apr 13 10:33:40 PDT 2010
On Tue, 2010-04-13 at 17:04 +0100, Tom Pride wrote:
> Hi Steve,
>
> Thanks for the suggestion but that didn't work. I'm not sure if you
> read my entire post or not, but the two redundant rings that I have
> configured, both work without a problem until I introduce a fault by
> shutting down eth1 on one of the nodes. This then causes the cluster
> to mark ringid 0 as FAULTY. When I then reactivate eth1 and both
> nodes can once again ping each other over the network, I then run
> corosync-cfgtool -r which should re-enable the FAULTY redundant ring
> within corosync, but it doesn't work. Corosync refuses to re-enable
> the ring even though there is no longer any network fault.
>
By deactivating eth1, i assume you mean you ifdown eth1. Unfortunately
taking a network interface out of service while using redundant ring
doesn't work properly. To verify that a failure on that interface is
detected, i recommend using iptables to block the ports related to
corosync.
a bit more detail:
http://www.corosync.org/doku.php?id=faq:ifdown
> I might be mistaken, but isn't the trick of separating the port values
> by 2 instead of 1 only for when you are using broadcast instead of the
> recommended multicast? I'm using multicast.
>
Thought it may make a difference on the local interface port used for
udp messages (the token), but wasn't sure.
Regards
-steve
> Any more suggestions?
>
> Cheers,
> Tom
>
> On Tue, Apr 13, 2010 at 4:37 PM, Steven Dake <sdake at redhat.com> wrote:
> try separating the port values by 2 instead of 1.
>
> Regards
> -steve
>
> On Tue, 2010-04-13 at 11:30 +0100, Tom Pride wrote:
> > Hi There,
> >
> > As per the recommendations, the 2 node clusters I have built
> use 2
> > redundant rings for added resilience. I have currently be
> carry out
> > some testing on the clusters to ensure that a failure in one
> of the
> > redundant rings can be recovered from. I am aware of the
> fact that
> > corosync does not currently have a feature which monitors
> failed rings
> > to bring them back up automatically when communications are
> repaired.
> > All I have been doing is testing to see that the
> corosync-cfgtool -r
> > command will do as it says on the tin and "Reset redundant
> ring state
> > cluster wide after a fault, to re-enable redundant ring
> operation."
> >
> > In my 2 node cluster I have been issuing the ifdown command
> on eth1 on
> > node1. This results in corosync-cfgtool -s reporting the
> following:
> >
> > root at mq006:~# corosync-cfgtool -s
> > Printing ring status.
> > Local node ID 71056300
> > RING ID 0
> > id = 172.59.60.4
> > status = Marking seqid 8574 ringid 0 interface
> 172.59.60.4
> > FAULTY - adminisrtative intervention required.
> > RING ID 1
> > id = 172.23.42.37
> > status = ring 1 active with no faults
> >
> > I then issue ifup eth1 on node1 and ensure that I can now
> ping node2.
> > The link is definitely up, so I then issue the command
> > corosync-cfgtool -r. I then run corosync-cfgtool -s again
> and it
> > reports:
> >
> > root at mq006:~# corosync-cfgtool -s
> > Printing ring status.
> > Local node ID 71056300
> > RING ID 0
> > id = 172.59.60.4
> > status = ring 0 active with no faults
> > RING ID 1
> > id = 172.23.42.37
> > status = ring 1 active with no faults
> >
> > So things are looking good at this point, but if I wait 10
> more
> > seconds and run corosync-cfgtool -s again, it reports that
> ring_id 0
> > is FAULTY again:
> >
> > root at mq006:~# corosync-cfgtool -s
> > Printing ring status.
> > Local node ID 71056300
> > RING ID 0
> > id = 172.59.60.4
> > status = Marking seqid 8574 ringid 0 interface
> 172.59.60.4
> > FAULTY - adminisrtative intervention required.
> > RING ID 1
> > id = 172.23.42.37
> > status = ring 1 active with no faults
> >
> > It does not matter how many times I run corosync-cfgtool -r,
> ring_id 0
> > will report it as being FAULTY 10 seconds after issuing the
> reset. I
> > have tried running /etc/init.d/network restart on node1 in
> the hope
> > that a full network stop and start makes a difference, but
> it doesn't.
> > The only thing that will fix this situation is if I
> completely stop
> > and restart the corosync cluster stack on both nodes
> > (/etc/init.d/corosync stop and /etc/init.d/corosync start).
> Once I've
> > done that both rings stay up and are stable. This is
> obviously not
> > what we want.
> >
> > I am running the latest RHEL rpms from here:
> > http://www.clusterlabs.org/rpm/epel-5/x86_64/
> >
> > corosync-1.2.1-1.el5
> > corosynclib-1.2.1-1.el5
> > pacemaker-1.0.8-4.el5
> > pacemaker-libs-1.0.8-4.el5
> >
> > My corosync.conf looks like this:
> > compatibility: whitetank
> >
> > totem {
> > version: 2
> > secauth: off
> > threads: 0
> > consensus: 1201
> > rrp_mode: passive
> > interface {
> > ringnumber: 0
> > bindnetaddr: 172.59.60.0
> > mcastaddr: 226.94.1.1
> > mcastport: 4010
> > }
> > interface {
> > ringnumber: 1
> > bindnetaddr: 172.23.40.0
> > mcastaddr: 226.94.2.1
> > mcastport: 4011
> > }
> > }
> >
> > logging {
> > fileline: off
> > to_stderr: yes
> > to_logfile: yes
> > to_syslog: yes
> > logfile: /tmp/corosync.log
> > debug: off
> > timestamp: on
> > logger_subsys {
> > subsys: AMF
> > debug: off
> > }
> > }
> >
> > amf {
> > mode: disabled
> > }
> >
> > service {
> > # Load the Pacemaker Cluster Resource Manager
> > name: pacemaker
> > ver: 0
> > }
> >
> > aisexec {
> > user: root
> > group: root
> > }
> >
> >
> > This is what gets written into /tmp/corosync.log when I
> carry out the
> > link failure test and then try and reset the ring status:
> > root at mq005:~/activemq_rpms# cat /tmp/corosync.log
> > Apr 13 11:20:31 corosync [MAIN ] Corosync Cluster Engine
> ('1.2.1'):
> > started and ready to provide service.
> > Apr 13 11:20:31 corosync [MAIN ] Corosync built-in
> features: nss rdma
> > Apr 13 11:20:31 corosync [MAIN ] Successfully read main
> configuration
> > file '/etc/corosync/corosync.conf'.
> > Apr 13 11:20:31 corosync [TOTEM ] Initializing transport
> (UDP/IP).
> > Apr 13 11:20:31 corosync [TOTEM ] Initializing
> transmit/receive
> > security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> > Apr 13 11:20:31 corosync [TOTEM ] Initializing transport
> (UDP/IP).
> > Apr 13 11:20:31 corosync [TOTEM ] Initializing
> transmit/receive
> > security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> > Apr 13 11:20:31 corosync [TOTEM ] The network interface
> [172.59.60.3]
> > is now up.
> > Apr 13 11:20:31 corosync [pcmk ] info: process_ais_conf:
> Reading
> > configure
> > Apr 13 11:20:31 corosync [pcmk ] info: config_find_init:
> Local
> > handle: 4730966301143465986 for logging
> > Apr 13 11:20:31 corosync [pcmk ] info: config_find_next:
> Processing
> > additional logging options...
> > Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt:
> Found 'off'
> > for option: debug
> > Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt:
> Defaulting to
> > 'off' for option: to_file
> > Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt:
> Found 'yes'
> > for option: to_syslog
> > Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt:
> Defaulting to
> > 'daemon' for option: syslog_facility
> > Apr 13 11:20:31 corosync [pcmk ] info: config_find_init:
> Local
> > handle: 7739444317642555395 for service
> > Apr 13 11:20:31 corosync [pcmk ] info: config_find_next:
> Processing
> > additional service options...
> > Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt:
> Defaulting to
> > 'pcmk' for option: clustername
> > Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt:
> Defaulting to
> > 'no' for option: use_logd
> > Apr 13 11:20:31 corosync [pcmk ] info: get_config_opt:
> Defaulting to
> > 'no' for option: use_mgmtd
> > Apr 13 11:20:31 corosync [pcmk ] info: pcmk_startup: CRM:
> Initialized
> > Apr 13 11:20:31 corosync [pcmk ] Logging: Initialized
> pcmk_startup
> > Apr 13 11:20:32 corosync [pcmk ] info: pcmk_startup:
> Maximum core
> > file size is: 18446744073709551615
> > Apr 13 11:20:32 corosync [pcmk ] info: pcmk_startup:
> Service: 9
> > Apr 13 11:20:32 corosync [pcmk ] info: pcmk_startup: Local
> hostname:
> > mq005.back.int.cwwtf.local
> > Apr 13 11:20:32 corosync [pcmk ] info: pcmk_update_nodeid:
> Local node
> > id: 54279084
> > Apr 13 11:20:32 corosync [pcmk ] info: update_member:
> Creating entry
> > for node 54279084 born on 0
> > Apr 13 11:20:32 corosync [pcmk ] info: update_member:
> 0x5452c00 Node
> > 54279084 now known as mq005.back.int.cwwtf.local (was:
> (null))
> > Apr 13 11:20:32 corosync [pcmk ] info: update_member: Node
> > mq005.back.int.cwwtf.local now has 1 quorum votes (was 0)
> > Apr 13 11:20:32 corosync [pcmk ] info: update_member: Node
> > 54279084/mq005.back.int.cwwtf.local is now: member
> > Apr 13 11:20:32 corosync [pcmk ] info: spawn_child: Forked
> child
> > 11873 for process stonithd
> > Apr 13 11:20:33 corosync [pcmk ] info: spawn_child: Forked
> child
> > 11874 for process cib
> > Apr 13 11:20:33 corosync [pcmk ] info: spawn_child: Forked
> child
> > 11875 for process lrmd
> > Apr 13 11:20:33 corosync [pcmk ] info: spawn_child: Forked
> child
> > 11876 for process attrd
> > Apr 13 11:20:33 corosync [pcmk ] info: spawn_child: Forked
> child
> > 11877 for process pengine
> > Apr 13 11:20:33 corosync [pcmk ] info: spawn_child: Forked
> child
> > 11878 for process crmd
> > Apr 13 11:20:33 corosync [SERV ] Service engine loaded:
> Pacemaker
> > Cluster Manager 1.0.8
> > Apr 13 11:20:33 corosync [SERV ] Service engine loaded:
> corosync
> > extended virtual synchrony service
> > Apr 13 11:20:33 corosync [SERV ] Service engine loaded:
> corosync
> > configuration service
> > Apr 13 11:20:33 corosync [SERV ] Service engine loaded:
> corosync
> > cluster closed process group service v1.01
> > Apr 13 11:20:33 corosync [SERV ] Service engine loaded:
> corosync
> > cluster config database access v1.01
> > Apr 13 11:20:33 corosync [SERV ] Service engine loaded:
> corosync
> > profile loading service
> > Apr 13 11:20:33 corosync [SERV ] Service engine loaded:
> corosync
> > cluster quorum service v0.1
> > Apr 13 11:20:33 corosync [MAIN ] Compatibility mode set to
> whitetank.
> > Using V1 and V2 of the synchronization engine.
> > Apr 13 11:20:33 corosync [TOTEM ] The network interface
> [172.23.42.36]
> > is now up.
> > Apr 13 11:20:33 corosync [pcmk ] notice: pcmk_peer_update:
> > Transitional membership event on ring 640: memb=0, new=0,
> lost=0
> > Apr 13 11:20:33 corosync [pcmk ] notice: pcmk_peer_update:
> Stable
> > membership event on ring 640: memb=1, new=1, lost=0
> > Apr 13 11:20:33 corosync [pcmk ] info: pcmk_peer_update:
> NEW:
> > mq005.back.int.cwwtf.local 54279084
> > Apr 13 11:20:33 corosync [pcmk ] info: pcmk_peer_update:
> MEMB:
> > mq005.back.int.cwwtf.local 54279084
> > Apr 13 11:20:33 corosync [pcmk ] info: update_member: Node
> > mq005.back.int.cwwtf.local now has process list:
> > 00000000000000000000000000013312 (78610)
> > Apr 13 11:20:33 corosync [TOTEM ] A processor joined or left
> the
> > membership and a new membership was formed.
> > Apr 13 11:20:33 corosync [MAIN ] Completed service
> synchronization,
> > ready to provide service.
> > Apr 13 11:20:33 corosync [pcmk ] info: pcmk_ipc: Recorded
> connection
> > 0x545a660 for attrd/11876
> > Apr 13 11:20:33 corosync [pcmk ] info: pcmk_ipc: Recorded
> connection
> > 0x545b290 for stonithd/11873
> > Apr 13 11:20:33 corosync [pcmk ] info: pcmk_ipc: Recorded
> connection
> > 0x545d4e0 for cib/11874
> > Apr 13 11:20:33 corosync [pcmk ] info: pcmk_ipc: Sending
> membership
> > update 640 to cib
> > Apr 13 11:20:34 corosync [pcmk ] info: pcmk_ipc: Recorded
> connection
> > 0x545e210 for crmd/11878
> > Apr 13 11:20:34 corosync [pcmk ] info: pcmk_ipc: Sending
> membership
> > update 640 to crmd
> > Apr 13 11:20:34 corosync [pcmk ] notice: pcmk_peer_update:
> > Transitional membership event on ring 648: memb=1, new=0,
> lost=0
> > Apr 13 11:20:34 corosync [pcmk ] info: pcmk_peer_update:
> memb:
> > mq005.back.int.cwwtf.local 54279084
> > Apr 13 11:20:34 corosync [pcmk ] notice: pcmk_peer_update:
> Stable
> > membership event on ring 648: memb=2, new=1, lost=0
> > Apr 13 11:20:34 corosync [pcmk ] info: update_member:
> Creating entry
> > for node 71056300 born on 648
> > Apr 13 11:20:34 corosync [pcmk ] info: update_member: Node
> > 71056300/unknown is now: member
> > Apr 13 11:20:34 corosync [pcmk ] info: pcmk_peer_update:
> > NEW: .pending. 71056300
> > Apr 13 11:20:34 corosync [pcmk ] info: pcmk_peer_update:
> MEMB:
> > mq005.back.int.cwwtf.local 54279084
> > Apr 13 11:20:34 corosync [pcmk ] info: pcmk_peer_update:
> > MEMB: .pending. 71056300
> > Apr 13 11:20:34 corosync [pcmk ] info:
> send_member_notification:
> > Sending membership update 648 to 2 children
> > Apr 13 11:20:34 corosync [pcmk ] info: update_member:
> 0x5452c00 Node
> > 54279084 ((null)) born on: 648
> > Apr 13 11:20:34 corosync [TOTEM ] A processor joined or left
> the
> > membership and a new membership was formed.
> > Apr 13 11:20:34 corosync [pcmk ] info: update_member:
> 0x545dd00 Node
> > 71056300 (mq006.back.int.cwwtf.local) born on: 648
> > Apr 13 11:20:34 corosync [pcmk ] info: update_member:
> 0x545dd00 Node
> > 71056300 now known as mq006.back.int.cwwtf.local (was:
> (null))
> > Apr 13 11:20:34 corosync [pcmk ] info: update_member: Node
> > mq006.back.int.cwwtf.local now has process list:
> > 00000000000000000000000000013312 (78610)
> > Apr 13 11:20:34 corosync [pcmk ] info: update_member: Node
> > mq006.back.int.cwwtf.local now has 1 quorum votes (was 0)
> > Apr 13 11:20:34 corosync [pcmk ] info:
> send_member_notification:
> > Sending membership update 648 to 2 children
> > Apr 13 11:20:34 corosync [MAIN ] Completed service
> synchronization,
> > ready to provide service.
> > Apr 13 11:23:34 corosync [TOTEM ] Marking seqid 6843 ringid
> 0
> > interface 172.59.60.3 FAULTY - adminisrtative intervention
> required.
> > Apr 13 11:25:15 corosync [TOTEM ] Marking ringid 0 interface
> > 172.59.60.3 FAULTY - adminisrtative intervention required.
> > Apr 13 11:28:02 corosync [TOTEM ] Marking ringid 0 interface
> > 172.59.60.3 FAULTY - adminisrtative intervention required.
> > Apr 13 11:28:13 corosync [TOTEM ] Marking ringid 0 interface
> > 172.59.60.3 FAULTY - adminisrtative intervention required.
> >
> >
> > Can anyone help me out with this? Am I doing something
> wrong or have
> > I found a bug?
> >
> > Cheers,
> > Tom
>
> > _______________________________________________
> > Openais mailing list
> > Openais at lists.linux-foundation.org
> > https://lists.linux-foundation.org/mailman/listinfo/openais
>
>
More information about the Openais
mailing list