[Openais] Redundant ring not recovering after issuing the command corosync-cfgtool -r
Tom Pride
tom.pride at gmail.com
Wed Apr 14 01:20:42 PDT 2010
Hi Steve,
Many thanks for your detailed explanation on this. Makes a lot more sense
to me now. I have now repeated the test by shutting down the switch port
that the server is connected to and then reactivating it. Corosync
recovered as expected from the failure after I ran corosync-cfgtool -r.
That's one more test I can tick off the box on my way to building a robust
live cluster.
Just out of curiosity, do you have any approximate ETA on when corosync will
be released with a feature that monitors the status of a redundant ring that
has been marked as faulty and the automatically re-enables it when it
detects that the fault has been fixed?
Cheers,
Tom
On Tue, Apr 13, 2010 at 7:34 PM, Steven Dake <sdake at redhat.com> wrote:
> On Tue, 2010-04-13 at 19:31 +0100, Tom Pride wrote:
> > Just to clarify, when I ifdown eth1 corosync does detect a failure and
> > it does mark the ring as faulty. Are you saying that when I use ifup
> > corosync can't work out that the interface is back up and
> > communications can resume when I run corosync-cfgtool -r ? Would I
> > therefore get a different result if I introduced the failure by
> > physically unplugging the cat5 from the server and then physically
> > reconnecting the cat5? What about if I shut down the port on the
> > switch it is connected to?
> >
>
> Yes this is correct. You should see proper operation if the network
> link is lost normally (ie the nic fails, the link fails, the switch port
> fails, the switch fails).
>
> When an interface is ifdowned, it sends a special event to corosync,
> which corosync captures and causes special behavior to occur (the
> binding to 127.0.0.1). Pulling a network cable doesn't cause this same
> event to occur. This rebind behavior is incompatible with redundant
> ring.
>
> Regards
> -steve
>
> > On Tue, Apr 13, 2010 at 6:33 PM, Steven Dake <sdake at redhat.com> wrote:
> > On Tue, 2010-04-13 at 17:04 +0100, Tom Pride wrote:
> > > Hi Steve,
> > >
> > > Thanks for the suggestion but that didn't work. I'm not
> > sure if you
> > > read my entire post or not, but the two redundant rings that
> > I have
> > > configured, both work without a problem until I introduce a
> > fault by
> > > shutting down eth1 on one of the nodes. This then causes
> > the cluster
> > > to mark ringid 0 as FAULTY. When I then reactivate eth1 and
> > both
> > > nodes can once again ping each other over the network, I
> > then run
> > > corosync-cfgtool -r which should re-enable the FAULTY
> > redundant ring
> > > within corosync, but it doesn't work. Corosync refuses to
> > re-enable
> > > the ring even though there is no longer any network fault.
> > >
> >
> >
> > By deactivating eth1, i assume you mean you ifdown eth1.
> > Unfortunately
> > taking a network interface out of service while using
> > redundant ring
> > doesn't work properly. To verify that a failure on that
> > interface is
> > detected, i recommend using iptables to block the ports
> > related to
> > corosync.
> >
> > a bit more detail:
> >
> > http://www.corosync.org/doku.php?id=faq:ifdown
> >
> > > I might be mistaken, but isn't the trick of separating the
> > port values
> > > by 2 instead of 1 only for when you are using broadcast
> > instead of the
> > > recommended multicast? I'm using multicast.
> > >
> >
> >
> > Thought it may make a difference on the local interface port
> > used for
> > udp messages (the token), but wasn't sure.
> >
> > Regards
> > -steve
> >
> >
> > > Any more suggestions?
> > >
> > > Cheers,
> > > Tom
> > >
> > > On Tue, Apr 13, 2010 at 4:37 PM, Steven Dake
> > <sdake at redhat.com> wrote:
> > > try separating the port values by 2 instead of 1.
> > >
> > > Regards
> > > -steve
> > >
> > > On Tue, 2010-04-13 at 11:30 +0100, Tom Pride wrote:
> > > > Hi There,
> > > >
> > > > As per the recommendations, the 2 node clusters I
> > have built
> > > use 2
> > > > redundant rings for added resilience. I have
> > currently be
> > > carry out
> > > > some testing on the clusters to ensure that a
> > failure in one
> > > of the
> > > > redundant rings can be recovered from. I am aware
> > of the
> > > fact that
> > > > corosync does not currently have a feature which
> > monitors
> > > failed rings
> > > > to bring them back up automatically when
> > communications are
> > > repaired.
> > > > All I have been doing is testing to see that the
> > > corosync-cfgtool -r
> > > > command will do as it says on the tin and "Reset
> > redundant
> > > ring state
> > > > cluster wide after a fault, to re-enable redundant
> > ring
> > > operation."
> > > >
> > > > In my 2 node cluster I have been issuing the
> > ifdown command
> > > on eth1 on
> > > > node1. This results in corosync-cfgtool -s
> > reporting the
> > > following:
> > > >
> > > > root at mq006:~# corosync-cfgtool -s
> > > > Printing ring status.
> > > > Local node ID 71056300
> > > > RING ID 0
> > > > id = 172.59.60.4
> > > > status = Marking seqid 8574 ringid 0
> > interface
> > > 172.59.60.4
> > > > FAULTY - adminisrtative intervention required.
> > > > RING ID 1
> > > > id = 172.23.42.37
> > > > status = ring 1 active with no faults
> > > >
> > > > I then issue ifup eth1 on node1 and ensure that I
> > can now
> > > ping node2.
> > > > The link is definitely up, so I then issue the
> > command
> > > > corosync-cfgtool -r. I then run corosync-cfgtool
> > -s again
> > > and it
> > > > reports:
> > > >
> > > > root at mq006:~# corosync-cfgtool -s
> > > > Printing ring status.
> > > > Local node ID 71056300
> > > > RING ID 0
> > > > id = 172.59.60.4
> > > > status = ring 0 active with no faults
> > > > RING ID 1
> > > > id = 172.23.42.37
> > > > status = ring 1 active with no faults
> > > >
> > > > So things are looking good at this point, but if I
> > wait 10
> > > more
> > > > seconds and run corosync-cfgtool -s again, it
> > reports that
> > > ring_id 0
> > > > is FAULTY again:
> > > >
> > > > root at mq006:~# corosync-cfgtool -s
> > > > Printing ring status.
> > > > Local node ID 71056300
> > > > RING ID 0
> > > > id = 172.59.60.4
> > > > status = Marking seqid 8574 ringid 0
> > interface
> > > 172.59.60.4
> > > > FAULTY - adminisrtative intervention required.
> > > > RING ID 1
> > > > id = 172.23.42.37
> > > > status = ring 1 active with no faults
> > > >
> > > > It does not matter how many times I run
> > corosync-cfgtool -r,
> > > ring_id 0
> > > > will report it as being FAULTY 10 seconds after
> > issuing the
> > > reset. I
> > > > have tried running /etc/init.d/network restart on
> > node1 in
> > > the hope
> > > > that a full network stop and start makes a
> > difference, but
> > > it doesn't.
> > > > The only thing that will fix this situation is if
> > I
> > > completely stop
> > > > and restart the corosync cluster stack on both
> > nodes
> > > > (/etc/init.d/corosync stop
> > and /etc/init.d/corosync start).
> > > Once I've
> > > > done that both rings stay up and are stable. This
> > is
> > > obviously not
> > > > what we want.
> > > >
> > > > I am running the latest RHEL rpms from here:
> > > > http://www.clusterlabs.org/rpm/epel-5/x86_64/
> > > >
> > > > corosync-1.2.1-1.el5
> > > > corosynclib-1.2.1-1.el5
> > > > pacemaker-1.0.8-4.el5
> > > > pacemaker-libs-1.0.8-4.el5
> > > >
> > > > My corosync.conf looks like this:
> > > > compatibility: whitetank
> > > >
> > > > totem {
> > > > version: 2
> > > > secauth: off
> > > > threads: 0
> > > > consensus: 1201
> > > > rrp_mode: passive
> > > > interface {
> > > > ringnumber: 0
> > > > bindnetaddr: 172.59.60.0
> > > > mcastaddr: 226.94.1.1
> > > > mcastport: 4010
> > > > }
> > > > interface {
> > > > ringnumber: 1
> > > > bindnetaddr: 172.23.40.0
> > > > mcastaddr: 226.94.2.1
> > > > mcastport: 4011
> > > > }
> > > > }
> > > >
> > > > logging {
> > > > fileline: off
> > > > to_stderr: yes
> > > > to_logfile: yes
> > > > to_syslog: yes
> > > > logfile: /tmp/corosync.log
> > > > debug: off
> > > > timestamp: on
> > > > logger_subsys {
> > > > subsys: AMF
> > > > debug: off
> > > > }
> > > > }
> > > >
> > > > amf {
> > > > mode: disabled
> > > > }
> > > >
> > > > service {
> > > > # Load the Pacemaker Cluster Resource
> > Manager
> > > > name: pacemaker
> > > > ver: 0
> > > > }
> > > >
> > > > aisexec {
> > > > user: root
> > > > group: root
> > > > }
> > > >
> > > >
> > > > This is what gets written into /tmp/corosync.log
> > when I
> > > carry out the
> > > > link failure test and then try and reset the ring
> > status:
> > > > root at mq005:~/activemq_rpms# cat /tmp/corosync.log
> > > > Apr 13 11:20:31 corosync [MAIN ] Corosync Cluster
> > Engine
> > > ('1.2.1'):
> > > > started and ready to provide service.
> > > > Apr 13 11:20:31 corosync [MAIN ] Corosync
> > built-in
> > > features: nss rdma
> > > > Apr 13 11:20:31 corosync [MAIN ] Successfully
> > read main
> > > configuration
> > > > file '/etc/corosync/corosync.conf'.
> > > > Apr 13 11:20:31 corosync [TOTEM ] Initializing
> > transport
> > > (UDP/IP).
> > > > Apr 13 11:20:31 corosync [TOTEM ] Initializing
> > > transmit/receive
> > > > security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> > > > Apr 13 11:20:31 corosync [TOTEM ] Initializing
> > transport
> > > (UDP/IP).
> > > > Apr 13 11:20:31 corosync [TOTEM ] Initializing
> > > transmit/receive
> > > > security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> > > > Apr 13 11:20:31 corosync [TOTEM ] The network
> > interface
> > > [172.59.60.3]
> > > > is now up.
> > > > Apr 13 11:20:31 corosync [pcmk ] info:
> > process_ais_conf:
> > > Reading
> > > > configure
> > > > Apr 13 11:20:31 corosync [pcmk ] info:
> > config_find_init:
> > > Local
> > > > handle: 4730966301143465986 for logging
> > > > Apr 13 11:20:31 corosync [pcmk ] info:
> > config_find_next:
> > > Processing
> > > > additional logging options...
> > > > Apr 13 11:20:31 corosync [pcmk ] info:
> > get_config_opt:
> > > Found 'off'
> > > > for option: debug
> > > > Apr 13 11:20:31 corosync [pcmk ] info:
> > get_config_opt:
> > > Defaulting to
> > > > 'off' for option: to_file
> > > > Apr 13 11:20:31 corosync [pcmk ] info:
> > get_config_opt:
> > > Found 'yes'
> > > > for option: to_syslog
> > > > Apr 13 11:20:31 corosync [pcmk ] info:
> > get_config_opt:
> > > Defaulting to
> > > > 'daemon' for option: syslog_facility
> > > > Apr 13 11:20:31 corosync [pcmk ] info:
> > config_find_init:
> > > Local
> > > > handle: 7739444317642555395 for service
> > > > Apr 13 11:20:31 corosync [pcmk ] info:
> > config_find_next:
> > > Processing
> > > > additional service options...
> > > > Apr 13 11:20:31 corosync [pcmk ] info:
> > get_config_opt:
> > > Defaulting to
> > > > 'pcmk' for option: clustername
> > > > Apr 13 11:20:31 corosync [pcmk ] info:
> > get_config_opt:
> > > Defaulting to
> > > > 'no' for option: use_logd
> > > > Apr 13 11:20:31 corosync [pcmk ] info:
> > get_config_opt:
> > > Defaulting to
> > > > 'no' for option: use_mgmtd
> > > > Apr 13 11:20:31 corosync [pcmk ] info:
> > pcmk_startup: CRM:
> > > Initialized
> > > > Apr 13 11:20:31 corosync [pcmk ] Logging:
> > Initialized
> > > pcmk_startup
> > > > Apr 13 11:20:32 corosync [pcmk ] info:
> > pcmk_startup:
> > > Maximum core
> > > > file size is: 18446744073709551615
> > > > Apr 13 11:20:32 corosync [pcmk ] info:
> > pcmk_startup:
> > > Service: 9
> > > > Apr 13 11:20:32 corosync [pcmk ] info:
> > pcmk_startup: Local
> > > hostname:
> > > > mq005.back.int.cwwtf.local
> > > > Apr 13 11:20:32 corosync [pcmk ] info:
> > pcmk_update_nodeid:
> > > Local node
> > > > id: 54279084
> > > > Apr 13 11:20:32 corosync [pcmk ] info:
> > update_member:
> > > Creating entry
> > > > for node 54279084 born on 0
> > > > Apr 13 11:20:32 corosync [pcmk ] info:
> > update_member:
> > > 0x5452c00 Node
> > > > 54279084 now known as mq005.back.int.cwwtf.local
> > (was:
> > > (null))
> > > > Apr 13 11:20:32 corosync [pcmk ] info:
> > update_member: Node
> > > > mq005.back.int.cwwtf.local now has 1 quorum votes
> > (was 0)
> > > > Apr 13 11:20:32 corosync [pcmk ] info:
> > update_member: Node
> > > > 54279084/mq005.back.int.cwwtf.local is now: member
> > > > Apr 13 11:20:32 corosync [pcmk ] info:
> > spawn_child: Forked
> > > child
> > > > 11873 for process stonithd
> > > > Apr 13 11:20:33 corosync [pcmk ] info:
> > spawn_child: Forked
> > > child
> > > > 11874 for process cib
> > > > Apr 13 11:20:33 corosync [pcmk ] info:
> > spawn_child: Forked
> > > child
> > > > 11875 for process lrmd
> > > > Apr 13 11:20:33 corosync [pcmk ] info:
> > spawn_child: Forked
> > > child
> > > > 11876 for process attrd
> > > > Apr 13 11:20:33 corosync [pcmk ] info:
> > spawn_child: Forked
> > > child
> > > > 11877 for process pengine
> > > > Apr 13 11:20:33 corosync [pcmk ] info:
> > spawn_child: Forked
> > > child
> > > > 11878 for process crmd
> > > > Apr 13 11:20:33 corosync [SERV ] Service engine
> > loaded:
> > > Pacemaker
> > > > Cluster Manager 1.0.8
> > > > Apr 13 11:20:33 corosync [SERV ] Service engine
> > loaded:
> > > corosync
> > > > extended virtual synchrony service
> > > > Apr 13 11:20:33 corosync [SERV ] Service engine
> > loaded:
> > > corosync
> > > > configuration service
> > > > Apr 13 11:20:33 corosync [SERV ] Service engine
> > loaded:
> > > corosync
> > > > cluster closed process group service v1.01
> > > > Apr 13 11:20:33 corosync [SERV ] Service engine
> > loaded:
> > > corosync
> > > > cluster config database access v1.01
> > > > Apr 13 11:20:33 corosync [SERV ] Service engine
> > loaded:
> > > corosync
> > > > profile loading service
> > > > Apr 13 11:20:33 corosync [SERV ] Service engine
> > loaded:
> > > corosync
> > > > cluster quorum service v0.1
> > > > Apr 13 11:20:33 corosync [MAIN ] Compatibility
> > mode set to
> > > whitetank.
> > > > Using V1 and V2 of the synchronization engine.
> > > > Apr 13 11:20:33 corosync [TOTEM ] The network
> > interface
> > > [172.23.42.36]
> > > > is now up.
> > > > Apr 13 11:20:33 corosync [pcmk ] notice:
> > pcmk_peer_update:
> > > > Transitional membership event on ring 640: memb=0,
> > new=0,
> > > lost=0
> > > > Apr 13 11:20:33 corosync [pcmk ] notice:
> > pcmk_peer_update:
> > > Stable
> > > > membership event on ring 640: memb=1, new=1,
> > lost=0
> > > > Apr 13 11:20:33 corosync [pcmk ] info:
> > pcmk_peer_update:
> > > NEW:
> > > > mq005.back.int.cwwtf.local 54279084
> > > > Apr 13 11:20:33 corosync [pcmk ] info:
> > pcmk_peer_update:
> > > MEMB:
> > > > mq005.back.int.cwwtf.local 54279084
> > > > Apr 13 11:20:33 corosync [pcmk ] info:
> > update_member: Node
> > > > mq005.back.int.cwwtf.local now has process list:
> > > > 00000000000000000000000000013312 (78610)
> > > > Apr 13 11:20:33 corosync [TOTEM ] A processor
> > joined or left
> > > the
> > > > membership and a new membership was formed.
> > > > Apr 13 11:20:33 corosync [MAIN ] Completed
> > service
> > > synchronization,
> > > > ready to provide service.
> > > > Apr 13 11:20:33 corosync [pcmk ] info: pcmk_ipc:
> > Recorded
> > > connection
> > > > 0x545a660 for attrd/11876
> > > > Apr 13 11:20:33 corosync [pcmk ] info: pcmk_ipc:
> > Recorded
> > > connection
> > > > 0x545b290 for stonithd/11873
> > > > Apr 13 11:20:33 corosync [pcmk ] info: pcmk_ipc:
> > Recorded
> > > connection
> > > > 0x545d4e0 for cib/11874
> > > > Apr 13 11:20:33 corosync [pcmk ] info: pcmk_ipc:
> > Sending
> > > membership
> > > > update 640 to cib
> > > > Apr 13 11:20:34 corosync [pcmk ] info: pcmk_ipc:
> > Recorded
> > > connection
> > > > 0x545e210 for crmd/11878
> > > > Apr 13 11:20:34 corosync [pcmk ] info: pcmk_ipc:
> > Sending
> > > membership
> > > > update 640 to crmd
> > > > Apr 13 11:20:34 corosync [pcmk ] notice:
> > pcmk_peer_update:
> > > > Transitional membership event on ring 648: memb=1,
> > new=0,
> > > lost=0
> > > > Apr 13 11:20:34 corosync [pcmk ] info:
> > pcmk_peer_update:
> > > memb:
> > > > mq005.back.int.cwwtf.local 54279084
> > > > Apr 13 11:20:34 corosync [pcmk ] notice:
> > pcmk_peer_update:
> > > Stable
> > > > membership event on ring 648: memb=2, new=1,
> > lost=0
> > > > Apr 13 11:20:34 corosync [pcmk ] info:
> > update_member:
> > > Creating entry
> > > > for node 71056300 born on 648
> > > > Apr 13 11:20:34 corosync [pcmk ] info:
> > update_member: Node
> > > > 71056300/unknown is now: member
> > > > Apr 13 11:20:34 corosync [pcmk ] info:
> > pcmk_peer_update:
> > > > NEW: .pending. 71056300
> > > > Apr 13 11:20:34 corosync [pcmk ] info:
> > pcmk_peer_update:
> > > MEMB:
> > > > mq005.back.int.cwwtf.local 54279084
> > > > Apr 13 11:20:34 corosync [pcmk ] info:
> > pcmk_peer_update:
> > > > MEMB: .pending. 71056300
> > > > Apr 13 11:20:34 corosync [pcmk ] info:
> > > send_member_notification:
> > > > Sending membership update 648 to 2 children
> > > > Apr 13 11:20:34 corosync [pcmk ] info:
> > update_member:
> > > 0x5452c00 Node
> > > > 54279084 ((null)) born on: 648
> > > > Apr 13 11:20:34 corosync [TOTEM ] A processor
> > joined or left
> > > the
> > > > membership and a new membership was formed.
> > > > Apr 13 11:20:34 corosync [pcmk ] info:
> > update_member:
> > > 0x545dd00 Node
> > > > 71056300 (mq006.back.int.cwwtf.local) born on: 648
> > > > Apr 13 11:20:34 corosync [pcmk ] info:
> > update_member:
> > > 0x545dd00 Node
> > > > 71056300 now known as mq006.back.int.cwwtf.local
> > (was:
> > > (null))
> > > > Apr 13 11:20:34 corosync [pcmk ] info:
> > update_member: Node
> > > > mq006.back.int.cwwtf.local now has process list:
> > > > 00000000000000000000000000013312 (78610)
> > > > Apr 13 11:20:34 corosync [pcmk ] info:
> > update_member: Node
> > > > mq006.back.int.cwwtf.local now has 1 quorum votes
> > (was 0)
> > > > Apr 13 11:20:34 corosync [pcmk ] info:
> > > send_member_notification:
> > > > Sending membership update 648 to 2 children
> > > > Apr 13 11:20:34 corosync [MAIN ] Completed
> > service
> > > synchronization,
> > > > ready to provide service.
> > > > Apr 13 11:23:34 corosync [TOTEM ] Marking seqid
> > 6843 ringid
> > > 0
> > > > interface 172.59.60.3 FAULTY - adminisrtative
> > intervention
> > > required.
> > > > Apr 13 11:25:15 corosync [TOTEM ] Marking ringid 0
> > interface
> > > > 172.59.60.3 FAULTY - adminisrtative intervention
> > required.
> > > > Apr 13 11:28:02 corosync [TOTEM ] Marking ringid 0
> > interface
> > > > 172.59.60.3 FAULTY - adminisrtative intervention
> > required.
> > > > Apr 13 11:28:13 corosync [TOTEM ] Marking ringid 0
> > interface
> > > > 172.59.60.3 FAULTY - adminisrtative intervention
> > required.
> > > >
> > > >
> > > > Can anyone help me out with this? Am I doing
> > something
> > > wrong or have
> > > > I found a bug?
> > > >
> > > > Cheers,
> > > > Tom
> > >
> > > > _______________________________________________
> > > > Openais mailing list
> > > > Openais at lists.linux-foundation.org
> > > >
> > https://lists.linux-foundation.org/mailman/listinfo/openais
> > >
> > >
> >
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.linux-foundation.org/pipermail/openais/attachments/20100414/f62a97b6/attachment-0001.htm
More information about the Openais
mailing list