[Openais] Redundant ring not recovering after issuing the command corosync-cfgtool -r

Steven Dake sdake at redhat.com
Tue Apr 13 11:34:43 PDT 2010


On Tue, 2010-04-13 at 19:31 +0100, Tom Pride wrote:
> Just to clarify, when I ifdown eth1 corosync does detect a failure and
> it does mark the ring as faulty.  Are you saying that when I use ifup
> corosync can't work out that the interface is back up and
> communications can resume when I run corosync-cfgtool -r ?  Would I
> therefore get a different result if I introduced the failure by
> physically unplugging the cat5 from the server and then physically
> reconnecting the cat5?  What about if I shut down the port on the
> switch it is connected to?
> 

Yes this is correct.  You should see proper operation if the network
link is lost normally (ie the nic fails, the link fails, the switch port
fails, the switch fails).

When an interface is ifdowned, it sends a special event to corosync,
which corosync captures and causes special behavior to occur (the
binding to 127.0.0.1).  Pulling a network cable doesn't cause this same
event to occur.  This rebind behavior is incompatible with redundant
ring.

Regards
-steve

> On Tue, Apr 13, 2010 at 6:33 PM, Steven Dake <sdake at redhat.com> wrote:
>         On Tue, 2010-04-13 at 17:04 +0100, Tom Pride wrote:
>         > Hi Steve,
>         >
>         > Thanks for the suggestion but that didn't work.  I'm not
>         sure if you
>         > read my entire post or not, but the two redundant rings that
>         I have
>         > configured, both work without a problem until I introduce a
>         fault by
>         > shutting down eth1 on one of the nodes.  This then causes
>         the cluster
>         > to mark ringid 0 as FAULTY.  When I then reactivate eth1 and
>         both
>         > nodes can once again ping each other over the network, I
>         then run
>         > corosync-cfgtool -r which should re-enable the FAULTY
>         redundant ring
>         > within corosync, but it doesn't work.  Corosync refuses to
>         re-enable
>         > the ring even though there is no longer any network fault.
>         >
>         
>         
>         By deactivating eth1, i assume you mean you ifdown eth1.
>          Unfortunately
>         taking a network interface out of service while using
>         redundant ring
>         doesn't work properly.  To verify that a failure on that
>         interface is
>         detected, i recommend using iptables to block the ports
>         related to
>         corosync.
>         
>         a bit more detail:
>         
>         http://www.corosync.org/doku.php?id=faq:ifdown
>         
>         > I might be mistaken, but isn't the trick of separating the
>         port values
>         > by 2 instead of 1 only for when you are using broadcast
>         instead of the
>         > recommended multicast?  I'm using multicast.
>         >
>         
>         
>         Thought it may make a difference on the local interface port
>         used for
>         udp messages (the token), but wasn't sure.
>         
>         Regards
>         -steve
>         
>         
>         > Any more suggestions?
>         >
>         > Cheers,
>         > Tom
>         >
>         > On Tue, Apr 13, 2010 at 4:37 PM, Steven Dake
>         <sdake at redhat.com> wrote:
>         >         try separating the port values by 2 instead of 1.
>         >
>         >         Regards
>         >         -steve
>         >
>         >         On Tue, 2010-04-13 at 11:30 +0100, Tom Pride wrote:
>         >         > Hi There,
>         >         >
>         >         > As per the recommendations, the 2 node clusters I
>         have built
>         >         use 2
>         >         > redundant rings for added resilience.  I have
>         currently be
>         >         carry out
>         >         > some testing on the clusters to ensure that a
>         failure in one
>         >         of the
>         >         > redundant rings can be recovered from.  I am aware
>         of the
>         >         fact that
>         >         > corosync does not currently have a feature which
>         monitors
>         >         failed rings
>         >         > to bring them back up automatically when
>         communications are
>         >         repaired.
>         >         > All I have been doing is testing to see that the
>         >         corosync-cfgtool -r
>         >         > command will do as it says on the tin and "Reset
>         redundant
>         >         ring state
>         >         > cluster wide after a fault, to re-enable redundant
>         ring
>         >         operation."
>         >         >
>         >         > In my 2 node cluster I have been issuing the
>         ifdown command
>         >         on eth1 on
>         >         > node1.  This results in corosync-cfgtool -s
>         reporting the
>         >         following:
>         >         >
>         >         > root at mq006:~# corosync-cfgtool -s
>         >         > Printing ring status.
>         >         > Local node ID 71056300
>         >         > RING ID 0
>         >         >     id    = 172.59.60.4
>         >         >     status    = Marking seqid 8574 ringid 0
>         interface
>         >         172.59.60.4
>         >         > FAULTY - adminisrtative intervention required.
>         >         > RING ID 1
>         >         >     id    = 172.23.42.37
>         >         >     status    = ring 1 active with no faults
>         >         >
>         >         > I then issue ifup eth1 on node1 and ensure that I
>         can now
>         >         ping node2.
>         >         > The link is definitely up, so I then issue the
>         command
>         >         > corosync-cfgtool -r.  I then run corosync-cfgtool
>         -s again
>         >         and it
>         >         > reports:
>         >         >
>         >         > root at mq006:~# corosync-cfgtool -s
>         >         > Printing ring status.
>         >         > Local node ID 71056300
>         >         > RING ID 0
>         >         >     id    = 172.59.60.4
>         >         >     status    = ring 0 active with no faults
>         >         > RING ID 1
>         >         >     id    = 172.23.42.37
>         >         >     status    = ring 1 active with no faults
>         >         >
>         >         > So things are looking good at this point, but if I
>         wait 10
>         >         more
>         >         > seconds and run corosync-cfgtool -s again, it
>         reports that
>         >         ring_id 0
>         >         > is FAULTY again:
>         >         >
>         >         > root at mq006:~# corosync-cfgtool -s
>         >         > Printing ring status.
>         >         > Local node ID 71056300
>         >         > RING ID 0
>         >         >     id    = 172.59.60.4
>         >         >     status    = Marking seqid 8574 ringid 0
>         interface
>         >         172.59.60.4
>         >         > FAULTY - adminisrtative intervention required.
>         >         > RING ID 1
>         >         >     id    = 172.23.42.37
>         >         >     status    = ring 1 active with no faults
>         >         >
>         >         > It does not matter how many times I run
>         corosync-cfgtool -r,
>         >         ring_id 0
>         >         > will report it as being FAULTY 10 seconds after
>         issuing the
>         >         reset.  I
>         >         > have tried running /etc/init.d/network restart on
>         node1 in
>         >         the hope
>         >         > that a full network stop and start makes a
>         difference, but
>         >         it doesn't.
>         >         > The only thing that will fix this situation is if
>         I
>         >         completely stop
>         >         > and restart the corosync cluster stack on both
>         nodes
>         >         > (/etc/init.d/corosync stop
>         and /etc/init.d/corosync start).
>         >          Once I've
>         >         > done that both rings stay up and are stable.  This
>         is
>         >         obviously not
>         >         > what we want.
>         >         >
>         >         > I am running the latest RHEL rpms from here:
>         >         > http://www.clusterlabs.org/rpm/epel-5/x86_64/
>         >         >
>         >         > corosync-1.2.1-1.el5
>         >         > corosynclib-1.2.1-1.el5
>         >         > pacemaker-1.0.8-4.el5
>         >         > pacemaker-libs-1.0.8-4.el5
>         >         >
>         >         > My corosync.conf looks like this:
>         >         > compatibility: whitetank
>         >         >
>         >         > totem {
>         >         >     version: 2
>         >         >     secauth: off
>         >         >     threads: 0
>         >         >     consensus: 1201
>         >         >     rrp_mode: passive
>         >         >     interface {
>         >         >                 ringnumber: 0
>         >         >                 bindnetaddr: 172.59.60.0
>         >         >                 mcastaddr: 226.94.1.1
>         >         >                 mcastport: 4010
>         >         >     }
>         >         >         interface {
>         >         >                 ringnumber: 1
>         >         >                 bindnetaddr: 172.23.40.0
>         >         >                 mcastaddr: 226.94.2.1
>         >         >                 mcastport: 4011
>         >         >         }
>         >         > }
>         >         >
>         >         > logging {
>         >         >     fileline: off
>         >         >     to_stderr: yes
>         >         >     to_logfile: yes
>         >         >     to_syslog: yes
>         >         >     logfile: /tmp/corosync.log
>         >         >     debug: off
>         >         >     timestamp: on
>         >         >     logger_subsys {
>         >         >         subsys: AMF
>         >         >         debug: off
>         >         >     }
>         >         > }
>         >         >
>         >         > amf {
>         >         >     mode: disabled
>         >         > }
>         >         >
>         >         > service {
>         >         >        # Load the Pacemaker Cluster Resource
>         Manager
>         >         >        name: pacemaker
>         >         >        ver: 0
>         >         > }
>         >         >
>         >         > aisexec {
>         >         >        user:   root
>         >         >        group:  root
>         >         > }
>         >         >
>         >         >
>         >         > This is what gets written into /tmp/corosync.log
>         when I
>         >         carry out the
>         >         > link failure test and then try and reset the ring
>         status:
>         >         > root at mq005:~/activemq_rpms# cat /tmp/corosync.log
>         >         > Apr 13 11:20:31 corosync [MAIN  ] Corosync Cluster
>         Engine
>         >         ('1.2.1'):
>         >         > started and ready to provide service.
>         >         > Apr 13 11:20:31 corosync [MAIN  ] Corosync
>         built-in
>         >         features: nss rdma
>         >         > Apr 13 11:20:31 corosync [MAIN  ] Successfully
>         read main
>         >         configuration
>         >         > file '/etc/corosync/corosync.conf'.
>         >         > Apr 13 11:20:31 corosync [TOTEM ] Initializing
>         transport
>         >         (UDP/IP).
>         >         > Apr 13 11:20:31 corosync [TOTEM ] Initializing
>         >         transmit/receive
>         >         > security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
>         >         > Apr 13 11:20:31 corosync [TOTEM ] Initializing
>         transport
>         >         (UDP/IP).
>         >         > Apr 13 11:20:31 corosync [TOTEM ] Initializing
>         >         transmit/receive
>         >         > security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
>         >         > Apr 13 11:20:31 corosync [TOTEM ] The network
>         interface
>         >         [172.59.60.3]
>         >         > is now up.
>         >         > Apr 13 11:20:31 corosync [pcmk  ] info:
>         process_ais_conf:
>         >         Reading
>         >         > configure
>         >         > Apr 13 11:20:31 corosync [pcmk  ] info:
>         config_find_init:
>         >         Local
>         >         > handle: 4730966301143465986 for logging
>         >         > Apr 13 11:20:31 corosync [pcmk  ] info:
>         config_find_next:
>         >         Processing
>         >         > additional logging options...
>         >         > Apr 13 11:20:31 corosync [pcmk  ] info:
>         get_config_opt:
>         >         Found 'off'
>         >         > for option: debug
>         >         > Apr 13 11:20:31 corosync [pcmk  ] info:
>         get_config_opt:
>         >         Defaulting to
>         >         > 'off' for option: to_file
>         >         > Apr 13 11:20:31 corosync [pcmk  ] info:
>         get_config_opt:
>         >         Found 'yes'
>         >         > for option: to_syslog
>         >         > Apr 13 11:20:31 corosync [pcmk  ] info:
>         get_config_opt:
>         >         Defaulting to
>         >         > 'daemon' for option: syslog_facility
>         >         > Apr 13 11:20:31 corosync [pcmk  ] info:
>         config_find_init:
>         >         Local
>         >         > handle: 7739444317642555395 for service
>         >         > Apr 13 11:20:31 corosync [pcmk  ] info:
>         config_find_next:
>         >         Processing
>         >         > additional service options...
>         >         > Apr 13 11:20:31 corosync [pcmk  ] info:
>         get_config_opt:
>         >         Defaulting to
>         >         > 'pcmk' for option: clustername
>         >         > Apr 13 11:20:31 corosync [pcmk  ] info:
>         get_config_opt:
>         >         Defaulting to
>         >         > 'no' for option: use_logd
>         >         > Apr 13 11:20:31 corosync [pcmk  ] info:
>         get_config_opt:
>         >         Defaulting to
>         >         > 'no' for option: use_mgmtd
>         >         > Apr 13 11:20:31 corosync [pcmk  ] info:
>         pcmk_startup: CRM:
>         >         Initialized
>         >         > Apr 13 11:20:31 corosync [pcmk  ] Logging:
>         Initialized
>         >         pcmk_startup
>         >         > Apr 13 11:20:32 corosync [pcmk  ] info:
>         pcmk_startup:
>         >         Maximum core
>         >         > file size is: 18446744073709551615
>         >         > Apr 13 11:20:32 corosync [pcmk  ] info:
>         pcmk_startup:
>         >         Service: 9
>         >         > Apr 13 11:20:32 corosync [pcmk  ] info:
>         pcmk_startup: Local
>         >         hostname:
>         >         > mq005.back.int.cwwtf.local
>         >         > Apr 13 11:20:32 corosync [pcmk  ] info:
>         pcmk_update_nodeid:
>         >         Local node
>         >         > id: 54279084
>         >         > Apr 13 11:20:32 corosync [pcmk  ] info:
>         update_member:
>         >         Creating entry
>         >         > for node 54279084 born on 0
>         >         > Apr 13 11:20:32 corosync [pcmk  ] info:
>         update_member:
>         >         0x5452c00 Node
>         >         > 54279084 now known as mq005.back.int.cwwtf.local
>         (was:
>         >         (null))
>         >         > Apr 13 11:20:32 corosync [pcmk  ] info:
>         update_member: Node
>         >         > mq005.back.int.cwwtf.local now has 1 quorum votes
>         (was 0)
>         >         > Apr 13 11:20:32 corosync [pcmk  ] info:
>         update_member: Node
>         >         > 54279084/mq005.back.int.cwwtf.local is now: member
>         >         > Apr 13 11:20:32 corosync [pcmk  ] info:
>         spawn_child: Forked
>         >         child
>         >         > 11873 for process stonithd
>         >         > Apr 13 11:20:33 corosync [pcmk  ] info:
>         spawn_child: Forked
>         >         child
>         >         > 11874 for process cib
>         >         > Apr 13 11:20:33 corosync [pcmk  ] info:
>         spawn_child: Forked
>         >         child
>         >         > 11875 for process lrmd
>         >         > Apr 13 11:20:33 corosync [pcmk  ] info:
>         spawn_child: Forked
>         >         child
>         >         > 11876 for process attrd
>         >         > Apr 13 11:20:33 corosync [pcmk  ] info:
>         spawn_child: Forked
>         >         child
>         >         > 11877 for process pengine
>         >         > Apr 13 11:20:33 corosync [pcmk  ] info:
>         spawn_child: Forked
>         >         child
>         >         > 11878 for process crmd
>         >         > Apr 13 11:20:33 corosync [SERV  ] Service engine
>         loaded:
>         >         Pacemaker
>         >         > Cluster Manager 1.0.8
>         >         > Apr 13 11:20:33 corosync [SERV  ] Service engine
>         loaded:
>         >         corosync
>         >         > extended virtual synchrony service
>         >         > Apr 13 11:20:33 corosync [SERV  ] Service engine
>         loaded:
>         >         corosync
>         >         > configuration service
>         >         > Apr 13 11:20:33 corosync [SERV  ] Service engine
>         loaded:
>         >         corosync
>         >         > cluster closed process group service v1.01
>         >         > Apr 13 11:20:33 corosync [SERV  ] Service engine
>         loaded:
>         >         corosync
>         >         > cluster config database access v1.01
>         >         > Apr 13 11:20:33 corosync [SERV  ] Service engine
>         loaded:
>         >         corosync
>         >         > profile loading service
>         >         > Apr 13 11:20:33 corosync [SERV  ] Service engine
>         loaded:
>         >         corosync
>         >         > cluster quorum service v0.1
>         >         > Apr 13 11:20:33 corosync [MAIN  ] Compatibility
>         mode set to
>         >         whitetank.
>         >         > Using V1 and V2 of the synchronization engine.
>         >         > Apr 13 11:20:33 corosync [TOTEM ] The network
>         interface
>         >         [172.23.42.36]
>         >         > is now up.
>         >         > Apr 13 11:20:33 corosync [pcmk  ] notice:
>         pcmk_peer_update:
>         >         > Transitional membership event on ring 640: memb=0,
>         new=0,
>         >         lost=0
>         >         > Apr 13 11:20:33 corosync [pcmk  ] notice:
>         pcmk_peer_update:
>         >         Stable
>         >         > membership event on ring 640: memb=1, new=1,
>         lost=0
>         >         > Apr 13 11:20:33 corosync [pcmk  ] info:
>         pcmk_peer_update:
>         >         NEW:
>         >         > mq005.back.int.cwwtf.local 54279084
>         >         > Apr 13 11:20:33 corosync [pcmk  ] info:
>         pcmk_peer_update:
>         >         MEMB:
>         >         > mq005.back.int.cwwtf.local 54279084
>         >         > Apr 13 11:20:33 corosync [pcmk  ] info:
>         update_member: Node
>         >         > mq005.back.int.cwwtf.local now has process list:
>         >         > 00000000000000000000000000013312 (78610)
>         >         > Apr 13 11:20:33 corosync [TOTEM ] A processor
>         joined or left
>         >         the
>         >         > membership and a new membership was formed.
>         >         > Apr 13 11:20:33 corosync [MAIN  ] Completed
>         service
>         >         synchronization,
>         >         > ready to provide service.
>         >         > Apr 13 11:20:33 corosync [pcmk  ] info: pcmk_ipc:
>         Recorded
>         >         connection
>         >         > 0x545a660 for attrd/11876
>         >         > Apr 13 11:20:33 corosync [pcmk  ] info: pcmk_ipc:
>         Recorded
>         >         connection
>         >         > 0x545b290 for stonithd/11873
>         >         > Apr 13 11:20:33 corosync [pcmk  ] info: pcmk_ipc:
>         Recorded
>         >         connection
>         >         > 0x545d4e0 for cib/11874
>         >         > Apr 13 11:20:33 corosync [pcmk  ] info: pcmk_ipc:
>         Sending
>         >         membership
>         >         > update 640 to cib
>         >         > Apr 13 11:20:34 corosync [pcmk  ] info: pcmk_ipc:
>         Recorded
>         >         connection
>         >         > 0x545e210 for crmd/11878
>         >         > Apr 13 11:20:34 corosync [pcmk  ] info: pcmk_ipc:
>         Sending
>         >         membership
>         >         > update 640 to crmd
>         >         > Apr 13 11:20:34 corosync [pcmk  ] notice:
>         pcmk_peer_update:
>         >         > Transitional membership event on ring 648: memb=1,
>         new=0,
>         >         lost=0
>         >         > Apr 13 11:20:34 corosync [pcmk  ] info:
>         pcmk_peer_update:
>         >         memb:
>         >         > mq005.back.int.cwwtf.local 54279084
>         >         > Apr 13 11:20:34 corosync [pcmk  ] notice:
>         pcmk_peer_update:
>         >         Stable
>         >         > membership event on ring 648: memb=2, new=1,
>         lost=0
>         >         > Apr 13 11:20:34 corosync [pcmk  ] info:
>         update_member:
>         >         Creating entry
>         >         > for node 71056300 born on 648
>         >         > Apr 13 11:20:34 corosync [pcmk  ] info:
>         update_member: Node
>         >         > 71056300/unknown is now: member
>         >         > Apr 13 11:20:34 corosync [pcmk  ] info:
>         pcmk_peer_update:
>         >         > NEW:  .pending. 71056300
>         >         > Apr 13 11:20:34 corosync [pcmk  ] info:
>         pcmk_peer_update:
>         >         MEMB:
>         >         > mq005.back.int.cwwtf.local 54279084
>         >         > Apr 13 11:20:34 corosync [pcmk  ] info:
>         pcmk_peer_update:
>         >         > MEMB: .pending. 71056300
>         >         > Apr 13 11:20:34 corosync [pcmk  ] info:
>         >         send_member_notification:
>         >         > Sending membership update 648 to 2 children
>         >         > Apr 13 11:20:34 corosync [pcmk  ] info:
>         update_member:
>         >         0x5452c00 Node
>         >         > 54279084 ((null)) born on: 648
>         >         > Apr 13 11:20:34 corosync [TOTEM ] A processor
>         joined or left
>         >         the
>         >         > membership and a new membership was formed.
>         >         > Apr 13 11:20:34 corosync [pcmk  ] info:
>         update_member:
>         >         0x545dd00 Node
>         >         > 71056300 (mq006.back.int.cwwtf.local) born on: 648
>         >         > Apr 13 11:20:34 corosync [pcmk  ] info:
>         update_member:
>         >         0x545dd00 Node
>         >         > 71056300 now known as mq006.back.int.cwwtf.local
>         (was:
>         >         (null))
>         >         > Apr 13 11:20:34 corosync [pcmk  ] info:
>         update_member: Node
>         >         > mq006.back.int.cwwtf.local now has process list:
>         >         > 00000000000000000000000000013312 (78610)
>         >         > Apr 13 11:20:34 corosync [pcmk  ] info:
>         update_member: Node
>         >         > mq006.back.int.cwwtf.local now has 1 quorum votes
>         (was 0)
>         >         > Apr 13 11:20:34 corosync [pcmk  ] info:
>         >         send_member_notification:
>         >         > Sending membership update 648 to 2 children
>         >         > Apr 13 11:20:34 corosync [MAIN  ] Completed
>         service
>         >         synchronization,
>         >         > ready to provide service.
>         >         > Apr 13 11:23:34 corosync [TOTEM ] Marking seqid
>         6843 ringid
>         >         0
>         >         > interface 172.59.60.3 FAULTY - adminisrtative
>         intervention
>         >         required.
>         >         > Apr 13 11:25:15 corosync [TOTEM ] Marking ringid 0
>         interface
>         >         > 172.59.60.3 FAULTY - adminisrtative intervention
>         required.
>         >         > Apr 13 11:28:02 corosync [TOTEM ] Marking ringid 0
>         interface
>         >         > 172.59.60.3 FAULTY - adminisrtative intervention
>         required.
>         >         > Apr 13 11:28:13 corosync [TOTEM ] Marking ringid 0
>         interface
>         >         > 172.59.60.3 FAULTY - adminisrtative intervention
>         required.
>         >         >
>         >         >
>         >         > Can anyone help me out with this?  Am I doing
>         something
>         >         wrong or have
>         >         > I found a bug?
>         >         >
>         >         > Cheers,
>         >         > Tom
>         >
>         >         > _______________________________________________
>         >         > Openais mailing list
>         >         > Openais at lists.linux-foundation.org
>         >         >
>         https://lists.linux-foundation.org/mailman/listinfo/openais
>         >
>         >
>         
>         
> 



More information about the Openais mailing list