[Openais] Redundant ring not recovering after issuing the command corosync-cfgtool -r

Steven Dake sdake at redhat.com
Tue Apr 13 10:33:40 PDT 2010


On Tue, 2010-04-13 at 17:04 +0100, Tom Pride wrote:
> Hi Steve,
> 
> Thanks for the suggestion but that didn't work.  I'm not sure if you
> read my entire post or not, but the two redundant rings that I have
> configured, both work without a problem until I introduce a fault by
> shutting down eth1 on one of the nodes.  This then causes the cluster
> to mark ringid 0 as FAULTY.  When I then reactivate eth1 and both
> nodes can once again ping each other over the network, I then run
> corosync-cfgtool -r which should re-enable the FAULTY redundant ring
> within corosync, but it doesn't work.  Corosync refuses to re-enable
> the ring even though there is no longer any network fault.
> 

By deactivating eth1, i assume you mean you ifdown eth1.  Unfortunately
taking a network interface out of service while using redundant ring
doesn't work properly.  To verify that a failure on that interface is
detected, i recommend using iptables to block the ports related to
corosync.

a bit more detail:

http://www.corosync.org/doku.php?id=faq:ifdown

> I might be mistaken, but isn't the trick of separating the port values
> by 2 instead of 1 only for when you are using broadcast instead of the
> recommended multicast?  I'm using multicast.
> 

Thought it may make a difference on the local interface port used for
udp messages (the token), but wasn't sure.

Regards
-steve

> Any more suggestions?
> 
> Cheers,
> Tom
> 
> On Tue, Apr 13, 2010 at 4:37 PM, Steven Dake <sdake at redhat.com> wrote:
>         try separating the port values by 2 instead of 1.
>         
>         Regards
>         -steve
>         
>         On Tue, 2010-04-13 at 11:30 +0100, Tom Pride wrote:
>         > Hi There,
>         >
>         > As per the recommendations, the 2 node clusters I have built
>         use 2
>         > redundant rings for added resilience.  I have currently be
>         carry out
>         > some testing on the clusters to ensure that a failure in one
>         of the
>         > redundant rings can be recovered from.  I am aware of the
>         fact that
>         > corosync does not currently have a feature which monitors
>         failed rings
>         > to bring them back up automatically when communications are
>         repaired.
>         > All I have been doing is testing to see that the
>         corosync-cfgtool -r
>         > command will do as it says on the tin and "Reset redundant
>         ring state
>         > cluster wide after a fault, to re-enable redundant ring
>         operation."
>         >
>         > In my 2 node cluster I have been issuing the ifdown command
>         on eth1 on
>         > node1.  This results in corosync-cfgtool -s reporting the
>         following:
>         >
>         > root at mq006:~# corosync-cfgtool -s
>         > Printing ring status.
>         > Local node ID 71056300
>         > RING ID 0
>         >     id    = 172.59.60.4
>         >     status    = Marking seqid 8574 ringid 0 interface
>         172.59.60.4
>         > FAULTY - adminisrtative intervention required.
>         > RING ID 1
>         >     id    = 172.23.42.37
>         >     status    = ring 1 active with no faults
>         >
>         > I then issue ifup eth1 on node1 and ensure that I can now
>         ping node2.
>         > The link is definitely up, so I then issue the command
>         > corosync-cfgtool -r.  I then run corosync-cfgtool -s again
>         and it
>         > reports:
>         >
>         > root at mq006:~# corosync-cfgtool -s
>         > Printing ring status.
>         > Local node ID 71056300
>         > RING ID 0
>         >     id    = 172.59.60.4
>         >     status    = ring 0 active with no faults
>         > RING ID 1
>         >     id    = 172.23.42.37
>         >     status    = ring 1 active with no faults
>         >
>         > So things are looking good at this point, but if I wait 10
>         more
>         > seconds and run corosync-cfgtool -s again, it reports that
>         ring_id 0
>         > is FAULTY again:
>         >
>         > root at mq006:~# corosync-cfgtool -s
>         > Printing ring status.
>         > Local node ID 71056300
>         > RING ID 0
>         >     id    = 172.59.60.4
>         >     status    = Marking seqid 8574 ringid 0 interface
>         172.59.60.4
>         > FAULTY - adminisrtative intervention required.
>         > RING ID 1
>         >     id    = 172.23.42.37
>         >     status    = ring 1 active with no faults
>         >
>         > It does not matter how many times I run corosync-cfgtool -r,
>         ring_id 0
>         > will report it as being FAULTY 10 seconds after issuing the
>         reset.  I
>         > have tried running /etc/init.d/network restart on node1 in
>         the hope
>         > that a full network stop and start makes a difference, but
>         it doesn't.
>         > The only thing that will fix this situation is if I
>         completely stop
>         > and restart the corosync cluster stack on both nodes
>         > (/etc/init.d/corosync stop and /etc/init.d/corosync start).
>          Once I've
>         > done that both rings stay up and are stable.  This is
>         obviously not
>         > what we want.
>         >
>         > I am running the latest RHEL rpms from here:
>         > http://www.clusterlabs.org/rpm/epel-5/x86_64/
>         >
>         > corosync-1.2.1-1.el5
>         > corosynclib-1.2.1-1.el5
>         > pacemaker-1.0.8-4.el5
>         > pacemaker-libs-1.0.8-4.el5
>         >
>         > My corosync.conf looks like this:
>         > compatibility: whitetank
>         >
>         > totem {
>         >     version: 2
>         >     secauth: off
>         >     threads: 0
>         >     consensus: 1201
>         >     rrp_mode: passive
>         >     interface {
>         >                 ringnumber: 0
>         >                 bindnetaddr: 172.59.60.0
>         >                 mcastaddr: 226.94.1.1
>         >                 mcastport: 4010
>         >     }
>         >         interface {
>         >                 ringnumber: 1
>         >                 bindnetaddr: 172.23.40.0
>         >                 mcastaddr: 226.94.2.1
>         >                 mcastport: 4011
>         >         }
>         > }
>         >
>         > logging {
>         >     fileline: off
>         >     to_stderr: yes
>         >     to_logfile: yes
>         >     to_syslog: yes
>         >     logfile: /tmp/corosync.log
>         >     debug: off
>         >     timestamp: on
>         >     logger_subsys {
>         >         subsys: AMF
>         >         debug: off
>         >     }
>         > }
>         >
>         > amf {
>         >     mode: disabled
>         > }
>         >
>         > service {
>         >        # Load the Pacemaker Cluster Resource Manager
>         >        name: pacemaker
>         >        ver: 0
>         > }
>         >
>         > aisexec {
>         >        user:   root
>         >        group:  root
>         > }
>         >
>         >
>         > This is what gets written into /tmp/corosync.log when I
>         carry out the
>         > link failure test and then try and reset the ring status:
>         > root at mq005:~/activemq_rpms# cat /tmp/corosync.log
>         > Apr 13 11:20:31 corosync [MAIN  ] Corosync Cluster Engine
>         ('1.2.1'):
>         > started and ready to provide service.
>         > Apr 13 11:20:31 corosync [MAIN  ] Corosync built-in
>         features: nss rdma
>         > Apr 13 11:20:31 corosync [MAIN  ] Successfully read main
>         configuration
>         > file '/etc/corosync/corosync.conf'.
>         > Apr 13 11:20:31 corosync [TOTEM ] Initializing transport
>         (UDP/IP).
>         > Apr 13 11:20:31 corosync [TOTEM ] Initializing
>         transmit/receive
>         > security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
>         > Apr 13 11:20:31 corosync [TOTEM ] Initializing transport
>         (UDP/IP).
>         > Apr 13 11:20:31 corosync [TOTEM ] Initializing
>         transmit/receive
>         > security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
>         > Apr 13 11:20:31 corosync [TOTEM ] The network interface
>         [172.59.60.3]
>         > is now up.
>         > Apr 13 11:20:31 corosync [pcmk  ] info: process_ais_conf:
>         Reading
>         > configure
>         > Apr 13 11:20:31 corosync [pcmk  ] info: config_find_init:
>         Local
>         > handle: 4730966301143465986 for logging
>         > Apr 13 11:20:31 corosync [pcmk  ] info: config_find_next:
>         Processing
>         > additional logging options...
>         > Apr 13 11:20:31 corosync [pcmk  ] info: get_config_opt:
>         Found 'off'
>         > for option: debug
>         > Apr 13 11:20:31 corosync [pcmk  ] info: get_config_opt:
>         Defaulting to
>         > 'off' for option: to_file
>         > Apr 13 11:20:31 corosync [pcmk  ] info: get_config_opt:
>         Found 'yes'
>         > for option: to_syslog
>         > Apr 13 11:20:31 corosync [pcmk  ] info: get_config_opt:
>         Defaulting to
>         > 'daemon' for option: syslog_facility
>         > Apr 13 11:20:31 corosync [pcmk  ] info: config_find_init:
>         Local
>         > handle: 7739444317642555395 for service
>         > Apr 13 11:20:31 corosync [pcmk  ] info: config_find_next:
>         Processing
>         > additional service options...
>         > Apr 13 11:20:31 corosync [pcmk  ] info: get_config_opt:
>         Defaulting to
>         > 'pcmk' for option: clustername
>         > Apr 13 11:20:31 corosync [pcmk  ] info: get_config_opt:
>         Defaulting to
>         > 'no' for option: use_logd
>         > Apr 13 11:20:31 corosync [pcmk  ] info: get_config_opt:
>         Defaulting to
>         > 'no' for option: use_mgmtd
>         > Apr 13 11:20:31 corosync [pcmk  ] info: pcmk_startup: CRM:
>         Initialized
>         > Apr 13 11:20:31 corosync [pcmk  ] Logging: Initialized
>         pcmk_startup
>         > Apr 13 11:20:32 corosync [pcmk  ] info: pcmk_startup:
>         Maximum core
>         > file size is: 18446744073709551615
>         > Apr 13 11:20:32 corosync [pcmk  ] info: pcmk_startup:
>         Service: 9
>         > Apr 13 11:20:32 corosync [pcmk  ] info: pcmk_startup: Local
>         hostname:
>         > mq005.back.int.cwwtf.local
>         > Apr 13 11:20:32 corosync [pcmk  ] info: pcmk_update_nodeid:
>         Local node
>         > id: 54279084
>         > Apr 13 11:20:32 corosync [pcmk  ] info: update_member:
>         Creating entry
>         > for node 54279084 born on 0
>         > Apr 13 11:20:32 corosync [pcmk  ] info: update_member:
>         0x5452c00 Node
>         > 54279084 now known as mq005.back.int.cwwtf.local (was:
>         (null))
>         > Apr 13 11:20:32 corosync [pcmk  ] info: update_member: Node
>         > mq005.back.int.cwwtf.local now has 1 quorum votes (was 0)
>         > Apr 13 11:20:32 corosync [pcmk  ] info: update_member: Node
>         > 54279084/mq005.back.int.cwwtf.local is now: member
>         > Apr 13 11:20:32 corosync [pcmk  ] info: spawn_child: Forked
>         child
>         > 11873 for process stonithd
>         > Apr 13 11:20:33 corosync [pcmk  ] info: spawn_child: Forked
>         child
>         > 11874 for process cib
>         > Apr 13 11:20:33 corosync [pcmk  ] info: spawn_child: Forked
>         child
>         > 11875 for process lrmd
>         > Apr 13 11:20:33 corosync [pcmk  ] info: spawn_child: Forked
>         child
>         > 11876 for process attrd
>         > Apr 13 11:20:33 corosync [pcmk  ] info: spawn_child: Forked
>         child
>         > 11877 for process pengine
>         > Apr 13 11:20:33 corosync [pcmk  ] info: spawn_child: Forked
>         child
>         > 11878 for process crmd
>         > Apr 13 11:20:33 corosync [SERV  ] Service engine loaded:
>         Pacemaker
>         > Cluster Manager 1.0.8
>         > Apr 13 11:20:33 corosync [SERV  ] Service engine loaded:
>         corosync
>         > extended virtual synchrony service
>         > Apr 13 11:20:33 corosync [SERV  ] Service engine loaded:
>         corosync
>         > configuration service
>         > Apr 13 11:20:33 corosync [SERV  ] Service engine loaded:
>         corosync
>         > cluster closed process group service v1.01
>         > Apr 13 11:20:33 corosync [SERV  ] Service engine loaded:
>         corosync
>         > cluster config database access v1.01
>         > Apr 13 11:20:33 corosync [SERV  ] Service engine loaded:
>         corosync
>         > profile loading service
>         > Apr 13 11:20:33 corosync [SERV  ] Service engine loaded:
>         corosync
>         > cluster quorum service v0.1
>         > Apr 13 11:20:33 corosync [MAIN  ] Compatibility mode set to
>         whitetank.
>         > Using V1 and V2 of the synchronization engine.
>         > Apr 13 11:20:33 corosync [TOTEM ] The network interface
>         [172.23.42.36]
>         > is now up.
>         > Apr 13 11:20:33 corosync [pcmk  ] notice: pcmk_peer_update:
>         > Transitional membership event on ring 640: memb=0, new=0,
>         lost=0
>         > Apr 13 11:20:33 corosync [pcmk  ] notice: pcmk_peer_update:
>         Stable
>         > membership event on ring 640: memb=1, new=1, lost=0
>         > Apr 13 11:20:33 corosync [pcmk  ] info: pcmk_peer_update:
>         NEW:
>         > mq005.back.int.cwwtf.local 54279084
>         > Apr 13 11:20:33 corosync [pcmk  ] info: pcmk_peer_update:
>         MEMB:
>         > mq005.back.int.cwwtf.local 54279084
>         > Apr 13 11:20:33 corosync [pcmk  ] info: update_member: Node
>         > mq005.back.int.cwwtf.local now has process list:
>         > 00000000000000000000000000013312 (78610)
>         > Apr 13 11:20:33 corosync [TOTEM ] A processor joined or left
>         the
>         > membership and a new membership was formed.
>         > Apr 13 11:20:33 corosync [MAIN  ] Completed service
>         synchronization,
>         > ready to provide service.
>         > Apr 13 11:20:33 corosync [pcmk  ] info: pcmk_ipc: Recorded
>         connection
>         > 0x545a660 for attrd/11876
>         > Apr 13 11:20:33 corosync [pcmk  ] info: pcmk_ipc: Recorded
>         connection
>         > 0x545b290 for stonithd/11873
>         > Apr 13 11:20:33 corosync [pcmk  ] info: pcmk_ipc: Recorded
>         connection
>         > 0x545d4e0 for cib/11874
>         > Apr 13 11:20:33 corosync [pcmk  ] info: pcmk_ipc: Sending
>         membership
>         > update 640 to cib
>         > Apr 13 11:20:34 corosync [pcmk  ] info: pcmk_ipc: Recorded
>         connection
>         > 0x545e210 for crmd/11878
>         > Apr 13 11:20:34 corosync [pcmk  ] info: pcmk_ipc: Sending
>         membership
>         > update 640 to crmd
>         > Apr 13 11:20:34 corosync [pcmk  ] notice: pcmk_peer_update:
>         > Transitional membership event on ring 648: memb=1, new=0,
>         lost=0
>         > Apr 13 11:20:34 corosync [pcmk  ] info: pcmk_peer_update:
>         memb:
>         > mq005.back.int.cwwtf.local 54279084
>         > Apr 13 11:20:34 corosync [pcmk  ] notice: pcmk_peer_update:
>         Stable
>         > membership event on ring 648: memb=2, new=1, lost=0
>         > Apr 13 11:20:34 corosync [pcmk  ] info: update_member:
>         Creating entry
>         > for node 71056300 born on 648
>         > Apr 13 11:20:34 corosync [pcmk  ] info: update_member: Node
>         > 71056300/unknown is now: member
>         > Apr 13 11:20:34 corosync [pcmk  ] info: pcmk_peer_update:
>         > NEW:  .pending. 71056300
>         > Apr 13 11:20:34 corosync [pcmk  ] info: pcmk_peer_update:
>         MEMB:
>         > mq005.back.int.cwwtf.local 54279084
>         > Apr 13 11:20:34 corosync [pcmk  ] info: pcmk_peer_update:
>         > MEMB: .pending. 71056300
>         > Apr 13 11:20:34 corosync [pcmk  ] info:
>         send_member_notification:
>         > Sending membership update 648 to 2 children
>         > Apr 13 11:20:34 corosync [pcmk  ] info: update_member:
>         0x5452c00 Node
>         > 54279084 ((null)) born on: 648
>         > Apr 13 11:20:34 corosync [TOTEM ] A processor joined or left
>         the
>         > membership and a new membership was formed.
>         > Apr 13 11:20:34 corosync [pcmk  ] info: update_member:
>         0x545dd00 Node
>         > 71056300 (mq006.back.int.cwwtf.local) born on: 648
>         > Apr 13 11:20:34 corosync [pcmk  ] info: update_member:
>         0x545dd00 Node
>         > 71056300 now known as mq006.back.int.cwwtf.local (was:
>         (null))
>         > Apr 13 11:20:34 corosync [pcmk  ] info: update_member: Node
>         > mq006.back.int.cwwtf.local now has process list:
>         > 00000000000000000000000000013312 (78610)
>         > Apr 13 11:20:34 corosync [pcmk  ] info: update_member: Node
>         > mq006.back.int.cwwtf.local now has 1 quorum votes (was 0)
>         > Apr 13 11:20:34 corosync [pcmk  ] info:
>         send_member_notification:
>         > Sending membership update 648 to 2 children
>         > Apr 13 11:20:34 corosync [MAIN  ] Completed service
>         synchronization,
>         > ready to provide service.
>         > Apr 13 11:23:34 corosync [TOTEM ] Marking seqid 6843 ringid
>         0
>         > interface 172.59.60.3 FAULTY - adminisrtative intervention
>         required.
>         > Apr 13 11:25:15 corosync [TOTEM ] Marking ringid 0 interface
>         > 172.59.60.3 FAULTY - adminisrtative intervention required.
>         > Apr 13 11:28:02 corosync [TOTEM ] Marking ringid 0 interface
>         > 172.59.60.3 FAULTY - adminisrtative intervention required.
>         > Apr 13 11:28:13 corosync [TOTEM ] Marking ringid 0 interface
>         > 172.59.60.3 FAULTY - adminisrtative intervention required.
>         >
>         >
>         > Can anyone help me out with this?  Am I doing something
>         wrong or have
>         > I found a bug?
>         >
>         > Cheers,
>         > Tom
>         
>         > _______________________________________________
>         > Openais mailing list
>         > Openais at lists.linux-foundation.org
>         > https://lists.linux-foundation.org/mailman/listinfo/openais
>         
> 



More information about the Openais mailing list