[Openais] corosync ring marked FAULTY - administrative intervention required

Fri Apr 9 04:45:08 PDT 2010

Hi,

I experience this issue  on every cluster I have, not just this one, so it could be a common misconfiguration on my part.
I am using the latest version of the corosync:

corosync-1.2.1-1.el5

Here is my config:

compatibility: none

aisexec {
        user:   root
        group:  root
}

service {
        name: pacemaker
        ver:  0
}

totem {
        version: 2
        token: 5000
        token_retransmits_before_loss_const: 20
        join: 1000
        consensus: 7500
        vsftype: none
        max_messages: 20
        secauth: off
        threads: 0
        clear_node_high_bit: yes
        rrp_mode: passive
        interface {
                ringnumber: 0
                broadcast: yes
                bindnetaddr: 10.0.0.0
                mcastport: 5405
        }
        interface {
                ringnumber: 1
                broadcast: yes
                bindnetaddr: 207.207.163.0
                mcastport: 5406
        }
}

logging {
        fileline: off
        to_stderr: no
        to_syslog: yes
        debug: on
        timestamp: on
}

amf {
        mode: disabled
}

[root at xen-11 ~]# ifconfig 
eth0      Link encap:Ethernet  HWaddr 00:30:48:62:4E:DC  
          inet addr:207.207.163.11  Bcast:207.207.163.255  Mask:255.255.255.0
          inet6 addr: fe80::230:48ff:fe62:4edc/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:2009418 errors:0 dropped:0 overruns:0 frame:0
          TX packets:799835 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:1428434820 (1.3 GiB)  TX bytes:664164837 (633.3 MiB)

eth1      Link encap:Ethernet  HWaddr 00:30:48:62:4E:DD  
          inet addr:10.0.0.1  Bcast:10.0.0.3  Mask:255.255.255.252
          inet6 addr: fe80::230:48ff:fe62:4edd/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:4233811 errors:0 dropped:0 overruns:0 frame:0
          TX packets:14118095 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:518593446 (494.5 MiB)  TX bytes:14199338528 (13.2 GiB)
          Memory:d8060000-d8080000 

[root at xen-12 ~]# ifconfig 
eth0      Link encap:Ethernet  HWaddr 00:30:48:62:4C:CA  
          inet addr:207.207.163.12  Bcast:207.207.163.255  Mask:255.255.255.0
          inet6 addr: fe80::230:48ff:fe62:4cca/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1210002 errors:0 dropped:0 overruns:0 frame:0
          TX packets:473204 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:698444593 (666.0 MiB)  TX bytes:1145344594 (1.0 GiB)

eth1      Link encap:Ethernet  HWaddr 00:30:48:62:4C:CB  
          inet addr:10.0.0.2  Bcast:10.0.0.3  Mask:255.255.255.252
          inet6 addr: fe80::230:48ff:fe62:4ccb/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:13776771 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4008079 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:14138136203 (13.1 GiB)  TX bytes:493569061 (470.7 MiB)
          Memory:d8060000-d8080000 

Cross-over connection on eth1

I don't see much of details  in message log, probably need to increase debug level

[root at xen-12 ~]# corosync-cfgtool -s
Printing ring status.
Local node ID 33554442
RING ID 0
	id	= 10.0.0.2
	status	= ring 0 active with no faults
RING ID 1
	id	= 207.207.163.12
	status	= Marking seqid 6594 ringid 1 interface 207.207.163.12 FAULTY - adminisrtative intervention required.

I can reset it just fine

[root at xen-12 ~]# corosync-cfgtool -r
Re-enabling all failed rings.
[root at xen-12 ~]# corosync-cfgtool -s
Printing ring status.
Local node ID 33554442
RING ID 0
	id	= 10.0.0.2
	status	= ring 0 active with no faults
RING ID 1
	id	= 207.207.163.12
	status	= ring 1 active with no faults

But it goes into FAULTY mode almost right away:

Apr  9 11:40:56 xen-12 corosync[13835]:   [TOTEM ] Marking seqid 18340 ringid 1 interface 207.207.163.12 FAULTY - adminisrtative intervention required.

that's the only message from the corosync in the log

Thank you,
Vadym Chepkov