[Openais] Two questions; Intro to AIS? and TOTEM errors in fence loop

Mon Nov 2 15:37:36 PST 2009

On Tue, 2009-11-03 at 08:46 +1030, Darren Thompson wrote:
> Kelly et al
> 
> My experience is from SLES (SUSE Enterprise) so you may need to
> "localise" this for your distribution.
> 
> First of all, get rid of that crappy "network bridge fudge script" that
> XEN uses to monkey around with your comms when XEND starts. Find the
> file Xendconfig.sxp (or the equivalent xend config file).
> Find the network section where it explains the various bridging methods
> and look for the line "network-bridge". REM this line out completely.
> 
> Now using the appropriate network tools create a network bridge for each
> real NIC that you have (in this case br0, br1, br2).
> Within the bridge script, connect each of your real Ethernet ports to
> the bridge, remove any IP configuration from the Ethernet port, add the
> network configuration to the bridge. Do the same for both nodes.
> Note; you can also bond the NICs first, then connect the bond to the
> bridge and this also works, now with NIC redundancy,
> 
> This way the bridging is intrinsically set up from network
> initialisation and is not change part of the way through the boot
> process (the ugly Xen network script caused me all sorts of hassles with
> clusters which is why I now do it this way).
> 
> I hope this helps.
> 
> Regards
> Darren
> 
> 

Darren

Thanks for the detailed explination.

There are numerous issues with xend and openais operating together
because xen does wierd networking stuff after openais is started.

I hope the above helps.

Regards
-steve
> 
> 
> On Mon, 2009-11-02 at 16:55 -0500, Madison Kelly wrote:
> > Hi all,
> > 
> >    I've been playing with clustering on CentOS 5.x lately and thus far 
> > I've not really understood AIS' role in it. I've been to the AIS website 
> > but there isn't much there.
> > 
> >    So my first question is; Where can I go to learn the fundamentals of AIS?
> > 
> >    Second question relates to a real-world problem I've been having. 
> > I've got a pretty 2-node simple cluster running DRBD+LVM on eth1. I use 
> > eth0 between the two nodes as a back channel which connects to an 
> > internal network via which we can manage the servers with IPMI. Lastly, 
> > there is eth2 on each node that is Internet-facing.
> > 
> >    So far, I can't put eth0 under Xen control (that is, I can't have it 
> > virtualized on dom0) without it causing a fence loop. I've tried asking 
> > for help elsewhere but some far I've not heard back.
> > 
> >    The reason I am asking here now is that the node that gets fenced 
> > shows this in it's logs several times just before going down:
> > 
> > Oct 31 00:27:21 vsh02 openais[3133]: [TOTEM] FAILED TO RECEIVE
> > Oct 31 00:27:21 vsh02 openais[3133]: [TOTEM] entering GATHER state from 6.
> > 
> >    On the surviving node I see this:
> > 
> > Oct 31 00:35:47 vsh03 openais[3237]: [TOTEM] The token was lost in the 
> > OPERATIONAL state.
> > Oct 31 00:35:47 vsh03 openais[3237]: [TOTEM] Receive multicast socket 
> > recv buffer size (288000 bytes).
> > Oct 31 00:35:47 vsh03 openais[3237]: [TOTEM] Transmit multicast socket 
> > send buffer size (262142 bytes).
> > Oct 31 00:35:47 vsh03 openais[3237]: [TOTEM] entering GATHER state from 2.
> > Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] entering GATHER state from 0.
> > Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] Creating commit token 
> > because I am the rep.
> > Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] Saving state aru 2c high 
> > seq received 2c
> > Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] Storing new sequence id for 
> > ring 108
> > Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] entering COMMIT state.
> > Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] entering RECOVERY state.
> > Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] position [0] member 
> > 10.255.135.3:
> > Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] previous ring seq 260 rep 
> > 10.255.135.2
> > Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] aru 2c high delivered 2c 
> > received flag 1
> > Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] Did not need to originate 
> > any messages in recovery.
> > Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] Sending initial ORF token
> > Oct 31 00:35:51 vsh03 openais[3237]: [CLM  ] CLM CONFIGURATION CHANGE
> > Oct 31 00:35:51 vsh03 openais[3237]: [CLM  ] New Configuration:
> > Oct 31 00:35:51 vsh03 kernel: dlm: closing connection to node 1
> > Oct 31 00:35:51 vsh03 fenced[3256]: vsh02.domain.com not a cluster 
> > member after 0 sec post_fail_delay
> > Oct 31 00:35:51 vsh03 openais[3237]: [CLM  ]     r(0) ip(10.255.135.3)
> > Oct 31 00:35:51 vsh03 fenced[3256]: fencing node "vsh02.domain.com"
> > 
> >    This happens when I put eth0 under Xen's management. The nodes will 
> > keep fencing until DRBD breaks. At that point, the fencing stops and 
> > everything seems to be fine. However, once I fix the DRBD partition and 
> > try to look at the LVM the above errors return and I'm right back into a 
> > fence loop until DRBD breaks again.
> > 
> >    Even a pointer to where I can learn more about AIS/TOTEM so that I 
> > can try to understand what's going on would be awesome. I'm really stuck 
> > on this error...
> > 
> > Thanks!
> > 
> > Madi
> > _______________________________________________
> > Openais mailing list
> > Openais at lists.linux-foundation.org
> > https://lists.linux-foundation.org/mailman/listinfo/openais
> > 
> 
> _______________________________________________
> Openais mailing list
> Openais at lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais