[Openais] Split brain when using EVS library

Andrew Beekhof beekhof at gmail.com
Mon Sep 15 06:54:35 PDT 2008


On Mon, Sep 15, 2008 at 15:22, David Teigland <teigland at redhat.com> wrote:
> On Sat, Sep 13, 2008 at 06:16:45PM +0200, Lars Marowsky-Bree wrote:
>> On 2008-09-09T11:18:59, David Teigland <teigland at redhat.com> wrote:
>>
>> > > For some reason our cluster splits up into two rings.
>> > > Scenario is:
>> > > node1(n1) n2 n3 n4 n5 n6 are in the ring.
>> > >
>> > > Suddenly the ring splits into two rings:
>> > > n1 n2 n3 got leave msg from n4 n5 n6
>> > > n4 n5 n6 got leave msg from n1 n2 n3
>> > >
>> > > After a few milliseconds the two rings joins again:
>> > > n1 n2 n3 got join msg from n4 n5 n6
>> > > n4 n5 n6 got join msg from n1 n2 n3
>> > >
>> > > The two ring is joined to one ring again:
>> > > node1(n1) n2 n3 n4 n5 n6 are in the ring.
>> >
>> > We at RH have struggled a great deal with this exact "feature" for quite a
>> > long time.  It's the biggest problem by far that we've had using openais.
>>
>> Any insights as to why this occurs? Random membership fluctuations are
>> .. a problem.
>>
>> Pacemaker can, AFAIK, deal with the rings healing, but the splits are
>> worrying, as they might cause recovery action to occur.
>
> Yes, the splits are annoying because they do cause recovery [1].  Even
> more annoying, though, is the merging of the splits.  That's what I was
> complaining about. We've really struggled with adding just the right kind
> of code to detect when a split+merge happens and handle it properly (for
> us that's continuing with recovery for the original split.)

My biggest complaint are the splits that happen when a node joins the cluster.
Somehow one goes from a "cluster of N" to "N clusters-of-one" and all
manner of different combinations until finally getting a "cluster of
N+1"

And I'm talking about "stable" memberships, not the intermediate steps.

I've seen this with N as low as 4.

> [1] The splits can't usually be blamed on openais; it's usually something
> to do with a network glitch.

Not sure I'm buying this, though it depends on your definition of
network glitch.
Whitetank is far better than Trunk in this regard, but it still throws
the odd fit.


More information about the Openais mailing list