[Openais] Split brain when using EVS library
Ruppert Koch
ruppert at rcsc.de
Sat Sep 13 11:58:26 PDT 2008
Lars Marowsky-Bree wrote:
> On 2008-09-09T11:18:59, David Teigland <teigland at redhat.com> wrote:
>
>
>>> For some reason our cluster splits up into two rings.
>>> Scenario is:
>>> node1(n1) n2 n3 n4 n5 n6 are in the ring.
>>>
>>> Suddenly the ring splits into two rings:
>>> n1 n2 n3 got leave msg from n4 n5 n6
>>> n4 n5 n6 got leave msg from n1 n2 n3
>>>
>>> After a few milliseconds the two rings joins again:
>>> n1 n2 n3 got join msg from n4 n5 n6
>>> n4 n5 n6 got join msg from n1 n2 n3
>>>
>>> The two ring is joined to one ring again:
>>> node1(n1) n2 n3 n4 n5 n6 are in the ring.
>>>
>> We at RH have struggled a great deal with this exact "feature" for quite a
>> long time. It's the biggest problem by far that we've had using openais.
>>
>
> Any insights as to why this occurs? Random membership fluctuations are
> ... a problem.
>
> Pacemaker can, AFAIK, deal with the rings healing, but the splits are
> worrying, as they might cause recovery action to occur.
>
>
> Regards,
> Lars
>
>
Fault detection as well as membership are managed by the Totem protocol.
I assume the following happens:
A node P experiences a token timeout. In this case P automatically
assumes that the previous token holder Q has failed and puts Q on its
list of failed nodes. This means that the next membership can contain
either P or Q., but not both. P initiates the establishment of a new
membership by sending a Gather message. When receiving that message Q
also starts to gather nodes for a competing membership. In effect, two
gather phases are executed simultaneously. Some of the other nodes
decide to stick with P while others side with Q. In the end two parallel
memberships are established: The old ring breaks into two independent rings.
Now two independent rings exist. Since we still have a multicast
environment, all nodes of each ring receive all messages. Both ring
leaders detect that nodes exist that do not belong to that ring. The
fact that P cannot be in the same membership than Q is not an issue
anymore because either node assumes the other node has been repaired, or
that the communication disruption has been overcome. The two rings join
and form a single ring.
Ruppert
--
Ruppert Koch, Ph.D.
Reliable Computer Systems Consulting
http://www.rcsc.de
More information about the Openais
mailing list