[Openais] Split brain when using EVS library

Ruppert Koch ruppert at rcsc.de
Sat Sep 13 11:58:26 PDT 2008


Lars Marowsky-Bree wrote:
> On 2008-09-09T11:18:59, David Teigland <teigland at redhat.com> wrote:
>
>   
>>> For some reason our cluster splits up into two rings.
>>> Scenario is:
>>> node1(n1) n2 n3 n4 n5 n6 are in the ring.
>>>
>>> Suddenly the ring splits into two rings:
>>> n1 n2 n3 got leave msg from n4 n5 n6
>>> n4 n5 n6 got leave msg from n1 n2 n3
>>>
>>> After a few milliseconds the two rings joins again:
>>> n1 n2 n3 got join msg from n4 n5 n6
>>> n4 n5 n6 got join msg from n1 n2 n3
>>>
>>> The two ring is joined to one ring again:
>>> node1(n1) n2 n3 n4 n5 n6 are in the ring.
>>>       
>> We at RH have struggled a great deal with this exact "feature" for quite a
>> long time.  It's the biggest problem by far that we've had using openais.
>>     
>
> Any insights as to why this occurs? Random membership fluctuations are
> ... a problem.
>
> Pacemaker can, AFAIK, deal with the rings healing, but the splits are
> worrying, as they might cause recovery action to occur.
>
>
> Regards,
>     Lars
>
>   
Fault detection as well as membership are managed by the Totem protocol. 
I assume the following happens:

A node P experiences a token timeout. In this case P automatically 
assumes that the previous token holder Q has failed and puts Q on its 
list of failed nodes. This means that the next membership can contain 
either P or Q., but not both. P initiates the establishment of a new 
membership by sending a Gather message. When receiving that message Q 
also starts to gather nodes for a competing membership. In effect, two 
gather phases are executed simultaneously. Some of the other nodes 
decide to stick with P while others side with Q. In the end two parallel 
memberships are established: The old ring breaks into two independent rings.

Now two independent rings exist. Since we still have a multicast 
environment, all nodes of each ring receive all messages. Both ring 
leaders detect that nodes exist that do not belong to that ring. The 
fact that P cannot be in the same membership than Q is not an issue 
anymore because either node assumes the other node has been repaired, or 
that the communication disruption has been overcome. The two rings join 
and form a single ring.

Ruppert


-- 
Ruppert Koch, Ph.D.
Reliable Computer Systems Consulting
http://www.rcsc.de



More information about the Openais mailing list