[Openais] Re: membership change peculiarity

Mark Haverkamp markh at osdl.org
Thu Sep 30 12:35:44 PDT 2004


On Thu, 2004-09-30 at 12:23, Steven Dake wrote:
> Mark,
> 
> This looks like a bug in the membership algorithm.  A bugzilla entry is
> in order.  How repeatable is this problem?  Did the node think it was
> the only one, right after you killed off node 8, or was it some time
> (more then 2 seconds) later?

I'll send in a bugzilla.  It's hard to tell exactly, but it looked like
it happened right away after killing node 8.

> 
> The way it works is when cl017 detects a token loss, it should enter the
> gather state.  Then it sends an attempt join (multicast, but not
> reliable or ordered).  Every rep of a ring should respond with an
> attempt join.  In this case, node 8 is special, because it is always
> likely to be the ring rep (until it is killed, in which case 17 is the
> ring rep).  The procesor with the smallest IP address is chosen as the
> ring rep.  Of importance is that non-reps do not take part in building
> the membership for the membership algorithm.
> 
> The fact that it is now 17 that fails to produce the desired gathers
> that doesn't detect the new ring could cause some speculation that the
> membership algorithm fails to form on token loss in the operational
> state when the rep is killed off.  If it were some other node, I'd say
> it is likely it just didn't get picked up because the processor failed
> to communicate.
> 
> We just don't have enough information from the logs to detect what the
> algorithm is doing incorrectly, though.  We are in need of some more
> debug output...
> 
> I'll get to these problems as soon as I can; I'm working on the cluster
> membership library/executive (clm) code trying to fix up a few bugs and
> increase code coverage...

Sounds good,
Mark.

-- 
Mark Haverkamp <markh at osdl.org>




More information about the Openais mailing list