[Openais] Re: membership change peculiarity
markh at osdl.org
Thu Sep 30 12:35:44 PDT 2004
On Thu, 2004-09-30 at 12:23, Steven Dake wrote:
> This looks like a bug in the membership algorithm. A bugzilla entry is
> in order. How repeatable is this problem? Did the node think it was
> the only one, right after you killed off node 8, or was it some time
> (more then 2 seconds) later?
I'll send in a bugzilla. It's hard to tell exactly, but it looked like
it happened right away after killing node 8.
> The way it works is when cl017 detects a token loss, it should enter the
> gather state. Then it sends an attempt join (multicast, but not
> reliable or ordered). Every rep of a ring should respond with an
> attempt join. In this case, node 8 is special, because it is always
> likely to be the ring rep (until it is killed, in which case 17 is the
> ring rep). The procesor with the smallest IP address is chosen as the
> ring rep. Of importance is that non-reps do not take part in building
> the membership for the membership algorithm.
> The fact that it is now 17 that fails to produce the desired gathers
> that doesn't detect the new ring could cause some speculation that the
> membership algorithm fails to form on token loss in the operational
> state when the rep is killed off. If it were some other node, I'd say
> it is likely it just didn't get picked up because the processor failed
> to communicate.
> We just don't have enough information from the logs to detect what the
> algorithm is doing incorrectly, though. We are in need of some more
> debug output...
> I'll get to these problems as soon as I can; I'm working on the cluster
> membership library/executive (clm) code trying to fix up a few bugs and
> increase code coverage...
Mark Haverkamp <markh at osdl.org>
More information about the Openais