[Openais] Logs during reconfiguration (node lost)

Kristen Smith kjsmith at nortel.com
Mon Feb 21 11:01:46 PST 2005


On the clocks not being synchronized - we forgot to set the ntp server
before this run - we will do it before the next one.

-----Original Message-----
From: Mark Haverkamp [mailto:markh at osdl.org] 
Sent: Monday, February 21, 2005 12:28 PM
To: Smith, Kristen [NGC:B675:EXCH]
Cc: Openais List
Subject: Re: [Openais] Logs during reconfiguration (node lost)


On Mon, 2005-02-21 at 12:19 -0500, Kristen Smith wrote:
> Hi Steve,
> 
> We had some traffic running this weekend (5+1) and one of the nodes 
> died (the same aisexec: ../include/sq.h:152: sq_item_get: Assertion 
> `sq_position >= 0' failed. that is already reported). In looking 
> through the logs when this happened, I am confused about something and 
> maybe you can clear this up for me.
> 
> We had 6 nodes (47.104.22.82 - 47.104.22.87) - the failure occurred on 
> .84. The reconfig looks the same on 4 of the remaining nodes and 
> different on another one. The logs are shown below.
> 
> My questions are:
> 
> 1) why do all but .86 think that .84 AND .86 went away - .84 died, so 
> that makes sense, but why .86 as well?
> 2) why does .86 think all other nodes went away and it is all by 
> itself?
> 3) both .82 and .86 think they are the rep and create new commit 
> tokens - I guess this is because .86 thinks it is in a cluster by 
> itself and .82 was the original rep.
> 
> Also, this is just the beginning of the reconfiguration at this time - 
> all nodes do multiple reconfigurations after this one caused by the 
> failure. I can send all logs along later if you want. Eventually 
> (within a second or so after this initial reconfig), all the nodes 
> wind up seeing each other and the ring is reformed in a 5+0 scenario.

I have also seen this kind of thing happen.  It seemed to me that given that
the nodes have timeouts associated that can time out at slightly different
times, and there are states that the protocol can be in where foreign
messages are ignored, that because of timing, things like this can happen.
Maybe Steve will have a less ambiguous explanation than this :-).

Mark.

p.s. It's a little difficult to see the relationship of the log messages
when the clocks look like they aren't synchronized.


-- 
Mark Haverkamp <markh at osdl.org>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.linux-foundation.org/pipermail/openais/attachments/20050221/6d6d6dc9/attachment-0001.htm


More information about the Openais mailing list