[Openais] Logs during reconfiguration (node lost)

Steven Dake sdake at mvista.com
Tue Feb 22 11:37:34 PST 2005


On Mon, 2005-02-21 at 11:28, Mark Haverkamp wrote:
> On Mon, 2005-02-21 at 12:19 -0500, Kristen Smith wrote:
> > Hi Steve,
> > 
> > We had some traffic running this weekend (5+1) and one of the nodes
> > died (the same aisexec: ../include/sq.h:152: sq_item_get: Assertion
> > `sq_position >= 0' failed. that is already reported). In looking
> > through the logs when this happened, I am confused about something and
> > maybe you can clear this up for me.
> > 
> > We had 6 nodes (47.104.22.82 - 47.104.22.87) - the failure occurred
> > on .84. The reconfig looks the same on 4 of the remaining nodes and
> > different on another one. The logs are shown below. 
> > 
> > My questions are:
> > 
> > 1) why do all but .86 think that .84 AND .86 went away - .84 died, so
> > that makes sense, but why .86 as well? 
> > 2) why does .86 think all other nodes went away and it is all by
> > itself? 
> > 3) both .82 and .86 think they are the rep and create new commit
> > tokens - I guess this is because .86 thinks it is in a cluster by
> > itself and .82 was the original rep.
> > 
> > Also, this is just the beginning of the reconfiguration at this time -
> > all nodes do multiple reconfigurations after this one caused by the
> > failure. I can send all logs along later if you want. Eventually
> > (within a second or so after this initial reconfig), all the nodes
> > wind up seeing each other and the ring is reformed in a 5+0 scenario.
> 
> I have also seen this kind of thing happen.  It seemed to me that given
> that the nodes have timeouts associated that can time out at slightly
> different times, and there are states that the protocol can be in where
> foreign messages are ignored, that because of timing, things like this
> can happen.  Maybe Steve will have a less ambiguous explanation than
> this :-).
> 

Mark it could be that your right..  relativity related to the individual
processors results in this behavior...  I am not sure about this
though..  See previous explination..

> Mark.
> 
> p.s. It's a little difficult to see the relationship of the log messages
> when the clocks look like they aren't synchronized.
> 




More information about the Openais mailing list