[Openais] status of code in bk?

Steven Dake sdake at mvista.com
Fri Feb 11 12:52:18 PST 2005


Kristen

My apologies for a short response; I had to take my daughter to the
doctor and wanted to get a quick response out to you.

The configuration change means that either a new processor joins the
configuration or a processor is detected as faulty because the token has
not been received in TOKEN_TIMEOUT.  This happens when the token is not
delivered (because it was not forwarded or dropped during
transmission).  If it is dropped, that TOKEN_RETRANSMIT_TIMEOUT will
retransmit the token.  When the token is lost, however, because a
processor has faulted, the membership algorithm is started.

This causes every processor to enter the "GATHER" phase of the
membership state machine.  Each processor sends a message with what it
thinks the new membership should be considered in this round of
membership asnd what processors it believes are failed.  During this
phase, every processor must agree or achieve consensus on these two
lists.
 
The next phase of the algorithm is the "COMMIT" phase, where a token is
transmitted from the representative (lowest IP address in configuration)
along a ring specified in the consensus membership.  This commit token
spreads information used for recovery of lost messages in the EVS
state.  The commit token rotates twice to pass this information and
share the information regarding the new configuration.

The next state is the "EVS" state which recovers any lost messages and
ensures virtual synchrony.  If there are messages missing at the end of
a configuration for a processor, the algorithm will recover those
missing messages and ensure all processors (that are not failed) will
receive the same message stream and configuration change stream.

The next state is the OPERATIONAL state.  When entering the operational
state, all messages from the point that recovery occurs to the end of
the message stream are delivered along with configuration changes.

As you can see, this process is somewhat complex.  Its better to spend
the 40msec it usually takes to execute this process on extra token
timeout time.

Unfortunately the token timeout time depends on your network load and
hardware.  One way to determine this information is calculate the token
rotation time with your slowest hardware with your maximum load.  Then
the token timeout can be increased slightly from this maximum time
period.  Make sure to take the retransmit period into account.

Configuration changes are a cause of concern, because messages are
blocked during reconfiguration.  For a real time system, this is not
particularly good during normal operation.  During a fault, I believe
blocking is ok because, hey, its a fault and blocking is better then not
working at all :)

Regards
-steve

On Fri, 2005-02-11 at 09:55, Kristen Smith wrote:
> Steve,
> 
> Right now we have TIMEOUT_TOKEN set to 60 and we periodically see
> reconfigurations. What exactly is going on when a reconfiguration
> occurs? Is it cause for concern when these occur?
> 
> Thanks,
> Kristen
> 
> -----Original Message-----
> From: Steven Dake [mailto:sdake at mvista.com] 
> Sent: Wednesday, February 09, 2005 7:12 PM
> To: Smith, Kristen [NGC:B675:EXCH]
> Cc: openais at lists.osdl.org; Bajpai, Muni [NGC:B670:EXCH]
> Subject: RE: [Openais] status of code in bk?
> 
> 
> Kristen
> I'd suggest playing with the timing and reporting the lowest values
> which work for you.  I intend to spend some time on determining this
> but its low priority for now.  I'd expect that the following
> aggressive values should work in a LAN setting.  If they dont, try
> increasing (scaling all values by the same multiplier).
> 
> TIMEOUT_STATE_GATHER_JOIN 40
> TIMEOUT_STATE_GATHER_CONSENSUS 80 (should be double join)
> TIMEOUT_TOKEN 90 TIMEOUT_TOKEN_RETRANSMIT 30
> 
> You may be able to get TIMEOUT_TOKEN down to 60 with more chance of
> reconfigurations.
> 
> There was no intent to change the timing values.  I must have made the
> change during debugging.  I often change these values to test for
> different timeout values and may have inadvertantly committed that
> change.
> 
> When calculating the timeout for the token, I find that a token should
> spend about 300 usec at each processor if there are no messages to
> multicast.  With 16 processors, that is about 2 msec.  If the token
> doesn't rotate in TIMEOUT_TOKEN a reconfiguration occurs.  If you add
> one processor multicasting 40 messages per ring rotation, a token may
> take 5-6 msec to rotate.  Given that, 90 msec is sufficient to wait
> for a token loss detector.
> 
> I eventually intend to make the calculation of the ring timeouts
> dynamically calculated during ring formation but this work is quite a
> bit out (maybe even next year).
> 
> Thanks
> -steve
> 
> On Wed, 2005-02-09 at 17:41, Kristen Smith wrote:
> > Steve,
> > 
> > One thing I notice when running the latest bitkeeper code is that
> the 
> > time it takes to notice that another node has failed has increased.
> If 
> > I start up 2 aisexecs (one on each node) and then ctrl-c one of
> them, 
> > the other takes a few seconds to notice that the node went away.
> When 
> > we started using the totem-ais code in Jan, I was impressed that the
> > time to notice the failure was decreased (almost instaneous) than it
> > had been with the previous openais, but now it seems like it is
> slower 
> > than with the previous openais (before the totem changes).
> > 
> > Are there new configuration parms that I need to muck with to get
> the 
> > node failure detection time down? (I did see your email a while back
> > on decreasing this time, I was just wondering if you had intended to
> > make the detection time greater in this new code).
> > 
> > Thanks,
> > Kristen
> > 
> > -----Original Message-----
> > From: Steven Dake [mailto:sdake at mvista.com]
> > Sent: Tuesday, February 08, 2005 3:29 PM
> > To: Smith, Kristen [NGC:B675:EXCH]
> > Cc: openais at lists.osdl.org; Bajpai, Muni [NGC:B670:EXCH]
> > Subject: Re: [Openais] status of code in bk?
> > 
> > 
> > Kristen,
> > 
> > All of the code is now in bitkeeper.  I'll try to wrap up a
> freshmeat 
> > release tomorrow with code coverage reports after running the tests
> we 
> > have available.
> > 
> > Thanks
> > -steve
> > 
> > On Tue, 2005-02-08 at 07:30, Kristen Smith wrote:
> > > Hello,
> > > 
> > > Could you please tell me the status of the latest code that is in
> > > bitkeeper? Does it have all the patches you guys have been putting
> > out
> > > for the past few weeks? If not, when do you foresee updating it
> with 
> > > all these patches?
> > > 
> > > Thanks,
> > > Kristen
> > > 
> > > 
> > > 
> > >
> >
> ______________________________________________________________________
> > > _______________________________________________
> > > Openais mailing list
> > > Openais at lists.osdl.org
> > http://lists.osdl.org/mailman/listinfo/openais
> > 
> > 
> 
> 




More information about the Openais mailing list