[Openais] Checkpoint crash in aisexec

Steven Dake sdake at mvista.com
Mon Feb 14 12:17:26 PST 2005


On Sat, 2005-02-12 at 08:08, Kristen Smith wrote:
> Steve,
> 
> Thanks for the response.
> 
> For recovery - what are the ramifications if we don't have recovery
> working 100%? What I see now is that when a node leaves 
> the cluster and then rejoins, it receives evt messages, but it can
> take anywhere from 15seconds to minutes for evt messages sent from
> that node to reach the other applications. I handle this with some 

Mark have you seen this issue?

> message retries which is ok in this startup case. However, are we in
> jeopardy in other cases that I am not considering? When running
> traffic the past few days and seeing periodic reconfigs, I don't seem
> to be losing messages when that occurs - I only see the lost messages
> when I actually kill a node and start it back up to rejoin the
> cluster.
> 

What we have today is totally unacceptable because atleast for
checkpointing, there is no recovery.  And Mark is waiting on my base
code for event recovery.

Definition of 100% working means if there is a failure during recovery,
we are guaranteed a consistent state.  I think evt is pretty close to
this goal, although the checkpoint replication after merge has not been
developed yet.  I can think of alot of easy ways to do this, but
handling a failure during the recovery phase makes it more difficult.

Definition of almost 100% is that recovery works properly if there are
no faults during recovery (ie: the merge process), but if there is a
fault during recovery (ie: reconfig) something could go awry.

We want consistently replicated data (the 100% case).  100% is probably
past your development window; the other case is within reach.

Regards
-steve

> Thanks
> Kristen
> 
> -----Original Message-----
> From: Steven Dake [mailto:sdake at mvista.com] 
> Sent: Friday, February 11, 2005 5:30 PM
> To: Smith, Kristen [NGC:B675:EXCH]
> Subject: RE: [Openais] Checkpoint crash in aisexec
> 
> 
> Ok well I doubt with 200 byte checkpoints there is a buffer overflow.
> :)
> 
> Recovery will come after 188 is wrapped up.  I think your two weeks
> window looks good for alpha-level recovery (ie: works most of the
> time).  High quality production recovery will not hit your window for
> development (ie: works 100% of the time no matter what happens).
> 
> Thanks
> -steve
> 
> On Fri, 2005-02-11 at 15:56, Kristen Smith wrote:
> > Steve,
> > 
> > The size of the checkpoints are ~200 bytes.
> > 
> > I agree, valgrind is an excellent tool. We will run it through and
> see 
> > if that shows anything.
> > 
> > I have tried this scenario maybe 30 times today (for various other
> > testing) and it happened maybe 10 times. For a while I could
> reproduce 
> > with a given test about 5 times and then it hasn't happened again.
> > 
> > Sounds like defect-188 fixing is going well. May I ask how the 
> > recovery work is going as well? (Don't mean to be pushy on that
> front
> > - we have 2 more weeks of coding for our application left and I am 
> > really hoping that we are able to put the new recovery code in
> during 
> > that time).
> > 
> > Thanks a bunch,
> > Kristen
> > 
> > -----Original Message-----
> > From: Steven Dake [mailto:sdake at mvista.com]
> > Sent: Friday, February 11, 2005 4:37 PM
> > To: Smith, Kristen [NGC:B675:EXCH]
> > Subject: Re: [Openais] Checkpoint crash in aisexec
> > 
> > 
> > how large are the read or write requests?
> > just a thought there could be some buffer overrun with larger 
> > requests.
> > 
> > On Fri, 2005-02-11 at 14:55, Kristen Smith wrote:
> > > Steve,
> > > 
> > > We are periodically seeing aisexec crash with the following trace:
> > > 
> > >         (gdb) bt
> > >         #0  message_handler_req_lib_ckpt_checkpointclose
> > >         (conn_info=0x0, message=0xb73fc008) at ckpt.c:1552
> > >         #1  0x080494c2 in poll_handler_libais_deliver (handle=0,
> > fd=3,
> > >         revent=134633824, data=0x89c2ad8,
> > >             prio=0x89b2784) at main.c:578
> > >         #2  0x08056e62 in poll_run (handle=0) at aispoll.c:386
> > > 
> > > 
> > > #3  0x080499ac in main (argc=1, argv=0xbfffcb64) at main.c:1003
> > > 
> > > We have looked through the code but can't seem to figure out how
> > > conn_info is getting set to 0. Do you have any idea under what 
> > > circumstances conn_info could be null when this function is
> called?
> > > 
> > > This is happening when we have multiple nodes up and we kill one
> of
> > > the active nodes. The standby node (which was reading checkpoints)
> > > must now become a writer, so it closes the checkpoint and this 
> > > happens. Unfortunately, I can't reproduce this consistently - I 
> > > finally got a core dump today. I don't recall ever seeing this
> with 
> > > the old code.
> > > 
> > > Thanks,
> > > Kristen
> > > 
> > > 
> > > 
> > >
> >
> ______________________________________________________________________
> > > _______________________________________________
> > > Openais mailing list
> > > Openais at lists.osdl.org
> > http://lists.osdl.org/mailman/listinfo/openais
> > 
> > 
> 
> 




More information about the Openais mailing list