[Openais] Checkpoint crash in aisexec

Steven Dake sdake at mvista.com
Tue Feb 15 11:50:33 PST 2005


On Tue, 2005-02-15 at 09:47, Kristen Smith wrote:
> Steve,
> 
> Thanks for the response - I hear ya loud and clear - not good without
> recovery. So, is there something that we could do to help you with
> this recovery coding? If you had some type of design thoughts on how
> you wanted checkpoint recovery to occur, maybe that is something we
> could help out with. Just throwing this out there to see what you
> think.
> 

Kristen
You have done alot to help us so far but more help is always appreciated
:)

If someone from your org wanted to get started writing code for
checkpoint recovery that would be great!  I spent some time in the drive
to work this morning thinking about how checkpoint recovery should work:

There are 3 main steps that should be done in order:
1. synchronize checkpoint reference counts (so retention timers work
properly)
2. synchronize checkpoint metadata contents (sizes, sections, etc)
2. synchronize checkpoint section data contents

The place to get started is on the reference count synchronization.

The checkpoint must contain a list of active user's processor ids along
with their reference count.  So if processor A has checkpoint 1 open
twice, and processor B has checkpoint 1 open three times, and processor
C has checkpoint 1 open four times each processor would maintain a list
for the checkpoint (in the checkpoint data structure):

p_A:r_2
p_B:r_3
p_C:r_4

Then on a configuration change, the leaving processors would close their
reference counts.  So in this example, p_B leaves then the processor ref
count looks like:
p_A:r_2
p_C:r_4

During this configuration change, a processor joins p_D.  It has
checkpoint 1 open 1 time.  p_D gets a configuration change {add p_A,
p_C} and then sends a synchronization message with its previous ring
identifier and current list of checkpoint reference counts (after the
above leave in the configuration change was processed).  The
representative of {p_A, p_C} also sends a synchronization message with
the previous ring identifier and a current list of checkpoint reference
counts.  If the previous ring identifiers match and the sending
processor is not the delivering processor then p_C should ignore p_A's
message (ie: p_C receives p_A message, but it already knows about p_A's
references).

This requires us to add the ring identifier to the configuration change.

So now each previous configuration is aware of the new configuration. 
The reference counts look like:
p_A:r_2
p_C:r_4
p_D:r_1

The above maintenence of the reference counts, or open checkpoints, must
maintain a per-checkpoint variable which is the "reference count for
this checkpoint".  In the last case, that reference count would be 7. 
Each time a processor leaves, its reference counts are subtracted from
this "global ref count".  Each time a processor is added, its reference
counts are added.  This reference count is then what is used for
retention duration.

Any thoughts on the above approach welcome.

Thanks!
-steve

> Thanks,
> Kristen 
> 
> -----Original Message-----
> From: Steven Dake [mailto:sdake at mvista.com] 
> Sent: Monday, February 14, 2005 2:17 PM
> To: Smith, Kristen [NGC:B675:EXCH]; markh at osdl.org;
> openais at lists.osdl.org
> Cc: Bajpai, Muni [NGC:B670:EXCH]
> Subject: RE: [Openais] Checkpoint crash in aisexec
> 
> 
> On Sat, 2005-02-12 at 08:08, Kristen Smith wrote:
> > Steve,
> > 
> > Thanks for the response.
> > 
> > For recovery - what are the ramifications if we don't have recovery 
> > working 100%? What I see now is that when a node leaves the cluster 
> > and then rejoins, it receives evt messages, but it can take anywhere
> > from 15seconds to minutes for evt messages sent from that node to 
> > reach the other applications. I handle this with some
> 
> Mark have you seen this issue?
> 
> > message retries which is ok in this startup case. However, are we in
> > jeopardy in other cases that I am not considering? When running 
> > traffic the past few days and seeing periodic reconfigs, I don't
> seem 
> > to be losing messages when that occurs - I only see the lost
> messages 
> > when I actually kill a node and start it back up to rejoin the 
> > cluster.
> > 
> 
> What we have today is totally unacceptable because atleast for
> checkpointing, there is no recovery.  And Mark is waiting on my base
> code for event recovery.
> 
> Definition of 100% working means if there is a failure during
> recovery, we are guaranteed a consistent state.  I think evt is pretty
> close to this goal, although the checkpoint replication after merge
> has not been developed yet.  I can think of alot of easy ways to do
> this, but handling a failure during the recovery phase makes it more
> difficult.
> 
> Definition of almost 100% is that recovery works properly if there are
> no faults during recovery (ie: the merge process), but if there is a
> fault during recovery (ie: reconfig) something could go awry.
> 
> We want consistently replicated data (the 100% case).  100% is
> probably past your development window; the other case is within reach.
> 
> Regards
> -steve
> 
> > Thanks
> > Kristen
> > 
> > -----Original Message-----
> > From: Steven Dake [mailto:sdake at mvista.com]
> > Sent: Friday, February 11, 2005 5:30 PM
> > To: Smith, Kristen [NGC:B675:EXCH]
> > Subject: RE: [Openais] Checkpoint crash in aisexec
> > 
> > 
> > Ok well I doubt with 200 byte checkpoints there is a buffer
> overflow.
> > :)
> > 
> > Recovery will come after 188 is wrapped up.  I think your two weeks 
> > window looks good for alpha-level recovery (ie: works most of the 
> > time).  High quality production recovery will not hit your window
> for 
> > development (ie: works 100% of the time no matter what happens).
> > 
> > Thanks
> > -steve
> > 
> > On Fri, 2005-02-11 at 15:56, Kristen Smith wrote:
> > > Steve,
> > > 
> > > The size of the checkpoints are ~200 bytes.
> > > 
> > > I agree, valgrind is an excellent tool. We will run it through and
> > see
> > > if that shows anything.
> > > 
> > > I have tried this scenario maybe 30 times today (for various other
> > > testing) and it happened maybe 10 times. For a while I could
> > reproduce
> > > with a given test about 5 times and then it hasn't happened again.
> > > 
> > > Sounds like defect-188 fixing is going well. May I ask how the
> > > recovery work is going as well? (Don't mean to be pushy on that
> > front
> > > - we have 2 more weeks of coding for our application left and I am
> > > really hoping that we are able to put the new recovery code in
> > during
> > > that time).
> > > 
> > > Thanks a bunch,
> > > Kristen
> > > 
> > > -----Original Message-----
> > > From: Steven Dake [mailto:sdake at mvista.com]
> > > Sent: Friday, February 11, 2005 4:37 PM
> > > To: Smith, Kristen [NGC:B675:EXCH]
> > > Subject: Re: [Openais] Checkpoint crash in aisexec
> > > 
> > > 
> > > how large are the read or write requests?
> > > just a thought there could be some buffer overrun with larger
> > > requests.
> > > 
> > > On Fri, 2005-02-11 at 14:55, Kristen Smith wrote:
> > > > Steve,
> > > > 
> > > > We are periodically seeing aisexec crash with the following
> trace:
> > > > 
> > > >         (gdb) bt
> > > >         #0  message_handler_req_lib_ckpt_checkpointclose
> > > >         (conn_info=0x0, message=0xb73fc008) at ckpt.c:1552
> > > >         #1  0x080494c2 in poll_handler_libais_deliver (handle=0,
> > > fd=3,
> > > >         revent=134633824, data=0x89c2ad8,
> > > >             prio=0x89b2784) at main.c:578
> > > >         #2  0x08056e62 in poll_run (handle=0) at aispoll.c:386
> > > > 
> > > > 
> > > > #3  0x080499ac in main (argc=1, argv=0xbfffcb64) at main.c:1003
> > > > 
> > > > We have looked through the code but can't seem to figure out how
> > > > conn_info is getting set to 0. Do you have any idea under what 
> > > > circumstances conn_info could be null when this function is
> > called?
> > > > 
> > > > This is happening when we have multiple nodes up and we kill one
> > of
> > > > the active nodes. The standby node (which was reading
> checkpoints) 
> > > > must now become a writer, so it closes the checkpoint and this 
> > > > happens. Unfortunately, I can't reproduce this consistently - I 
> > > > finally got a core dump today. I don't recall ever seeing this
> > with
> > > > the old code.
> > > > 
> > > > Thanks,
> > > > Kristen
> > > > 
> > > > 
> > > > 
> > > >
> > >
> >
> ______________________________________________________________________
> > > > _______________________________________________
> > > > Openais mailing list
> > > > Openais at lists.osdl.org
> > > http://lists.osdl.org/mailman/listinfo/openais
> > > 
> > > 
> > 
> > 
> 
> 




More information about the Openais mailing list