[Openais] Checkpoint crash in aisexec

Steven Dake sdake at mvista.com
Wed Feb 16 12:44:15 PST 2005


Sounds great Muni
Looking forward to seeing your patches.

regards
-steve

On Wed, 2005-02-16 at 12:30, Muni Bajpai wrote:
> Ok steve,
> 
> Thanks for the feedback. This is my take on the steps.
> 
> I.) First Patch
>         1.) Move struct memb_ring_id from totemsrp.c to totemsrp.h
>         2.) Move #define MAX_MEMBERS from totemsrp.c to totemsrp.h,
> change the name of the definition to PROCESSOR_COUNT_MAX.
> 
>         3.) Make changes to handlers.h, amf.c, ckpt.c, clm.c, evs.c,
> totemsrp.c, totempg.c
> 
> II.) Second Patch
>         Add support for sync on the ckpt service.
> 
> Thanks
> 
> Muni
> -----Original Message-----
> From: Steven Dake [mailto:sdake at mvista.com] 
> Sent: Wednesday, February 16, 2005 1:02 PM
> To: Bajpai, Muni [NGC:B670:EXCH]
> Cc: openais at lists.osdl.org; Smith, Kristen [NGC:B675:EXCH];
> markh at osdl.org
> Subject: RE: [Openais] Checkpoint crash in aisexec
> 
> 
> Muni
> 
> I responded inline.  I'd suggest if you tackle this problem to try to
> break it up into a few patches to work on seperately.  Ie: the
> configuration change changes required to get the ring id through the
> config change system, and then as a seperate patch the syncronization
> code.
> 
> Thanks
> -steve
> 
> On Wed, 2005-02-16 at 09:38, Muni Bajpai wrote:
> > Thanks for the quick responses last evening. My Response/Queries are
> > inline prepended by a -------------------
> > 
> > Muni
> > 
> > -----Original Message-----
> > From: Steven Dake [mailto:sdake at mvista.com]
> > Sent: Tuesday, February 15, 2005 6:20 PM
> > To: Bajpai, Muni [NGC:B670:EXCH]; openais at lists.osdl.org
> > Cc: Smith, Kristen [NGC:B675:EXCH]; markh at osdl.org
> > Subject: RE: [Openais] Checkpoint crash in aisexec
> > 
> > 
> > Muni
> > I hope you dont mind me copying the openais mailing list so others
> can 
> > share in our exchanges.
> > 
> > Thanks for taking a look at this
> > 
> > Responses inline
> > 
> > On Tue, 2005-02-15 at 14:54, Muni Bajpai wrote:
> > > Hey Steve,
> > > 
> > > I work with kristen and need some more info on the checkpoint
> > recovery
> > > ...
> > > 
> > > 1.) So the logic for accepting a configuration change from a
> > processor
> > > is :
> > >         if ((incoming_ring_id == last_known_ring_id) 
> > >                 && (source_processor != delivering_processor) {
> > > 
> > >                 //IGNORE Change.
> > >         }
> > > 
> > >         So as per my understanding:
> > >         1.) (Ckpt Executive Perspective) If the change is from ME
> > then
> > > always change
> > 
> > maybe I was wrong with what I said before.  Try this logic out:
> > 
> > If the sync message is from your previous configuration, then the 
> > reference counts should not be updated because they would double the
> > reference counts incorrectly.
> > 
> > ------------- So you mean don't care about the source/dest of the
> sync 
> > message for decision making of accepting/ignoring config_chg, just
> use 
> > the ring_id ?
> > 
> 
> Its not the decision to accept the config change callback, its the
> decision to accept the syncronization message.  You should always
> accept the configuration change callback.  But in some cases, the sync
> message should be ignored.
> 
> A member of the synchronization message should be "previous_ring_id"
> which is the ring identifier of the ring previous to the one that is
> currently undergoing recovery.  Keep in mind that it should be the
> last regular configuration, not the transitional configuration.
> 
> The previous ring id is sufficient to determine if the refcount
> increase request would result in an invalid increase.  
> 
> If they match, then the processor is already aware of the
> synchronization contents and should ignore the request.  If they dont
> match, then the processor is unaware of the syncronization contents
> and should accept the request.
> 
> > ?
> > 
> > ------------- Is it possible to get sync's from 2 different
> processors 
> > with the same ring_id ??
> > 
> 
> No this is not possible.  
> 
> The reason is that when determining to send the sync message, the old
> ring id's representative is checked against the local ip.  If they
> match, then the sync message is sent (because this processor is the
> representative).  If they don't match, no sync message is sent
> (because the representative will take care of requesting the
> synchronization message).
> 
> > The sync message is originated from the representative processor 
> > containing the ring id prior to the transitional configuration
> change.
> > 
> > When the message is delivered, it is compared to the ring id prior
> to 
> > the transitional configuration.  If these two match, then the
> message 
> > should be ignored because its a sync message from a processor within
> > the prior configuration.
> > 
> > >         2.) if the ring_id's don't match then always change.
> > > 
> > 
> > Yes if the ring id in the delivered sync message doesn't match the 
> > previous ring id, then add the reference count information for that 
> > processor to the checkpoint.
> > 
> > >         Please confirm.
> > > 
> > > 2.) We must add support for the new data structure additions in
> the
> > > Ckpt Executive Opens and Close handlers also.
> > > 
> > 
> > no data structures are required in the handler prototypes.  I think
> we 
> > need a new message vs open and close.  The message should be
> something 
> > like "synchronizecounts".  I dont want to overload open and close
> too 
> > much with extra meaning.  We could use this synchronizecounts for
> some 
> > other purpose later, like exchanging metadata too.
> > 
> > ------------ So the ckpt_refcount[MAX_MEMBERS] array is modified on 
> > the receipt of sync messages,open and close??
> > 
> 
> Yes ckpt_refcount is modified on open, close, and in some cases on
> sync given the logic above.
> 
> > > 3.) The addition as you enumerated to the checkpoint data
> structure, 
> > > did you have any implementation preferences or did you want us to
> > use
> > > anything appropriates (cursively I was thinking of a list of
> struct
> > > refs)
> > 
> > hmm I have an affinity towards avoiding any sort of memory
> allocation 
> > if at all possible (because they can fail, and this can cause us
> major 
> > troubles).  Maybe something like struct ckpt_refcnt {
> > 
> >         int count;
> >         struct in_addr addr;
> > };
> > 
> > Then somethign like adding to saCkptCheckpoint
> > 
> > struct ckpt_refcount ckpt_refcount[MAX_MEMBERS];
> > 
> > MAX_MEMBERS should probably be brought out fromt otemsrp.c into 
> > totemsrp.h and changed from MAX_MEMBERS to PROCESSOR_COUNT_MAX.
> > 
> > > 
> > > 4.) The last_known_ring_id. What does that mean to a newly added
> > > processor. Explicitly ( incoming_ring_id == last_known_ring_id )
> > will
> > > always fail on a newly commissioned processor. Am I understanding
> > that
> > > correctly ?
> > > 
> > 
> > no not incoming ring id.  Instead it is the processor's last ring id
> > in the originated synchronization message.
> > 
> > last known ring id should be inited to zero.  You understand that
> the 
> > sync message will have some value and last_known_ring_id will be
> zero.
> > 
> > This will force the synchronization message to be accepted which is 
> > desired behavior.
> > 
> > > Where is the last_known_ring_id stored ?
> > > 
> > 
> > it must be stored when a configuration change is delivered to the 
> > ckpt_confchg_fn.
> > 
> > > 5.) Is exec/evt.c the best example for any ideas on implementation
> > ??
> > > 
> > 
> > I don't think evt uses reference counting to track channels, but it
> is 
> > necessary for checkpoints because of checkpoint retention.  I'd
> rather 
> > try to invent a few different approaches here so we can unify them 
> > later once we have discovered the best design.
> > 
> > Synchronization after a merge or partition is the hardest part of a 
> > distributed system and I hope we can find a few approaches to test 
> > out.
> > 
> > > 
> > > Thanks
> > > 
> > > Muni
> > > 
> > > -----Original Message-----
> > > From: Steven Dake [mailto:sdake at mvista.com]
> > > Sent: Tuesday, February 15, 2005 1:51 PM
> > > To: Smith, Kristen [NGC:B675:EXCH]
> > > Cc: markh at osdl.org; openais at lists.osdl.org; Bajpai, Muni 
> > > [NGC:B670:EXCH]
> > > Subject: RE: [Openais] Checkpoint crash in aisexec
> > > 
> > > 
> > > On Tue, 2005-02-15 at 09:47, Kristen Smith wrote:
> > > > Steve,
> > > > 
> > > > Thanks for the response - I hear ya loud and clear - not good
> > > without
> > > > recovery. So, is there something that we could do to help you
> with 
> > > > this recovery coding? If you had some type of design thoughts on
> > how
> > > > you wanted checkpoint recovery to occur, maybe that is something
> > we
> > > > could help out with. Just throwing this out there to see what
> you
> > > > think.
> > > > 
> > > 
> > > Kristen
> > > You have done alot to help us so far but more help is always
> > > appreciated
> > > :)
> > > 
> > > If someone from your org wanted to get started writing code for
> > > checkpoint recovery that would be great!  I spent some time in the
> > > drive to work this morning thinking about how checkpoint recovery 
> > > should work:
> > > 
> > > There are 3 main steps that should be done in order:
> > > 1. synchronize checkpoint reference counts (so retention timers
> work
> > > properly)
> > > 2. synchronize checkpoint metadata contents (sizes, sections, etc)
> > 2.
> > > synchronize checkpoint section data contents
> > > 
> > > The place to get started is on the reference count
> synchronization.
> > > 
> > > The checkpoint must contain a list of active user's processor ids
> > > along with their reference count.  So if processor A has
> checkpoint
> > 1
> > > open twice, and processor B has checkpoint 1 open three times, and
> > > processor C has checkpoint 1 open four times each processor would 
> > > maintain a list for the checkpoint (in the checkpoint data
> > structure):
> > > 
> > > p_A:r_2
> > > p_B:r_3
> > > p_C:r_4
> > > 
> > > Then on a configuration change, the leaving processors would close
> > > their reference counts.  So in this example, p_B leaves then the 
> > > processor ref count looks like: p_A:r_2 p_C:r_4
> > > 
> > > During this configuration change, a processor joins p_D.  It has
> > > checkpoint 1 open 1 time.  p_D gets a configuration change {add
> p_A,
> > > p_C} and then sends a synchronization message with its previous
> ring
> > > identifier and current list of checkpoint reference counts (after
> > the
> > > above leave in the configuration change was processed).  The
> > > representative of {p_A, p_C} also sends a synchronization message
> > with
> > > the previous ring identifier and a current list of checkpoint
> > > reference counts.  If the previous ring identifiers match and the 
> > > sending processor is not the delivering processor then p_C should 
> > > ignore p_A's message (ie: p_C receives p_A message, but it already
> > > knows about p_A's references).
> > > 
> > > This requires us to add the ring identifier to the configuration
> > > change.
> > > 
> > > So now each previous configuration is aware of the new
> > configuration.
> > > The reference counts look like:
> > > p_A:r_2
> > > p_C:r_4
> > > p_D:r_1
> > > 
> > > The above maintenence of the reference counts, or open
> checkpoints,
> > > must maintain a per-checkpoint variable which is the "reference
> > count
> > > for this checkpoint".  In the last case, that reference count
> would
> > be
> > > 7.
> > > 
> > > Each time a processor leaves, its reference counts are subtracted
> > from
> > > this "global ref count".  Each time a processor is added, its
> > > reference counts are added.  This reference count is then what is
> > used
> > > for retention duration.
> > > 
> > > Any thoughts on the above approach welcome.
> > > 
> > > Thanks!
> > > -steve
> > > 
> > > > Thanks,
> > > > Kristen
> > > > 
> > > > -----Original Message-----
> > > > From: Steven Dake [mailto:sdake at mvista.com]
> > > > Sent: Monday, February 14, 2005 2:17 PM
> > > > To: Smith, Kristen [NGC:B675:EXCH]; markh at osdl.org;
> > > > openais at lists.osdl.org
> > > > Cc: Bajpai, Muni [NGC:B670:EXCH]
> > > > Subject: RE: [Openais] Checkpoint crash in aisexec
> > > > 
> > > > 
> > > > On Sat, 2005-02-12 at 08:08, Kristen Smith wrote:
> > > > > Steve,
> > > > > 
> > > > > Thanks for the response.
> > > > > 
> > > > > For recovery - what are the ramifications if we don't have
> > > recovery
> > > > > working 100%? What I see now is that when a node leaves the
> > > cluster
> > > > > and then rejoins, it receives evt messages, but it can take
> > > anywhere
> > > > > from 15seconds to minutes for evt messages sent from that node
> > to
> > > > > reach the other applications. I handle this with some
> > > > 
> > > > Mark have you seen this issue?
> > > > 
> > > > > message retries which is ok in this startup case. However, are
> > we
> > > in
> > > > > jeopardy in other cases that I am not considering? When
> running 
> > > > > traffic the past few days and seeing periodic reconfigs, I
> don't
> > > > seem
> > > > > to be losing messages when that occurs - I only see the lost
> > > > messages
> > > > > when I actually kill a node and start it back up to rejoin the
> > > > > cluster.
> > > > > 
> > > > 
> > > > What we have today is totally unacceptable because atleast for 
> > > > checkpointing, there is no recovery.  And Mark is waiting on my
> > base
> > > > code for event recovery.
> > > > 
> > > > Definition of 100% working means if there is a failure during 
> > > > recovery, we are guaranteed a consistent state.  I think evt is
> > > pretty
> > > > close to this goal, although the checkpoint replication after
> > merge
> > > > has not been developed yet.  I can think of alot of easy ways to
> > do
> > > > this, but handling a failure during the recovery phase makes it
> > more
> > > > difficult.
> > > > 
> > > > Definition of almost 100% is that recovery works properly if
> there
> > > are
> > > > no faults during recovery (ie: the merge process), but if there
> is
> > a
> > > > fault during recovery (ie: reconfig) something could go awry.
> > > > 
> > > > We want consistently replicated data (the 100% case).  100% is 
> > > > probably past your development window; the other case is within
> > > reach.
> > > > 
> > > > Regards
> > > > -steve
> > > > 
> > > > > Thanks
> > > > > Kristen
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Steven Dake [mailto:sdake at mvista.com]
> > > > > Sent: Friday, February 11, 2005 5:30 PM
> > > > > To: Smith, Kristen [NGC:B675:EXCH]
> > > > > Subject: RE: [Openais] Checkpoint crash in aisexec
> > > > > 
> > > > > 
> > > > > Ok well I doubt with 200 byte checkpoints there is a buffer
> > > > overflow.
> > > > > :)
> > > > > 
> > > > > Recovery will come after 188 is wrapped up.  I think your two
> > > weeks
> > > > > window looks good for alpha-level recovery (ie: works most of
> > the
> > > > > time).  High quality production recovery will not hit your
> > window
> > > > for
> > > > > development (ie: works 100% of the time no matter what
> happens).
> > > > > 
> > > > > Thanks
> > > > > -steve
> > > > > 
> > > > > On Fri, 2005-02-11 at 15:56, Kristen Smith wrote:
> > > > > > Steve,
> > > > > > 
> > > > > > The size of the checkpoints are ~200 bytes.
> > > > > > 
> > > > > > I agree, valgrind is an excellent tool. We will run it
> through
> > > and
> > > > > see
> > > > > > if that shows anything.
> > > > > > 
> > > > > > I have tried this scenario maybe 30 times today (for various
> > > other
> > > > > > testing) and it happened maybe 10 times. For a while I could
> > > > > reproduce
> > > > > > with a given test about 5 times and then it hasn't happened
> > > again.
> > > > > > 
> > > > > > Sounds like defect-188 fixing is going well. May I ask how
> the 
> > > > > > recovery work is going as well? (Don't mean to be pushy on
> > that
> > > > > front
> > > > > > - we have 2 more weeks of coding for our application left
> and
> > I
> > > am
> > > > > > really hoping that we are able to put the new recovery code
> in
> > > > > during
> > > > > > that time).
> > > > > > 
> > > > > > Thanks a bunch,
> > > > > > Kristen
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Steven Dake [mailto:sdake at mvista.com]
> > > > > > Sent: Friday, February 11, 2005 4:37 PM
> > > > > > To: Smith, Kristen [NGC:B675:EXCH]
> > > > > > Subject: Re: [Openais] Checkpoint crash in aisexec
> > > > > > 
> > > > > > 
> > > > > > how large are the read or write requests?
> > > > > > just a thought there could be some buffer overrun with
> larger 
> > > > > > requests.
> > > > > > 
> > > > > > On Fri, 2005-02-11 at 14:55, Kristen Smith wrote:
> > > > > > > Steve,
> > > > > > > 
> > > > > > > We are periodically seeing aisexec crash with the
> following
> > > > trace:
> > > > > > > 
> > > > > > >         (gdb) bt
> > > > > > >         #0  message_handler_req_lib_ckpt_checkpointclose
> > > > > > >         (conn_info=0x0, message=0xb73fc008) at ckpt.c:1552
> > > > > > >         #1  0x080494c2 in poll_handler_libais_deliver
> > > (handle=0,
> > > > > > fd=3,
> > > > > > >         revent=134633824, data=0x89c2ad8,
> > > > > > >             prio=0x89b2784) at main.c:578
> > > > > > >         #2  0x08056e62 in poll_run (handle=0) at
> > aispoll.c:386
> > > > > > > 
> > > > > > > 
> > > > > > > #3  0x080499ac in main (argc=1, argv=0xbfffcb64) at
> > > main.c:1003
> > > > > > > 
> > > > > > > We have looked through the code but can't seem to figure
> out
> > > how
> > > > > > > conn_info is getting set to 0. Do you have any idea under
> > what
> > > > > > > circumstances conn_info could be null when this function
> is
> > > > > called?
> > > > > > > 
> > > > > > > This is happening when we have multiple nodes up and we
> kill
> > > one
> > > > > of
> > > > > > > the active nodes. The standby node (which was reading
> > > > checkpoints)
> > > > > > > must now become a writer, so it closes the checkpoint and
> > this
> > > > > > > happens. Unfortunately, I can't reproduce this
> consistently
> > -
> > > I
> > > > > > > finally got a core dump today. I don't recall ever seeing
> > this
> > > > > with
> > > > > > > the old code.
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > Kristen
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> ______________________________________________________________________
> > > > > > > _______________________________________________
> > > > > > > Openais mailing list
> > > > > > > Openais at lists.osdl.org
> > > > > > http://lists.osdl.org/mailman/listinfo/openais
> > > > > > 
> > > > > > 
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > 
> > > 
> > 
> > 
> 
> 




More information about the Openais mailing list