[Openais] Checkpoint crash in aisexec

Steven Dake sdake at mvista.com
Tue Feb 15 14:56:03 PST 2005


Muni

The ring id will be delivered as part of the confchg_fn callback from
totmsrp.c.  From the checkpoint perspective, this configuration change
will be delivered as a parameter to ckpt.c:ckpt_confchg_fn().

So we will change ckpt_confchg_fn to:
static int ckpt_confchg_fn (
    struct memb_ring_id *ring_id,
    enum totempg_configuration_type configuration_type,
    struct in_addr *member_list, void *member_list_private,
        int member_list_entries,
    struct in_addr *left_list, void *left_list_private,
        int left_list_entries,
    struct in_addr *joined_list, void *joined_list_private,
        int joined_list_entries)

To do that the following changes have to be made:

1. handlers.h has to be modified to defne this new paramemter
2. every service must be modified with handlers.h new definition
3. all callbacks that deliver the confchg_fn must be modified to deliver
the ring id in addition.  The functions are main.c:confchg_fn,
totempg.c:totempg_confchg_fn, totemsrp.c:totemsrp_confchg_fn and
totemsrp_initialize.
4. struct memb_ring_id must be moved from totemsrp.c to totemsrp.h so
others know how to read its contents.

This is a good first patch to get familiar with the flow of the
configuration change delivery.

Regards
-steve

> For 5.) I see that you store it on disk.
> sprintf (filename, "/tmp/ringid_%s",inet_ntoa (my_id.sin_addr));
> 
> Should I access that from ckpt.c via (memb_ring_id_create_or_load) ? I
> see that it is not defined in the header file for totemsrp.h . Is that
> not meant to be accessed ? If not what would you suggest ? 
> 
> Thanks
> 
> Muni
> 
> -----Original Message-----
> From: Bajpai, Muni [NGC:B670:EXCH] 
> Sent: Tuesday, February 15, 2005 3:54 PM
> To: 'sdake at mvista.com'; Smith, Kristen [NGC:B675:EXCH]
> Cc: markh at osdl.org
> Subject: RE: [Openais] Checkpoint crash in aisexec
> 
> 
> Hey Steve,
> 
> I work with kristen and need some more info on the checkpoint recovery
> ...
> 
> 1.) So the logic for accepting a configuration change from a processor
> is :
>         if ((incoming_ring_id == last_known_ring_id) 
>                 && (source_processor != delivering_processor) {
> 
>                 //IGNORE Change.
>         }
> 
>         So as per my understanding:
>         1.) (Ckpt Executive Perspective) If the change is from ME then
> always change
>         2.) if the ring_id's don't match then always change.
> 
>         Please confirm.
> 
> 2.) We must add support for the new data structure additions in the
> Ckpt Executive Opens and Close handlers also.
> 
> 3.) The addition as you enumerated to the checkpoint data structure,
> did you have any implementation preferences or did you want us to use
> anything appropriates (cursively I was thinking of a list of struct
> refs)
> 
> 4.) The last_known_ring_id. What does that mean to a newly added
> processor. Explicitly ( incoming_ring_id == last_known_ring_id ) will
> always fail on a newly commissioned processor. Am I understanding that
> correctly ? Where is the last_known_ring_id stored ?
> 
> 5.) Is exec/evt.c the best example for any ideas on implementation ??
> 
> 
> Thanks
> 
> Muni
> 
> -----Original Message-----
> From: Steven Dake [mailto:sdake at mvista.com] 
> Sent: Tuesday, February 15, 2005 1:51 PM
> To: Smith, Kristen [NGC:B675:EXCH]
> Cc: markh at osdl.org; openais at lists.osdl.org; Bajpai, Muni
> [NGC:B670:EXCH]
> Subject: RE: [Openais] Checkpoint crash in aisexec
> 
> 
> On Tue, 2005-02-15 at 09:47, Kristen Smith wrote:
> > Steve,
> > 
> > Thanks for the response - I hear ya loud and clear - not good
> without
> > recovery. So, is there something that we could do to help you with 
> > this recovery coding? If you had some type of design thoughts on how
> > you wanted checkpoint recovery to occur, maybe that is something we 
> > could help out with. Just throwing this out there to see what you 
> > think.
> > 
> 
> Kristen
> You have done alot to help us so far but more help is always
> appreciated
> :)
> 
> If someone from your org wanted to get started writing code for
> checkpoint recovery that would be great!  I spent some time in the
> drive to work this morning thinking about how checkpoint recovery
> should work:
> 
> There are 3 main steps that should be done in order:
> 1. synchronize checkpoint reference counts (so retention timers work
> properly)
> 2. synchronize checkpoint metadata contents (sizes, sections, etc) 2.
> synchronize checkpoint section data contents
> 
> The place to get started is on the reference count synchronization.
> 
> The checkpoint must contain a list of active user's processor ids
> along with their reference count.  So if processor A has checkpoint 1
> open twice, and processor B has checkpoint 1 open three times, and
> processor C has checkpoint 1 open four times each processor would
> maintain a list for the checkpoint (in the checkpoint data structure):
> 
> p_A:r_2
> p_B:r_3
> p_C:r_4
> 
> Then on a configuration change, the leaving processors would close
> their reference counts.  So in this example, p_B leaves then the
> processor ref count looks like: p_A:r_2 p_C:r_4
> 
> During this configuration change, a processor joins p_D.  It has
> checkpoint 1 open 1 time.  p_D gets a configuration change {add p_A,
> p_C} and then sends a synchronization message with its previous ring
> identifier and current list of checkpoint reference counts (after the
> above leave in the configuration change was processed).  The
> representative of {p_A, p_C} also sends a synchronization message with
> the previous ring identifier and a current list of checkpoint
> reference counts.  If the previous ring identifiers match and the
> sending processor is not the delivering processor then p_C should
> ignore p_A's message (ie: p_C receives p_A message, but it already
> knows about p_A's references).
> 
> This requires us to add the ring identifier to the configuration
> change.
> 
> So now each previous configuration is aware of the new configuration. 
> The reference counts look like:
> p_A:r_2
> p_C:r_4
> p_D:r_1
> 
> The above maintenence of the reference counts, or open checkpoints,
> must maintain a per-checkpoint variable which is the "reference count
> for this checkpoint".  In the last case, that reference count would be
> 7. 
> 
> Each time a processor leaves, its reference counts are subtracted from
> this "global ref count".  Each time a processor is added, its
> reference counts are added.  This reference count is then what is used
> for retention duration.
> 
> Any thoughts on the above approach welcome.
> 
> Thanks!
> -steve
> 
> > Thanks,
> > Kristen
> > 
> > -----Original Message-----
> > From: Steven Dake [mailto:sdake at mvista.com]
> > Sent: Monday, February 14, 2005 2:17 PM
> > To: Smith, Kristen [NGC:B675:EXCH]; markh at osdl.org;
> > openais at lists.osdl.org
> > Cc: Bajpai, Muni [NGC:B670:EXCH]
> > Subject: RE: [Openais] Checkpoint crash in aisexec
> > 
> > 
> > On Sat, 2005-02-12 at 08:08, Kristen Smith wrote:
> > > Steve,
> > > 
> > > Thanks for the response.
> > > 
> > > For recovery - what are the ramifications if we don't have
> recovery
> > > working 100%? What I see now is that when a node leaves the
> cluster 
> > > and then rejoins, it receives evt messages, but it can take
> anywhere
> > > from 15seconds to minutes for evt messages sent from that node to 
> > > reach the other applications. I handle this with some
> > 
> > Mark have you seen this issue?
> > 
> > > message retries which is ok in this startup case. However, are we
> in 
> > > jeopardy in other cases that I am not considering? When running 
> > > traffic the past few days and seeing periodic reconfigs, I don't
> > seem
> > > to be losing messages when that occurs - I only see the lost
> > messages
> > > when I actually kill a node and start it back up to rejoin the
> > > cluster.
> > > 
> > 
> > What we have today is totally unacceptable because atleast for 
> > checkpointing, there is no recovery.  And Mark is waiting on my base
> > code for event recovery.
> > 
> > Definition of 100% working means if there is a failure during 
> > recovery, we are guaranteed a consistent state.  I think evt is
> pretty 
> > close to this goal, although the checkpoint replication after merge 
> > has not been developed yet.  I can think of alot of easy ways to do 
> > this, but handling a failure during the recovery phase makes it more
> > difficult.
> > 
> > Definition of almost 100% is that recovery works properly if there
> are 
> > no faults during recovery (ie: the merge process), but if there is a
> > fault during recovery (ie: reconfig) something could go awry.
> > 
> > We want consistently replicated data (the 100% case).  100% is 
> > probably past your development window; the other case is within
> reach.
> > 
> > Regards
> > -steve
> > 
> > > Thanks
> > > Kristen
> > > 
> > > -----Original Message-----
> > > From: Steven Dake [mailto:sdake at mvista.com]
> > > Sent: Friday, February 11, 2005 5:30 PM
> > > To: Smith, Kristen [NGC:B675:EXCH]
> > > Subject: RE: [Openais] Checkpoint crash in aisexec
> > > 
> > > 
> > > Ok well I doubt with 200 byte checkpoints there is a buffer
> > overflow.
> > > :)
> > > 
> > > Recovery will come after 188 is wrapped up.  I think your two
> weeks
> > > window looks good for alpha-level recovery (ie: works most of the 
> > > time).  High quality production recovery will not hit your window
> > for
> > > development (ie: works 100% of the time no matter what happens).
> > > 
> > > Thanks
> > > -steve
> > > 
> > > On Fri, 2005-02-11 at 15:56, Kristen Smith wrote:
> > > > Steve,
> > > > 
> > > > The size of the checkpoints are ~200 bytes.
> > > > 
> > > > I agree, valgrind is an excellent tool. We will run it through
> and
> > > see
> > > > if that shows anything.
> > > > 
> > > > I have tried this scenario maybe 30 times today (for various
> other
> > > > testing) and it happened maybe 10 times. For a while I could
> > > reproduce
> > > > with a given test about 5 times and then it hasn't happened
> again.
> > > > 
> > > > Sounds like defect-188 fixing is going well. May I ask how the 
> > > > recovery work is going as well? (Don't mean to be pushy on that
> > > front
> > > > - we have 2 more weeks of coding for our application left and I
> am 
> > > > really hoping that we are able to put the new recovery code in
> > > during
> > > > that time).
> > > > 
> > > > Thanks a bunch,
> > > > Kristen
> > > > 
> > > > -----Original Message-----
> > > > From: Steven Dake [mailto:sdake at mvista.com]
> > > > Sent: Friday, February 11, 2005 4:37 PM
> > > > To: Smith, Kristen [NGC:B675:EXCH]
> > > > Subject: Re: [Openais] Checkpoint crash in aisexec
> > > > 
> > > > 
> > > > how large are the read or write requests?
> > > > just a thought there could be some buffer overrun with larger 
> > > > requests.
> > > > 
> > > > On Fri, 2005-02-11 at 14:55, Kristen Smith wrote:
> > > > > Steve,
> > > > > 
> > > > > We are periodically seeing aisexec crash with the following
> > trace:
> > > > > 
> > > > >         (gdb) bt
> > > > >         #0  message_handler_req_lib_ckpt_checkpointclose
> > > > >         (conn_info=0x0, message=0xb73fc008) at ckpt.c:1552
> > > > >         #1  0x080494c2 in poll_handler_libais_deliver
> (handle=0,
> > > > fd=3,
> > > > >         revent=134633824, data=0x89c2ad8,
> > > > >             prio=0x89b2784) at main.c:578
> > > > >         #2  0x08056e62 in poll_run (handle=0) at aispoll.c:386
> > > > > 
> > > > > 
> > > > > #3  0x080499ac in main (argc=1, argv=0xbfffcb64) at
> main.c:1003
> > > > > 
> > > > > We have looked through the code but can't seem to figure out
> how 
> > > > > conn_info is getting set to 0. Do you have any idea under what
> > > > > circumstances conn_info could be null when this function is
> > > called?
> > > > > 
> > > > > This is happening when we have multiple nodes up and we kill
> one
> > > of
> > > > > the active nodes. The standby node (which was reading
> > checkpoints)
> > > > > must now become a writer, so it closes the checkpoint and this
> > > > > happens. Unfortunately, I can't reproduce this consistently -
> I 
> > > > > finally got a core dump today. I don't recall ever seeing this
> > > with
> > > > > the old code.
> > > > > 
> > > > > Thanks,
> > > > > Kristen
> > > > > 
> > > > > 
> > > > > 
> > > > >
> > > >
> > >
> >
> ______________________________________________________________________
> > > > > _______________________________________________
> > > > > Openais mailing list
> > > > > Openais at lists.osdl.org
> > > > http://lists.osdl.org/mailman/listinfo/openais
> > > > 
> > > > 
> > > 
> > > 
> > 
> > 
> 
> 




More information about the Openais mailing list