[Openais] Re: PATCH I.

Steven Dake sdake at mvista.com
Wed Feb 16 14:07:34 PST 2005


Muni,

great work!  Can you fix a few minor things, and then I'll apply the
patch?

Also can you post messages or patches to the mailing list so others
besides myself can review them?  This really helps give more reviews of
the code.

There is a commented out PROCESSOR_COUNT_MAX in totemsrp.c that should
be deleted.

Please use a space after commas:
        totemsrp_confchg_fn (TOTEMSRP_CONFIGURATION_TRANSITIONAL,
                my_trans_memb_list, 0, my_trans_memb_entries,
                left_list, 0, left_list_entries,
-               0, 0, 0);
+               0, 0, 0,&my_ring_id);

&my_ring_id should have a space before it.

When specifying pointers, please use the style to match the rest of the
code:

struct memb_ring_id *ring_id

instead of

struct memb_ring_id* ring_id

Great work
-steve

On Wed, 2005-02-16 at 14:44, Muni Bajpai wrote:
> Hey Steven,
> 
> I have completed Step 1 of our changes please review the patch.
> (Today's bitkeeper load). 
> I ran the ckptbench on 2 machines and everything seemed fine.
> Am attaching the patch of the changes. Please let me know if
> everything is OK.
> Also I am assuming that pending the OK you will checkin the code into
> bitkeeper.
> 
> Thanks
> 
> Muni
> 
> -----Original Message-----
> From: Steven Dake [mailto:sdake at mvista.com] 
> Sent: Wednesday, February 16, 2005 2:44 PM
> To: Bajpai, Muni [NGC:B670:EXCH]
> Cc: openais at lists.osdl.org; Smith, Kristen [NGC:B675:EXCH];
> markh at osdl.org
> Subject: RE: [Openais] Checkpoint crash in aisexec
> 
> 
> Sounds great Muni
> Looking forward to seeing your patches.
> 
> regards
> -steve
> 
> On Wed, 2005-02-16 at 12:30, Muni Bajpai wrote:
> > Ok steve,
> > 
> > Thanks for the feedback. This is my take on the steps.
> > 
> > I.) First Patch
> >         1.) Move struct memb_ring_id from totemsrp.c to totemsrp.h
> >         2.) Move #define MAX_MEMBERS from totemsrp.c to totemsrp.h, 
> > change the name of the definition to PROCESSOR_COUNT_MAX.
> > 
> >         3.) Make changes to handlers.h, amf.c, ckpt.c, clm.c, evs.c,
> > totemsrp.c, totempg.c
> > 
> > II.) Second Patch
> >         Add support for sync on the ckpt service.
> > 
> > Thanks
> > 
> > Muni
> > -----Original Message-----
> > From: Steven Dake [mailto:sdake at mvista.com]
> > Sent: Wednesday, February 16, 2005 1:02 PM
> > To: Bajpai, Muni [NGC:B670:EXCH]
> > Cc: openais at lists.osdl.org; Smith, Kristen [NGC:B675:EXCH];
> > markh at osdl.org
> > Subject: RE: [Openais] Checkpoint crash in aisexec
> > 
> > 
> > Muni
> > 
> > I responded inline.  I'd suggest if you tackle this problem to try
> to 
> > break it up into a few patches to work on seperately.  Ie: the 
> > configuration change changes required to get the ring id through the
> > config change system, and then as a seperate patch the
> syncronization 
> > code.
> > 
> > Thanks
> > -steve
> > 
> > On Wed, 2005-02-16 at 09:38, Muni Bajpai wrote:
> > > Thanks for the quick responses last evening. My Response/Queries
> are 
> > > inline prepended by a -------------------
> > > 
> > > Muni
> > > 
> > > -----Original Message-----
> > > From: Steven Dake [mailto:sdake at mvista.com]
> > > Sent: Tuesday, February 15, 2005 6:20 PM
> > > To: Bajpai, Muni [NGC:B670:EXCH]; openais at lists.osdl.org
> > > Cc: Smith, Kristen [NGC:B675:EXCH]; markh at osdl.org
> > > Subject: RE: [Openais] Checkpoint crash in aisexec
> > > 
> > > 
> > > Muni
> > > I hope you dont mind me copying the openais mailing list so others
> > can
> > > share in our exchanges.
> > > 
> > > Thanks for taking a look at this
> > > 
> > > Responses inline
> > > 
> > > On Tue, 2005-02-15 at 14:54, Muni Bajpai wrote:
> > > > Hey Steve,
> > > > 
> > > > I work with kristen and need some more info on the checkpoint
> > > recovery
> > > > ...
> > > > 
> > > > 1.) So the logic for accepting a configuration change from a
> > > processor
> > > > is :
> > > >         if ((incoming_ring_id == last_known_ring_id) 
> > > >                 && (source_processor != delivering_processor) {
> > > > 
> > > >                 //IGNORE Change.
> > > >         }
> > > > 
> > > >         So as per my understanding:
> > > >         1.) (Ckpt Executive Perspective) If the change is from
> ME
> > > then
> > > > always change
> > > 
> > > maybe I was wrong with what I said before.  Try this logic out:
> > > 
> > > If the sync message is from your previous configuration, then the
> > > reference counts should not be updated because they would double
> the
> > > reference counts incorrectly.
> > > 
> > > ------------- So you mean don't care about the source/dest of the
> > sync
> > > message for decision making of accepting/ignoring config_chg, just
> > use
> > > the ring_id ?
> > > 
> > 
> > Its not the decision to accept the config change callback, its the 
> > decision to accept the syncronization message.  You should always 
> > accept the configuration change callback.  But in some cases, the
> sync 
> > message should be ignored.
> > 
> > A member of the synchronization message should be "previous_ring_id"
> > which is the ring identifier of the ring previous to the one that is
> > currently undergoing recovery.  Keep in mind that it should be the 
> > last regular configuration, not the transitional configuration.
> > 
> > The previous ring id is sufficient to determine if the refcount 
> > increase request would result in an invalid increase.
> > 
> > If they match, then the processor is already aware of the 
> > synchronization contents and should ignore the request.  If they
> dont 
> > match, then the processor is unaware of the syncronization contents 
> > and should accept the request.
> > 
> > > ?
> > > 
> > > ------------- Is it possible to get sync's from 2 different
> > processors
> > > with the same ring_id ??
> > > 
> > 
> > No this is not possible.
> > 
> > The reason is that when determining to send the sync message, the
> old 
> > ring id's representative is checked against the local ip.  If they 
> > match, then the sync message is sent (because this processor is the 
> > representative).  If they don't match, no sync message is sent 
> > (because the representative will take care of requesting the 
> > synchronization message).
> > 
> > > The sync message is originated from the representative processor
> > > containing the ring id prior to the transitional configuration
> > change.
> > > 
> > > When the message is delivered, it is compared to the ring id prior
> > to
> > > the transitional configuration.  If these two match, then the
> > message
> > > should be ignored because its a sync message from a processor
> within 
> > > the prior configuration.
> > > 
> > > >         2.) if the ring_id's don't match then always change.
> > > > 
> > > 
> > > Yes if the ring id in the delivered sync message doesn't match the
> > > previous ring id, then add the reference count information for
> that 
> > > processor to the checkpoint.
> > > 
> > > >         Please confirm.
> > > > 
> > > > 2.) We must add support for the new data structure additions in
> > the
> > > > Ckpt Executive Opens and Close handlers also.
> > > > 
> > > 
> > > no data structures are required in the handler prototypes.  I
> think
> > we
> > > need a new message vs open and close.  The message should be
> > something
> > > like "synchronizecounts".  I dont want to overload open and close
> > too
> > > much with extra meaning.  We could use this synchronizecounts for
> > some
> > > other purpose later, like exchanging metadata too.
> > > 
> > > ------------ So the ckpt_refcount[MAX_MEMBERS] array is modified
> on
> > > the receipt of sync messages,open and close??
> > > 
> > 
> > Yes ckpt_refcount is modified on open, close, and in some cases on 
> > sync given the logic above.
> > 
> > > > 3.) The addition as you enumerated to the checkpoint data
> > structure,
> > > > did you have any implementation preferences or did you want us
> to
> > > use
> > > > anything appropriates (cursively I was thinking of a list of
> > struct
> > > > refs)
> > > 
> > > hmm I have an affinity towards avoiding any sort of memory
> > allocation
> > > if at all possible (because they can fail, and this can cause us
> > major
> > > troubles).  Maybe something like struct ckpt_refcnt {
> > > 
> > >         int count;
> > >         struct in_addr addr;
> > > };
> > > 
> > > Then somethign like adding to saCkptCheckpoint
> > > 
> > > struct ckpt_refcount ckpt_refcount[MAX_MEMBERS];
> > > 
> > > MAX_MEMBERS should probably be brought out fromt otemsrp.c into
> > > totemsrp.h and changed from MAX_MEMBERS to PROCESSOR_COUNT_MAX.
> > > 
> > > > 
> > > > 4.) The last_known_ring_id. What does that mean to a newly added
> > > > processor. Explicitly ( incoming_ring_id == last_known_ring_id )
> > > will
> > > > always fail on a newly commissioned processor. Am I
> understanding
> > > that
> > > > correctly ?
> > > > 
> > > 
> > > no not incoming ring id.  Instead it is the processor's last ring
> id 
> > > in the originated synchronization message.
> > > 
> > > last known ring id should be inited to zero.  You understand that
> > the
> > > sync message will have some value and last_known_ring_id will be
> > zero.
> > > 
> > > This will force the synchronization message to be accepted which
> is
> > > desired behavior.
> > > 
> > > > Where is the last_known_ring_id stored ?
> > > > 
> > > 
> > > it must be stored when a configuration change is delivered to the
> > > ckpt_confchg_fn.
> > > 
> > > > 5.) Is exec/evt.c the best example for any ideas on
> implementation
> > > ??
> > > > 
> > > 
> > > I don't think evt uses reference counting to track channels, but
> it
> > is
> > > necessary for checkpoints because of checkpoint retention.  I'd
> > rather
> > > try to invent a few different approaches here so we can unify them
> > > later once we have discovered the best design.
> > > 
> > > Synchronization after a merge or partition is the hardest part of
> a
> > > distributed system and I hope we can find a few approaches to test
> > > out.
> > > 
> > > > 
> > > > Thanks
> > > > 
> > > > Muni
> > > > 
> > > > -----Original Message-----
> > > > From: Steven Dake [mailto:sdake at mvista.com]
> > > > Sent: Tuesday, February 15, 2005 1:51 PM
> > > > To: Smith, Kristen [NGC:B675:EXCH]
> > > > Cc: markh at osdl.org; openais at lists.osdl.org; Bajpai, Muni
> > > > [NGC:B670:EXCH]
> > > > Subject: RE: [Openais] Checkpoint crash in aisexec
> > > > 
> > > > 
> > > > On Tue, 2005-02-15 at 09:47, Kristen Smith wrote:
> > > > > Steve,
> > > > > 
> > > > > Thanks for the response - I hear ya loud and clear - not good
> > > > without
> > > > > recovery. So, is there something that we could do to help you
> > with
> > > > > this recovery coding? If you had some type of design thoughts
> on
> > > how
> > > > > you wanted checkpoint recovery to occur, maybe that is
> something
> > > we
> > > > > could help out with. Just throwing this out there to see what
> > you
> > > > > think.
> > > > > 
> > > > 
> > > > Kristen
> > > > You have done alot to help us so far but more help is always 
> > > > appreciated
> > > > :)
> > > > 
> > > > If someone from your org wanted to get started writing code for 
> > > > checkpoint recovery that would be great!  I spent some time in
> the 
> > > > drive to work this morning thinking about how checkpoint
> recovery 
> > > > should work:
> > > > 
> > > > There are 3 main steps that should be done in order:
> > > > 1. synchronize checkpoint reference counts (so retention timers
> > work
> > > > properly)
> > > > 2. synchronize checkpoint metadata contents (sizes, sections,
> etc)
> > > 2.
> > > > synchronize checkpoint section data contents
> > > > 
> > > > The place to get started is on the reference count
> > synchronization.
> > > > 
> > > > The checkpoint must contain a list of active user's processor
> ids 
> > > > along with their reference count.  So if processor A has
> > checkpoint
> > > 1
> > > > open twice, and processor B has checkpoint 1 open three times,
> and 
> > > > processor C has checkpoint 1 open four times each processor
> would 
> > > > maintain a list for the checkpoint (in the checkpoint data
> > > structure):
> > > > 
> > > > p_A:r_2
> > > > p_B:r_3
> > > > p_C:r_4
> > > > 
> > > > Then on a configuration change, the leaving processors would
> close 
> > > > their reference counts.  So in this example, p_B leaves then the
> > > > processor ref count looks like: p_A:r_2 p_C:r_4
> > > > 
> > > > During this configuration change, a processor joins p_D.  It has
> > > > checkpoint 1 open 1 time.  p_D gets a configuration change {add
> > p_A,
> > > > p_C} and then sends a synchronization message with its previous
> > ring
> > > > identifier and current list of checkpoint reference counts
> (after
> > > the
> > > > above leave in the configuration change was processed).  The 
> > > > representative of {p_A, p_C} also sends a synchronization
> message
> > > with
> > > > the previous ring identifier and a current list of checkpoint 
> > > > reference counts.  If the previous ring identifiers match and
> the 
> > > > sending processor is not the delivering processor then p_C
> should 
> > > > ignore p_A's message (ie: p_C receives p_A message, but it
> already 
> > > > knows about p_A's references).
> > > > 
> > > > This requires us to add the ring identifier to the configuration
> > > > change.
> > > > 
> > > > So now each previous configuration is aware of the new
> > > configuration.
> > > > The reference counts look like:
> > > > p_A:r_2
> > > > p_C:r_4
> > > > p_D:r_1
> > > > 
> > > > The above maintenence of the reference counts, or open
> > checkpoints,
> > > > must maintain a per-checkpoint variable which is the "reference
> > > count
> > > > for this checkpoint".  In the last case, that reference count
> > would
> > > be
> > > > 7.
> > > > 
> > > > Each time a processor leaves, its reference counts are
> subtracted
> > > from
> > > > this "global ref count".  Each time a processor is added, its 
> > > > reference counts are added.  This reference count is then what
> is
> > > used
> > > > for retention duration.
> > > > 
> > > > Any thoughts on the above approach welcome.
> > > > 
> > > > Thanks!
> > > > -steve
> > > > 
> > > > > Thanks,
> > > > > Kristen
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Steven Dake [mailto:sdake at mvista.com]
> > > > > Sent: Monday, February 14, 2005 2:17 PM
> > > > > To: Smith, Kristen [NGC:B675:EXCH]; markh at osdl.org; 
> > > > > openais at lists.osdl.org
> > > > > Cc: Bajpai, Muni [NGC:B670:EXCH]
> > > > > Subject: RE: [Openais] Checkpoint crash in aisexec
> > > > > 
> > > > > 
> > > > > On Sat, 2005-02-12 at 08:08, Kristen Smith wrote:
> > > > > > Steve,
> > > > > > 
> > > > > > Thanks for the response.
> > > > > > 
> > > > > > For recovery - what are the ramifications if we don't have
> > > > recovery
> > > > > > working 100%? What I see now is that when a node leaves the
> > > > cluster
> > > > > > and then rejoins, it receives evt messages, but it can take
> > > > anywhere
> > > > > > from 15seconds to minutes for evt messages sent from that
> node
> > > to
> > > > > > reach the other applications. I handle this with some
> > > > > 
> > > > > Mark have you seen this issue?
> > > > > 
> > > > > > message retries which is ok in this startup case. However,
> are
> > > we
> > > > in
> > > > > > jeopardy in other cases that I am not considering? When
> > running
> > > > > > traffic the past few days and seeing periodic reconfigs, I
> > don't
> > > > > seem
> > > > > > to be losing messages when that occurs - I only see the lost
> > > > > messages
> > > > > > when I actually kill a node and start it back up to rejoin
> the 
> > > > > > cluster.
> > > > > > 
> > > > > 
> > > > > What we have today is totally unacceptable because atleast for
> > > > > checkpointing, there is no recovery.  And Mark is waiting on
> my
> > > base
> > > > > code for event recovery.
> > > > > 
> > > > > Definition of 100% working means if there is a failure during
> > > > > recovery, we are guaranteed a consistent state.  I think evt
> is
> > > > pretty
> > > > > close to this goal, although the checkpoint replication after
> > > merge
> > > > > has not been developed yet.  I can think of alot of easy ways
> to
> > > do
> > > > > this, but handling a failure during the recovery phase makes
> it
> > > more
> > > > > difficult.
> > > > > 
> > > > > Definition of almost 100% is that recovery works properly if
> > there
> > > > are
> > > > > no faults during recovery (ie: the merge process), but if
> there
> > is
> > > a
> > > > > fault during recovery (ie: reconfig) something could go awry.
> > > > > 
> > > > > We want consistently replicated data (the 100% case).  100% is
> > > > > probably past your development window; the other case is
> within
> > > > reach.
> > > > > 
> > > > > Regards
> > > > > -steve
> > > > > 
> > > > > > Thanks
> > > > > > Kristen
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Steven Dake [mailto:sdake at mvista.com]
> > > > > > Sent: Friday, February 11, 2005 5:30 PM
> > > > > > To: Smith, Kristen [NGC:B675:EXCH]
> > > > > > Subject: RE: [Openais] Checkpoint crash in aisexec
> > > > > > 
> > > > > > 
> > > > > > Ok well I doubt with 200 byte checkpoints there is a buffer
> > > > > overflow.
> > > > > > :)
> > > > > > 
> > > > > > Recovery will come after 188 is wrapped up.  I think your
> two
> > > > weeks
> > > > > > window looks good for alpha-level recovery (ie: works most
> of
> > > the
> > > > > > time).  High quality production recovery will not hit your
> > > window
> > > > > for
> > > > > > development (ie: works 100% of the time no matter what
> > happens).
> > > > > > 
> > > > > > Thanks
> > > > > > -steve
> > > > > > 
> > > > > > On Fri, 2005-02-11 at 15:56, Kristen Smith wrote:
> > > > > > > Steve,
> > > > > > > 
> > > > > > > The size of the checkpoints are ~200 bytes.
> > > > > > > 
> > > > > > > I agree, valgrind is an excellent tool. We will run it
> > through
> > > > and
> > > > > > see
> > > > > > > if that shows anything.
> > > > > > > 
> > > > > > > I have tried this scenario maybe 30 times today (for
> various
> > > > other
> > > > > > > testing) and it happened maybe 10 times. For a while I
> could
> > > > > > reproduce
> > > > > > > with a given test about 5 times and then it hasn't
> happened
> > > > again.
> > > > > > > 
> > > > > > > Sounds like defect-188 fixing is going well. May I ask how
> > the
> > > > > > > recovery work is going as well? (Don't mean to be pushy on
> > > that
> > > > > > front
> > > > > > > - we have 2 more weeks of coding for our application left
> > and
> > > I
> > > > am
> > > > > > > really hoping that we are able to put the new recovery
> code
> > in
> > > > > > during
> > > > > > > that time).
> > > > > > > 
> > > > > > > Thanks a bunch,
> > > > > > > Kristen
> > > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Steven Dake [mailto:sdake at mvista.com]
> > > > > > > Sent: Friday, February 11, 2005 4:37 PM
> > > > > > > To: Smith, Kristen [NGC:B675:EXCH]
> > > > > > > Subject: Re: [Openais] Checkpoint crash in aisexec
> > > > > > > 
> > > > > > > 
> > > > > > > how large are the read or write requests?
> > > > > > > just a thought there could be some buffer overrun with
> > larger
> > > > > > > requests.
> > > > > > > 
> > > > > > > On Fri, 2005-02-11 at 14:55, Kristen Smith wrote:
> > > > > > > > Steve,
> > > > > > > > 
> > > > > > > > We are periodically seeing aisexec crash with the
> > following
> > > > > trace:
> > > > > > > > 
> > > > > > > >         (gdb) bt
> > > > > > > >         #0  message_handler_req_lib_ckpt_checkpointclose
> > > > > > > >         (conn_info=0x0, message=0xb73fc008) at
> ckpt.c:1552
> > > > > > > >         #1  0x080494c2 in poll_handler_libais_deliver
> > > > (handle=0,
> > > > > > > fd=3,
> > > > > > > >         revent=134633824, data=0x89c2ad8,
> > > > > > > >             prio=0x89b2784) at main.c:578
> > > > > > > >         #2  0x08056e62 in poll_run (handle=0) at
> > > aispoll.c:386
> > > > > > > > 
> > > > > > > > 
> > > > > > > > #3  0x080499ac in main (argc=1, argv=0xbfffcb64) at
> > > > main.c:1003
> > > > > > > > 
> > > > > > > > We have looked through the code but can't seem to figure
> > out
> > > > how
> > > > > > > > conn_info is getting set to 0. Do you have any idea
> under
> > > what
> > > > > > > > circumstances conn_info could be null when this function
> > is
> > > > > > called?
> > > > > > > > 
> > > > > > > > This is happening when we have multiple nodes up and we
> > kill
> > > > one
> > > > > > of
> > > > > > > > the active nodes. The standby node (which was reading
> > > > > checkpoints)
> > > > > > > > must now become a writer, so it closes the checkpoint
> and
> > > this
> > > > > > > > happens. Unfortunately, I can't reproduce this
> > consistently
> > > -
> > > > I
> > > > > > > > finally got a core dump today. I don't recall ever
> seeing
> > > this
> > > > > > with
> > > > > > > > the old code.
> > > > > > > > 
> > > > > > > > Thanks,
> > > > > > > > Kristen
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> ______________________________________________________________________
> > > > > > > > _______________________________________________
> > > > > > > > Openais mailing list
> > > > > > > > Openais at lists.osdl.org
> > > > > > > http://lists.osdl.org/mailman/listinfo/openais
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > 
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > 
> > > 
> > 
> > 
> 
> 
>  




More information about the Openais mailing list