[Openais] RE: Checkpoint Recovery Synchronization

Steven Dake sdake at mvista.com
Thu Feb 17 15:02:51 PST 2005


Responses inline Muni


On Thu, 2005-02-17 at 15:13, Muni Bajpai wrote:
> Hey Steven,
> 
> Some more questions inline. (prepended --------------- )
> 
> Thanks
> 
> Muni
> 
> -----Original Message-----
> From: Steven Dake [mailto:sdake at mvista.com] 
> Sent: Thursday, February 17, 2005 1:21 PM
> To: Bajpai, Muni [NGC:B670:EXCH]
> Cc: openais at lists.osdl.org; Smith, Kristen [NGC:B675:EXCH]
> Subject: Re: Checkpoint Recovery Synchronization
> 
> 
> On Thu, 2005-02-17 at 11:35, Muni Bajpai wrote:
> > Hey Steven,
> >  
> > So onto phase II.
> >  
> > Do you have any preferences to the new (struct 
> > req_exec_ckpt_checkpointsynchronize). I know you did mention having
> 
> Hmm I'd use a different message name, something like
> "synchronizestate"
> 
> The reason is because there is a checkpoint synchronize API call in
> the checkpoint services which we may some day support.
> 
> > the previous regular configuration ring_id in that message but what 
> > else ?? I know we have to send all the saCkptCheckpoint stored in
> the 
> > list that checkpointListHead points to, or we could send out
> multiple 
> > synch for each checkpoint. I prefer sending one message. But we
> 
> I'd not send saCkptCheckpoint, but instead the checkpoint name
> identifier.  Then when the message is received, it can be found by
> syncrhonize handler and updated, or if it doesn't exist, it can be
> created.
> 
> In fact, we may desire to send the checkpoint meta data and section
> metadata in this message too.
> --------------- Lets make this a Patch III (checkpoint meta data and
> section metadata). 
> 

This sounds fine..  But we should prepare for the fact that this same
syncronization message could/(should?) send that information too.

> The advantage of a seperate message per checkpoint is that we can
> easily add other things like the rest of the metadata (checkpoint
> section names, checkpoint section sizes).  This allows us to avoid
> running into the 512k message size limit.
> 
> > have to decide on the type of the aggregated data.
> >  
> > Also the standard
> > struct req_header header;
> > struct message_source source;
> >  
> 
> I think we don't want message_source, since this is used to identify
> the source of an IPC message from a library.  Instead struct in_addr
> should suffice.  We should at some point rename message_source to
> ipc_source I guess to avoid this confusion.
> 
> > should be a part of the new struct too.
> >  
> > I cant think of anything else.
> >  
> 
> This is a little complicated but one thing to keep in mind is that we
> should not enable the new checkpoint data until synchronization have
> been delivered from all represeentative processors.  The reason for
> this is because a failure (loss of the token) in the checkpoint
> recovery state would be disasterous if we updated the data on the fly;
> we could end up with inconsistent data.
> 
> As part of the base recovery code, I'll take care of developing code
> to notify you when it is safe to activate the new checkpoint data.
> 
> One more complication to consider.  During recovery it is possible
> that you will be unable to send data because the transmit queue is
> full.  In this case, you must store your state and then use
> totempg_token_callback_create to queue a callback to notify you when
> the token has left the processor.  At this point, there may be an
> opportunity to queue new requests into totempg.  This last step may
> have to be repeated several times before you can complete the
> synchronization process.
> 
> --------------- Are you talking about the queueing up the all the
> outgoing mcasts ?? E.g requests received from the ckpt lib for
> (open,close,write ...e.t.c)

The library requests will receive a try again response, and wont be able
to submit new requests.

Requests that are partially in flight will be completed.  The recovery
base code I am working out basically provides a programmable barrier
mechanism.

Here is how it works:
A service calls "barrier start".  Then once the processor has received
the barrier from all processors, a callback will be called within the
service to request it initiate its recovery.  Then the service calls
barrier stop.  Once the barrier stop is recieved from all processors, a
callback will be called indicating that recovery has completed.  This
callback would execute steps 6 and 7.  Barriers can have dependencies in
this system, so for example, evt will depend upon the clm service
because it requires recovery information from clm.  I wouldn't worry too
much about this for now.  

One way to prepare for this design is to create a few functions:
* ckpt_recovery_inititialize which builds the local state data used for
recovery and is called from the ckpt_confchg_fn
* ckpt_recovery_process which processes the next recovery data prepared
by ckpt_recovery_init and is queued to run by the token callback
function.  Once ckpt_recovery_process has no more data to process, it
can stop asking to be requued by the token transmit callback.
* ckpt_recovery_finalize which executes steps 6 and 7 and is called by
ckpt_recovery_process when the processing has been completed.  Later we
will change this to be called by the barrier stop callback.
* ckpt_recovery_abort which aborts the recovery and releases allocated
local state data.  This should be called when a configuration change
occurs while already in recovery before the barrier stop callback
occurs.  For now, it can be a dummy function.

This design offers a 100% correct approach to recovery that handles
configuration changes during recovery.

Now back to your question; I am speaking specifically about the
synchronization messages.  Lets consider an example:

There are 5,000 checkpoints to synchronize.  That is 5000 seperate
messages in the recovery phase.  The totempg code can only queue about
500 messages (smaller then 1400 bytes), and then can only send about 40
per token rotation, so that means the remaining 4500 have to be sent
somehow.  Unfortunately, you are only going to get one configuration
change, so you wont be notified again when you can start queueing new
requests.  To solve this problem there is a callback mechanism which
informs you some more room might be available to send messages.  Then it
would be possible to send the next 40 checkpoint sync messages (until
the queue is full), then wait for the next notification, then send the
next 40 more, repeat, until all messages are finally queued.

> 
> So the basic algorithm is:
> 1. remove left members in transitional configuration from reference
> counts on checkpoints immediately 
> 2. copy the existing checkpoint data to a new checkpoint list. 
> --------------- (what about the old list Should (checkpoint list
> header) point to null ?) If yes, that would mean that all incoming
> ckpt_exec would fail because they all make a call to
> ckpt_checkpoint_find_global and that call would fail ?. 
> 

Keep the old list intact without making any changes, including the list
head.  Then when you get to step 7, remove the old list entries (with
free, etc), reinit the list head, and link the new list into the list
head that previously existed.

> THIS IS WHAT I think you mean :If no then that would mean that all
> exec commands would try to complete on the old data and could cause
> data inconsistency or would it ?. E.g open and close. Or maybe it wont
> as at the end of the sync the new list gets activated ? 
> 
> What if an executive receives a close while sync is in progress ?? The
> newdata list will not have updated refcounts and could also be
> pointing to a deallocated section whose pointer it coped when the
> newlist was created.Should I be updateing the new list as well as the
> old list if this should occur ?
> 
> 3. start synchronization process, requesting a restart with
> totempg_token_callback_create when the transmit queue is full  
> 
> 4. update copy of checkpoint data with synchronization message
> information when a syncstate message is received 
> 5. wait for synchronization to complete (I'll take care of this part) 
> 6. activate new list (by copying the new checkpoint list header into
> old list header 
> 7. free old checkpoint data
> 
> That should be enough to get you started.. :)
> 
> Regards
> -steve
> 
> > Please let me know,
> >  
> > Thanks
> >  
> > Muni
> >  
> >  
> >         -----Original Message-----
> >         From: openais-bounces at lists.osdl.org
> >         [mailto:openais-bounces at lists.osdl.org] On Behalf Of Bajpai,
> >         Muni [NGC:B670:EXCH]
> >         Sent: Wednesday, February 16, 2005 1:31 PM
> >         To: 'sdake at mvista.com'
> >         Cc: openais at lists.osdl.org; markh at osdl.org; Smith, Kristen
> >         [NGC:B675:EXCH]
> >         Subject: RE: [Openais] Checkpoint crash in aisexec
> >         
> >         
> >         
> >         Ok steve,
> >         
> >         Thanks for the feedback. This is my take on the steps.
> >         
> >         I.) First Patch
> >                 1.) Move struct memb_ring_id from totemsrp.c to
> >         totemsrp.h
> >                 2.) Move #define MAX_MEMBERS from totemsrp.c to
> >         totemsrp.h, change the name of the definition to
> >         PROCESSOR_COUNT_MAX.
> >         
> >                 3.) Make changes to handlers.h, amf.c, ckpt.c,
> clm.c,
> >         evs.c, totemsrp.c, totempg.c
> >         
> >         II.) Second Patch
> >                 Add support for sync on the ckpt service.
> >         
> >         Thanks
> >         
> >         Muni
> >         -----Original Message-----
> >         From: Steven Dake [mailto:sdake at mvista.com] 
> >         Sent: Wednesday, February 16, 2005 1:02 PM
> >         To: Bajpai, Muni [NGC:B670:EXCH]
> >         Cc: openais at lists.osdl.org; Smith, Kristen [NGC:B675:EXCH];
> >         markh at osdl.org
> >         Subject: RE: [Openais] Checkpoint crash in aisexec
> >         
> >         
> >         Muni
> >         
> >         I responded inline.  I'd suggest if you tackle this problem
> to
> >         try to break it up into a few patches to work on seperately.
> >         Ie: the configuration change changes required to get the
> ring
> >         id through the config change system, and then as a seperate
> >         patch the syncronization code.
> >         
> >         Thanks
> >         -steve
> >         
> >         On Wed, 2005-02-16 at 09:38, Muni Bajpai wrote:
> >         > Thanks for the quick responses last evening. My
> >         Response/Queries are 
> >         > inline prepended by a -------------------
> >         > 
> >         > Muni
> >         > 
> >         > -----Original Message-----
> >         > From: Steven Dake [mailto:sdake at mvista.com]
> >         > Sent: Tuesday, February 15, 2005 6:20 PM
> >         > To: Bajpai, Muni [NGC:B670:EXCH]; openais at lists.osdl.org
> >         > Cc: Smith, Kristen [NGC:B675:EXCH]; markh at osdl.org
> >         > Subject: RE: [Openais] Checkpoint crash in aisexec
> >         > 
> >         > 
> >         > Muni
> >         > I hope you dont mind me copying the openais mailing list
> so
> >         others can 
> >         > share in our exchanges.
> >         > 
> >         > Thanks for taking a look at this
> >         > 
> >         > Responses inline
> >         > 
> >         > On Tue, 2005-02-15 at 14:54, Muni Bajpai wrote:
> >         > > Hey Steve,
> >         > > 
> >         > > I work with kristen and need some more info on the
> >         checkpoint
> >         > recovery
> >         > > ...
> >         > > 
> >         > > 1.) So the logic for accepting a configuration change
> from
> >         a
> >         > processor
> >         > > is :
> >         > >         if ((incoming_ring_id == last_known_ring_id) 
> >         > >                 && (source_processor !=
> >         delivering_processor) {
> >         > > 
> >         > >                 //IGNORE Change.
> >         > >         }
> >         > > 
> >         > >         So as per my understanding:
> >         > >         1.) (Ckpt Executive Perspective) If the change
> is
> >         from ME
> >         > then
> >         > > always change
> >         > 
> >         > maybe I was wrong with what I said before.  Try this logic
> >         out:
> >         > 
> >         > If the sync message is from your previous configuration,
> >         then the 
> >         > reference counts should not be updated because they would
> >         double the 
> >         > reference counts incorrectly.
> >         > 
> >         > ------------- So you mean don't care about the source/dest
> >         of the sync 
> >         > message for decision making of accepting/ignoring
> >         config_chg, just use 
> >         > the ring_id ?
> >         >
> >         
> >         Its not the decision to accept the config change callback,
> its
> >         the decision to accept the syncronization message.  You
> should
> >         always accept the configuration change callback.  But in
> some
> >         cases, the sync message should be ignored.
> >         
> >         A member of the synchronization message should be
> >         "previous_ring_id" which is the ring identifier of the ring
> >         previous to the one that is currently undergoing recovery. 
> >         Keep in mind that it should be the last regular
> configuration,
> >         not the transitional configuration.
> >         
> >         The previous ring id is sufficient to determine if the
> >         refcount increase request would result in an invalid
> >         increase.
> >         
> >         If they match, then the processor is already aware of the
> >         synchronization contents and should ignore the request.  If
> >         they dont match, then the processor is unaware of the
> >         syncronization contents and should accept the request.
> >         
> >         > ?
> >         > 
> >         > ------------- Is it possible to get sync's from 2
> different
> >         processors 
> >         > with the same ring_id ??
> >         >
> >         
> >         No this is not possible.
> >         
> >         The reason is that when determining to send the sync
> message,
> >         the old ring id's representative is checked against the
> local
> >         ip.  If they match, then the sync message is sent (because
> >         this processor is the representative).  If they don't match,
> >         no sync message is sent (because the representative will
> take
> >         care of requesting the synchronization message).
> >         
> >         > The sync message is originated from the representative
> >         processor 
> >         > containing the ring id prior to the transitional
> >         configuration change.
> >         > 
> >         > When the message is delivered, it is compared to the ring
> id
> >         prior to 
> >         > the transitional configuration.  If these two match, then
> >         the message 
> >         > should be ignored because its a sync message from a
> >         processor within 
> >         > the prior configuration.
> >         > 
> >         > >         2.) if the ring_id's don't match then always
> >         change.
> >         > > 
> >         > 
> >         > Yes if the ring id in the delivered sync message doesn't
> >         match the 
> >         > previous ring id, then add the reference count information
> >         for that 
> >         > processor to the checkpoint.
> >         > 
> >         > >         Please confirm.
> >         > > 
> >         > > 2.) We must add support for the new data structure
> >         additions in the
> >         > > Ckpt Executive Opens and Close handlers also.
> >         > > 
> >         > 
> >         > no data structures are required in the handler prototypes.
> >         I think we 
> >         > need a new message vs open and close.  The message should
> be
> >         something 
> >         > like "synchronizecounts".  I dont want to overload open
> and
> >         close too 
> >         > much with extra meaning.  We could use this
> >         synchronizecounts for some 
> >         > other purpose later, like exchanging metadata too.
> >         > 
> >         > ------------ So the ckpt_refcount[MAX_MEMBERS] array is
> >         modified on 
> >         > the receipt of sync messages,open and close??
> >         >
> >         
> >         Yes ckpt_refcount is modified on open, close, and in some
> >         cases on sync given the logic above.
> >         
> >         > > 3.) The addition as you enumerated to the checkpoint
> data
> >         structure, 
> >         > > did you have any implementation preferences or did you
> >         want us to
> >         > use
> >         > > anything appropriates (cursively I was thinking of a
> list
> >         of struct
> >         > > refs)
> >         > 
> >         > hmm I have an affinity towards avoiding any sort of memory
> >         allocation 
> >         > if at all possible (because they can fail, and this can
> >         cause us major 
> >         > troubles).  Maybe something like struct ckpt_refcnt {
> >         > 
> >         >         int count;
> >         >         struct in_addr addr;
> >         > };
> >         > 
> >         > Then somethign like adding to saCkptCheckpoint
> >         > 
> >         > struct ckpt_refcount ckpt_refcount[MAX_MEMBERS];
> >         > 
> >         > MAX_MEMBERS should probably be brought out fromt otemsrp.c
> >         into 
> >         > totemsrp.h and changed from MAX_MEMBERS to
> >         PROCESSOR_COUNT_MAX.
> >         > 
> >         > > 
> >         > > 4.) The last_known_ring_id. What does that mean to a
> newly
> >         added
> >         > > processor. Explicitly ( incoming_ring_id ==
> >         last_known_ring_id )
> >         > will
> >         > > always fail on a newly commissioned processor. Am I
> >         understanding
> >         > that
> >         > > correctly ?
> >         > > 
> >         > 
> >         > no not incoming ring id.  Instead it is the processor's
> last
> >         ring id 
> >         > in the originated synchronization message.
> >         > 
> >         > last known ring id should be inited to zero.  You
> understand
> >         that the 
> >         > sync message will have some value and last_known_ring_id
> >         will be zero.
> >         > 
> >         > This will force the synchronization message to be accepted
> >         which is 
> >         > desired behavior.
> >         > 
> >         > > Where is the last_known_ring_id stored ?
> >         > > 
> >         > 
> >         > it must be stored when a configuration change is delivered
> >         to the 
> >         > ckpt_confchg_fn.
> >         > 
> >         > > 5.) Is exec/evt.c the best example for any ideas on
> >         implementation
> >         > ??
> >         > > 
> >         > 
> >         > I don't think evt uses reference counting to track
> channels,
> >         but it is 
> >         > necessary for checkpoints because of checkpoint retention.
> >         I'd rather 
> >         > try to invent a few different approaches here so we can
> >         unify them 
> >         > later once we have discovered the best design.
> >         > 
> >         > Synchronization after a merge or partition is the hardest
> >         part of a 
> >         > distributed system and I hope we can find a few approaches
> >         to test 
> >         > out.
> >         > 
> >         > > 
> >         > > Thanks
> >         > > 
> >         > > Muni
> >         > > 
> >         > > -----Original Message-----
> >         > > From: Steven Dake [mailto:sdake at mvista.com]
> >         > > Sent: Tuesday, February 15, 2005 1:51 PM
> >         > > To: Smith, Kristen [NGC:B675:EXCH]
> >         > > Cc: markh at osdl.org; openais at lists.osdl.org; Bajpai, Muni
> >         > > [NGC:B670:EXCH]
> >         > > Subject: RE: [Openais] Checkpoint crash in aisexec
> >         > > 
> >         > > 
> >         > > On Tue, 2005-02-15 at 09:47, Kristen Smith wrote:
> >         > > > Steve,
> >         > > > 
> >         > > > Thanks for the response - I hear ya loud and clear -
> not
> >         good
> >         > > without
> >         > > > recovery. So, is there something that we could do to
> >         help you with 
> >         > > > this recovery coding? If you had some type of design
> >         thoughts on
> >         > how
> >         > > > you wanted checkpoint recovery to occur, maybe that is
> >         something
> >         > we
> >         > > > could help out with. Just throwing this out there to
> see
> >         what you
> >         > > > think.
> >         > > > 
> >         > > 
> >         > > Kristen
> >         > > You have done alot to help us so far but more help is
> >         always
> >         > > appreciated
> >         > > :)
> >         > > 
> >         > > If someone from your org wanted to get started writing
> >         code for
> >         > > checkpoint recovery that would be great!  I spent some
> >         time in the 
> >         > > drive to work this morning thinking about how checkpoint
> >         recovery 
> >         > > should work:
> >         > > 
> >         > > There are 3 main steps that should be done in order:
> >         > > 1. synchronize checkpoint reference counts (so retention
> >         timers work
> >         > > properly)
> >         > > 2. synchronize checkpoint metadata contents (sizes,
> >         sections, etc)
> >         > 2.
> >         > > synchronize checkpoint section data contents
> >         > > 
> >         > > The place to get started is on the reference count
> >         synchronization.
> >         > > 
> >         > > The checkpoint must contain a list of active user's
> >         processor ids
> >         > > along with their reference count.  So if processor A has
> >         checkpoint
> >         > 1
> >         > > open twice, and processor B has checkpoint 1 open three
> >         times, and
> >         > > processor C has checkpoint 1 open four times each
> >         processor would 
> >         > > maintain a list for the checkpoint (in the checkpoint
> data
> >         > structure):
> >         > > 
> >         > > p_A:r_2
> >         > > p_B:r_3
> >         > > p_C:r_4
> >         > > 
> >         > > Then on a configuration change, the leaving processors
> >         would close
> >         > > their reference counts.  So in this example, p_B leaves
> >         then the 
> >         > > processor ref count looks like: p_A:r_2 p_C:r_4
> >         > > 
> >         > > During this configuration change, a processor joins p_D.
> >         It has
> >         > > checkpoint 1 open 1 time.  p_D gets a configuration
> change
> >         {add p_A,
> >         > > p_C} and then sends a synchronization message with its
> >         previous ring
> >         > > identifier and current list of checkpoint reference
> counts
> >         (after
> >         > the
> >         > > above leave in the configuration change was processed). 
> >         The
> >         > > representative of {p_A, p_C} also sends a
> synchronization
> >         message
> >         > with
> >         > > the previous ring identifier and a current list of
> >         checkpoint
> >         > > reference counts.  If the previous ring identifiers
> match
> >         and the 
> >         > > sending processor is not the delivering processor then
> p_C
> >         should 
> >         > > ignore p_A's message (ie: p_C receives p_A message, but
> it
> >         already 
> >         > > knows about p_A's references).
> >         > > 
> >         > > This requires us to add the ring identifier to the
> >         configuration
> >         > > change.
> >         > > 
> >         > > So now each previous configuration is aware of the new
> >         > configuration.
> >         > > The reference counts look like:
> >         > > p_A:r_2
> >         > > p_C:r_4
> >         > > p_D:r_1
> >         > > 
> >         > > The above maintenence of the reference counts, or open
> >         checkpoints,
> >         > > must maintain a per-checkpoint variable which is the
> >         "reference
> >         > count
> >         > > for this checkpoint".  In the last case, that reference
> >         count would
> >         > be
> >         > > 7.
> >         > > 
> >         > > Each time a processor leaves, its reference counts are
> >         subtracted
> >         > from
> >         > > this "global ref count".  Each time a processor is
> added,
> >         its
> >         > > reference counts are added.  This reference count is
> then
> >         what is
> >         > used
> >         > > for retention duration.
> >         > > 
> >         > > Any thoughts on the above approach welcome.
> >         > > 
> >         > > Thanks!
> >         > > -steve
> >         > > 
> >         > > > Thanks,
> >         > > > Kristen
> >         > > > 
> >         > > > -----Original Message-----
> >         > > > From: Steven Dake [mailto:sdake at mvista.com]
> >         > > > Sent: Monday, February 14, 2005 2:17 PM
> >         > > > To: Smith, Kristen [NGC:B675:EXCH]; markh at osdl.org;
> >         > > > openais at lists.osdl.org
> >         > > > Cc: Bajpai, Muni [NGC:B670:EXCH]
> >         > > > Subject: RE: [Openais] Checkpoint crash in aisexec
> >         > > > 
> >         > > > 
> >         > > > On Sat, 2005-02-12 at 08:08, Kristen Smith wrote:
> >         > > > > Steve,
> >         > > > > 
> >         > > > > Thanks for the response.
> >         > > > > 
> >         > > > > For recovery - what are the ramifications if we
> don't
> >         have
> >         > > recovery
> >         > > > > working 100%? What I see now is that when a node
> >         leaves the
> >         > > cluster
> >         > > > > and then rejoins, it receives evt messages, but it
> can
> >         take
> >         > > anywhere
> >         > > > > from 15seconds to minutes for evt messages sent from
> >         that node
> >         > to
> >         > > > > reach the other applications. I handle this with
> some
> >         > > > 
> >         > > > Mark have you seen this issue?
> >         > > > 
> >         > > > > message retries which is ok in this startup case.
> >         However, are
> >         > we
> >         > > in
> >         > > > > jeopardy in other cases that I am not considering?
> >         When running 
> >         > > > > traffic the past few days and seeing periodic
> >         reconfigs, I don't
> >         > > > seem
> >         > > > > to be losing messages when that occurs - I only see
> >         the lost
> >         > > > messages
> >         > > > > when I actually kill a node and start it back up to
> >         rejoin the
> >         > > > > cluster.
> >         > > > > 
> >         > > > 
> >         > > > What we have today is totally unacceptable because
> >         atleast for 
> >         > > > checkpointing, there is no recovery.  And Mark is
> >         waiting on my
> >         > base
> >         > > > code for event recovery.
> >         > > > 
> >         > > > Definition of 100% working means if there is a failure
> >         during 
> >         > > > recovery, we are guaranteed a consistent state.  I
> think
> >         evt is
> >         > > pretty
> >         > > > close to this goal, although the checkpoint
> replication
> >         after
> >         > merge
> >         > > > has not been developed yet.  I can think of alot of
> easy
> >         ways to
> >         > do
> >         > > > this, but handling a failure during the recovery phase
> >         makes it
> >         > more
> >         > > > difficult.
> >         > > > 
> >         > > > Definition of almost 100% is that recovery works
> >         properly if there
> >         > > are
> >         > > > no faults during recovery (ie: the merge process), but
> >         if there is
> >         > a
> >         > > > fault during recovery (ie: reconfig) something could
> go
> >         awry.
> >         > > > 
> >         > > > We want consistently replicated data (the 100% case). 
> >         100% is 
> >         > > > probably past your development window; the other case
> is
> >         within
> >         > > reach.
> >         > > > 
> >         > > > Regards
> >         > > > -steve
> >         > > > 
> >         > > > > Thanks
> >         > > > > Kristen
> >         > > > > 
> >         > > > > -----Original Message-----
> >         > > > > From: Steven Dake [mailto:sdake at mvista.com]
> >         > > > > Sent: Friday, February 11, 2005 5:30 PM
> >         > > > > To: Smith, Kristen [NGC:B675:EXCH]
> >         > > > > Subject: RE: [Openais] Checkpoint crash in aisexec
> >         > > > > 
> >         > > > > 
> >         > > > > Ok well I doubt with 200 byte checkpoints there is a
> >         buffer
> >         > > > overflow.
> >         > > > > :)
> >         > > > > 
> >         > > > > Recovery will come after 188 is wrapped up.  I think
> >         your two
> >         > > weeks
> >         > > > > window looks good for alpha-level recovery (ie:
> works
> >         most of
> >         > the
> >         > > > > time).  High quality production recovery will not
> hit
> >         your
> >         > window
> >         > > > for
> >         > > > > development (ie: works 100% of the time no matter
> what
> >         happens).
> >         > > > > 
> >         > > > > Thanks
> >         > > > > -steve
> >         > > > > 
> >         > > > > On Fri, 2005-02-11 at 15:56, Kristen Smith wrote:
> >         > > > > > Steve,
> >         > > > > > 
> >         > > > > > The size of the checkpoints are ~200 bytes.
> >         > > > > > 
> >         > > > > > I agree, valgrind is an excellent tool. We will
> run
> >         it through
> >         > > and
> >         > > > > see
> >         > > > > > if that shows anything.
> >         > > > > > 
> >         > > > > > I have tried this scenario maybe 30 times today
> (for
> >         various
> >         > > other
> >         > > > > > testing) and it happened maybe 10 times. For a
> while
> >         I could
> >         > > > > reproduce
> >         > > > > > with a given test about 5 times and then it hasn't
> >         happened
> >         > > again.
> >         > > > > > 
> >         > > > > > Sounds like defect-188 fixing is going well. May I
> >         ask how the 
> >         > > > > > recovery work is going as well? (Don't mean to be
> >         pushy on
> >         > that
> >         > > > > front
> >         > > > > > - we have 2 more weeks of coding for our
> application
> >         left and
> >         > I
> >         > > am
> >         > > > > > really hoping that we are able to put the new
> >         recovery code in
> >         > > > > during
> >         > > > > > that time).
> >         > > > > > 
> >         > > > > > Thanks a bunch,
> >         > > > > > Kristen
> >         > > > > > 
> >         > > > > > -----Original Message-----
> >         > > > > > From: Steven Dake [mailto:sdake at mvista.com]
> >         > > > > > Sent: Friday, February 11, 2005 4:37 PM
> >         > > > > > To: Smith, Kristen [NGC:B675:EXCH]
> >         > > > > > Subject: Re: [Openais] Checkpoint crash in aisexec
> >         > > > > > 
> >         > > > > > 
> >         > > > > > how large are the read or write requests?
> >         > > > > > just a thought there could be some buffer overrun
> >         with larger 
> >         > > > > > requests.
> >         > > > > > 
> >         > > > > > On Fri, 2005-02-11 at 14:55, Kristen Smith wrote:
> >         > > > > > > Steve,
> >         > > > > > > 
> >         > > > > > > We are periodically seeing aisexec crash with
> the
> >         following
> >         > > > trace:
> >         > > > > > > 
> >         > > > > > >         (gdb) bt
> >         > > > > > >         #0 
> >         message_handler_req_lib_ckpt_checkpointclose
> >         > > > > > >         (conn_info=0x0, message=0xb73fc008) at
> >         ckpt.c:1552
> >         > > > > > >         #1  0x080494c2 in
> >         poll_handler_libais_deliver
> >         > > (handle=0,
> >         > > > > > fd=3,
> >         > > > > > >         revent=134633824, data=0x89c2ad8,
> >         > > > > > >             prio=0x89b2784) at main.c:578
> >         > > > > > >         #2  0x08056e62 in poll_run (handle=0) at
> >         > aispoll.c:386
> >         > > > > > > 
> >         > > > > > > 
> >         > > > > > > #3  0x080499ac in main (argc=1, argv=0xbfffcb64)
> >         at
> >         > > main.c:1003
> >         > > > > > > 
> >         > > > > > > We have looked through the code but can't seem
> to
> >         figure out
> >         > > how
> >         > > > > > > conn_info is getting set to 0. Do you have any
> >         idea under
> >         > what
> >         > > > > > > circumstances conn_info could be null when this
> >         function is
> >         > > > > called?
> >         > > > > > > 
> >         > > > > > > This is happening when we have multiple nodes up
> >         and we kill
> >         > > one
> >         > > > > of
> >         > > > > > > the active nodes. The standby node (which was
> >         reading
> >         > > > checkpoints)
> >         > > > > > > must now become a writer, so it closes the
> >         checkpoint and
> >         > this
> >         > > > > > > happens. Unfortunately, I can't reproduce this
> >         consistently
> >         > -
> >         > > I
> >         > > > > > > finally got a core dump today. I don't recall
> ever
> >         seeing
> >         > this
> >         > > > > with
> >         > > > > > > the old code.
> >         > > > > > > 
> >         > > > > > > Thanks,
> >         > > > > > > Kristen
> >         > > > > > > 
> >         > > > > > > 
> >         > > > > > > 
> >         > > > > > >
> >         > > > > >
> >         > > > >
> >         > > >
> >         > >
> >         >
> >        
> ______________________________________________________________________
> >         > > > > > > _______________________________________________
> >         > > > > > > Openais mailing list
> >         > > > > > > Openais at lists.osdl.org
> >         > > > > > http://lists.osdl.org/mailman/listinfo/openais
> >         > > > > > 
> >         > > > > > 
> >         > > > > 
> >         > > > > 
> >         > > > 
> >         > > > 
> >         > > 
> >         > > 
> >         > 
> >         >
> >         
> >         
> 
> 




More information about the Openais mailing list