[Openais] Checkpoint crash in aisexec

Steven Dake sdake at mvista.com
Tue Feb 15 14:12:07 PST 2005


Kristen

good bug hunt work

We have seen this before long ago but it was fixed for a time when
recovery was limping.

THanks
-steve

On Tue, 2005-02-15 at 15:02, Kristen Smith wrote:
> Mark,
> 
> I wasn't explaining myself correctly because I didn't fully understand
> what was happening. I have been digging into it this afternoon and
> believe I now understand what is going on.
> 
> So, what I see is that messages are not queued on the existing nodes
> side (as I previously implied), they are just dropped. My app does
> message retries when it doesn't get a response from the other node, so
> the retries wind up getting through to the existing node after a
> period of time. This should explain it better:
> 
> 1) I have 2 nodes (a and b) up - both are sending/receiving EVT
> messages every 1 second
> 2) I kill one of the nodes (a)
> 3) I start node a and it rejoins the cluster
> 4) The evt messages coming from b get to a's app, however, the EVT
> messages coming from a, get to b's aisexec, but don't get sent to the
> app. 
> 
> The reason it is not being sent to the lib (on b's side) is that this
> call in exec/evt.c(evt_remote_evt):
> 
>     if (check_last_event(evtpkt, source_addr)) {
>         return 0;
>     }
> 
> is returning true. And the reason it returns true is that the event_id
> from a has restarted at 0 (the lower 32 bits), but b thinks the last
> event received from a is something more than 0 (based on # of events
> received from a before it went down earlier - which is why it takes
> longer to get new node messages if the node has been around for a
> while). Now, a keeps sending (and incrementing its id on each publish)
> and finally the event id from a catches up to what b had for a before
> it failed and the check_last_event routine now fails and the message
> is now delivered to B's app.
> 
> So, basically, after really digging into this and understanding how
> recovery applies, I now see that once evt recovery is in place, the
> last event ids will be transmitted across the nodes and the publishers
> will send the correct event id based on what was just received during
> recovery and this will not be a problem any longer.
> 
> Thanks for your patience,
> Kristen
> 
> 
> -----Original Message-----
> From: Mark Haverkamp [mailto:markh at osdl.org] 
> Sent: Tuesday, February 15, 2005 9:52 AM
> To: Steven Dake
> Cc: Smith, Kristen [NGC:B675:EXCH]; Openais List; Bajpai, Muni
> [NGC:B670:EXCH]
> Subject: RE: [Openais] Checkpoint crash in aisexec
> 
> 
> On Mon, 2005-02-14 at 13:17 -0700, Steven Dake wrote:
> > On Sat, 2005-02-12 at 08:08, Kristen Smith wrote:
> > > Steve,
> > > 
> > > Thanks for the response.
> > > 
> > > For recovery - what are the ramifications if we don't have
> recovery 
> > > working 100%? What I see now is that when a node leaves the
> cluster 
> > > and then rejoins, it receives evt messages, but it can take
> anywhere 
> > > from 15seconds to minutes for evt messages sent from that node to 
> > > reach the other applications. I handle this with some
> > 
> > Mark have you seen this issue?
> 
> I haven't seen this.  Since the recovery code isn't enabled, I can't
> be a time delay mcasting messages with retention times.  Normally, on
> a config change, I see very little pause of sending and receiving
> messages when a config change happens.  
> 
> > 
> > > message retries which is ok in this startup case. However, are we
> in 
> > > jeopardy in other cases that I am not considering? When running 
> > > traffic the past few days and seeing periodic reconfigs, I don't 
> > > seem to be losing messages when that occurs - I only see the lost 
> > > messages when I actually kill a node and start it back up to
> rejoin 
> > > the cluster.
> 
> It is possible for events to be dropped by the event service.  If
> there are so many events occurring that the application can't keep up,
> the event service will drop some and send a "lost message" (as
> specified in
> 
> the API spec).   Those messages wouldn't be delayed though, just
> dropped
> and the queue has to be backed up 1000 messages for this to happen.
> 
> Mark.
> 
> 
> > > 
> > 
> > What we have today is totally unacceptable because atleast for 
> > checkpointing, there is no recovery.  And Mark is waiting on my base
> > code for event recovery.
> > 
> > Definition of 100% working means if there is a failure during 
> > recovery, we are guaranteed a consistent state.  I think evt is
> pretty 
> > close to this goal, although the checkpoint replication after merge 
> > has not been developed yet.  I can think of alot of easy ways to do 
> > this, but handling a failure during the recovery phase makes it more
> > difficult.
> > 
> > Definition of almost 100% is that recovery works properly if there
> are 
> > no faults during recovery (ie: the merge process), but if there is a
> > fault during recovery (ie: reconfig) something could go awry.
> > 
> > We want consistently replicated data (the 100% case).  100% is 
> > probably past your development window; the other case is within
> reach.
> > 
> > Regards
> > -steve
> > 
> > > Thanks
> > > Kristen
> > > 
> > > -----Original Message-----
> > > From: Steven Dake [mailto:sdake at mvista.com]
> > > Sent: Friday, February 11, 2005 5:30 PM
> > > To: Smith, Kristen [NGC:B675:EXCH]
> > > Subject: RE: [Openais] Checkpoint crash in aisexec
> > > 
> > > 
> > > Ok well I doubt with 200 byte checkpoints there is a buffer 
> > > overflow.
> > > :)
> > > 
> > > Recovery will come after 188 is wrapped up.  I think your two
> weeks 
> > > window looks good for alpha-level recovery (ie: works most of the 
> > > time).  High quality production recovery will not hit your window 
> > > for development (ie: works 100% of the time no matter what
> happens).
> > > 
> > > Thanks
> > > -steve
> > > 
> > > On Fri, 2005-02-11 at 15:56, Kristen Smith wrote:
> > > > Steve,
> > > > 
> > > > The size of the checkpoints are ~200 bytes.
> > > > 
> > > > I agree, valgrind is an excellent tool. We will run it through
> and
> > > see
> > > > if that shows anything.
> > > > 
> > > > I have tried this scenario maybe 30 times today (for various
> other
> > > > testing) and it happened maybe 10 times. For a while I could
> > > reproduce
> > > > with a given test about 5 times and then it hasn't happened
> again.
> > > > 
> > > > Sounds like defect-188 fixing is going well. May I ask how the
> > > > recovery work is going as well? (Don't mean to be pushy on that
> > > front
> > > > - we have 2 more weeks of coding for our application left and I
> am
> > > > really hoping that we are able to put the new recovery code in
> > > during
> > > > that time).
> > > > 
> > > > Thanks a bunch,
> > > > Kristen
> > > > 
> > > > -----Original Message-----
> > > > From: Steven Dake [mailto:sdake at mvista.com]
> > > > Sent: Friday, February 11, 2005 4:37 PM
> > > > To: Smith, Kristen [NGC:B675:EXCH]
> > > > Subject: Re: [Openais] Checkpoint crash in aisexec
> > > > 
> > > > 
> > > > how large are the read or write requests?
> > > > just a thought there could be some buffer overrun with larger
> > > > requests.
> > > > 
> > > > On Fri, 2005-02-11 at 14:55, Kristen Smith wrote:
> > > > > Steve,
> > > > > 
> > > > > We are periodically seeing aisexec crash with the following 
> > > > > trace:
> > > > > 
> > > > >         (gdb) bt
> > > > >         #0  message_handler_req_lib_ckpt_checkpointclose
> > > > >         (conn_info=0x0, message=0xb73fc008) at ckpt.c:1552
> > > > >         #1  0x080494c2 in poll_handler_libais_deliver
> (handle=0,
> > > > fd=3,
> > > > >         revent=134633824, data=0x89c2ad8,
> > > > >             prio=0x89b2784) at main.c:578
> > > > >         #2  0x08056e62 in poll_run (handle=0) at aispoll.c:386
> > > > > 
> > > > > 
> > > > > #3  0x080499ac in main (argc=1, argv=0xbfffcb64) at
> main.c:1003
> > > > > 
> > > > > We have looked through the code but can't seem to figure out
> how 
> > > > > conn_info is getting set to 0. Do you have any idea under what
> > > > > circumstances conn_info could be null when this function is
> > > called?
> > > > > 
> > > > > This is happening when we have multiple nodes up and we kill
> one
> > > of
> > > > > the active nodes. The standby node (which was reading 
> > > > > checkpoints) must now become a writer, so it closes the 
> > > > > checkpoint and this happens. Unfortunately, I can't reproduce 
> > > > > this consistently - I finally got a core dump today. I don't 
> > > > > recall ever seeing this
> > > with
> > > > > the old code.
> > > > > 
> > > > > Thanks,
> > > > > Kristen
> > > > > 
> > > > > 
> > > > > 
> > > > >
> > > >
> > >
> ____________________________________________________________________
> > > __
> > > > > _______________________________________________
> > > > > Openais mailing list
> > > > > Openais at lists.osdl.org
> > > > http://lists.osdl.org/mailman/listinfo/openais
> > > > 
> > > > 
> > > 
> > > 
> -- 
> Mark Haverkamp <markh at osdl.org>
> 
> 




More information about the Openais mailing list