[Openais] Checkpoint crash in aisexec

Mark Haverkamp markh at osdl.org
Tue Feb 15 14:14:19 PST 2005


On Tue, 2005-02-15 at 17:02 -0500, Kristen Smith wrote:
> Mark,
> 
> I wasn't explaining myself correctly because I didn't fully understand
> what was happening. I have been digging into it this afternoon and
> believe I now understand what is going on.
> 
> So, what I see is that messages are not queued on the existing nodes
> side (as I previously implied), they are just dropped. My app does
> message retries when it doesn't get a response from the other node, so
> the retries wind up getting through to the existing node after a
> period of time. This should explain it better:
> 
> 1) I have 2 nodes (a and b) up - both are sending/receiving EVT
> messages every 1 second 
> 2) I kill one of the nodes (a) 
> 3) I start node a and it rejoins the cluster 
> 4) The evt messages coming from b get to a's app, however, the EVT
> messages coming from a, get to b's aisexec, but don't get sent to the
> app. 
> 
> The reason it is not being sent to the lib (on b's side) is that this
> call in exec/evt.c(evt_remote_evt):
> 
>     if (check_last_event(evtpkt, source_addr)) { 
>         return 0; 
>     }
> 
> is returning true. And the reason it returns true is that the event_id
> from a has restarted at 0 (the lower 32 bits), but b thinks the last
> event received from a is something more than 0 (based on # of events
> received from a before it went down earlier - which is why it takes
> longer to get new node messages if the node has been around for a
> while). Now, a keeps sending (and incrementing its id on each publish)
> and finally the event id from a catches up to what b had for a before
> it failed and the check_last_event routine now fails and the message
> is now delivered to B's app.
> 
> So, basically, after really digging into this and understanding how
> recovery applies, I now see that once evt recovery is in place, the
> last event ids will be transmitted across the nodes and the publishers
> will send the correct event id based on what was just received during
> recovery and this will not be a problem any longer.

OK, thanks, it's good to know what is going on. 


Mark.

-- 
Mark Haverkamp <markh at osdl.org>




More information about the Openais mailing list