[Openais] Checkpoint crash in aisexec
Mark Haverkamp
markh at osdl.org
Tue Feb 15 14:14:19 PST 2005
On Tue, 2005-02-15 at 17:02 -0500, Kristen Smith wrote:
> Mark,
>
> I wasn't explaining myself correctly because I didn't fully understand
> what was happening. I have been digging into it this afternoon and
> believe I now understand what is going on.
>
> So, what I see is that messages are not queued on the existing nodes
> side (as I previously implied), they are just dropped. My app does
> message retries when it doesn't get a response from the other node, so
> the retries wind up getting through to the existing node after a
> period of time. This should explain it better:
>
> 1) I have 2 nodes (a and b) up - both are sending/receiving EVT
> messages every 1 second
> 2) I kill one of the nodes (a)
> 3) I start node a and it rejoins the cluster
> 4) The evt messages coming from b get to a's app, however, the EVT
> messages coming from a, get to b's aisexec, but don't get sent to the
> app.
>
> The reason it is not being sent to the lib (on b's side) is that this
> call in exec/evt.c(evt_remote_evt):
>
> if (check_last_event(evtpkt, source_addr)) {
> return 0;
> }
>
> is returning true. And the reason it returns true is that the event_id
> from a has restarted at 0 (the lower 32 bits), but b thinks the last
> event received from a is something more than 0 (based on # of events
> received from a before it went down earlier - which is why it takes
> longer to get new node messages if the node has been around for a
> while). Now, a keeps sending (and incrementing its id on each publish)
> and finally the event id from a catches up to what b had for a before
> it failed and the check_last_event routine now fails and the message
> is now delivered to B's app.
>
> So, basically, after really digging into this and understanding how
> recovery applies, I now see that once evt recovery is in place, the
> last event ids will be transmitted across the nodes and the publishers
> will send the correct event id based on what was just received during
> recovery and this will not be a problem any longer.
OK, thanks, it's good to know what is going on.
Mark.
--
Mark Haverkamp <markh at osdl.org>
More information about the Openais
mailing list