[Openais] corosync enters recovery repeatedly on lossy network

Thu Jun 17 19:16:13 PDT 2010

Hi,

I'm running corosync on a setup where corosync packets are getting delayed and
lost. I'm seeing corosync enter recovery mode repeatedly, which is then causing
other problems for us. (We're running trunk as at revision 2569 (8 Dec 09), so
some of these flow-on problems may already be fixed.)

Corosync entering recovery mode repeatedly doesn't look like it's fixed on the
latest trunk though. The problem is corosync is canceling its token retransmit
timeout prematurely in message_handler_mcast().

Corosync in this setup is getting some mcast packets received out of order. So
corosync receives a mcast message with a lower seq than the last token it sent
out and stops its token retransmit timer. If the token it just sent is lost,
then it doesn't retransmit the token. The token timeout occurs and corosync
enters gather/commit/recovery.

I think the message_handler_mcast() code should also check the seq of the mcast
message before stopping the retransmit timer (see attached patch). You can only
guarantee the last token sent was successfully received if another node sends a
mcast message with a higher seq.

Does anyone see any problems with this patch?

Thanks,
Tim
-------------- next part --------------
A non-text attachment was scrubbed...
Name: corosync-retransmit-timer.patch
Type: text/x-patch
Size: 1358 bytes
Desc: not available
Url : http://lists.linux-foundation.org/pipermail/openais/attachments/20100618/dd6ba330/attachment.bin