[Openais] Segfault and issues scaling up

Tue Jun 1 09:04:07 PDT 2010

On 05/31/2010 08:41 PM, Dave Dillow wrote:
> Hello,
> I'm investigating the use of corosync and pacemaker to manage our file
> system cluster, and I'm running into some not unexpected issues. For
> many reasons, it makes sense to manage all of the nodes as a single
> cluster, but it would appear that pacemaker is not currently suitable
> for a ~200 node cluster, and that corosync will require some tuning to
> get there. As I said, not unexpected.
>
> To separate concerns, I've focusing on getting corosync up and stable at
> smaller scales first, and then plan to get pacemaker happy once there is
> a solid foundation. To that end, I've started with smaller clusters, 12
> to 48 nodes or so -- using GigE currently, though I would prefer to
> eventually use a redundant ring over Infiniband.
>
> I've been using the following in the configuration file at the moment:
>      join: 50
>      token: 2000
>      consensus: 5000
>
> I've tried a few other settings as well, but the ring seems to become
> unstable after 70 or so nodes, and it may also have some stability
> issues at lower scales, especially around configuration changes, where
> multiple rings will be formed and dissolved in rapid succession. It will
> often settle down at smaller scales and include all active nodes, but at
> the larger scales it will continue this instability indefinitely, as
> well as cause some nodes to segfault or get confused about the current
> sequence number expected. In the case where the configuration does
> stabilize, I have seen it get to a state where it seems to be passing
> only 4 to 8 messages per second as measured by the log output from the
> SYNC service. Pacemaker has been disabled for this work.

The most I have tested physically is 48 nodes.  I can't offer any advice 
on what to tune beyond that, other then increase join and increase 
consensus to much larger values then they are currently set.

Some good options might be
join: 150
token: 5000
consensus: 20000

Note I am hesitant to think that corosync will work will in its current 
form at 70 node count.

>
> Does anyone have some suggestions on good timing parameters to use for
> rings of this size? I can probably work my way through the papers on
> Totem to deduce some numbers, but perhaps the experienced hands here
> have some idea of the ballpark I'm looking for.
>
> As for the segfault, it is the result of totempg_deliver_fn() being
> handed an encapsulated packet and then misinterpreting it. This was
> handed down from messages_deliver_to_app(), and based on the flow around
> deliver_messages_from_recovery_to_regular() I expect that it should not
> see encapsulated messages. Looking through the core dump, the
> multi-encapsulated message is from somewhat ancient ring instances: the
> current ringid seq is 38260, and the outer encapsulation is for seq
> 38204 with an inner encapsulation of seq 38124. It seems this node was
> last operation in ring 38204, and had entered recovery state a number of
> times without landing in operational again prior to the crash.

It is normal for these ring messages to be recovered, but perhaps there 
is some error in how the recovery is functioning.

>
> I have a core dump of this occurring in corosync 1.2.1, as well as the
> logs from the node that crashed and one or two others in the cluster.
> I've looked through the changes to 1.2.2 and 1.2.3, but nothing stands
> out as likely to solve this. Building new versions is somewhat painful
> on this diskless cluster, so I'll try to reproduce with 1.2.3 before
> building custom versions. I can probably make the logs available to
> interested parties as well.

1.2.2 contains several fixes related to lossy messaging and one segfault 
in particular that occurs when recovery is interrupted by new memberships.

>
> While working with pacemaker prior to focusing on corosync, I noticed on
> several occasions where corosync would get into a situation were all
> nodes of the cluster were considered members of the ring, but some nodes
> were working with sequence numbers that were several hundred behind
> everyone else, and did not catch up. I have not seen this in a
> corosync-only test, but I suspect it may be related to the segfault
> above -- it only seemed to occur after a pass through the recovery state.
>

I would expect that is normal on a lossy network (ie: if there are 
retransmits in your network).  With 48 nodes all sending messages, it is 
possible for one node during high load to have lower seqids because it 
hasn't yet received or processed the seqids.

> Any suggestions on how to proceed to put this bug to bed?
>

I would try 1.2.4 (pending).  1.2.2 and 1.2.3 exhibit problems with 
logging on some platforms.

> Thanks,
> Dave
>
> _______________________________________________
> Openais mailing list
> Openais at lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais