[Openais] Segfault and issues scaling up

Tue Jun 1 10:26:35 PDT 2010

On Tue, 2010-06-01 at 09:04 -0700, Steven Dake wrote:
> The most I have tested physically is 48 nodes.  I can't offer any advice 
> on what to tune beyond that, other then increase join and increase 
> consensus to much larger values then they are currently set.
> 
> Some good options might be
> join: 150
> token: 5000
> consensus: 20000
> 
> Note I am hesitant to think that corosync will work will in its current 
> form at 70 node count.

Ok, I'll give those a shot.

> > As for the segfault, it is the result of totempg_deliver_fn() being
> > handed an encapsulated packet and then misinterpreting it. This was
> > handed down from messages_deliver_to_app(), and based on the flow around
> > deliver_messages_from_recovery_to_regular() I expect that it should not
> > see encapsulated messages. Looking through the core dump, the
> > multi-encapsulated message is from somewhat ancient ring instances: the
> > current ringid seq is 38260, and the outer encapsulation is for seq
> > 38204 with an inner encapsulation of seq 38124. It seems this node was
> > last operation in ring 38204, and had entered recovery state a number of
> > times without landing in operational again prior to the crash.
> 
> It is normal for these ring messages to be recovered, but perhaps there 
> is some error in how the recovery is functioning.

Certainly, it doesn't look like there should ever be encapsulated
messages on the regular ring, only the recovery ring. Somehow, we're
getting messages on the regular ring with at least one, if not two
levels of encapsulation.

Also, should we be recovering messages from ringids that are not our
immediate ancestor?

> 1.2.2 contains several fixes related to lossy messaging and one segfault 
> in particular that occurs when recovery is interrupted by new memberships.

Ok, I'll recheck the changes between the two.

> > While working with pacemaker prior to focusing on corosync, I noticed on
> > several occasions where corosync would get into a situation were all
> > nodes of the cluster were considered members of the ring, but some nodes
> > were working with sequence numbers that were several hundred behind
> > everyone else, and did not catch up. I have not seen this in a
> > corosync-only test, but I suspect it may be related to the segfault
> > above -- it only seemed to occur after a pass through the recovery state.
> >
> 
> I would expect that is normal on a lossy network (ie: if there are 
> retransmits in your network).  With 48 nodes all sending messages, it is 
> possible for one node during high load to have lower seqids because it 
> hasn't yet received or processed the seqids.

The network rarely has retransmits -- and this was a stable
configuration. I'm working a bit from memory here, as I concentrated on
the segfault issue. I'll keep an eye out for this occurrence and see if
I can collect more data.

> > Any suggestions on how to proceed to put this bug to bed?
> >
> 
> I would try 1.2.4 (pending).  1.2.2 and 1.2.3 exhibit problems with 
> logging on some platforms.

Alrighty, will give that a go.

Thanks!
Dave