[Openais] Segfault and issues scaling up

Dave Dillow dave at thedillows.org
Tue Jun 1 20:18:12 PDT 2010


On 06/01/2010 01:37 PM, Steven Dake wrote:
> On 06/01/2010 10:26 AM, David Dillow wrote:
>> Certainly, it doesn't look like there should ever be encapsulated
>> messages on the regular ring, only the recovery ring. Somehow, we're
>> getting messages on the regular ring with at least one, if not two
>> levels of encapsulation.
>>
> 
> There should never be an encapsulated message in a regular ring.  The 
> ring id problem I spoke about later in this mail would explain why that 
> encapsulated message would come into in regular ring.

Ok, looks like r2792 fixed the encapsulated messages on the regular
ring, as you expected. I'm now tripping the assert on line 2750 in
totemsrp.c:

        assert (instance->commit_token->memb_index <= \
		instance->commit_token->addr_entries);

This happened on several nodes when running with a peak of 93 nodes in
the cluster. It happened on one or two nodes, then later caught again
once the count had dropped to 90 or so.

I'm still running the shorter timeouts, as they seem to stress the
system a bit more to force issues like this to surface. I've saved off
three specimens of the core files and associated logs for further study,
as it is likely the machines will be rebooted tomorrow for other testing
and they don't have long-term local storage.

Any suggestions on how I can help debug this?

Thanks,
Dave


More information about the Openais mailing list