[Openais] Segfault and issues scaling up

Tue Jun 1 20:57:39 PDT 2010

On 06/01/2010 11:18 PM, Dave Dillow wrote:
> On 06/01/2010 01:37 PM, Steven Dake wrote:
>> On 06/01/2010 10:26 AM, David Dillow wrote:
>>> Certainly, it doesn't look like there should ever be encapsulated
>>> messages on the regular ring, only the recovery ring. Somehow, we're
>>> getting messages on the regular ring with at least one, if not two
>>> levels of encapsulation.
>>>
>>
>> There should never be an encapsulated message in a regular ring.  The 
>> ring id problem I spoke about later in this mail would explain why that 
>> encapsulated message would come into in regular ring.

Sorry, here would have been a good place to mention that these tests
were with r2917 off the trunk.

> Ok, looks like r2792 fixed the encapsulated messages on the regular
> ring, as you expected. I'm now tripping the assert on line 2750 in
> totemsrp.c:
> 
>         assert (instance->commit_token->memb_index <= \
> 		instance->commit_token->addr_entries);
> 
> This happened on several nodes when running with a peak of 93 nodes in
> the cluster. It happened on one or two nodes, then later caught again
> once the count had dropped to 90 or so.
> 
> I'm still running the shorter timeouts, as they seem to stress the
> system a bit more to force issues like this to surface. I've saved off
> three specimens of the core files and associated logs for further study,
> as it is likely the machines will be rebooted tomorrow for other testing
> and they don't have long-term local storage.
> 
> Any suggestions on how I can help debug this?