[Openais] aisexec core dump during traffic

Kristen Smith kjsmith at nortel.com
Mon Feb 14 06:38:48 PST 2005


One correction - what I saw on the 2nd node that crashed was not the exact
same assertion, it was this one:
 
aisexec: ../include/sq.h:152: sq_item_get: Assertion `sq_position >= 0'
failed

-----Original Message-----
From: openais-bounces at lists.osdl.org [mailto:openais-bounces at lists.osdl.org]
On Behalf Of Smith, Kristen [NGC:B675:EXCH]
Sent: Sunday, February 13, 2005 9:42 AM
To: 'openais at lists.osdl.org'
Cc: Bajpai, Muni [NGC:B670:EXCH]
Subject: RE: [Openais] aisexec core dump during traffic


One more thing - this assert actually happened on 2 nodes, not just 1.
Unfortunately, I didn't have core files enabled on the 2nd machine. It was
the same assert line as the other node.

-----Original Message-----
From: openais-bounces at lists.osdl.org [mailto:openais-bounces at lists.osdl.org]
On Behalf Of Smith, Kristen [NGC:B675:EXCH]
Sent: Sunday, February 13, 2005 9:26 AM
To: 'openais at lists.osdl.org'
Cc: Bajpai, Muni [NGC:B670:EXCH]
Subject: [Openais] aisexec core dump during traffic



Steve, 

Running traffic this weekend (in a 3+1 configuration - each of the active
nodes were writing out ~6/ckpts/second). Ran for about 20 hours and then got
the following from aisexec (on of the active nodes):

aisexec: ../include/sq.h:102: sq_item_add: Assertion
`sq->items_inuse[sq_position] == 0' failed. 

and a trace: 

#0  0x00bebcdf in raise () from /lib/tls/libc.so.6 
#1  0x00bed4e5 in abort () from /lib/tls/libc.so.6 
#2  0x00be5609 in __assert_fail () from /lib/tls/libc.so.6 
#3  0x0805add1 in orf_token_mcast (token=0xbfffce00, fcc_mcasts_allowed=29,
system_from=0xbfffd420) 
    at totemsrp.c:1990 
#4  0x080587e6 in message_handler_orf_token (system_from=0xbfffd420,
iovec=0xbfffce00, iov_len=1, 
    bytes_received=78, endian_conversion_needed=0) at totemsrp.c:2702 
#5  0x0805a3d9 in recv_handler (handle=0, fd=7, revents=1, data=0x0,
prio=0x0) at totemsrp.c:3351 
#6  0x08056e62 in poll_run (handle=0) at aispoll.c:386 
#7  0x080499ac in main (argc=1, argv=0xbfffd634) at main.c:1003 

This is the bitkeeper code from last Monday. 

Here are the #defines I have changed, if that matters at all: 

#define TIMEOUT_STATE_GATHER_JOIN               40 
#define TIMEOUT_STATE_GATHER_CONSENSUS  80 
#define TIMEOUT_TOKEN                                      180 
#define TIMEOUT_TOKEN_RETRANSMIT                30 

Any other information I can provide for you? 

Thanks, 
Kristen 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.linux-foundation.org/pipermail/openais/attachments/20050214/ebaf100a/attachment-0001.htm


More information about the Openais mailing list