[Openais] recover from corosync daemon restart and cpg_finalize timing

Thu Jun 24 00:16:12 PDT 2010

On 06/23/2010 11:35 PM, Andrew Beekhof wrote:
> On Thu, Jun 24, 2010 at 1:50 AM, dan clark<2clarkd at gmail.com>  wrote:
>> Dear Gentle Reader....
>>
>> Attached is a small test program to stress initializing and finalizing
>> communication between a corosync cpg client and the corosync daemon.
>> The test was run under version 1.2.4.  Initial testing was with a
>> single node, subsequent testing occurred on a system consisting of 3
>> nodes.
>>
>> 1) If the program is run in such a way that it loops on the
>> initialize/mcast_joined/dispatch/finalize AND the corosync daemon is
>> restarted while the program is looping (service corosync restart) then
>> the application locks up in the corosync client library in a variety
>> of interesting locations.  This is easiest to reproduce in a single
>> node system with a large iteration count and a usleep value between
>> joins.  'stress_finalize -t 500 -i 10000 -u 1000 -v'  Sometimes it
>> recovers in a few seconds (analysis of strace indicated
>> futex(...FUTEX_WAIT, 0, {1, 997888000}) ... which would account for
>> multiple 2 second delays in error recovery from a lost corosync
>> daemon).  Sometimes it locks up solid!   What is the proper way of
>> handling the loss of the corosync daemon?  Is it possible to have the
>> cpg library have a fast error recovery in the case of a failed daemon?
>>
>> sample back trace of lockup:
>> #0  0x000000363c60c711 in sem_wait () from /lib64/libpthread.so.0
>> #1  0x0000003000002a34 in coroipcc_msg_send_reply_receive (
>>    handle=<value optimized out>, iov=<value optimized out>, iov_len=1,
>>    res_msg=0x7fffaefecac0, res_len=24) at coroipcc.c:465
>> #2  0x0000003000802db1 in cpg_leave (handle=1648075416440668160,
>>    group=<value optimized out>) at cpg.c:458
>> #3  0x0000000000400df8 in coInit (handle=0x7fffaefecdb0,
>>    groupNameStr=0x7fffaefeccb0 "./stress_finalize_groupName-0", ctx=0x6e1)
>>    at stress_finalize.c:101
>> #4  0x000000000040138a in main (argc=8, argv=0x7fffaefecf28)
>>    at stress_finalize.c:243
>
> I've also started getting semaphore related stack traces.
>

the stack trace from Dan is different from yours Andrew.  Yours is 
during startup.   Dan is more concerned about the fact that 
sem_timedwait sits around for 2 seconds before returning information 
indicating the server has exited or stopped.  (along with other issues)

> #0  __new_sem_init (sem=0x7ff01f81a008, pshared=1, value=0) at sem_init.c:45
> 45	  isem->value = value;
> Missing separate debuginfos, use: debuginfo-install
> audit-libs-2.0.1-1.fc12.x86_64 libgcrypt-1.4.4-8.fc12.x86_64
> libgpg-error-1.6-4.x86_64 libtasn1-2.3-1.fc12.x86_64
> libuuid-2.16-10.2.fc12.x86_64
> (gdb) where
> #0  __new_sem_init (sem=0x7ff01f81a008, pshared=1, value=0) at sem_init.c:45
> #1  0x00007ff01e601e8e in coroipcc_service_connect (socket_name=<value
> optimized out>, service=<value optimized out>, request_size=1048576,
> response_size=1048576, dispatch_size=1048576, handle=<value optimized
> out>)
>      at coroipcc.c:706
> #2  0x00007ff01ec1bb81 in init_ais_connection_once (dispatch=0x40e798
> <cib_ais_dispatch>, destroy=0x40e8f2<cib_ais_destroy>, our_uuid=0x0,
> our_uname=0x6182c0, nodeid=0x0) at ais.c:622
> #3  0x00007ff01ec1ba22 in init_ais_connection (dispatch=0x40e798
> <cib_ais_dispatch>, destroy=0x40e8f2<cib_ais_destroy>, our_uuid=0x0,
> our_uname=0x6182c0, nodeid=0x0) at ais.c:585
> #4  0x00007ff01ec16b90 in crm_cluster_connect (our_uname=0x6182c0,
> our_uuid=0x0, dispatch=0x40e798, destroy=0x40e8f2, hb_conn=0x6182b0)
> at cluster.c:56
> #5  0x000000000040e9fb in cib_init () at main.c:424
> #6  0x000000000040df78 in main (argc=1, argv=0x7ffff194aaf8) at main.c:218
> (gdb) print *isem
> Cannot access memory at address 0x7ff01f81a008
>
> sigh
>

This code literally hasn't been modified for over a year - strange to 
start seeing errors now.

Is your /dev/shm full?

Regards
-steve