[Openais] recover from corosync daemon restart and cpg_finalize timing
Steven Dake
sdake at redhat.com
Thu Jun 24 00:16:12 PDT 2010
On 06/23/2010 11:35 PM, Andrew Beekhof wrote:
> On Thu, Jun 24, 2010 at 1:50 AM, dan clark<2clarkd at gmail.com> wrote:
>> Dear Gentle Reader....
>>
>> Attached is a small test program to stress initializing and finalizing
>> communication between a corosync cpg client and the corosync daemon.
>> The test was run under version 1.2.4. Initial testing was with a
>> single node, subsequent testing occurred on a system consisting of 3
>> nodes.
>>
>> 1) If the program is run in such a way that it loops on the
>> initialize/mcast_joined/dispatch/finalize AND the corosync daemon is
>> restarted while the program is looping (service corosync restart) then
>> the application locks up in the corosync client library in a variety
>> of interesting locations. This is easiest to reproduce in a single
>> node system with a large iteration count and a usleep value between
>> joins. 'stress_finalize -t 500 -i 10000 -u 1000 -v' Sometimes it
>> recovers in a few seconds (analysis of strace indicated
>> futex(...FUTEX_WAIT, 0, {1, 997888000}) ... which would account for
>> multiple 2 second delays in error recovery from a lost corosync
>> daemon). Sometimes it locks up solid! What is the proper way of
>> handling the loss of the corosync daemon? Is it possible to have the
>> cpg library have a fast error recovery in the case of a failed daemon?
>>
>> sample back trace of lockup:
>> #0 0x000000363c60c711 in sem_wait () from /lib64/libpthread.so.0
>> #1 0x0000003000002a34 in coroipcc_msg_send_reply_receive (
>> handle=<value optimized out>, iov=<value optimized out>, iov_len=1,
>> res_msg=0x7fffaefecac0, res_len=24) at coroipcc.c:465
>> #2 0x0000003000802db1 in cpg_leave (handle=1648075416440668160,
>> group=<value optimized out>) at cpg.c:458
>> #3 0x0000000000400df8 in coInit (handle=0x7fffaefecdb0,
>> groupNameStr=0x7fffaefeccb0 "./stress_finalize_groupName-0", ctx=0x6e1)
>> at stress_finalize.c:101
>> #4 0x000000000040138a in main (argc=8, argv=0x7fffaefecf28)
>> at stress_finalize.c:243
>
> I've also started getting semaphore related stack traces.
>
the stack trace from Dan is different from yours Andrew. Yours is
during startup. Dan is more concerned about the fact that
sem_timedwait sits around for 2 seconds before returning information
indicating the server has exited or stopped. (along with other issues)
> #0 __new_sem_init (sem=0x7ff01f81a008, pshared=1, value=0) at sem_init.c:45
> 45 isem->value = value;
> Missing separate debuginfos, use: debuginfo-install
> audit-libs-2.0.1-1.fc12.x86_64 libgcrypt-1.4.4-8.fc12.x86_64
> libgpg-error-1.6-4.x86_64 libtasn1-2.3-1.fc12.x86_64
> libuuid-2.16-10.2.fc12.x86_64
> (gdb) where
> #0 __new_sem_init (sem=0x7ff01f81a008, pshared=1, value=0) at sem_init.c:45
> #1 0x00007ff01e601e8e in coroipcc_service_connect (socket_name=<value
> optimized out>, service=<value optimized out>, request_size=1048576,
> response_size=1048576, dispatch_size=1048576, handle=<value optimized
> out>)
> at coroipcc.c:706
> #2 0x00007ff01ec1bb81 in init_ais_connection_once (dispatch=0x40e798
> <cib_ais_dispatch>, destroy=0x40e8f2<cib_ais_destroy>, our_uuid=0x0,
> our_uname=0x6182c0, nodeid=0x0) at ais.c:622
> #3 0x00007ff01ec1ba22 in init_ais_connection (dispatch=0x40e798
> <cib_ais_dispatch>, destroy=0x40e8f2<cib_ais_destroy>, our_uuid=0x0,
> our_uname=0x6182c0, nodeid=0x0) at ais.c:585
> #4 0x00007ff01ec16b90 in crm_cluster_connect (our_uname=0x6182c0,
> our_uuid=0x0, dispatch=0x40e798, destroy=0x40e8f2, hb_conn=0x6182b0)
> at cluster.c:56
> #5 0x000000000040e9fb in cib_init () at main.c:424
> #6 0x000000000040df78 in main (argc=1, argv=0x7ffff194aaf8) at main.c:218
> (gdb) print *isem
> Cannot access memory at address 0x7ff01f81a008
>
> sigh
>
This code literally hasn't been modified for over a year - strange to
start seeing errors now.
Is your /dev/shm full?
Regards
-steve
More information about the Openais
mailing list