[Openais] recover from corosync daemon restart and cpg_finalize timing

Thu Jun 24 12:50:48 PDT 2010

Thank you for trying out this test.

I have upgraded to release 1.2.5 and applied the fix posted for the
leak to /dev/shm.  Unfortunately when I run the test application
(slightly modified to fix a couple of bugs I found) I still find
/dev/shm filling up with large files "control_buffer-xxx,
dispatch_buffer-xxx, fdata-xxxx, request_buffer_xxx,
response_buffer_xxx" even after corosync is restarted and the
application daemon killed.  It would appear that there may still be a
problem in the cleanup of the temporary files used by corosync
(library and daemon?) in /dev/shm.

Should the shutdown of the application (and associated corosync
library) cleanup the temporary files?  Should the shutdown of the
daemon cleanup the /dev/shm temporary files?  Would a stop gap measure
be to rm -f /dev/shm/* in the init.d script to cleanup any leftovers?
Would that break the library if the applications were not also shut
down?

dan

On Thu, Jun 24, 2010 at 12:16 AM, Steven Dake <sdake at redhat.com> wrote:
> On 06/23/2010 11:35 PM, Andrew Beekhof wrote:
>>
>> On Thu, Jun 24, 2010 at 1:50 AM, dan clark<2clarkd at gmail.com>  wrote:
>>>
>>> Dear Gentle Reader....
>>>
>>> Attached is a small test program to stress initializing and finalizing
>>> communication between a corosync cpg client and the corosync daemon.
>>> The test was run under version 1.2.4.  Initial testing was with a
>>> single node, subsequent testing occurred on a system consisting of 3
>>> nodes.
>>>
>>> 1) If the program is run in such a way that it loops on the
>>> initialize/mcast_joined/dispatch/finalize AND the corosync daemon is
>>> restarted while the program is looping (service corosync restart) then
>>> the application locks up in the corosync client library in a variety
>>> of interesting locations.  This is easiest to reproduce in a single
>>> node system with a large iteration count and a usleep value between
>>> joins.  'stress_finalize -t 500 -i 10000 -u 1000 -v'  Sometimes it
>>> recovers in a few seconds (analysis of strace indicated
>>> futex(...FUTEX_WAIT, 0, {1, 997888000}) ... which would account for
>>> multiple 2 second delays in error recovery from a lost corosync
>>> daemon).  Sometimes it locks up solid!   What is the proper way of
>>> handling the loss of the corosync daemon?  Is it possible to have the
>>> cpg library have a fast error recovery in the case of a failed daemon?
>>>
>>> sample back trace of lockup:
>>> #0  0x000000363c60c711 in sem_wait () from /lib64/libpthread.so.0
>>> #1  0x0000003000002a34 in coroipcc_msg_send_reply_receive (
>>>   handle=<value optimized out>, iov=<value optimized out>, iov_len=1,
>>>   res_msg=0x7fffaefecac0, res_len=24) at coroipcc.c:465
>>> #2  0x0000003000802db1 in cpg_leave (handle=1648075416440668160,
>>>   group=<value optimized out>) at cpg.c:458
>>> #3  0x0000000000400df8 in coInit (handle=0x7fffaefecdb0,
>>>   groupNameStr=0x7fffaefeccb0 "./stress_finalize_groupName-0", ctx=0x6e1)
>>>   at stress_finalize.c:101
>>> #4  0x000000000040138a in main (argc=8, argv=0x7fffaefecf28)
>>>   at stress_finalize.c:243
>>
>> I've also started getting semaphore related stack traces.
>>
>
> the stack trace from Dan is different from yours Andrew.  Yours is during
> startup.   Dan is more concerned about the fact that sem_timedwait sits
> around for 2 seconds before returning information indicating the server has
> exited or stopped.  (along with other issues)
>
>> #0  __new_sem_init (sem=0x7ff01f81a008, pshared=1, value=0) at
>> sem_init.c:45
>> 45        isem->value = value;
>> Missing separate debuginfos, use: debuginfo-install
>> audit-libs-2.0.1-1.fc12.x86_64 libgcrypt-1.4.4-8.fc12.x86_64
>> libgpg-error-1.6-4.x86_64 libtasn1-2.3-1.fc12.x86_64
>> libuuid-2.16-10.2.fc12.x86_64
>> (gdb) where
>> #0  __new_sem_init (sem=0x7ff01f81a008, pshared=1, value=0) at
>> sem_init.c:45
>> #1  0x00007ff01e601e8e in coroipcc_service_connect (socket_name=<value
>> optimized out>, service=<value optimized out>, request_size=1048576,
>> response_size=1048576, dispatch_size=1048576, handle=<value optimized
>> out>)
>>     at coroipcc.c:706
>> #2  0x00007ff01ec1bb81 in init_ais_connection_once (dispatch=0x40e798
>> <cib_ais_dispatch>, destroy=0x40e8f2<cib_ais_destroy>, our_uuid=0x0,
>> our_uname=0x6182c0, nodeid=0x0) at ais.c:622
>> #3  0x00007ff01ec1ba22 in init_ais_connection (dispatch=0x40e798
>> <cib_ais_dispatch>, destroy=0x40e8f2<cib_ais_destroy>, our_uuid=0x0,
>> our_uname=0x6182c0, nodeid=0x0) at ais.c:585
>> #4  0x00007ff01ec16b90 in crm_cluster_connect (our_uname=0x6182c0,
>> our_uuid=0x0, dispatch=0x40e798, destroy=0x40e8f2, hb_conn=0x6182b0)
>> at cluster.c:56
>> #5  0x000000000040e9fb in cib_init () at main.c:424
>> #6  0x000000000040df78 in main (argc=1, argv=0x7ffff194aaf8) at main.c:218
>> (gdb) print *isem
>> Cannot access memory at address 0x7ff01f81a008
>>
>> sigh
>>
>
> This code literally hasn't been modified for over a year - strange to start
> seeing errors now.
>
> Is your /dev/shm full?
>
> Regards
> -steve
>