[Openais] recover from corosync daemon restart and cpg_finalize timing

Steven Dake sdake at redhat.com
Thu Jun 24 10:42:34 PDT 2010


Dan,

Thanks for the test case

responses inline

On 06/23/2010 04:50 PM, dan clark wrote:
> Dear Gentle Reader....
>
> Attached is a small test program to stress initializing and finalizing
> communication between a corosync cpg client and the corosync daemon.
> The test was run under version 1.2.4.  Initial testing was with a
> single node, subsequent testing occurred on a system consisting of 3
> nodes.
>
> 1) If the program is run in such a way that it loops on the
> initialize/mcast_joined/dispatch/finalize AND the corosync daemon is
> restarted while the program is looping (service corosync restart) then
> the application locks up in the corosync client library in a variety
> of interesting locations.  This is easiest to reproduce in a single
> node system with a large iteration count and a usleep value between
> joins.  'stress_finalize -t 500 -i 10000 -u 1000 -v'  Sometimes it
> recovers in a few seconds (analysis of strace indicated
> futex(...FUTEX_WAIT, 0, {1, 997888000}) ... which would account for
> multiple 2 second delays in error recovery from a lost corosync
> daemon).  Sometimes it locks up solid!   What is the proper way of
> handling the loss of the corosync daemon?  Is it possible to have the
> cpg library have a fast error recovery in the case of a failed daemon?
>

The 2 second delay is normal, although it can be improved upon.  I have 
filed a bugzilla to address this point:

https://bugzilla.redhat.com/show_bug.cgi?id=607744

I was not able to generate a lockup with your test case.  May be my 
hardware is too fast/slow or has different properties then yours.  It is 
possible this is related to the sem_destroy as per man page:
        Destroying  a  semaphore  that other processes or threads are 
currently
        blocked on (in sem_wait(3)) produces undefined behavior.

If your up for testing on your hardware, might consider removing the 
sem_destroy from coroipcs.c:1846: specifically:
         sem_destroy (&conn_info->control_buffer->sem0);
         sem_destroy (&conn_info->control_buffer->sem1);
         sem_destroy (&conn_info->control_buffer->sem2);

This activity would really help eliminate this as a possibility.

> sample back trace of lockup:
> #0  0x000000363c60c711 in sem_wait () from /lib64/libpthread.so.0
> #1  0x0000003000002a34 in coroipcc_msg_send_reply_receive (
>     handle=<value optimized out>, iov=<value optimized out>, iov_len=1,
>     res_msg=0x7fffaefecac0, res_len=24) at coroipcc.c:465
> #2  0x0000003000802db1 in cpg_leave (handle=1648075416440668160,
>     group=<value optimized out>) at cpg.c:458
> #3  0x0000000000400df8 in coInit (handle=0x7fffaefecdb0,
>     groupNameStr=0x7fffaefeccb0 "./stress_finalize_groupName-0", ctx=0x6e1)
>     at stress_finalize.c:101
> #4  0x000000000040138a in main (argc=8, argv=0x7fffaefecf28)
>     at stress_finalize.c:243
>
> 2) If the test program is run with an iteration count of greater than
> about 10, group joins for the specified group name tends to start
> failing (CS_ERR_TRY_AGAIN) but never recovers (trying again doesn't
> help :).  This test was run on a single node of a 3 node system (but
> may be reproduce similar problems on a smaller number of nodes).
> ' ./stress_finalize -i 10 -j 1 junk'
>

I was able to reproduce this, bug filed at:
https://bugzilla.redhat.com/show_bug.cgi?id=607745

> 3) An unrelated observation is that if the corosync daemon is setup on
> two nodes that are participate in multicast through a tunnel, the
> corosync daemon runs in a tight loop at very high priority level
> effectively halting the machine.  Is this because the basic daemon
> communication relies on message reflection of the underlying transport
> which would occur on an ethernet multicast but would not on a tunnel?
>
> An example setup for an ip tunnel might be something along the following lines:
> modprobe ip_grep up
> echo 1>  /proc/sys/net/ipv4/ip_forward
> ip tunnel add gre1 mode gre remote 10.x.y.z local 20.z.y.x ttl 127
> ip addr add 192.168.100.33/24 peer 192.168.100.11/24 dev gre1
> ip link set gre1 up multicast on
>

no idea...

I know some people have had success with tunnels, but I have never tried 
them myself.  Please feel free to file an enhancement request as per our 
policy at:

http://www.corosync.org/doku.php?id=support

Also, as a side note, what is the motivation for wanting to use a tunnel?

Thanks for the work on the test case.  Really nice to see this kind of 
activity for gaps in our current test cases.  If your interested, you 
might consider merging this into the cts framework that Angus has developed.

Regards
-steve

> Thank you for taking the time to consider these tests.  Perhaps future
> versions of the software package could include a similar set of tests
> illustrating proper behavior?
>
> dan
>
>
>
> _______________________________________________
> Openais mailing list
> Openais at lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais



More information about the Openais mailing list