[Openais] recover from corosync daemon restart and cpg_finalize timing
Steven Dake
sdake at redhat.com
Thu Jun 24 10:42:34 PDT 2010
Dan,
Thanks for the test case
responses inline
On 06/23/2010 04:50 PM, dan clark wrote:
> Dear Gentle Reader....
>
> Attached is a small test program to stress initializing and finalizing
> communication between a corosync cpg client and the corosync daemon.
> The test was run under version 1.2.4. Initial testing was with a
> single node, subsequent testing occurred on a system consisting of 3
> nodes.
>
> 1) If the program is run in such a way that it loops on the
> initialize/mcast_joined/dispatch/finalize AND the corosync daemon is
> restarted while the program is looping (service corosync restart) then
> the application locks up in the corosync client library in a variety
> of interesting locations. This is easiest to reproduce in a single
> node system with a large iteration count and a usleep value between
> joins. 'stress_finalize -t 500 -i 10000 -u 1000 -v' Sometimes it
> recovers in a few seconds (analysis of strace indicated
> futex(...FUTEX_WAIT, 0, {1, 997888000}) ... which would account for
> multiple 2 second delays in error recovery from a lost corosync
> daemon). Sometimes it locks up solid! What is the proper way of
> handling the loss of the corosync daemon? Is it possible to have the
> cpg library have a fast error recovery in the case of a failed daemon?
>
The 2 second delay is normal, although it can be improved upon. I have
filed a bugzilla to address this point:
https://bugzilla.redhat.com/show_bug.cgi?id=607744
I was not able to generate a lockup with your test case. May be my
hardware is too fast/slow or has different properties then yours. It is
possible this is related to the sem_destroy as per man page:
Destroying a semaphore that other processes or threads are
currently
blocked on (in sem_wait(3)) produces undefined behavior.
If your up for testing on your hardware, might consider removing the
sem_destroy from coroipcs.c:1846: specifically:
sem_destroy (&conn_info->control_buffer->sem0);
sem_destroy (&conn_info->control_buffer->sem1);
sem_destroy (&conn_info->control_buffer->sem2);
This activity would really help eliminate this as a possibility.
> sample back trace of lockup:
> #0 0x000000363c60c711 in sem_wait () from /lib64/libpthread.so.0
> #1 0x0000003000002a34 in coroipcc_msg_send_reply_receive (
> handle=<value optimized out>, iov=<value optimized out>, iov_len=1,
> res_msg=0x7fffaefecac0, res_len=24) at coroipcc.c:465
> #2 0x0000003000802db1 in cpg_leave (handle=1648075416440668160,
> group=<value optimized out>) at cpg.c:458
> #3 0x0000000000400df8 in coInit (handle=0x7fffaefecdb0,
> groupNameStr=0x7fffaefeccb0 "./stress_finalize_groupName-0", ctx=0x6e1)
> at stress_finalize.c:101
> #4 0x000000000040138a in main (argc=8, argv=0x7fffaefecf28)
> at stress_finalize.c:243
>
> 2) If the test program is run with an iteration count of greater than
> about 10, group joins for the specified group name tends to start
> failing (CS_ERR_TRY_AGAIN) but never recovers (trying again doesn't
> help :). This test was run on a single node of a 3 node system (but
> may be reproduce similar problems on a smaller number of nodes).
> ' ./stress_finalize -i 10 -j 1 junk'
>
I was able to reproduce this, bug filed at:
https://bugzilla.redhat.com/show_bug.cgi?id=607745
> 3) An unrelated observation is that if the corosync daemon is setup on
> two nodes that are participate in multicast through a tunnel, the
> corosync daemon runs in a tight loop at very high priority level
> effectively halting the machine. Is this because the basic daemon
> communication relies on message reflection of the underlying transport
> which would occur on an ethernet multicast but would not on a tunnel?
>
> An example setup for an ip tunnel might be something along the following lines:
> modprobe ip_grep up
> echo 1> /proc/sys/net/ipv4/ip_forward
> ip tunnel add gre1 mode gre remote 10.x.y.z local 20.z.y.x ttl 127
> ip addr add 192.168.100.33/24 peer 192.168.100.11/24 dev gre1
> ip link set gre1 up multicast on
>
no idea...
I know some people have had success with tunnels, but I have never tried
them myself. Please feel free to file an enhancement request as per our
policy at:
http://www.corosync.org/doku.php?id=support
Also, as a side note, what is the motivation for wanting to use a tunnel?
Thanks for the work on the test case. Really nice to see this kind of
activity for gaps in our current test cases. If your interested, you
might consider merging this into the cts framework that Angus has developed.
Regards
-steve
> Thank you for taking the time to consider these tests. Perhaps future
> versions of the software package could include a similar set of tests
> illustrating proper behavior?
>
> dan
>
>
>
> _______________________________________________
> Openais mailing list
> Openais at lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais
More information about the Openais
mailing list