[Openais] recover from corosync daemon restart and cpg_finalize timing

dan clark 2clarkd at gmail.com
Wed Jun 23 16:50:57 PDT 2010


Dear Gentle Reader....

Attached is a small test program to stress initializing and finalizing
communication between a corosync cpg client and the corosync daemon.
The test was run under version 1.2.4.  Initial testing was with a
single node, subsequent testing occurred on a system consisting of 3
nodes.

1) If the program is run in such a way that it loops on the
initialize/mcast_joined/dispatch/finalize AND the corosync daemon is
restarted while the program is looping (service corosync restart) then
the application locks up in the corosync client library in a variety
of interesting locations.  This is easiest to reproduce in a single
node system with a large iteration count and a usleep value between
joins.  'stress_finalize -t 500 -i 10000 -u 1000 -v'  Sometimes it
recovers in a few seconds (analysis of strace indicated
futex(...FUTEX_WAIT, 0, {1, 997888000}) ... which would account for
multiple 2 second delays in error recovery from a lost corosync
daemon).  Sometimes it locks up solid!   What is the proper way of
handling the loss of the corosync daemon?  Is it possible to have the
cpg library have a fast error recovery in the case of a failed daemon?

sample back trace of lockup:
#0  0x000000363c60c711 in sem_wait () from /lib64/libpthread.so.0
#1  0x0000003000002a34 in coroipcc_msg_send_reply_receive (
   handle=<value optimized out>, iov=<value optimized out>, iov_len=1,
   res_msg=0x7fffaefecac0, res_len=24) at coroipcc.c:465
#2  0x0000003000802db1 in cpg_leave (handle=1648075416440668160,
   group=<value optimized out>) at cpg.c:458
#3  0x0000000000400df8 in coInit (handle=0x7fffaefecdb0,
   groupNameStr=0x7fffaefeccb0 "./stress_finalize_groupName-0", ctx=0x6e1)
   at stress_finalize.c:101
#4  0x000000000040138a in main (argc=8, argv=0x7fffaefecf28)
   at stress_finalize.c:243

2) If the test program is run with an iteration count of greater than
about 10, group joins for the specified group name tends to start
failing (CS_ERR_TRY_AGAIN) but never recovers (trying again doesn't
help :).  This test was run on a single node of a 3 node system (but
may be reproduce similar problems on a smaller number of nodes).
' ./stress_finalize -i 10 -j 1 junk'

3) An unrelated observation is that if the corosync daemon is setup on
two nodes that are participate in multicast through a tunnel, the
corosync daemon runs in a tight loop at very high priority level
effectively halting the machine.  Is this because the basic daemon
communication relies on message reflection of the underlying transport
which would occur on an ethernet multicast but would not on a tunnel?

An example setup for an ip tunnel might be something along the following lines:
modprobe ip_grep up
echo 1 > /proc/sys/net/ipv4/ip_forward
ip tunnel add gre1 mode gre remote 10.x.y.z local 20.z.y.x ttl 127
ip addr add 192.168.100.33/24 peer 192.168.100.11/24 dev gre1
ip link set gre1 up multicast on

Thank you for taking the time to consider these tests.  Perhaps future
versions of the software package could include a similar set of tests
illustrating proper behavior?

dan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: stress_finalize.c
Type: text/x-csrc
Size: 8512 bytes
Desc: not available
Url : http://lists.linux-foundation.org/pipermail/openais/attachments/20100623/bf741aa7/attachment-0001.c 


More information about the Openais mailing list