[Openais] recover from corosync daemon restart and cpg_finalize timing

dan clark 2clarkd at gmail.com
Thu Jun 24 13:44:57 PDT 2010


Hi Steven!

I really appreciate the consideration you have given to this scenario
and I am thankful that the test case has warranted a bug submission
and your work to move through the hoops required to do the e-work.

Please note, the stress test only reflects the nature of the problem.
A primary aspect of the test that does not reflect actual usage is
that the test case is run start to finish, whereas the actual use case
is a daemon running for months on end, where leaks of file
descriptors, memory or shared memory segments would make it
challenging to keep the application functioning over long periods of
time within a constrained set of resources. The primary purpose of the
test case is to allow a daemon utilizing corosync to quickly recover
from these two cases:
1) the application daemon starting prior to the corosync daemon on reboot
2) the corosync deamon being yanked out from under the application
daemon during normal run time.  The application should continue to
function in a degraded state until the corosync daemons can be
restarted, at which point the application will quickly re-establish
connectivity and re-join the same groups (that are hopefully still
represented by other nodes).   As was correctly stated, a problem is
the 2 second delay.  Under 1.2.5 I think I recreated the lockup
situation last night, but this morning have not been able to reproduce
the lockup, the 2 second delay is still occuring.  Perhaps it is, as
was speculated, timing related and it is necessary to get into the
correct critical sections of locking to reproduce.  As a result I do
not have any data on the proposed change below to eliminate the
'sem_destroy' calls in coripcs.c.  Would the proposed changes end up
leaking semaphores?

Note that the test application reports leaks by the client library of
file descriptors and loss of functionality of groups on corosync
daemon restart.  I assume that the leaked file descriptors are the TCP
handles between the corosync client libraries and the corosync daemon?
 (see message 'leaking fd now...').  Perhaps there is a need to have
the ability for an application to fully cleanup the state of the cpg
library interface so that once the daemon does restart, it can be
safely attached to without any vestiges of the past.   Would this be
considered a separate bug?

> Also, as a side note, what is the motivation for wanting to use a tunnel?

The tunnel was an attempt to expand the capability from a local set of
nodes and attach to a second set on the other side of a firewall (or
set of routers that will not allow multicast traffice).  If future
projects are considering adding unicast connections between groups
(such as the 'spread' facility supports) this would obviate the need
for tunnels.  Are there plans for such capabilities?

I would appreciate any posting on the preferred method of creating
tunnels for use with corosync if there has been success utilizing them
in the field.

> If your interested, you might consider merging this into the cts framework that Angus has developed.

Unfortunately I am not familiar with the cts framework.  I simply
modeled the test case after the cases that were in the 'tests'
directory of the corosync distribution.  If future distributions
re-write the 'tests' directory entries to use the cts frameork, I can
certainly model such a change, however in the mean time it would be
nice to have a version similar to the finalize stress test supplied as
part of the 'tests' directory.  Would that be possible?  If so I can
add some fixes to my original submission but I am not familiar with
how to do so efficiently.

As stated in an earlier post the /dev/shm seems to fill up with large
temporary files over time, which may be a related library (or daemon)
cleanup scenario.

Thank you again for any shared insights. I apologize for not drilling
down and providing some better insights into the possible areas in the
corosync code itself that may be impacted, but figured in this case
perhaps a simple test program would suffice.  The idea of recognizing
the loss of the daemon at some level and having ALL corosync library
calls fail fast from that point on seems to catch  the nature of the
timing aspect of the problem, although cleanup of resources is equally
important.

dan

On Thu, Jun 24, 2010 at 10:42 AM, Steven Dake <sdake at redhat.com> wrote:
> Dan,
>
> Thanks for the test case
>
> responses inline
>
> On 06/23/2010 04:50 PM, dan clark wrote:
>>
>> Dear Gentle Reader....
>>
>> Attached is a small test program to stress initializing and finalizing
>> communication between a corosync cpg client and the corosync daemon.
>> The test was run under version 1.2.4.  Initial testing was with a
>> single node, subsequent testing occurred on a system consisting of 3
>> nodes.
>>
>> 1) If the program is run in such a way that it loops on the
>> initialize/mcast_joined/dispatch/finalize AND the corosync daemon is
>> restarted while the program is looping (service corosync restart) then
>> the application locks up in the corosync client library in a variety
>> of interesting locations.  This is easiest to reproduce in a single
>> node system with a large iteration count and a usleep value between
>> joins.  'stress_finalize -t 500 -i 10000 -u 1000 -v'  Sometimes it
>> recovers in a few seconds (analysis of strace indicated
>> futex(...FUTEX_WAIT, 0, {1, 997888000}) ... which would account for
>> multiple 2 second delays in error recovery from a lost corosync
>> daemon).  Sometimes it locks up solid!   What is the proper way of
>> handling the loss of the corosync daemon?  Is it possible to have the
>> cpg library have a fast error recovery in the case of a failed daemon?
>>
>
> The 2 second delay is normal, although it can be improved upon.  I have
> filed a bugzilla to address this point:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=607744
>
> I was not able to generate a lockup with your test case.  May be my hardware
> is too fast/slow or has different properties then yours.  It is possible
> this is related to the sem_destroy as per man page:
>       Destroying  a  semaphore  that other processes or threads are
> currently
>       blocked on (in sem_wait(3)) produces undefined behavior.
>
> If your up for testing on your hardware, might consider removing the
> sem_destroy from coroipcs.c:1846: specifically:
>        sem_destroy (&conn_info->control_buffer->sem0);
>        sem_destroy (&conn_info->control_buffer->sem1);
>        sem_destroy (&conn_info->control_buffer->sem2);
>
> This activity would really help eliminate this as a possibility.
>
>> sample back trace of lockup:
>> #0  0x000000363c60c711 in sem_wait () from /lib64/libpthread.so.0
>> #1  0x0000003000002a34 in coroipcc_msg_send_reply_receive (
>>    handle=<value optimized out>, iov=<value optimized out>, iov_len=1,
>>    res_msg=0x7fffaefecac0, res_len=24) at coroipcc.c:465
>> #2  0x0000003000802db1 in cpg_leave (handle=1648075416440668160,
>>    group=<value optimized out>) at cpg.c:458
>> #3  0x0000000000400df8 in coInit (handle=0x7fffaefecdb0,
>>    groupNameStr=0x7fffaefeccb0 "./stress_finalize_groupName-0", ctx=0x6e1)
>>    at stress_finalize.c:101
>> #4  0x000000000040138a in main (argc=8, argv=0x7fffaefecf28)
>>    at stress_finalize.c:243
>>
>> 2) If the test program is run with an iteration count of greater than
>> about 10, group joins for the specified group name tends to start
>> failing (CS_ERR_TRY_AGAIN) but never recovers (trying again doesn't
>> help :).  This test was run on a single node of a 3 node system (but
>> may be reproduce similar problems on a smaller number of nodes).
>> ' ./stress_finalize -i 10 -j 1 junk'
>>
>
> I was able to reproduce this, bug filed at:
> https://bugzilla.redhat.com/show_bug.cgi?id=607745
>
>> 3) An unrelated observation is that if the corosync daemon is setup on
>> two nodes that are participate in multicast through a tunnel, the
>> corosync daemon runs in a tight loop at very high priority level
>> effectively halting the machine.  Is this because the basic daemon
>> communication relies on message reflection of the underlying transport
>> which would occur on an ethernet multicast but would not on a tunnel?
>>
>> An example setup for an ip tunnel might be something along the following
>> lines:
>> modprobe ip_grep up
>> echo 1>  /proc/sys/net/ipv4/ip_forward
>> ip tunnel add gre1 mode gre remote 10.x.y.z local 20.z.y.x ttl 127
>> ip addr add 192.168.100.33/24 peer 192.168.100.11/24 dev gre1
>> ip link set gre1 up multicast on
>>
>
> no idea...
>
> I know some people have had success with tunnels, but I have never tried
> them myself.  Please feel free to file an enhancement request as per our
> policy at:
>
> http://www.corosync.org/doku.php?id=support
>
> Also, as a side note, what is the motivation for wanting to use a tunnel?
>
> Thanks for the work on the test case.  Really nice to see this kind of
> activity for gaps in our current test cases.  If your interested, you might
> consider merging this into the cts framework that Angus has developed.
>
> Regards
> -steve
>
>> Thank you for taking the time to consider these tests.  Perhaps future
>> versions of the software package could include a similar set of tests
>> illustrating proper behavior?
>>
>> dan
>>
>>
>>
>> _______________________________________________
>> Openais mailing list
>> Openais at lists.linux-foundation.org
>> https://lists.linux-foundation.org/mailman/listinfo/openais
>
>


More information about the Openais mailing list