[Openais] {patch] Corosync hangs on startup
Andrew Beekhof
andrew at beekhof.net
Fri Jun 18 07:33:47 PDT 2010
Checked in as r2948.
Please backport to 1.2
On Fri, Jun 11, 2010 at 6:23 PM, Steven Dake <sdake at redhat.com> wrote:
> On 06/11/2010 09:00 AM, Andrew Beekhof wrote:
>>
>> This is a bit convoluted, but hang in there.
>>
>>
>> So there is this bug:
>> http://developerbugs.linux-foundation.org/show_bug.cgi?id=2379
>>
>> Essentially, to reproduce, you stop syslog but leave it enabled in
>> corosync.conf.
>> Here is the logging section I used:
>>
>> logging {
>> debug: on
>> fileline: off
>> to_syslog: yes
>> to_stderr: no
>> to_logfile: yes
>> logfile: /tmp/corosync.log
>> syslog_facility: daemon
>> timestamp: on
>> }
>>
>> What would happen is that on startup, the (pacemaker) child processes
>> would deadlock _inside_ the call to fork().
>> This seemed to happen more often if the logfile didnt yet exist.
>>
>>
>> Here's the gdb stack trace:
>>
>> #0 0x0000003268407bfd in pthread_join (threadid=140599098124048,
>> thread_return=0x0) at pthread_join.c:89
>> #1 0x0000000000406805 in sigsegv_handler (num=<value optimized out>) at
>> main.c:212
>> #2<signal handler called>
>> #3 fresetlockfiles () at ../nptl/sysdeps/unix/sysv/linux/fork.c:48
>> #4 __libc_fork () at ../nptl/sysdeps/unix/sysv/linux/fork.c:158
>> #5 0x00007fdfc5d95772 in spawn_child () from
>> /usr/libexec/lcrso/pacemaker.lcrso
>> #6 0x00007fdfc5d99e4a in pcmk_startup () from
>> /usr/libexec/lcrso/pacemaker.lcrso
>> #7 0x0000000000407f23 in corosync_service_link_and_init
>> (corosync_api=0x613c40, service_name=0x1020430 "pacemaker",
>> service_ver=<value optimized out>) at service.c:201
>> #8 0x00000000004082c1 in corosync_service_defaults_link_and_init
>> (corosync_api=0x613c40) at service.c:535
>> #9 0x0000000000405df6 in main_service_ready () at main.c:1204
>> #10 0x000000326e00f81f in main_iface_change_fn (context=0x7fdfc6027010,
>> iface_addr=<value optimized out>, iface_no=<value optimized out>) at
>> totemsrp.c:4347
>> #11 0x000000326e009f7a in timer_function_netif_check_timeout
>> (data=0x10474d0) at totemudp.c:1359
>> #12 0x000000326e006709 in timerlist_expire (handle=1197105576937521152) at
>> tlist.h:309
>> #13 poll_run (handle=1197105576937521152) at coropoll.c:409
>> #14 0x000000000040568b in main (argc=<value optimized out>, argv=<value
>> optimized out>) at main.c:1556
>>
>>
>> As I see it, there are three problems here.
>>
>> 1) fork() is segfaulting
>> 2) sigsegv_handler() is doing _WAY_ too much.
>> I don't believe it should be calling logsys_atexit() - certainly not if
>> logsys_atexit() then calls pthread_join
>> 3) logsys_atexit() is calling pthread_join() on a thread that doesn't
>> exist in the new process
>>
>
> the signal handling needs some work around segfault.
>
>>
>> I'll leave 2) and 3) for someone more knowledgeable. I investigated 1) as
>> the other two don't matter if fork() isn't segfaulting.
>>
>> I tried running valgrind and every fork produced the same complaint:
>>
>> ==00:00:00:03.081 23392== Invalid write of size 4
>> ==00:00:00:03.082 23392== at 0x3267CA4E72: fork (fork.c:48)
>> ==00:00:00:03.082 23392== by 0x7633771: spawn_child (in
>> /usr/libexec/lcrso/pacemaker.lcrso)
>> ==00:00:00:03.082 23392== by 0x7637E49: pcmk_startup (in
>> /usr/libexec/lcrso/pacemaker.lcrso)
>> ==00:00:00:03.082 23392== by 0x407E07: corosync_service_link_and_init
>> (service.c:201)
>> ==00:00:00:03.082 23392== by 0x4081A0:
>> corosync_service_defaults_link_and_init (service.c:534)
>> ==00:00:00:03.082 23392== by 0x405C85: main_service_ready (main.c:1211)
>> ==00:00:00:03.082 23392== by 0x4C4B2BE: main_iface_change_fn
>> (totemsrp.c:4363)
>> ==00:00:00:03.082 23392== by 0x4C42AD9:
>> timer_function_netif_check_timeout (totemudp.c:1380)
>> ==00:00:00:03.082 23392== by 0x4C3F8DC: poll_run (tlist.h:309)
>> ==00:00:00:03.082 23392== by 0x405543: main (main.c:1563)
>> ==00:00:00:03.082 23392== Address 0x0 is not stack'd, malloc'd or
>> (recently) free'd
>>
>> Note that this is the same file and line that gdb reported.
>>
>> And if you look at line 48 of ./nptl/sysdeps/unix/sysv/linux/fork.c, you
>> see that its the body of this for-loop
>>
>> static void
>> fresetlockfiles (void)
>> {
>> _IO_ITER i;
>>
>> for (i = _IO_iter_begin(); i != _IO_iter_end(); i = _IO_iter_next(i))
>> _IO_lock_init (*((_IO_lock_t *) _IO_iter_file(i)->_lock));
>> }
>>
>> I tracking down _IO_iter_begin() which led me to _IO_list_all, which led
>> me to _IO_2_1_stderr_, which led me to test the following patch:
>>
>> Index: exec/main.c
>> ===================================================================
>> --- exec/main.c (revision 2943)
>> +++ exec/main.c (working copy)
>> @@ -418,6 +418,9 @@
>> fd = open("/dev/null", O_RDWR);
>> if (fd>= 0) {
>> /* dup2 to 0 / 1 / 2 (stdin / stdout / stderr) */
>> + close (STDIN_FILENO);
>> + close (STDOUT_FILENO);
>> + close (STDERR_FILENO);
>> dup2(fd, STDIN_FILENO); /* 0 */
>> dup2(fd, STDOUT_FILENO); /* 1 */
>> dup2(fd, STDERR_FILENO); /* 2 */
>>
>
> I'm not sure why that isn't merged, it should be there. I notice the child
> close operatoin for 0,1,2, are commented out as well.
>
> Regards
> -steve
>
>>
>> With the patch I've not been able to reproduce the hang and valgrind no
>> longer complains.
>> So I'm reasonably certain its the correct fix.
>>
>> Please ACK.
>> _______________________________________________
>> Openais mailing list
>> Openais at lists.linux-foundation.org
>> https://lists.linux-foundation.org/mailman/listinfo/openais
>
>
More information about the Openais
mailing list