[Openais] {patch] Corosync hangs on startup
Andrew Beekhof
andrew at beekhof.net
Fri Jun 11 09:00:08 PDT 2010
This is a bit convoluted, but hang in there.
So there is this bug:
http://developerbugs.linux-foundation.org/show_bug.cgi?id=2379
Essentially, to reproduce, you stop syslog but leave it enabled in corosync.conf.
Here is the logging section I used:
logging {
debug: on
fileline: off
to_syslog: yes
to_stderr: no
to_logfile: yes
logfile: /tmp/corosync.log
syslog_facility: daemon
timestamp: on
}
What would happen is that on startup, the (pacemaker) child processes would deadlock _inside_ the call to fork().
This seemed to happen more often if the logfile didnt yet exist.
Here's the gdb stack trace:
#0 0x0000003268407bfd in pthread_join (threadid=140599098124048, thread_return=0x0) at pthread_join.c:89
#1 0x0000000000406805 in sigsegv_handler (num=<value optimized out>) at main.c:212
#2 <signal handler called>
#3 fresetlockfiles () at ../nptl/sysdeps/unix/sysv/linux/fork.c:48
#4 __libc_fork () at ../nptl/sysdeps/unix/sysv/linux/fork.c:158
#5 0x00007fdfc5d95772 in spawn_child () from /usr/libexec/lcrso/pacemaker.lcrso
#6 0x00007fdfc5d99e4a in pcmk_startup () from /usr/libexec/lcrso/pacemaker.lcrso
#7 0x0000000000407f23 in corosync_service_link_and_init (corosync_api=0x613c40, service_name=0x1020430 "pacemaker", service_ver=<value optimized out>) at service.c:201
#8 0x00000000004082c1 in corosync_service_defaults_link_and_init (corosync_api=0x613c40) at service.c:535
#9 0x0000000000405df6 in main_service_ready () at main.c:1204
#10 0x000000326e00f81f in main_iface_change_fn (context=0x7fdfc6027010, iface_addr=<value optimized out>, iface_no=<value optimized out>) at totemsrp.c:4347
#11 0x000000326e009f7a in timer_function_netif_check_timeout (data=0x10474d0) at totemudp.c:1359
#12 0x000000326e006709 in timerlist_expire (handle=1197105576937521152) at tlist.h:309
#13 poll_run (handle=1197105576937521152) at coropoll.c:409
#14 0x000000000040568b in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:1556
As I see it, there are three problems here.
1) fork() is segfaulting
2) sigsegv_handler() is doing _WAY_ too much.
I don't believe it should be calling logsys_atexit() - certainly not if logsys_atexit() then calls pthread_join
3) logsys_atexit() is calling pthread_join() on a thread that doesn't exist in the new process
I'll leave 2) and 3) for someone more knowledgeable. I investigated 1) as the other two don't matter if fork() isn't segfaulting.
I tried running valgrind and every fork produced the same complaint:
==00:00:00:03.081 23392== Invalid write of size 4
==00:00:00:03.082 23392== at 0x3267CA4E72: fork (fork.c:48)
==00:00:00:03.082 23392== by 0x7633771: spawn_child (in /usr/libexec/lcrso/pacemaker.lcrso)
==00:00:00:03.082 23392== by 0x7637E49: pcmk_startup (in /usr/libexec/lcrso/pacemaker.lcrso)
==00:00:00:03.082 23392== by 0x407E07: corosync_service_link_and_init (service.c:201)
==00:00:00:03.082 23392== by 0x4081A0: corosync_service_defaults_link_and_init (service.c:534)
==00:00:00:03.082 23392== by 0x405C85: main_service_ready (main.c:1211)
==00:00:00:03.082 23392== by 0x4C4B2BE: main_iface_change_fn (totemsrp.c:4363)
==00:00:00:03.082 23392== by 0x4C42AD9: timer_function_netif_check_timeout (totemudp.c:1380)
==00:00:00:03.082 23392== by 0x4C3F8DC: poll_run (tlist.h:309)
==00:00:00:03.082 23392== by 0x405543: main (main.c:1563)
==00:00:00:03.082 23392== Address 0x0 is not stack'd, malloc'd or (recently) free'd
Note that this is the same file and line that gdb reported.
And if you look at line 48 of ./nptl/sysdeps/unix/sysv/linux/fork.c, you see that its the body of this for-loop
static void
fresetlockfiles (void)
{
_IO_ITER i;
for (i = _IO_iter_begin(); i != _IO_iter_end(); i = _IO_iter_next(i))
_IO_lock_init (*((_IO_lock_t *) _IO_iter_file(i)->_lock));
}
I tracking down _IO_iter_begin() which led me to _IO_list_all, which led me to _IO_2_1_stderr_, which led me to test the following patch:
Index: exec/main.c
===================================================================
--- exec/main.c (revision 2943)
+++ exec/main.c (working copy)
@@ -418,6 +418,9 @@
fd = open("/dev/null", O_RDWR);
if (fd >= 0) {
/* dup2 to 0 / 1 / 2 (stdin / stdout / stderr) */
+ close (STDIN_FILENO);
+ close (STDOUT_FILENO);
+ close (STDERR_FILENO);
dup2(fd, STDIN_FILENO); /* 0 */
dup2(fd, STDOUT_FILENO); /* 1 */
dup2(fd, STDERR_FILENO); /* 2 */
With the patch I've not been able to reproduce the hang and valgrind no longer complains.
So I'm reasonably certain its the correct fix.
Please ACK.
More information about the Openais
mailing list