[Openais] corosync 1.2.5 still doesn't shutdown properly

Tue Jun 22 11:31:43 PDT 2010

On Tue, Jun 22, 2010 at 2:21 PM, Steven Dake <sdake at redhat.com> wrote:
> On 06/22/2010 11:07 AM, Vadym Chepkov wrote:
>>
>> On Tue, Jun 22, 2010 at 1:49 PM, Steven Dake<sdake at redhat.com>  wrote:
>>>
>>> On 06/22/2010 03:56 AM, Vadym Chepkov wrote:
>>>>
>>>> Hi,
>>>>
>>>> I decided to check if I can start using corosync again on several of
>>>> my clusters (have to use heartbeat there at the moment).
>>>> I don't even have any services defined in corosync.conf, commented
>>>> pacemaker out, just plain corosync and it never goes down:
>>>>
>>>> # ps axf|grep corosync
>>>> 26294 pts/0    S+     0:00  |               \_ /bin/sh /sbin/service
>>>> corosync restart
>>>> 26299 pts/0    S+     0:01  |                   \_ /bin/bash
>>>> /etc/init.d/corosync restart
>>>> 29249 pts/1    S+     0:00                  \_ grep corosync
>>>> 25959 ?        Ssl    0:00 corosync
>>>>
>>>>
>>>> I attached to the process and this is where it hangs:
>>>>
>>>> (gdb) where
>>>> #0  0x0fe14134 in poll () from /lib/libc.so.6
>>>> #1  0x0ffbc530 in poll_run (handle=150346236434579456) at coropoll.c:413
>>>> #2  0x10006e50 in main (argc=<value optimized out>, argv=<value
>>>> optimized out>) at main.c:1576
>>>>
>>>> How can I help to debug this problem?
>>>> It is 100% reproducible.
>>>>
>>>> Thank you,
>>>> Vadym
>>>> ________
>>>
>>> Vadym,
>>>
>>> Thanks for the feedback.  I do test this scenario and it works for me:
>>>
>>> [root at cast flatiron]# service corosync start
>>> Starting Corosync Cluster Engine (corosync):               [  OK  ]
>>> [root at cast flatiron]# service corosync restart
>>> Signaling Corosync Cluster Engine (corosync) to terminate: [  OK  ]
>>> Waiting for corosync services to unload:.                  [  OK  ]
>>> Starting Corosync Cluster Engine (corosync):               [  OK  ]
>>> [root at cast flatiron]# service corosync stop
>>> Signaling Corosync Cluster Engine (corosync) to terminate: [  OK  ]
>>> Waiting for corosync services to unload:.                  [  OK  ]
>>> [root at cast flatiron]# service corosync start
>>> Starting Corosync Cluster Engine (corosync):               [  OK  ]
>>> [root at cast flatiron]# /etc/init.d/corosync restart
>>> Signaling Corosync Cluster Engine (corosync) to terminate: [  OK  ]
>>> Waiting for corosync services to unload:.                  [  OK  ]
>>> Starting Corosync Cluster Engine (corosync):               [  OK  ]
>>>
>>>
>>> One thing that would stop corosync from shutting down is if it couldn't
>>> enter operational state.  This often happens because of a firewall
>>> enabled
>>> on the ports corosync uses to communicate.
>>>
>>> The system logs would be helpful (with debug: on).
>>>
>>> Regards
>>> -steve
>>
>>
>> And it works fine on Intel based servers, but on Redhat PPC based
>> server it doesn't
>>
>> I attached the config and the log file
>>
>> Thanks,
>> Vadym
>
> Nothing jumps out from the logs.  Thanks for the pointer about ppc. I'll
> hunt down some PPC hardware and see if I can reproduce/fix.  Could you be
> more specific about which ppc (32 or 64) you were running?  Where you
> running BE and LE in same cluster?
>
> Please be patient, however.  I don't have any ppc hardware personally, and
> getting access to non-x86 hardware may take me a few days.

That's why I offered to help, since I have access to the PPC and it's
in my best interests :)

The kernel is ppc64, but most of the utilities are 32-bit, that's how
Redhat ships PPC.
I compiled 32-bit corosync, anyway. Both machines have identical
kernel, so they can't
have different byte order.

Thanks,
Vadym