[Openais] [PATCH orosync] select a new sync member if the node with the lowest nodeid has left.

David Teigland teigland at redhat.com
Thu Apr 22 15:18:47 PDT 2010


On Thu, Apr 22, 2010 at 04:35:08PM -0500, David Teigland wrote:
> On Thu, Apr 22, 2010 at 11:06:19AM +1000, Angus Salkeld wrote:
> > Problem:
> > 
> > Under certain circumstances cpg does not send group leave messages.
> > 
> > With a big token timeout (tested with token == 5min).
> > 1 start all nodes
> > 2 start ./test/testcpg on all nodes
> > 2 go to the node with the lowest nodeid
> > 3 ifconfig <int> down && killall -9 corosync && /etc/init.d/corosync restart && ./testcpg
> > 4 the other nodes will not get the cpg leave event
> > 5 testcpg reports an extra cpg group (basically one was not removed)
> > 
> > Solution:
> > If a member gets removed using the new trans_list and
> > that member is the node used for syncing (lowest nodeid)
> > then the next lowest node needs to be chosen for syncing.
> > 
> > David would you mind confirming that this solves your problem?
> 
> It works great, thanks!

That was after two tests, and it may have been a bit hasty...
when I went back to do some further tests, I happened to make a slight
mistake running the usual steps, and the node failure then went unnoticed
like before.  When repeating the "mistake" intentionally, I get the same
problem.  This new test is:

1 nodes 1,2,3,4: cman_tool join
2 create iptables partition: 1 | 2,3,4
3 node 1: kill -9 corosync
4 remove iptables partition: 1,2,3,4
5 node 1: cman_tool join
6 nodes 1,2,3,4: fenced; fence_tool join
7 create iptables partition: 1 | 2,3,4
8 node 1: kill -9 corosync
9 remove iptables partition: 1,2,3,4
10 node 1: cman_tool join
11 no confchg removing 1 from the fenced cpg on nodes 2,3,4

Dave



More information about the Openais mailing list