[Openais] patch AMF sync

Steven Dake sdake at redhat.com
Mon Aug 14 17:21:16 PDT 2006


On Mon, 2006-08-14 at 21:33 +0200, Hans Feldt (AS/EAB) wrote:
> I am performing some hardening of the AMF sync at the moment and have
> fixed a couple of issues. Assert is my friend.
> 
> One assert I get is when I kill a node and start it again directly. I do
> get config change callbacks in the other nodes but they say no node left
> and no node joined! Isn't that strange?
> 
This is proper behavior.  What happens is a node fails (ctrl-c?), and
then restarts.  When a node restarts, it starts the membership protocol.
Therefore, it appears as though the node never left or joined.

In fact, there is no way to tell if a node has left or joined, its more
of a "here is a list of the processors in the configuration".  The
left/joined are misnomers and should probably be removed, but several
people complained when I last mentioned it.

The bottom line is, after every configuration change you must do a
complete resync of the data.  How do you know who should do a resync?
The ring id can be used to identify unique ring configurations (and
could I suppose be used to determine a left and joined list in some
strange way).  The way I'd suggest this being done is that every
processor that gets a configuration change check its ringid.rep field to
see if it matches this_ip.  If it does, then have that node synchronize
the data for that part of the ring.

This could be extended into the sync code so that the sync callbacks are
only called for nodes that are ring reps, but some services don't
synchronize in this way.  Therefore it would be some work to make
changes to them to work in this fashion.

Regards
-steve

> Regards,
> Hans
> 
> > -----Original Message-----
> > From: openais-bounces at lists.osdl.org 
> > [mailto:openais-bounces at lists.osdl.org] On Behalf Of Hans Feldt
> > Sent: den 11 augusti 2006 14:34
> > To: sdake at redhat.com
> > Cc: openais at lists.osdl.org
> > Subject: Re: [Openais] patch AMF sync
> > 
> > Steven Dake wrote:
> > 
> > > 10) I suggest using the regular openais_timer_add functions 
> > instead of 
> > > poll_timer_add.  If these functions have problems (which I 
> > think have 
> > > been addressed now) then I'd like to know about them so they can be 
> > > fixed.  the poll timer add should only be used by totem.
> > 
> > I tried again to use the openais_timer interface but this 
> > time totem locked up and cluster communication did not work, 
> > I got a split brain cluster...
> > 
> > Initially the nodes see each other, one node syncs the other 
> > but after that we got the split brain.
> > 
> > Therefore AMF still uses the poll_timer interface.
> > 
> > My test environment is a 3 node User mode Linux cluster. I 
> > have _not_ tried with a real cluster.
> > 
> > Regards,
> > Hans
> > 
> > 
> > _______________________________________________
> > Openais mailing list
> > Openais at lists.osdl.org
> > https://lists.osdl.org/mailman/listinfo/openais
> > 




More information about the Openais mailing list