[Openais] Re: confchg_fn, cluster membership, etc.

Fri Sep 10 14:18:21 PDT 2004

On Fri, 2004-09-10 at 13:48, Mark Haverkamp wrote:
> On Fri, 2004-09-10 at 13:30, Steven Dake wrote:
> > On Fri, 2004-09-10 at 10:23, Mark Haverkamp wrote:
> > > Steve,
> > > 
> > > I've been looking at the configuration change function and what I get
> > > when it is called.
> > > 
> > > When I start the first aisexec, I see that I am the only node.  This
> > > makes sense.
> > > 
> > > When I start a second aisexec, It first sees itself as the only node in
> > > the cluster.  (Does this mean that for a short time there are two
> > > clusters?) Then the config function gets called again and I see that the
> > > first node joined.  The first node on the other hand sees that the
> > > second node joined (which seems to be the correct view).  I would think
> > > that each node should see the same view of the cluster with regard to
> > > who is joining and who was already a member.  Is it possible to have
> > > each node have the same idea of who has joined and who was already a
> > > member?
> > > 
> > 
> > There is a good reason there are two configuration changes.  The first
> > configuration (called a transistional configuration) indicates who has
> > left the configuration.  The second configuration (called the regular
> > configuration) specifies who has joined the configuration.
> 
> Are these transitions visible via the config change function?  What I
> see on the second node that I start is this:
> 
> 
> L(4): AIS Executive Service: Copyright (C) 2002-2004 MontaVista
> Software, Inc.
> L(4): entering GATHER state.
> L(4): SENDING attempt join because this node is ring rep.
> New queue for ip 192.168.1.17
> L(5): Evt exec init request
> L(4): AIS Executive Service: started and ready to receive connections.
> L(4): Got attempt join from 192.168.1.8
> L(4): CONSENSUS reached!
> Got membership form token
> Got membership form token
> conf_desc_list 2
> highest seq 0 0
> highest seq 1 0
> setting barrier seq to 1
> EVS STATE group arut 0 gmi arut 0 highest 0 barrier 1 starting group
> arut 0
> EVS STATE group arut 1 gmi arut 1 highest 0 barrier 1 starting group
> arut 1
> L(4): EVS recovery of messages complete, transitioning to operational.
> CONFCHG ENTRIES 1

this is the transitional configuration

> L(4): CLM CONFIGURATION CHANGE
> L(4): New Configuration:
> L(4):   192.168.1.17
> L(4): Members Left:
> L(4): Members Joined:
> L(5): Evt conf change
> L(5): m 1, j 0, l 0
> New queue for ip 192.168.1.8

this is the regular configuration

> L(4): CLM CONFIGURATION CHANGE
> L(4): New Configuration:
> L(4):   192.168.1.8
> L(4):   192.168.1.17
> L(4): Members Left:
> L(4): Members Joined:
> L(4):   192.168.1.8
> L(5): Evt conf change
> L(5): m 2, j 1, l 0
> L(4): got nodejoin message 192.168.1.17
> L(4): got nodejoin message 192.168.1.8
> L(3): Token being retransmitted.
> L(3): Token loss in OPERATIONAL.
> L(4): entering GATHER state.
> L(4): SENDING attempt join because this node is ring rep.
> L(4): I am the only member.

this is the transitional configuraiton

> L(4): CLM CONFIGURATION CHANGE
> L(4): New Configuration:
> L(4):   192.168.1.17
> L(4): Members Left:
> L(4):   192.168.1.8
> L(4): Members Joined:
> L(5): Evt conf change
> L(5): m 1, j 0, l 1
> 

There is no regular configuration because the processor is the only
member.  The code is broken and should return the regular configuration
too even in this "single processor" state.

The broken code is in gmi.c around 2125.  There should be another call
to gmi_confchg_fn.

>  
> The first one says that there are no joiners, the second one shows
> joiners.  The final is when I killed the first node.
>  
> 
> 
> > 
> > When a partition is detected, all messages that are part of the old
> > configuration are delivered.  When a gap is detected in sequence
> > numbers, a transistional configuration is delivered, and then the
> > remaining messages that can be delivered are delivered.  Then the
> > regular configuration is delivered.  New messages are then delivered
> > under the new regular configuration.
> 
> Are you saying that there can be more than one cluster?  I would have
> thought that there is only one cluster and that if you weren't in it
> before but you are in it now, that you are the new guy and just joined.
> 

yes extended virtual syncrhony allows multiple partitions to operate at
the same time.  Although if they are operating and can see each other,
they will merge automatically.

> > 
> > This ensures that messages are delivered under the correct
> > configuration.
> > 
> > Philosophically I dont think its possible to specify, atleast with the
> > current vs messaging model, who has joined because of observer
> > relativity (see below).  The reason is that two partitions each with 4
> > processors could be operating seperately, and then merge.  So who would
> > be joining partition, and who would be the partition that was joined?
> 
> How do you insure data integrity if two partitions can operation
> independently?  What is to stop them from stomping on each other because
> they are unaware?  For instance if the ais lock service gets
> implemented, You can't have the two halves of the cluster think that
> they can take ownership of the same locks.
> 

This problem applies to AMF and LCK service.  All other services can
merge in some deterministic way.

Some researchers think a lock service is impossible to make work
reliabily.  I'd almost tend to agree with them, except I believe it is
possible to specify the network configuration to identify partitionable
elements.

For example

SWITCH A - SWITCH B

Switch A and Switch B could easily parititon.  So if you specified all
the nodes for switch A as one partition and all the nodes for switch B
as another partition, if they split, you could select an active
partition using some deterministic mechanism.  It may be though, that
you select a partition where the power has gone out and is gone.  Maybe
you could have some other mechanism to identify if switch A is really
dead.

There is no easy solution to the partition problem when only one active
partition should be selected.  In 2005, I'd like to tackle this problem,
and perhaps feedback whatever we learn into the SA Forum AIS
specifications.

Part of tackling this problem is implementing a multiring protocol that
can work in wide area (multi-site) networks.

For now, if we can just ensure that on a merge, every processor has the
same state, I would be satisified with that result for our initial
release.

> > 
> > > For a function of the event service, I'd like to know if I'm the new
> > > guy.  The way things are I don't think that I can know this.
> > > 
> > 
> > The "new guy" is relative to the observer.  Hence, processor A thinks
> > processor B is the new guy, and processor B thinks processor A is the
> > new guy.  So who is right?  They are both right, from their observation
> > points.  I'm not sure how else to think about this scenario.
> > 
> > The mechanism I had always believed would work would be to syncrhonize
> > state whenever a new processor is available.  Since because of the
> > relativity of the observer, it is impossible to know who is new, the
> > algorithms have to figure out who has the correct data using some form
> > of algorithm that allows all processors to agree to the data set.
> > 
> > It is a weakness in the SA Forum AIS that what happens on a merge and
> > partition are unspecified.  Some policies (that are perhaps selectable)
> > would be useful for the specifications.  For now, we can do whatever
> > works.  I'd be happy enough for now, if what we had ensured that every
> > processor had the same data set.
> > 
> > > Another thing that I found is that SaClmClusterNodeT data isn't
> > > available in the confchg_fn for newly joining nodes.
> > > 
> > 
> > This is true and the reason we added the clm_get_by_nodeid.  
> 
> This is what I tried to use.  I get NULL node information returned.
> 

Ok it must be broken.  No other services are using it currently.  Is the
line evt.c:1986 (from bk) that is returning the incorrect values?

I'll have a look at it.

> > If this is
> > too cumbersome to be useful, we can change the confchg_fn passed to
> > executive handlers to take the SaClmClusterNodeT data structure.  I'm
> > not too attracted to this idea, because it requires adding information
> > about the SaClmClusterNodeT data structure to exec/main.c to formulate
> > the data set.
> > 
> > Regards
> > -steve
> > > Mark.