[Openais] Re: confchg_fn, cluster membership, etc.

Mon Sep 13 14:50:45 PDT 2004

On Mon, 2004-09-13 at 11:43, Daniel McNeil wrote:
> On Fri, 2004-09-10 at 14:18, Steven Dake wrote:
> [stuff deleted] 
> > > 
> > > How do you insure data integrity if two partitions can operation
> > > independently?  What is to stop them from stomping on each other because
> > > they are unaware?  For instance if the ais lock service gets
> > > implemented, You can't have the two halves of the cluster think that
> > > they can take ownership of the same locks.
> > > 
> > 
> > This problem applies to AMF and LCK service.  All other services can
> > merge in some deterministic way.
> > 
> > Some researchers think a lock service is impossible to make work
> > reliabily.  I'd almost tend to agree with them, except I believe it is
> > possible to specify the network configuration to identify partitionable
> > elements.
> 
> Steve,
> 
> WHAT!?  Did I read this correctly?  "a lock service is impossible to
> make work reliably"  Some one better tell Oracle and the old
> VAX guys that they had it wrong for all these years!  :)
> 
I'm not familiar with how these two lock services work.  I suspect they
do not allow two or more partitions to operate within the network.

I don't believe it is possible to make a lock service operate correctly
when one configuration can partition into two without selecting only one
operational partition (and causing the other to fail new requests or
block).

> Can you explain what you mean by "I believe it is possible to specify
> the network configuration to identify partitionable elements"?
> 

Sure
Switch A has processor 1, 2, 3
Switch B has processor 4, 5
Switch A is connected to Switch B by link

If Switch A fails, processor 1, 2, 3 fails
If switch B fails, processor 4, 5 fails
If the link between switch A and switch B fail, processor 1, 2, 3
partition and 4, 5 partition

It is this last case that can handled by a virtual synchrony filter that
takes into account the network topology.

I'd like to work on this with the multiring protocols next year...

> Doesn't the ability of services to merge also depend on the applications
> ability to merge that is using the service?
> 
agreed.  merging and partitioning are the most difficult part of a
distributed application.

If you look at a majority of solutions available now, they don't handle
a merge or partition particularly well.  They crash, reach inconsistent
state, fail to form a new membership, etc.

openais handles these cases, but does add some complication to the
merging and partitioning process.  The complication is that something
needs to be done in these cases.  But atleast they are well accounted
for and occur deterministically.  Other solutons leave the partition and
merge unspecified.

> For high availability clusters, you want some application to continue
> running the application in the event of failures.   For data integrity
> this means that the application only runs on 1 cluster partition
> (primary partition) to prevent multiple instances from assuming they
> own the data.  This leads to the notion of quorum and fencing of shared
> resources to prevent corruption.  Application can then use DLM to
> provide cluster-wide consistency.  vsync communication would also be
> useful in this hi-av primary partition cluster.
> 

There are two forms of virtual syncrhony (actually there are alot
more).  Extended virtual syncrhony allows multiple partitions to operate
at the same time (what is implemented in openais) and virtual synchrony
only allows one primary partition to operate.  It is possible to develop
a virtual syncrhony filter for extended virtual synchrony.  This work
needs to be done at some point, but hasn't been done yet.  After this is
available, services which require it (AMF, DLCK) can use the vs filter
while other services can continue without a vs filter, or with
(depending on administrator configuration).

STONITH is one approach to ensuring processors are fenced from further
destruction.  Its kind of draconian, but it works.  A virtual synchrony
filter should work about as well in most cases.  If the processor cannot
send new requests, it cannot corrupt shared resources, so goes the
thinking :)

see below for a paper on extended virtual synchrony filters if your
interested in implementing one :-)
ftp://ftp.cs.huji.ac.il/users/transis/evs.ps.gz

This certainly looks doable from the reading.

> Since I have only worked on quorum based cluster, I cannot think of
> an application that would work in a cluster where there can be multiple
> partitions that can run in parallel and then merge back.  Can you
> give us an example?
> 
> How would/could Openais be used as the membership component where one
> would like to implement a cluster database?
> 
Virtual syncrhony is perfect for cluster database replication.  I am
afraid the SA Forum APIs are less perfect, however, since they don't
account for partitions.

The way to implement a virtual synchrony database is to order all
requests in agreed order.  Then no lock service is needed.  On a merge,
the database should be resynchronized.  On a partition, nothing should
be done.  A virtual syncrhony filter could be used for those desiring
only one operational partition.

see http://www.emicnetworks.com who use some variation of this approach
for mysql.

The research suggests using agreed/safe ordering in place of a
distributed lock manager, and developing the algorithms accordingly.

Hope this helps
-steve

> Thanks,
> 
> Daniel
>