[Openais] correlating events

Steven Dake sdake at redhat.com
Thu Sep 10 16:11:28 PDT 2009


On Thu, 2009-09-10 at 14:25 -0500, David Teigland wrote:
> On Wed, Sep 02, 2009 at 09:23:08AM -0500, David Teigland wrote:
> > > > 1. correlating events from different services locally
> > > > 
> > > > I get nodedown from both cman (or quorum service) and cpg.  I need to
> > > > correlate them with each other.  When I get a cpg nodedown for node A, I don't
> > > > know which cman nodedown for A it refers to: one of multiple in the past or
> > > > one in the future that cman hasn't reported yet.
> > > > 
> > > 
> > > Correlation could be solved by addition of api to cman, cpg, and quorum
> > > to retrieve the globally unique ring id for the last configuration
> > > change delivered to the application.
> > > 
> > > If you agree, we can work on the implementation for corosync 1.1.
> > > Adding this to CPG is trivial, not sure about other services.
> > > 
> > > Our policies wrt x.y.z would not be violated with this change.
> > > 
> > > As an example, the API for cpg might look like
> > > 
> > > cpg_ringid_get (handle, &ring_id);
> > > 
> > > Then ring_id could be memcmp'ed in the application.
> > > 
> > > This would retrieve the last ring id delivered to the application (not
> > > the current ring id known to the cpg service).
> 
> Thinking more about this, and I think there are two different kinds of ringid
> queries that we'd want from cpg.  It's because all new ringid's result in
> cman/quorum confchgs, but not all ringid changes result in cpg confchgs.
> 
> My understanding is that ringid (actually ringid sequence number) is
> incremented for each new ring (each cluster membership change).
> 
> a. For a given ringid from cpg for a nodedown confchg, need to know that
>    cman/quorum has seen the same nodedown.
> 
> Comparing the ringid of the cpg nodedown confchg and the ringid from cman
> should work for this.  If cman ringid is greater than or equal to the ringid
> of the cpg nodedown confchg, then we know cman is aware of the cpg nodedown.
> cman ringid may be larger if another node has since joined the cluster but not
> the cpg, or if a cluster member failed that was not a member of the cpg.
> 
> b. For a given ringid from cman/quorum, need to know that any confchgs up to
>    that same ringid have been delivered to the cpg.
> 
> These imply two different ringid values for cpg:
> 
> 1. the ringid of the last confchg delivered to the cpg
> 2. the ringid that cpg deliveries are up to date with, which may be greater
>    than the ringid of the last confchg delivered if the latest ring changes
>    have not altered the cpg membership
> 
> example
> 
> cluster ringid = 40
> cluster members = 1,2,3,4,5
> cpg members = 1,2,3,4
> 
> node 1 fails
> cluster ringid = 44
> cluster members = 2,3,4,5
> cpg members = 2,3,4
> 
> cman_ringid(&id)
>   id = 44
> 
> cpg_ringid(h, &id1, &id2)
>   before the app dispatches the cpg confchg callback
>   id1 = 40, id2 = 40
>   after the app dispatches the cpg confchg callback
>   id1 = 44, id2 = 44
> 
> node 5 fails
> cluster ringid = 48
> cluster members = 2,3,4
> cpg members = 2,3,4
> 
> cman_ringid(h, &id)
>   id = 48
> 
> cpg_ringid(h, &id1, &id2)
>   id1 = 44, id2 = 48
>   (there are no confchgs for the cpg in response to 5 failing)
> 

IMO the proper way to do this is to ensure whatever ringid was delivered
in a callback to the application is the current ring id returned by the
api.  This gets rid of any races you describe above.

> > Turns out that libcman already has a call that returns the ring id, so all I
> > need now is the addition to cpg.
> 
> Chrissie pointed out that libcman only returns the 64 bit ringid as uint32,
> but I doubt we'll see ringid's bigger than that.... even if we do I'm just
> comparing consecutive id's so the lower 32 bits should be fine.
> 

Once the ring id is greater then 32 bits, you would always be comparing
0.  Looks like cman needs this error corrected, along with the addition
of the ring leader node id.

> Dave

A ring id is uniquely identified by the nodeid of the ring leader and
the 64 bit value of the ringid.  Need both values in the comparison.

Regards
-steve

> 



More information about the Openais mailing list