[Openais] whitetank: question regarding saCkptCheckpointOpen()

Sat Feb 21 17:11:48 PST 2009

On Sat, 2009-02-21 at 13:02 -0800, Joel Becker wrote:
> On Sat, Feb 21, 2009 at 04:38:49PM +0100, Lars Marowsky-Bree wrote:
> > I have a question regarding this call; possibly it applies to other
> > CKPT functions too, but this is the one currently giving me worries.
> > 
> > ocfs2_controld uses this service, and they get spawned by the cluster
> > manager at essentially the same time everywhere. (At a time where all
> > nodes are up, and dlm_controld.pcmk is also already up.)
> > 
> > This causes saCkptCheckpointOpen() to fail on a number of nodes,
> > possibly because the membership is in-flux, with EAGAIN. Now, what is an
> > appropriate amount of times to retry such calls?
> 
> 	It shouldn't be because of membership.  It's probably due to
> the time corosync takes to get the checkpoint setup on the initial node.
> 	Actually, is there any chance you can figure out which one it
> is?  There is debug logging in ocfs2_controld about that.  We want to
> know if it is the write&&EEXIST case or the !write&&!EEXIST case.
> 	I really suspect it is the !write&&!EEXIST case on the global
> checkpoint.  Here's what ocfs2_controld is trying to do (from
> ocfs2_controld/main.c).  Note the last two paragraphs especially:
> 
> /*
>  * Protocol negotiation.
>  *
>  * This is the maximum protocol supported by the daemon for inter-daemon
>  * communication.  The negotiated value daemon_running_proto is what the
>  * daemon uses at runtime.
>  *
>  * All daemons must support the initial protocol, which works as follows:
>  * Prior to starting CPG, daemons store two values in the local node
>  * checkpoint.  The maximum daemon protocol is stored in the
>  * "daemon_max_protocol" section, and the ocfs2 maximum protocol is stored
>  * in the "ocfs2_max_protocol" section.  The protocols are stored in the
>  * format:
>  *
>  *     <2-char-hex-major><space><2-char-hex-minor><null>
>  *
>  * These sections MUST be created before CPG is started.  Other sections
>  * MUST NOT be created at this time.
>  *
>  * Once CPG is started, the daemon reads the "daemon_protocol" and
>  * "ocfs2_protocol" sections from the daemon's global checkpoint.  The
>  * values are stored as the running versions.  All interaction takes place
>  * based on the running versions.  At this point, the daemon may add
>  * other sections to the local node checkpoint that are part of the
>  * running protocol.
>  *
>  * If the daemon is the first node to join the group, it sets the
>  * "daemon_protocol" and "ocfs2_protocol" sections of the global checkpoint
>  * to the maximum values this daemon supports.
>  */
> 
> 	So the first ocfs2_controld to start (the one that gets a cpg
> join with only itself as a member) is responsible for opening the global
> checkpoint read/write and storing its protocol versions.  All the other
> ocfs2_controlds get cpg join messages with more than themselves as
> members and attempt to open the global checkpoint read-only.  They want
> to read the protocol versions stored by the first node.
> 	The code, as you note, assumes that the Ckpt service will have
> made the global checkpoint visible to all nodes within the two seconds
> we allow for startup.  I've never had a problem with that, but I haven't
> started up more than four nodes at once.
> 	The other case, write&&EEXIST, only happens when you kill and
> restart a controld; that's why I suspect it isn't your problem.  I'm
> assuming you're starting on rebooted nodes.  I actually ran into this
> one more often when I was developing, because I was restarting daemons
> left and right as I recompiled and tested.
> 	In the write&&EEXIST case, you often have a daemon's local
> checkpoint (the one opened before cpg is started) already existing in
> the Ckpt service.  When the daemon exited before, Ckpt was supposed to
> tear it down.  Then, when the daemon restarts, it tries to create the
> Ckpt again with O_EXCL semantics.  It turns out that if you restart fast
> enough, Ckpt won't have torn it down yet.  So you retry a couple times,
> Ckpt eventually tears the old one down, and you can now create a new
> one.  Once again, two seconds was always enough in my testing.
> 
> > And what is the real cause why they fail spuriously?
> 
> 	I assume we're finding some sort of propagation delay in
> corosync/openais.  I'm curious what sdake thinks.
> 
> > Initially, ocfs2_controld retried twice; clearly not often enough. But I
> > have a feeling that the number of retries depends on timing and on the
> > number of cluster nodes, which makes my gut scream "band-aid!" when I
> > simply increase it - when are we going to hit it again?
> 
> 	You've noted that the ocfs2_controld/ckpt.c has two retry
> counts, the TENTATIVE retry count and the SERIOUS one.  The TENTATIVE is
> 2 - we expect that it should really work without a retry.  The SERIOUS
> one is for behaviors we want to try harder on, and we use that for
> init/exit of checkpoint.
> 
> > Do we need a "blocking" wrapper, and simply retry infinitely?
> 
> 	Looks like dlm_controld retries indefinitely.  Iipc 'm open to
> suggestions.
> 
> Joel
> 

Here is how it works.

During a configuration change, ,pst new messages are rejected from
library calls with SA_AIS_ERR_TRY_AGAIN.  Hard to send messages if the
membership is not determined and no ring is available.  So instead of
queue messages that can't be sent, the try again happens.

In the current cluster 3 code, cman blocks membership changes for 10
seconds which is the minimum the retries should occur.  The blocked
period is the "token" totem parameter (man openais.conf).

With the latest ipc code, TRY_AGAIN usually never occurs except during
config changes as described above or unless aisexec or corosync hasn't
been started yet.