[Openais] whitetank: question regarding saCkptCheckpointOpen()

Sat Feb 21 13:02:01 PST 2009

On Sat, Feb 21, 2009 at 04:38:49PM +0100, Lars Marowsky-Bree wrote:
> I have a question regarding this call; possibly it applies to other
> CKPT functions too, but this is the one currently giving me worries.
> 
> ocfs2_controld uses this service, and they get spawned by the cluster
> manager at essentially the same time everywhere. (At a time where all
> nodes are up, and dlm_controld.pcmk is also already up.)
> 
> This causes saCkptCheckpointOpen() to fail on a number of nodes,
> possibly because the membership is in-flux, with EAGAIN. Now, what is an
> appropriate amount of times to retry such calls?

	It shouldn't be because of membership.  It's probably due to
the time corosync takes to get the checkpoint setup on the initial node.
	Actually, is there any chance you can figure out which one it
is?  There is debug logging in ocfs2_controld about that.  We want to
know if it is the write&&EEXIST case or the !write&&!EEXIST case.
	I really suspect it is the !write&&!EEXIST case on the global
checkpoint.  Here's what ocfs2_controld is trying to do (from
ocfs2_controld/main.c).  Note the last two paragraphs especially:

/*
 * Protocol negotiation.
 *
 * This is the maximum protocol supported by the daemon for inter-daemon
 * communication.  The negotiated value daemon_running_proto is what the
 * daemon uses at runtime.
 *
 * All daemons must support the initial protocol, which works as follows:
 * Prior to starting CPG, daemons store two values in the local node
 * checkpoint.  The maximum daemon protocol is stored in the
 * "daemon_max_protocol" section, and the ocfs2 maximum protocol is stored
 * in the "ocfs2_max_protocol" section.  The protocols are stored in the
 * format:
 *
 *     <2-char-hex-major><space><2-char-hex-minor><null>
 *
 * These sections MUST be created before CPG is started.  Other sections
 * MUST NOT be created at this time.
 *
 * Once CPG is started, the daemon reads the "daemon_protocol" and
 * "ocfs2_protocol" sections from the daemon's global checkpoint.  The
 * values are stored as the running versions.  All interaction takes place
 * based on the running versions.  At this point, the daemon may add
 * other sections to the local node checkpoint that are part of the
 * running protocol.
 *
 * If the daemon is the first node to join the group, it sets the
 * "daemon_protocol" and "ocfs2_protocol" sections of the global checkpoint
 * to the maximum values this daemon supports.
 */

	So the first ocfs2_controld to start (the one that gets a cpg
join with only itself as a member) is responsible for opening the global
checkpoint read/write and storing its protocol versions.  All the other
ocfs2_controlds get cpg join messages with more than themselves as
members and attempt to open the global checkpoint read-only.  They want
to read the protocol versions stored by the first node.
	The code, as you note, assumes that the Ckpt service will have
made the global checkpoint visible to all nodes within the two seconds
we allow for startup.  I've never had a problem with that, but I haven't
started up more than four nodes at once.
	The other case, write&&EEXIST, only happens when you kill and
restart a controld; that's why I suspect it isn't your problem.  I'm
assuming you're starting on rebooted nodes.  I actually ran into this
one more often when I was developing, because I was restarting daemons
left and right as I recompiled and tested.
	In the write&&EEXIST case, you often have a daemon's local
checkpoint (the one opened before cpg is started) already existing in
the Ckpt service.  When the daemon exited before, Ckpt was supposed to
tear it down.  Then, when the daemon restarts, it tries to create the
Ckpt again with O_EXCL semantics.  It turns out that if you restart fast
enough, Ckpt won't have torn it down yet.  So you retry a couple times,
Ckpt eventually tears the old one down, and you can now create a new
one.  Once again, two seconds was always enough in my testing.

> And what is the real cause why they fail spuriously?

	I assume we're finding some sort of propagation delay in
corosync/openais.  I'm curious what sdake thinks.

> Initially, ocfs2_controld retried twice; clearly not often enough. But I
> have a feeling that the number of retries depends on timing and on the
> number of cluster nodes, which makes my gut scream "band-aid!" when I
> simply increase it - when are we going to hit it again?

	You've noted that the ocfs2_controld/ckpt.c has two retry
counts, the TENTATIVE retry count and the SERIOUS one.  The TENTATIVE is
2 - we expect that it should really work without a retry.  The SERIOUS
one is for behaviors we want to try harder on, and we use that for
init/exit of checkpoint.

> Do we need a "blocking" wrapper, and simply retry infinitely?

	Looks like dlm_controld retries indefinitely.  I'm open to
suggestions.

Joel

-- 

Life's Little Instruction Book #306

	"Take a nap on Sunday afternoons."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127