[Openais] whitetank: question regarding saCkptCheckpointOpen()
lmb at suse.de
Sat Feb 21 07:38:49 PST 2009
I have a question regarding this call; possibly it applies to other
CKPT functions too, but this is the one currently giving me worries.
ocfs2_controld uses this service, and they get spawned by the cluster
manager at essentially the same time everywhere. (At a time where all
nodes are up, and dlm_controld.pcmk is also already up.)
This causes saCkptCheckpointOpen() to fail on a number of nodes,
possibly because the membership is in-flux, with EAGAIN. Now, what is an
appropriate amount of times to retry such calls?
And what is the real cause why they fail spuriously?
Initially, ocfs2_controld retried twice; clearly not often enough. But I
have a feeling that the number of retries depends on timing and on the
number of cluster nodes, which makes my gut scream "band-aid!" when I
simply increase it - when are we going to hit it again?
Do we need a "blocking" wrapper, and simply retry infinitely?
saCkptCheckpointOpen() also takes a timeout parameter. Which it
subsequently does not appear to use anywhere.
Maybe this question doesn't make sense, but I'm trying to dig into it
right now ;-)
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde
More information about the Openais