[Openais] whitetank: question regarding saCkptCheckpointOpen()

Lars Marowsky-Bree lmb at suse.de
Sat Feb 21 07:38:49 PST 2009


Hi all,

I have a question regarding this call; possibly it applies to other
CKPT functions too, but this is the one currently giving me worries.

ocfs2_controld uses this service, and they get spawned by the cluster
manager at essentially the same time everywhere. (At a time where all
nodes are up, and dlm_controld.pcmk is also already up.)

This causes saCkptCheckpointOpen() to fail on a number of nodes,
possibly because the membership is in-flux, with EAGAIN. Now, what is an
appropriate amount of times to retry such calls?

And what is the real cause why they fail spuriously?

Initially, ocfs2_controld retried twice; clearly not often enough. But I
have a feeling that the number of retries depends on timing and on the
number of cluster nodes, which makes my gut scream "band-aid!" when I
simply increase it - when are we going to hit it again?

Do we need a "blocking" wrapper, and simply retry infinitely?

saCkptCheckpointOpen() also takes a timeout parameter. Which it
subsequently does not appear to use anywhere.

Maybe this question doesn't make sense, but I'm trying to dig into it
right now ;-)


Regards,
    Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde



More information about the Openais mailing list