[Openais] whitetank: question regarding saCkptCheckpointOpen()
sdake at redhat.com
Sat Feb 21 17:15:37 PST 2009
On Sat, 2009-02-21 at 16:38 +0100, Lars Marowsky-Bree wrote:
> Hi all,
> I have a question regarding this call; possibly it applies to other
> CKPT functions too, but this is the one currently giving me worries.
> ocfs2_controld uses this service, and they get spawned by the cluster
> manager at essentially the same time everywhere. (At a time where all
> nodes are up, and dlm_controld.pcmk is also already up.)
> This causes saCkptCheckpointOpen() to fail on a number of nodes,
> possibly because the membership is in-flux, with EAGAIN. Now, what is an
> appropriate amount of times to retry such calls?
> And what is the real cause why they fail spuriously?
> Initially, ocfs2_controld retried twice; clearly not often enough. But I
> have a feeling that the number of retries depends on timing and on the
> number of cluster nodes, which makes my gut scream "band-aid!" when I
> simply increase it - when are we going to hit it again?
> Do we need a "blocking" wrapper, and simply retry infinitely?
The design is such that the developer can retry the call or do other
work when try_again is delivered.
> saCkptCheckpointOpen() also takes a timeout parameter. Which it
> subsequently does not appear to use anywhere.
this is ignored in openais and meant for implementations which may have
huge open times (several minutes!!!!) which is not the case with
maybe it makes sense during the config change case, but I think try
again makes more ease of use then dealing with a timeout.
> Maybe this question doesn't make sense, but I'm trying to dig into it
> right now ;-)
hopefully my previous message on this topic answered your questions.
More information about the Openais