[Openais] How to tune corosync heartbeat timer ?

Alain.Moulle Alain.Moulle at bull.net
Wed Jun 2 01:19:21 PDT 2010


Hi Steven,

have you got a formula to calculate the timeout with regard to 
token,token_retransmits_before_loss_const , and
consensus values ?

and is there any risk on corosync behavior, stability, etc. if we 
increase this time to around 45s / 60s ?

does anybody have experienced ?

Thanks
Regards
Alain
> Hi Steven,
> I've git it a try :
> the values of token=45000 and token_retransmits_before_loss_const=45 leads
>   
> to also set consensus=54000 (at least 1,2 * token) otherwise corosync 
> start fails. With these values, when I do ifdown eth0 on one node, in 
> fact it takes around 98s
> for this node to appear OFFLINE on crm_mon on the healthy node, so I don't
> exactly know which is the formula ?
>
> Thanks
> Regards
> Alain
>
>   
>
>     token: 45000
>     token_retransmits_before_loss_const: 45
>
>      On Wed, 2010-05-19 at 08:39 +0200, Alain.Moulle wrote:
>         
>
>         Hi Steven
>         in fact, I 've at first post this question on the Pacemaker ML,
>         but there is no way in Pacemaker to increase this time, and
>         I think it is normal as the "cluster manager" part is provided
>               
>
>         by corosync, managing the heartbeat. My concern is to largely
>         increase this time, until even values
>
>         as 45s, this is not a problem for applications I have to manage,
>               
>
>         but 10s is really a big problem for me, in case of network
>         problem which lead to silence on heartbeat for a while. So,
>         based on your experience, which parameters do you think I can
>         try to increase to get this 45s timeout ?
>
>         Thanks a lot.
>         Regards
>         Alain
>               
>
>             On Mon, 2010-05-17 at 08:25 +0200, Alain.Moulle wrote:
>                     
>
>                     Hi again,
>
>                     I 've checked the man corosync.conf and seen many parameters
>                     around token timers etc. but I can't see how to increase the heartbeat
>                     timeout. When testing, it occurs that timeout is between 10s and 12s
>                     before a node decides to fence another one in the cluster (when for
>                     example I force a if down eth0 on this node to simulate Heartbeat failure).
>                     But I can't see which parameter(s) to tune in corosync.conf to increase
>                     these 10 or 12s ...
>
>                     Any tip would be appreciated...
>                     Thanks
>                     Alain
>                                 
>
>             Alain,
>
>             I don't have a direct answer to your question.  Corosync detects a
>             failure of any node in "token" msec.  I have not measured how long
>             qpid/fencing/pacemaker/rgmanager/gfs/ocfs/etc take to operate on this
>             notification.  This delta between failure detection and recovery would
>             be a good question to potentially ask on the pacemaker ml.
>
>             In my test environments I run at token = 1000 msec.  Totem can be tuned
>             to lower values, but under a heavy network load, may falsely detect a
>             node failure.
>
>             Most products that use Corosync ship with a 10000msec (10sec) or larger
>             token value to offer least chance of false node detection.
>
>             The token timer is just one consideration, however.  The
>             "token_retransmits_before_loss_const" defaults to 4.  This may be too
>             low in lossy or heavy load networks.  A higher value for this
>             configuration produces a bit more load but more resilient behavior.
>
>             Regards
>             -steve
>
>
>                     
>
>         _______________________________________________
>         Openais mailing list
>         Openais at lists.linux-foundation.org
>         https://lists.linux-foundation.org/mailman/listinfo/openais
>               
>
>
>         
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.linux-foundation.org/pipermail/openais/attachments/20100602/6878ee0c/attachment.htm 


More information about the Openais mailing list