[Openais] Understanding the openais.conf 'rrp_*' variables

Fri Feb 19 15:17:34 PST 2010

On Fri, Feb 19, 2010 at 10:02 AM, Digimer <linux at alteeve.com> wrote:

> Hi all,
>
>   I've been reading through the man page and have been struggling to
> understand the relationship of the redundant ring protocol options. I
> think I understand now, but would be grateful if someone could confirm
> that I've got it right or not.
>
> rrp_problem_count_timeout
>
>   Two purposes;
>
> - When no errors are seen for this many milliseconds,
> rrp_problem_count_threshold is decremented by 1.
>
> - While an error exists, this many milliseconds is the upper limit
> before the interface is declared bad.
>
>   How is this different from 'token'?
>

>
> rrp_problem_count_threshold
>
> - Starts at '0' and is increased by 1 every rrp_token_expired_timeout
> milliseconds without receiving a token.
>
> - Counts down by 1 every rrp_problem_count_timeout milliseconds without
> a problem
>
>   How is this different from 'fail_to_recv_const'?
>
> rrp_token_expired_timeout
>
> - This is the maximum time that can pass without receiving a token
> before triggering an increment of rrp_problem_count_threshold.
>
>   How is this different from 'max_network_delay'?
>
>   I am sure I am misunderstanding something here. :)
>
> RRP has two modes, active and passive.  The rrp_problem_count_timeout
and rrp_token_expired_timeout are only used in active mode. The
rrp_problem_count_threshold is used for both active and passive mode, it is
constant, never changes.

In active mode, if token does not arrive within rrp_token_expired_timeout,
then internal problem_count is increased by 1. If token arrives
within rrp_problem_count_timeout, then the interval problem_count is
decreased by 1. While keep doing this, if the problem_counter is more or
equal than the configured rrp_problem_count_threshold, then that interface
becomes FAULTY interface, won't be used any more until administrator fixes
it.

In passive mode, it maintains token_recv_count and mcast_recv_count.
Whenever a token or mcast msg is received, the corresponding count is
increased by 1. Also it compares this count to other interface. If that
difference is more than rrp_problem_count_threshold, then the interface (has
smaller count value) becomes FAULTY, won't be used any more until
administrator fixes it.

These are there to detect a faulty interface. The token lost timeout is
mainly to detect a node. When the toke lost timeout expires, the corosync
will enter GATHER mode to find out which nodes are there currently.

The fail_to_recv_const is mainly to detect a faulty node that fails to
receive a message. The corosync(or cluster) can not wait forever this
situation, it enters GATHER mode if a node fails to receive a message for
fail_to_recv_const rotation.

Thanks
hj

-- 
Peakpoint Service

Cluster Setup, Troubleshooting & Development
kerdosa at gmail.com
(303) 997-2823
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.linux-foundation.org/pipermail/openais/attachments/20100219/6012f987/attachment.htm