[Openais] aisexec unable to get in sync

Steven Dake sdake at mvista.com
Mon Sep 27 12:49:07 PDT 2004


Mark

Could you file a bugzilla bug?

I've studied the output log for about an hour and have some ideas to add
more debug output that might provide more clues to what is happening. 
For example it would be very helpful to know the gather set and the set
of members that produced consensus.  It would also help to know if a
processor ordered (sent) messages (specifically messages 0-4) in a new
configuration.  It would also help to know the configuration ids of the
configurations.  Also if any messages were dropped in
message_handler_mcast (the comment Ignore multicasts for other
configurations followed by the TODO...)  I'll work up a patch to get
some more debug output for these items.

How reproducible is this?  What exact test were you running?

In the bug log, where it prints out the line:

EVS STATE group arut 4 gmi arut 4 highest 51420646 barrier 51420647
starting group arut 4

I assume that line scrolled forever.

This could be the holes at the end of the configuration bug that is
still lurking out there.  But it doesn't look like it.

It almost looks as though a recovery was taking place on one processor,
while on another processor messages were being ordered in a new
configuration.  The problem occurs when both think they are the same
configuration.  This shouldn't happen, and maybe its not, but thats my
best guess for now.

Thanks
-steve

On Mon, 2004-09-27 at 10:27, Mark Haverkamp wrote:
> On Mon, 2004-09-27 at 09:22, Mark Haverkamp wrote:
> > Steve,
> > 
> > On Friday afternoon I started a test on my four nodes publishing
> > events.  After they ran for a while, I checked top and saw that they
> > were using about 8meg of memory and running OK.  When I got in this
> > morning I saw this sort of thing on all nodes.  The programs were still
> > using around 8 meg so I don't think that there was some sort of
> > out-of-memory problem.  
> > 
> > EVS STATE group arut 4 gmi arut 4 highest 51420646 barrier 51420647 starting group arut 4
> > EVS STATE group arut 4 gmi arut 4 highest 51420646 barrier 51420647 starting group arut 4
> > EVS STATE group arut 4 gmi arut 4 highest 51420646 barrier 51420647 starting group arut 4
> > EVS STATE group arut 4 gmi arut 4 highest 51420646 barrier 51420647 starting group arut 4
> > 
> > I checked the saved console logs and saw that this happened on all four
> > nodes about the same time (4AM Saturday).  Any ideas on a good way to
> > debug this?
> > 
> > Thanks,
> > Mark.
> 
> Actually, the four nodes had some differences in what they printed out. 
> Here is output from each.
> 
> cl008:
> L(3): Token being retransmitted.
> L(1): Received message has invalid digest... ignoring.
> L(3): Token being retransmitted.
> L(1): Received message has invalid digest... ignoring.
> L(4): Got attempt join from 192.168.1.18
> L(4): entering GATHER state.
> L(4): SENDING attempt join because this node is ring rep.
> L(3): Token loss in GATHER or COMMIT.
> L(4): Got attempt join from 192.168.1.19
> L(4): CONSENSUS reached!
> Got membership form token
> conf_desc_list 1
> highest seq 0 51420627
> setting barrier seq to 51420627
> setting barrier seq to 51420628
> Got membership form token
> FORM CONF ENTRIES 1
> EVS STATE group arut 51420627 gmi arut 51420627 highest 51420627 barrier 51420628 starting group arut 51420627
> EVS STATE group arut 51420628 gmi arut 51420628 highest 51420627 barrier 51420628 starting group arut 51420628
> L(4): EVS recovery of messages complete, transitioning to operational.
> CONFCHG ENTRIES 4
> calling recovery
> L(4): CLM CONFIGURATION CHANGE
> L(4): New Configuration:
> L(4):   192.168.1.8
> L(4):   192.168.1.18
> L(4):   192.168.1.19
> L(4): Members Left:
> L(4):   192.168.1.17
> L(4): Members Joined:
> L(4): CLM CONFIGURATION CHANGE
> L(4): New Configuration:
> L(4):   192.168.1.8
> L(4):   192.168.1.18
> L(4):   192.168.1.19
> L(4): Members Left:
> L(4): Members Joined:
> L(4): All services unplugged, unplugging processor
> L(4): All processors unplugged, allowing messages to be transmitted.
> L(4): Got attempt join from 192.168.1.17
> L(4): entering GATHER state.
> L(4): SENDING attempt join because this node is ring rep.
> L(4): CONSENSUS reached!
> L(4): swallowing ORF token 16310570.
> Got membership form token
> conf_desc_list 2
> highest seq 0 51420646
> setting barrier seq to 51420646
> highest seq 1 51420646
> setting barrier seq to 51420647
> Got membership form token
> FORM CONF ENTRIES 2
> EVS STATE group arut 4 gmi arut 4 highest 51420646 barrier 51420647 starting group arut 4
> EVS STATE group arut 4 gmi arut 4 highest 51420646 barrier 51420647 starting group arut 4
> 
> cl017:
> already received token 65190900 65190900
> L(4): Got attempt join from 192.168.1.18
> L(4): rep is 192.168.1.8, not handling attempt join.
> L(4): Got attempt join from 192.168.1.19
> L(4): rep is 192.168.1.8, not handling attempt join.
> L(4): Got attempt join from 192.168.1.8
> L(4): rep is 192.168.1.8, not handling attempt join.
> already received token 65239212 65239212
> L(3): Token being retransmitted.
> L(3): Token loss in OPERATIONAL.
> L(4): entering GATHER state.
> L(4): SENDING attempt join because this node is ring rep.
> L(4): Got attempt join from 192.168.1.8
> L(4): CONSENSUS reached!
> Got membership form token
> Got membership form token
> conf_desc_list 2
> highest seq 0 51420646
> setting barrier seq to 51420646
> highest seq 1 51420646
> setting barrier seq to 51420647
> EVS STATE group arut 4 gmi arut 51420646 highest 51420646 barrier 51420647 starting group arut 4
> EVS STATE group arut 4 gmi arut 51420647 highest 51420646 barrier 51420647 starting group arut 4
> EVS STATE group arut 4 gmi arut 51420647 highest 51420646 barrier 51420647 starting group arut 4
> EVS STATE group arut 4 gmi arut 51420647 highest 51420646 barrier 51420647 starting group arut 4
> EVS STATE group arut 4 gmi arut 51420647 highest 51420646 barrier 51420647 starting group arut 4
> EVS STATE group arut 4 gmi arut 51420647 highest 51420646 barrier 51420647 starting group arut 4
> EVS STATE group arut 4 gmi arut 51420647 highest 51420646 barrier 51420647 starting group arut 4
> EVS STATE group arut 4 gmi arut 51420647 highest 51420646 barrier 51420647 starting group arut 4
> EVS STATE group arut 4 gmi arut 51420647 highest 51420646 barrier 51420647 starting group arut 4
> EVS STATE group arut 4 gmi arut 51420647 highest 51420646 barrier 51420647 starting group arut 4
> EVS STATE group arut 4 gmi arut 51420647 highest 51420646 barrier 51420647 starting group arut 4
> 
> cl018:
> L(3): Token being retransmitted.
> L(3): Token being retransmitted.
> L(3): Token loss in OPERATIONAL.
> L(4): entering GATHER state.
> L(4): SENDING attempt join because this node is ring rep.
> L(4): Got attempt join from 192.168.1.19
> L(4): Got attempt join from 192.168.1.8
> L(4): CONSENSUS reached!
> Got membership form token
> Got membership form token
> conf_desc_list 1
> highest seq 0 51420627
> setting barrier seq to 51420627
> setting barrier seq to 51420628
> EVS STATE group arut 51420627 gmi arut 51420627 highest 51420627 barrier 51420628 starting group arut 51420627
> EVS STATE group arut 51420628 gmi arut 51420628 highest 51420627 barrier 51420628 starting group arut 51420628
> L(4): EVS recovery of messages complete, transitioning to operational.
> CONFCHG ENTRIES 4
> calling recovery
> L(4): CLM CONFIGURATION CHANGE
> L(4): New Configuration:
> L(4):   192.168.1.8
> L(4):   192.168.1.18
> L(4):   192.168.1.19
> L(4): Members Left:
> L(4):   192.168.1.17
> L(4): Members Joined:
> L(4): CLM CONFIGURATION CHANGE
> L(4): New Configuration:
> L(4):   192.168.1.8
> L(4):   192.168.1.18
> L(4):   192.168.1.19
> L(4): Members Left:
> L(4): Members Joined:
> L(4): All services unplugged, unplugging processor
> already received token 15 65239213
> L(4): All processors unplugged, allowing messages to be transmitted.
> L(1): Received message has invalid digest... ignoring.
> L(4): Got attempt join from 192.168.1.17
> L(4): rep is 192.168.1.8, not handling attempt join.
> L(4): Got attempt join from 192.168.1.8
> L(4): rep is 192.168.1.8, not handling attempt join.
> Got membership form token
> Got membership form token
> conf_desc_list 2
> highest seq 0 51420646
> setting barrier seq to 51420646
> highest seq 1 51420646
> setting barrier seq to 51420647
> EVS STATE group arut 51420646 gmi arut 4 highest 51420646 barrier 51420647 starting group arut 51420646
> EVS STATE group arut 4 gmi arut 4 highest 51420646 barrier 51420647 starting group arut 4
> EVS STATE group arut 4 gmi arut 4 highest 51420646 barrier 51420647 starting group arut 4
> EVS STATE group arut 4 gmi arut 4 highest 51420646 barrier 51420647 starting group arut 4
> 
> 
> cl019:
> already received token 65190898 65190898
> L(3): Token being retransmitted.
> already received token 65239210 65239210
> L(3): Token being retransmitted.
> L(4): Got attempt join from 192.168.1.18
> L(4): rep is 192.168.1.8, not handling attempt join.
> L(3): Token loss in OPERATIONAL.
> L(4): entering GATHER state.
> L(4): SENDING attempt join because this node is ring rep.
> L(4): Got attempt join from 192.168.1.8
> L(4): CONSENSUS reached!
> Got membership form token
> Got membership form token
> conf_desc_list 1
> highest seq 0 51420627
> setting barrier seq to 51420627
> setting barrier seq to 51420628
> EVS STATE group arut 51420627 gmi arut 51420627 highest 51420627 barrier 51420628 starting group arut 51420627
> EVS STATE group arut 51420628 gmi arut 51420628 highest 51420627 barrier 51420628 starting group arut 51420628
> L(4): EVS recovery of messages complete, transitioning to operational.
> CONFCHG ENTRIES 4
> calling recovery
> L(4): CLM CONFIGURATION CHANGE
> L(4): New Configuration:
> L(4):   192.168.1.8
> L(4):   192.168.1.18
> L(4):   192.168.1.19
> L(4): Members Left:
> L(4):   192.168.1.17
> L(4): Members Joined:
> L(4): CLM CONFIGURATION CHANGE
> L(4): New Configuration:
> L(4):   192.168.1.8
> L(4):   192.168.1.18
> L(4):   192.168.1.19
> L(4): Members Left:
> L(4): Members Joined:
> L(4): All services unplugged, unplugging processor
> L(4): All processors unplugged, allowing messages to be transmitted.
> L(4): Got attempt join from 192.168.1.17
> L(4): rep is 192.168.1.8, not handling attempt join.
> L(4): Got attempt join from 192.168.1.8
> L(4): rep is 192.168.1.8, not handling attempt join.
> Got membership form token
> Got membership form token
> conf_desc_list 2
> highest seq 0 51420646
> setting barrier seq to 51420646
> highest seq 1 51420646
> setting barrier seq to 51420647
> EVS STATE group arut 4 gmi arut 4 highest 51420646 barrier 51420647 starting group arut 4
> EVS STATE group arut 4 gmi arut 4 highest 51420646 barrier 51420647 starting group arut 4
> EVS STATE group arut 4 gmi arut 4 highest 51420646 barrier 51420647 starting group arut 4




More information about the Openais mailing list