[Openais] aisexec unable to get in sync

Steven Dake sdake at mvista.com
Mon Sep 27 13:16:16 PDT 2004


On Mon, 2004-09-27 at 13:02, Mark Haverkamp wrote:
> On Mon, 2004-09-27 at 12:49, Steven Dake wrote:
> > Mark
> > 
> > Could you file a bugzilla bug?
> > 
> 
> I'll do that.
> 
> 
> > I've studied the output log for about an hour and have some ideas to add
> > more debug output that might provide more clues to what is happening. 
> > For example it would be very helpful to know the gather set and the set
> > of members that produced consensus.  It would also help to know if a
> > processor ordered (sent) messages (specifically messages 0-4) in a new
> > configuration.  It would also help to know the configuration ids of the
> > configurations.  Also if any messages were dropped in
> > message_handler_mcast (the comment Ignore multicasts for other
> > configurations followed by the TODO...)  I'll work up a patch to get
> > some more debug output for these items.
> > 
> > How reproducible is this?  What exact test were you running?
> 
> I have seen it a few times.  This time it probably ran 18-20 hours
> before doing this.  I was running my publish and subscribe programs on
> each of the four nodes.
> 

The other reports you made were holes at the end of the configuration.

> subscribe -q -q on each node and
> publish -t10 -x1000 -w10
> publish -t3 -x10000 -w3
> publish -t2 -x10000 -w 4
> publish -t1 -x10000 -w2
> 
> one publish per node.
> 
> 
> One peculiar thing was this:
> 
> L(3): Token being retransmitted.
> L(1): Received message has invalid digest... ignoring.
> L(3): Token being retransmitted.
> L(1): Received message has invalid digest... ignoring.
> 
> Does this mean anything?
> 

Each message is digested with SHA1/HMAC.  The key for HMAC is built from
a private key (keygen output) and a random number generated with prng. 
The random number is then stored in the message header.  This operation
is repeated on the receiver side in reverse order.  If the digest
doesn't match the mesage is rejected.

I'm not sure how this could happen unless there was a corruption of the
UDP packet data that still passed checksumming in Linux.  Another
possibility is that the PRNG/SHA1/HMAC algorithms are broken in some way
although they all pass the test vectors.  We could hash the full message
with MD5 with a static key.  This would tell us if the UDP packet data
was corrupted but still passes checksum by eliminating the extra
algorithms in use.  It might also help to have a copy of the messages
that failed to digest properly so we could look at the message header
(after decryption) and see if it is valid.

I'll work this up in a seperate debug patch.  We can apply them both and
try our respective runs on each end and see what we get.

Regards
-steve

> 
> > 
> > In the bug log, where it prints out the line:
> > 
> > EVS STATE group arut 4 gmi arut 4 highest 51420646 barrier 51420647
> > starting group arut 4
> > 
> > I assume that line scrolled forever.
> 
> It did.
> 
> > 
> > This could be the holes at the end of the configuration bug that is
> > still lurking out there.  But it doesn't look like it.
> > 
> > It almost looks as though a recovery was taking place on one processor,
> > while on another processor messages were being ordered in a new
> > configuration.  The problem occurs when both think they are the same
> > configuration.  This shouldn't happen, and maybe its not, but thats my
> > best guess for now.
> > 




More information about the Openais mailing list