[Openais] Re: recent segfault

Steven Dake sdake at mvista.com
Tue Feb 1 17:05:59 PST 2005


On Tue, 2005-02-01 at 17:28, Daniel McNeil wrote:
> On Tue, 2005-02-01 at 13:51, Steven Dake wrote:
> > On Tue, 2005-02-01 at 14:27, Mark Haverkamp wrote:
> > > On Tue, 2005-02-01 at 14:19 -0700, Steven Dake wrote:
> > > > On Tue, 2005-02-01 at 14:03, Mark Haverkamp wrote:
> > > > > On Tue, 2005-02-01 at 13:48 -0700, Steven Dake wrote:
> > > > > > I was thinking another possibility is that after a processor joins a
> > > > > > configuration, it takes the end of previous fragment from another
> > > > > > processor into its assembly area.  Instead it should start on the next
> > > > > > fragment start and discard any previous fragmented data from new
> > > > > > processors.
> > > > > 
> > > > > I think that I see.  What you are saying is that a partial message was
> > > > > sent before the processor joined and once it joined it received the last
> > > > > piece.  
> > > > > > 
> > > > > > I think what we need is some kind of value in each message (short int)
> > > > > > which specifies the index in msg_lens[x] where the first fragment starts
> > > > > > for this packet, or 0xffff if this fragment contains no starting
> > > > > > fragment.
> > > > > 
> > > > > Maybe, along with the fragmented bit (last message is fragment) add a
> > > > > continuation bit (first part of buffer is continuation of a previous
> > > > > message.  The receiving processor would throw away continuations if its
> > > > > assembly area didn't already have something in it.
> > > > > 
> > > > This is good.  I want to be sure we can handle large MTUs for messages. 
> > > > This means we need about a range of 0-3000 to specify the start index (2
> > > > bytes, plus 1 byte per message with MTU of 9000).  I'll start working on
> > > > a patch integrating the fragment bit and continuation bit into the start
> > > > index to compact some space.
> > > 
> > > I'm not following the need for extra bytes. Wouldn't we only need a
> > > single bit in the mcast structure like the fragmented bit?  The only
> > > message in the incoming buffer that can be a continuation is the first
> > > one.  If the assembly index is zero and the continuation bit is set on
> > > the incoming message, we just throw away the first message in the
> > > incoming buffer and the next one (if any) is the start of a new one.
> > > 
> > 
> > good idea Mark.  The patch should be pretty easy to develop.  I'm
> > looking at the sort queue in use bug now.  If you want to work up a
> > patch for the continuation bit idea that would be cool.
> > 
> > It looks like if a message is lost in recovery,
> > memb_state_operational_enter may sometimes be called in certain
> > conditions after about 1-2 hours of running with RANDOM_DROP enabled. 
> > This would definately result in a crash because there would be missing
> > messages in the message stream which a) doesn't follow vs sematics b)
> > would break the assembler.
> > 
> 
> 
> Steve,
> 
> The handling of the packed and fragment handling makes me think of a
> potential problem:
> 
> If a config change happens in the middle of a large message that has
> been fragmented, I'm wondering if the ordering of messages might
> be messed up:
> 
> Starting with a 2 node cluster (A and B)
> 
> A sends out A1
> 
> B sends out B1frag1
> 
> C joins cluster and sends out C1
> 
> A sends out A2
> 
> B sends out B1frag2 and B1frag3
> 
> I think the above describes what you and Mark are talking about
> where C can see the B1frag2 and B1frag3 and not know how to process
> it.  Am I understanding this right?
> 
> Now the problem: what is the actual message deliver order:
> 
> A sees A1,C1,A2,B1
> B sees A1,C1,A2,B1
> C sees C1,A2 (with mark's fix to drop partial fragments).
> 
> So I see 2 problems with this:
> 
> 1. B1 was started in the old config (A,B) but delivered in the new
>     config (A,B,C)
> 
> 2. C does not see B1 at all, since he only received partial fragments.
> 
> Am I mis-understanding the way it works?  If B does not deliver the
> entire message B1, before C joins, then we can get the above problems.
> Does the protocol give the surviving nodes a change to send out their
> last message in its entirety before allowing a new node to join?
> 

Your absolutely right the current code violates extended virtual
synchrony in this case.

I believe that any message that is in a pending queue should be flushed
into the algorithm before the configuration change is delivered. 
Unfortunately totemsrp does not have any easy mechanism to add such a
functionality.  Adding some new state to the MEMB_STATE_RECOVERY state
of the state machine might do it.  Ie: once memb_state_operational
should be entered, it may be safe to begin transmitting messages in the
pending queue.  Handling recovery of a failure in the
MEMB_STATE_RECOVERY state would take some thinking though.

Even another option is to add some code to totempg that adds a message
(flush barrier) to the pending queue.  Then totempg would hold off
delivery of the configuration change until all of these flush barriers
are delivered to the processor.  Then once the flush barriers had been
received by all processors in the configuration, this would indicate
that all messages in the pending queues had been flushed and it would be
safe to issue a configuration change to the application.  This approach
is simple and provides vs semantics until you consider failures during
the point that the configuration change is delivered to totempg, but
before the flush barriers have been delivered.

Unfortunately handling the failure conditions during these operations is
difficult.  In the short term I dont want to take on the redesign of the
protocol to handle this rare case.  Although if I had a patch that
worked I'd commit it :)

I suspect nobody that has implemented totem has properly solved this
issue in the fragmented message case.  Atleast there are no code or
papers describing it that I have found.

> Thanks,
> 
> Daniel
> 




More information about the Openais mailing list