[Openais] Re: segfaults and asserts

Mark Haverkamp markh at osdl.org
Thu Feb 24 07:43:25 PST 2005


On Wed, 2005-02-23 at 17:16 -0700, Steven Dake wrote:
> On Wed, 2005-02-23 at 14:00, Mark Haverkamp wrote:
> > On Wed, 2005-02-23 at 13:03 -0700, Steven Dake wrote:
> > > The cl019 assert path is a new one I think unreported.  If you still
> > > have the windows open can you print out the sq data?  Also could you go
> > > up to update_aru and print out my_aru, i, and my_high_seq.  Must be some
> > > case I have missed.
> > > 
> > 
> > $1 = (struct sq *) 0x80ba430
> > (gdb) p *sq
> > $2 = {head = 38, size = 2000, items = 0x80eb7d8,
> >   items_inuse = 0x8100fa0 '\001' <repeats 200 times>..., size_per_item = 44,
> >   head_seqid = 38, item_count = 2000}
> > (gdb)
> > 
> > 
> > 
> > (gdb) p my_aru
> > $5 = 2037
> > (gdb) p i
> > $6 = 2038
> > (gdb) p my_high_seq_delivered
> > $7 = 2037
> > (gdb) p my_high_seq_received
> > $8 = 2038
> > (gdb) p my_high_seq_received_save
> > $9 = 0
> > (gdb)
> > 
> > 
> 
> There seems to be some strange correlation between i (packet 2038), the
> item_count for the sort queue (2000 entries) and the head (position 38).
> 
> It would be interesting to know how head got reset to zero when my_aru
> is 2037.  Can you print the memb_state variable?  Did a configuration
> change occur around the time this crash occured? (ie token lost, or any
> sort of membership messages received).
(gdb) p memb_state
$2 = MEMB_STATE_OPERATIONAL

A configuration change had just occurred and, as the memb_state variable
says, we were in operational state.  If you check the logs, it looked
like the nodes had gone through a number of continuous config changes
just before the segfaults.  For instance:

[...]
Feb 22 18:25:36 [NOTICE  ] [CLM  ] CLM CONFIGURATION CHANGE
Feb 22 18:25:36 [NOTICE  ] [CLM  ] New Configuration:
Feb 22 18:25:36 [NOTICE  ] [CLM  ]      192.168.1.8
Feb 22 18:25:36 [NOTICE  ] [CLM  ]      192.168.1.17
Feb 22 18:25:36 [NOTICE  ] [CLM  ]      192.168.1.18
Feb 22 18:25:36 [NOTICE  ] [CLM  ]      192.168.1.19
Feb 22 18:25:36 [NOTICE  ] [CLM  ] Members Left:
Feb 22 18:25:36 [NOTICE  ] [CLM  ] Members Joined:
Feb 22 18:25:36 [NOTICE  ] [CLM  ] CLM CONFIGURATION CHANGE
Feb 22 18:25:36 [NOTICE  ] [CLM  ] New Configuration:
Feb 22 18:25:36 [NOTICE  ] [CLM  ]      192.168.1.8
Feb 22 18:25:36 [NOTICE  ] [CLM  ]      192.168.1.17
Feb 22 18:25:36 [NOTICE  ] [CLM  ]      192.168.1.18
Feb 22 18:25:36 [NOTICE  ] [CLM  ]      192.168.1.19
Feb 22 18:25:36 [NOTICE  ] [CLM  ] Members Left:
Feb 22 18:25:36 [NOTICE  ] [CLM  ] Members Joined:
Feb 22 18:25:36 [NOTICE  ] [GMI  ] entering OPERATIONAL state.
Feb 22 18:25:37 [NOTICE  ] [GMI  ] Creating commit token because I am the rep.
Feb 22 18:25:37 [NOTICE  ] [GMI  ] Storing new sequence id for ring 8544
Feb 22 18:25:37 [NOTICE  ] [GMI  ] entering COMMIT state.
Feb 22 18:25:37 [NOTICE  ] [GMI  ] entering GATHER state.
Feb 22 18:25:37 [NOTICE  ] [GMI  ] Creating commit token because I am the rep.
Feb 22 18:25:37 [NOTICE  ] [GMI  ] Storing new sequence id for ring 8548
Feb 22 18:25:37 [NOTICE  ] [GMI  ] entering COMMIT state.
[...]


Mark.



> 
> Thanks
> -steve
> 
> > > I think we have seen the other segfaul/_delivert but I'm not sure.  I don't think
> > > I've seen source_addr set to the address 0x8 before.  Were you able to
> > > debug the segfault?  need to know assembly->index and datasize, and
> > > iovec[0].iov_len (arguments to the memcpy).  Might be interesting to see
> > > all the iovec metadata if it has a iovlen of more then 1.
> > 
> > I think that the stack is pretty much trashed by the time the seqfault
> > happens.  None of the memory addresses that I tried to look at were
> > valid.
> > For instance:
> > 
> > (gdb) p iovec
> > Cannot access memory at address 0xc
> > (gdb) p datasize
> > Cannot access memory at address 0xfffff9f0
> > (gdb) p header
> > Cannot access memory at address 0xfffffa08
> > (gdb)
> > (gdb) p source_addr
> > Cannot access memory at address 0x8
> > (gdb) p iov_len
> > Cannot access memory at address 0x10
> > (gdb) p endian_conversion_required
> > Cannot access memory at address 0x14
> > 
> > 
> > > 
> > > Thanks
> > > -steve
> > 
> > 
> > 
> > > > 
-- 
Mark Haverkamp <markh at osdl.org>




More information about the Openais mailing list