[Openais] Checkpoint crash in aisexec

Steven Dake sdake at mvista.com
Fri Feb 11 14:34:26 PST 2005


Kristen
The only way I can thin,k of that conn_info could be zero (as shown by
gdb) is if the stack were corrupted in some fashion.  Otherwise, there
would be a crash in poll_handler_libais_deliver (which makes extensive
use of the conn_info structure).

One thing that doesn't make sense to me: the code is ckpt.c:1552 but
that function in current bitkeeper is
message_handler_req_lib_ckpt_sectionexpirationtimeset.  Are you using
some older version?

One way to find these sorts of problems is run the code through
valgrind.  This tool really rocks and can find many many sorts of these
bugs.

A bunch of bugs have been fixed in the rework associated with
defect-188.  I know its been a long time coming, but we are closing in
on a complete patch for this defect.  It is possible one of these fixed
bugs could fix the buffer overwrite.  There were alot of problems in
various scenarios relating to buffer management of IPC.

You mentioned it is not easily reproducible.  Could you comment on how
often it reproduces?  IE: 1 out of 20 times...

Thanks
-steve

On Fri, 2005-02-11 at 14:55, Kristen Smith wrote:
> Steve,
> 
> We are periodically seeing aisexec crash with the following trace:
> 
>         (gdb) bt
>         #0  message_handler_req_lib_ckpt_checkpointclose
>         (conn_info=0x0, message=0xb73fc008) at ckpt.c:1552
>         #1  0x080494c2 in poll_handler_libais_deliver (handle=0, fd=3,
>         revent=134633824, data=0x89c2ad8,
>             prio=0x89b2784) at main.c:578
>         #2  0x08056e62 in poll_run (handle=0) at aispoll.c:386
> 
> 
> #3  0x080499ac in main (argc=1, argv=0xbfffcb64) at main.c:1003
> 
> We have looked through the code but can't seem to figure out how
> conn_info is getting set to 0. Do you have any idea under what
> circumstances conn_info could be null when this function is called?
> 
> This is happening when we have multiple nodes up and we kill one of
> the active nodes. The standby node (which was reading checkpoints)
> must now become a writer, so it closes the checkpoint and this
> happens. Unfortunately, I can't reproduce this consistently - I
> finally got a core dump today. I don't recall ever seeing this with
> the old code.
> 
> Thanks,
> Kristen
> 
> 
> 
> ______________________________________________________________________
> _______________________________________________
> Openais mailing list
> Openais at lists.osdl.org
> http://lists.osdl.org/mailman/listinfo/openais




More information about the Openais mailing list