[PATCH] cr_tests: Fix hang when robust futex lists are not restored during restart

Matt Helsley matthltc at us.ibm.com
Fri Jul 10 16:34:57 PDT 2009


On Thu, Jul 09, 2009 at 05:21:44PM -0700, Sukadev Bhattiprolu wrote:
> Serge E. Hallyn [serue at us.ibm.com] wrote:
> | Quoting Matt Helsley (matthltc at us.ibm.com):
> | > The robust futex test can hang if the kernel fails to properly set the robust
> | > list pointer. This currently happens during restart. The test should not
> | > hang and instead should report failure.
> | > 
> | > Use a timeout to ensure that hangs are caught and reported as failure.
> | 
> | Doesn't seem to work though :)  The test still hangs on restart.
> 
> I got a hang on restart, with following backtrace (ckpt-v17-rc1 plus couple
> of bug fixes)

Sorry, which fixes?

Perhaps this is the same problem that Serge was seeing..

> 
> mktree        S f6a4bbe0     0 25126  25124 0x00000000
>  f6589b00 00000086 00000001 f6a4bbe0 f6a4bd74 c3190160 f5e17e1c 011a6d85
>  00000000 c302f680 ffffffea 007ee140 f5e17e1c 00000000 00000001 00000000
>  c15fdbfc f5e17e00 f5e17e00 00000000 c1041af6 00000000 f5e17e00 00000000
> Call Trace:
>  [<c1041af6>] ? futex_wait_queue_me+0x94/0xa5
>  [<c1041bfd>] ? futex_wait+0xf6/0x1e9
>  [<c106300b>] ? generic_file_buffered_write+0x169/0x257
>  [<c1042dd7>] ? do_futex+0x93/0xa01
>  [<c101d867>] ? enqueue_entity+0xe/0x7e
>  [<c1081787>] ? cache_alloc_refill+0x54/0x43e
>  [<c106274a>] ? find_get_page+0x1d/0x7a
>  [<c1064407>] ? filemap_fault+0xbb/0x320
>  [<c107296c>] ? __do_fault+0x319/0x352
>  [<c1037c5c>] ? autoremove_wake_function+0x0/0x2d
>  [<c1073f6e>] ? handle_mm_fault+0x24e/0x508

This is what it should look like when a task on a futex is being woken. The
fault just means that the page backing the futex was paged out between
checkpoint and restart. In theory that's not a problem for the futex
code -- it's designed to handle faults. However, of course, it should
not cause a stack dump.

>  [<c1043846>] ? sys_futex+0x101/0x116
>  [<c1351f46>] ? do_page_fault+0x1ff/0x27b
>  [<c10027e8>] ? sysenter_do_call+0x12/0x26
> mktree        S f642b750     0 25127  25124 0x00000000
>  f6589b00 00000086 c15fcd3c f642b750 f642b8e4 c3170160 c1041e2f 011a6d7f
>  ffffffff f6589b00 000005da 00000000 00000001 00000000 00000000 00000000
>  f6500000 00000008 f66d5e7c f66d5f9c c108a797 00000000 f642b750 c1037c5c
> Call Trace:
>  [<c1041e2f>] ? futex_wake+0xb9/0xc3
>  [<c108a797>] ? pipe_wait+0x4b/0x62
>  [<c1037c5c>] ? autoremove_wake_function+0x0/0x2d
>  [<c108afdf>] ? pipe_read+0x2c0/0x32d
>  [<c1066aad>] ? get_page_from_freelist+0x284/0x2de
>  [<c1084d7e>] ? do_sync_read+0xbf/0x100
>  [<c1037c5c>] ? autoremove_wake_function+0x0/0x2d
>  [<c10798ca>] ? page_add_new_anon_rmap+0x20/0x3b
>  [<c1073ef8>] ? handle_mm_fault+0x1d8/0x508
>  [<c1139499>] ? security_file_permission+0xc/0xd
>  [<c1084cbf>] ? do_sync_read+0x0/0x100
>  [<c10853f7>] ? vfs_read+0x81/0x102
>  [<c1085787>] ? sys_read+0x3c/0x63
>  [<c10027e8>] ? sysenter_do_call+0x12/0x26

The first place to look, of course, is the futex restart blocks.

Thanks for the report. I'm kind of swamped with little things at the
moment so I'm going to have to put off deeper analysis of this for now.

Cheers,
	-Matt


More information about the Containers mailing list