[Ksummit-2011-discuss] Discussion Topic proposal: Handling Failure Better

Thu Aug 4 19:02:34 PDT 2011

At 8/4/2011 22:02:33 roland at kernel.org submitted the following Discussion  
Topic proposal.

Title:	Handling Failure Better
Abstract:
"Making the Common Case the Only Case with Anticipatory Memory
Allocation" <http://pages.cs.wisc.edu/~swami/papers/refuse.pdf> points
out that even very mature filesystems like ext3 and xfs do not always
handle allocation failure properly and may even corrupt themselves if
an allocation fails.

Recently I've found that the block layer also doesn't always handle
device failure properly, eg <https://lkml.org/lkml/2011/7/8/457> (and
I've seen other cases where hot-unplugging a disk results in userspace
getting stuck in an uninterruptible sleep or worse).

These two seem like special cases of a more general problem: our
exception paths are poorly tested and often buggy.  kmalloc almost
never fails and disks are almost never hot-unplugged while still in
use, but having these failures lead to filesystem corruption and
panics is not ideal.  The usual process for fixing kernel problems
breaks down in these cases, since by definition they are rare and
often hard to reproduce, so even if a relevant developer tries to fix
them, it is difficult to make progress.

Is there a way to exercise exception paths more so that bugs are found
sooner, or could we adopt designs (eg the one in "Making the Common
Case the Only Case") that eliminate exception paths altogether?  The
existing fault injection code in the kernel seems to be used by a very
small minority; is there a way we could make it easier to enable, so
that developers and testers could run with it without too much
disruption?  lockdep and other automatic checking code have proven
invaluable for kernel quality -- could we test error paths in a
similar way?

Suggested participants:
Submitter: roland at kernel.org