[Ksummit-2012-discuss] [ATTEND] kernel core dump and "dying breath"
tony.luck at intel.com
Thu Jun 21 17:53:16 UTC 2012
>> For the really tricky sorts of problems like memory corruption
>> however, the addresses tend to move around so this is not likely to
>> help much. I tend to fall back to kdb, and the "kdb death script" (a
> They move? Is it that the whole memory bank is busted? Could the
> second reboot mark the DIMM as unsavory and not use it?
A lot of memory corruption is due to software bugs (accessing beyond
the end of data structures, using pointers after data has been freed
and re-allocated). These almost always move around.
Hardware errors often move too. Bit flips from stray neutrons or
alpha particles occur randomly throughout memory ... if you don't
have ECC they will bite you from time to time. Some failures do
stick in the same place ... as silicon ages and leakage from some
memory cells happens faster than your DRAM refresh cycle, you start
seeing "stuck" bits. Typically you'd want to deal with those at
a smaller than "whole DIMM" level (to begin with your memory is
probably interleaved across DIMMs for better bandwidth, if you stop
using a DIMM, not only do you lose that capacity - the rest of the
memory in that interleave set gets slower too).
It may be hard for the OS to get involved in the "don't use these
bits of memory" game though. High end servers are doing their own
monitoring, and will take their own corrective actions (e.g. using
spare rank to take the flaky memory out of the range visible to the
OS and replace with good memory). Low-end systems may not provide
enough data to locate the problem (e.g. without ECC that don't even
know that there is a problem).
More information about the Ksummit-2012-discuss