[Ksummit-2012-discuss] [ATTEND] large system issues and failure analysis

Roland Dreier roland at kernel.org
Sun Jun 24 00:07:15 UTC 2012


On Fri, Jun 22, 2012 at 7:54 PM, Luck, Tony <tony.luck at intel.com> wrote:
> I’m also interested in mission critical systems – for which one of the
> key requirements is being able to debug the root cause of a crash in
> just one occurrence (owners of stock exchanges hate it when you say
> “well let’s run it again tomorrow and see if it crashes again”).
>
> So pstore, “dying breath”, flight recorder, etc. are all hugely interesting
> to me.

Yes, very interesting from the POV of shipping appliance systems on
x86 HW too.  I'm hoping we can do better on the platforms we have
and also push vendors to build better platforms.

> I’d also like to provide some large system perspective to balance out
> against the embedded, phone, tablet etc. people who may sometimes
> forget that we need to support hundreds of cores, terabytes of memory
> and thousands of I/O devices.

Even on our current HW (12 cores, 96 GB memory, 40-100 SSDs) we
see things not scaling as well as one might hope (mmap_sem contention,
block queue / SCSI host issues, etc).

 - R.


More information about the Ksummit-2012-discuss mailing list