[Ksummit-2012-discuss] [ATTEND] kernel core dump and "dying breath"

Thu Jun 21 15:25:56 UTC 2012

On Thu, Jun 21, 2012 at 12:14:49PM +0100, James Bottomley wrote:
> On Thu, 2012-06-21 at 05:57 -0500, Jason Wessel wrote:
> > On 06/21/2012 05:37 AM, Konrad Rzeszutek Wilk wrote:
> > > On Thu, Jun 21, 2012 at 4:05 AM, Cong Wang <xiyou.wangcong at gmail.com> wrote:
> > >> Hi, all,
> > >>
> > >> I would like to bring up the kernel "dying breath" topic with
> > >> to the Kernel Summit this year. It will contain some recent
> > >> technologies like pstore, ramconsole etc., and of course
> > >> should also cover kdump and netconsole too.
> > 
> > If we are going to make netconsole actually work reliably in all
> > contexts, kgdboe can be revived.  Currently there are places
> > netconsole simply doesn't work and that printk you were looking
> > for... It is never going to be delivered.
> > 
> > > 
> > > Are there any future projects in the pipeline? Most of these deal with
> > > depositing somewhere "why it crashed" information, but are there any
> > > that try to omit the cause on the next boot?
> > > 
> > 
> > Self healing eh?  Short of using a different kernel to run some
> > scripts to look at the crash itself and take action such as booting a
> > new kernel, OR booting the existing kernel and looking at the previous
> > crash information to attempt to take corrective action I am not aware
> > of anything.  Things like fsck take this sort of action today, I
> > believe you are asking about something of an entirely different level.
> > 
> > The one thing that does come to mind is that if you did save
> > information about the prior crash that you could probably get more
> > information on the next next crash by automatically inserting a kprobe
> > at the crash address that could collect more information automatically
> > into the ftrace buffer or "something" depending on the original crash.
> 
> So I think we're advocating something much more simplistic.  The basic
> problem is that a lot of kernel crashes are very hard for the average
> user to report:  for a lot the best we get is a photo of the crash
> screen with most of the relevant information scrolled off the top.

I really like the idea that somebody posted of using QR codes. Easy to
take a photo of, and easy enough to decipher. 

> Serial or net consoles may be standard dev tools, but they're also too
> complex for most average users (although, to their credit, there's a lot
> who try).
> 
> So the question becomes could we make the gathering of this data easier,
> so it just becomes a cut and paste from a log file (or even an automatic
> submission depending on what permissions you gave your distro).  The
> thought is that a lot of systems are coming with areas of NVRAM today,
> so we could write the dying gasp of a crashing kernel to this NVRAM area
> (avoids most of the problems that trying to write this to disk had and
> NVRAM isn't cleared on bios init) and then spit it out on next boot in a
> form that can be easily consumed by automated bug reporting tools or
> easily cut and pasted into an email.

I recall the rtc device being used for this - if the kernel panics
it writes to the RTC a hash value of where it crashed? Granted, that
only gets us one set of data - not the whole tombstone.