[Openais] [RFC] simple blackbox

Mon Oct 13 08:13:08 PDT 2008

On Mon, 2008-10-13 at 14:45 +0200, Lars Marowsky-Bree wrote:
> On 2008-10-09T08:51:58, Steven Dake <sdake at redhat.com> wrote:
> 
> Hi Steven,
> 
> > > The goal was to have a blackbox which we cannot just retroactively dump,
> > > but also easily recover from a core or kernel crashdump. 
> > 
> > the array is global and can easily be obtained from a core file with a
> > simple script.
> 
> I got that, but I would at least mention that we are looking at crash
> scenarios. Not exactly scenarios which are well known for well-behaved
> pointers.

If any segfault were to corrupt the flight recorder data segment's head
and tail array index (which are at the end of the memory buffer) the
rest of the buffer would still contain a full record trace which could
be recovered.  A record looks like this:

record[0] size of record
record[1] record #
record[2..size_of_record] record contents

Once the first size of record message was found in the array, it would
be possible to iterate the entire array and sort on record #.  Then the
valid flight data (that wasn't corrupted) could be read via fplay (with
some changes to fplay).

Even random corruption of various values of the flight data record could
be analyzed.  The flight data record replay code could be made extremely
resilient to corruption of the flight data events by errant memory
overwrites.

> 
> > > I would maintain that I'd prefer a ring buffer of messages logged (at +n
> > > log  celevels above the actual log level configured), in their printable
> > > char[] form.
> > too much overhead - see below.  Also then trace data is lost during
> > normal runtime operations.  IMO we want to maintain this trace
> > information at all times during every run.
> 
> Uhm, but for how long? It seems obvious that the blackbox eventually
> overflows and wraps around.
> 

Let me start out by stating the use case.  The application crashes, with
a segfault.  The bug could be one of two classes a) the bug is broken
code which is easy to fix by inspection or b) the bug is a logic error
caused by many previous state changes within the application.

The case of B is what the flight recorder is meant to solve.  The really
hard bugs which can't be understood without understanding what events
led up to the crash.

So in this case, I am debugging a real problem with falls into the B
category.  I have a 4MB flight recorder buffer which wraps at
approximately ~27k records.  It demonstrates the start of the fault at
record 22k and fails at record 27k.  So in that case, 5k records were
enough to analyze the problem.  4mb seems like a good buffer size, but
the fact is memory is cheap and even 10mb might be a reasonable flight
buffer size.

example of solution in action:

https://bugzilla.redhat.com/show_bug.cgi?id=441413

> > > That would also ensure that we are looking at exactly the same strings
> > > as were eventually logged externally, which could be very much useful to
> > > correlate the logs from the blackbox with those in the system logs.
> > this is indeed a problem in the current design.
> 
> A rather key issue I think.
> 

Today the system logs contain a mix of string debugging information
(only useful to a developer) and actual logging information telling the
user something informative.  What this log recorder does is keep those
informative messages in the flight recorder as strings (Since these are
often low performance paths which do not affect timing).  The fplay app
will then replay every log message along with any event status change
messages that were recorded during normal run-time.  So what is logged
externally to syslog (only the informative string messages) are the
exact same messages as seen in the replay of the system crash log.

> > Let me explain the real world problem and design goals I had when I
> > started so you understand where the implementation comes from.
> > 
> > I have a real world problem in the checkpoint service where under
> > certain circumstances with 10-12 hrs of test runs during recovery
> > checkpoint segfaults.  I have instrumented it heavily with logsys and
> > the error goes away entirely (because it dramatically changes the timing
> > characteristics.  So there is the problem - instrumentation overhead
> > causes errors that were once there to disappear.
> 
> I understand this. Logging instrumentation of whatever kind will always
> subtly change the timing.
> 
> However, the idea of instrumenting the general logging call, for example
> such as:
> 
> - loglevel_general >= notice
> - loglevel_blackbox >= debug
> 
> Would actually ensure that the debug messages were generated at all
> times (just not normally written out). This would thus not affect timing
> between test and production runs, as the logging would be identical.
> 

The act of writing the log message also changes system timing.

> Of course, this means that even at debug log level, the code should not
> log so much that it slows down so that it becomes unusable for
> production. Some additional internal debug levels might be used for
> actual debugging runs during development.
> 
> Even with your proposal, I don't expect the tracing would record
> _EVERYTHING_ at all times, as that would still be too much data. But
> depending on your format, you might be able to squeeze in more data,
> sure - as long as we don't overcompensate by logging too large
> structures ;-)
> 

Actually I think you misunderstand my proposal slightly.  I intend
atleast in the corosync and openais code to record nearly every state
event that occurs within the flight recorder, without the logging of
that message externally.  The state change events (log_rec) have
different goals from those of the informative message event
(log_printf).

Any user that doesn't want to adopt this model doesn't have to in any
way.  There is no removal of the log_printf api.  It remains the staple
logging API and usable for all cases, even debugging.  Just keep in mind
it may change program execution because of the variable
non-deterministic nature of sprintf string processing and the logging
via syslogd or the filesystem that ensues.  It's contents are still
recorded in the flight recorder for later analysis.

So today what we are talking about for most apps is a completely hidden
"under the hood" change of how log_printf logs messages using a
snprintf,log_rec operation, while the logsys code reads from the flight
recorder log messages and replays those that should be printed to the
user in a thread.

In a failure or crash scenario, the system log can be analyzed via
fplay.

> > How will this be used?  All of those printfs will be turned into log_rec
> > messages.  Since log_rec is super high performance, I am hopeful it
> > doesn't impact the reproduction of the error.  Also those log_rec
> > commands will be _left in the code_ permanently. 
> 
> The same would also be true for instrumenting the general logging code.
> 
> > Anytime there is a field failure, instead of providing a log file,
> > users would provide the flight data which could be analyzed and
> > debugged.  I could look at the trace data and say "wow that event
> > looks out of place..".  Complete data structures can be logged instead
> > of logging individual pieces of the sprintf functionality.
> 
> That is indeed an interesting property. Maybe one worth accepting the
> complexity for, even though I'm not sure about that. As long as the logs
> can be easily reconciled with the externally observed logs - preferably,
> those would use the same infrastructure, just that then the logging
> thread would chose to convert and output them as text immediately?
> 
> > The goals are:
> > 1) super high performance
> > 2) tracing left in permanently
> > 3) uses circular buffer properly and efficiently
> > 4) easy to recover trace data from core files
> > 5) allows variable length parameters to be logged which are binary in
> > format (such as data structures)
> > 6) records all information necessary for the person supporting the
> > failure to understand exactly what happened during execution.
> 
> I definitely don't disagree with the goals; I hadn't thought of 5), and
> am not perfectly sure whether the implementation achieves 4), but maybe
> this is a good trade-off to make.
> 
> 
> Regards,
>     Lars
> 

I am doing the development in whitetank.  I hope to be able to share a
version for trunk of openais and corosync soon so that the behavior of
the flight replay can be reviewed along with the general use of the
code.

There is developer pain in using log_rec functionality, specifically the
printing functions in the fplay code that decode the trace message.  But
the pain only needs be suffered once after which it can be forgotten
about until a crash or fault of some kind occurs.

Regards
-steve