[Ksummit-2008-discuss] DTrace

James Bottomley James.Bottomley at HansenPartnership.com
Mon Jun 30 12:22:12 PDT 2008


On Sun, 2008-06-29 at 21:04 -0400, Frank Ch. Eigler wrote:
> Please forgive me for "crashing" the discussion party here.  I would
> like to clarify some systemtap-related issues that people have raised.
> (I'm one of its developers.)  I'll just list individual points,
> roughly in order they were raised.  For a fuller treatment of any of
> the topics, please involve our public <systemtap at sources.redhat.com>
> mailing list.

It's not a private party ... hence the "discuss" part of the list
naming ...

> * postgres, other dtrace-probe-instrumented userspace programs
> 
>   We aim to piggyback on these efforts by reusing the dtrace
>   instrumentation calls embedded into postgres etc., if at all
>   possible.
> 
> * "klunky and prone to break in unexpected ways"
> 
>   There's a germ of truth there, but OTOH the case James ran into
>   involved complications beyond normal symbolic debugging too
>   (possibly having to search separately compiled modules for
>   definitions of opaque struct-pointer types).  We're working on it;
>   our bug/feature list is in public bugzilla.

Well, let me give you another example, because it tripped me up for
days:  Return probes give access to the entry variables in the state the
routine was entered (not on return).  I ran into it because I was trying
to look at what a routine had done to the scsi command structure which
was passed as an input.

I've also found it very easy to crash the system under probe if you use
the wrong build tree for the running kernel (not a problem, I know that
enterprise customers run into, but a common one for kernel developers).
Since we have a kernel build version that increments with every build,
it would be useful to sanity check the one systemtap pulled out of the
debug with the one in the running kernel.

> * "unhappy week with dwarf"
> 
>   Guilty as charged. :-)
> 
> * kprobes, markers
> 
>   Performance of kprobes-based probes is about 1 us per hit overhead.
>   Markers are on the order of tens of nanoseconds, which makes a huge
>   difference for frequently-hit probes.  We'd be happy to interface to
>   other event sources like ftrace or whatever, as long as they provide
>   suitable kernel-module-accessible APIs.

There were two specific latencies of concern to the financial trading
house type end user: One was the latency from execution to run.  This is
caused mostly by the module build and insertion.  I really can't see
this getting better except by divorcing systemtap from having to use the
whole of the kernel build infrastructure.  To do that, we need to begin
putting a lot of the C fragments that make up that infrastructure into
the kernel to lessen the load.  It would actually be nice finally to get
to the point where you simply link the probe routines with a special
module stub (built as part of the kernel) and insert it.

The other is the probe execution latency.  Since the institutions are
tracing transactions on the order of milliseconds, microsecond latencies
in the probes do give them cause for concern (it only takes a few probe
points to add up to a significant perturbation).

> * user-space probing
> 
>   We're finally getting very close in this.  Yes, it'd use the IBM
>   uprobes prototype above the Red Hat utrace work as a lower layer,
>   which we hope get upstream as soon as possible.  It will behave
>   analogously to dtrace: executing probes in kernel space.  If it can
>   be made safe (and we think it can), it's a huge performance win over
>   trying to do it in userspace (with some gang of debugging processes
>   or whatever).
> 
> * oprofile
> 
>   It's a fine special-purpose tool.  We hope to hook into the same
>   sorts of underlying hardware performance counters to enable the same
>   profiling capability in systemtap - except well integrated with the
>   rest of the probing events / scripts.  perfmon2 upstream would be
>   very helpful.
> 
> * dtrace "just works"
> 
>   Yeah, so I hear, but think about how different their target
>   environment is.  Their kernel hardly changes (several fixed APIs,
>   ABIs): this has huge implications.  Their kernel was willing to
>   insert probes (~ markers), a bunch of build system changes (debug
>   info subset transcribing).  Here in linux land, we suffer
>   multifaceted tensions and it is hard to go toward a goal without
>   obstructions (well-meaning as they may be).

The goal has to be well articulated and agreed to.  Open source is rapid
at progressing towards common goals ... it's when the goals aren't
common that progress gets bogged down.

>   A bunch of third-party scripts are often conflated with "dtrace",
>   which is just a matter of growing the user community enough, and
>   giving them a good tool to build on top of.  A growing set of
>   runnable end-user scripts is already packaged with systemtap,
>   intended for use by nonexperts, more help (e.g. concise problem
>   statements about what you'd like to measure/see) would be welcome.
> 
> * integrating systemtap runtime into kernel
> 
>   We did some analysis about how much of the runtime code contains
>   novel & relevant code to the kernel.  We came up with a fraction
>   like 20% (IIRC; still searching for a link to the thread).  Some of
>   the code is indeed in need of some cleanup love.  
> 
>   Some of it has been necessary to work around kernel disruptions
>   (e.g., unexporting stuff like kallsyms_lookup).  The parts that are
>   deeply kernel-version-sensitive (and would thus benefit from your
>   maintenance) are quite small.  We're still open to trying to pursue
>   copying/upstreaming some of this code into the kernel.

Actually, this one is an example of a wrong approach.  What you're
effectively doing is trying to implement an ABI for staprun in these
files (as well as various helpers for the modules).  The work around for
kallsyms_lookup is pretty horrible as well ... expecially as the kernel
has its own address to symbol string converter.

This is a lot of what needs to be cleaned up and simplified.  The
interface between systemtap and the kernel is essentially a private ABI
and we should treat it as such, so all the helpers for the modules and
the necessary implementers of the ABI should be in kernel ... there
shouldn't be any (if done right) carried around as C fragments with
kernel version ifdefs ...

> * tapsets
> 
>   Theodore is mistaken that we are deflecting the job of tapset (probe
>   macro; abstracting architecture and kernel version-change -
>   $foo->bar->baz, function names) authorship.  We have asked for help,
>   and have received a little, but the group has in fact authored a
>   growing collection of this stuff.
> 
>   We would welcome having tapsets be included with the kernel and
>   cared for by you guys.
> 
> * debuginfo
> 
>   Yes, it's very helpful & necessary if one wants to place probes at
>   just about any statement and extract just about any data value.
>   It's the same prerequisite that crash or kgdb would have, since we
>   operate at a similar level of object/source code visibility.  Other
>   distros are learning to package this admittedly bulky data up, so
>   it'll be a matter of a largish download for distro users. Kernel
>   developers will of course have the data generated locally already.
> 
>   We've recently gained the ability to work on symbol table level data
>   only.  It's a compromise technology: it shrinks the installation
>   footprint but we get only function-entry probes; we lose data
>   typing; can only get at ABI-dictated positional integral arguments.
> 
> * systemtap building
> 
>   The only thing unusual with building the thing is the use of the
>   elfutils library to parse elf/dwarf data; links to that are provided
>   and one can link to a private copy if the system lacks it.

That's true, just: I've done it but it's not exactly easy.  The
necessity of this undocumented --enable-staticdw flag stalled my
attempts to build it for a while.

> * systemtap releases
> 
>   True, we've been spotty with formal releases, though they are
>   archived and available, and we're moving to a more regular release
>   schedule very shortly.  The weekly snapshots have been good (except
>   a recent unfortunate regression that hits 2.6.25 kernels
>   particularly badly - that's holding up the new release plans).
> 
> 
> Thanks for reading; sorry about the length.

James




More information about the Ksummit-2008-discuss mailing list