Probing into Systemtap

Systemtap has been under active development for a while now. More than 35 people have contributed enhancements in the last year. But newer developments, like the ability to dynamically trace user space programs, and not just kernel space, in recent GNU/Linux distributions, seem to have been very silently introduced and have not always been noticed by users that are not yet using Systemtap extensively. So we take a look at what currently works out of the box, what that box should contain to make things work, the work in progress and the challenges Systemtap faces to be more powerful and get more wide spread adoption.

Systemtap's goal is to provide full system observability on production systems, which is safe, non-intrusive, (near) zero-overhead and allows ubiquitous data collection across the whole system for any interesting event that could happen. To achieve this Systemtap defines the stap language, in which you define your probes, actions and data acquisition. The Systemap translator and runtime guarantees probe points are only placed on safe locations and that probe functions cannot generate too much overhead when collecting data. For dynamic probes on addresses inside the kernel Systemtap uses kprobes, for dynamic probes in user space programs Systemtap uses its cousin uprobes. This provides a unified way of probing and then collecting data for observing the whole system. To dynamicly find locations for probe points, arguments of the probed functions and the variables in scope at the probe point, Systemtap uses the debuginfo (Dwarf) standard debugging information that the compiler generates.

So an ideal setting for using Systemtap is when the GNU/Linux distribution provides easy access to debuginfo for the kernel and user space programs, almost all do, and the kernel supports kpobes, which has been in the upstream kernel for some years, and uprobes, which comes with and is automatically loaded by Systemtap, but which relies on the full utrace framework, which isn't yet in all distribution kernels (the latest few releases of the Fedora family, including Red Hat Enterprise Linux and CentOS do include full utrace support by default). Systemtap works without debuginfo, but the range of probes and the amount of data you can collect is then very limited. And it works without utrace support, but then you won't be able to do deep user space probing, only observe direct user/kernel space interactions.

There are various probe variants one can use with Systemtap, but the most interesting ones are the debuginfo based probes for the kernel, kernel modules and user space applications. These can use function, statement or return variants and wildcards, like kernel.function("rpc_new_task") a named kernel function, process("/bin/ls").function("*") any function entry in a specific process, module("usb*").function("*sync*").return, every return of a function containing the word sync, in any module starting with usb, or even a specific statement in a particular function like in a particular file, kernel.statement("bio_init@fs/bio.c+3").

Depending on the type of probe one can access specifics of the probe point. For the debuginfo based probes these are $var for in scope variables or function arguments, $var->field for accessing structure fields, $var[N] for array elements, $return for the return value of a function in a return probe and meta variables like $$vars to get a string representation of all the in scope variables at a particular probe point. All access to such constructs are safeguarded by the systemtap runtime to make sure no illegal accesses can occur.

Given that one has the debuginfo of a program installed, one can easily get a simple call trace of a specific program, including all function parameters and return values with the following stap script:

  probe process("/bin/ls").function("*").call
  {
    printf("=>%s(%s)\n", probefunc(), $$parms);
  }

  probe process("/bin/ls").function("*").return
  {
    printf("<=%s:%s\n", probefunc(), $$return);
  }

The examples included with Systemtap come with much more powerful versions that shows timed per-thread callgraphs, optionally showing only children of a particular function call.

While these probing and data extraction constructs are very powerful they do require some knowledge of the kernel or program code base. Since you are often interested in what is happening and not precisely how, Systemtap comes with stapsets, which are utility functions and aliases for groups of interesting probes in a particular subsystem. Examples include system calls, nfs operations, signals, sockets, etc. Currently these are distributed with Systemtap itself, but ideally each program or subsystem comes with their own tapset of interesting events provided by the program or subsystem maintainer.

Just printing out events while they occur is not always ideal. First you get overwhelmed by the output, second you might only be interested in a specific subset of the same event (only certain parameters, only calls that take longer than a specific time, only from the process that does the most calls over a specific time frame, etc.), finally processing all the events on your production system might interfere with the thing you are trying to observe. Especially at the start of your investigations when you might not yet be sure what the interesting events are to look for, when you do some very wide probing to see what is going on.

For this reason the stap language supports variables that can be used as associative arrays, simple control structures and data aggregates functions to do simple statistics during probe time, with very low overhead and without having to call external programs that might interfere with the system being probed.

The following script might be how you would start investigating when you have a system which seems to do an excessive amount of reads, it uses the vfs tapset and an associative array to store the number of reads a particular executable with a specific pid number did:

  global totals;
  probe vfs.read
  {
    totals[execname(), pid()]++
  }

  probe end
  {
    printf("== totals ==\n")
    foreach ([name,pid] in totals-)
      printf("%s (%d): %d \n", name, pid, totals[name,pid])
  }

This will give you a list of executables and their pid sorted by the total number of vfs reads done between when you started and stopped the script. These facilities in the stap language help greatly to minimize any overhead of the tracing framework. If you would try to do the same thing by just printing each vfs event and then post-processing the results with perl you might end up with perl itself being the process doing the most vfs calls, or worse, by having to parse a couple megabytes of trace data perl might start trashing the system even more making it harder to determine the root cause in the first place.

Systemtap now also supports static markers in the kernel. This allows subsystem maintainers to mark specific events as interesting, providing a format string of the arguments to the event that can be easily parsed by tracing tools like systemtap. The advantage over tapsets is that they are in-code and so might be easier to maintain (you probably still want to have an associated tapset to have utilities like nicely formatting the arguments or associate various markers with each other) and they can work without needing any dwarf debuginfo around (but then you loose the ability to inspect local variables or function parameters not passed to the marker). You use them through probe kernel.mark("kernel_sched_wakeup") and can then access the arguments through $argN and get the argument format string of the marker with $format.

An alternative way of adding static markers to the kernel, tracepoints, is not yet directly supported in systemtap. Tracepoints have as disadvantage that they will require the dwarf debuginfo to be around since they don't currently specify the types of their arguments except through their function prototypes. So Systemtap can currently only use tracepoints via hand-written intermediary code that maps them to markers.

The development version of Systemtap recently got support for user space static markers. Although Systemtap defines its own STAP_PROBE macros for usage in applications that want to add static markers there is also an alternative tracing tool, dtrace, that already defines ways for programs to embed static markers. Systemtap also supports the convention used by dtrace by providing an alternative include file and build preprocessor so that programs using DTRACE_PROBE macros can be automatically compiled as if for dtrace and have their static markers show up with stap.

Luckily there are various programs that already have such markers defined. For example postgresql has various static markers that one can probe with Systemtap now to trace higher level events like transactions and database locks. Currently one has to adapt the build process of such programs by hand, but the next version of Systemtap will come with scripts that will automate that process.

While Systemtap works great on GNU/Linux distributions that support it, there are a couple of challenges to overcome to make it more ubiquitous and easier for more people to use out of the box. This is not just work on the Systemtap code base itself. Since the goal is to provide full system observability, from low-level kernel events to high-level application events there is work all across the GNU/Linux stack and better integration into more distributions (providing default install of Systemtap and tapsets, easy access to debuginfo for deep inspection, binaries compiled with marker support for high-level events, etc.). The main two challenges to make Systemtap more powerful and easier to use on any distribution are around debuginfo and better kernel support.

A lot of power of Systemtap comes from the fact that it can use the dwarf debuginfo of the kernel and applications to do very detailed inspection. But this comes at a price since the debuginfo is often large. For example on Fedora the kernel debuginfo package is far larger than the kernel package itself. One easy win will be to split the debuginfo package into the dwarf files and the source files, which are needed for a debugger, but not directly for a tracer like Systemtap. Fedora plans to do this for its next release. The elfutils team is also working on a framework for Dwarf transformation and compression that could be used as post-processor on the output of the compiler.

Systemtap sometimes suffers from the same issues you might have with a debugger, the compiler has optimized the code, but forgot where it put a certain variable after the optimization. Of course this is always the variable you are most interested in. Alexandre Oliva is working on improving the local variable debug information in GCC. His variable tracking assignments branch in GCC aims to improve debug information by annotating assignments early in the compilation and carrying over such annotations throughout all optimization passes so that you can always accurately track variables even in optimized code.

Finally there is work being done on having a Systemtap "client and server" that could be used on production systems where you might not even want to have any tools or debuginfo installed. You can then setup a development client that has the same setup as the production system, but including the Systemtap translator and all debuginfo, create and test your scripts there and only run the final result on the bare bones production server.

Most of the Systemtap runtime, like the kprobes support, is maintained in the upstream linux kernel, but there is some stuff still missing. This leads to distributions having to add some small patches to their kernel especially to support user space tracing. In particular some of the utrace framework is still not fully upstream. Over the last few kernel releases various parts have been merged already, the utrace user_regset framework, which creates an interface for code accessing the user-space view of any machine specific state and the tracehook work, which provides a framework for all the user process tracing, are both already upstream. The actual utrace framework sits on top of these and then the ptrace interface is implemented as utrace client. Anything that changes the ptrace implementation is hairy stuff and there is a large ptrace testsuite to make sure that nothing breaks. But one idea is to push upstream in two goes. At first using utrace or ptrace on a process would be mutually exclusive. That could pave to path to get pure-utrace upstream in first and then do proper ptrace cooperation in a second go.

This would also provide the way for uprobes, that depends on the utrace framework, to be submitted upstream. uprobes components such as breakpoint insertion and removal and the single-stepping infrastructure are also potentially useful for other user space tracers and debuggers. Like with utrace one idea is to factor out these portions of uprobes so that it can be used by multiple clients as shared user-space breakpoint support (ubs) layer. With multiple clients using the same layer, upstream acceptance might be easier.

One candidate for using both the utrace and the uprobes layer besides Systemtap is Froggy which provides an alternative debugger interface to ptrace. The GDB Archer project would like to serve as testbed for Froggy, which they hope will also make gdb more robust when linked with libpython that is being used for GDB scripting.

There is still work to do, but over the last couple of years the GNU/Linux tracing and debugging experience has kept improving. Hopefully soon all these parts will fall into place and provide hackers with a fairly nice environment for not only debugging on development systems, but also for unobtrusive tracing on production systems.

About the author: Mark Wielaard is a Senior Software engineer at Red Hat working in the Engineering Tools group hacking on Systemtap.