Posted
on February 2, 2016, 15:36,
by mjw.
A package is more than a binary – make it observable
Introduction
I gave a presentation at Fosdem 2016 in the distributions developer room. This article is an extended version of the slide presenter notes. You can get the original from the talk page (press ‘h’ for help and ‘p’ to get the presenter view for the slides).
If any of this sounds interesting and you would like to get help implementing some of this for your distribution please contact me. I work upstream on valgrind and elfutils, which take advantage of having symbols and debuginfo available for programs. And elfutils is used by systemtap, systemd and perf to use some of that information (if available). I am also happy to work on gdb, gcc, rpm, binutils, etc. if that would help make some of this possible or more usable. I work for Red Hat which might explain some of my bias towards how Fedora handles some of this. Please do point out when I am making assumptions that are just plain wrong for other distributions. Ideally all this will be usable cross-distros and when running programs in VMs or containers based on different distro images, you’ll simply reach in and trace, profile and debug anything running seamlessly.
Goal
The main goal is to seamlessly go from a binary back to the original source. In this case limited to “native” ELF code (stuff you build with GCC or any other compiler that produces native code and sufficiently good debuginfo). Once we get this right for “native code” we can look at how to setup similar conventions for other language execution environments.
Whether you are tracing, profiling or debugging a locally running program, get a core file or want to interpret trace or profile data captured on some system, we want to make sure as many symbols, debuginfo and sources are available and easily accessible. Get them wherever they are. Or have a standard way to get them.
I know how most of this stuff works in Fedora, and how to improve some things for that distribution. But I would need help with other distributions and sanity checking that these ideas make sense in other context.
Why?
If you are running Free Software then you should be able to get back at the source code of your binaries. Code is never perfect and real issues always happen in production. Every user really is (and should allowed to be) a “debugger”. Observing (tracing, profiling) their system as it is running. Since actual running (optimized) code in a specific setup really is different from development code. You will observe different behavior in an actual deployed binary compared to how it behaved on the packager or developer setup.
And this isn’t just for the user on the machine getting useful backtraces. The user might just capture a trace or profile on their machine. Or you might get a core file that needs analysis “off-line”. Then having everything ready beforehand makes recreating a “debug environment” that matches the “production environment” precisely so much easier.
We do want users to trace, profile and debug processes running on their systems so they know precisely what is going on with their machine. But we also care about security so all code should run with the minimal privileges possible. Different users shouldn’t be able to trace each other processes, services should run in separate security context, processes handling sensitive data should make sure to pin security sensitive memory to prevent dumping such data to disk and processes that aren’t supposed to use introspection syscalls should be sandboxed. That is all good stuff. It makes sure users/services can synchronize, signal, debug, trace and profile their own processes, but not more than that.
There are however some kernel tweaks that don’t obey process separation and don’t respect different security scopes. Like setting selinux/yama ptrace_deny/scope. Enabling those will break stuff and will cause use of more privileged code than necessary. These “deny ptrace” features aren’t just about blocking the ptrace system call. They don’t just block “debuggers”. They block all inter-process synchronization, signaling, tracing and profiling by normal (unprivileged) users. Both settings were tried in Fedora and were disabled by default in the end. Because with them users can no longer observe their own processes. So they will have to raise their privileges to root. It also means a privileged monitoring process cannot just drop privileges to trace or profile lesser privileged code. So you’ll have to debug, profile and trace as root! It can also be seen as a form of security theater since a compromised process that is running in the same user/security context, might not be able to easily observe another process directly, but it can still get at the same inputs, read and manipulate the other process files, settings, install preload code to disable any restrictions, etc. Making observing other processes much more cumbersome, but not impossible.
So please don’t use these system breaking tweaks on normal setups where users and administrators should be able to monitor their own processes. We need real solutions that don’t require running everything as root and that respects normal user privileges and security contexts.
build-ids
A build-id is a globally unique identifier for an executable ELF image. Luckily everybody gets this right now (all support is upstream and enabled by default in the GNU toolchain). An build-id is an (allocated) ELF note put into the binary by the linker. It is (normally) the SHA1 hash over all code sections in the ELF image. The build-id can be found in each executable, shared library, kernel, module, etc. It is loaded into memory and automatically dumped into core files.
When you know the build-ids and the addresses where the ELF images are/were loaded then you have enough information to match any address to original source.
If your build is reproducible then the build-id will also be exactly the same. The build-id identifies the executable code. So stripping symbols or adding debuginfo doesn’t change it. And in theory with reproducible builds you could “just” rebuild your binaries with all debug options turned on (GCC guarantees that producing debug output will not change the executable code generated) and not strip out any symbols. But that is not really practical and a bit cumbersome (you will need to also keep around the exact build environment for any binary on the system).
Because they are so useful and so essential it really makes sense to make it an error when no build-id is found in an executable or shared library, not just warn about it when creating a package.
backtraces/unwind tables
Backtraces are the backbone of everything (tracing, profiling, debugging). They provide the necessary context for any observation. If you have any observability this should be it. To make it possible to get accurate and precise backtraces in any context always use gcc -fasynchronous-unwind-tables everywhere. It is already the default on the most widely used architectures, but you should enable it on all architectures you support. Fedora already does this (either through making it the default in gcc or by passing it explicitly in the build flags used by rpmbuild).
This will get you unwind tables which are put into .eh_frame sections, which are always kept with the binary and loaded in memory and so can be accessed easily and fast. frame pointers only get you so far. It is always the tricky code, signal handlers, fork/exec/clone, unexpected termination in the prologue/epilogue that manipulates the frame pointer. And it is often this tricky situations where you want accurate backtraces the most and you get bad results when only trying to rely on frame pointers. Maintaining frame pointers bloats code and reduces optimization oppertunities. GCC is really good at automatically generating it for any higher level language. And glibc now has CFI for all/most hand written assembler.
The only exception might be the kernel (for reasons mainly to do with the fact that linux kernel modules are ET_REL files, loadable .eh_frame sections are somewhat problematic). But even for the kernel please do generate accurate unwind tables and then put those in the .debug_frame section (which can then be stripped out and put into a separate debug file later). You can do this with the .cfi_sections assembler directive.
function symbols
When you do get backtraces for observations it would be really nice to immediately be able to match any addresses to the function names from the original source code. But normally only the .dynsym symbols are available (these are only those symbols that are necessary for dynamic linking your application and shared libraries). The full .symtab is normally stripped away since it is strictly only necessary for the linker combining object files.
Because .dynsym provides too little symbols and .symtab provides too much symbols, Fedora introduced the mini-symtab (sometimes called mini-debuginfo). This is a special (non-loaded) .gnu_debugdata section that contains a xz compressed ELF image. This ELF image contains minimal .symtab + .strtab sections for just the function symbols of the original .symtab section.
gdb and elfutils support using/reading .gnu_debugdata upstream. But it is only generated by some obscure function inside the rpm find-debuginfo.sh script. This really should be its own reusable script/program.
An alternative might be to just not strip the full .symtab out together with the full debuginfo and maybe use the new ELF compressed section support. (valgrind needs the full .symtab in some cases – although only really for ld.so, and valgrind doesn’t support compressed sections at the moment).
Together with accurate unwind tables having the function symbols available (and not stripped away or put into a separate debug file that might not be immediately accessible) provides the minimal requirements for easy, simple and useful tracing and profiling.
Full debuginfo
Other debug information can be stored separately from the main executable, but we still need to generate it. Some recommendations:
- Always use -g (-gdwarf-4 is the default in recent GCC)
- Do NOT disable -fvar-tracking-assignments
- gdb-add-index (.gdb_index)
- Maybe use -g3 (adds macro definitions)
This is a lot of info and sadly somewhat entangled. But please always generate it and then strip it into a separate .debug file.
This will give you inlines (program structure, which code ended up where). Arguments to functions and local variables, plus which value they have at which point in the program.
The types and structures used by the program. Matching addresses to source lines. .gdb_index provides debuggers a quick way to navigate some of these structures, so they don’t need to scan it all at startup, even if you only want to use a small portion. -g3 used to be very expensive, but recent GCC versions generate much denser data. Nobody really uses them much though, since nobody generates them… so chicken, egg. Both indexing and dense macros are proposed as DWARFv5 extensions.
Not using -fvar-tracking-assignments, which is the default with gcc now, really provides very poor results. Some projects disable it because they are afraid that generating extra debuginfo will somehow impact the generated code. If it ever does that is a bug in GCC. If you do want to double check then you can enable GCC -fcompare-debug or define the environment variable GCC_COMPARE_DEBUG to explicitly make GCC check this invariant isn’t violated.
Compressing
Full debuginfo is big! So yes, compression is something to think about. But ELF section compression is the wrong level. It isn’t supported by many programs (valgrind for example doesn’t). There are two variants (if you use any please not .zdebug, which is now a deprecated GNU extension). It prevents simply mmapping the data and using an index to only read/use what you need. It causes very slow startup.
You should however use DWZ, the DWARF optimization and duplicate removal tool. Given all debuginfo in a package this tool will make sure duplicate DWARF information is stored in a common place, reducing the size of the individual debug files.
You could use both DWZ and ELF section compression together if you really want to get the most compression. But I would recommend using DWZ only and then compress the whole file(s) for storage (like in a package), but install them uncompressed for direct usage.
Sources
The DWARF debuginfo references sources, and you really want to have them easily available. So package the (generated) sources (as they were compiled) and put them somewhere under /usr/src/debug/[package-version]/.
There is however one gotcha. DWARF references the sources where they were build. So unless you put and build the sources precisely where you want to install them you will have to adjust the. This can be done in two ways:
- rpm debugedit
- gcc -fdebug-prefix-map=old=new
debugedit is both more and less flexible. It is more flexible because it provides you the actual source file list used in the DWARF describing the program. It is less flexible because it isn’t a full DWARF rewriter. It adjusts the location/directories as long as they are smaller… So setup a big enough build root PATH name. It is probably time to rewrite debugedit to support proper DWARF rewriting and make it an independent tool that can easily be reused not just by rpm.
Separating and “linking”
There are two ways to “link” your binaries to their debug files:
- .gnu_debuglink section in main file with name (and CRC) of .debug file
- /usr/lib/debug/.build-id/XX/XXXXXXXXXXXXXXXXXXXXXX.debug
The .gnu_debuglink name has to be searched under well known paths (/usr/lib/debug + original location and/or subdirs). This makes it fragile, but more tools support it and it is the fallback used for when there is no build-id. But it might be time to deprecate/remove it because they inherently conflict between package versions.
Fedora supports both linking build-id.debug -> debuglink file. Fedora also throws in extra link to main exe under .build-id. But in the debuginfo package, so that link could mismatch if the main package and debug package versions don’t match up. It is not recommended to mimic this setup.
Preventing conflict
This is work in progress in Fedora:
- Want to install both 64-bit and 32-bit debug package.
- Have older/newer version of a debuginfo package installed. (inspecting a core file).
By making debuginfo packages parallel installable across arches and versions you should be able easily trace, profile and debug 32 and 64 bit programs at the same time. Inspect a core file generated against slightly different versions of the executable and libraries installed on the developer machine. And be able to install all debug files matching the executables running in a container for deep inspection.
To get there:
- Hash in full name-version-arch of package into build-id.
- Get rid of .gnu_debuglink files.
- No more build-id main file backlinks.
- Put sources under full name-version-arch subdir
This is where I still have more questions than answers. build-ids can conflict for minor version updates (the files should be completely identical though). Should we hash-in the full package name to make them unique again or accept that multiple packages can provide the same ELF images/build-ids? Dropping .gnu_debuglink (or changing install/renamed paths) will need program updates. Should the build-id-main-file backlinks be moved into the main package?
Should we really package debug files?
We might also want to explore alternatives to parallel installable debuginfo packages. Would it make sense to completely do away with debuginfo packages by:
- Making /usr/lib/debug and /usr/src/debug “magic” fuse file systems?
- Populate through something like darkserver
- Have a cross-distro federated registry of build-ids?
Something like the above is being experimented with in the Clear Linux Project.