- Topic/Title slide

DWARF5 and GNU extensions
New ways to go from binary to source

Abstract:

After several years a new DWARF debugging standard, DWARF5, has been released that incorporates various GNU extensions that allow to better express how to map various binary constructs created by optimizing compilers back to the original source code while reducing the size of the debugging information. This results in more expressive debuginfo, but also introduces more complexity that DWARF consumers need to deal with.

We will go over the existing GNU Extensions to DWARF, how they were adopted by DWARF5, and describe how debug consumers can take advantage of them. To reduce space a lot of different strategies are being used. Separate .debug files, .gnu_debuglink, build-ids, compressed ELF sections, debug types in ELF COMDAT sections, the Dwarf Compressor DWZ .multi files, DWARF Supplementary Object Files, GNU Debug Fission, split-dwarf .dwo files, DWARF Package Files .dwp files. The basic structure of describing a program with a tree of Debug Information Entries (DIEs) with attributes per compile unit augmented with some auxiliary data structures to map to source files, describe macros used in the source and get call frame information hasn't fundementally changed between DWARF version 2 and version 5. But the hierarchy of the representation and where the bits are stored has become much more complex. It is no longer possible to just see the DWARF descriptions as a fancy symbol table which can be simply indexed through some offsets. It has also become much more expressive than that.

*Note how the second paragraph already shows some focus on specific things.*


- What is DWARF?

  - Going from binary to source.
  - But that is slightly misleading.
    If that was all there was to it, you would just need the .debug_line
    mapping (addrs -> source/line-no).
  - Once you have source you might want to know...
    - Which function are we in, what parameters does it have, which variables are in scope? What are the types of those? .debug_info (DIE - Debug Information Entries tree.)
    - Given those variables, what are their values? .debug_loc (location descriptors).
    - What code range does this function (or lexical scope) span? .debug_ranges.
    - How did I get here? What were the values of the variables in scope when this function was called? .debug_frame (.eh_frame) unwind information
    - Now that I have the source, how does the following snippet expand? .debug_macro defines.


- DWARF standard design goals

- Language Independence
- Architecture Independence
- Operating System Independence
- Compact Data Representation
- Efficient Processing
- Implementation Independence
- Explicit Rather Than Implicit Description
- Avoid Duplication of Information
- Leverage Other Standards
- Limited Dependence on Tools
- Separate Description From Implementation
- Permissive Rather Than Prescriptive
- Vendor Extensibility

All have strong and weak points (give some quick examples).
But DWARF is the best we got, widely supported.

The last point means we can often try out stuff through a GNU Vendor
Extension and then propose it for the next standard.

For this talk lets concentrate on "Limited Dependence on Tools".

  DWARF data is designed so that it can be processed by commonly
  available assemblers, linkers, and other support programs, without
  requiring additional functionality specifically to support DWARF
  data.

It explains a bit why the data representation is as it is. It makes DWARF easily "pass through" the toolchain, because it looks just like "data" to e.g. the assembler or linker. But the tools not having to know about DWARF do limit some of the other goals. There are some clever GNU extensions added to DWARF5 to counter some of these limitations.

- All the DWARF5 additions

[... whole long list ...]

Sticking to new data representation issue because I cannot discuss
them all now. But please do ask me if you want to know more.

- What is needed to produce/composite DWARF?

  - Being able to reference labels in data.
  - Being able to referene labels between data sections.
  - Being able to reference symbols (relocations)
  - Would be nice if assembler can produce leb128

And that is it! With the above a DWARF producer can generate DWARF for
an object that can be combined by a linker without any more special
support.

So DWARF is modelled around the traditional compile unit model.

- Example

  - two source files, one header with data structure.

  - Parts of object file 1.
  - Parts of object file 2.
  - Combined, just concatenate
    and resolve symbol relocations, inter-section references.


* So, no really special magic needed to combine DWARF from separate
  object files. But look at that repeated type. And that are a lot of
  relocations...

- .debug_types

And extension for DWARF3 integrated into DWARF4.

- What if we had a "section group" or "linkonce section" where
  identical/duplicate sections would be merged/only one picked when
  combining object files?  (ELF Comdat sections)

- And if we define a hash/checksum over a type, then we could put a
  type into such a data section with that hash/checksum as name.

- Define a new way to refer to a type DIE (DW_FORM_ref_sig8) and the
  linker will make sure identically named ones will be de-duplicated.

- Does require a new DWARF unit header format that includes the sig8
  and the offset into the DIE tree that identifies the type. So we put
  these into their own section, so they don't get mixed up with the
  "real" compile units in .debug_info.

- .debug_types example

[xxx same example as before, note we only have one type DIE.]

So, this is pretty nice. Linkers all already have some kind of
mechanism for this, so now we de-duplicate some information between
DWARF in object files for "free".

But things do get more complicated, there are now two DIE data
sections (.debug_info and .debug_types). Lots of existing consumers
depend on being able to reference DIEs by "offset". Which used to be
simple with the whole DIE tree in one section, but is now slightly
more complicated.

For this reason, not enabled by default in GCC.

In DWARF5 unit headers got more "generic", allowing for different
header fields for different unit types. So now these debug type units
are again part of .debug_info.

So that gives us some automatic de-duplication of information.  But
there is still a lot of DWARF data that the linker as to process (if
just to reduce the amount of data that is just copied around).

- GNU DebugFission or .dwo files

What if we could let the linker only deal with those parts of the
DWARF data that needs relocations?

- Why do we have relocations (in .debuginfo) again?
  - Attributes referencing strings
  - Attributes referencing addresses/symbols
  - Attributes using inter-section references

Through indirection we can "remove" the direct relocations for strings
and addresses. Add .debug_addr which is a section just containing
addresses and reference each address through an index into
.debug_addr. Likewise introduce a new section .debug_str_offsets which
points to strings in the .debug_str section. And introduce forms to
reference strings through offsets into .debug_str_offsets.

For inter-section references, like location lists or range lists, do
something similar. Instead of referencing "directly" through a
relocation (where ever that data might end up), reference as index
from the start. Both range lists and location lists now start with an
index table that contains the offset to the actual list entries.

Now all we have to resolve is "from the start". We do this in a couple of ways.

First we split the Compile Unit in two. A skeleton unit, which we keep
in the main object file which has only one CU DIE with the attributes
that might still need relocations. In particular we add
DW_AT_str_offsets_base, DW_AT_addr_base, DW_AT_loclists_base and
DW_AT_loclists_base that are relocatable pointers to the start of the
.debug_addr, .debug_str_offsets sections and the start of the index
tables of the .debug_rnglists and .debug_loclists sections.

Then for all other DIEs (none of which have attributes needing
relocations anymore) we create a split unit in a separate .dwo file.

To properly connect the two, both the skeleton unit and split unit
have a dwo_id (like the type unit sig8) and the the skeleton compile
unit DIE has a comp_dir attribute and dwo_name attribute that point to
the actual .dwo file.

(Same?) Example: Maybe with eu-readelf --debug-dump=info+

This is nice for your quick edit/compile/debug cycle. Most of the
DWARF data now gets output once in a separate .dwo file and the linker
doesn't need to touch it at all. It doesn't get get rid of the
duplicate types. But that also wasn't really the case when they went
into the .o files directly. And a DWARF consumer can do the
deduplication during load time based on the type sig8.

GCC and GDB support this (as a DWARF4 extension, not yet the
standardized DWARF5 variant). But I don't know if people have been
using it in anger yet. Executables can easily be created from
thousands of object files. Which might mean you get thousands of open
file descriptors (that is how I have it currently implemented in
elfutils libdw, to support lazy file reading, but might run out of
file descriptors).

- DWARF Package Files .dwp

But once you get out of the edit/compile/debug cycle and you want to
install or distribute you work it isn't convenient to drag along all
those .dwo files. So we do now need a tool
GNU binutils comes with dwp which does exactly this.

Usage: dwp [options] [file...]
  -e EXE, --exec EXE       Get list of dwo files from EXE
                           (defaults output to EXE.dwp)

Now this tool does need to know a bit of DWARF. At least enough to
read the type signatures and split compile unit ids. It also needs to
act a bit like a linker, concatenating the section data (but it
doesn't need to do any relocations). It then creates a lookup table
for the compile ids (.debug_cu_index) and type ids (.debug_tu_index)
which list the offsets (and sizes) of the supporting sections (info,
abbrev, line, loclists, str_offsets, macro, rnglists) to the indexes
can be adjusted to those offsets.

If there is a string table in the .dwo file, it should merge those and
update the .debug_str_offsets tables.

So a consumer will first look for a .dwp file matching a main/debug
file and get the split compile units from there, if that file doesn't
exist, try to get the individual .dwo files. Strangely the lookup of
the .dwp files isn't specified.

- DWZ or DWARF Supplementary Object Files .sup

So what could we do if we do use a tool that "understands" DWARF data?

multi files. Very convenient for distro packaging.

[...] examples [...]

- Some diagram? What comes from where? Maybe from DWARF spec.
  Doesn't even list sup references.

  Should list all the DW_FORM_ sups?
  DW_FORM_strp_sup,

.gnu_debugaltlink
.debug_sup section

- Advertisement

If writing a DWARF consumer you might want to use a library.
Why not try elfutils libdw?