- Topic/Title slide DWARF5 and GNU extensions New ways to go from binary to source Abstract: After several years a new DWARF debugging standard, DWARF5, has been released that incorporates various GNU extensions that allow to better express how to map various binary constructs created by optimizing compilers back to the original source code while reducing the size of the debugging information. This results in more expressive debuginfo, but also introduces more complexity that DWARF consumers need to deal with. We will go over the existing GNU Extensions to DWARF, how they were adopted by DWARF5, and describe how debug consumers can take advantage of them. To reduce space a lot of different strategies are being used. Separate .debug files, .gnu_debuglink, build-ids, compressed ELF sections, debug types in ELF COMDAT sections, the Dwarf Compressor DWZ .multi files, DWARF Supplementary Object Files, GNU Debug Fission, split-dwarf .dwo files, DWARF Package Files .dwp files. The basic structure of describing a program with a tree of Debug Information Entries (DIEs) with attributes per compile unit augmented with some auxiliary data structures to map to source files, describe macros used in the source and get call frame information hasn't fundementally changed between DWARF version 2 and version 5. But the hierarchy of the representation and where the bits are stored has become much more complex. It is no longer possible to just see the DWARF descriptions as a fancy symbol table which can be simply indexed through some offsets. It has also become much more expressive than that. *Note how the second paragraph already shows some focus on specific things.* - What is DWARF? - Going from binary to source. - But that is slightly misleading. If that was all there was to it, you would just need the .debug_line mapping (addrs -> source/line-no). - Once you have source you might want to know... - Which function are we in, what parameters does it have, which variables are in scope? What are the types of those? .debug_info (DIE - Debug Information Entries tree.) - Given those variables, what are their values? .debug_loc (location descriptors). - What code range does this function (or lexical scope) span? .debug_ranges. - How did I get here? What were the values of the variables in scope when this function was called? .debug_frame (.eh_frame) unwind information - Now that I have the source, how does the following snippet expand? .debug_macro defines. - DWARF standard design goals - Language Independence - Architecture Independence - Operating System Independence - Compact Data Representation - Efficient Processing - Implementation Independence - Explicit Rather Than Implicit Description - Avoid Duplication of Information - Leverage Other Standards - Limited Dependence on Tools - Separate Description From Implementation - Permissive Rather Than Prescriptive - Vendor Extensibility All have strong and weak points (give some quick examples). But DWARF is the best we got, widely supported. The last point means we can often try out stuff through a GNU Vendor Extension and then propose it for the next standard. For this talk lets concentrate on "Limited Dependence on Tools". DWARF data is designed so that it can be processed by commonly available assemblers, linkers, and other support programs, without requiring additional functionality specifically to support DWARF data. It explains a bit why the data representation is as it is. It makes DWARF easily "pass through" the toolchain, because it looks just like "data" to e.g. the assembler or linker. But the tools not having to know about DWARF do limit some of the other goals. There are some clever GNU extensions added to DWARF5 to counter some of these limitations. - All the DWARF5 additions [... whole long list ...] Sticking to new data representation issue because I cannot discuss them all now. But please do ask me if you want to know more. - What is needed to produce/composite DWARF? - Being able to reference labels in data. - Being able to referene labels between data sections. - Being able to reference symbols (relocations) - Would be nice if assembler can produce leb128 And that is it! With the above a DWARF producer can generate DWARF for an object that can be combined by a linker without any more special support. So DWARF is modelled around the traditional compile unit model. - Example - two source files, one header with data structure. - Parts of object file 1. - Parts of object file 2. - Combined, just concatenate and resolve symbol relocations, inter-section references. * So, no really special magic needed to combine DWARF from separate object files. But look at that repeated type. And that are a lot of relocations... - .debug_types And extension for DWARF3 integrated into DWARF4. - What if we had a "section group" or "linkonce section" where identical/duplicate sections would be merged/only one picked when combining object files? (ELF Comdat sections) - And if we define a hash/checksum over a type, then we could put a type into such a data section with that hash/checksum as name. - Define a new way to refer to a type DIE (DW_FORM_ref_sig8) and the linker will make sure identically named ones will be de-duplicated. - Does require a new DWARF unit header format that includes the sig8 and the offset into the DIE tree that identifies the type. So we put these into their own section, so they don't get mixed up with the "real" compile units in .debug_info. - .debug_types example [xxx same example as before, note we only have one type DIE.] So, this is pretty nice. Linkers all already have some kind of mechanism for this, so now we de-duplicate some information between DWARF in object files for "free". But things do get more complicated, there are now two DIE data sections (.debug_info and .debug_types). Lots of existing consumers depend on being able to reference DIEs by "offset". Which used to be simple with the whole DIE tree in one section, but is now slightly more complicated. For this reason, not enabled by default in GCC. In DWARF5 unit headers got more "generic", allowing for different header fields for different unit types. So now these debug type units are again part of .debug_info. So that gives us some automatic de-duplication of information. But there is still a lot of DWARF data that the linker as to process (if just to reduce the amount of data that is just copied around). - GNU DebugFission or .dwo files What if we could let the linker only deal with those parts of the DWARF data that needs relocations? - Why do we have relocations (in .debuginfo) again? - Attributes referencing strings - Attributes referencing addresses/symbols - Attributes using inter-section references Through indirection we can "remove" the direct relocations for strings and addresses. Add .debug_addr which is a section just containing addresses and reference each address through an index into .debug_addr. Likewise introduce a new section .debug_str_offsets which points to strings in the .debug_str section. And introduce forms to reference strings through offsets into .debug_str_offsets. For inter-section references, like location lists or range lists, do something similar. Instead of referencing "directly" through a relocation (where ever that data might end up), reference as index from the start. Both range lists and location lists now start with an index table that contains the offset to the actual list entries. Now all we have to resolve is "from the start". We do this in a couple of ways. First we split the Compile Unit in two. A skeleton unit, which we keep in the main object file which has only one CU DIE with the attributes that might still need relocations. In particular we add DW_AT_str_offsets_base, DW_AT_addr_base, DW_AT_loclists_base and DW_AT_loclists_base that are relocatable pointers to the start of the .debug_addr, .debug_str_offsets sections and the start of the index tables of the .debug_rnglists and .debug_loclists sections. Then for all other DIEs (none of which have attributes needing relocations anymore) we create a split unit in a separate .dwo file. To properly connect the two, both the skeleton unit and split unit have a dwo_id (like the type unit sig8) and the the skeleton compile unit DIE has a comp_dir attribute and dwo_name attribute that point to the actual .dwo file. (Same?) Example: Maybe with eu-readelf --debug-dump=info+ This is nice for your quick edit/compile/debug cycle. Most of the DWARF data now gets output once in a separate .dwo file and the linker doesn't need to touch it at all. It doesn't get get rid of the duplicate types. But that also wasn't really the case when they went into the .o files directly. And a DWARF consumer can do the deduplication during load time based on the type sig8. GCC and GDB support this (as a DWARF4 extension, not yet the standardized DWARF5 variant). But I don't know if people have been using it in anger yet. Executables can easily be created from thousands of object files. Which might mean you get thousands of open file descriptors (that is how I have it currently implemented in elfutils libdw, to support lazy file reading, but might run out of file descriptors). - DWARF Package Files .dwp But once you get out of the edit/compile/debug cycle and you want to install or distribute you work it isn't convenient to drag along all those .dwo files. So we do now need a tool GNU binutils comes with dwp which does exactly this. Usage: dwp [options] [file...] -e EXE, --exec EXE Get list of dwo files from EXE (defaults output to EXE.dwp) Now this tool does need to know a bit of DWARF. At least enough to read the type signatures and split compile unit ids. It also needs to act a bit like a linker, concatenating the section data (but it doesn't need to do any relocations). It then creates a lookup table for the compile ids (.debug_cu_index) and type ids (.debug_tu_index) which list the offsets (and sizes) of the supporting sections (info, abbrev, line, loclists, str_offsets, macro, rnglists) to the indexes can be adjusted to those offsets. If there is a string table in the .dwo file, it should merge those and update the .debug_str_offsets tables. So a consumer will first look for a .dwp file matching a main/debug file and get the split compile units from there, if that file doesn't exist, try to get the individual .dwo files. Strangely the lookup of the .dwp files isn't specified. - DWZ or DWARF Supplementary Object Files .sup So what could we do if we do use a tool that "understands" DWARF data? multi files. Very convenient for distro packaging. [...] examples [...] - Some diagram? What comes from where? Maybe from DWARF spec. Doesn't even list sup references. Should list all the DW_FORM_ sups? DW_FORM_strp_sup, .gnu_debugaltlink .debug_sup section - Advertisement If writing a DWARF consumer you might want to use a library. Why not try elfutils libdw?