mind | EOSS2023 – Efficient and Practical Capturing of Crash Data on Embedded Systems

Applications crash. Deal with it. minicoredumper is a project that makes it easier to get information about crashes in live, deployed products. If you dump just the stack trace of the crashing thread, the core dump is so small that there’s really no argument not to store it.

sched slides

We’re talking about crashes in the field here, not just in development.

A core dump is an image of the application’s memory at the time of the crash. It’s in ELF format. The kernel handles it, you don’t need anything more to get it. It allows post-mortem, offline debugging. A core dump is so super useful to debug an application crash, you really need to use this tool in deployment. It’s much better than trying to reproduce he problem with limited or no information.

The problem is that a core dump is very large, and it only contains information about the crashed process, not others that it works together with.

minicoredumper created minimal, custom core dumps and can also take pre-crash state snapshot. It’s a userspace application. It has a configuration file to define what it should do exactly. It can do in-memory compression of the core dump before writing to disk.

It works by defining the path to core files with /proc/sys/kernel/core_pattern. If the first character of this filename is a |, it can be piped to a commandline. There are %-patterns to specify PID and other process metadata.

The configuration file is JSON. It specifies the dump path where it needs to be stored, and matching rules for application-specific dump configurations, called recepts. Each recept is another JSON file, making it easy to apply the same one to different matches. The matches use the different fields that are passed in the metadata of core_pattern.

A recept can extract just the crashing thread stack from the core dump, limit the stack size, limit which maps should be dumped. It can also specify inidividual symbols that should be grabbed. The compressor is simply an external executable that should be called. It can also be put in a tar file. Tar supports sparse files, i.e. if a piece of a file is just zeroes, tar will not store it. This avoids that the compressor needs to run its compression algorithm on all the zeroes, which saves some CPU time. It can also dump /proc info from the time of the crash.

The coredump on standard input is only used to read the ELF header. The rest is read out from /proc. It creates an ELF file just liek a normal coredump, with the metadata needed by gdb. It also adds a custom note section to distinguish between memory that is actually 0, and memory that just wasn’t included in the dump. This requires a patched gdb though.

If you use minidumper, you should actually test if the generated core dumps contain the information you need to debug it. You can easily simulate a crash with kill -SEGV.

libminicoredumper is a library that you can link into your application to specify what you want to be dumped. It makes it easier to grab the interesting data without taking the full heap. You can also put printable things in there. It can also specify to write in a different file than the core file – particularly useful for the text output. There’s a separate tool to inject the external files back into the core file. It works by defining two symbols with a linked list of pointers to the data you want dumped.

With libminicoredumper it’s also possible to do live dumps, i.e. snapshots. Note that it’s not instantaneous, e.g. minicoredumper has to start up and parse its configuration file. Latency is 2-30ms to the first dump. libminicoredumper registers itself on an external daemon via a UNIX domain socket. That daemon freezes the PIDs that need to be dumped and extract their data. Note that the freeze means that if one process crashes, it will also temporarily stop other processes.