mind | EOSS2023 – Designing to the Worst Case Scenario – Practical System Call Filtering with Seccomp

Seccomp is a tool to mitigate security vulnerabilities in your device. This talk says how it can be used in practice.

The primary goal is that the device performs its function, but you also want to protect against misuse.

Userspace can only interact with hardware or its environment through system calls (which are usually wrapped in a system library). [There was a lot of explanation of really basic things, like fork() and execve().] System calls have some built-in access control (e.g. sys_open() checks if the user has access to the file that is opened). Seccomp is a mechanism to put additional, arbitrary filtering on the system calls. libseccomp is a userspace library that gives a nice high-leval API to this functionality. The filters can be set up by either the parent process or by the child itself. When a process tries to do a disallowed syscall, it is killed. Or rather, there are a number of actions that can be specified, including logging or returning the syscall with an errno. There’s global default action that can be set, that applies to any non-filtered syscall.

The typical setup is a sequence of seccomp_init setting the default action, a series of seccomp_rule_add, and seccomp_load to apply them.

seccomp rules define the action to be taken for a syscall, and arguments can be checked as well. For checking arguments, there’s a complicated set of macros that can be used, but it’s basically integer comparison with masking.

To set up the allowed syscall rules is to run the program with logging instead of killing the process. The log contains the details of all syscalls, including arguments. The system call is logged by number, use scmp_sys_resolver to find the corresponding syscall name. Alternatively, you can use strace to find a similar list. Use -ff to trace into grandchildren, useful for tracing containerized processes.

systemd can specify seccomp filtering in the service unit files, with SystemCallFilter. It also has predefined system call sets, e.g. @network-io to disallow network access. There’s a basic allow list for the very basic things, like sleeping. Container management systems (docker. LXC) can similarly lock down the container’s syscalls. LXC has lxc.seccomp.profile that can be either an allowlist or a denylist. The denylist can specify an action for each syscall. However, these filters are applied pretty early, so the syscalls made by LXC itself need to be allowed as well. Docker has the --security-opt seccomp=... option pointing to a JSON file. It returns -EPERM. There are some filters that are applied by default to unprivileged containers. In the JSON file it’s possible to add arguments filtering as well.