Impressions from Linaro Connect 2024 (part 2)

Impressions from Linaro Connect 2024 (part 2)

Arnout Vandecappelle
10/06/2024

This is part 2 of my report of Linaro Connect – you can find part 1 here. This article gives my own reflections about some of the talks I attended. If one of them looks interesting, go to the linked resources to view the slides or the video recording.

Rethinking the kernel system call entry – Arnd Bergmann (Linaro)

slides & video – clicking starts download of PDF!

A system call is a complex beast. Typically there’s a libc wrapper around the actual system call, which uses syscall() to put the system call number and the arguments on the stack and to do the low-level signaling to the kernel. On the kernel side, there’s a stack of functions that wrap each other to handle different parts of the system call. All architectures do this slightly differently. This leads to unreadable code which makes it sensitive to exploits. In addition to these good reasons to rework and simplify this, it also sounds like fun.

The first and most impactful cleanup is to improve the use of the system call table. Many architecture already have a table that enumerates the system calls, their numbers, and which wrapper they use. For some architectures, however, there is only plain code. Even for the architectures that do use a system call table, the table is copied everywhere – right now you have to update 25 files to add a new system call. All new system calls, however, use the same number on all architectures, so they can be put in a common table that is used by all architectures. Finally, the Makefile that converts the table to code is copied for each architecture (but slightly differently). This can be factored into a single Makefile, with additional flags specified in the arch-specific Makefile.

A second major change is to clean up the system call wrappers – i.e. the functions that handle arguments and context from userspace. These wrappers are macros. They are extremely complex because they need to do a lot of little details. Because they are macros, the code is very hard to understand and review. Arnd would like to replace them with code that is generated from the syscall table – the header is already generated from it, the C code could be generated just as well. This generated C code is much easier to understand than the macro, so we can review if the wrappers are in fact correct.

The information in the system call table has to be duplicated in a bunch of projects: glibc and other userspace wrappers, qemu, strace, perf. By exporting the system call table, the situation for these projects becomes simpler.

All of the above is still work in progress. I couldn’t find any patches submitted upstream that do any of these nice improvements.

Simple, Yocto, Secure Boot: Using new systemd features to pick all three! – Erik Schilling (Linaro)

[slides & video – clicking starts download of PDF!https://resources.linaro.org/en/resource/awmFCMBZgXWgvGvWug1gY1]

This talk is about integration problems of existing features.

Secure boot today includes firmware, kernel, initramfs and userspace. This talk is about the userspace parts. The assumption is that the initramfs goes with the kernel in an EFI image, the Unfied Kernel Image (UKI), while the userspace executables etc. go in a read-only filesystem, and the user data goes in an encrypted writable partition.

For the user data, we want to encrypt it with a key that is tied to the TPM. Currently, there is no standard wait of creating and populating this partition, and it’s actually quite complicated. systemd-repart comes to the rescue here. It’s a tool that at runtime checks if all partitions are present and creates it if needed. It supports tpm2 encryption out of the box. So the factory image simply doesn’t have the user data partition. On first boot, systemd sees that the partition doesn’t exist, so it creates it based on the configuration file, which specifies that it should use LUKS with the key enrolled into TPM.

Yocto supports read-only rootfs support, but that means /etc is not writable. Many programs assume that they can write to /etc. So the alternative is to leave everything in / empty except /usr; systemd will then populate the rest of / (including /etc) based on tmpfiles stanzas.

Now /usr has to be secured. One solution is to put /usr in a separate filesystem and protect it with dm-verity – this is best integrated with systemd. It creates a hash tree of the entire block device, and the root of that tree is signed. dm-verity verifies this signature at boot time and verifies the hashes every time a block is accessed. For Yocto, there is a dm-verity-img.bbclass in meta-security, but you still need to integrate the generated metadata in the final image. This can again be solved with systemd-repart. systemd-repart supports generating the dm-verity metadata volume. By running systemd-repart in “offline” mode, we can create the entire image, including dm-verity metadata, on the build machine. In other words, systemd-repart can replace wic.

We still need to include the signed root hash somewhere. the UAPI working group has defined partition type UUIDs for various partitions: root, /usr, but also the verity data and the verity signature for each of them. So we just need to create a partition of that type to store the signature and it will be automatically discovered.

Since partitions are auto-discovered, we need an initramfs to do the discovery (and set up verity). But that means we have to verify the initramfs as well. EFI only has a protocol for verifying EFI executables, not for non-executable binaries. In addition, ARM SystemReady doesn’t even mandate supporting verification of an EFI executable. The solution is to put everything in one big EFI executable. The boot loader needs to know how to extract the initramfs (and other data like device tree) from it, and UEFI will verify the entire thing. UKI specifies how to put things in the UKI image so all bootloaders interpret it the same. Note that currently only a single device tree in the UKI is supported.

All of this is working, but not entirely integrated or upstreamed to Yocto yet.

Kria Dynamic Board-ID & Device Tree Selection – Wesley Skeffington, Michal Simek (AMD)

[slides & video – clicking starts download of PDF!https://resources.linaro.org/en/resource/q7U3Rr7m3ZbZmXzYK7A9u3]

Kria is a Xilinx (AMD) SoM where you can define several carrier cards with various I/Os. This hardware modularity is nice, but how do you manage all these combinations in software? The different carrier cards must be exposed to the FPGA bitstream, to the kernel and to the applications. Ideally, you should be able to have a single common firmware image that runs with all carrier cards. The carrier card can be detected automatically with an I2C EEPROM that is installed both on the SoM and the carrier card. This has to be used to load the correct DTB.

Not only the carrier cards create differences, also the FPGA bitstream itself. The functionality exposed by the FPGA bitstream needs a corresponding kernel driver, and there are use cases where the FPGA should be reprogrammed while the application keeps running. For example in automotive, you may have different bitstreams for parked, city traveling, and highway traveling. This has to switch fast so you can’t reboot the entire system.

For both use cases, the feature to use are device tree overlays. However, both in U-Boot and in Linux there are limitations. U-Boot supports applying an overlay, but only to the device tree that is passed to Linux, not to the device tree it uses itself. U-Boot does, however, support the automatic selection of a device tree based on a compatible string that is constructed based on the EEPROM IDs. This was extended with wildcards, because many carrier cards are not relevant to U-Boot.

Linux has support for loading FPGA bitstreams dynamically with fpga-mgr. This also supports loading and unloading of device tree overlays. In fact, the FPGA bitstream itself is specified as a device tree overlay. Of course, it turns out that unloading often leads to problems because drivers don’t expect this to happen and don’t work correctly when the device is removed.. In userspace, libdfx helps with managing the loading of the FPGA bitstream.

There’s a lot of upstreaming work still to be done, although they’ve already been working upstream all the time.

Developing and Deploying Software for Hybrid Systems – Chris Adeniyi-Jones (Arm)

[slides & video – clicking starts download of PDF!https://resources.linaro.org/en/resource/cgzZUpmBPdkrqVR7hUeSVC]

A hybrid system is one that has more than one different type of processor, e.g. a Cortex-A and a Cortex-M. Usually there are multiple of each. Each processor has a bit of its own memory and a bit of memory that is shared with other processors. Devices (peripherals) as assigned to one of the processors.

As a demonstrator, Chris uses an NXP IMX8MMini EVK. The Cortex-A software was built with Yocto, the Cortex-M is using FreeRTOS with MCUXpresso-SDK. To communicate between the two, Linux uses remoteproc and rpmsg. remoteproc loads firmware to the coprocessor, starts it and stops it, and rpmsg exchanges messages with it. There’s a corresponding library for integration at the Cortex-M side. All this works out of the box.

However, the configuration is spread out over several files, an in particular it’s partially duplicated in the different subsystems. For example, the memory map has to be specified both to MCUXpresso and to Linux, and they have to be consistent of course.

Another limitation is that the partitioning (memory map) has to be done statically, it’s not easy to change it at runtime. The partitioning should also be enforced by the system registers, so that Linux can’t (accidentally) access the peripherals assigned to the Cortex-M and vice versa. Unfortunately, there is no common driver infrastructure that automatically applies this.

Finally, the naming in device trees is not standardized. The correct names are needed to link correctly to the userspace calls to remoteproc and rpmsg. These non-standard names make it more difficult to migrate to a different SoC variant.

OpenAMP is a project to improve support for hybrid systems. The task they took is the definition of the System Device Tree, which spans the Cortex-A and Cortex-M. It also includes partitioning and memory map information. From this, a DT is generated for Linux and an MCUXpresso description for the Cortex-M. Clearly, this would make even more sense if Zephyr is used on the Cortex-M side, but that hasn’t been tried yet.

Architecting for CPU performance on the next phase of AI workloads everywhere on Arm – Nick Horne (Arm)

slides & video – clicking starts download of PDF!

Arm’s strategy is based on a few premises. – ML will replace classic heuristics or be combined with it. Therefore, Arm needs to adapt hardware and software platforms. – New types of ML networks will keep emerging. A recent example is transformer networks (they’ve existed for a while already but are only picking up steam now). Therefore, Arm needs to create future-proof solutions. – Demand for compute is still insatiable, so Arm must keep a focus on performance, area and energy efficiency. – Standardisation is inevitable (so the same software can run on a variety of hardware), so Arm must be involved in that. – A vibrant ecosystem benefits everyone, so Arm should plug into this ecosystem and support it.

CPU, GPU and NPU are all relevant. CPU is in fact more efficient for small networks because otherwise overheads start to accumulate. Also, it’s the most flexible to support future networks that are built on different principles. NPUs require a lot more effort to map to and are not flexible. ARMv7, ARMv8 and the upcoming ARMv9 all have extensions dedicated to improving machine learning efficiency on the general-purpose CPU.

A large part of current machine learning is built on Large Language Models (LLMs). It has 2 phases. The first phase is the encoder. It’s compute bound and it determines the latency of the first word coming out. The second phase, the decoder, is memory bound and determines the text generation speed. Compute bound means you need more parallel ops, which can be achieved with smaller bitwidths (e.g. 8-bit float). For memory bound, compression techniques are key. This can be as simple as downscaling to e.g. 4-bit integer, but is also related to approximations and memory layout. They optimized a LLAMA3 model with 4-bit blockwise quantization and achieved 60% improvement for Encoder and 30% for Decoder.

Arm develops highly optimized kernels (arm compute library and armnn). They don’t really care what software gets used, the important thing is that the CPU can be used efficiently by users. It’s just that Arm is in a good position to do these optimisations. So the kernels are fully open source with a very permissive license. They also work on (contribute to) code generation projects like TVM and MLIR.

How to stay sane: Long Term Support, Government Mandates and Open Source – Thomas Gall (Linaro)

slides & video – clicking starts download of PDF!

Linaro is starting an LTS program for Linux and U-Boot, and Thomas Gall. Surprisingly, however, this program was never mentioned in this talk. Instead, it focused on advice about how open source communities should structure their Long Term Support.

LTS is basically a branch that is maintained for a period of time, so the users of that branch don’t get exposed to breakage due to new features. Obviously, this costs resources because you have to duplicate some effort.

For users, the following policies are important to know about: – How long will it be maintained? – What is the policy for which patches are accepted? E.g. the Civil Infrastructure Platform kernel accepts new features under certain conditions. – What is the decision process to pull in a fix? How is it decided if a patch falls within the policy? – How is the relationship between mainline and LTS patches tracked? In particular, how do you keep track of which patches need to be backported, and which ones already have been backported? What happens if a fix from mainline doesn’t “just apply”? – What is the impact on mainline development? What additional burdens are placed on mainline to make life easier for LTS maintenance? E.g. requirement to have Fixes tags on mainline.

Two examples were discussed in a bit more detail: U-Boot and the Linux kernel.

U-Boot doesn’t have an LTS at all, it just releases twice per year and that’s it. In the mainline commits, there is no formal identification of when they are fixes, and if so, what they fix. In addition, U-Boot shares code with the Linux kernel (both by copying verbatim and by copying the structure), but it’s not clear if fixes that have been applied to the kernel have also been applied to U-Boot. All this makes it much harder to start maintaining an LTS branch – it’s either a lot of work to maintain it, or the community processes have to change to accommodate the LTS.

Linux is a bit of a gold standard for LTS. It started in 2006 with 2.6.16. There is a clear patch policy. One weird thing in this policy is that it only picks up fixes from mainline. This means that, if an issue was accidentally fixed on mainline due to refactoring, that that issue is never going to get fixed in the stable tree. That is the reason that vendor and distro kernels don’t simply follow the stable tree: they care about actual issues seen by their customers, so they will apply fixes that never appeared in mainline.

The new regulations in US and EU have a few similar pillars. There should be a process for detecting vulnerabilities and for applying security updates, security updates should be applied within a certain timeframe (4 months, although it depends), and the lifetime of the software updates should be at least 5 years. To reach the latter, some form of LTS is needed – but it can also be combined with updating the software components. There is unfortunately not a lot of data that can guide us to choose between LTS and updating.

OP-TEE device drivers frameworks – Etienne Carrière (ST)

slides & video – clicking starts download of PDF!

Etienne is one of the most knowledgeable people when it comes to OP-TEE, and one of the two main contributors to the project, so it was really interesting to hear him speak about it. The subject of this talk was device drivers in OP-TEE. It is not very difficult to write one, so go ahead and do it!

When you do something in OP-TEE, you’re probably doing something crypto related. OP-TEE has software implementations for many operations but sometimes it’s useful to have hardware assistance. You also need a few drivers for resource management, like regulators, clocks, busses. OP-TEE doesn’t have persistent storage drivers, it assumes the non-secure world handles that. E.g. for RPMB on eMMC, OP-TEE has the key to it but the actual access goes through the insecure world.

A driver is defined by a struct with operations callbacks that is passed to a register function. E.g. drvcrypt_register_authenc to register a structure with 9 functions that are used to perform authenticated encryption. The register function is specific to the framework in which the driver fits, e.g. crypt, regulator, clock. All register functions are called early in the boot process, and then devices are initialised in a specific order. This order is important to handle dependencies between drivers. The talk went into a lot of detail about what ordering is used in different situations. In summary, every driver chooses an “init level” in which it wants to be initialised. If in the initialisation function, it turns out that some resource that it needs (clock, regulator, …) is not available yet, it can return -EDEFER and it will be called again later. This is not a very efficient way of handling dependencies, but since in practice there aren’t that many drivers loaded in OP-TEE, it works well enough.

This talk is absolutely worth watching in full if you need to learn about writing drivers in OP-TEE.

Demo Friday

The final event of Linaro Connect is Demo Friday. It consists of the last lunch break of the conference, which is extended to three hours to allow people to visit all the demos. The demos are real technical demos, not just commercial nonsense, and they’re mostly given by the people who worked on it. They offer an interesting opportunity to discuss these topics with the technical people.

Presentations

Drop the docs and embrace the model with Gaphor Fosdem '24 - Frank Van Bever 20 March, 2024 Read more
How to update your Yocto layer for embedded systems? ER '23 -Charles-Antoine Couret 28 September, 2023 Read more
Tracking vulnerabilities with Buildroot & Yocto EOSS23 conference - Arnout Vandecapelle 12 July, 2023 Read more
Lua for the lazy C developer Fosdem '23 - Frank Van Bever 5 February, 2023 Read more
Exploring a Swedish smart home hub Fosdem '23 - Hannah Kiekens 4 February, 2023 Read more
prplMesh An Open-source Implementation of the Wi-Fi Alliance® Multi-AP (Arnout Vandecappelle) 25 October, 2018 Read more

 

News