Embedded Recipes 2023 Day 2 – part 1, Paris, France

Embedded Recipes 2023, Paris, France – Day 2, part 1

The second day of Embedded Recipes got started right away with some interesting talks.
The video for the live stream can be viewed on youtube. At the time of this writing, the cut clips for each talk are not available yet.

arnout vandecapelle, mind (essensium division), embedded software consultant
Arnout Vandecapelle
11/10/2023

Accelerated ML at the edge with mainline – Tomeu Vizoso, Freelancer

slides

It’s difficult to use mainline Linux when there are binary drivers. If you’re not using mainline, it’s unlikely that you’ll contribute back. In the past, GPU drivers often used to be binary – fortunately that’s changing. Now, however, we have the same situation again for NPUs (Neural-network Processing Units).

At the kernel level, NPUs and GPUs have a lot in common. Compute-only drivers tend to be relatively small but are often out-of-tree. For inclusion in the DRM subsystem, there also needs to be an open source userspace (to allow testing), which is often missing. For GPUs, there are open standards like OpenGL that define the API for them. That makes it easier to create an open source userspace. For NPUs, however, no such standards exist. Instead, vendors often ship binary-only forks of various ML projects like TensorFlow.

Tomeu proposes to choose one ML framework and bless this as the standard userspace API. Other ML frameworks can then use that API. Therefore, the plan is to use a TensorFlow Light, in Mesa, with Gallium as HAL. They’ll start with VeriSilicon’s Vivante NPU, which is used in several SoCs and already has the Etnaviv kernel driver. TFLite’s Teflon delegate is used for mapping on the GPU/NPU.

[ Arnout: the rest of the talk was too hard to follow. ]

The work requires a lot of reverse-engineering of the binaries. Various tools can be used to observe and analyze the blobs that get sent to the NPU.

In the future, the concept should be extended to include mapping to NPUs for OpenCL code. There should also be frontends added for other frameworks than TFLite.

One image to rule them all: portably handling hardware variants – Ahmad Fatoum, Pengutronix

slides

In an embedded project, you often start out with a single board, but then develop multiple variants. Sometimes it can be just some DNP (Do Not Populate) components in a single PCB design, or a component that goes EOL and needs to be replaced with a (pin-compatible) new one. Other times, it’s really different PCB designs, or even with a different SoC.

This can be handled either with multiple images, one for each variant, or with a single image. Multiple images has the advantage that everything is contained in the build system, and it’s possible to tune the compiler, kernel, etc. to the specific CPU. It has the smallest image size, and less risk of breaking the already-existing variants. In yocto, multiple images are handled by creating a different MACHINE for each variant – it’s still possible to factor out common parts in an include, or to base on another machine and add overrides.

A single image, in the other hand, shortens the build time and increases shared code. Fewer artifacts simplify CI and testing. It’s also easier for the user, no risk of choosing the wrong image to flash/update. E.g. it’s possible to create a single rescue USB stick that works on all product variants.

In a single image, all software uses shared code but with dynamic configuration at runtime. Basically, each software component reads configuration files, and different variants get different configuration files.

For the kernel, a hardware description is needed, in ACPI or device tree. Theoretically, since it describes the hardware, it should never need to be updated. In practice, however, bindings change, and some things are put in device tree which are not strictly hardware related. Therefore, the device tree should be shipped together with the kernel. There are many specifications that allow multiple device trees to be shipped: bootloader specification, distroboot, U-Boot’s FIT images, Unified Kernel Image (UKI). FIT has a few pitfalls however. Load and entry address are hardcoded, so the bootloader should somehow figure it out. The bootloader also hardcodes the configuration name. Basically, the bootloader needs to figure out which hardware it is running on and chooses the kernel’s device tree based on that.

For now, we assume that the bootloader is outside the main storage area (e.g. in the eMMC boot partitions) and isn’t updated – see below for more thoughts on the bootloader. So in the build system, we need a single build to produce a different bootloader image for each board. Barebox and TF-A have native support for this. Yocto has UBOOT_CONFIG which is a list of configurations, which re-runs the build of U-Boot for each configuration.

Once a device tree is selected, the bootloader can still apply fixups to it. This is a good way to handle variants that are pretty similar, e.g. just one or two components in the device tree that change. In barebox, this can be done in a script. In U-Boot, you need to write custom C code for it – and don’t forget to call fdt_increase_size(). Make sure that nodes are looked up by alias, so if a node moves around in the device tree the fixup still applies. You can also look up by compatible.

Overlays are a special case of device tree fixups that apply mostly to extension boards. Note that overlays can’t delete nodes – but they can change the status to disabled. There’s a patch set that allows applying overlays in Linux, but that has only been partially merged and hasn’t seen any progress in years. So don’t bother with that, just do overlays in the bootloader.

In userspace, avoid hard-coding paths to device nodes or other hardware identifiers. Instead, create symlinks. Many symlinks are created automatically, e.g. /sys/bus/…/devices/ rather than the absolute path, or the /dev/disk/by-* symlinks. You can also create symlinks to /dev nodes in udev rules – for example for serial ports. Block device paths are a bit problematic here – in a A/B, the two partitions will have the same UUID and label. Therefore, you need to write your own rules to create symlinks based on information about which is the current rootfs. For things that are described in device tree, it’s easy to write a udev rule that creates a symlink based on the alias:

ACTION==”add”, ENV{OF_ALIAS_0}==”?*”, \
RUN+=”/bin/mkdir -p /dev/by-ofalias”, \
RUN+=”/bin/ln -sfn /sys%p /dev/by-ofalias/%E{OF_ALIAS_0}”

In systemd, it’s possible to enable/disable services based on the existence of nodes. It’s even possible to match the board compatible with ConditionFirmware. Similarly, systemd-networkd can use matches. Configuration files for applications can be written to /run by a pre-start script.

In RAUC, it’s possible to specify in the manifest that a certain sub-image only applies to a specific board. This works by providing a path (e.g. in /sys/soc) that yields an identifier. The sub-image with matching identifier will be applied. This is useful for selecting a bootloader image from a single RAUC bundle that applies to all variants.

A single bootloader is useful when many variants need to be supported – for the same reasons as single image in general. For completely dissimilar SoCs, it’s possible to play with different entry points to map multiple bootloaders into a single image. For similar SoCs, support is needed in the bootloader itself. U-Boot doesn’t support this, but barebox has CONFIG_ARCH_MULTIARCH which makes it use device tree. You need to detect the board type, which can be done by reading EEPROM, probing I2C devices, strapping pins, etc. Based on that, you set the compatible and the bootloader selects the matching device tree.

Since all of the above requires a lot of consideration, it’s best to already do this from the very beginning. So even if you have only a single board variant, already apply the symlinks and board ID probing.

[ Olivier’s personal thoughts:

Take-away: look again at barebox, it looks it has interesting properties and it seems effectively used.

Many things were quite obvious for me, but saying them explicitly could help. ]

State of the Beagle: BeaglePlay, BeagleConnect and beyond – Jason Kridner, Freelancer Drew Fustini, BayLibre

slidesGoogle Slides

Jason couldn’t make it, so the talk was given by Drew Fustini (BayLibre) who is on the board of directors of beagleboard.org.

BeagleBoard is a non profit organization focus on Linux embedded and Zephyr OS products. It produces Open Hardware, Open Source Software, and it has a lot of devices. There is a big community and lots of resources. The boards designed by BeagleBoard can be produced by anyone, but the organization gets a commission on the boards produces by their official vendor. Visit https://bbb.io/about to learn more, and {,git.,docs.}beagleboard.org.

Beaglebone AI-64 is a 64 bit ARM device with 4 GB of RAM with deep learning and multimedia hardware acceleration. It includes integration with TensorFlow Light for AI in Python.

BeaglePlay is cheaper device but still 64 bits, without NPU but with M4 for real time work. There is still work ongoing to upstream the BSP for it. It also has a microcontroller on-board running Zephyr, which is used to connect to low-speed wireless peripherals, e.g. the BeagleConnect Freedom.

BeagleConnect Freedom is a wireless microcontroller with Zephyr OS and several wireless connectivity options: subGHz, BLE, 802.15.4. subGHz gives it a range of up to 1km at 1kbps. Its purpose is to wirelessly connect sensors (or actuators) with e.g. BeaglePlay as the central. Sensors are plugged in using a mikroBUS connector.

A key component of this vision is to handle the remote sensors with Linux drivers. The sensors are typically connected with I2C, SPI or other slow buses. Rather than having to write drivers for them in Zephyr, then invent a protocol to give an API to it and implement that protocol both on the Zephyr and the Linux side, and finally rewrite application code to use that protocol instead of the standard Linux APIs, BeagleConnect makes it possible to tunnel the I2C/SPI/… bus over the wireless connection. The sensor driver is thus entirely on the Linux node.

The underlying technology is Greybus, which comes out of project Ara. Project Ara was about making a smartphone modular. In the end, project Ara failed because it is really difficult to make a smartphone modular, due to the form factor and the high level of integration required. However, in the end the Greybus protocol that is used to achieve modularity made it into the upstream kernel.

The Greybus makes abstraction of the underlying network technology. It talks to a greybus-for-zephyr on the remote node. It has a discovery mechanism to discover what is available on that remote node. On the BeaglePlay, greybus creates a (virtual) I2C bus and populates it with the i2c_device nodes corresponding to the ones discovered remotely. Thus, Linux does not see the difference between direct I2C and I2C over greybus.

Current greybus implementation still uses a gbridge component in userspace, this needs to be eliminated to improve efficiency and stability.

BeagleV Ahead is a T-Head RISC-V (TH1520) based board. It is more powerful than BeaglePlay. Upstreaming is on-going. Debian and Ubuntu are supported. A limitation of this board is that its CPU uses instruction set extensions that were not released yet at the time the hardware was designed, and the released instruction set extension spec turned out to be slightly different. Because of this, it’s not possible to use upstream GCC and binutils, it needs a patched version.

[ Olivier’s personal thoughts: the BeagleBoard is an ecosystem that seems to do a lot of  SW the right way, very open, very nice. But it bases everything on  TI solutions that seem not much used in embedded systems. The  BeagleV is an exception, but they intend to do more. To follow. ]

Add the power of the Web to your embedded devices with WPE WebKit – Mario Sánchez-Prada, Igalia

slides: HTML pdf

There are 4 main web rendering engines at the moment: WebKit, Chromium (Google Chrome), Gecko (Firefox) and Servo (replacement intended for Firefox, but abandoned by Mozilla). Igalia works on all of them except Gecko. This talk is about WebKit only.

WebKit is a (mostly) BSD licensed web browser engine started by Apple as a fork of KHTML. It was forked again by Google to become Blink (part of Chromium). Chromium now migrated to its own Web Engine. WebKit goals are performance, portability, stability, compatibility, standards compliance and hackability. It’s available on many different platforms, which (together with the previous points) make it very suitable for embedded. WebKit is only a web engine, not a browser. It just handles rendering and user interaction, not all the menus, bookmarks, etc.

WebKit architecture consists of WebCore and JavaScriptCore for rendering and execution, Platform for hooking into the underlying platform, and WebKit which provides the (stable) API, which is adapted to the underlying platform. WebCore includes also network, multimedia and accessibility. WebKit also implements the split-process model.

A WebKit port is an adaptation of WebKit for a specific platform. Some are upstream, some are out-of-tree. Official ports include Mac (where the API is Objective C, not C), iOS (which is used also in Chrome on iOS!), two Windows ports, WebKitGTK, and WPE. For linux, the common parts are in separate libraries: GLib, libsoup, GStreamer. The differences are mainly in graphics and input handling. WebKitGTK is for integration in GTK applications, both GTK3 and GTK4. WPE is lower level, aimed at embedded devices. A port is a specific instantiation of the Platform (i.e. GLib, libsoup, …) but also of the WebKit API itself (e.g. using GTK for integration in the application).

WPE is optimized for embedded devices It has a minimal set of dependencies. It uses a backends architecture, which is great to support hardware acceleration for graphics and multimedia. It supports 32-bit ARM, though the JavaScript support is a bit limited. It has low memory and storage footprint. On the other hand, it doesn’t support all APIs found in other WebKit ports. For example, there’s no API for locking the mouse pointer. It doesn’t link with a particular UI toolkit – it’s generally meant to be used with no toolkit at all (i.e. full screen display).

WPE is used in set-top boxes, smart home appliances, Hi-Fi audio systems, digital signage, GPS devices, video/audio conference, and more. Some of these don’t even have any display at all, the web engine is simply used to stream audio. There is also headless server rendering, where the rendered output is encoded as a video stream and sent over the network.

There are two forks of WPE. Upstream WPE is generic and free of customizations. Downstream WPE, aka WebPlatformForEmbedded is optimized for Broadcom and other SoCs used in set-top boxes. It comes from the RDK project.

WPEWebKit is the actual WebKit port. It relies on backends for display and input. libwpe provides the callbacks used for rendering that are implemented by the graphics backend. It also allows the input backend to relay events from the application. WPEBackend-FDO is a wayland-based reference backend. It can be used for development on PC and then replaced with a device-specific backend. Cog is a launcher that simply loads a URL given on the command line. It has no user interface, but it has a dbus interface that can be used to enter input events.

Hardware accelerated graphics are implemented with ANGLE. ANGLE (Almost Native Graphics Layer Engine) is basically WebGL to e.g. OpenGL. DMABuf is used for buffer sharing with the graphics pipeline. There’s a fallback implementation for drivers that don’t support it. WPE unifies the pipelines for HTML/CSS and SVG rendering, which allows the latter to also be accelerated using the same backend. GStreamer is tightly integrated to implement various web APIs e.g. Media Capture. It also supports DMABuf for decoders.

In the future they want to develop a simplified architecture for WPEWebKit with a simplified design. They also want to accelerate 2D rendering. Android is a new target platform, a demo is already available. The graphics pipeline can also be improved, e.g. using DMABuf more.

Implementing ISP algorithms in libcamera – Laurent Pinchart, Ideas on Board

slides

This talk is a bit of the continuation of the talk Laurent gave at Embedded Linux Conference 2023 in Prague.

Assuming your hardware is supported with Linux drivers for all components (using V4L2, Media controller), you still have a mess of components that need to be controlled in the video input pipeline. libcamera tries to solve or simplify this problem of setting up the pipeline, supporting multiple streams, per-frame controls, and more.

The ISP (Image Signal Processor) is a piece of hardware that is meant for (pre)processing the images that come from camera’s. When retrieving the raw images from a camera, there are a large number of operations that need to be performed on them to clean up the images, like Bayer conversion, simple filtering, lens shading, and more. This is basically a sequence of algorithms (operations) on each image, every time producing a new image. The ISP provides primitives in hardware to perform those operations.

libcamera has pipeline handlers which are helpers to set up the plumbing specific to a platform. This includes talking to the ISP itself. A second platform-specific component is the Image Processing Algorithms support. That is what this talk is about. This is the only component of libcamera that is allowed to be closed source. Out-of-tree IPAs are sandboxed in an isolated process. In-tree IPAs talk directly to the pipeline handler, but it’s transparent to the pipeline handler if it’s direct or sandboxed.

The application asks the camera to capture a frame with a Request. This includes setting controls and setting up buffers. The request is then queued to the camera. After some time it completes. The application can then reuse() the request.

Creating an IPA module has 4 easy steps: the IP interfaces, the pipeline handler, the IP module, and the algorithm itself.

First are the IPA interfaces. It defines the operations that are exposed to the pipeline handler. It includes buffers and event handling.

Second is the pipeline handler which must be implemented accordingly. The pipeline handler is the only one that calls the IPA interfaces, so the two are really developed together. It creates an IPA instance and wires it up with the V4L2 ports and functions. Part of the IPA interface is the tuning file that sets parameters about e.g. the lens. Then the camera should be configured, and finally started and stopped again. IPAs need statistics from the captured images to improve the algorithm (e.g. to perform white balancing). The pipeline handler detects when these are ready and pushes them into the IPA.

The third step is the IPA module, which implements all the functions defined in the interface.

The final step is the algorithm. libcamera helps with infrastructure to build an algorithm, defining the typical steps that correspond to an interface. There is a skeleton for new applications.

[ Olivier’s personal thoughts: Amazing how complex it is to get something as obvious as a good image from a camera on an embedded system. And the libcamera team masters it. ]

Presentations

Lua for the lazy C developer Fosdem '23 - Frank Van Bever 5 February, 2023 Read more
Exploring a Swedish smart home hub Fosdem '23 - Hannah Kiekens 4 February, 2023 Read more
Tracking vulnerabilities with Buildroot & Yocto EOSS23 conference - Arnout Vandecapelle 12 July, 2023 Read more
How to update your Yocto layer for embedded systems? ER '23 -Charles-Antoine Couret 28 September, 2023 Read more
prplMesh An Open-source Implementation of the Wi-Fi Alliance® Multi-AP (Arnout Vandecappelle) 25 October, 2018 Read more

 

News