When a motherboard’s empty battery is making a production device fail to boot

When a motherboard’s empty battery is making a production device fail to boot

Charles-Antoine Couret
12/01/2024

Context

My current customer is CMC, a company which designs and builds controllers for industrial compressors. These big machines are compressing air to perform different actions in the industry and a lot of physical parameters must be monitored such as temperature, pressure, humidity, dewpoint, power, etc. to be able to produce compressed air efficiently and to improve the reliability of these machines.CNC AERO

The project here is to build an ecosystem to monitor this data from different kinds of sensors, then to send the result to the cloud, and finally to make reports, check if devices are working well and predict failures.

This ecosystem has these components which are relevant for this story:

  • A little x86 PC to display locally some data on a touch screen. The data is coming from other machines locally over Ethernet.
  • Each compressor has its own monitor, based on an i.MX6ULL, to get some data via sensors or Modbus. Its data is sent to the cloud and to the local device manager over Ethernet.

 

AERO displayThe idea behind the PC is to be able to work without cloud connection: local monitoring and sending some commands. Thus it is possible to work offline, which is sometimes required by the customer for security / privacy reasons, and sometimes because the connectivity can be really bad in these factories.

That’s why we’re using a PC for this task. Good performance is required to execute a local version of the cloud stack with a local backend and frontend.  It is displayed by a minimal browser as the main interface. And because x86 is standard and produced with a high volume in the market, the cost is relatively low and it does not require us to build our own hardware or to customize the software too much.

The running image is a x86 Yocto built by ourselves with just the minimal amount of components required for this task.

Issue

After some years, some customers are mentioning an issue where the PC does not boot anymore and seems stuck at the UEFI configuration screen. It is hard to get more information this way and clearly impossible to debug such an issue remotely. Therefore, one customer sent this device back for further analysis.

The disk is correctly detected by the UEFI, its content looks correct without weird corruptions to explain the issue. But the UEFI interface is not showing the bootable partitions. Because the device has an A/B partitioning with two different bootable partitions, two of them should be displayed there.

After running a LiveUSB of Fedora Linux on the device, efibootmgr command provides the answer: UEFI parameters related to boot are lost. So which partition is bootable, arguments to send to the kernel, etc. Resetting them manually as done on production is solving the issue:

efibootmgr -d /dev/sda --gpt   -p 1 -c -L "UEFI: aero1" -l '\EFI\BOOT\bzImage.efi' \
    -u "root=/dev/sda3 rootwait rootfstype=ext4 console=tty1 quiet" > /dev/null
efibootmgr -d /dev/sda --gpt   -p 2 -c -L "UEFI: aero2" -l '\EFI\BOOT\bzImage.efi' \
    -u "root=/dev/sda4 rootwait rootfstype=ext4 console=tty1 quiet" > /dev/null
efibootmgr -o 0000,0001

These commands are defining first partitions of the SSD as bootable with a Linux kernel as EFI binary and required parameters to perform the boot, especially the rootfs location on the device.

After analysis, it turned out that the battery of the motherboard was dead. The issue is easy to reproduce on that particular unit: after cutting the power completely for some time, the PC fails to boot again. We’d expect the non-volatile RAM to be, you know, non-volatile, but apparently it’s just a battery-backed memory, not actual flash.

The battery that powers the NVRAM is not supposed to run empty within the lifetime of the PC. However, these machines are used in a hot environment, ambient temperature can be above 40-50°C sometimes. This is reducing the lifetime of the battery even after just a few years.

So, we have an explanation, but it’s not realistic to solve this as we did during our investigation. It does not scale, the issue can happen again if the power fails, and it’s really too complex for technicians to solve it themselves.

Solution

Boot partitions can be detected by the motherboard without extra info

UEFI relies on the GPT standard (GUID Partition Table) to know the list of partitions on the device and knowing the type of each partition. This standard is removing a lot of limitations introduced by MBR (Master Boot Record) which was used by BIOS at the beginning of PC x86.

To be able to boot on a device, the UEFI firmware is looking for EFI partition, and to identify them, it’s looking for partitions with GUID C12A7328-F81F-11D2-BA4B-00A0C93EC93B in the GPT. So in order to fix this part of the issue we have to configure our boot partitions with this setting.

In Yocto, we’re using a wks file to define our SSD image file, which contains the partitions. We can add a parameter --part-type to define the EFI partition like this:

part /boot --ondisk sda --label boot1 --part-type C12A7328-F81F-11D2-BA4B-00A0C93EC93B --part-name aero1 --fstype=vfat --active --align 1024 --size=50 --source rawcopy --sourceparams="file=cmc-boot-aero-aero-pc.vfat"

This way, new devices are able to boot on these partitions even if NVRAM battery goes dead. But we need to solve this for existing products as well. We can create a boot script that sets the current partition as a bootable one in the GPT partition table if it’s not already correctly configured.

fix_guid_for_bootable_partition() {
  local BOOT_PARTITION_NUMBER="$1"
  local ROOT_DEVICE_FILE="$2"
  local EFI_GUID="C12A7328-F81F-11D2-BA4B-00A0C93EC93B"
  local GUID_BOOT_PARTITION=$(sgdisk -i "${BOOT_PARTITION_NUMBER}" "${ROOT_DEVICE_FILE}" | grep "Partition GUID code:" | cut -d ' ' -f 4)

  # If boot partition is not configured as EFI partition, set it in the GPT table!
  # Avoid to write it when it's useless, to avoid corruptions, this step must be
  # applied once for each old device
  if [ "${GUID_BOOT_PARTITION}" != "${EFI_GUID}" ]; then
    sgdisk -s -t "${BOOT_PARTITION_NUMBER}:${EFI_GUID}" "${ROOT_DEVICE_FILE}"
  fi
}

Unfortunately this fix is not enough, if the issue is happening again, the EFI firmware will not find the EFI binary!

Change EFI firmware location

In our ESP partition, our previous path \EFI\BOOT\bzImage.efi was not standard which means even with the right GUID, the EFI firmware is not able to find out the EFI binary to load and the boot process is stopped. Previously we set in the NVRAM where the EFI binary is,  but when this information is lost, EFI firmware is not able to auto detect it.

The solution is to use a standard path which is for x86_64 architecture \EFI\boot\bootx64.efi. It’s not an issue to keep this path without extra copy because we’re not willing to support multiboot with other systems…

But this fix is not enough, if the issue is happening again the kernel is not able to get the right parameters to find the rootfs to continue the boot!

Change the bootloader

For simplicity, there was no bootloader between the UEFI firmware and the kernel. The kernel can be an EFI binary and directly started by this firmware which is removing an extra step. Due to A/B partitioning, we can’t hardcode the rootfs parameter in the kernel binary and the kernel is not able to get this setting from a file in the boot partition.

The simplest solution is to move to another bootloader which is starting the Linux kernel with the right parameters. In that case the parameters are stored in the disk and can’t be lost. Another solution can be to have a dedicated EFI firmware with these settings hardcoded but then we would need to contact the mainboard provider to perform this change, probably costly and taking some time to get what we want. We could also create an EFI script which is called at boot time to get these settings from a file but we have less experience about it and maybe we can have issues if we’re changing the provider of these boards. Finally, we could also create an initramfs that is linked into the kernel and put a script in there to find the right rootfs partition. But adding a bootloader is much simpler.

For the bootloader, we considered GRUB and systemd-boot – both are supported by Yocto. We choose the latter one because it’s simpler for our use case. It works only on UEFI systems, well supporting this standard and does not have a lot of legacy stuff to support which are not required for our device. In addition, its configuration file is really simple while GRUB is fairly complicated.

So the organization in the ESP partition becomes a bit different: EFI\boot\bootx64.efi is systemd-boot, Linux\bzImage is the kernel as a plain bzImage instead of an EFI binary, loader\loader.conf stores the bootloader configuration and loader\entries\boot.conf (and boot2.conf) are the different boot entries to start our system. The kernel must be in the ESP partition along with systemd-boot.

The content of this content files is:

loader.conf:

default boot.conf
timeout 0

boot.conf(on partitions /dev/sda3; the other partition has the corresponding root= option):

title boot
linux /Linux/bzImage
options root=/dev/sda3 rootwait rootfstype=ext4 console=tty1

And setting the EFI_PROVIDER variable in our machine definition in Yocto to systemd-boot.

Fixing the issue for existing devices with Yocto

For the moment the problem is solved for new devices or after an upgrade with the new version before having the problem. But there were at that time at least two devices with this issue, and because customers are not performing updates all the time and quickly, we are expecting to get more devices with this problem.

We need a way to restore the device into a good state without performing an update. The first idea is to use a LiveUSB of an existing distribution like Fedora, making a change in the boot process to execute some commands, then booting from it. But it’s a weird workaround which is not really reproducible correctly and the technician does not have the possibility to know what is happening if the process failed for unknown reasons. And the final file is pretty heavy in general, more than 800 Mio just to execute a script and eventually perform the update automatically…

The solution is to take advantage of Yocto power by generating our own ISO file.

We can keep the same machine and distro definitions, we just need to create two recipes, one for the specific script to execute and the target recipe to generate the iso file. The recipe for the image is pretty simple and looks like this.

DESCRIPTION = "Aero fix boot issue Live image"

LICENSE = "CLOSED"
IMAGE_FEATURES:append = " empty-root-password allow-empty-password allow-root-login "

inherit core-image image-live

IMAGE_INSTALL += " \
    boot-issue-script \
    e2fsprogs \
    e2fsprogs-resize2fs \
    update-tools \
    linux-firmware-i915 \
    swupdate \
"

LABELS_LIVE = "boot.conf"

IMAGE_FSTYPES = "ext4 live iso"
NOISO = "0"

The minimal amount of packages are added, we’re using the same infrastructure to build this file along with our other deliverables and it’s easy to extend it for other purposes in the future if required. The script is installed by the boot-issue-script recipe which is applying efibootmgr and sgdisk commands in order to solve the boot issue and to perform the updates, thus solving the problem for good. The iso file size is around 280 Mio which is a lot lighter compared to main distros in the market.

Then we need to explain to the technician how to flash an USB stick and to boot from it. A little documentation was written and they can use the Fedora Media Writer tool to write the USB stick with our ISO file. It works well, this software is simple to understand and works on Windows, macOS and Linux without problems which is a nice choice to simplify our explanations.

Problem solved!

Presentations

Drop the docs and embrace the model with Gaphor Fosdem '24 - Frank Van Bever 20 March, 2024 Read more
How to update your Yocto layer for embedded systems? ER '23 -Charles-Antoine Couret 28 September, 2023 Read more
Tracking vulnerabilities with Buildroot & Yocto EOSS23 conference - Arnout Vandecapelle 12 July, 2023 Read more
Lua for the lazy C developer Fosdem '23 - Frank Van Bever 5 February, 2023 Read more
Exploring a Swedish smart home hub Fosdem '23 - Hannah Kiekens 4 February, 2023 Read more
prplMesh An Open-source Implementation of the Wi-Fi Alliance® Multi-AP (Arnout Vandecappelle) 25 October, 2018 Read more

 

News