Upstreaming for a Memory Leak in the Linux Kernel Bluetooth Stack

Upstreaming for a Memory Leak in the Linux Kernel Bluetooth Stack

Marleen Vos
24/07/2024

In the first part, we saw how we hit, reproduced and fixed a memory leak in the Bluetooth stack in the kernel. After Olivier submitted this as an RFC upstream, he unfortunately no longer had the time to iterate on the patches. Fortunately, a few weeks later, Marleen was able to continue the work.

Newer Kernels

Since almost a month had passed, there had been new kernel releases. The first step of course was to test if the issue still occurred on those. Marleen tested with the most recent kernels at the time
(6.5.11 and 6.6), and it turned out that neither of them still showed the leak! In kernel 6.5, however, the memory leak could still be reproduced.

Of course, not being able to reproduce doesn’t necessarily mean that the leak was actually fixed. Looking at the logs between 6.5 and 6.5.11, nothing was mentioned about fixing a memory leak. So we took out the git bisect tool to pinpoint the commit that fixed the issue: commit 4ab81f16c68a (Bluetooth: hci_conn: Consolidate code for aborting connections). Its parent commit c2509f7c3735 is the last one where our reproducer still shows the leak.

Looking at the code of this commit, it is not clear whether the issue was fixed intentionally or by accident, or if it was just moved somewhere else. Ideally, we would be able to backport this patch to the 5.13 kernel that is used in the project where we originally encountered the issue. That would allow us to prove the memory leak is gone also in real scenarios. However, there have been some 900 patches in the net/bluetooth/ part of the kernel code since kernel 5.13 and before commit 4ab81f16c68a, and a lot of these create conflicts or otherwise interact with backporting the patch in question.

Bluetooth Testing (BlueZ tests)

If the leak is (implicitly) fixed in the upstream kernel, then we should at least try to make sure it doesn’t return. An important step for that is to add it to the test framework. When Olivier made his bug report, BlueZ maintainer Luiz Augusto von Dentz already requested for such a test to be added to l2cap-tester. Marleen started to work on this, but it turns out that Luiz Augusto had already implemented such a test himself.

Thus, the next step was to prove that this new test really does expose the memory leak. For that, we added it to our reproducer repository. This entailed the following steps.

  • Enable all the sub-options of the bluez-utils package in Buildroot’s “make menuconfig”. Some of the tests use some of these options, it’s easiest to just enable all of them.
  • Add “–enable-testing to bluez5_utils.mk” in the Buildroot package. Buildroot doesn’t have an option to install the test tools, so we simply patched Buildroot to install them.
  • Add the patch that adds the timeout test to the bluez-utils package. This can be done without modifying Buildroot itself, through BR2_GLOBAL_PATCH_DIR. The patch came in a series of two, we need both of them to avoid creating a merge conflict.
  • Add some extra settings to the kernel configuration, cfr. the the instructions on the BlueZ wiki. We also have to enable PCI (even though our target STM32MP15 platform doesn’t have PCI) because some of those depend on it.
  • If some of these additional configurations are put in modules, make sure to modprobe them before running the l2cap-tester.

Finally we can use l2cap-tester on two different kernels:

  • v6.5.11, where we cannot reproduce the memleak with our reproducer. We find that l2cap-tester does not reproduce a memleak in this version either.
  • v6.5.11 with a revert of the ‘fix’ commit. We find that l2cap-tester does trigger the memleak in this version.

Coming to these conclusions was not entirely trivial, because l2cap-tester doesn’t run reliably (at least not on the STM32MP15 platform). For example, some of the tests (other than the timeout test) would occasionally also time out.

We also tried to gain more confidence in the lack of memory leak by tracing the reference counting added by Olivier’s second patch, using dynamic debug with:

echo 'module bluetooth +p' > /sys/kernel/debug/dynamic_debug/control

Unfortunately, enabling this dynamic debug causes l2cap-tester to timeout and hang before it even gets to the timeout test. We got no useful information out of this, and we had to disable the extra logging in order to run the tests.

Conclusion

Even though we are not 100% sure that the specific commit fixes the memory leak, we can conclude that the new tests introduced catch it, and that it is no longer present in newer kernels. Of course, to avoid regressions, the tests still have to be run with kmemleak enabled.

Presentations

Drop the docs and embrace the model with Gaphor Fosdem '24 - Frank Van Bever 20 March, 2024 Read more
How to update your Yocto layer for embedded systems? ER '23 -Charles-Antoine Couret 28 September, 2023 Read more
Tracking vulnerabilities with Buildroot & Yocto EOSS23 conference - Arnout Vandecapelle 12 July, 2023 Read more
Lua for the lazy C developer Fosdem '23 - Frank Van Bever 5 February, 2023 Read more
Exploring a Swedish smart home hub Fosdem '23 - Hannah Kiekens 4 February, 2023 Read more
prplMesh An Open-source Implementation of the Wi-Fi Alliance® Multi-AP (Arnout Vandecappelle) 25 October, 2018 Read more

 

News