mind | EOSS2023 – Do the Time Warp – the Rocky Horror PTP Show: Verification of Network Time Synchronization in the Real World

PTP (Precision Time Protocol) synchronizes multiple clocks over a network. It compensates for path delays. It automagically selects the best leader clock to follow. It’s complicated to configure, and you need to verify how well it works. Always check your assumptions!

sched slides

This talk is based on personal experience with linuxptp v3.1. v4 has been released, stability has apparently improved a lot.

The basic protocol is a 2 step sync. The leader sends a sync, the follower sends a delay request, and the leader sends a delay response. These messages contain timestamps of the sender, so at the end the follower has 4 timestamps and it can calculate the delay.

To select the leader, every clock checks its capabilities and if it thinks it’s better than what was already announced, it announces itself. There’s a decision tree to determine the leader based on this. The user can statically configure a leader as well.

The full protocol is much, much complex than this to cater to various corner cases or use cases. There are adaptations for if the switch supports PTP or not. It’s important to evaluate what exactly gets used in your system, because it can affect the behaviour (accuracy) quite a lot.

Ideally, PTP is hardware offloaded. Packet timestamping supports makes sure that the timestamps are accurate w.r.t. the exact time the packet is sent.

The most established implementation is linuxptp, but some other projects exist. It’s a userspace implementation. It’s quite tricky to configure correctly. They should have quarterly releases as of now.

To evaluate how well two systems are synchronised, you can use pulse outputs on each and check if the edges match up – over a longer period of time. You can also measure in pure software, by letting the follower send back to the leader, but you really have to verify if that measurement really is correct. Always check your assumptions!

PTP configuration has many possible permutations, which makes sure that anything that can go wrong, will go wrong. It makes sense to add plausibility checks, i.e. the full time of day and not just the synchronisation edge.

Measurements with a logic analyser of the PPS pulses on the leader and the follower show a number of different failure cases, with different reasons for the failures. See slides for details, but it basically is a lot of debugging to find out what exactly went wrong because there are several causes possible for each issue.

Failure modes can sometimes be seen in the linuxptp log file. However, if there’s a master sync timeout, it’s hard to tell what the exact cause is. E.g. a half duplex link is simply not allowed by the standard, so it never syncs. Cause can be a hardware or driver bug, which doesn’t propagate timestamps properly for example.

Some common pitfalls:

if NTP is also enabled, you get jumps in system time.
Timestamping in MAC or in PHY gives different results
Timescale of clock source may be different than assumed, e.g. are leap seconds included or not?
Measurements may not notice some errors, e.g. if they’re too big, they roll over.
Check if the elected leader is the one that you expect to be elected.
and many more.

Choose the correct profile. It’s often implied by the application, e.g. in GSM. Don’t just copy commands from the internet, read the man pages. Check the logs, also in bridges (switches). Test over a longer period of time. And always check your assumptions!