As companies grow, adapt, morph, and mature, one item remains the same: the need for reinvention. Technical infrastructure is no exception. As our member community grew, our priorities were to keep up with that growth, or as we say, ensure continuous “site up.” (Read: adding servers to scale from hundreds to hundreds of thousands.) We ran into challenges about how to plan for this type of scaling—in particular, in keeping the platform images and kernels installed on our servers up to date. We moved forward in fits and starts, reimaging our entire physical server fleet in ad-hoc all-hands efforts in order to respond to various extrinsic factors, such as publicly disclosed CPU bugs.
The learnings we took away from these prior efforts allowed us to build a more refined and automated process for reimaging servers going forward, and to more crisply define the lifecycle of the servers on which we deploy LinkedIn’s production stack. With this increased confidence, we undertook an effort to reimage all of the servers comprising Rain, LinkedIn’s private cloud, to CentOS with a modern kernel. This blog post aims to share a new set of learnings from our most recent effort.
At its start, the CentOS reimaging process went mostly according to plan. However, as we neared completion, we suddenly halted the process because of multiple reports of severe 99th percentile latency increases for a serving application when an instance of another application was being deployed on the same physical server. The problem only affected servers with the new image, so we had our work cut out for us to avoid a lengthy and discouraging rollback process across tens of thousands of servers. The bug itself could have disrupted service for our members, and a platform image rollback would carry yet another set of risks.
“Noisy neighbor” problems are well known in multi-tenant scheduling environments like most cloud platforms. To avoid the tragedy of the commons, abstractions such as containerization and time-sharing are introduced. However, abstractions are leaky—one tenant is often able to breach the agreement and unfairly exclude other users of a global resource.
In maintaining our private cloud, we have become familiar enough with this class of problem that we knew exactly where to look: load average, system CPU utilization, page cache utilization, free page scans, disk queues. Atop is a great tool for diagnosing a shared server at a glance.
The interesting thing about this problem is that no shared resource was being exhausted. This pointed in the direction of a mutual exclusion problem. Some tools for diagnosing mutual exclusion problems are to look at the stacks of each process, to echo l > /proc/sysrq-trigger to snapshot each core’s stack, and to use the perf top and/or perf record utilities. All of these tools comprise an approach to determine where the system is applying its wall-clock time, since it seems to not be spending enough of it executing the workload that serves our members.
Strangely, these efforts turned up nothing of interest. The system wasn’t busy; it was just slower for our workload, according to the wall clock, than the older platform image.
Fresh out of quick explanations, we attempted to create test cases to reproduce the problem. A valid test case would be fast on the old platform image and slow on the new one. One team created a test case which reproduced the problem by downloading several large artifacts in parallel. Another team then created a benchmark utilizing the fio test framework, which was fast on the old image but exceeded its configured runtime by multiple minutes on the new image. We determined that the problem existed in both kernels 4.19 and 5.4.
Concurrently, we deduced that this problem exclusively impacted older servers with HDD (rotating) root disks, which we arrange in a software RAID1 mirror configuration—newer servers with SSD (solid-state) root disks were unaffected. One engineer noticed that there was an upstream bug tracking an issue seemingly related to blk-mq. It seemed plausible that this was a blk-mq problem since blk-mq was introduced after the release of kernel 3.10 (our previous golden image kernel). It was also evident that the system was not actually busy while the latency problem existed, so it seemed to be reasonable to hypothesize that inefficient I/O submission was the root cause.
Applying the patch attached to that upstream bug (to reduce the number of queues in the scalable bitmap layer) did improve performance enough to move some teams forward. Based on this result, we explored the possibility that the regression was related to the scsi-mq migration and the new I/O schedulers it required. However, after trying a number of configurations, it was clear that the choice or configuration of the I/O scheduler had little to no impact on the problem.
Due to prior experience in our storage tier, we were also familiar with the ext4 regression introduced in Linux 4.9 for direct I/O workloads. There was no equivalent guidance that we could find addressing increased latency on normal buffered I/O workloads in kernel 4.x. With suspicion of ext4 aroused by that existing direct I/O issue, we decided to replace the ext4 filesystem with XFS on some test servers to determine whether ext4 was again at fault here.
Surprisingly, we found that the problem was indeed nonexistent on XFS. Remember, this problem exclusively affected servers with rotating disk storage, so what could cause a latency problem exclusively on ext4, only on rotating HDD devices, and on a server that isn’t busy?
Rooting out the cause
Lacking other palatable options and knowing that the problem was introduced sometime between kernel 3.10 and 4.19, we proceeded to take one of the affected servers out of rotation and bisect the kernel. Bisecting the kernel on a physical host in a datacenter came with its own set of complications and workarounds, requiring additional assistance across multiple teams.
In kernel 4.19, we reverted the following two commits that were introduced during the 4.6 release cycle to fix the regression. The second had previously been reverted by Linus:
- Commit 06bd3c36a733 (“ext4: fix data exposure after a crash”)
- Commit 1f60fbe72749 (“ext4: allow readdir()’s of large empty directories to be interrupted”)
In kernel 5.4, we cherry-picked the following commit that was introduced during the 5.6 release cycle to fix the regression:
“ext4: make dioread_nolock the default” from commit e5da4c933, an ext4 merge
- Interestingly, setting the dioread_nolock boot param had no effect.
- Our workload is buffered I/O; what should a change around dioread_nolock, a direct I/O knob, have to offer in this situation?
The following graph shows approximate testing results as we iterated to converge on the 3.10 kernel’s performance: